JPH0648499B2

JPH0648499B2 - Processing unit

Info

Publication number: JPH0648499B2
Application number: JP63010775A
Authority: JP
Inventors: 薫内田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1988-01-22
Filing date: 1988-01-22
Publication date: 1994-06-22
Anticipated expiration: 2009-06-22
Also published as: JPH01187638A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、メモリ部，演算部をパイプライン状のバスで
結合し、データ駆動方式により演算順序をコントロール
するデータフロープロセッサのプロセッシングユニット
に関するものである。Description: TECHNICAL FIELD The present invention relates to a processing unit of a data flow processor that connects a memory unit and an arithmetic unit by a pipeline bus and controls the arithmetic order by a data driven method. Is.

〔従来の技術〕従来、データフロープロセッサとしては特開昭56-16915
2 号公報に記載されている技術があり、それを製品化し
たものとして日本電気株式会社製のμPD7281がある。μ
PD7281は第７図に示されるような構成を持つ。このμPD
7281によれば、外部バスから装置に入力されるデータの
単位となるトークンは、データ値、入力後にリンクテー
ブル92を参照するための識別子、そのトークンが処理さ
れるべき装置を示すモジュール番号を持ち、トークン入
力部91は、外部バスを通るトークンのモジュール番号が
その装置の番号と一致する場合にそのトークンを内部に
入力し、そうでない場合トークン出力部97を通じてその
まま外部バスから出力する。入力されたトークンは、こ
の持つ識別子によりリンクテーブル92を参照し、そこで
ファンクションテーブル93を参照するためのファンクシ
ョンテーブルアドレスと次回にリンクテーブル92を参照
するための識別子を得た後にファンクションテーブル93
へ送られる。[Prior Art] Conventionally, JP-A-56-16915 has been used as a data flow processor.
There is a technology described in Japanese Patent Publication No. 2 and a μPD7281 manufactured by NEC Corporation is a commercialized product. μ
The PD7281 has a structure as shown in FIG. This μPD
According to 7281, a token that is a unit of data input to a device from an external bus has a data value, an identifier for referring to the link table 92 after input, and a module number indicating the device on which the token is to be processed. The token input unit 91 inputs the token internally when the module number of the token passing through the external bus matches the device number, and otherwise outputs the token directly from the external bus through the token output unit 97. The input token refers to the link table 92 by this identifier, and obtains the function table address for referencing the function table 93 and the identifier for referencing the link table 92 next time, and then the function table 93 is obtained.
Sent to.

トークンはファンクションテーブル93のおいてそのファ
ンクションテーブルアドレスによる参照を行い、そこで
データメモリ94の管理情報の参照・更新を行うと同時
に、プロセッシングユニット96での処理内容を示す処理
コードとデータメモリ94のアクセスアドレスを得、デー
タメモリ94に送られ、そこで必要に応じて２項演算の相
手方のオペランドの待合わせ、あるいは定数演算のため
の定数の読出しを行う。キューメモリ95はプロセッシン
グユニット96が前のトークンを処理中でトークンを入力
できない時にトークンを一時保持するためのメモリであ
り、プロセッシングユニット96がビジーでない場合に
は、トークンはキューメモリ95からプロセッシングユニ
ット96に送られ、その処理コードに応じて、整数データ
の加減乗算、論理演算、シフト、比較、ビット反転、プ
ライオリティエンコーディング、分流、数値発生、コピ
ー、内部レジスタを利用した累積加算演算などのうち１
つの処理を受ける。なおトークンの持つ処理コードが出
力を示すものである場合には、トークンはキューメモリ
95からトークン出力部97へ送られ、入力トークンと同一
の形に変形された後に、外部バスへ出力される。プロセ
ッシングユニット96で処理を受けたトークンは、リンク
テーブル92に送られ、再びその識別子により参照を行
う。以下同様にして出力命令が実行されるまで内部のリ
ングバスを回り、そのデータ値に対して必要な処理を受
ける。The token refers to the function table address in the function table 93, refers to / updates the management information of the data memory 94, and at the same time, accesses the processing code indicating the processing content in the processing unit 96 and the data memory 94. The address is obtained and sent to the data memory 94, where the operand of the partner of the binary operation is waited for, or the constant is read for constant operation. The queue memory 95 is a memory for temporarily holding the token when the processing unit 96 is processing the previous token and cannot input the token, and when the processing unit 96 is not busy, the token is transferred from the queue memory 95 to the processing unit 96. Depending on the processing code, add / subtract multiplication of integer data, logical operation, shift, comparison, bit inversion, priority encoding, shunting, numerical value generation, copy, cumulative addition operation using internal register, etc.
Receive one treatment. If the processing code of the token indicates output, the token is the queue memory.
It is sent from 95 to the token output unit 97, transformed into the same shape as the input token, and then output to the external bus. The token processed by the processing unit 96 is sent to the link table 92, and again referred to by its identifier. In the same manner, it goes around the internal ring bus until the output instruction is executed, and receives the necessary processing for the data value.

[Problems to be Solved by the Invention]

数値シュミレーションやパターン認識などの応用分野に
おいて用いられる数値計算では、データの精度を確保す
るために浮動小数点データを扱う必要があり、特に浮動
小数点データを要素とする行列とベクトルの乗算や行列
同士の乗算は、これらのアプリケーションにおいて頻繁
に行われる。In numerical calculation used in application fields such as numerical simulation and pattern recognition, it is necessary to handle floating-point data in order to ensure the accuracy of the data. In particular, multiplication of matrices and vectors with floating-point data as elements and matrix-to-matrix Multiplication is frequent in these applications.

前述のデータフロープロセッサで、この様な乗算に現れ
る、外部メモリ上にある複数のデータに対しそれぞれあ
る係数を掛けそれらの積の和をとる、いわゆるコンボリ
ューション処理を行う場合を考える。従来のデータフロ
ープロセッサにおいては演算器が一つしかなく、トーク
ンが内部リングを１周してプロセッシングユニットに入
った時にそのトークンの持つ２つのデータの間の１つの
２項演算しかできないため、コンボリューションにおい
てはＮ個のデータ組の乗算を行うためにＮ回と、その結
果の加算を行うために（Ｎ−１）回トークンが内部リン
グを周回しプロセッシングユニットに流れ込む必要があ
り、さらにそのうちの加算は時間的に直列に行なわれな
ければならないため処理時間が長くなるという問題点が
ある。Consider the case where so-called convolution processing is performed in the above-described data flow processor, in which a plurality of data existing in such a multiplication are multiplied by a certain coefficient and the sum of their products is obtained. In the conventional data flow processor, there is only one arithmetic unit, and when the token goes around the inner ring and enters the processing unit, only one binary operation between the two data of the token can be performed. In the case of volume, N tokens must go around the inner ring and flow into the processing unit N times to perform multiplication of N data sets, and (N-1) times to perform addition of the result. There is a problem that the processing time becomes long because the addition has to be performed serially in terms of time.

μPD7281においては上の問題点のうち連続データの加算
を高速化するためにプロセッシングユニットにレジスタ
を設け累積加算を行うようになっているが、これでも一
度には乗算と加算の一方しかできないため乗算について
は高速化できない。従ってコンボリューションの高速化
のためには、係数と入力データとの乗算を行う乗算器
と、レジスタを用いてその積の累積加算を行う演算器と
が縦列に配置されなければならない。Among the above problems, the μPD7281 has a register in the processing unit to perform cumulative addition in order to speed up the addition of continuous data, but even this can only perform multiplication or addition at a time, so multiplication Can't speed up. Therefore, in order to speed up the convolution, a multiplier for multiplying the coefficient by the input data and an arithmetic unit for cumulatively adding the products by using the register must be arranged in a column.

一方、前述のデータフロープロセッサではサポートされ
ている算術演算処理が整数データに対する加減乗算に限
られているため、これら以外の浮動小数点表現のデータ
の演算を必要とする処理は実行できず、これを行おうと
する場合、ソフトウェアで実現しなければならず、処理
時間の増大を招いていた。On the other hand, in the above-mentioned data flow processor, since the supported arithmetic operation processing is limited to addition / subtraction multiplication for integer data, it is not possible to execute the processing that requires the operation of the data of the floating point representation other than these When trying to do so, it had to be realized by software, which caused an increase in processing time.

これを解決するために、前記データフロープロセッサの
プロセッシングユニットに通常のプロセッサで用いられ
ている浮動小数点演算用ハードウェアを組み込んだプロ
セッサを提供することは可能である。しかしこの場合、
浮動小数点演算のためには他の固定小数点演算における
より長い処理時間が必要であるため、プロセッサ内部の
トークン転送をパイプラインクロックに同期して行うデ
ータフロープロセッサでは全体の動作パイプラインサイ
クルを長くせざるを得ず、全体の処理速度が低下すると
いう問題が生ずる。In order to solve this, it is possible to provide a processor in which the processing unit of the data flow processor incorporates the floating point arithmetic hardware used in a normal processor. But in this case
Floating-point operations require longer processing time than other fixed-point operations.Therefore, in data flow processors that perform token transfer inside the processor in synchronization with the pipeline clock, the entire operation pipeline cycle must be lengthened. Inevitably, there arises a problem that the overall processing speed decreases.

これに対し一般にはこのように浮動小数点演算部が複雑
な処理を必要とする場合、その内部を複数ステージに分
割し、各段をパイプライン的に動作させることにより全
体のクロック周期を短くするという手法が取られる。し
かしこの場合でも、例えばベクトル内の要素の累積加算
をとる際はその前の和をとる演算が終了しなければ次の
加算を開始できないというようにそれらの加算はパイプ
ライン化できないため、その分だけ処理時間がかかる。On the other hand, in general, when the floating point arithmetic unit requires complicated processing in this way, the internal clock is divided into a plurality of stages, and each stage is operated in a pipeline manner to shorten the entire clock cycle. The approach is taken. However, even in this case, for example, when the cumulative addition of the elements in the vector is taken, the next addition cannot be started unless the operation for taking the previous sum is completed. Only takes processing time.

例えばそれぞれ浮動小数点データを要素とするｍ×ｎの
行列Ａ（要素ａ[ｉ，ｊ]）と長さｎのベクトル（要素
ｘ[ｊ]の乗算＝Ａを、浮動小数点乗算器とｓステー
ジからなるパイプライン化された浮動小数点加算器を持
つデータフロープロセッサで行う場合を考える。For example, an m × n matrix A (element a [i, j]) each having floating point data as an element and a vector (element x [j] of length n multiplied by A = A are calculated from the floating point multiplier and the s stage. Consider a case where the data flow processor has a pipelined floating point adder.

を求める場合、連続して入力される２つのデータ列の乗
算を行った後、その結果の累積加算を行うためにはｎ個
の積の和を求めなければならない。そこでそのための
（ｎ−１）回の加算のためには、１回の加算がｓステッ
プかかり、かつ全ての加算は時間的に直列に行わざるを
得ないため、ｓ×（ｎ−１）ステップかかり、従って全
部でｓ×（ｎ−１）×ｍステップかかることになる。 In order to obtain, the sum of n products must be obtained in order to perform the cumulative addition of the results after the multiplication of two data strings that are continuously input. Therefore, for (n-1) times of addition, one addition takes s steps, and all the additions have to be performed serially in terms of time, so s * (n-1) steps. Therefore, it takes s × (n−1) × m steps in total.

本発明の目的は上記のような問題点を解決し、浮動小数
点表現のデータによる積和演算をなるべくパイプライン
性能を低下させずに実行でき、上述のような処理を高速
化できるデータフロープロセッサのプロセッシングユニ
ットを提供することにある。An object of the present invention is to solve the above-mentioned problems, to execute a product-sum operation using data in floating-point representation without degrading the pipeline performance as much as possible, and to provide a data flow processor capable of speeding up the above-mentioned processing. It is to provide a processing unit.

[Means for Solving the Problems]

本発明のプロセッシングユニットは、前記内部メモリ部から前記バスを介して入力されるコン
トロール情報とオペランドデータを持つトークン上の２
つのオペランドデータの演算を行い、結果データを持つ
トークンを出力する算術計算部と、前記算術計算部の出力トークン上の結果データ、及び、
加算の途中結果を一時保持する複数のレジスタからなる
レジスタファイルと、前記算術計算部の出力トークン上の結果データと前記レ
ジスタファイルから読出したデータとの演算を行い、結
果を前記レジスタファイルに送る加算器と、前記算術計算部の出力トークン上のコントロール情報を
前記加算器を通過するデータと同期させるための遅延回
路と、前記加算器の結果出力データと前記遅延回路から得られ
るコントロール情報とから演算結果データを持つトーク
ンを生成する結果トークン生成部と、からなることを特徴としている。The processing unit according to the present invention is a token on the token having control information and operand data input from the internal memory unit via the bus.
An arithmetic calculation unit that calculates one operand data and outputs a token having result data; and result data on the output token of the arithmetic calculation unit, and
A register file composed of a plurality of registers for temporarily holding the intermediate result of the addition, an operation of the result data on the output token of the arithmetic calculation unit and the data read from the register file, and sending the result to the register file , A delay circuit for synchronizing the control information on the output token of the arithmetic calculation unit with the data passing through the adder, and the result output data of the adder and the control information obtained from the delay circuit It is characterized by comprising a result token generation unit for generating a token having result data.

[Action]

本発明のプロセッシングユニットを持つデータフロープ
ロセッサにおいて、前記の行列Ａとベクトルの乗算を
行う場合、予めベクトルの要素ｘ[１]，ｘ[２]，・
・，ｘ[ｎ]を外部から与えることにより内部メモリに保
持しておく。処理に際し、行列Ａの要素データａ[１，
１]，ａ[２，１]，・・，ａ[ｍ，１]，ａ[１，２]，・
・，ａ[ｍ，２]，・・，ａ[１，ｎ]，・・，ａ[ｍ，ｎ]
をこの順で持つｎ×ｍ個のトークンをデータフロープロ
セッサに入力する。入力されたトークンは内部メモリを
経由する際に必要な２項演算の相手方データとなるベク
トルの要素と処理コードとを得る。この相手方データ
はｘ[１]，・・，ｘ[１]（ｍ個），ｘ[２]，・・ｘ
[２]，・・，ｘ[ｎ]，・・，ｘ[ｎ]のようにｍ×ｎ個ア
クセスされる。In the data flow processor having the processing unit of the present invention, when the matrix A and the vector are multiplied, the vector elements x [1], x [2] ,.
., X [n] is given from the outside and held in the internal memory. At the time of processing, the element data a [1,
1], a [2, 1], ..., a [m, 1], a [1, 2], ...
., A [m, 2], ..., a [1, n], ..., a [m, n]
The n × m number of tokens having in this order are input to the data flow processor. The inputted token obtains a vector element and a processing code which are the counterpart data of the binary operation required when passing through the internal memory. The other party data are x [1], ..., x [1] (m pieces), x [2] ,.
[2], ..., X [n], ..., X [n] are accessed m × n.

２つのオペランドを持つトークンは、プロセッシングユ
ニットの前段を構成する算術計算部に順に連続して入力
される。そこでは入力された２つのオペランドについて
乗算を行い、後段の累積加算部へその結果の積ｐ[ｉ，
ｊ]＝ａ[ｉ，ｊ]＊ｘ[ｊ]を持つトークンをそのデータ
がｐ[１，１]，ｐ[２，１]，・・，ｐ[ｍ，１]，ｐ
[１，２]，・・，ｐ[１，ｎ]，・・，ｐ[ｍ，ｎ]の順に
なるように送出する。累積加算部では内部のレジスタフ
ァイル内のレジスタの長さｍのＦＩＦＯとして用いるこ
とにより、第ｉ行の部分和ｑ[１，ｋ−１]＝ｑ[ｉ，１]
＋・・＋ｐ[ｉ，ｋ−１]をｐ[ｉ，ｋ]が入力されるまで
サイクリックに保持し、保持した部分和が次にｍクロッ
ク遅れて入って来る次の加算されるべきデータに同期す
るようにこれらを加算器へ送る。加算器では連続して入
って来るオペランドの組が、第（ｉ−１）列に関するも
の、第ｉ列に関するもの、第（ｉ＋１）列に関するもの
というように互いに独立なもののためパイプラインの各
段をフルに用いて加算することができ、加算の後、結果
の部分和が再びレジスタファイルに保持される。Tokens having two operands are successively input in sequence to the arithmetic calculation unit that constitutes the preceding stage of the processing unit. There, the two input operands are multiplied, and the product p [i,
j] = a [i, j] * x [j] whose data is p [1,1], p [2,1], ..., P [m, 1], p
.., p [1, n], ..., P [m, n] are transmitted in this order. In the accumulator, the partial sum q [1, k-1] = q [i, 1] of the i-th row is used by using it as a FIFO of register length m in the internal register file.
+ ··· + p [i, k-1] is cyclically held until p [i, k] is input, and the held partial sum comes next with a delay of m clocks to be added next data. These are sent to the adder to be synchronized with. In the adder, the consecutive incoming operand sets are independent from each other, such as for the (i-1) th column, for the i-th column, and for the (i + 1) th column, so that each stage of the pipeline is Can be used full to add, and after the addition the partial sum of the results is again held in the register file.

このようにしてｍ×ｎ組のデータが入力される間に、ｍ
個の積和をパイプライン的に動作する加算器によって求
めることができる。While m × n sets of data are input in this way, m
The sum of products can be obtained by an adder that operates in a pipeline manner.

〔Example〕

次に本発明の実施例について図面を参照して説明する。 Next, embodiments of the present invention will be described with reference to the drawings.

第２図は本発明の一実施例におけるデータフロープロセ
ッサ１全体の構成を示す内部ブロック図であり、10はト
ークン入力部、11はリンクテーブル、12はオペランドフ
ェッチテーブル、13はデータメモリ、14はファンクショ
ンテーブル、15はバッファキュー、16はプロセッシング
ユニット、17はトークン出力部であり、リンクテーブル
11、オペランドフェッチテーブル12、データメモリ13、
ファンクションテーブル14、バッファキュー15、プロセ
ッシングユニット16は、図に示すようにこの順にパイプ
ライン方式のバスでリング状に接続されており、トーク
ンはこの内部リングバス上を転送される。またプロセッ
シングユニット16は、算術計算部20と、累積加算部21と
が縦列に配置されたものである。FIG. 2 is an internal block diagram showing the overall configuration of the data flow processor 1 according to one embodiment of the present invention. 10 is a token input unit, 11 is a link table, 12 is an operand fetch table, 13 is a data memory, and 14 is Function table, 15 is a buffer queue, 16 is a processing unit, 17 is a token output section, and a link table
11, operand fetch table 12, data memory 13,
The function table 14, the buffer queue 15, and the processing unit 16 are connected in a ring shape by a pipeline type bus in this order as shown in the figure, and tokens are transferred on this internal ring bus. Further, the processing unit 16 has an arithmetic calculation section 20 and a cumulative addition section 21 arranged in a column.

第３図は第２図のデータフロープロセッサを用いたデー
タ処理装置の一例の全体構成図である。このデータ処理
装置においては複数のデータフロープロセッサ１・・・
２と、１つのメモリインタフェース回路３が外部バス５
で結ばれており、外部バス５はメモリインタフェース回
路３を介してメモリ４と接続されている。外部バス５上
でトークンはハンドシェーク方式により非同期に転送さ
れる。FIG. 3 is an overall configuration diagram of an example of a data processing device using the data flow processor of FIG. In this data processing device, a plurality of data flow processors 1 ...
2 and one memory interface circuit 3 are external buses 5
The external bus 5 is connected to the memory 4 via the memory interface circuit 3. The token is transferred asynchronously on the external bus 5 by the handshake method.

第４図は第２図のデータフロープロセッサ、および第３
図のデータ処理装置で用いられる、データの単位である
トークンの形式を示す。第４図（ａ）に示す外部バス５
上でのトークン60はモジュール番号61、コントロールフ
ラグ62、リンクテーブルアドレス63とデータ部64からな
る。コントロールフラグ62は、プログラム実行の前にリ
ンクテーブル11などデータフロープロセッサ内部のテー
ブルにプログラムをロードする際に用いるトークンを処
理実行時の実行トークンと区別するために用いる。FIG. 4 is a data flow processor of FIG.
The format of the token which is a unit of data used by the data processing apparatus of the figure is shown. External bus 5 shown in FIG. 4 (a)
The token 60 above is composed of a module number 61, a control flag 62, a link table address 63 and a data section 64. The control flag 62 is used to distinguish the token used when the program is loaded into a table inside the data flow processor such as the link table 11 before executing the program from the execution token when executing the process.

トークン入力部10は前段のデータフロープロセッサまた
はメモリインタフェース回路から入力されるトークンの
うちそのモジュール番号61が、その装置に与えられた番
号に等しいもののみを内部へ取り込みリンクテーブル11
へパイプラインサイクルに同期して送り、その他のトー
クンは通過トークンとしてそのままトークン出力部17へ
送る。ただし、それに対して送出すべきリンクテーブル
11あるいはトークン出力部17がビジー状態である場合に
はトークンを送出せず、更に前段のデータフロープロセ
ッサまたはメモリインタフェース回路からの入力をハン
ドシェークのアルノレジ信号を返さないことにより停止
する。The token input unit 10 fetches only the tokens whose module number 61 is equal to the number given to the device among the tokens inputted from the preceding data flow processor or memory interface circuit into the link table 11
It is sent in synchronization with the pipeline cycle, and other tokens are sent to the token output unit 17 as pass tokens as they are. However, the link table that should be sent for it
If the token output unit 11 or the token output unit 17 is in a busy state, the token is not sent and the input from the previous data flow processor or the memory interface circuit is stopped by not returning the handshake acknowledge signal.

リンクテーブル11はプロセッシングユニット16またはト
ークン入力部10からトークンを入力するが、両方から同
時に入力要求があった場合にはトークン入力部10からの
入力を優先する。リンクテーブル11はプロセッシングユ
ニット16またはトークン入力部10から入力されたトーク
ン60のリンクテーブルアドレス63によって参照され、ト
ークンはオペランドフェッチテーブル12をアクセスする
アドレス、ファンクションテーブル14をアクセスするア
ドレス及び次回のリンクテーブル11参照のためのリンク
テーブルアドレスを得てオペランドフェッチテーブル12
に送られる。The link table 11 inputs a token from the processing unit 16 or the token input unit 10, but when both input requests are made simultaneously, the input from the token input unit 10 is prioritized. The link table 11 is referred to by the link table address 63 of the token 60 input from the processing unit 16 or the token input unit 10, and the token is the address that accesses the operand fetch table 12, the address that accesses the function table 14, and the next link table. 11 Get link table address for lookup Operand fetch table 12
Sent to.

オペランドフェッチテーブル12は入力トークン60の持つ
リンクテーブル11から読出したオペランドフェッチテー
ブルアクセスアドレスによって参照され、そのアドレス
にある、データメモリ13の読出し、書込み、データの２
項キュー制御の命令コードの参照と状態管理を行う情報
の参照、更新を行う。これによりトークンはデータメモ
リ13のアクセスアドレスとデータメモリ13における動作
を指定するデータメモリ処理コードを受け取る。The operand fetch table 12 is referred to by the operand fetch table access address read from the link table 11 of the input token 60, and the read / write / data of the data memory 13 at the address is read.
Refers to the instruction code for term queue control, and refers to and updates the information for status management. As a result, the token receives the access address of the data memory 13 and the data memory processing code designating the operation in the data memory 13.

データメモリ13は入力トークン60の持つデータメモリア
クセスアドレスによってアクセスされ、必要に応じて２
項演算のデータ同士の待ち合わせのキューとして、ある
いは２項演算のための一方のオペランドデータを一時格
納するためのメモリとして用いられる。例えば予め外部
メモリからデータフロープロセッサに入力したデータを
データメモリ13の順に連続した番地に書き込むことによ
って保持し、その後演算処理を行う際に２項演算のため
の第１オペランドを持つデータメモリ読出しトークンに
よってデータメモリ13からその書込んだデータを読出
し、読出しデータを第２オペランドとしてトークンに付
加することによりプロセッシングユニット16での２項演
算に用いることができる。更にデータメモリ13の出口で
第１オペランドと第２オペランドをデータメモリ処理コ
ードに従って交換することができる。The data memory 13 is accessed by the data memory access address of the input token 60, and if necessary, 2
It is used as a queue for waiting for data of term operations or as a memory for temporarily storing one operand data for a binary operation. For example, a data memory read token having a first operand for a binary operation when the data input from the external memory to the data flow processor in advance is held by being written in consecutive addresses in the order of the data memory 13 and then arithmetic processing is performed. Then, the written data is read from the data memory 13 and the read data is added to the token as the second operand, so that it can be used for the binary operation in the processing unit 16. Furthermore, at the exit of the data memory 13, the first operand and the second operand can be exchanged according to the data memory processing code.

ファンクションテーブル14では入力されるトークンは、
そのファンクションテーブルアクセスアドレスによりそ
の内部のテーブルをアクセスする。これによりプロセッ
シングユニット16での処理内容を示す処理コードがトー
クンに付加される。同時にファンクションテーブル14に
保存された内部状態により、流れるトークンのリンクテ
ーブルアドレス部が変更されることにより必要に応じて
流れの制御が行われる。また上述の流れ制御動作の代わ
りにその内部状態保持部にあるデータを第２オペランド
としてトークンに付加し、ファンクションテーブル14の
入力時に持っていた第１オペランドのデータと共にプロ
セッシングユニット16へ入力することができる。なおフ
ァンクションテーブル14でフェッチされる処理コード
は、算術計算部20での処理を規定する算術計算部処理コ
ード、累積加算部21での処理を規定する累積加算部処理
コード、及びプロセッシングユニット16の処理結果を持
つトークンをリンクテーブル11へ送るかトークン出力部
へ送るかを指定する出力選択コードからなる。The token input in the function table 14 is
The internal table is accessed by the function table access address. As a result, a processing code indicating the processing content of the processing unit 16 is added to the token. At the same time, the link table address part of the flowing token is changed by the internal state stored in the function table 14 to control the flow as necessary. Further, instead of the flow control operation described above, the data in the internal state holding unit can be added to the token as the second operand and input to the processing unit 16 together with the data of the first operand held when the function table 14 was input. it can. Note that the processing code fetched in the function table 14 is an arithmetic calculation unit processing code that specifies processing in the arithmetic calculation unit 20, a cumulative addition unit processing code that specifies processing in the cumulative addition unit 21, and processing of the processing unit 16. The output selection code specifies whether to send the token having the result to the link table 11 or the token output unit.

バッファキュー15は、プロセッシングユニット16にトー
クンを入力する前にトークンを一時保持するためのＦＩ
ＦＯメモリであり、プロセッシングユニット16がトーク
ン入力を停止している際にプロセッシングユニット16に
対する出力を停止する。バッファキュー15からプロセッ
シングユニット16へ送られる際のトークンの形式を、第
４図（ｂ）のトークン65に示す。トークン65はコントロ
ールフラグ69、リンクテーブルアドレス70と、処理され
るべき第１オペランド71、第２オペランド72を持ち、さ
らにファンクションテーブル14でフェッチした算術計算
部処理コード66、累積加算部処理コード67、出力選択コ
ード68を持っている。The buffer queue 15 is a FI for temporarily holding the token before inputting the token to the processing unit 16.
It is an FO memory, and stops the output to the processing unit 16 when the processing unit 16 is stopping the token input. The format of the token when sent from the buffer queue 15 to the processing unit 16 is shown in the token 65 of FIG. 4 (b). The token 65 has a control flag 69, a link table address 70, a first operand 71 and a second operand 72 to be processed, and further, an arithmetic calculation unit processing code 66 fetched in the function table 14, a cumulative addition unit processing code 67, Has output selection code 68.

プロセッシングユニット16は第１図に示されるように、
算術計算部20、累積加算部21が直列に接続されることに
より構成され、入力されたトークンが独立に動作するそ
れらを順に通過する際に、これらのトークンに対しパイ
プライン的に作用する。The processing unit 16 is, as shown in FIG.
The arithmetic calculation unit 20 and the cumulative addition unit 21 are configured to be connected in series, and when the input tokens pass through those operating independently, they act on these tokens in a pipeline manner.

算術計算部20はそこへ入力されるトークンの第１オペラ
ンドと第２オペランドとの２項演算、あるいは第１オペ
ランドの単項演算を、ファンクションテーブル14でフェ
ッチした処理コードのうちの算術計算部処理コードに従
い内部状態を持たずに実行し、結果データを持つトーク
ンを信号線101を介して累積加算部21へ出力する。演算
としては算術演算、論理演算、シフト、比較、ビット操
作などがある。特にトークンの持つデータが浮動小数点
データであり、トークンが浮動小数点乗算を指示する処
理コードを持つ場合には、入力された２つの浮動小数点
データ間の乗算を行い、結果の浮動小数点データを持つ
トークンを結果トークンとする。なお算術計算部20は全
体のパイプラインクロックを上げるため、そのハードウ
ェアをパイプライン的に動作する複数ステージに分割し
て構成することも可能である。The arithmetic calculation unit 20 is the arithmetic calculation unit processing code of the processing codes fetched by the function table 14 for the binary operation of the first operand and the second operand of the token input to it, or the unary operation of the first operand. The token having the result data is output to the cumulative addition unit 21 via the signal line 101. The arithmetic operations include arithmetic operations, logical operations, shifts, comparisons, bit operations and the like. In particular, if the data that the token has is floating-point data, and the token has a processing code that instructs floating-point multiplication, the multiplication is performed between the two input floating-point data, and the token that has the resulting floating-point data Is the result token. Since the arithmetic calculation unit 20 raises the entire pipeline clock, the hardware can be divided into a plurality of stages that operate in a pipeline manner.

累積加算部21では算術計算部20から信号線101 を介して
入力されるトークンの持つ累積加算部処理コードに従っ
て、そのデータを加算器22に送ってレジスタファイル23
から読出したデータと加算を行わせ、あるいはレジスタ
ファイル23の中の適当なレジスタにその値をセットする
ことができる。算術計算部20からの入力トークンは入力
トークンレジスタ30に保持され、その内容のうち処理さ
れるべきデータ値が信号線102 に、その他の制御用のト
ークン情報が信号線107 に出力される。信号線107 上の
トークン情報としては、第４図（ｂ）に示すプロセッシ
ングユニット16に対する入力トークン65が持っていたリ
ンクテーブルアドレス70、コントロールフラグ69、累積
加算部処理コード67、出力選択コード68がある。特にそ
のうち累積加算部処理コード67にはレジスタファイル23
の書込み制御コード、読出し制御コード、結果トークン
生成制御コードが含まれ、また出力選択コード68には、
その結果トークンをトークン出力部17に対して出力する
かリンクテーブル11に対して出力するかのフラグと、ト
ークン出力部17から外部バスに出力する際に持つべきモ
ジュール番号が含まれる。The cumulative addition unit 21 sends the data to the adder 22 according to the cumulative addition unit processing code of the token input from the arithmetic calculation unit 20 through the signal line 101, and sends the data to the register file 23.
The data read from can be added, or the value can be set in an appropriate register in the register file 23. The input token from the arithmetic calculation unit 20 is held in the input token register 30, and the data value to be processed among the contents is output to the signal line 102, and the other token information for control is output to the signal line 107. As token information on the signal line 107, the link table address 70, the control flag 69, the cumulative addition unit processing code 67, and the output selection code 68 which the input token 65 for the processing unit 16 shown in FIG. is there. In particular, the register file 23
Write control code, read control code, result token generation control code, and output selection code 68,
As a result, a flag indicating whether to output the token to the token output unit 17 or the link table 11 and a module number to be possessed when the token output unit 17 outputs to the external bus are included.

レジスタファイル23はｒ個のレジスタからなり、レジス
タファイル書込み制御部24からの信号109 により信号線
105 上のデータが指定されたレジスタに書込まれる。レ
ジスタファイル書込み制御部24は信号107 または信号10
8 のレジスタファイル書込み制御コードで制御され、信
号線102 または104 から入力されるデータの何れかをレ
ジスタファイル23の内の指定されたレジスタに信号線10
5 を介して書き込む。信号107 と信号108 では信号107
を優先する。レジスタファイル読出し制御部25は信号10
7 のレジスタファイル読出し制御コードで制御され、レ
ジスタファイル23の各レジスタのうち指定されたレジス
タのデータを信号線103 への出力とする。The register file 23 consists of r registers, and the signal 109 is sent from the register file write control unit 24 to the signal line.
The above data is written to the specified register. The register file write control unit 24 uses signal 107 or signal 10
Controlled by the register file write control code of 8, the data input from the signal line 102 or 104 is sent to the designated register in the register file 23 by the signal line 10
Write through 5. Signal 107 and signal 108
Prioritize. The register file read control unit 25 sends the signal 10
It is controlled by the register file read control code 7 and the data of the designated register among the registers of the register file 23 is output to the signal line 103.

加算器22は信号線102, 103上の２つの浮動小数点データ
に対してパイプライン的な加算動作を行い、同じフォー
マットを持つ結果データを信号線104 に出力する。第５
図に５段のステージで構成される加算器22の一例を示
す。この例では扱うデータはＩＥＥＥ７５４標準規格の
浮動小数点フォーマットに準拠しており、各データの指
数部と仮数部を分離した後、各々を内部の５段のラッチ
で順に保持しながら演算を進めていく。第５図中、Ｌで
示されるのはパイプラインの各ステージを構成するため
のラッチである。以下簡単にその動作を説明する。２つ
の入力データは比較選択部150 で比較され、信号151 に
２つのデータのうち大きい方の指数部が、信号152 にそ
の仮数部が、信号153 に小さい方のデータの仮数部が、
信号154 に２つの指数部の差の絶対値が選択出力され
る。小さい方のデータの仮数部は指数部の差だけ右シフ
タ155 で右フシトされ、加算器156 でもう一方の仮数部
と加算される。その結果の上位から２進法表現で０の続
く数が零数カウンタ157 で計数され、その数だけ仮数の
和が左シフタ158 で左シフトされ正規化された演算結果
の仮数部が得られる。同時に同じく零数カウンタ157 の
出力が加算器159 で元の大きい方の指数部に加えられる
ことによって演算結果の指数部が得られる。なおここで
は加算器22のパイプライン段数は５段の例を示したが、
以下では一般的にｓ段と仮定して説明する。The adder 22 performs a pipelined addition operation on the two floating point data on the signal lines 102 and 103, and outputs result data having the same format to the signal line 104. Fifth
The figure shows an example of the adder 22 including five stages. In this example, the data to be handled conforms to the floating point format of the IEEE754 standard, and after the exponent part and the mantissa part of each data are separated, the operation is carried out while holding each in order by the internal 5-stage latches. . In FIG. 5, L is a latch for forming each stage of the pipeline. The operation will be briefly described below. The two input data are compared by the comparison / selection unit 150, and the larger exponent part of the two data in the signal 151, the mantissa part thereof in the signal 152, and the mantissa part of the smaller data in the signal 153,
The absolute value of the difference between the two exponents is selectively output as the signal 154. The mantissa part of the smaller data is right-shifted by the right shifter 155 by the difference of the exponent part and added to the other mantissa part by the adder 156. The number of consecutive zeros in binary notation from the uppermost result is counted by the zero counter 157, and the sum of the mantissas is left-shifted by the left shifter 158 by that number to obtain the mantissa part of the normalized operation result. At the same time, the output of the zero counter 157 is also added to the larger exponent of the original by the adder 159 to obtain the exponent of the operation result. Although the number of pipeline stages of the adder 22 is 5 here,
In the description below, the number of stages is generally assumed to be s.

遅延回路26はｓ個の遅延のためのラッチを直列に接続し
たものであり、信号線107 上のトークン情報を、ｓ段遅
らせることにより加算器22を通過する演算データに同期
して結果トークン生成部25及びレジスタファイル書込み
制御部24に送るために用いられる。The delay circuit 26 is configured by connecting s latches for delay in series, and delays the token information on the signal line 107 by s stages to generate a result token in synchronization with the operation data passing through the adder 22. It is used for sending to the unit 25 and the register file write control unit 24.

結果トークン生成部29は信号107 の結果トークン生成制
御コードで制御され、加算器22から出力される浮動小数
点形式の結果データ104 に信号線107 で送られるトーク
ン情報のうちリンクテーブルアドレス、コントロールフ
ラグ、出力選択コードを付加してプロセッシングユニッ
ト16からの出力トークンの形式を整え、指定されたタイ
ミングで結果トークンを信号線110 へ出力する。The result token generation unit 29 is controlled by the result token generation control code of the signal 107, and the link table address, the control flag, and the control flag among the token information sent to the floating point result data 104 output from the adder 22 via the signal line 107. An output selection code is added to format the output token from the processing unit 16, and the result token is output to the signal line 110 at a designated timing.

プロセッシングユニット16からの出力トークンは通常リ
ンクテーブル11に送出されるが、そのトークンがデータ
フロープロセッサ外へ出力されるべきであることを示す
出力選択コードを持つときには、出力選択コード内にあ
る外部バスのトークンに必要なモジュール番号をそのト
ークンに付加し、トークン出力部17へ送出する。ただし
トークン出力部17がビジー状態である場合にはそこへの
出力を停止し、プロセッシングユニット16へのバッファ
キュー15からの入力も禁止する。The output token from the processing unit 16 is normally sent to the link table 11, but when the token has an output selection code indicating that it should be output to the outside of the data flow processor, the external bus in the output selection code is used. The module number necessary for the token is added to the token and sent to the token output unit 17. However, when the token output unit 17 is in the busy state, the output to the token output unit 17 is stopped, and the input from the buffer queue 15 to the processing unit 16 is also prohibited.

トークン出力部17はプロセッシングユニット16またはト
ークン入力部10から入力されたトークンを外部バス５を
介して後段のデータフロープロセッサまたはメモリイン
タフェース回路３に対して出力する。ただしプロセッシ
ングユニット16及びトークン入力部10の両方から同時に
そのリクエストがあった場合にはトークン入力部10から
の入力を優先し、プロセッシングユニット16に対し、ビ
ジー状態であることを知らせる信号を送ることによって
プロセッシングユニット16からのトークンの受付を停止
する。また後段のデータフロープロセッサまたはメモリ
インタフェース回路がビジー状態でハンドシェークのア
クノレジ信号を返さない場合にも、出力を停止し、また
プロセッシングユニット16からもトークンの受付を停止
する。The token output unit 17 outputs the token input from the processing unit 16 or the token input unit 10 to the subsequent data flow processor or memory interface circuit 3 via the external bus 5. However, when there is a request from both the processing unit 16 and the token input unit 10 at the same time, the input from the token input unit 10 is prioritized, and the processing unit 16 is sent a signal notifying that it is busy. Stop accepting tokens from the processing unit 16. Also, when the data flow processor or the memory interface circuit in the subsequent stage is busy and does not return the handshake acknowledge signal, the output is stopped, and the token reception from the processing unit 16 is also stopped.

以上の実施例で説明したプロセッシングユニットにおい
て、加算器22を構成するステージ数ｓ、レジスタファイ
ル23のレジスタ数ｒ、また本発明で扱うべき行列Ａとベ
クトルの乗算問題においては、その行列Ａのサイズｍ
×ｎについて、ｒ≧ｍ≧ｓが成立しなければならない。In the processing unit described in the above embodiment, the number of stages s forming the adder 22, the number r of registers of the register file 23, and the size of the matrix A in the multiplication problem of the matrix A and the vector to be handled in the present invention. m
For xn, r ≧ m ≧ s must hold.

次に本実施例を用いて例えば先で述べたような行列Ａ
（サイズｍ×ｎ）×ベクトル（サイズｎ）の演算処理
を行う場合の動作について説明する。実施例において加
算器22を構成するステージ数ｓ＝５、レジスタファイル
23のレジスタ数ｒ＝32とし、またｍについては前述の条
件に従ってｍ＝32とする。なおここではｍ＝ｒである
が、ｍ＜ｒの場合には、レジスタファイル23のｒ個のレ
ジスタのうちｍ個だけをＦＩＦＯとして用いるようにア
クセスレジスタ選択を行うので、全く同様に処理を進め
られる。Next, using the present embodiment, for example, the matrix A as described above
The operation when the arithmetic processing of (size m × n) × vector (size n) is performed will be described. In the embodiment, the number of stages s = 5 constituting the adder 22, the register file
The number of registers of 23, r = 32, and m = 32, according to the above conditions. Although m = r here, if m <r, the access register is selected so that only m of the r registers of the register file 23 are used as the FIFO, and the same process is performed. To be

まず演算に先立ってベクトルの要素ｘ[１]，ｘ[２]，
・・，ｘ[ｎ]をデータメモリに設定する。次に処理に用
いる行列Ａの要素を持つトークンを外部メモリ４からメ
モリインタフェース回路３を介してデータフロープロセ
ッサに次々に入力する。この際は、ａ[１，１]，ａ[２，１]，・・，ａ[ｍ，１]，ａ[１，２]，ａ[２，２]，・・，ａ[ｍ，２]，・・・ａ[１，ｎ]，ａ[２，ｎ]，・・，ａ[ｍ，ｎ]，の順でｍ×ｎ個の行列の要素データが入力されるように
外部メモリ４をアクセスする。First, prior to the calculation, vector elements x [1], x [2],
.., x [n] is set in the data memory. Next, tokens having elements of the matrix A to be used for processing are sequentially input from the external memory 4 to the data flow processor via the memory interface circuit 3. At this time, a [1,1], a [2,1], ..., a [m, 1], a [1,2], a [2,2], ..., a [m, 2 ], ... a [1, n], a [2, n], ..., a [m, n], in this order, so that the element data of m × n matrixes are input in the external memory 4 To access.

入力されたトークンはトークン入力部10からリンクテー
ブル11に入力され、オペランドフェッチテーブル12にお
いて、データメモリ13のベクトル要素を入力順に、ｘ[１]をｍ回、ｘ[２]をｍ回、・・・、ｘ[ｎ]をｍ回というようにアクセスされるように制御される。これに
よりプロセッシングユニット16に入力される際の２つの
オペランドの組は（ａ[１，１]，ｘ[１]），（ａ[２，１]，ｘ[１]），・・，（ａ[ｍ，１]，ｘ[１]），（ａ[１，２]，ｘ[２]），（ａ[２，２]，ｘ[２]），・・，（ａ[ｍ，２]，ｘ[２]），・・・（ａ[１，ｎ]，ｘ[ｎ]），（ａ[２，ｎ]，ｘ[ｎ]），・・，（ａ[ｍ，ｎ]，ｘ[ｎ]）のようになる。The input token is input to the link table 11 from the token input unit 10, and in the operand fetch table 12, vector elements of the data memory 13 are input in the order of x [1] m times, x [2] m times, .., x [n] is controlled to be accessed m times. As a result, the set of two operands when input to the processing unit 16 is (a [1,1], x [1]), (a [2,1], x [1]), ..., (a [m, 1], x [1]), (a [1, 2], x [2]), (a [2, 2], x [2]), ..., (a [m, 2]) , X [2]), ... (a [1, n], x [n]), (a [2, n], x [n]), ..., (a [m, n], x [n]).

次いでファンクションテーブル14において各トークンは
プロセッシングユニット16における処理内容を指定する
処理コードをフェッチする。処理コードとしては先に述
べたように次のようなものがあり、各々を以下の説明に
用いるニーモニックの内容とともに示すと次のようにな
る。Next, each token in the function table 14 fetches a processing code designating the processing content in the processing unit 16. As described above, there are the following types of processing codes, and each of them is shown below together with the contents of the mnemonics used in the following description.

１．算術計算部20における処理を規定するコード fmul：入力される２つのオペランドの浮動小数点乗算２．累積加算部21のレジスタファイル書込みの制御コー
ド w0：信号線102 上のデータをサイクリックに書込む w1：信号線104 上のデータをサイクリックに書込む３．累積加算部21のレジスタファイル読出しの制御コー
ド rdcyc ：レジスタをサイクリックに読出す −：読出さない４．累積加算部21のトークン生成制御コード fadd：トークン生成する −：出力トークンを生成しない５．出力選択コード out ：トークンをトークン出力部17へ送る −：トークンをリンクテーブル11へ送る。1. Code fmul that defines processing in the arithmetic calculation unit 20: Floating point multiplication of two input operands 1. Control code for writing register file of cumulative addition unit w0: cyclically write data on signal line 102 w1: cyclically write data on signal line 104 3. Control code for reading register file of cumulative adder 21: rdcyc: Read register cyclically-: Do not read 4. 4. Token generation control code of accumulative addition unit 21 fadd: token generation −: output token is not generated Output selection code out: Send the token to the token output unit 17-: Send the token to the link table 11.

各トークンについてこの５つのコードの組を順に括弧に
入れて示すとすると、本処理を行うには流れるｍ×ｎ個
のトークンについてのコードの組が最初のｍ個について：（fmul,w0,-,-,-）次のｍ×（ｎ−２）個について：（fmul,w1,rdcyc,-,-）最後のｍ個について：（fmul,-,rdcyc,fadd,out）となるようにする。以上によりプロセッシングユニット
16で演算されるべきデータとそのための処理コードを持
つトークンがｍ×ｎ個連続して、即ちクロック毎にプロ
セッシングユニット16へ流入する。Assuming that these five code groups are shown in parentheses for each token in order, the code groups for m × n tokens that flow in this process are as follows: (fmul, w0,- ,-,-) For the next m × (n-2) pieces: (fmul, w1, rdcyc,-,-) For the last m pieces: (fmul,-, rdcyc, fadd, out) . Processing unit
The m × n tokens having the data to be calculated in 16 and the processing code therefor continuously flow into the processing unit 16, that is, every clock.

算術計算部20では処理を規定するコードが全てのトーク
ンについて fmnl なので入力される２つのオペランドの
浮動小数点乗算を行い、それらの積を同じく連続データ
として累積加算部21へ送る。以下簡単のためｐ[ｉ，ｊ]＝ａ[ｉ，ｊ]・ｘ[ｊ] とすると、算術計算部20の出力トークンのデータは順
に、ｐ[１，１]，ｐ[２，１]，・・，ｐ[ｍ，１]，ｐ[１，２]，ｐ[２，２]，・・，ｐ[ｍ，２]，・・・ｐ[１，ｎ]，ｐ[２，ｎ]，・・，ｐ[ｍ，ｎ]，となり、これもクロック毎に連続して累積加算部21へ流
入する。Since the code defining the processing is fmnl for all tokens in the arithmetic calculation unit 20, the floating-point multiplication of the two input operands is performed, and the product of them is also sent to the cumulative addition unit 21 as continuous data. For the sake of simplicity, assuming that p [i, j] = a [i, j] · x [j], the data of the output token of the arithmetic calculation unit 20 is p [1,1], p [2,1] in order. , ..., p [m, 1], p [1, 2], p [2, 2], ..., p [m, 2], ... p [1, n], p [2, n ], ..., P [m, n], which also continuously flows into the cumulative addition unit 21 for each clock.

累積加算部21では上記の順で入力されるトークンの処理
コードに従い、ａ）最初のｍ個についてはそのデータをレジスタファイ
ル23にサイクリックに書込み、ｂ）次のｍ×（ｎ−２）個についてはレジスタファイル
23からサイクリックに読出したデータと、入力したデー
タを加算器23で加算し、その結果データをレジスタファ
イル23にサイクリックに書込み、ｃ）最後のｍ個についてはレジスタファイル23からサイ
クリックに読出したデータと、入力したデータを加算器
22で加算し、その結果データを持つトークンを生成し、
プロセッシングユニット16の出力としてトークン出力部
に送出する、という動作を行う。このようにレジスタファイル23を長
さｍのＦＩＦＯとして用いることにより、最初のｍ個の
トークンはｍクロック遅れ、次のｐ[１，２]，・・，ｐ
[ｍ，２]の列のトークンと同期して加算器22へ入力され
る。その結果の部分和（第２部分和）は加算器22内でｓ
ステージ通過するためにｓクロック遅れ、その後レジス
タファイル23に一度書込まれ、（ｍ−ｓ）クロック後に
読出されることにより計ｍクロック遅れ、次のｐ[１，
３]，・・，ｐ[ｍ，３]の列のトークンと同期して加算
器22へ入力されることとなる。以下これを繰り返すこと
により第ｎ部分和の列が求まり、これを最終結果として
結果トークン生成部29から出力することにより、処理を
終了する。According to the token processing code input in the above order, the cumulative addition unit 21 a) cyclically writes the data for the first m in the register file 23, b) the next m × (n−2) Register file for
The data read cyclically from 23 and the input data are added by the adder 23, the resulting data is cyclically written to the register file 23, and c) the last m data are cyclically read from the register file 23. Added data and input data
Add in 22, generate a token with the result data,
The operation of sending to the token output section as the output of the processing unit 16 is performed. In this way, by using the register file 23 as a FIFO of length m, the first m tokens are delayed by m clocks and the next p [1, 2], ..., P
It is input to the adder 22 in synchronization with the token in the [m, 2] column. The resulting partial sum (second partial sum) is s in the adder 22.
It is delayed by s clocks to pass through the stage, then is written once in the register file 23, and is read out after (ms) clocks, resulting in a total delay of m clocks, and the next p [1,
3], ..., P [m, 3] will be input to the adder 22 in synchronization with the tokens in the sequence. After that, the sequence of the n-th partial sum is obtained by repeating this, and this is output as the final result from the result token generation unit 29, thereby ending the processing.

第６図はこの処理における、累積加算部21の各部分で処
理されるデータのタイミングの概略を示す。図中、ｐ
_i,jとあるのは説明中のｐ[ｉ，ｊ]を示し、またである。従ってｑ_i,n＝ｙ_ｉである。さらにとあるのはそのタイミングの間にｐ_1,j，ｐ_2,j，・・，
ｐ_m,jが順に流れたことを意味し、同様には加算器22に２つのオペランドの組として（ｐ_1,j，ｑ_1,j-1），（ｐ_2,j，ｑ_2,j-1），・・，（ｐ
_m,j，ｑ_m,j-1）を順に入力することを示す。この時は当然加算器22の出
力はｑ_1,j，ｑ_2,j，・・，ｑ_m,j となる。加算器22はオペランドの組が入力されてから結
果が出力されるまでｓステップかかるので、その結果を
利用するレジスタファイル書込み制御部24、結果トーク
ン生成部29の動作タイミングはｓクロック遅れる。FIG. 6 shows an outline of the timing of the data processed by each part of the cumulative addition unit 21 in this processing. In the figure, p
_{i, j} means p [i, j] in the explanation, and Is. Therefore, q _{i, n} = y _i . further It means that during that timing, p _{1, j} , p _{2, j} , ...
It means that p _{m, j} flowed in order, and similarly Is added to the adder 22 as a set of two operands (p _{1, j} , q _{1, j-1} ), (p _{2, j} , q _{2, j-1} ), ..., (p
_{m, j} , q _{m, j-1} ) are input in order. At this time, the output of the adder 22 is of course q _{1, j} , q _{2, j} , ..., Q _{m, j} . Since the adder 22 takes s steps from the input of the set of operands to the output of the result, the operation timings of the register file write control unit 24 and the result token generation unit 29 using the result are delayed by s clocks.

〔The invention's effect〕

以上説明したように本発明においては、（１）大きなハードウェアを必要とし本来ならば長い処
理時間がかかる浮動小数点の加算を行う専用ハードウェ
アを用意し、しかもそれをパイプライン的に動作する複
数ステージに分割して構成している。これにより浮動小
数点計算のスループットを向上させ、つまり実効的な演
算時間を短縮することができ、さらにプロセッシングユ
ニット部、さらにはデータフロープロセッサ全体の動作
パイプラインサイクルを短縮することが可能になること
により処理性能が向上される。As described above, according to the present invention, (1) a plurality of dedicated hardware units for preparing floating-point addition, which requires a large amount of hardware and originally requires a long processing time, and which operates in a pipeline manner It is divided into stages. As a result, the throughput of floating-point calculation can be improved, that is, the effective calculation time can be shortened, and further, the processing pipeline unit and the operation pipeline cycle of the entire data flow processor can be shortened. Processing performance is improved.

（２）さらに浮動小数点の乗算が可能な算術計算部をこ
の加算器と縦列に配置することにより、浮動小数点デー
タのコンボリューションがデータを１度でプロセッシン
グユニットに通すだけで行える。(2) Further, by arranging an arithmetic calculation unit capable of floating-point multiplication in parallel with this adder, convolution of floating-point data can be performed only by passing the data through the processing unit once.

（３）このようなパイプライン化された加算器を装備し
た場合、従来の積和演算を含む行列×ベクトルの計算
時にその加算器のステージ数をｓとしてｓ×（ｎ−１）
×ｍステップかかっていたが、本発明では第６図が示す
ように約ｍ×ｎ＋ｓステップで終了し、処理の高速化が図れる。これはレジスタファ
イルの利用により加算器の各パイプラインステージをフ
ル稼働させることが可能となったことによる。(3) When such a pipelined adder is installed, s × (n−1) is set as the number of stages of the adder when calculating the matrix × vector including the conventional product-sum operation.
It took xm steps, but in the present invention, as shown in FIG. 6, the process is completed in about m × n + s steps, and the processing speed can be increased. This is because it is possible to fully operate each pipeline stage of the adder by using the register file.

という効果があり、これにより数値演算処理の高速化を
図ることができる。This has the effect of increasing the speed of numerical calculation processing.

[Brief description of drawings]

第１図は本発明のプロセッシングユニットの一実施例の
構成を示すブロック図、第２図は第１図のプロセッシングユニットを用いたデー
タフロープロセッサの構成図、第３図は第２図のデータフロープロセッサを用いたデー
タフロー処理装置の例を示す全体構成図、第４図は本発明の説明に供するトークンの形式を示す
図、第５図は累積加算部内の加算器の構成の一例を示すブロ
ック図、第６図は累積加算部におけるデータ処理の動作を示すタ
イミングチャート図、第７図は従来のデータフロープロセッサの構成を示す図
である。 16……プロセッシングユニット 20……算術計算部 21……累積加算部 22……加算器 23……レジスタファイル 24……レジスタファイル書込制御部 25……レジスタファイル読出し制御部 26……遅延回路 29……結果トークン生成部 30……入力トークンレジスタ1 is a block diagram showing a configuration of an embodiment of a processing unit of the present invention, FIG. 2 is a configuration diagram of a data flow processor using the processing unit of FIG. 1, and FIG. 3 is a data flow of FIG. FIG. 4 is an overall configuration diagram showing an example of a data flow processing device using a processor, FIG. 4 is a diagram showing a format of a token used for explaining the present invention, and FIG. 5 is a block showing an example of configuration of an adder in a cumulative addition unit. 6 and 6 are timing charts showing the operation of the data processing in the accumulator, and FIG. 7 is a diagram showing the configuration of the conventional data flow processor. 16 …… Processing unit 20 …… Arithmetic calculator 21 …… Cumulative adder 22 …… Adder 23 …… Register file 24 …… Register file write controller 25 …… Register file read controller 26 …… Delay circuit 29 ...... Result token generator 30 …… Input token register

Claims

[Claims]

1. A token, which is a unit of data, is sent to a pipeline-shaped bus connecting an internal memory unit and an arithmetic unit,
In a processing unit of a data flow processor that controls an operation order by a data driven method, 2 on a token having control information and operand data input from the internal memory unit via the bus.
An arithmetic calculation unit that calculates one operand data and outputs a token having result data; and result data on the output token of the arithmetic calculation unit, and
A register file composed of a plurality of registers for temporarily holding the intermediate result of the addition, an operation of the result data on the output token of the arithmetic calculation unit and the data read from the register file, and sending the result to the register file , A delay circuit for synchronizing the control information on the output token of the arithmetic calculation unit with the data passing through the adder, and the result output data of the adder and the control information obtained from the delay circuit A processing unit characterized by comprising a result token generation unit for generating a token having result data.