JP2928684B2

JP2928684B2 - VLIW type arithmetic processing unit

Info

Publication number: JP2928684B2
Application number: JP15519092A
Authority: JP
Inventors: 禎石川
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 1991-10-31
Filing date: 1992-06-15
Publication date: 1999-08-03
Anticipated expiration: 2014-08-03
Also published as: JPH05197547A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【産業上の利用分野】この発明は、ストアドプログラム
型デジタル計算機におけるＶＬＩＷ（ＶｅｒｙＬｏｎ
ｇＩｎｓｔｒｕｃｔｉｏｎＷｏｒｄ）型演算処理装
置に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a VLIW (Very Lon) in a stored program digital computer.
g Instruction Word) type arithmetic processing device.

【０００２】[0002]

【従来の技術】マイクロプロセサの処理性能は、その動
作周波数、１命令当たりの所要サイクル数ＣＰＩ（Ｃｙ
ｃｌｅｓｐｅｒＩｎｓｔｒｕｃｔｉｏｎ）、および
プログラム実行に必要な命令数の３つの要因で決まる。
すなわち、処理性能を上げたければ、動作周波数を上
げ、ＣＰＩを下げ、プログラム実行命令数を減らすこと
が必要となる。2. Description of the Related Art The processing performance of a microprocessor depends on its operating frequency and the required number of cycles per instruction CPI (Cy).
les per Instruction) and the number of instructions required to execute the program.
That is, to increase the processing performance, it is necessary to increase the operating frequency, lower the CPI, and reduce the number of program execution instructions.

【０００３】上記３要素のうち、半導体製造技術の進歩
によりマイクロプロセサチップの動作周波数は年々上が
っているが、その時々で入手可能なチップの動作周波数
には上限がある。[0003] Of the above three factors, the operating frequency of the microprocessor chip has been increasing year by year due to the progress of the semiconductor manufacturing technology, but the operating frequency of the chip available at that time has an upper limit.

【０００４】また、マイクロプロセサにはＣＩＳＣ（Ｃ
ｏｍｐｌｅｘｅｄＩｎｓｔｒｕｃｔｉｏｎＳｅｔ
Ｃｏｍｐｕｔｅｒ）型とＲＩＳＣ（ＲｅｄｕｃｅｄＩ
ｎｓｔｒｕｃｔｉｏｎＳｅｔＣｏｍｐｕｔｅｒ）型
（およびこれらの折衷型）があり、ＲＩＳＣチップなら
ＣＩＳＣチップよりもＣＰＩを小さくできるが、それで
もＣＰＩは１までにしか下がらない。プログラム実行命
令数はプログラマの能力（あるいはコンパイラの性能）
に依存するが、これを減らすにも限界がある。[0004] The microprocessor has CISC (C
omplexed Instruction Set
Computer and RISC (Reduced I)
There is an NStructure Set Computer (and a compromise between them), where a RISC chip can have a smaller CPI than a CISC chip, but still has a CPI of only one. The number of program execution instructions is the ability of the programmer (or the performance of the compiler)
However, there is a limit in reducing this.

【０００５】このような状況を背景として、ＶＬＩＷア
ーキテクチャとスーパースカラアーキテクチャが生まれ
てきた。これらのアーキテクチャを採用したプロセサで
は、プログラム中から並列に実行できる命令を見つけ
て、複数の処理を並列に実行する。すなわち、複数の命
令を同時にデコードしてそれらを実行ユニットに送り、
同時に処理を行なう。このような並列処理により、ＣＰ
Ｉを１以下に下げることが可能となる。[0005] Against this background, a VLIW architecture and a superscalar architecture have emerged. A processor employing these architectures finds an instruction that can be executed in parallel from a program and executes a plurality of processes in parallel. That is, it decodes multiple instructions simultaneously and sends them to the execution unit,
Processing is performed simultaneously. By such parallel processing, CP
I can be reduced to 1 or less.

【０００６】ＶＬＩＷ型プロセサおよびスーパースカラ
型プロセサは複数のパイプラインを内臓しており、この
パイプラインにより複数の処理を同時に実行できる点で
は同じであるが、両者は以下の点で相違する。（１）命令ストリームのスケジューリング方法の相違The VLIW-type processor and the superscalar-type processor have a plurality of built-in pipelines, which are the same in that a plurality of processes can be executed simultaneously by the pipelines, but they are different in the following points. (1) Difference in instruction stream scheduling method

【０００７】スーパースカラでは、プロセサの命令デコ
ーダが並列に実行できる命令を見つけ、命令実行段階で
動的にスケジューリングを行なう。このため、並列性の
検出はフェッチした命令からしか行なえない。In superscalar, an instruction decoder of a processor finds an instruction that can be executed in parallel, and dynamically schedules the instruction at an instruction execution stage. Therefore, the parallelism can be detected only from the fetched instruction.

【０００８】これに対し、ＶＬＩＷでは、コンパイル時
（あるいはアセンブラコーディング時）にプログラムを
スケジューリングする。そのため、スーパースカラに比
べコンパイラ（あるいはプログラマ）の負担は大きくな
るが、ＶＬＩＷではハードウエア（命令デコーダ）が簡
単になる。（２）命令フォーマットの相違On the other hand, in VLIW, a program is scheduled at the time of compilation (or at the time of assembler coding). Therefore, the burden on the compiler (or programmer) is greater than that of a superscalar, but the hardware (instruction decoder) is simplified in VLIW. (2) Difference in instruction format

【０００９】一般に、スーパースカラは従来のＲＩＳＣ
プロセサと同じ３２ビット長の命令フォーマットを持っ
ている。そのデコーダは、フェッチした命令から並列に
実行できる命令を選び、パイプラインに投入する。３２
ビット長の命令フォーマットをサポートするスーパース
カラプロセサでは従来のＲＩＳＣプロセサの命令フォー
マットをそのまま利用することができ、オブジェクトコ
ードレベルで、既存のＲＩＳＣとソフトウエアの互換性
を維持できる。Generally, a superscalar is a conventional RISC.
It has the same 32-bit instruction format as the processor. The decoder selects an instruction that can be executed in parallel from the fetched instruction and inputs it to the pipeline. 32
A superscalar processor that supports a bit-length instruction format can use the instruction format of a conventional RISC processor as it is, and can maintain software compatibility with existing RISC at the object code level.

【００１０】これに対し、ＶＬＩＷは、一般に３２ビッ
トよりも長い命令フォーマットを持っており、１命令で
複数の処理を指定することができる。そのため、既存の
ＲＩＳＣとソフトウエア互換性を維持することはできな
いが、スーパースカラよりも高い性能を出すことができ
る。On the other hand, VLIW generally has an instruction format longer than 32 bits, and one instruction can specify a plurality of processes. For this reason, software compatibility with the existing RISC cannot be maintained, but higher performance can be obtained than with superscalar.

【００１１】ところで、パイプライン構造を持つプロセ
サの場合、１つの命令は複数のステージに分けられて実
行される。１つのステージは原則として１クロックで実
行されるため、パイプラインが順調に動作すれば全ての
命令は見かけ上１クロックで実行されることになる。し
かし、実際には様々な原因でパイプラインは乱れプロセ
サの実質的な処理速度は低下する。これらの原因はハザ
ードと呼ばれる。このハザードをハードウエアで検出し
て自動的にパイプラインを遅らせる機構をインターロッ
クという。In the case of a processor having a pipeline structure, one instruction is executed in a plurality of stages. Since one stage is executed in principle with one clock, all instructions will be apparently executed in one clock if the pipeline operates smoothly. However, in reality, the pipeline is disturbed for various reasons, and the actual processing speed of the processor is reduced. These causes are called hazards. The mechanism that detects this hazard by hardware and automatically delays the pipeline is called an interlock.

【００１２】この発明は、スーパースカラではなくＶＬ
ＩＷアーキテクチャに基づく並列処理システムに関する
ものである。とくに、上記インターロックの機能を改良
し、パイプラインの遅れを最小限に押さえ処理の高速化
を図ったＶＬＩＷ型演算処理装置に関する。The present invention is not a super scalar but a VL
The present invention relates to a parallel processing system based on the IW architecture. In particular, the present invention relates to a VLIW type arithmetic processing device in which the function of the interlock is improved and the delay of the pipeline is minimized and the processing speed is increased.

【００１３】一般に、ＶＬＩＷ型演算処理装置では複数
の演算を指定できる固定長の命令語を実行して行くが、
１つの命令語中に指定された複数の演算は、少なくとも
ソフトウエア（プログラム）からは同時に実行されてい
るように見えることが必要である。In general, a VLIW type arithmetic processing unit executes a fixed length instruction word capable of designating a plurality of operations.
A plurality of operations specified in one instruction word need to appear to be executed simultaneously from at least software (program).

【００１４】ＶＬＩＷ型演算処理装置では、その処理能
力の向上のために浮動小数点演算などもサポートされて
いる。この浮動小数点演算は、種類によっては（例えば
除算では）その処理に数サイクルを要する。The VLIW type arithmetic processing unit also supports a floating point operation and the like in order to improve the processing capability. This floating-point operation requires several cycles depending on the type (for example, in division).

【００１５】ところが、従来のＶＬＩＷ型演算処理装置
では、同時に実行されるべく指定された１命令中の複数
演算は、ソフトウエアに対しては、あくまで同時に実行
されるように見せなければならない。そのために、同時
に指定された複数演算中に処理時間が長くかかるものが
あった場合、それ以外の演算が先に終了していたとして
も、この長時間演算処理が完了するまでは、演算が終了
した演算器に次の演算処理をさせず、遊ばせていた。However, in the conventional VLIW type arithmetic processing unit, a plurality of operations in one instruction designated to be executed at the same time must appear to software so as to be executed at the same time. Therefore, if some of the multiple operations specified at the same time take a long processing time, even if the other operations have been completed first, the operation will be completed until this long operation is completed. The operator did not perform the next arithmetic processing, but was allowed to play.

【００１６】[0016]

【発明が解決しようとする課題】上述のように、同時に
実行されるべく指定された１命令語中の複数演算はソフ
トウエア上はあくまで同時に実行されていることになっ
ていなければならないために、従来のＶＬＩＷ型演算処
理装置では、先行する命令語に含まれる演算処理が全て
終了するまでは次の命令語を実行しないようにしてい
る。このため、同時に指定された複数演算中に処理時間
の長くかかるものがあった場合、それが完了するまで
は、他の演算が既に終了していたとしても、演算の終了
している演算器に対して次の命令を実行させずに待たせ
たまま遊ばせることになる。すると並列演算時間が長く
かかり、ＶＬＩＷアーキテクチャの並列演算処理能力を
十分に活用できない。As described above, since a plurality of operations in one instruction word designated to be executed at the same time must be executed at the same time in software, In the conventional VLIW type arithmetic processing device, the next instruction word is not executed until all the arithmetic operations included in the preceding instruction word are completed. For this reason, if some of the multiple operations specified at the same time take a long processing time, even if other operations have already been completed, the operation unit that has completed the operation will be On the other hand, the player is allowed to play while waiting without executing the next instruction. Then, the parallel operation time is long, and the parallel operation processing capability of the VLIW architecture cannot be fully utilized.

【００１７】この発明は、上記事情に鑑みなされたもの
で、ソフトウエアに対しては演算の実行順序を保証しつ
つ、先行して起動された時間のかかる演算が実行中であ
っても、できるかぎり後続の命令語を並列に実行できる
ようにし、もって並列処理用演算器の稼働率を向上さ
せ、演算処理装置の全体的な処理能力を改善したＶＬＩ
Ｗ型演算処理装置を提供することを目的とする。The present invention has been made in view of the above circumstances, and it is possible to guarantee the execution order of operations for software, and to execute even a time-consuming operation that has been activated earlier. VLI that enables subsequent instruction words to be executed in parallel as long as possible, thereby improving the operation rate of the processing unit for parallel processing and improving the overall processing capability of the processing unit.
It is an object to provide a W-type arithmetic processing device.

【００１８】[0018]

【課題を解決するための手段】この発明のＶＬＩＷ型演
算処理装置は、複数の命令語を含むＶＬＩＷを格納する
命令バッファ（２０１）と；According to the present invention, there is provided a VLIW type arithmetic processing unit comprising: an instruction buffer (201) for storing a VLIW including a plurality of instruction words;

【００１９】前記ＶＬＩＷに含まれる命令語数に対応し
た数の複数フィールドからなるパイプラインを有し、こ
れらのパイプラインで前記ＶＬＩＷに含まれる命令を独
立に実行する演算ユニット（２０３）と；前記演算ユニ
ットにより実行される各命令語のオペランド（Ｒｊ）に
より参照されるハードウエアリソース（２００）と；An operation unit (203) having a pipeline composed of a plurality of fields corresponding to the number of instruction words included in the VLIW, and independently executing the instructions included in the VLIW in these pipelines; A hardware resource (200) referenced by an operand (Rj) of each instruction word executed by the unit;

【００２０】前記演算ユニットのあるフィールド（ｉ＝
１）で現在実行中の命令（ＭＵＬ）の実行結果を格納す
るための第１オペランド（Ｒ４）と、前記演算ユニット
の各フィールド（ｉ＝０〜３）で現在実行しようとする
命令（Ａｄｄ；ｉ＝０，２）により参照されるの第２オ
ペランド（Ｒ１〜Ｒ４）とを比較し、前記第２オペラン
ド（Ｒ１〜Ｒ４）が前記第１オペランド（Ｒ４）と一致
するもの（Ｒ４）を含む場合に、一致信号を提供する比
較器（２０５）と；前記演算ユニットの各フィールド
（ｉ）で命令実行が終了していない場合に、フィールド
毎に未了信号を発生する手段（２０８）と；A certain field (i =
1) a first operand (R4) for storing the execution result of the currently executing instruction (MUL), and an instruction (Add;) to be currently executed in each field (i = 0 to 3) of the arithmetic unit. comparing the second operand (R1 to R4) referred to by i = 0, 2), including the one (R4) in which the second operand (R1 to R4) matches the first operand (R4) A comparator (205) for providing a coincidence signal in the case; and a means (208) for generating an incomplete signal for each field when the instruction execution is not completed in each field (i) of the arithmetic unit;

【００２１】前記比較器からの前記一致信号と前記発生
手段からの前記未了信号との論理積を前記演算ユニット
の各フィールド（ｉ）毎にとり、何れかのフィールド
（ｉ）が前記論理積に対して真をもたらす場合に命令取
り込み禁止信号を出力し、全てのフィールド（ｉ）が前
記論理積に対して偽をもたらす場合に命令取り込み信号
を出力する論理回路（２０９，２１０）と；The logical product of the coincidence signal from the comparator and the incomplete signal from the generating means is calculated for each field (i) of the arithmetic unit, and one of the fields (i) is added to the logical product. A logic circuit (209, 210) for outputting an instruction fetch inhibit signal when the result of the instruction is true, and outputting an instruction fetch signal when all the fields (i) produce a false result of the logical product;

【００２２】前記論理回路が前記命令取り込み信号を出
力するときは前記命令バッファから前記演算ユニットの
前記パイプラインへ前記ＶＬＩＷに含まれる１以上の命
令語を同時に転送し、前記論理回路が前記命令取り込み
禁止信号を出力するときは前記命令バッファから前記演
算ユニットへの前記ＶＬＩＷの命令語転送を停止する命
令出力コントローラ（２０１Ａ）とを具備している。When the logic circuit outputs the instruction fetch signal, one or more instruction words included in the VLIW are simultaneously transferred from the instruction buffer to the pipeline of the arithmetic unit, and the logic circuit fetches the instruction fetch signal. An instruction output controller (201A) for stopping the transfer of the VLIW instruction word from the instruction buffer to the arithmetic unit when outputting the inhibit signal is provided.

【００２３】[0023]

【作用】この発明のＶＬＩＷ型演算処理装置では、現在
実行しようとしている命令語より先に実行が開始された
命令が実行中であり、その命令の実行結果を格納する予
定のレジスタ番号と現在実行しようとしている命令によ
って読出されるレジスタ番号とを比較する。そして一致
するレジスタ番号がある場合に当該レジスタ番号のレジ
スタの内容がまだ使用不可能であることが認識される
と、現在実行しようとしている命令を待たせる。それ以
外の場合では、処理に複数サイクルかかる命令が実行中
であっても、その命令の実行結果を参照しない後続の命
令は並列実行させて、複数の命令語の処理の高速化を図
る。In the VLIW type arithmetic processing unit according to the present invention, an instruction whose execution has been started prior to the instruction word to be executed at present is being executed, and a register number for storing an execution result of the instruction and a currently executed instruction number are stored. Compare with the register number read by the instruction about to be performed. If there is a matching register number and it is recognized that the contents of the register of the register number are still unusable, the instruction to be executed is made to wait. In other cases, even if an instruction that requires a plurality of cycles for processing is being executed, a subsequent instruction that does not refer to the execution result of the instruction is executed in parallel to speed up the processing of a plurality of instruction words.

【００２４】[0024]

【実施例】図１は、フィールド０〜３それぞれに４つの
３２ビット長命令語（ロード、演算、ストア、ジャン
プ）を配した１２８ビットのＶＬＩＷの一例を示す。こ
の演算命令には、加算（ＡＤＤ）、減算（ＳＵＢ）、乗
算（ＭＵＬ）、除算（ＤＩＶ）、無処理（ＮＯＰ）等が
含まれる。FIG. 1 shows an example of a 128-bit VLIW in which four 32-bit instruction words (load, operation, store, and jump) are arranged in fields 0 to 3, respectively. The operation instructions include addition (ADD), subtraction (SUB), multiplication (MUL), division (DIV), and no processing (NOP).

【００２５】図２は、この発明が適用されない場合のＶ
ＬＩＷ並列演算処理の一例を示す。この例では、フィー
ルド３で除算（Ｒ３／Ｒ４）を４クロックかけて実行中
にフィールド０〜２の実行ユニットが３クロック分遊ん
でしまっており、フィールド１で乗算（Ｒ１＊Ｒ２）を
２クロックかけて実行中にフィールド０，２，３の実行
ユニットが１クロック分遊んでしまっている。この遊び
があるために、図２で例示する演算処理の完了には、都
合８クロックかかっている。FIG. 2 shows V when the present invention is not applied.
An example of the LIW parallel operation processing will be described. In this example, while the division (R3 / R4) is performed in field 3 with 4 clocks, the execution units in fields 0 to 2 are idle for 3 clocks, and the multiplication (R1 * R2) is performed in field 1 for 2 clocks. During execution, the execution units in fields 0, 2, and 3 are idle for one clock. Due to this play, it takes eight clocks to complete the arithmetic processing illustrated in FIG.

【００２６】図３は、この発明が適用された場合の、図
２に対応するＶＬＩＷ並列演算処理の一例を示す。この
例では、フィールド３で４クロック除算（Ｒ３／Ｒ４）
を実行中に、フィールド１で２クロック乗算（Ｒ１＊Ｒ
２）と２度の１クロック減算（Ｒ１−Ｒ２）を実行し、
フィールド０とフィールド２で２度の１クロック加算
（Ｒ１＋Ｒ２；Ｒ３＋Ｒ４）とタイミング合わせのため
の１クロックＮＯＰを１度実行している。図３（この発
明が適用された場合）に例示する演算処理では、除算・
乗算の処理中にパイプライン中の実行ユニットが遊ぶ
（ＮＯＰ実行）割合が少ないので、都合５クロックで図
２（この発明が適用されない場合）と同等の演算処理を
終了している。図６は、図３で例示したような並列演算
処理がハードウエア中でどのように進行するかを説明す
るフローチャートである。FIG. 3 is a diagram when the present invention is applied .
2 shows an example of a VLIW parallel operation process corresponding to No. 2 . In this example, 4 clock division (R3 / R4) in field 3
Is performed, two clock multiplications (R1 * R
2) and twice performing one clock subtraction (R1-R2),
In field 0 and field 2, two one-clock additions (R1 + R2; R3 + R4) and one clock NOP for timing adjustment are executed once. Figure 3 (this departure
In the arithmetic processing exemplified in the case where the
During execution of the multiplication, the execution unit in the pipeline has a small playing (NOP execution) ratio, so the arithmetic processing equivalent to that in FIG. 2 (when the present invention is not applied) is completed with five clocks for convenience. FIG. 6 is a flowchart illustrating how the parallel operation processing illustrated in FIG. 3 proceeds in hardware.

【００２７】まず、例えば図３の命令語１に対応する図
２の命令語１がハードウエア（並列演算実行ユニット）
にフェッチされたとする（ＳＴ１００）。このフェッチ
により、フィールド０〜３の４本のパイプラインには命
令語１のＡＤＤ，ＳＵＢ，ＡＤＤ，ＳＵＢが一括して
（同期して）投入される。First, for example, a diagram corresponding to the instruction word 1 in FIG.
Instruction 1 of 2 is hardware (parallel operation execution unit)
(ST100). By this fetch, ADD, SUB, ADD, and SUB of the instruction word 1 are collectively (synchronously) input to the four pipelines of the fields 0 to 3.

【００２８】各フィールド０〜３では、投入された命令
（ＡＤＤ，ＳＵＢ，ＡＤＤ，ＳＵＢ）のオペランド（命
令語が参照するアドレスまたはデータのこと。ここでは
各パイプラインで参照されるレジスタファイルのレジス
タ番号Ｒｊ；ｊは例えば１〜８）が使用可能かどうかチ
ェックされる（ＳＴ１１０〜ＳＴ１１３）。今は演算を
開始し始めたばかりであり、投入された命令（ＡＤＤ，
ＳＵＢ，ＡＤＤ，ＳＵＢ）が要求する各フィールド０〜
３のオペランド（Ｒ１〜Ｒ４）は全て使用可能である
（ＳＴ１１４〜ＳＴ１１７，イエス）。In each of the fields 0 to 3, the operand (address or data referred to by the instruction word of the input instruction (ADD, SUB, ADD, SUB). In this case, the register of the register file referred to by each pipeline) It is checked whether the numbers Rj; j are usable, for example, from 1 to 8) (ST110 to ST113). Now, the operation has just begun, and the input instructions (ADD,
SUB, ADD, SUB) require each field 0
All three operands (R1 to R4) can be used (ST114 to ST117, YES).

【００２９】フィールド０〜３のオペランドが全て使用
可能となると（ＳＴ１２０，イエス）、各フィールド０
〜３の演算に必要なハードウエアリソース（レジスタフ
ァイル中で現在使用可能なレジスタ）があるかどうかチ
ェックされる（ＳＴ１３０〜ＳＴ１３３）。今は演算を
開始し始めたばかりであり、ハードウエアのリソースは
全て使用可能である（ＳＴ１３４〜ＳＴ１３７，イエ
ス）。When all the operands of the fields 0 to 3 become usable (ST120, YES), each field 0
It is checked whether there are any hardware resources (registers currently available in the register file) necessary for the operations (1) to (3) (ST130 to ST133). Now, the operation has just begun, and all hardware resources can be used (ST134 to ST137, YES).

【００３０】フィールド０〜３のハードウエアリソース
が全て使用可能となれば（ＳＴ１４０，イエス）、各フ
ィールド０〜３で、投入された命令（ＡＤＤ，ＳＵＢ，
ＡＤＤ，ＳＵＢ）の処理が同時に開始する（ＳＴ１５０
〜ＳＴ１５３）。以上の処理により、図３の命令語１に
対する処理が１クロックで完了する。When all the hardware resources of the fields 0 to 3 become available (ST140, YES), the input instructions (ADD, SUB,
ADD, SUB) starts simultaneously (ST150).
To ST153). With the above processing, the processing for the instruction word 1 in FIG. 3 is completed in one clock.

【００３１】次に、図３の命令語２に対応する図２の命
令語２がハードウエア（並列演算実行ユニット）にフェ
ッチされる（ＳＴ１００）。このフェッチにより、フィ
ールド０〜３の４本のパイプラインには命令語２のＡＤ
Ｄ，ＳＵＢ，ＡＤＤ，ＤＩＶが投入される。Next, the instruction of FIG. 2 corresponding to the instruction word 2 of FIG.
Instruction 2 is fetched into hardware (parallel operation execution unit) (ST100). By this fetch, the AD pipeline of the instruction word 2 is stored in the four pipelines of the fields 0 to 3.
D, SUB, ADD, and DIV are input.

【００３２】各フィールド０〜３では、投入された命令
（ＡＤＤ，ＳＵＢ，ＡＤＤ，ＤＩＶ）のオペランドが使
用可能かどうかチェックされる（ＳＴ１１０〜ＳＴ１１
３）。命令語１は１クロックで同時処理完了しており、
かつフィールド３の命令ＤＩＶが参照するオペランド
（レジスタＲ８）はまだ未使用なので、投入された命令
（ＡＤＤ，ＳＵＢ，ＡＤＤ，ＤＩＶ）が要求する各フィ
ールド０〜３のオペランド（Ｒ１〜Ｒ４，Ｒ８）は全て
使用可能である（ＳＴ１１４〜ＳＴ１１７，イエス）。In each of the fields 0 to 3, it is checked whether or not the operand of the input instruction (ADD, SUB, ADD, DIV) is usable (ST110 to ST11).
3). Instruction word 1 has been completed simultaneously in one clock.
Since the operand (register R8) referenced by the instruction DIV in the field 3 is not yet used, the operands (R1 to R4, R8) of the fields 0 to 3 required by the input instruction (ADD, SUB, ADD, DIV) are required. Can be used (ST114 to ST117, YES).

【００３３】フィールド０〜３のオペランドが全て使用
可能となると（ＳＴ１２０，イエス）、各フィールド０
〜３の演算に必要なハードウエアリソースがあるかどう
かチェックされる（ＳＴ１３０〜ＳＴ１３３）。投入さ
れた命令（ＡＤＤ，ＳＵＢ，ＡＤＤ，ＤＩＶ）が要求す
るハードウエアのリソースはいずれも他の命令により使
用中でないので、全て使用可能である（ＳＴ１３４〜Ｓ
Ｔ１３７，イエス）。When all the operands of the fields 0 to 3 become usable (ST120, Yes), each field 0
It is checked whether there are any hardware resources required for the operations of (1) to (3) (ST130 to ST133). Since none of the hardware resources requested by the input instructions (ADD, SUB, ADD, DIV) are being used by other instructions, they can all be used (ST134 to S134).
T137, yes).

【００３４】フィールド０〜３のハードウエアリソース
が全て使用可能となれば（ＳＴ１４０，イエス）、各フ
ィールド０〜３で、投入された命令（ＡＤＤ，ＳＵＢ，
ＡＤＤ，ＤＩＶ）の処理が同時に開始する（ＳＴ１５０
〜ＳＴ１５３）。以上のようにして、図３の命令語２が
処理される。When all the hardware resources of the fields 0 to 3 become available (ST140, YES), the input instructions (ADD, SUB,
ADD, DIV) start simultaneously (ST150).
To ST153). As described above, the instruction word 2 in FIG. 3 is processed.

【００３５】次に、図３の命令語３に対応する図２の命
令語３がハードウエアにフェッチされる（ＳＴ１０
０）。このフェッチにより、フィールド０〜２の３本の
パイプラインには命令語３のＡＤＤ，ＭＵＬ，ＡＤＤが
投入される。（ここで、図３に対応する図２の処理プロ
グラムを生成したコンパイラは、命令語２でフィールド
３に投入した命令ＤＩＶが時間のかかる処理であること
を知っていると仮定する。このため、この時点では図２
のフィールド３に実質的な処理を行なう新たな命令は投
入されず、ＮＯＰが挿入される。このフィールド３のＮ
ＯＰ部分は、図３の処理実行時には無視される。）Next, the instruction of FIG. 2 corresponding to the instruction word 3 of FIG.
Edict 3 is fetched into hardware (ST10
0). By this fetch, ADD, MUL, and ADD of the instruction word 3 are input to the three pipelines of the fields 0 to 2. ( Here, it is assumed that the compiler that has generated the processing program of FIG. 2 corresponding to FIG . 3 knows that the instruction DIV input to the field 3 with the instruction word 2 is a time-consuming processing . At this point, FIG.
No new instruction for performing a substantial process is input to the field 3 of , and NOP is inserted. N of this field 3
The OP part is ignored when executing the processing of FIG . )

【００３６】各フィールド０〜２では、投入された命令
（ＡＤＤ，ＭＵＬ，ＡＤＤ）のオペランドが使用可能か
どうかチェックされる（ＳＴ１１０〜ＳＴ１１３）。こ
こで、フィールド３はまだ命令語２のＤＩＶを処理中で
あり、この命令ＤＩＶの処理はフィールド３の２つのＡ
ＬＵ内で行なわれている。このため、フィールド３のオ
ペランド（Ｒ３，Ｒ４，Ｒ８）はまだ塞がっておらず、
使用可能である（ＳＴ１１７，イエス）。また、命令語
２のフィールド０〜２は１クロックで同時処理完了して
いるので、投入された命令（ＡＤＤ，ＭＵＬ，ＡＤＤ）
が要求する各フィールド０〜２のオペランド（Ｒ１〜Ｒ
４）も全て使用可能である（ＳＴ１１４〜ＳＴ１１６，
イエス）。In each of the fields 0 to 2, it is checked whether the operands of the input instructions (ADD, MUL, ADD) are usable (ST110 to ST113). Here, the field 3 is still processing the DIV of the instruction word 2, and the processing of the instruction DIV is performed by the two A
It is performed in the LU. Therefore, the operands (R3, R4, R8) of field 3 are not yet closed,
It can be used (ST117, Yes). Further, since the fields 0 to 2 of the instruction word 2 have been simultaneously processed in one clock, the input instruction (ADD, MUL, ADD)
Requires the operands of each of fields 0 to 2 (R1 to R2)
4) can also be used (ST114 to ST116,
Jesus).

【００３７】フィールド０〜３のオペランドが全て使用
可能となると（ＳＴ１２０，イエス）、各フィールド０
〜３の演算に必要なハードウエアリソースがあるかどう
かチェックされる（ＳＴ１３０〜ＳＴ１３３）。When all the operands of the fields 0 to 3 become available (ST120, Yes), each of the fields 0 to 3 becomes available.
It is checked whether there are any hardware resources required for the operations of (1) to (3) (ST130 to ST133).

【００３８】この場合、フィールド１で処理中の命令Ｍ
ＵＬはリソース（レジスタＲ４）を使っており、このリ
ソース（レジスタＲ４）をフィールド２の命令ＡＤＤが
参照しているため、フィールド２の演算に必要なハード
ウエアリソースはない（ＳＴ１３６，ノー）。そこで、
このリソース（レジスタＲ４）が空くまで（つまりフィ
ールド１の命令ＭＵＬの実行が終わるまで）、フィール
ド２は何もしない命令ＮＯＰを挟んで待つ（ＳＴ１３
２，ＳＴ１３６のループ）。In this case, the instruction M being processed in field 1
The UL uses a resource (register R4), and since the instruction ADD of field 2 refers to this resource (register R4), there is no hardware resource required for the operation of field 2 (ST136, NO). Therefore,
Until this resource (register R4) becomes empty (that is, until the execution of the instruction MUL in field 1 is completed), field 2 waits for an instruction NOP that does nothing (ST13).
2, loop of ST136).

【００３９】このとき、フィールド２の命令ＡＤＤはフ
ィールド０の命令ＡＤＤが使うリソース（レジスタＲ
３）も参照している。このため、フィールド１の命令Ｍ
ＵＬの実行が終わるまでフィールド２が命令ＮＯＰ（１
クロック）を実行して待っている間、フィールド０も命
令ＮＯＰ（１クロック）を実行して待つ（ＳＴ１３０，
ＳＴ１３４のループ）。At this time, the instruction ADD in the field 2 is a resource (register R) used by the instruction ADD in the field 0.
See also 3). Therefore, the instruction M in the field 1
Until the execution of the UL is completed, the field 2 holds the instruction NOP (1
Clock), the field 0 also executes the instruction NOP (1 clock) and waits (ST130,
ST134 loop).

【００４０】フィールド０、２に命令ＮＯＰを挟むこと
によりフィールド０〜３のハードウエアリソースが全て
使用可能となり（ＳＴ１４０，イエス）、各フィールド
０〜３で、投入された命令（ＡＤＤ，ＭＵＬ，ＡＤＤ）
および実行中の命令（ＤＩＶ）の並列処理が、２クロッ
クかけて同時進行する（ＳＴ１５０〜ＳＴ１５３）。以
上のようにして、図３の命令語３が処理される。By placing the instruction NOP in the fields 0 and 2, all the hardware resources in the fields 0 to 3 can be used (ST140, Yes), and the input instructions (ADD, MUL, ADD) )
In addition, the parallel processing of the instruction (DIV) being executed proceeds simultaneously over two clocks (ST150 to ST153). As described above, the instruction word 3 in FIG. 3 is processed.

【００４１】次に、図３の命令語４に対応する図２の命
令語４がハードウエアにフェッチされる（ＳＴ１０
０）。このフェッチにより、フィールド０〜２の３本の
パイプラインには命令語４のＡＤＤ，ＳＵＢ，ＡＤＤが
投入される。（ここで、図３に対応する図２の処理プロ
グラムを生成したコンパイラは、命令語２でフィールド
３に投入した命令ＤＩＶが時間のかかる処理であること
を知っていると仮定する。このため、ここでも図２のフ
ィールド３に実質的な処理を行なう新たな命令は投入さ
れず、ＮＯＰが挿入される。このフィールド３のＮＯＰ
部分は、図３の処理実行時には無視される。）Next, the instruction of FIG. 2 corresponding to the instruction word 4 of FIG.
The edict 4 is fetched into the hardware (ST10).
0). By this fetch, ADD, SUB, and ADD of the instruction word 4 are input to the three pipelines of the fields 0 to 2. ( Here, it is assumed that the compiler that has generated the processing program of FIG. 2 corresponding to FIG . 3 knows that the instruction DIV input to the field 3 with the instruction word 2 is a time-consuming processing . Also in this case, a new instruction for performing substantial processing is not input to the field 3 of Fig. 2 and a NOP is inserted.
The part is ignored when the processing of FIG. 3 is executed. )

【００４２】各フィールド０〜２では、投入された命令
（ＡＤＤ，ＳＵＢ，ＡＤＤ）のオペランドが使用可能か
どうかチェックされる（ＳＴ１１０〜ＳＴ１１３）。命
令語３のフィールド０〜２は２クロックで同時処理完了
しているので、投入された命令（ＡＤＤ，ＳＵＢ，ＡＤ
Ｄ）が要求する各フィールド０〜２のオペランド（Ｒ１
〜Ｒ４）は全て使用可能である（ＳＴ１１４〜ＳＴ１１
６，イエス）。また、フィールド３はまだ命令語２のＤ
ＩＶをフィールド３のＡＬＵで処理中であり、この命令
ＤＩＶに対するオペランド（Ｒ３，Ｒ４，Ｒ８）は使用
可能である（ＳＴ１１７，イエス）。In each of the fields 0 to 2, it is checked whether the operands of the input instruction (ADD, SUB, ADD) are usable (ST110 to ST113). Since the fields 0 to 2 of the instruction word 3 have been simultaneously processed in two clocks, the input instruction (ADD, SUB, AD
D) requires the operands (R1
To R4) can all be used (ST114 to ST11).
6, yes). In addition, field 3 still has D of instruction word 2
The IV is being processed by the ALU in the field 3, and the operands (R3, R4, R8) for this instruction DIV can be used (ST117, YES).

【００４３】フィールド０〜３のオペランドが全て使用
可能となると（ＳＴ１２０，イエス）、各フィールド０
〜３の演算に必要なハードウエアリソースがあるかどう
かチェックされる（ＳＴ１３０〜ＳＴ１３３）。投入さ
れた命令（ＡＤＤ，ＳＵＢ，ＡＤＤ）および実行中の命
令（ＤＩＶ）が要求するハードウエアのリソースは塞が
っていないので、全て使用可能である（ＳＴ１３４〜Ｓ
Ｔ１３７，イエス）。When all of the operands in fields 0 to 3 become available (ST120, Yes), each
It is checked whether there are any hardware resources required for the operations of (1) to (3) (ST130 to ST133). Since the hardware resources required by the input instruction (ADD, SUB, ADD) and the instruction being executed (DIV) are not blocked, all of them can be used (ST134 to S134).
T137, yes).

【００４４】フィールド０〜３のハードウエアリソース
が全て使用可能となれば（ＳＴ１４０，イエス）、各フ
ィールド０〜３で、投入された命令（ＡＤＤ，ＳＵＢ，
ＡＤＤ）および実行中の命令（ＤＩＶ）の並列処理が同
時進行する（ＳＴ１５０〜ＳＴ１５３）。以上のように
して、図３の命令語４が処理される。When all the hardware resources in the fields 0 to 3 become available (ST140, Yes), the input instructions (ADD, SUB,
ADD) and the parallel processing of the instruction being executed (DIV) proceed simultaneously (ST150 to ST153). As described above, the instruction word 4 in FIG. 3 is processed.

【００４５】図７は、この発明の一実施例に係るＶＬＩ
Ｗ並列演算処理装置の構成を説明するブロック図であ
る。この実施例装置は、レジスタファイル２００と、命
令バッファ部２０１と、演算ユニット２０３と、レジス
タ比較回路２０５と、演算指定デコーダ２０６と、オア
回路２０７と、アンド回路２０９と、オア回路２１０と
を備えている。FIG. 7 shows a VLI according to an embodiment of the present invention.
FIG. 2 is a block diagram illustrating a configuration of a W parallel processing device. This embodiment includes a register file 200, an instruction buffer unit 201, an operation unit 203, a register comparison circuit 205, an operation designation decoder 206, an OR circuit 207, an AND circuit 209, and an OR circuit 210. ing.

【００４６】命令バッファ部２０１は、図１に示すよう
な複数演算を包含する命令語群からなるＶＬＩＷを格納
する。命令バッファ部２０１は、後述する信号２１１に
よって命令バッファ２０１Ｂ内のＶＬＩＷを出力する命
令出力コントローラ２０１Ａと、命令バッファ２０１Ｂ
内の命令の次に実行される命令を格納するバッファ２０
１Ｃを含んでいる。The instruction buffer unit 201 stores a VLIW composed of an instruction word group including a plurality of operations as shown in FIG. The instruction buffer unit 201 includes an instruction output controller 201A that outputs a VLIW in the instruction buffer 201B according to a signal 211 described later, and an instruction buffer 201B.
20 for storing the instruction to be executed next to the instruction in
1C.

【００４７】命令バッファ部２０１に格納されたＶＬＩ
Ｗは、信号線２０２Ａを介して演算ユニット２０３に転
送される。演算ユニット２０３は、転送されたＶＬＩＷ
に含まれる複数演算を、レジスタファイル２００を用い
て、４本のパイプライン（フィールド０〜３）により並
列処理する。この処理のために、これらのパイプライン
はそれぞれ独自の演算器（ＡＬＵ）を複数台（例えば２
台）備えている。VLI stored in instruction buffer unit 201
W is transferred to the arithmetic unit 203 via the signal line 202A. The arithmetic unit 203 receives the transferred VLIW
Are processed in parallel by four pipelines (fields 0 to 3) using the register file 200. For this processing, these pipelines each have a plurality of independent arithmetic units (ALUs) (for example, 2 units).
Table).

【００４８】演算ユニット２０３で現在実行中の演算の
結果が格納されるところのレジスタファイル２００のレ
ジスタ番号Ｒｊ（ｊは例えば１〜８）は、信号線２０４
ｉ（ｉ＝０，１，２，３）を介して、レジスタ比較回路
２０５に与えられる。このレジスタ比較回路２０５に
は、信号線２０２Ｂを介して、次に実行しようとするＶ
ＬＩＷによって参照されるレジスタ番号も与えられる。
レジスタ比較回路２０５は、前者のレジスタ番号と後者
のレジスタ番号をと比較し、両者が一致するかどうかを
調べる。The register number Rj (j is, for example, 1 to 8) of the register file 200 in which the result of the operation currently being executed by the operation unit 203 is stored is indicated by a signal line 204.
It is given to the register comparison circuit 205 via i (i = 0, 1, 2, 3). The register comparison circuit 205 supplies, via the signal line 202B, the V
The register number referenced by the LIW is also provided.
The register comparison circuit 205 compares the former register number with the latter register number, and checks whether or not they match.

【００４９】すなわち、レジスタ比較回路２０５は４つ
の比較ブロック０〜３により構成され、これらのブロッ
ク０〜３は演算ユニット２０３のフィールド０〜３にそ
れぞれ対応している。各ブロックｉ（ｉ＝０〜３）で
は、そのフィールドｉで現在実行中の演算の結果が格納
されるレジスタ番号Ｒｊ（ｊ＝１〜８）と、次の命令語
によって参照される全てのレジスタ番号との一致が調べ
られる。That is, the register comparison circuit 205 is composed of four comparison blocks 0 to 3, and these blocks 0 to 3 correspond to the fields 0 to 3 of the arithmetic unit 203, respectively. In each block i (i = 0 to 3), the register number Rj (j = 1 to 8) in which the result of the operation currently being executed is stored in the field i, and all the registers referenced by the next instruction word The match with the number is checked.

【００５０】演算指定デコーダ２０６には、命令バッフ
ァ２０２Ｂから、信号線２０２Ｂを介して、次に実行し
ようとしている命令語が与えられる。デコーダ２０６は
与えられた命令語から、ＮＯＰ以外の演算が指定されて
いるフィールドｉを検出する。そのようなフィールドｉ
が検出されたなら、デコーダ２０６はこのフィールドｉ
の演算器が空くまで次の命令語の実行を待たせる信号２
０６ｉを出力する。The instruction word to be executed next is supplied to the operation designation decoder 206 from the instruction buffer 202B via the signal line 202B. The decoder 206 detects a field i in which an operation other than NOP is specified, from a given instruction word. Such a field i
Is detected, the decoder 206 sets this field i
Signal 2 that causes the execution of the next instruction word to wait until the arithmetic unit becomes empty
06i is output.

【００５１】オア回路２０７は、レジスタ比較回路２０
５の各ブロックｉごとの出力２０５ｉと演算指定デコー
ダ２０６の各フィールドｉごとの出力２０６ｉとの論理
和を取る。またアンド回路２０９は、オア回路２０７の
各ＯＲゲートｉごとの出力２０７ｉと信号２０８ｉとの
論理和を取る。信号２０８ｉは、演算ユニット２０３の
各フィールドｉから取り出されるもので、各フィールド
ｉごとに現在実行中の演算がまだ完了していないことを
示す。The OR circuit 207 includes the register comparing circuit 20
The logical sum of the output 205i for each block i and the output 206i for each field i of the operation designation decoder 206 is calculated. The AND circuit 209 calculates the logical sum of the output 207i of each OR gate i of the OR circuit 207 and the signal 208i. The signal 208i is extracted from each field i of the arithmetic unit 203, and indicates that the operation currently being executed has not been completed for each field i.

【００５２】アンド回路２０９の各ＡＮＤゲートｉごと
の出力２０９ｉはオア回路２１０に入力される。オア回
路２１０の出力は、信号線２１１を介して、命令出力コ
ントローラ２０１Ａに送られる。The output 209 i of each AND gate i of the AND circuit 209 is input to the OR circuit 210. The output of the OR circuit 210 is sent to the command output controller 201A via the signal line 211.

【００５３】次に、図８および図９を参照して、上記構
成のＶＬＩＷ型演算処理装置の動作を説明する。（この
動作はマイクロプログラムではなくハードウエアロジッ
クにより実行される。）Next, the operation of the VLIW type arithmetic processing unit having the above configuration will be described with reference to FIGS. (This operation is performed not by a microprogram but by hardware logic.)

【００５４】この実施例のＶＬＩＷ型演算処理装置は、
次のような特徴を持つ。すなわち、原則として複数命令
語の並列処理を可とするが、現在実行中の命令語の演算
結果が格納されるレジスタファイル２００のレジスタ番
号（オペランド）と、次に実行しようとしている命令語
によって参照されるレジスタ番号（オペランド）とが一
致し、かつ一致したレジスタ番号のレジスタへ結果を格
納する演算がまだ完了していない場合に、命令バッファ
部２０１から演算ユニット２０３への命令フェッチを待
たせることに特徴を持つ。The VLIW type arithmetic processing unit of this embodiment is
It has the following features. That is, parallel processing of a plurality of instruction words is permitted in principle, but reference is made by the register number (operand) of the register file 200 in which the operation result of the currently executed instruction word is stored and the instruction word to be executed next. The instruction buffer unit 201 waits for the instruction fetch from the instruction unit 203 when the register number (operand) to be matched matches and the operation for storing the result in the register with the matched register number has not been completed yet. It has features.

【００５５】そこで、命令バッファ部２０１から信号線
２０２Ａを介して１つのＶＬＩＷが演算ユニット２０３
にフェッチされ（ＳＴ１０）演算処理がなされると、演
算ユニット２０３のフィールドｉで現在実行中の演算結
果が格納されるレジスタ番号（Ｒｊ）が信号線２０４ｉ
から取り出され、次に実行しようとしているＶＬＩＷに
含まれる全ての命令語によって参照されるレジスタ番号
が命令バッファ２０１Ｂの信号線２０２Ｂから取り出さ
れる。信号線２０４ｉおよび信号線２０２Ｂから取り出
されたレジスタ番号は、比較回路２０５において、一致
するかどうかチェックされる。Therefore, one VLIW is transferred from the instruction buffer unit 201 to the arithmetic unit 203 via the signal line 202A.
(ST10) and the arithmetic processing is performed, the register number (Rj) in which the operation result currently being executed is stored in the field i of the arithmetic unit 203 is indicated by a signal line 204i.
And the register numbers referenced by all the instruction words included in the VLIW to be executed next are extracted from the signal line 202B of the instruction buffer 201B. The comparison circuit 205 checks whether or not the register numbers extracted from the signal lines 204i and 202B match.

【００５６】比較回路２０５の４ブロック０〜３各々は
演算ユニット２０３のフィールド０〜３に対応してお
り、１つのブロックｉではそのフィールドｉで現在実行
中の演算結果が格納されるレジスタ番号と次のＶＬＩＷ
内の命令語によって参照される全てのレジスタ番号との
一致が調べられる。一致があれば（ＳＴ１４，イエス）
比較回路２０５の出力２０５ｉは″１″となり（ＳＴ１
６）、不一致ならば（ＳＴ１４，ノー）出力２０５ｉ
は″０″となる（ＳＴ１８）。Each of the four blocks 0 to 3 of the comparison circuit 205 corresponds to the fields 0 to 3 of the operation unit 203. In one block i, the register number in which the operation result currently being executed is stored in the field i is Next VLIW
Is checked for a match with all register numbers referred to by the instruction word in. If there is a match (ST14, yes)
The output 205i of the comparison circuit 205 becomes "1" (ST1
6) If they do not match (ST14, NO), output 205i
Becomes "0" (ST18).

【００５７】比較回路２０５からの一致出力２０５ｉは
オア回路２０７の各ＯＲゲート０〜３の一方に入力され
る。各ＯＲゲート０〜３の他方には、演算指定デコーダ
２０６からの出力２０６ｉが入力される。The coincidence output 205i from the comparison circuit 205 is input to one of the OR gates 0 to 3 of the OR circuit 207. The output 206i from the operation designation decoder 206 is input to the other of the OR gates 0 to 3.

【００５８】演算指定デコーダ２０６は、次に実行しよ
うとしている命令語中で、ＮＯＰおよびこの演算が指定
されているフィールドｉを検出する（ＳＴ２０）。フィ
ールドｉでＮＯＰが検出されると（ＳＴ２２，イエ
ス）、そのフィールドｉに対応するＯＲゲートにロジッ
ク″０″の出力２０６ｉが与えられる（ＳＴ２４）。Ｎ
ＯＰが検出されなければ（ＳＴ２２，ノー）、出力２０
６ｉは″１″となる（ＳＴ２６）。The operation designation decoder 206 detects the NOP and the field i in which the operation is designated in the next instruction word to be executed (ST20). When NOP is detected in the field i (ST22, YES), an output 206i of logic "0" is given to the OR gate corresponding to the field i (ST24). N
If the OP is not detected (ST22, No), the output 20
6i becomes "1" (ST26).

【００５９】出力２０５ｉおよび出力２０６ｉの何れか
が″１″であれば（ＳＴ２８，イエス）、オア回路２０
７の各ＯＲゲート０〜３の出力２０７ｉは″１″となる
（ＳＴ３０）。出力２０７ｉが″１″というのは、「次
に実行しようとする命令語が、フィールドｉ用のハード
ウエアリソースを使用する」か、あるいは「次に実行し
ようとする命令語が、フィールドｉで現在実行中の演算
結果を参照する」ということを意味する。この出力２０
７ｉはアンド回路２０９の対応するＡＮＤゲートの一方
入力に与えられる。If either the output 205i or the output 206i is "1" (ST28, YES), the OR circuit 20
7, the output 207i of each of the OR gates 0 to 3 becomes "1" (ST30). The output 207i is "1" because "the instruction to be executed next uses the hardware resources for the field i" or "the instruction to be executed next is the current instruction in the field i. Refer to the result of the operation being executed. " This output 20
7i is applied to one input of a corresponding AND gate of AND circuit 209.

【００６０】一方、フィールドｉで現在実行中の演算が
その処理サイクル中でまだ終了していないなら（ＳＴ３
２，イエス）、未終了のフィールドｉの出力２０８ｉ
が″１″となる（ＳＴ３４）。この出力２０８ｉはアン
ド回路２０９の対応するＡＮＤゲートの他方入力に与え
られる。On the other hand, if the operation currently being executed in field i is not completed in the processing cycle (ST3)
2, yes), output 208i of unfinished field i
Becomes "1" (ST34). This output 208i is provided to the other input of the corresponding AND gate of AND circuit 209.

【００６１】フィールドｉについての出力２０７ｉおよ
び出力２０８ｉがともに″１″であれば（ＳＴ３６，イ
エス）、このフィールドｉに対応するアンド回路２０９
のＡＮＤゲートの出力２０９ｉは″１″となる（ＳＴ３
８）。この出力２０９ｉが″１″というのは、「フィー
ルドｉで現在実行中の演算が終了するのを待て」という
ことを意味する。If the output 207i and the output 208i for the field i are both "1" (ST36, YES), the AND circuit 209 corresponding to the field i
Of the AND gate 209i becomes "1" (ST3).
8). The fact that the output 209i is "1" means "waiting for the currently executed operation to be completed in the field i".

【００６２】アンド回路２０９の何れかのＡＮＤゲート
０〜３の出力２０９ｉが真（ロジック″１″）であれば
（ＳＴ４０，イエス）、オア回路２１０の出力は真（ロ
ジック″１″）となる（ＳＴ４２）。すると、信号線２
１１を介して命令バッファ部２０１の出力コントローラ
２０１Ａに待ち信号（ロジック″１″）が伝達され、命
令バッファ部２０１は演算ユニット２０３へのＶＬＩＷ
の一括転送を停止する。If the output 209i of any of the AND gates 0 to 3 of the AND circuit 209 is true (logic "1") (ST40, YES), the output of the OR circuit 210 becomes true (logic "1"). (ST42). Then, signal line 2
11, a wait signal (logic “1”) is transmitted to the output controller 201A of the instruction buffer unit 201, and the instruction buffer unit 201 sends a VLIW signal to the arithmetic unit 203.
Stop batch transfer of.

【００６３】フィールド０〜３全ての演算が完了すると
出力２０８ｉは全て″０″になる（ＳＴ４４，イエ
ス）。するとオア回路２０７の出力２０７ｉに関係なく
アンド回路２０９の出力２０９ｉは全て″０″になるか
ら、信号線２１１の出力も″０″になる（ＳＴ４６）。
すると命令出力コントローラ２０１Ａでの命令転送禁止
が解除され、バッファ２０１Ｂ内の次のＶＬＩＷが演算
ユニット２０３にフェッチされる（ＳＴ４８）。When all the operations of the fields 0 to 3 are completed, the outputs 208i all become "0" (ST44, YES). Then, regardless of the output 207i of the OR circuit 207, all the outputs 209i of the AND circuit 209 become "0", so that the output of the signal line 211 also becomes "0" (ST46).
Then, the instruction transfer prohibition in the instruction output controller 201A is released, and the next VLIW in the buffer 201B is fetched by the arithmetic unit 203 (ST48).

【００６４】このようにして、図２に示すような命令語
１〜４が命令バッファ部２０１に格納されていてこれら
の命令語を順次実行しようとする場合を想定すると、上
記実施例のＶＬＩＷ型演算処理装置では次の処理が実現
される。In this way, assuming that the instruction words 1 to 4 as shown in FIG. 2 are stored in the instruction buffer unit 201 and these instruction words are to be executed sequentially, the VLIW type of the above embodiment is assumed. The following processing is realized in the arithmetic processing device.

【００６５】すなわち、命令語２と命令語３と命令語４
における演算器の参照関係についてコンパイラレベルで
考察すれば、命令語２では２つの加算と１つの減算と１
つの除算との４演算を行ない、その演算結果をレジスタ
Ｒ１，Ｒ３，Ｒ４，Ｒ８に格納するようにしている。こ
のうち、Ｒ１，Ｒ３，Ｒ４は命令語３で読出されるが、
Ｒ８は命令語３、命令語４とも参照しない。このＲ８の
中味は除算結果なので、処理に４クロックかかるのでは
あるが、命令語３と命令語４は必ずしもこの除算の完了
を待ち合わせる必要はない。That is, the instruction word 2, the instruction word 3, and the instruction word 4
Considering the reference relations of the arithmetic units at the compiler level, in the instruction word 2, two additions, one subtraction and one
Four operations of two divisions are performed, and the operation results are stored in registers R1, R3, R4, and R8. Of these, R1, R3, and R4 are read by instruction word 3,
R8 does not refer to instruction 3 or instruction 4. Since the contents of R8 are the result of the division, it takes four clocks for the processing, but the instruction words 3 and 4 do not necessarily have to wait for the completion of the division.

【００６６】命令語３は、２つの加算と１つの乗算の計
３つの演算を実行して、Ｒ１，Ｒ３，Ｒ４に結果を格納
する。これらのＲ１，Ｒ３，Ｒ４は命令語４で読出され
るのであるが、とくに乗算の結果が格納されるＲ４は命
令語４で参照される。このため、命令語４は、命令語３
の直後の命令であるが、命令語３の乗算の完了を待ち合
わせる必要がある。Instruction 3 executes a total of three operations of two additions and one multiplication, and stores the results in R1, R3, and R4. These R1, R3, and R4 are read out by the instruction word 4. In particular, R4 in which the result of the multiplication is stored is referred to by the instruction word 4. Therefore, the instruction word 4 is replaced with the instruction word 3
However, it is necessary to wait for completion of the multiplication of the instruction word 3.

【００６７】したがって、図２に示す命令語１〜４を順
次処理する場合には、図３に示すように、命令語２の除
算を実行している最中に命令語３と命令語４を実行する
ことができる。ただし、命令語４は命令語３の乗算結果
を参照しているので、命令語３の乗算結果が出るのを待
ってから命令語４を実行することになる。この場合、命
令語３と命令語４の実行時間は合わせて３クロック分と
なるが、これは命令語２の除算と並行して実行されるの
で、命令語２の完了と命令語４の完了は同時になる。Therefore, when the instruction words 1 to 4 shown in FIG. 2 are sequentially processed, as shown in FIG. 3, the instruction words 3 and 4 are executed while the division of the instruction word 2 is being executed. Can be performed. However, since the instruction 4 refers to the result of the multiplication of the instruction 3, the instruction 4 is executed after waiting for the result of the multiplication of the instruction 3. In this case, the execution time of the instruction word 3 and the execution time of the instruction word 4 are 3 clocks in total, but this is executed in parallel with the division of the instruction word 2, so that the completion of the instruction word 2 and the completion of the instruction word 4 are completed. Will be at the same time.

【００６８】こうして、図２に示すような命令語１〜４
を実行する場合、図３に示すような処理内容と時間経過
となる。すなわち、図３では、各命令語１〜４の処理時
間の総和（所要クロック数）は、図２の場合の８クロッ
クと比べて３クロック少ない５クロックとなる。Thus, the instruction words 1 to 4 as shown in FIG.
Is executed, the processing content and time elapse as shown in FIG. That is, in FIG. 3, the total processing time (the required number of clocks) of each of the instruction words 1 to 4 is 5 clocks, which is 3 clocks smaller than 8 clocks in the case of FIG. 2.

【００６９】また、図４に示す命令語１〜４を順次処理
する場合には、図５に示すような処理内容と時間経過と
なる。図２の場合では命令語４のフィールド３はＮＯＰ
であるのに対して、図４に示す命令語４のフィールド３
は加算を指定している。その他については、図２の命令
語の内容と図４の命令語の内容とは同じである。上記命
令語３のフィールド３における内容の違いが処理にどう
影響するかについて、以下説明する。When the instruction words 1 to 4 shown in FIG. 4 are sequentially processed, the processing contents and time elapse as shown in FIG. In the case of FIG. 2, the field 3 of the instruction word 4 is NOP
Whereas field 3 of instruction word 4 shown in FIG.
Specifies addition. In other respects, the contents of the instruction word in FIG. 2 and the contents of the instruction word in FIG. 4 are the same. How the difference in the content of the instruction word 3 in the field 3 affects the processing will be described below.

【００７０】演算ユニット２０３では、各フィールド０
〜３において、加算・減算・乗算・除算・ＮＯＰの何れ
かの演算を行なうことができるが、同時には１種類の演
算しか処理できない。そうであれば、図４の命令語４
は、命令語２のフィールド３の除算が完了するまでは実
行できないことになる。このため、先行する演算を実行
中の演算器を後続する命令が使用しようとする場合に
は、先行演算が完了するまで後続の演算を待たせる制御
が必要がある。すなわち、図４の命令語４の実行は命令
語２の除算が完了した後に開始されるべきである。この
ような制御を基に図４に示す命令語１〜４が順次処理さ
れると、図５に示すようになり、命令語１から命令語４
までの全実行時間は６クロックとなる。In the arithmetic unit 203, each field 0
In ~ 3, any one of addition, subtraction, multiplication, division, and NOP can be performed, but only one type of operation can be processed at the same time. If so, the instruction word 4 in FIG.
Cannot be executed until the division of the field 3 of the instruction word 2 is completed. Therefore, when a subsequent instruction intends to use a computing unit that is executing a preceding operation, it is necessary to control the subsequent operation to wait until the preceding operation is completed. That is, the execution of the instruction 4 in FIG. 4 should be started after the division of the instruction 2 is completed. When the instruction words 1 to 4 shown in FIG. 4 are sequentially processed based on such control, the result is as shown in FIG.
The total execution time up to is 6 clocks.

【００７１】なお、演算ユニット２０３の各フィールド
０〜３において同時に異なった種類の演算が実行できる
ような構成であれば、図４の命令語４は命令語２の除算
と並行して処理できるので、図４の命令列の全実行時間
は５クロックで済むことになる。Note that if the configuration is such that different types of operations can be executed simultaneously in each of the fields 0 to 3 of the operation unit 203, the instruction 4 in FIG. 4 can be processed in parallel with the division of the instruction 2. 4 requires only five clocks.

【００７２】以上のように、この発明の実施例でば、先
行する演算が実行中のために内容未確定のレジスタを後
続命令語が参照しようとする場合と、まだ演算実行中の
演算器を使用しようとする場合には、演算が完了して結
果が確定するまで後続命令語の実行を待たせる制御を導
入しているが、これらの場合以外では、先行する命令語
の演算が実行中であっても、原則としてその演算の完了
を待たずに後続命令語が実行できるようにして、並列処
理能力を向上させている。As described above, according to the embodiment of the present invention, the case where the preceding instruction is trying to refer to a register whose content is undetermined because the preceding operation is being executed, and the case where the operation unit still executing the operation is When trying to use it, a control is introduced to wait for the execution of the subsequent instruction until the operation is completed and the result is determined. In other cases, however, the operation of the preceding instruction is being executed. Even if there is, in principle, the subsequent instruction word can be executed without waiting for the completion of the operation, thereby improving the parallel processing capability.

【００７３】図１０は、この発明の他実施例に係るＶＬ
ＩＷ並列演算処理装置を示す。図１０の実施例は、図７
の実施例から、次に実行されるＶＬＩＷ中からＮＯＰ命
令を検出する構成（演算指定デコーダ２０６）を省略し
たものとなっている。FIG. 10 shows a VL according to another embodiment of the present invention.
1 shows an IW parallel processing device. The embodiment of FIG.
In this embodiment, the configuration for detecting the NOP instruction from the next executed VLIW (the operation designation decoder 206) is omitted.

【００７４】図１０の構成では、ＶＬＩＷによるオペラ
ンドの参照関係をコンパイラレベルで考慮して、以下の
制御を行なうことで、フィールド０〜３における処理の
並列性を確保している。In the configuration shown in FIG. 10, the parallelism of the processing in the fields 0 to 3 is ensured by performing the following control in consideration of the reference relation of the operand by the VLIW at the compiler level.

【００７５】すなわち、ある命令語により参照されるオ
ペランド（レジスタ番号）に演算結果を格納しようとす
るところの、先行起動された演算（例えば乗算）がまだ
完了していない場合は、その完了を待ち、そうでない場
合は、まだ処理途中の先行起動演算（例えば除算）があ
っても、この処理途中の演算と並行して後続命令を実行
できるような制御を行なう。That is, if the operation (for example, multiplication) which is to be started before the operation result is to be stored in the operand (register number) referred to by a certain instruction word is not completed, the operation waits for the completion. Otherwise, even if there is a preceding activation operation (for example, division) still in the process, control is performed so that the subsequent instruction can be executed in parallel with the operation in the middle of the process.

【００７６】これにより、例えば図３において、フィー
ルド３で除算処理途中にその他の演算処理をフィールド
０〜２で並列に実行できる。換言すると、ＶＬＩＷのオ
ペランドフェッチステージでは全てのフィールドに対し
て命令語を同期して一括投入するが、それらの命令語の
実行ステージではフィールド毎に独立した並列処理を行
なうことにより、各フィールドの演算手段（ＡＬＵ等）
の遊びを極力減らす制御が行なわれる。Thus, for example, in FIG. 3, other arithmetic processing can be executed in parallel in fields 0 to 2 during division processing in field 3. In other words, in the operand fetch stage of the VLIW, instructions are synchronously and collectively input to all fields, but in the execution stage of those instructions, independent processing is performed for each field so that the operation of each field is performed. Means (ALU, etc.)
The control for minimizing the play of the vehicle is performed.

【００７７】[0077]

【発明の効果】以上述べたように、この発明では、先行
する命令語の演算結果を格納しようとするレジスタ番号
が後続の命令語の演算で参照されない場合（または先行
命令語演算のために使用中の演算器が後続命令語演算に
よって使用されない場合）、先行する命令語の演算完了
を待たずに後続命令語の並列実行を行なうよう構成した
ので、並列処理効率が高く、全体として演算処理の高速
化が図れる。As described above, according to the present invention, when the register number for storing the operation result of the preceding instruction word is not referred to in the operation of the following instruction word (or used for the operation of the preceding instruction word). In the case where the middle arithmetic unit is not used by the subsequent instruction word operation), the subsequent instruction word is configured to be executed in parallel without waiting for the completion of the operation of the preceding instruction word, so that the parallel processing efficiency is high, and Higher speed can be achieved.

【００７８】さらに、上記構成において、処理に時間の
かかる演算の結果を参照する他の演算の命令語が、前者
の演算の命令語から極力離れるようなコードを生成する
コンパイラを用意すれば、この発明の装置の処理能力を
さらに改善することができる。Further, in the above configuration, if a compiler is provided which generates a code such that the instruction word of another operation referring to the result of the operation requiring a long time for processing is as far as possible from the instruction word of the former operation, The throughput of the device of the invention can be further improved.

【図面の簡単な説明】[Brief description of the drawings]

【図１】図１は、ＶＬＩＷ命令の一例を示す図。FIG. 1 is a diagram illustrating an example of a VLIW instruction;

【図２】図２は、この発明が適用されない場合のＶＬＩ
Ｗ並列演算処理の一例を説明する図。FIG. 2 shows a VLI when the present invention is not applied;
The figure explaining an example of W parallel operation processing.

【図３】図３は、この発明が適用された場合のＶＬＩＷ
並列演算処理の一例を説明する図。FIG. 3 is a diagram illustrating a VLIW when the present invention is applied;
FIG. 9 illustrates an example of a parallel operation process.

【図４】図４は、この発明が適用されない場合のＶＬＩ
Ｗ並列演算処理の他例を説明する図。FIG. 4 is a diagram showing a VLI when the present invention is not applied;
The figure explaining the other example of W parallel operation processing.

【図５】図５は、この発明が適用された場合のＶＬＩＷ
並列演算処理の他例を説明する図。FIG. 5 is a diagram showing a VLIW when the present invention is applied;
The figure explaining the other example of a parallel operation process.

【図６】図６は、この発明のＶＬＩＷ並列演算の処理の
流れを例示するフローチャート。FIG. 6 is a flowchart illustrating the flow of a VLIW parallel operation process according to the present invention;

【図７】図７は、この発明の一実施例に係るＶＬＩＷ並
列演算処理装置の構成を説明するブロック図。FIG. 7 is a block diagram illustrating a configuration of a VLIW parallel processing device according to an embodiment of the present invention.

【図８】図８は、図７の装置によるＶＬＩＷ並列演算処
理の動作を説明するフローチャートの一部。FIG. 8 is a part of a flowchart for explaining the operation of the VLIW parallel operation processing by the apparatus of FIG. 7;

【図９】図９は、図７の装置によるＶＬＩＷ並列演算処
理の動作を説明するフローチャートの他部。FIG. 9 is another part of the flowchart for explaining the operation of the VLIW parallel operation processing by the apparatus of FIG. 7;

【図１０】図７は、この発明の他実施例に係るＶＬＩＷ
並列演算処理装置の構成を示すブロック図。FIG. 7 is a VLIW according to another embodiment of the present invention.
FIG. 2 is a block diagram illustrating a configuration of a parallel operation processing device.

[Explanation of symbols]

２００…レジスタファイル（ハードウエアリソース）、
２０１…命令バッファ部（格納手段）、２０１Ａ…命令
出力コントローラ（停止手段）、２０１Ｂ，２０１Ｃ…
命令バッファ、２０３…演算ユニット（実行手段）、２
０５…レジスタ比較回路（提供手段）、２０６…演算指
定デコーダ（実行手段）、２０７…ＯＲ回路、２０８…
信号線（発生手段）、２０９…ＡＮＤ回路（出力手
段）、２１０…ＯＲ回路（出力手段）、２０５ｉ…一致
信号、２０８ｉ…未了信号、２１１…命令取り込み信号
（″０″）／命令取り込み禁止信号（″１″）。200: register file (hardware resource),
201: instruction buffer unit (storage means), 201A: instruction output controller (stop means), 201B, 201C ...
Instruction buffer, 203: arithmetic unit (execution means), 2
05: register comparison circuit (providing means), 206: operation designation decoder (execution means), 207: OR circuit, 208 ...
Signal lines (generation means), 209 AND circuit (output means), 210 ... OR circuit (output means), 205i ... match signal, 208i ... incomplete signal, 211 ... instruction fetch signal ("0") / instruction fetch inhibition Signal ("1").

Claims

(57) [Claims]

Storage means for storing a VLIW including a plurality of instruction words; and a pipeline comprising a plurality of fields corresponding in number to the number of instruction words included in the VLIW, and being included in the VLIW by these pipelines. Execution means for independently executing instructions to be executed; hardware resources referenced by operands of respective instruction words executed by the execution means; and a field of the execution means for storing an execution result of an instruction currently being executed. Comparing the first operand for execution with the second operand referenced by the instruction currently being executed in each field of the execution means, and if the second operand includes one that matches the first operand,
Providing means for providing a match signal; generating means for generating an incomplete signal for each field when instruction execution is not completed in each field of the execution means; and the match signal from the providing means and the generation ANDing with the incomplete signal from the means for each field of the execution means, and outputting an instruction fetch inhibition signal when any of the fields yields true for the logical AND; Output means for outputting an instruction fetch signal when a false result is obtained for the logical product; and when the output means outputs the instruction fetch signal, the instruction is included in the VLIW from the storage means to the pipeline of the execution means. Simultaneously transfer one or more command words,
Means for stopping transfer of the VLIW instruction word from the storage means to the execution means when the output means outputs the instruction fetch inhibition signal.