JPH03214235A

JPH03214235A - Parallel processing method for plural paths

Info

Publication number: JPH03214235A
Application number: JP838490A
Authority: JP
Inventors: Yasuhiro Nakatsuka; 康弘中塚; Kenichi Kurosawa; 黒沢　憲一
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1990-01-19
Filing date: 1990-01-19
Publication date: 1991-09-19
Anticipated expiration: 2014-03-08
Also published as: JP2866421B2

Abstract

PURPOSE:To extract the 100% parallelism out of a relevant program by preparing plural instruction taking-out parts. CONSTITUTION:When a breakup instruction is executed by a processing system, the instruction take-out means of other processing systems have accesses to an instruction cache memory 101 based on an instruction address set by a breakup instruction or an instruction address that is already set. Thus, the parallel processing can be carried on in each processing system despite an internal branching occurred in an instruction train during processing at each side. Meanwhile it is required again to select one of those processing systems that start their actions independently of each other. In such conditions, a merging instruction is executed by a single instruction executing means 105 (107). Thus, the actions of the instruction taking-out means of other processing systems are stopped. Then the operation of the means 105 (107) is also stopped. Consequently, the branching instructions themselves can also be executed in parallel with each other.

Description

【発明の詳細な説明】［産業上の利用分野］本発明は、複数の命令を同時処理するための複数パス並
列処理方法に関するものである。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a multi-pass parallel processing method for simultaneously processing a plurality of instructions.

［従来の技術］演算処理の高速化のために、複数の命令を同時実行する
ようにした処理装置の例として、特開昭６３−４９８４
３号、米国特許４　、７６６　、５６６号、及びヨーロ
ッパ特許８７１．１０７５１．２号に記載されたものが
あり、これらの特許は同一内容のものを各国に対して出
願したものである。この特許出願における主張は近年ス
ーパスカつという名前で呼ばれている並列化方式であり
、アドレスが連続する２つの命令を同時に実行するもの
である。米国特許４，７６６．５６６号のＦｊ（＜、２
−７に記載されているように、同時に実行される２つの
命令は単一の命令バッファに納められており、これらが
同時に実行される。この方式では、分岐が生じたときに
連続した命令の実行はできないから、一方の実行ユニッ
トで分岐命令が実行されたときに命令バッファをキャン
セルして分岐発生時の処理を行っている。[Prior Art] An example of a processing device that executes multiple instructions simultaneously in order to speed up arithmetic processing is disclosed in Japanese Patent Laid-Open No. 63-4984.
No. 3, US Pat. No. 4,766, 566, and European Patent No. 871.10751.2, and these patents have the same contents filed in various countries. The claim in this patent application is a parallelization method recently called superscan, in which two instructions with consecutive addresses are executed at the same time. Fj (<, 2
7, two instructions to be executed simultaneously are stored in a single instruction buffer, and are executed simultaneously. In this method, since consecutive instructions cannot be executed when a branch occurs, when a branch instruction is executed in one execution unit, the instruction buffer is canceled to perform processing when a branch occurs.

〔発明が解決しようとする課題］上記従来技術には１分岐命令による本質的な並列度の減
少という問題があった。即ち、一方の実行ユニノ１−の
分岐命令によって、命令バッファがキャンセルされてし
まい、このとき２つの実行ユニットが独立して動作でき
ない。従って、命令の並列度の抽出時に最適化を行う範
囲がエントリと分岐命令との間のみに限定されてしまう
。ここでエントリとは分岐命令の飛先あるいは分岐命令
の直後の命令を指す。分岐命令は５命令に１度以上の割
合で出現するといわれており、この状態では最適化はほ
とんど意味を持たない。２倍のハードウェア物景を投入
した割に２命令同時実行による性能向上の効果は少ない
、あるいは殆どないという結果となる。[Problems to be Solved by the Invention] The above-mentioned conventional technology has a problem in that the degree of parallelism essentially decreases due to the one-branch instruction. That is, the instruction buffer is canceled by the branch instruction of one execution unit 1-, and at this time the two execution units cannot operate independently. Therefore, the range to be optimized when extracting the degree of parallelism of instructions is limited to only between the entry and the branch instruction. Here, the entry refers to the destination of a branch instruction or the instruction immediately after the branch instruction. It is said that a branch instruction appears once in every five instructions, and optimization has little meaning in this state. Even though twice as much hardware is used, the effect of performance improvement due to simultaneous execution of two instructions is small or almost non-existent.

本発明の１」的は、この分岐命令による並列度の減少を
解決し、分岐命令自身の並列化も可能とする複数パスＭ
１列処理方法を提供するにある。The first object of the present invention is to solve the problem of the decrease in the degree of parallelism caused by branch instructions, and to provide multiple paths M that can also parallelize the branch instructions themselves.
To provide a single column processing method.

し課題を解決するための手段］」１記の目的を達成するために、本発明においては、パ
イプライン構成された命令取り出し手段、デコード手段
及び命令実行手段から成るところの互いに独立動作可能
な処理系を複数個設け、さらに１つの処理系がその命令
を実行したときに他の処理系の動作を開始及び停止させ
るための分裂命令及び融合命令と、命令の実行がどこま
で進んだかを示すフラグと、該フラグを参照して各処理
系による命令の実行を制御する条件休止命令とを設けた
。Means for Solving the Problem] In order to achieve the object described in 1, the present invention provides a processing system that can operate independently of each other, consisting of an instruction fetching means, a decoding means, and an instruction execution means arranged in a pipeline. A plurality of systems are provided, and when one processing system executes the instruction, there are split instructions and fusion instructions to start and stop the operation of other processing systems, and a flag indicating how far the instruction execution has progressed. , and a conditional pause instruction that controls the execution of instructions by each processing system by referring to the flag.

Ｃ作用］１つの処理系により分裂命令が実行されると、他の処理
系の命令取り出し手段が上記分裂命令によってセットさ
れる命令アドレス、あるいはすでにセットされている命
令アドレスに従い命令用キャッシュメモリをアクセスす
る。これにより、それぞれの命令取り出し手段は異なる
プログラムカウンタを持つため、基本的に独立に動作し
、従ってそれぞれの側で処理中の命令列の内部での分岐
があっても、各処理系における並列処理を継続できる。C action] When a splitting instruction is executed by one processing system, the instruction retrieval means of the other processing system accesses the instruction cache memory according to the instruction address set by the splitting instruction or the instruction address that has already been set. do. As a result, each instruction fetching means has a different program counter, so they basically operate independently, so even if there is a branch within the instruction string being processed on each side, parallel processing in each processing system is possible. can continue.

一方、このようにして独立に動作し始めた各処理系を再
び１本に絞り込む必要性もでてくる。このときは１つの
命令実行手段により融合命令を実行することにより、他
の処理系の命令取り出し手段の動作を停止させる。これ
にともない命令実行手段も停止される。即ちパイプライ
ンが絞り込まれる。融合命令によって停止したパイプラ
インは動作中のパイプラインによる分裂命令あるいは一
定時間ごとに行われるタイマチエツク機構によってのみ
再開される。On the other hand, there also arises a need to narrow down the processing systems that have started operating independently in this way to one again. At this time, by executing the fused instruction by one instruction execution means, the operation of the instruction fetching means of the other processing system is stopped. Along with this, the instruction execution means is also stopped. In other words, the pipeline is narrowed down. A pipeline stopped by a fusion instruction can be restarted only by a split instruction from an active pipeline or by a timer check mechanism that is performed at regular intervals.

２つの処理系の処理の間にデータの依存関係がある場合
は、コンパイラが有する実行順序の情報に従って実行コ
ードの中に実行順序の制御のための命令、即ち条件休止
命令を埋め込む。複数の処理系の処理のパスにはそれぞ
れ同期用のフラグがあって、これらは分裂命令によって
その値を増加させられる。そして、条件休止命令は各処
理系のフラグの内容とそのときの自処理系の処理結果に
よって次の処理を待つか否かを判断し、これによって必
要なデータが得られた後に当該処理が行われるように各
処理系の処理の実行制御を行う。If there is a data dependency between the processes of two processing systems, an instruction for controlling the execution order, that is, a conditional pause instruction, is embedded in the execution code according to the execution order information possessed by the compiler. Each processing path of a plurality of processing systems has a synchronization flag, and the value of these flags can be increased by a splitting instruction. Then, the conditional pause instruction determines whether or not to wait for the next process based on the contents of the flags of each processing system and the processing result of its own processing system at that time, and after the necessary data is obtained, the relevant process is executed. Execution control of each processing system is performed to ensure that the processing is executed properly.

〔実施例］以下１本発明の一実施例を詳細に説明する。第１図は本
発明を応用したプロセッサのブロック図である。命令フ
ェッチユニット１００及び１０２は、それぞれの内部に
独立動作するプログラムカウンタを持ち、４ｇ号縁線１
１１び１１２を用いて命令用キャッシュメモ１月０１に
対して命令フェッチ要求を出す。[Example] Hereinafter, one example of the present invention will be described in detail. FIG. 1 is a block diagram of a processor to which the present invention is applied. Each of the instruction fetch units 100 and 102 has a program counter that operates independently, and has a 4g signal line 1.
11 and 112 are used to issue an instruction fetch request to the instruction cache memory January 01.

キャッシュメモＩＪＩＯＩはこれらの片方を交互にある
いは両方を同時に受は付ける。対応するデータは同時に
受は付けられた場合にはそのまま、交互に受は付けられ
た場合にはキャッシュメモリ１０１の内部にバッファリ
ングされて、各サイクルごとに信号線１１４及び１１５
へ出力される。それぞれが独立に動作１■能なデコーダ
やシーケンサから成る制御ブロック＋０３及び１０４は
、キャッシュメモリ１０１からの命令をデコードした情
報をもとに、信号線＋１８及び１１９を用いて下記の命
令実行部を並列に制御したり、分岐命令などの情報をも
とに信号線１１３及び１．１６とデータ線１２２及び１
２ｇを用いて命令フェッチユニット１００及び１０２を
も制御する。制御ブロック１０３及び１０４は信号線１
１７によって相互に接続されており、例えば両者の同期
をとるときなどにこの信号が利用される。命令実行部は
１つのレジスタファイル１０６と２つの実行ユニット１
０５及び１０７から構成される。レジスタファイル１０
６の一部と実行ユニット１０５は制御信号１１８によっ
て、レジスタファイル１０６の一部と実行ユニット１０
７は制御信号１１９によってそれぞれ制御される。制御
信号１１８と１１９によって制御されるレジスタファイ
ル１０６の各部分の間には共通部分があってもなくても
よい。実行ユニット１０５及び１０７はそれぞれ独立し
たソースデータバス１２０及び１２１を持ち、またそれ
ぞれ演算結果をファイル１０６へ格納するためのターゲ
ットデータバス１２４及び１２６を持っている。The cache memo IJIOI accepts one of these alternately or both at the same time. If the corresponding data are accepted at the same time, they remain as they are; if they are accepted alternately, they are buffered inside the cache memory 101, and are transferred to the signal lines 114 and 115 for each cycle.
Output to. Control blocks +03 and 104, each consisting of a decoder and a sequencer that can operate independently, execute the following instruction execution units using signal lines +18 and 119 based on information obtained by decoding instructions from the cache memory 101. The signal lines 113 and 1.16 and the data lines 122 and 1 are controlled in parallel and based on information such as branch instructions.
2g is also used to control instruction fetch units 100 and 102. Control blocks 103 and 104 are connected to signal line 1
17, and this signal is used, for example, when synchronizing the two. The instruction execution unit has one register file 106 and two execution units 1.
Consists of 05 and 107. Register file 10
A portion of register file 106 and execution unit 105 are controlled by control signal 118 to
7 are each controlled by a control signal 119. There may or may not be commonalities between the portions of register file 106 controlled by control signals 118 and 119. Execution units 105 and 107 have independent source data buses 120 and 121, respectively, and target data buses 124 and 126, respectively, for storing operation results to file 106.

メモリアクセスは、実行ユニット１０５及び１０７が計
算したメモリアドレスをメモリアクセスコントローラ１
０８及び１１０へ渡し、これらコントローラ１０８及び
１１．０はデータ用キャッシュメモリ１０９に対してア
クセス要求１２９及び１３０を出すことにより行われる
。キャッシュメモリ１０９はこれらの信号を交互あるい
は同時に受は付け、ロード／ストア専用バス】２５を介
してレジスタファイル１０６に対してデータの供給を行
う。For memory access, the memory addresses calculated by the execution units 105 and 107 are sent to the memory access controller 1.
These controllers 108 and 11.0 issue access requests 129 and 130 to the data cache memory 109. The cache memory 109 accepts these signals alternately or simultaneously, and supplies data to the register file 106 via the load/store dedicated bus 25.

第１図における左右のリソースは特殊な場合を除いて独
立に動作可能であり、プログラムの並列性を抽出し同時
に実行するのに適した構成となっている。極端な場合を
考えると、左右で全く別のプログラムを実行させること
もできる。一方、第１図において左右で共有されている
命令用キャッシュメモ＋Ｊ１０１、レジスタファイル１
０６及びデータ用キャッシュメモリ１０９の内部構造は
、プロセッサの実現方法や使用目的によって多少異なる
可能性がある。即ちキャッシュメモリ１０１．１０９の
内部は２つのキャッシュメモリから構成されてもよし）
し、１つのマルチボートキャッシュメモリから構成され
てもよい。また、フェッチ要求を交互に受は付けること
ができる１つのキャッシュメモリで構成し、命令取り出
しユニットに得られた命令をバッファリングする手段を
持ち、あたかも毎サイクルフェッチしたように見せかけ
るという方法もある。レジスタファイル１０６の内部は
左右で完全に共有されたレジスタファイルでもよいし、
全く独立した２つのレジスタファイルから構成されても
よい。無論部分的に共有する形態のものも考えられる。The left and right resources in FIG. 1 can operate independently except in special cases, and have a configuration suitable for extracting parallelism in programs and executing them simultaneously. In an extreme case, it is also possible to run completely different programs on the left and right sides. On the other hand, in FIG. 1, the instruction cache memory +J101 and register file 1 shared between the left and right
06 and the data cache memory 109 may differ somewhat depending on how the processor is implemented and the purpose of use. In other words, the inside of the cache memories 101 and 109 may consist of two cache memories)
However, it may be composed of one multi-vote cache memory. Another method is to configure the cache memory with one cache memory that can accept fetch requests alternately, and have a means for buffering the obtained instructions in the instruction fetching unit, making it appear as if the instructions were fetched every cycle. The inside of the register file 106 may be a register file completely shared between the left and right sides, or
It may also consist of two completely independent register files. Of course, it is also possible to partially share the information.

共有される部分には、左右に分配される独立したバス１
２０．１２１をもつ必要がある。In the shared part, there is an independent bus 1 distributed to the left and right.
20.121.

次に、このようにして実現されたシステムの動作を第２
図を用いて説明する。タロツク信号に１はマシンサイク
ルの前半で高レベル、後半で低レベルの信号であり、こ
の１サイクルで各部が決められた処理を行う。命令用キ
ャッシュメモＩＪＩＯＩの出力信号１１．４．１１５は
それぞれ２命令分のビット幅を持っており、それゆえに
命令読み出しは２サイクルに一度行えばよい。出力信号
１１４は制御ブロック１０３でデコードされ１次のステ
ージの信号１１８となる。この信号によって更に次のス
テージにおいてデータ１２０がそのまた次のステージに
おいてデータ１２４が制御される。このように１つの命
令を一連の処理に分割することによって命令の並列処理
化を図るパイプラインを構成する。右半分を動作させる
信号線１１５．１１９．１２１．１２６に関しても同様
であるが、ここでは動作していないものとする。Next, we will explain the operation of the system realized in this way in a second way.
This will be explained using figures. A 1 in the tarock signal is a high level signal in the first half of the machine cycle and a low level signal in the second half, and each part performs a predetermined process in this one cycle. The output signals 11, 4, and 115 of the instruction cache memory IJIOI each have a bit width for two instructions, and therefore, instructions need only be read once every two cycles. The output signal 114 is decoded by the control block 103 and becomes the primary stage signal 118. This signal controls data 120 in the next stage and data 124 in the next stage. By dividing one instruction into a series of processes in this way, a pipeline is configured to achieve parallel processing of instructions. The same applies to the signal lines 115.119.121.126 that operate the right half, but it is assumed here that they are not activated.

これらの信号、即ち第１図の右半分に動作させるために
、このプロセッサは特別な命令あるいは命令コード内に
埋め込まれた特別なフィールドを有する。このような命
令をここでは分裂命令（Ｆ−ｉｓｓｉｏｎ　Ｂ　ｒａｎ
ｃｈ　Ｉ　ｎ５ｔｒｕｃｔｉｏｎ）と呼ぶことにする。To operate on these signals, the right half of FIG. 1, the processor has special instructions or special fields embedded within the instruction code. Such an instruction will be referred to as a split instruction (F-ission B ran).
ch I n5 truction).

この分裂命令がタイムスロットＳ１に実行ユニット１０
５で実行されたとすると、第２図における制御信号２０
１がオンとなり、命令フェッチユニット１０２が動作可
能となる。ユニット１０２は分裂命令によってセットさ
れるアドレス、あるいは既にセットされているアドレス
に従い命令用キャッシュメモリ１０１をアクセスする。This splitting instruction is sent to execution unit 10 in time slot S1.
5, the control signal 20 in FIG.
1 is turned on, and the instruction fetch unit 102 becomes operational. The unit 102 accesses the instruction cache memory 101 according to an address set by the splitting instruction or an address that has already been set.

これにより第１図の右半分のパイプラインも動作開始す
ることになる。第１図の右半分と左半分は独立に動作す
る。As a result, the pipeline in the right half of FIG. 1 also starts operating. The right and left halves of FIG. 1 operate independently.

一方、このようにして独立に動作し始めたパイプライン
を再び一本に絞り込む必要もある。そのため、分裂の場
合と同様にこのプロセッサは特別な命令あるいは命令コ
ード内に埋め込まれた特別なフィールドを有する。この
ような命令を融合命令（Ｆ　ｕｓｊｏｎ　Ｂ　ｒａｎｃ
ｈ　Ｉ　ｎ５ｔｒｕｃｔｉｏｎ）と呼ぶことにする。こ
の融合命令がタイムスロットＳ２に実行ユニット１０５
で実行されたとすると、第２図における制御信号２０２
がオンとなり、命令フェッチユニット１．００が動作を
停止する。これにともない制御ブロック１０３及び実行
ユニット１０５も停止され、左半分のパイプラインが完
全に停止する。即ちパイプラインが右半分のみに絞り込
まれる。融合命令によって停止したパイプラインは、動
作中のパイプラインによる分裂命令あるいは一定時間ご
とに行われるタイマチニック機構によってのみ再開され
る。また、この融合命令によって並列処理が停止されて
いるとき、従来のスーパスカラ方式により停止された処
理系を利用するようにすることもできる。On the other hand, it is also necessary to narrow down the pipelines that have started operating independently in this way to one again. Therefore, as in the case of splitting, this processor has special instructions or special fields embedded within the instruction code. Such an instruction is called a fusion instruction (Fusjon B ranc).
h I n5 truction). This fused instruction is sent to execution unit 105 in time slot S2.
If the control signal 202 in FIG.
is turned on, and the instruction fetch unit 1.00 stops operating. Along with this, the control block 103 and execution unit 105 are also stopped, and the left half pipeline is completely stopped. In other words, the pipeline is narrowed down to only the right half. A pipeline stopped by a fusion instruction is restarted only by a split instruction from an active pipeline or by a timer mechanism that is executed at regular intervals. Further, when parallel processing is stopped by this fusion instruction, it is also possible to use the processing system that was stopped using the conventional superscalar method.

次に、このようなプロセッサを用いた場合の効果的なプ
ログラミングについて第３図及び第４図を用いて説明す
る。ここでは、比較の対象として連続アドレスの２命令
を同時に処理するスーパスカラマシンをとりあげる。ス
ーパス力うマシンはハードウェア物量が制約された条件
、たとえば集積度が低いＬＳＩを用いて実現する場合な
どに有効な方式であるが、連続アドレスの２命令という
制約条件があるために並列度が向上しないという問題点
がある。この方式による高速化の割合は、せいぜい１か
ら２割程度と推定される。Next, effective programming when using such a processor will be explained using FIGS. 3 and 4. Here, a superscalar machine that simultaneously processes two instructions at consecutive addresses will be used as a comparison target. Superpass machines are an effective method when the amount of hardware is limited, such as when using LSIs with low integration, but the constraint of two instructions with consecutive addresses makes it difficult to parallelize. The problem is that it doesn't improve. The rate of speedup achieved by this method is estimated to be about 1 to 20% at most.

第３図では実行サイクル数の異なる８命令（○Ｐ３０１
〜ＯＰ３０ｇ）を両者で実行した場合を示している。簡
単のために各命令の間にデータの競合関係はないものと
する。ＯＰ２Ｏ３，ＯＰ２Ｏ３，ＯＰ２Ｏ３゜０Ｐ３０
８は実行に２サイクル要する命令であり、残りの命令は
１サイクルにて実行可能とする。第３図の左側はスーパ
ス力うマシンの例を示している。スーパス力うマシンは
連続アドレスの２命令を同時に処理するため、それらの
命令は同時に実行が開始される。そのため同時実行され
る命令の処理時間が異なると、ところどころ空きが生じ
、全体の実行にタイムスロットＴ１〜Ｔ８の８サイクル
を要する。一方、第３図右側に示した本発明による複数
パス並列処理方法においてはこのような制約がないため
、タイムスロットＴ１〜Ｔ６の６サイクルにて実行を終
了する。Figure 3 shows 8 instructions with different numbers of execution cycles (○P301
~OP30g) is executed by both parties. For simplicity, it is assumed that there is no data conflict between instructions. OP2O3, OP2O3, OP2O3゜0P30
8 is an instruction that requires two cycles to execute, and the remaining instructions can be executed in one cycle. The left side of FIG. 3 shows an example of a superpower machine. Because a superpowered machine processes two instructions at consecutive addresses simultaneously, those instructions begin execution at the same time. Therefore, if the processing times of simultaneously executed instructions are different, there will be gaps here and there, and eight cycles of time slots T1 to T8 are required for the entire execution. On the other hand, in the multi-pass parallel processing method according to the present invention shown on the right side of FIG. 3, there is no such restriction, and the execution ends in six cycles of time slots T1 to T6.

以上は分岐のない簡単な場合の比較であるが、より重要
な相違点として、並列化の割合が本質的に異なることが
あげられる。即ちスーパスカラマシンにおいては連続の
２命令が必ずしも同時に実行可能と出来ないので、最適
化コンパイラを用いてコートを生成したとしても並列化
不可能な部分は本質的になくならない。分岐命令などが
そのよい例である。スーパス力うマシンにおいては分岐
は一度に１つしか実行できず、しかも分岐の入り口と次
の分岐命令との間の少ないコードの範囲内でしか並列度
を抽出できない。分岐命令は５命令に１度出現すると言
われており、平均４命令の間で並列度を抽出することに
なる。第４図に分岐命令を含むコードの例を示す（ＯＰ
４０１〜０Ｐ４０８）。The above is a comparison of a simple case with no branches, but the more important difference is that the parallelization rate is essentially different. That is, in a superscalar machine, two consecutive instructions cannot necessarily be executed simultaneously, so even if a code is generated using an optimizing compiler, the parts that cannot be parallelized will not essentially be eliminated. A good example is a branch instruction. Superpowered machines can only execute one branch at a time, and parallelism can only be extracted within a small amount of code between the branch entry and the next branch instruction. It is said that a branch instruction appears once in every five instructions, and the degree of parallelism is extracted between an average of four instructions. Figure 4 shows an example of code that includes a branch instruction (OP
401-0P408).

０Ｉ）４０３及び０Ｐ４０８は条件分岐命令であり、そ
れぞれＯＰ２Ｏ３，ＯＰ２Ｏ３へ分岐する。０Ｐ４０３
は７回分岐を行い１回は分岐しない。また０Ｐ４０８は
５回分岐を行い１回は分岐しない。従って０１）４０１
カら０Ｐ４０３は８回、０Ｐ４０４から０Ｐ４０８は６
回実行される。条件分岐命令において分岐する場合には
その実行に４サイクルを要し、分岐しない場合及びその
他の命令は１サイクルにて実行が終了するものとする。0I) 403 and 0P408 are conditional branch instructions, which branch to OP2O3 and OP2O3, respectively. 0P403
branches 7 times and does not branch once. Furthermore, 0P408 branches five times and does not branch once. Therefore 01) 401
Kara 0P403 is 8 times, 0P404 to 0P408 is 6 times.
Executed times. It is assumed that when a conditional branch instruction branches, four cycles are required for its execution, and when no branch is made or for other instructions, execution is completed in one cycle.

第４図（ａ）に示したパイプラインの本数が１本の従来
型のアーキテクチャをとった場合（シングルモード）は
、実行命令が５４、分岐が１２回あるため、全体で９０
サイクル要することになる。第４図（ｂ）に示したスー
パスカラモードの場合には、ＯＰ２Ｏ３ト４０２、ＯＰ
　４０４　ト４０５および０Ｐ４０６と４０７がそれぞ
れ並列可能であり実行される命令対の数は３４、分岐は
シングルモードと同様に１２回あるので、全体で７０サ
イクルとなる。これに対し、第４図（Ｃ）に示した本発
明による複数パス並列処理方法を用いれば２つのループ
を別々のパイプラインにｍｌ　ＩＪ当てることができる
ため、分岐命令も含めて並列化が可能となる。即ち０Ｐ
４０１〜４０３をプロセッサの左半分のパイプラインに
割り当て、残りを右半分に割り当てる。プロセッサの左
半分のパイプラインの実行時間は命令数が２４、分岐回
数が７であるため、全部で４５サイクルとなる。また、
プロセッサの右半分のパイプラインの実行時間は命令数
が３０、分岐回数が５であるため、全部でやはり４５サ
イクルとなる。従って全体の実行時間は４５サイクルと
なり、シングルモードの９０サイクルと比較して２倍の
並列度を抽出したことになる。無論、分裂分岐に伴うオ
ーバーヘッドはあるが、ループ回数が大きくなればこの
影響は十分無視できる程度になる。これが本発明がスー
パスカラモードに対して本質的に優れている理由である
。In the case of the conventional architecture with one pipeline (single mode) as shown in Figure 4(a), there are 54 instructions to execute and 12 branches, resulting in a total of 90 instructions.
It will take a cycle. In the case of the superscalar mode shown in FIG. 4(b), OP2O3 402, OP
404 and 0P 405 and 0P 406 and 407 can be executed in parallel, the number of instruction pairs to be executed is 34, and there are 12 branches as in the single mode, so the total number of cycles is 70. On the other hand, if the multiple pass parallel processing method according to the present invention shown in FIG. becomes. That is, 0P
401 to 403 are assigned to the pipeline on the left half of the processor, and the rest are assigned to the right half of the processor. The execution time of the pipeline in the left half of the processor is 45 cycles in total because the number of instructions is 24 and the number of branches is 7. Also,
The execution time of the pipeline in the right half of the processor is 45 cycles in total since the number of instructions is 30 and the number of branches is 5. Therefore, the total execution time is 45 cycles, which means that twice the degree of parallelism has been extracted compared to 90 cycles in single mode. Of course, there is an overhead associated with splitting and branching, but as the number of loops increases, this effect becomes negligible. This is why the present invention is inherently superior to superscalar mode.

さて、このような複数パス並列処理方法を用いた場合に
おいて、２つのパス即ち２つのパイプラインの間にデー
タの依存関係がある場合についても考慮しなければなら
ない。何故ならば、このような状況が生じる可能性は極
めて高く、これに対する対策なくしては複数パス並列処
理方法の本質的な高性能性を導きだすことは不可能とな
るからである。第５図は複数パス並列処理方法によるあ
るプログラムの流れを示したものである。本プロセッサ
のパイプラインは分裂命令５０１によって２つに分かれ
るが、それぞれパスは互いに関連をもって動作する。即
ち手続き５０４において第１番目のレジスタあるいはメ
モリに対して書き込み（Ｗｌ）が行われ、それをもう一
方のパスの手続き５０９で参照する（Ｒ１）。手続き５
０９はさらに第２番目のレジスタあるいはメモリに対し
て書き込み（Ｗ２）を行い、それを手続き５１２で参照
する（Ｒ２）。また、同時に手続き５０８では第３番目
のレジスタあるいはメモリに対して書き込み（Ｗ３）を
行い、それを手続き５１３で参照する（Ｒ３）。Now, when using such a multi-pass parallel processing method, it is also necessary to consider the case where there is a data dependency between two passes, that is, two pipelines. This is because the possibility that such a situation will occur is extremely high, and without countermeasures against this situation, it will be impossible to derive the essential high performance of the multi-pass parallel processing method. FIG. 5 shows the flow of a program using the multiple-pass parallel processing method. The pipeline of this processor is divided into two by the splitting instruction 501, but each path operates in relation to each other. That is, in procedure 504, writing (Wl) is performed to the first register or memory, and it is referenced in procedure 509 of the other pass (R1). Procedure 5
09 further writes to the second register or memory (W2) and refers to it in procedure 512 (R2). At the same time, in procedure 508, a write is made to the third register or memory (W3), which is referenced in procedure 513 (R3).

その後、融合命令によってパイプラインの一本化が行わ
れる。第５図のプログラムの流れはデータの依存関係に
よって、その実行順序が規定される。Thereafter, the pipeline is unified by a fusion instruction. The execution order of the program flow shown in FIG. 5 is defined by data dependencies.

即ち毛続き５０４は手続き５０９より先に、手続き５０
８は手続き５１３より先に、そして手続き５０９は手続
き５１２より先にそれぞれ実行されなければならない。In other words, the procedure 504 is performed before the procedure 509.
8 must be executed before procedure 513, and procedure 509 must be executed before procedure 512.

このような実行順序はハードウェアでは制御不可能であ
るが、このプログラムのコードを生成するコンパイラは
実行順序の情報を持っているため、コード中に実行順序
制御のための命令を埋め込むことができる。この命令に
条件休止命令（Ｉ　ｎｃｒｅｍｅｎｔ　ａｎｄ　Ｃｏｎ
ｄｉｔｉｏｎａｌ　Ｐａｕｓｅ）と呼ぶことにして、そ
の動作について説明する。前述の説明の通り、手続き５
０４は手続き５０９より早く実行されなければならない
。そこで、条件休止命令５０６．５０７を用いて同期を
とる。２つのパスにはそれぞれ同期用のフラグ５０２．
５０３があって、これらは分裂命令５０１によってリセ
ットされ、条件休止命令によってその値を増加させられ
る。条件休止命令５０７は手続き５０９の実行を手続き
５０４の実行終了まで待たせる働きをするものである。Although this kind of execution order cannot be controlled by hardware, the compiler that generates the code for this program has information about the execution order, so it is possible to embed instructions for controlling the execution order in the code. . This instruction is accompanied by a conditional pause instruction (Iincrement and con
The operation will be explained below. As explained above, procedure 5
04 must be executed earlier than procedure 509. Therefore, synchronization is achieved using conditional pause instructions 506 and 507. Each of the two paths has a synchronization flag 502.
503, which are reset by the split instruction 501 and have their values increased by the conditional pause instruction. The conditional suspension instruction 507 has the function of making the execution of the procedure 509 wait until the execution of the procedure 504 is completed.

手続き５０４の実行終了時には、条件休止命令５０６が
実行されており、フラグ５０２は１となっている。条件
休止命令５０７はフラグ５０３を１にしてフラグ５０２
と比較する。このとき、条件休止命令５０７はフラグ５
０２の内容を信号線５１５を用いて参照できる。もしも
フラグ５０２がＯの場合、即ちフラグ５０２がフラグ５
０３と等しくない時には、条件休止命令５０７はパイプ
ラインを停止して条件休止命令５０６の実行終了を待つ
。条件休止命令５１０と５１１に関しても同様であるが
、手続き５１２及び５１３はそれぞれ手続き５０９及び
５０８が終了していないと実行開始不能であるので、相
手のフラグと値が等しくない場合にパイプラインを停止
するという判定条件を持つ条件休止命令を使用する。When the execution of the procedure 504 ends, the conditional pause instruction 506 has been executed, and the flag 502 is set to 1. The conditional suspension instruction 507 sets the flag 503 to 1 and sets the flag 502
Compare with. At this time, the conditional pause instruction 507 is flag 5.
The contents of 02 can be referenced using the signal line 515. If flag 502 is O, that is, flag 502 is flag 5.
03, the conditional pause instruction 507 stops the pipeline and waits for the conditional pause instruction 506 to finish executing. The same applies to conditional suspension instructions 510 and 511, but procedures 512 and 513 cannot start execution unless procedures 509 and 508 are completed, respectively, so the pipeline is stopped if the flag and value of the other party are not equal. Use a conditional pause instruction that has a judgment condition of

このようにして誤動作なく目的のプログラムを並列に実
行できる。In this way, target programs can be executed in parallel without malfunction.

［発明の効果コ本発明によれば、命令取り出し部分を複数もつことによ
り、分岐命令による並列度の減少を解決し、分岐命令自
身をも並列化できるので、問題プログラムから並列度を
１．００％抽出することができる。また、複数の命令実
行部の同期をとるための命令とハードウェアを採用する
ことにより、誤動作なく命令の実行順序を制御できる。[Effects of the Invention] According to the present invention, by having a plurality of instruction fetch sections, the reduction in parallelism due to branch instructions can be solved, and the branch instructions themselves can also be parallelized. % can be extracted. Furthermore, by employing instructions and hardware for synchronizing a plurality of instruction execution units, the execution order of instructions can be controlled without malfunction.

[Brief explanation of drawings]

第１図は本発明の方法を適用した処理装置の一実施例を
示すブロック図、第２図は第１図の処理装置におけるパ
イプライン制御の説明図、第３図及び第４図は従来方法
と本発明の方法によったときの処理例を示す図、第５図
は命令間にデータ依存関係があるときの実行順序制御方
法の説明図である。１００、１０２・・・命令フェッチユニット、１０３．
１０４・・制御ブロック、１０５．１０７・・・実行ユ
ニット、５０２゜５０３・・フラグ。FIG. 1 is a block diagram showing an embodiment of a processing device to which the method of the present invention is applied, FIG. 2 is an explanatory diagram of pipeline control in the processing device of FIG. 1, and FIGS. 3 and 4 are conventional methods. FIG. 5 is an explanatory diagram of an execution order control method when there is a data dependency relationship between instructions. 100, 102... instruction fetch unit, 103.
104...Control block, 105.107...Execution unit, 502°503...Flag.

Claims

[Scope of Claims] 1. A processing system capable of independently operating, comprising an instruction retrieval means for retrieving an instruction from a memory, a decoding means for the instruction retrieved by the means, and an execution means for the instruction decoded by the means. In addition to providing a plurality of instructions, splitting instructions, and merging instructions, when there are instructions that can be processed simultaneously in the instruction sequence being processed by one processing system, by having that processing system execute the splitting instructions, the splitting instruction can be executed by another processing system. When starting simultaneous parallel processing and combining the simultaneous parallel processing of multiple processing systems into processing of one processing system, the above simultaneous parallel processing of the processing systems other than the one processing system is executed by having the one processing system execute the above fusion instruction. A multi-pass parallel processing method characterized by stopping the operation of a processing system that was performing processing. 2. When simultaneous parallel processing is performed by the plurality of processing systems, a flag is set for each of the plurality of processing systems, and a conditional pause instruction is inserted into the instruction string to be processed by the processing system. The flags of each processing system are cleared by the splitting instruction that instructs the start of parallel processing, and when each processing system executes the conditional suspension instruction, the flag of its own processing system is updated by +1, and the flag of the other processing system is checked. , If the values of the flags of the self-processing system and the other processing system are different, the processing of the next instruction of the self-processing system is suspended until the values match. . 3. When the fusion instruction is executed, a part of the consecutive instructions of the processing continued in the processing system that executed the fusion instruction, when the consecutive instructions do not include a branch instruction, is executed by the fusion instruction. 2. The multi-pass parallel processing method according to claim 1, wherein simultaneous parallel processing is performed by a processing system whose operation has been stopped.