JP2002182928A

JP2002182928A - Instruction emulation method

Info

Publication number: JP2002182928A
Application number: JP2000377145A
Authority: JP
Inventors: Yasuto Omiya; 康人近江谷
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2000-12-12
Filing date: 2000-12-12
Publication date: 2002-06-28

Abstract

PROBLEM TO BE SOLVED: To enhance instruction simulation speed by shortening time in pipeline mechanism and increasing use rate of a super sealer in instruction simulation. SOLUTION: In the instruction simulation, a process to copy an arithmetic processing of the next instruction after an arithmetic processing of the present instruction for simulation, a process to obtain a critical path (long instruction arithmetic processing) for which long time is required by comparing time to allocate a processing variable to the arithmetic processing of the present instruction and to execute resources by occupying them for execution of an arithmetic operation with time to allocate a processing variable different from the arithmetic operation of the present instruction and to execute the resources by occupying them for execution of the arithmetic operation and a process to allocate a step of a short instruction arithmetic processing in free time and to perform a parallel arithmetic processing when the resource in which use is doubled by the other short instruction arithmetic processing in the free time of execution cycle to be included in a step in a long instruction arithmetic processing are provided.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は計算機のソフトウェ
アに関するものであり、特にある命令セットを持つ計算
機（乙）の動作を別の命令セットを持つ計算機（甲）で
実現する処理方式に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to computer software, and more particularly to a processing method for realizing the operation of a computer having a certain instruction set (Computer B) on a computer having another command set. .

【０００２】[0002]

【従来の技術】従来、計算機（コンピュータ）は機種ご
とに機械命令のコードと機能といった命令仕様や、個々
の命令によらない共通の仕様を決め、その機種内でプロ
グラム互換性のあるプロセッサ（中央演算処理装置）を
開発をしていた。ところで、プロセッサのＬＳＩ化が進
み、ワンチップまたは数チップでプロセッサを構成し、
高速クロックで命令実行ができるようになる一方、ＬＳ
Ｉ開発に膨大な費用がかかり、多量の製品台数を出荷し
ないと開発費用が回収できない。更に、業界標準として
使用されるマイクロプロセッサの高性能化は更に進み、
独自の命令セットを持つ専用プロセッサとの性能格差は
拡大しつつある。業界標準のプロセッサへの移行という
大きな流れはあるが、それに搭載されるオペレーティン
グシステムが必ずしも万能ではなく、ユーザアプリケー
ションへの対応に不安がある。更に加えて、ユーザの資
産であるユーザアプリケーション、ユーザのデータ、運
用環境の尊重と、それに基づく従来方式の計算機の存続
とその低価格化のニーズがある。それらの解決策として
業界標準のプロセッサを使用して、従来仕様の計算機の
模擬実行を行う、「命令シミュレーション」、「命令エ
ミュレーション」、「命令変換」などといった技術が最
近よく使用されるようになってきた。2. Description of the Related Art Conventionally, a computer (computer) determines instruction specifications such as machine instruction codes and functions for each model, and common specifications that do not depend on individual instructions. Arithmetic processing unit). By the way, the processor is becoming LSI, and the processor is composed of one chip or several chips.
While being able to execute instructions with a high-speed clock, LS
I-development costs enormous costs, and development costs cannot be recovered unless a large number of products are shipped. Furthermore, the performance of microprocessors used as industry standards has been further improved,
The performance gap with dedicated processors that have their own instruction set is growing. Although there is a major trend toward the transition to industry-standard processors, the operating system on which it is installed is not always versatile, and there are concerns about the compatibility with user applications. In addition, there is a need for respect for user applications, user data, and operating environments, which are assets of the user, and for the continuation of conventional computers based on the same and the need for lower prices. As a solution to these problems, techniques such as "instruction simulation", "instruction emulation", and "instruction conversion", which simulate conventional computers using an industry-standard processor, have recently become popular. Have been.

【０００３】計算機の命令動作を模擬実行する技術には
２つの流れがある。１つはソフトウェアによるもので、
もう一つはハードウェアによるものである。第１のソフ
トウェアによるものは、ある命令セットを持つ計算機
（乙）を別の命令セットを持つ計算機（甲）上のプログ
ラムで模擬実行する計算機シミュレーション手法であ
る。それは古くより使われており、計算機の特性の解
析、プログラムのデバッグ、ホスト計算機環境によるマ
イクロプロセッサ用のソフトウェア開発、などに使用さ
れてきた。There are two flows in the technology for simulating the instruction operation of a computer. One is by software,
Another is by hardware. The first software is a computer simulation method for simulating a computer having a certain instruction set (Computer B) by using a program on a computer having a different instruction set (Computer A). It has been used for a long time, and has been used for analyzing computer characteristics, debugging programs, and developing software for microprocessors in a host computer environment.

【０００４】第２のハードウェアによるものは、主とし
て、複数の命令セットを持つ計算機の中央演算処理装置
に使用される。即ち、命令セット甲に最適化されたハー
ドウェアを使用して、別の命令セット乙用に命令デコー
ダやマイクロプログラムとごく一部のハードウェアのみ
用意する方式で「エミュレーション」という呼び方がさ
れている。また、水平型マイクロプログラム可能なマイ
クロプロセッサの出現により、ミニコンピュータのプロ
セッサの実現にも使用されてきた。このエミュレーショ
ン方式はソフトウェアによる模擬実行に比べると十倍か
ら数十倍高速であり、一般的に専用回路の数分の一の性
能を実現していた。The second hardware is mainly used for a central processing unit of a computer having a plurality of instruction sets. In other words, using the hardware optimized for the instruction set A, it is called "emulation" in a method that prepares only a small part of the hardware such as an instruction decoder and microprogram for another instruction set B. I have. The advent of horizontal microprogrammable microprocessors has also been used to implement minicomputer processors. This emulation method is ten to several tens times faster than the simulation execution by software, and generally achieves a fraction of the performance of a dedicated circuit.

【０００５】ところが、ＬＳＩ技術の進歩により処理速
度を上げる方法としてパイプライン回路やＲＩＳＣ（Ｒ
ｅｄｕｃｅｄＩｎｓｔｒｕｃｔｉｏｎＳｅｔＣｏ
ｍｐｕｔｅｒ）アーキテクチャが採られるようになっ
た。その結果、ＬＳＩメーカ以外でマイクロプログラム
を変更することもできなくなり、また外付け回路を付け
ると、回路間の遅延が大きくなると共にクロック数の増
大、パイプラインの乱れによるオーバヘッドの増大、等
が発生してプロセッサの性能を活かした効果的なエミュ
レーションができなくなった。一方、ソフトウェアによ
る命令模擬実行（以下、命令シミュレーションと表記）
では、命令シミュレーションという処理内容がプロセッ
サの持つハードウェアの高速性能を活かせないという欠
点があり、命令シミュレーションの高速化を阻害してい
た。以下、その課題について説明する。[0005] However, as a method of increasing the processing speed by the advance of LSI technology, a pipeline circuit or RISC (R)
reduced Instruction Set Co
mputer) architecture. As a result, it is not possible to change the microprogram by anyone other than the LSI maker, and if an external circuit is added, the delay between circuits will increase, the number of clocks will increase, and the overhead due to pipeline disturbance will increase. As a result, effective emulation utilizing the performance of the processor was no longer possible. On the other hand, instruction simulation by software (hereinafter referred to as instruction simulation)
However, there is a disadvantage that the processing content of the instruction simulation cannot utilize the high-speed performance of the hardware of the processor, which hinders the speeding up of the instruction simulation. Hereinafter, the problem will be described.

【０００６】図１４に一般的な命令シミュレーションの
流れを示す。また図１５に１命令を模擬する単純なプロ
グラム例をフローチャートで示す。命令の機能には複雑
な演算を行う命令もあるが、実際の基幹業務プログラム
動作を解析すると、メモリからレジスタへのロード、レ
ジスタからメモリへのストア、分岐処理、の出現頻度が
非常に高いことがわかっている。それらの命令処理の中
でも、命令解読、メモリオペランドのアドレス計算、と
いった個々の命令によらない共通処理が多くの命令数を
占めていることが、経験的に分かっている。FIG. 14 shows a flow of a general instruction simulation. FIG. 15 is a flowchart showing a simple program example for simulating one instruction. There are instructions that perform complex operations in the function of the instructions, but when analyzing the actual operation of the core business program, the frequency of loading from memory to registers, storing from registers to memory, and branch processing is extremely high. I know. It has been empirically found that, among these instruction processes, common processes that do not depend on individual instructions, such as instruction decoding and memory operand address calculation, occupy a large number of instructions.

【０００７】命令の解読は、命令語の特定フィールドか
ら構成される命令コードと呼ばれるデータに依存した分
岐処理が必要となる。高性能プロセッサでは分岐先の命
令列の先取り制御を行っているが、複数回の連続した分
岐ではパイプライン乱れによるオーバヘッドが大きい。
また解読用の同一命令列が何回も使用されるため、分岐
履歴保持による高速化効果が弱められる。更に、命令コ
ード値をオフセットとして分岐先を決定する方式では、
パイプラインの停止または再起動がなされてクロック数
を多く消費する、といった課題がある。The decoding of an instruction requires a branch process depending on data called an instruction code composed of a specific field of an instruction word. In a high-performance processor, prefetch control of an instruction sequence at a branch destination is performed. However, a plurality of consecutive branches has a large overhead due to a pipeline disorder.
Further, since the same instruction sequence for decoding is used many times, the speed-up effect by branch history retention is reduced. Furthermore, in the method of determining a branch destination using an instruction code value as an offset,
There is a problem that the pipeline is stopped or restarted and a large number of clocks are consumed.

【０００８】命令の解読のうち、どのレジスタを使用す
るかを決定する処理は、命令コードのビットシフトやＡ
ＮＤ動作を行い、その後にメモリ上に配置したレジスタ
値に相当するデータ配列へのアクセスを行う必要があ
り、シフト・ＡＮＤ・メモリアクセスの動作を逐次的に
に行う必要がある。この逐次動作は、高性能プロセッサ
のパイプライン処理においてレジスタ干渉を引き起こ
す。更にまた複数命令同時実行技術であるスーパスケー
ラの利点を利用しにくくしている。In the decoding of the instruction, the process of determining which register to use is performed by bit shifting of the instruction code or A
It is necessary to perform an ND operation and then access a data array corresponding to the register value arranged on the memory, and it is necessary to sequentially perform shift, AND, and memory access operations. This sequential operation causes register interference in pipeline processing of a high-performance processor. Furthermore, the advantage of the superscaler, which is a technique for simultaneously executing a plurality of instructions, is hardly used.

【０００９】一方、処理フローを図示はしていないが、
レジスタ番号を示す命令フィールドと命令コードを一括
して解読処理を狙い、分岐先のアドレス数を増やしてレ
ジスタ番号に相当するメモリ番地を動的に求めずに予め
固定値として決める方式もある。このような方式におい
てはパイプライン乱れが原理的に少ない長所がある。し
かしながら、命令が配置されるメモリ領域が拡大してキ
ャッシュミスオーバヘッドが著しく増加する課題があ
る。たとえば命令フィールドが８ビット、各４ビットの
レジスタフィールドが３組ある図１６に示す例のよう
に、解読ビット幅が８ビットならば一次キャッシュに、
１２ビットならば二次キャッシュメモリに適合するがそ
れ以上だとキャッシュミスのオーバヘッドが多いことが
わかる。従って、パイプラインオーバヘッドもしくは分
岐オーバヘッドを選択するか、または、キャッシュミス
オーバヘッドを選択するか、いずれにしてもプロセッサ
の高速性が活かせない。On the other hand, although the processing flow is not shown,
There is also a method in which the instruction field indicating the register number and the instruction code are collectively decoded, and the number of branch destination addresses is increased to determine a memory address corresponding to the register number as a fixed value in advance without dynamically obtaining the same. Such a method has an advantage that the disturbance of the pipeline is reduced in principle. However, there is a problem that the memory area where the instructions are arranged is enlarged and the cache miss overhead is significantly increased. For example, as shown in FIG. 16 where the instruction field is 8 bits and there are three sets of register fields each having 4 bits, if the decoding bit width is 8 bits,
If it is 12 bits, it is suitable for the secondary cache memory, but if it is larger than 12 bits, the overhead of cache miss is large. Therefore, whether the pipeline overhead or the branch overhead is selected, or the cache miss overhead is selected, the high speed of the processor cannot be utilized.

【００１０】図１７ないし図２０を用いてスーパスケー
ラ機構を持つプロセッサ（甲）を使用して異なる命令セ
ットの計算機（乙）の命令シミュレーションを実行した
時のハードウェア動作の例を説明する。図１７に乙の
「ロード命令」（以下、「Ｌ命令」と記述）の命令形式
とその機能を擬似コードで示したが、図１８に甲で行う
命令シミュレーションの処理フローと甲のレジスタの使
い方を示す。図中のアルファベット記号Ａ乃至Ｙは甲の
各命令（単位処理）に相当する。甲は、次のようなハー
ドウェアを持つと仮定する。ａ）命令フェッチユニット、命令デコードユニット、
実行ユニットがパイプライン動作をする。実行ユニット
には、演算実行ユニットが２個とメモリ実行ユニットが
１個ある。ｂ）演算ユニットはそれぞれ独立かつ並列に１サイク
ルでレジスタ間演算を実行する。ｃ）メモリ実行ユニットは１個で、アドレス計算、ア
ドレス変換とキャッシュのタグメモリアクセス、データ
アクセス、の３段パイプラインからなる。ｄ）命令キャッシュとデータキャッシュは独立してい
る。ｅ）レジスタ干渉が起きた時は各実行ユニット間で待
ち合わせし、結果をバイパスして伝達する。ｆ）命令デコードユニットは同時に２命令までディス
パッチ可能。ディパッチは前の命令を追い越さないが実
行順序は逆転しても、プログラムの逐次実行順序が保た
れるように整合性をＨ／Ｗで保証している。ｇ）分岐処理は、演算実行ユニットにより行われ、実
行ステージに命令キャッシュのタグメモリを引く。ｈ）命令レジスタは最大４個の命令を命令デコードユ
ニットに引き渡すまで保持し、フェッチは、４命令分の
命令を保持し、空になると必要分を命令キャッシュメモ
リより読み出す。また、甲の命令セットの一部について図２１に、ニーモ
ニックコード、記法、意味を説明している。An example of hardware operation when an instruction simulation of a computer (Otsu) of a different instruction set is executed by using a processor (A) having a superscalar mechanism will be described with reference to FIGS. Fig. 17 shows the instruction format and function of the "load instruction" (hereinafter, referred to as "L instruction") in pseudo code, and Fig. 18 shows the processing flow of the instruction simulation performed by Party A and how to use the registers of Party A. Is shown. Alphabetic symbols A to Y in the figure correspond to each instruction (unit processing) of the former. We assume that we have the following hardware: a) Instruction fetch unit, instruction decode unit,
The execution unit performs a pipeline operation. The execution unit has two operation execution units and one memory execution unit. b) The operation units execute the inter-register operation independently and in parallel in one cycle. c) There is one memory execution unit, which consists of a three-stage pipeline of address calculation, address translation and tag memory access for cache, and data access. d) The instruction cache and data cache are independent. e) When register interference occurs, wait between the execution units and transmit the result by bypass. f) The instruction decode unit can dispatch up to two instructions at the same time. Dispatch does not overtake the previous instruction, but guarantees consistency by H / W so that the sequential execution order of the program is maintained even if the execution order is reversed. g) The branch processing is performed by the execution unit, and the tag memory of the instruction cache is pulled to the execution stage. h) The instruction register holds up to four instructions until it is delivered to the instruction decode unit. The fetch holds instructions for four instructions, and when it becomes empty, reads the necessary part from the instruction cache memory. FIG. 21 illustrates a mnemonic code, a notation, and a meaning of a part of the instruction set of the instep A.

【００１１】図１９に、乙のＬ命令を甲が命令シミ
ュレーションを行う時の甲の機械命令列を示す。図中で
の命令Ａ乃至Ｅは乙の命令間での共通処理であり、命令
固有処理であるＦはＥに連続して配置の必要はない。図
中では、乙の他の命令処理に関する甲の命令記述は割愛
している。図２０には、乙の２つの連続したＬ命令を、
図１９で示した命令列を用いて甲が実行した時のパイプ
ラインステージの様子を、縦方向に実行クロックサイク
ル数を用いて示している。図１９と図２０中のアルファ
ベット記号は、説明上設けた識別記号であり甲の命令に
対応している。同様に、数字はクロックサイクル番号で
ある。識別記号で括弧書きのものはその命令のパイプラ
インステージ動作結果がそのサイクルで有効に使用され
ないことを示す。乙のＬ命令を１個シミュレーションに
するのに、この例では２７サイクルかかることになる。
このように２７サイクル間２５命令（図示参照）しか実
行しておらず、甲のプロセッサが１サイクルで最大２命
令の実行ができるＨ／Ｗ仕様でありその理論値である５
４命令に対しわずか４６％しか活かされてないことにな
る。FIG. 19 shows a machine instruction sequence of the first party when the first party executes the instruction simulation of the second party's L instruction. Instructions A to E in the figure are common processing among the instructions of the second party, and F, which is an instruction-specific processing, does not need to be arranged continuously to E. In the figure, the command description of Party A regarding the other command processing of Party B is omitted. FIG. 20 shows B's two consecutive L instructions,
The state of the pipeline stage when executed by the instep using the instruction sequence shown in FIG. 19 is shown using the number of execution clock cycles in the vertical direction. The alphabetic symbols in FIGS. 19 and 20 are identification symbols provided for explanation and correspond to the command of the former. Similarly, the numbers are clock cycle numbers. The identifier in parentheses indicates that the result of the pipeline stage operation of the instruction is not effectively used in the cycle. In this example, it takes 27 cycles to simulate one L instruction of Otsu.
As described above, only the 25 instructions (see the drawing) are executed for 27 cycles, and the H / W specification is such that the instep processor can execute a maximum of 2 instructions in one cycle, and its theoretical value is 5
Only 46% of the four instructions are used.

【００１２】[0012]

【発明が解決しようとする課題】従来の命令シュミレー
ションは上記で種々述べてきたように、他の命令セット
を持つパイプライン構造およびスーパスケーラ機構のプ
ロセッサを使用して、命令を模擬実行する命令シミュレ
ーションを行おうとすると、通常のアプリケーションプ
ログラムと異なり、命令シミュレーションの一般的な特
性によりプロセッサの高速性を活かすのが難しいという
課題がある。特に汎用的に使用される計算機をシミュレ
ーションの対象とした時には、スーパスケーラ機能が活
用しやすい命令の使用頻度はきわめて少なく、一方、ロ
ード、ストア、あるいはメモリを対象とした演算処理が
多くて、レジスタ情報の切り出しやアドレス計算に多く
のサイクル数を必要とし、サイクル数が増えて実行時間
が長くなるという課題があった。As described above, the conventional instruction simulation is an instruction simulation for simulating and executing an instruction using a processor having a pipeline structure having another instruction set and a superscaler mechanism. However, unlike ordinary application programs, there is a problem that it is difficult to utilize the high speed of the processor due to the general characteristics of instruction simulation. In particular, when a general-purpose computer is used as a simulation target, instructions that can be easily used by the superscalar function are used very rarely. There is a problem that a large number of cycles are required for extracting information and calculating addresses, and the number of cycles is increased, resulting in a longer execution time.

【００１３】本発明は上記の課題を解決するためになさ
れたもので、命令シミュレーション処理においてパイプ
ライン機構での時間を短縮し、スーパスケーラの使用率
をあげて、命令シミュレーション速度を向上する。SUMMARY OF THE INVENTION The present invention has been made to solve the above-mentioned problem, and in the instruction simulation processing, the time in the pipeline mechanism is reduced, the use rate of the superscaler is increased, and the instruction simulation speed is improved.

【００１４】[0014]

【課題を解決するための手段】この発明に係る命令エミ
ュレーション方法は、パイプライン機構と複数命令同時
実行機構（スーパスケーラ機構）を持つプロセッサの機
械命令セットを組み合わせた模擬用プログラムを用い
て、別の機械命令セット用に作られたプログラムを模擬
する方法において、模擬用の現命令の演算処理の後に次
命令の演算処理をコピーする工程と、現命令演算処理に
処理変数を割り付け、その演算実行のためにリソースを
占有して実行する時間と、次命令演算処理に現命令演算
とは異なる処理変数を割り付け、その演算実行のために
リソースを占有して実行する時間とを比較して、長い時
間が必要なクリティカルパス（長命令演算処理）を得る
工程と、長命令演算処理中のステップに含まれる実行サ
イクル空き時間に、他方の短命令演算処理が使用を重複
しないリソースを使用する場合は、この空き時間に短命
令演算処理のステップを割り付けて並行演算をさせる工
程と、を備えた。An instruction emulation method according to the present invention uses a simulation program combining a pipeline instruction and a machine instruction set of a processor having a multiple instruction simultaneous execution mechanism (super scaler mechanism). In a method of simulating a program created for a set of machine instructions, a process of copying the operation of the next instruction after the operation of the current instruction for simulation, and allocating a processing variable to the operation of the current instruction and executing the operation The time required to occupy resources for execution and the processing time allocated to the next instruction operation and different from the current instruction operation, and the time required to occupy resources for execution of the operation are compared. The step of obtaining a critical path (long instruction operation processing) requiring time, and the execution cycle idle time included in the step during the long instruction operation processing, If the short instruction operation processing square uses resources that do not overlap the use, comprising: a step of parallel computation by assigning step short instruction processing in the free time, the.

【００１５】また更に、現命令演算処理の後方ステップ
において、次の命令語を読み出して既に並行演算してい
る次命令演算処理の命令と比較する工程と、この比較し
た命令が一致しない場合は、現並行演算している次命令
演算を破棄して改めて次命令演算処理を行う工程を備え
た。Further, in a step subsequent to the current instruction operation processing, a step of reading the next instruction word and comparing it with the instruction of the next instruction operation processing which has already been performed in parallel, and when the compared instructions do not match, There is provided a step of discarding the next instruction operation currently being performed in parallel and performing the next instruction operation process again.

【００１６】また更に、クリティカルパスは、命令演算
処理を構成する単位処理を各単位処理毎に切り出して、
この切り出した単位処理において演算実行条件に基づい
て単位処理間の順序を定め、この定めたクリティカルパ
スを与える単位処理群の命令実行ステップに含まれる実
行サイクル空き時間に、クリティカルパス中の単位演算
処理群以外の他単位演算処理が使用を重複しないリソー
スを使用する場合は、この空き時間に他単位演算処理の
ステップを割り付けて並行演算をさせる工程を備えた。Further, the critical path is obtained by cutting out the unit processes constituting the instruction operation process for each unit process,
In the cut-out unit processing, the order between the unit processings is determined based on the operation execution condition, and the unit operation processing in the critical path is performed during the execution cycle idle time included in the instruction execution step of the unit processing group that gives the determined critical path. When a resource other than the group uses a resource whose use is not duplicated, a step of allocating a step of the other unit arithmetic processing to the idle time and performing a parallel arithmetic operation is provided.

【００１７】また更に、命令演算処理中に分岐ステップ
がある場合は、現命令演算処理と次命令演算処理の双方
の後にこの分岐ステップを設けた。Further, when there is a branch step in the instruction operation processing, this branch step is provided after both the current instruction operation processing and the next instruction operation processing.

【００１８】また更に、リソースがパイプライン演算機
構である場合は、このパイプラインの異なるレジスタ部
分は重複使用でないとして、ステップを割り付けるよう
にした。Further, when the resource is a pipeline operation mechanism, steps are allocated on the assumption that different register portions of the pipeline are not used repeatedly.

【００１９】[0019]

【発明の実施の形態】実施の形態１．本発明の実施の形
態について図面を参照して説明する。図１は本発明の一
実施の形態の命令シミュレーション実行のプログラムの
典型的な処理フローを示す図である。同図に示すよう
に、本実施の形態における命令シミュレーション方式
は、計算機乙の機械命令処理をパイプラインとスーパス
ケーラ機構を持つプロセッサ甲のプログラムによりシミ
ュレーション実行するものである。このシミュレーショ
ン処理において、各命令は以下のいずれかの処理に分類
される処理で構成される。必要な各変数や定数の設定を
メモリまたはレジスタに行う処理で、１）命令コードに依存せずに共通な処理を行う。２）命令コードを切り出して処理を行う。３）各命令毎に用意された以下の４）ないし６）の処理
を行う。即ち、４）各命令固有処理の中にあってプログラムカウンタの
示すアドレスに対応した命令の処理を行う。５）各命令固有処理の中にあってプログラム概念上のプ
ログラムカウンタが次に示すまたは次に示すと予測した
アドレスに対応した命令のフェッチと命令の処理を行
う。６）主記憶に対するストア動作の後に再フェッチした命
令語と予め投機的にフェッチした命令語を比較する。７）次の命令語を命令語にコピーする。８）各命令コードに対応した処理の後に次の命令処理に
移るループ処理。９）同比較後に予めフェッチした命令語の代わりに再フ
ェッチした命令語を置き換える。DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiment 1 An embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a diagram showing a typical processing flow of a program for executing an instruction simulation according to an embodiment of the present invention. As shown in the figure, the instruction simulation method according to the present embodiment simulates the machine instruction processing of the computer B by a program of a processor A having a pipeline and a super scaler mechanism. In this simulation processing, each instruction is constituted by processing classified into any of the following processing. The process of setting necessary variables and constants in a memory or a register. 1) Perform common processes without depending on the instruction code. 2) Cut out the instruction code and perform the processing. 3) The following processes 4) to 6) prepared for each instruction are performed. That is, 4) the instruction processing corresponding to the address indicated by the program counter in each instruction specific processing is performed. 5) The instruction counter fetches and processes the instruction corresponding to the address indicated next or predicted by the program counter in the program concept in the instruction specific processing. 6) The instruction word refetched after the store operation to the main memory is compared with the instruction word fetched speculatively in advance. 7) Copy the next instruction to the instruction. 8) Loop processing for moving to the next instruction processing after processing corresponding to each instruction code. 9) After the comparison, replace the previously fetched instruction word with the fetched instruction word.

【００２０】命令のシミュレーション実行の機能動作を
図１を用いて説明する。図中の処理２ａないし２ｆで示
される２では、命令語のフェッチ（２ａ）、命令コード
フィールドの切り出し（２ｂ）、命令コード８ビットに
対応した２５６通りの分岐テーブルのフェッチ（２
ｃ）、命令のレジスタ番号を示すフィールドの切り出し
（２ｄ）、命令でアドレス計算やデータとして使用され
るまたは使用される可能性のあるレジスタに相当するメ
モリの内容を予め投機的なフェッチ（２ｅ）、各命令固
有のルーチンへ分岐する（２ｆ）、により命令の解読と
一部の実行処理を含む共通処理がなされる。The functional operation of instruction simulation execution will be described with reference to FIG. In 2 shown by processes 2a to 2f in the figure, an instruction word is fetched (2a), an instruction code field is cut out (2b), and 256 types of branch table fetches (2
c), cutting out a field indicating the register number of the instruction (2d), and speculatively fetching in advance the contents of the memory corresponding to the register used or possibly used as address calculation or data in the instruction (2e) By branching to a routine specific to each instruction (2f), common processing including instruction decoding and partial execution processing is performed.

【００２１】各処理３）は、処理４）の系統と、処理
５）ないし７）からなる２つの系統からなる。４）及び
４ｅないし４ｌは、プログラムカウンタの示す命令の処
理を、５）及び５ａないし５ｍは、現在のプログラムカ
ウンタでなく次のプログラムカウンタの示す命令処理で
ある。５）については、プログラムカウンタの実態であ
るレジスタやメモリ変数の値更新がいつ行われるかとい
うことには無関係であり、あくまでプログラム上の概念
であるプログラムカウンタが次に示すアドレスに対応す
る。現プログラムカウンタが分岐を伴わない命令の時に
は現命令アドレスであるプログラムカウンタに現命令の
語長を加えたアドレスである。分岐を伴う命令の時は、
指定あるいは計算されたアドレス、または現命令の語長
を加えたアドレスである。分岐を伴う命令の時は更に、
乙の実際のプログラムの動作と完全に一致した次のプロ
グラムカウンタの示す命令であるか、前の命令の実行結
果の条件確定を待たずに投機的に決めたアドレスであっ
てもよい。処理４）と処理５）は、概念的には２つの処
理が並行して上から下へ破線矢印のように流れることを
示しているが、甲のプログラム上は図の実線矢印で示し
たごとく配置され、甲のプログラム概念上は実線矢印の
順に実行される。甲のハードウェア上はそのスーパスケ
ーラ機構の実行ユニットへのディスパッチは４）、５）
の各系統が並列に流れ、クロック動作上は破線矢印のよ
うに実行される。図１の５ｍと５ａで示したように、
４）と５）は必ずしも交互に実行される必要性はない。Each process 3) includes a system of process 4) and two systems of processes 5) to 7). 4) and 4e to 41 indicate processing of an instruction indicated by the program counter, and 5) and 5a to 5m indicate instruction processing indicated by the next program counter instead of the current program counter. Regarding 5), it is irrelevant when the value of the register or the memory variable, which is the actual state of the program counter, is updated, and the program counter, which is a concept on the program, corresponds to the following address. When the current program counter is an instruction that does not involve branching, it is an address obtained by adding the word length of the current instruction to the program counter that is the current instruction address. For instructions with branches,
This is the specified or calculated address, or the address to which the word length of the current instruction is added. If the instruction involves a branch,
The instruction may be an instruction indicated by the next program counter that completely matches the actual operation of the actual program, or may be an address that has been speculatively determined without waiting for the execution result of the previous instruction to be determined. The processing 4) and the processing 5) conceptually show that the two processings flow in parallel from top to bottom as shown by the broken arrows, but on the program of Party A as shown by the solid arrows in the figure. It is arranged and executed in the order of the solid arrows in the concept of the program of the instep. On the instep hardware, dispatch of the super scaler mechanism to the execution unit is 4), 5).
Flow in parallel, and the clock operation is executed as indicated by the dashed arrow. As shown by 5m and 5a in FIG.
4) and 5) need not necessarily be performed alternately.

【００２２】５）の処理には２）の処理の全部またはほ
とんどの部分が含まれている。図の５ａが２ａに、５ｂ
が２ｂにという具合で５ｆは２ｆにそれぞれ相当する。
この２の処理を５に含めることにより、即ち現命令を
４）で実行している間、次命令の２）の部分を実は並行
して行うことで、従来２）処理で掛かっていたプロセッ
サのクロック数がほぼ削減できると考えてよい。最も高
速なシミュレーション処理は、８ａのループにて実現さ
れる。The processing of 5) includes all or most of the processing of 2). 5a in the figure becomes 2a and 5b
Is 2b, and 5f is equivalent to 2f.
By including the process 2 in 5, that is, while the current instruction is being executed in 4), the portion of 2) of the next instruction is actually performed in parallel, so that the processor 2 It can be considered that the number of clocks can be substantially reduced. The fastest simulation processing is realized by the loop 8a.

【００２３】従来のフォンノイマン型と呼ばれる計算機
では、命令により命令コードの書き換えを許している。
プログラムの互換性を保つためには、極めて低い確率で
発生するプログラム書き換えを検出して、命令語の先読
みした値を使わずに再フェッチした命令語を用いて誤動
作の防止が必要である。図中の８ｎは、メモリへのスト
アを行う命令において４）の処理のストアより前に５）
の処理による命令フェッチが行われる命令処理に対応し
て設けられた本発明における重要なパスである。図中の
ＳＴ命令処理では８ｎから６）に遷移して命令語を再フ
ェッチして比較している。通常は一致するため５）で行
った処理が活用され５ｆによる分岐処理のみ行う。しか
し命令語書き換えがあると不一致となり、６）の処理に
て置き換えられた正しい命令語２ｂにより解読処理をや
り直すことになる。また、命令の実行される頻度が低い
命令においては４）と５）の処理を分ける処理を実現し
ても性能への効果は低いため敢えて５）の処理を行わ
ず、８ｃのループを介して２ａへ戻る。また３）の中で
６）の処理を行わず共通処理として６）を実施する場合
もある。In a conventional computer called a von Neumann type, an instruction code can be rewritten by an instruction.
In order to maintain program compatibility, it is necessary to detect program rewriting that occurs at an extremely low probability and prevent malfunction by using a refetched instruction word without using a prefetched value of the instruction word. 8n in the drawing is 5) before the store of the processing of 4) in the instruction for storing in the memory 5).
This is an important path in the present invention provided in correspondence with the instruction processing in which the instruction fetch is performed by the above processing. In the ST instruction processing in the figure, the transition is made from 8n to 6), and the instruction word is fetched again and compared. Usually, since they match, the processing performed in 5) is utilized, and only the branch processing by 5f is performed. However, if the instruction word is rewritten, they will not match, and the decoding process will be redone with the correct instruction word 2b replaced in the process of 6). Also, in the case of an instruction which is not frequently executed, even if the processing of separating the processing of 4) and 5) is realized, the effect on the performance is low, so that the processing of 5) is not intentionally performed, and the processing of 8c is performed through the loop of 8c. Return to 2a. In some cases, 6) is performed as a common process without performing the process 6) in 3).

【００２４】図１は結果を示す図であるので、通常の動
作フロー図から変化をさせる過程が判りにくい。そこで
本実施の形態におけるシミュレーション時の処理時間短
縮の処理別の図を用いて説明する。図２は、図１の並行
処理を可能にするための前準備を説明する図である。図
２（ａ）は通常のシミュレーションによるプログラミン
グ方法を示し、図２（ｂ）は本実施の形態における前準
備を行った後のシミュレーションによるプログラミング
方法を示す図である。即ち図２（ｂ）において、次命令
演算処理のための命令語読み出しから分岐までの２）の
処理を、４）の現命令演算処理が終わるまでに４）の現
命令演算処理が分岐等で複数あるなら複数分コピーし
て、現命令演算処理の分岐ステップの前に５）の処理と
して置く工程がある。同様に図３は、先の４）の処理と
５）の処理とを同一パイプラインのスーパスケーラ機構
により並行動作をさせるまでの処理内のステップを割り
付けていく過程を説明する図である。即ち図３（ａ）
は、例えば現処理４）と次処理５）に異なる処理変数を
それぞれ割り付けて、各別系統として処理のための条件
を考えて演算ステップ順に上から下へ時系列に展開した
図である。図３（ｂ）は、より長い命令演算処理時間が
必要なクリティカルパスとして更に命令演算ステップを
実行サイクルのレベルで表現して、現処理４）がクリテ
ィカルパスであることを示す図である。即ちクリティカ
ルパスを得る工程がある。図３（ｃ）は、クリティカル
パスである現処理４）の実行サイクル中の空き時間のサ
イクルに、リソースが空いていれば次処理５）を割り付
けた図である。即ち空き時間に他の演算命令を構成する
単位処理を割り付ける工程がある。FIG. 1 is a diagram showing the results, so that it is difficult to understand the process of changing from a normal operation flowchart. Therefore, a description will be given with reference to another drawing of the processing time reduction in the simulation in the present embodiment. FIG. 2 is a diagram illustrating preparations for enabling the parallel processing of FIG. FIG. 2A is a diagram illustrating a programming method by a normal simulation, and FIG. 2B is a diagram illustrating a programming method by a simulation after performing preparations in the present embodiment. That is, in FIG. 2 (b), the processing of 2) from the reading of the instruction word for the next instruction operation processing to the branch is performed until the current instruction operation processing of 4) is completed until the end of the current instruction operation processing of 4). If there is more than one, there is a step of copying a plurality of copies and placing them as a process 5) before the branch step of the current instruction operation process. Similarly, FIG. 3 is a diagram for explaining a process of allocating steps in the processes until the processes 4) and 5) are performed in parallel by the superscaler mechanism of the same pipeline. That is, FIG.
FIG. 7 is a diagram in which different processing variables are respectively assigned to, for example, the current process 4) and the next process 5), and are developed in a time series from top to bottom in the order of calculation steps in consideration of conditions for processing as separate systems. FIG. 3B is a diagram showing that the current processing 4) is a critical path, in which an instruction operation step is further expressed as an execution cycle level as a critical path requiring a longer instruction operation processing time. That is, there is a step of obtaining a critical path. FIG. 3C is a diagram in which the next process 5) is assigned to a cycle of an idle time in the execution cycle of the current process 4) that is a critical path if resources are free. That is, there is a step of allocating a unit process constituting another operation instruction to the idle time.

【００２５】上記図１ないし図３を用いて動作を説明す
る。図２（ａ）に示される通常のシミュレーションプロ
グラムでは、読み出した命令コード共通の処理である
２）の命令コードの切り出しから分岐テーブルのフェッ
チまでを終えて、３）命令毎に対応する分岐毎に処理を
用意して、４）命令の解読、演算から次の演算のために
プログラムカウンタ更新までを行う。この実施の形態で
はこの処理に付加して図２（ｂ）に太枠で示したよう
に、次の命令語の読み出しから分岐までの２）の処理を
５）処理として４）の処理にコピーして移す。次に具体
的な命令について図３と図４を元に説明する。図４はロ
ード命令の１つを図１の形式と対応させながら記述した
動作フローである。図において、左端の記号欄４１１で
５（５ａ、５ｂ＃１等）と記入した次の命令は、従来は
図４の左端の記号欄で４（４ｄ＃１、４ｄ＃２等）の現
命令の固有処理が終わってから行っていた。つまり図２
（ａ）の２）処理として、４）処理のプログラムカウン
タを更新してから次の命令を切り出す処理をしていた。
このように図４で太枠表示の５の次命令は、図２（ｂ）
のコピーを経て、現命令の４）処理と次命令の５）処理
とが一つの大きな処理として分岐分だけ用意される。The operation will be described with reference to FIGS. In the normal simulation program shown in FIG. 2A, the processing from the extraction of the instruction code to the fetch of the branch table in 2) which is the processing common to the read instruction codes is completed, and 3) the branch corresponding to each instruction is performed. A process is prepared, and 4) from decoding and calculation of an instruction to updating the program counter for the next calculation. In this embodiment, in addition to this processing, as shown by a bold frame in FIG. 2B, the processing from 2) from reading of the next instruction word to branching is copied to the processing of 4) as 5) processing. And transfer. Next, specific instructions will be described with reference to FIGS. FIG. 4 is an operation flow in which one of the load instructions is described in correspondence with the format of FIG. In the figure, the next instruction written as 5 (5a, 5b # 1, etc.) in the leftmost symbol column 411 is conventionally the current instruction of 4 (4d # 1, 4d # 2, etc.) in the leftmost symbol column of FIG. Was performed after the end of the unique processing of. That is, FIG.
As the 2) processing of (a), the processing of updating the program counter of 4) and then cutting out the next instruction has been performed.
As described above, the next instruction 5 shown in the bold frame in FIG.
, The current instruction 4) processing and the next instruction 5) processing are prepared for one branch as one large processing.

【００２６】いま詳細を説明したいのは、こうして用意
された大きな処理から４）処理と５）処理を並行動作さ
れるフローができるまでである。同一の大きな処理であ
っても、現命令と次命令にはそれぞれ異なる処理変数が
割り付けられているので、４）処理と５）処理とは同一
リソースを使用しても分離ができる。先ず図３（ａ）の
ように素直にそれぞれの実行順序を上から下へ記述す
る。なお、実行内容を図４のアセンブラコード４１４で
示し、動作内容を説明欄４１５で説明している。その内
容を全て記述するのは大変なので識別記号４１２として
略記する。このロード命令では、各実行内容はこのよう
な識別記号で略記され、その実行順序を各図面で追って
行けば、処理の段階２）ないし７）を表す２等の記号４
１１で図１ないし図５等の図に示された単位演算処理対
応が判る。図３（ａ）の状態では、４）処理での順序の
みが判っているが、その実行サイクルまでを考慮したの
が図３（ｂ）である。例えば略記Ｈのコードは実行に３
サイクルが要る。同様に略記Ｎの実行は３サイクルで、
略記Ａの実行も３サイクルである。こうして実行時間
（サイクル）が得られる。この結果、処理４）の方が処
理５）よりも時間がかかり、処理４）は１５サイクル、
処理５）は１０サイクルであり、処理４）がクリティカ
ルパスとなる。What I want to explain in detail is from the large process prepared in this way to a flow in which 4) process and 5) process are operated in parallel. Even for the same large processing, different processing variables are assigned to the current instruction and the next instruction, so that 4) processing and 5) processing can be separated by using the same resources. First, as shown in FIG. 3A, the respective execution orders are described from top to bottom. Note that the execution contents are shown by the assembler code 414 in FIG. 4, and the operation contents are explained in the explanation column 415. Since it is difficult to describe all the contents, it is abbreviated as an identification symbol 412. In this load instruction, each execution content is abbreviated by such an identification symbol, and if the execution order is followed in each drawing, a symbol 4 such as 2 representing processing stages 2) to 7) can be obtained.
Reference numeral 11 indicates the unit operation processing correspondence shown in FIGS. 1 to 5 and the like. In the state of FIG. 3 (a), only the order of the 4) processing is known, but FIG. 3 (b) takes into account its execution cycle. For example, the code of abbreviation H is 3
It takes a cycle. Similarly, the execution of the abbreviation N is three cycles,
Execution of the abbreviation A is also three cycles. Thus, an execution time (cycle) is obtained. As a result, the process 4) takes longer than the process 5), and the process 4) has 15 cycles,
Process 5) is 10 cycles, and process 4) is a critical path.

【００２７】次に従来だと空きサイクル３１１には何も
しないが、図３（ｃ）に示すように処理５）の次命令を
空きサイクルに実行をさせる。もちろんリソースが競合
するなら無理であるが、パイプラインの演算器はデータ
並行処理をしており、異なるレジスタ部分を用いれば、
見かけの並行動作が可能である。こうしてクリティカル
パスの処理４）中の空きサイクルに対して、リソースの
重複使用をしない処理５）の命令を割り付けていくと、
図３（ｃ）の並行動作フローができる。この図３の例で
は、命令のコーディング順序は略記Ｈ，Ａ，Ｊ，Ｍ，
Ｎ，Ｂ，Ｃ，Ｄ，Ｆ，Ｏ，Ｇ，Ｐ，Ｑ，Ｒ，Ｉ，Ｓ，
Ｋ，Ｔ、Ｕ，Ｖ，Ｌ，Ｚ，Ｘ，Ｗ，Ｅとなる。また全体
の実行サイクルは図３（ｃ）では１５サイクルである
が、これは元の図３（ｂ）の両方を合わせた２５サイク
ルより大幅に短縮されている。更に図３（ｃ）で空きサ
イクルを埋めるためにＱ，Ｕ，Ｖの位置を移動させて図
３（ｄ）のようにして２サイクル短縮する。Next, in the prior art, nothing is performed in the empty cycle 311. However, as shown in FIG. 3C, the next instruction of the process 5) is executed in the empty cycle. Of course, if resources conflict, it is impossible, but the arithmetic unit of the pipeline performs data parallel processing, and if different register parts are used,
Apparent parallel operation is possible. In this way, when the instruction of the process 5) that does not use the resource redundantly is assigned to the empty cycle in the process 4) of the critical path,
The parallel operation flow shown in FIG. In the example of FIG. 3, the coding order of the instructions is abbreviated as H, A, J, M,
N, B, C, D, F, O, G, P, Q, R, I, S,
K, T, U, V, L, Z, X, W, and E. The total execution cycle is 15 in FIG. 3 (c), which is significantly shorter than the original 25 in FIG. 3 (b). Further, in order to fill an empty cycle in FIG. 3 (c), the positions of Q, U and V are moved to shorten the cycle by 2 as shown in FIG. 3 (d).

【００２８】具体的なロード命令のシミュレーション時
間短縮を説明する図４では、現ロード命令の読み出し・
切り出しである２）処理を含めて記載してある。この中
央から下方の部分、４）処理と５）処理の並行プログラ
ム部分は、上記図３（ｄ）のコーディング順に整理され
て表示されている。図５は対応する実行サイクルを示す
図であり、実行サイクルでは第１２から第２８のサイク
ルで、リソースが重複しない命令フェッチ４５２、命令
デコード４５３、演算＃１４５４と演算＃２４５
５、メモリアクセス４５６、命令キャッシュ４５７、デ
ータキャッシュ４５８が実行される。FIG. 4 for explaining a concrete simulation time reduction of a load instruction is shown in FIG.
It is described including 2) processing which is the cutout. The lower part from the center, the parallel program part of the 4) processing and the 5) processing are arranged and displayed in the coding order of FIG. 3D. FIG. 5 is a view showing a corresponding execution cycle. In the twelfth to twenty-eighth cycles, the instruction fetch 452, the instruction decode 453, the operation # 1 454, and the operation # 2 45 that do not duplicate resources are executed.
5, memory access 456, instruction cache 457, and data cache 458 are executed.

【００２９】図４に示している小文字アルファベットの
ａ、ｂ、ｆ、ｃ、ｇ、ｄ、ｉ、ｎ、ｅは図１の処理２）
であり、Ｌ命令処理列が記述されて連続して実行される
場合は、このパスは実行されない。乙のＬ命令が連続し
て処理される場合は、図４のＨの行からＥの行の演算
を、図５のサイクル番号１５乃至２９で繰り返すことに
なる。尚、説明を簡単にするため、命令キャッシュなら
びにデータキャッシュメモリはヒット、メモリまたはキ
ャッシュへの格納データを一時貯えてそれらが暇なとき
に吐き出すストアバッファはあふれていない状態に限定
している。The lower-case alphabets a, b, f, c, g, d, i, n, and e shown in FIG.
When the L instruction processing sequence is described and executed continuously, this path is not executed. When the L instruction is processed continuously, the calculation from the row H to the row E in FIG. 4 is repeated with cycle numbers 15 to 29 in FIG. For the sake of simplicity, the instruction cache and the data cache memory are limited to a state in which the store buffer for hitting, temporarily storing data stored in the memory or the cache, and discharging them when they are free is not overflown.

【００３０】図５で示しているＡ，Ｂ，Ｃ，Ｄ，Ｅの処
理は互いに逐次実行されなければならず、乙の次の命令
の処理に分岐するときにクリティカルとなる処理の流れ
である。具体的には、サイクル１６〜１８Ａ、サイクル
１９でＢ、サイクル２０でＣ，サイクル２１〜２３で
Ｄ、２４〜２５は該当なし、サイクル２７でＥ、と１１
サイクル中このパスにとって３サイクルしか無駄がな
い。このパスは５命令しかないのに９サイクル掛かるパ
スである。またＦ、Ｇの処理は次の命令に対応した処理
でインデックスレジスタのアドレス計算して次の命令も
使うと想定した先行投資的な実行であるが、２サイクル
以上の逐次実行が必要な処理で２サイクル必要なため、
２０〜２２サイクルの３サイクルに割り当てている。Ｋ
とＬの処理もベースレジスタに関する先行投資的な実行
ではあるが、サイクル２４のＴのロードが実行されるま
での空きサイクルを活用しているためぺナルティはな
い。これらのＡ〜ＧおよびＩ，Ｋ，Ｌは、処理５）に相
当し、他のＨ、Ｊ，Ｍ〜Ｗが処理４）に相当する。Ｘは
プログラムカウンタの更新であり４）と考えても５）と
考えても構わない。処理４）の中でも「インデックスレ
ジスタとベースレジスタの値ををフェッチしてアドレス
加算を行いメモリよりフェッチしレジスタ相当番地に格
納する」処理があり、その中でもＭ，Ｎ，Ｏ，Ｐ，Ｓ，
Ｔ，Ｗは逐次実行が必要であり、このパス単独ではサイ
クル１７〜２７の間の１１サイクル間で７個の命令を実
行しているが、１５サイクルのパイプラインに空きを生
じるので、処理５）との組み合わせで空きサイクルを埋
めてパイプラインならびにスーパスケーラが活かされて
いる。The processing of A, B, C, D, and E shown in FIG. 5 must be executed sequentially, and is a processing flow that becomes critical when branching to the processing of the next instruction. . Specifically, in cycles 16 to 18A, B in cycle 19, C in cycle 20, C in cycles 21 to 23, D in 24 to 25, E in cycle 27, and 11
Only three cycles are wasted for this pass during the cycle. This path takes nine cycles even though there are only five instructions. In addition, the processing of F and G is an upfront investment based on the assumption that the address of the index register is calculated and the next instruction is used in the processing corresponding to the next instruction, but the processing that requires sequential execution of two cycles or more is required. Since two cycles are required,
Allocated to 3 cycles of 20 to 22 cycles. K
And L are also upfront investments for the base register, but there is no penalty due to the use of an empty cycle until the loading of T in cycle 24 is performed. These A to G and I, K, L correspond to the processing 5), and the other H, J, M to W correspond to the processing 4). X is an update of the program counter, which may be considered 4) or 5). Among the processes 4), there is a process of “fetching the values of the index register and the base register, adding the address, fetching from the memory, and storing it at the address corresponding to the register”. Among them, M, N, O, P, S,
T and W need to be executed sequentially, and this path alone executes seven instructions during eleven cycles between cycles 17 to 27. However, since there is a vacancy in the 15-cycle pipeline, processing 5 ) To fill the empty cycle and utilize the pipeline and superscaler.

【００３１】図５では太枠で示した１５サイクルが実質
的な乙の１命令の処理サイクル数であり、スーパスケー
ラによる最大命令実行数２を乗じた３０命令に対し２５
命令を実行し８３％と高い利用率になっている。処理サ
イクル数も、従来例の図２０による２５命令２７サイク
ルに比べると５５％の処理時間で済み約２倍の性能とな
っている。図６、図７のＳＴ命令は１６サイクルで、２
６命令のためスーパスケーラ利用率は８１％、実行処理
時間は、図示はしていないが、従来例の２８サイクルの
５７％で済む。いずれも現命令の処理と次の命令の処理
を互いに混在させることによりパイプラインの空きやス
ーパスケーラの同時実行を可能にして演算ユニットの利
用率を高めた結果である。In FIG. 5, 15 cycles indicated by a bold frame are substantially the number of processing cycles of one instruction of the second party, and 25 cycles are obtained for 30 instructions multiplied by the maximum instruction execution number 2 by the superscaler.
After executing the instruction, the utilization rate is as high as 83%. The number of processing cycles is 55% of the processing time compared to 27 cycles of 25 instructions in FIG. The ST instruction in FIG. 6 and FIG.
The superscalar utilization rate is 81% because of six instructions, and the execution processing time is 57% of 28 cycles of the conventional example, although not shown. In both cases, the processing of the current instruction and the processing of the next instruction are mixed with each other, thereby enabling the empty pipeline and the simultaneous execution of the superscaler, thereby increasing the utilization rate of the arithmetic unit.

【００３２】この実施の形態において、処理４）または
処理５）が共通の関数を呼び出しても構わない。また、
命令コードの数分だけ３）がある必要はなく複数の命令
コードが１個の３）に対応し別途その中で分岐してもよ
い。３）は命令コードに対応せず、命令コードと他のフ
ィールドを含んだものに対応してもよいし、命令コード
の一部のビットに対応してもよい。また、上記した形態
ではメモリ上に分岐用のテーブルがあるがレジスタ上で
も構わない。また、テーブルを使用せずに、命令コード
から直接分岐先アドレスを決めてもよい。図１では処理
５）が処理２）を全部コピーした例となっているが、部
分的にコピーして処理２）の途中に飛び込んでもよい。
命令の実行によりプログラム書き換えを許さない仕様の
命令シミュレーションには６の代りに７を設け、８ｐ〜
８ｑのパスを設けなくてもよい。In this embodiment, the processing 4) or the processing 5) may call a common function. Also,
There is no need to have 3) as many as the number of instruction codes, and a plurality of instruction codes may correspond to one 3) and may be separately branched therein. 3) does not correspond to the instruction code, and may correspond to the one including the instruction code and other fields, or may correspond to some bits of the instruction code. In the above embodiment, the branch table is provided on the memory, but may be provided on the register. Alternatively, the branch destination address may be determined directly from the instruction code without using a table. In FIG. 1, the process 5) is an example in which the process 2) is entirely copied. However, the process 5) may be partially copied and jumped into the process 2).
For instruction simulations that do not allow program rewriting by executing instructions, 7 is provided instead of 6 and 8p ~
The 8q path need not be provided.

【００３３】図４と図５はロード命令の場合であるが、
ストア命令の場合も例示する。図６と図７は、それぞれ
ストア命令時の最終コーディング図と実行サイクルを示
す図であり、図４と図５に対応している。また図８ない
し図１３は、ロード命令、ストア命令を含む動作フロー
短縮化の過程を示す図であり、詳細にはロード命令に対
応している。図８は、図１と図３で示した５ｄと５ｅを
行わない場合の初期コーディング列を示す。即ちロード
命令について、左側のフローが処理をより一般的に表現
し、右の列（６３２）が該当するアセンブリ言語を示し
ている。中の列（６３１）は図１９の記号に対応してい
る。ここで、Ｓ６０１〜Ｓ６１８が現命令の処理（４）
にＳ６１９〜６２３と６２５が（５）にＳ６２４が
（７）に対応する。この図８では個々の処理の順番に依
存関係が最小になるように中間変数であるレジスタは独
立したものを割り付けている。図９は、図８の一連の処
理を略記した識別記号６３１毎の各単位処理、例えばＵ
に関して、その単位処理における入力６４１と出力６４
２、制約条件としての親命令記号６４３を抽出した関係
図である。即ちこのフローを元に処理の並列性と依存関
係を見付けて処理の順番を並べかえる。まず図９に示す
ように各命令毎に更新されるレジスタ（６４２）とその
演算入力であるレジスタ（６４１）を記述する。更新命
令記号の列（６４４）の該当する列すなわち（６４２）
を含む列に（６３１）の値を転記する。親命令記号（６
４３）の列には列６４１に記載された列６４４の列を上
から見てより下の行にある記号を記入する。命令記号Ｊ
の列を例にすると６４３にはＪを記入する。「右辺」の
ｒｔ＿Ｒｘｄに相当する６４４の６番目の列では「Ｈ」
が見つかる。またｒｔ＿Ｒｘｍでは６４４の５列目の
「Ｉ」が見つかる。このような手順を繰り返していき６
４３を求める。FIGS. 4 and 5 show the case of a load instruction.
The case of a store instruction is also exemplified. 6 and 7 are a final coding diagram and an execution cycle at the time of a store instruction, respectively, and correspond to FIG. 4 and FIG. FIGS. 8 to 13 are diagrams showing the process of shortening the operation flow including the load instruction and the store instruction, and correspond to the load instruction in detail. FIG. 8 shows an initial coding sequence when 5d and 5e shown in FIGS. 1 and 3 are not performed. That is, for the load instruction, the flow on the left side represents the processing more generally, and the right column (632) shows the corresponding assembly language. The middle column (631) corresponds to the symbol in FIG. Here, S601 to S618 are the processing of the current instruction (4).
S619 to 623 and 625 correspond to (5), and S624 corresponds to (7). In FIG. 8, independent registers are assigned to the registers as intermediate variables so that the dependence is minimized in the order of the individual processes. FIG. 9 shows each unit process for each identification symbol 631 that abbreviates the series of processes of FIG.
, The input 641 and the output 64 in the unit process
2 is a relationship diagram in which a parent instruction symbol 643 is extracted as a constraint condition. That is, based on this flow, the parallelism and the dependency of the processing are found, and the order of the processing is rearranged. First, as shown in FIG. 9, a register (642) updated for each instruction and a register (641) as an operation input thereof are described. The corresponding column of the column (644) of the update instruction symbol, that is, (642)
The value of (631) is transcribed to the column including. Parent instruction symbol (6
In the column 43), a symbol in a lower row is written when the column of the column 644 described in the column 641 is viewed from above. Command symbol J
For example, J is entered in 643. In the sixth column of 644 corresponding to rt_Rxd of “right side”, “H” is set.
Is found. In rt_Rxm, “I” in the fifth column of 644 is found. Repeat these steps 6
Ask for 43.

【００３４】図１０は図９の単位処理間の接続関係を展
開した図であり、例えば略記ＰはＲとＯの後に実行をし
なければならず、矢印の順に演算が必要であることを示
している。図１０に依存関係を示したグラフでは、矢印
の根本が親で矢印の先が子供を示しており、図９は図１
０を作るアルゴリズムを示していることになる。図１０
の破線矢印は、処理実行を遅らせなければならない制的
関係を示している。すなわちＺはＱ，Ｆ，Ｋ，Ｕ，Ａの
いづれもが実行された後に実行されなければならない。
この規則は図９で図示していないが、（６３１）が
「Ｚ」である行の「左辺」（６４２）の値である「ｒｏ
＿Ｉｎｓｔｒｗｏｒｄ」が６４１を含むいづれの行の記
号（６３１）より後ろにある必要性を示している。図１
１は図１０における最長時間となるクリティカルパスを
表現した図であり、略記Ｆ，Ｋからの処理が最も処理時
間が長くなる。即ち図１０にパイプラインの動作状況を
加えてサイクル数を示したものである。Ｉ，Ｈ，Ｍ，
Ｎ，Ａ，Ｄ，Ｅ，Ｔのサイクル数が３に伸びている。こ
れよりＦ，ＫよりＷに至るパスが長いことがわかる。図
１１ではリソースの重複による逐次化は考慮されていな
い。実線矢印が時間を支配するクリチカルパスであり、
破線がそうではないパスを示している。FIG. 10 is an expanded view of the connection relationship between the unit processes of FIG. 9. For example, the abbreviation P indicates that the execution must be performed after R and O, and the operations are required in the order of the arrows. ing. In the graph showing the dependency relationship in FIG. 10, the root of the arrow indicates the parent and the tip of the arrow indicates the child, and FIG.
This indicates an algorithm for generating 0. FIG.
The dashed arrow indicates that the execution of the process must be delayed. That is, Z must be performed after all of Q, F, K, U, and A have been performed.
Although this rule is not shown in FIG. 9, “ro” which is the value of “left side” (642) of the row where (631) is “Z” is used.
It indicates that "_Instrword" must be after the symbol (631) in any row containing 641. FIG.
1 is a diagram representing the longest critical path in FIG. 10, and the processing from the abbreviations F and K has the longest processing time. That is, FIG. 10 shows the number of cycles in addition to the operation status of the pipeline. I, H, M,
The number of cycles of N, A, D, E, and T is extended to three. This indicates that the path from F and K to W is longer. FIG. 11 does not consider serialization due to resource duplication. The solid arrow is the critical path that governs time,
Dashed lines indicate paths that are not.

【００３５】図１２は図１１中に使用されるリソースの
必要サイクル数を単純にカウントした図である。図１２
ではリソースの取り扱い数の合計を求めている。メモリ
アクセスユニットは１ヶに対しディスパッチと演算は２
つ同時実行可能な為、それぞれ１３，９，８サイクルが
論理限界であり、１３サイクルが理想的な最適解である
ことが判る。図１３は図１１と図１２でリソース制約が
最も厳しいメモリアクセスユニットの動作タイミングを
割り付けた図である。図で下段の数字６５１はメモリア
クセスユニットの起動数であり、太枠の単位処理Ｈ，Ｎ
等は処理タイミングが変更されている。図１３ではリソ
ースが一番少ないメモリアクセスユニットに着目して実
行タイミングを割り付けている様子を示している。図１
１で同時に動作開始しているＩ，Ｈ，Ｎ，Ｍを１サイク
ルずつずらして動作させており、Ｑが３サイクル遅れて
いるがＱとＰが１サイクル差となり、Ｗでは２サイクル
遅れで済んでいる。更にＡ〜Ｄのメモリアクセスユニッ
トのタイミングを割り付ける。次に演算ユニットについ
てよりクリチカルなパスから割り付けて行き、Ｑ，Ｕ，
Ｖ，Ｘ，Ｚは空いたサイクルに割り当てる。Ｅについて
はＷと同時に実行開始されるように割り当てる。これら
の実行タイミングの順に従ってソースコードの命令の順
番を並べかえる。この図８から図１３の意味は、パイプ
ラインの空きサイクルがあり、処理４ｄであるＦ，Ｇ，
Ｋ，Ｌを５ｄに、４ｅであるＩを５ｅに変更して、図８
〜１３の処理を行うと図３ｄ及び図４、図５ができあが
ることを説明している。FIG. 12 is a diagram simply counting the number of required cycles of resources used in FIG. FIG.
Now we want the total number of resources handled. Dispatch and operation are 2 for one memory access unit.
It can be seen that 13, 9, and 8 cycles are logical limits, respectively, and that 13 cycles are ideal optimal solutions. FIG. 13 is a diagram in which the operation timing of the memory access unit having the strictest resource constraint in FIGS. 11 and 12 is assigned. In the figure, the numeral 651 at the bottom is the number of activations of the memory access unit, and the unit processing H, N
For example, the processing timing has been changed. FIG. 13 shows a state in which the execution timing is allocated focusing on the memory access unit having the least resource. FIG.
At the same time, I, H, N, and M, which are simultaneously operating at 1, are operated by shifting one cycle at a time. Q is delayed by three cycles, but Q and P differ by one cycle, and W requires only two cycles. In. Further, the timings of the memory access units A to D are allocated. Next, the arithmetic units are assigned from a more critical path, and Q, U,
V, X, and Z are assigned to empty cycles. E is assigned so that execution is started at the same time as W. The order of the instructions in the source code is rearranged according to the order of these execution timings. The meaning of FIGS. 8 to 13 is that there is an empty cycle in the pipeline, and F, G,
K and L are changed to 5d and I which is 4e is changed to 5e, and FIG.
FIG. 3D, FIG. 4, and FIG. 5 are completed when the processes of Nos. To 13 are performed.

【００３６】図４で太線枠で示し、且つ、図５で太斜体
字で示したＡ乃至ＧとＩとＫとＬはプログラムカウンタ
の次の命令に対応した処理である。即ち、本実施の形態
で述べる並行処理を行っている部分である。斜線部はロ
ード命令の実質実行サイクルを示している。図６で太枠
で示し、且つ、図７で太斜体字で示したＢ乃至ＧとＩと
ＫとＬはプログラムカウンタの次の命令に対応した処理
である。この図４と図６の場合は、図１の５ｅに相当す
るレジスタの投機的フェッチは、マスクデータに限定し
ているため、乙のＬ命令に続く命令でのレジスタコンフ
リクト条件は発生しない。即ち比較工程は不要である。
従ってレジスタコンフリクトの回避論理は不要である。A to G, I, K, and L shown in a bold line frame in FIG. 4 and in bold italic characters in FIG. 5 are processes corresponding to the next instruction of the program counter. That is, it is the part that performs the parallel processing described in the present embodiment. The hatched portion indicates a substantial execution cycle of the load instruction. B to G, I, K, and L shown in a bold frame in FIG. 6 and in bold italics in FIG. 7 are processes corresponding to the next instruction of the program counter. In FIGS. 4 and 6, since speculative fetching of a register corresponding to 5e in FIG. 1 is limited to mask data, a register conflict condition does not occur in an instruction following the L instruction of B. That is, the comparison step is unnecessary.
Therefore, logic for avoiding register conflicts is unnecessary.

【００３７】本実施の形態では、関数呼び出しのような
連繋分岐（サブルーチンコール）を使用していないが、
分岐個所に連繋分岐を用いてもよい。本実施の形態で
は、ストア系の命令ではメモリへのストア動作の後に発
行した命令語の再フェッチデータと、メモリへのストア
動作の前に発行した命令語のフェッチデータを比較して
いる。しかしながら、ａ）命令語の一部同士の比較、
ｂ）命令語のアドレスとストアデータのアドレスの比
較、ｃ）命令語およびストアデータのアドレスを構成し
ている一部のビット同士の比較、ｄ）命令語のアドレス
が、ストアデータのアドレス＋ストアデータ長との範囲
にある、ｅ）命令語のアドレスが、ストアデータのアド
レス＋固定値の範囲にある、ｆ）命令語のアドレスが、
ストアデータのアドレスのビットの一部をゼロにした値
と固定値を加算した値の範囲にあることの、これらａ）
ないしｆ）のどれと比較してしてもよい。Although the present embodiment does not use a linked branch (subroutine call) such as a function call,
A connection branch may be used at a branch point. In the present embodiment, in the store-related instructions, the refetch data of the instruction issued after the store operation to the memory is compared with the fetch data of the instruction issued before the store operation to the memory. However, a) comparison of parts of the command word,
b) comparison of the address of the instruction word and the address of the store data; c) comparison of some bits constituting the address of the instruction word and the store data; d) the address of the instruction word is the address of the store data + store E) the address of the instruction word in the range of the data length is within the range of the address of the store data + the fixed value; f) the address of the instruction word is
A) that the value is within a range of a value obtained by adding a fixed value and a value obtained by adding a part of the bits of the address of the store data to zero.
Or f).

【００３８】本実施の形態ではスーパスケーラの並列度
が２の場合について述べたが３以上であってもよい。ま
た、本実施の形態では、次の命令に対するレジスタおよ
びメモリの書き込みを予め投機的に実行する具体例につ
いて記述していないが、次の命令の解読課程でレジスタ
コンフリクトやメモリデータのコンフリクトの有無も部
分的にはわかるので、先取りして実行したり投機的実行
の誤り検出と誤り処理を追加するようにしても同様の効
果が得られる。更に本実施の形態では、現在の命令の実
行と、次の命令の実行のみ並行して行なった例を示して
いるが、スーパスケーラの並列度があがれば３つ以上の
命令について並行処理しないと効果が現れにくくなりま
た並行処理するようにしてもよい。In this embodiment, the case where the degree of parallelism of the superscaler is 2 has been described, but it may be 3 or more. Further, in the present embodiment, a specific example of speculatively executing the writing of the register and the memory for the next instruction is not described in advance. Since it is partially understood, the same effect can be obtained even if execution is performed in advance or error detection and error processing for speculative execution are added. Further, in the present embodiment, an example is shown in which only the execution of the current instruction and the execution of the next instruction are performed in parallel. However, if the degree of parallelism of the superscaler increases, the parallel processing must be performed on three or more instructions. The effect may be less likely to appear, and parallel processing may be performed.

【００３９】プログラムカウンタの指す命令の処理と、
次にプログラムカウンタが指す命令の解読処理、次にプ
ログラムカウンタが指す命令で使用するレジスタ値の読
み出し処理を並行して実行できるようにするため、それ
らの処理をする機械命令列を混在して記述すると、単独
では並列性の乏しい動作に並列性が生じ、シミュレーシ
ョン実行に使用されているプロセッサのパイプライン高
速化機能やスーパスケーラによる並列性が活かされて、
処理に必要なクロックサイクル数の大幅な削減がなさ
れ、高速な命令シミュレーションが実現できる。また、
この発明は、依存性のある１つの大きな処理の中で順番
を入れ替える最適化の難しさと比べて、２つの独立した
処理の実行順番を入れ替えることによりかなりの最適化
がなされるので、命令シミュレータの新規設計時に性能
向上が容易である。Processing of the instruction indicated by the program counter;
Next, in order to be able to execute in parallel the process of decoding the instruction indicated by the program counter and the process of reading the register value used by the instruction indicated by the program counter, a machine instruction sequence that performs those processes is described in a mixed manner. Then, parallelism occurs in the operation with poor parallelism by itself, and the parallelism by the pipeline acceleration function of the processor used for simulation execution and the superscaler is utilized,
The number of clock cycles required for processing is greatly reduced, and high-speed instruction simulation can be realized. Also,
According to the present invention, compared to the difficulty of optimization in which the order is changed in one large dependent process, considerable optimization is performed by changing the execution order of two independent processes. Performance improvement is easy at the time of new design.

【００４０】また、プログラムカウンタの指す命令の処
理と、次にプログラムカウンタが指す命令の処理を混在
させずに、メモリ上に配置したテーブルを使用した分岐
処理に発生するパイプラインの隙間に、分岐処理以前の
情報により解読や投機的に実行可能な処理を埋めてその
時間を見かけ上削減する方法、またはフィールドごとの
逐次的にしか実行できない処理をべつのフィールドのサ
イクル数の掛かる処理と並行して行なう方法、を用いる
ことにより、少ない変更で既存の命令シミュレーション
を高速化できる。Further, the processing of the instruction indicated by the program counter and the processing of the instruction indicated by the next program counter are not mixed, and the processing is executed in the pipeline gap generated in the branch processing using the table arranged in the memory. A method of apparently reducing the time by filling in the decoding and speculatively executable processing with the information before processing, or performing the processing that can only be performed sequentially for each field in parallel with the processing that requires the number of cycles of another field , The existing instruction simulation can be sped up with few changes.

【００４１】一般的に計算機ハードウェア上に搭載した
プログラムにより意味のある処理を行なうときに、パイ
プラインやスーパスケーラを意識したコンパイラの最適
化機能により若干の性能改善がなされるが、処理のアル
ゴリズム段階またはアルゴリズムを構成する変数要素の
解析まで行なって最適化を行なうのは非常に困難で、ま
た、最適化の投資効率がよくない。一方、この発明の命
令シミュレーションでは、別の計算機を模擬するシュミ
レータ自身を頻度高く使用するため、シュミレータ以外
のアプリケーションのアルゴリズムを解析、最適化する
必要がなくて性能改善ができる。In general, when performing meaningful processing by a program mounted on computer hardware, the performance is slightly improved by the optimization function of the compiler that is aware of the pipeline and superscaler. It is very difficult to perform the optimization by analyzing the variables constituting the steps or the algorithm, and the investment efficiency of the optimization is not good. On the other hand, in the instruction simulation of the present invention, since the simulator itself that simulates another computer is frequently used, it is not necessary to analyze and optimize the algorithm of the application other than the simulator, and the performance can be improved.

【００４２】[0042]

【発明の効果】以上のようにこの発明によれば、現命令
と次以降命令とをそれぞれ１つづつの大きな処理にまと
め、それぞれに別の処理変数を割り付け、クリティカル
パスを生む命令の処理に対してその実行サイクル空き時
間に他命令の単位処理を割り付けるようにしたので、並
行演算を行って時間短縮ができる効果がある。As described above, according to the present invention, the current instruction and the next and subsequent instructions are grouped into one large process, and another processing variable is assigned to each. Since the unit process of another instruction is assigned to the idle time of the execution cycle, the operation can be shortened by performing the parallel operation.

[Brief description of the drawings]

【図１】本発明の実施の形態１におけるエミュレーシ
ョンの概念と動作フローを示す図である。FIG. 1 is a diagram showing a concept and an operation flow of emulation according to Embodiment 1 of the present invention.

【図２】本発明の概念の一部である命令コピーと動作
フロー位置の変更を説明する図である。FIG. 2 is a diagram illustrating instruction copy and change of an operation flow position which are part of the concept of the present invention.

【図３】本発明の概念の一部である時間短縮した並行
動作を行うプログラム順序変更説明図である。FIG. 3 is an explanatory diagram of a program order change for performing a parallel operation with reduced time, which is a part of the concept of the present invention.

【図４】実施の形態１におけるロード命令の並行動作
を行うコーディング図である。FIG. 4 is a coding diagram for performing a parallel operation of a load instruction according to the first embodiment;

【図５】実施の形態１におけるロード命令の並行動作
を行う実行サイクル図である。FIG. 5 is an execution cycle diagram for performing a parallel operation of a load instruction according to the first embodiment.

【図６】実施の形態１におけるストア命令の並行動作
を行うコーディング図である。FIG. 6 is a coding diagram for performing a parallel operation of a store instruction in the first embodiment.

【図７】実施の形態１におけるストア命令の並行動作
を行う実行サイクル図である。FIG. 7 is an execution cycle diagram for performing a parallel operation of a store instruction in the first embodiment.

【図８】実施の形態１における他のロード命令の単位
処理で詳述したコーディング図である。FIG. 8 is a coding diagram detailing unit processing of another load instruction in the first embodiment.

【図９】図８の処理を単位処理毎に分解し命令間の実
行順序の制約を表にした図である。FIG. 9 is a diagram showing a table of constraints on the execution order between instructions by decomposing the process of FIG. 8 into unit processes.

【図１０】図９の関係を接続関係で表した図である。FIG. 10 is a diagram showing the relationship of FIG. 9 as a connection relationship.

【図１１】図１０のクリティカルパスを表した図であ
る。FIG. 11 is a diagram illustrating a critical path in FIG. 10;

【図１２】図１１のクリティカルパスにおけるリソー
スの必要カウント数を示す図である。12 is a diagram illustrating a required count number of resources in the critical path in FIG. 11;

【図１３】実行サイクルの空き時間に対し他の単位処
理を割り付ける説明図である。FIG. 13 is an explanatory diagram of allocating another unit process to an idle time of an execution cycle.

【図１４】従来のシリアルな命令シミュレーションの
動作フロー図である。FIG. 14 is an operation flowchart of a conventional serial instruction simulation.

【図１５】従来の１命令を模擬するプログラム例を示
す図である。FIG. 15 is a diagram showing an example of a conventional program for simulating one instruction.

【図１６】キャッシュの解読ビット幅と所要ハードウ
ェア量の関係を示す図である。FIG. 16 is a diagram showing a relationship between a decoding bit width of a cache and a required amount of hardware.

【図１７】命令形式の例を示す図である。FIG. 17 is a diagram illustrating an example of an instruction format.

【図１８】従来のロード命令の処理内容を示す図であ
る。FIG. 18 is a diagram showing the processing content of a conventional load instruction.

【図１９】スーパスケーラ計算機を使用した従来のロ
ード命令実行例を示す図である。FIG. 19 is a diagram showing a conventional load instruction execution example using a superscaler computer.

【図２０】図１９のパイプラインステージにおける実
行サイクル図である。20 is an execution cycle diagram in the pipeline stage of FIG.

【図２１】従来の命令セットのコード、記法、意味を
示す図である。FIG. 21 is a diagram showing codes, notations, and meanings of a conventional instruction set.

[Explanation of symbols]

１初期化処理（共通処理）、２命令語フェッチ・解
読、２ａ命令語のフェッチ処理、２ｂ命令コードフ
ィールド切り出し処理、２ｃ命令コード対応分岐テー
ブルのフェッチ処理、２ｄ命令のレジスタフィールド
の切り出し処理、２ｅ命令処理に使用するレジスタの
一部投機的なフェッチを行う処理、２ｆ各命令固有のル
ーチンへの分岐処理、３模擬実行の対象となる命令１
個分の処理、４プログラムカウンタが示す命令に関す
る処理、４ｅ命令シミュレーションを行うプロセッサ
（甲）のレジスタまたはメモリの読み出し処理、４ｇ命
令語からのフィールド切り出し及び解読処理、４ｈア
ドレス計算処理、４ｉ命令シミュレーションを行うプロ
セッサ（甲）のメモリからデータ読み出し処理、４ｊ
演算処理（レジスタのデータアクセス）、４ｌ命令シ
ミュレーションを行うプロセッサ（甲）のレジスタまた
はメモリへの書き込み処理、５次のプログラムカウン
タが示す命令に関する処理、５ａ次命令語のフェッ
チ、５ｂ次命令コードの切り出し、５ｃ次命令コード
の分岐テーブルのフェッチ、５ｄ次命令のレジスタ番
号を示すフィールドの切り出し処理、５ｅ次命令で使
用されるレジスタ内容の一部投機的なフェッチ、５ｆ
各命令固有のルーチンへの分岐、６主記憶に対するス
トア動作の後に再フェッチした命令語を次の命令語変数
にセットし予め投機的にフェッチた命令語を比較する処
理、７命令語変数のコピー、８各命令コード対応の
処理後に次命令処理に移るフロー遷移、８ａ５ｆより
直接に次の命令に対応した３への分岐をするパス、８ｃ
命令によっては次の命令処理に対応した５を設けずに
代わりに共通部２に分岐するパス、８ｎ５の処理の中
でデータストア前にフェッチした命令語が変更されてい
ないかチェックするパス、８ｏ５の処理の外でデータ
ストア前にフェッチした命令語が変更されていないかチ
ェックするパス、８ｐ５の処理の外でデータストア前
にフェッチした命令語とストア後にフェッチした命令語
を比較し一致した場合のパス、８ｑデータストアの前
と後でフェッチした命令語を比較し不一致となったとき
のパス。1 Initialization processing (common processing), 2 instruction word fetch / decoding, 2a instruction word fetch processing, 2b instruction code field cutout processing, 2c instruction code corresponding branch table fetch processing, 2d instruction register field cutout processing, 2e Processing for speculatively fetching a part of a register used for instruction processing, processing for branching to a routine specific to each instruction 2f, instruction 1 to be simulated 1
4 processes related to the instruction indicated by the program counter, 4e a register or memory read process of a processor (A) that simulates an instruction, 4g a field extraction and decoding process from an instruction word, 4h address calculation process, 4i instruction simulation Processing for reading data from the memory of the processor (A) that performs
Arithmetic processing (data access of registers), 4l Write processing to the register or memory of the processor (A) performing the instruction simulation, processing related to the instruction indicated by the next program counter, 5a fetch of the next instruction word, 5b next instruction code Extraction, 5c Fetch of branch table of next instruction code, 5d Extraction processing of field indicating register number of next instruction, 5e Partial speculative fetch of register contents used in next instruction, 5f
Branching to a routine specific to each instruction, 6 Setting the refetched instruction word after the store operation to the main memory to the next instruction word variable, comparing speculatively fetched instruction words in advance, 7 Copying instruction word variables , 8 Flow transition to move to next instruction processing after processing for each instruction code, 8a 5f, path directly branching to 3 corresponding to next instruction, 8c
Depending on the instruction, a path for branching to the common unit 2 instead of providing 5 corresponding to the next instruction processing, a path for checking whether the instruction word fetched before the data store in the processing of 8n 5 has been changed, A path for checking whether the instruction fetched before the data store outside the processing of 8o5 is changed, and comparing the instruction fetched before the data store and the instruction fetched after the store outside the processing of 8p5. A path in the case of a match, a path in the case where the instruction words fetched before and after the 8q data store are compared and they do not match.

Claims

[Claims]

A program created for another machine instruction set is simulated using a simulation program combining a machine instruction set of a processor having a pipeline mechanism and a multiple instruction simultaneous execution mechanism (super scaler mechanism). A method of copying the operation of the next instruction after the operation of the current instruction for simulation, and allocating processing variables to the operation of the current instruction and occupying resources for executing the operation. And a processing variable different from that of the current instruction operation is allocated to the next instruction operation processing, and the time required to execute the operation while occupying resources is compared. The use of the other short instruction arithmetic processing overlaps with the step of obtaining the arithmetic processing) and the execution cycle idle time included in the step during the long instruction arithmetic processing. A step of allocating a step of the short instruction operation processing to the idle time and performing a parallel operation when using a resource not to be used.

2. A step of reading the next instruction word and comparing it with an instruction of the next instruction operation processing which has already been performed in parallel in a subsequent step of the current instruction operation processing. 2. The instruction emulation method according to claim 1, further comprising a step of discarding a next instruction operation that is being performed in parallel and performing a next instruction operation process again.

3. The critical path cuts out a unit process constituting an instruction operation process for each unit process, determines an order between the unit processes in the cut-out unit process based on an operation execution condition, and determines the determined critical process. If the execution cycle idle time included in the instruction execution step of the unit processing group that gives the path uses resources that do not overlap in use by other unit arithmetic processing other than the unit arithmetic processing group in the critical path, the idle time is 2. The instruction emulation method according to claim 1, further comprising a step of allocating steps of the other unit operation processing to perform parallel operations.

4. The instruction emulation method according to claim 1, wherein when a branch step is present during the instruction operation processing, the branch step is provided after both the current instruction operation processing and the next instruction operation processing.

5. The instruction emulation method according to claim 1, wherein, when the resource is a pipeline operation mechanism, steps are allocated on the assumption that different register portions of the pipeline are not used repeatedly.