JP2886838B2

JP2886838B2 - Apparatus and method for parallel decoding of variable length instructions in super scalar pipelined data processor

Info

Publication number: JP2886838B2
Application number: JP492397A
Authority: JP
Inventors: 希聖尚; 自強汪
Original assignee: KOGYO GIJUTSU KENKYUIN
Current assignee: KOGYO GIJUTSU KENKYUIN
Priority date: 1997-01-14
Filing date: 1997-01-14
Publication date: 1999-04-26
Anticipated expiration: 2017-01-14
Also published as: JPH10207707A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、一般的に、可変長
命令を利用するスーパースカラプロセッサに係る。より
詳細に言うと、本発明は、多数の可変長命令の並列デコ
ーディングを可能にするためキャシュメモリに記憶され
た命令の系列内の命令の境界情報を記憶し、利用するメ
モリ及び方法に関する。The present invention relates generally to a superscalar processor utilizing variable length instructions. More particularly, the present invention relates to a memory and method for storing and utilizing instruction boundary information within a sequence of instructions stored in a cache memory to enable parallel decoding of multiple variable length instructions.

【０００２】[0002]

【従来の技術】蓄積プログラム制御式ディジタルコンピ
ュータは、典型的にプログラム命令及びデータを記憶す
る主メモリを含む。主メモリは、実行用の中央処理ユニ
ット（ＣＰＵ）の中の算術論理ユニット（ＡＬＵ）、浮
動小数点ユニット（ＦＰＵ）等の機能ユニットにプログ
ラム命令及びデータを供給する。縮小命令セットコンピ
ュータ（ＲＩＳＣ）のＣＰＵは、典型的に固定長命令を
利用する。一方、複合命令セットコンピュータ（ＣＩＳ
Ｃ）のＣＰＵは、例えば、ｘ８６アーキテクチャのＣＩ
ＳＣ形ＣＰＵの場合に、（プレフィックスコードをカウ
ントしなければ）１乃至１２バイトの可変長命令を使用
する。2. Description of the Related Art Stored program controlled digital computers typically include a main memory for storing program instructions and data. The main memory supplies program instructions and data to functional units such as an arithmetic logic unit (ALU) and a floating point unit (FPU) in a central processing unit (CPU) for execution. Reduced instruction set computer (RISC) CPUs typically utilize fixed length instructions. On the other hand, a compound instruction set computer (CIS)
The CPU of C) is, for example, a CI of x86 architecture.
In the case of the SC type CPU, a variable length instruction of 1 to 12 bytes is used (unless the prefix code is counted).

【０００３】キャッシュメモリは、屡々、命令の参照場
所の空間的及び時間的な局在性を利用するためＣＰＵ及
び他の機能ユニットに接続される。参照の空間的な局在
性は、実行の順序と同じ順序に連続して記憶されるべき
命令の傾向である。参照の時間的な局在性は、同一命令
を繰り返しアクセスすべき機能ユニットの傾向である。
参照場所の時間的な局在性は、最後に実行されたある種
の命令を機能ユニットに繰り返し実行させるループ、分
岐及びサブルーチンのようなプログラムの流れの制御命
令により生じる。かくして、機能的なユニットは、メモ
リの同じ場所に記憶された命令を繰り返しアクセスする
傾向がある。[0003] Cache memories are often connected to CPUs and other functional units to take advantage of the spatial and temporal localization of instruction reference locations. The spatial locality of a reference is the tendency of instructions to be stored sequentially in the same order as the order of execution. Temporal locality of reference is the tendency of functional units to repeatedly access the same instruction.
Temporal localization of the reference location is caused by program flow control instructions such as loops, branches and subroutines that cause the functional unit to repeatedly execute certain instructions that were executed last. Thus, functional units tend to repeatedly access instructions stored at the same location in memory.

【０００４】命令の実行及び記憶の空間的及び時間的な
局在性の特質を利用するため、局在化された命令の系列
全体が主メモリからキャッシュメモリに複写される。一
般的に言うと、データワード（例えば、典型的に、バイ
トのサイズのユニット）は、例えば、６４個のデータワ
ードからなる連続したデータラインに構造化される。命
令がフェッチされた（取り出された）とき、命令を格納
するデータライン全体（命令が多数のデータラインを繋
ぐ場合には、複数のデータラインの全体）は、主メモリ
から読み出され、キャッシュメモリのキャッシュライン
の記憶場所に書き込まれる。従って、キャッシュメモリ
に命令の系列全体がロードされた（入れられた）とき
に、キャッシュメモリが、先にアクセスされた命令と同
じ命令、又は、（将来のアクセスがキャッシュメモリに
既に記憶されたデータラインに対応する他の命令に関す
るならば）未だアクセスされていない命令に対する将来
のアクセスを達成し得る尤度は増加する。In order to take advantage of the spatial and temporal locality characteristics of instruction execution and storage, an entire sequence of localized instructions is copied from main memory to cache memory. Generally speaking, a data word (e.g., typically a byte sized unit) is structured into a continuous data line of, for example, 64 data words. When an instruction is fetched (fetched), the entire data line storing the instruction (or the entire plurality of data lines if the instruction connects a large number of data lines) is read from the main memory and the cache memory is read. Is written to the storage location of the cache line. Thus, when the entire sequence of instructions is loaded (entered) into the cache memory, the cache memory may store the same instruction as the previously accessed instruction, or (data whose future accesses are already stored in the cache memory). The likelihood of achieving future access to an instruction that has not yet been accessed (for other instructions corresponding to the line) increases.

【０００５】キャッシュメモリの使用により、キャッシ
ュメモリのアクセス時間、即ち、命令をフェッチする時
間は、遅い主メモリよりも短くなるので、システム性能
は向上する。しかし、キャッシュメモリに存在しない命
令が要求される場合もある。この場合、所謂“キャッシ
ュミス”が発生する。キャッシュミスの場合に、要求さ
れた命令は主メモリから取得しなければならない。しか
し、殆どの場合に、時間的及び空間的な局在性のため、
要求された命令はキャッシュメモリ内にあり、所謂“キ
ャッシュヒット”が発生する。キャッシュヒットは、要
求された命令が高速キャッシュメモリから得られること
を意味する。ディジタルコンピュータシステムにおい
て、キャッシュメモリ内の命令に対する“キャッシュヒ
ット”の割合は、参照の局在性に起因して、９０％のオ
ーダーに達する。[0005] The use of the cache memory improves the system performance because the access time of the cache memory, ie, the time to fetch an instruction, is shorter than that of the slow main memory. However, an instruction that does not exist in the cache memory may be requested. In this case, a so-called “cache miss” occurs. In the case of a cache miss, the requested instruction must be obtained from main memory. However, in most cases, due to temporal and spatial localization,
The requested instruction is in the cache memory and a so-called "cache hit" occurs. A cache hit means that the requested instruction is obtained from a high speed cache memory. In digital computer systems, the ratio of "cache hits" to instructions in cache memory can be on the order of 90% due to the locality of references.

【０００６】ＣＰＵの開発の系譜において、ＣＰＵは、
最初、その機能ユニットが命令の系列を受け、同時に１
個の命令を実行するよう設計された。かかるＣＰＵは、
同時に１個の命令しか実行しないので、“スカラ”プロ
セッサと称される。従来のスカラプロセッサの動作速度
は、回路技術、コンピュータ機構、コンピュータアーキ
テクチャの進歩により、限界まで増加させられた。しか
し、新世代の各コンピュータ装置と共に、従来のスカラ
マシーンのための新しい加速機構を見つける必要があ
る。In the genealogy of CPU development, CPU
Initially, the functional unit receives a sequence of instructions,
Designed to execute instructions. Such CPUs
It is called a "scalar" processor because it executes only one instruction at a time. The operating speed of conventional scalar processors has been pushed to the limit by advances in circuit technology, computer architecture, and computer architecture. However, with each new generation of computing devices, there is a need to find new acceleration mechanisms for conventional scalar machines.

【０００７】現代のプロセッサは、一般的に、性能を向
上させるためパイプライン方式として知られた技術を使
用する。パイプライン方式は、組立ラインに類似した命
令実行技術である。命令の実行は、屡々、メモリから命
令をフェッチし、その命令を夫々のオペレーションとオ
ペランドにデコードし、命令のオペランドをフェッチ
し、デコードされたオペレーションをオペランドに適用
し（以下では、簡単に命令を“実行”すると称する）、
その結果をメモリ又はレジスタに格納する一連の段階を
含むことが想定される。パイプライン方式の技術によれ
ば、実行処理の一連の段階はその後の命令の間で重ね合
わされる。例えば、ＣＰＵが命令系列の中の第１の命令
の結果を格納する間に、ＣＰＵは同時に系列の中の第２
の命令を実行し、系列の中の第３の命令のオペランドを
フェッチし、系列の中の第４の命令をデコードし、系列
の中の第５の命令をフェッチする。パイプライン方式
は、かくして、命令の系列に対する実行時間を減少させ
る。[0007] Modern processors generally use a technique known as pipelining to increase performance. The pipeline method is an instruction execution technique similar to an assembly line. The execution of instructions often involves fetching the instructions from memory, decoding the instructions into their respective operations and operands, fetching the operands of the instruction, and applying the decoded operations to the operands (hereinafter, the instruction "Execute"),
It is envisioned to include a series of steps for storing the result in a memory or register. According to the pipelined technique, a series of steps of execution processing are overlapped between subsequent instructions. For example, while the CPU stores the result of the first instruction in the instruction sequence, the CPU simultaneously
, Fetches the operand of the third instruction in the sequence, decodes the fourth instruction in the sequence, and fetches the fifth instruction in the sequence. Pipelining thus reduces the execution time for a sequence of instructions.

【０００８】性能を向上させる別の技術は、２個以上の
命令の並列、即ち、同時に実行することを含む。この技
術を利用するプロセッサは、一般的にスーパースカラプ
ロセッサと称される。かかるプロセッサは、多少任意性
があり、命令系列が記憶された厳密に逐次的な順序とは
異なる順序で、命令の系列が実行され、その命令の結果
が格納される付加的な技術を組み込む場合がある。この
技術は、夫々、順序の乱れた（アウトオブオーダー）発
行及び順序の乱れた完了と称される。Another technique for improving performance involves executing two or more instructions in parallel, ie, simultaneously. Processors utilizing this technique are commonly referred to as superscalar processors. Such processors are somewhat arbitrary and incorporate additional technology in which the sequence of instructions is executed in a different order than the strictly sequential order in which the sequence of instructions is stored, and the results of the instructions are stored. There is. This technique is referred to as out-of-order issue and out-of-order completion, respectively.

【０００９】スーパースカラプロセッサが２個以上の命
令を同時に実行し得る能力は、実行される特定の命令に
依存する。同様に、命令を乱れた順序で発行又は完了す
る柔軟性は、発行又は完了されるべき特定の命令に依存
する。かかる命令の依存関係には、資源の衝突、手続的
依存関係、及び、データ依存関係と称される３通りのタ
イプがある。資源の衝突は、並列に実行中の２個の命令
が同一の資源、例えば、システムバスにアクセスしよう
として競合するときに生じる。データ依存関係は、第１
の命令の完了によって、後に完了される第２の命令によ
り後でアクセスされるレジスタ又はメモリに記憶された
値が変えられるときに生じる。[0009] The ability of a superscalar processor to execute two or more instructions simultaneously depends on the particular instruction being executed. Similarly, the flexibility to issue or complete instructions out of order depends on the particular instruction to be issued or completed. There are three types of instruction dependencies, referred to as resource conflicts, procedural dependencies, and data dependencies. Resource conflicts occur when two instructions executing in parallel compete for access to the same resource, for example, the system bus. Data dependencies are number one
Occurs when the value stored in a register or memory subsequently accessed is changed by a second instruction that is completed later.

【００１０】データ依存関係は、“真のデータ依存関
係”、“反対の依存関係”及び“出力データ依存関係”
と称される３通りのタイプに分類される。これについ
て、マイクジョンソン(Mike Johnson)著の“スーパー
スカラマイクロプロセッサ設計(Superscalar Micropr
ocessor Design) ”、ページ９−２４（１９９１年）を
参照のこと。前の命令により計算された値を使用する命
令は、前の命令に“真の”（又はデータ）依存関係を有
する。出力依存関係の一例は、第１及び第２の逐次的な
命令が共に同一のレジスタ又はメモリ位置を異なる値に
割当て、かつ、第１及び第２の命令に続く第３の命令が
レジスタ又はメモリに格納された値をオペランドとして
使用する順序の乱れた完了の際に生じる。先行（第１）
の命令は後（第２）の命令の後に完了することは許容さ
れず、そうでなければ、第３の命令は誤った値を取る。
反対の依存関係の一例は、乱れた順序かつ先行の命令よ
りも前に実行された後の命令が先行の命令により使用さ
れた値を破壊する順序の乱れた実行の際に生じる。真の
依存関係、出力依存関係、及び、反対の依存関係の例と
して、以下の命令の系列：（１）Ｒ３：＝Ｒ３オペレーションＲ５（２）Ｒ４：＝Ｒ３＋１（３）Ｒ３：＝Ｒ５＋１（４）Ｒ７：＝Ｒ３オペレーションＲ４を想定する。命令（２）のオペランドとして使用される
べくＲ３に格納された値は命令（１）により決定される
ので、命令（２）は命令（１）に真の依存関係を有す
る。命令（３）はレジスタＲ３の内容を変更するので、
命令（３）は命令（２）に反対の依存関係を有する。命
令（３）が乱れた順序で命令（２）の前に実行されたな
らば、命令（２）はレジスタＲ３に格納された間違った
値、特に、命令（３）によって変更されたような値を使
用する。命令（１）及び（３）は、出力依存関係を有す
る。命令（１）は、乱れた順序で命令（３）の後に行う
ことが許されない。命令（３）により決定されるような
結果として得られる値は、命令（１）により決定される
ような結果の値ではなくレジスタＲ３に格納された最後
の値でなければならないので、命令（４）はレジスタＲ
３に格納された正しいオペランド値に関して実行され
る。誤りのある依存関係は、レジスタの改名技術及び並
べ替えバッファを用いて除去される。Data dependencies include "true data dependencies", "opposite dependencies" and "output data dependencies".
Are classified into three types. In this regard, Mike Johnson's "Superscalar Micropr
ocessor Design) ", pages 9-24 (1991). An instruction that uses a value calculated by a previous instruction has a" true "(or data) dependency on the previous instruction. An example of a dependency relationship is that the first and second sequential instructions both assign the same register or memory location to different values, and the third instruction following the first and second instructions stores the same in the register or memory. Occurs on an out-of-order completion using the stored value as an operand.
The third instruction is not allowed to complete after the later (second) instruction, otherwise the third instruction will take the wrong value.
One example of an opposite dependency occurs during an out-of-order execution, where an instruction executed out of order and before the preceding instruction destroys the value used by the preceding instruction. As an example of a true dependency, an output dependency, and an opposite dependency, the following sequence of instructions: (1) R3: = R3 Operation R5 (2) R4: = R3 + 1 (3) R3: = R5 + 1 (4 R7: = R3 Assume operation R4. Instruction (2) has a true dependency on instruction (1) because the value stored in R3 to be used as the operand of instruction (2) is determined by instruction (1). Instruction (3) changes the contents of register R3,
Instruction (3) has the opposite dependency on instruction (2). If instruction (3) were executed before instruction (2) in a disturbed order, instruction (2) would have the wrong value stored in register R3, especially the value as modified by instruction (3). Use Instructions (1) and (3) have output dependencies. Instruction (1) is not allowed to be performed after instruction (3) in a disturbed order. Since the resulting value as determined by instruction (3) must be the last value stored in register R3 and not the resulting value as determined by instruction (1), instruction (4) ) Indicates the register R
3 for the correct operand value stored in Erroneous dependencies are eliminated using register renaming techniques and reordering buffers.

【００１１】手続き的な依存関係は、第１の命令の実行
が分岐命令のような先行命令の実行の結果に依存する場
合に生じる。この点について、マイクジョンソン著の
“スーパースカラマイクロプロセッサ設計”、ページ
５７−７７（１９９１年）を参照のこと。特定の分岐が
行われるかどうかを確実に知ることは困難である。簡潔
のため、分岐命令は、ある予め指定された逐次的ではな
いアドレスで実行を継続させるか、又は、直ぐ後に続く
命令で逐次的に実行を継続させる命令である場合を仮定
する。前者の場合に所謂分岐が行われ、後者の場合に所
謂分岐が行われない。分岐命令は、分岐の際に実行が継
続するアドレスがメモリ又はレジスタに格納された値に
従って動的に変化するインデックス付き分岐命令を含む
ことにより複雑化させられる。しかし、かかるインデッ
クス付き分岐命令に関する説明は特に行わない。従っ
て、分岐命令の後に実行されるべき命令の系列を確実に
知ることは困難である。それにも係わらず、分岐が行わ
れるかどうかを予測するため、９０％程度の精度が得ら
れる多数の分岐予測技術が使用される。分岐予測技術を
使用することにより、分岐が行われるかどうかに関する
予測がなされる。予測が正しい場合に実行される命令の
系列がフェッチされ、実行される。しかし、かかる命令
のあらゆる結果は、分岐命令が実際に実行されるまで
は、“不確定”として扱われるに過ぎない。分岐命令が
実行されたとき、予測が正しいかどうかに関する判定が
行われる。分岐命令の結論が正しく予測された場合、上
記の“不確定な結果”が受け入れられる。しかし、分岐
が誤って予測された場合、不確定な結果を回復／廃棄
し、正しい実行のための命令の系列をフェッチすること
により、予測失敗の復旧段階が行われる。A procedural dependency occurs when the execution of a first instruction depends on the result of the execution of a preceding instruction, such as a branch instruction. See Mike Johnson, "Superscalar Microprocessor Design," pages 57-77 (1991) in this regard. It is difficult to know for sure whether a particular branch will be taken. For simplicity, it is assumed that the branch instruction is an instruction that continues execution at some pre-specified non-sequential address or that continues execution sequentially with the immediately following instruction. In the former case, so-called branching is performed, and in the latter case, so-called branching is not performed. Branch instructions are complicated by including indexed branch instructions in which the address at which execution continues at the time of the branch changes dynamically according to the value stored in the memory or register. However, a description of such an indexed branch instruction is not particularly made. Therefore, it is difficult to reliably know the sequence of instructions to be executed after the branch instruction. Nevertheless, to predict whether or not a branch will be taken, a number of branch prediction techniques are used which can provide an accuracy of about 90%. By using branch prediction techniques, a prediction is made as to whether a branch will be taken. A sequence of instructions to be executed if the prediction is correct is fetched and executed. However, any result of such an instruction is only treated as "indeterminate" until the branch instruction is actually executed. When the branch instruction is executed, a determination is made as to whether the prediction is correct. If the conclusion of the branch instruction is correctly predicted, the above "uncertain result" is accepted. However, if a branch is incorrectly predicted, a recovery phase of the prediction failure is performed by recovering / discarding the indeterminate result and fetching the sequence of instructions for correct execution.

【００１２】多数の技術がソフトウェア及び／又はハー
ドウェアを用いて分岐予測を行うため使用される。例え
ば、ソフトウェアコンパイラを使用することにより、分
岐が行われるかどうかに関係したオリジナルのソースコ
ードから予測が行える。予測は、静的（実行中に変化し
ない）であっても、動的（実行中に変化する）でも構わ
ない。二つの動的ハードウェア分岐予測スキームは、ペ
ンティアム(Pentium)（登録商標）プロセッサのように
２ビットカウンタを使用し、或いは、ペンティアムプ
ロ(Pentium Pro) （登録商標）プロセッサのように２重
レベルの分岐ターゲットバッファ使用する。簡単な分岐
ターゲットバッファ技術によれば、各分岐命令は、最
初、分岐が行われない場合が推定される。分岐が行われ
たとき、各分岐命令は実行されるので、実行が分岐され
るアドレスは、例えば、キューとして動作させられた分
岐ターゲットバッファに格納される。次に、各分岐命令
は、命令フェッチサイクルの間に識別される。各命令に
対する分岐は分岐ターゲットバッファキューの先頭のア
ドレスのアドレスから始まることが予測される。このよ
うにして、分岐ターゲットバッファキューの先頭に指定
されたアドレスから始まるアドレスに設けられた少なく
とも１個の命令がフェッチされる。分岐命令が実行され
たときに分岐が行われないならば、予測は間違いであ
る。かかる場合に、分岐予測の失敗復旧が実行され、分
岐ターゲットバッファキューは更新されない。或いは、
分岐命令が行われたときに分岐が行われ、実行の分岐し
たアドレスが分岐ターゲットバッファキューの先頭のア
ドレスに一致しないならば、分岐予測は再び間違う。こ
のような場合、新しいアドレスが分岐ターゲットバッフ
ァキュー（の後尾部）に格納され、分岐予測復旧が行わ
れる。一方、分岐が行われ、実行の分岐したアドレスが
分岐ターゲットバッファキューの先頭のアドレスに一致
するならば、分岐は正確に予測される。アドレスは分岐
ターゲットバッファキューの先頭から除去され、その後
尾部に格納される。A number of techniques are used to make branch predictions using software and / or hardware. For example, by using a software compiler, a prediction can be made from the original source code that is related to whether a branch is taken. The prediction can be static (does not change during execution) or dynamic (changes during execution). Two dynamic hardware branch prediction schemes use a 2-bit counter, such as a Pentium® processor, or a dual-level counter, such as a Pentium Pro® processor. Use a branch target buffer. According to the simple branch target buffer technique, it is estimated that each branch instruction does not take a branch at first. When a branch is taken, each branch instruction is executed, so the address at which the execution branches is stored, for example, in a branch target buffer operated as a queue. Next, each branch instruction is identified during an instruction fetch cycle. It is predicted that the branch for each instruction starts from the address of the head address of the branch target buffer queue. In this way, at least one instruction provided at the address starting from the address specified at the head of the branch target buffer queue is fetched. If the branch is not taken when the branch instruction is executed, the prediction is wrong. In such a case, branch prediction failure recovery is performed and the branch target buffer queue is not updated. Or,
If the branch is taken when the branch instruction is taken, and if the address at which the execution branch did not match the head address of the branch target buffer queue, the branch prediction is again incorrect. In such a case, the new address is stored in the branch target buffer queue (at the tail), and branch prediction recovery is performed. On the other hand, if a branch is taken and the branch address of the execution matches the head address of the branch target buffer queue, the branch is correctly predicted. The address is removed from the head of the branch target buffer queue and stored at the tail.

【００１３】ＣＩＳＣプロセッサの命令セットは可変長
の命令を含む。メモリ内に端と端を接して間断なく詰め
込まれたならば、個々のＣＩＳＣ命令の開始及び終了
は、各命令の境界に関する予備的な知識又は指標を用い
ることなく、容易に識別することができない。従って、
命令フェッチ動作の間に可変長命令を分離する際に有限
の量の時間が消費される。命令の開始は直前の命令の長
さ（及び終了）が判定されるまで判定することができな
いので、上記有限な時間は、上記の命令分離段階を他の
命令実行段階と重ね合わせることにより容易に回復し得
ない。この欠点は、命令セットアーキテクチャの複雑さ
が増大すると共に（例えば、許容される命令長の数が増
大すると共に）、際立つようになる。The instruction set of a CISC processor contains instructions of variable length. Once packed end-to-end in memory, the start and end of an individual CISC instruction cannot be easily identified without any prior knowledge or indication of the boundaries of each instruction. . Therefore,
A finite amount of time is consumed in separating variable length instructions during an instruction fetch operation. Since the start of an instruction cannot be determined until the length (and end) of the immediately preceding instruction has been determined, the finite time can be easily determined by overlapping the instruction separation stage with other instruction execution stages. I can't recover. This disadvantage becomes more pronounced as the complexity of the instruction set architecture increases (eg, as the number of allowed instruction lengths increases).

【００１４】プロセッサの設計の傾向は、シングルイシ
ュー(single-issue)スカラ式から、高性能、パイプライ
ン、スーパースカラ式システムに移行しているので、多
数の命令の並列デコードの要求が生じる。次の命令は先
行の命令の長さが分かるまでデコードできないので、多
数の可変長命令を並列にデコードすることは困難であ
る。この問題は、スーパースカラ式ＣＩＳＣアーキテク
チャにおいて重要であるが、固定長命令を使用するので
各命令の開始と終了の場所を簡単な方法で判定できるス
ーパースカラ式ＲＩＳＣアーキテクチャの場合に重要で
はない。As processor design trends shift from single-issue scalar to high-performance, pipelined, superscalar systems, the need for parallel decoding of multiple instructions arises. Since the next instruction cannot be decoded until the length of the preceding instruction is known, it is difficult to decode many variable-length instructions in parallel. This problem is important in superscalar CISC architectures, but not in superscalar RISC architectures, which use fixed-length instructions so that the start and end locations of each instruction can be determined in a simple manner.

【００１５】ＣＩＳＣアーキテクチャにおいて、命令キ
ャッシュは、メモリから命令を受け、機能ユニットがそ
の命令を受け入れる容易ができるまで命令を保持するた
め利用される。かくして、命令はデコーダへの入力のた
めより早くフェッチされる。デコーダは実行用の機能ユ
ニットに入力される前に命令をデコードする。インテル
（登録商標）のペンティアムプロ（登録商標）のよう
なスーパースカラ式（例えば、多重機能ユニット）マイ
クロプロセッサの出現は、できる限り多くのクロックサ
イクルの間に各機能ユニットを利用する（機能ユニット
は不必要に休止状態であってはならない）点を強調す
る。これにより、スーパースカラ・パイプライン式環境
において可変長命令をデコードする幾つかのスキームが
開発された。In the CISC architecture, an instruction cache is used to receive instructions from memory and hold the instructions until a functional unit can easily accept the instruction. Thus, instructions are fetched earlier for input to the decoder. The decoder decodes the instruction before entering the functional unit for execution. The advent of superscalar (eg, multi-function unit) microprocessors, such as Intel® Pentium Pro®, utilizes each functional unit for as many clock cycles as possible (functional units are Should not be unnecessarily dormant). This has led to the development of several schemes for decoding variable length instructions in a superscalar pipelined environment.

【００１６】プリフェッチ器は、キャッシュ又は主メモ
リから命令をフェッチするため設けられる。プリフェッ
チ器の内容は、典型的に、次の命令がフェッチされるべ
きキャッシュの開始アドレスを示す。プリフェッチ器の
設計は、特定のシステムの場合に見られるように固定長
命令又は可変長命令の利用を考慮して行われる。可変長
命令を利用する従来技術のプロセッサにおいて、命令キ
ャッシュが最初にロードされたとき、プリフェッチ器又
は命令キャッシュは、ワード配列、即ち、任意の命令が
開始、終了する場所を知る方法がない。従って、プリフ
ェッチ器が命令を要求したとき、命令キャッシュは、キ
ャッシュから大きい隣接（連続）したデータワードの系
列をフェッチする必要がある。上記の系列は、所望の命
令の中の全データワードが系列内に含まれることを保証
するため、実現可能な最大長の命令が達成できるよう十
分に大きくしなければならない。残念ながら、このこと
は、プリフェッチ器のキャッシュがキャッシュメモリに
より供給されたデータワードの系列を厳密に調べ、命令
をシステムの他の素子に転送する前に所望の命令よりも
先のデータワードを捨てる必要があることを意味する。
これはスーパースカラ式システムにおける重大な制限で
ある。その理由は、プリフェッチ器が所望の命令の正確
な開始と終了を判定するため数クロックサイクルを使用
する必要があるからである。A prefetcher is provided for fetching instructions from cache or main memory. The contents of the prefetcher typically indicate the starting address of the cache where the next instruction is to be fetched. The design of the prefetcher takes into account the use of fixed length instructions or variable length instructions, as is the case in certain systems. In prior art processors utilizing variable length instructions, when the instruction cache is first loaded, the prefetcher or instruction cache has no way of knowing the word array, ie, where any instruction starts and ends. Thus, when the prefetcher requests an instruction, the instruction cache needs to fetch a large contiguous (consecutive) sequence of data words from the cache. The above sequence must be large enough to achieve the maximum achievable instruction length to ensure that all data words in the desired instruction are included in the sequence. Unfortunately, this means that the prefetcher cache probes the sequence of data words provided by the cache memory and discards data words prior to the desired instruction before transferring the instruction to other elements of the system. Means that you need to.
This is a significant limitation in superscalar systems. The reason is that the prefetcher must use several clock cycles to determine the exact start and end of the desired instruction.

【００１７】数通りの従来技術がこの問題を扱うため考
えられた。アドバンスドマイクロデバイス（登録商標）
のＡＭＤ５Ｋ８６（登録商標）、インテル（登録商標）
のペンティアム（登録商標）及びペンティアムプロ
（登録商標）プロセッサは、多数の可変長命令を並列に
デコードするため種々のスキームを使用する。ＡＭＤ５
Ｋ８６（登録商標）は、ｘ８６命令がメモリからフェッ
チされ、命令キャッシュに書き込まれるとき、命令を予
めデコード（プリデコード）する。プリデコードの途中
で、プリデコーダは、バイトをメモリに書き込む前に、
各バイトに５ビットを追加する。上記ビットは、主とし
て、命令内のバイトの状態を示すため使用される。状態
は、特に、バイトが命令の先頭又は末尾にあるかどうか
を記録する。上記プリデコード技術の欠点は、プリデコ
ードビットが命令キャッシュのサイズを著しく増大させ
ることである。ＡＭＤ５Ｋ８６（登録商標）のスキーム
において多数のプリデコードビットを収容し、処理する
ため余分なハードウェアが必要になる。この点につい
て、エムスレータ(M. Slater) による“ペンティアム
を超えるべく設計されたＡＭＤのＫ５(AMD's K5 Design
ed to Outrun Pentium)”、マイクロプロセッサリポー
ト、第８巻、第１４号、ページ１，６−１１、１９９４
年１０月発行を参照のこと。Several prior art techniques have been considered to address this problem. Advanced Micro Device (registered trademark)
AMD5K86 (R), Intel (R)
The Pentium (R) and Pentium Pro (R) processors use various schemes to decode a large number of variable length instructions in parallel. AMD5
K86® decodes (pre-decodes) x86 instructions in advance when they are fetched from memory and written to the instruction cache. During pre-decoding, the pre-decoder writes
Add 5 bits to each byte. These bits are mainly used to indicate the status of the byte in the instruction. The status specifically records whether a byte is at the beginning or end of an instruction. A disadvantage of the predecoding technique is that the predecode bits significantly increase the size of the instruction cache. Extra hardware is required to accommodate and process the large number of predecode bits in the AMD5K86® scheme. In this regard, M. Slater says, “AMD's K5 Design is designed to go beyond Pentium.
ed to Outrun Pentium) ", Microprocessor Report, Vol. 8, No. 14, Pages 1, 6-11, 1994.
See October, 2010.

【００１８】一方、ペンティアム（登録商標）は、命令
キャッシュの各ラインの内部に命令境界を記録する。単
一のクロックサイクルにフェッチされるべきライン境界
を広げる命令を許可するスプリットラインアクセスを維
持するため、境界が記録される。命令が最初にデコード
されたとき、命令の長さが命令キャッシュに送り返され
る。各命令キャッシュディレクトリエントリは、ライン
内部に命令境界のマークを付ける。しかし、かかるディ
レクトリエントリマークが維持され、使用される様子、
即ち、マークのため何ビットが使用されるか、それら
は、データライン、データワード又は命令のいずれに基
づいて割り当たられるのか、それらはどこに記憶される
のか等について、完全に明らかという訳ではない。この
点について、ディーアンダーソン(D. Anderson) 著の
“ペンティアムプロセッサシステムアーキテクチャ(Pen
tium Processor System Architecture) ”、第２版、１
９９５年発行と、インテルの“ペンティアムプロ（登
録商標）プロセッサマイクロアーキテクチャの旅(A Tou
r of the Pentium Pro Processor Microarchitecture)
”、(http://intel.com/procs/ppro/info/p6 white/in
dex.htm) と、エルグウェンナップ(L. Gwennap)によ
る“インテルのＰ６は切り離されたスーパースカラ設計
を使用(Intel's P6 Uses Decoupled Superscalar Desig
n)" 、マイクロプロセッサリポート、第９巻、第２号、
ページ９−１５、１９９５年２月発行とを参照のこと。On the other hand, Pentium (registered trademark) records an instruction boundary inside each line of the instruction cache. The boundaries are recorded to maintain split line accesses that allow instructions to extend line boundaries to be fetched in a single clock cycle. When an instruction is first decoded, the length of the instruction is sent back to the instruction cache. Each instruction cache directory entry marks an instruction boundary inside the line. However, how such directory entry marks are maintained and used,
That is, it is not completely clear how many bits are used for marks, whether they are allocated based on data lines, data words or instructions, where they are stored, etc. . In this regard, D. Anderson's Pentium Processor System Architecture (Pen
tium Processor System Architecture) ", 2nd edition, 1
Published in 995, Intel's "Pentium Pro (R) Processor Microarchitecture Journey (A Tou
r of the Pentium Pro Processor Microarchitecture)
”, (Http://intel.com/procs/ppro/info/p6 white / in
dex.htm) and L. Gwennap, "Intel's P6 Uses Decoupled Superscalar Desig
n) ", Microprocessor Report, Vol. 9, No. 2,
See pages 9-15, published February 1995.

【００１９】[0019]

【発明が解決しようとする課題】スーパースカラ式シス
テム用の他のデコーダスキームは、米国特許第５，２０
２，９６７号、第５，３３７，４１５号、第５，４５
９，８４４号及び第５，４８８，７４５号明細書に記載
されている。上記の全ての特許は、可変バイト長命令に
対し境界情報のプリデコード又は記憶を提供するため、
重要な付加ハードウェア部品を含む。従って、低減され
たハードウェアのコストで並列処理を最大限に利用する
ため、最小の付加ハードウェア部品を用いて多数の可変
長命令を並列形式でデコードする技術が必要である。Another decoder scheme for superscalar systems is disclosed in US Pat.
No. 2,967, No. 5,337,415, No. 5,45
No. 9,844 and 5,488,745. All of the above patents provide pre-decoding or storage of boundary information for variable byte length instructions,
Includes important additional hardware components. Therefore, in order to maximize parallel processing at a reduced hardware cost, there is a need for a technique for decoding a large number of variable-length instructions in parallel using the minimum additional hardware components.

【００２０】[0020]

【課題を解決するための手段】上記及び他の目的は本発
明により達成される。一実施例によれば、可変長命令セ
ットの各命令の開始及び終了を判定するための方法が提
供される。データラインは、例えば、命令キャッシュで
ある第１の記憶領域に格納される。各データラインは、
主メモリの逐次的なアドレスに格納されたデータワード
の系列を構成する。データラインは、主メモリに隣接し
て格納された多数のエンコードされた可変長命令を含
む。第１の記憶領域に格納されたデータラインの各デー
タワードと関係した標識を含む多数の標識が第２の記憶
領域に格納される。各標識は、その関係したデータワー
ドが可変長命令の最初のデータワードであるかどうかを
示す。These and other objects are achieved by the present invention. According to one embodiment, a method is provided for determining the start and end of each instruction of a variable length instruction set. The data line is stored in a first storage area that is, for example, an instruction cache. Each data line is
Construct a sequence of data words stored at sequential addresses in main memory. The data line contains a number of encoded variable length instructions stored adjacent to main memory. A number of indicators, including an indicator associated with each data word of the data line stored in the first storage area, are stored in the second storage area. Each indicator indicates whether the associated data word is the first data word of a variable length instruction.

【００２１】他の実施例によれば、データワードの系列
は、キャッシュに格納された少なくとも１個の逐次的な
データラインからフェッチされる。フェッチされたデー
タワードの系列は、開始データワードと、少なくとも許
容可能な最大長の命令内のデータワードの数とを含む。
複数の標識（即ち、標識のベクトル）が、フェッチされ
た系列の各データワードと関係した標識を含む第２の記
憶領域からフェッチされる。デコードされるべき命令の
系列の区切り記号として標識を使用することにより、デ
ータワードの系列の中の少なくとも１個の重なり合わな
い部分系列が識別され、ここで、データワードの各部分
系列は、異なる逐次的なデコードされるべき命令に含ま
れる。データワードの各部分系列は、別個の命令として
デコードされる。According to another embodiment, a sequence of data words is fetched from at least one sequential data line stored in a cache. The sequence of fetched data words includes the starting data word and at least the number of data words in the maximum allowable length instruction.
A plurality of indicators (ie, a vector of indicators) are fetched from a second storage area containing an indicator associated with each data word of the fetched sequence. By using the indicator as a delimiter of the sequence of instructions to be decoded, at least one non-overlapping sub-sequence in the sequence of data words is identified, wherein each sub-sequence of the data word is different Included in instructions to be decoded sequentially. Each subsequence of the data word is decoded as a separate instruction.

【００２２】例えば、各命令の長さは、区切り記号の標
識を使用することなく、同時に別々に判定される。これ
らの長さは一致するかどうかが判定されるべく標識と比
較される。命令系列は、１個以上の機能ユニットに配付
するため出力される。上記命令系列は、データワードの
フェッチされた系列の開始データワードから始まる最初
の命令と、逐次的に最初のデータの後に続き、フェッチ
された標識のベクトルが判定された長さと一致する１個
以上の命令だけにより先行された他の各命令とを含む。
例えば、置換標識のベクトルが判定された長さから発生
される。置換標識のベクトルを使用することにより、フ
ェッチされた標識のベクトルが判定された長さと一致し
ない少なくとも１個の命令により先行された各可変長命
令は、フェッチされたデータワードの系列から再パーシ
ング（構造解析）され、再デコードされる。For example, the length of each instruction is determined separately and simultaneously without using delimiter indicators. These lengths are compared to the indicia to determine if they match. The instruction sequence is output for distribution to one or more functional units. The sequence of instructions comprises a first instruction beginning with a start data word of the fetched sequence of data words, and one or more sequentially following the first data, wherein the vector of the fetched indicator matches the determined length. And each of the other instructions preceded by only that instruction.
For example, a vector of replacement markers is generated from the determined length. By using the replacement indicator vector, each variable length instruction preceded by at least one instruction whose fetched indicator vector does not match the determined length is reparsed from the fetched sequence of data words ( Structural analysis) and re-decode.

【００２３】他の実施例によれば、標識を更新する方法
が提供される。少なくとも１個のデコードされるべき命
令の系列は、第２の記憶領域からフェッチされた標識の
ベクトルを用いて、キャッシュからフェッチされたデー
タワードの系列から識別される。デコードされるべき命
令の識別された系列はデコードされる。命令をデコード
するのと同時に、デコードされるべき命令の系列の各命
令の最初及び最後のデータワードは標識を用いることな
く配置される。上記の長さは、フェッチされた命令の正
確さを照合するため使用される。置換標識がかくして配
置された最初及び最後のデータワードに従って発生さ
れ、得られた標識の代わりに第２の記憶領域に格納され
る。According to another embodiment, a method for updating an indicator is provided. At least one sequence of instructions to be decoded is identified from the sequence of data words fetched from the cache using a vector of indicators fetched from the second storage area. The identified sequence of instructions to be decoded is decoded. Simultaneously with decoding the instructions, the first and last data words of each instruction in the sequence of instructions to be decoded are located without using an indicator. The length is used to verify the accuracy of the fetched instruction. A replacement indicator is generated according to the first and last data words thus placed and stored in the second storage area instead of the obtained indicator.

【００２４】例えば、処理システムは、第１及び第２の
記憶領域と共に実現される。プリフェッチ器は、第１の
記憶領域からデコードされるべき命令を含むデータワー
ドの系列を受け、第２の記憶領域から標識のベクトルを
受ける。プリフェッチ器は、多数のデコーダへの同時の
出力に対し１個以上の可変長命令をパーシングするた
め、区切り記号として標識を使用する。プリフェッチ器
は、更に、標識のベクトル及びパーシングされた命令を
命令長照合器に出力する。命令長照合器は、デコーダが
命令をデコードする間に、同時に多数の仕事を行う。命
令長照合器は、標識を使用することなく各命令の長さを
判定する。例えば、命令長照合器は、各命令の長さ（即
ち、最初及び最後のデータワード）を判定するため、デ
ータワードの系列を厳密に調べる。命令長照合器は、単
一サイクル内の多数の命令の長さを同時に判定すること
ができる。命令長照合器は、第２の記憶領域に格納する
ため、長さから判定された標識のベクトルを出力する。
命令長照合器は、更に、標識が命令長と一致しなかった
１個以上の命令により先行された上記命令ワードの再パ
ーシング及び再デコーディングを生じさせる。これは、
新たに判定された標識のベクトルを含む適当な信号を、
プリフェッチ器及びデコーダに出力することにより達成
される。For example, the processing system is realized together with the first and second storage areas. The prefetcher receives a sequence of data words containing instructions to be decoded from a first storage area and receives a vector of indicators from a second storage area. The prefetcher uses indicators as delimiters to parse one or more variable length instructions for simultaneous output to multiple decoders. The prefetcher further outputs the indicator vector and the parsed instruction to the instruction length collator. The instruction length collator performs many tasks simultaneously while the decoder decodes the instruction. The instruction length collator determines the length of each instruction without using an indicator. For example, an instruction length checker traverses a sequence of data words to determine the length of each instruction (ie, the first and last data words). The instruction length collator can simultaneously determine the length of multiple instructions in a single cycle. The instruction length collator outputs a vector of the indicator determined from the length to be stored in the second storage area.
The instruction length checker also causes re-parsing and re-decoding of the instruction word preceded by one or more instructions whose indicator did not match the instruction length. this is,
A suitable signal containing the newly determined vector of signs,
This is achieved by outputting to a prefetcher and a decoder.

【００２５】[0025]

【発明の実施の形態】添付図面を参照して以下の詳細な
説明を読むことにより本発明のより良い理解が得られ
る。図１に示されるように、スーパースカラパイプライ
ン式データ処理システム１０は、例えば、命令キャッシ
ュ３０、命令プリフェッチ器５０、命令長照合器７０、
Ｍ（Ｍ＞１）台のデコーダ８０、及びＮ（Ｎ＞１）台の
機能ユニット１００により構成される。処理システム１
０は、データキャッシュ（図示しない）のような他の構
成部品を含んでもよい。機能ユニット１００は、受けら
れた命令に基づいて処理システム１０の一部を制御す
る。機能ユニット１００の例はＡＬＵ及びＦＰＵである
が、機能ユニットは命令を実行する装置であれば、どの
ような装置でも構わない。BRIEF DESCRIPTION OF THE DRAWINGS A better understanding of the present invention may be obtained by reading the following detailed description with reference to the accompanying drawings, in which: FIG. As shown in FIG. 1, the superscalar pipeline data processing system 10 includes, for example, an instruction cache 30, an instruction prefetcher 50, an instruction length collator 70,
It is composed of M (M> 1) decoders 80 and N (N> 1) functional units 100. Processing system 1
0 may include other components such as a data cache (not shown). Functional unit 100 controls a portion of processing system 10 based on the instructions received. Examples of the functional unit 100 are an ALU and an FPU, but the functional unit may be any device that executes instructions.

【００２６】例えば、処理システム１０は、命令境界メ
モリ２０及びタグメモリ４０を更に含む。命令境界メモ
リ２０、命令キャッシュ３０及びタグメモリ４０は、単
一のメモリ構造の記憶領域、又は、命令境界メモリ２０
と命令キャッシュ３０とタグメモリ４０とに対し別々に
アクセス可能なメモリ回路の記憶領域として実現され
る。例えば、命令キャッシュ３０は、各々が同じバイト
数、例えば、１６バイトからなる固定長のキャッシュラ
イン記憶場所に構造化される。タグメモリ４０は、命令
キャッシュ３０の各キャッシュライン記憶場所に対応し
て１個のタグ要素を含むタグの配列を有する。各タグ要
素は、対応した記憶場所が、無効データ、変更されてい
ない有効データ、変更された有効データ等を含むかどう
かを示す標識（ビット）を含む。キャッシュ及びデータ
ラインの“支配権”、書き送りスキーム及びメモリ整合
性の動作の説明は、本発明の範囲を超えているが、名称
“Ｉ／Ｏデバイスと主メモリとの間でデータを転送する
メモリの矛盾の無い先行の支配方法及びシステム(Memor
y Consistent Pre-Ownership Method and System forTr
ansferring Data Between an I/O Device and a Main M
emory) ”の米国特許出願第０８／０７１，７２１号明
細書に記載されている。命令境界メモリ２０は以下に詳
細に説明する。更に、分岐ターゲットバッファ６０がプ
リフェッチ器５０に接続される。分岐ターゲットバッフ
ァ６０は、分岐命令の結果を推定し、ある状況下で上記
命令の付加的な並列化を可能にする。For example, the processing system 10 further includes an instruction boundary memory 20 and a tag memory 40. The instruction boundary memory 20, the instruction cache 30 and the tag memory 40 are provided in a storage area of a single memory structure or the instruction boundary memory 20.
And the instruction cache 30 and the tag memory 40 are realized as storage areas of a memory circuit that can be accessed separately. For example, the instruction cache 30 is structured into fixed length cache line locations, each consisting of the same number of bytes, for example, 16 bytes. The tag memory 40 has an array of tags including one tag element corresponding to each cache line storage location in the instruction cache 30. Each tag element includes an indicator (bit) indicating whether the corresponding storage location contains invalid data, unmodified valid data, modified valid data, and the like. A description of the operation of "dominance" of cache and data lines, write-forward schemes and memory integrity is beyond the scope of the present invention, but transfers data between an I / O device and main memory. Memory-consistent prior dominance method and system (Memor
y Consistent Pre-Ownership Method and System forTr
ansferring Data Between an I / O Device and a Main M
emory) "in U.S. patent application Ser. No. 08 / 071,721. Instruction boundary memory 20 is described in further detail below. In addition, branch target buffer 60 is connected to prefetcher 50. Branch The target buffer 60 estimates the outcome of the branch instruction and allows additional parallelization of the instruction under certain circumstances.

【００２７】１．命令境界及び命令キャッシュ配列の構
造例えば、命令キャッシュ３０は、図６に示されているよ
うに、データワード（例えば、バイト）のＶ×Ｕの２次
元のバイト配列３００を有する。命令キャッシュ配列３
００は、Ｖ個のデータワード（例えば、バイト）のライ
ンサイズを伴うＵ個のデータライン格納場所を有する。
命令キャッシュ配列３００の各キャッシュライン内の最
上位データワードは、バイトＶ−１（３１０）である。
命令キャッシュ配列３００の各キャッシュライン内の最
下位データワードは、バイト０（３０１）である。命令
キャッシュ配列３００の最上位キャッシュラインは、ラ
インＵ−１（３５９）である。命令キャッシュ配列３０
０の最下位キャッシュラインは、ライン０（３５１）で
ある。1. Instruction Boundary and Instruction Cache Array Structure For example, instruction cache 30 has a V × U two-dimensional byte array 300 of data words (eg, bytes), as shown in FIG. Instruction cache array 3
00 has U data line storage locations with a line size of V data words (eg, bytes).
The most significant data word in each cache line of instruction cache array 300 is byte V-1 (310).
The least significant data word in each cache line of instruction cache array 300 is byte 0 (301). The uppermost cache line of the instruction cache array 300 is the line U-1 (359). Instruction cache array 30
The least significant cache line of 0 is line 0 (351).

【００２８】命令は、メインメモリから（命令が潜在的
に２本以上のデータラインに拡がる場合を想定して）デ
ータライン全体又は命令を含むデータラインを読み、デ
ータラインを命令キャッシュのキャッシュライン格納場
所に書き込むことによりロードされる。例えば、命令は
１乃至６個のデータワード長、各アドレスは１６ビット
長、データラインのサイズＶが１６バイトである場合を
想定する。アドレス0001 0000 0000 0010 即ち、４０９
８の命令を実行すべきであるが、命令キャッシュ３０に
その命令が存在しない場合を考える。この場合に、デー
タワード４０９６乃至４１１１を含むデータライン４０
９６の全体が主メモリから読み出され、命令キャッシュ
３０の例えばキャッシュラインＸに格納される。命令の
第１のデータワードは、命令キャッシュ配列３００に格
納されたデータラインの第３のデータワードであること
が予め分かる。しかし、逐次的に続く命令の第１のデー
タワードがアドレス４０９９、４１００、４１０１、４
１０２、４１０３又は４１０４で始まるかどうかは、ア
ドレス４０９８から始まる命令の長さ（即ち、最後のデ
ータワード）が判定されるまで分からない。The instruction reads the entire data line or the data line containing the instruction from main memory (assuming the instruction potentially extends to more than one data line) and stores the data line in the cache line of the instruction cache. Loaded by writing to the location. For example, assume that the instruction is 1 to 6 data words long, each address is 16 bits long, and the size V of the data line is 16 bytes. Address 0001 0000 0000 0010 That is, 409
Assume that instruction 8 is to be executed, but that instruction does not exist in instruction cache 30. In this case, the data line 40 including the data words 4096 to 4111
The entirety of 96 is read from main memory and stored, for example, in cache line X of instruction cache 30. It is known in advance that the first data word of the instruction is the third data word of the data line stored in the instruction cache array 300. However, the first data word of the successively following instructions has addresses 4099, 4100, 4101, 4101
Whether it begins at 102, 4103 or 4104 is unknown until the length of the instruction starting at address 4098 (ie, the last data word) is determined.

【００２９】このため、命令境界メモリ２０は、図５及
び図７乃至図９に示されているように、Ｖ×Ｕの標識の
２次元配列４００を有する。例えば、各標識は単一ビッ
トで実現してもよい。命令境界メモリ４００の各標識
は、命令キャッシュ配列３００内の固有のデータワード
格納場所に対応する。例えば、０≦ｕ≦Ｕかつ０≦ｖ≦
Ｖであるとき、ｕ＝０、ｖ＝０なる標識は、命令キャッ
シュ配列３００の位置０のキャッシュライン０に格納さ
れたデータワードに対応し、ｕ＝０、ｖ＝１なる標識
は、命令キャッシュ配列３００の位置１のキャッシュラ
イン０に格納されたデータワードに対応し、以下同様で
ある。８キロワードの命令キャッシュ配列３００に対
し、必要な境界情報を保持するため１キロの標識（１キ
ロビット）が標識配列４００内に要求される。標識配列
４００内の各標識は、対応する命令キャッシュ配列３０
０内のデータワードが命令の最初又は第１のデータワー
ドであるかどうかを指定するため使用される。データワ
ードが命令の中の最初のデータワードであるならば、対
応する標識配列４００内の標識ビットは、例えば、１に
設定され、そうでなければ標識配列４００内の標識ビッ
トは０に設定される。For this purpose, the instruction boundary memory 20 has a two-dimensional array 400 of V × U indicators as shown in FIGS. 5 and 7 to 9. For example, each indicator may be implemented with a single bit. Each indicator in the instruction boundary memory 400 corresponds to a unique data word location in the instruction cache array 300. For example, 0 ≦ u ≦ U and 0 ≦ v ≦
When V, the indicator u = 0, v = 0 corresponds to the data word stored in cache line 0 at position 0 of instruction cache array 300, and the indicator u = 0, v = 1 corresponds to the instruction cache. Corresponds to the data word stored in cache line 0 at position 1 of array 300, and so on. For an 8-kiloword instruction cache array 300, a one kilogram indicator (1 kilobit) is required in the indicator array 400 to hold the necessary boundary information. Each indicator in indicator array 400 has a corresponding instruction cache array 30
Used to specify whether the data word in 0 is the first or first data word of the instruction. If the data word is the first data word in the instruction, the corresponding indicator bit in indicator array 400 is set to, for example, one, otherwise the indicator bit in indicator array 400 is set to zero. You.

【００３０】命令境界メモリ２０がプリフェッチ器５０
に書き込む境界情報は、命令キャッシュ３０によりプリ
フェッチ器５０に転送されたデータワードの系列内の各
データワードに対応する１個の標識ビットを含む標識の
ベクトルの形をなす。標識のベクトルは、“０”標識ビ
ットと“１”標識ビットの系列により構成され、第１の
“１”標識ビットは、命令キャッシュ３０により出力さ
れたデータライン内の命令の最初のデータワードの場所
に対応する。現在の命令の長さは、（現在の命令の第１
のデータワードを示す）現在の命令に対する“１”の値
が付けられた標識と、（本当に次の命令の第１のデータ
ワードを示す）本当に次の命令の“１”の値が付けられ
た標識ビットとの間仕切りに対応する。プリフェッチ器
５０は、命令境界メモリ２０から標識のベクトルを受
け、現在要求されている命令の境界を識別するため標識
のベクトルを使用する。例えば、プリフェッチ器５０
は、第１の値“１”が与えられた標識ビットを配置する
ため、標識のベクトルをパーシングする。第１の値
“１”が与えられた標識ビットは実行されるべき本当に
次の命令の最初又は第１のデータワードに対応する。プ
リフェッチ器５０は、次に、第２番目に生じる値“１”
が与えられた標識ビットを、データラインに格納される
ような直ぐ後に続く命令の第１のデータワードを示す標
識ベクトルに配置する。かかる情報を使用することによ
り、プリフェッチ器５０は、命令の長さ、及び、より重
要な最後又は末尾のデータワードを判定することができ
る。先頭及び末尾の記憶場所が識別された後、完全な命
令を構成するデータワードが分かる。例えば、プリフェ
ッチ器５０は、多数の可変長命令を各サイクルで同時に
（即ち、並列に）パーシングすることができる。The instruction boundary memory 20 is a prefetch unit 50
Is in the form of a vector of indicators including one indicator bit corresponding to each data word in the sequence of data words transferred by the instruction cache 30 to the prefetcher 50. The vector of indicators is comprised of a sequence of "0" indicator bits and "1" indicator bits, the first "1" indicator bit being the first data word of the instruction in the data line output by instruction cache 30. Corresponds to the location. The length of the current instruction is (first of current instruction
An indicator labeled "1" for the current instruction (indicating the first data word of the next instruction) and a "1" for the truly next instruction (indicating the first data word of the next instruction). Corresponds to a partition with an indicator bit. The prefetcher 50 receives the indicator vector from the instruction boundary memory 20 and uses the indicator vector to identify the currently requested instruction boundary. For example, the prefetcher 50
Parses the vector of beacons to place the beacon bits given the first value "1". The indicator bit given the first value "1" corresponds to the first or first data word of the very next instruction to be executed. The prefetch unit 50 then outputs the second occurring value “1”.
Places the beacon bits given the beacon vector in the beacon vector indicating the first data word of the immediately following instruction as stored on the data line. Using such information, the prefetcher 50 can determine the length of the instruction and, more importantly, the last or last data word. After the head and tail storage locations have been identified, the data words that make up the complete instruction are known. For example, the prefetcher 50 can parse a number of variable length instructions simultaneously (ie, in parallel) in each cycle.

【００３１】２．命令境界メモリの動作命令を実行し、命令境界メモリ２０を更新する処理は図
２及び図３のフローチャートにより説明される。ステッ
プ２０２において、プリフェッチ器５０は、命令キャッ
シュ３０からの（２個以上の命令を含む）データワード
の系列を要求する。“リードミス”があったならば、キ
ャッシュコントローラ（図示しない）は、主メモリ（図
示しない）から要求されたデータワードの系列を含む２
個以上のデータラインを取り出す。命令キャッシュ３０
内に“リードヒット”があったならば、図２のステップ
２０４に従って、命令キャッシュ３０は、命令全体が確
実に含まれるのに十分な大きさがある命令キャッシュ配
列３００の隣接したデータワードの系列をフェッチす
る。かかる系列は、命令キャッシュ３０のキャッシュラ
イン内の特定のオフセットからの全てのデータワード
と、直ぐ次のキャッシュラインの最初の数個のデータワ
ードの中の２個以上のデータワードとを含む。ここで、
“次”は、データラインのアドレスに関して判定され、
必ずしも命令キャッシュ配列３００内の隣でなくてもよ
い。デコーダ８０によるデータワードの消費レートは、
プリフェッチ器５０によるデータワードの作成レートと
一致する必要がない点に注意すべきである。かかる場合
に、プリフェッチ器５０は、先にフェッチされた命令の
データワードがデコーディングのため提示される間に格
納されるバッファを有する。一実施例において、プリフ
ェッチ器５０のバッファ内のデータワードの占有がある
レベルよりも低下したとき、プリフェッチ器５０は命令
キャッシュ３０から一定数のデータワードをフェッチす
る。プリフェッチ器５０は、直ぐ次の命令の第１のデー
タワードを配置し、デコーダ８０によりデコーディング
のため出力されたデータワードの系列が隣接することを
保証するため、例えば、バッファ内のデータワードを
“回転”させ、詰め込む。フェッチされたデータワード
の系列は、例えば、少なくとも最大の許容可能な長さの
命令と同じ長さである。データワードがフェッチされる
のと同時に、命令境界メモリ２０は、図２のステップ２
０４を完了するため、境界情報を標識のベクトルの形式
でプリフェッチ器５０に出力する。2. Operation of Instruction Boundary Memory The process of executing an instruction and updating the instruction boundary memory 20 is described with reference to the flowcharts of FIGS. In step 202, prefetcher 50 requests a sequence of data words (including two or more instructions) from instruction cache 30. If there has been a "read miss", the cache controller (not shown) sends a 2 including the sequence of data words requested from main memory (not shown).
Take out more than one data line. Instruction cache 30
If there was a "read hit," then in accordance with step 204 of FIG. 2, instruction cache 30 causes sequence of adjacent data words in instruction cache array 300 to be large enough to ensure that the entire instruction is contained. Fetch. Such a sequence includes all data words from a particular offset within a cache line of the instruction cache 30 and more than one of the first few data words of the immediately following cache line. here,
“Next” is determined with respect to the address of the data line,
It does not necessarily have to be adjacent to the instruction cache array 300. The consumption rate of the data word by the decoder 80 is
It should be noted that the rate of data word creation by the prefetcher 50 need not match. In such a case, the prefetcher 50 has a buffer in which the data words of the previously fetched instruction are stored while presented for decoding. In one embodiment, the prefetcher 50 fetches a fixed number of data words from the instruction cache 30 when the occupancy of the data words in the buffer of the prefetcher 50 drops below a certain level. The prefetcher 50 places the first data word of the immediately following instruction and, to ensure that the sequence of data words output for decoding by the decoder 80 are contiguous, e.g. "Rotate" and pack. The sequence of fetched data words is, for example, at least as long as the instruction of the maximum allowable length. At the same time that the data word is fetched, the instruction boundary memory 20 stores the data in step 2 of FIG.
To complete 04, the boundary information is output to the prefetch unit 50 in the form of a vector of indicators.

【００３２】図２のステップ２０６において、プリフェ
ッチ器５０は、データワードの系列から２個以上の可変
長命令を同時にパーシングして出力する。その動作を行
う際に、プリフェッチ器５０は、得られた情報を別々の
命令に分割するため、標識のベクトルを調べる。上記の
如く、プリフェッチ器５０は、実行されるべき直ぐ次の
命令の第１のデータワードと、直後に続く命令とを判定
するため、第１番目及び第２番目に生じる値“１”が与
えられた標識ビットを探して標識を調べる。データワー
ドをパーシングして出力された上記情報は、第１の命令
に対応する。同様に、プリフェッチ器５０は、直後に続
く命令の末尾又は最後のデータワードを判定するため第
３番目に生じる値“１”が与えられた標識ビットを探索
し、以下同様である。かかる“探索”は、例えば、直線
的には行われず、むしろ、標識のベクトルを、データワ
ードの系列を２個以上の別個の命令の系列に抽出／パー
シングする適当な論理回路に入力することにより行われ
る。上記回路は、当業者の技術の範囲内にあり、簡単の
ためここで再検討を行わない。かくしてパーシングされ
た各命令は、夫々のデコーダ８０に出力される。例え
ば、４台のデコーダ８０があるならば、４個のデータワ
ードの系列がパーシングして出力される。かくしてパー
シングされ、又は、分割された全ての命令は、標識のベ
クトルと共に命令長照合器７０に出力される。図示され
るように、デコーダ８０によるデコーディングと、命令
長照合器７０による命令長（より詳細に言うと、標識の
ベクトルの正確さ）の照合とが並列に行われる。In step 206 of FIG. 2, the prefetcher 50 parses and outputs two or more variable-length instructions from a sequence of data words at the same time. In performing its operation, the prefetcher 50 examines the vector of indicators to split the obtained information into separate instructions. As described above, the prefetcher 50 is provided with the first and second occurring values "1" to determine the first data word of the immediately next instruction to be executed and the immediately following instruction. Look up the sign looking for the indicated sign bit. The information output by parsing the data word corresponds to the first instruction. Similarly, the prefetcher 50 searches for an indicator bit given a third occurring value "1" to determine the end or last data word of the immediately following instruction, and so on. Such "searching" is not performed, for example, linearly, but rather by inputting the vector of indicators into a suitable logic circuit which extracts / parses the sequence of data words into a sequence of two or more separate instructions. Done. The above circuits are within the skill of those in the art and will not be reviewed here for simplicity. The instructions thus parsed are output to the respective decoders 80. For example, if there are four decoders 80, a sequence of four data words is parsed and output. All instructions thus parsed or split are output to the instruction length collator 70 along with a vector of indicators. As shown, decoding by the decoder 80 and collation of the instruction length (more specifically, the accuracy of the indicator vector) by the instruction length collator 70 are performed in parallel.

【００３３】図２のステップ２０８において、各デコー
ダ８０は、プリフェッチ器５０により供給された命令を
デコードしようとする。ステップ２０８と並列に行われ
るステップ２０９において、命令長照合器７０は、プリ
フェッチ器５０により供給された命令に対する正しい命
令長を照合する。特に、命令長照合器７０は、データワ
ードの系列から各命令を抽出し、かくして抽出された最
初又は第１番目のデータを記録する。命令長照合器７０
は、多数の命令の長さを同時（即ち、並列）に照合し得
る。例えば、４台のデコーダ８０があるならば、プリフ
ェッチ器は、好ましくは、４個のデータワードの部分系
列をパーシングする。上記の４個の部分系列は、データ
ワードの部分系列に含まれる各命令の長さを同時に判定
する命令長照合器７０に並列に受けられる。かくして判
定された命令の長さは、各命令の境界を識別する。命令
長照合器７０により発生させられた照合後の命令情報
は、命令境界メモリ２０により出力された標識のベクト
ルと類似した標識のベクトルの形に形成される。正確な
照合後の境界情報は、命令長照合器７０から命令境界メ
モリ２０に書き込まれる。書込みの回数を削減するた
め、命令長照合器７０は、命令境界メモリ２０から受け
た標識がエラーを含む場合に限り、命令境界メモリ２０
の標識を書き換える。In step 208 of FIG. 2, each decoder 80 attempts to decode the instruction provided by prefetcher 50. In step 209, which is performed in parallel with step 208, the instruction length collator 70 collates the correct instruction length for the instruction supplied by the prefetcher 50. In particular, the instruction length collator 70 extracts each instruction from the sequence of data words and records the first or first data thus extracted. Instruction length collator 70
Can match the length of multiple instructions simultaneously (ie, in parallel). For example, if there are four decoders 80, the prefetcher preferably parses a subsequence of four data words. The four subsequences are received in parallel by an instruction length collator 70 that simultaneously determines the length of each instruction included in the data word subsequence. The lengths of the instructions thus determined identify the boundaries of each instruction. The collated instruction information generated by the instruction length collator 70 is formed in the form of a marker vector similar to the marker vector output by the instruction boundary memory 20. The accurate boundary information after collation is written from the instruction length collator 70 to the instruction boundary memory 20. In order to reduce the number of times of writing, the instruction length collating unit 70 sets the instruction boundary memory 20 only when the indicator received from the instruction boundary memory 20 includes an error.
Rewrite the sign of.

【００３４】命令境界メモリ２０からプリフェッチ器５
０に最初にフェッチされた標識のベクトルは、対応する
データワードがフェッチされたデータワードの系列に含
まれる命令の第１のデータワードであるかどうかを正確
に示すという保証がない点に注意が必要である。かかる
不確実さは何通りかの状況で発生する。第１の状況は、
命令データワードが実行中にそのままの状態を維持する
ことが要求されない場合（例えば、自己修正コード）で
ある。第２の状況は、命令キャッシュ３０内のデータラ
インに対するキャッシュライン格納場所が、書き直し又
はキャッシュフラッシングスキームに従って別のデータ
ライン（即ち、主メモリの別の場所）から上書きされる
場合である。このような場合に、対応する標識のベクト
ルはリセットされ、即ち、命令キャッシュ３０のキャッ
シュライン格納場所と関係した各標識ビットは値“０”
にリセットされる。何れにしても、命令境界情報は新た
に書かれたデータラインに対し利用できない。従って、
標識ビットの現在の値がキャッシュのキャッシュライン
格納場所に含まれた命令の境界を正確に反映する可能性
は間違いなくない。第３の状況は、ＣＰＵ１０が初期化
されたとき、配列４００に格納された標識がリセットさ
れた場合である。キャッシュは、データラインに基づい
てデータのフラッシュ又は書き直しを行い、一方、プリ
フェッチ器５０に供給されたデータの系列及び標識の各
ベクトルは、夫々のデータラインの内部で整列され、又
は、命令キャッシュ３０の単一のキャッシュライン格納
場所の内部に完全に収容される必要がないので、上記ス
キームは非常に複雑になり得ることに注意が必要であ
る。更に、分岐ターゲットバッファ６０は実行された分
岐命令の予測結果を絶えず出力することに注意が必要で
ある。かかる予測は、例えば、命令キャッシュ３０から
フラッシュされ、又は、他のデータラインにより上書き
される等により無効にされたデータラインに実行が分岐
するので、たとえ稀であるとしても、上記の第２の状況
の発生を増加させることを示す可能性がある。From the instruction boundary memory 20 to the prefetcher 5
Note that there is no guarantee that the vector of indicators initially fetched at 0 will indicate exactly whether the corresponding data word is the first data word of the instruction contained in the sequence of fetched data words. is necessary. Such uncertainties occur in several situations. The first situation is
This is the case when the instruction data word is not required to maintain its state during execution (eg, a self-modifying code). The second situation is when the cache line storage location for a data line in the instruction cache 30 is overwritten from another data line (ie, another location in main memory) according to a rewrite or cache flushing scheme. In such a case, the corresponding indicator vector is reset, i.e., each indicator bit associated with the cache line storage location in the instruction cache 30 has the value "0".
Is reset to In any case, the instruction boundary information is not available for newly written data lines. Therefore,
There is no doubt that the current value of the indicator bit will accurately reflect the boundary of the instruction contained in the cache line location of the cache. The third situation is when the indicator stored in array 400 is reset when CPU 10 is initialized. The cache flushes or rewrites the data based on the data lines, while the sequence of data and the vector of indicators provided to the prefetcher 50 are aligned within each data line, or the instruction cache 30. It should be noted that the above scheme can be very complex because it does not need to be completely contained within a single cache line storage location. Further, it should be noted that the branch target buffer 60 constantly outputs the prediction result of the executed branch instruction. Such a prediction, even if it is rare, will cause the execution to branch to a data line that has been invalidated, for example, by being flushed from the instruction cache 30 or overwritten by another data line. May indicate an increase in the occurrence of the situation.

【００３５】命令の開始及び終了データワードは、デー
タラインの境界と並べる必要はないことに注意すべきで
ある。これは、可変長命令の隙間のない詰め込みと、デ
ータライン内のあらゆるデータワードから始まる命令へ
の分岐を生じる分岐命令との作用である。上記の如く、
プリフェッチ器５０は、データラインの中間から始まる
不連続的なデータワードの系列をフェッチし得る。プリ
フェッチ器５０がそのように動作するとき、プリフェッ
チ器５０はかかるデータラインの開始データワード位置
を記録する必要がある。命令長照合器７０が境界標識の
ベクトルを判定した後、境界標識のベクトルは、プリフ
ェッチ器５０に格納された開始データワード位置を用い
て命令境界メモリ２０を書き換える。フェッチされたデ
ータワードの系列が２個以上のデータラインに拡がる場
合、例えば、境界標識のベクトルを書き換えるため、フ
ェッチされたデータワードの系列が拡がる各データライ
ンに対する１回の書込みからなる多重の書込みが必要で
ある。多数のデコーダ８０は、フェッチされたデータワ
ードの系列の長さ、即ち、フェッチされたデータワード
の系列が拡がるデータラインの個数を増加させ得る場合
を想定する。これにより、標識のベクトルを書き換える
際に、命令長照合器７０により行われる書込み動作の回
数が増加される。標識のベクトルを格納するため多重の
書込みが必要であり、かつ、命令境界メモリ２０が十分
に高速ではないので係属中の各書込み動作を実行できな
いならば、バッファは書込みを係属したままの状態にし
なければならない。勿論、バッファリングは、たとえ、
並列デコーディングの可能性を低下させるとしても、全
くエラーを発生させない。同様に、得られた境界標識の
ベクトルが命令長照合器７０により正確であることが照
合されたならば、命令境界メモリ２０を更新する必要が
ない。これにより、命令境界メモリ２０に関して行われ
る書込みの回数が削減される利点が得られる。It should be noted that the start and end data words of the instruction need not be aligned with data line boundaries. This is a function of the tight packing of variable length instructions and of branch instructions that result in a branch to an instruction beginning at every data word in the data line. As mentioned above,
Prefetcher 50 may fetch a discontinuous sequence of data words starting in the middle of a data line. When the prefetcher 50 operates in such a manner, the prefetcher 50 needs to record the starting data word position of such a data line. After the instruction length collator 70 determines the boundary indicator vector, the boundary indicator vector rewrites the instruction boundary memory 20 using the starting data word position stored in the prefetcher 50. If the sequence of fetched data words extends over two or more data lines, for example, a multiple write consisting of a single write for each data line on which the sequence of fetched data words extends to rewrite the vector of the boundary indicator is necessary. Assume that multiple decoders 80 can increase the length of the sequence of fetched data words, ie, the number of data lines over which the sequence of fetched data words extends. This increases the number of write operations performed by the instruction length collator 70 when rewriting the indicator vector. If multiple writes are required to store the vector of indicators and the instruction boundary memory 20 is not fast enough to perform each pending write operation, the buffer leaves the writes pending. There must be. Of course, even if buffering
Even if the possibility of parallel decoding is reduced, no error occurs. Similarly, if the obtained boundary indicator vector is verified to be accurate by the instruction length collator 70, there is no need to update the instruction boundary memory 20. This has the advantage of reducing the number of writes performed on the instruction boundary memory 20.

【００３６】更に、出力された標識のベクトルは、完全
なデータラインに対応する全ての標識ではなく、デコー
ディングのため得られたデータラインの一部に対応する
標識しか含まない点に注意が必要である。例えば、かか
る部分的な更新が行えるような回路が与えられる。ステ
ップ２１０によれば、命令長照合器７０は、プリフェッ
チ器５０により与えられた境界情報（即ち、命令境界メ
モリ２０により最初に供給された標識のベクトル）を、
命令長照合器７０により判定されたような確認された命
令の長さと比較する。最初に、照合された命令長がプリ
フェッチ器５０により供給された境界情報を確認する場
合を考える。例えば、命令長照合器７０により判定され
たような先頭のデータワード及び命令長が、命令境界メ
モリ２０により最初に出力された標識のベクトルによっ
て指定された先頭のデータワードと一致する場合を考え
る。このような場合、命令長照合器７０は、第１のデコ
ードされた命令、或いは、命令長照合器７０により判定
された命令長がデコードされた命令をパーシングするた
め使用された標識と一致する命令だけが先行したデコー
ドされた命令を受ける各デコーダ８０にイネーブル（使
用可能状態）信号を出力する。上記イネーブル信号は、
ステップ２１４に従って上記デコードされた命令をディ
スパッチ論理回路９０に転送するため上記デコーダ８０
を使用可能状態にさせる。It should be further noted that the output vector of markers contains only the markers corresponding to a part of the data line obtained for decoding, rather than all the markers corresponding to the complete data line. It is. For example, a circuit capable of performing such a partial update is provided. According to step 210, the instruction length collator 70 converts the boundary information provided by the prefetcher 50 (ie, the vector of indicators initially provided by the instruction boundary memory 20) into
The instruction length is compared with the confirmed instruction length as determined by the instruction length collator 70. First, consider the case where the collated instruction length confirms the boundary information supplied by the prefetch unit 50. For example, consider the case where the first data word and the instruction length as determined by the instruction length collator 70 match the first data word specified by the indicator vector first output by the instruction boundary memory 20. In such a case, the instruction length collator 70 may determine whether the first decoded instruction or an instruction whose instruction length determined by the instruction length collator 70 matches the indicator used to parse the decoded instruction. Only output an enable signal to each decoder 80 that receives the preceding decoded instruction. The enable signal is
Decoder 80 for transferring the decoded instruction to dispatch logic 90 in accordance with step 214
To a usable state.

【００３７】次に、命令長照合器７０により判定された
命令長が命令境界メモリ２０からフェッチされた境界情
報と適応又は一致しない場合を想定する。即ち、境界情
報は、失われているか、不完全であるか、又は、不正確
である。このような場合に、命令長照合器７０は、ステ
ップ２１２に従って、命令長照合器７０により判定され
た長さと一致しない標識を用いてパーシングされた２個
以上の命令により先行された命令を受ける各デコーダ８
０にディスエーブル（使用禁止状態）信号を出力する。Next, it is assumed that the instruction length determined by the instruction length collator 70 does not adapt or match the boundary information fetched from the instruction boundary memory 20. That is, the boundary information is missing, incomplete, or incorrect. In such a case, the instruction length collator 70 receives each instruction preceded by two or more instructions parsed using an indicator that does not match the length determined by the instruction length collator 70, according to step 212. Decoder 8
A disable (use disabled state) signal is output to 0.

【００３８】第１のデータワードの系列の最初のデータ
ワードは、データワードの系列がパーシングされた最初
のデータワードであることに注意する必要がある。かく
して、最初のデータ系列から形成された第１の命令の最
初のデータワードは、常に正確に並べられる。従って、
第１の命令は、最初のデータ系列を受けるデコーダ８０
により常にデコードされる。更に、最初の系列に続く第
２のデータワードの系列の中の最初のデータワードを第
２の命令の開始と正確に並べるとが可能である。かかる
場合に、第１及び第２の命令は、夫々のデコーダ８０に
よりデコードされ、以下同様である。一般的に言うと、
第１の整列されていないデータワードの系列は、命令長
照合器７０により識別される。上記第１の整列されてい
ないデータワードの系列は、最初にパーシングされたデ
ータワードの系列ではあり得ないことに注意が必要であ
る。命令長照合器７０は、次に、上記整列されていない
データワードの系列をデコードするデコーダ８０と、上
記第１の整列されていないデータワードの系列の後にパ
ーシングされたデータワードの系列を受ける各デコーダ
８０とを使用禁止状態にする。これは、上記第１の整列
されていないシーケンスの後にパーシングされた全ての
データワードの系列も整列されていないと考えられるか
らである。It should be noted that the first data word of the first sequence of data words is the first data word to which the sequence of data words has been parsed. Thus, the first data word of the first instruction formed from the first data sequence is always exactly aligned. Therefore,
The first instruction is a decoder 80 that receives the first data sequence.
Is always decoded. Furthermore, it is possible that the first data word in the second sequence of data words following the first sequence is exactly aligned with the start of the second instruction. In such a case, the first and second instructions are decoded by respective decoders 80, and so on. Generally speaking,
The first sequence of unaligned data words is identified by the instruction length collator 70. Note that the first sequence of unaligned data words cannot be the first sequence of parsed data words. The instruction length collator 70 then decodes the sequence of unaligned data words and receives each of the parsed sequence of data words after the first sequence of unaligned data words. The decoder 80 and the decoder 80 are disabled. This is because the sequence of all data words parsed after the first unaligned sequence is also considered unaligned.

【００３９】適当なデコーダ８０を使用禁止状態にする
ことは、多数の方法で実現される。例えば、パーシング
された命令ワードが一連のデコーダ８０に出力される。
もし、デコーダ８０の中の１台が整列されていないデー
タワードの系列を受けたために使用禁止状態にされるべ
きであるならば、逐次的にパーシングされたデータワー
ドを受ける各デコーダ８０も使用禁止状態にされる。例
えば、これは、ディスエーブル信号を（逐次的に続くデ
ータワードの系列を受ける）逐次的に続くデコーダ８０
に通知する使用禁止状態にされたデコーダ８０により達
成される。かかるディスエーブル信号は、そのデコーダ
から、逐次的に続く命令を受けた次のデコーダ８０に伝
搬し、以下同様である。命令長照合器７０は、更に、プ
リフェッチ器５０による上記（更新された）境界情報を
使用した上記データワードの系列からの命令の再パーシ
ングと、再パーシングされた情報のデコーダ８０への出
力とを同時に行わせる。正確な境界情報が、このステッ
プにおいて、命令長照合器７０によりプリフェッチ器に
転送されることに注意が必要である。The disabling of the appropriate decoder 80 can be accomplished in a number of ways. For example, the parsed instruction word is output to a series of decoders 80.
If one of the decoders 80 is to be disabled because it has received a sequence of unaligned data words, then each decoder 80 that receives the sequentially parsed data words is also disabled. State. For example, this may include a sequentially following decoder 80 (receiving a sequence of successive data words) with a disable signal.
Is achieved by the decoder 80 which is disabled. Such a disable signal propagates from the decoder to the next decoder 80 that has received a successive instruction, and so on. The instruction length collator 70 further reparses the instruction from the sequence of data words using the (updated) boundary information by the prefetcher 50 and outputs the reparsed information to the decoder 80. Let them be done at the same time. Note that the exact boundary information is transferred to the prefetcher by the instruction length collator 70 in this step.

【００４０】ステップ２１６では、境界情報は正確であ
り、有効な命令はデコーダ８０によりデコードされ、ス
テップ２１４によりデコードされた命令がディスパッチ
論理回路９０に書き込まれていることが仮定される。ス
テップ２１６において、ディスパッチ論理回路９０は、
デコードされた各命令を実行すべき機能ユニット１００
を判定する。浮動小数点演算を実行する命令、分岐条件
テスト及びアドレス罫線を実行する命令等のような上記
命令に依存して、特定の機能ユニットが利用される。ス
ーパースカラプロセッサは、少なくとも１台の機能ユニ
ット１００を含む。その場合に、少なくとも１台の機能
ユニット１００が、可能であれば命令を並列に実行する
ため同時に選択される。複数台の機能ユニット１００は
同一の機能を維持することが可能であり、或いは、機能
ユニット１００のある部分集合が重なり合う機能の部分
集合を維持することが可能である。２台以上の機能ユニ
ット１００が頻繁に使用される命令を同時に実行し、こ
れにより命令レベルの実行の並列化を増加させる方が有
利である。機能ユニット１００が選択された後、ディス
パッチ論理回路９０は、デコードされた命令を上記機能
ユニット１００に転送する。ステップ２１８によれば、
機能ユニット１００はデコードされた命令を実行する。
ステップ２２０において、上記の処理は、要求された全
ての命令が実行されるまで繰り返される。At step 216, it is assumed that the boundary information is correct, valid instructions have been decoded by decoder 80, and the instructions decoded at step 214 have been written to dispatch logic 90. In step 216, the dispatch logic 90
Functional unit 100 to execute each decoded instruction
Is determined. Certain functional units are utilized depending on the above instructions, such as instructions for performing floating point operations, instructions for performing branch condition tests and address rules, and the like. The superscalar processor includes at least one functional unit 100. In that case, at least one functional unit 100 is simultaneously selected, if possible, to execute the instructions in parallel. The plurality of functional units 100 can maintain the same function, or can maintain a subset of functions in which a certain subset of the functional units 100 overlaps. It is advantageous for two or more functional units 100 to execute frequently used instructions simultaneously, thereby increasing instruction level execution parallelism. After the functional unit 100 is selected, the dispatch logic 90 transfers the decoded instruction to the functional unit 100. According to step 218,
Functional unit 100 executes the decoded instruction.
In step 220, the above process is repeated until all requested instructions have been executed.

【００４１】ステップ２０２乃至２２０は逐次的に示さ
れているが、パイプライン式処理のパラダイムに従って
並列した夫々のユニットにより実行してもよい。Although steps 202 through 220 are shown sequentially, they may be performed by respective units in parallel according to the paradigm of pipelined processing.

【００４２】[0042]

【実施例】以下、命令境界メモリ２０の構造、機能及び
更新の最良の例を説明する。Ａ．電源投入時又は命令キャッシュデータラインの置換
時データラインタグ配列４０及び標識配列４００内の全て
の有効な標識は、電源投入後に、“０”にリセット（即
ち、“クリア”）される。クリアされた標識配列は図５
に示され、クリアされたデータラインタグ配列は図４に
示される。更に、命令キャッシュライン３５９のような
命令キャシュラインが置換され、或いは、無効のマーク
が付けられたとき、対応した標識配列の行４５９のよう
な対応する標識のベクトルは、“０”にリセットされ
る。DESCRIPTION OF THE PREFERRED EMBODIMENTS The best example of the structure, function and updating of the instruction boundary memory 20 will be described below. A. At power-up or replacement of the instruction cache data line All valid labels in the data line tag array 40 and the label array 400 are reset to "0" (ie, "clear") after power-up. The cleared label sequence is shown in FIG.
And the cleared data line tag sequence is shown in FIG. Further, when an instruction cache line, such as instruction cache line 359, is replaced or marked invalid, the corresponding indicator vector, such as row 459 of the corresponding indicator array, is reset to "0". You.

【００４３】Ｂ．命令の初期フェッチ及び標識配列の初
期ローディングこの例の場合、命令キャッシュ配列３００は、図６に示
されたような命令でロードされる。図６のキャッシュ配
列のライン３５１、３５４乃至３５６には、種々の命令
が示されている。図６のキャッシュライン３５９には、
全く命令が収容されていない。これは、対応するデータ
ラインタグ配列４０内の有効／無効ビットがクリアさ
れ、キャッシュライン３５９が無効データを含むことを
示すためである。キャッシュライン３５９は、理論的に
は、無効かつ意味のない任意のデータを格納することが
できる。明瞭さと簡単のため、キャッシュライン３５９
は、値０が与えられたデータだけを含む場合が示され
る。各命令は、キャッシュ配列３００の隣接した一つ以
上のキャッシュライン内のデータワードに収容される。
例えば、命令は、逐次的に順序付けされ、隣接したライ
ン３５５、３５６を占有するように示されている。例え
ば、図６において、１個の５データワード命令は、デー
タラインｕ＝Ｘ（３５５）のデータワードｖ＝Ｖ−２
（３０９）から始まり、データラインｕ＝Ｘ＋１（３５
６）のデータワードｖ＝２で終了する。先行する命令、
即ち、上記の５データワード命令の前に現れた命令は、
データラインｕ＝Ｘ（３５５）のデータワードｖ＝Ｖ−
３（３０８）で終了し、データラインｕ＝Ｘ（３５５）
のデータワードｖ＝１（３０２）から始まる。上記の５
データワードの後の命令は、図６に示されているよう
に、ラインｕ＝Ｘ＋１（３５６）のデータワードｖ＝３
（図示しない）から始まり、ラインｕ＝Ｘ＋１（３５
６）のデータワードｖ＝Ｖ−２（３０９）で終了する。
しかし、この方法は、命令がキャッシュに格納する唯一
又は典型的な方法ではない。それどころか、命令の系列
を含むデータラインは、特に、キャッシュの構造（例え
ば、セット−連合）及び置換の方式に依存して、命令キ
ャッシュ３００の逐次的ではない、又は、隣接しないキ
ャッシュライン格納場所でも構わない。B. Initial Fetching of Instructions and Initial Loading of Indicator Array In this example, the instruction cache array 300 is loaded with instructions as shown in FIG. Various instructions are shown in lines 351, 354-356 of the cache array of FIG. The cache line 359 in FIG.
No orders are accommodated. This is to indicate that the valid / invalid bit in the corresponding data line tag array 40 is cleared, indicating that cache line 359 contains invalid data. The cache line 359 can theoretically store any invalid and meaningless data. For clarity and simplicity, cache line 359
Indicates a case where the value 0 includes only the given data. Each instruction is contained in a data word in one or more adjacent cache lines of the cache array 300.
For example, instructions are shown to be sequentially ordered and occupy adjacent lines 355,356. For example, in FIG. 6, one 5-data word instruction is a data word v = V−2 of data line u = X (355).
Starting from (309), the data line u = X + 1 (35
6) End with data word v = 2. The preceding instruction,
That is, the instruction that appears before the above five data word instruction is:
The data word v = V− of the data line u = X (355)
3 (308), data line u = X (355)
From the data word v = 1 (302). 5 above
The instruction following the data word will cause the data word v = 3 on line u = X + 1 (356), as shown in FIG.
(Not shown), and the line u = X + 1 (35
The processing ends with the data word v = V−2 (309) in 6).
However, this is not the only or typical way for instructions to be cached. Rather, data lines containing a sequence of instructions may also be located in non-sequential or non-contiguous cache line storage locations in the instruction cache 300, depending, inter alia, on the structure of the cache (eg, set-association) and the manner of replacement. I do not care.

【００４４】実行が開始されたとき、プリフェッチ器５
０は、（図示しないプログラムカウンタから利用可能
な）実行されるべき現在の命令の第１のデータワードの
メモリアドレスだけが供給される。データワードの個
数、又は、直ぐ隣の命令の最後のデータワードのアドレ
スに関する情報は提供されず、何れかの後続の命令の第
１のデータワードのアドレスに関する情報は得られな
い。従って、プリフェッチ器５０は、少なくとも命令中
に与えることができる最大のデータワード数がフェッチ
されるように、第１のデータワードと、第１のデータワ
ードの次に続くデータワードの個数とを含むデータワー
ドの系列を要求することにより、５データワード命令だ
けが得られる。この場合、系列は、キャッシュラインＸ
（３５５）のデータワードＶ−２（３０９）からキャッ
シュラインＸ＋１（３５６）のデータワード３（図示し
ない）までのデータワードを含むと考えられる。When the execution is started, the prefetch unit 5
0 is supplied only with the memory address of the first data word of the current instruction to be executed (available from a program counter not shown). No information is provided about the number of data words, or the address of the last data word of the immediately following instruction, and no information is available about the address of the first data word of any subsequent instructions. Thus, the prefetcher 50 includes the first data word and the number of data words following the first data word such that at least the maximum number of data words that can be provided in the instruction is fetched. By requesting a sequence of data words, only five data word instructions are obtained. In this case, the series is the cache line X
It is considered to include data words from (355) data word V-2 (309) to cache line X + 1 (356) data word 3 (not shown).

【００４５】命令キャッシュ３０が命令をプリフェッチ
器５０に転送するのと同時に、命令境界メモリ２０は、
プリフェッチ器５０に転送された命令キャッシュ配列３
００内の各データワードに対応する標識のベクトルを転
送する。しかし、上記例の場合に、標識配列４００は、
全ての標識が“０”に設定された電源投入時の状態、又
は、クリアされた状態を維持する。従って、上記例にお
いて、（行Ｘ（４５５）のビットＶ−２の標識から、行
Ｘ＋１（４５６）のビット３（図示しない）の標識まで
の）全ての標識ビットは、現時点で“０”に設定され、
要求された命令に対し境界情報が得られないことを示
す。At the same time that the instruction cache 30 transfers the instruction to the prefetcher 50, the instruction boundary memory 20
Instruction cache array 3 transferred to prefetcher 50
Transfer the vector of indicators corresponding to each data word in 00. However, in the case of the above example, the label sequence 400
The state at the time of power-on in which all the signs are set to “0” or the cleared state is maintained. Thus, in the above example, all the indicator bits (from the indicator of bit V-2 of row X (455) to the indicator of bit 3 (not shown) of row X + 1 (456)) are currently set to "0". Is set,
Indicates that boundary information cannot be obtained for the requested instruction.

【００４６】上記の如く、プリフェッチ器５０は、それ
が正確な情報を有するかどうかを判定することができな
い。それどころか、標識ビットは論理回路に供給され、
そこで、（上記例の場合に誤りがあるにも係わらず）上
記標識を命令が終了する場所（多重の命令の場合には、
それらが始まる場所）に関する区切り記号として用いて
受けられたデータワードの系列から要求された命令を抽
出する。間違ってパーシングされた所望の命令のデータ
ワードは、デコーダ８０の中の１台に出力される。しか
し、第１の命令の中の第１のデータワードは分かってい
ることに注意する必要がある。かくして、上記第１のデ
ータワードから始まるデータワードの部分系列を受ける
一次デコーダ８０は、たとえ標識が間違っている場合で
も、上記の如く第１の命令をデコードすることができ
る。As mentioned above, the prefetcher 50 cannot determine whether it has the correct information. On the contrary, the indicator bit is provided to the logic circuit,
So, despite the error in the case of the above example, the indicator is marked where the instruction ends (for multiple instructions,
Extract the requested instruction from the sequence of data words received as a delimiter for (where they begin). The incorrectly parsed desired instruction data word is output to one of the decoders 80. However, it should be noted that the first data word in the first instruction is known. Thus, the primary decoder 80, which receives a subsequence of data words starting from the first data word, can decode the first instruction as described above, even if the indicator is wrong.

【００４７】デコータ８０が命令をデコードしている間
に、命令長照合器７０は、プリフェッチ器５０により供
給された命令が正しくパーシングされたかどうかを、そ
の命令の長さを標識ビット内の命令境界情報と照合する
ことにより判定する。そのため、命令長照合器７０は、
各デコーダ８０に出力された命令及び標識を受ける。命
令長照合器７０により、標識は命令を正確に区切ること
が別個の命令長の照合に基づいて判定されたならば、命
令長照合器７０は、デコードされた命令をディスパッチ
論理回路９０に出力するため全てのデコーダ８０を使用
可能状態にさせる。この場合、標識はデコーダ８０によ
りデコードされた命令の長さを正確に示さない。かくし
て、命令長照合器７０は、第１の命令を受ける一次デコ
ーダ８０を除く、命令の最初のデータワードと並べられ
ないデータワードの系列が供給されたデコーダ８０にデ
ィスエーブル信号を出力する。一方、一次デコーダ８０
は、第１の命令の第１のデータワードと並べられた第１
の命令のデータワードの系列が供給される。一次デコー
ダ８０に最大長の命令と長さが等しいデータワードの系
列が供給されるならば、一次デコーダ８０は、標識が正
しいかどうかとは無関係に常に第１の命令をデコードす
ることができる。同様に、第１の命令の長さは正しいと
して照合され、第２の命令の長さが正しいと照合されな
かった場合を想定する。この場合、第２の命令のデータ
ワードの系列は、それにもかかわらず、（たとえ第３の
命令のデータワードの系列が第３の命令と並べられない
としても）第２の命令の始めと揃えられる。第１及び第
２の命令は、共に、夫々のデコーダ８０によりデコード
され、以下同様である。While the decoder 80 is decoding the instruction, the instruction length collator 70 determines whether the instruction provided by the prefetcher 50 was correctly parsed by determining the length of the instruction in the instruction boundary in the indicator bit. Judge by comparing with information. Therefore, the instruction length collator 70
The instructions and indicators output to each decoder 80 are received. If the instruction length collator 70 determines that the indicator correctly delimits the instruction based on a distinct instruction length collation, the instruction length collator 70 outputs the decoded instruction to the dispatch logic 90. Therefore, all the decoders 80 are made usable. In this case, the indicator does not accurately indicate the length of the instruction decoded by decoder 80. Thus, the instruction length collator 70 outputs a disable signal to the decoder 80 supplied with a sequence of data words that are not aligned with the first data word of the instruction, except for the primary decoder 80 that receives the first instruction. On the other hand, the primary decoder 80
Is the first data word aligned with the first data word of the first instruction.
Is provided. If the primary decoder 80 is provided with a sequence of data words equal in length to the maximum length instruction, the primary decoder 80 can always decode the first instruction, regardless of whether the indicator is correct. Similarly, assume that the length of the first instruction is verified as correct and the length of the second instruction is not verified as correct. In this case, the sequence of data words of the second instruction is nevertheless aligned with the beginning of the second instruction (even though the sequence of data words of the third instruction is not aligned with the third instruction). Can be The first and second instructions are both decoded by respective decoders 80, and so on.

【００４８】使用禁止状態にされたデコーダ８０は、デ
コードされた命令をディスパッチ論理回路９０に出力す
ることを止める。命令長照合器７０は、標識配列４００
への格納用の正しい標識のベクトルを命令境界メモリ２
０に出力する。補正された標識ビットは、命令長照合器
７０が長さを照合し得ず、デコードされなかった少なく
とも１個の命令により先行された上記命令をプリフェッ
チ器５０に再パーシングさせる信号と共に、プリフェッ
チ器５０に出力される。The disabled decoder 80 stops outputting the decoded instruction to the dispatch logic circuit 90. The instruction length collator 70 includes a sign array 400
The correct indicator vector for storage into the instruction boundary memory 2
Output to 0. The corrected beacon bits, together with a signal that causes the prefetcher 50 to reparse the instruction preceded by at least one undecoded instruction that the instruction length collator 70 cannot verify the length, along with the prefetcher 50 Is output to

【００４９】データワード記憶場所を開始する命令と、
適当な標識ビットベクトルとを命令長照合器７０から受
けた後、命令境界メモリ２０は標識ビットベクトルを適
当な場所に格納する。この場合、行ｕ＝Ｘ（４５５）上
の標識ビットｖ＝Ｖ−１（４１０）及びＶ−２（４０
９）と、ラインｕ＝Ｘ＋１（４５６）上の標識ビットｖ
＝０乃至ｖ＝２（図示しない）は、夫々、’０１’及
び’０００’に変更される。換言すれば、第１の命令は
５データワード長であることを反映させるため、標識が
更新される。命令長照合器７０は、例えば、１キャッシ
ュラインずつに基づいて、上記の更新を行う。かくし
て、キャッシュラインＸの最上位２バイトに対し判定さ
れた標識ビットの部分系列’０１’は、直ぐに更新され
る。このことは図８に示されている。しかし、命令長照
合器７０は、例えば、キャッシュラインＸ＋１内の処理
されるべき全てのデータワードがデコードされ、それに
対し標識ビットが得られるまで、標識ビットの部分系
列’０００’の更新を延期する。これは、図９に示され
る。例えば、標識ビットの更新は、キャッシュラインの
最後のデータワードがデコードされるまで、或いは、他
のキャッシュライン内の命令への分岐が行われるまで延
期される。何れにしても、特定のキャッシュライン内の
データワードのデコードは止められる。上記のような時
まで、標識ビットが生成されると、命令長照合器７０に
よりキャッシュラインに対するベクトルビットの系列を
形成するためその時点で未だ更新されていない標識ビッ
トに連結される。An instruction to start a data word location;
After receiving the appropriate indicator bit vector from the instruction length collator 70, the instruction boundary memory 20 stores the indicator bit vector in an appropriate location. In this case, the indicator bits v = V-1 (410) and V-2 (40) on row u = X (455)
9) and the indicator bit v on line u = X + 1 (456)
= 0 to v = 2 (not shown) are changed to '01' and '000', respectively. In other words, the indicator is updated to reflect that the first instruction is 5 data words long. The instruction length collator 70 performs the above update based on, for example, one cache line at a time. Thus, the indicator bit subsequence '01' determined for the two most significant bytes of cache line X is updated immediately. This is shown in FIG. However, the instruction length collator 70 defers updating the indicator bit subsequence '000' until, for example, all data words to be processed in the cache line X + 1 are decoded and the indicator bits are obtained. . This is shown in FIG. For example, the update of the indicator bit is deferred until the last data word of the cache line is decoded, or until a branch to an instruction in another cache line is made. In any case, decoding of the data word in a particular cache line is stopped. Until such time, when the indicator bits are generated, they are concatenated by the instruction length collator 70 to the not yet updated indicator bits to form a sequence of vector bits for the cache line.

【００５０】命令の再パーシングの際に、プリフェッチ
器５０は、命令長照合器７０により供給されるような正
確な境界情報を使用する。かくして、命令照合器７０に
よる引き続く比較は、命令が正しいということを示す。
その場合、命令長照合器７０は、機能ユニット１００に
よる引き続く実行のため、夫々のデコードされた命令長
が照合された各デコーダ８０に、再デコードされた命令
をディスパッチ論理回路９０に出力するよう指示する。
同時に、命令長照合器７０は、プリフェッチ器７０に、
現在の命令が有効であり、次の命令を処理するよう知ら
せる。During instruction re-parsing, the prefetcher 50 uses the exact boundary information as provided by the instruction length collator 70. Thus, subsequent comparisons by the instruction matcher 70 indicate that the instruction is correct.
In that case, the instruction length collator 70 instructs each decoder 80 whose respective decoded instruction length has been collated to output the re-decoded instruction to the dispatch logic circuit 90 for subsequent execution by the functional unit 100. I do.
At the same time, the instruction length collator 70 makes the prefetch unit 70
Indicates that the current instruction is valid and will process the next instruction.

【００５１】命令キャッシュ３０から種々の命令をフェ
ッチ及び再フェッチする処理において、命令長照合器７
０は、命令境界メモリ２０を徐々に更新する。上記の説
明では簡単のため、単一の命令だけがフェッチされ、デ
コードされることに注意が必要である。多数の命令がプ
リフェッチ器５０によって同時にフェッチされ、デコー
ドされる方が有利である。かくして、漸増的な境界情報
の判定が非常に高速に得られる。命令キャッシュ３０内
の全ての命令がフェッチされ、命令長照合器７０により
その長さが少なくとも１回判定されたならば、上記の命
令に対する全ての境界情報が命令境界メモリ２０に得ら
れる。図９には、図６に示された命令キャッシュ配列３
００に格納された各命令に対する境界情報を含む標識配
列４００が示される。上記の如く、本発明は、命令の参
照特性の空間的及び時間的な局在性を非常に有利に利用
する。プロセッサの命令キャッシュ３０内の“ヒット”
率は、通常９０％を超えるという事実を仮定するなら
ば、多数の命令の並列デコーディングのため本発明を用
いる成功率は非常に高い。In the process of fetching and re-fetching various instructions from the instruction cache 30, the instruction length collator 7
"0" gradually updates the instruction boundary memory 20. Note that for simplicity in the above description, only a single instruction is fetched and decoded. Advantageously, multiple instructions are fetched and decoded by prefetcher 50 simultaneously. In this way, incremental determination of boundary information is obtained very quickly. If all instructions in instruction cache 30 have been fetched and their lengths have been determined at least once by instruction length collator 70, all boundary information for the above instructions is available in instruction boundary memory 20. FIG. 9 shows the instruction cache array 3 shown in FIG.
An indicator array 400 is shown that includes boundary information for each instruction stored at 00. As mentioned above, the invention very advantageously exploits the spatial and temporal localization of the reference properties of the instructions. "Hit" in the processor's instruction cache 30
Given the fact that the rate is typically greater than 90%, the success rate using the present invention for parallel decoding of multiple instructions is very high.

【００５２】Ｃ．命令境界メモリ情報を用いた命令の実
行この例では、命令キャッシュ配列３００は、前の例から
変更されないままの状態である。命令キャッシュ配列３
００の内容は図６に示される。しかし、命令キャッシュ
３０の各命令は、少なくとも１回フェッチされているの
で、境界情報配列２０内の境界情報は、対応する命令キ
ャッシュ配列の各命令の開始（及び終了）を正確に反映
している。標識配列４００の命令境界情報は図９に示さ
れる。標識配列４００に配置された正確な情報の状況
は、典型的に、スーパースカラ処理システムにおける命
令の参照特性の時間的及び空間的局在性に起因する。C. Executing Instructions Using Instruction Boundary Memory Information In this example, the instruction cache array 300 remains unchanged from the previous example. Instruction cache array 3
The contents of 00 are shown in FIG. However, since each instruction in instruction cache 30 has been fetched at least once, the boundary information in boundary information array 20 accurately reflects the start (and end) of each instruction in the corresponding instruction cache array. . The instruction boundary information of the indicator array 400 is shown in FIG. The exact context of the information placed in the beacon array 400 typically results from the temporal and spatial localization of the reference characteristics of the instructions in a superscalar processing system.

【００５３】上記の如く、種々の命令が、図６のキャッ
シュ配列データライン記憶場所３５１、３５４乃至３５
６に示される。図６のデータライン記憶場所３５９には
命令が収容されていない。各命令は、キャッシュ配列３
００の１個以上のデータライン記憶場所内の隣接したデ
ータワード内に収容される。例えば、図６において、１
個の５データワード命令は、データライン記憶場所Ｘ＋
１（３５６）のデータワード２（３０３）で終了し、デ
ータライン記憶場所Ｘ（３５５）のデータワードＶ−２
（３０９）から始まる。上記の５データワード命令の前
の命令は、データライン記憶場所Ｘ（３５５）のデータ
ワードＶ−３（３０８）で終了し、データラインＸ（３
５５）のデータワード１（３０２）から始まる。上記の
５データワード命令の後の命令は、データライン記憶場
所Ｘ＋１（３５６）のデータワードＶ−２（３０９）で
終了し、データラインＸ＋１（３５６）のデータワード
３（図示しない）から始まる。As noted above, various instructions are stored in the cache array data line locations 351, 354 through 35 of FIG.
6 is shown. Instructions are not contained in data line location 359 of FIG. Each instruction is in cache array 3
00 are contained in adjacent data words in one or more data line locations. For example, in FIG.
The five data word instructions have data line locations X +
1 (356) ends at data word 2 (303), and data word V-2 at data line storage location X (355).
It starts from (309). The instruction preceding the five data word instruction above terminates at data word V-3 (308) at data line storage location X (355), and terminates at data line X (3
55) starting at data word 1 (302). The instruction following the five data word instruction above ends at data word V-2 (309) at data line location X + 1 (356) and begins at data word 3 (not shown) at data line X + 1 (356).

【００５４】前の例と同様に、プリフェッチ器５０が５
データワード命令を要求するとき、プリフェッチ器５０
は、データライン記憶場所Ｘ＋１（３５６）のデータワ
ード３（図示しない）から、例えば、データライン記憶
場所Ｘ（３５５）のデータワードＶ−２（３０９）まで
のデータワードの系列を要求することにより、その要求
を行う。これは、フェッチされたデータワードから最長
の命令ワードが抽出可能であることを保証する。As in the previous example, the prefetcher 50
When requesting a data word instruction, the prefetcher 50
By requesting a sequence of data words from data word 3 (not shown) at data line storage location X + 1 (356) to, for example, data word V-2 (309) at data line storage location X (355). Make that request. This ensures that the longest instruction word can be extracted from the fetched data word.

【００５５】命令キャッシュ３０が命令をプリフェッチ
器５０に出力するのと同時に、命令境界メモリ２０は、
命令キャッシュ配列３００から出力されたデータワード
の系列に対応する標識のベクトルをプリフェッチ器５０
に出力する。この例では、図９に内容が示された標識ベ
クトル配列４００は、要求された命令の先頭と、直ぐ次
の命令の先頭とを示す２個の“１”ビットを含む。プリ
フェッチ器５０は、データワードの系列から次の命令を
パーシングするため、標識のベクトル、特に、“１”標
識ビットの相対位置を使用する。命令境界メモリ２０に
より供給された境界情報を使用することにより、特に、
多重命令が並列にフェッチされるべき場合に、要求され
た命令を取得し、その命令をデコーダ８０に与えるため
必要な時間が著しく短縮される。標識のベクトルが、部
分系列内の各命令の区切り記号を示し、各命令を同時に
パーシング又は分割するため使用できる仮定する。At the same time that the instruction cache 30 outputs an instruction to the prefetcher 50, the instruction boundary memory 20
The prefetcher 50 retrieves a vector of indicators corresponding to the sequence of data words output from the instruction cache array 300.
Output to In this example, the indicator vector array 400 whose contents are shown in FIG. 9 includes two “1” bits indicating the head of the requested instruction and the head of the immediately following instruction. The prefetcher 50 uses the vector of indicators, in particular, the relative position of the "1" indicator bits, to parse the next instruction from the sequence of data words. By using the boundary information provided by the instruction boundary memory 20,
If multiple instructions are to be fetched in parallel, the time required to obtain the requested instruction and provide it to decoder 80 is significantly reduced. Assume that the vector of indicators indicates the delimiter of each instruction in the subsequence and can be used to parse or split each instruction simultaneously.

【００５６】デコーダ８０がプリフェッチ器５０により
パーシングされた命令をデコードする間に、命令長照合
器７０は、プリフェッチ器５０により与えられた命令の
長さを標識のベクトルに対し照合する。上記命令に先行
する命令が、命令長照合器７０により（データワードの
系列だけから）独立に判定されたような対応する長さ
が、標識のベクトルにより指定されたような命令境界と
一致する命令だけであるならば、命令長照合器７０は、
上記命令を受けたデコーダ８０をデコードされた命令を
ディスパッチ論理回路９０に出力させるため使用可能状
態にする。上記例では、単一の命令をパーシング、デコ
ーディング、照合する場合が説明されることに注意が必
要である。命令長照合器７０は、任意のサイクルで、プ
リフェッチ器５０により出力されたパーシングされた各
データワードの系列の長さを同時に照合することが可能
である。While the decoder 80 decodes the instruction parsed by the prefetcher 50, the instruction length checker 70 checks the length of the instruction given by the prefetcher 50 against the vector of the indicator. An instruction whose preceding length is such that the corresponding length, as determined independently by the instruction length collator 70 (only from the sequence of data words), matches an instruction boundary as specified by the vector of indicators. , The instruction length collator 70 calculates
The decoder 80 that has received the instruction is enabled to output the decoded instruction to the dispatch logic circuit 90. It should be noted that the above example describes the case where a single instruction is parsed, decoded, and matched. The instruction length collator 70 can simultaneously collate the length of the sequence of each parsed data word output by the prefetcher 50 in an arbitrary cycle.

【００５７】以上の説明は、本発明の一実施例の配置及
び動作の説明である。本発明の範囲は、上記の実施例と
共に、当業者に明らかな他の実施例を含むことが意図さ
れる。The above is a description of the arrangement and operation of one embodiment of the present invention. The scope of the present invention is intended to include the above embodiments, as well as other embodiments that would be apparent to those skilled in the art.

[Brief description of the drawings]

【図１】本発明のディジタル処理装置の概要図である。FIG. 1 is a schematic diagram of a digital processing device of the present invention.

【図２】本発明のディジタル処理装置の動作を表わすフ
ローチャートである。FIG. 2 is a flowchart showing the operation of the digital processing device of the present invention.

【図３】本発明のディジタル処理装置の動作を表わすフ
ローチャートである。FIG. 3 is a flowchart showing the operation of the digital processing device of the present invention.

【図４】クリア（電源投入）状態のタグ配列に格納され
たデータラインタグの配列を表わす図表である。FIG. 4 is a table showing an array of data line tags stored in a tag array in a clear (power-on) state.

【図５】クリア（電源投入又はリフレッシュ）状態の命
令境界メモリに格納された標識の配列を表わす図表であ
る。FIG. 5 is a chart showing an array of indicators stored in an instruction boundary memory in a clear (power-on or refresh) state.

【図６】命令が命令キャッシュに書き込まれた後に、命
令キャッシュに格納されたデータワードの配列を表わす
図表である。FIG. 6 is a chart representing an array of data words stored in the instruction cache after the instruction has been written to the instruction cache.

【図７】命令が命令キャッシュにロードされた後に、命
令境界メモリに格納された標識の配列を表わす図表であ
る。FIG. 7 is a diagram representing an array of indicators stored in the instruction boundary memory after an instruction has been loaded into the instruction cache.

【図８】ある命令に対する境界情報が判定された後に、
命令境界メモリ内の標識の配列を表わす図表である。FIG. 8: After the boundary information for an instruction is determined,
5 is a chart showing an arrangement of indicators in an instruction boundary memory.

【図９】境界情報が命令キャッシュ内の全命令に対し判
定された後に、命令境界メモリ内の標識の配列を表わす
図表である。FIG. 9 is a chart representing an array of indicators in an instruction boundary memory after boundary information has been determined for all instructions in the instruction cache.

【符号の説明】１０スーパースカラパイプライン式データ処理装置２０命令境界メモリ３０命令キャッシュ４０タグメモリ５０命令プリフェッチ器６０分岐ターゲットバッファ７０命令長照合器８０デコーダ９０ディスパッチ論理回路１００機能ユニット３００バイト配列４００標識配列[Description of Signs] 10 Superscalar pipelined data processing device 20 instruction boundary memory 30 instruction cache 40 tag memory 50 instruction prefetcher 60 branch target buffer 70 instruction length collator 80 decoder 90 dispatch logic circuit 100 functional unit 300 byte array 400 Indicator sequence

───────────────────────────────────────────────────── フロントページの続き (58)調査した分野(Int.Cl.⁶，ＤＢ名) G06F 9/38 G06F 9/32 ──────────────────────────────────────────────────続き Continued on the front page (58) Field surveyed (Int.Cl. ⁶ , DB name) G06F 9/38 G06F 9/32

Claims

(57) [Claims]

1. A method for determining the start and end of each instruction in a processor which decodes and executes a number of instructions of variable length in a single cycle, comprising: (a) storing a sequential address in a main memory; Storing, in a cache, each data line comprised of a sequence of stored data words and including a plurality of encoded variable length instructions stored contiguously in said memory; Storing in memory a plurality of indicators associated with each data word, including an indicator indicating whether the associated data word is the first data word of a variable length instruction.

And (c) said indicator stored in said memory in response to initialization of said processor, so as not to indicate that any data word is the first data word of a variable length instruction. 2. The method of claim 1, further comprising the step of resetting.

3. fetching a sequence of adjacent data words of one or more data lines of said cache;
Fetching a vector of indicators including an indicator associated with each data word of the fetched subsequence; and Parsing the subsequence into one or more variable length instructions in parallel, wherein the method is used as a data word delimiter.

And (e) simultaneously decoding the parsed sequence of each variable-length instruction; and (f) simultaneously performing the step (e) without using the vector of the indicator. Simultaneously determining the length of each of said variable length instructions to determine whether said indicator vector matches said determined length; and (g) distributing to at least one functional unit.
A first instruction beginning with the first data word of the sequence of fetched data words, and at least one instruction that sequentially follows the first instruction and whose vector of indicators matches the determined length; Outputting an instruction sequence including each of the other instructions preceded by:

5. A step of: (h) generating a vector of replacement indicators from the determined length; and (i) using the vector of replacement indicators to fetch the fetched data word from the subsequence of the fetched data words. 5. The method of claim 4, further comprising the step of reparsing and re-decoding each of said variable length instructions preceded by at least one instruction whose vector of indices does not match said determined length.

(H) distributing each of the simultaneously output decoded instructions to at least one of the plurality of functional units; and (i) decoding each of the decoded instructions. Executing the instructions in parallel on the at least one functional unit to which the instructions have been distributed.

7. E) determining the length of each instruction without using the indicator; f) generating a vector of replacement indicators from the determined length; Instead of the fetched indicator vector above,
Storing the vector of replacement markers in the memory.

8. The step (g) comprises: (g1) assigning the vector of the indicator to each of the data lines of the data line and each of the vectors associated with a data word of the corresponding data line. 8. The method of claim 7, comprising: dividing into a first plurality of marker sub-sequences including markers; and (g2) storing the respective marker sub-sequences during separate storage stages.

9. (g3) Until the steps (c), (e) and (f) are performed on data lines other than the data line corresponding to the subsequence in which the step (g2) is postponed, the portion of each marker (G4) deferring the step (g2) for the sequence; (g4) performing a step (c) on the second sequence of data words to obtain a vector of the second indicator and a subsequence of the second plurality of indicators; Performing (e), (f) and (g1); and (g5) concatenating the marker subsequences from the first and second plurality of subsequences corresponding to the same data line. The method of claim 8, further comprising:

(C) overwriting or overwriting the data line in the cache memory so as not to indicate that the data word of the overwritten or invalidated data line is the first data of the variable length instruction; The method of claim 1, further comprising resetting each indicator associated with each data word of the overwritten or invalidated data line in response to invalidating the data line in the cache memory.

11. A method of decoding one or more instructions in a processor that decodes and executes multiple instructions of variable length in a single cycle, comprising: (a) starting with the first data word, Fetching a sequence of data words of one or more data lines stored in a cache, including a sequence of at least a maximum number of data words in an acceptable instruction; and (b) each of the sequences of the fetched sequence. Fetching from the cache a plurality of indicators pertaining to data words, including an indicator of whether each of the data words is the first data word of an instruction; and (c) separate sequential decoding. Said sequence of instructions to be decoded, constituted by instructions, identifying one or more non-overlapping subsequences in said sequence of data words. And (d) decoding the data words of each subsequence as separate instructions.

12. A method for decoding and executing one or more instructions in a processor that decodes and executes multiple instructions of variable length in a single cycle, the method comprising: (a) decoding each data word of a series of data words; Related,
Utilizing a plurality of indicators obtained from the memory including an indicator indicating whether the relevant data word is the first data word of the instruction, the data in the cache including each data word of the sequence of instructions to be decoded. Identifying a sequence of one or more instructions to be decoded from the sequence of words; (b) decoding the identified sequence of instructions to be decoded; and (c) the step (b). Locating the first and last data words of each instruction in the sequence of instructions to be decoded without using the indicator and verifying the accuracy of the indicator at the same time while performing (D) generating a replacement indicator according to the first and last data words located in step (c), and replacing the obtained indicator with the replacement indicator instead of the obtained indicator. The method comprising the step of storing the directory.

13. A processor for decoding and executing a number of instructions of variable length in a single cycle and determining the first and last data words of each instruction, comprising: (a) a sequential address of main memory; A first cache memory area comprising a series of data words stored in the first cache memory area for storing each data line including a number of encoded variable length instructions stored contiguously in the memory; A second memory area associated with each data word of the data line and storing a plurality of indicators including an indicator indicating whether the associated data word is the first data word of a variable length instruction.

14. In response to initialization of the processor, the second memory area has therein no data word indicating that it is the first data word of a variable length instruction. 14. The processing device according to claim 13, wherein the stored marker is reset.

15. (c) Fetching a sequence of adjacent data words of one or more data lines of the cache and fetching a vector of indicators including an indicator associated with each data word of the fetched subsequence. And prefetching using the indicator of the vector as a delimiter for adjacent data words included in the fetched data word subsequence to parse the subsequence into one or more variable length instructions in parallel. 14. The processing apparatus according to claim 13, further comprising a vessel.

16. (d) a plurality of decoders for individually decoding separate instructions in each of the parallel parsed variable length instructions; and (e) the decoder decodes the parsed variable length instruction words. At the same time, without using the vector of the indicator, simultaneously determine the length of each instruction,
An instruction length checker for determining whether the vector of the indicator is equal to the determined length; the decoder comprising: a first instruction starting from a first data word of the sequence of fetched data words; And decoding each of the instruction sequences successively following the first instruction and including each other instruction preceded by at least one instruction whose vector of indicators is equal to the determined length; 16. The processing device according to claim 15, further comprising: receiving from the instruction length collator an enable signal for enabling the decoder to output decoded instructions for distribution to one of the functional units. .

17. The instruction length collator generates a vector of replacement indicators from the determined length, and the prefetcher generates a vector of the fetched indicator from the subsequence of the fetched data words. Uses the vector of permutation indicators to reparse each of the variable length instructions preceded by at least one instruction that does not match the determined length, wherein the decoder converts the reparsed instruction to 17. The processing device according to claim 16, wherein re-decoding is performed in parallel.

18. The method according to claim 17, wherein each of the simultaneously output decoded instructions comprises at least one of a plurality of functional units.
17. The processing apparatus according to claim 16, further comprising: a dispatch logic circuit provided to one of the functional units; and (g) a plurality of functional units that execute the decoded instructions in parallel.

(D) an instruction length collator for determining the length of each of the instructions without using the indicator, and generating a vector of a replacement indicator from the determined length; 16. The processing device according to claim 15, wherein the memory area stores the vector of the replacement marker in the memory instead of the vector of the fetched marker.

20. The instruction length checker corresponds to the vector of the indicator to a separate one of the data lines and includes each indicator of the vector associated with a data word of the corresponding one of the data lines. 20. The processing device according to claim 19, wherein the processing is divided into a first plurality of marker sub-sequences, and the sub-sequences of the respective markers are stored during separate storage steps.

21. The instruction length collator may generate a replacement indicator vector for at least one instruction in a data line other than the data line corresponding to the subsequence whose storage is postponed. Deferring the storage of the sub-sequence of each sign, the instruction length collator acquires the vector of the second sign and the sub-sequence of the second plurality of signs, and 21. The processing device according to claim 20, wherein the sub-sequences of the signs from the first and second plural sub-sequences are connected.

22. The second memory area, wherein the data word of the overwritten or invalidated data line does not indicate that it is the first data of a variable length instruction.
In response to the first memory area overwriting the data line therein or disabling the data line therein, each of the first memory area associated with the respective data word of the overwritten or invalidated data line. 14. The processing device according to claim 13, wherein the sign is reset.

23. A processing device for decoding and executing a number of variable-length instructions in a single cycle and decoding one or more instructions, comprising: (a) a plurality of variable-length instructions stored adjacently; And (b) indicating whether each of the data words in the data line stored in the cache is the first data word of a variable length instruction. (C) one or more sequential data lines stored in the cache, including: (c) a first data word, and at least a number of data words in the longest allowable instruction. Fetching from the cache a plurality of indicators associated with each of the data words of the fetched sequence; A prefetch comprising instructions to be loaded and using the indicator as a delimiter of the sequence of instructions to be decoded to identify one or more non-overlapping subsequences in the sequence of data words. And (d) a plurality of decoders including a decoder for decoding the data words of each subsequence as separate instructions.

24. A processor for decoding and executing multiple instructions of variable length in a single cycle, comprising: (a) associated with each data word of a sequence of data words;
Utilizing a plurality of indicators obtained from the memory including an indicator indicating whether the relevant data word is the first data word of the instruction, the data in the cache including each data word of the sequence of instructions to be decoded. A prefetcher for identifying from the sequence of words one or more sequences of instructions to be decoded; (b) a plurality of decoders for decoding each instruction in the identified sequence of instructions to be decoded; (C) While the decoder is decoding the instructions in the sequence, at the same time, without using the indicator, locate the first and last data words of each instruction in the sequence of instructions to be decoded. An instruction length checker that checks and verifies the accuracy of the indicator and generates a replacement indicator according to the first and last data words; Processing apparatus comprising a memory for storing the replacement indicator instead of identification.