JP5356531B2

JP5356531B2 - Instruction optimization performance based on sequence detection or information associated with the instruction

Info

Publication number: JP5356531B2
Application number: JP2011534805A
Authority: JP
Inventors: ファリク、オハド; ラポポルト、リフ; ガボー、ロン; クロラップ、ユリア; ミシャエリ、ミカエル
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2008-11-05
Filing date: 2009-10-30
Publication date: 2013-12-04
Anticipated expiration: 2029-10-30
Also published as: TW201030606A; US8543796B2; KR20110050722A; JP2012507805A; KR101267911B1; US20100115240A1; US20130346728A1; CN101788903A; TWI434213B; BRPI0920790A2; US8935514B2; WO2010053837A3; WO2010053837A2; CN101788903B

Abstract

In one embodiment, the present invention includes an instruction decoder that can receive an incoming instruction and a path select signal and decode the incoming instruction into a first instruction code or a second instruction code responsive to the path select signal. The two different instruction codes, both representing the same incoming instruction may be used by an execution unit to perform an operation optimized for different data lengths. Other embodiments are described and claimed.

Description

プロセッサを基本とする大半のシステムにおいて、プロセッサは、コピーや記憶といったオペレーションの効率的実行に合わせた命令を提供している。メモリへのコピーオペレーションに最適化されたソフトウェアは、特定のプロセッサ実装向けに調整されている。多くの場合、データのコピーを実行する最善の方法は変化するため、コンパイラ、オペレーティングシステム（ＯＳ）カーネル及びアプリケーションライターは、様々なシナリオ、様々なマイクロアーキテクチャ等に合わせてそれぞれ調整された数多くのコードを扱わなければならず、コードは、いわば移動標的のようなものとなっている。 In most processor-based systems, the processor provides instructions tailored for efficient execution of operations such as copying and storing. Software optimized for memory copy operations is tailored for specific processor implementations. In many cases, the best way to perform a copy of the data changes, so the compiler, operating system (OS) kernel, and application writer have a lot of code tailored to different scenarios, different microarchitectures, etc. The code is like a moving target.

命令のパラメータのうちの１つで規定される一定量のデータ要素をコピーするには、反復コピー命令を使用することができる。反復コピー命令は、例えば、バイト、ワード、ダブルワード、４倍長語等の様々なネイティブデータ要素長を有してもよい。ネイティブ長が長くなればなるほど、特定量のデータを動かすのに、命令がより効率的に実行され、これは、より大きな"ロード（読み込み）"及び"ストア（格納）"オペレーションを使用することができるからである。例えば、インテル（登録商標）アーキテクチャ（ＩＡ３２）構造の、リピート（繰り返し）・ムーブ（移動）・バイト（ＲＥＰＭＯＶＳＢ）命令は、コピーの長さを示す情報として、所定のレジスタ内の値を使用する。また、命令は、入力パラメータとして、コピー元ポインタ及びコピー先ポインタを受信する。このような命令は、データの１バイトを、一度に一つずつ移動するよう定義されている。ある条件下では、オペレーションが、長いオペレーション（例えば、一回に１６バイト）を使用して実行され、このような場合には、命令の実行が、"高速モード"に切り替わると言える。ＩＡ３２のプログラマのレファレンスマニュアルには、現在のプロセッサにおいて高速モードを実行してもよい条件が規定されている。 An iterative copy instruction can be used to copy a certain amount of data elements defined by one of the parameters of the instruction. The iterative copy instruction may have various native data element lengths such as, for example, bytes, words, double words, quadruple words, etc. The longer the native length, the more efficiently the instructions are executed to move a certain amount of data, which may use larger “load” and “store” operations. Because it can. For example, a repeat (repeated) move (move) byte (REP MOVSB) instruction of an Intel (registered trademark) architecture (IA32) structure uses a value in a predetermined register as information indicating the length of a copy. . The instruction also receives a copy source pointer and a copy destination pointer as input parameters. Such an instruction is defined to move one byte of data one at a time. Under certain conditions, operations are performed using long operations (eg, 16 bytes at a time), and in such cases, execution of instructions switches to a “fast mode”. The IA32 programmer's reference manual defines the conditions under which the high speed mode may be executed in the current processor.

多くの場合、コンパイル時には、コピーの長さ及び設定されたオペレーションが未知であるので、反復コピーオペレーションの従来の実装形態を使用してコピーオペレーションの効力を改善するための解決方法の１つは、まず、ストリングの大部分を移動させる第１反復コピー命令を使用し、次に、データの残りを移動させる第２反復コピー命令を使用する（例えば、第１コピーオペレーションでは、一度にダブルワードを移動させ、第２コピーオペレーションでは、最後の０〜３バイトを移動させる）ことである。このようなシーケンスには、２つの欠点がある。（ａ）第２命令を実行することにより、データの残りの部分が０である場合であっても、さらなるサイクルを消費することになってしまう。（ｂ）特定の長さを有する第１反復コピー命令、及びそれに続く第２の命令の限定されたシーケンスに対して最適化が調整されており、その他の組み合わせの場合には、大幅な性能損失が生じてしまう。 In many cases, at compile time, the length of the copy and the configured operation are unknown, so one solution to improve the effectiveness of the copy operation using the traditional implementation of the iterative copy operation is: First, use a first iterative copy instruction that moves the majority of the string, then use a second iterative copy instruction that moves the rest of the data (eg, the first copy operation moves a doubleword at a time In the second copy operation, the last 0 to 3 bytes are moved). Such a sequence has two drawbacks. (A) By executing the second instruction, even if the remaining part of the data is 0, a further cycle is consumed. (B) Optimization is adjusted for a limited sequence of first iteration copy instructions with a specific length followed by a second instruction, with significant performance loss for other combinations Will occur.

また、パイプライン機械では、最も好適な動作を決定するのに必要となるデータの一部が未知である、又はまだコミットされていないにも関わらず、命令の最も好適な動作を命令をデコードする時点で決定する必要がある場合が生じる。このような場合の一例として、フラグがまだ計算されていないにも関わらず、フラグによって分岐を選択する場合が挙げられる。このような問題を解決するための最も一般的なスキームは、分岐予測器を使用することである。しかしながら、予測器は、トレーニング（履歴を構築する）の時間を必要とし、コストも高く（多くのステートを保存する必要があるため）、また、断片的なパターンの下での性能は不確かである。 A pipeline machine also decodes an instruction with the most preferred behavior of the instruction, even though some of the data needed to determine the best behavior is unknown or not yet committed. There may be cases where it needs to be determined at a time. As an example of such a case, there is a case where a branch is selected by a flag even though the flag has not yet been calculated. The most common scheme for solving such problems is to use a branch predictor. However, predictors require training (building history) time, are expensive (because many states need to be saved), and performance under fragmentary patterns is uncertain .

本発明の一実施形態に係る方法のフローチャートである。3 is a flowchart of a method according to an embodiment of the present invention. 本発明の一実施形態に係るシーケンス検出器のブロック図である。It is a block diagram of the sequence detector which concerns on one Embodiment of this invention. 本発明の一実施形態に係るシーケンスデコーダ状態機械の一例の状態図である。FIG. 4 is a state diagram of an example of a sequence decoder state machine according to an embodiment of the present invention. 本発明の一実施形態に係るプロセッサのブロック図である。It is a block diagram of a processor concerning one embodiment of the present invention. 本発明の一実施形態に係るシステムのブロック図である。1 is a block diagram of a system according to an embodiment of the present invention.

様々な実施形態において、コンパイラが生成したコピーオペレーションの特性が、反復コピーオペレーションをより効率的に行うのに利用される。本明細書で使用されている"コピー"オペレーションという言葉は、データをメモリ内、メモリへ又はメモリの外へ移動させるメモリコピーオペレーション、メモリムーブオペレーション及びメモリセットオペレーションを総称する言葉として使用されている。異なる環境では、このような一般的なオペレーションに対して、異なる名前が付けられているかもしれない。これらのコピーオペレーションの"高速モード"は、多くのケースで実行可能である。実行不可能とされた場合（例えば、エイリアシング・リスクテストが不合格であった場合）であっても、多くの場合（ランダム分布であると仮定すると）、１度に１つのデータ要素がコピーされる、ネイティブモードよりも速いモードを実行可能である。最適化されたコピーシーケンスは、最初から、幾つかの異なる高速モード（すなわち、ネイティブモードよりも速いモード）のうちの１つを使用してコピーの実行を試みるが、ネイティブ長のオペレーションを使用してコピーを実行しなければならないのは僅かなケースにとどまる。プロセッサの命令セットは、プロセッサにメモリコピーオペレーション又はメモリセット（格納）オペレーションを実行させるよう指示する１以上の命令を含んでもよく、これらのオペレーションが効率的に実装されると、そのプロセッサハードウェアは、様々なマイクロアーキテクチャ世代及び様々なアーキテクチャ世代に渡る性能境界を保つことができる。 In various embodiments, the characteristics of the copy operation generated by the compiler are utilized to make the iterative copy operation more efficient. As used herein, the term “copy” operation is used generically to refer to memory copy operations, memory move operations, and memory set operations that move data into, out of, or out of memory. . Different environments may have different names for these common operations. The “fast mode” of these copy operations can be performed in many cases. Even if it is not feasible (for example, if the aliasing risk test fails), in many cases (assuming a random distribution) one data element is copied at a time. A mode faster than the native mode can be executed. An optimized copy sequence will initially attempt to perform a copy using one of several different fast modes (ie, faster than native mode), but use native length operations. There are only a few cases where copying must be performed. A processor instruction set may include one or more instructions that direct the processor to perform a memory copy operation or a memory set (store) operation, and once these operations are efficiently implemented, the processor hardware , Performance boundaries across different microarchitecture generations and different architecture generations can be maintained.

以下に記載されるように、一実施形態は、複数の主要な段階（以下に詳述する）を含んでもよく、（１）"高速コピー"を開始するのに必要な規則のチェックを実行し、次の段階のためのオペレーションを設定する部分、（２）条件付きコピーが実行されたヘッド部分（パイプレインのレイテンシをカバーし、条件付きオペレーションを使用することにより伝播によって生じるバブルを防ぐ）、（３）目的のケースを扱うための特色を加えた高速固定サイズ反復、及び（４）テール部分を含む。チェック及びヘッド部分（ステップ１及びステップ２）は、全ストリング長（すなわち、コピー長又はブロック長）について実行される。ヘッド部分は、全てのチェックが可であった場合に実行され、チェックの結果が不合格であった場合は、ハードウェアは、１度にネイティブサイズのコピーオペレーションを行うネイティブループに入る。高速ループ及びテール部分は、ヘッド部分で解析されるコピーの長さに応じて、必要に応じて実行される。判断を早い段階で行うことにより、実行経路が最小のパイプラインバブルを使用して選択され、分岐ミスが予測されることはない。長さの又はｓｒｃ‐ｄｓｔの距離ハンドリングのうちの幾つかに、追加の制限を適用することもでき、"高速ループ"の幾つかの実装例においては、オペレーションの一部を再実行する必要がある厳密でない例外検出が、ヘッド部で行われるチェックに加えて存在してもよく、最大６４Ｂまで戻ることが許され、コピー先のポイントが、コピー元ポインタの後ろであって６３Ｂ未満であるか（すなわち、（ｄｓｔｍｏｄ４Ｋ）−（ｓｒｃｍｏｄ４Ｋ）＜６３Ｂ）をチェックする必要がある。このような追加のチェックに失敗した場合、正確な実行を行うために、最適ではないが別のコードルーチンへと分岐することはできる。また、幾つかの実施形態では、選択肢として、コピーオペレーション長が非常に長く、キャッシングヒントを利用して性能を改善できる特別なケースを扱ってもよい。本明細書では特定のサイズのコピーオペレーションが記載されるが、本発明の範囲はこれらに限定されず、実施形態は、その他のサイズ（例えば、異なるバイト数及びキャッシュライン幅）に合わせて最適化されたコピーオペレーションを扱うことができる。 As described below, one embodiment may include a number of major steps (described in detail below) that perform (1) a check of the rules necessary to initiate a “fast copy”. , The part that sets the operation for the next stage, (2) the head part where the conditional copy was performed (covers the latency of the pipeline and prevents bubbles caused by propagation by using the conditional operation), (3) It includes a fast fixed size iteration with added features to handle the desired case, and (4) a tail portion. Check and head portions (step 1 and step 2) are performed for the entire string length (ie, copy length or block length). The head portion is executed when all the checks are possible, and if the result of the check is unsuccessful, the hardware enters a native loop that performs a copy operation of a native size at a time. The fast loop and tail portions are executed as needed depending on the length of the copy analyzed at the head portion. By making the decision at an early stage, the execution path is selected using the smallest pipeline bubble and no branch miss is predicted. Additional restrictions may be applied to some of the length or src-dst distance handling, and some implementations of “fast loops” may require some re-execution of operations. Certain non-exact exception detection may be present in addition to the checks performed at the head, is allowed to return up to 64B, and the destination point is behind the source pointer and less than 63B (I.e., (dst mod 4K)-(src mod 4K) <63B) needs to be checked. If such additional checks fail, it is possible to branch to another, but not optimal, code routine for correct execution. Also, some embodiments may handle special cases where the copy operation length is very long and the caching hints can be used to improve performance. Although specific size copy operations are described herein, the scope of the present invention is not limited to these, and embodiments are optimized for other sizes (eg, different number of bytes and cache line width). Handled copy operations.

図１は、本発明の一実施形態に係る方法のフローチャートである。方法１００は、汎用ハードウェアユニット又は専用ハードウェアユニットのようなプロセッサの様々なロケーションで実行することができる。方法１００は、最適化された態様で、反復コピーオペレーションを実行するのに使用されてもよい。図１に示すように、方法１００では、最初に、チェックが実行され、コピーオペレーションの準備が行われる（ブロック１１０）。具体的には、実行すべきコピーオペレーションの種類を判断するために様々なチェックを行い、また、コピーオペレーションで使用される様々なカウント値を、コピーオペレーションに関連付けられたカウンタに読み込ませることにより、カウンタを初期化してもよい。まず初めに、命令のネイティブ長よりも長い読み込み／格納オペレーションを使用してコピーオペレーションが行われる高速フローが実行可能かを判断するための幾つかのチェックが行われてもよい。幾つかのチェックのうちの１つでもチェックに失敗した場合には、ネイティブモードループが実行され、コピーオペレーションは、例えば、バイト単位で動くオペレーション又はダブルワード単位の命令といったように、命令のネイティブ長を使用して行われる（ブロック１２０）。必要な情報がすでに利用可能となっている及び既知である実行段階において取得されたデータを使用して、チェックを行う。チェックのいずれかが失敗となった場合には、関連付けられたパフォーマンスヒットが生じると共に、予測ミスコストが生じるが、通常の使用では希なケースであり、ネイティブループのコストを考えれば、相対的なロスも低い。 FIG. 1 is a flowchart of a method according to an embodiment of the present invention. The method 100 may be performed at various locations on a processor, such as a general purpose hardware unit or a dedicated hardware unit. The method 100 may be used to perform an iterative copy operation in an optimized manner. As shown in FIG. 1, in the method 100, a check is first performed to prepare for a copy operation (block 110). Specifically, various checks are performed to determine the type of copy operation to be performed, and various count values used in the copy operation are read into a counter associated with the copy operation. The counter may be initialized. First of all, some checks may be made to determine if a fast flow in which a copy operation is performed using a read / store operation that is longer than the native length of the instruction is feasible. If any one of the checks fails, a native mode loop is executed and the copy operation is the native length of the instruction, eg, an operation that works in bytes or an instruction in doublewords. (Block 120). A check is made using data obtained in the execution phase where the necessary information is already available and known. If any of the checks fail, there will be an associated performance hit and a misprediction cost, which is a rare case in normal use and is relative when considering the cost of the native loop. Loss is low.

一実施形態では、チェックされる条件として、コピー先（ｄｓｔ）ポインタのストリングとコピー元（ｓｒｃ）ポインタのストリングとの間の距離をチェックすることを含み、先に読み出されるｓｒｃがオペレーションの振る舞いを変更することがないようにする。距離の測定は、０バイト（Ｂ）＜（（ｄｓｔｍｏｄ４Ｋ）−（ｓｒｃｍｏｄ４Ｋ））＜１６Ｂであるかの判断で行われ、この範囲であれば、ネイティブモードへと移ってもよい。ページ間のメモリエイリアシングが無関係である場合の実施形態では、"ｍｏｄ４Ｋ"なしで、オペレーションを行ってもよい。方向（ＤＦ）フラグのチェックも行う。ＤＦフラグ＝＝１である場合には、ネイティブモードへと移ってもよい。アドレス空間（ｓｒｃ及びｄｓｔの両方について）のラップアラウンドに対するチェックを行ってもよく、チェック結果が真である場合には、ネイティブモードへと移る。その他の条件を追加、又は、高速モードへと入る条件の一部を取り除いて、別の実装形態を使用してもよい。 In one embodiment, the condition to be checked includes checking the distance between the destination (dst) pointer string and the source (src) pointer string, where the previously read src determines the behavior of the operation. Do not change it. The distance is measured by determining whether 0 byte (B) <((dst mod 4K) − (src mod 4K)) <16B. If it is within this range, the mode may be shifted to the native mode. In embodiments where memory aliasing between pages is irrelevant, the operation may be performed without "mod 4K". The direction (DF) flag is also checked. When the DF flag == 1, the mode may be shifted to the native mode. A check for wraparound of the address space (for both src and dst) may be performed, and if the check result is true, the mode moves to native mode. Other implementations may be used by adding other conditions or removing some of the conditions for entering the high speed mode.

ブロック１１０では、例えば、"ＦａｓｔＬｏｏｐ"のような高速ループの準備が行われ、テール部分も実行してもよい。一実施形態において、この段階は、高速ＣＬループのためのカウンタを計算すること（例えば、長さがｒｃｘレジスタでバイト単位で規定され、各ループは６４バイトで動作し、反復の数は、ｒｃｘ／６４で計算される）、及びその値を、ゼロオーバードカウンタレジスタに読み込むこと（"ヘッド"部分が、以下に説明するように、６４Ｂまでのデータをコピーすると仮定し、高速ループ１１０へとジャンプする時にカウンタが１でデクリメントされる）を含む。"ヘッド"部分が６４Ｂよりも大きいデータ（例えば１２８Ｂ）を扱う場合には、ｒｃｘ／６４の計算値から、定数を差し引くことが必要な場合もある。そして、テール条件が計算され、ゼロオーバーヘッドジャンプ制御レジスタに置かれる。 In block 110, for example, a fast loop preparation such as “Fast Loop” is made and the tail portion may also be executed. In one embodiment, this stage calculates a counter for the fast CL loop (eg, the length is specified in bytes in the rcx register, each loop operates at 64 bytes, and the number of iterations is rcx And the value is read into the zero overcounter register (assuming that the "head" portion copies up to 64B of data, as described below, and enters the fast loop 110. The counter is decremented by 1 when jumping). When data with a “head” portion larger than 64B (for example, 128B) is handled, it may be necessary to subtract a constant from the calculated value of rcx / 64. The tail condition is then calculated and placed in the zero overhead jump control register.

チェックのいずれかが失敗となった場合には、制御がブロック１２０に移り、コピーはネイティブモードで行われてもよい。様々な実施形態において、このネイティブモードは、ネイティブ長モードに従ってコピーオペレーションを実行されるのに使用することができる。ここで、方法１００が終了してもよい。このように、コピーオペレーションを束ねるのに必要な条件を満たさない場合には、ゼロオーバーヘッドループを使用した各コピー反復（イタレーション）に、ネイティブ長が使用される（例えば、繰り返し移動バイト命令（ＲＥＰＭＯＶＳＢ）の場合、一回のイタレーションにつき１バイト）。 If any of the checks fail, control passes to block 120 and the copy may take place in native mode. In various embodiments, this native mode can be used to perform copy operations according to the native length mode. Here, method 100 may end. Thus, if the conditions necessary to bundle the copy operations are not met, the native length is used for each copy iteration (iteration) using a zero overhead loop (eg, repeated move byte instruction (REP). In the case of MOVSB), one byte per iteration).

チェックが完了し、高速コピーオペレーションを実行することができると判断された（ブロック１１０のチェック及び計算に基づいて）場合には、制御は、ブロック１１０からブロック１３０へと移る。ブロック１３０において、コピーオペレーションのヘッド部分を実行してもよい。具体的には、例えば６４バイトである、所定の量のデータ以下のあらゆる長さを扱う条件付き読み込み／格納を実行してもよい。本明細書に記載するように、一実施形態において、最大６４バイトであるデータのコピーを行うのに、コピーオペレーションを８回まで実行してもよい。具体的には、ブロック１１０のチェックを通過した場合には、この時点で、結果の正確性に影響を及ぼすことなく、ネイティブコピー長よりも長いコピーオペレーションを実行可能であるとプロセッサが知る。 If the check is complete and it is determined that a fast copy operation can be performed (based on the check and calculation of block 110), control passes from block 110 to block 130. At block 130, the head portion of the copy operation may be performed. Specifically, conditional read / store handling, for example, 64 bytes, and any length below a predetermined amount of data may be performed. As described herein, in one embodiment, a copy operation may be performed up to 8 times to copy data that is a maximum of 64 bytes. Specifically, if the check in block 110 is passed, the processor knows at this point that a copy operation longer than the native copy length can be performed without affecting the accuracy of the result.

ブロック１３０において、コピーオペレーションは、"条件付き"オペレーヨンを使用しており、Ｎバイト長の条件付きコピーのそれぞれは、残りの長さが少なくともＮバイト存在している場合に実行される。条件は、実行時にチェックされるので、実行からデコード段階へと戻される長さ情報の伝播に依存しない。コピーに加えて、反復の各回では、次のオペレーションで使用されるべきｓｒｃポインタ及びｄｓｔポインタをＮだけインクリメントし、残りの長さをＮだけデクリメントする。 In block 130, the copy operation is using a "conditional" operation, and each of the N byte long conditional copies is performed if there is at least N bytes remaining. Since the condition is checked at runtime, it does not depend on propagation of length information returned from execution to the decoding stage. In addition to copying, at each iteration, the src and dst pointers to be used in the next operation are incremented by N and the remaining length is decremented by N.

コピーオペレーションの回数は、準備が"チェック"段階（ブロック１１０）で完了可能となり、パイプラインで伝播するように設定され、順番が来た時及びデコード段階で使用される時にペナルティを受けないようにする。"ロードゼロオーバーヘッドカウンタ"又は"ゼロオーバーヘッド分岐条件"がデコード段階から最終的な実行段階まで進むのに掛かる時間は、条件付きオペレーションがデコードされ実行される時間枠であり、デコードから実行までのパイプの深さに等しい。マシーンが扱うことのできる最大の読み込み／格納長さ（バイト単位）が、"Ｎ＝２＾ｎ"であるとすると、２のべき乗の長さのシーケンス（２のツリーのべき乗と称すことができる）、すなわち、１、１、２、４、・・・、Ｎ／２、Ｎ、Ｎ、Ｎを使用して、コピーシーケンスを実行することができる。例えば、Ｎ＝１６の場合であって、パイプライン遅延をカバーするのにプロセッサが８回のオペレーションを必要であると仮定すると、シーケンスは、１、１、２、４、８、１６、１６、１６となり、最大のコピー量が６４Ｂとなる。０から６４Ｂまでの範囲の各数字には、その量と同じデータを移動させることができるオペレーションのサブセットが存在する（例えば、３バイトを移動させるには、１及び２を実行する、また、１０バイトの場合には、２及び８を実行する）。別の例として、Ｎ＝３２であり、パイプライン遅延をカバーするのに８回のオペレーションを必要であると仮定すると、シーケンスは、１、１、２、４、８、１６、３２、３２であり、合計９６Ｂとなる。幾つの実施形態では、条件付き部分が扱うことができる最大のデータ量が、ＦａｓｔＬｏｏｐのサイズの整数倍であると効率的である（例えば、６４Ｂ×１＝６４Ｂ、６４Ｂ×２＝１２８Ｂ）。 The number of copy operations can be completed in the "check" stage (block 110), set to propagate in the pipeline, and not penalized when in turn and used in the decode stage. To do. The time it takes for the “load zero overhead counter” or “zero overhead branch condition” to go from the decode stage to the final execution stage is the time frame during which the conditional operation is decoded and executed, and the pipe from decode to execution Equal to the depth of If the maximum read / store length (in bytes) that a machine can handle is “N = 2 ^ n”, it can be called a power-of-two sequence (a power of two trees). ), That is, 1, 1, 2, 4,..., N / 2, N, N, N can be used to perform the copy sequence. For example, assuming N = 16 and assuming the processor needs 8 operations to cover the pipeline delay, the sequence is 1, 1, 2, 4, 8, 16, 16, 16 and the maximum copy amount is 64B. For each number in the range 0 to 64B, there is a subset of operations that can move the same amount of data (eg, to move 3 bytes, perform 1 and 2 and 10 For bytes, perform 2 and 8.) As another example, assuming N = 32 and requiring 8 operations to cover the pipeline delay, the sequence is 1, 1, 2, 4, 8, 16, 32, 32. Yes, a total of 96B. In some embodiments, it is efficient that the maximum amount of data that the conditional part can handle is an integer multiple of the size of Fast Loop (eg, 64B × 1 = 64B, 64B × 2 = 128B).

一実施形態において、オペレーションのシーケンスは、上記の記載とは反対の順番で実行することにより（例えば、１６、１６、１６、８、４、２、１、１）、ブロック１３０のヘッド部分で、０〜６４Ｂの範囲のあらゆるバイト数のデータを正確にコピーするのに必要なオペレーションのサブセットを簡単に生成することができる。これは、残りの部分の長さを調べる条件を設定し、Ｒｅｍａｉｎｄｅｒ＿Ｌｅｎｇｔｈ（残りの長さ）−Ｎ＞０である場合には、オペレーションが完了し、そうでない場合には、スキップされる。Ｒｅｍａｉｎｄｅｒ＿Ｌｅｎｇｔｈは、各コピーオペレーションの後に、そのオペレーション長を使用して更新される。コピーオペレーションの都度、ｓｒｃポインタ及びｄｓｔポインタを更新する替わりに、元のｓｒｃポインタ及び元のｄｓｔポインタのオフセットのみを更新することも可能であり、ｓｒｃポインタ及びｄｓｔポインタは、ブロック１３０の最後で（又は、ブロック中の別のスナップショットポイントで）新しい値へと更新される。このようにすることにより、各条件付き段階で、１回の"加算"オペレーションを節約することができる。 In one embodiment, the sequence of operations is performed in the opposite order as described above (eg, 16, 16, 16, 8, 4, 2, 1, 1), at the head portion of block 130, A subset of operations required to accurately copy any number of bytes of data in the range 0-64B can be easily generated. This sets a condition for checking the length of the remaining part, and if Reminder_Length (remaining length) -N> 0, the operation is completed, otherwise it is skipped. Reminder_Length is updated after each copy operation using its operation length. Instead of updating the src pointer and the dst pointer for each copy operation, it is also possible to update only the offset of the original src pointer and the original dst pointer, and the src pointer and the dst pointer are at the end of the block 130 ( (Or at another snapshot point in the block) updated to a new value. In this way, one “add” operation can be saved at each conditional stage.

ヘッド部分１３０の最後で、カウンタ、選択されたループの種類及びブロック１１０で用意された条件を使用して、複数の態様の判定が行われる。具体的には、ゼロオーバーヘッドカウンタ値が、１以上である場合には、カウンタは１デクリメントされて、ブロック１４０のＦａｓｔＬｏｏｐが実行される。１未満である場合であて、テール条件が真である場合（すなわち、残りのバイト数が、６４より小さくゼロより大きい場合）には、テール部分がブロック１３５で実行される。それ以外の場合であって、追加のデータがコピーされない場合には、方法１００は終了する。"ＦａｓｔＬｏｏｐ"を呼び出す必要があるかを判断するのに、ゼロオーバーヘッドカウンタ値を使用する。反復（イタレーション）回数＋１の数がカウンタに読み込まれており、カウンタ値＞１の場合には、デクリメントされ、"ＦａｓｔＬｏｏｐ"のヘッド部にジャンプする。カウンタ値が１以下であると判断された場合には、ループを呼び出す必要はない。 At the end of the head portion 130, a plurality of aspects are determined using the counter, the selected loop type, and the conditions provided in block 110. Specifically, when the zero overhead counter value is 1 or more, the counter is decremented by 1 and Fast Loop of block 140 is executed. If it is less than 1 and the tail condition is true (ie, the remaining number of bytes is less than 64 and greater than zero), the tail portion is executed at block 135. Otherwise, if additional data is not copied, method 100 ends. The zero overhead counter value is used to determine if “Fast Loop” needs to be called. The number of iterations (iterations) +1 is read into the counter, and when the counter value> 1, it is decremented and jumps to the head part of “Fast Loop”. When it is determined that the counter value is 1 or less, there is no need to call a loop.

図１に示すように、残りのカウント値が６３バイトよりも大きい場合には、制御はブロック１４０へと移され、例えば、６４バイト及び／又はキャッシュラインサイズのデータを１イタレーション毎に移動させる高速固定サイズ反復を実行してもよい。これは、所定の長さのコピーオペレーションを、予め読み込まれたゼロオーバーヘッドループカウンタで扱う高速ループである。幾つかの実施形態では、ブロック１４０のコピーオペレーションを実行する前に、いつヒットが予測ミスペナルティを受けるか（しかしながら、"高速実行"は可能）のチェックを行う。まず、更なるポインタの距離のチェックが実行されるが、これは、ＦａｓｔＬｏｏｐの制限が、ヘッド部分における条件付きコピーの制限よりも厳しい場合に必要となる。例えば、進行を追跡しないＦａｓｔＬｏｏｐは、一番初めから再実行する必要がある場合があり、前に行われている全てのチェックに加えて、（（ｓｒｃｍｏｄ４Ｋ）−（ｄｓｔｍｏｄ４Ｋ））＞６３Ｂをチェックする必要がある。上記のチェックに失敗した場合には、制御はブロック１６０に移り、第２高速ループが実行される（以下に詳述されるが、これは、制限は存在しないが、実行するのに遅くなる場合があるケースに対応する）、（２）ストリングの残りの長さがチェックされ、長さが規定の閾値（ＮＴ＿ｔｈｒｅｓｈｏｌｄ）よりも大きい場合には、制御がブロック１５０へと移り、キャッシュ汚染を回避するための、読み込み及び格納オペレーションのための非時間的ヒント（例えば、インテル（登録商標）のＭＯＶＮＴＤＱＡ又はＭＯＶＮＴＳＱ命令）のようなキャッシュヒントを使用するループが実行される。一実施形態では、このＮＴ＿ｔｈｒｅｓｈｏｌｄパレメータは、最良の性能影響を達成するようにキャッシュサイズに対応させて調整することができる。別の実装例として、様々なキャッシングヒントのうち最も適切なものの使用を判断するために、複数の閾値レベルを使用してもよい。 As shown in FIG. 1, if the remaining count value is greater than 63 bytes, control is transferred to block 140, eg, moving 64 bytes and / or cache line size data for each iteration. Fast fixed size iterations may be performed. This is a high-speed loop that handles a copy operation of a predetermined length with a pre-read zero overhead loop counter. In some embodiments, before performing the copy operation of block 140, a check is made as to when the hit is subject to a predictive miss penalty (however, "fast execution" is possible). First, a further pointer distance check is performed, which is necessary when the Fast Loop limit is more stringent than the conditional copy limit at the head portion. For example, a Fast Loop that does not track progress may need to be re-executed from the very beginning, in addition to all the checks previously performed, ((src mod 4K)-(dst mod 4K)) It is necessary to check> 63B. If the above check fails, control passes to block 160 and the second fast loop is executed (detailed below, but there is no limit but it is slow to execute) (2) the remaining length of the string is checked and if the length is greater than the specified threshold (NT_threshold), control passes to block 150 to avoid cache pollution. Loops using cache hints, such as non-temporal hints for read and store operations (eg, Intel® MOVNTDQA or MOVNTSQ instructions). In one embodiment, this NT_threshold parameter can be adjusted for the cache size to achieve the best performance impact. As another example implementation, multiple threshold levels may be used to determine the most appropriate use of the various caching hints.

ブロック１４０のループのイタレーション（反復）のそれぞれの間に、６４Ｂのデータが可能な限り高速な態様でコピーされる（すなわち、コピー長に対して最適化されたコードシーケンスが使用される）。イタレーションの数は、ゼロオーバーヘッドループカウンタを使用して判断される。ＦａｓｔＬｏｏｐのブロック１４０の最後で、テール部を扱う条件がチェックされ、次のような判断がなされる（条件はプリセットされているので、ゼロオーバーヘッドである）。テール条件＝真であれば、制御はブロック１３５のテール部分に移り、真でない場合には、更なるデータがコピーされることなく、方法１００が終了する。 During each iteration of the loop of block 140, 64B of data is copied in the fastest possible manner (ie, a code sequence optimized for copy length is used). The number of iterations is determined using a zero overhead loop counter. At the end of the Fast Loop block 140, the condition for handling the tail is checked and the following decision is made (the condition is preset, so there is zero overhead). If the tail condition = true, control passes to the tail portion of block 135, otherwise, the method 100 ends without copying any further data.

ブロック１６０において、ｆａｓｔ＿１６ｌｏｏｐ（高速１６ループ）はＦａｓｔＬｏｏｐと同様なものであるが、各イタレーションで１６Ｂを（このコピー長に最適化されたシーケンスに応じて）コピーする。ゼロオーバーヘッドカウンタは、ループの実行に先立って、１６Ｂのイタレーションが可能となるように調整される。 In block 160, fast_16 loop (fast 16 loop) is similar to Fast Loop, but copies 16B (depending on the sequence optimized for this copy length) at each iteration. The zero overhead counter is adjusted to allow for 16B iterations prior to execution of the loop.

６４Ｂの塊（又は、ブロック１４０、１５０及び１６０のコピーオペレーションのサイズ）を、可能な限り多くコピーした後、６３Ｂ以下のコピーオペレーションが残る場合がある（このようなテール部分が存在する場合にのみ、プロセッサがこの状態に至る）。テール部分は、ブロック１３５における条件付きコピーオペレーションのシーケンスを使用して扱われ、シーケンスは、ヘッド部で使用されたシーケンスと同様なものであるが、異なる点は、シーケンスが、１が２つではなく、１が１つで始まる（１、２、・・・）という点である。テール部の長さは、ＦａｓｔＬｏｏｐの１回のイタレーションにおけるデータ量から１を引いたサイズ（例えば、６３Ｂ＝６４Ｂ−１）に設定され、パイプラインの深さとは関係しない。Ｎ＝１６及びＦａｓｔＬｏｏｐが６４Ｂである上記の例の場合、テール部は、１６、１６、１６、８、４、２、１バイト（７回のオペレーション）のデータの塊でコピーされる。上述のヘッド部の箇所で説明したように、移動させるオペレーションのサブセットを規定するプロセスを最適化するために、逆の順序が使用されている。Ｎ＝３２の場合、テール部のシーケンスは、３２、１６、８、４、２、１（６回のオペレーション）となる。 After copying as much of the 64B chunk (or the size of the copy operation of blocks 140, 150 and 160) as possible, a copy operation of 63B or less may remain (only if such a tail portion exists). , The processor reaches this state). The tail part is handled using the sequence of conditional copy operations in block 135, which is similar to the sequence used in the head part, except that the sequence is not two in one. No, 1 starts with 1 (1, 2, ...). The length of the tail portion is set to a size (for example, 63B = 64B-1) obtained by subtracting 1 from the data amount in one iteration of Fast Loop, and is not related to the depth of the pipeline. In the above example where N = 16 and Fast Loop is 64B, the tail portion is copied as a block of data of 16, 16, 16, 8, 4, 2, 1 bytes (7 operations). As described in the head section above, the reverse order is used to optimize the process of defining the subset of operations to be moved. When N = 32, the tail sequence is 32, 16, 8, 4, 2, 1 (6 operations).

ブロック１１０でチェックされたＤＦフラグが１である場合には、ストリングは"逆順"となり、コピー元及びコピー先ポインタは、デクリメントされる。上述したアルゴリズムでは、この場合を、ネイティブループで扱う（制御をブロック１２０へと移す）。別の実装例として、このようなコピーオペレーションを、同様な"高速コピー"シーケンスを使用して実装してもよく、対称スキームを使用して、ポインタの調整オペレーションにおけるオペレーションを反転させる。 If the DF flag checked at block 110 is 1, the string is "reverse" and the source and destination pointers are decremented. In the algorithm described above, this case is handled by a native loop (control is transferred to block 120). As another example implementation, such a copy operation may be implemented using a similar “fast copy” sequence, using a symmetric scheme to invert the operation in the pointer adjustment operation.

上述の方法１００の実装例は、ＲＥＰＭＯＶＳＢ命令を使用した反復コピーオペレーションに対するものであったが、その他のコピー命令を使用した別の実装例を採用してもよい。例えば、格納（ストア）命令（例えば、ＲＥＰＳＴＯＳＢ）を利用したアルゴリズムは、ＲＥＰＭＯＶＳＢと同様なスキームを扱うことができ、コピーオペレーションではｌｏａｄ＋ｓｔｏｒｅが使用されたが、格納オペレーションではｓｔｏｒｅのみが実行される点を除いて、上述と同じ段階が使用される。加えて、ＲＥＰＳＴＯＳＢの場合には、プロセスを簡単にできる箇所が存在する。（１）ｓｒｃとｄｓｔとの間の距離をチェックする必要がない。（２）ｓｒｃポインタにおける条件をチェックする必要がない。また、最も長い格納オペレーションの長さ（上述の例の場合は、Ｎ＝１６又はＮ＝３２）を有する格納データレジスタを準備する段階が新たに必要となり、最も長い格納オペレーションは、格納アクションのためにデータの複製バージョンを保有する（ＳＴＯＳＢは、格納先データレジスタの各バイトにおいて重複させる必要がある１バイトのデータを含む）。 Although the implementation of the method 100 described above is for repeated copy operations using the REP MOVSB instruction, other implementations using other copy instructions may be employed. For example, an algorithm using a store instruction (for example, REP STOSB) can handle the same scheme as REP MOVSB, and load + store is used in the copy operation, but only store is executed in the store operation. Except as noted, the same steps as described above are used. In addition, in the case of REP STOSB, there are places where the process can be simplified. (1) There is no need to check the distance between src and dst. (2) There is no need to check the condition in the src pointer. In addition, a new step of preparing a storage data register having the length of the longest storage operation (N = 16 or N = 32 in the above example) is required, and the longest storage operation is for the storage action. (STOSB contains 1 byte of data that needs to be duplicated in each byte of the storage destination data register).

図１の実装例は、ＲＥＰＭＯＶＳＢ及び１イタレーションにつき６４バイトに調整された例であったが、異なる長さの高速コピーオペレーションを扱うのにそのほかの実施形態を使用してもよい。また、ダブルワード長（例えば、ＲＥＰＭＯＶＳＤ）を移動させるといった命令や、その他の命令を使用して、高速コピーオペレーションを実行するのに、このようなオペレーションを使用してもよい。別の実施形態として、ページの"エイリアシングが存在しない"という仮定を採用してもよい（この場合、モジュール４Ｋは取り除かれる）。上述したように、コードシーケンスの一部は、含まれる命令の特定の種類に対して最も効率的な態様で所望のオペレーションを実行するように最適化されており、別のシーケンス部分では、同じ命令が、最適化されていない態様で実行される場合がある。様々な実施形態において、シーケンス検出技術が実装され、入力される命令のシーケンスを分析し、コードを実行ユニットに提供して、所定のコードシーケンスの１以上の命令を最適化された態様で実行することを可能にする。 The implementation example of FIG. 1 was an example adjusted to 64 bytes per REP MOVSB and one iteration, but other embodiments may be used to handle different lengths of high speed copy operations. Also, such an operation may be used to perform a fast copy operation using an instruction such as moving a double word length (eg, REP MOVSD) or other instructions. As another embodiment, the assumption that there is no “aliasing” of the page may be employed (in this case, module 4K is removed). As mentioned above, a portion of the code sequence is optimized to perform the desired operation in the most efficient manner for the particular type of instruction involved, and in another sequence portion, the same instruction May be performed in an unoptimized manner. In various embodiments, sequence detection techniques are implemented to analyze a sequence of incoming instructions and provide code to an execution unit to execute one or more instructions in a predetermined code sequence in an optimized manner. Make it possible.

一例として、ＩＡ３２ＲＥＰＭＯＶＳ及びＲＥＰＳＴＯＳオペレーションは、予め長さが知られていないコピーオペレーションを扱うために調整される。現在の最適化では、データの大部分を移動させるためにＲＥＰＭＯＶＳＤを使用し、残りの部分にはＲＥＰＭＯＶＳＢを使用することを基本としており、残り部分への使用は、０‐３の長さであることが知られている（ＲＥＰＭＯＶＳＢ実行時間を最適化するのに使用される情報）。これらのコピーオペレーションを実装するコードの例が、表１に示されている（同様な構造が、ＲＥＰＳＴＯＳにも適用される）。 As an example, IA32 REP MOVS and REP STOS operations are tailored to handle copy operations whose length is not known in advance. The current optimization is based on using REP MOVSD to move most of the data and REP MOVSB for the rest, using 0-3 lengths for the rest. (Information used to optimize REP MOVSB execution time). An example of code that implements these copy operations is shown in Table 1 (similar structure applies to REP STOs).

ＲＥＰＭＯＶＳＢは、長さが０‐３の場合を早く処理し、その他の長さについてはペナルティを受けることによって、最適化されている。先行するオペレーションのために、上記のスキームは、カウントが決して０‐３を超えないように構成される。しかしながら、このような最適化を実行するためにその他の様々なシーケンスを使用してもよく、特にＲＥＰＭＯＶＳＢ命令のカウントを設定するのに別のシーケンスを使用してもよい。ＲＥＰＭＯＶＳＢの振る舞いを０‐３以外の長さに対して最適化する、例えば、ＲＥＰＭＯＶＳＱ命令が有する残りの部分の長さが０‐７であることに関連して最適化する場合、コードがうまく機能せず、多くの場合、性能が低下してしまう可能性がある（例では、長さが４‐７の場合）。同様に、どのような長さを扱っても効率的となるようなＲＥＰＭＯＶＳＢに対するその他の最適化、及び、このような最適化の一部として０‐３の長さの場合に質の低下が生じる場合には、上記の表１に記載されたコードは適切に作動せず、性能が低下してしまう。ｅｃｘの値が命令の実行時でなければ知ることができないとしても、パイプライン遅延での時間損失を防ぐために、ＲＥＰＭＯＶＳＢがどの長さを扱うべきかの決定を、命令デコード時に行う必要があり、性能損失を引き起こす"バブル"が生成されてしまうことがある。 REP MOVSB is optimized by processing cases with length 0-3 early and penalizing other lengths. For the preceding operations, the above scheme is configured so that the count never exceeds 0-3. However, various other sequences may be used to perform such optimization, and in particular another sequence may be used to set the count of REP MOVSB instructions. When optimizing the behavior of REP MOVSB for lengths other than 0-3, for example when optimizing in relation to the remaining length of the REP MOVVSQ instruction being 0-7, the code It doesn't work well, and in many cases performance can be degraded (in the example, the length is 4-7). Similarly, other optimizations for REP MOVSB that would be efficient to handle any length, and quality degradation in the case of 0-3 lengths as part of such optimization If it does occur, the cords listed in Table 1 above will not work properly and performance will suffer. Even if the value of ecx can only be known at the time of instruction execution, it is necessary to determine at the time of instruction decoding what length REP MOVSB should handle in order to prevent time loss due to pipeline delay. , "Bubbles" that cause performance loss may be generated.

上述の最適化（表１）では、ＭＯＶＳＢは、ＲＥＰＭＯＶＳＤ命令（Ｄ＋Ｂシ−ケンスと称する）の直ぐ後に続き、このことは、プログラマがＲＥＰＭＯＶＳＢ命令を限られたバイト数、例えば０‐３バイトに設定することを意図しているというヒントとして機能する。実施形態は、このシーケンスヒントを使用して、様々な命令コードを、実行ユニットに提供し、（少なくとも）２番目のコピー命令の最適化を行うことを可能とする。完全な命令シーケンスは変化する場合があり、他のコードが同じ結果を達成するのに使用される場合もあることから、特定のシーケンスを探索する替わりに、ハードウェアが、ＲＥＰＭＯＶＳＤ命令の後に続くＲＥＰＭＯＶＳＢを、少ない数の命令で（例えば、１‐９）で探す。所定のデータ長に対して、どのフローをデコードし、どの最適化を選択するか関わらず、そしてデコーディングが正確に実行されたとしても、Ｄ＋Ｂシーケンスが必ず検出される保証はないことから、Ｄ＋Ｂシーケンスが間違って検出されてしまうことがないとも必ずしも言えない。 In the optimization described above (Table 1), MOVSB follows immediately after the REP MOVSD instruction (referred to as the D + B sequence), which means that the programmer has a limited number of bytes, eg 0-3 bytes, of the REP MOVSB instruction. Serves as a hint that it is intended to be set to. Embodiments use this sequence hint to provide various instruction codes to the execution unit, allowing (at least) the second copy instruction to be optimized. Instead of searching for a specific sequence, the hardware follows the REP MOVSD instruction because the complete instruction sequence may vary and other code may be used to achieve the same result. Search for REP MOVSB with a small number of instructions (eg, 1-9). D + B sequences are not guaranteed to be detected regardless of which flow is decoded and which optimization is selected for a given data length, and even if decoding is performed correctly. It cannot be said that the sequence is not detected by mistake.

図２には、本発明の一実施形態に係るシーケンス検出器のブロック図が示されている。図２に示すように、プロセッサ２００は、実行されるべき命令を受信する命令デコーダ２１０を含んでもよい。命令がデコーダで受信されると、受信された命令は、バッファ２１５に格納されてもよい。バッファ２１５は、デコードロジック２２０に、実行すべき次の命令を提供するよう動作してもよく、デコードロジック２２０は、シーケンス検出器状態機械２４０を含むフィードバック経路からデコード経路選択信号を受信する。この選択信号、及びデコードロジック２２０の様々な規則に基づいて、命令がデコードされて、実行ユニット２３０に提供され実行されてもよい。通常、デコードロジック２２０は、入力される命令を受信し、受信した命令から、デコードした命令を生成する。一実施形態において、このようなデコードされた命令は、命令に対応する機械コードの形式であってもよく、命令が実行可能となるよう実行ユニット２３０に提供される。例えば、命令コードは、命令ユニットにマイクロコードシーケンスを実行させる、又は所定の機能ユニットを選択して、所望のオペレーションを実行させてもよい。デコードロジック回路は、複数の命令のデコードを並列に実行してもよい。実行のために、別のデコードロジック回路が、１つの命令を複数の指示に変換してもよい。 FIG. 2 shows a block diagram of a sequence detector according to an embodiment of the present invention. As shown in FIG. 2, the processor 200 may include an instruction decoder 210 that receives instructions to be executed. When an instruction is received at the decoder, the received instruction may be stored in buffer 215. Buffer 215 may operate to provide decode logic 220 with the next instruction to be executed, which decode logic 220 receives a decode path selection signal from a feedback path that includes sequence detector state machine 240. Based on this selection signal and various rules of the decode logic 220, the instruction may be decoded and provided to the execution unit 230 for execution. Normally, the decode logic 220 receives an input instruction and generates a decoded instruction from the received instruction. In one embodiment, such decoded instructions may be in the form of machine code corresponding to the instructions and are provided to the execution unit 230 so that the instructions are executable. For example, the instruction code may cause the instruction unit to execute a microcode sequence or select a predetermined functional unit to perform a desired operation. The decode logic circuit may execute decoding of a plurality of instructions in parallel. For execution, another decode logic circuit may convert an instruction into multiple instructions.

図２に示すように、デコードされた命令が、フィードバック経路の命令比較器２２５に供給されてもよく、デコードされた命令を、状態機械２４０から受信した予測命令コードと比較してもよい。予測命令コードは、コードシーケンスの最初の部分に存在する、状態機械２４０及びデコードロジック２２０を使用して最適化されるのが望ましい所定の命令コードに対応していてもよい。ある実装形態では、内部のマイクロオペレーションアレイに対するインデックスを使用して、実装してもよい。また、幾つかの実装形態では、複数のこのような状態機械及び比較器が配置され、それぞれが、コードシーケンス中で探索されるべき所定の命令と関連付けられていてもよい。別の実装形態では、状態機械２４０及び比較器２２５は、複数の命令の比較及び解析をサポートするように拡張されてもよい。 As shown in FIG. 2, the decoded instruction may be provided to the feedback path instruction comparator 225, and the decoded instruction may be compared to the predicted instruction code received from the state machine 240. The predicted instruction code may correspond to a predetermined instruction code that is desired to be optimized using the state machine 240 and decode logic 220 present in the first part of the code sequence. In some implementations, an implementation may use an index to an internal micro-operation array. Also, in some implementations, a plurality of such state machines and comparators may be arranged, each associated with a predetermined instruction to be searched for in the code sequence. In another implementation, the state machine 240 and the comparator 225 may be extended to support multiple instruction comparison and analysis.

１つの命令を実装する図２の実施形態に示されているように、比較器２２５に入力される２つのコードが一致する場合には、一致信号が比較器２２５から状態機械２４０へと報告される。また、図２に示すように、状態機械２４０は、サイクル毎に、ストール信号（又は、命令デコードを示す情報）を命令デコーダ２１０から受信する。図２には、命令のデコードが一度に１つ行われる場合が示されているが、複数の命令が並列にデコードされる場合に拡張することもできる。命令デコーダ２１０は、デコードロジック２２０へと供給される命令を保持する。一実施形態において、デコードロジック２２０は、特定の状態情報（例えば、ある命令を不正であると規定する機械のモード）を使用して命令を解析するロジック機能を含んでもよい。デコーダの出力は、"デコードされた命令"として示されており、この命令に対して実行されるであろうマイクロオペレーションを特定する。これらのオペレーションの性質は、機械のマイクロアーキテクチャの実装に依存するが、命令を一意的に表す２進値（又は値の範囲）として見なすこともできる。このコードは、実行ユニット２３０に供給されて、デコードされた命令に対応するオペレーションが、１以上のサイクルで実行される。 A match signal is reported from the comparator 225 to the state machine 240 if the two codes input to the comparator 225 match, as shown in the embodiment of FIG. The As shown in FIG. 2, the state machine 240 receives a stall signal (or information indicating instruction decoding) from the instruction decoder 210 every cycle. FIG. 2 shows a case where one instruction is decoded at a time, but the present invention can be extended when a plurality of instructions are decoded in parallel. The instruction decoder 210 holds an instruction supplied to the decode logic 220. In one embodiment, the decode logic 220 may include a logic function that parses the instruction using specific state information (eg, a mode of the machine that defines the instruction as illegal). The output of the decoder is shown as a “decoded instruction” and identifies the micro-operation that will be performed for this instruction. The nature of these operations depends on the machine's microarchitecture implementation, but can also be viewed as a binary value (or range of values) that uniquely represents the instruction. This code is provided to execution unit 230 so that the operation corresponding to the decoded instruction is executed in one or more cycles.

一実装形態では、最適化は、命令シーケンスに緩く基づく。命令の正確なオペレーションは、決定に関わらず保証されると仮定され、それにより、全てのケースでシーケンスの検出が正確であると保証する必要がなくなり、ほとんどの場合のシーケンスを検出が検出されるように最適化することが可能となる。命令比較器２２５は、状態機械２４０からの現在の命令と、命令デコーダ２２０から受信された"次の命令コード"とを比較する。以下に記載するように、このコードは、複数のコードの範囲又は１以上のコードを状態機械フローに基づいてカバーしてもよい。比較の結果が一致した場合には、状態機械２４０は、次の段階に移る。状態機械は、一致の検出（一致は、ある状態から別の状態へと変化する場合がある）若しくは、時間又は命令デコードカウントに基づいて、ある段階から別の段階へと移る。時間が使用される場合には、ストールを示す情報が、命令デコーダ２１０から提供され、状態機械が、命令デコーダ２１０がストールしている（例えば、下層のキャッシュ又はメモリからのフェッチの完了を待っている、又は実行ユニットがビジー状態で新たな命令を実行できない場合など）間に"カウント"してしまうのを防ぐ。このストールにより、実行サイクルのカウントが、デコードされた命令のカウントの近似となり、このように構成する方が実装が単純化される場合がある。シーケンス検出器状態機械２４０は、デコードロジック２２０に、状態情報信号をフィードバックし、図２では、"デコード経路選択信号"として示されている。この状態情報は、命令バッファ２１５における同じ命令に対して、デコーダの規則が、デコードされた異なる命令を実行ユニット２３０に信号で伝えるように、デコードロジック２２０を変更する。 In one implementation, optimization is loosely based on instruction sequences. It is assumed that the correct operation of the instruction is guaranteed regardless of the decision, so it is not necessary to guarantee that the sequence detection is accurate in all cases, and the detection of the sequence in most cases is detected Can be optimized. The instruction comparator 225 compares the current instruction from the state machine 240 with the “next instruction code” received from the instruction decoder 220. As described below, this code may cover a range of codes or one or more codes based on the state machine flow. If the comparison results match, the state machine 240 moves on to the next stage. The state machine moves from one stage to another based on detection of a match (the match may change from one state to another) or based on time or instruction decode count. If time is used, information indicating a stall is provided from the instruction decoder 210 and the state machine is stalled (eg, waiting for a fetch from the underlying cache or memory to complete). Or when the execution unit is busy and cannot execute a new instruction). Due to this stall, the count of execution cycles becomes an approximation of the count of decoded instructions, and this configuration may simplify the implementation. The sequence detector state machine 240 feeds back the status information signal to the decode logic 220 and is shown in FIG. 2 as a “decode path select signal”. This status information modifies the decode logic 220 so that for the same instruction in the instruction buffer 215, the decoder rules signal different decoded instructions to the execution unit 230.

オペレーションを明確にするため、ＲＥＰＭＯＶＳＢの実行を検出し最適化する例について、２つの場合を提供する。（１）ＲＥＰＭＯＶＳＢ自体が、データ長が未知であって、３バイトよりも大きいと予想されるもののコピーに使用される（すなわち、"長ＲＥＰＭＯＶＳＢ"命令）場合、そして、ＲＥＰＭＯＶＳＢが、コードシーケンスにおいてＲＥＰＭＯＶＳＤと関連付けられて使用される場合であって、ＲＥＰＭＯＶＳＢの命令に対する長さが０‐３バイトの範囲であると予想され、ここでは"短ＲＥＰＭＯＶＳＢ"と称することにする。このような２つの異なるコードを命令デコーダ２１０から出力することができ、実行ユニット２３０は、２つの異なる最適化されたコピーオペレーションのうちから選択された１つを実行する。 For clarity of operation, two cases are provided for the example of detecting and optimizing the execution of REP MOVSB. (1) If the REP MOVSB itself is used to copy what the data length is unknown and expected to be greater than 3 bytes (ie, a “long REP MOVSB” instruction), then the REP MOVSB When used in association with REP MOVSD in a sequence, the length for REP MOVSB instructions is expected to be in the range of 0-3 bytes, and will be referred to herein as "Short REP MOVSB". Two such different codes can be output from the instruction decoder 210, and the execution unit 230 executes a selected one of two different optimized copy operations.

図３は、本発明の一実施形態に係るシーケンスデコーダ状態機械の一例の状態図であり、状態機械の実装が示されている。図３に示すように、オペレーション３１０では、状態機械は、ＲＥＰＭＯＶＳＤ命令又はＲＥＰＳＴＯＳＤ命令を探索する場合にリセットされる。この場合、状態機械からのデコード経路選択信号は、ＲＥＰＭＯＶＳＢがコードシーケンス中に生じた場合に、"長ＲＥＰＭＯＶＳＢ"コードを生成するように設定される。同時に、命令デコーダには、ＲＥＰＭＯＶＳＤ及びＲＥＰＳＴＯＳＤのコードが供給され、この２つのうちのどちらかのコードが生じた場合には、そのことを示す情報がシーケンス検出状態機械に提供され、ＲＥＰＭＯＶＳＢ又はＲＥＰＳＴＯＳＢを"直ちに追従"を探すモードへと切り替わり、デコード経路選択信号が提供されて、"短ＲＥＰＭＯＶＳＢ"オペレーションに対するコードがエンコードされる。状態機械は、特定されたＲＥＰＭＯＶＳＢ又はＲＥＰＳＴＯＳＤの閾値距離の間、少ない回数である"ｎ回の非ストールサイクル"又は等価である"ｎｌ個の命令"の間は、この状態（オペレーション３２０‐３４０）にとどまる。１度に１つの命令がデコードされる場合には、ｎｌはｎに等しく、複数の命令が同時にデコードされる場合には、ｎｌはｎよりも大きく（例えば、４ｎ）なる。デコーダが、このフローに対して新たな命令を発行するのを妨げるフェッチストール又はその他のストールが生じた場合には、カウントを一時停止して、シーケンス検出を保証するようにする。本例では、ｎは小さい数であり、例えば、４である。このような遅延の後、ＲＥＰＭＯＶＳＢが到達したか否かに関わらず、シーケンサは、ＲＥＰＭＯＶＳＤ又はＲＥＰＳＴＯＳＤを探索する最初の状態３１０に戻り、新たなＲＥＰＭＯＶＳＤ＋Ｂシーケンスが開始したことを暗に示すこととなる。ＲＥＰＭＯＶＳＢが存在しない場合又はＲＥＰＳＴＯＳＢが検出されなかった場合は、コードがＲＥＰＭＯＶＳＤのみを含み、別の箇所では、ＲＥＰＭＯＶＳＢのみを含んでいるというシナリオをカバーする。ある実施形態では、状態機械オペレーションの途中で割り込みが発生した等のイベントは無視してもよく、これは、イベント発生の割合は、ミス予測のペナルティで乗算されるため、イベントのコストと比較して小さくなるからである。 FIG. 3 is a state diagram of an example of a sequence decoder state machine according to one embodiment of the present invention, showing a state machine implementation. As shown in FIG. 3, in operation 310, the state machine is reset when searching for a REP MOVSD or REP STADOS instruction. In this case, the decode path selection signal from the state machine is set to generate a “long REP MOVSB” code if a REP MOVSB occurs during the code sequence. At the same time, the instruction decoder is supplied with the REP MOVSD and REP STOSD codes, and if one of the two codes occurs, information indicating that is provided to the sequence detection state machine and the REP MOVSB Alternatively, REP STOSB is switched to a mode that looks for “immediate follow” and a decode path selection signal is provided to encode the code for the “short REP MOVSB” operation. The state machine is in this state (operation 320--) during the specified REP MOVSB or REP STOSD threshold distance for a small number of "n non-stall cycles" or equivalent "nl instructions". 340). When one instruction is decoded at a time, nl is equal to n, and when a plurality of instructions are decoded simultaneously, nl is larger than n (for example, 4n). If a fetch stall or other stall occurs that prevents the decoder from issuing new instructions for this flow, the count is paused to ensure sequence detection. In this example, n is a small number, for example, 4. After such a delay, regardless of whether the REP MOVSB has arrived or not, the sequencer returns to the initial state 310 searching for REP MOVSD or REP STOSD, implying that a new REP MOVSD + B sequence has started. It will be. If no REP MOVSB is present or no REP STOSB is detected, the scenario covers that the code contains only REP MOVSD and elsewhere only REP MOVSB. In some embodiments, events such as interrupts occurring during state machine operations may be ignored, since the percentage of event occurrences is multiplied by the miss prediction penalty, so compared to the cost of the event. Because it becomes smaller.

選択肢の１つとして、状態機械は、最後の状態を実行せずに、状態３２０又は状態３３０で抜けて状態３１０に戻り、ＲＥＰＭＯＶＳＢ又はＲＥＰＳＴＯＳＢを探索するように実装することができるが、コードシーケンスが短い場合は（ＲＥＰＭＯＶＳＢの直ぐ後に続くＲＥＰＭＯＶＳＤが存在せず、固定遅延の間は検出されないと仮定して）、このように構成する必要はない。別の実施形態では、特定された命令間のシーケンスの距離が長い場合には、２番目（すなわち、別のさらなる命令）の検出時に、状態機械は、最初の探索状態（状態３１０）にリセットしてもよい。 As an option, the state machine can be implemented to exit in state 320 or 330 and return to state 310 and search for REP MOVSB or REP STOSB without executing the last state, If the sequence is short (assuming that there is no REP MOVSD immediately following REP MOVSB and it is not detected during a fixed delay), there is no need to do this. In another embodiment, if the sequence distance between identified instructions is long, upon detection of the second (ie, another additional instruction), the state machine resets to the first search state (state 310). May be.

選択された最適化に関わらず完全な実行が保証されるという事実は、ＲＥＰＭＯＶＳＤ命令とＲＥＰＭＯＶＳＢ命令との間の例外のようなケースもカバーしている。このような稀な条件が発生した場合には、ＲＥＰＭＯＶＳＢの実行は、最適でない経路を選択してもよく、性能の点ではコストが生じてしまうかもしれないが、コードの正確な実行を損なうこと避けることができるかもしれない。他にも、パイプラインの掃き出しのような予測ミスを引き起こすケースが存在する（例えば、ＲＥＰＭＯＶＳＢが、ＲＥＰＭＯＶＳＤの後にデコードされて、掃き出される）。このようなケースでは、通常、状態機械はリセットしないことが望ましく、リセットしてしまうと、高い確率で、許容された遅延の時間枠内において、ＲＥＰＭＯＶＳＢが再びデコードされてしまう。 The fact that complete execution is guaranteed regardless of the optimization chosen also covers cases like exceptions between the REP MOVSD and REP MOVSB instructions. If such a rare condition occurs, REP MOVSB execution may choose a non-optimal path, which may incur a cost in terms of performance, but impairs the correct execution of the code You might be able to avoid that. There are other cases that cause misprediction such as pipeline sweeping (eg, REP MOVSB is decoded after REP MOVSD and swept out). In such a case, it is usually desirable not to reset the state machine, which would cause the REP MOVSB to be decoded again with high probability within the allowed delay time frame.

一実施形態において、フローが完全ではなく変動が生じるケースを正しく扱うために、シーケンス検出器状態機械の実装条件を緩めてもよい。例えば、完全シーケンスを探索する替わりに、タイマーを使用することにより、この問題に対処することができる。 In one embodiment, the sequence detector state machine implementation may be relaxed to correctly handle the case where the flow is not perfect and fluctuates. For example, instead of searching for a complete sequence, this problem can be addressed by using a timer.

現在のデコーダは、複数の命令を一度にデコードすることができる。上述の実装形態を、この場合を含むように、複数の態様で拡張することができる。初めに、"探索すべき"命令のデコードを、一度に１つに限定することができる。ＲＥＰＭＯＶＳＢの例では、ＲＥＰＭＯＶＳＤ命令及びＲＥＰＳＴＯＳＤ命令は、それ自身によってデコードされる。次に、複数の比較オペレーションが、各デコーダの出力に対して実行されて、シリアル化する（より最近のオペレーションをフラッシュする）又は、予測されたコードの全てに対して複数の比較器を使用して、状態機械があらゆるオペレーションからコードシーケンスを追従できるようにする。非シリアル化デコードが使用される場合には、状態機械は、複数の段階分岐を同時にサポートする（第１のデコードと並列に第２の一致のデコードをサポートする等）ように拡張されてもよい。 Current decoders can decode multiple instructions at once. The implementation described above can be extended in several ways to include this case. Initially, it is possible to limit the decoding of instructions to be searched to one at a time. In the REP MOVSB example, the REP MOVSD and REP STOSD instructions are decoded by themselves. Next, multiple compare operations are performed on the output of each decoder to serialize (flush more recent operations) or use multiple comparators for all of the predicted codes. The state machine can follow the code sequence from any operation. If non-serialized decoding is used, the state machine may be extended to support multiple stage branches simultaneously (such as supporting a second match decoding in parallel with the first decoding). .

実施形態は、ＲＥＰＭＯＶＳＤ＋Ｂシーケンスを使用するように最適化された現在のコードに対して性能損失を生じさせることなく、新しいコードに大きな利益をもたらすＲＥＰＭＯＶＳＢ命令の最適化を可能にする。 Embodiments allow optimization of the REP MOVSB instruction, which provides significant benefits for new code without incurring performance loss for current code optimized to use the REP MOVSD + B sequence.

図４は、本発明の一実施形態に係るプロセッサのブロック図を示している。図４に示すように、プロセッサ４００は、複数段階にパイプライン化された、アウトオブオーダープロセッサであってもよい。図４では、上述の命令調整と関連して使用される様々な特徴を例示するために、プロセッサ４００は相対的に簡略化されて描かれている。 FIG. 4 shows a block diagram of a processor according to an embodiment of the present invention. As shown in FIG. 4, the processor 400 may be an out-of-order processor pipelined in multiple stages. In FIG. 4, the processor 400 is depicted in a relatively simplified manner to illustrate various features used in connection with the instruction coordination described above.

図４に示すように、プロセッサ４００は、フロントエンドユニット４１０を含み、実行すべきマクロ命令をフェッチするのに使用され、プロセッサにおいて後で使用するためにこれらマイクロ命令を準備しておく。例えば、フロントエンドユニット４１０は、フェッチユニット４０４、命令キャッシュ４０６及び命令デコーダ４０８を含んでもよい。幾つかの実装形態では、フロントエンドユニット４１０は、マイクロコードストレージ及びマイクロオペレーション（μＯＰ）ストレージと共に、トレースキャッシュを更に含んでもよい。フェッチユニット４０４は、マクロ命令を、例えば、メモリ又は命令キャッシュ４０６からフェッチして、命令デコーダ４０８に供給し、命令を基本命令、すなわち、プロセッサによって実行されるμＯＰへとデコードしてもよい。本発明の一実施形態では、入力される命令グループが２つ以上の命令の所定のシーケンスを含む（又は、上記したように、選択された命令のシーケンスが互いに近接する）ように、シーケンス検出を実行するロジックを備えるように命令デコーダ４０８が構成される。このロジックは、命令デコーダ４０８に、デコードされた様々な命令、例えば、プロセッサパイプラインで後に実行されるμＯＰを、性能を最適化するために提供させる。幾つかの実装形態において、所定のマクロ命令が受信されると、命令デコーダ４０８は、所定のマイクロコードシーケンスを実行のために送信されるようにし、このシーケンスは、本発明の実施形態に係る高速モードコピーオペレーションを扱ってもよい。別の実装形態では、デコードされた命令に応答して効率的に高速コピーオペレーションを実行するべく、特定のハードウェアに対して、実行ユニットを拡張してもよい。 As shown in FIG. 4, the processor 400 includes a front end unit 410 that is used to fetch macroinstructions to be executed and prepares these microinstructions for later use in the processor. For example, the front end unit 410 may include a fetch unit 404, an instruction cache 406 and an instruction decoder 408. In some implementations, the front-end unit 410 may further include a trace cache along with microcode storage and micro-op (μOP) storage. The fetch unit 404 may fetch the macro instruction from, for example, a memory or instruction cache 406 and provide it to the instruction decoder 408 to decode the instruction into a basic instruction, ie, a μOP executed by the processor. In one embodiment of the present invention, sequence detection is performed so that an input instruction group includes a predetermined sequence of two or more instructions (or, as described above, selected instruction sequences are close to each other). An instruction decoder 408 is configured to include logic to execute. This logic causes the instruction decoder 408 to provide various decoded instructions, eg, μOPs that are later executed in the processor pipeline, to optimize performance. In some implementations, when a predetermined macroinstruction is received, the instruction decoder 408 causes a predetermined microcode sequence to be transmitted for execution, which is a high speed according to an embodiment of the present invention. Mode copy operations may be handled. In another implementation, the execution unit may be extended to specific hardware to perform fast copy operations efficiently in response to decoded instructions.

フロントエンドユニット４１０と実行ユニット４２０との間には、マイクロ命令を受信し、実行のために準備するのに使用されてもよいアウトオブオーダー（ＯＯＯ）エンジン４１５が接続されている。具体的には、ＯＯＯエンジン４１５は、マイクロ命令フローを再順序付けし、実行に必要な様々なリソースを割り当てるための様々なバッファを含んでもよい。また、レジスタファイル４３０及び拡張されたレジスタファイル４３５のような様々なレジスタファイル内の格納位置に対して、論理レジスタのリネームを提供する。レジスタファイル４３０は、整数オペレーション及び浮動小数点オペレーションのために、別々のレジスタファイルを含んでもよい。拡張されたレジスタファイル４３５は、ベクトルサイズの単位、例えば、１レジスタにつき、２５６ビット又は５１２ビットといった記憶領域を提供してもよい。 Connected between front end unit 410 and execution unit 420 is an out-of-order (OOO) engine 415 that may be used to receive micro-instructions and prepare for execution. Specifically, the OOO engine 415 may include various buffers for reordering the microinstruction flow and allocating various resources necessary for execution. It also provides logical register renaming for storage locations in various register files, such as register file 430 and extended register file 435. Register file 430 may include separate register files for integer and floating point operations. The expanded register file 435 may provide a storage area such as a unit of vector size, for example, 256 bits or 512 bits per register.

様々なリソースが実行ユニット４２０内に存在してもよく、例えば、様々な整数、浮動小数点及び単一命令多重データ（ＳＩＭＤ）ロジックニット、及び他の専用ハードウェアが含まれる。結果は、リタイアメントロジック、すなわち、リオーダ（再整列）・バッファ（ＲＯＢ）４４０に供給される。具体的には、ＲＯＢ４４０は、実行される命令と関連付けられた情報を受信する様々なアレイ及びロジックを含んでもよい。情報は、ＲＯＢ４４０によって調べられ、命令が、有効にリタイアでき、結果のデータがプロセッサのアーキテクチャ状態にコミットしたものとなるかを判断する、又は、命令の適切なリタイアメントを妨げるような１以上の例外が発生したかを判断する。無論、ＲＯＢ４４０は、リタイアメントに関してその他のオペレーションを扱ってもよい。 Various resources may exist within execution unit 420, including, for example, various integers, floating point and single instruction multiple data (SIMD) logic units, and other dedicated hardware. The result is provided to retirement logic, ie, a reorder buffer (ROB) 440. Specifically, ROB 440 may include various arrays and logic that receive information associated with instructions to be executed. The information is examined by the ROB 440 to determine if the instruction can be retired effectively and determine if the resulting data is committed to the processor's architectural state, or prevent proper retirement of the instruction Determine whether the error occurred. Of course, ROB 440 may handle other operations for retirement.

図４に示すように、ＲＯＢ４４０は、キャッシュ４５０と接続され、本発明はこの点に関して限定されないが、一実施形態では、低階層のキャッシュ（例えば、Ｌ１キャッシュ）であってもよい。実行ユニット４２０は、キャッシュ４５０と直接接続することができる。キャッシュ４５０から、高階層のキャッシュ、システムメモリ等に対しての通信が発生してもよい。図４の実施形態では、この構成が高階層に示されているが、本発明はこの点に関して限定されない。 As shown in FIG. 4, ROB 440 is connected to cache 450 and the present invention is not limited in this regard, but in one embodiment may be a low tier cache (eg, an L1 cache). The execution unit 420 can be directly connected to the cache 450. Communication from the cache 450 to a higher hierarchy cache, system memory, etc. may occur. In the embodiment of FIG. 4, this configuration is shown in a higher hierarchy, but the present invention is not limited in this respect.

実施形態は、多くの異なるシステムの型に実装されてもよい。図５には、本発明の一実施形態に係るシステムのブロック図が示されている。図５に示すように、マルチプロセッサシステム５００は、ポイント・ツー・ポイント相互接続システムであって、ポイント・ツー・ポイント相互接続５５０によって連結された第１プロセッサ５７０及び第２プロセッサ５８０を含む。図５に示すように、プロセッサ５７０及びプロセッサ５８０のそれぞれは、マルチコアプロセッサであり、第１プロセッサコア及び第２プロセッサコア（すなわち、プロセッサコア５７４ａ及び５７４ｂ、並びにプロセッサコア５８４ａ及び５８４ｂ）を含む。各プロセッサコアは、図１‐４に示すように、命令調整を実行するハードウェア、ソフトウェア及びファームウェアを含んでもよい。 Embodiments may be implemented in many different system types. FIG. 5 shows a block diagram of a system according to an embodiment of the present invention. As shown in FIG. 5, the multiprocessor system 500 is a point-to-point interconnect system that includes a first processor 570 and a second processor 580 that are coupled by a point-to-point interconnect 550. As shown in FIG. 5, each of the processor 570 and the processor 580 is a multi-core processor, and includes a first processor core and a second processor core (ie, processor cores 574a and 574b and processor cores 584a and 584b). Each processor core may include hardware, software, and firmware that performs instruction coordination, as shown in FIGS. 1-4.

図５に示すように、第１プロセッサ５７０は、メモリ制御ハブ（ＭＣＨ）５７２、及びポイント・ツー・ポイント（Ｐ‐Ｐ）インターフェース５７６及び５７８を含む。同様に、第２プロセッサ５８０は、ＭＣＨ５８２、Ｐ‐Ｐインターフェース５８６及び５８８を含む。図５に示すように、ＭＣＨ５７２及びＭＣＨ５８２は、プロセッサをそれぞれのメモリに、すなわち、メモリ５３２及びメモリ５３４に接続し、これらのメモリは、対応するプロセッサにローカルに取り付けられたメインメモリ（例えば、ダイナミックランダムアクセスメモリ（ＤＲＡＭ））の一部であってもよい。第１プロセッサ５７０及び第２プロセッサ５８０はそれぞれ、Ｐ‐Ｐ相互接続５５２及び５５４を介して、チップセット５９０に連結されていてもよい。図５に示すように、チップセット５９０は、Ｐ‐Ｐインターフェース５９４及び５９８を含む。 As shown in FIG. 5, the first processor 570 includes a memory control hub (MCH) 572 and point-to-point (PP) interfaces 576 and 578. Similarly, the second processor 580 includes an MCH 582 and PP interfaces 586 and 588. As shown in FIG. 5, MCH 572 and MCH 582 connect processors to their respective memories, ie, memory 532 and memory 534, which are main memories (eg, dynamic memory) that are locally attached to the corresponding processor. It may be part of a random access memory (DRAM). The first processor 570 and the second processor 580 may be coupled to the chipset 590 via PP interconnections 552 and 554, respectively. As shown in FIG. 5, the chipset 590 includes PP interfaces 594 and 598.

また、チップセット５９０は、チップセット５９０を、高性能グラフィックスエンジン５３８に接続するインターフェース５９２を含む。同様に、チップセット５９０は、インターフェース５９６を介して、第１バス５１６に接続されていてもよい。図５に示すように、様々なＩ／Ｏデバイス５１４が、第１バス５１６に接続されていてもよく、また、第１バス５１６と第２バス５２０とを接続するバスブリッジ５１８が第１バス５１６に接続されていてもよい。様々なデバイスを第２バス５２０に接続してもよく、例えば、キーボード／マウス５２２、通信デバイス５２６、及び、ディスクドライブ又は一実施形態においてコード５３０を含んでもよいその他のマスデータストレージデバイスのようなストレージユニット５２８を接続してもよい。また、オーディオＩ／Ｏ５２４を、第２バス５２０に接続してもよい。 Chipset 590 also includes an interface 592 that connects chipset 590 to high performance graphics engine 538. Similarly, the chipset 590 may be connected to the first bus 516 via the interface 596. As shown in FIG. 5, various I / O devices 514 may be connected to the first bus 516, and the bus bridge 518 connecting the first bus 516 and the second bus 520 is the first bus. 516 may be connected. Various devices may be connected to the second bus 520, such as a keyboard / mouse 522, a communication device 526, and other mass data storage devices that may include a disk drive or code 530 in one embodiment. A storage unit 528 may be connected. Further, the audio I / O 524 may be connected to the second bus 520.

実施形態は、コードに実装されてもよいし、システムに命令を実行させるようプログラムするのに使用可能な命令を格納する記憶媒体に記憶されてもよい。記憶媒体としては、特にこれに限定されないが、フロッピー（登録商標）ディスク、光ディスク、コンパクトディスクリードオンリーメモリ（ＣＤ−ＲＯＭ）、再書き込み可能コンパクトディスク（ＣＤ−ＲＷ）及び磁気光学ディスクを含むあらゆるディスク、並びに、リードオンリーメモリ（ＲＯＭ）、ダイナミックランダムアクセスメモリ（ＤＲＡＭ）及びスタティックランダムアクセスメモリ（ＳＲＡＭ）のようなランダムアクセスメモリ（ＲＡＭ）、消去可能−プログラム可能リードオンリーメモリ（ＥＰＲＯＭ）、フラッシュメモリ、電気的消去可能−プログラム可能リードオンリーメモリ（ＥＥＰＲＯＭ）、磁気又は光学カードのような半導体デバイス、又は、電気的命令を格納するのに適したその他の種類の媒体が含まれる。 Embodiments may be implemented in code or stored on a storage medium that stores instructions that can be used to program the system to execute instructions. Storage media include, but are not limited to, any disks including floppy disks, optical disks, compact disk read only memory (CD-ROM), rewritable compact disks (CD-RW) and magneto-optical disks. And random access memory (RAM) such as read only memory (ROM), dynamic random access memory (DRAM) and static random access memory (SRAM), erasable-programmable read only memory (EPROM), flash memory, Electrically erasable—Includes programmable read only memory (EEPROM), semiconductor devices such as magnetic or optical cards, or other types of media suitable for storing electrical instructions.

本発明が、限られた数の実施形態を参照して説明されたが、数多くの変形例及び変更が可能であることは、当業者にとって明らかである。本発明の範囲及び精神の範囲内において、このような変形例及び変更についても添付の特許請求の範囲に含まれることを意図している。以下に本発明の実施形態の例を項目として示す。
［項目１］
命令コードによって示されるオペレーションを実行する実行ユニットと、
入力される命令を受信する命令デコーダと、
を備え、
命令デコーダは、第１入力命令を受信し、経路選択信号をフィードバック経路から受信する第１ロジックを含み、
フィードバック経路は、命令デコーダと連結されており、命令デコーダと連結され経路選択信号を生成するシーケンス検出器を含み、
経路選択信号は、第１入力命令の閾値距離内で命令デコーダによって受信された異なる入力命令の検出に対応しており、
第１ロジックは、第１入力命令を、経路選択信号に応答して、第１命令コード又は第２命令コードへとデコードする、装置。
［項目２］
命令デコーダから命令コードを、シーケンス検出器から予測コードを受信し、命令コードと予測コードとが一致する場合には、一致信号を生成する比較器を更に備える項目１に記載の装置。
［項目３］
シーケンス検出器は、一致信号が生成されていない場合に、第１ロジックに第１入力命令を第１命令コードにデコードさせる第１状態の経路選択信号を生成し、第１命令コードは、第１データ長に対して最適化されたコピーオペレーションに対応している項目２に記載の装置。
［項目４］
シーケンス検出器は、一致信号に応答して、第１ロジックに第１入力命令を第２命令コードにデコードさせる第２状態の経路選択信号を生成し、第２命令コードは、第１データ長とは異なる第２データ長について最適化されたコピーオペレーションに対応している項目３に記載の装置。
［項目５］
第２命令コードは、実行ユニットに、有限の長さのコピーオペレーションを実行させる項目４に記載の装置。
［項目６］
第１入力命令からの閾値距離に対応する第１の個数の命令内において、異なる入力命令が命令デコーダによって受信された場合に、シーケンス検出器は、第２状態の経路選択信号を生成する項目４に記載の装置。
［項目７］
閾値距離は、サイクル数及びデコードストール情報によって近似される項目６に記載の装置。
［項目８］
シーケンス検出器は、状態機械を有し、異なる入力命令が、第１の個数の命令内で受信されなかった場合には、状態機械はリセットされる項目６に記載の装置。
［項目９］
反復コピー命令が、反復コピー命令と関連付けられた情報に少なくとも一部基づいて最適化可能であるかを判断する段階と、
可能であると判断された場合に、２のツリーのべき乗のコピーを使用して、第１の量以下のデータを、第１の数以下の個数の塊で、第１コピー元ロケーションから第１コピー先ロケーションへとコピーする条件付きコピーオペレーションの第１シーケンスによって、反復コピー命令の第１部分を実行する段階と、
コピーすべきデータの残りの部分が第１閾値よりも大きい場合には、コピーオペレーションの高速ループを使用して、第２の量のデータを第２コピー元ロケーションから第２コピー先ロケーションへとコピーすることにより、反復コピー命令の第２部分を実行する段階と、
その後にコピーすべきデータが残っている場合には、第３の量以下のデータを、第３の数以下の個数の塊で、第３コピー元ロケーションから第３コピー先ロケーションへとコピーする条件付きコピーオペレーションの第２シーケンスによって、反復コピー命令の第３部分を実行する段階と、
を備える方法。
［項目１０］
条件付きコピーオペレーションの第１シーケンスを実行する前に、高速ループ及び条件付きコピーオペレーションの第２シーケンスに対するセットアップ情報を取得する段階を更に備える項目９に記載の方法。
［項目１１］
第２の量のデータが、第２閾値よりも大きいか否かを判断し、大きい場合には、キャッシュに格納することなく、キャッシングヒントを使用して第２の量のデータを直接メモリにコピーする段階をさらに備える項目９に記載の方法。
［項目１２］
条件付きコピーオペレーションの第１シーケンスの第１番目は、Ｎバイトのデータ塊をコピーし、条件付きコピーオペレーションの第１シーケンスと関連付けられた第１ポインタ及び第２ポインタをインクリメントし、コピーすべき残りのデータと関連付けられたカウンタを更新する項目９に記載の方法。
［項目１３］
２のツリーのべき乗は、プロセッサの最大読み込み長又は最大格納長に対応する、２のべき乗の一番目の長さで始まり、１バイトに対応する２のべき乗の最後の長さで終了する項目９に記載の方法。
［項目１４］
反復コピー命令と関連付けられた第１ポインタと第２ポインタとの差分が、第３閾値と第４閾値の間であるかを判断する段階を更に備え、
判断の結果が真である場合には、高速ループの１イタレーションよりも短い幅を有するコピーオペレーションを使用して、第２の量のデータをコピーする項目９に記載の方法。
［項目１５］
デコーダを含むフロントエンドを有するプロセッサと、
プロセッサに連結されたダイナミックランダムアクセスメモリ（ＤＲＡＭ）とを備えるシステムであって、
デコーダは第１デコードロジックを含み、
入力コピー命令及び少なくとも１つの別のコピー命令を含むデコーダに、命令のシーケンスが受信されたことが示唆された場合に、第１デコードロジックは、入力コピー命令を受信し、デコーダと連結されたフィードバック経路で選択信号を第２ロジックから受信し、選択信号に応答して、入力コピー命令を第１命令コード又は第２命令コードへとデコードし、
プロセッサは、第１命令コード又は第２命令コードを受信し、受信した命令コードに応答して、第１コピーオペレーション又は第２コピーオペレーションをそれぞれ実行する実行ユニットを更に有する、システム。
［項目１６］
第２ロジックは、シーケンス検出器を含み、
少なくとも１つの別のコピー命令に対応する第２入力コピー命令の後に、第１の個数の命令内で入力コピー命令が受信される場合には、デコードロジックに第２命令を生成させ、その他の場合には、デコードロジックに第１命令を生成させる選択信号が、シーケンス検出器によって生成される項目１５に記載のシステム。
［項目１７］
入力コピー命令が、第１の個数の命令内で受信されない場合には、シーケンス検出器は、第２入力コピー命令を探索する第１状態へとリセットされる項目１６に記載のシステム。
［項目１８］
シーケンス検出器は、第２入力コピー命令が検出された後に、第１状態から、入力コピー命令を探索する第２状態へと進む項目１７に記載のシステム。
［項目１９］
デコーダから命令コードを、シーケンス検出器から予測コードを受信し、命令コードと予測コードとが一致する場合には一致信号を生成する比較器を更に備える項目１６に記載のシステム。
［項目２０］
第１コピーオペレーションは、第１データ長に対して最適化され、第２コピーオペレーションは、第１データ長とは異なる第２データ長に対して最適化されている項目１５に記載のシステム。

Although the present invention has been described with reference to a limited number of embodiments, it will be apparent to those skilled in the art that many variations and modifications are possible. Such variations and modifications are intended to be included within the scope of the appended claims within the scope and spirit of the present invention. Examples of embodiments of the present invention are shown as items below.
[Item 1]
An execution unit that performs the operation indicated by the instruction code;
An instruction decoder for receiving an input instruction;
With
The instruction decoder includes first logic for receiving a first input instruction and receiving a path selection signal from the feedback path;
The feedback path is coupled to the instruction decoder and includes a sequence detector coupled to the instruction decoder to generate a path selection signal;
The path selection signal corresponds to the detection of different input commands received by the command decoder within the threshold distance of the first input command,
The first logic is a device that decodes a first input instruction into a first instruction code or a second instruction code in response to a path selection signal.
[Item 2]
The apparatus according to item 1, further comprising a comparator that receives an instruction code from the instruction decoder and a prediction code from the sequence detector and generates a match signal when the instruction code and the prediction code match.
[Item 3]
The sequence detector generates a first state path selection signal that causes the first logic to decode the first input instruction into a first instruction code when the coincidence signal is not generated. Item 3. The device according to item 2, which corresponds to a copy operation optimized for the data length.
[Item 4]
The sequence detector is responsive to the match signal to generate a second state path selection signal that causes the first logic to decode the first input instruction into a second instruction code, the second instruction code having a first data length and The apparatus according to item 3, which corresponds to a copy operation optimized for a different second data length.
[Item 5]
The apparatus according to item 4, wherein the second instruction code causes the execution unit to execute a copy operation having a finite length.
[Item 6]
Item 4 for generating a second state path selection signal when a different input command is received by the command decoder within a first number of commands corresponding to a threshold distance from the first input command. The device described in 1.
[Item 7]
The apparatus according to item 6, wherein the threshold distance is approximated by the cycle number and decode stall information.
[Item 8]
7. The apparatus of item 6, wherein the sequence detector has a state machine and the state machine is reset if a different input command is not received within the first number of commands.
[Item 9]
Determining whether the iterative copy instruction can be optimized based at least in part on information associated with the iterative copy instruction;
If it is determined that it is possible, using a power-of-two copy of the tree, the first quantity or less of data is sent from the first source location to the first copy in the first quantity or less. Executing a first portion of the iterative copy instruction with a first sequence of conditional copy operations to copy to a destination location;
If the remaining portion of the data to be copied is greater than the first threshold, use a fast loop of copy operations to copy a second amount of data from the second source location to the second destination location. Performing a second part of the iterative copy instruction;
If there is still data to be copied thereafter, a condition for copying the data of the third amount or less in the number of chunks of the third number or less from the third copy source location to the third copy destination location. Performing a third part of the iterative copy instruction by a second sequence of append copy operations;
A method comprising:
[Item 10]
10. The method of item 9, further comprising obtaining setup information for a fast loop and a second sequence of conditional copy operations prior to performing the first sequence of conditional copy operations.
[Item 11]
Determine whether the second amount of data is greater than the second threshold and, if so, copy the second amount of data directly to memory using caching hints without storing it in the cache The method according to item 9, further comprising the step of:
[Item 12]
The first of the first sequence of conditional copy operations copies an N-byte data chunk, increments the first and second pointers associated with the first sequence of conditional copy operations, and the remaining to be copied 10. The method according to item 9, wherein the counter associated with the data is updated.
[Item 13]
The power of the tree of 2 starts with the first length of the power of 2 corresponding to the maximum read length or the maximum storage length of the processor, and ends with the last length of the power of 2 corresponding to 1 byte. The method described in 1.
[Item 14]
Determining whether the difference between the first pointer and the second pointer associated with the iterative copy instruction is between a third threshold and a fourth threshold;
10. The method of item 9, wherein if the result of the determination is true, a second amount of data is copied using a copy operation having a width shorter than one iteration of the fast loop.
[Item 15]
A processor having a front end including a decoder;
A system comprising a dynamic random access memory (DRAM) coupled to a processor,
The decoder includes first decoding logic;
If the decoder including the input copy instruction and at least one other copy instruction indicates that a sequence of instructions has been received, the first decoding logic receives the input copy instruction and provides feedback coupled to the decoder. Receiving a selection signal from the second logic in the path, and in response to the selection signal, decoding the input copy instruction into a first instruction code or a second instruction code;
The system further comprises an execution unit that receives the first instruction code or the second instruction code and executes the first copy operation or the second copy operation in response to the received instruction code, respectively.
[Item 16]
The second logic includes a sequence detector;
If an input copy instruction is received in the first number of instructions after a second input copy instruction corresponding to at least one other copy instruction, causes the decode logic to generate a second instruction; otherwise 16. The system according to item 15, wherein a selection signal for causing the decoding logic to generate the first instruction is generated by the sequence detector.
[Item 17]
17. The system of item 16, wherein if the input copy command is not received within the first number of commands, the sequence detector is reset to a first state that searches for a second input copy command.
[Item 18]
18. The system of item 17, wherein the sequence detector proceeds from a first state to a second state that searches for an input copy command after a second input copy command is detected.
[Item 19]
The system of item 16, further comprising a comparator that receives the instruction code from the decoder and the prediction code from the sequence detector and generates a match signal if the instruction code and the prediction code match.
[Item 20]
16. The system of item 15, wherein the first copy operation is optimized for a first data length and the second copy operation is optimized for a second data length different from the first data length.

Claims

An execution unit that performs the operation indicated by the instruction code;
An instruction decoder for receiving an input instruction ;
A sequence detector connected to the instruction decoder and generating a path selection signal ;
The instruction decoder receives the instruction includes a first logic to receive the route selection signal from the sequence detector,
The route selection signal is for a predetermined input command is a second type of copy command indicating whether it is within the range of threshold distance since the received to the first logic,
Wherein the first logic is, upon receiving a first input command is a first type of copy command, based on the path selection signal, the first input instruction, the first instruction code or the second instruction code The device that is to be decoded into.

A comparator that receives an instruction code from the instruction decoder and a prediction code from the sequence detector, and generates a match signal when the instruction code and the prediction code match , wherein the prediction code is: The apparatus of claim 1, further comprising the comparator being an instruction code corresponding to the second type of copy instruction .

The sequence detector outputs the first input command to the first logic when the match signal is not generated or when the threshold distance is exceeded after the match signal is generated . Generating the route selection signal in a first state to be decoded into an instruction code ;
Wherein the first instruction code corresponds to an optimized copy operation for the first data length, according to claim 2.

A second state in which the sequence detector causes the first logic to decode the first input instruction into the second instruction code in response to being within the threshold distance after the coincidence signal is generated; Generating the route selection signal of
The second instruction code, wherein the first data length corresponds to a copy operation which is optimized for a different second data lengths, according to claim 3.

The second instruction code, the execution unit is intended to perform the finite length of the copy operation, according to claim 4.

The sequence detector generates the path selection signal in the second state when the predetermined input command is within a first number of commands corresponding to the threshold distance after being received by the first logic. it is to apparatus of claim 4.

Said threshold distance and to be compared, the predetermined input command and subsequent length of the first input instruction preceding are those obtained by the number of cycles and the decode stall information device according to claim 6.

The sequence detector has a state machine;
If the first input command is not received by the first logic within the first number of commands since the predetermined input command was received by the first logic, the state machine is reset. The apparatus of claim 6, wherein

A processor having a front end including a decoder;
A dynamic random access memory (DRAM) coupled to the processor,
The processor includes second logic connected to the decoder to generate a selection signal;
The decoder includes first decoding logic for receiving an instruction and receiving the selection signal from the second logic ;
The selection signal includes a first instruction received following the predetermined copy instruction in response to a predetermined copy instruction being a second type of copy instruction being received by the first decoding logic . If the input copy instruction is a copy instruction of the type, it is suggested that a sequence of instructions including the predetermined copy instruction and the input copy instruction is received by the first decoding logic,
It said first decoding logic, when receiving the input copy command is a predetermined first type of copy command, based on the selection signal, the input copy command, the first instruction code or is intended to decode to the second instruction code,
Wherein the processor, the first instruction code or receives the second instruction code, in response to a command code received, or the first copy operation further comprising an execution unit for executing the second copy operation, respectively, the system .

The second logic includes a sequence detector;
The sequence detector generates the selection signal;
The selection signal is sent to the first decoding logic when the input copy instruction is received in a first number of instructions after a second input copy instruction corresponding to the predetermined copy instruction. to generate an instruction code, the otherwise, it is intended to generate the first instruction code to said first decoding logic, according to claim 9 system.

The input copy command, wherein when the first not received by the number in the instruction, the sequence detector is reset to the first state of searching for the second input copy command, according to claim 10 System.

12. The system of claim 11 , wherein the sequence detector proceeds from the first state to a second state that searches for the input copy instruction after the second input copy instruction is detected.

A comparator that receives an instruction code from the decoder and a prediction code from the sequence detector and generates a match signal when the instruction code and the prediction code match , the prediction code being the first code The system of claim 10 , further comprising the comparator being an instruction code corresponding to two types of copy instructions .

Wherein the first copy operation is optimized with respect to the first data length, the second copy operation, according to claim 9, which is optimized for a different second data length from the first data length System.