JP7064273B2

JP7064273B2 - Read / store unit with split reorder queue using a single CAM port

Info

Publication number: JP7064273B2
Application number: JP2020517847A
Authority: JP
Inventors: シンハロイ、バララム; ロイド、ブライアン; ゴンザレス、クリストファー
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2017-10-06
Filing date: 2018-10-03
Publication date: 2022-05-10
Anticipated expiration: 2038-10-03
Also published as: GB2579534A; WO2019069255A1; GB2579757B; DE112018004004T5; GB202006344D0; GB202006338D0; DE112018004006T5; JP2020536310A; DE112018004006B4; CN111133413B; JP2020536308A; JP7025100B2; WO2019069256A1; GB2579757A; GB2579534B; CN111133421B; CN111133421A; CN111133413A

Description

本発明の実施形態は、一般に、アウトオブオーダー（ＯｏＯ：out-of-order）プロセッサに関し、より詳細には、ＯｏＯプロセッサにおける命令のアウトオブオーダー実行を効率的にサポートするために、単一の連想メモリ（ＣＡＭ：content address memory）ポートと共に分割された読み込み順序変更キューおよび格納順序変更キュー（load and store reorder queues）を実装する読み込み／格納ユニット（ＬＳＵ：load-store unit）に関する。 Embodiments of the present invention generally relate to out-of-order (OoO) processors, and more specifically, are single to efficiently support out-of-order execution of instructions in the OoO processor. It relates to a load-store unit (LSU) that implements load and store reorder queues divided together with an associative memory (CAM) port.

ＯｏＯプロセッサでは、命令順序付けユニット（ＩＳＵ：instruction sequencingunit）が、命令をさまざまな発行キューにディスパッチし、ＯｏＯ実行のサポートにおいてレジスタ名を変更し、さまざまな発行キューから実行パイプラインに命令を発行し、実行された命令を完了し、例外条件を処理する。レジスタ名の変更は、通常、命令が各発行キューに配置される前に、ＩＳＵ内のマッパー論理によって実行される。ＩＳＵは、命令間の依存関係を追跡するための依存関係行列を含んでいる１つまたは複数の発行キューを含む。依存関係行列は、通常、発行キュー内の命令ごとに１行および１列を含む。 In the OoO processor, an instruction sequencing unit (ISU) dispatches instructions to various issue queues, renames them in support of OoO execution, issues instructions from the various issue queues to the execution pipeline, and issues them to the execution pipeline. Complete the executed instruction and handle the exception condition. Register renaming is usually performed by mapper logic in the ISU before the instruction is placed on each issue queue. The ISU contains one or more issue queues containing a dependency matrix for tracking dependencies between instructions. The dependency matrix typically contains one row and one column for each instruction in the issue queue.

単一のＣＡＭポートを使用する分割された順序変更キューを備える読み込み／格納ユニットを提供する。 Provides a read / store unit with a split reordering queue that uses a single CAM port.

本発明の実施形態は、アウトオブオーダー・プロセッサにおいて実効アドレスに基づく読み込み／格納ユニットを実装するための方法、システム、およびコンピュータ・プログラム製品を含む。１つまたは複数の命令を実行するための処理ユニットの非限定的な例としては、読み込み／格納ユニット（ＬＳＵ）が挙げられ、ＬＳＵは、複数のＬＳＵパイプを使用してアウトオブオーダー（ＯｏＯ）ウィンドウ内の複数の命令を実行する。この実行は、ＯｏＯウィンドウから命令を選択することであって、命令が実効アドレスを使用する、選択することと、命令が読み込み命令であることに応答して、処理ユニットがシングルスレッド・モードで動作することに応答して、命令が第１の読み込みパイプ上で発行されているということに基づいて読み込み順序変更キューの第１のパーティションにエントリを作成することと、命令が第２の読み込みパイプ上で発行されているということに基づいて、読み込み順序変更キューの第２のパーティションにエントリを作成することとを含む。この実行は、複数のスレッドが同時に処理されるマルチスレッド・モードで処理ユニットが動作することに応答して、命令が第１の読み込みパイプ上で発行されているということに基づいて、処理ユニットの第１のスレッドによって、読み込み順序変更キューの第１のパーティションの第１の所定の部分にエントリを作成することも含む。 Embodiments of the invention include methods, systems, and computer program products for implementing effective address-based read / store units in out-of-order processors. A non-limiting example of a processing unit for executing one or more instructions is a read / store unit (LSU), where the LSU is out of order (OoO) using multiple LSU pipes. Execute multiple instructions in a window. This execution is to select an instruction from the OoO window, where the processing unit operates in single-threaded mode in response to the instruction using the effective address, the selection, and the instruction being a read instruction. Creating an entry in the first partition of the read order change queue based on the fact that the instruction was issued on the first read pipe in response to the instruction and the instruction on the second read pipe. Includes creating an entry in the second partition of the read order change queue based on being issued in. This execution is based on the fact that the instruction is issued on the first read pipe in response to the processing unit operating in a multithreaded mode where multiple threads are processed simultaneously. It also includes creating an entry in the first predetermined part of the first partition of the read order change queue by the first thread.

１つまたは複数の実施形態によれば、処理ユニットによる１つまたは複数の命令のアウトオブオーダー実行のためのコンピュータ実装方法が、処理ユニットの読み込み／格納ユニット（ＬＳＵ）によって、順序に従わずに実行される複数の命令を含んでいる命令のアウトオブオーダー・ウィンドウを受信することと、ＬＳＵによって、ＯｏＯウィンドウから命令を発行することとを含む。命令の発行は、ＯｏＯウィンドウから命令を選択することであって、命令が実効アドレスを使用する、選択することと、命令が読み込み命令であることに応答して、処理ユニットがシングルスレッド・モードで動作することに応答して、命令が第１の読み込みパイプ上で発行されているということに基づいて読み込み順序変更キューの第１のパーティションにエントリを作成することと、命令が第２の読み込みパイプ上で発行されているということに基づいて、読み込み順序変更キューの第２のパーティションにエントリを作成することとを含む。この実行は、複数のスレッドが同時に処理されるマルチスレッド・モードで処理ユニットが動作することに応答して、命令が第１の読み込みパイプ上で発行されているということに基づいて、処理ユニットの第１のスレッドによって、読み込み順序変更キューの第１のパーティションの第１の所定の部分にエントリを作成することも含む。 According to one or more embodiments, the computer implementation method for out-of-order execution of one or more instructions by a processing unit is out of order by the processing unit read / store unit (LSU). It involves receiving an out-of-order window of an instruction containing multiple instructions to be executed and issuing the instruction from the OoO window by the LSU. Issuing an instruction is to select an instruction from the OoO window, where the processing unit is in single-threaded mode in response to the instruction using the effective address, the selection, and the instruction being a read instruction. Creating an entry in the first partition of the read order change queue based on the fact that the instruction was issued on the first read pipe in response to the operation, and the instruction on the second read pipe. Includes creating an entry in the second partition of the read order change queue based on being issued above. This execution is based on the fact that the instruction is issued on the first read pipe in response to the processing unit operating in a multithreaded mode where multiple threads are processed simultaneously. It also includes creating an entry in the first predetermined part of the first partition of the read order change queue by the first thread.

１つまたは複数の実施形態によれば、コンピュータ・プログラム製品が、プログラム命令が具現化されているコンピュータ可読記憶媒体を含み、それらのプログラム命令は、処理ユニットに動作を実行させるために、処理ユニットによって実行可能である。それらの動作は、処理ユニットの読み込み／格納ユニット（ＬＳＵ）によって、順序に従わずに実行される複数の命令を含んでいる命令のアウトオブオーダー・ウィンドウを受信することと、ＬＳＵによって、ＯｏＯウィンドウから命令を発行することとを含む。命令の発行は、ＯｏＯウィンドウから命令を選択することであって、命令が実効アドレスを使用する、選択することと、命令が読み込み命令であることに応答して、処理ユニットがシングルスレッド・モードで動作することに応答して、命令が第１の読み込みパイプ上で発行されているということに基づいて読み込み順序変更キューの第１のパーティションにエントリを作成することと、命令が第２の読み込みパイプ上で発行されているということに基づいて、読み込み順序変更キューの第２のパーティションにエントリを作成することとを含む。この実行は、複数のスレッドが同時に処理されるマルチスレッド・モードで処理ユニットが動作することに応答して、命令が第１の読み込みパイプ上で発行されているということに基づいて、処理ユニットの第１のスレッドによって、読み込み順序変更キューの第１のパーティションの第１の所定の部分にエントリを作成することも含む。 According to one or more embodiments, the computer program product comprises a computer-readable storage medium in which the program instructions are embodied, the program instructions being used to cause the processing unit to perform an operation. It is feasible by. Their operation is to receive an out-of-order window of instructions containing multiple instructions executed out of order by the processing unit's read / store unit (LSU), and by LSU, the OoO window. Includes issuing orders from. Issuing an instruction is to select an instruction from the OoO window, where the processing unit is in single-threaded mode in response to the instruction using the effective address, the selection, and the instruction being a read instruction. Creating an entry in the first partition of the read order change queue based on the fact that the instruction was issued on the first read pipe in response to the operation, and the instruction on the second read pipe. Includes creating an entry in the second partition of the read order change queue based on being issued above. This execution is based on the fact that the instruction is issued on the first read pipe in response to the processing unit operating in a multithreaded mode where multiple threads are processed simultaneously. It also includes creating an entry in the first predetermined part of the first partition of the read order change queue by the first thread.

その他の特徴および利点が、本発明の手法によって実現される。本発明のその他の実施形態および態様は、本明細書において詳細に説明され、請求される発明の一部と見なされる。本発明を利点および特徴と共によく理解するために、説明および図面を参照されたい。 Other features and advantages are realized by the method of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered part of the claimed invention. Please refer to the description and drawings for a good understanding of the invention along with its advantages and features.

本明細書に記載された専有権の詳細は、本明細書の最後にある特許請求の範囲において具体的に指摘され、明確に請求される。本発明の各実施形態の前述およびその他の特徴と利点は、添付の図面と共に行われる以下の詳細な説明から明らかになる。 The details of the possession rights described herein are specifically pointed out and explicitly claimed within the scope of the claims at the end of the specification. The aforementioned and other features and advantages of each embodiment of the invention will be apparent from the following detailed description provided with the accompanying drawings.

本発明の１つまたは複数の実施形態に従う、アウトオブオーダー・プロセッサ内に実効アドレスに基づく読み込み／格納ユニットを含んでいるシステムのブロック図である。FIG. 3 is a block diagram of a system comprising an effective address-based read / store unit within an out-of-order processor according to one or more embodiments of the present invention. 本発明の１つまたは複数の実施形態に従う、実効アドレス・ディレクトリ（ＥＡＤ：effectiveaddress directory）およびこのＥＡＤを利用するための関連するメカニズムが実装される、ＯｏＯプロセッサのプロセッサ・アーキテクチャの例示的なブロック図である。An exemplary block diagram of a processor architecture of an OoO processor that implements an effective address directory (EAD) and related mechanisms for utilizing this EAD, according to one or more embodiments of the invention. Is. 本発明の１つまたは複数の実施形態に従うプロセッシング・コアの読み込み／格納ユニット（ＬＳＵ）を示す図である。It is a figure which shows the reading / storing unit (LSU) of the processing core according to one or more embodiments of this invention. １つの実施形態例に従う実効アドレス・ディレクトリ（ＥＡＤ）構造（Ｌ１キャッシュ）の例示的なブロックである。It is an exemplary block of an effective address directory (EAD) structure (L1 cache) according to one embodiment. １つの実施形態例に従う実効実変換（ＥＲＴ：effective real translation）テーブル構造の例示的なブロックである。It is an exemplary block of an effective real translation (ERT) table structure according to one embodiment. 本発明の１つまたは複数の実施形態に従う、ＬＳＵによって命令を実行するためにメモリにアクセスするための例示的な方法のフローチャートである。It is a flowchart of an exemplary method for accessing memory to execute an instruction by LSU, according to one or more embodiments of the present invention. 本発明の１つまたは複数の実施形態に従う、ＥＲＴを再度読み込むための方法のフローチャートである。It is a flowchart of the method for reloading ERT according to one or more embodiments of this invention. 本発明の１つまたは複数の実施形態に従うシノニム検出テーブル（ＳＤＴ：synonymdetection table）の例示的な構造を示す図である。It is a figure which shows the exemplary structure of the synonym detection table (SDT) according to one or more embodiments of this invention. 本発明の１つまたは複数の実施形態に従う、ＥＲＴおよびＳＤＴＥＡの交換を実行するための方法のフローチャートである。It is a flowchart of the method for performing exchange of ERT and SDT EA according to one or more embodiments of this invention. 本発明の１つまたは複数の実施形態に従うＥＲＴ削除（ＥＲＴＥ：ERTeviction）テーブルを示す図である。FIG. 5 shows an ERTeviction (ERTE) table according to one or more embodiments of the present invention. 本発明の１つまたは複数の実施形態に従う、エントリをＥＲＴＥテーブルに追加するための例示的な方法のフローチャートである。It is a flowchart of an exemplary method for adding an entry to an ERTE table according to one or more embodiments of the present invention. 本発明の１つまたは複数の実施形態に従って開始される例示的な命令のセットの例示的なシーケンス図である。It is an exemplary sequence diagram of a set of exemplary instructions initiated according to one or more embodiments of the invention. 本発明の１つまたは複数の実施形態に従う、プロセッサがシングルスレッド（ＳＴ：singlethread）モードまたはマルチスレッド（ＭＴ：multi-threaded）モードのどちらで動作しているかに応じて、マルチパイプ・モードで、およびＯｏＯの方法で、ＬＳＵによって命令を発行するための例示的な方法のフローチャートである。In multipipe mode, depending on whether the processor is operating in singlethread (ST) mode or multi-threaded (MT) mode, according to one or more embodiments of the invention. And in the OoO method, it is a flowchart of an exemplary method for issuing an instruction by LSU. 本発明の１つまたは複数の実施形態の一部または全部の態様を実装するためのコンピュータ・システムのブロック図である。FIG. 3 is a block diagram of a computer system for implementing some or all embodiments of one or more embodiments of the present invention.

本明細書において示される図は、実例である。本発明の思想から逸脱することなく、本明細書に記載された図または動作の多くの変形が存在することが可能である。例えば、動作は異なる順序で実行されることが可能であり、あるいは動作は追加、削除、または変更されることが可能である。また、「結合される」という用語およびその変形は、２つの要素間に通信経路が存在することを表しており、それらの要素間に要素／接続が介在しない要素間の直接的接続を意味していない。これらのすべての変形は、本明細書の一部であると見なされる。 The figures shown herein are examples. It is possible that many variations of the figures or behaviors described herein exist without departing from the ideas of the present invention. For example, actions can be performed in different order, or actions can be added, deleted, or modified. Also, the term "combined" and its variants represent the existence of a communication path between two elements, meaning a direct connection between elements without element / connection intervening between those elements. Not. All these variations are deemed to be part of this specification.

本明細書に記載された本発明の１つまたは複数の実施形態は、ＯｏＯプロセッサ内の実効実アドレス・テーブルのエントリの動的削除によって、実効アドレス（ＥＡ：effective address）に基づく読み込み／格納ユニット（ＬＳＵ）をアウトオブオーダー（ＯｏＯ）プロセッサに提供する。本明細書に記載された技術的解決策は、チップ面積の削減を促進するために、およびさらに、ＯｏＯプロセッサのタイミングを改善するために、コンポーネントの中でも特に、実効実テーブル（ＥＲＴ：effective real table）およびシノニム検出テーブル（ＳＤＴ：synonymdetection table）と共に実効アドレス・ディレクトリ（ＥＡＤ）を使用する。さらに、本明細書に記載された技術的解決策は、ＯｏＯＬＳＵが順序に従わない方法で読み込み／格納命令を実行するのを容易にする。ＯｏＯＬＳＵは、複数のパイプを使用して読み込み／格納命令を実行し、性能を改善する。ＬＳＵのマルチパイプの実装は、本明細書において説明されているように、分割されたＥＲＴ、読み込み順序変更キュー（ＬＲＱ：load reorder queue）、および格納順序変更キュー（ＳＲＱ：storereorder queue）に基づく。 One or more embodiments of the invention described herein are read / store units based on an effective address (EA) by dynamically deleting entries in an effective real address table within an OoO processor. (LSU) is provided to the out-of-order (OoO) processor. The technical solutions described herein are effective real tables (ERTs), among other components, to facilitate chip area reductions and, in addition, to improve the timing of OoO processors. ) And the synonym detection table (SDT) and the effective address directory (EAD) are used. In addition, the technical solutions described herein facilitate the OoO LSU to execute read / store instructions in an unordered manner. OoO LSU uses multiple pipes to execute read / store instructions to improve performance. LSU's multi-pipe implementation is based on split ERTs, load reorder queues (LRQs), and store reorder queues (SRQs), as described herein.

ほとんどの最新のコンピューティング・デバイスは、仮想メモリをサポートする。仮想メモリは、実際には物理メモリが断片化しているときに、および物理メモリがディスク・ストレージにあふれているときにも、連続的な作業メモリまたはアドレス空間が存在するという印象をアプリケーション・プログラムに与える技術である。基本的に、コンピューティング・デバイスのメモリの見え方がアプリケーション・プログラムに提供され、アプリケーションは、アプリケーションから見える実効アドレス空間内の実効アドレスを使用して、連続的に見えるメモリにアクセスし、その後、この実効アドレスが、アクセス動作を実際に実行するために、実際の物理メモリまたはストレージ・デバイスの物理アドレスに変換される。実効アドレスは、動作を発行する実体（例えば、アプリケーション、プロセス、スレッド、割り込みハンドラ、カーネル・コンポーネントなど）の視点から動作によってアクセスされるメモリ位置を指定するために使用される値である。 Most modern computing devices support virtual memory. Virtual memory gives application programs the impression that continuous working memory or address space exists even when physical memory is actually fragmented and when physical memory is flooded with disk storage. It is a technique to give. Basically, the memory appearance of the computing device is provided to the application program, and the application uses the effective addresses in the effective address space visible to the application to access the continuously visible memory and then. This effective address is translated into the physical address of the actual physical memory or storage device in order to actually perform the access operation. The effective address is a value used to specify the memory location accessed by the operation from the perspective of the entity issuing the operation (eg, application, process, thread, interrupt handler, kernel component, etc.).

すなわち、コンピューティング・デバイスが仮想メモリの概念をサポートしない場合、実効アドレスおよび物理アドレスは１つであり、同じである。しかし、コンピューティング・デバイスが仮想メモリをサポートする場合、アプリケーションによってサブミットされる特定の動作の実効アドレスは、コンピューティング・デバイスのメモリ・マッピング・ユニットによって、動作が実行される物理メモリまたはストレージ・デバイス内の位置を指定する物理アドレスに変換される。 That is, if the computing device does not support the concept of virtual memory, then there is one effective address and one physical address, which is the same. However, if the computing device supports virtual memory, the effective address of the particular operation submitted by the application is the physical memory or storage device on which the operation is performed by the memory mapping unit of the computing device. Translated to a physical address that specifies a location within.

さらに、最新のコンピューティング・デバイスでは、コンピューティング・デバイスのプロセッサが、一連のデータ処理要素を含んでいるプロセッサ命令パイプラインを使用して、実体（例えば、アプリケーション、プロセスなど）によってサブミットされた命令（動作）を処理する。命令パイプラインは、コンピュータ命令の処理を、各ステップの最後でストレージを使用する一連のステップに分割することによって、命令スループットを増加させる技術である。命令パイプラインは、コンピューティング・デバイスの制御回路が、最も遅いステップの処理速度で命令をプロセッサ命令パイプラインに発行するのを容易にし、この処理速度は、すべてのステップを同時に実行するために必要な時間よりも非常に高速である。命令パイプラインを使用するプロセッサ（すなわち、パイプライン型プロセッサ）は、別々のジョブに対して半ば独立して動作できる段に内部で構造化される。各段は、パイプラインの最後の段まで各段の出力が別の段に供給されるように構造化され、一連のチェーン内の次の段に接続される。 In addition, in modern computing devices, the processor of the computing device is an instruction submitted by an entity (eg, application, process, etc.) using a processor instruction pipeline that contains a set of data processing elements. Process (operation). Instruction pipeline is a technique that increases instruction throughput by dividing the processing of computer instructions into a series of steps that use storage at the end of each step. The instruction pipeline makes it easy for the control circuit of the computing device to issue instructions to the processor instruction pipeline at the slowest step speed, which is required to execute all steps simultaneously. It's much faster than the time. Processors that use instruction pipelines (ie, pipelined processors) are internally structured in stages that can operate semi-independently for separate jobs. Each stage is structured so that the output of each stage is supplied to another stage up to the last stage of the pipeline, and is connected to the next stage in a series of chains.

そのようなパイプライン型プロセッサは、インオーダー・パイプライン型プロセッサまたはアウトオブオーダー・パイプライン型プロセッサの形態を取ってよい。インオーダー・パイプライン型プロセッサの場合、データが、パイプラインの特定の段で処理される命令に使用されない場合に、そのデータが使用可能になるまで、パイプラインを介した命令の実行が停止されるように、命令が順序通りに実行される。一方、アウトオブオーダー・パイプライン型プロセッサは、動作を実行するために必要なデータを使用できないときに発生する停止をプロセッサが回避できるようにする。アウトオブオーダー・プロセッサの命令パイプラインは、処理される準備ができている他の命令で時間の「スロット」を埋め、その後、パイプラインの最後で結果を並べ替え、命令が順序通りに実行されたように見えるようにすることによって、それらの停止を防ぐ。元のコンピュータ・コード内で命令が順序付けられる方法はプログラム順序と呼ばれ、一方、プロセッサでは、命令がデータ順序（すなわち、データおよびオペランドがプロセッサのレジスタ内で使用可能になる順序）で処理される。 Such pipeline processors may take the form of in-order pipeline processors or out-of-order pipeline processors. For in-order pipeline processors, if data is not used for instructions processed at a particular stage in the pipeline, execution of instructions through the pipeline is stopped until the data is available. The instructions are executed in order. Out-of-order pipeline processors, on the other hand, allow processors to avoid outages that occur when the data needed to perform an operation is unavailable. The out-of-order processor instruction pipeline fills a "slot" of time with other instructions that are ready to be processed, then sorts the results at the end of the pipeline, and the instructions are executed in order. Prevent them from stopping by making them look like they are. The way instructions are ordered in the original computer code is called program order, while the processor processes the instructions in data order (that is, the order in which data and operands are available in the processor's registers). ..

最新のプロセッサ命令パイプラインは、命令が命令パイプラインを通って流れるときに、命令の実効アドレスを追跡する。命令の処理が例外の取得をもたらすか、命令が前の状態をフラッシュするか、命令が現在のメモリ位置と相対的な新しいメモリ位置に分岐するか、または命令の実行が完了するときに、常にこの実効アドレスが利用されるため、命令の実効アドレスを追跡することは重要である。 Modern processor instruction pipelines track the effective address of an instruction as it flows through the instruction pipeline. Whenever the processing of an instruction results in getting an exception, the instruction flushes the previous state, the instruction branches to a new memory location relative to the current memory location, or the instruction execution completes. Since this effective address is used, it is important to keep track of the effective address of the instruction.

命令の実効アドレスを追跡することは、プロセッサのチップ面積、電力消費などの観点で、費用がかかる。これは、これらの実効アドレスが大きいサイズ（例えば、６４ビット）を有しており、最新のプロセッサ命令パイプラインが深く（すなわち、多くの段を含んでおり）、プロセッサ命令パイプラインの命令フェッチ段からプロセッサ命令パイプラインの完了段までの命令の存続期間が非常に長くなることを引き起こすためである。高度にマルチスレッド化されたアウトオブオーダー・プロセッサ（すなわち、順序に従わない方法で複数のスレッドから命令を実行するプロセッサ）では、異なるアドレス範囲からの膨大な数の命令が同時に処理することができる（すなわち、「インフライト」である）ため、この費用がさらに増えることがある。 Tracking the effective address of an instruction is costly in terms of processor chip area, power consumption, and so on. This is because these effective addresses have a large size (eg, 64 bits), the modern processor instruction pipeline is deep (ie, contains many stages), and the instruction fetch stage of the processor instruction pipeline. This is because it causes the instruction to have a very long duration from to the completion stage of the processor instruction pipeline. A highly multithreaded out-of-order processor (ie, a processor that executes instructions from multiple threads in an unordered manner) can process a huge number of instructions from different address ranges simultaneously. This cost may be further increased because it is (ie, "in-flight").

１つまたは複数の例では、コンピューティング・デバイスは、パイプライン・ラッチ、分岐情報キュー（ＢＩＱ：branch information queue）、およびグローバル完了テーブル（ＧＣＴ：global completion table）の組み合わせを使用して、命令の実効アドレスを追跡する。命令のグループのベース実効アドレス（ＥＡ）が、命令シーケンサ・ユニット（ＩＳＵ：instruction sequencer unit）のＧＣＴ内に蓄積されて追跡できるようになるまで、ラッチを使用してパイプラインの前端から転送される。このデータを格納するために必要なラッチの数は、おおよそ、パイプラインのフェッチ段とディスパッチ段の間のパイプラインの段数になる。これらのラッチは、これらの段の間に通常はＥＡが不要であるため、無駄である。このデータは、命令がパイプラインを通って流れるときに命令グループと共に「付き合いで参加している」単なるペイロード・データである。加えて、この方法は、分岐命令がＢＩＱとＧＣＴの両方にＥＡを含むため、二重の格納につながる。 In one or more examples, the computing device uses a combination of pipeline latches, branch information queues (BIQs), and global completion tables (GCTs) for instructions. Track the effective address. The base effective address (EA) of a group of instructions is forwarded from the front end of the pipeline using a latch until it can be stored and tracked in the GCT of the instruction sequencer unit (ISU). .. The number of latches required to store this data is approximately the number of stages in the pipeline between the fetch and dispatch stages of the pipeline. These latches are useless as they usually do not require an EA between these stages. This data is just payload data that "associates" with the instruction group as the instruction flows through the pipeline. In addition, this method leads to double storage because the branch instruction includes EA in both BIQ and GCT.

したがって、ＧＣＴのみにおいてＥＡを追跡することによって、そのような非効率性を取り除くコンピューティング・デバイスが開発された。例えば、それらの新しいコンピューティング・デバイス（命令シーケンサ・ユニット）は、フェッチ時にエントリをＧＣＴに作成する。ＥＡは、この時点でＧＣＴに読み込まれ、その後、命令が完了したときに削除される。これによって、装置全体で、多くのパイプライン・ラッチを取り除く。アドレス線の数と同じくらいの長さの完全なＥＡ（例えば、６４ビットＥＡ）の代わりに、小さいタグが、パイプラインを通る命令グループと共に運ばれる。このタグは、この命令グループのベースＥＡを保持するＧＣＴ内のエントリを指し示す。分岐が、発行されたときに、ＥＡをＧＣＴから直接取得できるため、ＢＩＱ内のアドレスの格納が不要になる。そのような手法は、面積効率を改善するが、アウトオブオーダー・プロセッサには適用できない。さらに、それらの手法は、プログラム順序に従わないで着信するアドレス要求を処理するための十分な情報を欠いている。加えて、それらの手法は、複数のまとまりのないアドレス範囲から形成されていることがある命令グループを追跡する能力を欠いているため、アウトオブオーダー実行に必要なディスパッチおよび完了の帯域幅をサポートできない。従来、そのようなメカニズムは、単一のアドレス範囲からの命令グループのみをサポートしていたため、順序に従わずに実行するために使用できる命令の数が著しく減少する可能性がある。さらに、ＥＡに対応するＲＡ（またはその逆）などの、対応するアドレスを検索するために、連想メモリ（ＣＡＭ：content addressable memory）が使用される。ＣＡＭは、専用比較回路を使用して、単一クロック・サイクルでのルックアップ・テーブル機能を実装する。ＣＡＭの機能全体は、検索語を受け取って、一致するメモリ位置を返すことである。しかし、そのようなＣＡＭは、チップ面積を必要とし、そのような検索のための電力を消費する。 Therefore, computing devices have been developed that eliminate such inefficiencies by tracking the EA only in the GCT. For example, those new computing devices (instruction sequencer units) create entries in the GCT on fetch. The EA is read into the GCT at this point and then deleted when the instruction is completed. This removes many pipeline latches throughout the device. Instead of a full EA (eg, 64-bit EA) as long as the number of address lines, a small tag is carried with the instruction group through the pipeline. This tag points to an entry in the GCT that holds the base EA of this instruction group. When the branch is issued, the EA can be obtained directly from the GCT, eliminating the need to store the address in the BIQ. Such techniques improve area efficiency, but are not applicable to out-of-order processors. Moreover, those techniques lack sufficient information to process incoming address requests out of program order. In addition, these techniques lack the ability to track instruction groups that may be formed from multiple disjointed address ranges, thus supporting the dispatch and completion bandwidth required for out-of-order execution. Can not. Traditionally, such a mechanism has supported only instruction groups from a single address range, which can significantly reduce the number of instructions that can be used to execute out of order. In addition, associative memory (CAM: content addressable memory) is used to retrieve the corresponding address, such as RA corresponding to EA (or vice versa). CAM implements a look-up table function in a single clock cycle using a dedicated comparison circuit. The entire function of CAM is to receive a search term and return a matching memory location. However, such a CAM requires chip area and consumes power for such a search.

本明細書に記載された技術的解決策の実施形態例は、前述したＧＣＴの解決策の面積効率を有するだけでなく、性能を抑制せずに発行幅の広いアウトオブオーダー・パイプラインもサポートできる、実効アドレス・ディレクトリ（ＥＡＤ）、実効実テーブル（ＥＲＴ）、およびシノニム検出テーブル（ＳＤＴ：synonym detection table）を提供することによって、これらの手法を改善する。本明細書に記載された技術的解決策は、プロセッサがアウトオブオーダー（ＯｏＯ）ウィンドウ内でＥＡのシノニムを避けることができる限り、プロセッサがＥＡのみを使用して実行するのをさらに容易にする。ＯｏＯウィンドウは、プロセッサの命令パイプライン内の命令のセットである。ＯｏＯウィンドウ内のＥＡのシノニムを防ぐことによって、プロセッサがＯｏＯウィンドウ内のＥＡの変換を回避することができるため、プロセッサは、アドレス変換のためのチップ面積および電力消費を削減する。これは、ＯｏＯウィンドウ内にＥＡのシノニムが存在しなくなって、インフライトの命令に関してロード・ヒット・ストア（ＬＨＳ：load-hit-store）条件、ストア・ヒット・ロード（ＳＨＬ：store-hit-load）条件、およびロード・ヒット・ロード（ＬＨＬ：load-hit-load）条件が検出されなくなるためである。 The embodiments of the technical solutions described herein not only have the area efficiency of the GCT solutions described above, but also support a wide out-of-order pipeline without compromising performance. These techniques are improved by providing an effective address directory (EAD), an effective real table (ERT), and a synonym detection table (SDT) that can be used. The technical solutions described herein make it even easier for the processor to run using EA alone, as long as the processor can avoid EA synonyms within the out-of-order (OoO) window. .. An OoO window is a set of instructions in a processor's instruction pipeline. By preventing the synonyms of the EA in the OOO window, the processor can avoid the translation of the EA in the OOO window, so that the processor reduces the chip area and power consumption for address translation. This is because the EA synonym no longer exists in the OoO window, and the load hit store (LHS: load-hit-store) condition, store hit load (SHL: store-hit-load) for in-flight instructions. ) Condition and load-hit-load (LHL: load-hit-load) condition are no longer detected.

言い換えると、本明細書に記載された技術的解決策は、ＯｏＯウィンドウ内のＥＡの別名化を規制し、読み込み／格納ポートに関する変換データ構造およびハードウェアを縮小することによって、技術的問題に対処する。したがって、本明細書に記載された技術的解決策は、１つのアドレス（ＥＡ）のみを追跡することによって、チップ面積の削減を促進する。さらに、これらの技術的解決策は、ＯｏＯプロセッサが分割された読み込み／格納キューを使用して２読み込み／２格納モード（2 load and 2 store mode）で実行することを容易にし、通常はアドレス変換に使用されるＣＡＭポートをさらに削減する。 In other words, the technical solutions described herein address technical issues by restricting EA aliasing in OoO windows and reducing the conversion data structure and hardware for read / store ports. do. Therefore, the technical solutions described herein facilitate chip area reduction by tracking only one address (EA). In addition, these technical solutions make it easy for the OoO processor to perform in 2 load and 2 store modes with split load and store queues, usually address translation. Further reduce the number of CAM ports used for.

加えて、ＯｏＯプロセッサがマルチスレッド（ＭＴ）動作をサポートする場合、ＯｏＯプロセッサは、順序に従わない方法でのスレッド動作ごとに、ＥＡをＲＡに変換するためおよびＲＡをＥＡに変換するために、読み込み／格納ユニット内の読み込み／格納キューごとに複数のＣＡＭポートを容易にする必要がある。例えば、ＭＴモードで４つのスレッドを実行するＯｏＯプロセッサについて考える。各スレッドは、独立した命令を実行することによって、同時に実行している。この場合、ＯｏＯプロセッサの読み込み／格納ユニット（ＬＳＵ）は、実効アドレスを実アドレスに変換するためおよび実アドレスを実効アドレスに変換するために、通常、読み込み／格納キューごとに４つ以上のＣＡＭポートを使用する。アドレス変換用のそのような複数のＣＡＭポートは、かなりのチップ面積を占有し、さらに電力を消費する。本明細書に記載された技術的解決策は、複数のスレッドのための複数のＣＡＭポートのそのような技術的課題に対処する。 In addition, if the OoO processor supports multithreaded (MT) operation, the OoO processor will convert EA to RA and RA to EA for each thread operation in an unordered manner. Multiple CAM ports need to be facilitated for each read / store queue in the read / store unit. For example, consider an OoO processor that runs four threads in MT mode. Each thread is executing at the same time by executing independent instructions. In this case, the OoO processor's read / store unit (LSU) typically has four or more CAM ports per read / store queue to translate the effective address to the real address and to translate the real address to the effective address. To use. Such multiple CAM ports for address translation occupy a considerable chip area and consume more power. The technical solutions described herein address such technical challenges of multiple CAM ports for multiple threads.

本明細書に記載された発明の１つまたは複数の実施形態例は、単一のＣＡＭポートを読み込み／格納キューに使用することによって、本明細書に記載された技術的課題の態様に対処し、このようにして、アドレス変換に使用されるチップ面積および電力を削減する。例えば、本明細書に記載された本発明の実施形態例は、ＬＳＵが、分割された読み込み／格納キューを含む複数読み込み／複数格納ＬＳＵ（multi-load and multi-store LSU）になることを容易にすることができ、アドレス変換用のＣＡＭポートの数の削減を促進する。「複数読み込みＬＳＵ」は、読み込み命令ごとに別々のパイプ上で、複数の読み込み命令を同時に発行するＬＳＵである。例えば、「２読み込みＬＳＵ」は、２つの別々のパイプ（ＬＤ０およびＬＤ１）上で、２つの読み込み命令を同時に発行するＬＳＵである。同様に、「複数格納ＬＳＵ」は、格納命令ごとに別々のパイプ上で、複数の格納命令を同時に発行するＬＳＵである。例えば、「２格納ＬＳＵ」は、２つの別々のパイプ（ＳＴ０およびＳＴ１）上で、２つの格納命令を同時に発行するＬＳＵである。 One or more embodiments of the invention described herein address aspects of the technical challenges described herein by using a single CAM port for read / store queues. In this way, the chip area and power used for address translation is reduced. For example, an embodiment of the invention described herein facilitates an LSU to be a multi-load and multi-store LSU (multi-load and multi-store LSU) that includes a split load / store queue. Can help reduce the number of CAM ports for address translation. "Multiple read LSUs" are LSUs that simultaneously issue a plurality of read instructions on separate pipes for each read instruction. For example, "2 read LSUs" are LSUs that issue two read instructions simultaneously on two separate pipes (LD0 and LD1). Similarly, the "plural storage LSU" is an LSU that simultaneously issues a plurality of storage instructions on separate pipes for each storage instruction. For example, "2 storage LSUs" are LSUs that simultaneously issue two storage instructions on two separate pipes (ST0 and ST1).

ここで図１を参照すると、本発明の１つまたは複数の実施形態に従って、ＯｏＯ命令ウィンドウ内でＥＡのシノニムを防ぐための技術的解決策を実装するアウトオブオーダー（ＯｏＯ）プロセッサの命令順序付けユニット（ＩＳＵ）を含んでいるシステム１００のブロック図が、概して示されている。図１に示されているシステム１００は、ＩＳＵのマッパー１１０に入力するためのデコードされた命令を準備する設定ブロック１０８に入力するための命令をフェッチしてデコードする、命令フェッチ・ユニット／命令デコード・ユニット（ＩＦＵ／ＩＤＵ：instruction fetch unit/instruction decode unit）１０６を含んでいる。本発明の１つまたは複数の実施形態に従って、ＩＦＵ／ＩＤＵ１０６によって、スレッドからの一度に６つの命令がフェッチされ、デコードされ得る。本発明の１つまたは複数の実施形態に従って、設定ブロック１０８に送信される６つの命令は、６つの非分岐命令、５つの非分岐命令および１つの分岐命令、または４つの非分岐命令および２つの分岐命令を含むことができる。本発明の１つまたは複数の実施形態に従って、設定ブロック１０８は、フェッチされた命令をＩＳＵ内のそれらのブロックに送信する前に、発行キュー内のエントリ、完了テーブル、マッパー、およびレジスタ・ファイルなどの十分なリソースが存在することをチェックする。 Referring now to FIG. 1, an out-of-order (OoO) processor instruction ordering unit that implements a technical solution to prevent EA synonyms within the OoO instruction window according to one or more embodiments of the invention. A block diagram of the system 100 containing (ISU) is generally shown. System 100, shown in FIG. 1, prepares a decoded instruction for input to the ISU mapper 110. An instruction fetch unit / instruction decode that fetches and decodes an instruction for input to the configuration block 108. -The unit (IFU / IDU: instruction fetch unit / instruction decode unit) 106 is included. According to one or more embodiments of the invention, the IFU / IDU 106 may fetch and decode six instructions from a thread at a time. According to one or more embodiments of the invention, the six instructions transmitted to the configuration block 108 are six non-branch instructions, five non-branch instructions and one branch instruction, or four non-branch instructions and two. Can include branch instructions. According to one or more embodiments of the invention, the configuration block 108 includes entries in the issue queue, completion tables, mappers, register files, etc. before sending the fetched instructions to those blocks in the ISU. Check that there are enough resources for.

図１に示されたマッパー１１０は、プログラマの命令（例えば、論理レジスタ名）をプロセッサの物理リソース（例えば、物理レジスタ・アドレス）にマッピングする。図１には、条件レジスタ（ＣＲ：condition register）マッパー、リンク／カウント（ＬＮＫ／ＣＮＴ：link/count）レジスタ・マッパー、整数例外レジスタ（ＸＥＲ：exception register）マッパー、汎用レジスタ（ＧＰＲ：general purposeregisters）およびベクトル－スカラ・レジスタ（ＶＳＲ：vector-scalar register）をマッピングするための統合マッパー（Ｕマッパー：UMapper）、ＧＰＲおよびＶＳＲをマッピングするための設計済みマッパー（ＡＲＣＨマッパー：architected mapper）、および浮動小数点状態および制御レジスタ（ＦＰＳＣＲ：floating point status and control register）マッパーを含む、さまざまなマッパー１１０が示されている。 The mapper 110 shown in FIG. 1 maps programmer instructions (eg, logical register names) to processor physical resources (eg, physical register addresses). In FIG. 1, a condition register (CR) mapper, a link / count (LNK / CNT: link / count) register mapper, an integer exception register (XER) mapper, and general purpose registers (GPR) are shown. And an integrated mapper (UMapper) for mapping vector-scalar registers (VSRs), a pre-designed mapper (ARCH mapper) for mapping GPRs and VSRs, and floating point numbers. Various mappers 110 are shown, including a floating point status and control register (FPSCR) mapper.

設定ブロック１０８からの出力が、現在のＩＳＵ内のすべての命令を追跡するためのグローバル完了テーブル（ＧＣＴ：global completion table）１１２にも入力される。設定ブロック１０８からの出力が、命令を発行キューにディスパッチするためのディスパッチ・ユニット１１４にも入力される。図１に示されているＩＳＵの実施形態は、ＣＲ発行キュー（ＣＲＩＳＱ：CR issue queue）１１６を含んでおり、ＣＲ発行キュー１１６は、ＣＲマッパーからの命令を受信して追跡し、それらの命令を命令フェッチ・ユニット（ＩＦＵ）１２４に発行（１２０）して、ＣＲ論理命令および移動命令を実行する。図１には分岐発行キュー（分岐ＩＳＱ：branch issue queue）１１８も示されており、分岐発行キュー１１８は、分岐命令およびＬＮＫ／ＣＮＴ物理アドレスをＬＮＫ／ＣＮＴマッパーから受信して追跡する。分岐ＩＳＱ１１８は、予測された分岐アドレスまたは方向あるいはその両方が正しくなかった場合、命令をＩＦＵ１２４に発行して、命令フェッチをリダイレクトすることができる。 The output from the configuration block 108 is also input to the global completion table (GCT) 112 for tracking all instructions in the current ISU. The output from the configuration block 108 is also input to the dispatch unit 114 for dispatching instructions to the issue queue. The ISU embodiment shown in FIG. 1 includes a CR issue queue (CR ISQ: CR issue queue) 116, which receives and tracks instructions from the CR mapper and tracks them. The instruction is issued (120) to the instruction fetch unit (IFU) 124 to execute the CR logical instruction and the move instruction. FIG. 1 also shows a branch issue queue (branch issue queue) 118, which receives and tracks branch instructions and LNK / CNT physical addresses from the LNK / CNT mapper. The branch ISQ118 may issue an instruction to the IFU 124 to redirect the instruction fetch if the predicted branch address and / or direction is incorrect.

ディスパッチ論理およびＬＮＫ／ＣＮＴマッパーから名前が変更されたレジスタ、ＸＥＲマッパー、ＵＭａｐｐｅｒ（ＧＰＲ／ＶＳＲ）、ＡＲＣＨマッパー（ＧＰＲ／ＶＳＲ）、ならびにＦＰＳＣＲマッパーから出力された命令が、発行キュー１０２に入力される。図１に示されているように、発行キュー１０２は、ディスパッチされた固定小数点命令（Ｆｘ：fixed point instructions）、読み込み命令（Ｌ：loadinstructions）、格納命令（Ｓ：store instructions）、およびベクトルおよびスカラ・ユニット（ＶＳＵ：vector-and-scaler unit）命令を追跡する。図１の実施形態に示されているように、発行キュー１０２は、２つの部分ＩＳＱ０１０２０およびＩＳＱ１１０２１に分割されており、各部分がＮ／２個の命令を保持する。プロセッサがシングルスレッド（ＳＴ）モードで実行している場合、発行キュー１０２が、単一のスレッドのすべての命令（この例では、Ｎ個すべての命令）を処理するために、ＩＳＱ０１０２０およびＩＳＱ１１０２１の両方を含んでいる単一論理の発行キューとして使用され得る。 Instructions output from the dispatch logic and the registers renamed from the LNK / CNT mapper, the XER mapper, the UMapper (GPR / VSR), the ARCH mapper (GPR / VSR), and the FPSCR mapper are input to the issue queue 102. .. As shown in FIG. 1, the issue queue 102 contains dispatched fixed point instructions (Fx), load instructions (L), store instructions (S), and vectors and scalars. -Track unit (VSU: vector-and-scaler unit) instructions. As shown in the embodiment of FIG. 1, the issue queue 102 is divided into two parts ISQ0 1020 and ISQ1 1021, each part holding N / 2 instructions. When the processor is running in single thread (ST) mode, the issue queue 102 has ISQ0 1020 and ISQ1 1021 to process all instructions in a single thread (in this example, all N instructions). Can be used as a single logical issue queue containing both.

プロセッサがマルチスレッド（ＭＴ）モードで実行している場合、ＩＳＱ０１０２０が、第１のスレッドからのＮ／２個の命令を処理するために使用可能であり、ＩＳＱ１１０２１が、第２のスレッドＩＳＱ１１０２１からのＮ／２個の命令を処理するために使用される。 When the processor is running in multithreaded (MT) mode, ISQ0 1020 can be used to process N / 2 instructions from the first thread and ISQ1 1021 is the second thread ISQ1. Used to process N / 2 instructions from 1021.

図１に示されているように、発行キュー１０２は、実行ユニットの２つのグループ（１０４０および１０４１）に分割されている実行ユニット１０４に、命令を発行する。図１に示されている実行ユニットの両方のグループ（１０４０および１０４１）は、完全固定小数点実行ユニット（full fixed point execution unit）（完全ＦＸ０、完全ＦＸ１）、読み込み実行ユニット（ＬＵ０、ＬＵ１）、簡易固定小数点、格納データ、および格納アドレス実行ユニット（簡易ＦＸ０／ＳＴＤ０／ＳＴＡ０、簡易ＦＸ１／ＳＴＤ１／ＳＴＡ１）、ならびに浮動小数点、ベクトル・マルチメディア実行、１０進浮動小数点、および格納データ実行ユニット（ＦＰ／ＶＭＸ／ＤＦＰ／ＳＴＤ０、ＦＰ／ＶＭＸ／ＤＦＰ／ＳＴＤ１）を含んでいる。ＬＵ０、簡易ＦＸ０／ＳＴＤ０／ＳＴＡ０、およびＦＰ／ＶＭＸ／ＤＦＰ／ＳＴＤ０は、集合的に、読み込み／格納ユニット（ＬＳＵ）１０４２を形成する。同様に、ＬＵ１、簡易ＦＸ１／ＳＴＤ１／ＳＴＡ１、およびＦＰ／ＶＭＸ／ＤＦＰ／ＳＴＤ１は、読み込み／格納ユニット（ＬＳＵ）１０４３を形成する。２つのＬＳＵ１０４２および１０４３は、まとめて、システム１００のＬＳＵと呼ばれる。 As shown in FIG. 1, the issue queue 102 issues an instruction to the execution unit 104, which is divided into two groups of execution units (1040 and 1041). Both groups (1040 and 1041) of execution units shown in FIG. 1 are full fixed point execution units (full FX0, full FX1), read execution units (LU0, LU1), simplified. Fixed-point, stored data, and stored address execution units (simple FX0 / STD0 / STA0, simple FX1 / STD1 / STA1), and floating point, vector multimedia execution, decimal floating point, and stored data execution units (FP / VMX / DFP / STD0, FP / VMX / DFP / STD1) is included. LU0, simplified FX0 / STD0 / STA0, and FP / VMX / DFP / STD0 collectively form a read / store unit (LSU) 1042. Similarly, LU1, Simplified FX1 / STD1 / STA1, and FP / VMX / DFP / STD1 form a read / store unit (LSU) 1043. The two LSUs 1042 and 1043 are collectively referred to as the LSU of system 100.

図１に示されているように、プロセッサがＳＴモードで実行している場合、実行ユニットの第１のグループ１０４０が、ＩＳＱ０１０２０から発行された命令を実行し、実行ユニットの第２のグループ１０４１が、ＩＳＱ１１０２１から発行された命令を実行する。プロセッサがＳＴモードで実行している場合の本発明の代替の実施形態では、発行キュー１０２内のＩＳＱ０１０２０およびＩＳＱ１１０２１の両方から発行された命令が、実行ユニットの第１のグループ１０４０および実行ユニットの第２のグループ１０４１内の実行ユニット１０４０のいずれかに含まれる実行ユニットに発行され得る。 As shown in FIG. 1, when the processor is running in ST mode, the first group 1040 of the execution units executes the instructions issued by ISQ0 1020 and the second group 1041 of the execution units. Executes the instruction issued by ISQ1 1021. In an alternative embodiment of the invention when the processor is running in ST mode, instructions issued by both ISQ0 1020 and ISQ1 1021 in the issue queue 102 are the first group 1040 of the execution unit and the execution unit. Can be issued to an execution unit included in any of the execution units 1040 in the second group 1041 of.

本発明の１つまたは複数の実施形態に従って、プロセッサがＭＴモードで実行している場合、実行ユニットの第１のグループ１０４０が、ＩＳＱ０１０２０から発行された第１のスレッドの命令を実行し、実行ユニットの第２のグループ１０４１が、ＩＳＱ１１０２１から発行された第２のスレッドの命令を実行する。 According to one or more embodiments of the present invention, when the processor is running in MT mode, the first group 1040 of execution units executes and executes the instructions of the first thread issued by ISQ0 1020. The second group 1041 of the unit executes the instruction of the second thread issued from ISQ1 1021.

本発明の実施形態が、さまざまな異なるサイズの発行キューおよびその他の要素に関して実装され得るため、図１に示されている発行キュー１０２内のエントリの数およびその他の要素のサイズ（例えば、バス幅、キュー・サイズ）は、実際は例示的であるよう意図されている。本発明の１つまたは複数の実施形態に従って、サイズが選択可能であるか、またはプログラム可能である。 Since embodiments of the present invention can be implemented with respect to various different sizes of issue queues and other elements, the number of entries in the issue queue 102 and the size of the other elements shown in FIG. 1 (eg, bus width). , Cue size) is intended to be exemplary in practice. The size is selectable or programmable according to one or more embodiments of the invention.

１つまたは複数の例では、システム１００は、実施形態例に従って、ＯｏＯプロセッサである。図２は、本発明の１つまたは複数の実施形態に従う、実効アドレス・ディレクトリ（ＥＡＤ）およびこのＥＡＤを利用するための関連するメカニズムが実装される、ＯｏＯプロセッサのプロセッサ・アーキテクチャの例示的なブロック図である。図２に示されているように、このプロセッサ・アーキテクチャは、命令キャッシュ２０２、命令フェッチ・バッファ２０４、命令デコード・ユニット２０６、および命令ディスパッチ・ユニット２０８を含んでいる。命令が、命令フェッチ・バッファ２０４によって命令キャッシュ２０２からフェッチされ、命令デコード・ユニット２０６に提供される。命令デコード・ユニット２０６は、命令をデコードし、デコードされた命令を命令ディスパッチ・ユニット２０８に提供する。命令ディスパッチ・ユニット２０８の出力が、命令の種類に応じて、グローバル完了テーブル２１０、ならびに分岐発行キュー２１２、条件レジスタ発行キュー２１４、統合発行キュー（unified issue queue）２１６、読み込み順序変更キュー２１８、または格納順序変更キュー２２０あるいはその組み合わせのうちの１つまたは複数に提供される。命令の種類は、命令デコード・ユニット２０６のデコーディングおよびマッピングによって決定される。発行キュー２１２～２２０は、実行ユニット２２２～２４０のうちのさまざまな実行ユニットに、入力を提供する。データ・キャッシュ２５０および各ユニットと共に含まれているレジスタ・ファイルは、命令で使用するためのデータを提供する。 In one or more examples, the system 100 is an OoO processor according to an embodiment. FIG. 2 is an exemplary block of the processor architecture of an OoO processor that implements an effective address directory (EAD) and related mechanisms for utilizing this EAD, according to one or more embodiments of the invention. It is a figure. As shown in FIG. 2, this processor architecture includes an instruction cache 202, an instruction fetch buffer 204, an instruction decode unit 206, and an instruction dispatch unit 208. The instruction is fetched from the instruction cache 202 by the instruction fetch buffer 204 and provided to the instruction decoding unit 206. The instruction decoding unit 206 decodes the instruction and provides the decoded instruction to the instruction dispatch unit 208. The output of the instruction dispatch unit 208, depending on the type of instruction, is the global completion table 210, as well as the branch issue queue 212, the condition register issue queue 214, the unified issue queue 216, the read order change queue 218, or It is provided to one or more of the storage order change queue 220 or a combination thereof. The type of instruction is determined by the decoding and mapping of the instruction decoding unit 206. Issue queues 212-220 provide inputs to various execution units of execution units 222-240. The data cache 250 and the register files included with each unit provide data for use in the instruction.

命令キャッシュ２０２は、第２のレベルの変換ユニット２６２およびプリデコード・ユニット２７０を介してＬ２キャッシュ２６０から命令を受信する。第２のレベルの変換ユニット２６２は、アソシエート・セグメント・ルックアサイド・バッファ（associate segment look-aside buffer）２６４およびトランスレーション・ルックアサイド・バッファ２６６を使用して、フェッチされた命令のアドレスを実効アドレスからシステム・メモリ・アドレスに変換する。プリデコード・ユニットは、Ｌ２キャッシュから着信する命令を部分的にデコードし、一意の識別情報でそれらの命令を拡大して、下流の命令デコーダの作業を簡略化する。 The instruction cache 202 receives instructions from the L2 cache 260 via the second level conversion unit 262 and predecode unit 270. The second level translation unit 262 uses the associate segment look-aside buffer 264 and the translation lookaside buffer 266 to address the fetched instructions as the effective address. To the system memory address. The pre-decoding unit partially decodes the incoming instructions from the L2 cache and expands those instructions with unique identification information, simplifying the work of the downstream instruction decoder.

命令フェッチ・バッファ２０４にフェッチされる命令は、命令が分岐命令である場合、分岐予測ユニット２８０にも提供される。分岐予測ユニット２８０は、分岐履歴テーブル２８２、復帰スタック２８４、およびカウント・キャッシュ２８６を含んでいる。これらの要素は、次に命令キャッシュからフェッチされるべき実効アドレス（ＥＡ）を予測する。分岐命令は、制御の流れが変更されるコンピュータ・プログラム内の位置である。分岐命令は、ｉｆ－ｔｈｅｎ－ｅｌｓｅステートメントまたはｄｏ－ｗｈｉｌｅステートメントなどの、コンピュータ・プログラム内の制御構造から生成される低レベルの機械命令である。分岐が選択されないことがあり、その場合、制御の流れが変化せず、次に実行される命令はメモリ内のその分岐の直後の命令であり、または分岐が選択されることがあり、その場合、次に実行される命令はメモリ内のどこか他の場所にある命令である。分岐が選択される場合、新しいＥＡが命令キャッシュに提示される必要がある。 The instruction fetched in the instruction fetch buffer 204 is also provided to the branch prediction unit 280 if the instruction is a branch instruction. The branch prediction unit 280 includes a branch history table 282, a return stack 284, and a count cache 286. These factors then predict the effective address (EA) to be fetched from the instruction cache. A branch instruction is a position in a computer program where the flow of control is changed. A branch instruction is a low-level machine instruction generated from a control structure in a computer program, such as an if-then-else statement or a do-while statement. A branch may not be selected, in which case the flow of control does not change and the next instruction executed may be the instruction immediately following that branch in memory, or a branch may be selected, in which case. , The next instruction to be executed is an instruction somewhere else in memory. If a branch is selected, a new EA needs to be presented in the instruction cache.

分岐予測ユニットからのＥＡおよび関連する予測情報が、実効アドレス・ディレクトリ２９０に書き込まれる。後で、分岐実行ユニット２２２によって、このＥＡが確認される。このＥＡが正しい場合、このＥＡは、このアドレス領域からのすべての命令が実行を完了するまで、ディレクトリ内に残る。このＥＡが正しくない場合、分岐実行ユニットがアドレスをフラッシュし、修正されたアドレスがその場所に書き込まれる。ＥＡＤ２９０は、ＣＡＭとしてのディレクトリの使用を容易にする論理ユニットも含んでいる。 The EA and related prediction information from the branch prediction unit is written to the effective address directory 290. Later, this EA is confirmed by the branch execution unit 222. If this EA is correct, it will remain in the directory until all instructions from this address area have completed execution. If this EA is incorrect, the branch execution unit flushes the address and the corrected address is written to that location. The EAD290 also includes a logical unit that facilitates the use of the directory as a CAM.

メモリから読み取る命令またはメモリに書き込む命令（読み込み命令または格納命令など）が、ＬＳ／ＥＸ実行ユニット２３８、２４０に発行される。ＬＳ／ＥＸ実行ユニットは、命令によって指定されたメモリ・アドレスを使用して、データ・キャッシュ２５０からデータを取得する。このアドレスは、実効アドレスであり、使用される前に、まず第２のレベルの変換ユニットを介してシステム・メモリ・アドレスに変換される必要がある。アドレスがデータ・キャッシュ内に見つからない場合、Ｌ２キャッシュに対する失敗した要求を管理するために、読み込み失敗キュー（load miss queue）が使用される。そのようなキャッシュ・ミスの不利益を減らすために、高度なデータ・プリフェッチ・エンジンが、近い将来に命令によって使用される可能性が高いアドレスを予測する。このようにして、命令がデータを必要とするときに、そのデータがデータ・キャッシュ内にすでに存在する可能性が高くなり、それによって、Ｌ２キャッシュに対する失敗した要求の長い待ち時間を防ぐ。 An instruction to read from the memory or an instruction to write to the memory (such as a read instruction or a storage instruction) is issued to the LS / EX execution units 238 and 240. The LS / EX execution unit retrieves data from the data cache 250 using the memory address specified by the instruction. This address is an effective address and must first be translated into a system memory address via a second level translation unit before it can be used. If the address is not found in the data cache, a load miss queue is used to manage the failed requests for the L2 cache. To reduce the disadvantages of such cache misses, advanced data prefetch engines predict addresses that are likely to be used by instructions in the near future. In this way, when an instruction requires data, it is more likely that the data already exists in the data cache, thereby preventing long wait times for failed requests for the L2 cache.

ＬＳ／ＥＸ実行ユニット２３８、２４０は、読み込み順序変更キュー２１８および格納順序変更キュー２２０内の命令の古さおよびメモリの依存関係を追跡することによって、プログラム順序に従わずに命令を実行する。これらのキューは、アウトオブオーダー実行が同じプログラムのインオーダー実行と一致しない結果を生成したときに、それを検出するために使用される。そのような場合、現在のプログラム・フローがフラッシュされ、再実行される。 The LS / EX execution units 238 and 240 execute instructions out of program order by tracking the age and memory dependencies of the instructions in the read order change queue 218 and the storage order change queue 220. These queues are used to detect when an out-of-order execution produces a result that does not match the in-order execution of the same program. In such a case, the current program flow is flushed and re-executed.

プロセッサ・アーキテクチャは、実効アドレス・ディレクトリ（ＥＡＤ）２９０をさらに含んでおり、実効アドレス・ディレクトリ（ＥＡＤ）２９０は、実効アドレスが必要とされるがパイプラインを通る必要がない場合に、実効アドレスを使用できるように、集中化された方法で、命令のグループの実効アドレスを維持する。さらに、ＥＡＤ２９０は、アウトオブオーダー処理をサポートするための回路または論理あるいはその両方を含んでいる。図２は、分岐予測ユニット２８０を介してアクセスされているＥＡＤ２９０を示しているが、図２に示されたユニットのうちのさまざまなユニットが、分岐予測ユニット２８０を通る必要なしにＥＡＤ２９０にアクセスできるようにするための回路が提供されてよいということが、理解されるべきである。 The processor architecture further includes an effective address directory (EAD) 290, which provides an effective address when an effective address is required but does not need to go through the pipeline. Maintain the effective address of a group of instructions in a centralized way so that it can be used. In addition, the EAD290 includes circuits and / or logic to support out-of-order processing. FIG. 2 shows the EAD290 being accessed via the branch prediction unit 280, but various units of the units shown in FIG. 2 can access the EAD290 without having to go through the branch prediction unit 280. It should be understood that a circuit for doing so may be provided.

当業者は、図１～２のハードウェアが実装に応じて変わってよいということを、理解するであろう。フラッシュ・メモリ、同等の不揮発性メモリ、または光ディスク・ドライブなどの、その他の内部ハードウェアまたは周辺機器が、図１～２に示されているハードウェアに加えて、またはそれらのハードウェアの代わりに、使用されてよい。加えて、実施形態例のプロセスは、本発明の思想および範囲を逸脱することなく、前述したＳＭＰシステム以外のマルチプロセッサ・データ処理システムに適用されてよい。 Those skilled in the art will appreciate that the hardware of FIGS. 1-2 may vary depending on the implementation. Other internal hardware or peripherals, such as flash memory, equivalent non-volatile memory, or optical disk drives, may be in addition to or in place of the hardware shown in Figures 1-2. , May be used. In addition, the process of the examples may be applied to a multiprocessor data processing system other than the SMP system described above without departing from the idea and scope of the present invention.

さらに、データ処理システム１００は、クライアント・コンピューティング・デバイス、サーバ・コンピューティング・デバイス、タブレット・コンピュータ、ラップトップ・コンピュータ、電話またはその他の通信デバイス、パーソナル・デジタル・アシスタント（ＰＤＡ：personal digital assistant）などを含む、複数の異なるデータ処理システムのいずれかの形態を取ってよい。一部の例では、データ処理システム１００は、例えばオペレーティング・システム・ファイルまたはユーザによって生成されたデータあるいはその両方を格納するために、不揮発性メモリを提供するようにフラッシュ・メモリを使用して構成された、ポータブル・コンピューティング・デバイスであってよい。基本的に、データ処理システム１００は、アーキテクチャの制限なしで、任意の既知のデータ処理システムまたは後で開発されるデータ処理システムであってよい。 In addition, the data processing system 100 includes client computing devices, server computing devices, tablet computers, laptop computers, telephones or other communication devices, personal digital assistants (PDAs). It may take any form of a plurality of different data processing systems, including. In some examples, the data processing system 100 is configured with flash memory to provide non-volatile memory, for example to store operating system files and / or user-generated data. It may be a non-volatile computing device. Basically, the data processing system 100 may be any known data processing system or a data processing system developed later, without architectural restrictions.

当業者によって理解されるであろうように、本発明は、システム、装置、または方法として具現化されてよい。１つの実施形態例では、メカニズムが、ハードウェア（例えば、プロセッサの回路、ハードウェア・モジュール、またはユニットなど）において全体的に提供される。しかし、他の実施形態例では、ソフトウェアおよびハードウェアの組み合わせが、実施形態例の特徴およびメカニズムを提供または実装するために利用されてよい。例えば、ソフトウェアは、ファームウェア、常駐ソフトウェア、マイクロコードなどで提供されてよい。以下で示されるさまざまなフローチャートは、ハードウェア、またはハードウェアとソフトウェアの組み合わせ、あるいはその両方によって実行されてよい動作の概要を提供する。 As will be appreciated by those skilled in the art, the invention may be embodied as a system, device, or method. In one embodiment, the mechanism is provided entirely in hardware (eg, processor circuits, hardware modules, or units, etc.). However, in other embodiments, a combination of software and hardware may be utilized to provide or implement the features and mechanisms of the embodiments. For example, the software may be provided in firmware, resident software, microcode, and the like. The various flowcharts shown below provide an overview of the actions that may be performed by hardware, or a combination of hardware and software, or both.

実施形態例のメカニズムが少なくとも部分的にソフトウェアにおいて実装される実施形態例では、このソフトウェアを格納する１つまたは複数のコンピュータ使用可能媒体またはコンピュータ可読媒体の任意の組み合わせが、利用されてよい。例えば、コンピュータ使用可能媒体またはコンピュータ可読媒体は、電子、磁気、光、電磁気、赤外線、または半導体のシステム、装置、またはデバイスであってよいが、これらに限定されない。コンピュータ可読媒体のさらに具体的な例（非網羅的リスト）としては、ランダム・アクセス・メモリ（ＲＡＭ：random access memory）、読み取り専用メモリ（ＲＯＭ：read-onlymemory）、消去可能プログラマブル読み取り専用メモリ（ＥＰＲＯＭ：erasableprogrammable read-only memoryまたはフラッシュ・メモリ）などが挙げられる。 In embodiments where the mechanisms of the embodiments are implemented, at least in part, in software, any combination of one or more computer-enabled or computer-readable media containing the software may be utilized. For example, a computer-enabled or computer-readable medium may be, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device. More specific examples of computer-readable media (non-exhaustive list) include random access memory (RAM), read-only memory (ROM), and erasable programmable read-only memory (EPROM). : Erasableprogrammable read-only memory or flash memory).

通常は、すべての読み込み命令および格納命令について、ＥＡが対応するＲＡに変換される。そのようなＥＡからＲＡへの変換は、命令フェッチ（Ｉ－フェッチ）の場合にも実行される。低次メモリからの命令の取得の場合、そのような変換は、通常、実効アドレスから実アドレスへのテーブル（ＥＡＲＴ：effective to real address table）を必要とした。本明細書に記載された技術的解決策では、ＥＡからＲＡへの変換が、すべての読み込み命令および格納命令について実行されるのではなく、読み込み失敗、Ｉ－フェッチ失敗、およびすべての格納の場合にのみ、実行される。 Normally, all read and store instructions are converted by the EA to the corresponding RA. Such an EA to RA conversion is also performed in the case of instruction fetch (I-fetch). In the case of retrieving instructions from low-order memory, such a translation usually required an effective to real address table (EART). In the technical solution described herein, the EA to RA conversion is not performed for all read and store instructions, but for read failures, I-fetch failures, and all stores. Will only be executed.

これらの技術的解決策は、ＥＡのみを動作に使用することによって、ＥＡディレクトリ（Ｌ１ディレクトリとも呼ばれる）、ＬＲＱＦエントリ、ＬＭＱエントリなどの、１つまたは複数のデータ構造からのＲＡビット（例えば、ビット８：５１）の除去を容易にする。さらに、ＥＡのみが使用されている場合、ＳＲＱＬＨＳＲＡ比較論理が実行されない。そのような要素を除去することによって、使用されるプロセッサのチップ面積を削減し、したがって、通常のプロセッサを超えるチップ面積の削減を促進する。 These technical solutions use only EA for operation and RA bits (eg, bits) from one or more data structures such as EA directory (also called L1 directory), LRQF entry, LMQ entry, etc. 8:51) facilitates removal. Furthermore, if only EA is used, the SRQ LHS RA comparison logic is not executed. By removing such elements, the chip area of the processor used is reduced, thus facilitating a reduction in chip area beyond that of a normal processor.

さらに、本明細書における技術的解決策は、ＥＡのみを使用することによって、すべての読み込みアドレスおよび格納アドレスの生成時のＥＲＡＴの検索を除去する。これらの技術的解決策は、ユニット全体のＲＡバスの切り替えをさらに除去し、高速なＳＲＱＬＨＳＲＡｃａｍも回避する。したがって、これらの技術的解決策は、上記の動作を実行しないことによって、プロセッサが、通常のプロセッサと比較して少ない電力を消費するのを促進する。 Further, the technical solution herein eliminates the ELAT search on generation of all read and store addresses by using only the EA. These technical solutions further eliminate RA bus switching across the unit and also avoid high speed SRQ LHS RA cams. Therefore, these technical solutions facilitate the processor to consume less power compared to a normal processor by not performing the above operation.

さらに、本明細書における技術的解決策は、Ｌ１の待ち時間の改善も促進する。例えば、本明細書における技術的解決策は、除去によって、「最終的なｄｖａｌ」を決定することにおいて、アドレス変換が、ＥＡからＲＡへの変換を実行する通常のプロセッサと比較して少なくとも１サイクル速くなる。ＥＡのみを使用する（ＲＡ変換を行わない）ことによって、設定テーブルの複数のヒット、設定テーブルのヒット／ＲＡの失敗などの、「悪いｄｖａｌ」条件を除去するため、待ち時間も改善される。同様の方法で、本明細書における技術的解決策は、Ｌ２の待ち時間の改善を促進する。 In addition, the technical solutions herein also facilitate improvements in L1 latency. For example, the technical solution herein is to determine the "final dval" by removal, where the address translation is at least one cycle compared to a normal processor performing an EA to RA translation. It will be faster. By using only the EA (without RA conversion), the waiting time is also improved because "bad dval" conditions such as multiple hits in the setting table and hits / RA failures in the setting table are eliminated. In a similar manner, the technical solutions herein facilitate improvements in L2 latency.

ＥＡに基づくＬＳＵのみを使用することの技術的課題は、Ｌ２からのスヌープを処理できることを含む。例えば、ＬＳＵは、ＲＡからＥＡへの逆変換を含むことができる必要がある。したがって、本明細書における技術的解決策は、Ｌ２からのＲＡに基づくスヌープを、ＬＳＵのサブユニットへのＥＡに基づくスヌープに変換することを容易にする。 The technical challenges of using only EA-based LSUs include being able to handle snoops from L2. For example, the LSU needs to be able to include an inverse transformation from RA to EA. Therefore, the technical solution herein facilitates the conversion of RA-based snoops from L2 into EA-based snoops to subunits of LSU.

さらに、ＥＡのみに基づくＬＳＵには、同じスレッドのシノニム（すなわち、１つのスレッドからの２つの異なるＥＡが、同じＲＡにマッピングされる）を処理するという技術的課題がある。これらの技術的解決策は、本明細書に記載されたシノニム検出テーブル（ＳＤＴ）またはＥＲＴ削除（ＥＲＴＥ）テーブルのいずれかを使用して、そのような技術的課題に対処する。例えば、ＬＨＳ、ＳＨＬ、およびＬＨＬにわたる、シノニムが次のように定義される場合のＬ１のアクセスである。
Ｔｉｄ＝ｗ、ＥＡ（０：５１）＝ｘ＝＞ＲＡ（８：５１）＝ｚ
Ｔｉｄ＝ｗ、ＥＡ（０：５１）＝ｙ＝＞ＲＡ（８：５１）＝ｚ
このようにして、異なるＥＡが同じＲＡに対応する。本明細書に記載された技術的解決策は、ＥＡのシノニムを拒否すること、および対応する一次ＥＡを使用して再開することを容易にする。 In addition, EA-only LSUs have the technical challenge of handling synonyms for the same thread (ie, two different EA's from one thread are mapped to the same RA). These technical solutions address such technical challenges using either the synonym detection table (SDT) or the ERT deletion (ERTE) table described herein. For example, access to L1 across LHS, SHL, and LHL when the synonym is defined as:
Tid = w, EA (0:51) = x => RA (8:51) = z
Tid = w, EA (0:51) = y => RA (8:51) = z
In this way, different EA's correspond to the same RA. The technical solutions described herein facilitate the rejection of EA synonyms and resumption with the corresponding primary EA.

再び図を参照すると、図３は、本発明の１つまたは複数の実施形態に従うプロセッシング・コアの読み込み／格納ユニット（ＬＳＵ）１０４を示している。示されているＬＳＵ１０４は、２読み込み／２格納モードでの実行を容易にするが、本明細書に記載された技術的解決策がそのようなＬＳＵに限定されないということに、注意するべきである。以下では、ＬＳＵの実行の流れが説明される。読み込み命令または格納命令から、ＥＡ（コンピュータ・プログラム内でプログラマによって使用される実効アドレス）が生成される。同様に、命令フェッチの場合にもＥＡが生成される。通常は、すべての命令について、ＥＡがＲＡ（ＥＡからＲＡへの変換後にハードウェアによって使用される実アドレス）に変換され、技術的課題の中でも特に、より大きいチップ面積および頻繁な変換が必要だった。本明細書に記載された技術的解決策は、ＥＡのみを使用し（ＲＡへの変換を行わずに）、読み込み失敗時、Ｉ－フェッチ失敗時、および格納時にのみ、実効実テーブル（ＥＲＴ）２５５を使用してＲＡを生成することによって、そのような技術的課題に対処する。 Referring again to the figure, FIG. 3 shows a processing core read / store unit (LSU) 104 according to one or more embodiments of the present invention. It should be noted that the LSU 104 shown facilitates execution in 2 read / 2 store modes, but the technical solutions described herein are not limited to such LSUs. .. In the following, the flow of execution of LSU will be described. An EA (effective address used by a programmer in a computer program) is generated from a read or store instruction. Similarly, an EA is generated in the case of instruction fetch. Normally, for every instruction, the EA is converted to RA (the real address used by the hardware after the conversion from EA to RA), which requires a larger chip area and frequent conversions, among other technical challenges. rice field. The technical solution described herein uses only EA (without conversion to RA) and is an effective real table (ERT) only on read failure, I-fetch failure, and storage. Such technical challenges are addressed by using 255 to generate RA.

ＬＳＵ１０４は、読み込み順序変更キュー（ＬＲＱＦ）２１８を含んでおり、読み込み順序変更キュー（ＬＲＱＦ）２１８では、通常のＬＳＵ設計におけるＬＲＱ２１８と同様に、ディスパッチから完了までのすべての読み込み動作が追跡される。ＬＳＵ１０４は、第２の読み込み順序変更キューＬＲＱＥ２２５をさらに含んでいる。読み込みが（キャッシュ・ミスまたは変換失敗のため、あるいは読み込みが依存する前の命令が拒否されたために）拒否された場合、発行キューから読み込みが取り出され、ＬＲＱＥエントリに配置され、このＬＲＱＥエントリから読み込みが再発行される。示されているＬＲＱＥ２２５は、１２個のエントリをそれぞれ含む（合計で２４個のエントリ）、２つの読み込みモード用の２つのインスタンス（ＬＲＱＥ０およびＬＲＱＥ１）に分割される。ＳＴモードでは、スレッド／パイプに基づくパーティションが存在しない。ＭＴモードでは、Ｔ０、Ｔ２の動作がパイプＬＤ０で開始しており、Ｔ１、Ｔ３の動作が、再開のためのパイプＬＤ１で開始している。ここで、Ｔｘはスレッドｘであり、例えば、Ｔ０はスレッド０、Ｔ１はスレッド１、Ｔ２はスレッド２、Ｔ３はスレッド３である。本明細書では、各例がＭＴモードで４つのスレッドを使用しているが、他の例では、ＭＴモードが異なる数（８、１６、または任意のその他の数など）のスレッドの実行を同時に含んでよいということに、注意するべきである。１つまたは複数の例では、ＭＴモードでのスレッドの数は構成可能である。さらに、本明細書における例では、ＬＳＵ１０４が２つの読み込みパイプ（ＬＤ０およびＬＤ１）を使用しているが、他の例では、パイプの数が異なっていてよい（例えば、３、４、８など）。１つまたは複数の例では、ＬＲＱＦ２１８が、パイプの数と同じ数のパーティションに分割される。 The LSU 104 includes a read order change queue (LRQF) 218, which tracks all read operations from dispatch to completion, similar to the LRQ218 in a normal LSU design. The LSU 104 further includes a second read order change queue LRQE225. If a read is rejected (because of a cache miss or conversion failure, or because an instruction before the read depends on it is rejected), the read is taken from the issue queue, placed in the RQE entry, and read from this RQE entry. Will be reissued. The LRQE225 shown is divided into two instances (LRQE0 and LRQE1) for two read modes, each containing 12 entries (24 entries in total). In ST mode, there is no thread / pipe based partition. In the MT mode, the operations of T0 and T2 are started by the pipe LD0, and the operations of T1 and T3 are started by the pipe LD1 for restarting. Here, Tx is thread x, for example, T0 is thread 0, T1 is thread 1, T2 is thread 2, and T3 is thread 3. In this specification, each example uses four threads in MT mode, but in other examples, different numbers of threads in MT mode (such as 8, 16, or any other number) can be executed simultaneously. It should be noted that it may be included. In one or more examples, the number of threads in MT mode is configurable. Further, in the examples herein, the LSU 104 uses two read pipes (LD0 and LD1), but in other examples the number of pipes may be different (eg, 3, 4, 8 etc.). .. In one or more examples, the LRQF218 is divided into as many partitions as there are pipes.

示されているように、ＬＲＱＦ２１８は、４０個のエントリを（インスタンスごとに）含む、２つの読み込みモード用の２つのインスタンス（ＬＲＱＦ０およびＬＲＱＦ１）に分割される。ＬＲＱＦ２１８は、循環する順序通りのエントリの割り当て、循環する順序通りのエントリの排出、および循環する順序通りのエントリの割り当て解除である。さらに、ＭＴモードでは、Ｔ０、Ｔ２の動作がパイプＬＤ０、ＳＴ０で開始しており、Ｔ１、Ｔ３の動作が、パイプＬＤ１、ＳＴ１で開始している。ＳＴモードでは、ＬＲＱＦがどのパイプ／スレッドも含まない。 As shown, the LRQF218 is split into two instances (LRQF0 and LRQF1) for two read modes, containing 40 entries (per instance). LRQF218 is the allocation of circularly ordered entries, the ejection of circularly ordered entries, and the unallocation of circularly ordered entries. Further, in the MT mode, the operations of T0 and T2 are started in the pipes LD0 and ST0, and the operations of T1 and T3 are started in the pipes LD1 and ST1. In ST mode, the LRQF does not contain any pipes / threads.

１つまたは複数の例では、ＳＭＴ４モードの場合、ＬＲＱＦ２１８（および本明細書に記載されたその他の構造）が、Ｔ０：ＬＲＱＦ０［０：１９］循環キュー、Ｔ１：ＬＲＱＦ１［０：１９］循環キュー、およびＴ２：ＬＲＱＦ０［２０：３９］循環キュー、Ｔ３：ＬＲＱＦ１［２０：３９］循環キューとして分割される。 In one or more examples, in SMT4 mode, the LRQF218 (and other structures described herein) is a T0: LRQF0 [0:19] circular queue, a T1: LRQF1 [0:19] circular queue. , And T2: LRQF0 [20:39] circular queue, and T3: LRQF1 [20:39] circular queue.

１つまたは複数の例では、ＳＭＴ２モードの場合、ＬＲＱＦ２１８（および本明細書に記載されたその他の構造）が、Ｔ：ＬＲＱＦ０［０：３９］循環キューおよびＴ１：ＬＲＱＦ１［０：３９］循環キューとして分割される。さらに、１つまたは複数の例では、ＳＴモードの場合、ＬＲＱＦ０［０：３９］循環キューであり、ＬＲＱＦ１がＬＲＱＦ０のコピーである。他のデータ構造の場合、ＳＴモードで類似するパーティション・パターンが使用され、第２のインスタンスが第１のインスタンスコピーである。 In one or more examples, in SMT2 mode, the LRQF218 (and other structures described herein) is a T: LRQF0 [0:39] circular queue and a T1: LRQF1 [0:39] circular queue. Is divided as. Further, in one or more examples, in ST mode, it is an LRQF0 [0:39] circular queue, where LRQF1 is a copy of LRQF0. For other data structures, a similar partition pattern is used in ST mode, where the second instance is the first instance copy.

相互無効化フラッシュ（ＸＩフラッシュ：cross invalidationflush）の場合、ＬＲＱＦに関して、ＮＴＣ＋１が、別のスレッドからのＸＩまたは格納排出がヒットするスレッドをフラッシュし、ＸＩフラッシュの場合に、同期時の明示的なＬ／Ｌの順序付けのフラッシュがＬＳＵ１０４によって実行されないようにする。 In the case of a cross invalidation flush (XI flush), for LRQF, NTC + 1 flushes the thread that hits the XI or storage discharge from another thread, and in the case of the XI flush, the explicit L at the time of synchronization. Prevents / L ordering flushes from being performed by LSU104.

すべての格納が、ＳＨＬ検出に関してＬＲＱＦ２１８に対してチェックし、ＳＨＬの検出時に、ＬＲＱＦ２１８が、格納の後の読み込みまたはすべてのもの（命令／動作）のフラッシュを開始する。さらに、ＤＣＢ命令が、ＳＨＬのケースに関してＬＲＱＦ２１８に対してチェックし、ＳＨＬのケースの発生時に、ＬＲＱＦ２１８が、ＤＣＢの後の読み込みまたはすべてのもののフラッシュを引き起こす。さらに、すべての読み込みが、ＬＨＬ検出に関してＬＲＱＦ２１８に対してチェックし（逐次読み込みの一貫性）、ＬＨＬの検出時に、ＬＲＱＦ２１８が、より古い読み込みの後のより新しい読み込みまたはすべてのもののフラッシュを引き起こす。１つまたは複数の例では、ＬＲＱＦ２１８が、クワッドワードのアトミック性を提供し、ＬＱがクワッドのアトミック性に関してＬＲＱＦ２１８に対してチェックし、アトミックでない場合に、ＬＱをフラッシュする。さらに、ＬＡＲＸ命令の場合、ＬＳＵ１０４がｌａｒｘヒットｌａｒｘのケースに関してＬＲＱＦ２１８に対してチェックし、それに応じて、より古いｌａｒｘ命令の後のより新しいＬＡＲＸまたはすべてのものをフラッシュする。 All stores check against LRQF218 for SHL detection, and upon detection of SHL, LRQF218 initiates a read after storage or a flush of everything (instructions / actions). In addition, the DCB instruction checks against the LRQF218 for the SHL case, and when the SHL case occurs, the LRQF218 triggers a read after the DCB or a flush of everything. In addition, all reads check against LRQF218 for LHL detection (sequential read consistency), and upon detection of LHL, LRQF218 triggers newer reads or flushes of everything after older reads. In one or more examples, the LRQF218 provides the atomicity of the quad word, the LQ checks against the LRQF218 for the atomicity of the quad, and flushes the LQ if it is not atomic. Further, in the case of a LARX instruction, the LSU 104 checks against the LRQF218 for the case of a larx hit larx and accordingly flushes the newer LARX or everything after the older larx instruction.

このようにして、ＬＲＱＦ２１８は、発行から完了までのすべての読み込み動作の追跡を容易にする。ＬＲＱＦ２１８内のエントリは、キュー構造内の物理的位置であるＲｅａｌ＿Ｌｔａｇ（ｒｌｔａｇ）でインデックス付けされる。ＬＲＱＦ２１８内の読み込み動作／エントリの古さが、順序通りであるＶｉｒｔｕａｌ＿Ｌｔａｇ（ｖｌｔａｇ）を使用して決定される。ＬＲＱＦは、ＧＭＡＳＫを使用して読み込みをフラッシュし、一部のグループは、ＧＴＡＧおよびＩＭＡＳＫを使用してフラッシュする。ＬＲＱＦ論理は、現在のｉタグまたはｉタグ＋１あるいは正確な読み込みのｉタグからフラッシュできる。 In this way, the LRQF218 facilitates tracking of all read operations from issue to completion. The entries in LRQF218 are indexed by Real_Ltag (rltag), which is a physical location in the queue structure. Read operation / entry age in LRQF218 is determined using the ordered Visual_Ltag (vltag). LRQF flashes reads using GMASK, and some groups flash using GTAG and IMASK. LRQF logic can be flushed from the current i-tag or i-tag + 1 or the exact read i-tag.

さらに、ＬＲＱＦは、通常使用されるＲＡ（８：５１）フィールドを含まず、代わりにＥＡに基づき、ＥＲＴＩＤ（０：６）およびＥＡ（４０：５１）を含む（２４ビットの節約）。ＳＨＬ、ＬＨＬでのＬＲＱＦのページの一致は、ＥＲＴＩＤの一致に基づく。さらに、各ＬＲＱエントリは、「ページ一致強制（Force Page Match）」ビットを含んでいる。ＬＲＱエントリのＥＲＴＩＤに一致するＥＲＴＩＤが無効化された場合、ページ一致強制ビットが設定される。ＬＲＱがＬＨＬ、ＳＨＬを検出し、格納の順序付けが、ページ一致強制＝１であるエントリを含めてフラッシュする。 In addition, the LRQF does not include the commonly used RA (8:51) field, but instead is based on the EA and includes the ERT ID (0: 6) and EA (40:51) (24-bit savings). Matching LRQF pages in SHL, LHL is based on matching ERT IDs. In addition, each LRQ entry contains a "Force Page Match" bit. If the ERT ID that matches the ERT ID of the LRQ entry is invalidated, the page match coercion bit is set. LRQ detects LHL, SHL and flushes including the entry whose storage order is page match forced = 1.

このようにして、ＬＲＱＦ２１８は、分割された読み込み要求キューを維持することによって、チップ面積を占有し、アドレス変換に電力を消費する複数のＣＡＭポートの技術的課題に対処し、読み込み要求キューは、ＯｏＯプロセッサが同時に実行できる所定の数の命令および所定の数のスレッドのために分割される。 In this way, the LRQF218 addresses the technical challenges of multiple CAM ports that occupy chip area and consume power for address translation by maintaining a split read request queue. It is divided for a given number of instructions and a given number of threads that the OoO processor can execute at the same time.

ＬＳＵ１０４のＳＲＱ２２０は、４０個のエントリ（インスタンスごと）の２つのインスタンスＳＲＱＲ０およびＳＲＱＲ１を含む、ＬＲＱＦ２１８に類似する構造を有し、ＳＲＱＲ０およびＳＲＱＲ１は、循環する順序通りのエントリの割り当て、循環する順序通りのエントリの排出、および循環する順序通りのエントリの割り当て解除である。さらに、ＳＲＱ２２０は、ＬＲＱＦ２１８と同様に分割される（例えば、パイプＬＤ０、ＳＴ０上で開始されるＴ０、Ｔ２の動作、パイプＬＤ１、ＳＴ１上で開始されるＴ１、Ｔ３の動作、ＳＴモードではパイプ／スレッドのパーティションがない）。ＳＴモードでは、両方のコピーが同一の値を含み、ＭＴモードでは、各コピーが異なっている。ＳＭＴ４モードでは、両方のインスタンスがさらに分割され、各スレッドに、ＳＲＱ２２０から２０個のエントリが割り当てられる（本明細書に記載されたＬＲＱＦの例示的なパーティションを参照）。１つまたは複数の例では、格納排出調停（store drain arbitration）の場合、ＳＭＴ４モードで、ＳＲＱ内の読み取りポインタの多重化が実行される。代替または追加として、ＳＭＴ２モードおよびＳＭＴ４モードで、ＳＲＱ０／１間の多重化が実行される。ＳＴモードでは、ＳＲＱ０に対してのみ、排出が実行される。 The SRQ220 of the LSU 104 has a structure similar to the LRQF218, including two instances SRQR0 and SRQR1 with 40 entries (per instance), where SRQR0 and SRQR1 assign entries in a circular order, in a circular order. Ejection of entries, and deallocation of entries in a circular order. Further, the SRQ220 is divided in the same manner as the LRQF218 (for example, the operation of T0 and T2 started on the pipes LD0 and ST0, the operation of the T1 and T3 started on the pipes LD1 and ST1, and the pipe / in ST mode. There is no thread partition). In ST mode, both copies contain the same value, and in MT mode, each copy is different. In SMT4 mode, both instances are further subdivided and each thread is assigned 20 entries from SRQ220 (see the exemplary partition of LRQF described herein). In one or more examples, in the case of store drain arbitration, the SMT4 mode performs multiplexing of read pointers in the SRQ. As an alternative or addition, multiplexing between SRQ0 / 1 is performed in SMT2 and SMT4 modes. In the ST mode, the discharge is executed only for SRQ0.

ここで、Ｔｘはスレッドｘであり、例えば、Ｔ０はスレッド０、Ｔ１はスレッド１、Ｔ２はスレッド２、Ｔ３はスレッド３である。本明細書では、各例がＭＴモードで４つのスレッドを使用しているが、他の例では、ＭＴモードが異なる数（８、１６、または任意のその他の数など）のスレッドの実行を同時に含んでよいということに、注意するべきである。１つまたは複数の例では、ＭＴモードでのスレッドの数は構成可能である。さらに、本明細書における例では、ＬＳＵ１０４が２つの格納パイプ（ＳＴ０およびＳＴ１）を使用しているが、他の例では、格納パイプの数が異なっていてよい（例えば、３、４、８など）。１つまたは複数の例では、ＳＲＱＲ２２０が、格納パイプの数と同じ数のパーティションに分割される。 Here, Tx is thread x, for example, T0 is thread 0, T1 is thread 1, T2 is thread 2, and T3 is thread 3. In this specification, each example uses four threads in MT mode, but in other examples, different numbers of threads in MT mode (such as 8, 16, or any other number) can be executed simultaneously. It should be noted that it may be included. In one or more examples, the number of threads in MT mode is configurable. Further, in the examples herein, the LSU 104 uses two storage pipes (ST0 and ST1), but in other examples the number of storage pipes may be different (eg, 3, 4, 8 etc.). ). In one or more examples, the SRQR 220 is divided into as many partitions as there are storage pipes.

ＳＲＱ２２０の各エントリは、格納のＴＩＤ（０：１）、ＥＲＴＩＤ（０：６）、ＥＡ（４４：６３）、およびＲＡ（８：５１）を含む。ＬＨＳを検出するために、ＬＳＵは｛格納のＴｉｄ，ＥＡ（４４：６３）｝を使用し、このようにしてＲＡＬＨＳの別名チェックを取り除く。ＥＲＴＩＤは、ＥＡ（４４：６３）の部分的一致の投機失敗を「捕らえる」ために使用される。ＳＱＲエントリはＲＡ（８：５１）を含み、ＲＡ（８：５１）は格納時に再び変換され、格納要求をＬ２に送信する（格納命令が排出され、発行されない）ときにのみ使用される。各ＳＲＱエントリも、「ページ一致強制」ビットを含んでいる。ページ一致強制ビットは、ＳＲＱエントリのＥＲＴＩＤに一致するＥＲＴＩＤが無効化されたときに、設定される。ＳＲＱは、ページ一致強制＝１であるエントリを伴うＬＨＳを検出できる。例えば、ページ一致強制＝１であるエントリに反するＬＨＳは、読み込み命令の拒否を引き起こす。さらに、格納排出は、ＳＱＲエントリに関してページ一致強制＝１の場合に、Ｌ１キャッシュにおける失敗を強制する。これは、「拡張ストア・ヒット・リロード（Extended store hit reload）」ＬＭＱ動作と並行して動作する。 Each entry in the SRQ 220 contains a storage TID (0: 1), ERT ID (0: 6), EA (44:63), and RA (8:51). To detect LHS, LSU uses {stored Tid, EA (44:63)} and thus removes the RA LHS alias check. The ERT ID is used to "catch" the speculative failure of the EA (44:63) partial match. The SQR entry contains RA (8:51), which is converted again on storage and is used only when the storage request is sent to L2 (the storage instruction is ejected and not issued). Each SRQ entry also contains a "force page match" bit. The page match coercion bit is set when the ERT ID that matches the ERT ID of the SRQ entry is invalidated. The SRQ can detect LHS with an entry for which page match coercion = 1. For example, an LHS contrary to an entry for which page match compulsion = 1 causes the read instruction to be rejected. Further, the storage discharge forces a failure in the L1 cache when page match coercion = 1 for the SQR entry. This works in parallel with the "Extended store hit reload" LMQ operation.

例えば、ＬＭＱの場合、ＬＭＱアドレス一致＝｛ＥＲＴＩＤ，ＥＡページ・オフセット（ｘｘ：５１），ＥＡ（５２：５６）｝が一致する。さらに、各ＬＭＱエントリの「ページ一致強制」ビットは、ＬＭＱエントリのＥＲＴＩＤに一致するＥＲＴＩＤが無効化されたときに、設定される（＝１）。ＬＭＱは、有効なＬＭＱエントリ［ｘ］のページ一致強制＝１および読み込み失敗のＥＡ［５２：５６］＝ＬＭＱエントリ［Ｘ］のＥＡ（５２：５６）である場合に、読み込み失敗を拒否する。さらに、ＬＭＱは拡張ストア・ヒット・リロードを含む。例えば、ＬＭＱは、再読み込みのＥＡ（５２：５６）＝ＳＲＱエントリ［Ｘ］のＥＡ（５２：５６）およびＳＲＱエントリ［Ｘ］のページ一致強制＝１である場合に、再読み込みの有効化を抑制する。代替または追加として、ＬＭＱは、ＬＭＱエントリ［Ｘ］のＥＡ（５２：５６）＝格納排出のＥＡ（５２：５６）および格納排出のページ一致強制＝１である場合に、再読み込みの有効化を抑制する。 For example, in the case of LMQ, LMQ address match = {ERT ID, EA page offset (xx: 51), EA (52:56)} match. Further, the "page match enforcement" bit of each LMQ entry is set when the ERT ID matching the ERT ID of the LMQ entry is invalidated (= 1). The LMQ rejects the read failure if the page match forcible = 1 of the valid LMQ entry [x] and the EA [52:56] of the read failure = the EA (52:56) of the LMQ entry [X]. In addition, LMQ includes extended store hit reloads. For example, the LMQ enables reloading when reloading EA (52:56) = EA (52:56) for SRQ entry [X] and page match enforcement = 1 for SRQ entry [X]. Suppress. As an alternative or addition, the LMQ will enable reloading if the EA (52:56) for the LMQ entry [X] = EA (52:56) for the store and discharge and page match enforcement for the store and discharge = 1. Suppress.

示されたＬＳＵ１０４は、チップ面積をさらに節約するために、格納データ・キュー（ＳＤＱ：Store Data Queue）をＳＲＱ２２０自体の一部として折りたたむ。オペランドのサイズがＳＲＱエントリのサイズより小さい（例えば、８バイトである）場合、オペランドがＳＲＱ自体のエントリに格納される。ベクトル・オペランドなどの、さらに広いオペランド（例えば、１６バイト幅）の場合、ＭＴモードでは、ＳＲＱ２２０内の２つの連続するエントリを使用して、ＳＲＱがそれらのオペランドを格納する。ＳＴモードでは、さらに広いオペランドがＳＲＱ０およびＳＲＱ１（例えば、それぞれ８バイト）に格納される。 The LSU 104 shown collapses the Store Data Queue (SDQ) as part of the SRQ220 itself to further save chip area. If the size of the operand is smaller than the size of the SRQ entry (eg, 8 bytes), the operand is stored in the entry of the SRQ itself. For wider operands (eg, 16 bytes wide), such as vector operands, in MT mode, the SRQ stores those operands using two consecutive entries in the SRQ220. In ST mode, wider operands are stored in SRQ0 and SRQ1 (eg, 8 bytes each).

ＳＲＱ２２０は、格納、バリア、ＤＣＢ、ＩＣＢＩ、またはＴＬＢのタイプの動作をキューに入れる。単一のｓタグが、ｓｔｏｒｅ＿ａｇｅｎおよびｓｔｏｒｅ＿ｄａｔａの両方に使用される。ＳＲＱ２２０は、ロード・ヒット・ストア（ＬＨＳ）のケース（同じスレッドのみ）を処理する。例えば、データ競合を伴う古い格納が存在しないことを保証するために、発行されたすべての読み込みがＳＲＱ２２０によってチェックされる。例えば、ＳＲＱＥＡアレイ内の古い格納に対して読み込みのＥＡおよびデータ・バイト・フラグを比較することによって、データ競合が検出される。 The SRQ220 queues storage, barrier, DCB, ICBI, or TLB type operations. A single s tag is used for both story_agen and story_data. The SRQ220 handles load hit store (LHS) cases (same threads only). For example, all reads issued are checked by SRQ220 to ensure that there are no old stores with data conflicts. Data conflicts are detected, for example, by comparing the read EA and data byte flags against the old stores in the SRQ EA array.

ディスパッチでＳＲＱエントリが割り当てられ、ディスパッチされた命令タグ（ｉタグ：instructiontags）が正しい行に入力される。さらに、格納排出の発生時に、ＳＲＱエントリが割り当て解除される。１つまたは複数の例では、ｉタグ・アレイが「オーバーフロー」のディスパッチを保持する。例えば、望ましいＳＲＱ内の行（例えば、ＳＲＱエントリｘ）がまだ使用中である場合、ディスパッチで情報がｉタグ・アレイに書き込まれる。ＳＲＱエントリｘが割り当て解除されるときに、ＳＲＱのオーバーフローのｉタグ構造内の対応する行が読み出され、メインＳＲＱのｉタグ・アレイ構造にコピーされる（オーバーフローのｉタグ構造の読み取りは、特定のスレッド／領域に関して、オーバーフローのｉタグ・アレイ内に有効なエントリが存在するかどうかによって制御される）。メインＳＲＱ０／１のｉタグ・アレイがＣＡＭポートによって処理され（またはＳＭＴ４内で１／２検索され）、ＩＳＵがｉタグに基づいて格納を発行するように、格納の発行時にどの物理的行に書き込むかを決定する。ＳＲＱ２２０は、格納排出および割り当て解除の発生時に、ｉタグをＩＳＵに送信する。 SRQ entries are assigned by dispatch, and the dispatched instruction tags (i-tags: instructiontags) are entered in the correct line. Further, the SRQ entry is deallocated when the storage / discharge occurs. In one or more examples, the i-tag array holds an "overflow" dispatch. For example, if a row in the desired SRQ (eg, SRQ entry x) is still in use, the dispatch writes the information to the i-tag array. When the SRQ entry x is deallocated, the corresponding row in the SRQ overflow i-tag structure is read and copied to the main SRQ i-tag array structure (reading the overflow i-tag structure is Controlled by the existence of valid entries in the overflow i-tag array for a particular thread / region). In which physical row at the time of issuing the storage, the i-tag array of the main SRQ0 / 1 is processed by the CAM port (or searched 1/2 in SMT4) and the ISU issues the storage based on the i-tag. Decide if you want to write. The SRQ220 transmits an i-tag to the ISU when storage / discharge and deallocation occur.

図４は、１つの実施形態例に従う実効アドレス・ディレクトリ構造（Ｌ１キャッシュ）２９０の例示的なブロックである。１つまたは複数の例では、ＥＡＤがＬＳＵ１０４の一部である。図３に示されているように、ＥＡＤ２９０は、１つまたは複数のエントリ（例えば、エントリ０～Ｎ）から成り、各エントリが、１つまたは複数の命令のグループに関する情報の複数のフィールドを含んでいる。例えば、１つの実施形態例では、ＥＡＤ２９０内の各エントリが１個～３２個の命令を表してよい。ＥＡＤ２９０内のエントリは、プロセッサのキャッシュ（例えば、図２のＬ２キャッシュ２６０）の新しいキャッシュ・ライン内にある命令のフェッチに応答して作成される。ＥＡＤ２９０内のエントリは、追加の命令がキャッシュ・ラインからフェッチされるときに、更新される。ＥＡＤ２９０内の各エントリは、選択された分岐（すなわち、キャッシュからフェッチされた分岐命令が「選択された」として解決される）、キャッシュ・ラインの横断（すなわち、次にフェッチされた命令が、現在のキャッシュ・ラインと異なるキャッシュ・ラインである）、またはプロセッサのパイプラインのフラッシュ（分岐予測ミスが発生した場合など）で、終了する。 FIG. 4 is an exemplary block of an effective address directory structure (L1 cache) 290 according to one embodiment. In one or more examples, EAD is part of LSU104. As shown in FIG. 3, the EAD290 consists of one or more entries (eg, entries 0-N), each entry containing multiple fields of information about a group of one or more instructions. I'm out. For example, in one embodiment, each entry in EAD290 may represent one to 32 instructions. The entry in EAD290 is created in response to an instruction fetch in a new cache line in the processor cache (eg, L2 cache 260 in FIG. 2). The entries in EAD290 are updated when additional instructions are fetched from the cache line. Each entry in EAD290 is a selected branch (ie, the branch instruction fetched from the cache is resolved as "selected"), a cache line crossing (ie, the next fetched instruction is now It ends with a cache line that is different from the cache line of) or a flush of the processor pipeline (for example, if a branch prediction error occurs).

図３に示されているように、ＥＡＤ２９０のエントリのフィールドは、ベース実効アドレス３１０、第１の命令識別子３２０、最後の命令識別子３３０、終了識別子３４０、グローバル履歴ベクトル・フィールド（global history vector field）３５０、リンク・スタック・ポインタ・フィールド３６０、分岐選択識別子３７０、および分岐情報フィールド３８０を含んでいる。ＥＡＤ２９０は、Ｌ１データ・キャッシュと同様に構造化される。連想構造を設定する。例えば、１つまたは複数の例では、連想構造は、８ウェイでＥＡ（５２：５６）によってアドレス指定され、ＥＡ（０：５１）を使用して選択される、３２個のインデックスである。 As shown in FIG. 3, the fields of the entry for EAD290 are the base effective address 310, the first instruction identifier 320, the last instruction identifier 330, the end identifier 340, and the global history vector field. It includes 350, a link stack pointer field 360, a branch selection identifier 370, and a branch information field 380. The EAD290 is structured similarly to the L1 data cache. Set the associative structure. For example, in one or more examples, the associative structure is an 8-way, addressed by EA (52:56) and 32 indexes selected using the EA (0:51).

ベース実効アドレス３１０は、命令のグループの開始実効アドレス（ＥＡ）である。命令のグループ内の各命令は、同じベースＥＡおよびベースＥＡからのオフセットを有する。例えば、１つの実施形態例では、ＥＡは、ビット０：６３を含んでいる６４ビットのアドレスである。１つの実施形態例では、ベースＥＡは、このＥＡのビット０：５６を含んでよく、ビット５７：６１が、命令のグループ内の特定の命令に関する、ベースＥＡからのオフセットを表す。ビット６２および６３は、各命令の特定のバイトを指す。実施形態例では、各アドレスが３２ビット長（すなわち、４バイト）の命令を参照し、メモリ内の各バイトがアドレス指定可能である。命令を、アドレス指定可能なサブコンポーネントにさらに分割することはできず、したがって、命令アドレスのビット６２および６３が常にゼロに設定される。したがって、ビット６２および６３は、格納する必要がなく、ＥＡＤによって、ゼロであるということが常に仮定され得る。 The base effective address 310 is the starting effective address (EA) of a group of instructions. Each instruction in a group of instructions has the same base EA and offset from the base EA. For example, in one embodiment, the EA is a 64-bit address containing bits 0:63. In one embodiment, the base EA may include bits 0:56 of this EA, where bits 57:61 represent an offset from the base EA for a particular instruction within a group of instructions. Bits 62 and 63 point to a particular byte of each instruction. In the embodiment, each address refers to an instruction having a length of 32 bits (that is, 4 bytes), and each byte in the memory can be addressed. The instruction cannot be further subdivided into addressable subcomponents, so bits 62 and 63 of the instruction address are always set to zero. Therefore, bits 62 and 63 do not need to be stored and can always be assumed to be zero by EAD.

第１の命令識別子フィールド３２０は、ＥＡＤ２９０のエントリに対応する命令のグループ内の第１の命令に関して、実効アドレスのオフセット・ビット（例えば、ＥＡのビット５７：６１）を格納する。フィールド３１０からのベースＥＡおよび第１の命令識別子フィールド３２０内の実効アドレスのオフセット・ビットの組み合わせは、ＥＡＤ２９０のエントリによって表された命令のグループ内の第１の命令にＥＡを提供する。この第１のフィールド３２０は、後で説明されるように、例えばパイプラインがフラッシュされた場合に、再フェッチ・アドレスおよび分岐予測情報を回復するために使用されてよい。 The first instruction identifier field 320 stores offset bits of the effective address (eg, bits 57:61 of the EA) for the first instruction in the group of instructions corresponding to the entry in EAD290. The combination of the base EA from field 310 and the offset bit of the effective address in the first instruction identifier field 320 provides the EA to the first instruction in the group of instructions represented by the entry in EAD290. This first field 320 may be used to recover the refetch address and branch prediction information, for example, if the pipeline is flushed, as will be described later.

最後の命令識別子フィールド３３０は、ＥＡＤ２９０のエントリに対応する命令のグループ内の最後の命令に関して、実効アドレスのオフセット・ビット（例えば、ＥＡのビット５７：６１）を格納する。ＥＡＤ２９０のエントリによって表された命令のグループ内の追加の命令がフェッチされるときに、ＥＡＤの論理がこのフィールドを更新する。ＥＡＤの論理は、キャッシュ・ラインの横断または選択された分岐が検出されたときに、ＥＡＤ２９０のエントリが終了することに応答して、特定のＥＡＤ２９０のエントリ内のこのフィールド３３０の更新を中断する。パイプラインのフラッシュが発生してＥＡＤのエントリの一部を消去しない限り、このフィールドは元の状態のままである。そのような場合、ＥＡＤの論理が、フラッシュの結果としてエントリ内の新しい最後の命令に現在なっている命令の実効アドレスのオフセット・ビットを格納するように、このフィールドを更新する。このフィールドは、後で説明されるように、ＥＡＤ２９０内のエントリの解放するために、最終的に完了に使用される。 The last instruction identifier field 330 stores the offset bit of the effective address (eg, bit 57:61 of the EA) for the last instruction in the group of instructions corresponding to the entry in EAD290. The logic of EAD updates this field when additional instructions in the group of instructions represented by the entry in EAD290 are fetched. The logic of EAD interrupts the update of this field 330 in a particular EAD290 entry in response to the termination of the EAD290 entry when a cache line crossing or selected branch is detected. This field remains in its original state unless a pipeline flush occurs to clear some of the EAD entries. In such cases, the EAD logic updates this field to store the offset bit of the effective address of the instruction that is currently in the new last instruction in the entry as a result of the flush. This field will eventually be used for completion to free the entry in EAD290, as described later.

終了識別子フィールド３４０は、ＥＡＤ２９０のエントリが終了しており、ＥＡＤ２９０のエントリに対応する命令グループの命令をフェッチするために、それ以上、命令フェッチが行われないということを示すために、使用される。ＥＡＤ２９０のエントリは、キャッシュ・ラインの横断、分岐が選択されること、またはパイプラインのフラッシュを含む、さまざまな異なる理由のために終了してよい。これらの条件のいずれかが、ＥＡＤのエントリが終了したことを示すように終了フィールド３４０内の値が設定される（例えば、「１」の値に設定される）ことを引き起こしてよい。このフィールド３４０は、後で詳細に説明されるように、ＥＡＤ２９０内のエントリを解放するために、完了時に使用される。 The end identifier field 340 is used to indicate that the entry for EAD290 has been completed and no further instruction fetch is performed to fetch the instruction in the instruction group corresponding to the entry for EAD290. .. The entry in EAD290 may be terminated for a variety of different reasons, including crossing the cache line, selecting a branch, or flushing the pipeline. Any of these conditions may cause the value in the end field 340 to be set (eg, set to a value of "1") to indicate that the entry for EAD has been completed. This field 340 is used on completion to free an entry in EAD290, as described in detail later.

グローバル履歴ベクトル・フィールド３５０は、ＥＡＤ２９０内のエントリを作成した第１の命令フェッチ・グループのグローバル履歴ベクトルを識別する。グローバル履歴ベクトルは、後で詳細に説明されるように、分岐が選択されたかどうかの履歴を識別するために使用される。グローバル履歴ベクトルは、分岐予測の目的に使用され、分岐が選択されたかどうかの最近の履歴に基づいて、現在の分岐が選択される可能性が高いかどうかを判定するのに役立つ。 The global history vector field 350 identifies the global history vector of the first instruction fetch group that created the entry in EAD290. The global history vector is used to identify the history of whether a branch was selected, as described in detail later. The global history vector is used for branch prediction purposes and helps determine if the current branch is likely to be selected based on the recent history of whether the branch was selected.

リンク・スタック・ポインタ・フィールド３６０は、ＥＡＤ２９０内のエントリを作成した第１の命令フェッチ・グループのリンク・スタック・ポインタを識別する。リンク・スタック・ポインタは、後で詳細に説明される別の分岐予測メカニズムである。 The link stack pointer field 360 identifies the link stack pointer of the first instruction fetch group that created the entry in EAD290. Link stack pointers are another branch prediction mechanism described in detail later.

分岐選択フィールド３７０は、ＥＡＤ２９０のエントリに対応する命令のグループが、分岐が選択された分岐命令を含んでいたかどうかを示す。分岐選択フィールド３７０内の値は、ＥＡＤ２９０のエントリによって表された命令グループの分岐命令が選択されると予測されることに応答して、更新される。加えて、ＥＡＤ２９０のエントリの命令内の分岐が選択された後に、終了フィールド３４０に適切な値を書き込むことによって、ＥＡＤ２９０のエントリも終了される。予測時に分岐選択フィールドが投機的に書き込まれるため、分岐が実際に実行されるときに、分岐選択フィールドの値を正しい値に置き換える必要があることがある。例えば、分岐が選択されないと予測されることがあり、その場合、「０」が分岐選択フィールドに書き込まれる。しかし、後の実行において、分岐が選択されたことが検出されることがあり、その場合、「１」の値を書き込むことによって、このフィールドが修正されなければならない。分岐が誤って予測された場合にのみ、第２の書き込みが発生する。 The branch selection field 370 indicates whether the group of instructions corresponding to the entry in EAD290 contained the branch instruction for which the branch was selected. The value in the branch selection field 370 is updated in response to the expected selection of the branch instruction of the instruction group represented by the entry in EAD290. In addition, the entry for EAD290 is also terminated by writing an appropriate value in the end field 340 after the branch within the instruction for the entry for EAD290 has been selected. Since the branch selection field is speculatively written at the time of prediction, it may be necessary to replace the value of the branch selection field with the correct value when the branch is actually executed. For example, it may be predicted that no branch will be selected, in which case "0" will be written to the branch selection field. However, in later executions it may be detected that a branch has been selected, in which case this field must be modified by writing a value of "1". The second write occurs only if the branch is erroneously predicted.

分岐情報フィールド３８０は、分岐が解決したときに分岐予測構造を更新するために使用される種々雑多な分岐情報、または分岐命令が完了したときの設計されたＥＡの状態を格納する。 The branch information field 380 stores various branch information used to update the branch prediction structure when the branch is resolved, or the state of the designed EA when the branch instruction is completed.

ＥＲＴ＿ＩＤフィールド３８５は、対応するＥＲＴエントリを識別する、ＥＲＴテーブル（さらに説明される）へのインデックスを格納する。ＥＲＴエントリが無効化された場合、関連付けられたＥＲＴ＿ＩＤが無効化され、Ｌ１キャッシュおよびＬ１Ｄキャッシュ内の関連付けられたすべてのエントリも無効化される。 The ERT_ID field 385 stores an index into the ERT table (described further) that identifies the corresponding ERT entry. If an ERT entry is invalidated, the associated ERT_ID is invalidated and all associated entries in the L1 and L1D caches are also invalidated.

ベースｅａｔａｇおよびｅａｔａｇオフセットという少なくとも２つの部分を含んでいる実効アドレス・タグ（ｅａｔａｇ：effective address tag）を使用して、ＥＡＤ２９０内のエントリがアクセスされる。１つの実施形態例では、このｅａｔａｇは１０ビットの値であり、６４ビットの実効アドレスより相対的に非常に小さい。１つの実装例では、１０ビットのｅａｔａｇの値および１４個のエントリというサイズを有するＥＡＤ２９０を使用する場合、ｅａｔａｇは、ＥＡＤ２９０内のエントリを識別するための、ベースｅａｔａｇと呼ばれる第１の５ビット、およびＥＡＤ２９０内のエントリによって表される命令のグループ内の特定の命令のオフセットを提供するための、ｅａｔａｇオフセットと呼ばれる第２の５ビットから成る。ＥＡＤ２９０内のエントリを識別する５ビット内の第１のビットは、ＥＡＤ２９０の最上位のエントリから最下位のエントリに移動するときに、循環が発生したかどうかを示すために、循環ビットとして使用されてよい。このビットは、古さの検出に使用されてよい。ＥＡＤ２９０内のエントリを識別する５ビットのうちの第２～第５のビットは、ＥＡＤのインデックスを指して、命令のベースＥＡ（すなわち、ＥＡ（０：５６））を識別するために使用されてよい。５ビットのオフセット値は、例えば、特定の命令の実効アドレスのビット５７：６１を提供するために使用されてよい。この例示的なｅａｔａｇが、以下に示される。
ｅａｔａｇ（０：９）＝ｒｏｗ（０：４）｜｜ｏｆｆｓｅｔ（０：４）
ｒｏｗ（０）：ＥＡＤの最上位のエントリから最下位のエントリへの移動時に循環が発生したかどうかを示すＥＡＤの循環ビット。
ｒｏｗ（１：４）：命令のＥＡ（０：５６）を決定するために使用される１４エントリのＥＡＤへのインデックス。
ｏｆｆｓｅｔ（０：４）：命令のＥＡのビット５７：６１。 An entry in EAD290 is accessed using an effective address tag (eatag) that contains at least two parts, the base eatag and the eatag offset. In one embodiment, this eatag is a 10-bit value, much smaller than a 64-bit effective address. In one implementation, when using an EAD290 with a value of 10 bits and a size of 14 entries, the offset is a first 5 bit, called the base offset, for identifying the entries in the EAD290. And consists of a second 5 bits called the eatag offset to provide the offset for a particular instruction within the group of instructions represented by the entries in EAD290. The first bit in the five bits that identifies the entry in EAD290 is used as a circular bit to indicate whether a cycle has occurred when moving from the most significant entry to the least significant entry in EAD290. It's okay. This bit may be used to detect age. The second to fifth bits of the five bits that identify the entry in the EAD290 point to the index of the EAD and are used to identify the base EA of the instruction (ie, EA (0:56)). good. The 5-bit offset value may be used, for example, to provide bits 57:61 of the effective address of a particular instruction. This exemplary eatag is shown below.
eatag (0: 9) = low (0: 4) || offset (0: 4)
low (0): Circulation bit of the EAD indicating whether or not a cycle has occurred when moving from the highest entry to the lowest entry of the EAD.
low (1: 4): Index to EAD of 14 entries used to determine the EA (0:56) of the instruction.
offset (0: 4): EA bit 57:61 of the instruction.

図５は、本発明の１つまたは複数の実施形態に従う例示的な実効実テーブル（ＥＲＴ）構造を示している。１つまたは複数の例では、ＥＲＴ２５５が合計で１２８個のエントリを含んでいるが、他の例では、エントリの総数が異なることができるということ、およびさらに、エントリの数が選択可能またはプログラム可能であってよいということに注意するべきである。さらに、ＬＳＵ１０４が複数のパイプを使用する場合、各パイプが、個別のパーティションをＥＲＴ２５５内で有する。１つまたは複数の例では、ＥＲＴ２５５内のエントリの所定の最大数が、パイプ間で均等に分割される。例えば、２つのパイプ（すなわち、別々に並列な２つの命令）の場合、ＬＳＵは、それぞれ６４個（半分）のエントリを含む、ＥＲＴ２５５の２つのパーティション（例えば、ＥＲＴ０およびＥＲＴ１）を維持する。例えば、ＬＤ０およびＳＴ０はＥＲＴ０を使用し、ＬＤ１およびＳＴ１はＥＲＴ１を使用する。ＳＴモードでは、ＥＲＴ２５５の第１のパーティションが、第１のパーティションのコピーである他のパーティションと共に使用され、例えば、ＥＲＴ０が、ＥＲＴ０のコピーであるＥＲＴ１と共に使用される。代替として、１つまたは複数の例では、ＬＳＵが単一の読み込みパイプおよび単一の格納パイプを使用する場合、ＥＲＴ２５５全体が単一のパーティションとして使用される。以下では、特に指定されない限り、ＥＲＴ２５５のいずれか１つのパーティションについて説明する。 FIG. 5 shows an exemplary effective real table (ERT) structure according to one or more embodiments of the invention. In one or more examples, the ERT255 contains a total of 128 entries, but in other examples the total number of entries can be different, and the number of entries is selectable or programmable. It should be noted that it may be. Further, if the LSU 104 uses multiple pipes, each pipe has a separate partition within the ERT255. In one or more examples, the predetermined maximum number of entries in ERT255 is evenly divided between the pipes. For example, in the case of two pipes (ie, two separately parallel instructions), the LSU maintains two partitions of ERT255 (eg, ERT0 and ERT1), each containing 64 (half) entries. For example, LD0 and ST0 use ERT0, and LD1 and ST1 use ERT1. In ST mode, the first partition of ERT255 is used with other partitions that are copies of the first partition, for example ERT0 is used with ERT1 that is a copy of ERT0. Alternatively, in one or more examples, if the LSU uses a single read pipe and a single storage pipe, the entire ERT255 is used as a single partition. In the following, unless otherwise specified, any one partition of ERT255 will be described.

ＥＲＴ２５５は、有効なＥＲＴエントリを含み、一般に、Ｌ１Ｉ－キャッシュもしくはＤ－キャッシュ・ディレクトリ（ＥＡＤ２９０）、あるいはＳＲＱエントリまたはＬＲＱＦエントリまたはＬＭＱエントリ内のアクティブなページに対して存在する。言い換えると、ＥＲＴ２５５は、ＬＳＵおよびＩＦＵ（Ｌ１ＤＣ、ＳＲＱ、ＬＲＱＥ、ＬＲＱＦ、ＬＭＱ、ＩＣ）内のアクティブなすべてのＲＰＮのテーブルである。１つまたは複数の例では、プロセッサ１０６がＳＴモードで動作している場合、ＥＲＴ２５５内のすべてのエントリが、実行されている単一のスレッドに使用される。代替として１つまたは複数の例では、ＥＲＴ２５５内のエントリが複数のセットに分割され、ＳＴモードでは、各セットが同じ内容を含む。例えば、ＥＲＴ２５５が合計で１２８個のエントリを含んでおり、最大２つのスレッドをサポートする場合、プロセッサがＳＴモードで動作しているときに、ＥＲＴ２５５は、それぞれ６４個のエントリの２つのセットを含み、それら２つのセットは同じ内容を含む。 The ERT 255 contains a valid ERT entry and is generally present for the L1 I-cache or D-cache directory (EAD290), or for the active page within the SRQ entry or LRQF entry or LMQ entry. In other words, ERT255 is a table of all active RPNs in LSUs and IFUs (L1 DC, SRQ, LRQE, LRQF, LMQ, IC). In one or more examples, when processor 106 is operating in ST mode, all entries in ERT255 are used for a single thread running. Alternatively, in one or more examples, the entries in ERT255 are split into multiple sets, and in ST mode, each set contains the same content. For example, if the ERT255 contains a total of 128 entries and supports up to 2 threads, then the ERT255 contains 2 sets of 64 entries each when the processor is operating in ST mode. , Those two sets contain the same content.

代替として、プロセッサ１０６がＭＴモードで動作している場合、ＥＲＴエントリが、実行されているスレッド間で分割される。例えば、２つのスレッドの場合、ＥＲＴエントリが２つの等しいセットに分割され、エントリの第１のセットが第１のスレッドに関連付けられ、エントリの第２のセットが第２のスレッドに関連付けられる。例えば、ＬＤ０パイプのＬ１の１つのコピーが失敗し、ＳＴ０パイプが、Ｔ０／Ｔ２Ｉ－フェッチ：ＥＲＴ０を開始して、ＳＭＴ２モードでＴ０を処理し、ＳＭＴ４モードでＴ０／Ｔ２を処理し、ＬＤ１パイプのＬ１の１つのコピーが失敗し、ＳＴ１パイプが、Ｔ１／Ｔ３Ｉ－フェッチ：ＥＲＴ１を開始して、ＳＭＴ２モードでＴ１を処理し、ＳＭＴ４モードでＴ１／Ｔ３を処理する。 Alternatively, when processor 106 is operating in MT mode, ERT entries are split between running threads. For example, in the case of two threads, the ERT entry is split into two equal sets, the first set of entries is associated with the first thread, and the second set of entries is associated with the second thread. For example, one copy of L1 in the LD0 pipe fails, the ST0 pipe initiates T0 / T2 I-fetch: ERT0, processes T0 in SMT2 mode, processes T0 / T2 in SMT4 mode, and LD1. One copy of L1 of the pipe fails and the ST1 pipe initiates T1 / T3 I-fetch: ERT1 to process T1 in SMT2 mode and T1 / T3 in SMT4 mode.

１つまたは複数の例では、各ＥＲＴエントリが、少なくとも、ＥＲＴ＿ＩＤ（０：６）、Ｔｉｄ＿ｅｎ（０：１）、ページ・サイズ（０：１）、ＥＡ（０：５１）、およびＲＡ（８：５１）というＥＲＴフィールドを含む。ＥＲＴ＿ＩＤフィールドは、ＥＲＴエントリごとの一意のインデックスである。例えば、ＥＲＴ＿ＩＤは、ＥＲＴエントリを識別する連続的番号を含んでよい。ＥＲＴ＿ＩＤは、ＥＡＤ２９０のＥＲＴ＿ＩＤフィールド２８５、およびＬＳＵによって使用される他のデータ構造に格納される。ＴＩＤ＿ｅｎフィールドは、エントリが、ＭＴモードで使用されることに対して有効化されているかどうかを示し、１つまたは複数の例では、ＥＲＴエントリを使用している命令のスレッド識別子を示す。さらに、ページ・サイズは、ＥＲＴエントリが参照するメモリのページ・サイズを示す。ＲＡは、ＥＲＴエントリに関連付けられた実アドレスを含む。 In one or more examples, each ERT entry has at least ERT_ID (0: 6), Tid_en (0: 1), page size (0: 1), EA (0:51), and RA (8: 1). 51) includes the ERT field. The ERT_ID field is a unique index for each ERT entry. For example, the ERT_ID may include a serial number that identifies the ERT entry. The ERT_ID is stored in the ERT_ID field 285 of EAD290, and other data structures used by LSU. The TID_en field indicates whether the entry is enabled for use in MT mode and, in one or more examples, the thread identifier of the instruction using the ERT entry. In addition, the page size indicates the page size of the memory referenced by the ERT entry. The RA contains the real address associated with the ERT entry.

ＲＡが命令の実行を完了するために使用される場合、ＬＳＵはＥＲＴ２５５のみを参照する。本明細書において説明されているように、ＥＲＴ２５５は、１．Ｉフェッチ、読み込み、または格納がＬ１キャッシュに失敗する、２．コア内の別のスレッドからの格納、３．別のコアからのスヌープ（ＸＩ）、ならびに４．ＴＬＢおよびＳＬＢの無効化という４つの機能に関して、ＬＳＵによって参照される。 When RA is used to complete the execution of an instruction, LSU refers only to ERT255. As described herein, ERT255 is 1. 2. I fetch, read, or store fails L1 cache. 2. Storage from another thread in the core. Snoop (XI) from another core, as well as 4. Referenced by LSU for four functions: TLB and SLB disabling.

Ｉフェッチ、読み込み、または格納がＬ１キャッシュに失敗する第１のケースでは、ＥＡおよびｔｈｒｅａｄ＿ｉｄが、ＥＲＴ２５５のインデックスを指すために使用され、有効なＥＲＴエントリが存在する場合、対応するＥＲＴエントリからのＲＡがＬ２キャッシュに送信される。ＥＲＴの失敗、すなわち、ＥＡおよびｔｈｒｅａｄ＿ｉｄの有効なＥＲＴエントリが存在しない場合、ＳＬＢ／ＴＬＢが使用される。 In the first case where I fetch, read, or store fails L1 cache, EA and threat_id are used to point to the index of ERT255, and if a valid ERT entry exists, RA from the corresponding ERT entry. Is sent to the L2 cache. SLB / TLB is used if the ERT fails, i.e., if there are no valid ERT entries for EA and threat_id.

コア内の別のスレッドからの格納の第２のケースでは、ＳＲＱから排出された格納が、別のスレッドからのヒットに関して、ＥＲＴ２５５およびＥＲＴＥテーブル（さらに説明される）をチェックする。異なるスレッドからのヒットが存在しない場合、同じＲＡを使用している別のスレッドからの読み込みが存在しない。同じＲＡを使用している異なるスレッドからのヒットが存在する場合、ＬＳＵがＬＲＱをチェックする。まれではあるが、別のスレッドによってＲＡが使用される場合、別のスレッドからのヒットが存在する。それに応じて、ＬＳＵが、共通のＲＡに関連するＥＡを検出するために、ＥＲＴテーブル４００を検索する。次にＥＡが、一致に関してＬＲＱを調べるために使用される（そのサイクル内の格納の発行を拒否する）。ＬＲＱがスレッドごとに分割されるため、ＬＳＵは関連するスレッドのＬＲＱのみを調べる。一致する読み込みがＬＲＱ内に存在する場合、ＬＳＵが、一致する読み込みのうちの最も古い読み込みをフラッシュする。 In the second case of storage from another thread in the core, the storage ejected from the SRQ checks the ERT255 and ERTE tables (discussed further) for hits from another thread. If there are no hits from different threads, then there are no reads from another thread using the same RA. If there are hits from different threads using the same RA, LSU checks the LRQ. In rare cases, if RA is used by another thread, there will be hits from another thread. Accordingly, LSU searches the ERT table 400 to detect EA associated with a common RA. The EA is then used to look up the LRQ for a match (rejects the issuance of storage within that cycle). Since the LRQ is divided for each thread, the LSU only looks at the LRQ of the associated thread. If a matching read exists in the LRQ, the LSU flushes the oldest of the matching reads.

プロセッサの別のコアからのスヌープの第３のケースでは、ＬＳＵが第２のケースと同様に動作し、実行されている他のスレッドのいずれかからのヒットに関してチェックする。ＴＬＢ／ＳＬＢが無効化されるケースでは、ＥＲＴ２５５も無効化される。 In the third case of snooping from another core of the processor, LSU behaves like the second case and checks for hits from any of the other threads running. In cases where the TLB / SLB is disabled, the ERT255 is also disabled.

図６は、本発明の１つまたは複数の実施形態に従う、ＬＳＵによって命令を実行するためにメモリにアクセスするための例示的な方法のフローチャートを示している。この命令は、ＯｏＯプロセッサ１０６の読み込み、格納、または命令フェッチであってよい。５０５および５１０に示されているように、命令の受信時に、ＬＳＵが、命令のパラメータを使用して、ＥＡＤ２９０がその命令に対応するエントリを含んでいるかどうかをチェックする。１つまたは複数の例では、チェックに使用されるパラメータは、特に、スレッド識別子、ページ・サイズ、ＥＡを含む。 FIG. 6 shows a flow chart of an exemplary method for accessing memory to execute an instruction by LSU, according to one or more embodiments of the invention. This instruction may be a read, store, or instruction fetch of the OoO processor 106. Upon receipt of an instruction, as shown in 505 and 510, the LSU uses the parameters of the instruction to check if the EAD290 contains an entry corresponding to that instruction. In one or more examples, the parameters used for the check include, in particular, the thread identifier, page size, EA.

ＬＳＵで、ＥＡＤ２９０内のＥＡＤヒットが発生した（すなわち、命令のＥＡがＥＡＤテーブル３００内のエントリに一致する）場合、５２０に示されているように、ＬＳＵが、一致するＥＡＤエントリの内容を読み取り、対応するＥＲＴエントリを決定する。各ＥＡＤエントリは、ＥＲＴ＿ＩＤ（０：６）フィールド２８５を含んでいる。前述したように、ＥＲＴエントリが無効化された場合、関連付けられたＥＲＴ＿ＩＤが無効化され、ＥＡＤテーブル３００内の関連付けられたすべてのエントリも無効化される。したがって、ＥＲＴ＿ＩＤフィールド２８５を使用して、読み込み／格納命令のＥＲＴエントリを検出できるため、ＥＡＤヒットはＥＲＴヒットを意味する。したがって、ＥＡＤヒットの場合、対応するＥＡＤエントリの識別後に、ＬＳＵがＥＡＤエントリからＥＲＴ＿ＩＤを読み出し、５３０に示されているように、ＳＲＱ、ＬＭＱ、またはＬＲＱＦ、あるいはその組み合わせに送信する。ＳＲＱ、ＬＭＱ、またはＬＲＱＦ、あるいはその組み合わせは、識別されたＥＡＤエントリからのＥＡを使用する。ＲＡを使用する格納命令の場合、５４０および５４５に示されているように、ＥＲＴエントリからのＲＡが、Ｌ２にアクセスするために読み出される。したがって、格納命令以外のどの場所でもＲＡが使用されないため、本明細書における技術的解決策を実装するコアは、ＥＡ専用コアと呼ばれる。 If an EAD hit in EAD290 occurs in LSU (ie, the EA of the instruction matches an entry in EAD table 300), then LSU reads the contents of the matching EAD entry, as shown in 520. , Determine the corresponding ERT entry. Each EAD entry contains an ERT_ID (0: 6) field 285. As mentioned above, when an ERT entry is invalidated, the associated ERT_ID is invalidated and all associated entries in the EAD table 300 are also invalidated. Therefore, an EAD hit means an ERT hit because the ERT_ID field 285 can be used to detect an ERT entry for a read / store instruction. Thus, in the case of an EAD hit, after identifying the corresponding EAD entry, LSU reads the ERT_ID from the EAD entry and sends it to SRQ, LMQ, or LRQF, or a combination thereof, as shown in 530. The SRQ, LMQ, or LRQF, or a combination thereof, uses the EA from the identified EAD entry. For storage instructions using RA, RA from the ERT entry is read to access L2, as shown in 540 and 545. Therefore, since RA is not used anywhere other than the storage instruction, the core that implements the technical solution herein is referred to as the EA dedicated core.

ここで、命令がＥＡＤ２９０において失敗するケース、すなわち、命令のＥＡに一致するエントリがＥＡＤテーブル３００内に存在しないケースについて検討する。５５０に示されているように、ｔｈｒｅａｄ＿ｉｄおよびＥＡが、ＥＲＴ２５５からの各エントリに対して比較される。５５５および５３０に示されているように、ＥＲＴヒットが発生した場合、すなわち、ＥＲＴエントリがパラメータに一致する場合、ＬＳＵがＲＡ（８：５１）をＥＲＴエントリから読み出す。読み込み要求の場合、ＬＳＵが、アクセスするためにＲＡをＬ２キャッシュに送信する（５３０）。５４０～５４５に示されているように、格納命令の場合、ＬＳＵがＲＡをＳＲＱに格納し、その後、格納がＬ２キャッシュに排出されるときに、ＲＡをＬ２キャッシュに送信する。 Here, consider the case where the instruction fails in EAD290, that is, the case where the entry matching the EA of the instruction does not exist in the EAD table 300. As shown in 550, threat_id and EA are compared for each entry from ERT255. As shown in 555 and 530, if an ERT hit occurs, i.e., if the ERT entry matches a parameter, LSU reads RA (8:51) from the ERT entry. In the case of a read request, the LSU sends RA to the L2 cache for access (530). As shown in 540-545, in the case of a storage instruction, the LSU stores the RA in the SRQ and then sends the RA to the L2 cache when the storage is drained into the L2 cache.

５５５および５６０に示されているように、ＥＲＴの失敗が発生した場合、ＬＳＵがＥＲＴ２５５の再読み込みを開始する。さらに、ＥＲＴエントリの置き換えが開始される。ＥＲＴエントリの置き換えはＬＲＵに基づき、ＬＳＵは、このプロセスの間に、アウトオブオーダー・ウィンドウ内のシノニムを確実に追跡する。 As shown in 555 and 560, if an ERT failure occurs, the LSU initiates a reload of the ERT 255. In addition, ERT entry replacement is initiated. The replacement of ERT entries is based on LRU, which ensures that synonyms in the out-of-order window are tracked during this process.

このようにして、読み込みに関して上記の方法を実装することによって、ＥＡに基づくＬ１ディレクトリ内にＥＡヒットが存在する場合、アドレス変換が実行されない。これによって、Ｌ１ディレクトリがＲＡに基づく通常のプロセッサを改良し、Ｌ１ディレクトリでの読み込みの失敗の場合に、Ｌ２ディレクトリおよびその先に送信されるＲＡを取得する変換のために、ＥＡをＥＡＲＴテーブルに送信することを引き起こす。 By implementing the above method for reading in this way, address translation is not performed if an EA hit exists in the EA-based L1 directory. This improves the normal processor based on the L1 directory and puts the EA in the EAR table for conversion to get the RA sent to the L2 directory and beyond in the event of a read failure in the L1 directory. Cause to send.

さらに、格納の場合、本明細書に記載された方法では、ＬＳＵが、ＥＲＴテーブルを調べてＲＡを決定する必要があり、その後、このＲＡは、格納がＳＲＱから排出されるときにキャッシュ（Ｌ１、Ｌ２、メモリ）まで排出するために、ＳＲＱＲに格納される。ＳＲＱＲは、すべてのＲＡを格納のために保持する。ＲＡは、ネスト（すなわち、Ｌ２、メモリ、およびメモリ・サブシステムのその他のユニット）に排出するためにのみ格納される。ＲＡは、通常の解決策で使用されているように、ロード・ヒット・ストア、ストア・ヒット・ロード、ロード・ヒット・ロードのいずれのタイプのアウトオブオーダー実行のハザード検出にも、使用されることがない。格納のためのＲＡ計算は、格納の完了後にＬＳＵが格納に関する割り込みを処理できないため、格納が完了する前に発生する（格納は、アドレス変換に関連する割り込みを生成することがあり、この割り込みは、格納が完了する前に処理される）。ここで、格納が（ＳＲＱＲから）発行されるときにＲＡ計算が実行され、このようにして、ＬＳＵがアドレス変換を実行する必要がないようにする。このようにして、格納が発行され、順序に従わずに実行されてから、順序通りに完了し、その後、格納がＳＲＱから順序通りに排出される。格納が排出されるまで、他のスレッドまたはコアは、その格納について知らない（現在のスレッドのみが知っている）。格納がＳＲＱから排出された後に、その格納がＬ１（ラインがＬ１内にすでに存在する場合）およびＬ２キャッシュ（キャッシングが有効化されている場合）に書き込まれ、その時点で格納が、システム１００内の他のすべてのスレッドおよびコアに知られる。 Further, in the case of storage, the method described herein requires the LSU to examine the ERT table to determine the RA, after which this RA caches (L1) when the storage is ejected from the SRQ. , L2, memory), it is stored in the SRQR. SRQR holds all RAs for storage. RA is stored only for ejection to nests (ie, L2, memory, and other units of the memory subsystem). RA is also used to detect hazards for out-of-order executions of any type of load-hit store, store-hit-load, or load-hit-load, as used in normal solutions. There is no such thing. The RA calculation for storage occurs before the storage is complete because the LSU cannot handle the storage-related interrupt after the storage is complete (the storage may generate an interrupt related to address translation, which is an interrupt. , Processed before storage is complete). Here, the RA calculation is performed when the storage is issued (from SRQR), thus eliminating the need for the LSU to perform address translation. In this way, the storage is issued, executed out of order, then completed in order, and then the storage is ejected from the SRQ in order. No other thread or core knows about the storage (only the current thread knows) until the storage is ejected. After the store is ejected from the SRQ, the store is written to the L1 (if the line already exists in L1) and the L2 cache (if caching is enabled), at which point the store is in system 100. Known to all other threads and cores.

ＥＡに基づくＬ１Ｉ－キャッシュに失敗した命令フェッチの場合、ＥＲＴ２５５を使用してＥＡがＲＡに変換され、Ｉ－キャッシュ・ラインをフェッチするためにＲＡがネストに送信される。ここで、ＬＨＳ（ロード・ヒット・ストア）、ＳＨＬ（ストア・ヒット・ロード）、およびＬＨＬ（ロード・ヒット・ロード）が、ＥＡに基づくＬ１キャッシュ（ＥＡＤ２９０）内のディレクトリ・エントリに格納されたＥＡおよびＥＲＴインデックスに基づいて、すべて決定される。ＥＡＤテーブル３００内のすべてのエントリは、ＥＲＴテーブル４００において有効な変換を有しており、ＬＨＳ、ＳＨＬ、およびＬＨＬが決定された後に、その変換が使用され得る。ＥＲＴエントリが無効化された場合、対応するＬ１キャッシュ・エントリが無効化される。 For instruction fetches that fail the L1 I-cache based on the EA, the EA is converted to RA using ERT255 and the RA is sent to the nest to fetch the I-cache line. Here, the LHS (load hit store), SHL (store hit load), and LHL (load hit load) are stored in a directory entry in the EA-based L1 cache (EAD290). And all are determined based on the ERT index. All entries in the EAD table 300 have a valid transformation in the ERT table 400, which transformations can be used after the LHS, SHL, and LHL have been determined. If an ERT entry is invalidated, the corresponding L1 cache entry is invalidated.

読み込み順序変更キューであるＬＲＱＦは、ディスパッチから完了までのすべての読み込み動作が追跡されることを保証する。読み込みが（キャッシュ・ミスまたは変換失敗のため、あるいは読み込みが依存する前の命令が拒否されたために）拒否された場合、発行キューから読み込みが取り出され、ＬＲＱＥに配置され、このＬＲＱＥから読み込みが再発行される。 The read order change queue, LRQF, ensures that all read operations from dispatch to completion are tracked. If the read is rejected (because of a cache miss or conversion failure, or because the instruction before the read depends on it is rejected), the read is taken from the issue queue, placed on the LRQE, and read again from this LRQE. publish.

図７は、本発明の１つまたは複数の実施形態に従う、ＥＲＴを再度読み込むための方法のフローチャートを示している。ＥＲＴの再読み込みは、ＥＲＴの失敗に応答して、ＥＲＴの失敗に基づいてＥＲＴ内のエントリの作成または更新を引き起こす。ＥＲＴは、ＥＲＴ２５５に追加されるＲＡを受信し、６０５に示されているように、そのＲＡをＥＲＴ０およびＥＲＴ１内の各エントリと比較する。６１０および６１５に示されているように、そのＲＡがＥＲＴ２５５内に存在せず、新しいエントリを作成できる場合、そのＲＡを格納するために、ＥＲＴ２５５が新しいＥＲＴ＿ＩＤを含む新しいエントリを作成する。新しいエントリは、実行中のスレッドが第１のスレッドまたは第２のスレッドであることに基づいて、それぞれＥＲＴ０またはＥＲＴ１のいずれかに作成される。プロセッサがＳＴモードで動作している場合、ＥＲＴ０が更新される。ＥＲＴ２５５が新しいエントリのための空いているスロットを含んでいない場合、６１５に示されているように、最長時間未使用またはその他の手法に基づいて、既存のエントリが置き換えられる。 FIG. 7 shows a flow chart of a method for reloading an ERT according to one or more embodiments of the present invention. Reloading the ERT causes the creation or update of entries in the ERT based on the failure of the ERT in response to the failure of the ERT. ERT receives RA added to ERT255 and compares that RA with each entry in ERT0 and ERT1 as shown in 605. As shown in 610 and 615, if the RA does not exist in the ERT255 and a new entry can be created, the ERT255 creates a new entry containing the new ERT_ID to store the RA. New entries are created in either ERT0 or ERT1, respectively, based on the running thread being the first thread or the second thread. If the processor is operating in ST mode, ERT0 is updated. If the ERT 255 does not contain an empty slot for a new entry, the existing entry will be replaced based on the longest unused or other method, as shown in 615.

受信されたＲＡ（再読み込み中のＲＡ）と同じＲＡを含むＥＲＴ２５５内の既存のエントリが検出された場合、６２０に示されているように、ＥＲＴ２５５が、既存のエントリのページ・サイズ（０：１）を受信されたＲＡのページ・サイズと比較する。既存のエントリのページ・サイズが再読み込み中のＲＡのページ・サイズより小さい場合、６２５に示されているように、そのＲＡの既存のエントリがＥＲＴ２５５から除去され、より大きいページ・サイズを有するＲＡのために、新しいＥＲＴ＿ＩＤを含む新しいエントリが追加される。既存のエントリが同じページ・サイズまたはより大きいページ・サイズを有しており、実装がＳＤＴを使用している場合、６２７に示されているように、再読み込み中のＲＡのためのエントリがＳＤＴ内に作成される。ＬＳＵがＥＲＴＥを使用している場合、この動作が実行されなくてよいということに注意するべきである。 If an existing entry in ERT255 that contains the same RA as the received RA (RA being reloaded) is detected, the ERT255 will have the page size of the existing entry (0 :), as shown in 620. Compare 1) with the page size of the received RA. If the page size of an existing entry is less than the page size of the RA being reloaded, then as shown in 625, the existing entry for that RA is removed from the ERT255 and the RA with the larger page size. A new entry is added for this, including the new ERT_ID. If an existing entry has the same or larger page size and the implementation is using SDT, then the entry for RA being reloaded is SDT, as shown in 627. Created within. It should be noted that this operation does not have to be performed if the LSU is using ERTE.

既存のエントリのページ・サイズが再読み込み中のＲＡと同じサイズである場合、６３０に示されているように、ＥＲＴ２５５は、既存のエントリが実行中のスレッドのローカルＥＲＴ上にあるかどうかをチェックする。この場合、ローカルＥＲＴとは、実行されているスレッドに関連付けられているＥＲＴ（例えば、第１のスレッドの場合はＥＲＴ０、第２のスレッドの場合はＥＲＴ１）のことを指す。６３２に示されているように、ＲＡのヒットが他のＥＲＴ（すなわち、ローカルＥＲＴでないＥＲＴ）内に存在する場合、ＥＲＴ２５５が、非ローカルＥＲＴ内のＥＲＴ＿ＩＤに一致するＥＲＴ＿ＩＤを含む新しいエントリをローカルＥＲＴ内に作成する。例えば、ＲＡのヒットが、スレッド０によって実行されている命令のＥＲＴ１内に存在する場合、ＥＲＴ１内のエントリに一致するＥＲＴ＿ＩＤを含むエントリがＥＲＴ０内に作成される。 If the page size of an existing entry is the same size as the RA being reloaded, ERT255 will check if the existing entry is on the local ERT of the running thread, as shown in 630. do. In this case, the local ERT refers to the ERT associated with the thread being executed (eg, ERT0 for the first thread, ERT1 for the second thread). As shown in 632, if an RA hit is in another ERT (ie, an ERT that is not a local ERT), the ERT 255 will make a new entry containing an ERT_ID that matches the ERT_ID in the non-local ERT. Create in. For example, if the RA hit is in ERT1 of the instruction being executed by thread 0, an entry containing the ERT_ID that matches the entry in ERT1 will be created in ERT0.

ＲＡのヒットがローカルＥＲＴインスタンス上に存在し、ＥＡも一致する場合、ＥＡとＲＡの両方が既存のエントリと一致したが、このスレッドに関してＥＲＴの再読み込みを引き起こすＥＲＴの失敗が存在したため、ＥＲＴは、そのことが、２つのスレッドが同じＥＡ－ＲＡ間のマッピング（同じページ・サイズを有する）を共有しているということを示していると見なす。したがって、６３４に示されているように、再読み込みスレッドに対応するビットに関する既存の一致するエントリ内のｔｉｄ＿ｅｎ（０）ビットまたはｔｉｄ＿ｅｎ（１）ビットがオンになって、このケースを示す。 If the RA hit is on the local ERT instance and the EA also matches, then both the EA and RA matched the existing entry, but there was an ERT failure that caused the ERT to reload for this thread, so the ERT , That is considered to indicate that the two threads share the same mapping between EA-RA (having the same page size). Therefore, as shown in 634, the tid_en (0) or tid_en (1) bits in the existing matching entry for the bit corresponding to the reload thread are turned on to indicate this case.

６３６に示されているように、ＲＡのヒットがローカルＥＲＴインスタンス上に存在し、ＥＡが既存のエントリに一致せず、既存のエントリが、再読み込み中のＲＡと同じスレッド用である場合、ＥＲＴは、２つの異なるＥＡが同じスレッドからの同じＲＡにマッピングされる、別名化のケースを識別する。プロセッサがＳＤＴに基づく実装を使用している場合、既存の一致するエントリのＥＲＴ＿ＩＤ、ＥＡオフセット（４０：５１）にマッピングされるシノニムのエントリが、ＳＤＴに導入される。プロセッサがＥＲＴＥに基づく実装を使用する場合、ＬＳＵは、命令が非投機的になるまで、その命令を拒否し、そのとき、ＥＲＴからエントリを削除し、エントリをＥＲＴＥに追加する。 As shown in 636, if an RA hit exists on a local ERT instance, the EA does not match an existing entry, and the existing entry is for the same thread as the RA being reloaded, then the ERT Identify the case of aliasing where two different EA are mapped to the same RA from the same thread. If the processor is using an SDT-based implementation, an entry for the synonym that maps to the ERT_ID, EA offset (40:51) of the existing matching entry is introduced into the SDT. If the processor uses an ERTE-based implementation, LSU rejects the instruction until it becomes non-speculative, then removes the entry from the ERT and adds the entry to the ERTE.

６３８に示されているように、ＲＡのヒットがローカルＥＲＴインスタンス上に存在し、ＥＡが既存のエントリに一致せず、既存のエントリが、異なるスレッド用である場合、ＥＲＴは、２つのＥＡが異なるスレッドからの同じＲＡにマッピングされる、別名化のケースを識別する。プロセッサがＳＤＴに基づく実装を使用している場合、既存の一致するエントリのＥＲＴ＿ＩＤ、ＥＡオフセット（４０：５１）にマッピングされるシノニムのエントリが、ＳＤＴに導入される。プロセッサがＥＲＴＥに基づく実装を使用している場合、新しいＥＲＴ＿ＩＤを使用して、ＥＲＴの失敗が発生したスレッドのみに有効なｔｉｄ＿ｅｎを含む新しいローカルＥＲＴエントリが追加される。 As shown in 638, if an RA hit exists on a local ERT instance, the EA does not match an existing entry, and the existing entry is for a different thread, then the ERT will have two EA's. Identify cases of aliasing that are mapped to the same RA from different threads. If the processor is using an SDT-based implementation, an entry for the synonym that maps to the ERT_ID, EA offset (40:51) of the existing matching entry is introduced into the SDT. If the processor is using an ERTE-based implementation, the new ERT_ID will be used to add a new local ERT entry containing tid_en that is valid only for the thread in which the ERT failed.

上記の方法は、ＥＲＴＥに基づく実装において、同じＲＡを有しているが異なるＥＡを有している２つのスレッドが、２つの異なるＥＲＴエントリを使用し、ＳＤＴに基づく実装において、２つのスレッドが同じＲＡを有しているが異なるＥＡを有している場合に、変換のうちの１つがＥＲＴエントリを使用し、他の変換がＳＤＴエントリを使用するということを、容易にする。したがって、ＥＲＴエントリは、ｔｉｄ＿ｅｎフィールドをＥＲＴエントリ内に含むことによって、同じＥＡおよび同じＲＡが異なるスレッドにわたって使用されるというケースを容易にする。例えば、ＥＲＴ０インスタンスではＴｉｄ＿ｅｎ（０：１）＝｛ｔｉｄ０ｅｎ，ｔｉｄ１ｅｎ｝、ＥＲＴ１インスタンスではＴｉｄ＿ｅｎ（０：１）＝｛ｔｉｄ１ｅｎ，ｔｉｄ１ｅｎ｝というようになる。さらに、ＥＲＴエントリは、複数のエントリを各スレッド識別子と共にＥＲＴ０およびＥＲＴ１内に含むことによって、同じＥＡが異なるスレッドにわたって異なるＲＡに対応するというケースを容易にする。ＥＲＴエントリは、同じＲＡに対応する異なるＥＡを伴うケース（同じスレッドまたは異なるスレッドのケース）もサポートする。ここで、実装がＥＲＴＥまたはＳＤＴのいずれを使用するかに基づいて、２つのケースがさらに説明される。 In the above method, in an ERTE-based implementation, two threads with the same RA but different EA use two different ERT entries, and in an SDT-based implementation, two threads It facilitates that one of the conversions uses the ERT entry and the other conversion uses the SDT entry when they have the same RA but different EA. Therefore, the ERT entry facilitates the case where the same EA and the same RA are used across different threads by including the tid_en field within the ERT entry. For example, in the ERT0 instance, Tid_en (0: 1) = {tid 0 en, tid 1 en}, and in the ERT1 instance, Tid_en (0: 1) = {tid 1 en, tid 1 en}. Further, the ERT entry facilitates the case where the same EA corresponds to different RAs across different threads by including multiple entries in ERT0 and ERT1 with each thread identifier. The ERT entry also supports cases with different EA corresponding to the same RA (cases of the same thread or different threads). Here, two cases are further described based on whether the implementation uses ERTE or SDT.

ＬＳＵが、ＳＤＴ（シノニム検出テーブル）を使用する実装を使用している場合、ＥＲＴの再読み込み時に、同じＲＡに対応する異なるＥＡを含む新しい命令が検出されたときに、ＬＳＵは、ＥＲＴ２５５の代わりにＳＤＴにエントリを導入する。元の（前の）ＥＲＴエントリのＥＡを使用してＳＤＴのヒットが再開する。新しいシノニムのページ・サイズが、一致するＲＡを含む既存のＥＲＴエントリ内のページ・サイズより大きい場合、シノニムをＳＤＴに導入する代わりに、既存のＥＲＴエントリが（より大きいページ・サイズを有する）新しいシノニムに置き換えられる。古いＥＲＴエントリは、最終的にシノニムとしてＳＤＴに再導入される。 If the LSU is using an implementation that uses the SDT (Synonym Detection Table), the LSU will replace the ERT 255 when a new instruction containing a different EA corresponding to the same RA is detected when reloading the ERT. Introduce an entry to SDT. The SDT hit resumes using the EA of the original (previous) ERT entry. If the page size of the new synonym is larger than the page size in the existing ERT entry containing the matching RA, then instead of introducing the synonym into the SDT, the existing ERT entry is new (with a larger page size). Replaced by synonyms. Old ERT entries will eventually be reintroduced into the SDT as synonyms.

代替として、ＬＳＵが、ＥＲＴＥを使用する実装を使用しており、同じＲＡに対応する異なるＥＡを含む命令が、異なるスレッドである場合、ＬＳＵは、適切なＴｉｄ＿ｅｎが有効化された新しいエントリをＥＲＴテーブルに導入する。命令が同じスレッドである場合、ＬＳＵは、読み込み／格納が非投機的になるまで、読み込み／格納を拒否する。その後、ＬＳＵは、既存のＥＲＴエントリを削除し、スレッドからのインフライトの最も新しい命令のＩタグでタグ付けされたＥＲＴエントリをＥＲＴＥテーブル内に配置する。ＬＳＵは、新しいＥＡ－ＲＡ対をＥＲＴテーブル４００にさらに導入する。これによって、２つの異なるＥＡが同じスレッドからの同じＲＡにマッピングされる状況が発生しないことを保証する。 Alternatively, if the LSU is using an implementation that uses ERTE and the instructions containing different EA corresponding to the same RA are in different threads, the LSU will ERT a new entry with the appropriate Tid_en enabled. Introduce to the table. If the instructions are in the same thread, LSU refuses to read / store until the read / store becomes non-speculative. The LSU then deletes the existing ERT entry and places the ERT entry tagged with the I tag of the newest in-flight instruction from the thread in the ERTE table. LSU will further introduce a new EA-RA pair to the ERT table 400. This ensures that there will be no situation where two different EA's are mapped to the same RA from the same thread.

さらに、ＥＲＴのケースを再び参照し、ＬＳＵがプロセッサ１０６の別のコアからスヌープを受信する場合について考える。スヌープは、システム内の異なるコアから来る可能性がある（スヌープは、別のコアまたはスレッドを示し、同じ実アドレスでデータを変更している）。ＬＳＵは、コア内の他のスレッドへの可能性のあるスヌープとして、コア内のスレッドからのからの格納もチェックする。すべての（他のコアからの）スヌープまたは（コア内の他のスレッドからのからの）格納は、ＲＡを伴う。そのような場合、ＬＳＵは、ＲＡを逆変換し、ＥＲＴ２５５に基づいて、対応するＥＡ、ＥＲＴ＿ＩＤ、およびページ・サイズを決定する。ＬＳＵは、この情報を、次の構造の各々に格納されたＥＲＴ＿ＩＤ、ＰＳ、ＥＡ（４０：５６）と比較して、スヌープのヒットを検出し、適切な動作を実行する。例えば、ＬＲＱＦエントリにおいてスヌープのヒットが検出された場合、ＬＳＵは、可能性のあるロード・ヒット・ロードのアウトオブオーダーのハザードを示す。ＥＡＤ２９０においてスヌープのヒットが検出され、スヌープが異なるコアからである場合、ＬＳＵがＬ１の無効化を開始する。格納が共有ラインに対する別のスレッドからである場合、ラインが新しい格納を自動的に取得し、更新される。 Further, referring again to the case of ERT, consider the case where the LSU receives a snoop from another core of the processor 106. Snoops can come from different cores in the system (snoops indicate different cores or threads, modifying data at the same real address). LSU also checks for storage from threads in the core as possible snoops to other threads in the core. All snoops (from other cores) or storages (from other threads in the core) are accompanied by RA. In such cases, the LSU reversely transforms the RA to determine the corresponding EA, ERT_ID, and page size based on ERT255. LSU compares this information with ERT_ID, PS, EA (40:56) stored in each of the following structures to detect snoop hits and perform appropriate actions. For example, if a snoop hit is detected in an LRQF entry, the LSU indicates a possible load-hit-load out-of-order hazard. If a snoop hit is detected in EAD290 and the snoop is from a different core, the LSU initiates L1 invalidation. If the store is from another thread to the shared line, the line will automatically get and update the new store.

したがって、本明細書に記載された技術的解決策は、１つのアドレス（ＥＡ）のみを追跡することによって、ＬＳＵのチップ面積の削減を促進する。さらに、これらの技術的解決策は、プロセッサ・コアが分割された読み込み／格納キューを使用して２読み込み／２格納モードで実行できるようにし、変換用のＣＡＭポートをさらに削減し、変換の電力消費も削減する。さらに、これらの技術的解決策は、ＥＡのみを使用することによって、ＥＡＤの失敗が発生しない限り、ＲＡへの変換が読み込み／格納経路において実行されないようにするという利点を有する。さらに、ＬＨＬ、ＳＨＬ、ＬＨＳなどのハザードを検出すること、および時間においてＤＶＡＬを抑制することは、タイミング問題を引き起こさない。ＬＳＵがＥＡのみを使用するため、２つの異なるＥＡが同じＲＡにマッピングされた場合、ＬＨＳ、ＳＨＬ、ＬＨＬの検出が失敗する可能性がある。本明細書に記載された技術的解決策は、ＥＡＤからのＥＡおよびＥＲＴのインデックスを使用することによって、そのような技術的課題に対処する。さらに、ＥＡのシノニムの検出時に、ＬＳＵが、ＳＤＴまたはＥＲＴＥテーブルをＯｏＯウィンドウ内の命令に使用することによって、命令を処理する。 Therefore, the technical solutions described herein facilitate a reduction in the chip area of the LSU by tracking only one address (EA). In addition, these technical solutions allow the processor core to run in two read / two store modes using split read / store queues, further reducing the CAM port for conversion, and powering the conversion. It also reduces consumption. Further, these technical solutions have the advantage of using only the EA to prevent the conversion to RA from being performed in the read / store path unless an EAD failure occurs. Furthermore, detecting hazards such as LHL, SHL, LHS, and suppressing DVAL over time does not cause timing problems. Since the LSU uses only EA, detection of LHS, SHL, LHL may fail if two different EA are mapped to the same RA. The technical solutions described herein address such technical challenges by using the EA and ERT indexes from EAD. In addition, upon detection of the EA synonym, the LSU processes the instruction by using the SDT or ERTE table for the instruction in the OoO window.

（ＥＲＴＥとは対照的に）ＬＳＵがＳＤＴを使用し、スヌープのヒットがＬＭＱに存在する場合、ＬＳＵは、Ｌ１Ｄキャッシュに格納しないようにＬＭＱエントリも更新し、ＳＲＱエントリがＳＲＱ内のスヌープに使用されず、ＬＨＳのＥＡがＲＡのヒットに失敗する形式のチェックのみに使用され、新しいＳＤＴエントリがスヌープのヒットに対して作成される。 If the LSU uses the SDT (as opposed to the ERTE) and a snoop hit is present in the LMQ, the LSU also updates the LMQ entry so that it is not stored in the L1D cache and the SRQ entry becomes a snoop in the SRQ. Not used, the LHS EA is used only for checking forms that fail RA hits, and a new SDT entry is created for snoop hits.

図８は、本発明の１つまたは複数の実施形態に従うシノニム検出テーブル（ＳＤＴ）７００の例示的な構造を示している。描かれた例は、１６個のエントリを含む場合を示しているが、他の例では、ＳＤＴ７００がこの例とは異なる数のエントリを含んでよいということに、注意するべきである。ＳＤＴ７００は、ＬＳＵ１０４の複数のスレッドおよびパイプにわたって共通である。例えば、ＬＤ０、ＬＤ１、ＳＴ０、およびＳＴ１は、すべてＳＤＴ７００内のエントリにアクセスし、ＳＤＴ７００は、各々に対して別々のパーティションを含まない。 FIG. 8 shows an exemplary structure of a synonym detection table (SDT) 700 according to one or more embodiments of the invention. It should be noted that the example depicted shows the case containing 16 entries, but in other examples the SDT700 may contain a different number of entries than this example. The SDT 700 is common across multiple threads and pipes of the LSU 104. For example, LD0, LD1, ST0, and ST1 all access the entries in the SDT700, and the SDT700 does not contain a separate partition for each.

ＳＤＴ７００内のエントリは、少なくとも、発行アドレス｛発行Ｔｉｄ（０：１），発行ＥＡ（０：５１）｝、ページ・サイズ（０：１）（例えば、４ｋ、６４ｋ、２ＭＢ、１６ＭＢ）、および再開アドレス｛ＥＡ（４０：５１），ＥＲＴＩＤ（０：６）｝のフィールドを含む。Ｔｉｄ（thread-identifier：スレッド識別子）フィールドは、ＯｏＯプロセッサからのどのスレッドが、ＳＤＴ７００内のエントリに関連付けられた命令を実行しているかを示す。開始がＬ１に失敗する命令の場合、ＬＳＵは、命令をＳＤＴ７００に対して比較する。開始された命令が、元のアドレスの比較でＳＤＴにヒットした場合、ＬＳＵが命令を拒否し、ＳＤＴエントリからの対応する置換アドレスを使用して命令を再開する。例えば、ＬＳＵは、置換アドレス（４０：５１）をＳＲＱＬＨＳに使用し、実行パイプライン内のＥＲＴＩＤを「強制的に一致させる」。 The entries in the SDT700 are at least the issue address {issue Tid (0: 1), issue EA (0:51)}, page size (0: 1) (eg 4k, 64k, 2MB, 16MB), and resume. Includes fields for the address {EA (40:51), ERT ID (0: 6)}. The Tid (thread-identifier) field indicates which thread from the OoO processor is executing the instruction associated with the entry in the SDT700. If the start is an instruction that fails L1, the LSU compares the instruction to the SDT700. If the initiated instruction hits the SDT in comparison to the original address, LSU rejects the instruction and resumes the instruction using the corresponding substitution address from the SDT entry. For example, LSU uses the substitution address (40:51) for SRQ LHS to "force match" the ERT ID in the execution pipeline.

本明細書において説明されているように、ＥＲＴの再読み込み中に、エントリがＳＤＴ７００に追加される。例えば、ＥＲＴの再読み込み中に、再読み込みＲＡが、有効なＥＲＴエントリに対して比較される。一致するＲＡを含むＥＲＴエントリがすでに存在し、追加のｔｉｄ＿ｅｎビットのみが元のＥＲＴエントリに設定されているＥＡのヒットのケースでない場合、既存のＥＲＴエントリからＥＡ（３２：５１）が読み取られ、エントリをＥＲＴ２５５に追加する代わりに、エントリがＳＤＴ７００に導入される。 As described herein, an entry is added to the SDT 700 during the reloading of the ERT. For example, during an ERT reload, the reload RA is compared against valid ERT entries. If an ERT entry containing a matching RA already exists and only the additional tid_en bits are not the case of an EA hit set in the original ERT entry, then the EA (32:51) is read from the existing ERT entry. Instead of adding the entry to the ERT255, the entry is introduced into the SDT700.

ＳＤＴ７００はエントリ数が制限されているため、エントリが置き換えられる。１つまたは複数の例では、最長時間未使用（ＬＲＵ：least recently used）手法または任意のその他の手法に基づいて、エントリが置き換えられる。１つまたは複数の例では、ＳＤＴエントリが置き換えられる場合、二次アドレスを使用するその後の開始が、ＳＤＴエントリの導入シーケンスを再トリガーする。さらに、ＣＡＭは、無効化されたＥＲＴエントリに一致するＥＲＴＩＤを含むＳＤＴエントリを消去する。 Since the SDT700 has a limited number of entries, the entries are replaced. In one or more examples, entries are replaced based on the longest recently used (LRU) method or any other method. In one or more examples, if the SDT entry is replaced, a subsequent start using the secondary address will re-trigger the introduction sequence of the SDT entry. In addition, the CAM clears the SDT entry containing the ERT ID that matches the invalidated ERT entry.

図９は、本発明の１つまたは複数の実施形態に従う、ＥＲＴおよびＳＤＴＥＡの交換を実行するための方法のフローチャートを示している。１つまたは複数の例では、ＥＲＴエントリおよびＳＤＴエントリが同じページ・サイズを有している場合に、ＬＳＵが交換を実行する。この交換によって、同じスレッドまたは異なるスレッドの異なる命令で、異なるＥＡが同じＲＡに対応する場合に、プロセッサ１０６の効率を改善する。例えば、ＥＡｘ＝＞ＲＡｚ、およびＥＡｙ＝＞ＲＡｚとなるような２つの命令ｘおよびｙについて考える。最初に、ＥＡｘがＥＲＴに失敗した場合、すなわち、ＥＡｙの前に、本明細書において説明されているように、ＬＳＵが、ＲＡｚへのＥＡｘのマッピングを含むＥＲＴエントリを導入する。その後、ＥＡｙがＥＲＴに失敗した場合、ＬＳＵが、ＲＡｚを使用してＥＲＴを検索し、ＲＡにヒットし、元のアドレス＝ＥＡｙ、置換アドレス＝ＥＡｘを含むエントリをＳＤＴ７００に導入する。 FIG. 9 shows a flow chart of a method for performing an ERT and SDT EA exchange according to one or more embodiments of the invention. In one or more examples, the LSU performs the exchange if the ERT and SDT entries have the same page size. This exchange improves the efficiency of the processor 106 when different EA's correspond to the same RA with different instructions in the same thread or different threads. For example, consider two instructions x and y such that EAx => Raz and EAy => Raz. First, if EAx fails ERT, i.e., prior to EAy, LSU introduces an ERT entry containing the mapping of EAx to Raz, as described herein. Then, if the EAy fails the ERT, the LSU searches the ERT using RAz, hits the RA, and introduces an entry into the SDT700 containing the original address = EAy and the replacement address = EAx.

ここで、その後のほとんどのＲＡｚへのアクセスがＥＡｙを伴う場合、ＬＳＵは、ＥＡＤ自体を使用するよりも頻繁にＳＤＴを使用する必要がある。１つまたは複数の例では、そのような頻繁なＳＤＴへの参照を減らすことによってＬＳＵの効率を改善するための技術的解決策は、各ＳＤＴエントリ内のカウンタをインクリメントすることを含む。図８の８１０に示されているように、ＬＳＵは、ＳＤＴエントリからのＥＲＴＩＤに一致するＥＲＴＩＤを含む命令を開始する。ＳＤＴエントリのＥＲＴＩＤが一致する場合、８２０に示されているように、ＬＳＵは、開始された命令のＥＡをＳＤＴエントリ内の元のＥＡとさらに比較する。８３０および８３５に示されているように、ＳＤＴエントリが、命令からのＥＡに一致する元のアドレス値を含んでいる場合、ＳＤＴエントリのカウンタがインクリメントされる。８４０に示されているように、開始された命令が、ＳＤＴエントリの元のアドレスと異なるＥＡを含んでいる場合、ＳＤＴエントリのカウンタがリセットされる。 Here, if most subsequent access to Raz is accompanied by EAy, the LSU needs to use the SDT more often than it does with the EAD itself. In one or more examples, a technical solution for improving the efficiency of LSUs by reducing such frequent references to SDTs comprises incrementing the counter in each SDT entry. As shown in 810 of FIG. 8, the LSU initiates an instruction containing an ERT ID that matches the ERT ID from the SDT entry. If the ERT IDs of the SDT entries match, the LSU further compares the EA of the initiated instruction with the original EA in the SDT entry, as shown in 820. As shown in 830 and 835, if the SDT entry contains the original address value that matches the EA from the instruction, the counter for the SDT entry is incremented. As shown in 840, if the initiated instruction contains an EA that is different from the original address of the SDT entry, the SDT entry counter is reset.

１つまたは複数の例では、カウンタが４ビットのフィールドであり、１５の最大値を意味する。他の例では、しきい値として使用されるフィールドが異なる長さであるか、または異なる最大値を有するか、あるいはその両方であるということが、理解されるべきである。例えば、８４５および８５０に示されているように、命令が開始された後に、カウンタ値がしきい値と比較される。カウンタがしきい値未満である場合、説明されたように、ＬＳＵが動作を続行する。カウンタがしきい値を超えたか、または場合によっては、しきい値に等しい場合、８６０に示されているように、ＬＳＵがＳＤＴエントリに対応するＥＲＴエントリを無効化する。例えば、ＳＤＴエントリからのＥＲＴＩＤを含むＥＲＴエントリが無効化される。ＥＲＴエントリの無効化は、ＥＡディレクトリ、ＬＲＱＦ、ＬＭＱ、およびＳＲＱからの対応するエントリの無効化を引き起こす。 In one or more examples, the counter is a 4-bit field, meaning a maximum of 15. In other examples, it should be understood that the fields used as thresholds are of different lengths, have different maximum values, or both. For example, as shown in 845 and 850, the counter value is compared to the threshold after the instruction is started. If the counter is below the threshold, LSU continues to operate as described. If the counter exceeds the threshold or, in some cases, equals the threshold, LSU invalidates the ERT entry corresponding to the SDT entry, as shown in 860. For example, the ERT entry containing the ERT ID from the SDT entry is invalidated. Invalidating an ERT entry causes invalidation of the corresponding entry from the EA directory, LRQF, LMQ, and SRQ.

さらに、ＬＳＵは、以下の方法で、終了するために元のＥＡを必要とする開始された命令における例外の技術的課題に対処する。例えば、開始された命令がＳＤＴにヒットし、元の開始アドレスの代わりにＳＤＴエントリからの置換アドレスを使用して再開したいが、終了するために元のＥＡを必要とする例外が選択された場合について考える。そのような条件は、ＤＡＷＲ／ＳＤＡＲなどの場合に発生することがある。 In addition, LSU addresses the technical challenges of exceptions in initiated instructions that require the original EA to terminate in the following manner: For example, if the started instruction hits the SDT and you want to resume using the replacement address from the SDT entry instead of the original starting address, but an exception is selected that requires the original EA to exit. think about. Such a condition may occur in the case of DAWR / SPAR and the like.

本明細書に記載された技術的解決策を実装するＬＳＵは、元のアドレスをＬＲＱＥ内のキューに維持することによって、そのような技術的課題に対処する。ＬＲＱＥは、ＬＲＱＥエントリごとに、ＳＤＴヒット・フラグ（ビット）、ＳＤＴインデックス（０：３）も維持する。再開時に、置換アドレスを取得するために、１サイクル早くＳＤＴインデックスが読み取られる。ＬＲＱＥは、再開の前に、ＬＲＱＥエントリのアドレス（元のアドレス）とＳＤＴの（ＳＤＴから読み取られた）置換アドレスとの間で、さらに多重化する。終了するために元のアドレスが必要になる、上記のような例外ケースの場合、ＬＲＱＥは、ＤＡＷＲの部分一致などで設定されたエントリごとに、追加のＳＤＴヒット・オーバーライド・フラグ（ビット）を含む。ＬＲＱＥは、例外と共に終了するＳＤＴのヒットが存在したケースを再開し、元のアドレスを強制的に開始する。ＳＲＱの再開は、本明細書において説明されているＬＲＱＥの再開と同様であり、再開の前に例外と共に終了することが決定された場合、ＳＤＴヒット・オーバーライド・フラグが使用される。 LSUs that implement the technical solutions described herein address such technical challenges by keeping the original address in a queue within the LRQE. The LRQE also maintains an SDT hit flag (bit) and an SDT index (0: 3) for each RQE entry. At restart, the SDT index is read one cycle earlier to get the replacement address. The LRQE is further multiplexed between the address of the LRQE entry (original address) and the replacement address of the SDT (read from the SDT) prior to resumption. In exceptional cases like the one above where the original address is needed to exit, the LRQE contains an additional SDT hit override flag (bit) for each entry set, such as with a partial DAWR match. .. LRQE resumes the case where there was an SDT hit that ends with an exception and forces the original address to start. The resumption of the SRQ is similar to the resumption of the LRQE described herein, and the SDT hit override flag is used if it is determined to terminate with an exception prior to the resumption.

図１０は、本発明の１つまたは複数の実施形態に従うＥＲＴ削除（ＥＲＴＥ）テーブル９００を示している。ＥＲＴＥテーブル９００は、ＬＳＵがＥＲＴ２５５から削除された（または無効化された）行を追跡するのを容易にする。ＥＲＴＥテーブル９００は、ＥＲＴ２５５にエントリが作成されているときに、同じスレッドに、同じＲＡに対する異なるＥＡが存在するかどうかをチェックするのをさらに容易にする。ＥＲＴＥテーブル９００は、すべての同時スレッドによって共有される。１つまたは複数の例では、ＥＲＴＥテーブル９００の一部が、ＮＴＣエントリのために予約される。ＥＲＴＥテーブル９００内のエントリは、スレッドＩＤ、Ｉタグ、ＥＡ、およびＲＡのフィールドを含む。１つまたは複数の例では、ＥＲＴＥテーブル・エントリが追加のフィールドを含んでよい。１つまたは複数の例では、スレッドＩＤが４ビットのフィールドであってよい。 FIG. 10 shows an ERT deletion (ERTE) table 900 according to one or more embodiments of the invention. The ERTE table 900 makes it easy for LSU to track deleted (or invalidated) rows from ERT255. The ERTE table 900 makes it even easier to check if different EA's for the same RA exist in the same thread when an entry is made in ERT255. The ERTE table 900 is shared by all concurrent threads. In one or more examples, part of the ERTE table 900 is reserved for NTC entries. The entries in the ERTE table 900 include fields for thread ID, I tag, EA, and RA. In one or more examples, the ERTE table entry may contain additional fields. In one or more examples, the thread ID may be a 4-bit field.

ＥＲＴＥテーブル９００は、１対１の対応関係のある２つのテーブル（ＥＲＴ＿ＥＡおよびＥＲＴ＿ＲＡ）の組み合わせと見なされ得る。ＥＲＴ＿ＥＡテーブルは、検索するためにＥＡを使用し、ＥＲＴ＿ＲＡテーブルは、検索するためにＲＡを使用する。１つまたは複数の例では、各テーブルは６４個のエントリを含むが、他の例では、エントリの数が可変／プログラム可能であってよい。ＥＡ－ＲＡ間の変換がＥＲＴＥテーブル９００から除去された場合、ＥＡＤテーブル３００からの関連するキャッシュ・ラインも無効化される。そのようにして、ＥＲＴ２５５は、プロセッサ・コア内のすべての変換（ＴＬＢ、ＳＬＢを除く）の上位セットになる。 The ERTE table 900 can be considered as a combination of two tables (ERT_EA and ERT_RA) having a one-to-one correspondence. The ERT_EA table uses EA to search, and the ERT_RA table uses RA to search. In one or more examples, each table contains 64 entries, while in other examples the number of entries may be variable / programmable. If the EA-RA conversion is removed from the ERTE table 900, the associated cache line from the EAD table 300 is also invalidated. In that way, the ERT 255 becomes a superset of all translations (except TLB, SLB) in the processor core.

ＥＲＴＥテーブル９００は、ＥＲＴ２５５内にないがインフライトの命令によって使用されるすべての変換を追跡する。ＥＲＴＥテーブル・エントリは、削除されたエントリを使用していた可能性のある最も新しい命令でタグ付けされる。読み込み／格納のＯｏＯ発行に起因して、ＯｏＯウィンドウ内のアクティブなすべての命令のうちの最も新しいＩタグが、ＥＲＴＥテーブル９００に格納される。フラッシュ時に、残っている最後の命令のＩタグが、有効なすべてのエントリに格納される。完了時に、同じＩタグまたはより古いＩタグを含むすべてのエントリが解放される。ＥＲＴＥテーブル９００は、満杯である場合、ディスパッチを阻止して、命令が完了するのを（またはフラッシュされるのを、あるいはその両方を）待ち、最終的にテーブルが完全に解放される。本明細書に記載された例は、開始された命令の古さを追跡するためにＩタグを使用するが、他の例では、単調に増加して循環する別のタグ（ＥＡタグ、ＬＳタグなど）が代わりに使用されてよいということに注意するべきである。 The ERTE table 900 keeps track of all transformations that are not in ERT255 but are used by in-flight instructions. The ERTE table entry is tagged with the newest instruction that may have used the deleted entry. Due to the read / store OOO issue, the newest I-tag of all active instructions in the OOO window is stored in the ERTE table 900. Upon flushing, the I tag of the last remaining instruction is stored in all valid entries. Upon completion, all entries containing the same I-tag or older I-tags will be released. If the ERTE table 900 is full, it blocks the dispatch, waits for the instruction to complete (or is flushed, or both), and finally the table is completely freed. The examples described herein use the I tag to track the age of the initiated instruction, while in other examples another tag that monotonically increases and circulates (EA tag, LS tag). Etc.) should be noted that may be used instead.

ＥＲＴ２５５またはＥＡＤ２９０内の変換が削除／無効化された場合、削除されたエントリからの最後の所定のビット数（例えば、最後の１２ビット）を含まずに、削除されたエントリのＥＡ－ＲＡがＥＲＴＥテーブル９００に追加される。さらに、削除された変換が属する同じスレッドの最も新しい有効なＩタグを含むＥＲＴＥテーブル９００内のエントリが、例えばフラグ（ビット）を使用して、マーク付けされる。 If the conversion in ERT255 or EAD290 is deleted / invalidated, the EA-RA of the deleted entry will be ERTE without including the last predetermined number of bits (eg, the last 12 bits) from the deleted entry. Added to table 900. In addition, entries in the ERTE table 900 containing the newest valid I-tag of the same thread to which the deleted transformation belongs are marked, for example, using flags (bits).

新しいアドレス変換（ＥＡからＲＡへ）が実行されるときに、ＬＳＵが、ＲＡをＥＲＴ２５５に対して比較し、同じスレッドからのＲＡに対する異なるＥＡがすでにＥＲＴ２５５に存在するかどうかをチェックする。存在する場合、ＬＳＵが、新しい変換をシノニムとしてＥＲＴ２５５に導入する。このようにして、ＥＲＴ２５５（ＥＲＴＥを使用する場合）は、同じスレッドの同じＲＡを指す異なるＥＡを含む２つのエントリを含むことができる。インフライトの命令のシノニムが許可されないため、１つまたは複数の例では、ＬＳＵが、ＮＴＣ＋１のフラッシュをのみを開始して、前進を保証する。 When a new address translation (EA to RA) is performed, LSU compares RA to ERT255 and checks if different EA from the same thread to RA already exists in ERT255. If present, LSU introduces a new transformation into ERT255 as a synonym. In this way, the ERT255 (when using ERTE) can contain two entries containing different EA pointing to the same RA in the same thread. In one or more examples, the LSU only initiates a flush of NTC + 1 to ensure forward, as synonyms for in-flight orders are not allowed.

バランス・フラッシュ（balance flush）は、停止したか、またはリソースを消費するか、あるいはその両方の対象のスレッドを、システムから全体的にフラッシュし、リソース使用の公平性またはバランスを回復する、スレッド制御メカニズムである。バランス・フラッシュは、選択されたスレッド上の、次に実行するべき命令グループに続くすべての命令グループをフラッシュする、次に実行するべき命令のフラッシュ（ＮＴＣ＋１）を含む。ＮＴＣ＋１バランス・フラッシュは、選択されたスレッドに関する実行ユニット、グローバル完了テーブル、およびＥＡＤをフラッシュする。スレッドがディスパッチで停止した場合にのみ、スレッドに対してバランス・フラッシュが実行される。スレッド切り替え制御レジスタ内の＜ｂｆ：１＞フィールドを使用して、バランス・フラッシュが有効化または無効化されてよい。 Balance flush is a thread control that flushes the threads of interest from the system as a whole, either stopped, consumes resources, or both, to restore fairness or balance of resource usage. It is a mechanism. Balanced flushing includes flushing all instruction groups on the selected thread following the instruction group to be executed next, flushing the next instruction to be executed (NTC + 1). NTC + 1 balanced flush flushes the execution unit, global completion table, and EAD for the selected thread. A balance flush is performed on the thread only if the thread is dispatched down. Balance flushing may be enabled or disabled using the <bf: 1> field in the thread switching control register.

１つまたは複数の例では、ＯｏＯウィンドウの実行が完了した後に、ＥＲＴＥ内のエントリが無効としてマーク付けされる。ＥＲＴテーブル２５５から削除されている新しいＥＡ－ＲＡ変換対を使用してＥＲＴＥエントリが書き込まれているときに、ＥＲＴＥエントリが有効としてマーク付けされるということに、注意するべきである。 In one or more examples, the entry in the ERTE is marked as invalid after the execution of the OoO window is complete. It should be noted that the ERTE entry is marked as valid when the ERTE entry is written using the new EA-RA conversion pair that has been removed from the ERT table 255.

図１１は、本発明の１つまたは複数の実施形態に従う、エントリをＥＲＴＥテーブル９００に追加するための例示的な方法のフローチャートを示している。本明細書において説明されているように、新しいエントリがＥＲＴ２５５に追加されるときに、ＬＳＵは、１０１０に示されているように、新しい変換のＥＡおよびＲＡの両方を、ＬＲＵによって管理されてよい特定の行に書き込む。１０１２に示されているように、ＥＲＴＥテーブル９００は、ＲＡを使用して検索し、ＲＡが、別のＥＡに対応するＥＲＴＥテーブル９００内のエントリにすでに存在するかどうかをチェックし、導入での同じスレッドで、複数のヒットの可能性のあるケースについてチェックする。ＲＡがＥＲＴＥテーブル９００にすでに存在する場合、１０１５に示されているように、ＥＲＴテーブルは、ＮＴＣまでＥＡ－ＲＡのエントリの作成を拒否し、ＮＴＣが検出されたときに導入する。 FIG. 11 shows a flow chart of an exemplary method for adding an entry to the ERTE table 900, according to one or more embodiments of the invention. As described herein, when a new entry is added to ERT255, the LSU may manage both the EA and RA of the new transformation by the LRU, as shown in 1010. Write to a specific line. As shown in 1012, the ERTE table 900 is searched using the RA to check if the RA already exists in the entry in the ERTE table 900 corresponding to another EA and in the deployment. Check for possible multiple hits in the same thread. If the RA already exists in the ERTE table 900, the ERT table refuses to create an entry for the EA-RA up to the NTC and introduces it when the NTC is detected, as shown in 1015.

１０２０に示されているように、ＬＳＵは、ＥＲＴ２５５内のエントリを上書きする前に、新しいエントリによって上書きされるＥＲＴ内の既存のエントリのＥＡおよびＲＡを読み取り、１０３０に示されているように、読み出されたエントリをＥＲＴＥテーブル９００にさらに格納する。さらに、１０４０および１０５０に示されているように、プロセッサの別のコアからのスヌープまたは格納排出が存在する場合、ＥＲＴＥテーブル９００がＥＡを検索して読み取る。 As shown in 1020, the LSU reads the EA and RA of the existing entry in the ERT that is overwritten by the new entry before overwriting the entry in ERT255, as shown in 1030. Further stores the read entries in the ERTE table 900. Further, as shown in 1040 and 1050, the ERTE table 900 searches and reads the EA if there is a snoop or stowed discharge from another core of the processor.

図１２は、本発明の１つまたは複数の実施形態に従って開始される例示的な命令のセットの例示的なシーケンス図を示している。各命令は、左側にプログラム順序で示されており、ＯｏＯで開始され、命令のシーケンスとは異なる動作のシーケンスを引き起こす。例えば、以下のイベントが時系列で発生することを考える。１．命令Ｍが、ＯｏＯで発行され、変換「ｅａ１、ｒａ１」を使用した。２．命令Ｋが、ＯｏＯで発行され、ＥＲＴに失敗し、新しいエントリを導入し、「ｒａ２ｅａ２」をＥＲＴから削除した。この時点で、使用中の最後のＩタグ＝Ｎ（同じスレッドから削除されたすべてのライン）、すなわち、Ｎまでの命令は、「ｒａ２ｅａ２」を使用することができており、Ｎの後に、命令は「ｒａ２ｅａ２」を使用できなくなる。３．命令Ｈが、ＯｏＯで発行され、ＥＲＴに失敗し、「ｒａ１ｅａ１」をＥＲＴから削除した。この時点で、使用中の最後のＩタグ＝Ｑである。４．フラッシュされたパイプラインおよび残された最後の命令がＩタグ＝Ｅを有しており、さらに、フェッチされる次の命令が、Ｒ、Ｓである。５．命令Ｅ～Ｒが所与の１サイクルで完了し、ＥＲＴＥ内のすべてのエントリを解放する。 FIG. 12 shows an exemplary sequence diagram of an exemplary set of instructions initiated according to one or more embodiments of the invention. Each instruction is shown in program order on the left and starts with OoO, causing a sequence of actions that is different from the sequence of instructions. For example, consider that the following events occur in chronological order. 1. 1. The instruction M was issued in OoO and used the conversion "ea1, ra1". 2. 2. Command K was issued on OoO, failed ERT, introduced a new entry, and removed "ra2 ea2" from ERT. At this point, the last I tag in use = N (all lines deleted from the same thread), i.e., instructions up to N can use "ra2 ea2", after N. The instruction cannot use "ra2 ea2". 3. 3. Command H was issued by OoO, failed ERT, and deleted "ra1 ea1" from ERT. At this point, the last I tag in use = Q. 4. The flushed pipeline and the last instruction left have an I tag = E, and the next instruction to be fetched is R, S. 5. Instructions E-R complete in a given cycle, freeing all entries in the ERTE.

本明細書に記載された技術的解決策は、このようにして、ＥＡのみを使用することを容易にし、読み込み／格納経路においてＥＡＲＴ（通常はプロセッサによって使用されていた）が参照されず、さらに、ＳＨＬの検出および時間におけるＤＶＡＬの抑制がタイミング問題を引き起こさないように、技術的優位性を実現する。さらに、本明細書に記載された技術的解決策は、ＥＡのみを使用することに伴う技術的問題、例えば、２つの異なるＥＡが同じＲＡにマッピングされた場合に、ＬＨＳ、ＳＨＬ、ＬＨＬの検出が失敗することがあるなどの問題に対処する。本明細書に記載された技術的解決策は、シノニム検出テーブル（ＳＤＴ）またはＥＲＴ削除テーブルのいずれかをＯｏＯウィンドウ内の命令に使用することによって、そのような技術的問題に対処する。これらの技術的解決策は、特に、チップ面積の削減（ＲＡを格納しないことによる）、電力消費の削減（ＥＡ－ＲＡを変換しないことによる）、および待ち時間の改善を含む、さまざまな技術的優位性を実現する。 The technical solutions described herein thus facilitate the use of EA alone, with no reference to EAR (usually used by the processor) in the read / store path, and further. , Achieve a technical advantage so that detection of SHL and suppression of DVAL in time do not cause timing problems. In addition, the technical solutions described herein are technical problems associated with using only EA, eg, detection of LHS, SHL, LHL when two different EA are mapped to the same RA. Address issues such as may fail. The technical solutions described herein address such technical problems by using either the synonym detection table (SDT) or the ERT deletion table for instructions in the OoO window. These technical solutions include a variety of technical solutions, including, among other things, reduced chip area (by not storing RA), reduced power consumption (by not converting EA-RA), and improved latency. Achieve superiority.

さらに、これらの技術的解決策は、すべての読み込みおよび格納のアドレス生成においてＥＡに対してＲＡを決定するための検索を取り除くことによって、電力消費の節減を促進する。代わりに、ＥＡＤの失敗およびＥＲＴの失敗が発生するまで、ＥＡが使用される。さらに、これらの技術的解決策は、単一のＣＡＭポートのみが使用されるようになるため、ユニット全体でのＲＡバスの切り替えの除去を容易にする。 In addition, these technical solutions promote power consumption savings by removing the search to determine the RA for the EA in all read and store address generation. Instead, EA is used until EAD failure and ERT failure occur. In addition, these technical solutions facilitate the elimination of RA bus switching across the unit, as only a single CAM port will be used.

図１３は、本発明の１つまたは複数の実施形態に従う、プロセッサがＳＴモードまたはＭＴモードのどちらで動作しているかに応じて、マルチパイプ・モードで、およびＯｏＯの方法で、ＬＳＵ１０４によって命令を発行するための例示的な方法のフローチャートを示している。例えば、ＬＳＵは、２読み込み／２格納モード（マルチパイプ・モード）で動作していてよい。ブロック１３１０に示されているように、ＬＳＵ１０４が、発行される命令をＯｏＯウィンドウから選択する。選択された命令は、読み込み命令、格納命令、またはＬＳＵ１０４が発行するそのような命令から派生した任意の命令（例えば、ＬＡＲＸ命令）であってよい。 FIG. 13 indicates instructions by the LSU 104 in multipipe mode and in the OoO way, depending on whether the processor is operating in ST mode or MT mode, according to one or more embodiments of the invention. A flowchart of an exemplary method for publishing is shown. For example, LSU may operate in 2 read / 2 storage modes (multipipe mode). As shown in block 1310, the LSU 104 selects the instruction to be issued from the OoO window. The selected instruction may be a read instruction, a storage instruction, or any instruction derived from such an instruction issued by LSU104 (eg, a LARX instruction).

ブロック１３２０に示されているように、ＬＳＵ１０４が、ＯｏＯプロセッサがＳＴモードまたはＭＴモードのいずれで動作しているかを判定する。ＳＴモードが使用されている場合、プロセッサが単一のスレッドを使用しており、ブロック１３３０に示されているように、ＬＳＵ１０４が、命令に関連付けられたＬＳＵパイプのみを決定する。例えば、命令が読み込み命令である場合、ＬＳＵ１０４は、その読み込み命令を、ＬＤ０パイプ、ＬＤ１パイプ、または任意のその他の読み込みパイプのいずれかに関連付けてよい。代替として、命令が格納命令である場合、ＬＳＵ１０４は、その格納命令を、ＳＴ０パイプ、ＳＴ１パイプ、または任意のその他の格納パイプのいずれかに関連付けてよい。 As shown in block 1320, the LSU 104 determines whether the OoO processor is operating in ST mode or MT mode. When ST mode is used, the processor is using a single thread and the LSU 104 determines only the LSU pipe associated with the instruction, as shown in block 1330. For example, if the instruction is a read instruction, the LSU 104 may associate the read instruction with any of the LD0 pipe, LD1 pipe, or any other read pipe. Alternatively, if the instruction is a storage instruction, the LSU 104 may associate the storage instruction with any of the ST0 pipe, ST1 pipe, or any other storage pipe.

さらに、ブロック１３４０に示されているように、ＬＳＵ１０４は、パイプに関連付けられたＬＲＱＦ２１８、ＳＲＱＲ２２０、ＬＲＱＥ２２２、およびＥＲＴ２５５内のパーティションを使用してエントリを作成／エントリにアクセスし、命令を発行する。例えば、命令が読み込み命令であり、関連付けられたパイプがＬＤ０である場合、その命令は、パーティションＬＲＱＦ０、ＬＲＱＥ０、およびＥＲＴ０からのエントリを使用する。同様に、パイプＳＴ０での格納命令の場合、パーティションＳＲＱＲ０およびＥＲＴ０が使用される。ＬＤ１パイプまたはＳＴ１パイプの場合、ＬＲＱＦ１、ＬＲＱＥ１、ＥＲＴ１、およびＳＲＱＲ１パーティションが使用される。エントリは、先入れ先出しに基づいてパーティション内に作成される。 Further, as shown in block 1340, the LSU 104 uses the partitions in the LRQF218, SRQR220, LRQE222, and ERT255 associated with the pipe to create / access entries and issue instructions. For example, if the instruction is a read instruction and the associated pipe is LD0, the instruction will use entries from partitions LRQF0, LRQE0, and ERT0. Similarly, for storage instructions on pipe ST0, partitions SRQR0 and ERT0 are used. For LD1 or ST1 pipes, the LRQF1, LRQE1, ERT1 and SRQR1 partitions are used. Entries are created in the partition on a first-in, first-out basis.

代替として、プロセッサがＭＴモードで動作している、すなわち、同時に複数のスレッドが実行されている場合、ブロック１３５０に示されているように、ＬＳＵ１０４は、選択された命令に関連付けられたスレッドのスレッド識別子を決定する。ブロック１３６０に示されているように、ＬＳＵ１０４は、命令に関連付けられたＬＳＵパイプをさらに決定する。さらに、ブロック１３７０に示されているように、ＬＳＵ１０４は、ＬＲＱＦ２１８、ＳＲＱＲ２２０、ＬＲＱＥ２２２、およびＥＲＴ２５５のパーティションおよびパーティション内の位置を識別して、エントリを作成／エントリにアクセスし、｛スレッドｉｄおよびパイプ｝の組み合わせに基づいて命令を発行する。例えば、ＬＳＵは、特定のスレッドを特定のパイプに制限し、例えば、偶数番号が付けられたスレッドをＬＤ０およびＳＴ０に制限し、奇数番号が付けられたスレッドをＬＤ１およびＳＴ１に制限する。スレッドおよびパイプの分類が他の例において異なってよいということに、注意するべきである。ＬＤ０パイプおよびＳＴ０パイプは、「０」の接尾辞が付けられたパーティションに関連付けられ、ＬＤ１パイプおよびＳＴ１パイプは、「１」の接尾辞が付けられたパーティションに関連付けられる（または、その逆に関連付けられる）。 Alternatively, if the processor is operating in MT mode, i.e., multiple threads are running at the same time, the LSU 104 is the thread of the thread associated with the selected instruction, as shown in block 1350. Determine the identifier. As shown in block 1360, the LSU 104 further determines the LSU pipe associated with the instruction. Further, as shown in block 1370, the LSU 104 identifies the partitions of LRQF218, SRQR220, LRQE222, and ERT255 and their locations within the partitions to create / access entries and {thread ids and pipes}. Issue an instruction based on the combination of. For example, LSU limits specific threads to specific pipes, for example, even-numbered threads to LD0 and ST0, and odd-numbered threads to LD1 and ST1. It should be noted that the classification of threads and pipes may be different in other examples. LD0 and ST0 pipes are associated with partitions suffixed with "0", and LD1 and ST1 pipes are associated with partitions suffixed with "1" (or vice versa). Will be).

本発明の１つまたは複数の実施形態例では、プロセッサがＭＴモードで実行しているスレッドの数に従って、各パーティションが部分にさらに分割される。例えば、プロセッサが４つのスレッドを実行している場合、ＬＳＵ内の２つのパーティションが、それぞれ２つの部分（第１のスレッド用の第１のパーティションおよび第２のスレッド用の第２の部分）にさらに分割され、各パーティション分がスレッドの対に関連付けられる。ＭＴモードでのスレッドの数が４つとは異なる１つまたは複数の他の実施形態例では、各パーティションに関連付けられたスレッドの数に基づいて、パーティションが異なる数の部分に分割される。スレッドの対が各パーティションに関連付けられており、各パーティションが等しい部分にさらに分割される上記の例では、対の第１のスレッドが第１の部分を使用し、第２のスレッドが第２の部分を使用する。したがって、ＬＤ０／ＳＴ０のＴ０で実行さている命令は、ＬＲＱＦ０、ＬＲＱＥ０、ＳＲＱＲ０、およびＥＲＴ０のパーティションの第１の部分に関連付けられ、ＬＤ０／ＳＴ０のＴ２で実行されている命令は、ＬＲＱＦ０、ＬＲＱＥ０、ＳＲＱＲ０、およびＥＲＴ０のパーティションの第２の部分に関連付けられる。さらに、ＬＤ１／ＳＴ１のＴ１で実行さている命令は、ＬＲＱＦ１、ＬＲＱＥ１、ＳＲＱＲ１、およびＥＲＴ１のパーティションの第１の部分に関連付けられ、ＬＤ１／ＳＴ１のＴ３で実行されている命令は、ＬＲＱＦ１、ＬＲＱＥ１、ＳＲＱＲ１、およびＥＲＴ１のパーティションの第２の部分に関連付けられる。 In one or more embodiments of the invention, each partition is further subdivided according to the number of threads the processor is running in MT mode. For example, if the processor is running four threads, the two partitions in the LSU will each have two parts (the first partition for the first thread and the second part for the second thread). It is further divided and each partition is associated with a pair of threads. In one or more other embodiments where the number of threads in MT mode is different from four, the partition is divided into different numbers of parts based on the number of threads associated with each partition. In the above example, where a pair of threads is associated with each partition and each partition is further divided into equal parts, the first thread of the pair uses the first part and the second thread is the second. Use the part. Therefore, the instruction executed at T0 of LD0 / ST0 is associated with the first part of the partition of LRQF0, LRQE0, SRQR0, and ERT0, and the instruction executed at T2 of LD0 / ST0 is LRQF0, LRQE0, Associated with the second part of the SRQR0, and ERT0 partitions. Further, the instructions executed at T1 of LD1 / ST1 are associated with the first part of the partition of LRQF1, LRQE1, SRQR1 and ERT1, and the instructions executed at T3 of LD1 / ST1 are LRQF1, LRQE1. Associated with the second part of the SRQR1 and ERT1 partitions.

ここで、本発明の１つまたは複数の実施形態の一部または全部の態様を実装するためのコンピュータ・システム１４００のブロック図である図１４を参照する。本明細書に記載された処理は、ハードウェア、ソフトウェア（例えば、ファームウェア）、またはハードウェアとソフトウェアの組み合わせにおいて実装されてよい。実施形態例では、記載された方法は、少なくとも一部においてハードウェアに実装されてよく、モバイル・デバイス、パーソナル・コンピュータ、ワークステーション、マイクロコンピュータ、またはメインフレーム・コンピュータなどの、専用または汎用コンピュータ・システム１４００のマイクロプロセッサの一部であってよい。 Here, reference is made to FIG. 14, which is a block diagram of a computer system 1400 for implementing some or all embodiments of one or more embodiments of the present invention. The processes described herein may be implemented in hardware, software (eg, firmware), or a combination of hardware and software. In embodiments, the methods described may be implemented in hardware, at least in part, and may be a dedicated or general purpose computer such as a mobile device, personal computer, workstation, microprocessor, or mainframe computer. It may be part of the microprocessor of the system 1400.

実施形態例では、図１４に示されているように、コンピュータ・システム１４００は、プロセッサ１４０５、メモリ・コントローラ１４１５に結合されたメモリ１４１２、および１つまたは複数の入力デバイス１４４５、またはローカルＩ／Ｏコントローラ１４３５を介して通信によって結合された周辺機器などの出力デバイス１４４７、あるいはその組み合わせを含む。これらのデバイス１４４７および１４４５は、例えば、プリンタ、スキャナ、マイクロホンなどを含んでよい。従来のキーボード１４５０およびマウス１４５５は、Ｉ／Ｏコントローラ１４３５に結合されてよい。Ｉ／Ｏコントローラ１４３５は、例えば、１つまたは複数のバスあるいは従来技術において知られたその他の有線接続または無線接続であってよい。Ｉ／Ｏコントローラ１４３５は、簡単にするために省略されている、通信を可能にするためのコントローラ、バッファ（キャッシュ）、ドライバ、リピータ、およびレシーバなどの追加の要素を含んでよい。 In an example embodiment, as shown in FIG. 14, the computer system 1400 is a processor 1405, a memory 1412 coupled to a memory controller 1415, and one or more input devices 1445, or local I / O. Includes output devices 1447 such as peripherals coupled via communication via controller 1435, or a combination thereof. These devices 1447 and 1445 may include, for example, printers, scanners, microphones and the like. The conventional keyboard 1450 and mouse 1455 may be coupled to the I / O controller 1435. The I / O controller 1435 may be, for example, one or more buses or other wired or wireless connections known in the art. The I / O controller 1435 may include additional elements such as a controller, buffer (cache), driver, repeater, and receiver for enabling communication, which are omitted for simplicity.

Ｉ／Ｏデバイス１４４７、１４４５は、例えばディスク・ストレージおよびテープ・ストレージ、ネットワーク・インターフェイス・カード（ＮＩＣ：network interface card）または変調器／復調器（他のファイル、デバイス、システム、またはネットワークにアクセスするため）、無線周波（ＲＦ：radio frequency）またはその他のトランシーバ、電話インターフェイス、ブリッジ、ルータなどの、入力および出力の両方と通信するデバイスをさらに含んでよい。 I / O devices 1447, 1445 access, for example, disk and tape storage, network interface cards (NICs) or modulators / demodulators (other files, devices, systems, or networks). It may further include devices that communicate with both inputs and outputs, such as radio frequency (RF) or other transceivers, telephone interfaces, bridges, routers, and so on.

プロセッサ１４０５は、ハードウェア命令またはソフトウェア、具体的には、メモリ１４１２に格納されたソフトウェアを実行するためのハードウェア・デバイスである。プロセッサ１４０５は、カスタムメイドであるか、または市販されたプロセッサ、中央処理装置（ＣＰＵ：central processing unit）、コンピュータ・システム１４００に関連付けられた複数のプロセッサ間の補助プロセッサ、（マイクロチップまたはチップ・セットの形態での）半導体ベースのマイクロプロセッサ、マクロプロセッサ、または命令を実行するためのその他のデバイスであってよい。プロセッサ１４０５は、実行可能命令のフェッチを高速化するための命令キャッシュ、データのフェッチおよび格納を高速化するためのデータ・キャッシュ、および実行可能命令とデータの両方の仮想アドレスから物理アドレスへの変換を高速化するために使用されるトランスレーション・ルックアサイド・バッファ（ＴＬＢ：translation look-aside buffer）などのキャッシュを含むことができるが、これらに限定されない。キャッシュは、さらに多くのキャッシュ・レベル（Ｌ１、Ｌ２など）の階層として構造化されてよい。 Processor 1405 is a hardware device for executing hardware instructions or software, specifically software stored in memory 1412. Processor 1405 is a custom-made or commercially available processor, central processing unit (CPU), auxiliary processor between multiple processors associated with computer system 1400 (microchip or chip set). It may be a semiconductor-based microprocessor (in the form of), a macroprocessor, or any other device for executing instructions. Processor 1405 includes an instruction cache for speeding up the fetching of executable instructions, a data cache for speeding up fetching and storing data, and a virtual address-to-physical address translation of both executable instructions and data. Can include, but is not limited to, caches such as translation look-aside buffers (TLBs) used to speed up. The cache may be structured as a hierarchy of more cache levels (L1, L2, etc.).

メモリ１４１２は、揮発性メモリ素子（例えば、ＤＲＡＭ、ＳＲＡＭ、ＳＤＲＡＭなどのランダム・アクセス・メモリ（ＲＡＭ：random access memory））および不揮発性メモリ素子（例えば、ＲＯＭ、消去可能プログラマブル読み取り専用メモリ（ＥＰＲＯＭ：erasable programmable read only memory）、電子的消去可能プログラマブル読み取り専用メモリ（ＥＥＰＲＯＭ：electronically erasable programmable read only memory）、プログラマブル読み取り専用メモリ（ＰＲＯＭ：programmable read only memory）、テープ、コンパクト・ディスク読み取り専用メモリ（ＣＤ－ＲＯＭ：compact disc read only memory）、ディスク、フロッピー（Ｒ）・ディスク、カートリッジ、カセットなど）のうちの１つまたは組み合わせを含んでよい。さらに、メモリ１４１２は電子、磁気、光、またはその他の種類のストレージ媒体を組み込んでよい。メモリ１４１２が、さまざまなコンポーネントが互いに遠く離れた位置にあるが、プロセッサ１４０５によってアクセスされてよい、分散アーキテクチャを含むことができるということに注意する。 The memory 1412 includes a volatile memory element (eg, random access memory (RAM) such as DRAM, SRAM, SDRAM) and a non-volatile memory element (eg, ROM, erasable programmable read-only memory (EPROM:). erasable programmable read only memory), electronically erasable programmable read only memory (EEPROM: electronically erasable programmable read only memory), programmable read only memory (PROM: programmable read only memory), tape, compact disk read only memory (CD- ROM: compact disc read only memory), disc, floppy (R) disc, cartridge, cassette, etc.) may be included. In addition, the memory 1412 may incorporate electronic, magnetic, optical, or other types of storage media. Note that memory 1412 can include a distributed architecture in which the various components are located far apart from each other but may be accessed by processor 1405.

メモリ１４１２内の命令は、１つまたは複数の別々のプログラムを含んでよく、それらのプログラムの各々は、論理的機能を実装するための実行可能命令の順序付けられたリストを含む。図１４の例では、メモリ１４１２内の命令は、適切なオペレーティング・システム（ＯＳ：operating system）１４１１を含む。オペレーティング・システム１４１１は、基本的に他のコンピュータ・プログラムの実行を制御することができ、スケジューリング、入出力制御、ファイルおよびデータの管理、メモリ管理、ならびに通信制御および関連するサービスを提供する。 The instructions in memory 1412 may include one or more separate programs, each of which contains an ordered list of executable instructions for implementing a logical function. In the example of FIG. 14, the instructions in memory 1412 include an appropriate operating system (OS) 1411. Operating system 1411 can essentially control the execution of other computer programs, providing scheduling, input / output control, file and data management, memory management, and communication control and related services.

例えば、プロセッサ１４０５の命令またはその他の取り出し可能な情報を含む追加データが、ストレージ１４２７に格納されてよく、ストレージ１４２７はハード・ディスク・ドライブまたは半導体ドライブなどのストレージ・デバイスであってよい。メモリ１４１２またはストレージ１４２７に格納される命令は、プロセッサ１４０５が本開示のディスパッチ・システムおよび方法の１つまたは複数の態様を実行できるようにする命令を含んでよい。 For example, additional data, including instructions from processor 1405 or other retrievable information, may be stored in storage 1427, which may be a storage device such as a hard disk drive or a semiconductor drive. Instructions stored in memory 1412 or storage 1427 may include instructions that allow processor 1405 to perform one or more aspects of the dispatch system and method of the present disclosure.

コンピュータ・システム１４００は、ディスプレイ１４３０に結合されたディスプレイ・コントローラ１４２５をさらに含んでよい。実施形態例では、コンピュータ・システム１４００は、ネットワーク１４６５に結合するためのネットワーク・インターフェイス１４６０をさらに含んでよい。ネットワーク１４６５は、コンピュータ・システム１４００と、外部サーバ、クライアントなどとの間での、ブロードバンド接続を介した通信用のＩＰベースのネットワークであってよい。ネットワーク１４６５は、コンピュータ・システム１４００と外部システムの間で、データを送受信する。実施形態例では、ネットワーク１４６５は、サービス・プロバイダによって管理された管理ＩＰネットワークであってよい。ネットワーク１４６５は、例えば、ＷｉＦｉ、ＷｉｎＭａｘなどの無線プロトコルおよび無線技術を使用して、無線方式で実装されてよい。ネットワーク１４６５は、ローカル・エリア・ネットワーク、広域ネットワーク、メトロポリタン・エリア・ネットワーク、インターネット、またはその他の類似する種類のネットワーク環境などの、パケット交換ネットワークであってもよい。ネットワーク１４６５は、固定無線ネットワーク、無線ローカル・エリア・ネットワーク（ＬＡＮ：local area network）、無線広域ネットワーク（ＷＡＮ：wide areanetwork）、パーソナル・エリア・ネットワーク（ＰＡＮ：personal area network）、仮想プライベート・ネットワーク（ＶＰＮ：virtual private network）、インターネット、またはその他の適切なネットワーク・システムであってよく、信号を送受信するための機器を含むことができる。 The computer system 1400 may further include a display controller 1425 coupled to the display 1430. In an embodiment, the computer system 1400 may further include a network interface 1460 for coupling to the network 1465. The network 1465 may be an IP-based network for communication over a broadband connection between the computer system 1400 and an external server, client, or the like. Network 1465 sends and receives data between computer system 1400 and an external system. In an embodiment, the network 1465 may be a managed IP network managed by a service provider. The network 1465 may be implemented wirelessly using, for example, wireless protocols and technologies such as WiFi, WinMax. Network 1465 may be a packet-switched network, such as a local area network, a wide area network, a metropolitan area network, the Internet, or other similar types of network environments. The network 1465 includes a fixed wireless network, a wireless local area network (LAN), a wide area network (WAN), a personal area network (PAN), and a virtual private network (PAN). It may be a virtual private network), the Internet, or any other suitable network system, and may include equipment for transmitting and receiving signals.

分割された読み込み要求キューおよび格納要求キューを提供するためのシステムおよび方法が、コンピュータ・プログラム製品において、または図１４に示されているようなコンピュータ・システム１４００において、全体的または部分的に具現化され得る。 Systems and methods for providing split read and store request queues are embodied in whole or in part in computer program products or in computer system 1400 as shown in FIG. Can be done.

本明細書では、関連する図面を参照して、本発明のさまざまな実施形態が説明される。本発明の範囲を逸脱することなく、本発明の代替の実施形態が考案され得る。以下の説明および図面において、要素間のさまざまな接続および位置関係（例えば、上、下、隣接など）が示される。それらの接続または位置関係あるいはその両方は、特に規定されない限り、直接的または間接的であることができ、本発明はこの点において限定するよう意図されていない。したがって、各実体の結合は、直接的結合または間接的結合を指すことができ、各実体間の位置関係は、直接的位置関係または間接的位置関係であることができる。さらに、本明細書に記載されたさまざまな作業および工程段階は、本明細書に詳細に記載されない追加の段階または機能を含んでいるさらに包括的な手順または工程に組み込まれ得る。 In the present specification, various embodiments of the present invention will be described with reference to the relevant drawings. Alternative embodiments of the invention can be devised without departing from the scope of the invention. In the following description and drawings, various connections and positional relationships between the elements (eg, top, bottom, adjacency, etc.) are shown. Their connections and / or positional relationships can be direct or indirect, unless otherwise specified, and the invention is not intended to be limited in this regard. Therefore, the connection of each entity can refer to a direct connection or an indirect connection, and the positional relationship between the entities can be a direct positional relationship or an indirect positional relationship. In addition, the various work and process steps described herein may be incorporated into more comprehensive procedures or steps that include additional steps or functions not described in detail herein.

以下の定義および略称が、特許請求の範囲および本明細書の解釈に使用される。本明細書において使用されているように、「備える」、「備えている」、「含む」、「含んでいる」、「有する」、「有している」、「含有する」、「含有している」という用語、またはこれらの任意のその他の変形は、非排他的包含をカバーするよう意図されている。例えば、要素のリストを含んでいる組成、混合、工程、方法、製品、または装置は、それらの要素のみに必ずしも限定されず、明示されていないか、またはそのような組成、混合、工程、方法、製品、または装置に固有の、その他の要素を含むことができる。 The following definitions and abbreviations are used in the claims and interpretation of the specification. As used herein, "prepare", "prepare", "contain", "contain", "have", "have", "contain", "contain". The term "is", or any other variant of these, is intended to cover non-exclusive inclusion. For example, a composition, mixture, process, method, product, or device that includes a list of elements is not necessarily limited to those elements and is not specified or such composition, mixture, process, method. Can include other elements that are specific to, product, or device.

さらに、「例示的」という用語は、本明細書では「例、事例、または実例としての役割を果たす」ことを意味するために使用される。「例示的」として本明細書に記載された実施形態または設計は、必ずしも他の実施形態または設計よりも好ましいか、または有利であると解釈されるべきではない。「少なくとも１つ」および「１つまたは複数」という用語は、１以上の任意の整数（すなわち、１、２、３、４など）を含んでいると理解されてよい。「複数」という用語は、２以上の任意の整数（すなわち、２、３、４、５など）を含んでいると理解されてよい。「接続」という用語は、間接的「接続」および直接的「接続」の両方を含んでよい。 In addition, the term "exemplary" is used herein to mean "act as an example, case, or example." The embodiments or designs described herein as "exemplary" should not necessarily be construed as preferred or advantageous over other embodiments or designs. The terms "at least one" and "one or more" may be understood to include any one or more integers (ie, 1, 2, 3, 4, etc.). The term "plurality" may be understood to include any integer greater than or equal to 2 (ie, 2, 3, 4, 5, etc.). The term "connection" may include both indirect "connection" and direct "connection".

「約」、「実質的に」、「近似的に」、およびこれらの変形の用語は、本願書の出願時に使用できる機器に基づいて、特定の量の測定に関連付けられた誤差の程度を含むよう意図されている。例えば、「約」は、特定の値の±８％または５％、あるいは２％の範囲を含むことができる。 The terms "about," "substantially," "approximately," and these variants include the degree of error associated with a particular quantity of measurement, based on the equipment available at the time of filing of this application. Is intended to be. For example, "about" can include a range of ± 8% or 5%, or 2% of a particular value.

簡潔さの目的で、本発明の態様の作成および使用に関連する従来手法は、本明細書に詳細に記載されることもあれば、記載されないこともある。具体的には、本明細書に記載されたさまざまな技術的特徴を実装するためのコンピューティング・システムおよび特定のコンピュータ・プログラムのさまざまな態様は、よく知られている。したがって、簡潔さのために、多くの従来の実装に関する詳細は、本明細書では、既知のシステムまたは工程あるいはその両方の詳細を提供することなく、簡潔にのみ述べられるか、または全体的に省略される。 For the sake of brevity, conventional techniques relating to the creation and use of aspects of the invention may or may not be described in detail herein. Specifically, different aspects of computing systems and specific computer programs for implementing the various technical features described herein are well known. Therefore, for brevity, many prior implementation details are given here only briefly or omitted altogether, without providing details of known systems and / or processes. Will be done.

本発明は、システム、方法、またはコンピュータ・プログラム製品、あるいはその組み合わせであってよい。コンピュータ・プログラム製品は、プロセッサに本発明の態様を実行させるためのコンピュータ可読プログラム命令を含んでいるコンピュータ可読記憶媒体を含んでよい。 The present invention may be a system, method, computer program product, or a combination thereof. The computer program product may include a computer-readable storage medium containing computer-readable program instructions for causing the processor to perform aspects of the invention.

コンピュータ可読記憶媒体は、命令実行デバイスによって使用するための命令を保持および格納できる有形のデバイスであることができる。コンピュータ可読記憶媒体は、例えば、電子ストレージ・デバイス、磁気ストレージ・デバイス、光ストレージ・デバイス、電磁ストレージ・デバイス、半導体ストレージ・デバイス、またはこれらの任意の適切な組み合わせであってよいが、これらに限定されない。コンピュータ可読記憶媒体のさらに具体的な例の非網羅的リストは、ポータブル・フロッピー（Ｒ）・ディスク、ハード・ディスク、ランダム・アクセス・メモリ（ＲＡＭ：random access memory）、読み取り専用メモリ（ＲＯＭ：read-onlymemory）、消去可能プログラマブル読み取り専用メモリ（ＥＰＲＯＭ：erasableprogrammable read-only memoryまたはフラッシュ・メモリ）、スタティック・ランダム・アクセス・メモリ（ＳＲＡＭ：static random access memory）、ポータブル・コンパクト・ディスク読み取り専用メモリ（ＣＤ－ＲＯＭ：compact disc read-only memory）、デジタル多用途ディスク（ＤＶＤ：digital versatile disk）、メモリ・スティック、フロッピー（Ｒ）・ディスク、パンチカードまたは命令が記録されている溝の中の隆起構造などの機械的にエンコードされるデバイス、およびこれらの任意の適切な組み合わせを含む。本明細書において使用されるとき、コンピュータ可読記憶媒体は、それ自体が、電波またはその他の自由に伝搬する電磁波、導波管またはその他の送信媒体を伝搬する電磁波（例えば、光ファイバ・ケーブルを通過する光パルス）、あるいはワイヤを介して送信される電気信号などの一過性の信号であると解釈されるべきではない。 The computer-readable storage medium can be a tangible device that can hold and store instructions for use by the instruction execution device. The computer-readable storage medium may be, for example, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination thereof, but is limited thereto. Not done. A non-exhaustive list of more specific examples of computer-readable storage media is portable floppy (R) disks, hard disks, random access memory (RAM), read-only memory (ROM: read). -onlymemory), erasable programmable read-only memory (EPROM: eraseableprogrammable read-only memory or flash memory), static random access memory (SRAM), portable compact disk read-only memory (CD) -ROM: compact disc read-only memory), digital versatile disk (DVD), memory sticks, floppy (R) discs, punch cards or raised structures in grooves where instructions are recorded, etc. Includes mechanically encoded devices, and any suitable combination of these. As used herein, a computer-readable storage medium itself passes through a radio wave or other freely propagating electromagnetic wave, a waveguide or other transmitting medium propagating electromagnetic wave (eg, an optical fiber cable). It should not be construed as a transient signal, such as an optical pulse) or an electrical signal transmitted over a wire.

本明細書に記載されたコンピュータ可読プログラム命令は、コンピュータ可読記憶媒体から各コンピューティング・デバイス／処理デバイスへ、またはネットワーク（例えば、インターネット、ローカル・エリア・ネットワーク、広域ネットワーク、または無線ネットワーク、あるいはその組み合わせ）を介して外部コンピュータまたは外部ストレージ・デバイスへダウンロードされ得る。このネットワークは、銅伝送ケーブル、光伝送ファイバ、無線送信、ルータ、ファイアウォール、スイッチ、ゲートウェイ・コンピュータ、またはエッジ・サーバ、あるいはその組み合わせを備えてよい。各コンピューティング・デバイス／処理デバイス内のネットワーク・アダプタ・カードまたはネットワーク・インターフェイスは、コンピュータ可読プログラム命令をネットワークから受信し、それらのコンピュータ可読プログラム命令を各コンピューティング・デバイス／処理デバイス内のコンピュータ可読記憶媒体に格納するために転送する。 The computer-readable program instructions described herein are from a computer-readable storage medium to each computing device / processing device, or a network (eg, the Internet, a local area network, a wide area network, or a wireless network, or a network thereof). Can be downloaded to an external computer or external storage device via a combination). This network may include copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers, or edge servers, or a combination thereof. A network adapter card or network interface within each computing device / processing device receives computer-readable program instructions from the network and these computer-readable program instructions are computer-readable within each computing device / processing device. Transfer for storage on a storage medium.

本発明の動作を実行するためのコンピュータ可読プログラム命令は、アセンブラ命令、命令セット・アーキテクチャ（ＩＳＡ：instruction-set-architecture）命令、マシン命令、マシン依存命令、マイクロコード、ファームウェア命令、状態設定データ、あるいは、Ｊａｖａ（Ｒ）、Ｓｍａｌｌｔａｌｋ（Ｒ）、Ｃ＋＋などのオブジェクト指向プログラミング言語、および「Ｃ」プログラミング言語または同様のプログラミング言語などの従来の手続き型プログラミング言語を含む１つまたは複数のプログラミング言語の任意の組み合わせで記述されたソース・コードまたはオブジェクト・コードであってよい。コンピュータ可読プログラム命令は、ユーザのコンピュータ上で全体的に実行すること、ユーザのコンピュータ上でスタンドアロン・ソフトウェア・パッケージとして部分的に実行すること、ユーザのコンピュータ上およびリモート・コンピュータ上でそれぞれ部分的に実行すること、あるいはリモート・コンピュータ上またはサーバ上で全体的に実行することができる。後者のシナリオでは、リモート・コンピュータは、ローカル・エリア・ネットワーク（ＬＡＮ：local area network）または広域ネットワーク（ＷＡＮ：wide areanetwork）を含む任意の種類のネットワークを介してユーザのコンピュータに接続されてよく、または接続は、（例えば、インターネット・サービス・プロバイダを使用してインターネットを介して）外部コンピュータに対して行われてよい。一部の実施形態では、本発明の態様を実行するために、例えばプログラマブル論理回路、フィールドプログラマブル・ゲート・アレイ（ＦＰＧＡ：field-programmable gate arrays）、またはプログラマブル・ロジック・アレイ（ＰＬＡ：programmable logic arrays）を含む電子回路は、コンピュータ可読プログラム命令の状態情報を利用することによって、電子回路をカスタマイズするためのコンピュータ可読プログラム命令を実行してよい。 The computer-readable program instructions for performing the operation of the present invention include assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcodes, firmware instructions, state setting data, and the like. Alternatively, any one or more programming languages, including object-oriented programming languages such as Java (R), Smalltalk (R), C ++, and traditional procedural programming languages such as the "C" programming language or similar programming languages. It may be source code or object code written in a combination of. Computer-readable program instructions can be executed entirely on the user's computer, partially on the user's computer as a stand-alone software package, and partially on the user's computer and on the remote computer, respectively. It can be run, or it can be run globally on a remote computer or on a server. In the latter scenario, the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN). Alternatively, the connection may be made to an external computer (eg, over the Internet using an Internet service provider). In some embodiments, in order to carry out aspects of the invention, for example, programmable logic circuits, field-programmable gate arrays (FPGAs), or programmable logic arrays (PLAs). The electronic circuit including) may execute the computer-readable program instruction for customizing the electronic circuit by utilizing the state information of the computer-readable program instruction.

本発明の態様は、本明細書において、本発明の実施形態に従って、方法、装置（システム）、およびコンピュータ・プログラム製品のフローチャート図またはブロック図あるいはその両方を参照して説明される。フローチャート図またはブロック図あるいはその両方の各ブロック、ならびにフローチャート図またはブロック図あるいはその両方に含まれるブロックの組み合わせが、コンピュータ可読プログラム命令によって実装され得るということが理解されるであろう。 Aspects of the invention are described herein with reference to flow charts and / or block diagrams of methods, devices (systems), and computer program products, according to embodiments of the invention. It will be appreciated that each block of the flow chart and / or block diagram, as well as the combination of blocks contained in the flow chart and / or block diagram, can be implemented by computer-readable program instructions.

これらのコンピュータ可読プログラム命令は、コンピュータまたはその他のプログラム可能なデータ処理装置のプロセッサを介して実行される命令が、フローチャートまたはブロック図あるいはその両方のブロックに指定される機能／動作を実施する手段を作り出すべく、汎用コンピュータ、専用コンピュータ、または他のプログラム可能なデータ処理装置のプロセッサに提供されてマシンを作り出すものであってよい。これらのコンピュータ可読プログラム命令は、命令が格納されたコンピュータ可読記憶媒体がフローチャートまたはブロック図あるいはその両方のブロックに指定される機能／動作の態様を実施する命令を含んでいる製品を備えるように、コンピュータ可読記憶媒体に格納され、コンピュータ、プログラム可能なデータ処理装置、または他のデバイス、あるいはその組み合わせに特定の方式で機能するように指示できるものであってもよい。 These computer-readable program instructions are means by which instructions executed through the processor of a computer or other programmable data processing device perform the functions / operations specified in the block diagram and / or block diagram. It may be provided to a general purpose computer, a dedicated computer, or the processor of another programmable data processing device to create a machine. These computer-readable program instructions are such that the computer-readable storage medium in which the instructions are stored comprises a product containing instructions that implement the functional / operational mode specified in the block diagram and / or block diagram. It may be stored on a computer-readable storage medium and can instruct a computer, a programmable data processor, or other device, or a combination thereof, to function in a particular manner.

コンピュータ可読プログラム命令は、コンピュータ上、その他のプログラム可能な装置上、またはその他のデバイス上で実行される命令が、フローチャートまたはブロック図あるいはその両方のブロックに指定される機能／動作を実施するように、コンピュータ実装プロセスを生成するべく、コンピュータ、その他のプログラム可能なデータ処理装置、またはその他のデバイスに読み込まれてもよく、それによって、一連の動作可能なステップを、コンピュータ上、その他のプログラム可能な装置上、またはその他のデバイス上で実行させるものであってもよい。 Computer-readable program instructions are such that instructions executed on a computer, other programmable device, or other device perform the functions / operations specified in the block diagram and / or block diagram. , May be loaded into a computer, other programmable data processing device, or other device to spawn a computer implementation process, thereby allowing a series of operable steps to be programmed on the computer or other. It may be run on a device or on another device.

図内のフローチャートおよびブロック図は、本発明のさまざまな実施形態に従って、システム、方法、およびコンピュータ・プログラム製品の可能な実装のアーキテクチャ、機能、および動作を示す。これに関連して、フローチャートまたはブロック図内の各ブロックは、規定された論理機能を実装するための１つまたは複数の実行可能な命令を備える、命令のモジュール、セグメント、または部分を表してよい。一部の代替の実装では、ブロックに示された機能は、図に示された順序とは異なる順序で発生してよい。例えば、連続して示された２つのブロックは、実際には、含まれている機能に応じて、実質的に同時に実行されるか、または場合によっては逆の順序で実行されてよい。ブロック図またはフローチャート図あるいはその両方の各ブロック、ならびにブロック図またはフローチャート図あるいはその両方に含まれるブロックの組み合わせは、規定された機能または動作を実行するか、または専用ハードウェアとコンピュータ命令の組み合わせを実行する専用ハードウェアベースのシステムによって実装され得るということにも注意する。 The flowcharts and block diagrams in the figure show the architecture, function, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of an instruction that comprises one or more executable instructions for implementing a defined logical function. .. In some alternative implementations, the functions shown in the blocks may occur in a different order than shown in the figure. For example, two blocks shown in succession may actually be executed at substantially the same time or, in some cases, in reverse order, depending on the functions included. Each block of the block diagram and / or flow chart, and the combination of blocks contained in the block diagram and / or flow chart, either performs the specified function or operation, or is a combination of dedicated hardware and computer instructions. Also note that it can be implemented by a dedicated hardware-based system that runs.

本発明のさまざまな実施形態の説明は、例示の目的で提示されているが、網羅的であることは意図されておらず、開示された実施形態に制限されない。記載された実施形態の範囲および思想を逸脱することなく多くの変更および変形が可能であることは、当業者にとって明らかであろう。本明細書で使用された用語は、実施形態の原理、実際の適用、または市場で見られる技術を超える技術的改良を最も適切に説明するため、または他の当業者が本明細書で開示された実施形態を理解できるようにするために選択されている。 Descriptions of the various embodiments of the invention are presented for illustrative purposes, but are not intended to be exhaustive and are not limited to the disclosed embodiments. It will be apparent to those skilled in the art that many modifications and variations are possible without departing from the scope and ideas of the embodiments described. The terms used herein are to best describe the principles of the embodiment, the actual application, or technical improvements beyond the techniques found on the market, or are disclosed herein by those of ordinary skill in the art. It has been selected to help you understand the embodiments.

Claims

A processing unit for executing one or more instructions, wherein the processing unit is
It comprises a read / store unit (LSU), the LSU being configured to use multiple LSU pipes to execute multiple instructions in an out-of-order (OoO) window.
The selection of an instruction from the OoO window, wherein the instruction uses an effective address.
In response to the instruction being a read instruction
Creates an entry in the first partition of the read reorder queue based on the instruction being issued on the first read pipe in response to the processing unit operating in single thread mode. And, based on the fact that the instruction was issued on the second read pipe, creating the entry in the second partition of the read order change queue.
Based on the fact that the instruction is issued on the first read pipe in response to the operation of the processing unit in a multithreaded mode in which a plurality of threads are processed simultaneously, the processing unit of the processing unit A processing unit executed by a first thread by creating the entry in the first predetermined portion of the first partition of the read order change queue.

In the multithreaded mode, the first predetermined portion of the first partition of the read order change queue is issued by the first thread of the processing unit using the first read pipe. The processing unit according to claim 1, which is specific to the read instruction.

The read / store unit
In response to the instruction being a storage instruction
A storage entry in the first partition of the storage order change queue based on the fact that the storage instruction was issued on the first storage pipe in response to the processing unit operating in the single thread mode. And creating the storage entry in the second partition of the storage order change queue based on the fact that the storage instruction was issued on the second storage pipe.
In response to the operation of the processing unit in the multithreaded mode, the storage instruction is issued on the first storage pipe by the first thread of the processing unit. The processing unit of claim 1, further configured to create the storage entry in a first predetermined portion of the first partition of the storage order change queue.

The processing unit according to claim 1, wherein the read order change queue includes one partition for each read pipe of the LSU.

The processing unit according to claim 4, wherein the LSU operates a plurality of read instructions at the same time, and one read instruction uses each read pipe.

The processing unit according to claim 3 , wherein the storage order change queue includes one partition for each storage pipe of the LSU.

The processing unit according to claim 6, wherein the LSU operates a plurality of storage instructions at the same time, and one storage instruction uses each read pipe.

A computer implementation method for out-of-order execution of one or more instructions by a processing unit.
Receiving an out-of-order (OoO) window of instructions containing multiple instructions executed out of order by the processing unit's read / store unit (LSU).
The LSU comprises issuing an instruction from the OoO window, the instruction.
The selection of an instruction from the OoO window, wherein the instruction uses an effective address.
In response to the instruction being a read instruction
Creates an entry in the first partition of the read reorder queue based on the instruction being issued on the first read pipe in response to the processing unit operating in single thread mode. And, based on the fact that the instruction was issued on the second read pipe, creating the entry in the second partition of the read order change queue.
The read by the first thread of the processing unit based on the instruction being issued on the first read pipe in response to the operation of the processing unit in multithreaded mode. A computer implementation method issued by creating the entry in the first predetermined portion of the first partition of the reordering queue.

In the multithreaded mode, the first predetermined portion of the first partition of the read order change queue is issued by the first thread of the processing unit using the first read pipe. The computer implementation method according to claim 8, which is specific to the read instruction.

In response to the instruction being a storage instruction
A storage entry in the first partition of the storage order change queue based on the fact that the storage instruction was issued on the first storage pipe in response to the processing unit operating in the single thread mode. And creating the storage entry in the second partition of the storage order change queue based on the fact that the storage instruction was issued on the second storage pipe.
In response to the operation of the processing unit in the multithreaded mode, the storage instruction is issued on the first storage pipe by the first thread of the processing unit. The computer implementation method of claim 8, further comprising creating the storage entry in a first predetermined portion of the first partition of the storage order change queue.

The computer implementation method according to claim 8, wherein the read order change queue includes one partition for each read pipe of the LSU.

The computer implementation method according to claim 11, wherein the LSU operates a plurality of read instructions at the same time, and one read instruction uses each read pipe.

The computer implementation method according to claim 10 , wherein the storage order change queue includes one partition for each storage pipe of the LSU.

13. The computer implementation method according to claim 13, wherein the LSU operates a plurality of storage instructions at the same time, and one storage instruction uses each read pipe.

A computer program product comprising a computer-readable storage medium in which a program instruction is embodied, the program instruction being transmitted to a processing unit.
Receiving an out-of-order (OoO) window of instructions containing multiple instructions executed out of order by the processing unit's read / store unit (LSU).
The LSU can be executed by the processing unit to perform an operation including issuing an instruction from the OoO window, and the instruction is:
The selection of an instruction from the OoO window, wherein the instruction uses an effective address.
In response to the instruction being a read instruction
Creates an entry in the first partition of the read reorder queue based on the instruction being issued on the first read pipe in response to the processing unit operating in single thread mode. And, based on the fact that the instruction was issued on the second read pipe, creating the entry in the second partition of the read order change queue.
The read by the first thread of the processing unit based on the instruction being issued on the first read pipe in response to the operation of the processing unit in multithreaded mode. A computer program product issued by creating the entry in the first predetermined portion of the first partition of the reordering queue.

In the multithreaded mode, the first predetermined portion of the first partition of the read order change queue is issued by the first thread of the processing unit using the first read pipe. The computer program product according to claim 15, which is specific to the read instruction.

In response to the instruction being a storage instruction
A storage entry in the first partition of the storage order change queue based on the fact that the storage instruction was issued on the first storage pipe in response to the processing unit operating in the single thread mode. And creating the storage entry in the second partition of the storage order change queue based on the fact that the storage instruction was issued on the second storage pipe.
In response to the operation of the processing unit in the multithreaded mode, the storage instruction is issued on the first storage pipe by the first thread of the processing unit. 15. The computer program product of claim 15, comprising creating the storage entry in a first predetermined portion of the first partition of the storage order change queue.

15. The computer program product of claim 15, wherein the read order change queue contains one partition for each read pipe of the LSU.

18. The computer program product of claim 18, wherein the LSU operates a plurality of read instructions simultaneously, with one read instruction using each read pipe.

17 . Described computer program products.