JP7525101B2

JP7525101B2 - Systems, methods and apparatus for heterogeneous computing

Info

Publication number: JP7525101B2
Application number: JP2022099279A
Authority: JP
Inventors: エム．サンカラン、ラジェシュ; ネイガー、ギルバート; ランガナサン、ナラヤン; ドレン、ステファンアール．ヴァン; ナズマン、ジョセフ; ディー．マクドネル、ニアル; エー．オハンロン、マイケル; ビー．モサー、ロクプラヴィーン; ドライズデイル、トレーシー、ガレット; ナーヴィタヒ、エリコ; ケー．ミシュラ、アシト; ヴェンカテシ、ガネシ; ティー．マール、デボラ; ピー．カーター、ニコラス; ディー．ピアース、ジョナサン; ティー．グロチョウスキー、エドワード; ジェイ．グレコ、リチャード; ヴァレンタイン、ロバート; コーバル、ジーザス; ディー．フレッチャー、トーマス
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2020-12-07
Filing date: 2022-06-21
Publication date: 2024-07-30
Anticipated expiration: 2036-12-31
Also published as: JP2022123079A; JP2021064378A; JP7164267B2; JP2024133635A

Description

本開示は、概してコンピューティングデバイスの分野、より具体的には、ヘテロジニアスコンピューティング方法、デバイス及びシステムに関する。 The present disclosure relates generally to the field of computing devices, and more specifically to heterogeneous computing methods, devices and systems.

現在のコンピュータでは、ＣＰＵは、アプリケーションソフトウェア及びオペレーティングシステムを実行するなどの汎用計算タスクを実行する。ある専門分野に特化した計算タスク、例えば、グラフィックス及び画像処理は、グラフィックスプロセッサ、画像プロセッサ、デジタル信号プロセッサ及び固定機能アクセラレータにより処理される。現在のヘテロジニアスマシンでは、各タイプのプロセッサは、様々な態様でプログラミングされる。 In modern computers, the CPU performs general-purpose computing tasks, such as running application software and operating systems. Specialized computing tasks, such as graphics and image processing, are handled by graphics processors, image processors, digital signal processors, and fixed function accelerators. In modern heterogeneous machines, each type of processor is programmed in a different way.

ビッグデータ処理の時代では、今日の汎用プロセッサと比較して、より低いエネルギーでより高い性能が求められている。アクセラレータ（例えば、カスタム固定機能ユニット又はオーダーメイドプログラマブルユニットのいずれか一方）は、これらの要求を満足させることに役立っている。この分野は、アルゴリズム及びワークロードの両方において急速な進化を遂げており、利用可能なアクセラレータのセットは、事前に予測することが難しく、製品型内のストックユニットにわたって枝分かれして、製品型と共に進化する可能性が極めて高い。 The era of big data processing demands higher performance at lower energy compared to today's general-purpose processors. Accelerators (e.g., either custom fixed-function units or bespoke programmable units) help meet these demands. The field is undergoing rapid evolution in both algorithms and workloads, and the set of available accelerators is difficult to predict in advance and will most likely evolve with product types, branching out across stock units within product types.

添付の図面と併せて以下の詳細な説明により、実施形態が容易に理解されるであろう。この説明を容易にするために、同様の参照番号は、同様の構造的要素を指定する。実施形態は、例として示され、添付の図面の図に制限することを目的としたものではない。 The embodiments will be readily understood from the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. The embodiments are illustrated by way of example and are not intended to be limiting to the figures of the accompanying drawings.

ヘテロジニアスマルチプロセッシングの実行環境を表現したものである。It represents a heterogeneous multiprocessing execution environment.

ヘテロジニアススケジューラの例示的な実装を示す。1 illustrates an exemplary implementation of a heterogeneous scheduler.

コンピュータシステムのシステムブート及びデバイス発見についての実施形態を示す。1 illustrates an embodiment of system boot and device discovery for a computer system.

処理要素の３つのタイプに対するプログラムフェーズのマッピングに基づいたスレッド移行の例を示す。1 shows an example of thread migration based on a mapping of program phases to three types of processing elements.

ヘテロジニアススケジューラにより実行される例示的な実施フローである。1 illustrates an exemplary implementation flow performed by a heterogeneous scheduler.

ヘテロジニアススケジューラによるスレッド宛先選択のための方法についての例を示す。1 illustrates an example of a method for thread destination selection by a heterogeneous scheduler.

論理ＩＤに対する縞模様マッピングの使用についての概念を示す。1 illustrates the concept of using stripe mapping for logical IDs.

論理ＩＤに対する縞模様マッピングの使用についての例を示す。13 shows an example of the use of stripe mapping for logical IDs.

コアグループの例を示す。An example of a core group is shown below.

バイナリトランスレータ切替メカニズムを利用するシステムにおけるスレッド実行の方法についての例を示す。An example is given of a method for thread execution in a system that utilizes a binary translator switching mechanism.

アクセラレータに対するホットコードのコア割り当てについての例示的な方法を示す。1 illustrates an exemplary method for hot code to accelerator core allocation.

ページディレクトリベースレジスタイベントに対するウェイクアップ又は書き込みのための可能性があるコア割り当てについての例示的な方法を示す。1 illustrates an example method for potential core allocation for wakeup or write on page directory base register events.

直列フェーズスレッドの例を示す。1 shows an example of a serial phase thread.

スリープコマンドイベントに対するスレッド応答のための潜在的なコア割り当てについての例示的な方法を示す。1 illustrates an example method for potential core allocation for thread response to a sleep command event.

フェーズ変更イベントに応じたスレッドのための潜在的なコア割り当てについての例示的な方法を示す。1 illustrates an example method for potential core allocation for threads in response to phase change events.

加速領域を記述するコードの例を示す。Here is an example of code that describes an acceleration region.

ハードウェアプロセッサコアにおけるＡＢＥＧＩＮを用いた実行についての方法の実施形態を示す。1 illustrates an embodiment of a method for execution with ABEGIN in a hardware processor core.

ハードウェアプロセッサコアにおいてＡＥＮＤを用いた実行についての方法の実施形態を示す。1 illustrates an embodiment of a method for implementation with AEND in a hardware processor core.

パターンマッチングを用いてＡＢＥＧＩＮ／ＡＥＮＤ等価を提供するシステムを示す。We present a system that uses pattern matching to provide ABEGIN/AEND equivalence.

パターン認識にさらされる非加速型記述スレッドについての方法の実施形態を示す。1 illustrates an embodiment of a method for non-accelerated description threads subjected to pattern recognition.

メモリ依存性の様々なタイプ、これらのセマンティクス、オーダリング要求及び使用事例を示す。We present various types of memory dependencies, their semantics, ordering requirements and use cases.

ＡＢＥＧＩＮ命令により指し示されるメモリデータブロックの例を示す。1 shows an example of a memory data block pointed to by an ABEGIN instruction.

ＡＢＥＧＩＮ／ＡＥＮＤセマンティクスを用いるように構成されるメモリ２５０３の例を示す。25 shows an example of a memory 2503 that is configured to use ABEGIN/AEND semantics.

ＡＢＥＧＩＮ／ＡＥＮＤを用いた実行についての異なるモードでの動作の方法の例を示す。Illustrates how to operate in different modes for implementation using ABEGIN/AEND.

一実施例に関する追加の詳細を示す。1 provides additional details regarding one embodiment.

アクセラレータの実施形態を示す。1 illustrates an embodiment of an accelerator.

マルチプロトコルリンクを介してプロセッサに結合されるアクセラレータ及び１又は複数のコンピュータプロセッサチップを含むコンピュータシステムを示す。1 illustrates a computer system including an accelerator and one or more computer processor chips coupled to a processor via a multi-protocol link.

実施形態に係るデバイスバイアスフローを示す。4 illustrates a device bias flow according to an embodiment.

一実施例に従う例示的な処理を示す。4 illustrates an exemplary process according to one embodiment.

オペランドが１又は複数のＩ／Ｏデバイスから解放される場合の処理を示す。1 illustrates the process when an operand is released from one or more I/O devices.

２つの異なるタイプのワークキューを用いた実施例を示す。An example is shown using two different types of work queues.

Ｉ／Ｏファブリックインタフェースを介してサブミットされた記述子をＲＥＣＥＩＶＥする複数のワークキューを有するデータストリーミングアクセラレータ（ＤＳＡ）デバイスの実施例を示す。1 illustrates an embodiment of a Data Streaming Accelerator (DSA) device having multiple work queues that RECEIVE descriptors submitted via an I/O fabric interface.

２つのワークキューを示す。Two work queues are shown.

エンジン及びグループ化を用いた別の構成を示す。13 illustrates another configuration using engines and grouping.

記述子の実施例を示す。An example of a descriptor is given below.

完了記録の実施例を示す。13 shows an example of a completion record.

例示的な非ｏｐ記述子及びｎｏ－ｏｐ完了記録を示す。1 illustrates an exemplary no-op descriptor and no-op completion record.

例示的なバッチ記述子及びｎｏ－ｏｐ完了記録を示す。1 shows an exemplary batch descriptor and no-op completion record.

例示的なドレイン記述子及びドレイン完了記録を示す。1 illustrates an exemplary drain descriptor and drain completion record.

例示的なメモリ移動記述子及びメモリ移動完了記録を示す。1 illustrates an exemplary memory migration descriptor and memory migration completion record.

例示的なフィル記述子を示す。1 illustrates an exemplary fill descriptor.

例示的な比較記述子及び比較完了記録を示す。1 illustrates an exemplary comparison descriptor and comparison completion record.

例示的な比較中間記述子を示す。13 illustrates an exemplary comparison intermediate descriptor.

例示的な作成データ記録記述子及び作成差分記録完了記録を示す。13 illustrates an exemplary production data record descriptor and a production delta record completion record.

差分記録のフォーマットを示す。This shows the format of the differential recording.

例示的な適合差分記録記述子を示す。1 illustrates an exemplary adaptive differential record descriptor.

作成差分記録及び適合差分記録オペレーションの利用についての一実施例を示す。13 illustrates one embodiment of the use of the create difference record and adapt difference record operations.

例示的なデュアルキャストを用いたメモリコピー記述子及びデュアルキャストを用いたメモリコピー完了記録を示す。13 illustrates an exemplary dual-cast memory copy descriptor and dual-cast memory copy completion record.

例示的なＣＲＣ生成記述子及びＣＲＣ生成を示す。1 illustrates an exemplary CRC generation descriptor and CRC generation.

ＣＲＣ生成記述子を用いた例示的なコピーを示す。13 shows an example copy with a CRC generated descriptor.

例示的なＤＩＦ挿入記述子及びＤＩＦ挿入完了記録を示す。1 illustrates an exemplary DIF insertion descriptor and a DIF insertion completion record.

例示的なＤＩＦストリップ記述子及びＤＩＦストリップ完了記録を示す。1 illustrates an exemplary DIF strip descriptor and a DIF strip completion record.

例示的なＤＩＦ更新記述子及びＤＩＦ更新完了記録を示す。1 illustrates an exemplary DIF update descriptor and a DIF update completion record.

例示的なキャッシュフラッシュ記述子を示す。1 illustrates an exemplary cache flush descriptor.

ＥＮＱＣＭＤにより生成された６４バイトのエンキュー格納データを示す。This indicates 64 bytes of enqueue store data generated by ENQCMD.

ＭＯＶＤＩＲＩ命令を処理するために、プロセッサにより実行される方法の実施形態を示す。1 illustrates an embodiment of a processor-performed method for processing a MOVDIRI instruction.

ＭＯＶＤＩＲＩ６４Ｂ命令を処理するために、プロセッサにより実行される方法の実施形態を示す。1 illustrates an embodiment of a method performed by a processor to process a MOVDIRI64B instruction.

ＥＮＣＱＭＤ命令を処理するために、プロセッサにより実行される方法の実施形態を示す。1 illustrates an embodiment of a method performed by a processor to process an ENCQMD instruction.

ＥＮＱＣＭＤＳ命令に関するフォーマットを示す。The format for the ENQCMDS command is shown below.

ＵＭＯＮＩＴＯＲ命令を処理するために、プロセッサにより実行される方法の実施形態を示す。1 illustrates an embodiment of a method performed by a processor to process a UMONITOR instruction.

ＵＭＷＡＩＴ命令を処理するために、プロセッサにより実行される方法の実施形態を示す。1 illustrates an embodiment of a processor-performed method for processing a UMWAIT instruction.

ＴＰＡＵＳＥ命令を処理するために、プロセッサにより実行される方法の実施形態を示す。1 illustrates an embodiment of a method performed by a processor to process a TPAUSE instruction.

ＵＭＷＡＩＴ及びＵＭＯＮＩＴＯＲ命令を用いた実行の例を示す。An example of execution using the UMWAIT and UMONITOR instructions is given below.

ＴＰＡＵＳＥ及びＵＭＯＮＩＴＯＲ命令を用いた実行の例を示す。An example of execution using the TPAUSE and UMONITOR instructions is given below.

アクセラレータがキャッシュコヒーレントインタフェースを通じて複数のコアに通信可能に結合される例示的な実装を示す。1 illustrates an exemplary implementation in which an accelerator is communicatively coupled to multiple cores through a cache coherent interface.

データ管理ユニット、複数の処理要素及び高速オンチップストレージを含むアクセラレータ及び前述の他のコンポーネントの別の図を示す。2 illustrates another diagram of an accelerator including a data management unit, multiple processing elements, and high-speed on-chip storage, as well as other components previously mentioned.

処理要素により実行された処理の例示的なセットを示す。4 illustrates an exemplary set of operations performed by a processing element.

ベクトルｙを生成するために、ベクトルｘに対する疎行列間の乗算の例を図示する。1 illustrates an example of sparse matrix multiplication for vector x to produce vector y.

各値が（値、行インデックス）ペアとして格納される行列ＡのＣＳＲ表現を示す。4 shows a CSR representation of a matrix A where each value is stored as a (value, row index) pair.

（値、列インデックス）ペアを用いる行列ＡのＣＳＣ表現を示す。We denote the CSC representation of matrix A using (value, column index) pairs.

計算パターンの擬似コードを示す。The pseudocode for the calculation pattern is shown below. 計算パターンの擬似コードを示す。The pseudocode for the calculation pattern is shown below. 計算パターンの擬似コードを示す。The pseudocode for the calculation pattern is shown below.

データ管理ユニット及び処理要素の一実施例に関する処理フローを示す。1 illustrates a process flow for one embodiment of a data management unit and processing elements.

ｓｐＭｓｐＶ＿ｃｓｃ及びｓｃａｌｅ＿ｕｐｄａｔｅ演算に関するパスを（点線を用いて）強調表示する。The paths for the spMspV_csc and scale_update operations are highlighted (with dotted lines).

ｓｐＭｄＶ＿ｃｓｒ演算に関するパスを示す。1 shows the path for spMdV_csr operation.

隣接行列としてのグラフを表す例を示す。Here is an example of representing a graph as an adjacency matrix: 隣接行列としてのグラフを表す例を示す。Here is an example of representing a graph as an adjacency matrix:

頂点プログラムを示す。The vertex program is shown below.

頂点プログラムを実行するための例示的なプログラムコードを示す。1 shows exemplary program code for executing a vertex program.

ＧＳＰＭＶの定式化を示す。1 shows the formulation of GSPMV.

フレームワークを示す。The framework is presented.

カスタマイズ可能な論理ブロックが各ＰＥ内に提供されることを示す。It is shown that a customizable logic block is provided within each PE.

各アクセラレータタイルの処理を示す。The processing of each accelerator tile is shown.

テンプレートの一実施例についてのカスタマイズ可能なパラメータを要約したものである。11 summarizes customizable parameters for one embodiment of a template.

チューニング検討事項を示す。Tuning considerations are presented.

最も一般的な疎行列フォーマットの１つを示す。One of the most common sparse matrix formats is shown below.

ＣＲＳデータフォーマットを用いた疎行列－密ベクトル乗算についての実施例に関する段階を示す。1 shows steps for an embodiment of sparse matrix-dense vector multiplication using the CRS data format.

アクセラレータ論理ダイと、ＤＲＡＭの１又は複数のスタックとを含むアクセラレータについての実施例を示す。An embodiment is shown for an accelerator that includes an accelerator logic die and one or more stacks of DRAM.

上部視点からＤＲＡＭダイのスタックの方を向いたアクセラレータ論理チップの一実施例を示す。1 illustrates one embodiment of an accelerator logic chip from a top perspective facing a stack of DRAM dies. 上部視点からＤＲＡＭダイのスタックの方を向いたアクセラレータ論理チップの一実施例を示す。1 illustrates one embodiment of an accelerator logic chip from a top perspective facing a stack of DRAM dies.

ＤＰＥの大まかな概観図を提供する。1 provides a high level overview of the DPE.

ブロッキングスキームの実施例を示す。1 shows an example of a blocking scheme.

ブロック記述子を示す。Indicates the block descriptor.

単一のドット積エンジンのバッファ内に合致する２行行列を示す。Shows a two-row matrix that fits into the buffer of a single dot product engine.

このフォーマットを用いるドット積エンジン内のハードウェアの一実施例を示す。An embodiment of the hardware in a dot product engine using this format is shown.

キャプチャを行うマッチ論理ユニットの内容を示す。Indicates the contents of the match logical unit to be captured.

実施例に係る疎行列－疎ベクトル乗算をサポートするドット積エンジン設計の詳細を示す。13 provides details of a dot product engine design supporting sparse matrix-vector multiplication according to an embodiment.

特定の値を用いる例を示す。Here is an example using specific values:

計算の両方のタイプを処理できるドット積エンジンを生じさせるように、疎－密及び疎－疎ドット積エンジンがどのように組み合わられるかを示す。We show how sparse-dense and sparse-sparse dot product engines can be combined to yield a dot product engine that can handle both types of calculations.

１２個のアクセラレータスタックを用いたソケット交換の実施を示す。An implementation of socket swapping with 12 accelerator stacks is shown.

プロセッサ／コアのセット及び８つのスタックを用いたマルチチップパッケージ（ＭＣＰ）実装を示す。A multi-chip package (MCP) implementation using a set of processors/cores and eight stacks is shown.

アクセラレータスタックを示す。Shows the accelerator stack.

６４個のドット積エンジン、８つのベクトルキャッシュ及び統合メモリコントローラを含むＷＩＯ３ＤＲＡＭスタックの下に位置することが意図されるアクセラレータの潜在的なレイアウトを示す。1 shows a potential layout of an accelerator intended to sit underneath a WIO3 DRAM stack, including 64 dot-product engines, eight vector caches, and an integrated memory controller.

７つのＤＲＡＭ技術を比較したものである。Seven DRAM technologies are compared.

スタック型ＤＲＡＭを示す。1 shows a stacked DRAM. スタック型ＤＲＡＭを示す。1 shows a stacked DRAM.

幅優先探索（ＢＦＳ）のリストを示す。1 shows a breadth-first search (BFS) listing.

一実施例に従うラムダ関数を規定するために用いられる記述子のフォーマットを示す。1 illustrates the format of a descriptor used to define a lambda function according to one embodiment.

実施形態におけるヘッダワードの下位６バイトを示す。4 shows the lower 6 bytes of a header word in the embodiment.

行列値バッファ、行列インデックスバッファ及びベクトル値バッファを示す。A matrix value buffer, a matrix index buffer, and a vector value buffer are shown.

ラムダデータパスの一実施例の詳細を示す。1 illustrates details of one embodiment of a lambda data path.

命令エンコーディングの実施例を示す。1 shows an example of instruction encoding.

ある特定の命令のセットに対するエンコーディングを示す。Indicates the encoding for a particular set of instructions.

例示的な比較述語のエンコーディングを示す。1 illustrates an encoding of an exemplary comparison predicate.

バイアスを用いた実施形態を示す。13 shows an embodiment using a bias.

ワークキューベースの実装と共に用いられるメモリマッピングされたＩ／Ｏ（ＭＭＩＯ）空間レジスタを示す。1 illustrates a memory-mapped I/O (MMIO) space register for use with a work-queue based implementation. ワークキューベースの実装と共に用いられるメモリマッピングされたＩ／Ｏ（ＭＭＩＯ）空間レジスタを示す。1 illustrates a memory-mapped I/O (MMIO) space register for use with a work-queue based implementation.

行列の乗算の例を示す。Here is an example of matrix multiplication:

２分木低減ネットワークを用いたｏｃｔｏＭＡＤＤ命令処理を示す。1 shows octoMADD instruction processing using a binary tree reduction network.

積和演算命令を処理するために、プロセッサにより実行される方法の実施形態を示す。1 illustrates an embodiment of a method performed by a processor to process a multiply-accumulate instruction.

ＭＡＤＤ命令を実行するための例示的なハードウェアを示す。1 illustrates exemplary hardware for executing the MADD instruction. ＭＡＤＤ命令を実行するための例示的なハードウェアを示す。1 illustrates exemplary hardware for executing the MADD instruction. ＭＡＤＤ命令を実行するための例示的なハードウェアを示す。1 illustrates exemplary hardware for executing the MADD instruction.

ハードウェアヘテロジニアススケジューラ回路及びメモリとのそのインタラクションの例を示す。1 illustrates an example of a hardware heterogeneous scheduler circuit and its interaction with memory.

ソフトウェアヘテロジニアススケジューラの例を示す。1 illustrates an example of a software heterogeneous scheduler.

ポストシステムブートデバイス発見のための方法の実施形態を示す。1 illustrates an embodiment of a method for post-system boot device discovery.

共有メモリ内のスレッドに対する移動の例を示す。An example of movement for threads in shared memory is shown.

ヘテロジニアススケジューラにより実行され得るスレッド移動のための例示的な方法を示す。1 illustrates an example method for thread migration that may be performed by a heterogeneous scheduler.

詳細に上述されたように、抽象実行環境を提示するプロセッサのブロック図である。1 is a block diagram of a processor that represents an abstract execution environment as described in detail above.

例示的なマルチチップ構成を示す簡易ブロック図である。FIG. 1 is a simplified block diagram illustrating an exemplary multi-chip configuration.

マルチチップリンク（ＭＣＬ）の例示的な実装を含むシステムの少なくとも一部を表すブロック図を示す。1 shows a block diagram depicting at least a portion of a system including an example implementation of a multi-chip link (MCL).

例示的なＭＣＬの例示的な論理ＰＨＹのブロック図を示す。1 illustrates a block diagram of an exemplary logical PHY of an exemplary MCL.

ＭＣＬを実装するために用いられる論理の別の表現を示すことを簡易ブロック図が示されることを図示する。1 illustrates a simplified block diagram showing another representation of the logic used to implement an MCL.

ＡＢＥＧＩＮ／ＡＥＮＤがサポートされていない場合の実行の例を示す。Here is an example of execution when ABEGIN/AEND is not supported.

本発明の一実施形態に係るレジスタアーキテクチャのブロック図である。FIG. 2 is a block diagram of a register architecture according to one embodiment of the present invention;

本発明の実施形態に係る、例示的なインオーダパイプライン及び例示的なレジスタリネーミング・アウトオブオーダ発行／実行パイプラインの両方を示すブロック図である。2 is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming out-of-order issue/execution pipeline according to an embodiment of the present invention.

本発明の実施形態に係るプロセッサに含まれるインオーダアーキテクチャコアの例示的な実施形態及び例示的なレジスタリネーミング・アウトオブオーダ発行／実行アーキテクチャコアの両方を示すブロック図である。1 is a block diagram illustrating both an exemplary embodiment of an in-order architecture core and an exemplary register renaming, out-of-order issue/execution architecture core included in a processor according to an embodiment of the present invention;

より具体的な例示的インオーダコアアーキテクチャのブロック図を示し、コアは、チップ内の（同じタイプ及び／又は異なるタイプの他のコアを含む）いくつかの論理ブロックのうちの１つであろう。A block diagram of a more specific exemplary in-order core architecture is shown, where a core may be one of several logic blocks (including other cores of the same type and/or different types) within a chip. より具体的な例示的インオーダコアアーキテクチャのブロック図を示し、コアは、チップ内の（同じタイプ及び／又は異なるタイプの他のコアを含む）いくつかの論理ブロックのうちの１つであろう。A block diagram of a more specific exemplary in-order core architecture is shown, where a core may be one of several logic blocks (including other cores of the same type and/or different types) within a chip.

本発明の実施形態に係る、１つより多くのコアを有してよく、統合メモリコントローラを有してよく、かつ、統合グラフィックスを有してよいプロセッサのブロック図である。FIG. 2 is a block diagram of a processor that may have more than one core, may have an integrated memory controller, and may have integrated graphics according to an embodiment of the present invention.

本発明の実施形態に係るシステムのブロック図を示す。1 shows a block diagram of a system according to an embodiment of the present invention.

本発明の実施形態に係る第１のより具体的な例示的システムのブロック図である。FIG. 2 is a block diagram of a first more specific exemplary system according to an embodiment of the present invention.

本発明の実施形態に係る第２のより具体的な例示的システムのブロック図である。FIG. 2 is a block diagram of a second, more specific, exemplary system according to an embodiment of the present invention.

本発明の実施形態に従うＳｏＣのブロック図である。FIG. 2 is a block diagram of a SoC according to an embodiment of the present invention.

本発明の実施形態に係る、ソース命令セット内のバイナリ命令をターゲット命令セット内のバイナリ命令に変換するソフトウェア命令変換器の使用を対比するブロック図である。1 is a block diagram contrasting the use of a software instruction converter to convert binary instructions in a source instruction set to binary instructions in a target instruction set according to an embodiment of the present invention.

以下の詳細な説明では、本明細書の一部を形成する添付の図面への参照が行われ、同様の符号が全体を通じて同様の部品を指し、実践され得る例示的な実施形態を用いて示される。他の実施形態が利用されてよく、構造的又は論理的な変更が本開示の範囲から逸脱することなく行われてよいことが理解されるべきである。したがって、以下の詳細な説明は、限定的な意味にとられるべきでなく、実施形態の範囲は、添付の特許請求の範囲及びこれらの同等物により規定される。 In the following detailed description, reference is made to the accompanying drawings which form a part hereof, where like numerals refer to like parts throughout and illustrate exemplary embodiments which may be practiced. It is to be understood that other embodiments may be utilized and structural or logical changes may be made without departing from the scope of the present disclosure. Therefore, the following detailed description is not to be taken in a limiting sense, and the scope of the embodiments is defined by the appended claims and their equivalents.

様々なオペレーションが、特許請求の範囲に記載の主題を理解する際に最も役立つ態様で、複数の別個の動作又は処理として順番に説明され得る。しかしながら、説明の順序は、これらの処理が必然的に順序に依存することを示唆するものとして解釈されるべきではない。特に、これらの処理は、提示の順序で実行されないくてもよい。説明される処理は、説明される実施形態とは異なる順序で実行されてよい。追加の実施形態では、様々な追加の処理が実行されてよく、及び／又は、説明される処理が省略されてもよい。 Various operations may be described sequentially as multiple separate actions or processes in a manner that is most helpful in understanding the claimed subject matter. However, the order of description should not be construed as implying that these processes are necessarily order dependent. In particular, these processes need not be performed in the order presented. Processes described may be performed in a different order than in the described embodiment. In additional embodiments, various additional processes may be performed and/or processes described may be omitted.

本開示の目的のために、「Ａ及び／又はＢ」という用語は、（Ａ）、（Ｂ）又は（Ａ及びＢ）を意味する。本開示の目的のために、「Ａ、Ｂ及び／又はＣ」という用語は、（Ａ）、（Ｂ）、（Ｃ）、（Ａ及びＢ）、（Ａ及びＣ）、（Ｂ及びＣ）又は（Ａ、Ｂ及びＣ）を意味する。 For purposes of this disclosure, the term "A and/or B" means (A), (B) or (A and B). For purposes of this disclosure, the term "A, B and/or C" means (A), (B), (C), (A and B), (A and C), (B and C) or (A, B and C).

説明では、「一実施形態において」又は「複数の実施形態において」という用語を用いてよく、同じ又は異なる実施形態のうちの１又は複数をそれぞれ指し得る。さらに、「備える（ｃｏｍｐｒｉｓｉｎｇ）」、「含む（ｉｎｃｌｕｄｉｎｇ）」及び「有する（ｈａｖｉｎｇ）」などの用語は、本開示の実施形態に関して用いられる場合、同義である。 In the description, the terms "in one embodiment" or "in embodiments" may be used and may refer to one or more of the same or different embodiments, respectively. Additionally, terms such as "comprising," "including," and "having" are synonymous when used in reference to embodiments of the present disclosure.

背景技術に記載したように、アクセラレータの様々な混合を実装する幅広いストックユニット及びプラットフォームがあるので、アクセラレータの解決手段を展開し、ポータブルに利用するアクセラレータの複雑性を管理することは困難であり得る。さらに、非常に多数のオペレーティングシステム（及びバージョン、パッチなど）を考慮すると、デバイスドライバモデルを用いてアクセラレータを配置するには、ビッグデータ処理についての開発者の取り組み、非移植性及び厳密な性能要件に起因する採用に対するハードルを含む制限がある。アクセラレータは、典型的には、汎用プロセッサ上で実行するソフトウェアよりも効率的に機能を実行するハードウェアデバイス（回路）である。例えば、ハードウェアアクセラレータは、特定のアルゴリズム／タスク（例えば、ビデオエンコーディング又はデコーディング、特定のハッシュ関数など）、又は、アルゴリズム／タスクのクラス（例えば、機械学習、疎データ操作、暗号化、グラフィックス、物理学、正規表現、パケット処理、人工知能、デジタル信号プロセッシングなど）の実行を改善するために用いられ得る。アクセラレータの例は、限定されることはないが、グラフィックス処理ユニット（「ＧＰＵ」）、固定機能フィールドプログラマブルゲートアレイ（「ＦＰＧＡ」）アクセラレータ、及び、固定機能特定用途向け集積回路（「ＡＳＩＣ」）を含む。アクセラレータは、いくつかの実施例において、ＣＰＵがシステム内の他のプロセッサよりも効率的である場合、汎用の中央処理装置（「ＣＰＵ」）であってよいことに留意する。 As described in the Background, with a wide range of stock units and platforms that implement a diverse mix of accelerators, it can be difficult to deploy an accelerator solution and manage the complexity of using accelerators in a portable manner. Furthermore, given the large number of operating systems (and versions, patches, etc.), deploying accelerators using a device driver model has limitations including hurdles to adoption due to developer efforts, non-portability, and strict performance requirements for big data processing. Accelerators are typically hardware devices (circuitry) that perform functions more efficiently than software running on a general-purpose processor. For example, hardware accelerators can be used to improve the execution of a particular algorithm/task (e.g., video encoding or decoding, a particular hash function, etc.) or a class of algorithms/tasks (e.g., machine learning, sparse data manipulation, cryptography, graphics, physics, regular expressions, packet processing, artificial intelligence, digital signal processing, etc.). Examples of accelerators include, but are not limited to, graphics processing units ("GPUs"), fixed function field programmable gate array ("FPGA") accelerators, and fixed function application specific integrated circuits ("ASICs"). Note that in some embodiments, the accelerator may be a general purpose central processing unit ("CPU") where the CPU is more efficient than other processors in the system.

所与のシステム（例えば、システムオンチップ（「ＳｏＣ」）、プロセッサストックユニット、ラックなど）の電力量は、利用可能なシリコンエリアの一部上のみで処理要素により消費され得る。これは、たとえ、ハードウェアブロックのすべてが同時にアクティブになり得ることはないとしても、特定の処理に対するエネルギー消費を低減する様々な特化型のハードウェアブロックを構築することを有利にする。 The power budget of a given system (e.g., a system-on-chip ("SoC"), processor stock unit, rack, etc.) may be consumed by processing elements on only a portion of the available silicon area. This makes it advantageous to build various specialized hardware blocks that reduce energy consumption for specific processes, even though not all of the hardware blocks may be active at the same time.

スレッドを処理する処理要素（例えば、コア又はアクセラレータ）を選択し、処理要素とインタフェース接続し、及び／又は、ヘテロジニアマルチプロセッサ環境内の電力消費を管理するためのシステム、方法及び装置の実施形態が詳細に説明される。例えば、様々な実施形態において、ヘテロジニアマルチプロセッサは、スレッド及び／又は処理要素の対応するワークロードの特性に基づいて、ヘテロジニアマルチプロセッサの異なるタイプの処理要素間でスレッドを動的に移行し、処理要素の１又は複数にプログラムインタフェースを提供し、特定の処理要素上での実行のためのコードを変換し、ワークロード及び選択された処理要素又はこれらの組み合わせについての特性に基づいて、選択された処理要素と共に用いる通信プロトコルを選択するように、（例えば、設計により、又は、ソフトウェアにより）構成される。 Embodiments of systems, methods, and apparatus for selecting a processing element (e.g., a core or accelerator) to process a thread, interfacing with the processing element, and/or managing power consumption in a heterogeneous multiprocessor environment are described in detail. For example, in various embodiments, the heterogeneous multiprocessor is configured (e.g., by design or by software) to dynamically migrate threads between different types of processing elements of the heterogeneous multiprocessor based on characteristics of the threads and/or corresponding workloads of the processing elements, provide program interfaces to one or more of the processing elements, translate code for execution on a particular processing element, and select a communication protocol to use with a selected processing element based on characteristics of the workload and the selected processing element or combinations thereof.

第１態様において、ワークロードディスパッチインタフェース、すなわち、ヘテロジニアススケジューラは、ホモジニアスマルチプロセッサプログラミングモデルをシステムプログラマに提示する。特に、この態様では、プログラマが特定のアーキテクチャをターゲットとするソフトウェア又は同等の抽象化を開発することを可能にし得る一方、開発されるソフトウェアへの対応する変更を要求することなく、基礎となるハードウェアに対する連続的な改善を容易する。 In a first aspect, the workload dispatch interface, i.e., the heterogeneous scheduler, presents a homogeneous multiprocessor programming model to the system programmer. In particular, this aspect may enable a programmer to develop software or an equivalent abstraction that targets a specific architecture, while facilitating continuous improvements to the underlying hardware without requiring corresponding changes to the software being developed.

第２態様において、マルチプロトコルリンクは、第１のエンティティ（ヘテロジニアススケジューラなど）が、通信と関連付けられたプロトコルを用いて、多数のデバイスと通信することを可能にする。これは、デバイス通信用に別々のリンクを有する必要性に取って代わるものである。特に、このリンクは、リンク上で動的に多重化される３又はそれより多いプロトコルを有する。例えば、共通のリンクは、１）１又は複数の独自又は業界標準（例えば、ＰＣＩエクスプレス仕様又は同等の代替手段など）において規定され得るように、デバイス発見、デバイス構成、エラー報告、割込み、ＤＭＡスタイルのデータ転送及び様々なサービスを可能にする生産者／消費者、発見、構成、割込み（ＰＤＣＩ）プロトコル、２）デバイスが、コヒーレントな読み出し及び書き込み要求を処理要素に発行することを可能にするキャッシングエージェントコヒーレンス（ＣＡＣ）プロトコル、及び、３）処理要素が、別の処理要素のローカルメモリにアクセスすることを可能にするメモリアクセス（ＭＡ）プロトコルからなるプロトコルをサポートする。 In a second aspect, the multi-protocol link allows a first entity (such as a heterogeneous scheduler) to communicate with multiple devices using a protocol associated with the communication. This replaces the need to have separate links for device communication. In particular, the link has three or more protocols that are dynamically multiplexed on the link. For example, a common link supports protocols consisting of: 1) a producer/consumer, discovery, configuration, interrupt (PDCI) protocol that enables device discovery, device configuration, error reporting, interrupts, DMA-style data transfers, and various services, as may be defined in one or more proprietary or industry standards (such as the PCI Express specification or equivalent alternatives); 2) a caching agent coherence (CAC) protocol that enables devices to issue coherent read and write requests to processing elements; and 3) a memory access (MA) protocol that enables a processing element to access the local memory of another processing element.

第３態様において、スレッドのスケジューリング、移行若しくはエミュレーション、又は、これらの一部が、スレッドのフェーズに基づいて行われる。例えば、スレッドのデータ並列フェーズは、典型的には、スケジューリングされ、又は、ＳＩＭＤコアに移行され、スレッドのスレッド並列フェーズは、典型的には、スケジューリングされ、又は、１又は複数のスカラコアに移行され、直列フェーズは、典型的には、スケジューリングされ、又は、アウトオブオーダコアに移行される。コアタイプのそれぞれは、両方ともスレッドのスケジューリング、移行又はエミュレーションについて考慮されるエネルギー又はレイテンシのいずれか一方を最小化する。エミュレーションは、スケジューリング又は移行が可能でない又は有利でない場合に用いられてよい。 In a third aspect, thread scheduling, migration, or emulation, or portions thereof, are performed based on the phase of the thread. For example, a data-parallel phase of a thread is typically scheduled or migrated to a SIMD core, a thread-parallel phase of a thread is typically scheduled or migrated to one or more scalar cores, and a serial phase is typically scheduled or migrated to an out-of-order core. Each core type minimizes either energy or latency, both of which are considered for thread scheduling, migration, or emulation. Emulation may be used when scheduling or migration is not possible or advantageous.

第４態様において、スレッド又はこれらの一部は、オポチュニスティックに（ｏｐｐｏｒｔｕｎｉｓｔｉｃａｌｌｙ）、アクセラレータへオフロードされる。特に、スレッドのアクセラレータ開始（ＡＢＥＧＩＮ）命令及びアクセラレータ終了（ＡＥＮＤ）命令又はこれらの一部、ブックエンド命令が、アクセラレータ上で実行可能であり得る。アクセラレータが利用可能でない場合、次に、ＡＢＥＧＩＮとＡＥＮＤとの間の命令が通常通り実行される。しかしながら、アクセラレータが利用可能である場合、アクセラレータを用いる（例えば、少ない電力を用いる）ことが好ましく、次に、ＡＢＥＧＩＮとＡＥＮＤ命令との間の命令は、そのアクセラレータ上で実行するために変換され、そのアクセラレータの実行のためにスケジューリングされる。その結果、アクセラレータの使用はオポチュニスティックである。 In a fourth aspect, a thread or a portion thereof is opportunistically offloaded to an accelerator. In particular, the accelerator start (ABEGIN) and accelerator end (AEND) instructions of the thread, or portions thereof, the bookend instructions, may be executable on the accelerator. If the accelerator is not available, then the instructions between the ABEGIN and AEND instructions are executed normally. However, if an accelerator is available, it is preferable to use the accelerator (e.g., to use less power), then the instructions between the ABEGIN and AEND instructions are converted to execute on that accelerator and scheduled for execution on that accelerator. As a result, the use of the accelerator is opportunistic.

第５態様において、スレッド又はその一部は、ＡＢＥＧＩＮ又はＡＥＮＤを用いることなく、アクセラレータに（オポチュニスティックな）オフロードのために解析される。ソフトウェア又はハードウェアパターンマッチは、アクセラレータ上で実行可能であり得るコード用に、スレッド又はその一部に対して実行される。アクセラレータが利用可能でない場合、又は、スレッド又はその一部それ自体がアクセラレータの実行に役立たない場合、スレッドの命令は、通常通りに実行される。しかしながら、アクセラレータが利用可能である場合、アクセラレータを用いる（例えば、少ない電力を用いる）ことが好ましく、次に、命令は、そのアクセラレータで実行するために変換され、そのアクセラレータ上での実行のためにスケジューリングされる。その結果、アクセラレータの使用はオポチュニスティックである。 In a fifth aspect, a thread or a portion thereof is analyzed for (opportunistic) offload to an accelerator without using ABEGIN or AEND. A software or hardware pattern match is performed against the thread or a portion thereof for code that may be executable on an accelerator. If an accelerator is not available, or if the thread or a portion thereof does not itself lend itself to accelerator execution, the thread's instructions are executed normally. However, if an accelerator is available, it is preferable to use the accelerator (e.g., use less power), and then the instructions are translated to run on that accelerator and scheduled for execution on that accelerator. As a result, the use of the accelerator is opportunistic.

第６態様において、選択された宛先処理要素をより良く適合させるコードフラグメント（スレッドの一部）の変換が実行される。例えば、コードフラグメントは、１）異なる命令セットを利用するために変換され、２）より多く並列化され、３）あまり並列化されず（直列化され）、４）データを並列化し（例えば、ベクトル化され）、及び／又は、５）データをあまり並列化しない（例えば、非ベクトル化される）。 In a sixth aspect, transformations are performed on the code fragment (part of a thread) to better suit the selected destination processing element. For example, the code fragment may be 1) transformed to take advantage of a different instruction set, 2) to be more parallelized, 3) to be less parallelized (serialized), 4) to be data parallelized (e.g., vectorized), and/or 5) to be less data parallelized (e.g., non-vectorized).

第７態様において、（共有又は専用のいずれか一方の）ワークキューは、デバイスにより行われるワークの範囲を定義する記述子を受信する。専用のワークキューは、単一のアプリケーション用の記述子を格納する一方、共有のワークキューは、複数のアプリケーションによりサブミットされる記述子を格納する。ハードウェアインタフェース／アービタは、特定のアービトレーションポリシに従って（例えば、各アプリケーション及びＱｏＳ／公平性ポリシの処理要件に基づいて）、記述子をワークキューからアクセラレータ処理エンジンにディスパッチする。 In a seventh aspect, a work queue (either shared or dedicated) receives descriptors that define the scope of work to be performed by the device. A dedicated work queue stores descriptors for a single application, while a shared work queue stores descriptors submitted by multiple applications. A hardware interface/arbiter dispatches the descriptors from the work queue to the accelerator processing engines according to a particular arbitration policy (e.g., based on the processing requirements of each application and QoS/fairness policies).

第８態様において、密行列乗算に対する改善は、単一の命令の実行と共に２次元行列の乗算を考慮する。複数のパックドデータ（ＳＩＭＤ、ベクトル）ソースは、単一のパックドデータソースに対して乗算される。いくつかの例において、２分木が乗算に用いられる。 In an eighth aspect, an improvement to dense matrix multiplication considers multiplication of two-dimensional matrices with a single instruction execution. Multiple packed data (SIMD, vector) sources are multiplied against a single packed data source. In some examples, a binary tree is used for the multiplication.

図１は、ヘテロジニアスマルチプロセッシングの実行環境を表現したものである。この例において、第１のタイプのコードフラグメント（例えば、ソフトウェアスレッドと関連付けられた１又は複数の命令）がヘテロジニアススケジューラ１０１により受信される。コードフラグメントは、任意の数のソースコード表現の形式であってよく、例えば、マシンコード、中間表現、バイトコード、テキストベースのコード（高水準言語、例えばＣ＋＋などのアセンブリコード、ソースコード）などを含む。ヘテロジニアススケジューラ１０１は、（例えば、すべてのスレッドがスカラコア上で実行中であるかのように、それらがユーザ及び／又はオペレーティングシステムに見えるように）ホモジニアスマルチプロセッサプログラミングモデルを提示し、受信したコードフラグメントに関するワークロードタイプ（プログラムフェーズ）を判断し、判断したワークロードタイプに対応する処理要素のタイプ（スカラ、アウトオブオーダ（ＯＯＯ）、単一命令複数データ（ＳＩＭＤ）又はアクセラレータを選択して、ワークロード（例えば、スレッド並列コード用のスカラ、直列コード用のＯＯＯ、データ並列用のＳＩＭＤ、及び、データ並列用のアクセラレータ）を処理し、対応する処理要素による処理のためにコードフラグメントをスケジューリングする。図１に示される特定の実施例において、処理要素タイプは、スカラコア１０３（例えば、インオーダコア）、連続的に格納された複数のデータ要素をレジスタが有するパックドデータオペランドに対して演算を行う単一命令複数データ（ＳＩＭＤ）コア１０５、低レイテンシのアウトオブオーダコア１０７及びアクセラレータ１０９を含む。いくつかの実施形態において、スカラコア１０３、単一命令複数データ（ＳＩＭＤ）コア１０５、低レイテンシのアウトオブオーダコア１０７は、ヘテロジニアプロセッサ内にあり、アクセラレータ１０９は、このヘテロジニアプロセッサの外部にある。しかしながら、処理要素の様々な異なる構成が利用されてよいことに留意されたい。いくつかの実施例において、ヘテロジニアススケジューラ１０１は、受信したコードフラグメント又はその一部を、処理要素の選択されたタイプに対応するフォーマットに変換又は解釈する。 1 is a representation of a heterogeneous multiprocessing execution environment. In this example, a first type of code fragment (e.g., one or more instructions associated with a software thread) is received by a heterogeneous scheduler 101. The code fragment may be in any number of source code representations, including, for example, machine code, intermediate representations, bytecode, text-based code (assembly code for a high-level language, e.g., C++, source code), etc. The heterogeneous scheduler 101 presents a homogeneous multiprocessor programming model (e.g., so that all threads appear to a user and/or the operating system as if they were executing on a scalar core), determines a workload type (program phase) for a received code fragment, selects a processing element type (scalar, out-of-order (OOO), single instruction multiple data (SIMD) or accelerator) corresponding to the determined workload type to process the workload (e.g., scalar for thread-parallel code, OOO for serial code, SIMD for data parallelism, and accelerator for data parallelism), and schedules the code fragment for processing by the corresponding processing element. In the particular embodiment shown in FIG. 1, the processing element types are , a scalar core 103 (e.g., an in-order core), a single instruction multiple data (SIMD) core 105 that operates on packed data operands, in which registers have multiple data elements stored contiguously, a low-latency out-of-order core 107, and an accelerator 109. In some embodiments, the scalar core 103, the single instruction multiple data (SIMD) core 105, and the low-latency out-of-order core 107 are within the heterogeneous processor, and the accelerator 109 is external to the heterogeneous processor. However, it should be noted that various different configurations of processing elements may be utilized. In some implementations, the heterogeneous scheduler 101 converts or interprets the received code fragment, or a portion thereof, into a format corresponding to a selected type of processing element.

処理要素１０３～１０９は、異なる命令セットアーキテクチャ（ＩＳＡ）をサポートしてよい。例えば、アウトオブオーダコアは、第１のＩＳＡをサポートしてよく、インオーダコアは、第２のＩＳＡをサポートしてよい。この第２のＩＳＡは、第１のＩＳＡの（サブ又はスーパー）セットであってよく、又は、異なっていてもよい。さらに、処理要素は、異なるマイクロアーキテクチャを有してよい。例えば、第１のアウトオブオーダコアは、第１のマイクロアーキテクチャをサポートし、インオーダコアは、異なる第２のマイクロアーキテクチャをサポートする。たとえ処理要素の特定のタイプ内であったとしても、ＩＳＡ及びマイクロアーキテクチャは、異なっていてもよいことに留意する。例えば、第１のアウトオブオーダコアは、第１のマイクロアーキテクチャをサポートしてよく、第２のアウトオブオーダコアは、異なるマイクロアーキテクチャをサポートしてよい。命令は、それらがＩＳＡの一部であるという点で、特定のＩＳＡに対して「ネイティブ」である。ネイティブ命令は、外部の変更（例えば、変換）を必要とすることなく、特定のマイクロアーキテクチャで実行する。 The processing elements 103-109 may support different instruction set architectures (ISAs). For example, an out-of-order core may support a first ISA and an in-order core may support a second ISA. This second ISA may be a (sub- or super) set of the first ISA or may be different. Additionally, the processing elements may have different microarchitectures. For example, a first out-of-order core may support a first microarchitecture and an in-order core may support a different second microarchitecture. Note that even within a particular type of processing element, the ISAs and microarchitectures may be different. For example, a first out-of-order core may support a first microarchitecture and a second out-of-order core may support a different microarchitecture. Instructions are "native" to a particular ISA in that they are part of the ISA. Native instructions execute in a particular microarchitecture without requiring external modification (e.g., translation).

いくつかの実施例では、処理要素の１又は複数は、例えば、システムオンチップ（ＳｏＣ）として、単一のダイに統合される。そのような実施例では、例えば、改善された通信レイテンシ、製造／コスト、低減されたピンカウント、プラットフォームの小型化などからの利益を得る場合がある。他の実施例では、処理要素は、まとめてパッケージ化され、それにより、単一のダイにある必要はなく、上記で参照したＳｏＣの利益の１又は複数を実現する。これらの実施例は、例えば、処理要素タイプ毎に最適化される異なる処理技術、歩留まり向上のためのより小さいダイサイズ、所有の知的財産ブロックの統合などからさらなる利益を得てよい。いくつかの従来のマルチパッケージ制限では、異なるデバイスが追加されるときに、それらと通信することが困難であるかもしれない。本明細書で説明されるマルチプロトコルリンクは、異なるタイプのデバイスに共通のインタフェースをユーザ、オペレーティングシステム（「ＯＳ」）などに提示することにより、この課題を最小化又は緩和する。 In some embodiments, one or more of the processing elements are integrated on a single die, e.g., as a system on a chip (SoC). Such embodiments may benefit, for example, from improved communication latency, manufacturing/cost, reduced pin count, platform compactness, etc. In other embodiments, the processing elements are packaged together, thereby achieving one or more of the benefits of SoC referenced above, without necessarily being on a single die. These embodiments may further benefit, for example, from different processing technologies optimized for each processing element type, smaller die size for improved yield, consolidation of proprietary intellectual property blocks, etc. With some traditional multi-package limitations, it may be difficult to communicate with different devices as they are added. The multi-protocol links described herein minimize or mitigate this challenge by presenting a common interface to the user, operating system ("OS"), etc. for different types of devices.

いくつかの実施例において、ヘテロジニアススケジューラ１０１は、プロセッサコア（例えば、ＯＯＯコア１０７）での実行のために、コンピュータ可読媒体（例えば、メモリ）に格納されたソフトウェアにおいて実装される。これらの実施例において、ヘテロジニアススケジューラ１０１は、ソフトウェアヘテロジニアススケジューラと称される。このソフトウェアは、バイナリトランスレータ、実行時（「ＪＩＴ」）コンパイラ、コードフラグメントを含むスレッドの実行をスケジューリングするＯＳ１１７、パターンマッチャ、内部モジュールコンポーネント又はこれらの組み合わせを実装してよい。 In some embodiments, the heterogeneous scheduler 101 is implemented in software stored on a computer-readable medium (e.g., memory) for execution on a processor core (e.g., OOO core 107). In these embodiments, the heterogeneous scheduler 101 is referred to as a software heterogeneous scheduler. This software may implement a binary translator, a just-in-time ("JIT") compiler, an OS 117 that schedules execution of threads that contain code fragments, a pattern matcher, an internal module component, or a combination thereof.

いくつかの実施例では、ヘテロジニアススケジューラ１０１は、回路及び／又は回路により実行される有限ステートマシンとして、ハードウェア内に実装される。これらの実施例では、ヘテロジニアススケジューラ１０１は、ハードウェアヘテロジニアススケジューラと称される。 In some embodiments, the heterogeneous scheduler 101 is implemented in hardware as a circuit and/or a finite state machine executed by a circuit. In these embodiments, the heterogeneous scheduler 101 is referred to as a hardware heterogeneous scheduler.

プログラム（例えば、ＯＳ１１７、エミュレーション層、ハイパーバイザ、セキュアモニタなど）の観点から、各タイプの処理要素１０３－１０９は、共有メモリアドレス空間１１５を利用する。いくつかの実施例において、共有メモリアドレス空間１１５は、図２に示されるように、２つのタイプのメモリ、メモリ２１１及びメモリ２１３を選択的に有する。そのような実施例において、メモリのタイプは、限定されることはないが、メモリ位置における差（例えば、異なるソケット上に配置される、など）、対応するインタフェース標準における差（例えば、ＤＤＲ４、ＤＤＲ５など）、所要電力における差、及び／又は、使用される基礎となるメモリ技術における差（例えば、高帯域幅メモリ（ＨＢＭ）、シンクロナスＤＲＡＭなど）を含む様々な方式で区別されてよい。 From the perspective of a program (e.g., OS 117, emulation layer, hypervisor, secure monitor, etc.), each type of processing element 103-109 utilizes a shared memory address space 115. In some embodiments, the shared memory address space 115 optionally has two types of memory, memory 211 and memory 213, as shown in FIG. 2. In such embodiments, the types of memory may be distinguished in various ways, including but not limited to differences in memory location (e.g., located on different sockets, etc.), differences in corresponding interface standards (e.g., DDR4, DDR5, etc.), differences in power requirements, and/or differences in the underlying memory technology used (e.g., high bandwidth memory (HBM), synchronous DRAM, etc.).

共有メモリアドレス空間１１５は、各タイプの処理要素によりアクセス可能である。しかしながら、いくつかの実施形態において、例えば、ワークロードの必要性に基づいて、異なるタイプのメモリが異なる処理要素に対して優先的に割り当てられてよい。例えば、いくつかの実施例では、プラットフォームのファームウェアインタフェース（例えば、ＢＩＯＳ又はＵＥＦＩ）又はメモリストレージは、プラットフォームにおいて利用可能なメモリリソースのタイプ、及び／又は、特定のアドレス範囲又はメモリタイプに関する処理要素の共通性を示すフィールドを含む。 The shared memory address space 115 is accessible by each type of processing element. However, in some embodiments, different types of memory may be preferentially assigned to different processing elements based on, for example, workload needs. For example, in some implementations, the firmware interface (e.g., BIOS or UEFI) or memory storage of the platform includes fields that indicate the types of memory resources available in the platform and/or the commonality of processing elements with respect to a particular address range or memory type.

ヘテロジニアススケジューラ１０１は、スレッドが所与の時点のどこで実行されるかを判断するためにスレッドを解析する場合、この情報を利用する。典型的には、スレッド管理メカニズムは、既存のスレッドを管理する方法に応じて、情報に基づいた決定を通知するために、それに利用可能な情報の全体を調べる。これは、多数の方式でそれ自体を明らかにし得る。例えば、処理要素に対して物理的に近いアドレス範囲の共通性を有する特定の処理要素上で実行するスレッドは、処理要素上で実行されるであろう通常状況の下、スレッドにわたる優先処理を与え得る。 Heterogeneous scheduler 101 utilizes this information when analyzing threads to determine where they will execute at a given time. Typically, a thread management mechanism looks at the ensemble of information available to it to inform an informed decision depending on how it manages existing threads. This may manifest itself in a number of ways. For example, threads executing on a particular processing element that has a commonality of address ranges that are physically close to the processing element may be given priority processing over threads under normal circumstances that would be executed on the processing element.

別の例は、特定のメモリタイプ（例えば、ＤＲＡＭのより高速なバージョン）から利益を得るであろうスレッドは、そのデータをメモリタイプに物理的に移動させ、コード内のメモリ参照を共有アドレス空間の一部を指し示すように調整させ得るということである。例えば、ＳＩＭＤコア１０５上のスレッドが第２のメモリタイプ２１３を利用し得る一方、それは、アクセラレータ１０９がアクティブであり、そのメモリタイプ２１３を必要とする（又は、ＳＩＭＤコア１０５のスレッドに割り当てられた部分を少なくとも必要とする）場合、この利用から移動させてもよい。 Another example is that a thread that would benefit from a particular memory type (e.g., a faster version of DRAM) may have its data physically moved to the memory type and memory references in the code adjusted to point to a portion of the shared address space. For example, while a thread on SIMD core 105 may utilize second memory type 213, it may be moved out of this usage if accelerator 109 is active and requires that memory type 213 (or at least the portion allocated to the thread on SIMD core 105).

例示的なシナリオは、メモリが一方よりも他方の処理要素に物理的に近い場合である。よくある例は、アクセラレータがコアとは異なるメモリタイプに直接接続されている場合である。 An example scenario is when memory is physically closer to one processing element than the other. A common example is when an accelerator is directly connected to a different memory type than the core.

これらの例では、典型的には、データ移動を開始するＯＳである。しかしながら、下位レベル（例えば、ヘテロジニアススケジューラ）が、独自で又は別のコンポーネント（例えば、ＯＳ）からの支援を伴ってこの機能を実行することを拒むものは何もない。前の処理要素のデータががフラッシュされ、ページテーブルエントリが無効にされるか否かは、データ移動を行うための実装及び不利益に依存する。データが、すぐ用いられる可能性が高くはない場合、一方のメモリタイプから他方にデータを移動するよりも、むしろストレージから単にコピーした方がより実現可能であるかもしれない。 In these examples, it is typically the OS that initiates the data movement. However, there is nothing to prevent a lower level (e.g., a heterogeneous scheduler) from performing this function on its own or with assistance from another component (e.g., the OS). Whether the data from the previous processing element is flushed and the page table entries invalidated depends on the implementation and the penalty for performing the data movement. If the data is not likely to be used immediately, it may be more feasible to simply copy it from storage rather than moving it from one memory type to another.

図１１７の（Ａ）～（Ｂ）は、共有メモリ内のスレッドに対する移動の例を示す。この例では、２つのタイプのメモリは、それぞれがその空間内のアドレスの独自の範囲を有するアドレス空間を共有する。１１７の（Ａ）において、共有メモリ１１７１５は、第１のタイプのメモリ１１７０１及び第２のタイプのメモリ１１７０７を含む。第１のタイプのメモリ１１７０１は、第１のアドレス範囲１１７０３を有し、スレッド１１１７０５に専用のアドレスである範囲内にある。第２のタイプのメモリ１１７０７は、第２のアドレス範囲１１７０９を有する。 Figures 117(A)-(B) show an example of movement for threads in a shared memory. In this example, two types of memory share an address space, each with its own range of addresses within that space. In 117(A), shared memory 11715 includes a first type of memory 11701 and a second type of memory 11707. First type of memory 11701 has a first address range 11703, within a range of addresses that are dedicated to thread 1 11705. Second type of memory 11707 has a second address range 11709.

スレッド１１１７０５の実行中のいくつかの時点で、ヘテロジニアススケジューラは、第２のスレッド１１７１１が、スレッド１１１７０５に割り当てられる前に、第１のタイプのメモリ１１７０１内のアドレスを用いるように、スレッド１１１７０５を移動することの決定を行う。これは、図１１７の（Ｂ）に示されている。この例では、スレッド１１１７０５は、第２のタイプのメモリ１１７０７に再割り当てされ、用いるためのアドレスの新たなセットが与えられる。しかしながら、これは、該当のケースである必要はない。メモリのタイプの差が、（例えば、ＰＥへの距離に基づいて）物理的又は空間的であってよいことに留意する。 At some point during the execution of thread 1 11705, the heterogeneous scheduler makes a decision to move thread 1 11705 to use addresses in the first type of memory 11701 before the second thread 11711 is assigned to thread 1 11705. This is shown in FIG. 117B. In this example, thread 1 11705 is reassigned to the second type of memory 11707 and given a new set of addresses to use. However, this need not be the case. Note that the difference in memory types may be physical or spatial (e.g., based on distance to the PE).

図１１８は、ヘテロジニアススケジューラにより実行され得るスレッド移動のための例示的な方法を示す。１１８０１において、第１のスレッドは、共有メモリ空間内の第１のタイプのメモリを用いて、コア又はアクセラレータなどの第１の処理要素（「ＰＥ」）上で実行されるよう指示される。例えば、図１１７の（Ａ）において、これは、スレッド１である。 Figure 118 illustrates an exemplary method for thread migration that may be performed by a heterogeneous scheduler. At 11801, a first thread is directed to run on a first processing element ("PE"), such as a core or accelerator, using a first type of memory in a shared memory space. For example, in Figure 117(A), this is thread 1.

後のいくつかの時点で、第２のスレッドを実行する要求が１１８０３において受信される。例えば、アプリケーション、ＯＳなどは、実行されるハードウェアスレッドを要求する。 At some later point in time, a request to execute the second thread is received at 11803. For example, an application, OS, etc. requests a hardware thread to be executed.

１１８０５において、共有アドレス空間内の第１のタイプのメモリを用いる第２のＰＥで第２のスレッドが実行されるべきとの判断が行われる。例えば、第２のスレッドは、第１のタイプのメモリに直接結合されるアクセラレータ上で実行され、当該実行（第１のスレッドが使用しているメモリを解放することを含む）は、第２のスレッドに第２のタイプのメモリを使用させるよりも効率的である。 At 11805, a determination is made that the second thread should be executed on a second PE using a first type of memory in a shared address space. For example, the second thread may be executed on an accelerator that is directly coupled to the first type of memory, and such execution (including freeing the memory used by the first thread) is more efficient than having the second thread use the second type of memory.

いくつかの実施形態では、１１８０７において、第１のスレッドのデータが第１のタイプのメモリから第２のタイプのメモリに移動される。これは、第１のスレッド実行の実行を単に停止して、その配置において別のスレッドを開始することがより効率的である場合に、必ずしも発生するわけではない。 In some embodiments, at 11807, data of the first thread is moved from the first type of memory to the second type of memory. This does not necessarily occur if it would be more efficient to simply stop execution of the first thread execution and start another thread in that placement.

１１８０９において、第１のスレッドと関連付けられたトランスレーションルックアサイドバッファ（ＴＬＢ）エントリが無効にされる。さらに、最も多くの実施形態では、データのフラッシュが実行される。 At 11809, the translation lookaside buffer (TLB) entry associated with the first thread is invalidated. Additionally, in most embodiments, a data flush is performed.

１１８１１において、第２のスレッドは、第２のＰＥに向けられ、第１のスレッドに対して前に割り当てられていた第１のタイプのメモリ内のアドレスの範囲に割り当てられる。 At 11811, the second thread is directed to the second PE and assigned to the range of addresses in the first type of memory previously assigned to the first thread.

図３は、ヘテロジニアススケジューラ３０１の例示的な実施例を示す。いくつかの例において、スケジューラ３０１は、ランタイムシステムの一部である。図示されるように、プログラムフェーズ検出器３１３は、コードフラグメントを受信し、対応するプログラムフェーズの実行が、直列、データ並列又はスレッド並列として最良の特徴であるか否かを判断するために、コードフラグメントの１又は複数の特性を識別する。これが判断される方法の例が以下に詳細に説明される。図１に関して詳細に説明したように、コードフラグメントは、任意の数のソースコード表現の形式であってよい。 Figure 3 illustrates an exemplary embodiment of a heterogeneous scheduler 301. In some examples, scheduler 301 is part of a runtime system. As shown, a program phase detector 313 receives a code fragment and identifies one or more characteristics of the code fragment to determine whether execution of the corresponding program phase is best characterized as serial, data parallel, or thread parallel. Examples of how this is determined are described in detail below. As described in detail with respect to Figure 1, the code fragment may be in the form of any number of source code representations.

反復的コードフラグメントについて、パターンマッチャ３１１は、この「ホット」コードを識別し、さらに、いくつかの例においては、コードフラグメントと関連付けられたワークロードが、異なる処理要素で処理するためにより適し得ることを示す対応する特性も識別する。パターンマッチャ３１１及びその動作に関するさらなる詳細は、例えば、図２０の文脈において以下で説明される。 For repetitive code fragments, pattern matcher 311 identifies this "hot" code and, in some instances, also identifies corresponding characteristics that indicate that the workload associated with the code fragment may be better suited for processing on a different processing element. Further details regarding pattern matcher 311 and its operation are described below, e.g., in the context of FIG. 20.

セレクタ３０９は、処理要素の特性と、電源マネージャ３０７により提供された熱及び／又は電力情報とに少なくとも部分的に基づいて、受信したコードフラグメントのネイティブ表現を実行するターゲット処理要素を選択する。当該ターゲット処理要素の選択は、コードフラグメントに対して最も適合するもの（すなわち、ワークロード特性と処理要素機能との間のマッチ）をできるだけ簡単に選択し得るが、システムの現在の電力消費レベル（例えば、電源マネージャ３０７により提供され得る場合）、処理要素の可用性、一方のタイプのメモリから他方へ移動するデータ量（及び、そのように行うことに対して関連付けられた不利益）などを考慮してもよい。いくつかの実施形態において、セレクタ３０９は、ハードウェア回路内に実装される、又は、ハードウェア回路により実行される有限ステートマシンである。 The selector 309 selects a target processing element to execute a native representation of the received code fragment based at least in part on the characteristics of the processing element and the thermal and/or power information provided by the power manager 307. The selection of the target processing element may be as simple as selecting the best match for the code fragment (i.e., the match between the workload characteristics and the processing element capabilities), but may also take into account the current power consumption level of the system (e.g., if provided by the power manager 307), the availability of processing elements, the amount of data to be moved from one type of memory to another (and the associated penalty for doing so), etc. In some embodiments, the selector 309 is a finite state machine implemented in or executed by a hardware circuit.

いくつかの実施形態において、セレクタ３０９は、ターゲット処理要素と通信するために、対応するリンクプロトコルも選択する。例えば、いくつかの実施例において、処理要素は、システムファブリック又はポイントツーポイント相互接続に関する複数のプロトコルを動的に多重化又はカプセル化することが可能な対応する共通のリンクインタフェースを利用する。例えば、特定の実施例において、サポートされるプロトコルは、１）１又は複数の独自又は業界標準（例えば、ＰＣＩエクスプレス仕様又は同等の代替手段など）において規定され得るように、デバイス発見、デバイス構成、エラー報告、割込み、ＤＭＡスタイルのデータ転送及び様々なサービスを可能にする生産者／消費者、発見、構成、割込み（ＰＤＣＩ）プロトコル、２）デバイスが、コヒーレントな読み出し及び書き込み要求を処理要素に発行することを可能にするキャッシングエージェントコヒーレンス（ＣＡＣ）プロトコル、及び、３）処理要素が、別の処理要素のローカルメモリにアクセスすることを可能にするメモリアクセス（ＭＡ）プロトコルを含む。セレクタ３０９は、処理要素に通信される要求のタイプに基づいて、これらのプロトコル間の選択を行う。例えば、生産者／消費者、発見、構成又は割込み要求は、ＰＤＣＩプロトコルを用い、キャッシュコヒーレンス要求は、ＣＡＣプロトコルを用い、ローカルメモリアクセス要求は、ＭＡプロトコルを用いる。 In some embodiments, the selector 309 also selects a corresponding link protocol to communicate with the target processing element. For example, in some embodiments, the processing elements utilize a corresponding common link interface that can dynamically multiplex or encapsulate multiple protocols over a system fabric or point-to-point interconnect. For example, in certain embodiments, the supported protocols include: 1) a Producer/Consumer, Discover, Configure, Interrupt (PDCI) protocol that enables device discovery, device configuration, error reporting, interrupts, DMA-style data transfers, and various services, as may be defined in one or more proprietary or industry standards (e.g., PCI Express specifications or equivalent alternatives, etc.); 2) a Caching Agent Coherence (CAC) protocol that enables devices to issue coherent read and write requests to the processing elements; and 3) a Memory Access (MA) protocol that enables a processing element to access the local memory of another processing element. The selector 309 selects between these protocols based on the type of request communicated to the processing element. For example, producer/consumer, discovery, configuration or interrupt requests use the PDCI protocol, cache coherence requests use the CAC protocol, and local memory access requests use the MA protocol.

いくつかの実施例において、スレッドは、フェーズタイプを示すマーカを含み、したがって、フェーズ検出器は利用されない。いくつかの実施例において、スレッドは、処理要素タイプ、リンクプロトコル及び／又はメモリタイプに関する暗示又は明示的な要求を含む。これらの実施例において、セレクタ３０９は、その選択処理においてこの情報を利用する。例えば、セレクタ３０９による選択は、スレッド及び／又はユーザによりオーバーライドされてよい。 In some embodiments, the thread includes a marker indicating the phase type, and therefore a phase detector is not utilized. In some embodiments, the thread includes implicit or explicit requirements regarding the processing element type, link protocol, and/or memory type. In these embodiments, selector 309 utilizes this information in its selection process. For example, the selection by selector 309 may be overridden by the thread and/or the user.

実装に応じて、ヘテロジニアススケジューラは、受信したコードフラグメントを処理し、ターゲット処理要素に対して対応するネイティブエンコーディングを生成する１又は複数のコンバータを含んでよい。例えば、ヘテロジニアススケジューラは、第１のタイプのマシンコードを第２のタイプのマシンコードに変換する変換器、及び／又は、中間表現をターゲット処理要素にネイティブなフォーマットに変換するＪＩＴコンパイラを含んでよい。代替的に又はさらに、ヘテロジニアススケジューラは、反復的コードフラグメント（すなわち、「ホット」コード）を識別し、コードフラグメント又は対応するマイクロオペレーションの１又は複数のネイティブエンコーディングをキャッシュするパターンマッチャを含んでよい。これらの選択的なコンポーネントのそれぞれは、図３に示されている。特に、ヘテロジニアススケジューラ３０１は、変換器３０３及びＪＩＴコンパイラ３０５を含む。ヘテロジニアススケジューラ３０１がオブジェクトコード又は中間表現に対して演算を行う場合、受信したコードフラグメントをターゲット処理要素１０３、１０５、１０７、１０９のうちの１又は複数にネイティブなフォーマットに変換するために、ＪＩＴコンパイラ３０５が呼び出される。ヘテロジニアススケジューラ３０１がマシンコード（バイナリ）に対して演算を行う場合（例えば、ある命令セットから他の命令セットに変換する場合など）、バイナリトランスレータ３０３は、受信したコードフラグメントを、ターゲット処理要素のうちの１又は複数にネイティブなマシンコードに変換する。代替的な実施形態において、ヘテロジニアススケジューラ３０１は、これらのコンポーネントのうちの１又は複数を除外してよい。 Depending on the implementation, the heterogeneous scheduler may include one or more converters that process the received code fragments and generate corresponding native encodings for the target processing elements. For example, the heterogeneous scheduler may include a converter that converts a first type of machine code to a second type of machine code, and/or a JIT compiler that converts the intermediate representation to a format native to the target processing elements. Alternatively or in addition, the heterogeneous scheduler may include a pattern matcher that identifies repetitive code fragments (i.e., "hot" code) and caches one or more native encodings of the code fragments or corresponding micro-operations. Each of these optional components is illustrated in FIG. 3. In particular, the heterogeneous scheduler 301 includes a converter 303 and a JIT compiler 305. If the heterogeneous scheduler 301 operates on object code or intermediate representation, the JIT compiler 305 is invoked to convert the received code fragment into a format native to one or more of the target processing elements 103, 105, 107, 109. If the heterogeneous scheduler 301 operates on machine code (binary) (e.g., converting from one instruction set to another), the binary translator 303 converts the received code fragment into machine code native to one or more of the target processing elements. In alternative embodiments, the heterogeneous scheduler 301 may exclude one or more of these components.

例えば、いくつかの実施形態では、バイナリトランスレータは含まれていない。これは、スケジューラにこれを対処させる代わりに、潜在的に利用可能なアクセラレータ、コアなどをプログラムが考慮する必要があるので、プログラミングの複雑性が増すという結果をもたらし得る。例えば、プログラムは、異なるフォーマットにおけるルーチンのためのコードを含む必要があるかもしれない。しかしながら、いくつかの実施形態において、バイナリトランスレータがない場合、より高いレベルでコードを受け入れるＪＩＴコンパイラがあり、当該ＪＩＴコンパイラが必要な変換を実行する。パターンマッチャが存在する場合、特定の処理要素で実行されるべきコードを発見するためにホットコードがさらに検出されてよい。 For example, in some embodiments, a binary translator is not included. This may result in increased programming complexity as the program must take into account potentially available accelerators, cores, etc., instead of letting the scheduler handle this. For example, the program may need to include code for routines in different formats. However, in some embodiments, where there is no binary translator, there is a JIT compiler that accepts code at a higher level, and the JIT compiler performs the necessary translations. If a pattern matcher is present, hot code may be further detected to find code that should be executed on a particular processing element.

例えば、いくつかの実施形態では、ＪＩＴコンパイラは含まれていない。これはまた、スケジューラにこれを対処させる代わりに、プログラムがまず特定のＩＳＡ用のマシンコードにコンパイルする必要があるので、プログラミングの複雑性が増すという結果をもたらし得る。しかしながら、いくつかの実施形態において、バイナリトランスレータがあり、かつ、ＪＩＴコンパイラがない場合、スケジューラは、以下で詳細に説明するように、ＩＳＡ間で変換してよい。パターンマッチャが存在する場合、特定の処理要素で実行されるべきコードを発見するためにホットコードがさらに検出されてよい。 For example, in some embodiments, a JIT compiler is not included. This may also result in increased programming complexity as the program must first be compiled into machine code for a particular ISA instead of having the scheduler handle this. However, in some embodiments, if there is a binary translator and there is no JIT compiler, the scheduler may translate between ISAs, as described in more detail below. If a pattern matcher is present, hot code may be further detected to find code that should be executed on a particular processing element.

例えば、いくつかの実施形態では、パターンマッチャは含まれていない。これはまた、移動された可能性があるコードが、実行中の特定のタスクにとって効率的とはいえないコアのままである可能性が高いので、効率性が下がるという結果をもたらし得る。 For example, in some embodiments, a pattern matcher is not included. This can also result in less efficiency, as code that may have been moved is likely to remain in a core that is less efficient for the particular task being performed.

いくつかの実施形態では、バイナリトランスレータ、ＪＩＴコンパイラ又はパターンマッチャがない。これらの実施形態では、スレッドを移動させるフェーズ検出又は明示的な要求のみが、スレッド／処理要素割り当て／移行に利用される。 In some embodiments, there is no binary translator, JIT compiler, or pattern matcher. In these embodiments, only phase detection or explicit requests to move threads are used for thread/processing element allocation/migration.

図１～図３を再び参照すると、ヘテロジニアススケジューラ１０１は、ハードウェア（例えば、回路）、ソフトウェア（例えば、実行可能なプログラムコード）又はこれらの任意の組み合わせで実装されてよい。図１１４は、ハードウェアヘテロジニアススケジューラ回路及びメモリとのそのインタラクションの例を示す。ヘテロジニアススケジューラは、限定されることはないが、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）ベース又は特定用途向け集積回路（ＡＳＩＣ）ベースのステートマシンとして、本明細書で詳細に説明される機能を提供するソフトウェアを内部に格納するメモリに結合される埋め込み型マイクロコントローラ、他のサブコンポーネントを有する論理回路（例えば、データハザード検出回路など）として、及び／又は、アウトオブオーダコアにより実行されるソフトウェア（例えば、ステートマシン）として、スカラコアにより実行されるソフトウェア（例えば、ステートマシン）として、ＳＩＭＤコアにより実行されるソフトウェア（例えば、ステートマシン）又はこれらの組み合わせとして含む多くの異なる様式で作成されてよい。図示された例では、ヘテロジニアススケジューラは、様々な機能を実行する１又は複数のコンポーネントを含む回路１１４０１である。いくつかの実施形態において、この回路１１４０１は、プロセッサコア１１４１９の一部であるが、チップセットの一部であってもよい。 Referring again to FIGS. 1-3, the heterogeneous scheduler 101 may be implemented in hardware (e.g., circuitry), software (e.g., executable program code), or any combination thereof. FIG. 114 illustrates an example of a hardware heterogeneous scheduler circuit and its interaction with memory. The heterogeneous scheduler may be created in many different ways, including, but not limited to, as a field programmable gate array (FPGA)-based or application specific integrated circuit (ASIC)-based state machine, as an embedded microcontroller coupled to a memory that stores therein software that provides the functionality described in detail herein, as a logic circuit having other subcomponents (e.g., data hazard detection circuitry, etc.), and/or as software (e.g., state machine) executed by an out-of-order core, as software (e.g., state machine) executed by a scalar core, as software (e.g., state machine) executed by a SIMD core, or combinations thereof. In the illustrated example, the heterogeneous scheduler is a circuit 11401 that includes one or more components that perform various functions. In some embodiments, this circuit 11401 is part of a processor core 11419, but may also be part of a chipset.

スレッド／処理要素（ＰＥ）トラッカー１１４０３は、システム及び各ＰＥで実行するスレッドごとにステータス（例えば、ＰＥの可用性、その現在の電力消費など）を維持する。例えば、トラッカー１１４０３は、テーブルなどのデータ構造において、アクティブ、アイドル又はインアクティブのステータスを維持する。 The thread/processing element (PE) tracker 11403 maintains a status for each thread executing in the system and each PE (e.g., the availability of the PE, its current power consumption, etc.). For example, the tracker 11403 maintains an active, idle, or inactive status in a data structure such as a table.

いくつかの実施形態において、パターンマッチャ１１４０５は、「ホット」コード、アクセラレータコード、及び／又は、ＰＥ割り当てを要求するコードを識別する。このマッチングに関するさらなる詳細が後で提供される。 In some embodiments, the pattern matcher 11405 identifies "hot" code, accelerator code, and/or code that requires PE allocation. More details regarding this matching are provided later.

ＰＥ情報１１４１１は、どのようなＰＥ（及びこれらのタイプ）がシステムにあり、何がＯＳなどによりスケジューリングされ得るかに関する情報を格納する。 PE information 11411 stores information about what PEs (and their types) are in the system and what can be scheduled by the OS etc.

上記では、ヘテロジニアススケジューラ回路１１４０１内の別々のコンポーネントとして詳細に説明されているが、一方、コンポーネントは、組み合わせられてよい、及び／又は、ヘテロジニアススケジューラ回路１１４０１の外部に移動されてもよい。 While detailed above as separate components within the heterogeneous scheduler circuit 11401, the components may be combined and/or moved outside of the heterogeneous scheduler circuit 11401.

ヘテロジニアススケジューラ回路１１４０１に結合されるメモリ１１４１３は、追加の機能を提供する（コア及び／又はヘテロジニアススケジューラ回路１１４０１により）実行されるソフトウェアを含んでよい。例えば、ソフトウェアパターンマッチャ１１４１７は、「ホット」コード、アクセラレータコード、及び／又は、ＰＥ割り当てを要求するコードを識別するために用いられてよい。例えば、ソフトウェアパターンマッチャ１１４１７は、コードシーケンスを、メモリに格納されたパターンの予め決定されたセットと比較する。メモリは、ある命令セットから他の命令セットに（例えば、１つの命令設定からアクセラレータベースの命令又はプリミティブに）コードを変換する変換器を格納してもよい。 Memory 11413 coupled to heterogeneous scheduler circuitry 11401 may include software executed (by the cores and/or heterogeneous scheduler circuitry 11401) that provides additional functionality. For example, software pattern matcher 11417 may be used to identify "hot" code, accelerator code, and/or code that requires PE allocation. For example, software pattern matcher 11417 compares code sequences to a predetermined set of patterns stored in memory. The memory may store converters that convert code from one instruction set to another (e.g., from one instruction set to accelerator-based instructions or primitives).

これらのコンポーネントは、どのようなリンクプロトコルが使用され、ＰＥなどで既に実行中のスレッドがある場合にどのような移行が発生するべきかなどを、スレッドを実行するためにそのＰＥの選択を行うセレクタ１１４１１に供給する。いくつかの実施形態において、セレクタ１１４１１は、ハードウェア回路内に実装される、又は、ハードウェア回路により実行される有限ステートマシンである。 These components provide information to a selector 11411 that selects the PE in which to run the thread, such as what link protocol is used and what transitions should occur if there is already a thread running on the PE. In some embodiments, the selector 11411 is a finite state machine implemented in or executed by a hardware circuit.

メモリ１１４１３は、例えば、いくつかの実施例において、１又は複数の変換器１１４１５（例えば、バイナリ、ＪＩＴコンパイラなど）が、選択されたＰＥのために異なるフォーマットにスレッドコードを変換すべく、メモリに格納されることを含んでもよい。 Memory 11413 may include, for example, in some embodiments, one or more converters 11415 (e.g., binary, JIT compilers, etc.) stored in the memory to convert thread code into a different format for a selected PE.

図１１５は、ソフトウェアヘテロジニアススケジューラの例を示す。ソフトウェアヘテロジニアススケジューラは、限定されることはないが、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）ベース又は特定用途向け集積回路（ＡＳＩＣ）ベースのステートマシンとして、本明細書で詳細に説明される機能を提供するソフトウェアを内部に格納するメモリに結合される埋め込み型マイクロコントローラ、他のサブコンポーネントを有する論理回路（例えば、データハザード検出回路など）として、及び／又は、アウトオブオーダコアにより実行されるソフトウェア（例えば、ステートマシン）として、スカラコアにより実行されるソフトウェア（例えば、ステートマシン）として、ＳＩＭＤコアにより実行されるソフトウェア（例えば、ステートマシン）又はこれらの組み合わせとして含む多くの異なる様式で作成されてよい。図示された例では、ソフトウェアヘテロジニアススケジューラは、メモリ１１４１３に格納される。その結果、プロセッサコア１１４１９に結合されるメモリ１１４１３は、スレッドをスケジューリングするために（コアにより）実行されるソフトウェアを含む。いくつかの実施形態において、ソフトウェアヘテロジニアススケジューラはＯＳの一部である。 115 illustrates an example of a software heterogeneous scheduler. The software heterogeneous scheduler may be created in many different ways, including but not limited to, as a field programmable gate array (FPGA)-based or application specific integrated circuit (ASIC)-based state machine, as an embedded microcontroller coupled to a memory that stores therein software providing the functionality described in detail herein, as a logic circuit having other subcomponents (e.g., data hazard detection circuitry, etc.), and/or as software (e.g., state machine) executed by an out-of-order core, as software (e.g., state machine) executed by a scalar core, as software (e.g., state machine) executed by a SIMD core, or combinations thereof. In the illustrated example, the software heterogeneous scheduler is stored in memory 11413. As a result, memory 11413 coupled to processor core 11419 contains software executed (by the core) to schedule threads. In some embodiments, the software heterogeneous scheduler is part of the OS.

実装に応じて、コア内のスレッド／処理要素（ＰＥ）トラッカー１１４０３は、システム及び各ＰＥにおいて実行するスレッドごとにステータス（例えば、ＰＥの可用性、その現在の電力消費など）を維持する、又は、スレッド／ＰＥトラッカー１１５２１を用いてソフトウェアでこれが実行される。例えば、トラッカーは、テーブルなどのデータ構造において、アクティブ、アイドル又はインアクティブのステータスを維持する。 Depending on the implementation, a thread/processing element (PE) tracker 11403 in the core maintains a status for each thread executing in the system and each PE (e.g., the availability of the PE, its current power consumption, etc.), or this is performed in software using the thread/PE tracker 11521. For example, the tracker maintains the active, idle, or inactive status in a data structure such as a table.

いくつかの実施形態において、パターンマッチャ１１４１７は、「ホット」コード、及び／又は、ＰＥ割り当てを要求するコードを識別する。このマッチングに関するさらなる詳細が後で提供される。 In some embodiments, pattern matcher 11417 identifies "hot" code and/or code that requires PE allocation. More details regarding this matching are provided later.

ＰＥ情報１１４０９及び／又は１１５０９は、どのようなＰＥがシステムにあり、何がＯＳなどによりスケジューリングされ得るかに関する情報を格納する。 PE information 11409 and/or 11509 stores information about what PEs are in the system and what can be scheduled by the OS etc.

ソフトウェアパターンマッチャ１１４１７は、「ホット」コード、アクセラレータコード、及び／又は、ＰＥ割り当てを要求するコードを識別するために用いられてよい。 Software pattern matcher 11417 may be used to identify "hot" code, accelerator code, and/or code that requires PE allocation.

スレッド／ＰＥトラッカー、処理要素情報、及び／又は、パターンマッチは、ＰＥなどで既に実行中のスレッドがある場合、どのようなリンクプロトコルを用いるか、どのような移行が発生するべきかが、スレッドを実行するＰＥの選択を行うセレクタ１１４１１に供給される。いくつかの実施形態において、セレクタ１１４１１は、プロセッサコア１１４１９より実装されて実行される有限ステートマシンである。 The thread/PE tracker, processing element information, and/or pattern match are fed to a selector 11411 which selects the PE on which to run the thread, if there is already a thread running on the PE, what link protocol to use, what transition should occur, etc. In some embodiments, the selector 11411 is a finite state machine implemented and executed by the processor core 11419.

メモリ１１４１３は、例えば、いくつかの実施例において、１又は複数の変換器１１４１５（例えば、バイナリ、ＪＩＴコンパイラなど）が選択されたＰＥのために異なるフォーマットにスレッドコードを変換すべく、メモリに格納されることを含んでもよい。 Memory 11413 may include, for example, in some embodiments, one or more converters 11415 (e.g., binary, JIT compilers, etc.) stored in the memory to convert thread code into a different format for a selected PE.

動作中、ＯＳは、実行環境の抽象化を提示する（例えば、ヘテロジニアススケジューラ１０１、３０１など）ヘテロジニアススケジューラを利用して、処理対象のスレッドをスケジューリングし、このスレッドを実行する。 During operation, the OS utilizes a heterogeneous scheduler (e.g., heterogeneous scheduler 101, 301, etc.) that presents an abstraction of the execution environment to schedule and execute threads to be processed.

以下の表は、潜在的な抽象化特徴（すなわち、プログラムが何を参照するか）、潜在的な設計自由度及びアーキテクチャの最適化（すなわち、何がプログラマから隠れているか）、及び、抽象化における特定の特徴を提供するための潜在的な利益又は理由を要約したものである。
The following table summarizes potential abstraction features (i.e., what a program refers to), potential design freedoms and architectural optimizations (i.e., what is hidden from the programmer), and potential benefits or reasons for providing particular features in the abstraction.

いくつかの例示的な実装において、ヘテロジニアススケジューラは、他のハードウェア及びソフトウェアリソースとの組み合わせにおいて、すべてを実行し、すべてのプログラミング技術（例えば、コンパイラ、組み込み関数、アセンブリ、ライブラリ、ＪＩＴ、オフロード、デバイス）をサポートする完全なプログラミングモデルを提示する。他の例示的な実装は、他のプロセッサ開発企業、例えば、ＡＲＭホールディングス社、ＭＩＰＳ、ＩＢＭ又はこれらのラインセンシ若しくは採用者により提供されるものに適合する代替的な実行環境を提示する。 In some exemplary implementations, the heterogeneous scheduler presents a complete programming model that runs everything and supports all programming techniques (e.g., compilers, intrinsics, assembly, libraries, JIT, offload, device) in combination with other hardware and software resources. Other exemplary implementations present alternative execution environments that are compatible with those offered by other processor developers, e.g., ARM Holdings, Inc., MIPS, IBM, or their licensees or adopters.

図１１９は、詳細に上述されたように、抽象実行環境を提示するプロセッサのブロック図である。この例では、プロセッサ１１９０１は、いくつかの異なるコアタイプ、例えば、図１に詳細に説明されているものを含む。各（ワイド）ＳＩＭＤコア１１９０３は、密な算術のプリミティブをサポートする融合積和演算（ＦＭＡ）回路、独自のキャッシュ（例えば、Ｌ１及びＬ２）、特定用途実行回路及びスレッド状態用のストレージを含む。 Figure 119 is a block diagram of a processor presenting an abstract execution environment, as described in detail above. In this example, the processor 11901 includes several different core types, such as those described in detail in Figure 1. Each (wide) SIMD core 11903 includes fused multiply-add (FMA) circuitry supporting dense arithmetic primitives, its own caches (e.g., L1 and L2), special purpose execution circuitry, and storage for thread state.

各レイテンシ最適化（ＯＯＯ）コア１１９１３は、融合積和演算（ＦＭＡ）回路、独自のキャッシュ（例えば、Ｌ１及びＬ２）及びアウトオブオーダ実行回路を含む。 Each latency optimized (OOO) core 11913 includes fused multiply-add (FMA) circuitry, its own caches (e.g., L1 and L2) and out-of-order execution circuitry.

各スカラコア１１９０５は、融合積和演算（ＦＭＡ）回路、独自のキャッシュ（例えば、Ｌ１及びＬ２）、特定用途実行及びスレッド状態の格納を含む。典型的には、スカラコア１１９０５は、メモリレイテンシをカバーするのに十分なスレッドをサポートする。いくつかの実施例において、ＳＩＭＤコア１１９０３及びレイテンシ最適化コア１１９１３の数は、スカラコア１１９０５の数と比較して少ない。 Each scalar core 11905 includes fused multiply-add (FMA) circuitry, its own caches (e.g., L1 and L2), special purpose execution and thread state storage. Typically, a scalar core 11905 supports enough threads to cover memory latencies. In some embodiments, the number of SIMD cores 11903 and latency-optimized cores 11913 is small compared to the number of scalar cores 11905.

いくつかの実施形態において、１又は複数のアクセラレータ１１９１７が含まれる。これらのアクセラレータ１１９１７は、固定機能又はＦＰＧＡベースであってよい。これらのアクセラレータ１１９１７とは代替的に又はこれらに加えて、いくつかの実施形態では、アクセラレータ１１９１７は、プロセッサの外部にある。 In some embodiments, one or more accelerators 11917 are included. These accelerators 11917 may be fixed function or FPGA based. Alternatively or in addition to these accelerators 11917, in some embodiments, the accelerators 11917 are external to the processor.

プロセッサ１１９０１はまた、プロセッサ内にあるコア及び潜在的に任意のアクセラレータにより共有されるラストレベルキャッシュ（ＬＬＣ）１１９０７を含む。いくつかの実施形態において、ＬＬＣ１１９０７は、高速アトミック用の回路を含む。 The processor 11901 also includes a last level cache (LLC) 11907 that is shared by the cores within the processor and potentially any accelerators. In some embodiments, the LLC 11907 includes circuitry for high speed atomics.

１又は複数の相互接続１１９１５は、コア及びアクセラレータを互いに、及び、外部インタフェースに結合する。例えば、いくつかの実施形態では、メッシュ型の相互接続が様々なコアを結合する。 One or more interconnects 11915 couple the cores and accelerators to each other and to external interfaces. For example, in some embodiments, a mesh-type interconnect couples the various cores.

メモリコントローラ１１９０９は、コア及び／又はアクセラレータをメモリに結合する。 The memory controller 11909 couples the cores and/or accelerators to memory.

複数の入力／出力インタフェース（例えば、以下で詳細に説明されるＰＣＩｅ、共通のリンク）１１９１１は、プロセッサ１１９０１を外部デバイス、例えば、他のプロセッサ及びアクセラレータに接続する。 Multiple input/output interfaces (e.g., PCIe, a common link, described in more detail below) 11911 connect the processor 11901 to external devices, e.g., other processors and accelerators.

図４は、コンピュータシステムのシステムブート及びデバイス発見についての実施形態を示す。システムについての知識は、例えば、どのようなコアが利用可能であるか、どれくらいのメモリが利用可能であるか、コアに関連するメモリ位置などがヘテロジニアススケジューラにより利用されるかといった知識を含む。いくつかの実施形態において、この知識は、アドバンスド・コンフィグレーション・アンド・パワー・インタフェース（ＡＣＰＩ）を用いて構築される。 FIG. 4 illustrates an embodiment of system boot and device discovery for a computer system. Knowledge about the system includes, for example, what cores are available, how much memory is available, memory locations associated with the cores, etc., to be utilized by the heterogeneous scheduler. In some embodiments, this knowledge is built using the Advanced Configuration and Power Interface (ACPI).

４０１において、コンピュータシステムがブートされる。 At 401, the computer system is booted.

４０３において、構成設定のクエリが行われる。例えば、いくつかのＢＩＯＳベースのシステムでは、ブートされたときに、ＢＩＯＳは、システムの動作確認を行い、ドライブ及び他の構成設定を独自のメモリバンクにクエリすることにより、オペレーション用のコンピュータを準備する。 At 403, configuration settings are queried. For example, in some BIOS-based systems, when booted, the BIOS performs a system check and prepares the computer for operation by querying its own memory banks for drives and other configuration settings.

４０５において、プラグインコンポーネントの探索が行われる。例えば、ＢＩＯＳは、コンピュータ内の任意のプラグインコンポーネントを探索し、メモリ内のポインタ（割込みベクトル）をセットアップしてそれらのルーチンにアクセスする。ＢＩＯＳは、デバイスドライバ並びにアプリケーションプログラムから、ハードウェア及び他の周辺デバイスとのインタフェース接続に関する要求を受け入れる。 At 405, a search for plug-in components occurs. For example, the BIOS searches for any plug-in components in the computer and sets up pointers (interrupt vectors) in memory to access their routines. The BIOS accepts requests from device drivers and application programs for interfacing with hardware and other peripheral devices.

４０７において、システムコンポーネント（例えば、コア、メモリなど）のデータ構造が生成される。例えば、ＢＩＯＳは、典型的には、ＯＳが付属デバイスとインタフェースするハードウェアデバイス及び周辺デバイス構成情報を生成する。さらに、ＡＣＰＩは、システムボードに対する柔軟でスケーラブルなハードウェアインタフェースを定義し、コンピュータが、特に、ノートブックコンピュータなどのポータブルデバイスにおいて、電源管理を改善するために、その周辺機器をオン及びオフすることを可能にする。ＡＣＰＩ仕様は、ハードウェアインタフェースと、ソフトウェアインタフェース（ＡＰＩ）と、実装される場合、ＯＳ指向構成及び電源管理をサポートするデータ構造とを含む。ソフトウェアの設計者は、ＡＣＰＩを用いて、ハードウェア、オペレーティングシステム及びアプリケーションソフトウェアを含むコンピュータシステム全体の電源管理機能を統合できる。この統合は、どのデバイスがアクティブであり、コンピュータサブシステム及び周辺機器に対する電源管理リソースのすべてを処理するかをＯＳが判断することを可能にする。 At 407, data structures for system components (e.g., cores, memory, etc.) are generated. For example, the BIOS typically generates hardware device and peripheral device configuration information with which the OS interfaces with attached devices. Additionally, ACPI defines a flexible and scalable hardware interface to the system board, allowing the computer to turn its peripheral devices on and off for improved power management, especially in portable devices such as notebook computers. The ACPI specification includes hardware interfaces, software interfaces (APIs), and data structures that, if implemented, support OS-directed configuration and power management. Software designers can use ACPI to integrate power management functions for the entire computer system, including hardware, operating system, and application software. This integration allows the OS to determine which devices are active and handle all of the power management resources for computer subsystems and peripheral devices.

４０９において、オペレーティングシステム（ＯＳ）がロードされて、制御を獲得する。例えば、ＢＩＯＳがその起動ルーチンを完了した時点で、ＢＩＯＳは制御をＯＳに渡す。ＡＣＰＩである場合、ＢＩＯＳは、コンピュータの制御をＯＳに渡し、ＢＩＯＳは、ＡＣＰＩ名前空間を含むデータ構造をＯＳにエクスポートし、それは、ツリーとしてグラフィカルに表され得る。名前空間は、コンピュータに接続されたＡＣＰＩデバイスのディレクトリとして動作し、各ＡＣＰＩデバイスに対してステータス情報をさらに定義及び提供するオブジェクトを含む。ツリー内の各ノードは、デバイスと関連付けられており、一方、ＯＳにより評価される場合、デバイスを制御し、ＡＣＰＩ仕様で規定されるような特定の情報をＯＳに返すノード、サブノード及びリーフがオブジェクトを表す。ＯＳ又はＯＳによりアクセスされるドライバは、名前空間オブジェクトを列挙及び評価する機能のセットを含んでよい。ＯＳが機能をコールして、ＡＣＰＩ名前空間内のオブジェクトの値を返す場合、ＯＳは、そのオブジェクトを評価したといえる。 At 409, the operating system (OS) loads and gains control. For example, when the BIOS completes its startup routines, the BIOS hands control over to the OS. In the case of ACPI, the BIOS hands control over the computer over to the OS, which exports a data structure to the OS that contains an ACPI namespace, which may be graphically represented as a tree. The namespace acts as a directory of ACPI devices connected to the computer and contains objects that further define and provide status information for each ACPI device. Each node in the tree is associated with a device, while the nodes, subnodes, and leaves represent objects that, when evaluated by the OS, control the device and return specific information to the OS as defined in the ACPI specification. The OS or a driver accessed by the OS may include a set of functions to enumerate and evaluate namespace objects. When the OS calls a function and returns the value of an object in the ACPI namespace, the OS is said to have evaluated that object.

いくつかの例において、利用可能なデバイスが変わる。例えば、アクセラレータ、メモリなどが加えられる。ポストシステムブートデバイス発見のための方法の実施形態が図１１６に示される。例えば、この方法の実施形態は、ブート後のシステムに追加されたアクセラレータを発見するために用いられてよい。１１６０１において、電源オン又はリセットされる接続されたデバイスのインジケーションが受信される。例えば、エンドポイントデバイスは、例えば、ＯＳにより、ＰＣＩｅスロットにプラグ接続される、又は、リセットされる。 In some examples, available devices change. For example, accelerators, memory, etc. are added. An embodiment of a method for post-system boot device discovery is shown in FIG. 116. For example, this method embodiment may be used to discover accelerators added to a system after boot. At 11601, an indication of a connected device being powered on or reset is received. For example, an endpoint device is plugged into a PCIe slot or reset, for example, by the OS.

１１６０３において、リンクトレーニングが接続されたデバイスを用いて実行され、接続されたデバイスが初期化される。例えば、ＰＣＩｅのリンクトレーニングは、リンク幅、レーン極性、及び／又は、サポートされる最大データレートなどのリンク構成パラメータを確立するために実行される。いくつかの実施形態において、接続されたデバイスの性能は、（例えば、ＡＣＰＩテーブルに）格納される。 At 11603, link training is performed with the connected device to initialize the connected device. For example, PCIe link training is performed to establish link configuration parameters such as link width, lane polarity, and/or maximum data rate supported. In some embodiments, the capabilities of the connected device are stored (e.g., in an ACPI table).

１１６０５において、接続されたデバイスの初期化が完了した場合、準備完了メッセージ（ｒｅａｄｙｍｅｓｓａｇｅ）が接続されたデバイスからシステムに送信される。 At 11605, when the initialization of the connected device is complete, a ready message is sent from the connected device to the system.

１１６０７において、デバイスが構成を準備できたことを示すために、接続されたデバイスの準備完了ステータスビットが設定される。 At 11607, the connected device's ready status bit is set to indicate that the device is ready for configuration.

１１６０９において、初期化された、接続されたデバイスが構成される。いくつかの実施形態において、デバイス及びＯＳは、デバイス用のアドレス（例えば、メモリマッピングされたＩ／Ｏ（ＭＭＩＯ）アドレス）について合意する。デバイスは、ベンダ識別番号（ＩＤ）、デバイスＩＤ、モデル番号、シリアル番号、特性、リソース要件などのうちの１又は複数を含むデバイス記述子を提供する。ＯＳは、記述子データ及びシステムリソースに基づいて、デバイスに対する追加の動作及び構成パラメータを判断してよい。ＯＳは、構成クエリを生成してよい。デバイスは、デバイス記述子に応答してよい。次に、ＯＳは、構成データを生成して、このデータを（例えば、ＰＣＩハードウェアを通じて）デバイスに送信する。これは、デバイスと関連付けられるアドレス空間を定義するベースアドレスレジスタの設定を含んでよい。 At 11609, the initialized, connected device is configured. In some embodiments, the device and the OS agree on an address for the device (e.g., a memory-mapped I/O (MMIO) address). The device provides a device descriptor that includes one or more of a vendor identification number (ID), a device ID, a model number, a serial number, characteristics, resource requirements, etc. The OS may determine additional operational and configuration parameters for the device based on the descriptor data and system resources. The OS may generate a configuration query. The device may respond with the device descriptor. The OS then generates configuration data and sends this data to the device (e.g., through PCI hardware). This may include setting a base address register that defines the address space associated with the device.

システムの知識が構築された後に、ＯＳは、ヘテロジニアススケジューラ（例えば、ヘテロジニアススケジューラ１０１、３０１など）を利用して、処理対象のスレッドをスケジューリングして、このスレッドを実行する。次に、ヘテロジニアススケジューラは、各スレッドのコードフラグメントを、（例えば、ユーザ及び／又はＯＳに対して）動的かつ透過的に最も適したタイプの処理要素へマッピングし、それにより、レガシアーキテクチャ機構用のハードウェアを構築する必要性、及び潜在的にシステムプログラマ又はＯＳにマイクロアーキテクチャの詳細をさらす必要性を潜在的に回避する。 After system knowledge is built, the OS utilizes a heterogeneous scheduler (e.g., heterogeneous scheduler 101, 301, etc.) to schedule threads to be processed and execute the threads. The heterogeneous scheduler then dynamically and transparently (e.g., to the user and/or the OS) maps each thread's code fragment to the most suitable type of processing element, thereby potentially avoiding the need to build hardware for legacy architectural mechanisms and potentially exposing microarchitecture details to the system programmer or the OS.

いくつかの例では、最も適したタイプの処理要素は、処理要素の性能及びコードフラグメントの実行特性に基づいて判断される。一般的に、プログラム及び関連するスレッドは、所与の時点で処理されるワークロードに応じて、異なる実行特性を有し得る。例示的な実行特性、又は、実行のフェーズは、例えば、データ並列フェーズ、スレッド並列フェーズ及び直列フェーズを含む。以下のテーブルは、これらのフェーズを識別し、これらの特性を要約したものである。テーブルはまた、例示的なワークロード／オペレーション、各フェーズタイプを処理する場合に有用な例示的なハードウェア、及び、用いられるフェーズ及びハードウェアの典型的な目的を含む。
In some examples, the most suitable type of processing element is determined based on the performance of the processing element and the execution characteristics of the code fragment. In general, a program and associated threads may have different execution characteristics depending on the workload being processed at a given time. Exemplary execution characteristics, or phases of execution, include, for example, a data-parallel phase, a thread-parallel phase, and a serial phase. The following table identifies these phases and summarizes their characteristics. The table also includes exemplary workloads/operations, exemplary hardware useful in processing each phase type, and typical purposes for the phases and hardware used.

いくつかの実施例において、ヘテロジニアススケジューラは、スレッド移行及びエミュレーションのどちらかを選択するように構成される。各タイプの処理要素が、任意のタイプのワークロードを処理できる構成（そうするためのエミュレーションを要求する場合）では、例えば、ワークロードのレイテンシ要件、エミュレーションと関連付けられた増加した実行レイテンシ、処理要素の電力及び熱的特性及び制約などを含む１又は複数の基準に基づいて、最も適した処理要素がプログラムフェーズごとに選択される。後で詳細に説明されるように、適した処理要素の選択は、いくつかの実施例において、実行中のスレッドの数を考慮して、コードフラグメント内のＳＩＭＤ命令又はベクトル化可能コードの存在を検出することにより実現される。 In some embodiments, the heterogeneous scheduler is configured to select between thread migration and emulation. In a configuration in which each type of processing element can process any type of workload (if it requires emulation to do so), the most suitable processing element is selected for each program phase based on one or more criteria including, for example, the latency requirements of the workload, the increased execution latency associated with emulation, the power and thermal characteristics and constraints of the processing element, etc. As will be described in more detail below, the selection of the suitable processing element is achieved in some embodiments by detecting the presence of SIMD instructions or vectorizable code in the code fragment, taking into account the number of threads being executed.

処理要素間でスレッドを移動することは、不利益がないわけではない。例えば、データは、共有キャッシュから下位レベルキャッシュに移動される必要がある可能性があり、元の処理要素及び受け手の処理要素の両方は、移動に順応するために、これらのパイプラインをフラッシュさせるだろう。状況に応じて、いくつかの実施例では、ヘテロジニアススケジューラは、（例えば、上記で参照した１又は複数の基準に対する閾値、又は、同じもののサブセットを設定することによる）非常に頻繁な移行を回避するためにヒステリシスを実装する。いくつかの実施形態において、ヒステリシスは、予め定義されたレート（例えば、１移行毎ミリ秒）を超えないようにスレッド移行を制限することにより実装される。したがって、当該移行のレートは、コード生成、同期及びデータ移行に起因する過剰なオーバーロードを回避するために制限される。 Moving threads between processing elements is not without disadvantages. For example, data may need to be moved from a shared cache to a lower level cache, and both the originating and recipient processing elements will flush their pipelines to accommodate the move. Optionally, in some embodiments, the heterogeneous scheduler implements hysteresis to avoid very frequent migrations (e.g., by setting thresholds for one or more of the criteria referenced above, or a subset of the same). In some embodiments, hysteresis is implemented by limiting thread migrations to not exceed a predefined rate (e.g., one migration per millisecond). Thus, the rate of such migrations is limited to avoid excessive overload due to code generation, synchronization, and data migration.

いくつかの実施形態において、例えば、特定のスレッドに対する好ましアプローチであるものとして、ヘテロジニアススケジューラにより移行が選択されない場合、ヘテロジニアススケジューラは、割り当てられた処理要素にスレッドのための欠落した機能をエミュレートする。例えば、オペレーティングシステムに対して利用可能なスレッドの総数を一定に維持する実施形態において、ヘテロジニアススケジューラは、（例えば、ワイド同時マルチスレッディングコアにおいて）利用可能なハードウェアスレッドの数がオーバサブスクライブされる場合に、マルチスレッディングをエミュレートしてよい。スカラ又はレイテンシコア上で、スレッドの１又は複数のＳＩＭＤ命令がスカラ命令に変換される、又は、ＳＩＭＤコア上で、より多くのスレッドがスポーンされ、及び／又は、命令が、パックドデータを利用するために変換される。 In some embodiments, for example, if migration is not selected by the heterogeneous scheduler as being the preferred approach for a particular thread, the heterogeneous scheduler emulates missing functionality for the thread on the assigned processing element. For example, in an embodiment that maintains a constant total number of threads available to the operating system, the heterogeneous scheduler may emulate multithreading when the number of available hardware threads is oversubscribed (e.g., in a wide simultaneous multithreading core). On a scalar or latency core, one or more SIMD instructions of the thread are converted to scalar instructions, or on a SIMD core, more threads are spawned and/or instructions are converted to utilize packed data.

図５は、処理要素の３つのタイプに対するプログラムフェーズのマッピングに基づいたスレッド移行の例を示す。図示されるように、処理要素の３つのタイプは、レイテンシの最適化（例えば、アウトオブオーダコア、アクセラレータなど）、スカラ（命令毎時間で１つのデータ項目を処理すること）及びＳＩＭＤ（命令毎に複数のデータ要素を処理すること）を含む。典型的には、このマッピングは、スレッド毎又はコードフラグメント毎に、プログラマ及びオペレーティングシステムに対して透過的な態様で、ヘテロジニアススケジューラにより実行される。 Figure 5 shows an example of thread migration based on a mapping of program phases to three types of processing elements. As shown, the three types of processing elements include latency optimized (e.g., out-of-order cores, accelerators, etc.), scalar (processing one data item per instruction time), and SIMD (processing multiple data elements per instruction). Typically, this mapping is performed by a heterogeneous scheduler on a per-thread or per-code fragment basis in a manner that is transparent to the programmer and the operating system.

一実施例では、ヘテロジニアススケジューラを用いて、ワークロードの各フェーズを最も適したタイプの処理要素にマッピングする。理想的には、これは、レガシ機能用のハードウェアを構築する必要性を軽減し、コンパイルコード（マシンコード）、組み込み関数（プロセッサ又はアクセラレータ命令に直接マッピングするプログラミング言語論理構成）、アセンブリコード、ライブラリ、中間（ＪＩＴベース）、オフロード（一方のマシンタイプから別のマシンタイプへの移動）及びデバイスに固有などの複数のコードタイプをサポートする完全なプログラミングモデルをヘテロジニアススケジューラが提示するマイクロアーキテクチャの詳細をさらすことを回避する。 In one embodiment, a heterogeneous scheduler is used to map each phase of a workload to the most appropriate type of processing element. Ideally, this alleviates the need to build hardware for legacy functions and avoids exposing micro-architectural details. The heterogeneous scheduler presents a complete programming model that supports multiple code types, including compiled code (machine code), intrinsics (programming language logical constructs that map directly to processor or accelerator instructions), assembly code, libraries, intermediate (JIT-based), offload (movement from one machine type to another), and device-specific.

特定の構成において、ターゲット処理要素に対するデフォルトの選択は、レイテンシが最適化される処理要素である。 In a particular configuration, the default selection for the target processing element is the processing element for which latency is optimized.

図５を再び参照すると、ワークロードに対する直列フェーズの実行５０１では、１又は複数のレイテンシが最適化された処理要素で最初に処理される。（例えば、実行の前又は実行中のコードにおいて得られた命令のタイプにより例えば見られるような、コードがより多くのデータを並列化するような動的なやり方で、又は、実行の前に）位相シフトを検出すると、ワークロードは、データ並列フェーズの実行５０３を完了するために、１又は複数のＳＩＭＤ処理要素に移行される。さらに、実行スケジューリング及び／又は変換は、典型的にはキャッシュされる。その後、ワークロードは、１又は複数のレイテンシが最適化された処理要素、又は、１又は複数のレイテンシが最適化された処理要素の第２のセットに戻って移行されて、次の直列フェーズの実行５０５を完了する。次に、ワークロードは、スレッド並列フェーズの実行５０７を処理するために、１又は複数のスカラコアに移行される。次に、ワークロードは、次の直列フェーズの実行５０９の完了のために、１又は複数のレイテンシが最適化された処理要素に戻って移行される。 Referring again to FIG. 5, in the serial phase execution 501 for a workload, it is first processed on one or more latency-optimized processing elements. Upon detecting a phase shift (e.g., before execution, in a dynamic manner such that the code parallelizes more data, e.g., as seen by the type of instructions available in the code before or during execution), the workload is migrated to one or more SIMD processing elements to complete the data-parallel phase execution 503. Furthermore, execution scheduling and/or transformations are typically cached. The workload is then migrated back to one or more latency-optimized processing elements, or a second set of one or more latency-optimized processing elements, to complete the next serial phase execution 505. The workload is then migrated to one or more scalar cores to process the thread-parallel phase execution 507. The workload is then migrated back to one or more latency-optimized processing elements for completion of the next serial phase execution 509.

この図示された例は、レイテンシが最適化されたコアへの復帰を示す一方、ヘテロジニアススケジューラは、スレッドが終了されるまで１又は複数の対応するタイプの処理要素において、任意の後続のフェーズの実行についての実行を継続してよい。いくつかの実施例では、処理要素は、ワークキューを利用して、完了していないタスクを格納する。その結果、タスクは、すぐに開始しなくてもよいが、キュー内のこれらのスポットが現れたときに実行される。 While this illustrated example shows a return to a latency-optimized core, the heterogeneous scheduler may continue execution of any subsequent phases of execution in one or more corresponding types of processing elements until the thread is terminated. In some embodiments, the processing elements utilize work queues to store incomplete tasks. As a result, tasks may not start immediately, but will be executed when their spot in the queue appears.

図６は、ヘテロジニアススケジューラ、例えば、ヘテロジニアススケジューラ１０１などにより実行される例示的な実施フローである。このフローは、処理要素（例えば、コア）の選択を図示する。図示されるように、ヘテロジニアススケジューラによりコードフラグメントが受信される。いくつかの実施形態において、限定されることはないが、スレッドウェイクアップコマンド、ページディレクトリベースレジスタへの書き込み、スリープコマンド、スレッドのフェーズ変更及び所望の再割り当てを示す１又は複数の命令を含むイベントが発生する。 6 is an exemplary implementation flow performed by a heterogeneous scheduler, such as heterogeneous scheduler 101. The flow illustrates the selection of a processing element (e.g., a core). As shown, a code fragment is received by the heterogeneous scheduler. In some embodiments, an event occurs including, but not limited to, a thread wakeup command, a write to a page directory base register, a sleep command, a thread phase change, and one or more instructions indicating a desired reallocation.

６０１において、ヘテロジニアススケジューラは、例えば、検出されたデータの依存性、命令タイプ及び／又は制御フロー命令に基づいて、（例えば、直列フェーズ又は並列フェーズにおけるコードフラグメントである）コードフラグメントに並列性があるか否かを判断する。例えば、ＳＩＭＤコードでいっぱいのスレッドは、並列とみなされるであろう。コードフラグメントが並列処理に適していない場合、ヘテロジニアススケジューラは、１又は複数のレイテンシに敏感なオペレーション要素（例えば、ＯＯＯコア）を選択し、直列フェーズの実行６０３においてコードフラグメントを処理する。典型的には、ＯＯＯコアは、（深層）推論及び動的なスケジューリングを有し、通常、より簡単な代替物と比較してワット性能が低い。 In 601, the heterogeneous scheduler determines whether there is parallelism in a code fragment (e.g., a code fragment in a serial or parallel phase) based on, for example, detected data dependencies, instruction types, and/or control flow instructions. For example, a thread full of SIMD code would be considered parallel. If the code fragment is not suitable for parallel processing, the heterogeneous scheduler selects one or more latency-sensitive operation elements (e.g., OOO cores) to process the code fragment in the serial phase of execution 603. Typically, OOO cores have (deep) speculation and dynamic scheduling, and usually have lower performance per watt compared to simpler alternatives.

いくつかの実施形態では、典型的には、レイテンシに敏感なオペレーション要素はスカラコアより多くの電力及びダイ空間を消費するので、利用可能なレイテンシに敏感なオペレーション要素がない。これらの実施形態では、スカラ、ＳＩＭＤ及びアクセラレータコアのみが利用可能である。 In some embodiments, there are no latency-sensitive operation elements available, as these typically consume more power and die space than scalar cores. In these embodiments, only scalar, SIMD, and accelerator cores are available.

６０５において、並列コードフラグメント、並列化可能なコードフラグメント及び／又はベクトル化可能コードフラグメントに関し、ヘテロジニアススケジューラは、コードの並列性についてのタイプを判断する。６０７において、スレッド並列コードフラグメントに関し、ヘテロジニアススケジューラは、スレッド並列処理要素（例えば、マルチプロセッサスカラコア）を選択する。スレッド並列コードフラグメントは、別々のスカラコアで同時に実行され得る独立した命令シーケンスを含む。 At 605, for parallel, parallelizable, and/or vectorizable code fragments, the heterogeneous scheduler determines a type of code parallelism. At 607, for thread-parallel code fragments, the heterogeneous scheduler selects thread-parallel processing elements (e.g., multiprocessor scalar cores). Thread-parallel code fragments include independent instruction sequences that can be executed simultaneously on separate scalar cores.

各処理要素が異なる数のデータに同じタスクを実行した場合に、データ並列コードが発生する。データ並列コードは、パックド及びランダムという異なるデータレイアウトの形式があり得る。６０９において、データレイアウトが判断される。ランダムデータは、ＳＩＭＤ処理要素に割り当てられてよいが、異なるメモリ位置からデータを引き出すギャザー命令６１３、（小型のプログラマブル処理要素のアレイ、例えば、ＦＰＧＡのアレイ上に計算を空間的にマッピングする）空間計算アレイ６１５、又は、スカラ処理要素６１７のアレイを利用することが必要である。パックドデータは、６１１において密な算術のプリミティブを用いるＳＩＭＤ処理要素又は処理要素に割り当てられる。 Data parallel code occurs when each processing element performs the same task on a different amount of data. Data parallel code can be in the form of different data layouts: packed and random. At 609, the data layout is determined. Random data may be assigned to SIMD processing elements, but requires the use of gather instructions 613 that pull data from different memory locations, spatial computation arrays 615 (which spatially map computations onto an array of small programmable processing elements, e.g., an array of FPGAs), or arrays of scalar processing elements 617. Packed data is assigned to SIMD processing elements or processing elements that use dense arithmetic primitives at 611.

いくつかの実施形態において、選択された宛先処理要素をより良く適合させるようにコードフラグメントの変換が実行される。例えば、コードフラグメントは、１）異なる命令セットを利用するために変換され、２）より多く並列化され、３）あまり並列化されず（直列化され）、４）データを並列化し（例えば、ベクトル化され）、及び／又は、５）データをあまり並列化しない（例えば、非ベクトル化される）。 In some embodiments, transformations are performed on the code fragment to better suit the selected destination processing element. For example, the code fragment may be 1) transformed to take advantage of a different instruction set, 2) made more parallel, 3) made less parallel (serialized), 4) made data parallel (e.g., vectorized), and/or 5) made less data parallel (e.g., non-vectorized).

処理要素が選択された後、コードフラグメントは、実行のために判断された処理要素のうちの１つに送信される。 After a processing element is selected, the code fragment is sent to one of the determined processing elements for execution.

図７は、ヘテロジニアススケジューラによるスレッド宛先選択のための方法についての例を示す。いくつかの実施形態において、この方法は、バイナリトランスレータにより実行される。７０１において、評価対象のスレッド又はこれらのコードフラグメントが受信される。いくつかの実施形態において、限定されることはないが、スレッドウェイクアップコマンド、ページディレクトリベースレジスタへの書き込み、スリープコマンド、スレッドのフェーズ変更及び所望の再割り当てを示す１又は複数の命令を含むイベントが発生する。 Figure 7 illustrates an example method for thread destination selection by a heterogeneous scheduler. In some embodiments, the method is performed by a binary translator. At 701, a thread to be evaluated or a code fragment thereof is received. In some embodiments, an event occurs including, but not limited to, a thread wakeup command, a write to a page directory base register, a sleep command, a phase change of a thread, and one or more instructions indicating a desired reallocation.

７０３において、コードフラグメントがアクセラレータにオフロードされるか否かの判断が行われる。例えば、アクセラレータに送信されるコードフラグメントである。ヘテロジニアススケジューラは、コードがアクセラレータを用いるという要望を識別するコードを含む場合、これが訂正動作であることを知り得る。この要望は、コードの領域がアクセラレータ上で実行され、又は、ネイティブに（例えば、本明細書で説明されたＡＢＥＧＩＮ／ＡＥＮＤ）実行されてよいことを示す識別子、又は、特定のアクセラレータを用いる明示的なコマンドであってよい。 At 703, a determination is made whether the code fragment is to be offloaded to an accelerator. For example, a code fragment is sent to an accelerator. The heterogeneous scheduler may know that this is a corrective action if the code contains code that identifies a desire to use an accelerator. This desire may be an identifier indicating that the region of code may be executed on an accelerator or executed natively (e.g., ABEGIN/AEND as described herein), or an explicit command to use a particular accelerator.

いくつかの実施形態では、７０５において、選択された宛先処理要素をより良く適合させるようにコードフラグメントの変換が実行される。例えば、コードフラグメントは、１）異なる命令セットを利用するために変換され、２）より多く並列化され、３）あまり並列化されず（直列化され）、４）データを並列化し（例えば、ベクトル化され）、及び／又は、５）データをあまり並列化しない（例えば、非ベクトル化される）。 In some embodiments, at 705, a transformation of the code fragment is performed to better suit the selected destination processing element. For example, the code fragment may be 1) transformed to take advantage of a different instruction set, 2) made more parallel, 3) made less parallel (serialized), 4) made data parallel (e.g., vectorized), and/or 5) made less data parallel (e.g., non-vectorized).

典型的には、７０７において、変換されたスレッドは、後の使用のためにキャッシュされる。いくつかの実施形態において、バイナリトランスレータは、将来におけるバイナリトランスレータの使用のために利用可能となるように変換されたスレッドをローカルにキャッシュする。例えば、コードが「ホット」になる（繰り返し実行される）場合、キャッシュは、（送信コストがあり得るが）変換の不利益なく将来の利用のためのメカニズムを提供する。 Typically, at 707, the translated threads are cached for later use. In some embodiments, the binary translator caches the translated threads locally so that they are available for future use by the binary translator. For example, if the code becomes "hot" (is executed repeatedly), caching provides a mechanism for future use without the penalty of translation (although there may be transmission costs).

７０９において、（変換された）スレッドは、処理のために宛先処理要素に送信される（例えば、オフロードされる）。いくつかの実施形態において、変換されたスレッドは、将来の利用のためにローカルに利用可能であるように受け手によりキャッシュされる。さらに、受け手又はバイナリトランスレータは、コードが「ホット」であると判断した場合、このキャッシングは、使用されるエネルギーが少ない状態でより高速な実行を可能にする。 At 709, the (translated) thread is sent (e.g., offloaded) to the destination processing element for processing. In some embodiments, the translated thread is cached by the recipient so that it is locally available for future use. Additionally, if the recipient or binary translator determines that the code is "hot," this caching allows for faster execution with less energy used.

７１１において、ヘテロジニアススケジューラは、例えば、検出されたデータの依存性、命令タイプ及び／又は制御フロー命令に基づいて、（例えば、直列フェーズ又は並列フェーズにおけるコードフラグメントである）コードフラグメントに並列性があるか否かを判断する。例えば、ＳＩＭＤコードでいっぱいのスレッドは、並列とみなされるであろう。コードフラグメントが並列処理に適していない場合、ヘテロジニアススケジューラは、１又は複数のレイテンシに敏感なオペレーション要素（例えば、ＯＯＯコア）を選択し、直列フェーズの実行７１３においてコードフラグメントを処理する。典型的には、ＯＯＯコアは、（深層）推論及び動的なスケジューリングを有し、故に、スカラの代替物と比較してワット性能が良好であり得る。 In 711, the heterogeneous scheduler determines whether there is parallelism in the code fragment (e.g., the code fragment is in a serial or parallel phase) based on, for example, detected data dependencies, instruction type, and/or control flow instructions. For example, a thread full of SIMD code would be considered parallel. If the code fragment is not suitable for parallel processing, the heterogeneous scheduler selects one or more latency-sensitive operation elements (e.g., OOO cores) to process the code fragment in the serial phase of execution 713. Typically, OOO cores have (deep) speculation and dynamic scheduling, and therefore may have better performance per watt compared to scalar alternatives.

７１５において、並列コードフラグメント、並列化可能なコードフラグメント及び／又はベクトル化可能コードフラグメントに関し、ヘテロジニアススケジューラは、コードの並列性についてのタイプを判断する。７１７において、スレッド並列コードフラグメントに関し、ヘテロジニアススケジューラは、スレッド並列処理要素（例えば、マルチプロセッサスカラコア）を選択する。スレッド並列コードフラグメントは、別々のスカラコアで同時に実行され得る独立した命令シーケンスを含む。 At 715, for parallel, parallelizable, and/or vectorizable code fragments, the heterogeneous scheduler determines a type of code parallelism. At 717, for thread-parallel code fragments, the heterogeneous scheduler selects a thread-parallel processing element (e.g., a multiprocessor scalar core). A thread-parallel code fragment includes independent instruction sequences that can be executed simultaneously on separate scalar cores.

各処理要素が異なる数のデータに同じタスクを実行した場合に、データ並列コードが発生する。データ並列コードは、パックド及びランダムという異なるデータレイアウトの形式があり得る。７１９において、データレイアウトが判断される。ランダムデータは、ＳＩＭＤ処理要素に割り当てられてよいが、ギャザー命令７２３、空間計算アレイ７２５又はスカラ処理要素７２７のアレイを利用することが必要である。パックドデータは、７２１において密な算術のプリミティブを用いるＳＩＭＤ処理要素又は処理要素に割り当てられる。 Data parallel code occurs when each processing element performs the same task on a different amount of data. Data parallel code can be in the form of different data layouts: packed and random. At 719, the data layout is determined. Random data may be assigned to SIMD processing elements, but requires the use of gather instructions 723, spatial computation arrays 725, or arrays of scalar processing elements 727. Packed data is assigned to SIMD processing elements or processing elements that use dense arithmetic primitives at 721.

いくつかの実施形態において、判断される宛先処理要素をより良く適合させるようにオフロードされていないコードフラグメントの変換が実行される。例えば、コードフラグメントは、１）異なる命令セットを利用するために変換され、２）より多く並列化され、３）あまり並列化されず（直列化され）、４）データを並列化し（例えば、ベクトル化され）、及び／又は、５）データをあまり並列化しない（例えば、非ベクトル化される）。 In some embodiments, transformations of the non-offloaded code fragments are performed to better suit the determined destination processing element. For example, the code fragments are 1) transformed to utilize a different instruction set, 2) made more parallel, 3) made less parallel (serialized), 4) made data parallel (e.g., vectorized), and/or 5) made less data parallel (e.g., non-vectorized).

ＯＳは、コア及びアクセラレータがアクセス可能であることに関わらず、潜在的に利用可能なスレッドの総数を参照する。以下の説明では、論理ＩＤと呼ばれるスレッド識別子（ＩＤ）によって各スレッドが列挙される。いくつかの実施例において、オペレーティングシステム及び／又はヘテロジニアススケジューラは、論理ＩＤを利用して、特定の処理要素のタイプ（例えば、コアタイプ）、処理要素ＩＤ及びその処理要素上のスレッドＩＤ（例えば、コアタイプのタプル、コアＩＤ、スレッドＩＤ）にスレッドをマッピングする。例えば、スカラコアは、コアＩＤ及び１又は複数のスレッドＩＤを有し、ＳＩＭＤコアは、コアＩＤ及び１又は複数のスレッドＩＤを有し、ＯＯＯコアは、コアＩＤ及び１又は複数のスレッドＩＤを有し、及び／又は、アクセラレータは、コアＩＤ及び１又は複数のスレッドＩＤを有する。 The OS sees the total number of potentially available threads, regardless of which cores and accelerators are accessible. In the following description, each thread is enumerated by a thread identifier (ID), referred to as the logical ID. In some embodiments, the operating system and/or heterogeneous scheduler uses the logical ID to map a thread to a particular processing element type (e.g., core type), processing element ID, and thread ID on that processing element (e.g., a tuple of core type, core ID, thread ID). For example, a scalar core has a core ID and one or more thread IDs, a SIMD core has a core ID and one or more thread IDs, an OOO core has a core ID and one or more thread IDs, and/or an accelerator has a core ID and one or more thread IDs.

図８は、論理ＩＤに対する縞模様マッピングの使用についての概念を示す。縞模様マッピングは、ヘテロジニアススケジューラにより用いられてよい。この例では、８つの論理的なＩＤ、及び、それぞれが１又は複数のスレッドを有する３つのコアタイプがある。典型的には、論理ＩＤから（コアＩＤ、スレッドＩＤ）へのマッピングは、除算及びモジュロを用いて計算され、ソフトウェアスレッドの共通性を保つために固定されていてよい。論理ＩＤから（コアタイプ）へのマッピングは、ＯＳに利用しやすい将来の新たなコアタイプに順応するように、ヘテロジニアススケジューラにより柔軟に実行される。 Figure 8 illustrates the concept of using striped mapping for logical IDs. Striped mapping may be used by a heterogeneous scheduler. In this example, there are eight logical IDs and three core types, each with one or more threads. Typically, the mapping from logical IDs to (core ID, thread ID) is calculated using division and modulo and may be fixed to preserve commonality of software threads. The mapping from logical IDs to (core type) is performed flexibly by the heterogeneous scheduler to accommodate future new core types that become available to the OS.

図９は、論理ＩＤに対する縞模様マッピングの使用についての例を示す。例では、論理ＩＤ１、４及び５が第１のコアタイプにマッピングされ、その他すべての論理ＩＤが第２のコアタイプにマッピングされる。第３のコアタイプは利用されていない。 Figure 9 shows an example of the use of striped mapping for logical IDs. In the example, logical IDs 1, 4, and 5 are mapped to a first core type, and all other logical IDs are mapped to a second core type. A third core type is not utilized.

いくつかの実施例では、コアタイプのグループ化が作成される。例えば、「コアグループ」タプルが、１つのＯＯＯタプル及びすべてのスカラ、ＳＩＭＤ、並びに、論理ＩＤが同じＯＯＯタプルにマッピングするアクセラレータコアタプルからなってよい。図１０は、コアグループの例を示す。典型的には、直列フェーズ検出及びスレッド移行が同じコアグループ内で実行される。 In some embodiments, groupings of core types are created. For example, a "core group" tuple may consist of one OOO tuple and all scalar, SIMD, and accelerator core tuples whose logical IDs map to the same OOO tuple. Figure 10 shows an example of a core group. Typically, serial phase detection and thread migration are performed within the same core group.

図１１は、バイナリトランスレータ切替メカニズムを利用するシステムにおけるスレッド実行の方法の例を示す。１１０１において、スレッドがコア上で実行される。コアは、アクセラレータを含む、本明細書で詳細に説明されるタイプのいずれかであってよい。 Figure 11 illustrates an example method of thread execution in a system utilizing a binary translator switching mechanism. At 1101, a thread executes on a core. The core may be any of the types described in detail herein, including an accelerator.

１１０３において、スレッドの実行中のいくつかの時点で、潜在的なコアの再割り当てイベントが発生する。例示的なコアの再割り当てイベントは、限定されることはないが、スレッドウェイクアップコマンド、ページディレクトリベースレジスタへの書き込み、スリープコマンド、スレッドのフェーズ変更及び異なるコアへの所望の再割り当てを示す１又は複数の命令を含む。 At 1103, at some point during the execution of the thread, a potential core reallocation event occurs. Exemplary core reallocation events include, but are not limited to, a thread wakeup command, a write to the page directory base register, a sleep command, a phase change of the thread, and one or more instructions indicating a desired reallocation to a different core.

１１０５において、イベントは、処理されて、コア割り当てに変更があるか否かに応じた判断が行われる。ある特定のコア割り当ての処理に関する例示的な方法を以下に詳細に説明する。 At 1105, the event is processed and a determination is made as to whether there has been a change in the core allocation. An exemplary method for processing a particular core allocation is described in detail below.

いくつかの実施形態において、コアの（再）割り当ては、移動率の制限及び電力消費の制限などの１又は複数の制限要因を対象とする。移動率の制限は、コアタイプ、コアＩＤ及びスレッドＩＤ毎に追跡される。一旦スレッドがターゲット（コアタイプ、コアＩＤ、スレッドＩＤ）に割り当てられると、タイマが開始されてバイナリトランスレータにより維持される。タイマが期限切れになるまで、同じターゲットに移行されるスレッドは他にない。その結果、スレッドは、タイマが期限切れになる前にその現在のコアから離れて移行してもよい一方、その逆は成り立たない。 In some embodiments, the (re)allocation of cores is subject to one or more limiting factors, such as migration rate limits and power consumption limits. Migration rate limits are tracked per core type, core ID and thread ID. Once a thread is assigned to a target (core type, core ID, thread ID), a timer is started and maintained by the binary translator. No other threads are migrated to the same target until the timer expires. As a result, a thread may migrate away from its current core before the timer expires, while the reverse is not true.

詳細に説明されるように、より多くのコアタイプ（アクセラレータを含む）がコンピューティングシステム（オン又はオフダイのいずれか一方）に追加されるにつれて、電力消費の制限に対する注目が高まる可能性が高い。いくつかの実施形態において、すべてのコア上のすべての実行スレッドにより消費される瞬間的な電力が計算される。計算された電力消費が閾値を超える場合、新たなスレッドが、より低い電力のコア、例えば、ＳＩＭＤ、スカラ及び専用のアクセラレータコアに割り当てられるだけであり、１又は複数のスレッドは、ＯＯＯコアからより低い電力のコアに強制的に移行させられる。いくつかの実施例では、電力消費の制限は、移動率の制限よりも優先されることに留意する。 As will be described in more detail, as more core types (including accelerators) are added to a computing system (either on- or off-die), limiting power consumption is likely to become an increased focus. In some embodiments, the instantaneous power consumed by all execution threads on all cores is calculated. If the calculated power consumption exceeds a threshold, new threads are only assigned to lower power cores, e.g., SIMD, scalar, and dedicated accelerator cores, and one or more threads are forced to migrate from an OOO core to a lower power core. Note that in some embodiments, power consumption limits take precedence over migration rate limits.

図１２は、アクセラレータに対するホットコードのコア割り当てについての例示的な方法を示す。１２０１において、コードが「ホット」であるという判断が行われる。コードのホット部分は、電力、性能、熱、他の既知のプロセッサ基準又はこれらの組み合わせなどの考慮に基づいて、その他のコアを介して１つのコア上で実行するのにより適しているコードの一部を指し得る。この判断は、任意の数の技術を用いて行われてよい。例えば、動的バイナリオプティマイザが、スレッドの実行をモニタリングするために利用されてよい。ホットコードは、プログラム実行中などに、静的コードの動的な実行頻度を記録するカウンタ値に基づいて検出されてよい。コアがＯＯＯコアであり、別のコアがインオーダコアである実施形態において、次に、コードのホット部分は、直列コア上で実行されるのにより適しているプログラムコードのホットスポットを指してよく、高反復セクションの実行のために、より多くの利用可能なリソースを潜在的に有する。多くの場合、高反復パターンを有するコードのセクションは、インオーダコア上でより効率的に実行されるように最適化され得る。本質的には、この例において、コールドコード（低反復）が、ネイティブＯＯＯコアに分配され、一方、ホットコード（高反復）は、ソフトウェア管理されたインオーダコアに分配される。コードのホット部分は、静的、動的又はこれらの組み合わせで識別されてよい。第１のケースでは、コンパイラ又はユーザは、プログラムコードおセクションがホットコードであると判断してよい。コア内のデコード論理は、一実施形態において、プログラムコードからのホットコード識別子命令をデコードするために適合され、当該命令は、プログラムコードのホット部分を識別する。そのような命令のフェッチ又はデコードは、コア上のコードのホットセクションの変換及び／又は実行をトリガし得る。別の例では、コード実行は、プロファイルされた実行であり、プロファイルの特性－実行と関連付けられた電力及び／又は性能メトリック－に基づいており、プログラムコードの領域は、ホットコードとして識別され得る。ハードウェアのオペレーションと同様に、他のコア上で実行されているプログラムコードのモニタリング／プロファイリングを実行するために、モニタリングコードが１つのコア上で実行されてよい。そのようなモニタリングコードは、コア内のストレージ構造において保持される又はプロセッサを含むシステムにおいて保持されるコードであってよいことに留意する。例えば、モニタリングコードは、マイクロコード又は他のコードであってよく、コアのストレージ構造において保持される。さらに別の例として、ホットコードの静的な識別は、暗示として行われる。しかしながら、プログラムコード実行の動的プロファイリングは、ホットとしてのコードの領域静的な識別を無視することができ、このタイプの静的な識別は、多くの場合、どのコアがコード分散に適切であるかを判断する際に動的プロファイリングが考慮してよいコンパイラ又はユーザ暗示と称される。さらに、動的プロファイリングの特性と同様に、ホットとしてのコードの領域の識別は、そのコードのセクションが常にホットとして識別されるように制限されるわけではない。変換及び／又は最適化の後、コードセクションの変換バージョンが実行される。 FIG. 12 illustrates an exemplary method for core allocation of hot code to an accelerator. At 1201, a determination is made that code is "hot." A hot portion of code may refer to a portion of code that is more suitable to run on one core over other cores based on considerations such as power, performance, thermal, other known processor criteria, or combinations thereof. This determination may be made using any number of techniques. For example, a dynamic binary optimizer may be utilized to monitor the execution of threads. Hot code may be detected based on counter values that record dynamic execution frequency of static code, such as during program execution. In an embodiment where a core is an OOO core and another core is an in-order core, then a hot portion of code may refer to a hot spot of program code that is more suitable to run on a serial core, potentially having more available resources for execution of the highly repetitive sections. Often, sections of code that have highly repetitive patterns may be optimized to run more efficiently on an in-order core. Essentially, in this example, cold code (low repetition) is distributed to native OOO cores, while hot code (high repetition) is distributed to software-managed in-order cores. Hot portions of code may be identified statically, dynamically, or a combination thereof. In the first case, a compiler or user may determine that a section of program code is hot code. Decode logic within the core, in one embodiment, is adapted to decode hot code identifier instructions from the program code, which instructions identify hot portions of the program code. Fetching or decoding such instructions may trigger translation and/or execution of the hot section of code on the core. In another example, the code execution is profiled execution, and based on characteristics of the profile - power and/or performance metrics associated with the execution - regions of the program code may be identified as hot code. Monitoring code may be executed on one core to perform monitoring/profiling of program code executing on other cores, as well as hardware operations. Note that such monitoring code may be code maintained in a storage structure within the core or maintained in a system that includes the processor. For example, the monitoring code may be microcode or other code and maintained in a storage structure of the core. As yet another example, the static identification of hot code is done as an implication. However, dynamic profiling of program code execution may ignore the static identification of a region of code as hot, and this type of static identification is often referred to as compiler or user implication that dynamic profiling may take into account when determining which cores are suitable for code distribution. Furthermore, similar to the nature of dynamic profiling, the identification of a region of code as hot is not limited to that section of code always being identified as hot. After transformation and/or optimization, a transformed version of the code section is executed.

１２０３において、適切なアクセラレータが選択される。バイナリトランスレータ、仮想マシンモニタ又はオペレーティングシステムは、利用可能なアクセラレータ及び所望の性能に基づいてこの選択を行う。多くの例では、アクセラレータは、より大きくてより一般的なコアよりも１ワットあたりの向上した性能でホットコードを実行するのにより適切である。 At 1203, an appropriate accelerator is selected. The binary translator, virtual machine monitor, or operating system makes this selection based on the available accelerators and the desired performance. In many instances, an accelerator is better suited to run hot code with improved performance per watt than a larger, more general purpose core.

１２０５において、ホットコードは、選択されたアクセラレータに送信される。この送信は、本明細書で詳細に説明されるように、適切な接続タイプを利用する。 At 1205, the hot code is transmitted to the selected accelerator, utilizing the appropriate connection type, as described in more detail herein.

最後に、１２０７において、ホットコードは、選択されたアクセラレータにより受信されて実行される。実行の間、ホットコードは、異なるコアへの割り当てについて評価されてよい。 Finally, at 1207, the hot code is received and executed by the selected accelerator. During execution, the hot code may be evaluated for allocation to different cores.

図１３は、ページディレクトリベースレジスタイベントに対するウェイクアップ又は書き込みのための可能性があるコア割り当てについての例示的な方法を示す。例えば、これは、コードフラグメントのフェーズを判断することを示す。１３０１において、ウェイクアップイベント又はページディレクトリベースレジスタ（例えば、タスク切替）イベントのいずれか一方が検出される。例えば、ウェイクアップイベントは、停止されたスレッド又は待機状態終了により受信された割込みのために発生する。ページディレクトリベースレジスタへの書き込みは、直列フェーズの開始又は停止を示し得る。典型的には、この検出は、バイナリトランスレータを実行しているコア上で発生する。 Figure 13 illustrates an exemplary method for potential core allocation for wake-up or write to page directory base register events. For example, this may refer to determining the phase of a code fragment. At 1301, either a wake-up event or a page directory base register (e.g., task switch) event is detected. For example, a wake-up event occurs due to an interrupt received by a stopped thread or a wait state exit. A write to the page directory base register may indicate the start or stop of a serial phase. Typically, this detection occurs on a core running a binary translator.

１３０３において、ウェイクアップした又はタスク切替を経験したスレッドと同じページテーブルベースポインタを共有するコアの数がカウントされる。いくつかの実施例において、テーブルは、論理ＩＤを特定のヘテロジニアスコアにマッピングするために用いられる。テーブルは、論理ＩＤによりインデックス付けされる。テーブルの各エントリは、論理ＩＤが現在有効であるか又は停止されているかを示すフラグ、ＳＩＭＤ又はスカラコアのどちらを好むかを示すフラグ、ページテーブルベースアドレス（例えば、ＣＲ３）、論理ＩＤが現在マッピングされているコアのタイプを示す値、及び、移動率を制限するカウンタを含む。 At 1303, the number of cores that share the same page table base pointer as the thread that woke up or experienced a task switch is counted. In some embodiments, a table is used to map logical IDs to specific heterogeneous cores. The table is indexed by the logical ID. Each entry in the table includes a flag indicating whether the logical ID is currently active or suspended, a flag indicating whether a SIMD or scalar core is preferred, a page table base address (e.g., CR3), a value indicating the type of core the logical ID is currently mapped to, and a counter that limits the migration rate.

同じ処理に属するスレッドは、同じアドレス空間、ページテーブル及びページディレクトリベースレジスタ値を共有する。 Threads belonging to the same process share the same address space, page table, and page directory base register value.

１３０５において、カウントされたコアの数が１より大きいか否かに応じた判断が行われる。このカウントは、スレッドが、直列又は並列フェーズにあるか否かを判断する。カウントが１である場合、次に、イベントを経験しているスレッドは、直列フェーズ１３０７にある。その結果、直列フェーズスレッドは、同じコアグループ内のすべてのスレッドの中で一意的なページディレクトリベースレジスタ値を有するスレッドである。図１４は、直列フェーズスレッドの例を示す。図示されるように、処理は、１又は複数のスレッドを有し、各処理は独自に割り当てられたアドレスを有する。 At 1305, a determination is made as to whether the number of counted cores is greater than one. This count determines whether the thread is in serial or parallel phase. If the count is one, then the thread experiencing the event is in serial phase 1307. As a result, a serial phase thread is one that has a unique page directory base register value among all threads in the same core group. Figure 14 shows an example of a serial phase thread. As shown, a process may have one or more threads, with each process having its own assigned address.

１３１３又は１３１５において、イベントを経験しているスレッドがＯＯＯコアに割り当てられていない場合、それはＯＯＯコアに移行され、ＯＯＯコア上の既存のスレッドは、ＳＩＭＤ又はスカラコアに移行される。イベントを経験しているスレッドがＯＯＯコアに割り当てられている場合、それは、多くの状況でそこに留まる。 At 1313 or 1315, if the thread experiencing the event is not assigned to an OOO core, it is migrated to an OOO core and the existing thread on the OOO core is migrated to a SIMD or scalar core. If the thread experiencing the event is assigned to an OOO core, it will remain there in most circumstances.

カウントが１より大きい場合、次に、１３０９において、イベントを経験しているスレッドは、並列フェーズにあり、並列フェーズのタイプの判断が行われる。イベントを経験しているスレッドがデータ並列フェーズにあるときに、スレッドがＳＩＭＤコアに割り当てられていない場合、当該スレッドはＳＩＭＤコアに割り当てられ、そうでない場合、１３１３において、既にそこにあるならば、当該スレッドはＳＩＭＤコア上に維持される。 If the count is greater than one, then at 1309, the thread experiencing the event is in a parallel phase and a determination of the type of parallel phase is made. When the thread experiencing the event is in a data parallel phase, if the thread is not assigned to a SIMD core, the thread is assigned to a SIMD core, otherwise at 1313, the thread is kept on the SIMD core if it is already there.

イベントを経験しているスレッドがデータ並列フェーズにあるときに、スレッドがＳＩＭＤコアに割り当てられていない場合、当該スレッドはＳＩＭＤコアに割り当てられ、そうでない場合、１３１３において、既にそこにあるならば、当該スレッドはＳＩＭＤコア上に維持される。 When the thread experiencing the event is in the data parallel phase, if the thread is not assigned to a SIMD core, the thread is assigned to a SIMD core, otherwise, in 1313, the thread is kept on the SIMD core if it is already there.

イベントを経験しているスレッドが、スレッド並列フェーズにあるときに、スレッドがスカラコアに割り当てられていない場合、当該スレッドはスカラコアに割り当てられ、そうでない場合、既に１３１５においてそこにあるならば、当該スレッドはスカラコア上に維持される。 When the thread experiencing the event is in the thread parallel phase, if the thread is not assigned to a scalar core, the thread is assigned to a scalar core, otherwise the thread is kept on the scalar core if it is already there at 1315.

さらに、いくつかの実施例では、スレッドが実行中であることを示すフラグがスレッドの論理ＩＤに対して設定される。 Furthermore, in some embodiments, a flag is set on the logical ID of the thread to indicate that the thread is running.

図１５は、スリープコマンドイベントに対するスレッド応答のための潜在的なコア割り当てについての例示的な方法を示す。例えば、これは、コードフラグメントのフェーズを判断することを示す。１５０１において、スレッドに影響を与えるスリープイベントが検出される。例えば、停止、待機エントリ及びタイムアウト又は一時停止コマンドが発生する。典型的には、この検出は、バイナリトランスレータを実行しているコア上で発生する。 Figure 15 illustrates an exemplary method for potential core allocation for thread response to a sleep command event. For example, this may involve determining the phase of a code fragment. At 1501, a sleep event affecting a thread is detected. For example, a stop, wait entry, and timeout or pause command occurs. Typically, this detection occurs on a core running a binary translator.

いくつかの実施形態では、１５０３において、スレッドが実行中であることを示すフラグがスレッドの論理ＩＤに対してクリアされる。 In some embodiments, at 1503, a flag indicating that the thread is running is cleared for the logical ID of the thread.

１５０５において、スリープスレッドと同じページテーブルベースポインタを共有するコアのスレッドの数がカウントされる。いくつかの実施例において、テーブルは、論理ＩＤを特定のヘテロジニアスコアにマッピングするために用いられる。テーブルは、論理ＩＤによりインデックス付けされる。テーブルの各エントリは、論理ＩＤが現在有効であるか又は停止されているかを示すフラグ、ＳＩＭＤ又はスカラコアのどちらを好むかを示すフラグ、ページテーブルベースアドレス（例えば、ＣＲ３）、論理ＩＤが現在マッピングされているコアのタイプを示す値、及び、移動率を制限するカウンタを含む。グループからの（任意のページテーブルベースポインタを有する）第１の実行スレッドについて触れる。 At 1505, the number of threads of the core that share the same page table base pointer as the sleep thread is counted. In some embodiments, a table is used to map logical IDs to specific heterogeneous cores. The table is indexed by the logical ID. Each entry in the table includes a flag indicating whether the logical ID is currently active or suspended, a flag indicating whether a SIMD or scalar core is preferred, a page table base address (e.g., CR3), a value indicating the type of core to which the logical ID is currently mapped, and a counter to limit the migration rate. The first running thread (with a given page table base pointer) from the group is mentioned.

１５０７において、システム内のＯＯＯコアがアイドルであるか否かに応じた判断が行われる。アイドルのＯＯＯコアは、アクティブに実行しているＯＳのスレッドがない。 At 1507, a determination is made as to whether an OOO core in the system is idle. An idle OOO core has no OS threads actively running.

ページテーブルベースポインタが、コアグループ内の完全に１つのスレッドにより共有されている場合、次に、１５０９において、共有スレッドは、ＳＩＭＤ又はスカラコアからＯＯＯコアに移動される。ページテーブルベースポインタが１つより多くのスレッドにより共有されている場合、次に、１５１１において、前に述べたグループの第１の実行スレッドは、（第１の実行スレッドの場所で実行する）解除されたスレッドのためにスペースを空けるべく、ＳＩＭＤ又はスカラコアからＯＯＯコアに移行されたスレッドである。 If the page table base pointer is shared by exactly one thread in the core group, then in 1509, the shared thread is moved from the SIMD or scalar core to the OOO core. If the page table base pointer is shared by more than one thread, then in 1511, the first running thread of the previously mentioned group is the thread that was migrated from the SIMD or scalar core to the OOO core to make room for the released thread (which executes in the place of the first running thread).

図１６は、フェーズ変更イベントに応じたスレッドのための可能性があるコア割り当てについての例示的な方法を示す。例えば、これは、コードフラグメントのフェーズを判断することを示す。１６０１において、潜在的なフェーズ変更イベントが検出される。典型的には、この検出は、バイナリトランスレータを実行するコアで発生する。 Figure 16 illustrates an example method for potential core allocation for threads in response to a phase change event. For example, this illustrates determining the phase of a code fragment. At 1601, a potential phase change event is detected. Typically, this detection occurs on a core running a binary translator.

１６０３において、スレッドの論理ＩＤがスカラコア上で有効であり、ＳＩＭＤ命令が存在するか否かに応じた判断が行われる。そのようなＳＩＭＤ命令がない場合、次に、スレッドは、通常通りに実行を継続する。しかしながら、スカラコアで実行中のスレッドに存在するＳＩＭＤ命令がある場合、次に、スレッドは、１６０５においてＳＩＭＤコアに移行される。 At 1603, a determination is made as to whether the logical ID of the thread is valid on the scalar core and a SIMD instruction is present. If there is no such SIMD instruction, then the thread continues to execute normally. However, if there is a SIMD instruction present in the thread executing on the scalar core, then the thread is migrated to the SIMD core at 1605.

１６０７において、スレッドの論理ＩＤがＳＩＭＤコア上で有効であり、ＳＩＭＤ命令が存在しないか否かに応じた判断が行われる。ＳＩＭＤ命令がある場合、次に、スレッドは、通常通りに実行を継続する。しかしながら、ＳＩＭＤコア上で実行中のスレッドに存在するＳＩＭＤ命令がない場合、次に、スレッドは、１６０９においてスカラコアに移行される。 At 1607, a determination is made as to whether the logical ID of the thread is valid on the SIMD core and no SIMD instructions are present. If there are SIMD instructions, then the thread continues to execute normally. However, if there are no SIMD instructions present in the thread executing on the SIMD core, then the thread is migrated to the scalar core at 1609.

この説明を通じて述べたように、バイナリトランスレータからアクセス可能なアクセラレータは、より効率的な実行（よりエネルギーの効率的な実行を含む）を提供し得る。しかしながら、それぞれの潜在的に利用可能なアクセラレータに対してプログラムを作成することを可能にすることは、不可能ではないにしても、難しい課題であるかもしれない。 As noted throughout this discussion, accelerators accessible to a binary translator may provide more efficient execution (including more energy efficient execution). However, being able to write programs for each potentially available accelerator may be a difficult, if not impossible, task.

本明細書において詳細に説明されるのは、スレッドの一部についての潜在的なアクセラレータベースの実行の開始及び終了を明示的に示す記述命令を用いる実施形態である。利用可能なアクセスレータがない場合、記述命令間のコードは、アクセラレータの使用がないままで実行される。いくつかの実施例において、これらの命令間のコードは、実行しているコアのいくつかのセマンティクスを緩和し得る。 Described in detail herein are embodiments that use descriptive instructions to explicitly indicate the beginning and end of potential accelerator-based execution for a portion of a thread. If no accessors are available, the code between the descriptive instructions will execute without the use of an accelerator. In some implementations, the code between these instructions may relax some semantics of the executing core.

図１７は、加速領域を記述するコードの例を示す。この領域の第１の命令は、加速開始（ＡＢＥＧＩＮ）命令１７０１である。いくつかの実施形態において、ＡＢＥＧＩＮ命令は、非アクセラレータコアに関する実行についての緩和された（サブ）モードに入るための許可を与える。例えば、いくつかの実施例におけるＡＢＥＧＩＮ命令は、どのサブモードの特徴が標準モードとは異なるかをプログラマ又はコンパイラが命令のフィールドにおいて示すことを可能にする。例示的な特徴は、限定されることはないが、自己書き換えコード（ＳＭＣ）を無視すること、メモリ一貫性モデル制限を弱めること（例えば、格納オーダリング要求を緩和する）、浮動小数セマンティクスを変更すること、パフォーマンスモニタリング（ｐｅｒｆｍｏｎ）を変更すること、アーキテクチャフラグの利用を変更すること、などのうちの１又は複数を含む。いくつかの実施例では、ＳＭＣは、関連するキャッシュライン（又はライン）を無効にさせるプロセッサに現在キャッシュされているコードセグメント内のメモリ位置への書き込みである。書き込みがプリフェッチ命令に影響を与える場合、プリフェッチキューは無効にされる。この後者のチェックは、命令の線形アドレスに基づいている。ターゲット命令が既にデコードされており、トレースキャッシュ内に存在するコードセグメント内の命令の書き込み又はスヌープは、トレースキャッシュ全体を無効にする。ＳＭＣは、トランスレーションルックアサイドバッファ内のＳＭＣ検出回路の調整により無視されてよい。例えば、１又は複数のレジスタ又はテーブル（例えば、メモリタイプ範囲レジスタ又はページ属性テーブル）内の設定を変更することにより、メモリ一貫性モデル制限が変更されてよい。例えば、浮動小数セマンティクスを変更する場合、浮動小数点実行回路が浮動小数点計算を実行する方法は、これらの回路の動作を制御する１又は複数の制御レジスタ（例えば、浮動小数点演算ユニット（ＦＰＵ）制御ワードレジスタを設定する）の使用を通じて変更される。変更する可能性がある浮動小数セマンティクスは、限定されることはないが、丸めモード、例外マスク及びステータスフラグがどのように処理されるか、フラッシュトゥゼロ（ｆｌｕｓｈ－ｔｏ－ｚｅｒｏ）、非正規化の設定、及び、精度（例えば、単精度、倍精度、拡張精度）制御を含む。さらに、いくつかの実施形態において、ＡＢＥＧＩＮ命令は、好ましいタイプのアクセラレータが利用可能である場合にそれが選択されるように、明示的なアクセラレータタイプの好みを考慮する。 FIG. 17 shows an example of code describing an acceleration region. The first instruction in this region is an acceleration begin (ABEGIN) instruction 1701. In some embodiments, the ABEGIN instruction provides permission to enter a relaxed (sub) mode of execution for a non-accelerator core. For example, the ABEGIN instruction in some embodiments allows a programmer or compiler to indicate in a field of the instruction which characteristics of the sub mode differ from the standard mode. Exemplary characteristics include, but are not limited to, one or more of ignoring self-modifying code (SMC), relaxing memory consistency model restrictions (e.g., relaxing store ordering requirements), changing floating point semantics, changing performance monitoring (perfmon), changing architectural flag usage, and the like. In some embodiments, an SMC is a write to a memory location in a code segment currently cached in the processor that causes the associated cache line (or lines) to be invalidated. If the write affects a prefetch instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction. A write or snoop of an instruction in a code segment where the target instruction has already been decoded and is present in the trace cache invalidates the entire trace cache. SMC may be ignored by adjusting the SMC detection circuitry in the translation lookaside buffer. For example, memory consistency model restrictions may be changed by changing settings in one or more registers or tables (e.g., memory type range registers or page attribute tables). For example, when changing floating-point semantics, the way in which the floating-point execution circuits perform floating-point calculations is changed through the use of one or more control registers that control the operation of these circuits (e.g., setting a floating-point unit (FPU) control word register). Floating-point semantics that may be changed include, but are not limited to, rounding modes, how exception masks and status flags are handled, flush-to-zero, denormalization settings, and precision (e.g., single, double, extended) control. Additionally, in some embodiments, the ABEGIN instruction takes into account explicit accelerator type preferences, so that an accelerator of the preferred type is selected if it is available.

非アクセラレータコード１７０３は、ＡＢＥＧＩＮ命令１７０１に従う。このコードは、システムのプロセッサコアに対してネイティブである。最悪の場合、利用可能なアクセスレータがない、又は、ＡＢＥＧＩＮがサポートされていない場合、このコードがそのままコア上で実行されてしまう。しかしながら、いくつかの実施例において、サブモードがその実行のために用いられる。 Non-accelerator code 1703 follows the ABEGIN instruction 1701. This code is native to the processor cores of the system. In the worst case, if there are no accelerators available or ABEGIN is not supported, this code will simply execute on the core. However, in some embodiments, a submode is used for the execution.

加速終了（ＡＥＮＤ）命令１７０５を有することにより、アクセラレータが実行を完了したように見えるようになるまで、その実行は、プロセッサコア上でゲート（ｇａｔｅｄ）される。効果的には、ＡＢＥＧＩＮ及びＡＥＮＤの使用は、アクセラレータ及び／又は緩和モードの実行を用いて、プログラマがオプトイン／アウトをすることを可能にする。 By having an accelerated end (AEND) instruction 1705, execution is gated on the processor core until the accelerator appears to have completed execution. Effectively, the use of ABEGIN and AEND allows the programmer to opt in/out of using the accelerator and/or relaxed mode of execution.

図１８は、ハードウェアプロセッサコアにおけるＡＢＥＧＩＮを用いた実行についての方法の実施形態を示す。１８０１において、スレッドのＡＢＥＧＩＮ命令がフェッチされる。前に述べたように、ＡＢＥＧＩＮ命令は、典型的には、異なる（サブ）モードの実行を定義するために用いられる１又は複数のフィールドを含む。 Figure 18 illustrates an embodiment of a method for execution using ABEGIN in a hardware processor core. At 1801, an ABEGIN instruction for a thread is fetched. As previously mentioned, the ABEGIN instruction typically includes one or more fields that are used to define different (sub)modes of execution.

１８０３において、フェッチされたＡＢＥＧＩＮ命令は、デコード回路を用いてデコードされる。いくつかの実施形態において、ＡＢＥＧＩＮ命令は、マイクロオペレーションにデコードされる。 At 1803, the fetched ABEGIN instruction is decoded using the decode circuitry. In some embodiments, the ABEGIN instruction is decoded into micro-ops.

１８０５において、デコードされたＡＢＥＧＩＮ命令は、ＡＢＥＧＩＮ命令に従うが、ＡＥＮＤ命令の前である命令に対する、（ＡＢＥＧＩＮ命令の１又は複数のフィールドにより明示的に規定されてよい）異なるモードへとスレッドが入るように実行回路により実行される。実行のこの異なるモードは、アクセラレータの可用性及び選択範囲に応じて、アクセラレータ上又は既存のコア上であってよい。いくつかの実施形態において、アクセラレータの選択は、ヘテロジニアススケジューラにより実行される。 At 1805, the decoded ABEGIN instruction is executed by the execution circuitry such that the thread enters a different mode (which may be explicitly specified by one or more fields of the ABEGIN instruction) for instructions that follow the ABEGIN instruction but precede the AEND instruction. This different mode of execution may be on an accelerator or on an existing core, depending on the availability and selection of the accelerator. In some embodiments, the accelerator selection is performed by a heterogeneous scheduler.

１８０７において、後続の非ＡＥＮＤ命令は、実行の異なるモードで実行される。アクセラレータが実行のために用いられる場合に、当該命令は、まず、バイナリトランスレータにより異なる命令セットに変換されてよい。 At 1807, the subsequent non-AEND instruction is executed in a different mode of execution. If an accelerator is used for execution, the instruction may first be translated by a binary translator into a different instruction set.

図１９は、ハードウェアプロセッサコアにおけるＡＥＮＤを用いた実行についての方法の実施形態を示す。１９０１において、ＡＥＮＤ命令がフェッチされる。 Figure 19 illustrates an embodiment of a method for execution with AEND in a hardware processor core. At 1901, an AEND instruction is fetched.

１９０３において、フェッチされたＡＥＮＤ命令は、デコード回路を用いてデコードされる。いくつかの実施形態において、ＡＥＮＤは、マイクロオペレーションにデコードされる。 At 1903, the fetched AEND instruction is decoded using the decode circuitry. In some embodiments, the AEND is decoded into a micro-op.

１９０５において、デコードされたＡＥＮＤ命令は、実行回路により実行されて、前にＡＢＥＧＩＮ命令により設定された実行の異なるモードから戻る。実行のこの異なるモードは、アクセラレータの可用性及び選択範囲に応じて、アクセラレータ上又は既存のコア上であってよい。 At 1905, the decoded AEND instruction is executed by the execution circuitry to return from the different mode of execution previously established by the ABEGIN instruction. This different mode of execution may be on the accelerator or on the existing core, depending on the availability and selection of the accelerator.

１８０７において、後続の非ＡＥＮＤ命令は、実行の元のモードで実行される。アクセラレータが実行のために用いられる場合、当該命令は、まず、バイナリトランスレータにより異なる命令セットに変換され得る。 At 1807, subsequent non-AEND instructions are executed in the original mode of execution. If an accelerator is used for execution, the instructions may first be translated by a binary translator to a different instruction set.

図１２４は、ＡＢＥＧＩＮ／ＡＥＮＤがサポートされていない場合の実行の例を示す。１２４０１において、ＡＢＥＧＩＮ命令がフェッチされる。１２４０３において、ＡＢＥＧＩＮがサポートされていないとの判断が行われる。例えば、ＣＰＵＩＤは、サポートがないことを示す。 Figure 124 shows an example of execution when ABEGIN/AEND is not supported. At 12401, the ABEGIN instruction is fetched. At 12403, a determination is made that ABEGIN is not supported. For example, CPUID indicates no support.

サポートがない場合、典型的には、１２４０５において、スレッドと関連付けられるコンテキストを変更しないオペレーション（ｎｏｐ）が実行される。１２４０７において、実行モードにおける変更がないので、サポートされていないＡＢＥＧＩＮに続く命令を通常通り実行する。 In the absence of support, typically an operation (nop) is performed at 12405 that does not change the context associated with the thread. At 12407, since there is no change in execution mode, the instruction following the unsupported ABEGIN is executed normally.

いくつかの実施形態では、ＡＢＥＧＩＮ／ＡＥＮＤの同等の利用法が少なくともパターンマッチングを用いて実現される。このパターンマッチングは、ハードウェア、ソフトウェア及び／又は両方に基づいてよい。図２０は、パターンマッチングを用いてＡＢＥＧＩＮ／ＡＥＮＤ等価を提供するシステムを示す。図示されるシステムは、メモリ２００５に格納された変換器２００１（例えば、バイナリトランスレータ、ＪＩＴなど）を含むスケジューラ２０１５（例えば、詳細に上述されたようなヘテロジニアススケジューラ）を含む。コア回路２００７は、スケジューラ２０１５を実行する。スケジューラ２０１５は、明示的なＡＢＥＧＩＮ／ＡＥＮＤ命令を有しても有していなくてもよいスレッド２０１９を受信する。 In some embodiments, the equivalent usage of ABEGIN/AEND is achieved using at least pattern matching. This pattern matching may be based on hardware, software and/or both. FIG. 20 illustrates a system that provides ABEGIN/AEND equivalence using pattern matching. The illustrated system includes a scheduler 2015 (e.g., a heterogeneous scheduler as described in detail above) that includes a translator 2001 (e.g., a binary translator, JIT, etc.) stored in a memory 2005. Core circuitry 2007 executes scheduler 2015. Scheduler 2015 receives threads 2019 that may or may not have explicit ABEGIN/AEND instructions.

スケジューラ２０１５は、ソフトウェアベースのパターンマッチャ２００３を管理し、オフロード中にトラップ及びコンテキストスイッチを実行し、ユーザ空間保存エリア（後で詳細に説明される）を管理し、アクセラレータコード２０１１を生成又はアクセラレータコード２０１１に変換する。パターンマッチャ２００３は、アクセラレータの利用及び／又は緩和された実行状態から利益を得てよいが、ＡＢＥＧＩＮ／ＡＥＮＤを用いて記述されていない受信したスレッド２０１９において得られるメモリに格納された（予め定義された）コードシーケンスを認識する。典型的には、各自のパターンは、変換器２００１に格納されるが、少なくとも、パターンマッチャ２００３にアクセス可能である。セレクタ２０１９は、前に詳細に説明したものとして機能する。 The scheduler 2015 manages the software-based pattern matcher 2003, performs traps and context switches during offloading, manages the user space save area (described in detail later), and generates or converts accelerator code 2011. The pattern matcher 2003 recognizes memory-stored (predefined) code sequences available in the received thread 2019 that may benefit from accelerator utilization and/or relaxed execution states, but that are not described with ABEGIN/AEND. Typically, the respective patterns are stored in the converter 2001, but are at least accessible to the pattern matcher 2003. The selector 2019 functions as previously described in detail.

スケジューラ２０１５は、パフォーマンスモニタリングの特徴を提供してもよい。例えば、コードが、完全なパターンマッチを有していない場合、スケジューラ２０１５は、コードが、より効率的である要件の緩和をさらに必要とし得ることを認識し、状況に応じて、スレッドと関連付けられる動作モードを調整する。動作モードの関係は、詳細に上述されている。 Scheduler 2015 may provide performance monitoring features. For example, if the code does not have a perfect pattern match, scheduler 2015 recognizes that the code may need further relaxation of requirements to be more efficient, and adjusts the operating mode associated with the thread accordingly. The relationship of the operating modes is described in detail above.

スケジューラ２０１５はまた、ＡＢＥＧＩＮ／ＡＥＮＤ領域内でコアを循環させること、アクティブにされ又はストールされるアクセラレータを循環させること、ＡＢＥＧＩＮ呼び出しをカウントすること、アクセラレータのキューイングを遅延させること（同期処理）、及び、メモリ／キャッシュ統計値のモニタリングのうちの１又は複数を実行する。いくつかの実施形態において、バイナリトランスレータ２００１は、ボトルネックを識別する際に有用であり得るアクセラレータコードを解釈するために用いられるアクセラレータに固有のコードを含む。アクセラレータは、この変換されたコードを実行する。 The scheduler 2015 also performs one or more of the following: cycling through cores in the ABEGIN/AEND region, cycling through activated or stalled accelerators, counting ABEGIN calls, delaying accelerator queuing (synchronization), and monitoring memory/cache statistics. In some embodiments, the binary translator 2001 includes accelerator-specific code that is used to interpret accelerator code that may be useful in identifying bottlenecks. The accelerator executes this translated code.

いくつかの実施形態において、コア回路２００７は、格納されたパターン２０１７を用いて、受信したスレッド２０１９内の（予め定義された）コードシーケンスを認識するハードウェアパターンマッチャ２００９を含む。典型的には、このパターンマッチャ２００９は、ソフトウェアパターンマッチャ２００３と比較して軽量であり、表現が簡単な領域（例えば、ｒｅｐｍｏｖｓ）を探す。認識されたコードシーケンスは、スケジューラ２０１５によるアクセラレータの使用のために変換されてよく、及び／又は、スレッド用の動作モードの緩和を結果的にもたらし得る。 In some embodiments, the core circuitry 2007 includes a hardware pattern matcher 2009 that uses stored patterns 2017 to recognize (predefined) code sequences in received threads 2019. Typically, this pattern matcher 2009 is lightweight compared to the software pattern matcher 2003 and looks for areas that are easy to express (e.g., rep movs). The recognized code sequences may be transformed for accelerator use by the scheduler 2015 and/or may result in a relaxed operating mode for the thread.

システムへ結合されているのは、アクセラレータコード２０１１を受信して実行する１又は複数のアクセラレータ２０１３である。 Coupled to the system are one or more accelerators 2013 that receive and execute accelerator code 2011.

図２１は、パターン認識にさらされる非加速型記述スレッドについての方法の実施形態を示す。この方法は、パターンマッチャの少なくとも１つのタイプを含むシステムにより実行される。 FIG. 21 illustrates an embodiment of a method for non-accelerated description threads that are subjected to pattern recognition. The method is performed by a system that includes at least one type of pattern matcher.

いくつかの実施形態では、２１０１において、スレッドが実行される。典型的には、このスレッドは、非アクセラレータコア上で実行される。実行中のスレッドの命令は、パターンマッチャへと供給される。しかしながら、当該スレッドの命令は、任意の実行の前にパターンマッチャへと供給されてもよい。 In some embodiments, at 2101, a thread is executed. Typically, the thread is executed on a non-accelerator core. The instructions of the executing thread are provided to the pattern matcher. However, the instructions of the thread may be provided to the pattern matcher prior to any execution.

２１０３において、スレッド内のパターンが認識（検出）される。例えば、ソフトウェアベースのパターンマッチャ又はハードウェアパターンマッチャ回路は、利用可能なアクセラレータと通常関連付けられているパターンを見つける。 At 2103, patterns are recognized (detected) within the threads. For example, a software-based pattern matcher or a hardware pattern matcher circuit finds patterns that are typically associated with available accelerators.

２１０５において、認識されたパターンは、利用可能なアクセラレータのために変換される。例えば、バイナリトランスレータは、当該パターンをアクセラレータコードに変換する。 At 2105, the recognized patterns are translated for the available accelerators. For example, a binary translator converts the patterns into accelerator code.

変換されたコードは、実行のために、２１０７において利用可能なアクセラレータに転送される。 The translated code is forwarded to an available accelerator at 2107 for execution.

図２２は、パターン認識にさらされる非加速型記述スレッドについての方法の実施形態を示す。この方法は、図２０のシステムに示すように、パターンマッチャの少なくとも１つのタイプを含むシステムにより実行される。 FIG. 22 illustrates an embodiment of a method for non-accelerated description threads that are subject to pattern recognition. The method is performed by a system that includes at least one type of pattern matcher, such as the system shown in FIG. 20.

いくつかの実施形態では、２２０１において、スレッドが実行される。典型的には、このスレッドは、非アクセラレータコア上で実行される。実行中のスレッドの命令は、パターンマッチャへと供給される。しかしながら、当該スレッドの命令は、任意の実行前にパターンマッチャへと供給されてもよい。 In some embodiments, at 2201, a thread is executed. Typically, the thread is executed on a non-accelerator core. The instructions of the executing thread are provided to a pattern matcher. However, the instructions of the thread may be provided to the pattern matcher prior to any execution.

２２０３において、スレッド内のパターンが認識（検出）される。例えば、ソフトウェアベースのパターンマッチャ又はハードウェアパターンマッチャ回路は、利用可能なアクセラレータと通常関連付けられているパターンを見つける。 At 2203, patterns are recognized (detected) within the threads. For example, a software-based pattern matcher or a hardware pattern matcher circuit finds patterns that are typically associated with available accelerators.

２２０５において、バイナリトランスレータは、認識されたパターンに基づいて、緩和要求を用いるために、スレッドと関連付けられた動作モードを調整する。例えば、バイナリトランスレータは、認識されたパターンと関連付けられた設定を利用する。 At 2205, the binary translator adjusts an operating mode associated with the thread to use the mitigation request based on the recognized pattern. For example, the binary translator utilizes a setting associated with the recognized pattern.

詳細に説明されたように、いくつかの実施形態では、コードの並列領域は、ＡＢＥＧＩＮ及びＡＥＮＤ命令により区切られる。ＡＢＥＧＩＮ／ＡＥＮＤブロック内では、特定のメモリロード及びストア操作の独立性が保証されている。他のロード及びストアは、潜在的な依存性を考慮する。これは、実装が、メモリ依存性をほとんど又は全くチェックしないで、ブロックを並列化することを可能にする。いかなる場合でも、直列の場合は、ブロックを実行するのに可能な方式の中に含まれているので、ブロックの直列実行が許可される。バイナリトランスレータは、静的依存性解析を実行して並列実行のインスタンスを作成し、これらのインスタンスをハードウェアにマッピングする。静的依存性解析は、外側、中間又は内側ループの反復を並列化し得る。スライスは、実装に依存する。ＡＢＥＧＩＮ／ＡＥＮＤの実装は、実装に最も適切なサイズにおける並列性を抽出する。 As described in detail, in some embodiments, parallel regions of code are delimited by ABEGIN and AEND instructions. Within an ABEGIN/AEND block, isolation of certain memory load and store operations is guaranteed. Other loads and stores take potential dependencies into account. This allows an implementation to parallelize a block with little or no memory dependency checking. In any case, serial execution of the block is permitted since the serial case is included among the possible ways to execute the block. The binary translator performs static dependency analysis to create instances of parallel execution and maps these instances to hardware. Static dependency analysis may parallelize outer, middle or inner loop iterations. Slicing is implementation dependent. The ABEGIN/AEND implementation extracts parallelism in the size that is most appropriate for the implementation.

ＡＢＥＧＩＮ／ＡＥＮＤブロックは、ネステッドループについての複数のレベルを含んでよい。実装は、自由に、サポートされる並列実行の量を選択する、又は、直列実行に対してフォールバックする。ＡＢＥＧＩＮ／ＡＥＮＤは、ＳＩＭＤ命令よりもはるかに大きい領域にわたる並列性を提供する。特定のタイプのコードに関し、ＡＢＥＧＩＮ／ＡＥＮＤは、マルチスレッディングより効率的なハードウェア実装を可能にする。 ABEGIN/AEND blocks may contain multiple levels of nested loops. Implementations are free to choose the amount of parallel execution supported or fall back to serial execution. ABEGIN/AEND provide parallelism over a much larger domain than SIMD instructions. For certain types of code, ABEGIN/AEND allow more efficient hardware implementations than multithreading.

ＡＢＥＧＩＮ／ＡＥＮＤの使用を通じて、プログラマ及び／又はコンパイラは、並列化の基準が満たされていない場合、ＣＰＵコアによる従来の直列実行に対してフォールバックすることができる。従来のアウトオブオーダＣＰＵコア上で実行されている場合、ＡＢＥＧＩＮ／ＡＥＮＤは、緩和されたメモリオーダリングの結果として、エリア、及び、メモリオーダリングバッファ（ＭＯＢ）の所要電力を縮小する。 Through the use of ABEGIN/AEND, a programmer and/or compiler can fall back to traditional serial execution by a CPU core if the parallelism criteria are not met. When running on a traditional out-of-order CPU core, ABEGIN/AEND reduces the area and power requirements of the Memory Ordering Buffer (MOB) as a result of relaxed memory ordering.

ＡＢＥＧＩＮ／ＡＥＮＤブロック内では、プログラマは、メモリ依存性を規定する。図２３は、メモリ依存性の様々なタイプ２３０１、これらのセマンティクス２３０３、オーダリング要求２３０５及び使用事例２３０７を示す。さらに、いくつかのセマンティクスは、実装に応じて、ＡＢＥＧＩＮ／ＡＥＮＤブロック内の命令に適用される。例えば、いくつかの実施形態において、レジスタの依存性が許容されているが、レジスタに対する修正は、ＡＥＮＤを超えて持続していない。さらに、いくつかの実施形態において、ＡＢＥＧＩＮ／ＡＥＮＤブロックは、ＡＢＥＧＩＮ／ＡＥＮＤブロックへの／からの分岐がない状態で、ＡＢＥＧＩＮで入り、ＡＥＮＤで終了され（又はパターン認識に基づいて同様の状態に入る）なければならない。最後に、典型的には、命令ストリームは、修正できない。 Within the ABEGIN/AEND block, the programmer specifies memory dependencies. Figure 23 shows various types of memory dependencies 2301, their semantics 2303, ordering requirements 2305, and use cases 2307. Additionally, some semantics apply to the instructions within the ABEGIN/AEND block, depending on the implementation. For example, in some embodiments, register dependencies are allowed, but modifications to registers do not persist beyond the AEND. Additionally, in some embodiments, the ABEGIN/AEND block must be entered at ABEGIN and exited at AEND (or a similar state entered based on pattern recognition) with no branches to or from the ABEGIN/AEND block. Finally, typically the instruction stream cannot be modified.

いくつかの実施例において、ＡＢＥＧＩＮ命令は、メモリデータブロックに対するポインタを含むソースオペランドを含む。このデータメモリブロックは、ＡＢＥＧＩＮ／ＡＥＮＤブロック内のコードを処理するために、ランタイム及びコア回路により利用される多くの情報を含む。 In some embodiments, the ABEGIN instruction includes a source operand that includes a pointer to a memory data block. This data memory block contains much of the information used by the runtime and core circuitry to process the code within the ABEGIN/AEND block.

図２４は、ＡＢＥＧＩＮ命令により指し示されるメモリデータブロックの例を示す。図示されるように、実装に応じて、メモリデータブロックは、シーケンス番号２４０１用のフィールド、ブロッククラス２４０３用のフィールド、実装識別子２４０５用のフィールド、保存状態エリアサイズ２４０７用のフィールド及びローカルストレージエリアサイズ２４０９用のフィールドを含む。 Figure 24 shows an example of a memory data block pointed to by an ABEGIN instruction. As shown, depending on the implementation, the memory data block includes a field for a sequence number 2401, a field for a block class 2403, a field for an implementation identifier 2405, a field for a saved state area size 2407, and a field for a local storage area size 2409.

シーケンス番号２４０１は、割込み前に、プロセッサがどれだけの（並列）計算を経たかを示す。ソフトウェアは、ＡＢＥＧＩＮの実行前に、シーケンス番号２４０１をゼロに初期化する。ＡＢＥＧＩＮの実行は、ゼロ以外の値をシーケンス番号２４０１に書き込んで、実行の進み具合を追跡する。完了すると、ＡＥＮＤの実行は、ゼロを書き込んで、その次の使用のために、シーケンス番号２４０１を再度初期化する。 Sequence number 2401 indicates how much (parallel) computation the processor has gone through before an interrupt. Software initializes sequence number 2401 to zero before execution of ABEGIN. Execution of ABEGIN writes a non-zero value to sequence number 2401 to track its progress. Upon completion, execution of AEND writes zero to reinitialize sequence number 2401 for its next use.

予め定義されたブロッククラス識別子２４０３（すなわち、ＧＵＩＤ）は、予め定義されたＡＢＥＧＩＮ／ＡＥＮＤブロッククラスを規定する。例えば、ＤＭＵＬＡＤＤ及びＤＧＥＭＭは、ブロッククラスとして予め定義され得る。予め定義されたクラスを用いて、バイナリトランスレータは、ヘテロジニアスハードウェアに対するマッピング解析を実行するために、バイナリを解析する必要はない。代わりに、変換器（例えば、バイナリトランスレータ）は、入力値を単に取得することにより、このＡＢＥＧＩＮ／ＡＥＮＤクラスのために予め生成された変換を実行する。ＡＢＥＧＩＮ／ＡＥＮＤに同封されたコードは、単に、非特化型コア中でこのクラスを実行するために用いられるコードとしての機能を果たすに過ぎない。 The predefined block class identifier 2403 (i.e., GUID) defines the predefined ABEGIN/AEND block class. For example, DMULADD and DGEMM may be predefined as block classes. With predefined classes, the binary translator does not need to parse the binary to perform mapping analysis to heterogeneous hardware. Instead, the translator (e.g., binary translator) performs a pre-generated translation for this ABEGIN/AEND class by simply taking the input values. The code enclosed with ABEGIN/AEND simply serves as the code used to execute this class in the non-specialized core.

実装ＩＤフィールド２４０５は、用いられる実行ハードウェアのタイプを示す。ＡＢＥＧＩＮの実行は、用いられるヘテロジニアスハードウェアのタイプを示すために、このフィールド２４０５を更新する。これは、実装が、ＡＢＥＧＩＮ／ＡＥＮＤコードを、異なるアクセラレーションハードウェアタイプを有し、又は、アクセラレータを一切有していないマシンへ移行する助けとなる。このフィールドは、ターゲット実装を適合させるように、保存されたコンテキストの可能な変換を可能にする。すなわち、エミュレータは、ＡＢＥＧＩＮ／ＡＥＮＤコードが割り込まれ、かつ、同じアクセラレータタイプを有していないマシンに移行される場合の移行後に、それがＡＥＮＤを抜けるまで、コードを実行するために用いられる。このフィールド２４０５はまた、ＡＢＥＧＩＮ／ＡＥＮＤブロックの実行の最中に割り込まれた場合でさえ、システムが、ＡＢＥＧＩＮ／ＡＥＮＤブロックを同じマシン内の異なるヘテロジニアスハードウェアに動的に再度割り当てられることを可能にし得る。 The implementation ID field 2405 indicates the type of execution hardware used. Execution of ABEGIN updates this field 2405 to indicate the type of heterogeneous hardware used. This helps implementations migrate ABEGIN/AEND code to machines with different acceleration hardware types or no accelerators at all. This field allows possible transformation of the saved context to adapt the target implementation. That is, the emulator is used to execute the code until it exits AEND after migration when ABEGIN/AEND code is interrupted and migrated to a machine that does not have the same accelerator type. This field 2405 may also allow the system to dynamically reassign ABEGIN/AEND blocks to different heterogeneous hardware in the same machine, even if they are interrupted in the middle of their execution.

状態保存エリアフィールド２４０７は、実装に固有である状態保存エリアのサイズ及びフォーマットを示す。実装は、状態保存エリアの実装に固有の部分が、ＣＰＵＩＤにおいて特定されたある最大値を超えないことを保証する。典型的には、ＡＢＥＧＩＮ命令の実行は、ＡＢＥＧＩＮ／ＡＥＮＤブロック、関連するフラグ及び追加の実装に固有の状態内で修正される汎用及びパックドデータレジスタの状態保存エリアへの書き込みを引き起こす。並列実行を容易にするために、レジスタの複数のインスタンスが書き込まれてよい。 The state save area field 2407 indicates the size and format of the state save area, which is implementation specific. The implementation guarantees that the implementation specific portion of the state save area does not exceed some maximum value specified in CPUID. Typically, execution of the ABEGIN instruction causes a write to the state save area of general purpose and packed data registers that are modified within the ABEGIN/AEND block, associated flags, and additional implementation specific state. Multiple instances of registers may be written to facilitate parallel execution.

ローカルストレージエリア２４０９は、ローカルストレージエリアとして割り当てられる。予約するための記憶量は、典型的には、ＡＢＥＧＩＮに対する即値オペランドとして特定される。ＡＢＥＧＩＮ命令が実行されると、特定のレジスタ（例えば、Ｒ９）への書き込みが、ローカルストレージ２４０９のアドレスを用いて行われる。障害がある場合、このレジスタは、シーケンス番号を指し示すことが行われる。 Local storage area 2409 is allocated as the local storage area. The amount of storage to reserve is typically specified as an immediate operand to ABEGIN. When the ABEGIN instruction is executed, a write to a particular register (e.g., R9) is made with an address in local storage 2409. In case of a failure, this register is made to point to a sequence number.

並列実行の各インスタンスは、一意的なローカルストレージエリア２４０９を受ける。アドレスは、並列実行のインスタンスごとに異なる。直列実行では、１つのストレージエリアが割り当てられる。ローカルストレージエリア２４０９は、アーキテクチャの汎用及びパックドデータレジスタを超えた一時的なストレージを提供する。ローカルストレージエリア２４０９は、ＡＢＥＧＩＮ／ＡＥＮＤブロックの外部にアクセスされるべきではない。 Each instance of parallel execution receives a unique local storage area 2409. The address is different for each instance of parallel execution. Serial execution is assigned one storage area. The local storage area 2409 provides temporary storage beyond the general purpose and packed data registers of the architecture. The local storage area 2409 should not be accessed outside the ABEGIN/AEND blocks.

図２５は、ＡＢＥＧＩＮ／ＡＥＮＤセマンティクスを用いるように構成されるメモリ２５０３の例を示す。ＡＢＥＧＩＮ／ＡＥＮＤをサポートし、かつ、このメモリ２５０３を利用するハードウェア（例えば、本明細書で説明される様々な処理要素）は図示されていない。詳細に説明されるように、メモリ２５０３は、使用対象のレジスタ２５０１、フラグ２５０５及び実装に固有の情報２５１１のインジケーションを含む保存状態エリア２５０７を含む。さらに、並列実行インスタンス毎にローカルストレージ２５０９がメモリ２５０３に格納される。 Figure 25 shows an example of a memory 2503 configured to use ABEGIN/AEND semantics. Not shown is the hardware (e.g., various processing elements described herein) that supports ABEGIN/AEND and utilizes this memory 2503. As will be described in more detail, the memory 2503 includes a saved state area 2507 that contains indications of registers 2501 to be used, flags 2505, and implementation specific information 2511. Additionally, local storage 2509 is stored in the memory 2503 for each parallel execution instance.

図２６は、ＡＢＥＧＩＮ／ＡＥＮＤを用いた実行についての異なるモードでの動作の方法の例を示す。典型的には、この方法は、エンティティの組み合わせ、例えば、変換器及び実行回路により実行される。いくつかの実施形態において、スレッドは、このモードに入る前に変換される。 Figure 26 shows an example method of operation in different modes of execution using ABEGIN/AEND. Typically, this method is performed by a combination of entities, e.g., a transformer and an execution circuit. In some embodiments, the thread is transformed before entering this mode.

２６０１において、実行の異なるモードは、例えば、実行の緩和モード（アクセラレータを用いる又は用いない）などに入る。通常、ＡＢＥＧＩＮ命令の実行からこのモードに入る。しかしながら、詳細に上述されるように、パターンマッチにより、このモードに入ることもあり得る。このモードに入ることは、シーケンス番号のリセットを含む。 At 2601, a different mode of execution is entered, such as a relaxed mode of execution (with or without an accelerator). Typically, this mode is entered through execution of an ABEGIN instruction. However, as described in more detail above, this mode may also be entered through a pattern match. Entering this mode includes resetting the sequence numbers.

２６０３において、保存状態エリアへの書き込みが行われる。例えば、修正される汎用及びパックドデータレジスタ、関連するフラグ、追加の実装に固有の情報が書き込まれる。このエリアは、ブロック内で何か不具合（例えば、割込み）があった場合の実行の再開又はロールバックを可能にする。 At 2603, a saved state area is written, e.g., general and packed data registers that are modified, associated flags, and additional implementation-specific information. This area allows execution to be resumed or rolled back if something goes wrong in the block (e.g., an interrupt).

２６０５において、並列実行インスタンス毎にローカルストレージエリアが予約される。詳細に上述されたように、このエリアのサイズは、状態保存エリアフィールドにより指示される。 At 2605, a local storage area is reserved for each parallel execution instance. The size of this area is indicated by the state save area field, as described in detail above.

２６０７において、ブロックの実行中、ブロックの進行具合が追跡される。例えば、命令が、実行に成功してリタイアされた場合、ブロックのシーケンス番号が更新される。 At 2607, the progress of the block is tracked during execution of the block. For example, if an instruction is retired after successful execution, the sequence number of the block is updated.

２６０９において、ＡＥＮＤ命令が到達したか否かに応じた判断が、（例えば、ブロックが完了したか否かを判断するために）行われる。ＡＥＮＤ命令が到達していない場合、次に、２６１３において、ローカルストレージエリアは、中間結果を用いて更新される。可能ならば、実行は、これらの結果から取り出す。しかしながら、いくつかの例では、２６１５において、ＡＢＥＧＩＮ／ＡＥＮＤ前へのロールバックが発生する。例えば、ＡＢＥＧＩＮ／ＡＥＮＤブロックの実行中に例外又は割込みが発生した場合、命令ポインタは、ＡＢＥＧＩＮ命令を指し示し、Ｒ９レジスタは、中間結果を用いて更新されるメモリデータブロックを指し示す。再開すると、メモリデータブロックに保存された状態は、訂正ポイントで再開するために用いられる。さらに、状態保存エリアを含むメモリデータブロックの初期部が存在しない又はアクセス可能でない場合、ページフォールトが引き起こされる。ローカルストレージエリアに対するロード及びストアについて、通常の方式、すなわち、存在しない又はアクセス可能ではないページへの第１のアクセスでページフォールトが報告される。いくつかの例において、非アクセラレータ処理要素が再開時に用いられる。 At 2609, a determination is made as to whether the AEND instruction has arrived (e.g., to determine whether the block is complete). If the AEND instruction has not arrived, then at 2613, the local storage area is updated with the intermediate results. If possible, execution picks up from these results. However, in some instances, at 2615, a rollback to before the ABEGIN/AEND occurs. For example, if an exception or interrupt occurs during execution of the ABEGIN/AEND block, the instruction pointer points to the ABEGIN instruction and the R9 register points to the memory data block that is updated with the intermediate results. Upon resuming, the state saved in the memory data block is used to resume at the correction point. Furthermore, if the initial part of the memory data block, including the state save area, does not exist or is not accessible, a page fault is triggered. For loads and stores to the local storage area, a page fault is reported in the normal manner, i.e., on the first access to a page that does not exist or is not accessible. In some instances, non-accelerator processing elements are used upon resumption.

２６１１において、ブロックの完了に成功した場合、次に、破棄されたレジスタがフラグと共に元の状態に戻される。メモリ状態だけがブロック後に異なる。 If the block completes successfully at 2611, then the discarded registers are returned to their original state along with the flags. Only the memory state is different after the block.

図２７は、ＡＢＥＧＩＮ／ＡＥＮＤを用いた実行についての異なるモードでの動作の方法の例を示す。典型的には、この方法は、エンティティの組み合わせ、例えば、バイナリ変換器及び実行回路により実行される。 Figure 27 shows an example of a method of operation in different modes for execution using ABEGIN/AEND. Typically, the method is performed by a combination of entities, e.g., a binary converter and an execution circuit.

２７０１において、実行の異なるモードは、例えば、実行の緩和モード（アクセラレータを用いる又は用いない）などに入る。通常、ＡＢＥＧＩＮ命令の実行からこのモードに入る。しかしながら、詳細に上述されたように、パターンマッチにより、このモードに入ることもあり得る。このモードに入ることは、シーケンス番号のリセットを含む。 At 2701, a different mode of execution is entered, such as a relaxed mode of execution (with or without an accelerator). Typically, this mode is entered through execution of an ABEGIN instruction. However, this mode may also be entered through a pattern match, as described in detail above. Entering this mode includes resetting the sequence numbers.

２７０３において、保存状態エリアへの書き込みが行われる。例えば、修正される汎用及びパックドデータレジスタ、関連するフラグ及び追加の実装に固有の情報が書き込まれる。このエリアは、ブロック内で何か不具合（例えば、割込み）があった場合の実行の再開又はロールバックを可能にする。 At 2703, a saved state area is written, e.g., general and packed data registers that are modified, associated flags, and additional implementation-specific information. This area allows execution to be resumed or rolled back if something goes wrong in the block (e.g., an interrupt).

２７０５において、並列実行インスタンス毎にローカルストレージエリアが予約される。詳細に上述されたように、このエリアのサイズは、状態保存エリアフィールドにより指示される。 At 2705, a local storage area is reserved for each parallel execution instance. The size of this area is indicated by the state save area field, as described in detail above.

２７０６において、ブロック内のコードが実行のために変換される。 At 2706, the code within the block is converted for execution.

２７０７において、変換されたブロックの実行中、ブロックの進行具合が追跡される。例えば、命令が、実行に成功してリタイアされた場合、ブロックのシーケンス番号が更新される。 At 2707, during execution of the transformed block, the progress of the block is tracked. For example, if an instruction is retired after successful execution, the sequence number of the block is updated.

２７０９において、ＡＥＮＤ命令が到達したか否かに応じた判断が、（例えば、ブロックが完了したか否かを判断するために）行われる。ＡＥＮＤ命令が到達していない場合、次に、２７１３において、ローカルストレージエリアは、中間結果を用いて更新される。可能ならば、実行は、これらの結果から取り出す。しかしながら、いくつかの例では、２７１５において、ＡＢＥＧＩＮ／ＡＥＮＤの前へのロールバックが発生する。例えば、ＡＢＥＧＩＮ／ＡＥＮＤブロックの実行中に例外又は割込みが発生した場合、命令ポインタは、ＡＢＥＧＩＮ命令を指し示し、Ｒ９レジスタは、中間結果を用いて更新されるメモリデータブロックを指し示す。再開すると、メモリデータブロックに保存された状態は、訂正ポイントで再開するために用いられる。さらに、状態保存エリアを含むメモリデータブロックの初期部が存在しない又はアクセス可能でない場合、ページフォールトが引き起こされる。ローカルストレージエリアに対するロード及びストアについて、通常の方式、すなわち、存在しない又はアクセス可能ではないページへの第１のアクセスでページフォールトが報告される。いくつかの例において、非アクセラレータ処理要素が再開時に用いられる。 At 2709, a determination is made as to whether the AEND instruction has arrived (e.g., to determine whether the block is complete). If the AEND instruction has not arrived, then at 2713, the local storage area is updated with the intermediate results. If possible, execution picks up from these results. However, in some instances, at 2715, a rollback to before the ABEGIN/AEND occurs. For example, if an exception or interrupt occurs during execution of the ABEGIN/AEND block, the instruction pointer points to the ABEGIN instruction and the R9 register points to the memory data block that is updated with the intermediate results. Upon resuming, the state saved in the memory data block is used to resume at the correction point. Furthermore, if the initial portion of the memory data block, including the state save area, does not exist or is not accessible, a page fault is triggered. For loads and stores to the local storage area, a page fault is reported in the normal manner, i.e., on the first access to a page that does not exist or is not accessible. In some instances, non-accelerator processing elements are used upon resumption.

ブロックの完了に成功した場合、次に、２７１１において、破棄されたレジスタがフラグと共に元の状態に戻される。メモリ状態だけがブロック後に異なる。 If the block completes successfully, then at 2711, the discarded registers are returned to their original state along with the flags. Only the memory state is different after the block.

上述したように、いくつかの実施例では、（マルチプロトコル共通リンク（ＭＣＬ）を呼び出す）共通のリンクが、デバイス（例えば、図１及び図２において説明した処理要素）に到達するために用いられる。いくつかの実施形態において、これらのデバイスは、ＰＣＩエクスプレス（ＰＣＩｅ）デバイスとして見られる。このリンクは、リンク上で動的に多重化される３又はそれより多いプロトコルを有する。例えば、共通のリンクは、１）１又は複数の独自又は業界標準（例えば、ＰＣＩエクスプレス仕様又は同等の代替手段など）において規定され得るように、デバイス発見、デバイス構成、エラー報告、割込み、ＤＭＡスタイルのデータ転送及び様々なサービスを可能にする生産者／消費者、発見、構成、割込み（ＰＤＣＩ）プロトコル、２）デバイスが、コヒーレントな読み出し及び書き込み要求を処理要素に発行することを可能にするキャッシングエージェントコヒーレンス（ＣＡＣ）プロトコル、及び、３）処理要素が、別の処理要素のローカルメモリにアクセスすることを可能にするメモリアクセス（ＭＡ）プロトコルからなるプロトコルをサポートする。これらのプロトコルの特定の例では、（例えば、インテル（登録商標）オンチップシステムファブリック（ＩＯＳＦ）、インダイ相互接続（ＩＤＩ）、スケーラブルメモリ相互接続３＋（ＳＭＩ３＋））が提供される一方、本発明の基礎となる原理は、任意の特定のプロトコルのセットに限定されない。 As mentioned above, in some embodiments, a common link (called a Multi-Protocol Common Link (MCL)) is used to reach devices (e.g., the processing elements described in Figures 1 and 2). In some embodiments, these devices are seen as PCI Express (PCIe) devices. This link has three or more protocols that are dynamically multiplexed on the link. For example, the common link supports protocols consisting of: 1) a Producer/Consumer, Discover, Configure, Interrupt (PDCI) protocol that enables device discovery, device configuration, error reporting, interrupts, DMA-style data transfers, and various services as may be defined in one or more proprietary or industry standards (e.g., PCI Express specifications or equivalent alternatives); 2) a Caching Agent Coherence (CAC) protocol that allows devices to issue coherent read and write requests to processing elements; and 3) a Memory Access (MA) protocol that allows a processing element to access the local memory of another processing element. While specific examples of these protocols are provided (e.g., Intel® On-Chip System Fabric (IOSF), In-Die Interconnect (IDI), Scalable Memory Interconnect 3+ (SMI3+)), the principles underlying the invention are not limited to any particular set of protocols.

図１２０は、例示的なマルチチップリンク（ＭＣＬ）１２０２０を用いて通信可能に接続される２又はそれより多いチップ又はダイ（例えば、１２０１０、１２０１５）を含む例示的なマルチチップ構成１２００５を示す簡易ブロック図１２０００である。図１２０は、例示的なＭＣＬ１２０２０を用いて相互接続される２つ（又はそれより多い）ダイの例を示す一方、ＭＣＬの実装に関して本明細書で説明される原理及び特徴は、数ある潜在的な例の中でも特に、２又はそれより多いダイ（例えば、１２０１０、１２０１５）を接続すること、ダイ（又はチップ）を別のコンポーネントオフダイに接続すること、ダイを別のデバイス又はダイオフパッケージ（例えば、１２００５）に接続すること、ダイをＢＧＡパッケージ、インターポーザ上のパッチの実装（ＰＯＩＮＴ）を含む、ダイ（例えば、１２０１０）及び他のコンポーネントを接続する任意の相互接続又はリンクに適用され得ることを理解されたい。 120 is a simplified block diagram 12000 illustrating an exemplary multi-chip configuration 12005 including two or more chips or dies (e.g., 12010, 12015) communicatively connected using an exemplary multi-chip link (MCL) 12020. While FIG. 120 illustrates an example of two (or more) dies interconnected using an exemplary MCL 12020, it should be understood that the principles and features described herein with respect to MCL implementations may be applied to any interconnect or link connecting a die (e.g., 12010) and other components, including connecting two or more dies (e.g., 12010, 12015), connecting a die (or chip) to another component off-die, connecting a die to another device or die-off-package (e.g., 12005), connecting a die to a BGA package, patch on interposer implementations (POINT), among other potential examples.

いくつかの例において、より大きなコンポーネント（例えば、ダイ１２０１０、１２０１５）は、それら自体を、例えば、システムオンチップ（ＳｏＣ）、マルチプロセッサチップ、又は、デバイス上、例として単一のダイ（例えば、１２０１０、１２０１５）上の、コア、アクセラレータなどのような複数のコンポーネント（１２０２６～１２０３０及び１２０４０～１２０４５）を含む他のコンポーネントなどのＩＣシステムであり得る。ＭＣＬ１２０２０は、潜在的に複数の別個のコンポーネント及びシステムから複雑かつ多様なシステムを構築すること対する柔軟性をもたらす。例として、ダイ１２０１０、１２０１５のそれぞれが製造されてよく、そうでなければ、２つの異なるエンティティにより提供され得る。さらに、ダイ及び他のコンポーネントは、それら自体が、デバイス（例えば、１２０１０、１２０１５のそれぞれ）内のコンポーネント（例えば、１２０２６～１２０３０及び１２０４０～１２０４５）の間の通信のためのインフラストラクチャを提供する相互接続又は他の通信ファブリック（例えば、１２０３１、１２０５０）を含むことができる。様々なコンポーネント及び相互接続（例えば、１２０３１、１２０５０）は、複数の異なるプロトコルをサポートする又は用いる。さらに、ダイ（例えば、１２０１０、１２０１５）の間の通信は、複数の異なるプロトコルを介したダイ上の様々なコンポーネント間のトランザクションを潜在的に含むことができる。 In some examples, the larger components (e.g., die 12010, 12015) may themselves be IC systems, such as, for example, a system on a chip (SoC), a multiprocessor chip, or other components including multiple components (12026-12030 and 12040-12045) such as cores, accelerators, etc. on a device, e.g., on a single die (e.g., 12010, 12015). MCL 12020 provides flexibility to build complex and diverse systems from potentially multiple separate components and systems. By way of example, each of die 12010, 12015 may be manufactured or otherwise provided by two different entities. Additionally, the dies and other components may themselves include interconnects or other communications fabrics (e.g., 12031, 12050) that provide an infrastructure for communications between components (e.g., 12026-12030 and 12040-12045) within a device (e.g., 12010, 12015, respectively). The various components and interconnects (e.g., 12031, 12050) may support or use multiple different protocols. Additionally, communications between dies (e.g., 12010, 12015) may potentially include transactions between various components on the die via multiple different protocols.

マルチチップリンク（ＭＣＬ）の実施形態は、複数のパッケージオプション、複数のＩ／Ｏプロトコル、並びに、信頼性・可用性・保守性（ＲＡＳ）機能をサポートする。さらに、物理層（ＰＨＹ）は、物理的な電気層及び論理層を含むことができ、最大で、いくつかの場合において約４５ｍｍを超えるチャネル長を含む、より長いチャネル長をサポートできる。いくつかの実施例では、例示的なＭＣＬは、８～１０Ｇｂ／ｓを超えるデータレートを含む、高データレートで動作できる。 Embodiments of the multi-chip link (MCL) support multiple package options, multiple I/O protocols, and reliability, availability, and serviceability (RAS) features. Additionally, the physical layer (PHY) can include physical electrical and logical layers and can support longer channel lengths, including channel lengths up to and in some cases greater than about 45 mm. In some implementations, the exemplary MCL can operate at high data rates, including data rates in excess of 8-10 Gb/s.

ＭＣＬの１つの例示的な実施例において、ＰＨＹ電気層は、従来のマルチチャネル相互接続解決手段（例えば、マルチチャネルＤＲＡＭＩ／Ｏ）を改善し、数ある潜在的な例の中でも特に、例として、調整された中間レール終端、低電力アクティブクロストーク除去、回路冗長、ビット毎のデューティサイクル訂正及びデスキュー、ライン符号化及び送信機等化を含む多数の機能により、例として、データレート及びチャネル構成を拡張する。 In one exemplary embodiment of the MCL, the PHY electrical layer improves upon traditional multi-channel interconnect solutions (e.g., multi-channel DRAM I/O) to, for example, extend data rates and channel configurations with numerous features including, for example, tuned mid-rail termination, low power active crosstalk cancellation, circuit redundancy, per-bit duty cycle correction and deskew, line coding and transmitter equalization, among other potential examples.

ＭＣＬの１つの例示的な実施例において、ＰＨＹ論理層は、データレート及びチャネル構成を拡張する一方、電気層にわたって複数のプロトコルを転送する相互接続もできるようにする場合に（例えば、電気層機能）をさらに支援するように実装される。そのような実施例が、プロトコルに依存せず、潜在的に任意の既存又は将来の相互接続プロトコルと連動するように設計されるモジュラ共通物理層を提供及び定義する。 In one exemplary embodiment of the MCL, the PHY logical layer is implemented to further support cases (e.g., electrical layer functionality) that extend data rates and channel configurations while also allowing interconnects to transport multiple protocols across the electrical layer. Such an embodiment provides and defines a modular common physical layer that is protocol independent and potentially designed to work with any existing or future interconnection protocol.

図１２１を参照すると、簡易ブロック図１２１００は、マルチチップリンク（ＭＣＬ）の例示的な実装を含むシステムの少なくとも一部を表すことを示す。ＭＣＬは、第１のデバイス１２１０５（例えば、１又は複数のサブコンポーネントを含む第１のダイ）を、第２のデバイス１２１１０（例えば、１又は複数の他のサブコンポーネントを含む第２のダイ）と接続する物理的な電気接続（例えば、レーンとして実装されるワイヤ）を用いて実装され得る。図１２１００の高水準表現において示される具体例では、（チャネル１２１１５、１２１２０内の）すべての信号は、単方向であり得、レーンは、アップストリーム及びダウンストリームデータ転送の両方を有するデータ信号を提供し得る。図１２１のブロック図１２１００は、アップストリームコンポーネントとしての第１のコンポーネント１２１０５、ダウンストリームコンポーネントとしての第２のコンポーネント１２１１０、ダウンストリームチャネル１２１１５としてデータを送信するときに用いられるＭＣＬの物理レーン及びアップストリームチャネル１２１２０として（コンポーネント１２１１０から）データを受信するために用いられるレーンを指す一方、デバイス１２１０５、１２１１０間のＭＣＬが、デバイス間でデータを送信及び受信の両方を行うために、各デバイスにより用いられ得ることを理解されたい。 121, a simplified block diagram 12100 is shown depicting at least a portion of a system including an exemplary implementation of a multi-chip link (MCL). The MCL may be implemented using physical electrical connections (e.g., wires implemented as lanes) connecting a first device 12105 (e.g., a first die including one or more subcomponents) with a second device 12110 (e.g., a second die including one or more other subcomponents). In the specific example shown in the high-level representation of FIG. 12100, all signals (within channels 12115, 12120) may be unidirectional, and the lanes may provide data signals having both upstream and downstream data transfer. While the block diagram 12100 of FIG. 121 refers to the first component 12105 as the upstream component, the second component 12110 as the downstream component, the physical lanes of the MCL used when transmitting data as the downstream channel 12115, and the lanes used to receive data (from component 12110) as the upstream channel 12120, it should be understood that the MCL between devices 12105, 12110 can be used by each device to both transmit and receive data between the devices.

１つの例示的な実施例において、ＭＣＬは、電気ＭＣＬ物理層（ＰＨＹ）１２１２５ａ，ｂ（又は、総称して１２１２５）を含む物理層と、実行可能な論理実装ＭＣＬ論理ＰＨＹ１２１３０ａ,ｂ（又は、総称して１２１３０）とを提供できる。電気又は物理ＰＨＹ１２１２５は、デバイス１２１０５、１２１１０間でデータが通信される物理的な接続を提供する。信号調整コンポーネント及び論理は、リンクの高データレート及びチャネル構成機能を確立するために、物理ＰＨＹ１２１２５と関連して実装されることができ、いくつかのアプリケーションが約４５ｍｍ又はそれより長い長さでの、密接にクラスタ化された物理的接続に関する。論理ＰＨＹ１２１３０は、クロック、リンクステート管理（例えば、リンク層１２１３５ａ、１２１３５ｂ）及びＭＣＬを介した通信に用いられる潜在的に複数の異なるプロトコル間でのプロトコル多重を容易にするための回路を含む。 In one exemplary embodiment, the MCL can provide a physical layer including electrical MCL physical layers (PHYs) 12125a,b (or collectively 12125) and executable logical implementations MCL logical PHYs 12130a,b (or collectively 12130). The electrical or physical PHYs 12125 provide the physical connection over which data is communicated between the devices 12105, 12110. Signal conditioning components and logic can be implemented in association with the physical PHYs 12125 to establish high data rates and channel configuration capabilities of the link, with some applications relating to closely clustered physical connections with lengths of approximately 45 mm or longer. The logical PHYs 12130 include circuitry for facilitating clocking, link state management (e.g., link layers 12135a, 12135b), and protocol multiplexing between potentially multiple different protocols used to communicate over the MCL.

１つの例示的な実施例において、物理ＰＨＹ１２１２５は、チャネルごと（例えば、１２１１５、１２１２０）に、インバンドデータが送信されるデータレーンのセットを含む。この具体例では、５０個のデータレーンがアップストリーム及びダウンストリームチャネル１２１１５、１２１２０のそれぞれに提供されるが、レイアウト及び電力制約、所望のアプリケーション、デバイス制約などにより許される場合には、その他の数のレーンが用いられ得る。各チャネルは、チャネルに関するストローブ又はクロック信号用の１又は複数の専用レーン、チャネルに関する有効な信号用の１又は複数の専用レーン、ストリーム信号用の１又は複数の専用レーン、及び、リンクステートマシン管理又はサイドバンド信号用の１又は複数の専用レーンをさらに含むことができる。物理ＰＨＹは、サイドバンドリンク１２１４０をさらに含むことができ、いくつかの例では、数ある例の中でも特に、デバイス１２１０５、１２１１０を接続するＭＣＬについての状態遷移及び他の属性を調整するために用いられる双方向低周波制御信号リンクであり得る。 In one exemplary embodiment, the physical PHY 12125 includes a set of data lanes for each channel (e.g., 12115, 12120) along which in-band data is transmitted. In this example, 50 data lanes are provided for each of the upstream and downstream channels 12115, 12120, although other numbers of lanes may be used as permitted by layout and power constraints, desired application, device constraints, etc. Each channel may further include one or more dedicated lanes for strobe or clock signals for the channel, one or more dedicated lanes for enable signals for the channel, one or more dedicated lanes for stream signals, and one or more dedicated lanes for link state machine management or sideband signals. The physical PHY may further include a sideband link 12140, which in some examples may be a bidirectional low frequency control signal link used to coordinate state transitions and other attributes for the MCLs connecting the devices 12105, 12110, among other examples.

上述したように、ＭＣＬの実装を用いて、複数のプロトコルがサポートされている。実際には、複数の独立したトランザクション層１２１５０ａ、１２１５０ｂが、各デバイス１２１０５、１２１１０において提供され得る。例として、各デバイス１２１０５、１２１１０は、数ある中でも、ＰＣＩ、ＰＣＩｅ、ＣＡＣなど、２又はそれより多いプロトコルをサポート及び利用してよい。ＣＡＣは、コアと、ラストレベルキャッシュ（ＬＬＣ）と、メモリと、グラフィックスとＩ／Ｏコントローラとの間で通信するオンダイに用いられるコヒーレントなプロトコルである。イーサネット（登録商標）プロトコル、インフィニバンドプロトコル及び他のＰＣＩｅファブリックベースのプロトコルを含む他のプロトコルもサポートされ得る。論理ＰＨＹ及び物理ＰＨＹの組み合わせは、数ある例の中でも特に、１つのダイ上のＳｅｒＤｅｓＰＨＹ（ＰＣＩｅ、イーサネット（登録商標）、インフィニバンド又は他の高速ＳｅｒＤｅｓ）を、他のダイ上に実装されているその上位層に接続するダイ間相互接続として用いられることもできる。 As mentioned above, multiple protocols are supported using the MCL implementation. In fact, multiple independent transaction layers 12150a, 12150b may be provided in each device 12105, 12110. By way of example, each device 12105, 12110 may support and utilize two or more protocols, such as PCI, PCIe, CAC, among others. CAC is a coherent protocol used on-die to communicate between the cores, last level cache (LLC), memory, graphics and I/O controllers. Other protocols may also be supported, including Ethernet protocols, InfiniBand protocols and other PCIe fabric-based protocols. The combination of logical PHY and physical PHY may also be used as an inter-die interconnect, connecting a SerDes PHY (PCIe, Ethernet, InfiniBand or other high-speed SerDes) on one die to its upper layer implemented on another die, among other examples.

論理ＰＨＹ１２１３０は、ＭＣＬにおけるこれら複数のプロトコル間の多重化をサポートする。例として、専用のストリームレーンは、どのプロトコルが、チャネルのデータレーン上で実質的に同時に送信されるデータに適用されるかを識別するエンコードされたストリーム信号をアサートするために用いられ得る。さらに、論理ＰＨＹ１２１３０は、様々なプロトコルがサポート又は要求し得る様々なタイプのリンク状態遷移とネゴシエートする。いくつかの例において、チャネルの専用ＬＳＭ＿ＳＢレーンを介して送信されたＬＳＭ＿ＳＢ信号は、デバイス１２１０５、１２１１０間のリンク状態遷移を通信及びネゴシエートするために、サイドバンドリンク１２１４０と一緒に用いられ得る。さらに、リンクトレーニング、エラー検出、スキュー検出、デスキュー及び従来の相互接続についての他の機能が、論理ＰＨＹ１２１３０を部分的に用いて、置き換えられ又は統制され得る。例として、各チャネルにおける１又は複数の専用の有効な信号レーンを介して送信される有効な信号は、数ある例の中でも特に、リンクアクティビティをシグナリングし、スキュー及びリンクエラーを検出し、及び、他の特徴を実現させるために用いられ得る。図１２１の具体例では、複数の有効なレーンがチャネル毎に提供されている。例として、チャネル内のデータレーンは、（物理的に及び／又は論理的に）バンドル化又はクラスタ化され得、有効なレーンは、クラスタごとに提供され得る。さらに、複数のストローブレーンは、いくつかの場合において、数ある例の中でも特に、チャネルにおける複数のデータレーンクラスタ内のクラスタごとに専用のストローブ信号を提供するために提供され得る。 The logical PHY 12130 supports multiplexing between these multiple protocols in the MCL. As an example, a dedicated stream lane can be used to assert an encoded stream signal that identifies which protocol applies to data transmitted substantially simultaneously on the data lanes of the channel. Additionally, the logical PHY 12130 negotiates the various types of link state transitions that the various protocols may support or require. In some examples, the LSM_SB signal transmitted over the dedicated LSM_SB lane of the channel can be used together with the sideband link 12140 to communicate and negotiate link state transitions between the devices 12105, 12110. Additionally, link training, error detection, skew detection, deskew, and other functions of conventional interconnects can be replaced or controlled in part using the logical PHY 12130. By way of example, the valid signal transmitted over one or more dedicated valid signal lanes in each channel may be used to signal link activity, detect skew and link errors, and implement other features, among other examples. In the example of FIG. 121, multiple valid lanes are provided per channel. By way of example, data lanes within a channel may be bundled or clustered (physically and/or logically) and a valid lane may be provided per cluster. Additionally, multiple strobe lanes may be provided in some cases to provide a dedicated strobe signal for each cluster within multiple data lane clusters in a channel, among other examples.

上述したように、論理ＰＨＹ１２１３０は、ＭＣＬにより接続されたデバイス間で送信されるリンク制御信号をネゴシエート及び管理する。いくつかの実施例において、論理ＰＨＹ１２１３０は、ＭＣＬを介してリンク層制御メッセージを送信（すなわち、インバンド）するリンク層パケット（ＬＬＰ）生成回路１２１６０を含む。そのようなメッセージは、数ある例の中でも特に、データがリンク層制御データなどのリンク層－リンク層間メッセージングであることを識別するストリームレーンを有する、チャネルのデータレーンを介して送信され得る。ＬＬＰモジュール１２１６０を用いてイネーブルにされたリンク層メッセージは、デバイス１２１０５、１２１１０のリンク層１２１３５ａ、１２１３５ｂ間のそれぞれの他のリンク層間の特徴の中でも特に、リンク層状態遷移、電源管理、ループバック、ディセーブル、再センタリングスクランブルについてのネゴシエーション及び動作を支援する。 As described above, the logical PHY 12130 negotiates and manages link control signals transmitted between devices connected by the MCL. In some embodiments, the logical PHY 12130 includes a link layer packet (LLP) generation circuit 12160 that transmits link layer control messages (i.e., in-band) over the MCL. Such messages may be transmitted over the data lanes of the channel, with the stream lanes identifying the data as link layer-to-link layer messaging, such as link layer control data, among other examples. Link layer messages enabled using the LLP module 12160 support the negotiation and operation of link layer state transitions, power management, loopback, disable, recentering and scrambling, among other link layer features between the link layers 12135a, 12135b of the devices 12105, 12110, respectively.

図１２２を参照すると、例示的なＭＣＬの例示的な論理ＰＨＹを示す簡易ブロック図１２２００が示される。物理ＰＨＹ１２２０５は、論理ＰＨＹ１２２１０と、ＭＣＬのリンク層をサポートする追加の論理とを含むダイに接続され得る。ダイは、この例において、ＭＣＬ上に複数の異なるプロトコルをサポートする論理をさらに含み得る。例として、図１２２の例では、ＰＣＩｅ論理１２２１５がＣＡＣ論理１２２２０と共に提供され、その結果、２より多いプロトコル、又は、ＰＣＩｅ及びＣＡＣ以外のプロトコルがＭＣＬを介してサポートされる例を含む、潜在的に数多くある例の中でも特に、ダイは、２つのダイを接続する同じＭＣＬを介してＰＣＩｅ又はＣＡＣのいずれか一方を用いて通信できる。ダイ間でサポートされる様々なプロトコルは、サービス及び特徴のレベルを変化させることを提供できる。 122, a simplified block diagram 12200 is shown illustrating an example logical PHY of an example MCL. The physical PHY 12205 may be connected to a die that includes the logical PHY 12210 and additional logic to support the link layer of the MCL. The die may further include logic to support multiple different protocols over the MCL in this example. By way of example, in the example of FIG. 122, PCIe logic 12215 is provided along with CAC logic 12220, such that the die can communicate using either PCIe or CAC over the same MCL connecting the two dies, among many other potentially examples including examples where more than two protocols, or protocols other than PCIe and CAC, are supported over the MCL. The various protocols supported between the dies may provide varying levels of service and features.

論理ＰＨＹ１２２１０は、（例えば、ＰＣＩｅ又はＣＡＣを介して受信した）ダイの上位層論理の要求と関連してリンク状態遷移をネゴシエートするためのリンクステートマシン管理論理１２２２５を含むことができる。いくつかの実施例において、論理ＰＨＹ１２２１０は、リンク試験及びデバッグ論理（例えば、１２２３０）をさらに含むことができる。上述したように、例示的なＭＣＬは、ＭＣＬの（数ある例示的な機能の中でも特に）プロトコルに依存せず、高性能かつ電力効率の良い機能を容易にするために、ＭＣＬを介してダイ間で送信される制御信号をサポートできる。例として、論理ＰＨＹ１２２１０は、上記の例において説明したように、専用のデータレーンを介したデータの送信及び受信と関連して、有効な信号、ストリーム信号及びＬＳＭサイドバンド信号の生成及び送信並びに受信及び処理をサポートできる。 The logical PHY 12210 may include link state machine management logic 12225 for negotiating link state transitions in conjunction with requests of the upper layer logic of the die (e.g., received via PCIe or CAC). In some embodiments, the logical PHY 12210 may further include link test and debug logic (e.g., 12230). As described above, the exemplary MCL may support control signals transmitted between dies via the MCL to facilitate protocol-independent, high-performance, and power-efficient functioning of the MCL (among other exemplary functions). By way of example, the logical PHY 12210 may support the generation and transmission, as well as the reception and processing of enable signals, stream signals, and LSM sideband signals in conjunction with the transmission and reception of data over dedicated data lanes, as described in the examples above.

いくつかの実施例では、多重化（例えば、１２２３５）及び逆多重化（例えば、１２２４０）論理は、論理ＰＨＹ１２２１０に含まれ得る、又は、そうでなければ論理ＰＨＹ１２２１０にアクセス可能であり得る。例として、多重化論理（例えば、１２２３５）は、ＭＣＬ上に送信されるデータ（例えば、パケット、メッセージなどとして具現化される）を識別するために用いられ得る。多重化論理１２２３５は、データを統制するプロトコルを識別し、プロトコルを識別するためにエンコードされたストリーム信号を生成できる。例として、１つの例示的な実施例では、ストリーム信号は、１バイトの２つの１６進数のシンボル（例えば、ＣＡＣ：ＦＦｈ；ＰＣＩｅ：Ｆ０ｈ；ＬＬＰ：ＡＡｈ；サイドバンド：５５ｈなど）としてエンコードされ得、識別されたプロトコルにより統制されるデータについての同じウィンドウ（例えば、１バイトの時間周期ウィンドウ）中に送信され得る。同様に、逆多重化論理１２２４０は、到着したストリーム信号を解釈してストリーム信号をデコードし、データレーン上のストリーム信号と共に同時に受信したデータに適用されるプロトコルを識別するために使用され得る。次に、逆多重化論理１２２４０は、プロトコルに固有のリンク層の処理を適用（又は確保）し、対応するプロトコル論理（例えば、ＰＣＩｅ論理１２２１５又はＣＡＣ論理１２２２０）によりデータを処理させることができる。 In some embodiments, multiplexing (e.g., 12235) and demultiplexing (e.g., 12240) logic may be included in logical PHY 12210 or may be otherwise accessible to logical PHY 12210. By way of example, multiplexing logic (e.g., 12235) may be used to identify data (e.g., embodied as packets, messages, etc.) to be transmitted on the MCL. Multiplexing logic 12235 may identify the protocol governing the data and generate a stream signal encoded to identify the protocol. By way of example, in one exemplary embodiment, the stream signal may be encoded as two hexadecimal symbols of one byte (e.g., CAC: FFh; PCIe: F0h; LLP: AAh; Sideband: 55h, etc.) and transmitted during the same window (e.g., one byte time period window) for data governed by the identified protocol. Similarly, the demultiplexing logic 12240 can be used to interpret the arriving stream signal to decode the stream signal and identify the protocol that applies to the data received simultaneously with the stream signal on the data lane. The demultiplexing logic 12240 can then apply (or ensure) protocol-specific link layer processing and have the data processed by the corresponding protocol logic (e.g., PCIe logic 12215 or CAC logic 12220).

論理ＰＨＹ１２２１０は、電源管理タスク、ループバック、ディセーブル、再センタリング、スクランブルなどを含む様々なリンク制御機能を処理するために用いられ得るリンク層パケット論理１２２５０をさらに含むことができる。ＬＬＰ論理１２２５０は、数ある機能の中でも特に、ＭＣＬＰを介したリンク層－リンク層間メッセージを容易にすることができる。ＬＬＰシグナリングに対応するデータはまた、そのデータレーンＬＬＰデータを識別するためにエンコードされた専用のストリーム信号レーン上に送信されたストリーム信号により識別され得る。多重化及び逆多重化論理（例えば、１２２３５、１２２４０）は、ＬＬＰトラフィックに対応するストリーム信号を生成及び解釈し、並びに、適切なダイ論理（例えば、ＬＬＰ論理１２２５０）によりそのようなトラフィックを処理させるために用いられこともできる。同様に、ＭＣＬＰのいくつかの実施例では、専用のサイドバンド（例えば、サイドバンド１２２５５及びサポート論理）、例えば、数ある例の中でも特に、非同期及び／又は低周波サイドバンドチャネルを含むことができる。 The logical PHY 12210 may further include link layer packet logic 12250 that may be used to handle various link control functions including power management tasks, loopback, disabling, recentering, scrambling, etc. The LLP logic 12250 may facilitate link layer to link layer messages via the MCLP, among other functions. Data corresponding to LLP signaling may also be identified by stream signals transmitted on dedicated stream signal lanes that are encoded to identify that data lane LLP data. Multiplexing and demultiplexing logic (e.g., 12235, 12240) may also be used to generate and interpret stream signals corresponding to LLP traffic and to have such traffic processed by the appropriate die logic (e.g., LLP logic 12250). Similarly, some embodiments of the MCLP may include dedicated sidebands (e.g., sideband 12255 and supporting logic), such as asynchronous and/or low frequency sideband channels, among other examples.

論理ＰＨＹ論理１２２１０は、専用のＬＳＭサイドバンドレーンを介してリンクステート管理メッセージングを生成及び受信（及び使用）できるリンクステートマシン管理論理をさらに含むことができる。例として、ＬＳＭサイドバンドレーンは、数ある潜在的な例の中でも特に、リンクトレーニング状態に進むためにハンドシェーキングを実行し、電源管理状態（例えば、Ｌ１状態）を終了するために用いられ得る。ＬＳＭサイドバンド信号は、数ある例の中でも特に、リンクのデータ信号、有効信号及びストリーム信号と整合していないが、代わりにシグナリング状態遷移に対応し、リンクにより接続された２つのダイ又はチップ間のリンクステートマシンを調整するという点で非同期信号であり得る。専用のＬＳＭサイドバンドレーンを提供することは、いくつかの例では、数ある例示的な利益の中でも特に、アナログフロントエンド（ＡＦＥ）の従来のスケルチ及び受信検出回路が除去されることを可能にし得る。 The logical PHY logic 12210 may further include link state machine management logic that can generate and receive (and use) link state management messaging via a dedicated LSM sideband lane. By way of example, the LSM sideband lane may be used to perform handshaking to proceed to a link training state and exit a power management state (e.g., an L1 state), among other potential examples. The LSM sideband signals may be asynchronous signals in that they are not aligned with the data, valid, and stream signals of the link, but instead correspond to signaling state transitions and coordinate link state machines between two dies or chips connected by the link, among other examples. Providing a dedicated LSM sideband lane may, in some examples, allow conventional squelch and receive detection circuitry in an analog front end (AFE) to be eliminated, among other exemplary benefits.

図１２３を参照すると、簡易ブロック図１２３００は、ＭＣＬを実装するために用いられる論理の別の表現を図示することが示されている。例として、論理ＰＨＹ１２２１０は、複数の異なるプロトコル（例えば、ＰＣＩｅ、ＣＡＣ、ＰＤＣＩ、ＭＡなど）１２３１５、１２３２０、１２３２５及びシグナリングモード（例えば、サイドバンド）のうちのいずれか一つが、例示的なＭＣＬの物理層とインタフェース接続できる規定の論理ＰＨＹインタフェース（ＬＰＩＦ）１２３０５と共に提供される。いくつかの実施例において、多重化及びアービトレーション論理１２３３０は、論理ＰＨＹ１２２１０から分離した層として提供されることもできる。一例では、ＬＰＩＦ１２３０５は、このＭｕｘＡｒｂ層１２３０の両側におけるインタフェースとして提供され得る。論理ＰＨＹ１２２１０は、別のインタフェースを通じて、物理ＰＨＹ（例えば、ＭＣＬＰＨＹのアナログフロントエンド（ＡＦＥ）１２２０５）とインタフェース接続できる。 Referring to FIG. 123, a simplified block diagram 12300 is shown illustrating another representation of logic used to implement the MCL. By way of example, the logical PHY 12210 is provided with a defined logical PHY interface (LPIF) 12305 through which any one of a number of different protocols (e.g., PCIe, CAC, PDCI, MA, etc.) 12315, 12320, 12325 and signaling modes (e.g., sideband) can interface with the physical layer of the exemplary MCL. In some embodiments, the multiplexing and arbitration logic 12330 can also be provided as a layer separate from the logical PHY 12210. In one example, the LPIF 12305 can be provided as an interface on both sides of this MuxArb layer 1230. The logical PHY 12210 can interface with a physical PHY (e.g., the analog front end (AFE) 12205 of the MCL PHY) through another interface.

ＬＰＩＦは、上位層に対して透過的なＬＰＩＦの下で完全に異なるＰＨＹが実装され得るように、上位層（例えば、１２３１５、１２３２０、１２３２５）からＰＨＹ（論理及び電気／アナログ）を取り除くことができる。これは、モジュール方式を促進することを支援し、設計において再利用でき、数ある例の中でも特に、基礎となるシグナリング技術ＰＨＹが更新された場合に、上位層は無傷のままでいることができる。さらに、ＬＰＩＦは、多重化／逆多重化、ＬＳＭ管理、エラー検出及びハンドリング、及び、論理ＰＨＹの他の機能をイネーブルにする多数の信号を定義できる。例として、以下のテーブルは、例示的なＬＰＩＦに関して定義され得る信号の少なくとも一部を要約したものである。
The LPIF can remove the PHY (logical and electrical/analog) from the higher layers (e.g., 12315, 12320, 12325) so that an entirely different PHY can be implemented under the LPIF transparent to the higher layers. This helps promote modularity and allows reuse in designs, and allows the higher layers to remain intact when the underlying signaling technology PHY is updated, among other examples. Additionally, the LPIF can define a number of signals that enable multiplexing/demultiplexing, LSM management, error detection and handling, and other functions of the logical PHY. By way of example, the following table summarizes at least some of the signals that may be defined for an exemplary LPIF:

テーブルで触れられているように、いくつかの実施例では、アライメントメカニズムが、ＡｌｉｇｎＲｅｑ／ＡｌｉｇｎＡｃｋハンドシェイクを通じて提供され得る。例えば、物理層は、リカバリに入る場合、いくつかのプロトコルは、パケットフレーミングを失うかもしれない。パケットのアライメントは、例として、リンク層による訂正フレーミング識別を保証するために、訂正され得る。物理層は、リカバリに入った場合、ＳｔａｌｌＲｅｑ信号をアサートでき、その結果、リンク層は、新たにアラインされたパケットを転送する準備ができた場合に、ストール信号をアサートする。物理層論理は、パケットがアラインされるか否かを判断するために、ストール及び有効の両方をサンプリングできる。例として、パケットアライメントを支援するために有効を使用する他の代替的な実装を含む、数ある潜在的な実装の中でも特に、ストール及び有効がサンプリングされてアサートされるまで、物理層はｔｒｄｙを駆動してリンク層パケットを排出することを継続できる。 As mentioned in the table, in some embodiments, an alignment mechanism can be provided through an AlignReq/AlignAck handshake. For example, some protocols may lose packet framing when the physical layer goes into recovery. The alignment of the packet can be corrected, for example, to ensure correct framing identification by the link layer. The physical layer can assert a StallReq signal when it goes into recovery, which causes the link layer to assert a stall signal when it is ready to forward the newly aligned packet. The physical layer logic can sample both stall and valid to determine if the packet is aligned. For example, the physical layer can continue to drive trdy to drain the link layer packet until stall and valid are sampled and asserted, among other potential implementations, including other alternative implementations that use valid to assist in packet alignment.

様々なフォールトトレンランスがＭＣＬ上の信号に対して定義され得る。例として、フォールトトレンランスは、有効、ストリーム、ＬＳＭサイドバンド、低周波サイドバンド、リンク層パケット及び他のタイプの信号に対して定義され得る。ＭＣＬの専用のデータレーンを介して送信されたパケット、メッセージ及び他のデータに対するフォールトトレンランスは、データを統制する特定のプロトコルに基づき得る。いくつかの実施例において、エラー検出及びハンドリングメカニズムは、数ある潜在的な例の中でも特に、巡回冗長検査（ＣＲＣ）、リトライバッファなどが提供され得る。例として、ＭＣＬを介して送信されるＰＣＩｅパケットに関して、３２ビットのＣＲＣが、（（例えば、再生メカニズムを通じた）保証された配信を用いた）ＰＣＩｅトランザクション層パケット（ＴＬＰ）に利用され得、１６ビットのＣＲＣが、（損失が多くなるように設計され得る（例えば、再生が適用されない））ＰＣＩｅリンク層パケットに利用され得る。さらに、ＰＣＩｅフレーミングトークンに関して、特定のハミング距離（例えば、４（４）のハミング距離）は、数ある例の中でも特に、トークン識別子に対して定義され得、パリティ及び４ビットのＣＲＣも利用され得る。他方では、ＣＡＣパケットに関して、１６ビットのＣＲＣが利用され得る。 Various fault tolerances may be defined for signals on the MCL. By way of example, fault tolerances may be defined for active, stream, LSM sideband, low frequency sideband, link layer packets, and other types of signals. Fault tolerance for packets, messages, and other data transmitted over the dedicated data lanes of the MCL may be based on the particular protocol governing the data. In some embodiments, error detection and handling mechanisms may be provided, such as cyclic redundancy checks (CRCs), retry buffers, and the like, among other potential examples. By way of example, for PCIe packets transmitted over the MCL, a 32-bit CRC may be used for PCIe transaction layer packets (TLPs) (with guaranteed delivery (e.g., through a replay mechanism)) and a 16-bit CRC may be used for PCIe link layer packets (which may be designed to be lossy (e.g., no replays are applied)). Additionally, for PCIe framing tokens, a particular Hamming distance (e.g., a Hamming distance of four (4)), among other examples, may be defined for the token identifier, and parity and a 4-bit CRC may also be utilized. On the other hand, for CAC packets, a 16-bit CRC may be utilized.

いくつかの実施例において、フォールトトレンランスは、（例えば、保証ビット及びシンボルロックを支援するために）低から高（すなわち、０から１）に遷移するために有効な信号を利用するリンク層パケット（ＬＬＰ）に対して定義される。さらに、一例において、ＭＣＬ上のＬＬＰデータ内の障害を判断する基礎として用いられ得る数ある定義された特性の中でも特に、特定の数の連続的な同一のＬＬＰは、送信されるように定義され得、応答はタイムアウトした後にリトライするリクエスタを用いて、応答は、各要求に対して予期され得る。さらなる例において、フォールトトレンランスは、有効な信号に対してもたらされる可能性があり、例として、（例えば、８つのＵＩに対して有効な信号を高に保持することにより）、有効な信号を通じて時間周期ウィンドウ又はシンボル全体にわたって広がる。さらに、ストリーム信号内のエラー又は障害は、数ある例の中でも特に、ストリーム信号のエンコーディング値に関するハミング距離を維持することにより、防止され得る。 In some embodiments, fault tolerance is defined for link layer packets (LLPs) that utilize a valid signal to transition from low to high (i.e., 0 to 1) (e.g., to aid in guarantee bit and symbol lock). Additionally, in one example, a certain number of consecutive identical LLPs may be defined to be transmitted, among other defined characteristics that may be used as a basis for determining a fault in the LLP data on the MCL, and a response may be expected for each request, with the requester retrying after a response times out. In a further example, fault tolerance may be provided for a valid signal, e.g., extending over an entire time period window or symbol through the valid signal (e.g., by holding the valid signal high for 8 UIs). Additionally, errors or faults in the stream signal may be prevented by, among other examples, maintaining a Hamming distance for the encoding values of the stream signal.

論理ＰＨＹの実装は、エラー検出、エラー報告及びエラー処理論理を含む。いくつかの実施例において、例示的なＭＣＬの論理ＰＨＹは、数ある例の中でも特に、（例えば、有効及びストリームレーン上の）ＰＨＹ層デフレーミングエラー、（例えば、ＬＳＭ状態遷移に関する）サイドバンドエラー、（例えば、ＬＳＭ状態遷移にとって重大な）ＬＬＰ内のエラーを検出する論理を含むことができる。いくつかのエラー検出／解決は、数ある例の中でも特に、ＰＣＩｅに固有のエラーを検出するのに適合するＰＣＩｅ論理などの上位層論理に委任され得る。 The logical PHY implementation includes error detection, error reporting, and error handling logic. In some embodiments, the logical PHY of an example MCL may include logic to detect PHY layer deframing errors (e.g., on valid and stream lanes), sideband errors (e.g., related to LSM state transitions), errors in the LLP (e.g., critical to LSM state transitions), among other examples. Some error detection/resolution may be delegated to higher layer logic, such as PCIe logic adapted to detect PCIe specific errors, among other examples.

デフレーミングエラーの場合、いくつかの実施例では、１又は複数のメカニズムが、エラー処理論理を通じて提供され得る。デフレーミングエラーは、関連するプロトコルに基づいて処理され得る。例として、いくつかの実施例では、リンク層が、リトライをトリガするためにエラーを通知できる。デフレーミングは、論理ＰＨＹデフレーミングの再再アライメントも引き起こし得る。さらに、数ある技術の中でも特に、論理ＰＨＹの再センタリングが実行され得、シンボル／ウィンドウロックが再獲得され得る。センタリングは、いくつかの例において、到着したデータを検出するのに最適なポイントに受信機クロックフェーズを移動するＰＨＹを含むことができる。この文脈における「最適」は、ノイズ及びクロックジッタに対して最も余裕があることを指し得る。数ある例の中でも特に、再センタリングは、例としてＰＨＹが低電力状態からウェイクアップした場合に実行される簡易センタリング機能を含むことができる。 In the case of a deframing error, in some embodiments, one or more mechanisms may be provided through the error handling logic. The deframing error may be handled based on the associated protocol. As an example, in some embodiments, the link layer may signal the error to trigger a retry. The deframing may also cause a re-realignment of the logical PHY deframing. Additionally, among other techniques, a re-centering of the logical PHY may be performed and symbol/window lock may be reacquired. Centering may include, in some examples, the PHY moving the receiver clock phase to a point that is optimal for detecting arriving data. "Optimal" in this context may refer to the most tolerant to noise and clock jitter. Among other examples, re-centering may include a simple centering function that is performed, for example, when the PHY wakes up from a low power state.

他のタイプのエラーは、他のエラー処理技術に関連し得る。例として、サイドバンドで検出されたエラーは、（例えば、ＬＳＭの）対応する状態のタイムアウトメカニズムを通じて捕まえられ得る。エラーは、ログに記録され得、次に、リンクステートマシンは、リセットに遷移され得る。ＬＳＭは、再開コマンドがソフトウェアから受信されるまで、リセット状態に維持することができる。別の例では、ＬＬＰエラー、例えば、リンク制御パケットエラーは、ＬＬＰシーケンスに対する確認応答が受信されなかった場合、ＬＬＰシーケンスを再開できるタイムアウトメカニズムを用いて処理され得る。 Other types of errors may be associated with other error handling techniques. As an example, an error detected in the sideband may be caught through a timeout mechanism of the corresponding state (e.g., in the LSM). The error may be logged and then the link state machine may be transitioned to reset. The LSM may be kept in the reset state until a resume command is received from the software. In another example, an LLP error, e.g., a link control packet error, may be handled using a timeout mechanism that allows the LLP sequence to be resumed if no acknowledgement for the LLP sequence is received.

いくつかの実施形態において、上記のプロトコルのそれぞれは、ＰＣＩｅの変形である。ＰＣＩｅデバイスは、バスと関連付けられた共通のアドレス空間を用いて通信する。このアドレス空間は、バスアドレス空間又はＰＣＩｅアドレス空間である。いくつかの実施形態において、ＰＣＩｅデバイスは、ＰＣＩｅアドレス空間とは異なり得る内部アドレス空間内のアドレスを用いる。 In some embodiments, each of the above protocols is a variation of PCIe. PCIe devices communicate using a common address space associated with a bus. This address space is the bus address space or the PCIe address space. In some embodiments, the PCIe devices use addresses in an internal address space, which may be different from the PCIe address space.

ＰＣＩｅ仕様は、ＰＣＩｅデバイスがそのローカルメモリ（又はその一部）をバスにさらし得るメカニズムを定義し、ひいては、ＣＰＵ、又は、そのメモリに直接アクセスするバスに取り付けられる他のデバイスをイネーブルにする。典型的には、各ＰＣＩｅデバイスは、ＰＣＩベースアドレスレジスタ（ＢＡＲ）と称されるＰＣＩｅアドレス空間内の専用の領域を割り当てられる。さらに、デバイスがさらすアドレスは、ＰＣＩＢＡＲ内のそれぞれのアドレスにマッピングされる。 The PCIe specification defines a mechanism by which a PCIe device can expose its local memory (or a portion of it) to the bus, thus enabling the CPU, or other devices attached to the bus, to directly access that memory. Typically, each PCIe device is assigned a dedicated region in the PCIe address space called the PCI Base Address Register (BAR). Furthermore, the addresses that the device exposes are mapped to respective addresses in the PCI BAR.

いくつかの実施形態において、ＰＣＩｅデバイス（例えば、ＨＣＡ）は、入力／出力メモリマッピングユニット（ＩＯＭＭＵ）を用いて、その内部アドレスとＰＣＩｅバスアドレスとを変換する。他の実施形態において、ＰＣＩｅデバイスは、ＰＣＩアドレス変換サービス（ＡＴＳ）を用いて、アドレス変換及び解決を実行してよい。いくつかの実施形態において、タグ、例えば、処理アドレス空間ＩＤ（ＰＡＳＩＤ）タグは、特定の処理の仮想アドレス空間に属するように変換されるアドレスを規定するために用いられる。 In some embodiments, a PCIe device (e.g., an HCA) uses an Input/Output Memory Mapping Unit (IOMMU) to translate between its internal addresses and PCIe bus addresses. In other embodiments, a PCIe device may use a PCI Address Translation Service (ATS) to perform address translation and resolution. In some embodiments, a tag, e.g., a Process Address Space ID (PASID) tag, is used to define addresses that are translated to belong to a particular process's virtual address space.

図２８は、一実施例に関する追加の詳細を示す。上記で説明される実施例に示すように、この実施例は、ホストメモリ２８６０を有するホストプロセッサ２８０２にマルチプロトコルリンク２８００を介して結合される、アクセラレータメモリ２８５０を有するアクセラレータ２８０１を含む。すでに述べたように、アクセラレータメモリ２８５０は、ホストメモリ２８６０とは異なるメモリ技術を利用してよい（例えば、アクセラレータメモリは、ＨＢＭ又はスタック型ＤＲＡＭであってよく、一方、ホストメモリは、ＳＤＲＡＭであってよい）。 Figure 28 shows additional details regarding one embodiment. As with the embodiments described above, this embodiment includes an accelerator 2801 having accelerator memory 2850 coupled via a multi-protocol link 2800 to a host processor 2802 having host memory 2860. As previously mentioned, the accelerator memory 2850 may utilize a different memory technology than the host memory 2860 (e.g., the accelerator memory may be HBM or stacked DRAM, while the host memory may be SDRAM).

マルチプレクサ２８１１及び２８１２は、マルチプロトコルリンク２８００がＰＣＤＩ、ＣＡＣ及びＭＡプロトコル（例えば、ＳＭＩ３＋）トラフィックをサポートする動的に多重化されたバスであり、それぞれがアクセラレータ２８０１及びホストプロセッサ２８０２内の異なる機能コンポーネントに転送され得るという事実を強調表示するように示されている。例として、制限されるものではないが、これらのプロトコルは、ＩＯＳＦ、ＩＤＩ及びＳＭＩ３＋を含んでよい。一実施例において、アクセラレータ２８０１のＰＣＩｅ論理２８２０は、コマンドを実行する場合、１又は複数のアクセラレータコア２８３０による使用のために仮想－物理アドレス変換をキャッシングするためのローカルＴＬＢ２８２２を含む。すでに述べたように、仮想メモリ空間は、アクセラレータメモリ２８５０とホストメモリ２８６０との間で分配される。同様に、ホストプロセッサ２８０２上のＰＣＩｅ論理は、ＰＣＩｅＩ／Ｏデバイス２８０６のメモリアクセスを管理するためのＩ／Ｏメモリ管理ユニット（ＩＯＭＭＵ）２８１０、及び、一実施例においてアクセラレータ２８０１を含む。図示されるように、アクセラレータ上のＰＣＩｅ論理２８２０及びホストプロセッサ上のＰＣＩｅ論理２８０８では、ＰＣＤＩプロトコルを用いて通信して、デバイス発見、レジスタアクセス、デバイス構成及び初期化、割込み処理、ＤＭＡ処理及びアドレス変換サービス（ＡＴＳ）などの機能を実行する。すでに述べたように、ホストプロセッサ２８０２上のＩＯＭＭＵ２８１０は、これらの機能に対する制御及び調整を主要目的として動作し得る。 Multiplexers 2811 and 2812 are shown to highlight the fact that multi-protocol link 2800 is a dynamically multiplexed bus supporting PCDI, CAC and MA protocol (e.g., SMI3+) traffic, each of which may be forwarded to different functional components within accelerator 2801 and host processor 2802. By way of example and not limitation, these protocols may include IOSF, IDI and SMI3+. In one embodiment, accelerator 2801's PCIe logic 2820 includes a local TLB 2822 for caching virtual-to-physical address translations for use by one or more accelerator cores 2830 when executing commands. As previously mentioned, virtual memory space is distributed between accelerator memory 2850 and host memory 2860. Similarly, the PCIe logic on the host processor 2802 includes an I/O memory management unit (IOMMU) 2810 for managing memory access for PCIe I/O devices 2806 and, in one embodiment, the accelerator 2801. As shown, the PCIe logic on the accelerator 2820 and the PCIe logic on the host processor 2808 communicate using the PCDI protocol to perform functions such as device discovery, register access, device configuration and initialization, interrupt handling, DMA handling, and address translation services (ATS). As previously mentioned, the IOMMU 2810 on the host processor 2802 may operate primarily for the purpose of controlling and coordinating these functions.

一実施例において、アクセラレータコア２８３０は、アクセラレータにより必要とされる機能を実行する処理エンジン（要素）を含む。さらに、アクセラレータコア２８３０は、ホストメモリ２８６０に格納されているページをローカルにキャッシングするためのホストメモリキャッシュ２８３４と、アクセラレータメモリ２８５０に格納されているページをキャッシングするためのアクセラレータメモリキャッシュ２８３２とを含んでよい。一実施例において、アクセラレータコア２８３０は、アクセラレータ２８０１とホストプロセッサ２８０２との間で共有されるキャッシュラインがコヒーレントを維持することを確保するために、ＣＡＣプロトコルを介してホストプロセッサ２８０２のコヒーレンス及びキャッシュ論理２８０７と通信する。 In one embodiment, the accelerator core 2830 includes a processing engine (element) that performs the functions required by the accelerator. Additionally, the accelerator core 2830 may include a host memory cache 2834 for locally caching pages stored in the host memory 2860, and an accelerator memory cache 2832 for caching pages stored in the accelerator memory 2850. In one embodiment, the accelerator core 2830 communicates with the coherence and cache logic 2807 of the host processor 2802 via a CAC protocol to ensure that cache lines shared between the accelerator 2801 and the host processor 2802 remain coherent.

アクセラレータ２８０１のバイアス／コヒーレンス論理２８４０は、マルチプロトコルリンク２８００を介した不必要な通信を減らしつつ、データコヒーレンスを確保するために本明細書で説明される様々なデバイス／ホストバイアス技術（例えば、ページレベルの粒度）を実装する。図示されるように、バイアス／コヒーレンス論理２８４０は、ＭＡメモリトランザクション（例えば、ＳＭＩ３＋）を用いてホストプロセッサ２８０２のコヒーレンス及びキャッシュ論理２８０７と通信する。コヒーレンス及びキャッシュ論理２８０７は、そのＬＬＣ２８０９、ホストメモリ２８６０、アクセラレータメモリ２８５０及びキャッシュ２８３２、２８３４、及び、コア２８０５の個別のキャッシュのそれぞれに格納されるデータのコヒーレンシを維持することを担っている。 The bias/coherence logic 2840 of the accelerator 2801 implements various device/host bias techniques (e.g., page-level granularity) described herein to ensure data coherency while reducing unnecessary communication over the multi-protocol link 2800. As shown, the bias/coherence logic 2840 communicates with the coherence and cache logic 2807 of the host processor 2802 using MA memory transactions (e.g., SMI3+). The coherence and cache logic 2807 is responsible for maintaining coherency of data stored in its LLC 2809, host memory 2860, accelerator memory 2850 and caches 2832, 2834, and each of the individual caches of the core 2805.

要約すると、アクセラレータ２８０１の一実施例は、ホストプロセッサ２８０２で実行されるソフトウェアに対するＰＣＩｅデバイスとして現れ、（多重化されたバスに対してＰＣＩｅプロトコルを効果的に再フォーマット化する）ＰＤＣＩプロトコルによりアクセスされる。アクセラレータ２８０１は、アクセラレータデバイスＴＬＢ及び標準のＰＣＩｅアドレス変換サービス（ＡＴＳ）を用いて共有仮想メモリに参加してよい。アクセラレータはまた、コヒーレンス／メモリエージェントとして処理され得る。特定機能（例えば、以下で説明されるＥＮＱＣＭＤ、ＭＯＶＤＩＲ）は、（例えば、ワークサブミッションのために）ＰＤＣＩ上で利用可能であり、一方、アクセラレータは、ＣＡＣを用いて、アクセラレータで、及び、特定のバイアス遷移フローにおいて、ホストデータをキャッシュしてよい。ホストからアクセラレータメモリへのアクセス（又は、アクセラレータからのホストバイアスアクセス）は、説明されるように、ＭＡプロトコルを用いてよい。 In summary, one embodiment of the accelerator 2801 appears as a PCIe device to software running on the host processor 2802 and is accessed via the PDCI protocol (effectively reformatting the PCIe protocol for a multiplexed bus). The accelerator 2801 may participate in shared virtual memory using an accelerator device TLB and standard PCIe address translation services (ATS). The accelerator may also be treated as a coherence/memory agent. Certain functions (e.g., ENQCMD, MOVDIR, described below) are available on the PDCI (e.g., for work submission), while the accelerator may cache host data at the accelerator using the CAC and in certain bias transition flows. Accesses from the host to the accelerator memory (or host bias accesses from the accelerator) may use the MA protocol, as described.

図２９に示されるように、一実施例において、アクセラレータは、デバイスバックエンドリソース２９０５へのアクセスを提供するようにプログラミングされ得るＰＣＩ構成レジスタ２９０２及びＭＭＩＯレジスタ２９０６を含む。一実施例において、ＭＭＩＯレジスタ２９０６用のベースアドレスは、ＰＣＩ構成空間内のベースアドレスレジスタ（ＢＡＲ）２９０１のセットにより特定される。以前の実装とは異なり、本明細書で説明されるデータストリーミングアクセラレータ（ＤＳＡ）の一実施例は、複数のチャネル又はＰＣＩ機能を実装しておらず、そのため、デバイスには、各レジスタについての１つのインスタンスのみがある。しかしながら、単一プラットフォームでは、１より多いＤＳＡデバイスがあってよい。 As shown in FIG. 29, in one embodiment, the accelerator includes PCI configuration registers 2902 and MMIO registers 2906 that can be programmed to provide access to device back-end resources 2905. In one embodiment, the base address for the MMIO registers 2906 is specified by a set of base address registers (BARs) 2901 in the PCI configuration space. Unlike previous implementations, one embodiment of the Data Streaming Accelerator (DSA) described herein does not implement multiple channels or PCI functionality, so there is only one instance of each register in the device. However, there may be more than one DSA device in a single platform.

実施例では、ここでは説明されない追加の性能を提供してよい、又は、レジスタをデバッグしてよい。任意のそのようなレジスタは、実装ごとに決まることが考慮されるべきである。 Examples may provide additional feature or debug registers not described herein. Any such registers should be considered implementation specific.

ＰＣＩ構成空間アクセスは、アラインされた１バイトアクセス、２バイトアクセス又は４バイトアクセスとして実行される。ＰＣＩ構成空間において、未実装のレジスタ及び予約されたビットにアクセスする規則については、ＰＣＩエクスプレスベースの仕様を参照する。 PCI configuration space accesses are performed as aligned 1-byte, 2-byte, or 4-byte accesses. Refer to the PCI Express Base specification for rules on accessing unimplemented registers and reserved bits in PCI configuration space.

ＢＡＲ０領域（機能、構成及びステータスレジスタ）へのＭＭＩＯ空間アクセスは、アラインされた１バイトアクセス、２バイトアクセス、４バイトアクセス又は８バイトアクセスとして実行される。８バイトアクセスは、８バイトレジスタにのみ用いられるべきである。ソフトウェアは、未実装のレジスタを読み出し又は書き込みすべきではない。ＢＡＲ２及びＢＡＲ４領域へのＭＭＩＯ空間アクセスは、ＥＮＱＣＭＤ、ＥＮＱＣＭＤＳ又はＭＯＶＤＩＲ６４Ｂ命令（以下で詳細に説明される）を用いて、６４バイトアクセスとして実行されるべきである。ＥＮＱＣＭＤ又はＥＮＱＣＭＤＳは、共有されるように構成されるワークキュー（ＳＷＱ）にアクセスするために用いられるべきであり、ＭＯＶＤＩＲ６４Ｂは、専用として構成されるワークキュー（ＤＷＱ）にアクセスするために用いられなければならい。 MMIO space accesses to the BAR0 region (Capabilities, Configuration, and Status Registers) are performed as aligned 1-byte, 2-byte, 4-byte, or 8-byte accesses. 8-byte accesses should only be used for 8-byte registers. Software should not read or write unimplemented registers. MMIO space accesses to the BAR2 and BAR4 regions should be performed as 64-byte accesses using the ENQCMD, ENQCMDS, or MOVDIR64B instructions (described in more detail below). ENQCMD or ENQCMDS should be used to access work queues (SWQs) that are configured as shared, and MOVDIR64B must be used to access work queues (DWQs) that are configured as dedicated.

ＤＳＡＰＣＩ構成空間の一実施例は、３つの６４ビットＢＡＲ２９０１を実装する。デバイス制御レジスタ（ＢＡＲ０）は、デバイス制御レジスタの物理ベースアドレスを含む６４ビットＢＡＲである。これらのレジスタは、デバイス性能、デバイスを構成及びイネーブルにする制御、及び、デバイスステータスに関する情報を提供する。ＢＡＲ０領域のサイズは、割込みメッセージストレージ２９０４のサイズに依存する。サイズは、割込みメッセージストレージエントリ２９０４の数×１６を３２ＫＢに加えて、次の２のべき乗に切り上げられる。例えば、デバイスが１０２４個の割込みメッセージストレージエントリ２９０４をサポートする場合、割込みメッセージストレージは１６ＫＢであり、ＢＡＲ０のサイズは６４ＫＢである。 One embodiment of the DSA PCI configuration space implements three 64-bit BARs 2901. The Device Control Register (BAR0) is a 64-bit BAR that contains the physical base address of the device control registers. These registers provide information about device capabilities, controls to configure and enable the device, and device status. The size of the BAR0 area depends on the size of the interrupt message storage 2904. The size is the number of interrupt message storage entries 2904 times 16 plus 32KB rounded up to the next power of two. For example, if a device supports 1024 interrupt message storage entries 2904, then the interrupt message storage is 16KB and the size of BAR0 is 64KB.

ＢＡＲ２は、特権及び非特権ポータルの物理ベースアドレスを含む６４ビットＢＡＲである。各ポータルは、６４バイトのサイズであり、別々の４ＫＢページ上に配置される。これは、ポータルがＣＰＵページテーブルを用いて異なるアドレス空間に独立にマッピングされることを可能にする。ポータルは、記述子をデバイスにサブミットするために用いられる。特権ポータルは、カーネルモードソフトウェアにより用いられ、非特権ポータルは、ユーザモードソフトウェアにより用いられる。非特権ポータルの数は、サポートされるワークキューの数と同じである。特権ポータルの数は、ワークキュー（ＷＱ）の数×（ＭＳＩ‐Ｘテーブルのサイズ－１）である。記述子をサブミットするために用いられるポータルのアドレスは、デバイスがどのＷＱに記述子を配置するか、ポータルは特権が与えられているか又は特権が与えられていないか、及び、どのＭＳＩ－Ｘテーブルエントリが完了割込みのために用いられ得るか、を判断することを可能にする。例えば、デバイスが８つのＷＱをサポートする場合、所与の記述子に対するＷＱは、（ポータルアドレス＞＞１２）かつ０ｘ７である。ポータルアドレス＞＞１５が０である場合、ポータルは、特権が与えられていない。そうでなければ、ポータルは、特権が与えられており、完了割込みに用いられるＭＳＩ－Ｘ２９０３テーブルインデックスは、ポータルアドレス＞＞１５である。ビット５：０は０でなければならない。ビット１１：６は無視される。したがって、ページ上で任意の６４バイトでアラインされたアドレスは、同じ効果を伴って用いられ得る。 BAR2 is a 64-bit BAR that contains the physical base addresses of the privileged and non-privileged portals. Each portal is 64 bytes in size and located on a separate 4KB page. This allows portals to be independently mapped to different address spaces using the CPU page tables. Portals are used to submit descriptors to the device. Privileged portals are used by kernel mode software and non-privileged portals are used by user mode software. The number of non-privileged portals is the same as the number of work queues supported. The number of privileged portals is the number of work queues (WQs) x (size of MSI-X table - 1). The address of the portal used to submit the descriptor allows the device to determine which WQ to place the descriptor in, whether the portal is privileged or non-privileged, and which MSI-X table entry can be used for completion interrupts. For example, if a device supports 8 WQs, the WQ for a given descriptor is (portal address >> 12) and 0x7. If Portal Address >> 15 is 0, the portal is not privileged. Otherwise, the portal is privileged and the MSI-X2903 table index used for the completion interrupt is Portal Address >> 15. Bits 5:0 must be 0. Bits 11:6 are ignored. Thus, any 64-byte aligned address on the page can be used with the same effect.

ワークキュー構成（ＷＱＣＦＧ）レジスタを用いて構成される場合、非特権ポータルを用いる記述子サブミッションは、ＷＱの占有閾値を対象とする。特権ポータルを用いた記述子サブミッションは、当該閾値を対象とはしない。ＳＷＱに対する記述子サブミッションは、ＥＮＱＣＭＤ又はＥＮＱＣＭＤＳ用いてサブミットされなければならない。ＳＷＱポータルに対するその他の書き込み動作は無視される。ＤＷＱに対する記述子サブミッションは、６４バイト書き込み動作を用いてサブミットされなければならない。ソフトウェアは、切れ目のない６４バイト書き込みを保証するために、ＭＯＶＤＩＲ６４Ｂを用いる。ディセーブルにされ、又は、専用のＷＱポータルに対するＥＮＱＣＭＤ又はＥＮＱＣＭＤＳは、リトライを返す。ＤＷＱポータルに対するその他の書き込み動作は無視される。ＢＡＲ２アドレス空間に対する任意の読み出し処理は、オール１を返す。カーネルモード記述子は、完了割込みを受信するために、特権ポータルを用いてサブミットされるべきである。カーネルモード記述子が、非特権ポータルを用いてサブミットされた場合、要求され得る完了割込みがない。ユーザモード記述子は、特権又は非特権ポータルのいずれか一方を用いてサブミットされてよい。 Descriptor submissions using a non-privileged portal are subject to the WQ occupancy threshold if configured using the Work Queue Configuration (WQCFG) register. Descriptor submissions using a privileged portal are not subject to the threshold. Descriptor submissions for SWQ must be submitted using ENQCMD or ENQCMDS. Other write operations to SWQ portals are ignored. Descriptor submissions for DWQ must be submitted using 64-byte write operations. Software uses MOVDIR64B to ensure contiguous 64-byte writes. ENQCMD or ENQCMDS to a disabled or dedicated WQ portal returns a retry. Other write operations to DWQ portals are ignored. Any read operation to the BAR2 address space returns all ones. Kernel mode descriptors should be submitted using a privileged portal to receive a completion interrupt. If a kernel mode descriptor is submitted using a non-privileged portal, no completion interrupt can be requested. A user mode descriptor may be submitted using either a privileged or non-privileged portal.

ＢＡＲ２領域内のポータルの数は、デバイスによりサポートされているＷＱの数×ＭＳＩ－Ｘ２９０３テーブルのサイズである。ＭＳＩ‐Ｘテーブルのサイズは、典型的には、ＷＱの数に１を加えたものである。そのため、例えば、デバイスが８つのＷＱをサポートする場合、ＢＡＲ２の有用なサイズは、８×９×４ＫＢ＝２８８ＫＢとなるであろう。ＢＡＲ２合計サイズは、次の２のべき乗に切り上げられる、又は、５１２ＫＢとなるであろう。 The number of portals in the BAR2 area is the number of WQs supported by the device x the size of the MSI-X2903 table. The size of the MSI-X table is typically the number of WQs plus one. So, for example, if a device supports 8 WQs, the useful size of BAR2 would be 8 x 9 x 4KB = 288KB. The total BAR2 size would be rounded up to the next power of 2, or 512KB.

ＢＡＲ４は、ゲストポータルの物理ベースアドレスを含む６４ビットＢＡＲである。各ゲストポータルは、６４バイトのサイズであり、別々の４ＫＢページに配置される。これは、ポータルがＣＰＵ拡張ページテーブル（ＥＰＴ）を用いて異なるアドレス空間に独立にマッピングされることを可能にする。ＧＥＮＣＡＰ内の割込みメッセージストレージサポートフィールドが０である場合、このＢＡＲは実装されていない。 BAR4 is a 64-bit BAR that contains the physical base address of the guest portal. Each guest portal is 64 bytes in size and is located on a separate 4KB page. This allows portals to be independently mapped into different address spaces using the CPU Extended Page Tables (EPT). If the Interrupt Message Storage Support field in GENCAP is 0, then this BAR is not implemented.

ゲストポータルは、記述子をデバイスにサブミットするために、ゲストカーネルモードソフトウェアにより用いられてよい。ゲストポータルの数は、割込みメッセージストレージ内のエントリの数×サポートされるＷＱの数である。記述子をサブミットするために用いられるゲストポータルのアドレスは、デバイスが記述子用のＷＱを判断することを可能にし、また、割込みメッセージストレージエントリが、記述子完成用の完了割込みを生成するために用いることを可能にする（カーネルモード記述子である場合で、要求完了割込みフラグが記述子に設定されている場合）。例えば、デバイスが８つのＷＱをサポートする場合、所与の記述子に対するＷＱは、（ゲストポータルアドレス＞＞１２）及び０ｘ７であり、完了割込みに用いられる割込みテーブルエントリインデックスは、ゲストポータルアドレス＞＞１５である。 The guest portal may be used by guest kernel mode software to submit descriptors to the device. The number of guest portals is the number of entries in the interrupt message storage times the number of WQs supported. The address of the guest portal used to submit the descriptor allows the device to determine the WQ for the descriptor, and the interrupt message storage entry can be used to generate a completion interrupt for the descriptor completion (if it is a kernel mode descriptor and the request completion interrupt flag is set in the descriptor). For example, if a device supports 8 WQs, the WQs for a given descriptor are (guest portal address >> 12) and 0x7, and the interrupt table entry index used for the completion interrupt is guest portal address >> 15.

一実施例において、ＭＳＩ－Ｘは、ＤＳＡが提供し、かつ、ＤＳＡがレガシＰＣＩ割込み又はＭＳＩを実装していないＰＣＩｅ割込み機能のみである。このレジスタ構造の詳細については、ＰＣＩエクスプレス仕様に従う。 In one embodiment, MSI-X is the only PCIe interrupt function that the DSA provides and where the DSA does not implement legacy PCI interrupts or MSI. Details of this register structure are in accordance with the PCI Express specification.

一実施例において、３つのＰＣＩエクスプレス機能が、アドレス変換を制御する。これらの機能の値の特定の組み合わせのみが、テーブルＡに示されるように、サポートされ得る。一般的な制御レジスタ（ＧＥＮＣＴＲＬ）内のイネーブルビットが１に設定されるときに、値がチェックされる。
In one embodiment, three PCI Express functions control address translation. Only certain combinations of values of these functions can be supported, as shown in Table A. The values are checked when the enable bit in the general control register (GENCTRL) is set to 1.

これらの機能のいずれかが、ソフトウェアによりに変更される一方、デバイスがイネーブルである場合、デバイスは、停止してよく、エラーがソフトウェアエラーレジスタに報告される。 If any of these functions are changed by software while the device is enabled, the device may be halted and an error reported to the software error register.

一実施例において、ソフトウェアは、デバイスが、ＰＡＳＩＤを用いてアドレス変換を実行するか否かを制御するために、ＰＡＳＩＤ機能を構成する。ＰＡＳＩＤがディセーブルである場合、物理アドレスのみが用いられてよい。ＰＡＳＩＤがイネーブルである場合、仮想又は物理アドレスが、ＩＯＭＭＵ構成に応じて用いられてよい。ＰＡＳＩＤがイネーブルである場合、アドレス変換サービス（ＡＴＳ）及びページ要求サービス（ＰＲＳ）の両方がイネーブルにされるべきである。 In one embodiment, software configures the PASID function to control whether the device performs address translation using PASID. If PASID is disabled, only physical addresses may be used. If PASID is enabled, virtual or physical addresses may be used depending on the IOMMU configuration. If PASID is enabled, both address translation services (ATS) and page request services (PRS) should be enabled.

一実施例において、ソフトウェアは、メモリアクセスを実行する前に、デバイスがアドレスを変換すべきか否かを制御するために、ＡＴＳ機能を構成する。アドレス変換がＩＯＭＭＵ２８１０においてイネーブルである場合、ＡＴＳは、受諾可能なシステム性能を取得するために、デバイスにおいてイネーブルでなければならない。アドレス変換がＩＯＭＭＵ２８１０においてイネーブルにされない場合、ＡＴＳは、ディセーブルにされなければならない。ＡＴＳがディセーブルである場合、物理アドレスのみが用いられてよく、すべてメモリアクセスは、未変換アクセスを用いて実行される。ＰＡＳＩＤがイネーブルにされる場合、ＡＴＳがイネーブルにされなければならない。 In one embodiment, software configures the ATS function to control whether the device should translate addresses before performing a memory access. If address translation is enabled in IOMMU 2810, ATS must be enabled in the device to obtain acceptable system performance. If address translation is not enabled in IOMMU 2810, ATS must be disabled. If ATS is disabled, only physical addresses may be used and all memory accesses are performed using untranslated accesses. If PASID is enabled, ATS must be enabled.

一実施例において、ソフトウェアは、アドレス変換が失敗した場合に、デバイスがページを要求できるか否かを制御するために、ＰＲＳ機能を構成する。ＰＡＳＩＤがイネーブルにされる場合、ＰＲＳは、イネーブルにされなければならず、ＰＡＳＩＤがディセーブルにされる場合、ＰＲＳは、ディセーブルにされなければならない。 In one embodiment, software configures the PRS feature to control whether a device can request a page if address translation fails. If PASID is enabled, PRS must be enabled, and if PASID is disabled, PRS must be disabled.

いくつかの実施例では、１又は複数のプロセッサコア、アクセラレータデバイス及び／又は他のタイプの処理デバイス（例えば、Ｉ／Ｏデバイス）間でシームレスに共有される仮想メモリ空間を利用する。特に、一実施例では、同じ仮想メモリ空間がコア、アクセラレータデバイス及び／又は他の処理デバイス間で共有される共有仮想メモリ（ＳＶＭ）アーキテクチャを利用する。さらに、いくつかの実施例では、共通の仮想メモリ空間を用いてアドレス指定されるヘテロジニアス形式の物理システムメモリを含む。ヘテロジニアス形式の物理システムメモリは、ＤＳＡアーキテクチャと接続するために、異なる物理インタフェースを用いてよい。例えば、アクセラレータデバイスは、高帯域幅メモリ（ＨＢＭ）などのローカルアクセラレータメモリに直接結合されてよく、各コアは、ダイナミックランダムアクセスメモリ（ＤＲＡＭ）などのホスト物理メモリに直接結合されてよい。この例において、共有仮想メモリ（ＳＶＭ）は、アクセラレータ、プロセッサコア及び／又は他の処理デバイスが、仮想メモリアドレスの整合性セットを用いて、ＨＢＭ及びＤＲＡＭにアクセスできるように、ＨＢＭ及びＤＲＡＭの組み合わせられた物理メモリにマッピングされる。 Some embodiments utilize a virtual memory space that is seamlessly shared between one or more processor cores, accelerator devices, and/or other types of processing devices (e.g., I/O devices). In particular, one embodiment utilizes a shared virtual memory (SVM) architecture in which the same virtual memory space is shared between cores, accelerator devices, and/or other processing devices. Additionally, some embodiments include a heterogeneous form of physical system memory that is addressed using a common virtual memory space. The heterogeneous form of physical system memory may use different physical interfaces to interface with the DSA architecture. For example, the accelerator device may be directly coupled to a local accelerator memory, such as a high bandwidth memory (HBM), and each core may be directly coupled to a host physical memory, such as a dynamic random access memory (DRAM). In this example, the shared virtual memory (SVM) is mapped to the combined physical memory of the HBM and DRAM such that the accelerator, processor cores, and/or other processing devices can access the HBM and DRAM using a consistent set of virtual memory addresses.

これら及び他の特徴のアクセラレータは、以下で詳細に説明される。概要の目的で、異なる実装は、以下のインフラストラクチャ機能のうちの１又は複数を含んでよい。 These and other features of the accelerator are described in detail below. By way of overview, different implementations may include one or more of the following infrastructure features:

共有仮想メモリ（ＳＶＭ）：いくつかの実施例では、ユーザレベルアプリケーションが、記述子内の仮想アドレスを用いて直接的にＤＳＡにコマンドをサブミットすることを可能にするＳＶＭをサポートする。ＤＳＡは、ページフォールトの処理を含む入力／出力メモリ管理ユニット（ＩＯＭＭＵ）を用いて、仮想アドレスを物理アドレスに変換することをサポートしてよい。記述子により参照される仮想アドレス範囲は、複数のヘテロジニアスメモリタイプにわたって分散された複数のページにまたがってよい。さらに、一実施例ではまた、データバッファが物理メモリ内で連続的である限り、物理アドレスの使用をサポートする。 Shared Virtual Memory (SVM): Some embodiments support SVM, which allows user-level applications to submit commands directly to the DSA using virtual addresses in the descriptors. The DSA may support translating virtual addresses to physical addresses using an Input/Output Memory Management Unit (IOMMU), including handling page faults. The virtual address range referenced by a descriptor may span multiple pages distributed across multiple heterogeneous memory types. Additionally, one embodiment also supports the use of physical addresses as long as the data buffer is contiguous in physical memory.

部分的な記述子完成：ＳＶＭサポートを用いて、動作は、アドレス変換中に、ページフォールトに遭遇する可能性がある。いくつかのケースでは、デバイスは、障害に遭遇した時点で、対応する記述子の処理を終了し、部分的な完了及び障害情報を示す完了記録をソフトウェアに提供して、ソフトウェアが、改善策を講じて、障害の解決後に動作をリトライすることを可能にしてよい。 Partial Descriptor Completion: With SVM support, an operation may encounter a page fault during address translation. In some cases, the device may terminate processing of the corresponding descriptor upon encountering the fault and provide a completion record to the software indicating partial completion and fault information, allowing the software to take remedial action and retry the operation after the fault is resolved.

バッチ処理：いくつかの実施例では、「バッチ」に記述子をサブミットすることをサポートする。バッチ記述子は、実質的に連続的なワーク記述子（すなわち、実際のデータ処理を含む記述子）のセットを指し示す。バッチ記述子を処理する場合、ＤＳＡは、特定メモリ及びからワーク記述子をフェッチして、これらを処理する。 Batch Processing: Some embodiments support submitting descriptors in "batches". A batch descriptor points to a set of substantially contiguous work descriptors (i.e., descriptors that contain the actual data processing). When processing a batch descriptor, the DSA fetches the work descriptors from a specified memory and processes them.

ステートレスデバイス：一実施例における記述子は、記述子ペイロード自体に入っている記述子を処理するためにすべての情報が必要とされるように、設計される。これは、デバイスが、そのスケーラビリティを改善するわずかなクライアント固有の状態を格納することを可能にする。用いられる場合に、トラステッドソフトウェアにより構成される１つの例外が、完了割込みメッセージである。 Stateless Device: Descriptors in one embodiment are designed such that all information needed to process the descriptor is in the descriptor payload itself. This allows the device to store little client-specific state which improves its scalability. One exception, if used, that is configured by trusted software is the completion interrupt message.

キャッシュ割り当て制御：これは、アプリケーションが、キャッシュに書き込むか、キャッシュをバイパスしてメモリに直接的に書き込むかを規定することを可能にする。一実施例において、完了記録は、常にキャッシュに書き込まれる。 Cache Allocation Control: This allows the application to specify whether to write to the cache or to bypass the cache and write directly to memory. In one embodiment, completion records are always written to the cache.

共有のワークキュー（ＳＷＱ）サポート：以下で詳細に説明されるように、いくつかの実施例では、エンキューコマンド（ＥＮＱＣＭＤ）及びエンキューコマンド（ＥＮＱＣＭＤＳ）命令を用いて、共有のワークキュー（ＳＷＱ）を通じてスケーラブルなワークサブミッションをサポートする。この実施例において、ＳＷＱは、複数のアプリケーションにより共有される。 Shared Work Queue (SWQ) Support: As described in more detail below, some embodiments support scalable work submission through a shared work queue (SWQ) using the enqueue command (ENQCMD) and enqueue command (ENQCMDS) instructions. In this embodiment, the SWQ is shared by multiple applications.

専用のワークキュー（ＤＷＱ）サポート：いくつかの実施例では、ＭＯＶＤＩＲ６４Ｂ命令を用いた、専用のワークキュー（ＤＷＱ）を通じた高スループットワークサブミッションに対するサポートがある。この実施例では、ＤＷＱは、ある特定のアプリケーションに専用のものである。 Dedicated Work Queue (DWQ) Support: In some embodiments, there is support for high throughput work submission through a Dedicated Work Queue (DWQ) using the MOVDIR64B instruction. In this embodiment, the DWQ is dedicated to a particular application.

ＱｏＳサポート：いくつかの実施例では、サービス品質（ＱｏＳ）レベルが、（例えば、カーネルドライバにより）ワークキューごとに特定されることを可能にする。次に、異なるワークキューを異なるアプリケーションに割り当ててよく、異なるアプリケーションからのワーク（ｗｏｒｋ）が、異なる優先度を用いてワークキューからディスパッチされることを可能にする。ワークキューは、ファブリックＱｏＳに対して特定のチャネルを用いるためにプログラミングされ得る。 QoS Support: Some embodiments allow a quality of service (QoS) level to be specified per work queue (e.g., by a kernel driver). Different work queues may then be assigned to different applications, allowing work from different applications to be dispatched from the work queue with different priorities. Work queues can be programmed to use specific channels for fabric QoS.

バイアスキャッシュコヒーレンスメカニズム Biased cache coherence mechanism

一実施例では、スタック型ＤＲＡＭ又はＨＢＭなどの直接的に取り付けられたメモリを用いてアクセラレータの性能を改善し、直接的に取り付けられたメモリを用いてアクセラレータを使用するアプリケーションに関するアプリケーション開発を簡略化する。この実施例では、アクセスレータ付属メモリが、システムメモリの一部としてマッピングされ、（例えば、現在のＩＯＭＭＵ実装において用いられる）共有仮想メモリ（ＳＶＭ）技術を用いるが、完全なシステムのキャッシュコヒーレンスと関連付けられる典型的な性能上の欠点を被ることなく、アクセスされることを可能にする。 In one embodiment, direct attached memory, such as stacked DRAM or HBM, is used to improve accelerator performance and simplify application development for applications that use accelerators with directly attached memory. In this embodiment, accessor attached memory is mapped as part of the system memory, allowing it to be accessed using shared virtual memory (SVM) techniques (e.g., as used in current IOMMU implementations), but without incurring the typical performance penalties associated with full system cache coherence.

面倒なキャッシュコヒーレンスのオーバヘッドなしで、システムメモリの一部としてアクセスレータ付属メモリにアクセスする能力は、有益な動作環境をアクセラレータオフロードにもたらす。システムアドレスマッピングの一部としてメモリにアクセスする能力は、ホストソフトウェアが、オペランドをセットアップし、従来のＩ／ＯＤＭＡデータコピーのオーバヘッドなしで、計算結果にアクセスすることを可能にする。そのような従来のコピーは、簡単なメモリアクセスと比較してすべて非効率であるドライバコール、割込み、及び、メモリマッピングされたＩ／Ｏ（ＭＭＩＯ）アクセスに関する。同時に、キャッシュコヒーレンスのオーバヘッドなしでアクセスレータ付属メモリにアクセスする能力は、オフロードされた計算の実行時間にとって重要であり得る。実質的なストリーミング書き込みメモリトラフィックを伴う場合、例えば、キャッシュコヒーレンスのオーバヘッドは、アクセラレータにより見られる有効な書き込み帯域幅を半分に削減できる。オペランドセットアップの効率性、結果的なアクセスの効率性、アクセラレータ計算の効率性のすべては、どれくらいアクセラレータのオフロードが機能しているかを判断する役割を果たす。（例えば、オペランドをセットアップし、結果を得る）オフロード機能のコストは非常に高く、オフロードしても全く効果がない場合がある、又は、非常に大きなジョブのみにアクセラレータを制限し得る。アクセラレータが計算を実行する効率性は、同じ効果を有し得る。 The ability to access accessor-attached memory as part of the system memory without onerous cache coherence overhead provides a beneficial operating environment for accelerator offload. The ability to access memory as part of the system address mapping allows host software to set up operands and access computation results without the overhead of traditional I/O DMA data copies. Such traditional copies involve driver calls, interrupts, and memory-mapped I/O (MMIO) accesses, which are all inefficient compared to simple memory accesses. At the same time, the ability to access accessor-attached memory without cache coherence overhead can be critical to the execution time of the offloaded computation. In cases involving substantial streaming write memory traffic, for example, cache coherence overhead can cut the effective write bandwidth seen by the accelerator in half. The efficiency of operand setup, the efficiency of the resulting accesses, and the efficiency of the accelerator computation all play a role in determining how well the accelerator offload works. The cost of the offload function (e.g., setting up the operands and getting the result) may be so high that offloading may not be effective at all, or may limit the accelerator to only very large jobs. The efficiency with which the accelerator performs the computation may have the same effect.

一実施例では、メモリアクセスを開始するエンティティ（例えば、アクセラレータ、コアなど）及びアクセスされるメモリ（例えば、ホストメモリ又はアクセラレータメモリ）に応じて異なるメモリアクセス及びコヒーレンス技術を適用する。これらの技術は、一般に、アクセスレータ付属メモリを提供する「コヒーレンスバイアス」メカニズムと称され、２つのセットのキャッシュコヒーレンスフローは、１つがその付属メモリへの効率的なアクセラレータのアクセスを最適化し、２つ目が、アクセスレータ付属メモリへのホストアクセス及びアクセスレータ付属メモリに対する共有アクセラレータ／ホストアクセスを最適化する。さらに、これらのフロー間の切り替えに関する２つの技術を含み、１つは、アプリケーションソフトウェアにより駆動され、もう一つは、独立したハードウェアの暗示により駆動される。コヒーレンスフローのセットの両方において、ハードウェアは、完全なキャッシュコヒーレンスを維持する。 In one embodiment, different memory access and coherence techniques are applied depending on the entity initiating the memory access (e.g., accelerator, core, etc.) and the memory being accessed (e.g., host memory or accelerator memory). These techniques are generally referred to as "coherence bias" mechanisms that provide accessor-attached memory with two sets of cache coherence flows, one optimizing efficient accelerator accesses to its attached memory and the second optimizing host accesses to accessor-attached memory and shared accelerator/host accesses to accessor-attached memory. It further includes two techniques for switching between these flows, one driven by application software and the other driven by independent hardware hints. In both sets of coherence flows, the hardware maintains full cache coherence.

図３０に概して示されるように、一実施例では、アクセラレータ３００１と、プロセッサコア及びＩ／Ｏ回路３００３を有する１又は複数のコンピュータプロセッサチップとを含むコンピュータシステムを適用し、アクセラレータ３００１は、マルチプロトコルリンク２８００を介してプロセッサと結合される。一実施例において、マルチプロトコルリンク３０１０は、それらに限定されないが、詳細に上述されたものを含む複数の異なるプロトコルをサポートする動的に多重化されたリンクである。しかしながら、本発明の基礎となる原理は、任意の特定のプロトコルのセットに限定されるものではないことに留意されたい。さらに、アクセラレータ３００１及びコアＩ／Ｏ３００３は、実装に応じて、同じ半導体チップ又は異なる半導体チップ上に集積されてよいことに留意する。 As generally shown in FIG. 30, one embodiment applies a computer system including an accelerator 3001 and one or more computer processor chips having a processor core and I/O circuitry 3003, the accelerator 3001 being coupled to the processor via a multi-protocol link 2800. In one embodiment, the multi-protocol link 3010 is a dynamically multiplexed link supporting multiple different protocols, including but not limited to those detailed above. However, it should be noted that the principles underlying the present invention are not limited to any particular set of protocols. It should also be noted that the accelerator 3001 and core I/O 3003 may be integrated on the same semiconductor chip or different semiconductor chips, depending on the implementation.

例示された実施例では、アクセラレータメモリバス３０１２は、アクセラレータ３００１をアクセラレータメモリ３００５に結合し、別々のホストメモリバス３０１１は、コアＩ／Ｏ３００３をホストメモリ３００７に結合する。すでに述べたように、アクセラレータメモリ３００５は、高帯域幅メモリ（ＨＢＭ）又はスタック型ＤＲＡＭ（これらのいくつかの例は、本明細書で説明される）を有してよく、ホストメモリ３００７は、ＤＲＡＭ、例えば、ダブルデータレート・シンクロナスダイナミックランダムアクセスメモリ（例えば、ＤＤＲ３ＳＤＲＡＭ、ＤＤＲ４ＳＤＲＡＭなど）を有してよい。しかしながら、本発明の基礎となる原理は、任意の特定のタイプのメモリ又はメモリプロトコルに限定されない。 In the illustrated embodiment, an accelerator memory bus 3012 couples the accelerator 3001 to the accelerator memory 3005, and a separate host memory bus 3011 couples the core I/O 3003 to the host memory 3007. As previously mentioned, the accelerator memory 3005 may comprise high bandwidth memory (HBM) or stacked DRAM (some examples of which are described herein), and the host memory 3007 may comprise DRAM, for example, double data rate synchronous dynamic random access memory (e.g., DDR3 SDRAM, DDR4 SDRAM, etc.). However, the principles underlying the present invention are not limited to any particular type of memory or memory protocol.

一実施例において、アクセラレータ３００１、及び、プロセッサチップ３００３内の処理コア上で実行する「ホスト」ソフトウェアの両方は、「ホストバイアス」フロー及び「デバイスバイアス」フローと称されるプロトコルフローについての２つの別個のセットを用いて、アクセラレータメモリ３００５にアクセスする。以下で説明されるように、一実施例では、特定のメモリアクセスのためにプロトコルフローを変調すること及び／又は選択することに関する複数のオプションをサポートする。 In one embodiment, both the accelerator 3001 and the "host" software executing on the processing cores within the processor chip 3003 access the accelerator memory 3005 using two separate sets of protocol flows, referred to as "host biased" flows and "device biased" flows. As described below, one embodiment supports multiple options for modulating and/or selecting the protocol flows for a particular memory access.

コヒーレンスバイアスフローは、アクセラレータ３００１と、プロセッサチップ３００３のうちの１つとの間のマルチプロトコルリンク３０１０の２つのプロトコル層、すなわち、ＣＡＣプロトコル層及びＭＡプロトコル層上に部分的に実装される。一実施例において、コヒーレンスバイアスフローは、（ａ）新たな方式でＣＡＣプロトコルにおける既存のオペコードを用いること、（ｂ）既存のＭＡ標準に対して新たなオペコードを加えること、及び、（ｃ）（リンクがＣＡＣ及びＰＣＤＩのみを含む前の）マルチプロトコルリンク３００１にＭＡプロトコルのサポートを加えることにより、イネーブルにされる。マルチプロトコルリンクは、ただ単にＣＡＣ及びＭＡをサポートすることに限定されないことに留意する。一実施例では、少なくともそれらのプロトコルをサポートすることが単に要求される。 The coherence bias flow is implemented in part on two protocol layers of the multi-protocol link 3010 between the accelerator 3001 and one of the processor chips 3003: the CAC protocol layer and the MA protocol layer. In one embodiment, the coherence bias flow is enabled by (a) using existing opcodes in the CAC protocol in a new way, (b) adding new opcodes to the existing MA standard, and (c) adding support for the MA protocol to the multi-protocol link 3001 (before the link included only CAC and PCDI). Note that the multi-protocol link is not limited to just supporting CAC and MA. In one embodiment, it is merely required to support at least those protocols.

本明細書で用いられるように、図３０に示される「ホストバイアス」フローは、アクセラレータ３００１が取り付けられるプロセッサチップ３００３内の標準コヒーレンスコントローラ３００９を通じてアクセラレータメモリ３００５に、アクセラレータ自体からの要求を含むすべての要求を集中させるフローのセットである。これは、独自のメモリにアクセスするために、アクセラレータ３００１に迂回路を取らせるが、アクセラレータ３００１及びプロセッサコアＩ／Ｏ３００３の両方からのアクセスが、プロセッサの標準コヒーレンスコントローラ３００９を用いてコヒーレントに維持されることを可能にする。一実施例において、プロセッサコア３００９がコヒーレンスコントローラ３００９に要求を発行する方式と同じ又は同様の方式で、フローは、ＣＡＣオペコードを用いて、マルチプロトコルリンクを介してプロセッサのコヒーレンスコントローラ３００９に要求を発行する。例えば、プロセッサチップのコヒーレンスコントローラ３００９は、アクセラレータ３００１からの要求に起因するＵＰＩ及びＣＡＣコヒーレンスメッセージ（例えば、スヌープ）を、アクセラレータに代わってすべてのピアプロセッサコアチップ（例えば、３００３）及び内部プロセッサエージェントに発行してよく、それらは、プロセッサコア３００３からの要求の場合と同じ程度であろう。この態様において、コヒーレンシは、アクセラレータ３００１によりアクセスされるデータと、プロセッサコアＩ／Ｏ３００３との間で維持される。 As used herein, the "host biased" flows shown in FIG. 30 are a set of flows that funnel all requests, including requests from the accelerator itself, to the accelerator memory 3005 through a standard coherence controller 3009 in the processor chip 3003 on which the accelerator 3001 is attached. This forces the accelerator 3001 to take a detour to access its own memory, but allows accesses from both the accelerator 3001 and the processor core I/O 3003 to be kept coherent using the processor's standard coherence controller 3009. In one embodiment, the flows issue requests to the processor's coherence controller 3009 over a multi-protocol link using a CAC opcode in the same or similar manner that the processor core 3009 issues requests to the coherence controller 3009. For example, the processor chip's coherence controller 3009 may issue UPI and CAC coherence messages (e.g., snoops) resulting from requests from the accelerator 3001 to all peer processor core chips (e.g., 3003) and internal processor agents on behalf of the accelerator, to the same extent as would be the case for requests from the processor core 3003. In this manner, coherency is maintained between data accessed by the accelerator 3001 and the processor core I/O 3003.

一実施例において、コヒーレンスコントローラ３００９はまた、マルチプロトコルリンク２８００を介してアクセラレータのメモリコントローラ３００６にメモリアクセスメッセージを条件付きで発行する。これらのメッセージはまた、データに、マルチプロトコルリンク２８００のプロセッサのコヒーレンスコントローラ３００９に返されることを強制し、その結果、マルチプロトコルリンク２８００を介したＣＡＣ応答としてアクセラレータ３００１に返される代わりに、コヒーレンスコントローラ３００９がこれらのプロセッサダイに対してローカルにあるメモリコントローラに送信し、データがアクセラレータ３００１の内部のエージェントに直接返されることを可能にする新たなオペコードを含むメッセージと同様である。 In one embodiment, the coherence controller 3009 also conditionally issues memory access messages to the accelerator's memory controller 3006 over the multiprotocol link 2800. These messages also force the data to be returned to the processor's coherence controller 3009 over the multiprotocol link 2800, as well as messages containing new opcodes that the coherence controller 3009 sends to the memory controller local to these processor dies, allowing the data to be returned directly to an agent internal to the accelerator 3001, instead of being returned to the accelerator 3001 as a CAC response over the multiprotocol link 2800.

図３０に示される「ホストバイアス」モードの一実施例では、アクセスレータ付属メモリ３００５をターゲットとするプロセッサコア３００３からのすべて要求は、通常のホストメモリ３００７をターゲットにしていたのと同様に、直接的にプロセッサコヒーレンシコントローラ３００９に送信される。コヒーレンスコントローラ３００９は、これらの標準キャッシュコヒーレンスアルゴリズムを適用して、それらがアクセラレータ３００１からのアクセスのために行うのと同様に、及び、それらが、通常のホストメモリ３００７へのアクセスのために行うのと同様に、これらの標準キャッシュコヒーレンスメッセージを送信してよい。コヒーレンスコントローラ３００９はまた、このクラスの要求のためにマルチプロトコルリンク２８００を介してＭＡコマンドを条件付きで送信するが、この場合、ＭＡフローは、マルチプロトコルリンク２８００にわたってデータを返す。 In one embodiment of the "host bias" mode shown in FIG. 30, all requests from the processor core 3003 targeting the accessor attached memory 3005 are sent directly to the processor coherency controller 3009, just as they would have been for normal host memory 3007. The coherence controller 3009 may apply these standard cache coherence algorithms and send these standard cache coherence messages just as they would for accesses from the accelerator 3001 and just as they would for accesses to normal host memory 3007. The coherence controller 3009 also conditionally sends MA commands over the multiprotocol link 2800 for this class of request, but in this case the MA flow returns data across the multiprotocol link 2800.

図３１に示される「デバイスバイアス」フローは、アクセラレータ３００１が、ホストプロセッサのキャッシュコヒーレンスコントローラ３００７に尋ねることなくそのローカルの付属メモリ３００５にアクセスすることを可能にするフローである。より具体的には、これらのフローは、アクセラレータ３００１が、マルチプロトコルリンク２８００を介して要求を送信することなく、メモリコントローラ３００６を介してそのローカルの付属メモリにアクセスすることを可能にする。 The "device bias" flows shown in FIG. 31 are flows that allow the accelerator 3001 to access its local attached memory 3005 without consulting the host processor's cache coherence controller 3007. More specifically, these flows allow the accelerator 3001 to access its local attached memory through the memory controller 3006 without sending a request over the multiprotocol link 2800.

「デバイスバイアス」モードにおいて、プロセッサコアＩ／Ｏ３００３からの要求は、上記の「ホストバイアス」に関する説明のように発行されるが、これらのフローのＭＡ部分において異なって完了される。「デバイスバイアス」の場合、アクセスレータ付属メモリ３００５に対するプロセッサ要求は、あたかも、それらが「未キャッシュ」の要求として発行されていたかのように完了される。この「未キャッシュ」の慣例は、デバイスバイアスフローの対象であるデータがプロセッサのキャッシュ階層に決してキャッシュされることがないように採用されている。これは、アクセラレータ３００１が、プロセッサ上のキャッシュコヒーレンスコントローラ３００９に尋ねることなくそのメモリ３００５内のデバイスバイアスデータにアクセスすることを可能にするという事情がある。 In "device bias" mode, requests from the processor core I/O 3003 are issued as described for "host bias" above, but are completed differently in the MA portion of these flows. With "device bias", processor requests to the accessor attached memory 3005 are completed as if they had been issued as "uncached" requests. This "uncached" convention is adopted so that data that is the subject of a device bias flow is never cached in the processor's cache hierarchy. This is in order to allow the accelerator 3001 to access device bias data in its memory 3005 without asking the cache coherence controller 3009 on the processor.

一実施例において、「未キャッシュ」プロセッサコア３００３アクセスフローに対するサポートは、プロセッサのＣＡＣバス上で、グローバルに監視される１回使用（「ＧＯ－ＵＯ」）応答で実装される。この応答は、データの一部をプロセッサコア３００３に返し、データの値のみを一旦用いるようプロセッサに命令する。これは、データのキャッシングを防止し、「未キャッシュ」フローの要求を満たす。ＧＯ－ＵＯ応答をサポートしていないコアを有するシステムにおいて、「未キャッシュ」フローは、マルチプロトコルリンク２８００のＭＡ層上、及び、プロセッサコア３００３のＣＡＣバス上のマルチメッセージ応答シーケンスを用いて実装されてよい。 In one embodiment, support for "uncached" processor core 3003 access flows is implemented with a globally monitored use-once ("GO-UO") response on the processor's CAC bus. This response returns a portion of the data to the processor core 3003 and instructs the processor to use only the value of the data once. This prevents caching of the data and satisfies the requirements of the "uncached" flow. In systems with cores that do not support GO-UO responses, the "uncached" flow may be implemented using a multi-message response sequence on the MA layer of the multiprotocol link 2800 and on the processor core 3003's CAC bus.

具体的には、プロセッサコアが、アクセラレータ３００１における「デバイスバイアス」ページをターゲットとするように得られる場合、アクセラレータは、アクセラレータからターゲットキャッシュラインに対する将来の要求をブロックするように、いくつかの状態をセットアップし、特別な「デバイスバイアスヒット」応答をマルチプロトコルリンク２８００のＭＡ層上に送信する。このＭＡメッセージに応じて、プロセッサのキャッシュコヒーレンスコントローラ３００９は、要求するプロセッサコア３００３にデータを返し、当該データを返した直後にスヌープ無効メッセージが続く。プロセッサコア３００３がスヌープ無効を完了したものとして認めた場合、キャッシュコヒーレンスコントローラ３００９は、別の特別なＭＡ「デバイスバイアスＢｏｃｋ完了」メッセージをマルチプロトコルリンク２８００のＭＡ層上のアクセラレータ３００１に送り返す。この完了メッセージは、アクセラレータ３００１に前述のブロック状態をクリアにさせる。 Specifically, when a processor core is obtained to target a "device bias" page in the accelerator 3001, the accelerator sets up some state to block future requests from the accelerator for the target cache line and sends a special "device bias hit" response on the MA layer of the multi-protocol link 2800. In response to this MA message, the processor's cache coherence controller 3009 returns the data to the requesting processor core 3003, followed immediately by a snoop invalidate message. When the processor core 3003 recognizes the snoop invalidate as complete, the cache coherence controller 3009 sends another special MA "device bias back complete" message back to the accelerator 3001 on the MA layer of the multi-protocol link 2800. This completion message causes the accelerator 3001 to clear the aforementioned blocked state.

図１０７は、バイアスを用いた実施形態を示す。一実施例において、デバイスとホストバイアスフローとの選択は、アクセラレータメモリ３００５内にバイアステーブル１０７０７として維持され得るバイアストラッカーデータ構造により駆動される。このバイアステーブル１０７０７は、アクセラレータ付属メモリページ毎に１又は２ビットを含むページ－グラニュラ構造（ｐａｇｅ-ｇｒａｎｕｌａｒｓｔｒｕｃｔｕｒｅ）（すなわち、メモリページの粒度で制御される）であってよい。バイアステーブル１０７０７は、（例えば、バイアステーブル１０７０７の頻繁に／最近用いられたエントリをキャッシュする）アクセラレータ内のバイアスキャッシュ１０７０３を用いて、又は、用いることなく、アクセスレータ付属メモリ３００５のスティールされたメモリ（ｓｔｏｌｅｎｍｅｍｏｒｙ）範囲で実装されてよい。代替的に、バイアステーブル１０７０７全体が、アクセラレータ３００１内に維持されてもよい。 Figure 107 illustrates an embodiment using bias. In one embodiment, the selection of device and host bias flows is driven by a bias tracker data structure that may be maintained as a bias table 10707 in the accelerator memory 3005. This bias table 10707 may be a page-granular structure (i.e., controlled at the granularity of a memory page) that includes one or two bits per accelerator-attached memory page. The bias table 10707 may be implemented in stolen memory ranges of the accessor-attached memory 3005, with or without a bias cache 10703 in the accelerator (e.g., caching frequently/recently used entries of the bias table 10707). Alternatively, the entire bias table 10707 may be maintained in the accelerator 3001.

一実施例において、アクセスレータ付属メモリ３００５へのそれぞれのアクセスと関連付けられるバイアステーブルエントリは、アクセラレータメモリへの実際のアクセスの前にアクセスされ、以下の動作を実行させる。
・デバイスバイアス内でこれらのページを見つけるアクセラレータ３００１からのローカル要求が、アクセラレータメモリ３００５に直接的に転送される。
・ホストバイアス内でこれらのページを見つけるアクセラレータ３００１からのローカル要求が、マルチプロトコルリンク２８００上のＣＡＣ要求としてプロセッサ３００３に転送される。
・デバイスバイアス内でこれらのページを見つけるプロセッサ３００３からのＭＡ要求が、上記で説明した「未キャッシュ」フローを用いて要求を完了する。
・ホストバイアス内でこれらのページを見つけるプロセッサ３００３からのＭＡ要求が、通常のメモリ読み出しのように要求を完了する。 In one embodiment, the bias table entry associated with each access to the accelerator associated memory 3005 is accessed prior to the actual access to the accelerator memory to perform the following operations:
Local requests from the accelerator 3001 finding these pages in the device memory are forwarded directly to the accelerator memory 3005.
Local requests from the accelerator 3001 to find these pages in the host database are forwarded to the processor 3003 as CAC requests on the multi-protocol link 2800.
MA requests from processor 3003 that find these pages in the device bias will complete the request using the "not cached" flow described above.
MA requests from the processor 3003 that find these pages in the host bias complete the request like a normal memory read.

ページのバイアス状態は、ソフトウェアベースのメカニズム、ハードウェア支援型のソフトウェアベースのメカニズムのいずれか一方により、又は、制限されたセットの場合、純粋にハードウェアベースのメカニズムにより変更され得る。 The bias state of a page can be changed by either software-based mechanisms, hardware-assisted software-based mechanisms, or, in a limited set of cases, purely hardware-based mechanisms.

バイアス状態を変更するための１つのメカニズムは、ＡＰＩコール（例えば、ＯｐｅｎＣＬ）を採用し、バイアス状態を変更するように指示するアクセラレータ３００１にメッセージを順番に送信（又は、コマンド記述子をエンキュー）するアクセラレータのデバイスドライバを順番に呼び出し、いくつかの遷移に関して、ホストにおいてキャッシュフラッシュ処理を実行する。キャッシュフラッシュ処理は、ホストバイアスからデバイスバイアスへの遷移に必要とされるが、逆の遷移には必須ではない。 One mechanism for changing the bias state employs an API call (e.g., OpenCL) that in turn invokes the accelerator's device driver, which in turn sends a message (or enqueues a command descriptor) to the accelerator 3001 instructing it to change the bias state, and for some transitions, performs a cache flush operation in the host. A cache flush operation is required for transitions from host bias to device bias, but is not required for the reverse transition.

いくつかの場合、ソフトウェアが、いつバイアス遷移ＡＰＩコールを行い、いつページ要求バイアス遷移を識別するかを判断することは難しい。そのような場合、アクセラレータは、バイアス遷移（暗示）メカニズムを実装してよく、バイアス遷移の必要性を検出し、それを示すメッセージをそのドライバに送信する。暗示メカニズムは、ホストバイアスページへのアクセラレータのアクセス、又は、デバイスバイアスページへのホストのアクセスをトリガし、かつ、割込みを介してアクセラレータのドライバにイベントをシグナリングするバイアステーブルルックアップに対応するメカニズムと同じくらい簡単であり得る。 In some cases, it is difficult for software to determine when to make a bias transition API call and when to identify a page request bias transition. In such cases, the accelerator may implement a bias transition (implied) mechanism to detect the need for a bias transition and send a message to its driver indicating so. The implied mechanism can be as simple as a mechanism that corresponds to a bias table lookup that triggers an accelerator access to the host bias page or a host access to the device bias page and signals the event to the accelerator's driver via an interrupt.

いくつかの実施例では、バイアス遷移状態の値をイネーブルにするために、第２のバイアス状態ビットを必要とし得ることに留意する。これは、システムが、メモリページにアクセスすることを継続することを可能にする一方、それらのページは、バイアス変更の処理にある（すなわち、キャッシュが部分的にフラッシュされ、後続の要求に起因するインクリメントキャッシュ汚染が抑制されなければならない場合）。 Note that in some embodiments, a second bias state bit may be required to enable the bias transition state value. This allows the system to continue accessing memory pages while those pages are in the process of a bias change (i.e., the cache must be partially flushed and incremental cache pollution due to subsequent requests must be suppressed).

一実施例に従う例示的な処理が図３２に示される。処理は、本明細書で説明されるシステム及びプロセッサアーキテクチャ上に実装され得るが、任意の特定のシステム又はプロセッサアーキテクチャに限定されない。 An exemplary process according to one embodiment is shown in FIG. 32. The process may be implemented on the systems and processor architectures described herein, but is not limited to any particular system or processor architecture.

３２０１において、ページの特定のセットがデバイスバイアス内に置かれる。すでに述べたように、これは、（例えば、各ページと関連付けられたビットを設定することにより）ページがデバイスバイアス内にあることを示すために、バイアステーブル内のこれらのページに対するエントリを更新することにより実現され得る。一実施例において、一旦デバイスバイアスに設定されると、ページは、ホストキャッシュメモリにキャッシュされていないことが保証される。３２０２において、ページがデバイスメモリから割り当てられる（例えば、ソフトウェアが、ドライバ／ＡＰＩコールを開始することによりページを割り当てる）。 At 3201, a particular set of pages are placed in the device bias. As previously mentioned, this may be accomplished by updating the entries for those pages in the bias table to indicate that the pages are in the device bias (e.g., by setting a bit associated with each page). In one embodiment, once placed in the device bias, the pages are guaranteed not to be cached in the host cache memory. At 3202, the pages are allocated from the device memory (e.g., software allocates the pages by initiating a driver/API call).

３２０３において、オペランドがプロセッサコアから割り当てられたページにプッシュされる。一実施例において、これは、（例えば、ＯｐｅｎＣＬＡＰＩコールを介して）ホストバイアスｎオペランドページをフリップするために、ＡＰＩコールを用いるソフトウェアにより実現される。必要とされるデータコピー又はキャッシュフラッシュがなく、オペランドデータは、ホストキャッシュ階層内の一部の任意の位置におけるこのステージで終了してよい。 At 3203, the operands are pushed from the processor core to the allocated page. In one embodiment, this is accomplished by software using an API call to flip the host biased n-operand page (e.g., via an OpenCL API call). There are no data copies or cache flushes required, and the operand data may end up at this stage in some arbitrary location within the host cache hierarchy.

３２０４において、アクセラレータデバイスは、オペランドを用いて結果を生成する。例えば、それは、コマンドを実行して、そのローカルメモリから直接的にデータを処理してよい（例えば、上述した３００５）。一実施例において、ソフトウェアは、ＯｐｅｎＣＬＡＰＩを用いて、オペランドページをデバイスバイアスにフリップして戻す（例えば、バイアステーブルを更新する）。ＡＰＩコールの結果として、ワーク記述子は、（例えば、以下で説明されるように、専用のワークキュー上での共有を介して）デバイスにサブミットされる。ワーク記述子は、ホストキャッシュからオペランドページをフラッシュするようデバイスに命令してよく、結果として（例えば、ＣＡＣプロトコルにおけるＣＬＦＬＵＳＨを用いて実行される）キャッシュフラッシュをもたらす。一実施例において、アクセラレータは、ホスト関連コヒーレンスオーバヘッドなしで実行され、データを結果ページにダンプする。 At 3204, the accelerator device uses the operands to generate results. For example, it may execute commands to process data directly from its local memory (e.g., 3005, described above). In one embodiment, the software flips the operand pages back to the device bias (e.g., updates the bias table) using the OpenCL API. As a result of the API call, a work descriptor is submitted to the device (e.g., via sharing on a dedicated work queue, as described below). The work descriptor may instruct the device to flush the operand pages from the host cache, resulting in a cache flush (e.g., performed using CLFLUSH in the CAC protocol). In one embodiment, the accelerator executes with no host-related coherence overhead and dumps the data into the result pages.

３２０５において、割り当てられたページから結果が引き出される。例えば、一実施例において、ソフトウェアは、結果ページをホストバイアスにフリップするために、（例えば、ＯｐｅｎＣＬＡＰＩを介して）１又は複数のＡＰＩコールを行う。この動作は、一部のバイアス状態を変更させ得るが、任意のコヒーレンス又はキャッシュフラッシュ動作を生じさせない。次に、ホストプロセッサコアは、必要に応じて、結果のデータにアクセスし、キャッシュし、共有することができる。最後に、３２０６において、割り当てられたページは、（例えば、ソフトウェアを介して）解放される。 At 3205, the results are pulled from the allocated page. For example, in one embodiment, software makes one or more API calls (e.g., via the OpenCL API) to flip the result page to the host bias. This operation may change some bias state, but does not cause any coherence or cache flush operations. The host processor core can then access, cache, and share the result data as needed. Finally, at 3206, the allocated page is freed (e.g., via software).

オペランドが１又は複数のＩ／Ｏデバイスから解放される同様の処理が、図３３に示される。３３０１において、ページの特定のセットがデバイスバイアス内に置かれる。すでに述べたように、これは、（例えば、各ページと関連付けられたビットを設定することにより）ページがデバイスバイアス内にあることを示すために、バイアステーブル内のこれらのページに対するエントリを更新することにより実現され得る。一実施例において、一旦デバイスバイアスに設定されると、ページは、ホストキャッシュメモリにキャッシュされていないことが保証される。３３０２において、ページがデバイスメモリから割り当てられる（例えば、ソフトウェアが、ドライバ／ＡＰＩコールを開始することによりページを割り当てる） A similar process in which operands are released from one or more I/O devices is shown in FIG. 33. At 3301, a particular set of pages are placed in the device bias. As previously mentioned, this may be accomplished by updating the entries for those pages in the bias table to indicate that the pages are in the device bias (e.g., by setting a bit associated with each page). In one embodiment, once placed in the device bias, the pages are guaranteed not to be cached in the host cache memory. At 3302, the pages are allocated from the device memory (e.g., software allocates the pages by initiating a driver/API call).

３３０３において、オペランドは、Ｉ／Ｏエージェントから割り当てられたページにプッシュされる。一実施例において、これは、データを書き込むために、非割り当て格納を用いて、Ｉ／Ｏエージェント及びＩ／ＯエージェントにＤＭＡ要求をポストするソフトウェアにより実現される。一実施例において、データは、ホストキャッシュ階層に決して割り当てられることはなく、ターゲットページがデバイスバイアス内に留まる。 At 3303, the operand is pushed from the I/O agent to the allocated page. In one embodiment, this is accomplished by software using an unallocated store to the I/O agent and posting a DMA request to the I/O agent to write the data. In one embodiment, the data is never allocated in the host cache hierarchy and the target page remains in the device cache.

３３０４において、アクセラレータデバイスは、オペランドを用いて結果を生成する。例えば、ソフトウェアは、アクセラレータデバイスにワークをサブミットしてよく、必要とされるページ遷移はない（すなわち、ページはデバイスバイアス内に留まる）。一実施例において、アクセラレータデバイスは、ホスト関連コヒーレンスオーバヘッドなしで実行され、アクセラレータは、データを結果ページにダンプする。 At 3304, the accelerator device uses the operands to generate a result. For example, software may submit work to the accelerator device and no page transition is required (i.e., the page remains within the device bias). In one embodiment, the accelerator device executes with no host-related coherence overhead and the accelerator dumps the data into a result page.

３３０５において、（例えば、ソフトウェアからの指示の下）Ｉ／Ｏエージェントは、割り当てられたページから結果を引き出す。例えば、ソフトウェアは、ＤＭＡ要求をＩ／Ｏエージェントにポストしてよい。ソースページが、デバイスバイアスに留まる場合、必要とされるページ遷移はない。一実施例において、Ｉ／Ｏブリッジは、ＲｄＣｕｒｒ（現在の読み出し）要求を用いて、結果ページからデータのキャッシュ不能なコピーをつかむ。 At 3305, the I/O agent (e.g., under direction from software) pulls the results from the allocated page. For example, software may post a DMA request to the I/O agent. If the source page remains on the device bias, no page transition is required. In one embodiment, the I/O bridge grabs a non-cacheable copy of the data from the result page with an RdCurr (read current) request.

いくつかの実施例において、ワークキュー（ＷＱ）は、ソフトウェア、サービス品質（ＱｏＳ）を実装するために用いられるアービタ及び公平性ポリシ、記述子を処理するための処理エンジン、アドレス変換及びキャッシングインタフェース、及び、メモリ読み出し／書き込みインタフェースによりサブミットされた「記述子」を保持する。記述子は、行われるワークの範囲を定義する。図３４に図示されるように、一実施例では、専用のワークキュー３４００及び共有のワークキュー３４０１といった２つの異なるタイプのワークキューがある。専用のワークキュー３４００は、単一のアプリケーション３４１３に対する記述子を格納する一方、共有のワークキュー３４０１は、複数のアプリケーション３４１０－３４１２によりサブミットされた記述子を格納する。ハードウェアインタフェース／アービタ３４０２は、特定のアービトレーションポリシに従って（例えば、各アプリケーション３４１０－３４１３及びＱｏＳ／公平性ポリシの処理要件に基づいて）、ワークキュー３４００－３４０１からアクセラレータ処理エンジン３４０５に記述子をディスパッチする。 In some embodiments, a Work Queue (WQ) holds "descriptors" submitted by software, an arbiter used to implement quality of service (QoS) and fairness policies, processing engines for processing the descriptors, an address translation and caching interface, and a memory read/write interface. The descriptors define the scope of work to be done. As shown in FIG. 34, in one embodiment, there are two different types of Work Queues: Dedicated Work Queue 3400 and Shared Work Queue 3401. Dedicated Work Queue 3400 stores descriptors for a single application 3413, while Shared Work Queue 3401 stores descriptors submitted by multiple applications 3410-3412. A Hardware Interface/Arbiter 3402 dispatches the descriptors from the Work Queues 3400-3401 to the accelerator processing engines 3405 according to a particular arbitration policy (e.g., based on the processing requirements of each application 3410-3413 and QoS/fairness policies).

図１０８ａ～図１０８Ｂは、ワークキューベースの実装と共に用いられるメモリマッピングされたＩ／Ｏ（ＭＭＩＯ）空間レジスタを示す。バージョンレジスタ１０８０７は、デバイスによりサポートされているこのアーキテクチャ仕様のバージョンを報告する。 Figures 108a-108b show memory mapped I/O (MMIO) space registers for use with a work queue based implementation. Version register 10807 reports the version of this architecture specification supported by the device.

一般的な機能レジスタ（ＧＥＮＣＡＰ）１０８０８は、デバイスの一般的な機能、例えば、最大転送サイズ、最大バッチサイズなどを規定する。テーブルＢは、ＧＥＮＣＡＰレジスタにおいて特定され得る様々なパラメータ及び値を列挙する。
General Capabilities Register (GENCAP) 10808 defines the general capabilities of the device, e.g., maximum transfer size, maximum batch size, etc. Table B lists the various parameters and values that may be specified in the GENCAP register.

一実施例において、ワークキュー機能レジスタ（ＷＱＣＡＰ）１０８１０は、ワークキューの機能、例えば、動作についての専用及び／又は共有モードに関するサポート、エンジンの数、ワークキューの数などを規定する。以下のテーブルＣは、構成され得る様々なパラメータ及び値を列挙する。
In one embodiment, Work Queue Capability Register (WQCAP) 10810 defines the capabilities of the Work Queue, such as support for dedicated and/or shared modes of operation, number of engines, number of Work Queues, etc. Table C below lists various parameters and values that may be configured.

一実施例において、オペレーション機能レジスタ（ＯＰＣＡＰ）１０８１１は、デバイスによりサポートされるオペレーションタイプを規定するビットマスクである。各ビットは、ビット位置と同じコードを有するオペレーションタイプに対応する。例えば、このレジスタのビット０は、Ｎｏ－ｏｐオペレーション（コード０）に対応する。ビットは、オペレーションがサポートされている場合に設定され、オペレーションがサポートされていない場合にクリアされる。
In one embodiment, Operation Capabilities Register (OPCAP) 20811 is a bitmask that defines the operation types supported by the device. Each bit corresponds to an operation type with the same code as the bit position. For example, bit 0 of this register corresponds to a No-op operation (code 0). The bit is set if the operation is supported and cleared if the operation is not supported.

一実施例において、一般的な構成レジスタ（ＧＥＮＣＦＧ）１０８１２は、仮想チャネル（ＶＣ）ステアリングタグを規定する。以下のテーブルＥを参照する。
In one embodiment, General Configuration Register (GENCFG) 10812 defines Virtual Channel (VC) Steering Tags, see Table E below.

一実施例において、一般的な制御レジスタ（ＧＥＮＣＴＲＬ）１０８１３は、ハードウェア又はソフトウェアエラーに対して割込みが生成したか否かを示す。以下のテーブルＦを参照する。
In one embodiment, General Control Register (GENCTRL) 10813 indicates whether an interrupt was generated for a hardware or software error. See Table F below.

一実施例において、デバイスイネーブルレジスタ（ＥＮＡＢＬＥ）は、エラーコード、デバイスがイネーブルか否かに応じたインジケータ、及び、デバイスリセット値を格納する。さらなる詳細については、以下のテーブルＧを参照する。
In one embodiment, the device enable register (ENABLE) stores an error code, an indicator as to whether the device is enabled, and a device reset value. See Table G below for further details.

一実施例において、割込み要因レジスタ（ＩＮＴＣＡＵＳＥ）は、割込みの要因を示す値を格納する。以下のテーブルＨを参照する。
In one embodiment, the interrupt cause register (INTCAUSE) stores a value indicating the cause of the interrupt, see Table H below.

一実施例において、コマンドレジスタ（ＣＭＤ）１０８１４は、ドレインＷＱ、ドレインＰＡＳＩＤ及びドレインオールコマンドをサブミットするために用いられる。アボート領域は、要求されたオペレーションがドレインであるか、アボートであるかを示す。このレジスタに書き込む前に、ソフトウェアは、任意のコマンドが、このレジスタを介してサブミットされる前に完了したことを確保し得る。このレジスタに書き込む前に、ソフトウェアは、コマンド構成レジスタ、及び、完了記録が要求された場合はコマンド完了記録アドレスレジスタも構成してよい。 In one embodiment, the command register (CMD) 10814 is used to submit the drain WQ, drain PASID, and drain all commands. The abort field indicates whether the requested operation is drain or abort. Before writing to this register, software may ensure that any command has completed before being submitted via this register. Before writing to this register, software may also configure the command configuration register and, if completion recording is requested, the command completion record address register.

ドレインオールコマンドは、すべてのＷＱ及びすべてのエンジン内のすべての未処理の記述子をドレイン又はアボートする。ドレインＰＡＳＩＤコマンドは、すべてのＷＱ及びすべてのエンジン内の特定のＰＡＳＩＤを用いて記述子をドレイン又はアボートする。ドレインＷＱは、特定のＷＱ内のすべての記述子をドレイン又はアボートする。実装に応じて、任意のドレインコマンドは、待機する必要がある記述子に加えて、他の記述子の完了を待ってよい。 The drain all command drains or aborts all outstanding descriptors in all WQs and all engines. The drain PASID command drains or aborts descriptors with a specific PASID in all WQs and all engines. The drain WQ drains or aborts all descriptors in a specific WQ. Depending on the implementation, any drain command may wait for other descriptors to complete in addition to the one it needs to wait for.

アボート領域が１である場合、ソフトウェアは、影響のある記述子が廃棄されることを要求している。しかしながら、ハードウェアは、これらの一部又はすべてをさらに完了してよい。記述子が廃棄された場合、書き込まれる完了記録はなく、その記述子に対して生成される完了割込みはない。他のメモリアクセスの一部又はすべてが発生し得る。 If the abort field is 1, software has requested that the affected descriptors be discarded. However, the hardware may still complete some or all of them. If a descriptor is discarded, no completion record is written and no completion interrupt is generated for that descriptor. Some or all other memory accesses may occur.

コマンドの完了は、完了割込みを生成することにより（要求された場合）、このレジスタのステータスフィールドをクリアにすることにより示される。完了がシグナリングされたときに、すべての影響のある記述子は、完了又は廃棄のいずれか一方であり、任意の影響のある記述子に起因して生成されるさらなるアドレス変換、メモリ読み出し、メモリ書き込み又は割込みはない。以下のテーブルＩを参照する。
Completion of a command is indicated by generating a completion interrupt (if requested) and by clearing the status field of this register. When completion is signaled, all affected descriptors are either completed or discarded, and there are no further address translations, memory reads, memory writes or interrupts generated due to any affected descriptors. See Table I below.

一実施例において、ソフトウェアエラーステータスレジスタ（ＳＷＥＲＲＯＲ）１０８１５は、記述子をサブミットした場合のエラー、記述子内の完了記録アドレスを変換するときのエラー、記述子内の完了記録アドレス有効フラグが０である場合、記述子を検証するときのエラー、及び、記述子内の完了記録アドレス有効フラグが０である場合にページフォールトなど、記述子を処理している間のエラーなど、複数の異なるタイプのエラーを格納する。以下のテーブルＪを参照する。
In one embodiment, software error status register (SWERROR) 10815 stores several different types of errors, such as an error when submitting a descriptor, an error when converting the completion record address in the descriptor, an error when validating the descriptor if the completion record address valid flag in the descriptor is 0, and an error while processing a descriptor, such as a page fault if the completion record address valid flag in the descriptor is 0. See Table J below.

一実施例において、ハードウェアエラーステータスレジスタ（ＨＷＥＲＲＯＲ）１０８１６は、ソフトウェアエラーステータスレジスタと同様の方式である（上記を参照）。 In one embodiment, the hardware error status register (HWERROR) 10816 is similar in format to the software error status register (see above).

一実施例において、グループ構成レジスタ（ＧＲＰＣＦＧ）１０８１７は、ワークキュー／エンジングループごとに構成データを格納する（図３６～図３７を参照）。特に、グループ構成テーブルは、エンジンに対するワークキューのマッピングを制御するＢＡＲ０におけるレジスタのアレイである。エンジンと同じ数のグループがあるが、ソフトウェアは、必要とするグループの数を構成してよい。それぞれのアクティブなグループは、１又は複数のワークキュー及び１又は複数のエンジンを含む。任意の未使用のグループは、０に等しいＷＱフィールド及びエンジンフィールドの両方を有していなければならない。グループ内の任意のＷＱにサブミットされた記述子は、グループ内の任意のエンジンにより処理されてよい。それぞれのアクティブなワークキューは、単一のグループ内になければならない。アクティブなワークキューは、対応するＷＱＣＦＧレジスタのＷＱサイズフィールドがゼロ以外のものである。グループ内に無いエンジンはいずれもインアクティブである。 In one embodiment, the Group Configuration Register (GRPCFG) 10817 stores configuration data for each Work Queue/Engine Group (see Figures 36-37). In particular, the Group Configuration Table is an array of registers in BAR0 that control the mapping of Work Queues to engines. There are as many groups as there are engines, but software may configure the number of groups required. Each active group contains one or more Work Queues and one or more engines. Any unused group must have both the WQ and Engine fields equal to 0. Descriptors submitted to any WQ in a group may be processed by any engine in the group. Each active Work Queue must be in a single group. An active Work Queue is one whose corresponding WQCFG register has a non-zero WQ Size field. Any engine not in a group is inactive.

各ＧＲＰＣＦＧレジスタ１０８１７は、３つのサブレジスタに分割されてよく、各サブレジスタは、１又は複数の３２ビットワードである（テーブルＫ～Ｍを参照）。デバイスはイネーブルである間、これらのレジスタは、読み取り専用であり得る。それらは、ＷＱＣＡＰのワークキュー構成サポートフィールドが０である場合も読み取り専用である。 Each GRPCFG register 10817 may be divided into three sub-registers, each of which is one or more 32-bit words (see Tables K-M). These registers may be read-only while the device is enabled. They are also read-only if the Work Queue Configuration Supported field in the WQCAP is 0.

ＢＡＲ０内のサブレジスタのオフセットは、グループＧごとに、０≦Ｇ＜エンジンの数であり、一実施例では以下のとおりである。
The offsets of the sub-registers in BAR0, for each group G, 0≦G<number of engines, are as follows in one embodiment:

一実施例において、ワークキュー構成レジスタ（ＷＱＣＦＧ）１０８１８は、各ワークキューのオペレーションを規定するデータを格納する。ＷＱ構成テーブルは、ＢＡＲ０における１６バイトレジスタのアレイである。ＷＱ構成レジスタの数は、ＷＱＣＡＰ内のＷＱフィールドの数と一致する。 In one embodiment, the Work Queue Configuration Register (WQCFG) 10818 stores data that defines the operation of each Work Queue. The WQ Configuration Table is an array of 16-byte registers in BAR0. The number of WQ Configuration Registers matches the number of WQ fields in the WQCAP.

各１６バイトＷＱＣＦＧレジスタは、４つの３２ビットサブレジスタに分割され、アラインされた６４ビット読み出し又は書き込み動作を用いて、読み出され又は書き込まれてよい。 Each 16-byte WQCFG register is divided into four 32-bit sub-registers and may be read or written using aligned 64-bit read or write operations.

デバイスがイネーブルの間、又は、ＷＱＣＡＰのワークキュー構成サポートフィールドが０である場合、各ＷＱＣＦＧ－Ａサブレジスタは、読み取り専用である。 While the device is enabled or if the Work Queue Configuration Supported field in the WQCAP is 0, each WQCFG-A subregister is read-only.

ＷＱＣＡＰのワークキュー構成サポートフィールドが０でない限り、各ＷＱＣＦＧ－Ｂはいつでも書き込み可能である。ＷＱがイネーブルであるときに、ＷＱ閾値フィールドがＷＱサイズより大きい値を含む場合、ＷＱはイネーブルにされず、ＷＱエラーコードは４に設定される。ＷＱがイネーブルの間、ＷＱ閾値フィールドがＷＱサイズより大きい値で書き込まれている場合、ＷＱはディセーブルであり、ＷＱエラーコードは４に設定される。 Each WQCFG-B is writable at any time unless the Work Queue Configuration Support field in the WQCAP is 0. If the WQ Threshold field contains a value greater than the WQ size when WQ is enabled, WQ is not enabled and the WQ Error Code is set to 4. If the WQ Threshold field is written with a value greater than the WQ size while WQ is enabled, WQ is disabled and the WQ Error Code is set to 4.

ＷＱがイネーブルの間、各ＷＱＣＦＧ－Ｃサブレジスタは読み取り専用である。それは、ＷＱイネーブルを１に設定する前又は同時に書き込まれてよい。ＷＱＣＡＰのワークキュー構成サポートフィールドが０である場合、以下のフィールド、すなわち、ＷＱモード、ＷＱ障害のブロック・イネーブル（ＷＱＦａｕｌｔｏｎＢｌｏｃｋＥｎａｂｌｅ）及びＷＱ優先度のフィールドは、常に読み取り専用である。たとえＷＱＣＡＰのワークキュー構成サポートフィールドが０であるとしても、ＷＱＣＦＧ－Ｃの以下のフィールド、すなわち、ＷＱＰＡＳＩＤ及びＷＱＵ／Ｓのフィールドは、ＷＱがイネーブルにされていない場合に書き込み可能である。 Each WQCFG-C subregister is read-only while WQ is enabled. It may be written before or at the same time as setting WQ Enable to 1. If the Work Queue Configuration Support field of WQCAP is 0, the following fields are always read-only: WQ Mode, WQ Fault on Block Enable, and WQ Priority. Even if the Work Queue Configuration Support field of WQCAP is 0, the following fields of WQCFG-C are writeable if WQ is not enabled: WQ PASID and WQ U/S.

各ＷＱＣＦＧ－Ｄサブレジスタは、いつでも書き込み可能である。しかしながら、デバイスがイネーブルにされていない場合、それは、ＷＱイネーブルを１に設定するエラーである。 Each WQCFG-D subregister is writable at any time. However, if the device is not enabled, it is an error to set WQ Enable to 1.

ＷＱイネーブルが１に設定されている場合、ＷＱイネーブル及びＷＱエラーコードフィールドの両方がクリアされる。次に、ＷＱイネーブル又はＷＱエラーコードのいずれか一方は、ＷＱのイネーブル化に成功したか否かを示すゼロ以外の値に設定される。 If WQ Enable is set to 1, then both the WQ Enable and WQ Error Code fields are cleared. Then, either WQ Enable or WQ Error Code is set to a non-zero value that indicates whether the WQ was successfully enabled.

すべてのＷＱＣＦＧレジスタのＷＱサイズフィールドの合計は、ＧＥＮＣＡＰ内のＷＱサイズフィールドの合計より大きくすることができない。この制約は、デバイスがイネーブルにされたときにチェックされる。ＷＱサイズフィールドが０であるＷＱは、イネーブルにされることができず、そのようなＷＱＣＦＧレジスタのすべての他のフィールドが無視される。デバイスがイネーブルの間、ＷＱサイズフィールドは読み取り専用である。サブレジスタのそれぞれに関するデータについては、テーブルＮを参照する。
The sum of the WQ size fields of all WQCFG registers cannot be greater than the sum of the WQ size fields in GENCAP. This constraint is checked when the device is enabled. WQs with a WQ size field of 0 cannot be enabled and all other fields in such WQCFG registers are ignored. The WQ size field is read-only while the device is enabled. See Table N for data regarding each of the sub-registers.

一実施例において、ワークキュー占有割込み制御レジスタ１０８１９（ワークキュー（ＷＱ）毎に１つ）は、ワークキューの占有率が特定の閾値に低下した場合、ソフトウェアが割込みを要求することを可能にする。ＷＱに対するＷＱ占有割込みイネーブルが１であり、現在のＷＱ占有率が、ＷＱ占有率の制限又はそれを下回る場合、以下の動作が実行され得る。１．ＷＱ占有割込みイネーブルフィールドがクリアされる。２．割込み理由レジスタのビット３が１に設定される。３．割込み理由レジスタのビット３が、段階２の前に０であった場合、ＭＳＩ－Ｘテーブルエントリ０を用いて割込みが生成される。４．レジスタがイネーブル＝１、及び、制限≧現在のＷＱ占有率で書き込まれた場合、割込みがすぐ生成される。結果として、レジスタがイネーブル＝１、及び、制限≧ＷＱサイズで書き込まれた場合、常に割込みはすぐに生成される。
In one embodiment, Work Queue Occupancy Interrupt Control Registers 10819 (one per Work Queue (WQ)) allow software to request an interrupt if the occupancy of a Work Queue drops to a certain threshold. If the WQ Occupancy Interrupt Enable for a WQ is 1 and the current WQ occupancy is at or below the WQ occupancy limit, the following actions may be performed: 1. The WQ Occupancy Interrupt Enable field is cleared. 2. Bit 3 of the Interrupt Reason Register is set to 1. 3. If bit 3 of the Interrupt Reason Register was 0 before step 2, an interrupt is generated using MSI-X table entry 0. 4. If the register is written with Enable=1 and Limit>=Current WQ Occupancy, an interrupt is generated immediately. As a result, an interrupt is always generated immediately if the register is written with Enable=1 and Limit>=WQ Size.

一実施例において、ワークキューステータスレジスタ（ＷＱ毎に１つ）１０８２０は、各ＷＱにおける現在のエントリの数を規定する。この数は、記述子がキューにサブミットされ、又は、キューからディスパッチされるときにはいつでも変更する可能性があるので、ＷＱに空きがあるか否かを判断することに信頼できない。 In one embodiment, the Work Queue Status Registers (one per WQ) 10820 define the current number of entries in each WQ. This number can change any time a descriptor is submitted to or dispatched from the queue, so it cannot be relied upon to determine whether a WQ has room.

一実施例において、ＭＳＩ－Ｘエントリ１０８２１は、ＭＳＩ－Ｘテーブルデータを格納する。オフセット及びエントリの数は、ＭＳＩ－Ｘ機能にある。提案されたエントリの数は、ＷＱの数に２を加えた値である。 In one embodiment, MSI-X entry 10821 stores the MSI-X table data. The offset and number of entries are in the MSI-X feature. The suggested number of entries is the number of WQs plus 2.

一実施例において、ＭＳＩ－Ｘ未処理ビットアレイ１０８２２は、ＭＳＩ－Ｘ機能にあるオフセット及びエントリの数を格納する。 In one embodiment, the MSI-X raw bit array 10822 stores the offset and number of entries in the MSI-X function.

一実施例において、割込みメッセージストレージエントリ１０８２３は、テーブル構造内に割込みメッセージを格納する。このテーブルのフォーマットは、ＰＣＩｅで規定されるＭＳＩ－Ｘテーブルのフォーマットと同様であるが、サイズは、２０４８個のエントリに限定されない。しかしながら、いくつかの実施例において、このテーブルのサイズは、異なるＤＳＡ実装間で変化してよく、２０４８個のエントリより少なくてもよい。一実施例において、エントリの数は、一般的な機能レジスタの割込みメッセージストレージサイズフィールド内にある。割込みメッセージストレージサポート機能が０である場合、このテーブルは提示されない。ＤＳＡが多数の仮想マシン又はコンテナをサポートするために、サポートされるテーブルのサイズは、かなり大きい必要がある。 In one embodiment, the interrupt message storage entry 10823 stores interrupt messages in a table structure. The format of this table is similar to the format of the MSI-X table defined in PCIe, but the size is not limited to 2048 entries. However, in some embodiments, the size of this table may vary between different DSA implementations and may be less than 2048 entries. In one embodiment, the number of entries is in the interrupt message storage size field of the general capabilities register. If the interrupt message storage support capability is 0, this table is not present. In order for the DSA to support a large number of virtual machines or containers, the size of the supported table needs to be quite large.

一実施例において、ＩＭＳ内の各エントリのフォーマットは、以下のテーブルＰにおいて説明される。
In one embodiment, the format of each entry in the IMS is described in Table P below.

図３５は、Ｉ／Ｏファブリックインタフェース３５０１（例えば、上記で説明されたマルチプロトコルリンク２８００など）を介してサブミットされた記述子を受信する複数のワークキュー３５１１－３５１２を有するデータストリーミングアクセラレータ（ＤＳＡ）デバイスの一実施例を示す。ＤＳＡは、クライアント（プロセッサコア、ピア入力／出力（ＩＯ）エージェント（ネットワークインタフェースコントローラ（ＮＩＣ）など）及び／又は、ソフトウェアチェーンオフロード要求など）からのダウンストリームワーク要求を受信するために、及び、アップストリーム読み出し、書き込み、及びアドレス変換オペレーションのためにＩ／Ｏファブリックインタフェース３５０１を用いる。例示された実施例では、ワークキュー間のアービトレーションを実行し、複数のエンジン３５５０のうちの１つにワーク記述子をディスパッチするアービタ３５１３を含む。アービタ３５１３及びワークキュー３５１１－１０１２の処理は、ワークキュー構成レジスタ３５００を通じて構成されてよい。例えば、アービタ３５１３は、ワークキュー３５１１－１０１２のそれぞれからの記述子をエンジン３５５０のそれぞれにディスパッチするために、様々なＱｏＳ及び／又は公平性ポリシを実装するように構成されてよい。 FIG. 35 illustrates one embodiment of a Data Streaming Accelerator (DSA) device having multiple work queues 3511-3512 that receive descriptors submitted via an I/O fabric interface 3501 (such as the multi-protocol link 2800 described above). The DSA uses the I/O fabric interface 3501 to receive downstream work requests from clients (such as processor cores, peer input/output (IO) agents (such as network interface controllers (NICs)) and/or software chain offload requests) and for upstream read, write, and address translation operations. The illustrated embodiment includes an arbiter 3513 that arbitrates between the work queues and dispatches work descriptors to one of multiple engines 3550. The operation of the arbiter 3513 and the work queues 3511-3512 may be configured through work queue configuration registers 3500. For example, the arbiter 3513 may be configured to implement various QoS and/or fairness policies for dispatching descriptors from each of the work queues 3511-1012 to each of the engines 3550.

一実施例において、ワークキュー３５１１－３５１２にキューイングされる記述子のいくつかは、ワーク記述子のバッチを含む／識別するバッチ記述子３５１５である。アービタ３５１３は、変換キャッシュ３５２０（プロセッサ上の潜在的に他のアドレス変換サービス）を通じて変換されたアドレスを使用いて、メモリから記述子３５１８のアレイを読み出すことによりバッチ記述子を処理するバッチ処理ユニット３５１６にバッチ記述子を転送する。一旦物理アドレスが識別されると、データ読み出し／書き込み回路３５４０は、メモリから記述子のバッチを読み出す。 In one embodiment, some of the descriptors queued in the work queues 3511-3512 are batch descriptors 3515 that contain/identify a batch of work descriptors. The arbiter 3513 forwards the batch descriptors to a batch processing unit 3516 which processes the batch descriptors by reading an array of descriptors 3518 from memory using addresses translated through a translation cache 3520 (and potentially other address translation services on the processor). Once the physical addresses have been identified, the data read/write circuitry 3540 reads the batch of descriptors from memory.

第２のアービタ３５１９は、バッチ処理ユニット３５１６により提供されるワーク記述子３５１８と、ワークキュー３５１１－３５１２から取得される個々のワーク記述子３５１４とのバッチ間でアービトレーションを実行し、ワーク記述子をワーク記述子処理ユニット３５３０に出力する。一実施例において、ワーク記述子処理ユニット３５３０は、（データＲ／Ｗユニット３５４０を介して）メモリを読み出し、データに対して要求されたオペレーションを実行し、出力データを生成し、（データＲ／Ｗユニット３５４０を介して）出力データ、完了記録及び割込みメッセージを書き込むステージを有する。 The second arbiter 3519 arbitrates between batches of work descriptors 3518 provided by the batch processing unit 3516 and individual work descriptors 3514 obtained from the work queues 3511-3512, and outputs the work descriptors to the work descriptor processing unit 3530. In one embodiment, the work descriptor processing unit 3530 has stages for reading memory (via data R/W unit 3540), performing requested operations on the data, generating output data, and writing (via data R/W unit 3540) the output data, completion records, and interrupt messages.

一実施例において、ワークキュー構成は、ノンポステッドＥＮＱＣＭＤ／Ｓ命令を使用する記述子を受信する共有のワークキュー（ＳＷＱ）として、又は、ポステッドＭＯＶＤＩＲ６４Ｂ命令を使用する記述子を受信する専用のワークキュー（ＤＷＱ）としての、いずれか一方として、ソフトウェアが（ＷＱ構成レジスタ３５００を介して）各ＷＱを構成することを可能にする。図３４に関して上記ですでに述べたように、ＤＷＱは、単一のアプリケーションからサブミットされたワーク記述子及びバッチ記述子を処理してよく、他方、ＳＷＱは、複数のアプリケーションの中で共有されてよい。ＷＱ構成レジスタ３５００は、どのＷＱ３５１１－３５１２がどのアクセラレータエンジン３５５０に供給するか、及び、各エンジンを供給するＷＱ３５１１－３５１２に関連する優先度をソフトウェアが制御することも可能にする。例えば、オーダリングされた優先度のセットは、（例えば、高、中、低：１、２、３などに）規定されてよく、記述子は、一般に、より低い優先度のワークキューの前に、又は、より低い優先度のワークキューからディスパッチするも頻繁に、より高い優先度のワークキューからディスパッチされてよい。例えば、高い優先度及び低い優先度として識別される２つのワークキューを用いて、ディスパッチされる各１０個の記述子について、１０個の記述子のうちの８個が、高い優先度のワークキューからディスパッチされてよく、一方、１０個の記述子のうちの２個が、低い優先度のワークキューからディスパッチされる。様々な他の技術が、ワークキュー３５１１－３５１２間で異なる優先度のレベルを実現するために用いられてよい。 In one embodiment, the work queue configuration allows software to configure each WQ (via the WQ configuration registers 3500) as either a shared work queue (SWQ) that receives descriptors using non-posted ENQCMD/S instructions, or as a dedicated work queue (DWQ) that receives descriptors using posted MOVDIR64B instructions. As already discussed above with respect to FIG. 34, a DWQ may process work and batch descriptors submitted from a single application, while a SWQ may be shared among multiple applications. The WQ configuration registers 3500 also allow software to control which WQs 3511-3512 feed which accelerator engines 3550, and the priorities associated with the WQs 3511-3512 that feed each engine. For example, an ordered set of priorities may be defined (e.g., high, medium, low: 1, 2, 3, etc.), and descriptors may generally be dispatched from a higher priority work queue before or more frequently than a lower priority work queue. For example, with two work queues identified as high and low priority, for every 10 descriptors dispatched, 8 of the 10 descriptors may be dispatched from the high priority work queue, while 2 of the 10 descriptors are dispatched from the low priority work queue. Various other techniques may be used to achieve different priority levels between work queues 3511-3512.

一実施例において、データストリーミングアクセラレータ（ＤＳＡ）は、ＰＣＩエクスプレス構成メカニズムとの互換性があるソフトウェアであり、その構成マッピングレジスタセット内にＰＣＩヘッダ及び拡張空間を実装する。構成レジスタは、ルートコンプレックスからＣＦＣ／ＣＦ８又はＭＭＣＦＧを通じてプログラミングされ得る。同様に、すべての内部レジスタは、ＪＴＡＧ又はＳＭバスインタフェースを通じてもアクセス可能であってよい。 In one embodiment, the Data Streaming Accelerator (DSA) is software compatible with PCI Express configuration mechanisms and implements the PCI header and extension space in its configuration mapping register set. The configuration registers can be programmed through CFC/CF8 or MMCFG from the root complex. Similarly, all internal registers may also be accessible through the JTAG or SM bus interface.

一実施例において、ＤＳＡデバイスは、そのオペレーションを制御するためにメモリマップレジスタを用いる。機能、構成及びワークサブミッションレジスタ（ポータル）は、ＢＡＲ０、ＢＡＲ２及びＢＡＲ４レジスタにより規定されるＭＭＩＯ領域を通じてアクセス可能である（以下で説明される）。各ポータルは、それらがプロセッサページテーブルを用いて異なるアドレス空間（クライアント）に独立にマッピングされ得るように、別々の４Ｋページ上にあってよい。 In one embodiment, the DSA device uses memory-mapped registers to control its operation. Capability, configuration and work submission registers (portals) are accessible through the MMIO regions defined by the BAR0, BAR2 and BAR4 registers (described below). Each portal may be on a separate 4K page so that they can be independently mapped to different address spaces (clients) using the processor page tables.

すでに述べたように、ソフトウェアは、記述子を通じてＤＳＡに対するワークを規定する。記述子は、ＤＳＡに対するオペレーションのタイプを特定して、データ及びステータスバッファのアドレス、即値オペランド、完了属性などを実行する（記述子のフォーマットに関する追加の詳細及び詳細は、以下で説明する）。完了属性は、完了記録を書き込むアドレスと、選択的な完了割込みを生成するのに必要とされる情報とを規定する。 As mentioned above, software specifies work to the DSA through descriptors. The descriptors specify the type of operation to the DSA to perform, such as addresses of data and status buffers, immediate operands, completion attributes, etc. (Additional details and details regarding the format of the descriptors are described below). The completion attributes specify the address to write the completion record to and the information needed to generate an optional completion interrupt.

一実施例において、ＤＳＡは、デバイス上のクライアント固有の状態を維持することを回避する。記述子を処理するすべての情報は、記述子自体に入っている。これは、ユーザモードアプリケーション間、並びに、仮想化されたシステム内の異なる仮想マシン（マシンコンテナ）間のその共有能力を改善する。 In one embodiment, the DSA avoids maintaining client-specific state on the device. All information to process the descriptor is in the descriptor itself. This improves its sharing capabilities between user mode applications as well as between different virtual machines (machine containers) in a virtualized system.

記述子は、オペレーション及び関連するパラメータを含んでよい（ワーク記述子と呼ばれる）、又は、記述子は、ワーク記述子のアレイのアドレスを含むことができる（バッチ記述子と呼ばれる）。ソフトウェアは、メモリに記述子を準備して、デバイスのワークキュー（ＷＱ）３５１１－３５１２に記述子をサブミットする。記述子は、ＷＱのモード及びクライアントの特権レベルに応じて、ＭＯＶＤＩＲ６４Ｂ、ＥＮＱＣＭＤ又はＥＮＱＣＭＤＳ命令を用いるデバイスにサブミットされる。 A descriptor may contain an operation and associated parameters (called a work descriptor), or the descriptor may contain the address of an array of work descriptors (called a batch descriptor). Software prepares the descriptors in memory and submits them to the device's work queue (WQ) 3511-3512. Descriptors are submitted to the device using the MOVDIR64B, ENQCMD, or ENQCMDS instructions, depending on the mode of the WQ and the privilege level of the client.

各ＷＱ３５１１－３５１２は、固定数のスロットを有し、それによって、重い負荷の下、いっぱいになり得る。一実施例において、デバイスは、ソフトウェアがフロー制御を実装するのを助けるために、必要なフィードバックを提供する。デバイスは、ワークキュー３５１１－３５１２から記述子をディスパッチし、さらなる処理のためにエンジンにこれらをサブミットする。エンジン３５５０が、記述子を完了し、結果的にアボートをもたらす特定の障害又はエラーに遭遇した場合、ホストメモリ内の完了記録に書き込むこと又は割込みを発行することのいずれか一方により、又は、その両方によりホストソフトウェアを通知する。 Each WQ 3511-3512 has a fixed number of slots, which can fill up under heavy load. In one embodiment, the device provides the necessary feedback to help the software implement flow control. The device dispatches descriptors from the work queues 3511-3512 and submits them to the engine for further processing. When the engine 3550 completes a descriptor and encounters a particular failure or error that results in an abort, it notifies the host software by either writing a completion record in host memory and/or issuing an interrupt.

一実施例において、各ワークキューは、それぞれがデバイスＭＭＩＯ空間内の別々の４ＫＢページにある複数のレジスタを介してアクセス可能である。各ＷＱに関して、１つのワークサブミッションレジスタは、「非特権ポータル」と呼ばれ、ユーザモードクライアントにより用いられるユーザ空間にマッピングされる。もう一つのワークサブミッションレジスタは、「特権ポータル」と呼ばれ、カーネルモードドライバにより用いられる。残りは、ゲストポータルであり、仮想マシン内のカーネルモードクライアントにより用いられる。 In one embodiment, each work queue is accessible through multiple registers, each in a separate 4KB page in the device MMIO space. For each WQ, one work submission register is called the "non-privileged portal" and is mapped into user space for use by user mode clients. Another work submission register is called the "privileged portal" and is used by kernel mode drivers. The remaining is the guest portal and is used by kernel mode clients within the virtual machine.

すでに述べたように、各ワークキュー３５１１－３５１２は、専用又は共有の２つのモードのうちの１つにおいて実行するように構成され得る。ＤＳＡは、専用及び共有モードに対するサポートを示すように、ワークキュー機能レジスタにおける機能ビットをさらす。また、ＤＳＡは、モードのうちの１つで動作するように各ＷＱを構成するために、ワークキュー構成レジスタ３５００に制御をさらす。ＷＱのモードは、ＷＱがディセーブルの間、すなわち、（ＷＱＣＦＧイネーブル＝０）の間だけ、変更され得る。ＷＱ機能レジスタ及びＷＱ構成レジスタの追加の詳細は、以下で説明する。 As mentioned above, each Work Queue 3511-3512 can be configured to run in one of two modes: dedicated or shared. The DSA exposes capability bits in the Work Queue Capability Register to indicate support for dedicated and shared modes. The DSA also exposes control to the Work Queue Configuration Register 3500 to configure each WQ to operate in one of the modes. The mode of a WQ can only be changed while the WQ is disabled, i.e., (WQCFG Enable = 0). Additional details of the WQ Capability Register and WQ Configuration Register are described below.

一実施例では、共有モードにおいて、ＤＳＡクライアントは、ＥＮＱＣＭＤ又はＥＮＱＣＭＤＳ命令を用いて、記述子をワークキューにサブミットする。ＥＮＱＣＭＤ及びＥＮＱＣＭＤＳは、６４バイトのノンポステッド書き込みを用いており、完了する前に、デバイスからの応答を待機する。ＤＳＡは、ワークキューに空きがある場合には（例えば、要求したクライアント／アプリケーションに）「成功」、又は、ワークキューが満杯の場合には「リトライ」を返す。ＥＮＱＣＭＤ及びＥＮＱＣＭＤＳ命令は、コマンドサブミッションのステータスをゼロフラグで返してよい（０は成功を示し、１はリトライを示す）。ＥＮＱＣＭＤ及びＥＮＱＣＭＤＳ命令を用いて、複数のクライアントは、同じワークキューに記述子を直接かつ同時にサブミットし得る。デバイスがこのフィードバックを提供するので、クライアントは、これらの記述子が受け取られたか否かを伝えることができる。 In one embodiment, in shared mode, a DSA client submits descriptors to a work queue using the ENQCMD or ENQCMDS command. ENQCMD and ENQCMDS use 64-byte non-posted writes and wait for a response from the device before completing. The DSA returns (e.g., to the requesting client/application) "success" if there is space in the work queue, or "retry" if the work queue is full. The ENQCMD and ENQCMDS commands may return the status of the command submission with a zero flag (0 indicates success, 1 indicates retry). Using the ENQCMD and ENQCMDS commands, multiple clients may submit descriptors directly and simultaneously to the same work queue. The device provides this feedback so that clients can tell whether their descriptors were received or not.

共有モードにおいて、ＤＳＡは、カーネルモードクライアント用の特権ポータルを介して、サブミッションのための一部のＳＷＱ容量を予約してよい。非特権ポータルを介したワークサブミッションは、ＳＷＱ内の記述子の数が、ＳＷＱ用に設定された閾値に達するまで受け取られる。特権ポータルを介したワークサブミッションは、ＳＷＱが満杯になるまで受け取られる。ゲストポータルを介したワークサブミッションは、非特権ポータルと同じ方法で閾値により制限される。 In shared mode, the DSA may reserve some SWQ capacity for submissions through privileged portals for kernel mode clients. Work submissions through non-privileged portals are accepted until the number of descriptors in the SWQ reaches the threshold configured for the SWQ. Work submissions through privileged portals are accepted until the SWQ is full. Work submissions through guest portals are limited by the threshold in the same manner as non-privileged portals.

ＥＮＱＣＭＤ又はＥＮＱＣＭＤＳ命令が、「成功」を返した場合、記述子は、デバイスにより受け取られ、処理のためにキューイングされる。命令が、「リトライ」を返した場合、ソフトウェアは、記述子をＳＷＱに再サブミットすることを試み得る、又は、それが非特権ポータルを用いるユーザモードクライアントであった場合、特権ポータルを用いて、ユーザモードクライアントの代わりに記述子をサブミットすることをカーネルモードドライバに要求し得るのいずれか一方を行う。これは、サービス妨害を回避するのに役立ち、将来への前進保証を提供する。代替的に、ソフトウェアは、ＳＷＱが満杯になった場合、他の方法（例えば、ＣＰＵを用いてワークを実行する）を用いてよい。 If the ENQCMD or ENQCMDS command returns "success", the descriptor is received by the device and queued for processing. If the command returns "retry", the software can either try to resubmit the descriptor to the SWQ, or, if it is a user mode client using a non-privileged portal, can request the kernel mode driver to submit the descriptor on behalf of the user mode client using the privileged portal. This helps to avoid denial of service and provides forward progress assurance. Alternatively, the software may use other methods (e.g., using the CPU to perform work) if the SWQ becomes full.

クライアント／アプリケーションは、処理アドレス空間ＩＤ（ＰＡＳＩＤ）と呼ばれる２０ビットのＩＤを使用してデバイスにより識別される。ＰＡＳＩＤは、デバイスＴＬＢ１７２２内のアドレスを検索して、アドレス変換又はページ要求をＩＯＭＭＵ１７１０に（例えば、マルチプロトコルリンク２８００を介して）送信するために、デバイスにより用いられる。共有モードにおいて、各記述子と共に用いられるＰＡＳＩＤは、記述子のＰＡＳＩＤフィールドに含まれる。一実施例において、ＥＮＱＣＭＤは、特定のレジスタ（例えば、ＰＡＳＩＤＭＳＲ）から現在のスレッドのＰＡＳＩＤを記述子にコピーする一方、ＥＮＱＣＭＤＳは、スーパバイザモードのソフトウェアが記述子にＰＡＳＩＤをコピーすることを可能にする。 Clients/applications are identified by the device using a 20-bit ID called the Process Address Space ID (PASID). The PASID is used by the device to look up addresses in the device TLB 1722 and send address translation or page requests to the IOMMU 1710 (e.g., over the multi-protocol link 2800). In shared mode, the PASID to be used with each descriptor is included in the PASID field of the descriptor. In one embodiment, ENQCMD copies the PASID of the current thread from a specific register (e.g., the PASID MSR) into the descriptor, while ENQCMDS allows supervisor mode software to copy the PASID into the descriptor.

「専用」モードにおいて、ＤＳＡクライアントは、ＭＯＶＤＩＲ６４Ｂ命令を用いて、記述子をデバイスのワークキューにサブミットしてよい。ＭＯＶＤＩＲ６４Ｂは、６４バイトのポステッド書き込みを用い、命令は、書き込み動作のポステッド特性に起因してより高速完了する。専用のワークキューに関し、ＤＳＡは、ワークキュー内のスロットの総数をさらし、ソフトウェアに依存してフロー制御を提供してよい。ソフトウェアは、ワークキューの満杯条件を検出するために、サブミットされ、完了した記述子の数をトラッキングする役割を担う。ワークキューに空きがないときに、ソフトウェアが、記述子を専用のＷＱに誤ってサブミットした場合、記述子はドロップされ、エラーが、（例えば、ソフトウェアエラーレジスタに）記録されてよい。 In "dedicated" mode, the DSA client may submit descriptors to the device's work queue using the MOVDIR64B instruction. MOVDIR64B uses 64-byte posted writes, which complete faster due to the posted nature of the write operation. With a dedicated work queue, the DSA may expose the total number of slots in the work queue and rely on software to provide flow control. Software is responsible for tracking the number of descriptors submitted and completed to detect work queue full conditions. If software erroneously submits a descriptor to a dedicated WQ when there is no room in the work queue, the descriptor may be dropped and an error may be logged (e.g., in a software error register).

ＭＯＶＤＩＲ６４Ｂ命令は、ＥＮＱＣＭＤ又はＥＮＱＣＭＤＳ命令が行うように、ＰＡＳＩＤを書き込まないので、専用モードにおいて、記述子内のＰＡＳＩＤフィールドを用いることができない。ＤＳＡは、専用のワークキューにサブミットされた記述子内のＰＡＳＩＤフィールドを無視してよく、代わりに、ＷＱ構成レジスタ３５００のＷＱＰＡＳＩＤフィールドを用いてアドレス変換を行う。一実施例において、ＷＱＰＡＳＩＤフィールドは、専用モードでワークキューを構成する場合、ＤＳＡドライバにより設定される。 The MOVDIR64B instruction does not write PASID as the ENQCMD or ENQCMDS instructions do, so the PASID field in the descriptor cannot be used in dedicated mode. The DSA may ignore the PASID field in descriptors submitted to a dedicated work queue, and instead uses the WQ PASID field of the WQ configuration register 3500 for address translation. In one embodiment, the WQ PASID field is set by the DSA driver when configuring a work queue in dedicated mode.

専用モードは、複数のクライアント／アプリケーションにより単一のＤＷＱを共有するものではないが、ＤＳＡデバイスは、複数のＤＷＱを有するように構成され得、ＤＷＱのそれぞれは、クライアントに独立に割り当てられ得る。さらに、ＤＷＱは、異なるクライアント／アプリケーションのために提供された異なる性能レベルに対して同じ又は異なるＱｏＳレベルを有するように構成され得る。 Dedicated mode does not share a single DWQ by multiple clients/applications, but the DSA device can be configured to have multiple DWQs, each of which can be independently assigned to a client. Furthermore, the DWQs can be configured to have the same or different QoS levels with different performance levels provided for different clients/applications.

一実施例において、データストリーミングアクセラレータ（ＤＳＡ）は、ワークキュー３５１１－１０１２にサブミットされた記述子を処理する２又はそれより多いエンジン３５５０を含む。ＤＳＡアーキテクチャの一実施例は、０から３の番号が付された４つのエンジンを含む。エンジン０及び１は、それぞれ、最大でデバイスの全帯域幅（例えば、読み出し用に３０ＧＢ／ｓ及び書き込み用に３０ＧＢ／ｓ）まで利用することが可能である。もちろん、すべてのエンジンについての組み合わられる帯域幅はまた、デバイスに対して利用可能な最大の帯域幅に制限される。 In one embodiment, the Data Streaming Accelerator (DSA) includes two or more engines 3550 that process descriptors submitted to the work queues 3511-1012. One embodiment of the DSA architecture includes four engines, numbered 0 through 3. Engines 0 and 1 can each utilize up to the full bandwidth of the device (e.g., 30 GB/s for reads and 30 GB/s for writes). Of course, the combined bandwidth for all engines is also limited to the maximum bandwidth available to the device.

一実施例において、ソフトウェアは、グループ構成レジスタを用いて、ＷＱ３５１１－３５１２及びエンジン３５５０をグループに構成する。それぞれのグループは、１又は複数のＷＱ及び１又は複数のエンジンを含む。ＤＳＡは、グループ内の任意のエンジンを用いて、グループ内の任意のＷＱにポステッドされた記述子を処理してよく、各ＷＱ及び各エンジンは、１つのグループのみにあってよい。グループの数はエンジンの数と同じであってよいので、各エンジンは別々のグループにあり得るが、任意のグループが１より多いエンジンを含む場合に、すべてのグループが用いられる必要があるわけではない。 In one embodiment, software configures WQs 3511-3512 and engines 3550 into groups using group configuration registers. Each group contains one or more WQs and one or more engines. The DSA may use any engine in a group to process a descriptor posted to any WQ in the group, and each WQ and each engine may be in only one group. The number of groups may be the same as the number of engines, so each engine may be in a separate group, but not all groups need to be used if any group contains more than one engine.

ＤＳＡアーキテクチャは、ワークキュー、グループ及びエンジンを構成するときに大きな柔軟性を可能にするが、ハードウェアは、特定の構成の使用のために狭く設計されてよい。エンジン０及び１は、ソフトウェアの要件に応じて、２つの異なる方式のうちの１つで構成されてよい。１つの推奨される構成は、同じグループ内にエンジン０及び１の両方を配置することである。ハードウェアは、グループ内の任意のワークキューから記述子を処理するエンジンのいずれか一方を用いる。この構成において、一方のエンジンが高レイテンシメモリアドレス変換又はページフォールトに起因してストールを有する場合、他方のエンジンは、動作を継続して、全体的なデバイスのスループットを最大化することができる。 The DSA architecture allows great flexibility when configuring work queues, groups, and engines, but the hardware may be narrowly designed for use with a particular configuration. Engines 0 and 1 may be configured in one of two different ways, depending on the requirements of the software. One recommended configuration is to place both engines 0 and 1 in the same group. The hardware will use either one of the engines to process descriptors from any work queue in the group. In this configuration, if one engine has a stall due to high latency memory address translation or a page fault, the other engine can continue to operate, maximizing overall device throughput.

図３６は、各グループ３６１１及び３６１２内の２つのワークキュー３６２１－３６２２及び３６２３－３６２４をそれぞれ示すが、サポートされるＷＱの最大数までの任意の数があってよい。グループ内のＷＱは、異なる優先度を有する共有のＷＱ、１つの共有のＷＱ及び他の専用のＷＱ、又は、同じ又は異なる優先度を有する複数の専用のＷＱであってよい。図示された例では、グループ３６１１は、エンジン０及び１３６０１によりサービス提供され、グループ３６１２は、エンジン２及び３３６０２によりサービス提供される。 Figure 36 shows two work queues 3621-3622 and 3623-3624 in each group 3611 and 3612, respectively, but there may be any number up to the maximum number of WQs supported. The WQs in a group may be shared WQs with different priorities, one shared WQ and another dedicated WQ, or multiple dedicated WQs with the same or different priorities. In the illustrated example, group 3611 is served by engines 0 and 1 3601, and group 3612 is served by engines 2 and 3 3602.

図３７に示されるように、エンジン０３７００及びエンジン１３７０１を使用する別の構成では、別々のグループ３７１０及び３７１１にそれぞれこれらを配置する。同様に、グループ２３７１２は、エンジン２３７０２に割り当てられ、グループ３は、エンジン３３７０３に割り当てられる。さらに、グループ０３７１０は、２つのワークキュー３７２１及び３７２２から構成され、グループ１３７１１は、ワークキュー３７２３から構成され、ワークキュー２３７１２は、ワークキュー３７２４から構成され、グループ３３７１３は、ワークキュー３７２５から構成される。 As shown in FIG. 37, another configuration using engine 0 3700 and engine 1 3701 places them in separate groups 3710 and 3711, respectively. Similarly, group 2 3712 is assigned to engine 2 3702 and group 3 is assigned to engine 3 3703. Furthermore, group 0 3710 is composed of two work queues 3721 and 3722, group 1 3711 is composed of work queue 3723, work queue 2 3712 is composed of work queue 3724, and group 3 3713 is composed of work queue 3725.

レイテンシに敏感なオペレーションが他のオペレーションの背後でブロックされた状態になる可能性を低減したい場合に、ソフトウェアは、この構成を選択してよい。この構成において、ソフトウェアは、レイテンシに敏感なオペレーションをエンジン１３７０２に接続されたワークキュー３７２３に、他のオペレーションをエンジン０３７００に接続されたワークキュー３７２１－３７２２にサブミットする。 Software may select this configuration if it wants to reduce the chance that latency-sensitive operations become blocked behind other operations. In this configuration, the software submits latency-sensitive operations to work queue 3723 connected to engine 1 3702, and other operations to work queues 3721-3722 connected to engine 0 3700.

エンジン２３７０２及びエンジン３３７０３は、例えば、相変化メモリなどの高帯域幅の不揮発性メモリに書き込むために用いられてよい。これらのエンジンの帯域幅の機能は、このタイプのメモリの予期される書き込み帯域幅に一致するサイズであってよい。この利用に関し、エンジン構成レジスタのビット２及び３は、１に設定されるべきであり、仮想チャネル１（ＶＣ１）が、これらのエンジンからのトラフィックに用いられるべきであることを示す。 Engine 2 3702 and Engine 3 3703 may be used to write to high bandwidth non-volatile memory, such as, for example, phase change memory. The bandwidth capabilities of these engines may be sized to match the expected write bandwidth of this type of memory. For this usage, bits 2 and 3 of the Engine Configuration Register should be set to 1, indicating that Virtual Channel 1 (VC1) should be used for traffic from these engines.

高帯域幅の不揮発性メモリ（例えば、相変化メモリ）がないプラットフォームにおいて、又は、ＤＳＡデバイスがこのタイプのメモリに書き込むために用いられない場合、エンジン２及び３は、未使用であってよい。しかしながら、サブミットされたオペレーションが制限された帯域幅に耐えるという条件で、ソフトウェアが、追加の低レイテンシパスとしてこれらの使用を行うことが可能である。 In platforms that do not have high bandwidth non-volatile memory (e.g., phase change memory), or if DSA devices are not used to write to this type of memory, engines 2 and 3 may be unused. However, software can make use of them as additional low latency paths, provided the submitted operations can tolerate the limited bandwidth.

各記述子がワークキューのヘッドに到着したときに、それは、スケジューラ／アービタ３５１３により除去され、グループ内のエンジンの１つに転送されてよい。メモリ内のワーク記述子３５１８を指す、バッチ記述子３５１５について、エンジンは、メモリから（すなわち、バッチ処理ユニット３５１６を用いて）ワーク記述子のアレイをフェッチする。 As each descriptor arrives at the head of the work queue, it may be removed by the scheduler/arbiter 3513 and forwarded to one of the engines in the group. For a batch descriptor 3515 that points to a work descriptor 3518 in memory, the engine fetches the array of work descriptors from memory (i.e., using the batch processing unit 3516).

一実施例において、各ワーク記述子３５１４について、エンジン３５５０は、完了記録アドレスのための変換をプリフェッチし、ワーク記述子処理ユニット３５３０にオペレーションを渡す。ワーク記述子処理ユニット３５３０はソース及び宛先アドレス変換のために、デバイスＴＬＢ１７２２及びＩＯＭＭＵ１７１０を用いて、ソースデータを読み出し、特定のオペレーションを実行し、宛先データをメモリに書き戻す。オペレーションが完了した場合、エンジンは、ワーク記述子により要求されている場合、予め変換された完了アドレスに完了記録を書き込み、割込みを生成する。 In one embodiment, for each work descriptor 3514, the engine 3550 prefetches the translation for the completion record address and passes the operation to the work descriptor processing unit 3530. The work descriptor processing unit 3530 uses the device TLB 1722 and IOMMU 1710 for source and destination address translation, reads the source data, performs the specific operation, and writes the destination data back to memory. When the operation is completed, the engine writes the completion record to the pre-translated completion address and generates an interrupt, if requested by the work descriptor.

一実施例において、ＤＳＡの複数のワークキューは、サービス品質（ＱｏＳ）の複数のレベルを提供するために用いられ得る。各ＷＱの優先度は、ＷＱ構成レジスタ３５００において特定されてよい。ＷＱの優先度は、同じグループ内の他のＷＱに関連する（例えば、単独でグループ内に存在するＷＱについての優先度レベルには意味がない）。グループ内のワークキューは、同じ又は異なる優先度を有し得る。しかしながら、単一のＳＷＱは、同じ用途をサービス提供するであろうから、同じグループ内に同じ優先度を有する複数の共有のＷＱを構成しても意味がない。スケジューラ／アービタ３５１３は、これらの優先度に従って、ワークキュー３５１１－３５１２からエンジン３５５０にワーク記述子をディスパッチする。 In one embodiment, multiple work queues in the DSA may be used to provide multiple levels of quality of service (QoS). The priority of each WQ may be specified in the WQ configuration registers 3500. The priority of a WQ is relative to other WQs in the same group (e.g., priority levels for a WQ that exists alone in a group are meaningless). Work queues in a group may have the same or different priorities. However, since a single SWQ will service the same application, there is no point in configuring multiple shared WQs with the same priority in the same group. The scheduler/arbiter 3513 dispatches work descriptors from the work queues 3511-3512 to the engine 3550 according to these priorities.

図３８は、実行対象となるオペレーションを規定するオペレーションフィールド３８０１、複数のフラグ３８０２、処理アドレス空間識別子（ＰＡＳＩＤ）フィールド３８０３、完了記録アドレスフィールド３８０４、ソースアドレスフィールド３８０５、宛先アドレスフィールド３８０６、完了割込みフィールド３８０７、転送サイズフィールド３８０８、及び、（潜在的に）１又は複数のオペレーションに固有のフィールド３８０９を含む記述子１３００の一実施例を示す。一実施例では、完了記録アドレス有効、要求完了記録及び要求完了割込みという３つのフラグがある。 Figure 38 shows one embodiment of a descriptor 1300 that includes an operation field 3801 that specifies the operation to be performed, a number of flags 3802, a processing address space identifier (PASID) field 3803, a completion record address field 3804, a source address field 3805, a destination address field 3806, a completion interrupt field 3807, a transfer size field 3808, and (potentially) one or more operation specific fields 3809. In one embodiment, there are three flags: completion record address valid, requested completion record, and requested completion interrupt.

共通のフィールドは、トラステッドフィールド及び非トラステッドフィールドの両方を含む。トラステッドフィールドは、それらがホスト上のＣＰＵにより又は特権（リング０又はＶＭＭ）ソフトウェア入力されるので、常にＤＳＡデバイスにより信頼されている。非トラステッドフィールドは、ＤＳＡクライアントにょり直接供給される。 Common fields include both trusted and non-trusted fields. Trusted fields are always trusted by the DSA device because they are entered by the CPU on the host or by privileged (ring 0 or VMM) software. Non-trusted fields are supplied directly by the DSA client.

一実施例において、トラステッドフィールドは、ＰＡＳＩＤフィールド３８０３、予約フィールド３８１１及びＵ／Ｓ（ユーザ／スーパバイザ）フィールド３８１０（すなわち、０のオフセットで始まる４バイト）を含む。記述子が、ＥＮＱＣＭＤ命令を用いてサブミットされる場合、ソース記述子内のこれらのフィールドは無視されてよい。ＭＳＲに含まれる値（例えば、ＰＡＳＩＤＭＳＲ）は、記述子がデバイスに送信される前に、これらのフィールドに置かれてよい。 In one embodiment, the trusted field includes a PASID field 3803, a reserved field 3811, and a U/S (user/supervisor) field 3810 (i.e., 4 bytes starting at an offset of 0). When the descriptor is submitted using an ENQCMD command, these fields in the source descriptor may be ignored. The value contained in the MSR (e.g., the PASID MSR) may be placed in these fields before the descriptor is sent to the device.

一実施例において、記述子がＥＮＱＣＭＤＳ命令を用いてサブミットされる場合、ソース記述子内のこれらのフィールドは、ソフトウェアにより初期化される。ＰＣＩエクスプレスＰＡＳＩＤ機能がイネーブルにされていない場合、Ｕ／Ｓフィールド３８１０は１に設定され、ＰＡＳＩＤフィールド３８０３は０に設定される。 In one embodiment, these fields in the source descriptor are initialized by software when the descriptor is submitted using the ENQCMDS instruction. If the PCI Express PASID feature is not enabled, the U/S field 3810 is set to 1 and the PASID field 3803 is set to 0.

記述子が、ＭＯＶＤＩＲ６４Ｂ命令を用いてサブミットされる場合、記述子内のこれらのフィールドは無視されてよい。デバイスは、代わりに、ＷＱコンフィグレジスタ３５００のＷＱＵ／Ｓ及びＷＱＰＡＳＩＤフィールドを用いる。 If the descriptor is submitted using the MOVDIR64B instruction, these fields in the descriptor may be ignored. The device uses the WQ U/S and WQ PASID fields of the WQ Config Register 3500 instead.

これらのフィールドは、バッチ内のどの記述子に対しても無視され得る。バッチ記述子３５１５の対応するフィールドは、バッチ内の各記述子３５１８に対して用いられる。テーブルＱは、これらのトラステッドフィールドのそれぞれについての説明及びビット位置を提供する。
These fields may be ignored for any descriptor in the batch. The corresponding fields in the batch descriptor 3515 are used for each descriptor 3518 in the batch. Table Q provides a description and bit position for each of these trusted fields.

以下のテーブルＲは、記述子のオペレーションフィールド３８０１に従う一実施例において実行されるリストである。
Table R below lists the operations performed in one embodiment according to the operation field 3801 of the descriptor.

以下のテーブルＳは、記述子の一実施例で用いられるフラグを列挙する。
Table S below lists the flags used in one embodiment of the descriptor.

一実施例において、完了記録アドレス３８０４は、完了記録のアドレスを規定する。完了記録は、３２バイトであってよく、完了記録アドレスは３２バイト境界上にアラインされる。完了記録アドレス有効フラグが０である場合、このフィールドは予約されている。要求完了記録フラグが１である場合、完了記録は、オペレーションの完了時にこのアドレスに書き込まれる。要求完了記録が０である場合、完了記録は、ページフォールト又はエラーがある場合のみ、このアドレスに書き込まれる。 In one embodiment, the completion record address 3804 specifies the address of the completion record. The completion record may be 32 bytes and the completion record address is aligned on a 32-byte boundary. If the completion record address valid flag is 0, this field is reserved. If the requested completion record flag is 1, the completion record is written to this address upon completion of the operation. If the requested completion record is 0, the completion record is written to this address only if there is a page fault or error.

比較などの結果をもたらす任意のオペレーションについて、完了記録アドレス有効及び要求完了記録フラグは、両方とも１であるべきであり、完了記録アドレスは有効であるべきである。 For any operation that produces a result such as a comparison, the completion record address valid and requested completion record flags should both be 1 and the completion record address should be valid.

仮想アドレスを用いる任意の処理について、完了記録アドレスは、要求完了記録フラグが設定されているか否かについて有効であるべきであり、その結果、完了記録は、ページフォールト又はエラーがある場合に書き込まれ得る。 For any operation that uses a virtual address, the completion record address should be valid whether or not the requested completion record flag is set, so that the completion record can be written in the event of a page fault or error.

最良の結果について、このフィールドは、記述子をサブミットしたソフトウェアにデバイスがエラーを報告することを可能するので、すべての記述子において有効であるべきである。このフラグが０であり、予期しないエラーが発生した場合、エラーは、ＳＷＥＲＲＯＲレジスタに報告され、要求をサブミットしたソフトウェアは、エラーが通知されなくてもよい。 For best results, this field should be valid in all descriptors, as it allows the device to report errors to the software that submitted the descriptor. If this flag is 0 and an unexpected error occurs, the error is reported in the SWERROR register and the software that submitted the request may not be notified of the error.

完了記録アドレスフィールド３８０４は、バッチ記述子において完了キューイネーブルフラグが設定されている場合、バッチ内の記述子を無視し、バッチ記述子内の完了キューアドレスが代わりに用いられる。 If the completion queue enable flag is set in the batch descriptor, the completion record address field 3804 ignores the descriptor in the batch and the completion queue address in the batch descriptor is used instead.

一実施例において、メモリからデータを読み出すオペレーションについて、ソースアドレスフィールド３８０５は、ソースデータのアドレスを規定する。ソースアドレスに対するアライメント要求はない。メモリにデータを書き込むオペレーションについて、宛先アドレスフィールド３８０６は、宛先バッファのアドレスを規定する。宛先アドレスに対するアライメント要求はない。いくつかのオペレーションタイプについて、このフィールドは、第２のソースバッファのアドレスとして用いられる。 In one embodiment, for operations that read data from memory, the source address field 3805 specifies the address of the source data. There are no alignment requirements for the source address. For operations that write data to memory, the destination address field 3806 specifies the address of the destination buffer. There are no alignment requirements for the destination address. For some operation types, this field is used as the address of a second source buffer.

一実施例において、転送サイズフィールド３８０８は、オペレーションを実行するために、ソースアドレスから読み出されるバイト数を示す。このフィールドの最大値は、２３２‐１であってよいが、最大の可能な転送サイズは、より小さくてよく、かつ、一般的な機能レジスタの最大転送サイズフィールドから判断されなければならない。転送サイズは０であるべきではない。多くのオペレーションタイプに関して、転送サイズに対するアライメント要求はない。オペレーションの説明において例外が言及されている。 In one embodiment, the transfer size field 3808 indicates the number of bytes to be read from the source address to perform the operation. The maximum value of this field may be 232-1, but the maximum possible transfer size may be less and must be determined from the maximum transfer size field in the general capability register. The transfer size should not be zero. For many operation types, there are no alignment requirements for the transfer size. Exceptions are noted in the operation description.

一実施例において、使用割込みメッセージストレージフラグが１である場合、完了割込み処理フィールド３８０７は、完了割込みを生成するために用いられる割込みメッセージストレージエントリを規定する。このフィールドの値は、ＧＥＮＣＡＰ内の割込みメッセージストレージサイズフィールドの値より小さくするべきである。一実施例において、完了割込み処理フィールド３８０７は、使用割込みメッセージストレージフラグが０である、要求完了割込みフラグが０である、Ｕ／Ｓビットが０である、一般的な機能レジスタの割込みメッセージストレージサポートフィールドが０である、又は、記述子が、ゲストポータルを介してサブミットされている、という条件のいずれかの下で予約される。 In one embodiment, if the use interrupt message storage flag is 1, the completion interrupt handling field 3807 specifies the interrupt message storage entry used to generate the completion interrupt. The value of this field should be less than the value of the interrupt message storage size field in GENCAP. In one embodiment, the completion interrupt handling field 3807 is reserved under any of the following conditions: the use interrupt message storage flag is 0, the request completion interrupt flag is 0, the U/S bit is 0, the interrupt message storage support field of the general capabilities register is 0, or the descriptor is being submitted through a guest portal.

図３９に示されるように、完了記録３９００の一実施例は、オペレーションが完了又はエラーに遭遇した場合にＤＳＡが書き込むメモリ内の３２バイト構造である。完了記録アドレスは、３２バイトアラインであるべきである。 As shown in FIG. 39, one embodiment of a completion record 3900 is a 32-byte structure in memory that the DSA writes to when an operation completes or encounters an error. The completion record address should be 32-byte aligned.

このセクションは、多くのオペレーションタイプに共通である完了記録のフィールドを説明する。各オペレーションタイプの説明は、フォーマットがこれとは異なる場合の完了記録図を含む。追加のオペレーションに固有のフィールドは、以下でさらに説明される。完了記録３９００は、たとえ、必要とされるフィールドが全くなくても、常に３２バイトであり得る。完了記録３９００は、ページフォールトに起因して部分的に完了した場合にオペレーションを継続するのに十分な情報を含む。 This section describes the fields of the completion record that are common to many operation types. Each operation type description includes a completion record diagram if the format differs from this. Additional operation-specific fields are described further below. The completion record 3900 will always be 32 bytes, even if none of the fields are required. The completion record 3900 contains enough information to continue the operation if it is partially completed due to a page fault.

完了記録は、（記述子３８００の完了記録アドレス３８０４により識別される）メモリ内の３２バイトアライン構造として実装され得る。完了記録３９００は、オペレーションが完了したか否かを示す完了ステータスフィールド３９０４を含む。オペレーションの完了に成功した場合、完了記録は、もしあればオペレーションのタイプに応じたオペレーションの結果を含んでよい。オペレーションが完了に成功しなかった場合、完了記録は、障害又はエラー情報を含む。 The completion record may be implemented as a 32-byte aligned structure in memory (identified by completion record address 3804 of descriptor 3800). The completion record 3900 includes a completion status field 3904 that indicates whether the operation completed. If the operation completed successfully, the completion record may include the results of the operation, if any, depending on the type of operation. If the operation did not complete successfully, the completion record includes failure or error information.

一実施例において、ステータスフィールド３９０４は、記述子の完了ステータスを報告する。ソフトウェアは、このフィールドを０に初期化すべきであり、それにより、完了記録が書き込まれたときを検出できる。
In one embodiment, status field 3904 reports the completion status of the descriptor. Software should initialize this field to zero so that it can detect when the completion record has been written.

上記のテーブルＴは、様々なステータスコードを提供し、一実施例に関する説明に関連する。 Table T above provides various status codes and related explanations for one embodiment.

以下のテーブルＵは、障害アドレスが読み出されたか、書き込まれたかを示す第１のビット、及び、フォールトアクセスがユーザモードであったか、スーパバイザモードアクセスであったかを示す第２のビットを含む一実施例において利用可能な障害コード３９０３を示す。
Table U below shows the fault codes 3903 available in one embodiment, which include a first bit indicating whether the faulting address was read or written, and a second bit indicating whether the faulting access was a user mode or supervisor mode access.

一実施例において、この完了記録３９００がバッチの一部としてサブミットされた記述子のためのものであった場合、インデックスフィールド３９０２は、この完了記録を生成した記述子のバッチ内のインデックスを含む。バッチ記述子について、このフィールドは、０ｘｆｆであってよい。バッチの一部ではないその他の記述子について、このフィールドは、予約済であってよい。 In one embodiment, if this completion record 3900 was for a descriptor submitted as part of a batch, index field 3902 contains the index within the batch of the descriptor that generated this completion record. For batch descriptors, this field may be 0xff. For other descriptors that are not part of a batch, this field may be reserved.

一実施例において、オペレーションが、ページフォールトに起因して部分的に完了した場合、バイト完了フィールド３９０１は、障害が発生した前に処理されたソースバイトの数を含む。このカウントにより表されるソースバイトのすべては、完全に処理されていて、結果は、必要に応じてオペレーションタイプに従って宛先アドレスに書き込まれる。いくつかのオペレーションタイプについて、このフィールドは、障害以外のいくつかの理由に対する完了の前にオペレーションが停止した場合に用いられてもよい。オペレーションが完全に完了した場合、このフィールドは、０に設定されてよい。 In one embodiment, if an operation is partially completed due to a page fault, the bytes completed field 3901 contains the number of source bytes that were processed before the fault occurred. All of the source bytes represented by this count have been fully processed, and the results are written to the destination address as appropriate according to the operation type. For some operation types, this field may be used if the operation stopped before completion for some reason other than a fault. If the operation completed completely, this field may be set to 0.

この値から出力サイズが容易に判断可能でないオペレーションタイプについて、完了記録は、宛先アドレスに書き込まれるバイト数も含む。 For operation types where the output size is not readily determinable from this value, the completion record also includes the number of bytes written to the destination address.

オペレーションがページフォールトに起因して部分的に完了した場合、このフィールドは、障害を発生させたアドレスを含む。一般的な規則として、すべての記述子は、有効な完了記録アドレス３８０４を有するべきであり、完了記録アドレス有効フラグは１であるべきである。この規則に対するいくつかの例外が以下に説明される。 If the operation was partially completed due to a page fault, this field contains the address that caused the fault. As a general rule, all descriptors should have a valid completion record address 3804 and the completion record address valid flag should be 1. Some exceptions to this rule are described below.

一実施例において、完了記録の第１のバイトはステータスバイトである。デバイスにより書き込まれるステータス値はすべてゼロ以外の値である。ソフトウェアは、いつデバイスが完了記録に書き込まれたかを示すことを可能にするために、記述子をサブミットする前に、完了記録のステータスフィールドを０に初期化すべきである。完了記録を初期化することはまた、それがマッピングされることを確実にし、そのため、デバイスは、アクセスする場合にページフォールトに遭遇することはない。 In one embodiment, the first byte of the completion record is the status byte. All status values written by the device are non-zero. Software should initialize the status field of the completion record to 0 before submitting the descriptor to allow the device to indicate when it has written to the completion record. Initializing the completion record also ensures that it is mapped, so the device does not encounter a page fault when accessing it.

要求完了記録フラグは、たとえオペレーションの完了に成功したとしても、完了記録を書き込むべきであることをデバイスに示す。このフラグが設定されていない場合、デバイスは、もしエラーがあれば、完了記録のみを書き込む。 The requested completion record flag indicates to the device that it should write a completion record even if the operation completed successfully. If this flag is not set, the device will only write a completion record if there is an error.

記述子完成は、以下の方法のいずれかを用いるソフトウェアにより検出され得る。 Descriptor completion can be detected by software using any of the following methods:

１．完了記録をポーリングして、ステータスフィールドがゼロ以外になるのを待つ。 1. Poll the completion record and wait for the status field to become non-zero.

２．完了記録アドレスに対する（本明細書で説明されるような）ＵＭＯＮＩＴＯＲ／ＵＭＷＡＩＴ命令を用いて、書き込まれるまで又はタイムアウトするまでブロックする。次に、ソフトウェアは、オペレーションが完了したか否かを判断するために、ステータスフィールドがゼロ以外か否かをチェックすべきである。 2. Use the UMONITOR/UMWAIT instruction (as described herein) to the completed record address to block until written or until a timeout occurs. Software should then check if the status field is non-zero to determine if the operation is complete.

３．カーネルモード記述子について、オペレーションが完了した場合、割込みを要求する。 3. For kernel mode descriptors, request an interrupt when the operation is completed.

４．記述子がバッチ内にある場合、同じバッチに後続の記述子内のフェンスフラグを設定する。フェンスを有する記述子又は同じバッチ内の任意の後続の記述子の完了は、フェンスに先行するすべての記述子の完了を示す。 4. If the descriptor is in a batch, set the fence flag in subsequent descriptors in the same batch. Completion of the descriptor with the fence or any subsequent descriptor in the same batch indicates completion of all descriptors preceding the fence.

５．記述子がバッチ内にある場合、バッチを初期化したバッチ記述子の完了は、バッチ内のすべての記述子の完了を示す。 5. If the descriptor is in a batch, completion of the batch descriptor that initialized the batch indicates completion of all the descriptors in the batch.

６．ドレイン記述子又はドレインコマンドを発行して、それが完了するのを待つ。 6. Issue a drain descriptor or drain command and wait for it to complete.

完了ステータスがページフォールトに起因して部分的な完了を示す場合、完了記録は、（もしあれば、）障害が引き起こされる前に、処理がどれくらい完了していたか、及び、障害が引き起こされた仮想アドレスを示す。ソフトウェアは、（プロセッサから障害アドレスをタッチすることにより、）障害を正常な状態に戻すことを選択し、新たな記述子内のワークの残りを再サブミット、又は、ソフトウェアにおけるワークの残りを完了し得る。記述子リスト及び完了記録アドレス上の障害は、異なって処理され、以下により詳細に説明される。 If the completion status indicates partial completion due to a page fault, the completion record indicates how much (if any) of the processing was completed before the fault was triggered, and the virtual address at which the fault was triggered. Software may choose to reverse the fault (by touching the faulting address from the processor) and resubmit the remainder of the work in a new descriptor, or complete the remainder of the work in software. Faults on the descriptor list and completion record addresses are handled differently, and are explained in more detail below.

ＤＳＡの一実施例では、メッセージシグナリング割込みのみをサポートする。ＤＳＡは、２つのタイプの割込みメッセージストレージ、すわなち、（ａ）ホストドライバにより用いられる割込みメッセージを格納する、ＭＳＩ－Ｘ機能を通じて列挙されたＭＳＩ－Ｘテーブル、及び、（ｂ）ゲストドライバにより用いられる割込みメッセージを格納するデバイスに固有の割込みメッセージストレージ（ＩＭＳ）テーブルを提供する。 One embodiment of the DSA supports only message signaling interrupts. The DSA provides two types of interrupt message storage: (a) MSI-X tables, enumerated through MSI-X functions, that store interrupt messages used by host drivers, and (b) device-specific interrupt message storage (IMS) tables that store interrupt messages used by guest drivers.

一実施例において、割込みは、３つのタイプのイベント、すなわち、（１）カーネルモード記述子の完了、（２）ドレイン又はアボートコマンドの完了、及び、（３）ソフトウェア又はハードウェアエラーレジスタにおいてポストされたエラーに対して生成され得る。イベントのタイプごとに、別々の割込みイネーブルがある。エラー及びアボート／ドレインコマンドの完了に起因する割込みは、ＭＳＩ－Ｘテーブル内のエントリ０を用いて生成される。割込み理由レジスタは、割込みの理由を判断するために、ソフトウェアにより読み出されてよい。 In one embodiment, interrupts may be generated for three types of events: (1) completion of a kernel mode descriptor, (2) completion of a drain or abort command, and (3) an error posted in the software or hardware error register. There is a separate interrupt enable for each type of event. Interrupts due to errors and completion of abort/drain commands are generated using entry 0 in the MSI-X table. The interrupt reason register may be read by software to determine the reason for the interrupt.

カーネルモード記述子の完了（例えば、Ｕ／Ｓフィールドが１である記述子）について、用いられる割込みメッセージは、どのように記述子がサブミットされたか、及び、記述子内の使用割込みメッセージストレージフラグに依存する。 For kernel mode descriptor completions (e.g., descriptors with a U/S field of 1), the interrupt message used depends on how the descriptor was submitted and the use interrupt message storage flag in the descriptor.

特権ポータルを介してサブミットされたカーネルモード記述子に対する完了割込みメッセージは、一般にＭＳＩ－Ｘテーブル内のエントリであり、ポータルアドレスにより判断される。しかしながら、ＧＥＮＣＡＰ内の割込みメッセージストレージサポートフィールドが１である場合、特権ポータルを介してサブミットされた記述子は、記述子内に使用割込みメッセージストレージフラグを設定することによりこの挙動をオーバーライドしてよい。この場合、記述子内の完了割込み処理フィールドは、割込みメッセージストレージへのインデックスとして用いられる。 The completed interrupt message for a kernel mode descriptor submitted through a privileged portal is typically an entry in the MSI-X table, determined by the portal address. However, if the interrupt message storage support field in GENCAP is 1, a descriptor submitted through a privileged portal may override this behavior by setting the use interrupt message storage flag in the descriptor. In this case, the completed interrupt handling field in the descriptor is used as an index into the interrupt message storage.

ゲストポータルを介してサブミットされたカーネルモード記述子に対する完了割込みメッセージは、割込みメッセージストレージ内のエントリであり、ポータルアドレスにより判断される。 The completion interrupt message for a kernel mode descriptor submitted through a guest portal is an entry in the interrupt message storage, determined by the portal address.

ＤＳＡにより生成された割込みは、カーネル又はＶＭＭソフトウェアにより構成されるように、割込み再マッピング及びポスティングハードウェアを通じて処理される。
Interrupts generated by the DSA are handled through interrupt remapping and posting hardware, as configured by the kernel or VMM software.

すでに述べたように、ＤＳＡは、一度に複数の記述子をサブミットすることをサポートする。バッチ記述子は、ホストメモリ内のワーク記述子のアレイのアドレス及び当該アレイの要素の数を含む。ワーク記述子のアレイは、「バッチ」と呼ばれる。バッチ記述子の使用は、ＤＳＡクライアントが、単一のＥＮＱＣＭＤ、ＥＮＱＣＭＤＳ又はＭＯＶＤＩＲ６４Ｂ命令を用いて複数のワーク記述子をサブミットすることを可能にし、全体的なスループットを潜在的に向上させることができる。ＤＳＡは、バッチ内のワーク記述子の数に対する制限を実行する。一般的な機能レジスタにおける最大バッチサイズフィールドにおいて制限が示される。 As mentioned above, the DSA supports submitting multiple descriptors at once. A batch descriptor contains the address of an array of work descriptors in host memory and the number of elements in that array. The array of work descriptors is called a "batch." The use of batch descriptors allows a DSA client to submit multiple work descriptors with a single ENQCMD, ENQCMDS, or MOVDIR64B instruction, potentially improving overall throughput. The DSA enforces a limit on the number of work descriptors in a batch. The limit is indicated in the maximum batch size field in the general capabilities register.

バッチ記述子は、他のワーク記述子と同じ方法で、ワークキューにサブミットされる。バッチ記述子がデバイスにより処理される場合、デバイスは、メモリからワーク記述子のアレイを読み出して、次に、ワーク記述子のそれぞれを処理する。ワーク記述子は、必ずしも順番通りに処理されるわけではない。 Batch descriptors are submitted to a work queue in the same way as other work descriptors. When a batch descriptor is processed by a device, the device reads the array of work descriptors from memory and then processes each of the work descriptors. The work descriptors are not necessarily processed in order.

バッチ記述子のＰＡＳＩＤ３８０３及びＵ／Ｓフラグは、バッチ内のすべての記述子に用いられる。バッチ内の記述子におけるＰＡＳＩＤ及びＵ／Ｓフィールド３８１０は無視される。バッチ内の各ワーク記述子は、ちょうど、直接サブミットされたワーク記述子と同様に、完了記録アドレス３８０４を特定できる。代替的に、バッチ記述子は、バッチからのすべてのワーク記述子の完了記録がデバイスにより書き込まれる「完了キュー」アドレスを特定できる。この場合、バッチ内の記述子における完了記録アドレスフィールド３８０４は無視される。完了キューは、記述子総数よりも１エントリ分大きくすべきであり、そのため、バッチ内のすべての記述子に対する完了記録とバッチ記述子とのための空間がある。完了記録は、記述子が完了した順序で生成され、それらが記述子アレイに現れる順序と同じでなくてよい。各完了記録は、その完了記録を生成したバッチ内の記述子のインデックスを含む。０ｘｆｆのインデックスは、バッチ記述子自体に用いられる。０のインデックスは、バッチ記述子以外の直接サブミットされた記述子に用いられる。バッチ内のいくつかの記述子は、それらが完了記録を要求せず、それらが完了に成功した場合、完了記録を生成しなくてよい。この場合完了キューに書き込まれる完了記録の数は、バッチ内の記述子の数より少ない可能性がある。バッチ記述子の完了記録（要求された場合）は、バッチ内のすべての記述子に対する完了記録の後に完了キューに書き込まれる。 The PASID 3803 and U/S flag in the batch descriptor are used for all the descriptors in the batch. The PASID and U/S fields 3810 in the descriptors in the batch are ignored. Each work descriptor in the batch can specify a completion record address 3804, just like a directly submitted work descriptor. Alternatively, the batch descriptor can specify a "completion queue" address where the completion records for all the work descriptors from the batch are written by the device. In this case, the completion record address field 3804 in the descriptors in the batch is ignored. The completion queue should be one entry larger than the total number of descriptors, so there is room for the batch descriptor and completion records for all the descriptors in the batch. Completion records are generated in the order that the descriptors are completed, which may not be the same order that they appear in the descriptor array. Each completion record contains the index of the descriptor in the batch that generated it. An index of 0xff is used for the batch descriptor itself. An index of 0 is used for directly submitted descriptors other than the batch descriptor. Some descriptors in a batch may not generate completion records if they did not request them and they complete successfully. In this case the number of completion records written to the completion queue may be less than the number of descriptors in the batch. The completion record for the batch descriptor (if requested) is written to the completion queue after the completion records for all descriptors in the batch.

バッチ記述子は、完了キューを規定せず、バッチ記述子に対する完了記録（要求された場合）は、バッチ内のすべての記述子が完了した後に、自体の完了記録アドレスに書き込まれる。バッチ記述子に対する完了記録は、バッチ内の記述子のいずれもが成功に相当しないステータスで完了したか否かについてのインジケーションを含む。これは、バッチ内のすべての記述子が完了に成功した通常の場合において、ソフトウェアがバッチ記述子に対する完了記録のみを検査することを可能にする。 A batch descriptor does not specify a completion queue, and the completion record for a batch descriptor (if requested) is written to its own completion record address after all descriptors in the batch have completed. The completion record for a batch descriptor includes an indication as to whether any of the descriptors in the batch completed with a status other than successful. This allows software to only examine the completion record for a batch descriptor in the normal case where all descriptors in the batch have completed successfully.

完了割込みは、必要に応じて、バッチ内の１又は複数のワーク記述子により要求されてもよい。バッチ記述子に対する完了記録（要求された場合）は、バッチ内のすべての記述子に対する完了記録及び完了割込みの後に書き込まれる。バッチ記述子に対する完了割込み（要求された場合）は、ちょうどその他の記述子と同様に、バッチ記述子に対する完了記録の後に生成される。 Completion interrupts may be requested by one or more work descriptors in a batch, as required. The completion record for the batch descriptor (if requested) is written after the completion records and completion interrupts for all descriptors in the batch. The completion interrupt for the batch descriptor (if requested) is generated after the completion record for the batch descriptor, just like any other descriptor.

バッチ記述子はバッチに含まれなくてもよい。ネステッド又はチェーン記述子アレイはサポートされていない。 Batch descriptors do not have to be contained in a batch. Nested or chained descriptor arrays are not supported.

デフォルト設定で、ＤＳＡは、ワーク記述子を実行している間、いずれのオーダリングも保証していない。記述子は、スループットを最大化するようにデバイスが適切と考える任意の順序でディスパッチ及び完了できる。よって、オーダリングが要求された場合、ソフトウェアは、明示的にオーダリングしなければならない。例えば、ソフトウェアは、記述子をサブミットして、完了を確実にするために記述子からの完了記録又は割込みを待ち、それから、次の記述子をサブミットすることができる。 By default, the DSA does not guarantee any ordering while executing work descriptors. Descriptors can be dispatched and completed in any order the device sees fit to maximize throughput. Thus, if ordering is required, software must explicitly order it. For example, software can submit a descriptor, wait for a completion record or interrupt from the descriptor to ensure completion, and then submit the next descriptor.

ソフトウェアは、バッチ記述子により規定されたバッチ内の記述子に対するオーダリングも特定できる。各ワーク記述子は、フェンスフラグ（Ｆｅｎｃｅｆｌａｇ）を有する。設定された場合、フェンスは、同じバッチ内の前の記述子が完了するまで、記述子の処理が開始されないことを保証する。これは、フェンスを有する記述子が、同じバッチ内の前の記述子により生成されるデータを消費することを可能にする。 Software can also specify ordering for descriptors within a batch defined by a batch descriptor. Each work descriptor has a fence flag. If set, the fence ensures that processing of a descriptor does not begin until the previous descriptor in the same batch has completed. This allows a descriptor with a fence to consume data produced by a previous descriptor in the same batch.

記述子は、オペレーションにより生成されたすべての書き込みがグローバルに観測可能となった後、要求された場合、宛先がリードバックした後、必要とされる場合、完了記録への書き込みがグローバルに観測可能となった後、及び、要求された場合、完了割込みの生成後に、完了する。 A descriptor is completed after all writes generated by the operation are globally observable, after the destination has read back if requested, after the writes to the completion record are globally observable if required, and after a completion interrupt has been generated if requested.

バッチの任意の記述子が成功に相当しないステータスで完了した場合、例えば、それがページフォールトに起因して部分的に完了した場合、１に等しいフェンスフラグを有する後続の記述子、及び、バッチ内の任意の後続の記述子が廃棄される。バッチをサブミットするために用いられたバッチ記述子に対する完了記録は、どれくらいの数の記述子が完了したかを示す。部分的に完了され、完了記録が生成された任意の記述子は、完了されたときにカウントされる。廃棄された記述子のみが、完了していないとみなされる。 If any descriptor in the batch completes with a status other than successful, for example if it partially completed due to a page fault, the subsequent descriptor with the fence flag equal to 1, and any subsequent descriptors in the batch, are discarded. The completion record for the batch descriptor used to submit the batch indicates how many descriptors have completed. Any descriptors that partially completed and generated a completion record are counted as completed. Only discarded descriptors are considered to be incomplete.

フェンスはまた、完了記録及び割込みに対するオーダリングを確保する。例えば、設定されたフェンス及び要求完了割込みを有するＮｏ－ｏｐ記述子は、バッチ内のすべての先行する記述子が完了した（必要とされる場合、これらの完了記録が書き込まれた）後に生成された割込みを発生させる。完了記録書き込みは、常に、同じワーク記述子により生成されたデータ書き込みの背後でオーダリングされ、完了割込み（要求された場合）は、常に、同じワーク記述子に対する完了記録書き込みの背後でオーダリングされる。 Fences also ensure ordering for completion records and interrupts. For example, a no-op descriptor with a set fence and a requesting completion interrupt will cause the interrupt generated after all preceding descriptors in the batch have completed (and their completion records written, if required). Completion record writes are always ordered behind data writes generated by the same work descriptor, and completion interrupts (if requested) are always ordered behind completion record writes for the same work descriptor.

ドレインは、クライアントがそれ自体のＰＡＳＩＤに属するすべての記述子が完了するのを待つことを可能にする記述子である。それは、ＰＡＳＩＤ全体に対するフェンスオペレーションとして用いられることができる。ドレインオペレーションは、そのＰＡＳＩＤを有するすべての前の記述子が完了した場合に完了する。ドレイン記述子は、ソフトウェアにより用いられ、そのすべての記述子の完了に対する単一の完了記録又は割込みを要求できる。ドレインは、通常のワークキューにサブミットされる通常の記述子である。ドレイン記述子は、バッチに含まれてよい。（フェンスフラグは、バッチ内の前の記述子が完了するのを待つために、バッチ内で用いられてよい。） A drain is a descriptor that allows a client to wait for all descriptors belonging to its own PASID to complete. It can be used as a fence operation for an entire PASID. A drain operation completes when all previous descriptors with that PASID have completed. A drain descriptor can be used by software to request a single completion record or interrupt for the completion of all its descriptors. A drain is a normal descriptor submitted to the normal work queue. A drain descriptor may be included in a batch. (The fence flag may be used in a batch to wait for the previous descriptor in the batch to complete.)

ソフトウェアは、ドレイン記述子がサブミットされた後、かつ、それが完了する前に、デバイスにサブミットされる特定のＰＡＳＩＤを有する記述子がないことを確保しなければならい。追加の記述子がサブミットされた場合、ドレインオペレーションが追加の記述子が完了するのも待つか否かが指定されていない。これは、ドレインオペレーションに長い時間をかけることになり得る。たとえデバイスが、追加の記述子が完了するのを待たなかったとしても、追加の記述子のいくつかは、ドレインオペレーションが完了する前に完了し得る。このように、すべての前のオペレーションが完了するまで、開始する後続のオペレーションがないことをフェンスが確保するので、ドレインは、フェンスとは異なる。 The software must ensure that after a drain descriptor is submitted and before it is completed, no other descriptors with the particular PASID are submitted to the device. If additional descriptors are submitted, it is unspecified whether the drain operation also waits for the additional descriptors to complete. This could cause the drain operation to take a long time. Even if the device does not wait for the additional descriptors to complete, some of the additional descriptors may complete before the drain operation completes. Thus, draining differs from fencing because fencing ensures that no subsequent operations start until all previous operations have completed.

一実施例において、アボート／ドレインコマンドは、アボート／ドレインレジスタに書き込むことにより、特権が与えられたソフトウェア（ＯＳカーネル又はＶＭＭ）によりサブミットされる。これらのコマンドの１つを受信したときに、ＤＳＡは、特定の記述子の完了を待つ（以下で説明される）。コマンドが完了した場合、ソフトウェアは、デバイスにおいて未処理の特定のカテゴリに記述子がこれ以上ないことを確認できる。 In one embodiment, abort/drain commands are submitted by privileged software (OS kernel or VMM) by writing to the abort/drain register. Upon receiving one of these commands, the DSA waits for the completion of a particular descriptor (described below). When the command completes, the software can be sure that there are no more descriptors in the particular category outstanding on the device.

一実施例では、ドレインオール、ドレインＰＡＳＩＤ及びドレインＷＱという３つのタイプのドレインコマンドがある。各コマンドは、完了へとこれらを処理するよりもむしろ、任意の未処理の記述子を破棄し得るデバイスを示すアボートフラグを有する。 In one embodiment, there are three types of drain commands: DRAIN ALL, DRAIN PASID, and DRAIN WQ. Each command has an abort flag that indicates the device may discard any outstanding descriptors rather than processing them to completion.

ドレインオールコマンドは、ドレインオールコマンドの前にサブミットされていたすべての記述子の完了を待機する。ドレインオールコマンドの後にサブミットされた記述子は、ドレインオールが完了したときに進行中であり得る。前の記述子が完了するのをドレインオールコマンドが待っている間に、デバイスは新たな記述子に対するワークを開始してよい。 The drain-all command waits for all descriptors submitted before the drain-all command to complete. Descriptors submitted after the drain-all command may be in progress when the drain-all command completes. The device may start work on new descriptors while the drain-all command is waiting for previous descriptors to complete.

ドレインＰＡＳＩＤコマンドは、特定のＰＡＳＩＤと関連付けられたすべての記述子を待つ。ドレインＰＡＳＩＤコマンドが完了した場合、デバイス内のＰＡＳＩＤに対する記述子はこれ以上存在しない。ソフトウェアは、ドレインＰＡＳＩＤコマンドがサブミットされた後、かつ、それが完了する前に、デバイスにサブミットされる特定のＰＡＳＩＤを有する記述子がないことを確保し得る。そうでなければ、挙動が定義されていない。 The drain PASID command waits for all descriptors associated with the particular PASID. When the drain PASID command completes, there are no more descriptors for the PASID in the device. Software may ensure that after the drain PASID command is submitted and before it completes, there are no more descriptors with the particular PASID submitted to the device. Otherwise, the behavior is undefined.

ドレインＷＱコマンドは、特定のワークキューにサブミットされたすべての記述子を待機する。ソフトウェアは、ドレインＷＱコマンドがサブミットされた後、かつ、それが完了する前にＷＱにサブミットされる記述子がないことを確保し得る。 The drain WQ command waits for all descriptors submitted to a particular work queue. Software can ensure that no descriptors are submitted to the WQ after the drain WQ command is submitted and before it completes.

ＤＳＡを使用しているアプリケーション又はＶＭが一時停止された場合、それは、ＤＳＡにサブミットされた未処理の記述子を有している可能性がある。このワークは、完了されなければならず、そのため、クライアントは、後で再開され得るコヒーレント状態にある。ドレインＰＡＳＩＤ及びドレインオールコマンドは、任意の未処理の記述子を待機するために、ＯＳ又はＶＭＭにより用いられる。ドレインＰＡＳＩＤコマンドは、単一のＰＡＳＩＤを使用していたアプリケーション又はＶＭに用いられる。ドレインオールコマンドは、複数のＰＡＳＩＤを使用するＶＭに用いられる。 When an application or VM using the DSA is suspended, it may have outstanding descriptors submitted to the DSA. This work must be completed so the clients are in a coherent state that can be resumed later. The drain PASID and drain all commands are used by the OS or VMM to wait for any outstanding descriptors. The drain PASID command is used for applications or VMs that were using a single PASID. The drain all command is used for VMs that use multiple PASIDs.

ＤＳＡを使用しているアプリケーションが抜け出た、又は、オペレーティングシステム（ＯＳ）により終了された場合、ＯＳは、アドレス空間、割り当てられたメモリ及びＰＡＳＩＤを解放又は再利用し得る前に、未処理の記述子がないことを確保する必要がある。任意の未処理の記述子を処分するために、ＯＳは、終了されるクライアントのＰＡＳＩＤを有するドレインＰＡＳＩＤコマンドを使用し、アボートフラグは１に設定される。このコマンドを受信したときに、ＤＳＡは、さらなる処理を行うことなく特定のＰＡＳＩＤに属するすべての記述子を破棄する。 When an application using the DSA exits or is terminated by the Operating System (OS), the OS must ensure that there are no outstanding descriptors before it can free or reclaim the address space, allocated memory, and PASID. To dispose of any outstanding descriptors, the OS uses the DRAIN PASID command with the PASID of the client being terminated, and the abort flag is set to 1. Upon receiving this command, the DSA discards all descriptors belonging to the particular PASID without further processing.

ＤＳＡの一実施例では、複数のＷＱからワークをディスパッチするためにサービス品質を特定するメカニズムを提供する。ＤＳＡは、ソフトウェアがＷＱ空間の合計を複数のＷＱに分割することを可能にする。各ＷＱは、ワークをディスパッチするために、異なる優先度が割り当てられ得る。一実施例において、ＤＳＡスケジューラ／アービタ３５１３は、より高い優先度のＷＱが、より低い優先度のＷＱより多くサービス提供されるように、ＷＱからワークをディスパッチする。しかしながら、ＤＳＡは、より高い優先度のＷＱが、より低い優先度のＷＱを枯渇させないことを確保する。すでに述べたように、様々な優先順位付けのスキームは、実装要件に基づいて採用されてよい。 One embodiment of the DSA provides a mechanism to specify quality of service for dispatching work from multiple WQs. DSA allows software to divide the total WQ space into multiple WQs. Each WQ can be assigned a different priority for dispatching work. In one embodiment, the DSA scheduler/arbiter 3513 dispatches work from WQs such that higher priority WQs are serviced more than lower priority WQs. However, the DSA ensures that higher priority WQs do not starve lower priority WQs. As already mentioned, various prioritization schemes may be adopted based on implementation requirements.

一実施例において、ＷＱ構成レジスタテーブルは、ＷＱを構成するために用いられる。ソフトウェアは、所望のＱｏＳレベルの数に一致するように、アクティブなＷＱの数を構成できる。ソフトウェアは、ＷＱサイズ及びいくつかの追加のパラメータをＷＱ構成レジスタテーブルにプログラミングすることにより各ＷＱを構成する。これは、ＷＱ空間全体をＷＱの所望の数に効果的に分割する。未使用のＷＱは、０のサイズを有する。 In one embodiment, a WQ configuration register table is used to configure the WQs. Software can configure the number of active WQs to match the number of desired QoS levels. Software configures each WQ by programming the WQ size and some additional parameters into the WQ configuration register table. This effectively divides the entire WQ space into the desired number of WQs. Unused WQs have a size of 0.

エラーは、１）特定のＰＡＳＩＤの記述子を処理するときに生じた関連エラー、２)事実上広範囲であり、ＰＡＳＩＤに特定のものではない非関連エラー、の２つのカテゴリに広く分けられる。ＤＳＡは、１つのＰＡＳＩＤからのエラーが、他のＰＡＳＩＤを停止させたり影響を与えたりすることをできる限り回避しようと試みる。ＰＡＳＩＤ固有のエラーは、エラーが完了記録自体（例えば、完了記録アドレス上のページフォールト）にある場合を除き、それぞれの記述子の完了記録に報告される。 Errors are broadly divided into two categories: 1) related errors that occur when processing a particular PASID's descriptors, and 2) unrelated errors that are broad in nature and not specific to a PASID. Whenever possible, the DSA attempts to prevent errors from one PASID from stopping or affecting other PASIDs. PASID-specific errors are reported in the completion record for each descriptor, except when the error is in the completion record itself (e.g., a page fault on the completion record address).

記述子サブミッション内又は記述子の完了記録上のエラーは、ソフトウェアエラーレジスタ（ＳＷＥＲＲＯＲ）を通じてホストドライバに報告されてよい。ハードウェアエラーは、ハードウェアエラーレジスタ（ＨＷＥＲＲＯＲ）を通じて報告されてよい。 Errors in the descriptor submission or on recording the completion of the descriptor may be reported to the host driver through the software error register (SWERROR). Hardware errors may be reported through the hardware error register (HWERROR).

ＤＳＡの一実施例では、デバイスイネーブルレジスタ内のイネーブルビットが１に設定されたときに、以下のチェックを実行する。
・バスマスタイネーブルが１である。
・ＰＡＳＩＤ、ＡＴＳ及びＰＲＳ機能の組み合わせが有効である（テーブル６－３、セクション６．１．３を参照）。
・すべてのＷＱＣＦＧレジスタのＷＱサイズフィールドの合計が、総ＷＱサイズより大きくない。
・各ＧＲＰＣＦＧレジスタについて、ＷＱ及びエンジンフィールドが、両方とも０である、又は、両方ともゼロ以外であるのいずれか一方である。
・ＷＱＣＦＧレジスタ内のサイズフィールドがゼロ以外である各ＷＱが１つのグループにある。
・ＷＱＣＦＧレジスタ内のサイズフィールドがゼロである各ＷＱが、いずれのグループにもない。
・各エンジンが、１つのグループにしかない。 In one embodiment of the DSA, when the enable bit in the device enable register is set to one, the following checks are performed:
Bus master enable is 1.
- A combination of PASID, ATS and PRS features is valid (see Table 6-3, Section 6.1.3).
The sum of the WQ size fields of all WQCFG registers is not greater than the total WQ size.
For each GRPCFG register, the WQ and engine fields are either both zero or both non-zero.
Each WQ with a non-zero size field in the WQCFG register is in one group.
Each WQ whose size field in the WQCFG register is zero is not in any group.
- Each engine is in only one group.

これらのチェックのいずれかが不合格であった場合、デバイスはイネーブルにされず、エラーコードがデバイスイネーブルレジスタのエラーコードフィールドに記録される。これらのチェックは、任意の順序で実行されてよい。したがって、あるタイプのエラーのインジケーションは、他のエラーもあることを示唆するものではない。同じ構成エラーは、異なる時間で、又は、デバイスの異なるバージョンで、異なるエラーコードを結果としてもたらす可能性がある。不合格になったチェックが一つもない場合、デバイスはイネーブルにされ、イネーブルフィールドが１に設定される。 If any of these checks fail, the device is not enabled and an error code is recorded in the error code field of the device enable register. These checks may be performed in any order; therefore, an indication of one type of error does not imply that other errors exist. The same configuration error may result in different error codes at different times, or in different versions of the device. If none of the checks fail, the device is enabled and the enable field is set to 1.

デバイスは、ＷＱＣＦＧレジスタ内のＷＱイネーブルビットが１に設定されたときに、以下のチェックを実行する。
・デバイスがイネーブルである（すなわち、デバイスイネーブルレジスタ内のイネーブルフィールドが１である）。
・ＷＱサイズフィールドがゼロ以外である。
・ＷＱ閾値が、ＷＱサイズフィールドより大きくない。
・ＷＱモードフィールドが、サポートされているモードを選択している。つまり、ＷＱＣＡＰ内の共有モードサポートフィールドが０である場合、ＷＱモードは１であり、又はＷＱＣＡＰ内の専用モードサポートフィールドが０である場合、ＷＱモードは０である。共有モードサポート及び専用モードサポートフィールドの両方が１である場合、ＷＱモードの値のいずれか一方は許可されている。
・ＧＥＮＣＡＰ内の障害のブロックサポートビットが０である場合、ＷＱ障害のブロックイネーブルフィールドは０である。 The device performs the following checks when the WQ Enable bit in the WQCFG register is set to 1:
The device is enabled (i.e., the enable field in the device enable register is 1).
- The WQ size field is non-zero.
- The WQ threshold is not greater than the WQ size field.
The WQMode field selects a supported mode, i.e. if the Shared Mode Supported field in the WQCAP is 0 then WQMode is 1, or if the Dedicated Mode Supported field in the WQCAP is 0 then WQMode is 0. If both the Shared Mode Supported and Dedicated Mode Supported fields are 1 then either one of the values of WQMode is allowed.
If the block support bit of the failure in GENCAP is 0, then the block enable field of the WQ failure is 0.

これらのチェックのいずれかが不合格であった場合、ＷＱはイネーブルにされず、エラーコードがＷＱコンフィグレジスタ３５００のＷＱエラーコードフィールドに記録される。これらのチェックは、任意の順序で実行されてよい。したがって、あるタイプのエラーのインジケーションが、他のエラーもあることを示唆するものではない。同じ構成エラーは、異なる時間で、又は、デバイスの異なるバージョンで、異なるエラーコードを結果としてもたらす可能性がある。不合格になったチェックが一つもない場合、デバイスはイネーブルにされ、ＷＱイネーブルフィールドが１に設定される。 If any of these checks fail, the WQ is not enabled and an error code is recorded in the WQ Error Code field of the WQ Config Register 3500. These checks may be performed in any order; therefore, an indication of one type of error does not imply that other errors exist. The same configuration error may result in different error codes at different times or in different versions of the device. If none of the checks fail, the device is enabled and the WQ Enable field is set to 1.

一実施例において、ＤＳＡは、記述子が受信されたときに、以下のチェックを実行する。
・記述子をサブミットするために用いられるレジスタアドレスにより識別されたＷＱが、アクティブなＷＱである（ＷＱＣＦＧレジスタ内のサイズフィールドがゼロ以外である）。このチェックが不合格であった場合、エラーがソフトウェアエラーレジスタ（ＳＷＥＲＲＯＲ）に記録される。
・記述子が共有のＷＱにサブミットされていた場合、
・それは、ＥＮＱＣＭＤ又はＥＮＱＣＭＤＳと共にサブミットされていた。このチェックが不合格であった場合、エラーがＳＷＥＲＲＯＲに記録される。
・特権が与えられていない又はゲストポータルを介して記述子がサブミットされていた場合、現在のキュー占有率は、ＷＱ閾値より大きくない。このチェックが不合格であった場合、リトライ応答が返される。
・記述子が特権ポータルを介してサブミットされていた場合、現在のキュー占有率がＷＱサイズより小さい。このチェックが不合格であった場合、リトライ応答が返される。
・記述子が専用のＷＱにサブミットされていた場合、
・それは、ＭＯＶＤＩＲ６４Ｂを用いてサブミットされていた。
・キュー占有率がＷＱサイズより小さい。 In one embodiment, the DSA performs the following checks when a descriptor is received:
The WQ identified by the register address used to submit the descriptor is an active WQ (the size field in the WQCFG register is non-zero). If this check fails, an error is logged in the Software Error Register (SWERROR).
If the descriptor was submitted to a shared WQ,
- It was submitted with ENQCMD or ENQCMDS. If this check fails, an error is logged in SWERROR.
If the descriptor was submitted via a non-privileged or guest portal, the current queue occupancy is not greater than the WQ threshold. If this check fails, a retry response is returned.
If the descriptor was submitted through a privileged portal, the current queue occupancy is less than the WQ size. If this check fails, a retry response is returned.
If the descriptor was submitted to a dedicated WQ,
- It was submitted using MOVDIR64B.
- The queue occupancy is smaller than the WQ size.

これらのチェックのいずれかが不合格であった場合、エラーがＳＷＥＲＲＯＲに記録される。 If any of these checks fail, an error is logged in SWERROR.

一実施例において、デバイスは、各記述子が処理されるときに各記述子に対して以下のチェックを実行する。
・オペレーションコードフィールド内の値がサポートされているオペレーションに対応する。これは、サブミットされていたコンテキストにおいてオペレーションが有効であることをチェックすることを含む。例えば、バッチ内のバッチ記述子は、無効なオペレーションコードとして処理されるであろう。
・設定される予約フラグがない。これは、ＧＥＮＣＡＰレジスタ内の対応する機能ビットが０であるフラグを含む。
・設定される非サポートフラグがない。これは、特定のオペレーションとの使用のために予約されるフラグを含む。例えば、フェンスビットは、バッチの一部としてよりもむしろ、直接的にエンキューされる記述子において予約される。構成内でディセーブルされるフラグ、例えば、障害のブロックフラグも含み、ＷＱＣＦＧレジスタ内の障害のブロックイネーブルフィールドが０である場合に予約される。
・要求されたフラグが設定される。例えば、要求完了記録フラグは、比較オペレーション用の記述子において１でなければならない。
・予約フィールドが０である。これは、特定の動作を意味することが定義されていない任意のフィールドを含む。いくつかの実施例では、すべての予約フィールドをチェックしなくてよいが、ソフトウェアは、最大の互換性のためにすべての未使用のフィールドをクリアする処理を行うべきである。バッチ記述子において、記述子総数フィールドは、ＧＥＮＣＡＰレジスタ内の最大バッチサイズフィールドより大きくない。
・（記述子タイプに適用可能なものとして）転送サイズ、ソースサイズ、最大差分記録サイズ、差分記録サイズ及び最大宛先サイズが、ＧＥＮＣＡＰレジスタ内の最大転送サイズフィールドより大きくない。
・デュアルキャストを用いたメモリコピー記述子において、２つの宛先アドレスのビット１１：０は同じである。
・使用割込みメッセージストレージフラグが設定されている場合、完了割込み処理は、割込みメッセージストレージサイズより少ない。 In one embodiment, the device performs the following checks on each descriptor as it is processed:
The value in the operation code field corresponds to a supported operation. This includes checking that the operation is valid in the context in which it was submitted. For example, a batch descriptor in a batch would be treated as an invalid operation code.
No reserved flags are set. This includes flags whose corresponding feature bits in the GENCAP register are 0.
No unsupported flags are set. This includes flags that are reserved for use with specific operations, e.g., fence bits are reserved in descriptors that are enqueued directly rather than as part of a batch. It also includes flags that are disabled in the configuration, e.g., the fault block flag, which is reserved if the fault block enable field in the WQCFG register is 0.
The requested flags are set. For example, the request completion record flag must be 1 in the descriptor for the compare operation.
Reserved fields are 0. This includes any field that is not defined to mean a specific action. In some embodiments, it is not necessary to check all reserved fields, but software should take care to clear all unused fields for maximum compatibility. In a batch descriptor, the total number of descriptors field is no larger than the maximum batch size field in the GENCAP register.
- The transfer size, source size, maximum delta record size, delta record size and maximum destination size (as applicable to the descriptor type) are not greater than the maximum transfer size field in the GENCAP register.
In a memory copy descriptor using dual cast, bits 11:0 of the two destination addresses are the same.
If the Used Interrupt Message Storage flag is set, the completed interrupt process is less than the interrupt message storage size.

一実施例において、完了記録アドレス３８０４は変換されることができず、記述子３８００は破棄され、エラーがソフトウェアエラーレジスタに記録される。そうでなければ、これらのチェックのいずれかが不合格であった場合、完了記録は、不合格のチェックのタイプを示すステータスフィールドと共に書き込まれ、バイト完了は０に設定される。要求された場合、完了割込みが生成される。 In one embodiment, if the completion record address 3804 cannot be translated, the descriptor 3800 is discarded and an error is recorded in a software error register. Otherwise, if any of these checks fail, a completion record is written with a status field indicating the type of check that failed and the byte completion is set to 0. A completion interrupt is generated if requested.

これらのチェックは、任意の順序で実行されてよい。したがって、完了記録におけるあるタイプのエラーのインジケーションは、他のエラーもあることを示唆するものではない。同じ無効記述子は、異なる時間で、又は、デバイスの異なるバージョンで、異なるエラーコードを報告してよい。 These checks may be performed in any order. Thus, an indication of one type of error in the completion record does not imply that other errors exist. The same invalid descriptor may report different error codes at different times or on different versions of the device.

記述子内の予約フィールド３８１１は、常に予約されているフィールド、いくつかの条件下で（例えば、機能、構成フィールド、どのように記述子がサブミットされたか、又は、記述子自体における他のフィールドの値に基づいて）予約されているフィールド、及び、オペレーションタイプに基づいて予約されているフィールドの３つのカテゴリに分類されてよい。以下のテーブルでは、フィールドが予約される条件を列挙する。
The reserved fields 3811 in a descriptor may be divided into three categories: fields that are always reserved, fields that are reserved under some conditions (e.g., based on the functionality, configuration fields, how the descriptor was submitted, or the values of other fields in the descriptor itself), and fields that are reserved based on the operation type. The following table lists the conditions under which fields are reserved.

すでに述べたように、ＤＳＡは、物理又は仮想アドレスのいずれか一方の使用をサポートする。プロセッサコア上で実行する処理と共有される仮想アドレスの使用は、共有仮想メモリ（ＳＶＭ）と呼ばれる。ＳＶＭをサポートするために、デバイスは、アドレス変換を実行する場合、ＰＡＳＩＤを提供し、それは、アドレス用に存在する変換がない場合に発生するページフォールトを処理する。しかしながら、デバイス自体は、仮想アドレスと物理アドレスとを区別しない。この区別は、ＩＯＭＭＵ１７１０のプログラミングにより制御される。 As mentioned above, the DSA supports the use of either physical or virtual addresses. The use of virtual addresses that are shared with processes running on a processor core is called Shared Virtual Memory (SVM). To support SVM, the device provides a PASID when performing address translation, which handles page faults that occur when no translation exists for an address. However, the device itself does not distinguish between virtual and physical addresses. This distinction is controlled by programming of the IOMMU 1710.

一実施例において、ＤＳＡは、ＡＴＳを利用するために、ＰＣＤＩを用いてＰＣＩｅ論理２８０８と通信するＰＣＩｅ論理２８２０を示す図２８に示されるようなアドレス変換サービス（ＡＴＳ）及びページ要求サービス（ＰＲＳ）ＰＣＩエクスプレス機能をサポートする。ＡＴＳは、アドレス変換中のデバイスの挙動を表現する。記述子が記述子処理ユニットに入る場合、デバイス２８０１は、記述子内のアドレスに対する変換を要求してよい。デバイスＴＬＢ２８２２においてヒットがある場合、デバイスは、対応するホスト物理アドレス（ＨＰＡ）を用いる。失敗又は許可障害がある場合、ＤＳＡ２８０１の一実施例では、変換のためにＩＯＭＭＵ２８１０に（すなわち、マルチプロトコルリンク２８００にわたって）アドレス変換要求を送信する。次に、ＩＯＭＭＵ２８１０は、それぞれのページテーブルを散策することにより変換を探し、変換されたアドレス及び有効な許可を含むアドレス変換応答を返してよい。次に、デバイス２８０１は、デバイスＴＬＢ２８２２に変換を格納し、オペレーション用に対応するＨＰＡを用いる。ＩＯＭＭＵ２８１０が、ページテーブル内の変換を探すことが不可能である場合、それは、利用可能な変換がないことを示すアドレス変換応答を返してよい。ＩＯＭＭＵ２８１０の応答が、変換がないことを示す、又は、オペレーションにより要求された許可を含まない有効な許可を示す場合、それは、ページフォールトとみなされる。 In one embodiment, the DSA supports address translation services (ATS) and page request services (PRS) PCI Express features as shown in FIG. 28, which shows PCIe logic 2820 communicating with PCIe logic 2808 using PCDI to utilize the ATS. The ATS describes the behavior of the device during address translation. When a descriptor enters the descriptor processing unit, the device 2801 may request a translation for the address in the descriptor. If there is a hit in the device TLB 2822, the device uses the corresponding host physical address (HPA). If there is a failure or permission failure, one embodiment of the DSA 2801 sends an address translation request to the IOMMU 2810 (i.e., across the multi-protocol link 2800) for translation. The IOMMU 2810 may then look for the translation by walking its respective page table and return an address translation response that includes the translated address and valid permissions. The device 2801 then stores the translation in the device TLB 2822 and uses the corresponding HPA for the operation. If the IOMMU 2810 is unable to find a translation in the page tables, it may return an address translation response indicating that no translation is available. If the IOMMU 2810 response indicates that no translation is available, or indicates valid permissions that do not include the permissions requested by the operation, it is considered a page fault.

ＤＳＡデバイス２８０１は、１）完了記録アドレス３８０４、２）バッチ記述子内の記述子リストアドレス、又は、３）ソースバッファ又は宛先バッファアドレスのうちの１つでのページフォールトに遭遇する可能性がある。ＤＳＡデバイス２８０１は、ページフォールトが解決されるまでブロックすることができる、又は、記述子を早期に完了して、部分的な完了をクライアントに返すことができる。一実施例において、ＤＳＡデバイス２８０１は、完了記録アドレス３８０４及び記述子リストアドレス上のページフォールトを常にブロックする。 The DSA device 2801 may encounter a page fault at one of: 1) the completion record address 3804, 2) the descriptor list address in the batch descriptor, or 3) the source or destination buffer address. The DSA device 2801 may block until the page fault is resolved, or may complete the descriptor early and return a partial completion to the client. In one embodiment, the DSA device 2801 always blocks on page faults on the completion record address 3804 and the descriptor list addresses.

ＤＳＡがページフォールトをブロックする場合、それは、ＯＳページフォールトハンドラによりサービス提供するために、当該ページフォールトをページ要求サービス（ＰＲＳ）要求としてＩＯＭＭＵ２８１０に報告する。ＩＯＭＭＵ２８１０は、割込みを通じてＯＳに通知してもよい。ＯＳは、アドレスを有効にして、チェックが成功すると、ページテーブルにマッピングを作成し、ＩＯＭＭＵ２８１０を通じてＰＲＳ応答を返す。 If the DSA blocks a page fault, it reports the page fault as a Page Request Service (PRS) request to the IOMMU 2810 for servicing by the OS page fault handler. The IOMMU 2810 may notify the OS via an interrupt. The OS validates the address and, if the check is successful, creates a mapping in the page table and returns the PRS response through the IOMMU 2810.

一実施例において、各記述子３８００は、ページフォールトがソース又は宛先バッファアドレス上で発生した場合、ＤＳＡ２８０１が部分的な完了を戻すべきであるか、ブロックするべきであるかを示す障害のブロックフラグを有する。障害のブロックフラグが１であり、障害に遭遇した場合、フォールトに遭遇した記述子は、ＰＲＳ応答が受信されるまでブロックされる。障害を有する記述子の背後の他のオペレーションもブロックされ得る。 In one embodiment, each descriptor 3800 has a fault blocking flag that indicates whether DSA 2801 should return partial completion or block if a page fault occurs on the source or destination buffer address. If the fault blocking flag is 1 and a fault is encountered, the faulted descriptor is blocked until a PRS response is received. Other operations behind the faulting descriptor may also be blocked.

障害のブロックが０であり、ページフォールトがソース又は宛先バッファアドレス上で発生した場合、デバイスは、オペレーションを停止して、部分的な完了ステータスを障害アドレス及び進捗情報と共に完了記録へ書き込む。クライアントソフトウェアが部分的な完了を示す完了記録を受信した場合、それは、（例えば、ページをタッチすることにより）プロセッサ上の障害を正常な状態に戻し、新たなワーク記述子を残りのワークと共にサブミットするオプションを有する。 If the fault blocks 0 and a page fault occurs on the source or destination buffer address, the device stops the operation and writes a partial completion status to the completion record along with the fault address and progress information. When the client software receives a completion record indicating partial completion, it has the option to revert the fault on the processor (e.g., by touching the page) and submit a new work descriptor with the remaining work.

代替的に、ソフトウェアは、プロセッサ上で残りのワークを完了できる。一般的な機能レジスタ（ＧＥＮＣＡＰ）内の障害のブロックサポートフィールドは、この機能に対するデバイスのサポートを示してよく、ワークキュー構成レジスタ内の障害のブロックイネーブルフィールドは、アプリケーションが機能を使用することが許可されるか否かをＶＭＭ又はカーネルドライバが制御することを可能にする。 Alternatively, software can complete the remaining work on the processor. The Block on Fault Support field in the General Capabilities Register (GENCAP) may indicate device support for this feature, and the Block on Fault Enable field in the Work Queue Configuration Register allows the VMM or kernel driver to control whether applications are allowed to use the feature.

デバイスページフォールトは、比較的高くつく可能性がある。事実、デバイスページフォールトのために働くコストは、プロセッサページフォールトのために働くコストより高いかも知れない。たとえデバイスが障害のブロックの代わりに部分的なワークの完了を障害時に実行したとしても、ページフォールトのために働き、ワークを再サブミットするために、ソフトウェアの介入を必要とするので、さらにオーバヘッドが発生する。よって、最高のパフォーマンスのためには、ピニング及びアンピニングのオーバヘッドを発生させることなく、ソフトウェアがデバイスページフォールトを最小化することが好ましい。 Device page faults can be relatively expensive. In fact, the cost of performing work on a device page fault may be higher than the cost of performing work on a processor page fault. Even if the device performs partial work completion on the fault instead of blocking the fault, additional overhead is incurred because software intervention is required to perform work on the page fault and resubmit the work. Thus, for best performance, it is preferable for software to minimize device page faults without incurring the overhead of pinning and unpinning.

バッチ記述子リスト及びソースデータバッファは、典型的には、これらをデバイスにサブミットする直前に、ソフトウェアにより生成される。よって、これらのアドレスは、時間的な局所性に起因して障害を発生させる可能性が低い。しかしながら、完了記述子及び宛先データバッファは、デバイスにサブミットする前に、それらがソフトウェアによりタッチされていない場合、障害を発生させる可能性がよい高い。そのような障害は、サブミッション前にこれらのページを明示的に「書き込みタッチ（ｗｒｉｔｅｔｏｕｃｈｉｎｇ）」するソフトウェアにより最小限に抑えられ得る。 The batch descriptor list and source data buffers are typically created by software just prior to submitting them to the device. Thus, their addresses are unlikely to cause failures due to temporal locality. However, the completion descriptors and destination data buffers are likely to cause failures if they are not touched by software before submitting them to the device. Such failures can be minimized by software explicitly "write touching" these pages before submission.

デバイスＴＬＢ無効要求中に、無効にされているアドレスが記述子処理ユニットにおいて用いられている場合、デバイスは、無効要求を完了する前にエンジンがアドレスを用いて行われるのを待つ。 During a device TLB invalidation request, if the address being invalidated is in use in the descriptor processing unit, the device waits for the engine to be done with the address before completing the invalidation request.

追加の記述子タイプ Additional Descriptor Types

いくつかの実施例では、以下の追加の記述子タイプのうちの１又は複数を利用してよい。 In some embodiments, one or more of the following additional descriptor types may be used:

Ｎｏ－ｏｐ No-op

図４０は、例示的な非ｏｐ記述子４０００及びｎｏ－ｏｐ完了記録４００１を示す。Ｎｏ－ｏｐオペレーション４００５は、ＤＭＡオペレーションを実行しない。それは、完了記録及び／又は完了割込みを要求してよい。それがバッチ内にある場合、バッチ内のすべての前の記述子の完了の後に、Ｎｏ－ｏｐ記述子の完了が発生することを確保するフェンスフラグを規定してよい。 Figure 40 shows an example no-op descriptor 4000 and a no-op completion record 4001. A no-op operation 4005 does not perform a DMA operation. It may require a completion record and/or a completion interrupt. If it is in a batch, a fence flag may be specified that ensures that the completion of the no-op descriptor occurs after the completion of all previous descriptors in the batch.

バッチ Batch

図４１は、例示的なバッチ記述子４１００及びｎｏ－ｏｐ完了記録４１０１を示す。バッチオペレーション４１０８は一度に複数の記述子をキューイングする。記述子リストアドレス４１０２は、処理対象のワーク記述子の連続的なアレイについてのアドレスである。一実施例において、アレイ内の各記述子は６４バイトである。記述子リストアドレス４１０２は、６４バイトアラインである。記述子総数４１０３は、アレイ内の記述子の数である。アレイ内の記述子のセットは、「バッチ」と呼ばれる。バッチ内で可能とされる記述子の最大数は、ＧＥＮＣＡＰ内の最大バッチサイズフィールドに与えられる。 Figure 41 shows an example batch descriptor 4100 and no-op completion record 4101. Batch operation 4108 queues multiple descriptors at a time. Descriptor list address 4102 is the address for a contiguous array of work descriptors to be processed. In one embodiment, each descriptor in the array is 64 bytes. Descriptor list address 4102 is 64-byte aligned. Total number of descriptors 4103 is the number of descriptors in the array. The set of descriptors in the array is called a "batch." The maximum number of descriptors allowed in a batch is given in the maximum batch size field in GENCAP.

バッチ記述子内のＰＡＳＩＤ４１０４及びＵ／Ｓフラグ４１０５は、バッチ内のすべての記述子に用いられる。バッチ内の記述子におけるＰＡＳＩＤ４１０４及びＵ／Ｓフラグフィールド４１０５は無視される。バッチ記述子４１００内の完了キューイネーブルフラグが設定された場合、完了記録アドレス有効フラグは１でなければならず、完了キューアドレスフィールド４１０６は、バッチ内のすべての記述子に用いられる完了キューのアドレスを含む。この場合、バッチ内の記述子における完了記録アドレスフィールド４１０６は無視される。一般的な機能レジスタ内の完了キューサポートフィールドが０である場合、完了キューイネーブルフラグが予約されている。 The PASID 4104 and U/S flag 4105 in the batch descriptor are used for all descriptors in the batch. The PASID 4104 and U/S flag fields 4105 in the descriptors in the batch are ignored. If the Completion Queue Enable flag in the batch descriptor 4100 is set, the Completion Record Address Valid flag must be 1, and the Completion Queue Address field 4106 contains the address of the completion queue to be used for all descriptors in the batch. In this case, the Completion Record Address field 4106 in the descriptors in the batch is ignored. If the Completion Queue Support field in the General Capabilities Register is 0, the Completion Queue Enable flag is reserved.

バッチ記述子内の完了キューイネーブルフラグが０である場合、バッチ内の各記述子に対する完了記録は、各記述子内の完了記録アドレス４１０６に書き込まれる。この場合、バッチ記述子内の要求完了記録フラグが１である場合、完了キューアドレスフィールドは、もっぱらバッチ記述子用の完了記録アドレス４１０６として用いられる。 If the completion queue enable flag in the batch descriptor is 0, the completion records for each descriptor in the batch are written to the completion record address 4106 in each descriptor. In this case, if the requested completion record flag in the batch descriptor is 1, the completion queue address field is used exclusively as the completion record address 4106 for the batch descriptor.

バッチ完了記録４１０１のステータスフィールド４１１０は、バッチにおける記述子のすべてが完了した場合、成功を示し、そうでなければ、それは、１又は複数の記述子が成功に相当しないステータスで完了したことを示す。完了記録の記述子完了フィールド４１１１は、それらが成功したか否かに関わらず、処理されていたバッチ内の記述子の総数を含む。記述子完了４１１１は、バッチ内にフェンスがある場合、又は、バッチを読み出している間にページフォールトが発生した場合、記述子総数４１０３より少なくてよい。 The status field 4110 of the batch completion record 4101 indicates success if all of the descriptors in the batch completed, otherwise it indicates that one or more descriptors completed with a status other than success. The descriptors completed field 4111 of the completion record contains the total number of descriptors in the batch that were being processed, whether they were successful or not. The descriptors completed 4111 may be less than the total number of descriptors 4103 if there is a fence in the batch or if a page fault occurred while reading the batch.

ドレイン Drain

図４２は、例示的なドレイン記述子４２００及びドレイン完了記録４２０１を示す。ドレインオペレーション４２０８は、ドレイン記述子４２００がサブミットされたワークキュー内の、ＰＡＳＩＤ４２０２と関連付けられたすべての未処理の記述子の完了を待つ。この記述子は、デバイスを使用している処理により、通常のシャットダウン中に用いられ得る。ＰＡＳＩＤ４２０２と関連付けられたすべての記述子を待つために、ソフトウェアは、ＰＡＳＩＤ４２０２が用いられていた各ワークキューに別々のドレインオペレーションをサブミットすべきである。ソフトウェアは、ドレイン記述子４２０１がサブミットされた後、かつ、それが完了する前に、ワークキューにサブミットされる特定のＰＡＳＩＤ４２０２を有する記述子がないことを確保すべきである。 Figure 42 shows an example drain descriptor 4200 and drain completion record 4201. The drain operation 4208 waits for the completion of all outstanding descriptors associated with PASID 4202 in the work queue to which the drain descriptor 4200 was submitted. This descriptor may be used during normal shutdown by processes using the device. To wait for all descriptors associated with PASID 4202, the software should submit a separate drain operation to each work queue in which PASID 4202 was used. The software should ensure that after the drain descriptor 4201 is submitted and before it completes, there are no descriptors with the particular PASID 4202 submitted to the work queue.

ドレイン記述子４２０１は、バッチに含まれていなくてよく、それは、非サポートオペレーションタイプとして処理される。ドレインは、要求完了記録又は要求完了割込みを規定すべきである。完了通知は、他の記述子が完了した後に行われる。 The drain descriptor 4201 may not be included in the batch and is treated as an unsupported operation type. The drain should specify a request completion record or a request completion interrupt. Completion notification occurs after other descriptors have completed.

メモリ移動 Memory transfer

図４３は、例示的なメモリ移動記述子４３００及びメモリ移動完了記録４３０１を示す。メモリ移動オペレーション４３０８は、メモリをソースアドレス４３０２から宛先アドレス４３０３にコピーする。コピーされるバイト数は、転送サイズ４３０４により与えられる。メモリアドレス又は転送サイズに対するアライメント要求はない。ソース及び宛先領域が重複する場合、メモリコピーは、ソースバッファ全体が一時的な空間にコピーされ、次に、宛先バッファにコピーされたかのように行われる。これは、宛先バッファの開始が、ソースバッファの終了と重複する場合に、コピーの方向を反転することにより実施され得る。 Figure 43 shows an example memory move descriptor 4300 and memory move completion record 4301. A memory move operation 4308 copies memory from a source address 4302 to a destination address 4303. The number of bytes copied is given by the transfer size 4304. There are no alignment requirements for the memory addresses or the transfer size. If the source and destination regions overlap, the memory copy is performed as if the entire source buffer was copied to a temporary space and then copied to the destination buffer. This can be accomplished by reversing the direction of the copy if the start of the destination buffer overlaps with the end of the source buffer.

オペレーションが、ページフォールトに起因して部分的に完了した場合、完了記録の方向フィールド４３１０は、ソース及び宛先バッファの先頭から開始するコピーが実行された場合に０であり、方向フィールドは、コピーの方向が反転された場合に１である。 If the operation is partially completed due to a page fault, the direction field 4310 of the completion record is 0 if a copy was performed starting from the beginning of the source and destination buffers, and the direction field is 1 if the direction of the copy was reversed.

部分的な完了後にオペレーションを再開するために、方向が０である場合、連続的な記述子内のソース及び宛先アドレスフィールド４３０２－４３０３は、バイト完了により増加されるべきであり、転送サイズは、バイト完了４３１１により低減されるべきである。方向が１である場合、転送サイズ４３０４は、バイト完了４３１１により低減されるべきであるが、ソース及び宛先アドレスフィールド４３０２－４３０３は、元の記述子と同じであるべきである。後続の部分的な完了が発生した場合、方向フィールド４３１０は、第１の部分的な完了に対するものと同じでなくてよいことに留意する。 To resume an operation after partial completion, if direction is 0, the source and destination address fields 4302-4303 in successive descriptors should be increased by byte completions and the transfer size should be decreased by byte completions 4311. If direction is 1, the transfer size 4304 should be decreased by byte completions 4311, but the source and destination address fields 4302-4303 should be the same as in the original descriptor. Note that when subsequent partial completions occur, the direction field 4310 does not have to be the same as for the first partial completion.

満杯（フィル（Ｆｉｌｌ）） Full (Fill)

図４４は、例示的なフィル記述子４４００を示す。メモリフィルオペレーション４４０８は、宛先アドレス４４０６におけるメモリをパターンフィールド４４０５内の値で満杯にする。パターンサイズは、８バイトであってよい。より小さいパターンを用いるために、ソフトウェアは、記述子内のパターンを複製しなければならない。書き込まれるバイト数は、転送サイズ４４０７により与えられる。転送サイズは、パターンサイズの倍数である必要はない。宛先アドレス又は転送サイズに対するアライメント要求はない。オペレーションがページフォールトに起因して部分的に完了した場合、完了記録のバイト完了フィールドは、障害が発生する前に宛先に書き込まれたバイト数を含む。 Figure 44 shows an example fill descriptor 4400. A memory fill operation 4408 fills memory at destination address 4406 with the value in pattern field 4405. The pattern size may be 8 bytes. To use a smaller pattern, software must replicate the pattern in the descriptor. The number of bytes written is given by transfer size 4407. The transfer size does not need to be a multiple of the pattern size. There are no alignment requirements for the destination address or transfer size. If the operation is partially completed due to a page fault, the bytes completed field of the completion record contains the number of bytes written to the destination before the fault occurred.

比較 Compare

図４５は、例示的な比較記述子４５００及び比較完了記録４５０１を示す。比較オペレーション４５０８は、ソース１のアドレス４５０４におけるメモリと、ソース２のアドレス４５０５におけるメモリとを比較する。比較されるバイト数は、転送サイズ４５０６により与えられる。メモリアドレス又は転送サイズ４５０６に対するアライメント要求はない。完了記録アドレス有効及び要求完了記録フラグは１でなければならず、完了記録アドレスは有効でなければならない。比較の結果は、完了記録４５０１の結果フィールド４５１０に書き込まれる。０の値は、２つのメモリ領域がマッチすることを示し、１の値は、それらがマッチしないことを示す。結果４５１０が１である場合、完了記録のバイト完了４５１１フィールドは、第１の差のバイトオフセットを示す。オペレーションが、ページフォールトに起因して部分的に完了した場合、結果は０である。差が検出された場合、その差は、ページフォールトの代わりに報告されるであろう。 Figure 45 shows an exemplary compare descriptor 4500 and compare completion record 4501. A compare operation 4508 compares memory at address 4504 of source 1 with memory at address 4505 of source 2. The number of bytes compared is given by the transfer size 4506. There are no alignment requirements for the memory addresses or the transfer size 4506. The completion record address valid and requested completion record flags must be 1 and the completion record address must be valid. The result of the comparison is written to the result field 4510 of the completion record 4501. A value of 0 indicates that the two memory regions match and a value of 1 indicates that they do not match. If the result 4510 is 1, the bytes completed 4511 field of the completion record indicates the byte offset of the first difference. If the operation is partially completed due to a page fault, the result is 0. If a difference is detected, the difference will be reported in place of the page fault.

オペレーションが成功し、チェック結果フラグが１である場合、完了記録のステータスフィールド４５１２は、以下のテーブルに示されるように、結果及び予測結果に従って設定される。これは、フェンスフラグと同じバッチ内の後続の記述子が、比較の結果に基づいてバッチの実行を継続又は停止することを可能にする。
If the operation is successful and the check result flag is 1, the status field 4512 of the completion record is set according to the result and expected result, as shown in the table below. This allows subsequent descriptors in the same batch as the fence flag to continue or stop execution of the batch based on the outcome of the comparison.

比較中間 Comparison intermediate

図４６は、例示的な比較中間記述子４６００を示す。比較中間処理４６０８は、ソースアドレス４６０１におけるメモリをパターンフィールド４６０２における値と比較する。パターンサイズは８バイトである。より小さいパターンを用いるために、ソフトウェアは、記述子内のパターンを複製しなければならない。比較されるバイト数は、転送サイズ４６０３により与えられる。転送サイズは、パターンサイズの倍数である必要はない。完了記録アドレス有効及び要求完了記録フラグは、１でなければならない。完了記録アドレス４６０４は有効でなければならない。比較の結果は、完了記録の結果フィールドに書き込まれる。０の値は、メモリ領域がパターンにマッチすることを示し、１の値は、それがマッチしないことを示す。結果が１である場合、完了記録のバイト完了フィールドは、第１の差の位置を示す。正確なバイト位置ではないかも知れないが、第１の差より大きくないことが保証される。オペレーションが、ページフォールトに起因して部分的に完了した場合、結果は０である。差が検出された場合、その差は、ページフォールトの代わりに報告されるであろう。一実施例において、比較中間に関する完了記録フォーマットと、チェック結果及び予測結果の挙動とは比較と同一である。 Figure 46 shows an exemplary compare intermediate descriptor 4600. The compare intermediate process 4608 compares the memory at source address 4601 with the value in pattern field 4602. The pattern size is 8 bytes. To use a smaller pattern, the software must duplicate the pattern in the descriptor. The number of bytes compared is given by the transfer size 4603. The transfer size does not have to be a multiple of the pattern size. The completion record address valid and requested completion record flags must be 1. The completion record address 4604 must be valid. The result of the comparison is written to the result field of the completion record. A value of 0 indicates that the memory region matches the pattern, and a value of 1 indicates that it does not match. If the result is 1, the byte done field of the completion record indicates the location of the first difference. It may not be the exact byte location, but it is guaranteed to be no greater than the first difference. If the operation is partially completed due to a page fault, the result is 0. If a difference was detected, the difference will be reported instead of the page fault. In one embodiment, the completion record format for the comparison intermediate and the behavior of the check results and predicted results are the same as the comparison.

作成差分記録 Created difference record

図４７は、例示的な作成差分記録記述子４７００及び作成差分記録完了記録４７０１を示す。作成差分記録オペレーション４７０８は、ソース１のアドレス４７０５におけるメモリとソース２のアドレス４７０２におけるメモリとを比較して、ソース２に一致するようにソース１を更新するのに必要とされる情報を含む差分記録を生成する。比較されるバイト数は、転送サイズ４７０３により与えられる。以下で説明されるように、転送サイズは、差分記録に格納され得る最大オフセットの分制限される。メモリアドレス又は転送サイズに対するアライメント要求がない。完了記録アドレス有効及び要求完了記録フラグは、１でなければならず、完了記録アドレス４７０４は、有効でなければならない。 Figure 47 shows an exemplary create delta record descriptor 4700 and create delta record completion record 4701. A create delta record operation 4708 compares memory at source 1 address 4705 with memory at source 2 address 4702 to generate a delta record containing the information needed to update source 1 to match source 2. The number of bytes compared is given by the transfer size 4703. As explained below, the transfer size is limited by the maximum offset that can be stored in the delta record. There are no alignment requirements for the memory address or transfer size. The completion record address valid and requested completion record flags must be 1, and the completion record address 4704 must be valid.

差分記録の最大サイズは、最大差分記録サイズ４７０９により与えられる。最大差分記録サイズ４７０９は、差分サイズ（１０バイト）の倍数とするべきであり、ＧＥＮＣＡＰ内の最大転送サイズより大きくないものでなければならない。差分記録の実際のサイズは、ソース１とソース２との間で検出された差の数に依存し、それは、完了記録の差分記録サイズフィールド４７１０に書き込まれる。差分記録に必要とされる空間が、記述子に規定された最大差分記録サイズ４７０９を超える場合、オペレーションは、部分的な差分記録で完了する。 The maximum size of a differential record is given by the maximum differential record size 4709. The maximum differential record size 4709 should be a multiple of the differential size (10 bytes) and must not be larger than the maximum transfer size in GENCAP. The actual size of the differential record depends on the number of differences found between source 1 and source 2, and is written to the differential record size field 4710 of the completed record. If the space required for the differential record exceeds the maximum differential record size 4709 specified in the descriptor, the operation completes with a partial differential record.

比較の結果は、完了記録４７０１の結果フィールド４７１１に書き込まれる。２つの領域が正確に一致する場合、結果は０であり、差分記録サイズは０であり、バイト完了は０である。２つの領域が一致しない場合、差分の完全なセットが差分記録に書き込まれ、結果は１であり、差分記録サイズは、得られたすべての差の合計サイズを含み、バイト完了は０である。２つの領域が一致せず、すべての差分を記録するのに必要な空間が最大差分記録サイズを超える場合、結果は２であり、差分記録サイズ４７１０は、差分記録に書き込まれた差分のセットのサイズ（典型的には、記述子に規定される差分記録サイズに等しい、又は、ほぼ等しい）を含み、バイト完了４７１２は、差分記録内の空間が超過する前に比較されたバイト数を含む。 The result of the comparison is written to the result field 4711 of the completion record 4701. If the two regions match exactly, result is 0, diff record size is 0, and bytes done is 0. If the two regions do not match, the complete set of differences is written to the diff record, result is 1, diff record size contains the total size of all differences obtained, and bytes done is 0. If the two regions do not match and the space required to record all differences exceeds the maximum diff record size, result is 2, diff record size 4710 contains the size of the set of differences written to the diff record (typically equal or nearly equal to the diff record size specified in the descriptor), and bytes done 4712 contains the number of bytes compared before the space in the diff record was exceeded.

オペレーションが、ページフォールトに起因して部分的に完了した場合、前の段落で説明したように、結果４７１１は０又は１のいずれか一方であり、バイト完了４７１２は、ページフォールトが発生する前に比較されたバイト数を含み、差分記録サイズは、ページフォールトが発生する前に差分記録で用いられた空間を含む。 If the operation is partially completed due to a page fault, then result 4711 will be either 0 or 1, bytes completed 4712 will contain the number of bytes compared before the page fault occurred, and differential record size will contain the space used in the differential record before the page fault occurred, as described in the previous paragraph.

差分記録のフォーマットが図４８に示される。差分記録は、差分（ｄｅｌｔａ）のアレイを含む。各差分は、２バイトのオフセット４８０１及びソース１における対応する８バイトとは異なるソース２からの８バイトブロックのデータ４８０２を含む。差分記録の合計サイズは１０の倍数である。オフセット４８０１は、８バイトの倍数を表す１６ビットのフィールドであるので、表現され得る最大オフセットは、０ｘ７ＦＦＦ８であり、そのため、最大転送サイズは、０ｘ８００００バイト（５１２ＫＢ）である。 The format of a delta record is shown in Figure 48. A delta record contains an array of deltas. Each delta contains a 2-byte offset 4801 and an 8-byte block of data 4802 from source 2 that differs from the corresponding 8 bytes in source 1. The total size of the delta record is a multiple of 10. Because offset 4801 is a 16-bit field that represents a multiple of 8 bytes, the maximum offset that can be represented is 0x7FFF8, and therefore the maximum transfer size is 0x80000 bytes (512KB).

オペレーションが成功し、チェック結果フラグが１である場合、完了記録のステータスフィールドは、以下のテーブルに示されるように、結果及び予測結果に従って設定される。これは、フェンスフラグと同じバッチ内の後続の記述子が、差分記録作成の結果に基づいてバッチの実行を継続又は停止することを可能にする。予測結果のビット７：２は無視される。
If the operation is successful and the check result flag is 1, the status field of the completion record is set according to the result and expected result as shown in the table below. This allows subsequent descriptors in the same batch as the fence flag to continue or stop execution of the batch based on the result of the delta record creation. Bits 7:2 of the expected result are ignored.

適合差分記録 Matching difference record

図４９は、例示的な適合差分記録記述子４９０１を示す。適合差分記録オペレーション４９０２は、宛先アドレス４９０３におけるメモリのコンテンツに差分記録を適用する。差分記録アドレス４９０４は、１に等しい結果で完了した作成差分記録オペレーション４９０２により作成された差分記録のアドレスである。差分記録サイズ４９０５は、作成差分記録オペレーション４９０２の完了記録において報告されるような差分記録のサイズである。宛先アドレス４９０３は、差分記録が作成された場合、ソース１のアドレスにおけるメモリと同じコンテンツを含むバッファのアドレスである。転送サイズ４９０６は、差分記録が作成された場合に用いられる転送サイズと同じである。適合差分記録オペレーション４９０２が完了した後に、宛先アドレス４９０３におけるメモリは、差分記録が作成された場合に、ソース２のアドレスにおけるメモリにあったコンテンツと一致する。メモリアドレス又は転送サイズに対するアライメント要求はない。 Figure 49 illustrates an exemplary conforming difference record descriptor 4901. Conforming difference record operation 4902 applies a difference record to the contents of memory at destination address 4903. Difference record address 4904 is the address of the difference record created by create difference record operation 4902 completing with result equal to 1. Difference record size 4905 is the size of the difference record as reported in the completion record of create difference record operation 4902. Destination address 4903 is the address of a buffer that contains the same contents as the memory at source 1's address when the difference record was created. Transfer size 4906 is the same as the transfer size used when the difference record was created. After conforming difference record operation 4902 completes, the memory at destination address 4903 matches the contents that were in the memory at source 2's address when the difference record was created. There are no alignment requirements for the memory address or transfer size.

適合差分記録オペレーション４９０２中にページフォールトが発生した場合、完了記録のバイト完了フィールドは、宛先に適用されることに成功した差分記録のバイト数を含む。ソフトウェアが、オペレーションを再開するために別の記述子をサブミットすることを選択した場合、連続的な記述子は、元と同じ宛先アドレス４９０３を含む。差分記録アドレス４９０４は、バイト完了により増加されるはずであり（したがって、第１の未適用の差分を指し示す）、差分記録サイズ４９０５は、バイト完了により低減されるはずである。 If a page fault occurs during a conforming delta record operation 4902, the bytes done field of the completion record contains the number of bytes of the delta record that were successfully applied to the destination. If the software chooses to submit another descriptor to resume the operation, the successive descriptor contains the same destination address 4903 as the original. The delta record address 4904 should be increased by the bytes done (thus pointing to the first unapplied delta) and the delta record size 4905 should be reduced by the bytes done.

図５０は、作成差分記録及び適合差分記録オペレーションの利用についての一実施例を示す。まず、作成差分記録オペレーション５００１が実行される。作成差分記録オペレーション５００１は、２つのソースバッファ－ソース１及び２－を読み出して、その完了記録５００３に実際の差分記録サイズ５００４を記録する差分記録５０１０を書き込む。適合差分記録オペレーション５００５は、作成差分記録オペレーション５００１により書き込まれた差分記録のコンテンツをそのサイズ及びソース１データのコピーと共に取り出し、元のソース２バッファの複製となるように宛先バッファ５０１５を更新する。作成差分記録オペレーションは、最大差分記録サイズ５００２を含む。 Figure 50 shows one embodiment of the use of create difference record and adaptive difference record operations. First, a create difference record operation 5001 is performed. The create difference record operation 5001 reads two source buffers - source 1 and 2 - and writes a difference record 5010 in its completion record 5003 that records the actual difference record size 5004. An adaptive difference record operation 5005 takes the contents of the difference record written by the create difference record operation 5001, along with its size and a copy of the source 1 data, and updates the destination buffer 5015 to be a replica of the original source 2 buffer. The create difference record operation includes a maximum difference record size 5002.

デュアルキャストを用いたメモリコピー Memory copy using dual cast

図５１は、例示的なデュアルキャストを用いたメモリコピー記述子５１００、及び、デュアルキャストを用いたメモリコピー完了記録５１０２を示す。デュアルキャストオペレーション５１０４を用いたメモリコピーは、ソースアドレス５１０５から、宛先１のアドレス５１０６及び宛先２のアドレス５１０７の両方にメモリをコピーする。コピーされるバイト数は、転送サイズ５１０８により与えられる。ソースアドレス又は転送サイズに対するアライメント要求はない。２つの宛先アドレス５１０６－５１０７のビット１１：０は同じであるべきである。 Figure 51 shows an exemplary memory copy with dual cast descriptor 5100 and a memory copy with dual cast completion record 5102. A memory copy with dual cast operation 5104 copies memory from a source address 5105 to both a destination 1 address 5106 and a destination 2 address 5107. The number of bytes copied is given by the transfer size 5108. There are no alignment requirements for the source address or transfer size. Bits 11:0 of the two destination addresses 5106-5107 should be the same.

ソース領域が宛先領域のいずれか一方と重複する場合、メモリコピーは、ソースバッファ全体が一時的な空間にコピーされ、次に、宛先バッファにコピーされたかのように行われる。これは、宛先バッファの先頭がソースバッファの終了と重複する場合にコピーの方向を反転することにより実施され得る。ソース領域が宛先領域の両方と重複する場合、又は、２つの宛先領域が重複する場合、それはエラーである。オペレーションが、ページフォールトに起因して部分的に完了した場合、コピーオペレーションは、宛先領域の両方に同じバイト数を書き込んだ後に停止し、完了記録の方向フィールド５１１０は、ソース及び宛先バッファの先頭から開始するコピーが実行される場合に０であり、方向フィールドは、コピーの方向が反転された場合に１である。 If the source area overlaps with either of the destination areas, the memory copy is performed as if the entire source buffer was copied to a temporary space and then copied to the destination buffer. This can be done by reversing the direction of the copy if the beginning of the destination buffer overlaps the end of the source buffer. It is an error if the source area overlaps both destination areas, or if the two destination areas overlap. If the operation is partially completed due to a page fault, the copy operation stops after writing the same number of bytes to both destination areas, and the direction field 5110 of the completion record is 0 if the copy is performed starting from the beginning of the source and destination buffers, and the direction field is 1 if the direction of the copy is reversed.

部分的な完了後にオペレーションを再開するために、方向５１１０が０である場合、連続的な記述子内のソース５１０５及び両方の宛先アドレスフィールド５１０６－５１０７は、バイト完了５１１１により増加されるべきであり、転送サイズ５１０８は、バイト完了５１１１により低減される。方向が１である場合、転送サイズ５１０８は、バイト完了５１１１により低減されるべきであるが、ソース５１０５及び宛先５１０６－５１０７アドレスフィールドは、元の記述子と同じであるべきである。後続の部分的な完了が発生した場合、方向フィールド５１１０は、第１の部分的な完了に対するものと同じでなくてもよいことに留意する。 To resume an operation after partial completion, if direction 5110 is 0, the source 5105 and both destination address fields 5106-5107 in the successive descriptor should be increased by byte completion 5111, and the transfer size 5108 should be decreased by byte completion 5111. If direction is 1, the transfer size 5108 should be decreased by byte completion 5111, but the source 5105 and destination 5106-5107 address fields should be the same as in the original descriptor. Note that when a subsequent partial completion occurs, the direction field 5110 does not have to be the same as for the first partial completion.

巡回冗長検査（ＣＲＣ）生成 Cyclic redundancy check (CRC) generation

図５２は、例示的なＣＲＣ生成記述子５２００及びＣＲＣ生成５２０１を示す。ＣＲＣ生成処理５２０４は、ソースアドレスにおけるメモリ上のＣＲＣを計算する。ＣＲＣ計算に用いられるバイト数は、転送サイズ５２０５により与えられる。メモリアドレス又は転送サイズ５２０５に対するアライメント要求はない。完了記録アドレス有効及び要求完了記録フラグは１でなければならず、完了記録アドレス５２０６は有効でなければならない。計算されたＣＲＣ値は、完了記録に書き込こまれる。 Figure 52 shows an example CRC generation descriptor 5200 and CRC generation 5201. The CRC generation process 5204 calculates the CRC in memory at the source address. The number of bytes used in the CRC calculation is given by the transfer size 5205. There is no alignment requirement for the memory address or transfer size 5205. The completion record address valid and requested completion record flags must be 1 and the completion record address 5206 must be valid. The calculated CRC value is written to the completion record.

オペレーションが、ページフォールトに起因して部分的に完了した場合、部分的なＣＲＣ結果は、ページフォールト情報と共に完了記録に書き込まれる。ソフトウェアが障害を是正し、オペレーションを再開する場合、連続的な記述子のＣＲＣシードフィールドへこの部分的な結果をコピーしなければならない。そうでなければ、ＣＲＣシードフィールドは０であるべきである。 If an operation is partially completed due to a page fault, the partial CRC result is written to the completion record along with the page fault information. When software corrects the fault and resumes the operation, it must copy this partial result into the CRC seed field of the successive descriptor. Otherwise, the CRC seed field should be 0.

ＣＲＣ生成を用いたコピー Copying using CRC generation

図５３は、ＣＲＣ生成記述子５３００を用いた例示的なコピーを示す。ＣＲＣ生成処理５３０５を用いたコピーは、ソースアドレス５３０２から宛先アドレス５３０３にメモリをコピーし、コピーされたデータに対するＣＲＣを計算する。コピーされたバイト数は、転送サイズ５３０４により与えられる。メモリアドレス又は転送サイズに対するアライメント要求はない。ソース及び宛先領域が重複する場合、それはエラーである。完了記録アドレス有効及び要求完了記録フラグは１でなければならず、完了記録アドレスは有効でなければならない。計算されたＣＲＣ値は、完了記録に書き込まれる。 Figure 53 shows an example copy with CRC generation descriptor 5300. The copy with CRC generation process 5305 copies memory from source address 5302 to destination address 5303 and calculates a CRC for the copied data. The number of bytes copied is given by transfer size 5304. There are no alignment requirements for memory addresses or transfer size. It is an error if the source and destination regions overlap. The completion record address valid and requested completion record flags must be 1 and the completion record address must be valid. The calculated CRC value is written to the completion record.

オペレーションが、ページフォールトに起因して部分的に完了した場合、部分的なＣＲＣ結果は、ページフォールト情報と共に完了記録に書き込まれる。ソフトウェアが障害を是正し、オペレーションを再開する場合、連続的な記述子のＣＲＣシードフィールドへこの部分的な結果をコピーしなければならない。そうでなければ、ＣＲＣシードフィールドは０とすべきである。一実施例において、ＣＲＣ生成を用いたコピー用の完了記録フォーマットは、ＣＲＣ生成用のフォーマットと同じである。 If an operation is partially completed due to a page fault, the partial CRC result is written to the completion record along with the page fault information. When software corrects the fault and resumes the operation, it must copy this partial result into the CRC seed field of the successive descriptor. Otherwise, the CRC seed field should be 0. In one embodiment, the completion record format for copying with CRC generation is the same as the format for CRC generation.

データ整合性フィールド（ＤＩＦ）挿入 Data Integrity Field (DIF) Insertion

図５４は、例示的なＤＩＦ挿入記述子５４００及びＤＩＦ挿入完了記録５４０１を示す。ＤＩＦ挿入オペレーション５４０５は、ソースアドレス５４０２から宛先アドレス５４０３にメモリをコピーし、ソースデータ上のデータ整合性フィールド（ＤＩＦ）を計算し、ＤＩＦを出力データに挿入する。コピーされたソースバイトの数は、転送サイズ５４０６により与えられる。ＤＩＦ計算は、例えば、５１２、５２０、４０９６又は４１０４バイトのソースデータの各ブロックで実行される。転送サイズは、ソースブロックサイズの倍数とすべきである。宛先に書き込まれたバイト数は、ソースブロックごとに、転送サイズに８バイトを加えた値である。メモリアドレスに対するアライメント要求はない。ソース及び宛先領域が重複する場合、それはエラーである。オペレーションが、ページフォールトに起因して部分的に完了した場合、参照タグ及びアプリケーションタグの更新された値は、ページフォールト情報と共に完了記録に書き込まれる。ソフトウェアが障害を是正し、オペレーションを再開する場合、連続的な記述子へこれらのフィールドをコピーし得る。 Figure 54 shows an exemplary DIF insertion descriptor 5400 and DIF insertion completion record 5401. A DIF insertion operation 5405 copies memory from a source address 5402 to a destination address 5403, calculates a data integrity field (DIF) on the source data, and inserts the DIF into the output data. The number of source bytes copied is given by the transfer size 5406. The DIF calculation is performed on each block of source data, for example, 512, 520, 4096, or 4104 bytes. The transfer size should be a multiple of the source block size. The number of bytes written to the destination is the transfer size plus 8 bytes per source block. There is no alignment requirement for the memory address. It is an error if the source and destination regions overlap. If the operation is partially completed due to a page fault, the updated values of the reference tag and application tag are written to the completion record along with the page fault information. When the software corrects the failure and resumes operation, it can copy these fields into the contiguous descriptor.

ＤＩＦストリップ DIF strip

図５５は、例示的なＤＩＦストリップ記述子５５００及びＤＩＦストリップ完了記録５５０１を示す。ＤＩＦストリップオペレーション５５０５は、ソースアドレス５５０２から宛先アドレス５５０３にメモリをコピーし、ソースデータ上のデータ整合性フィールド（ＤＩＦ）を計算し、計算されたＤＩＦを、データに含まれるＤＩＦと比較する。読み出されるソースバイトの数は、転送サイズ５５０６により与えられる。ＤＩＦ計算は、５１２、５２０、４０９６又は４１０４バイトであり得るソースデータの各ブロックで実行される。転送サイズは、ソースブロックごとに、ソースブロックサイズの倍数に８バイトを加えた値とすべきである。宛先に書き込まれるバイト数は、ソースブロックごとに、転送サイズから８バイトを差し引いた値である。メモリアドレスに対するアライメント要求はない。ソース及び宛先領域が重複する場合、それはエラーである。オペレーションが、ページフォールトに起因して部分的に完了した場合、参照タグ及びアプリケーションタグの更新された値は、ページフォールト情報と共に完了記録に書き込まれる。ソフトウェアが障害を是正し、オペレーションを再開する場合、連続的な記述子へこれらのフィールドをコピーしてよい。 Figure 55 shows an exemplary DIF strip descriptor 5500 and DIF strip completion record 5501. A DIF strip operation 5505 copies memory from a source address 5502 to a destination address 5503, calculates a data integrity field (DIF) on the source data, and compares the calculated DIF to the DIF contained in the data. The number of source bytes read is given by the transfer size 5506. The DIF calculation is performed on each block of source data, which can be 512, 520, 4096, or 4104 bytes. The transfer size should be a multiple of the source block size plus 8 bytes per source block. The number of bytes written to the destination is the transfer size minus 8 bytes per source block. There is no alignment requirement on the memory address. It is an error if the source and destination regions overlap. If the operation is partially completed due to a page fault, the updated values of the reference tag and application tag are written to the completion record along with the page fault information. When the software corrects the fault and resumes operation, it may copy these fields into the successive descriptors.

ＤＩＦ更新 DIF Update

図５６は、例示的なＤＩＦ更新記述子５６００及びＤＩＦ更新完了記録５６０１を示す。ＤＩＦ更新オペレーション５６０５を用いたメモリ移動は、ソースアドレス５６０２から宛先アドレス５６０３にメモリをコピーし、ソースデータ上のデータ整合性フィールド（ＤＩＦ）を計算し、計算されたＤＩＦを、データに含まれるＤＩＦと比較する。記述子内の宛先ＤＩＦフィールドを用いてソースデータ上のＤＩＦを同時に計算し、計算されたＤＩＦを出力データに挿入する。読み出されるソースバイトの数は、転送サイズ５６０６により与えられる。ＤＩＦ計算は、５１２、５２０、４０９６又は４１０４バイトであり得るソースデータの各ブロックで実行される。転送サイズ５６０６は、ソースブロックごとに、ソースブロックサイズの倍数に８バイトを加えた値とすべきである。宛先に書き込まれるバイト数は、転送サイズ５６０６と同じである。メモリアドレスに対するアライメント要求はない。ソース及び宛先領域が重複する場合、それはエラーである。オペレーションが、ページフォールトに起因して部分的に完了した場合、ソース及び宛先参照タグ、及び、アプリケーションタグの更新された値は、ページフォールト情報と共に完了記録に書き込まれる。ソフトウェアが障害を是正し、オペレーションを再開する場合、連続的な記述子にこれらのフィールドをコピーしてよい。 Figure 56 shows an exemplary DIF update descriptor 5600 and DIF update completion record 5601. A memory move using DIF update operation 5605 copies memory from source address 5602 to destination address 5603, calculates the data integrity field (DIF) on the source data, and compares the calculated DIF with the DIF contained in the data. It simultaneously calculates the DIF on the source data using the destination DIF field in the descriptor, and inserts the calculated DIF into the output data. The number of source bytes read is given by the transfer size 5606. The DIF calculation is performed on each block of source data, which can be 512, 520, 4096, or 4104 bytes. The transfer size 5606 should be a multiple of the source block size plus 8 bytes per source block. The number of bytes written to the destination is equal to the transfer size 5606. There is no alignment requirement for memory addresses. It is an error if the source and destination regions overlap. If an operation is partially completed due to a page fault, the updated values of the source and destination reference tags and the application tag are written to the completion record along with the page fault information. When software corrects the fault and resumes the operation, it may copy these fields into the contiguous descriptor.

以下のテーブルＡＡは、一実施例において用いられるＤＩＦフラグを示す。テーブルＢＢは、一実施例において用いられるソースＤＩＦフラグを示し、テーブルＣＣは、一実施例における宛先ＤＩＦフラグを示す。

Table AA below shows the DIF flags used in one embodiment, Table BB shows the source DIF flags used in one embodiment, and Table CC shows the destination DIF flags in one embodiment.

ソースＤＩＦフラグ

Source DIF Flag

宛先ＤＩＦフラグ

Destination DIF Flag

一実施例において、ＤＩＦ結果フィールドは、ＤＩＦオペレーションのステータスを報告する。このフィールドは、ＤＩＦストリップ及び更新オペレーションのためだけに、かつ、完了記録のステータスフィールドが成功又は誤った述部を伴う成功である場合のみ定義されてよい。以下のテーブルＤＤは、例示的なＤＩＦ結果フィールドコードを示す。

In one embodiment, the DIF result field reports the status of the DIF operation. This field may be defined only for DIF strip and update operations, and only if the status field of the completion record is success or success with incorrect predicate. Table DD below shows exemplary DIF result field codes.

Ｆ検出条件は、テーブルＥＥに以下に示されるうちの１つが真である場合に検出される。

The F detection condition is detected if one of the following in Table EE is true:

オペレーションが成功し、チェック結果フラグが１である場合、完了記録のステータスフィールドは、以下のテーブルＦＦに示されるように、ＤＩＦ結果に従って設定される。これは、フェンスフラグと同じバッチ内の後続の記述子が、オペレーションの結果に基づいてバッチの実行を継続又は停止することを可能にする。

If the operation is successful and the check result flag is 1, the status field of the completion record is set according to the DIF result, as shown in Table FF below. This allows subsequent descriptors in the same batch as the fence flag to continue or stop execution of the batch based on the outcome of the operation.

キャッシュフラッシュ Cache flush

図５７は、例示的なキャッシュフラッシュ記述子５７００を示す。キャッシュフラッシュオペレーション５７０５は、宛先アドレスでプロセッサキャッシュをフラッシュする。フラッシュされるバイト数は、転送サイズ５７０２により与えられる。転送サイズは、キャッシュラインサイズの倍数である必要はない。宛先アドレス又は転送サイズに対するアライメント要求はない。宛先領域により部分的にカバーされる任意のキャッシュラインがフラッシュされる。 Figure 57 shows an example cache flush descriptor 5700. A cache flush operation 5705 flushes the processor cache at a destination address. The number of bytes flushed is given by the transfer size 5702. The transfer size does not need to be a multiple of the cache line size. There are no alignment requirements for the destination address or transfer size. Any cache lines that are partially covered by the destination region are flushed.

宛先キャッシュフィルタグが０である場合、影響を受けるキャッシュラインは、キャッシュ階層の各レベルから無効にされ得る。キャッシュラインは、キャッシュ階層の任意のレベルで修正されたデータを含み、データは、メモリに書き戻される。これは、いくつかのプロセッサに実装されたＣＬＦＬＵＳＨ命令の挙動と同様である。 If the destination cache filter is 0, the affected cache line may be invalidated from each level of the cache hierarchy. If the cache line contains data that was modified at any level of the cache hierarchy, the data is written back to memory. This is similar to the behavior of the CLFLUSH instruction implemented in some processors.

宛先キャッシュフィルタグが１である場合、修正されたキャッシュラインは、メインメモリに書き込まれるが、キャッシュから追い出されない。これは、いくつかのプロセッサにおけるＣＬＷＢ命令の挙動と同様である。 If the destination cache filter is 1, the modified cache line is written to main memory but is not evicted from the cache. This is similar to the behavior of the CLWB instruction in some processors.

期限アクセラレータ（ｔｅｒｍａｃｃｅｌｅｒａｔｏｒ）は、ホストプロセッサ上で実行中のソフトウェアにより用いられ得る緩く結合したエージェントを参照して、任意の種類の計算又はＩ／Ｏタスクをオフロード又は実行するために本明細書でときどき用いられる。アクセラレータ及び使用モデルのタイプに応じて、これらは、メモリ又はストレージ、計算、通信又はこれらの任意の組み合わせに対するデータ移動を実行するタスクであり得る。 Term accelerator is sometimes used herein to refer to a loosely coupled agent that can be used by software running on a host processor to offload or perform any kind of computational or I/O task. Depending on the type of accelerator and usage model, these can be tasks that perform data movement to or from memory or storage, computation, communication, or any combination of these.

「緩く結合」とは、ホストソフトウェアにより、これらのアクセラレータがどのようにさらされ、アクセスされるかを指す。具体的には、これらは、プロセッサＩＳＡ拡張としてさらされることはなく、代わりに、プラットフォーム上のＰＣＩエクスプレス可算エンドポイントデバイスとしてさらされる。緩い結合は、これらのエージェントが、ホストソフトウェアからのワーク要求を受け入れ、ホストプロセッサに対して非同期的に動作することを可能にする。 "Loosely coupled" refers to how these accelerators are exposed and accessed by the host software. Specifically, they are not exposed as processor ISA extensions, but instead as PCI Express countable endpoint devices on the platform. Loose coupling allows these agents to accept work requests from the host software and operate asynchronously to the host processor.

「アクセラレータ」は、プログラマブルエージェント（例えば、ＧＰＵ／ＧＰＧＰＵ）、固定機能エージェント（例えば、圧縮又は暗号化エンジン）、又は、フィールドプログラマブルゲートアレイ（ＦＰＧＡ）などの再構成可能エージェントであり得る。これらのいくつかは、計算オフロードに用いられ、一方、その他（例えば、ＲＤＭＡ又はホストファブリックインタフェース）は、パケット処理、通信、ストレージ又はメッセージパッシングオペレーションに用いられてよい。 An "accelerator" can be a programmable agent (e.g., GPU/GPGPU), a fixed function agent (e.g., compression or encryption engine), or a reconfigurable agent such as a field programmable gate array (FPGA). Some of these may be used for computation offload, while others (e.g., RDMA or host fabric interfaces) may be used for packet processing, communication, storage, or message passing operations.

アクセラレータデバイスは、オンダイ（すなわち、プロセッサと同じダイ）、オンパッケージ、チップセット、マザーボードを含む、異なるレベルで物理的に集積されてよい、又は、別個のＰＣＩｅ接続型デバイスであり得る。集積されたアクセラレータについて、たとえ、ＰＣＩエクスプレスエンドポイントデバイスとして列挙されるとしても、これらのアクセラレータのいくつかは、（オンダイコヒーレントファブリック又は外部コヒーレントインタフェースに）コヒーレントに取り付けられる一方、その他は、内部非コヒーレントインタフェース又は外部のＰＣＩエクスプレスインタフェースに取り付けられ得る。 Accelerator devices may be physically integrated at different levels, including on-die (i.e., on the same die as the processor), on-package, chipset, motherboard, or may be separate PCIe-attached devices. For integrated accelerators, even if they are listed as PCI Express endpoint devices, some of these accelerators may be coherently attached (to the on-die coherent fabric or to an external coherent interface), while others may be attached to an internal non-coherent interface or to an external PCI Express interface.

概念的なレベルでの、「アクセラレータ」及び高性能Ｉ／Ｏデバイスコントローラは同様である。これらを区別するものは、統合／共有仮想メモリ、ページング可能なメモリ上で動作する能力、ユーザモードワークサブミッション、タスクスケジューリング／プリエンプション及び低レイテンシ同期に対するサポートなどの機能である。その結果、アクセラレータは、新たな改良型の高性能Ｉ／Ｏデバイスのカテゴリとみなされ得る。 At a conceptual level, "accelerators" and high-performance I/O device controllers are similar. What distinguishes them are features such as unified/shared virtual memory, the ability to operate on pageable memory, support for user-mode work submission, task scheduling/preemption, and low-latency synchronization. As a result, accelerators can be considered a new and improved category of high-performance I/O devices.

オフロードオペレーションモデル Off-road operation model

アクセラレータオフロードオペレーションモデルは、３つの利用カテゴリに広く分類され得る。 Accelerator offload operation models can be broadly categorized into three usage categories:

１．ストリーミング：ストリーミングオフロードモデルでは、小さい単位のワークが、高いレートでアクセラレータにストリーミングされる。この利用の典型的な例は、高いレートで、様々なタイプのパケット処理を実行するネットワークデータプレーンである。 1. Streaming: In the streaming offload model, small units of work are streamed to the accelerator at high rates. A typical example of this use case is a network data plane performing various types of packet processing at high rates.

２．低レイテンシ：いくつかのオフロード利用に関して、オフロードオペレーション（アクセラレータへのタスクのディスパッチ及びそれを実行するアクセラレータの両方）のレイテンシは重要な意味を持つ。この利用の例は、ホストファブリックを介した遠隔取得、プット及びアトミックオペレーションを含む低レイテンシメッセージパッシング構成モデルである。 2. Low Latency: For some offload use cases, the latency of the offload operation (both the dispatch of the task to the accelerator and the accelerator executing it) is important. An example of this use case is a low latency message passing composition model involving remote get, put and atomic operations over the host fabric.

３．スケーラブル：スケーラブルなオフロードは、デバイス上でサポートされる複数のワークキュー又は複数のドアベルなどのアクセラレータデバイスにより課される制約なしで、多数（限りない数）のクライアントアプリケーション（仮想マシン内及び仮想マシンを介して）に（例えば、リング３などの階層保護ドメイン内の最も高いリングから）計算アクセラレータのサービスが直接的にアクセス可能な利用を指す。本明細書で説明されるアクセラレータデバイス及びプロセッサ相互接続のいくつかのは、このカテゴリに含まれる。ＧＰＵ、ＧＰＧＰＵ、ＦＰＧＡ又は圧縮アクセラレータ、又は、メッセージパッシングなど、ワークのタイムシェアリング／スケジューリングをサポートする計算オフロードデバイスに適用されるそのようなスケーラビリティは、ロック無しオペレーションに関する大きなスケーラビリティ要件を有するエンタープライズデータベースなどを利用する。 3. Scalable: Scalable offloading refers to applications where the services of a computational accelerator are directly accessible (e.g., from the highest ring in a hierarchical protection domain, such as ring 3) to a large number (unlimited number) of client applications (within and through virtual machines) without constraints imposed by accelerator devices such as multiple work queues or multiple doorbells supported on the device. Some of the accelerator devices and processor interconnects described in this specification fall into this category. Such scalability applies to GPUs, GPGPUs, FPGAs, or compression accelerators, or computational offload devices that support time-sharing/scheduling of work, such as message passing, applications such as enterprise databases that have large scalability requirements for lock-free operations.

オフロードモデルにわたるワークディスパッチ Work dispatch across off-road models

上記のオフロードオペレーションモデルのそれぞれは、以下に説明されるような独自のワークディスパッチの課題を負う。 Each of the above offload operation models presents its own work dispatch challenges, which are described below.

１．ストリーミングオフロードの利用のためのワークディスパッチ 1. Work dispatch for using streaming offload

ストリーミングの利用に関して、典型的なワークディスパッチモデルは、メモリに存在するワークキューを用いることである。具体的には、デバイスは、メモリ内のワークキューの位置及びサイズを構成される。ハードウェアは、新たなワーク要素をワークキューに加えた場合にソフトウェアにより更新されるドアベル（テールポインタ）レジスタを実装する。ハードウェアは、ワークキュー要素上の生産者－消費者フロー制御を強化するために、ソフトウェアに関する現在のヘッドポインタを報告する。ストリーミングの利用に関して、典型的なモデルは、（多くの場合、ソフトウェアによるＵＣＭＭＩＯ読み出しのオーバヘッドを回避するために、ハードウェアによりホストメモリ内で維持される）ソフトウェアにキャッシュされたヘッドポインタ及びテールポインタを調べることによりワークキュー内に空きがあるか否かをチェックし、新たなワーク要素をメモリに存在するワークキューに加え、デバイスへのドアベルレジスタ書き込みを用いてテールポインタを更新するソフトウェアのためのものである。 For streaming usage, the typical work dispatch model is to use a work queue that resides in memory. Specifically, the device is configured with the location and size of the work queue in memory. The hardware implements a doorbell (tail pointer) register that is updated by software when adding a new work element to the work queue. The hardware reports the current head pointer to the software to enforce producer-consumer flow control on the work queue elements. For streaming usage, the typical model is for the software to check if there is space in the work queue by looking at the head and tail pointers cached in software (often maintained in host memory by the hardware to avoid the overhead of a UC MMIO read by the software), add the new work element to the work queue that resides in memory, and update the tail pointer using a doorbell register write to the device.

ドアベル書き込みは、典型的には、４バイト又は８バイトのキャッシュ不能な（ＵＣ）、ＭＭＩＯへの、書き込みである。いくつかのプロセッサにおいて、ＵＣ書き込みは、（生産者－消費者利用のために必要とされる）ＵＣ書き込みを発行する前に、より古い格納がグローバルに監視されることを確保するが、ＵＣ書き込みがプラットフォームによりポストされるまでに発行されてしまうことからプロセッサパイプライン内のすべての若い方の格納をブロックする直列化されたオペレーションである。Ｘｅｏｎサーバプロセッサ上でのＵＣ書き込み動作に対する典型的なレイテンシは、８０～１００ナノ秒のオーダにあり、すべての若い方の格納オペレーションがコアによりブロックされている時間の間、ストリーミングオフロード性能を制限する。 Doorbell writes are typically 4-byte or 8-byte uncacheable (UC) writes to MMIO. In some processors, UC writes ensure that older stores are globally monitored before issuing the UC write (required for producer-consumer usage), but are serialized operations that block all younger stores in the processor pipeline from being issued until the UC write is posted by the platform. Typical latencies for UC write operations on Xeon server processors are on the order of 80-100 nanoseconds, limiting streaming offload performance during the time that all younger store operations are blocked by the core.

ＵＣドアベル書き込みに続いて若い方の格納の直列化に取り組む１つのアプローチは、ドアベル書き込みのためのライトコンバイニング（ＷＣ）格納オペレーション（ＷＣの弱いオーダリングに起因する）を用いることであるが、ドアベル書き込みのためにＷＣ格納を用いることは、いくつかの課題を負う。ドアベル書き込みサイズ（典型的には、ＤＷＯＲＤ又はＱＷＯＲＤ）は、キャッシュラインサイズより小さい。これらの部分的な書き込みは、潜在的なライトコンバイニング機会のためのそのライトコンバイニングバッファ（ＷＣＢ）において、プロセッサが部分的な書き込みを維持することに起因して、追加のレイテンシを発生させ、プロセッサから発行されるドアベル書き込みに関するレイテンシを発生させる。ソフトウェアは、これらに、明示的な格納フェンスを通じて発行されることができ、ＵＣドアベルと同様に、若い方の格納に同じ直列化を発生させる。 One approach to address serialization of younger stores following UC doorbell writes is to use write combining (WC) store operations for doorbell writes (due to WC's weak ordering), but using WC stores for doorbell writes incurs several challenges. The doorbell write size (typically a DWORD or QWORD) is smaller than the cache line size. These partial writes incur additional latency due to the processor keeping the partial writes in its write combining buffer (WCB) for potential write combining opportunities, and incur latency for doorbell writes issued by the processor. Software can issue these through explicit store fences, resulting in the same serialization for younger stores as for UC doorbells.

ＷＣにマッピングされたＭＭＩＯに関する別の問題点は、（ＭＯＶＮＴＤＱＡを用いた）誤った予測及び推測的な読み出しが、（読み出側に影響を与え得るレジスタを有する）ＷＣにマッピングされたＭＭＩＯにさらされることである。ＵＣマッピングＭＭＩＯレジスタの残りよりも別々のページにおけるＷＣマッピングドアベルレジスタをデバイスがホストする必要があるので、これに対処することはデバイスにとって面倒である。これは、仮想化利用での課題も負うことになり、ＶＭＭソフトウェアは、ゲストメモリタイプを無視して、ＥＰＴページテーブルを用いてゲストにさらされる任意のデバイスＭＭＩＯに対するＵＣマッピングを強制することはもはやできない。 Another issue with WC-mapped MMIO is that mispredictions (using MOVNTDQA) and speculative reads are exposed to WC-mapped MMIO (which has registers that can affect the read side). This is cumbersome for devices to deal with, as they would need to host the WC-mapped doorbell registers in a separate page than the rest of the UC-mapped MMIO registers. This also poses challenges for virtualization usage, as the VMM software can no longer ignore the guest memory type and enforce UC mapping for any device MMIO exposed to the guest using the EPT page table.

本明細書で説明されるＭＯＶＤＩＲＩ命令は、これらのストリーミングオフロードの利用でのドアベル書き込みのためにＵＣ又はＷＣ格納を用いる上記の制限に対処する。 The MOVDIRI instruction described herein addresses the above limitations of using UC or WC storage for doorbell writes in these streaming offload applications.

２．低レイテンシオフロードの利用のためのワークディスパッチ 2. Work dispatch for low latency offloading

いくつかのタイプのアクセラレータデバイスは、最小のレイテンシで要求されたオペレーションを完了するために高度に最適化されている。（スループットを最適化された）ストリーミングアクセラレータとは異なって、これらのアクセラレータは、一般に、メモリホスト型ワークキューからワーク要素（及び、いくつかの場合では、同一のデータバッファ）をフェッチするためにＤＭＡの読み出しレイテンシを回避するために、（デバイスＭＭＩＯを通じてさらされる）デバイスホスト型ワークキューを実装する。代わりに、ホストソフトウェアは、ワーク記述子（及び、いくつかの場合では、データも）を直接的に書き込むことによりワークを、デバイスＭＭＩＯ通じてさらされたデバイスホスト型ワークキューにサブミットする。そのようなデバイスの例では、ホストファブリックコントローラ、リモートＤＭＡ（ＲＤＭＡ）デバイス、及び、不揮発性メモリ（ＮＶＭ）エクスプレスなどの新たなストレージコントローラを含む。デバイスホスト型ワークキューの利用は、既存ＩＳＡに関して課題を発生させることはほとんどない。 Some types of accelerator devices are highly optimized to complete requested operations with minimal latency. Unlike (throughput-optimized) streaming accelerators, these accelerators typically implement device-hosted work queues (exposed through device MMIO) to avoid DMA read latencies to fetch work elements (and in some cases, the same data buffer) from a memory-hosted work queue. Instead, host software submits work to device-hosted work queues exposed through device MMIO by directly writing work descriptors (and in some cases, data as well). Examples of such devices include host fabric controllers, remote DMA (RDMA) devices, and new storage controllers such as non-volatile memory (NVM) Express. The use of device-hosted work queues poses few challenges with respect to existing ISAs.

ＵＣ書き込みの直列化オーバヘッドを回避するために、デバイスホスト型ワークキューのＭＭＩＯアドレスは、典型的には、ＷＣとしてマッピングされる。これは、ストリーミングアクセラレータのためにＷＣにマッピングされたドアベルと同じ課題をさらす。 To avoid serialization overhead of UC writes, the MMIO address of a device-hosted work queue is typically mapped as a WC. This exposes the same challenges as a doorbell mapped to a WC for streaming accelerators.

さらに、デバイスホスト型ワークキューへのＷＣ格納を用いるには、デバイスがいくつかのプロセッサの書き込みアトミック性挙動を守る必要がある。例えば、最大で８バイトサイズの書き込み動作のアトミック性がいくつかのプロセッサは、キャッシュライン境界内（及び、ロックオペレーションのため）に書き込むことを保証するだけで、保証されたいずれの書き込み完了のアトミック性を定義するものではない。書き込み動作のアトミック性は、プロセッサ格納オペレーションが他のエージェントにより監視される粒度であり、プロセッサ命令セットアーキテクチャ及びコヒーレンシプロトコルの特性である。書き込み完了のアトミック性は、キャッシュ不可能な格納処理が受信機（メモリの場合では、メモリコントローラ、又は、ＭＭＩＯの場合ではデバイス）により監視される粒度である。書き込み完了のアトミック性は、書き込み動作のアトミック性より強く、プロセッサ命令セットアーキテクチャだけでなくプラットフォームの機能でもある。書き込み完了のアトミック性なしで、Ｎバイトのキャッシュ不可能な格納処理を実行するプロセッサ命令は、デバイスホスト型ワークキューによる複数の（トーン）書き込みトランザクションとして受信され得る。現在では、デバイスハードウェアは、デバイスホスト型ワークキューに書き込まれたワーク記述子の各ワード又はデータをトラッキングすることにより、そのようなトーン書き込みを守る必要がある。 Furthermore, using WC stores to device-hosted work queues requires the device to adhere to the write atomicity behavior of some processors. For example, the atomicity of a write operation of up to 8 bytes size only guarantees that some processors write within cache line boundaries (and due to locking operations), but does not define any guaranteed atomicity of write completion. The atomicity of a write operation is the granularity at which a processor store operation is monitored by other agents and is a property of the processor instruction set architecture and coherency protocol. The atomicity of write completion is the granularity at which a non-cacheable store operation is monitored by the receiver (the memory controller in the case of memory, or the device in the case of MMIO). The atomicity of write completion is stronger than the atomicity of a write operation and is a function of the platform as well as the processor instruction set architecture. Without the atomicity of write completion, a processor instruction that performs a non-cacheable store operation of N bytes may be received as multiple (tone) write transactions by the device-hosted work queue. Currently, the device hardware must guard against such tonal writes by tracking each word or piece of work descriptor data that is written to the device-hosted work queue.

本明細書で説明されるＭＯＶＤＩＲ６４Ｂ命令は、保証された６４バイト書き込み完了のアトミック性で、６４バイト書き込みをサポートすることにより、上記の制限に対処する。ＭＯＶＤＩＲ６４Ｂは、永続性メモリ（メモリコントローラに取り付けられたＮＶＭ）への書き込み、及び、非透過型ブリッジ（ＮＴＢ）を通じたシステムを介したデータの複製など、その他の利用にとっても有用である。 The MOVDIR64B instruction described herein addresses the above limitations by supporting 64-byte writes with guaranteed 64-byte write completion atomicity. MOVDIR64B is also useful for other uses, such as writing to persistent memory (NVM attached to a memory controller) and replicating data through the system through a non-transparent bridge (NTB).

３．スケーラブルなオフロードの利用のためのワークディスパッチ 3. Work dispatch for scalable offload utilization

アプリケーションからＩ／Ｏデバイスにワークをサブミットするための従来のアプローチは、Ｉ／Ｏコントローラデバイスに対してカーネルデバイスドライバを通じた要求を転送するカーネルＩ／Ｏスタックにシステムコールを行うことに関する。このアプローチは、スケーラブルである（任意の数のアプリケーションがデバイスのサービスを共有できる）一方、多くの場合、高性能デバイス及びアクセラレータに関する性能のボトルネックとなる直列化カーネルＩ／Ｏスタックのレイテンシ及びオーバヘッドを発生させる。 The traditional approach for submitting work from an application to an I/O device involves making a system call to the kernel I/O stack, which forwards the request through a kernel device driver to the device I/O controller. While this approach is scalable (any number of applications can share the services of the device), it introduces the latency and overhead of a serialized kernel I/O stack that often becomes a performance bottleneck for high performance devices and accelerators.

低オーバヘッドのワークディスパッチをサポートするために、いくつかの高性能デバイスは、デバイスに対する直接的なワークディスパッチを可能にし、ワーク完了をチェックするダイレクトリング３アクセスをサポートする。このモデルでは、デバイスいくつかのリソース（ドアベル、ワークキュー、完了キューなど）がアプリケーションの仮想アドレス空間に割り当てられ、マッピングされる。一旦マッピングされると、リング３ソフトウェア（ユーザモードのドライバ又はライブラリ）は、アクセラレータにワークを直接ディスパッチできる。共有仮想メモリ（ＳＶＭ）機能をサポートするデバイスに関して、ドアベル及びワークキューがマッピングされるアプリケーション処理の処理アドレス空間識別子（ＰＡＳＩＤ）を識別するために、ドアベル及びワークキューは、カーネルモードドライバによりセットアップされる。特定のワークキューを通じてディスパッチされたワークアイテムを処理する場合、デバイスは、Ｉ／Ｏメモリ管理ユニット（ＩＯＭＭＵ）を通じた物理アドレス変換に対して仮想的なそのワークキューに構成されるそれぞれのＰＡＳＩＤを用いる。 To support low overhead work dispatch, some high performance devices support direct ring 3 access that allows direct work dispatch to the device and checks for work completion. In this model, the device allocates and maps some resources (doorbell, work queue, completion queue, etc.) to the application's virtual address space. Once mapped, ring 3 software (user mode drivers or libraries) can dispatch work directly to the accelerator. For devices that support shared virtual memory (SVM) functionality, the doorbell and work queue are set up by the kernel mode driver to identify the process address space identifier (PASID) of the application process to which they are mapped. When processing work items dispatched through a particular work queue, the device uses the respective PASID configured for that work queue for virtual to physical address translation through the I/O memory management unit (IOMMU).

ダイレクトリング３ワークサブミッションに関する課題の１つは、スケーラビリティ問題である。アクセラレータデバイスに直接的にワークをサブミットできるアプリケーションクライアントの数は、アクセラレータデバイスによりサポートされるキュー／ドアベル（又は、デバイスホスト型ワークキュー）の数に依存する。これは、ドアベル又はデバイスホスト型ワークキューが、アプリケーションクライアントに静的に割り当てられ／マッピングされ、アクセラレータデバイス設計によりサポートされるこれらのリソースの数が固定されるからである。いくつかのアクセラレータデバイスは、（アプリケーションに対する要求についてドアベルを動的に取り外して再度取り付けることにより）それらが有するドアベルのリソースを過度に収容することで、このスケーラビリティの課題に「対処する」しようとしているが、多くの場合、拡大することが煩雑で難しい。これらが異なる仮想マシンに割り当てられる様々な仮想機能（ＶＦ）にわたって区分化される必要があるので、Ｉ／Ｏ仮想化（例えば、単一のルートＩ／Ｏ仮想化（ＳＲ－ＩＯＶ））をサポートするデバイスに関して、制限されたドアベル／ワークキューのリソースは、さらに制約される。 One of the challenges with Direct Ring 3 work submission is the scalability issue. The number of application clients that can submit work directly to an accelerator device depends on the number of queues/doorbells (or device-hosted work queues) supported by the accelerator device. This is because doorbells or device-hosted work queues are statically assigned/mapped to application clients, and the number of these resources supported by the accelerator device design is fixed. Some accelerator devices try to "address" this scalability challenge by over-containment the doorbell resources they have (by dynamically detaching and re-attaching doorbells on demand for applications), but this is often cumbersome and difficult to scale. For devices that support I/O virtualization (e.g., single root I/O virtualization (SR-IOV)), the limited doorbell/work queue resources are further constrained, as they need to be partitioned across various virtual functions (VFs) that are assigned to different virtual machines.

スケーリング問題は、ロック無しオペレーション用のデータベースなどの企業アプリケーションにより用いられる（６４Ｋ～１ＭのキューペアをサポートするＲＤＭＡデバイスのいくつかを有する）高性能なメッセージパッシングアクセラレータにとって、及び、多数のクライアントからサブミットされたタスクにわたるアクセラレータのリソースを共有することをサポートする計算アクセラレータにとって、最も重要な意味を持つ。 The scaling issue is most important for high performance message passing accelerators used by enterprise applications such as databases for lock-free operations (with some RDMA devices supporting 64K-1M queue pairs) and for computational accelerators that support sharing of accelerator resources across tasks submitted by multiple clients.

本明細書で説明されるＥＮＱＣＭＤ／Ｓ命令は、アクセラレータ上のワークキューのリソースをサブスクライブし共有する限りない数のクライアントをイネーブルにするために、上記のスケーリングの制限に対処する。 The ENQCMD/S instructions described herein address the above scaling limitations to enable an unlimited number of clients to subscribe and share the resources of a work queue on an accelerator.

一実施例では、直接格納及びエンキュー格納を含むプロセッサコアによる新たなタイプの格納オペレーションを含む。 In one embodiment, new types of store operations by the processor core include direct stores and enqueue stores.

一実施例において、直接格納は、本明細書で説明されるＭＯＶＤＩＲＩ及びＭＯＶＤＩＲ６４Ｂ命令により生成される。 In one embodiment, direct stores are generated by the MOVDIRI and MOVDIR64B instructions described herein.

キャッシュ可能性：ＵＣ及びＷＣ格納と同様に、直接格納は、キャッシュ可能ではない。キャッシュされるアドレスに直接格納が発行された場合、直接格納前に、ラインは、キャッシュからライトバック（修正される場合）及び無効にされる。 Cacheability: Like UC and WC stores, direct stores are not cacheable. When a direct store is issued to an address that is cached, the line is written back (if modified) and invalidated from the cache before the direct store.

メモリオーダリング：ＷＣ格納と同様に、直接格納は、弱くオーダリングされる。具体的には、それらは、より古いＷＢ／ＷＣ／ＮＴ格納、ＣＬＦＬＵＳＨＯＰＴ及びＣＬＷＢに対して、異なるアドレスへのオーダリングは行われない。異なるアドレスへのより若いＷＢ／ＷＣ／ＮＴ格納、ＣＬＦＬＵＳＨＯＰＴ又はＣＬＷＢは、より古い直接格納をパスできる。同じアドレスに対する直接格納は、同じアドレスに対する（直接格納を含む）より古い格納を用いて常にオーダリングされる。直接格納は、格納フェンシング（例えば、ＳＦＥＮＣＥ、ＭＦＥＮＣＥ、ＵＣ／ＷＰ／ＷＴ格納、ロック、ＩＮ／ＯＵＴ命令など）を強化する任意のオペレーションによりフェンスされる。 Memory ordering: Like WC stores, direct stores are weakly ordered. Specifically, they are not ordered with respect to older WB/WC/NT stores, CLFLUSHOPT, and CLWB to different addresses. A younger WB/WC/NT store, CLFLUSHOPT, or CLWB to a different address can pass an older direct store. Direct stores to the same address are always ordered with older stores (including direct stores) to the same address. Direct stores are fenced by any operation that enforces store fencing (e.g., SFENCE, MFENCE, UC/WP/WT stores, lock, IN/OUT instructions, etc.).

ライトコンバイニング：直接格納は、通常のＷＣ格納とは異なるライトコンバイニングの挙動を有する。具体的には、直接格納は、ライトコンバインバッファからの即時エビクションの対象となり、ひいては、同じアドレスへの若い方の格納（直接格納を含む）と組み合わせられない。ライトコンバイニングバッファにおいて保持されるより古いＷＣ／ＮＴ格納は、同じアドレスへの若い方の直接格納と組み合わせられてよく、そのような組み合わせを回避する必要がある利用は、同じアドレスへの直接格納を実行する前にフェンスＷＣ／ＮＴ格納を明示的に格納しなければならない。 Write combining: Direct stores have different write combining behavior than normal WC stores. Specifically, direct stores are subject to immediate eviction from the write combining buffer and thus cannot be combined with younger stores (including direct stores) to the same address. Older WC/NT stores held in the write combining buffer may be combined with younger direct stores to the same address, and applications that need to avoid such combinations must explicitly store fenced WC/NT stores before performing a direct store to the same address.

アトミック性：直接格納は、直接格納を発行する命令の書き込みサイズに関する書き込み完了のアトミック性をサポートする。ＭＯＶＤＩＲＩの場合では、宛先が４バイトアライン（又は、８バイトアライン）の場合、書き込み完了のアトミック性は、４バイト（又は、８バイト）である。ＭＯＶＤＩＲ６４Ｂに関して、宛先は、６４バイトアラインに強制され、書き込み完了のアトミック性は、６４バイトである。書き込み完了のアトミック性は、メモリコントローラ又はルートコンプレックスにより処理されるような複数の書き込みトランザクションに直接格納が分裂（ｔｏｒｎ）されていないことを保証する。直接格納をサポートするプロセッサ上のルートコンプレックス実装は、単一の非トーン・ポステッド（ｎｏｎ－ｔｏｒｎｐｏｓｔｅｄ）書き込みトランザクションとして、外部のＰＣＩエクスプレスファブリック（及び、ＰＣＩエクスプレスのオーダリングに従うＳｏＣ内の内部Ｉ／Ｏファブリック）上で直接格納が転送されることを保証する。任意のエージェント（プロセッサ又は非プロセッサエージェント）からメモリ位置への読み出し処理は、直接格納オペレーションを発行する命令により書き込まれたデータのすべてか、そのいずれでもないかのいずれか一方を参照する。 Atomicity: Direct stores support atomicity of write completion with respect to the write size of the instruction issuing the direct store. In the case of MOVDIRI, if the destination is 4-byte aligned (respectively 8-byte aligned), the atomicity of the write completion is 4-byte (respectively 8-byte). For MOVDIR64B, the destination is forced to be 64-byte aligned and the atomicity of the write completion is 64-byte. The atomicity of the write completion ensures that the direct store is not torn into multiple write transactions as processed by the memory controller or root complex. The root complex implementation on processors that support direct stores ensures that the direct store is forwarded on the external PCI Express fabric (and the internal I/O fabric within the SoC which follows PCI Express ordering) as a single non-torn posted write transaction. A read operation from any agent (processor or non-processor agent) to a memory location sees either all or none of the data written by instructions issuing direct store operations.

宛先メモリタイプの無視：直接格納は、（ＵＣ／ＷＰタイプを含む）宛先アドレスメモリタイプを無視し、常に弱いオーダリングに従う。これは、マッピングされたＵＣのメモリタイプ毎のＵＣオーダリングに従う通常のＭＯＶオペレーションを用いて、厳密な直列化要件を有し得る他のレジスタにアクセスし続けている間、ソフトウェアが、直接格納命令（ＭＯＶＤＩＲＩ又はＭＯＶＤＩＲ６４Ｂ）を用いて、デバイスＭＭＩＯをＵＣとしてマッピングし、特定のレジスタ（例えば、ドアベル又はデバイスホスト型ワークキューレジスタ）にアクセスすることを可能にする。これは、直接格納命令をゲストソフトウェア内から動作させることも可能にする一方、（デバイスに固有の知識を有していない）仮想マシンモニタ（ＶＭＭ）ソフトウェアは、プロセッサ拡張ページテーブル（ＥＰＴ）内のＵＣとしてゲスト露出ＭＭＩＯをマッピングし、ゲストメモリタイプを無視する。 Ignoring destination memory type: Direct stores ignore the destination address memory type (including UC/WP types) and always follow weak ordering. This allows software to map device MMIO as UC and access certain registers (e.g. doorbell or device hosted work queue registers) using direct store instructions (MOVDIRI or MOVDIR64B) while continuing to access other registers that may have strict serialization requirements using normal MOV operations that follow the UC ordering per the memory type of the mapped UC. This also allows direct store instructions to work from within guest software, while virtual machine monitor (VMM) software (which has no device specific knowledge) maps guest exposed MMIO as UC in the processor extended page table (EPT) and ignores the guest memory type.

直接格納をサポートするＳｏＣは、以下のように、直接格納に関する書き込み完了のアトミック性を確保する必要がある。 SoCs that support direct store must ensure atomicity of write completion for direct store as follows:

メインメモリへの直接格納：メインメモリへの直接格納に関して、コヒーレントファブリック及びシステムエージェントは、直接格納内のすべてのデータバイトが、単一の（非トーン（ｎｏｎ－ｔｏｒｎ））書き込みトランザクションとして、メモリへの要求に対してホームエージェント又は他のグローバルな可観測性（ＧＯ）ポイントに発行されることを確保する。永続性メモリをサポートするプラットフォームに関して、ホームエージェント、メモリコントローラ、メモリ側キャッシュ、インラインメモリ暗号化エンジン、永続性メモリを取り付けるメモリバス（例えば、ＤＤＲ－Ｔ）、及び、永続性メモリコントローラの各自が、直接格納に対して同じ又はより高い粒度の書き込み完了のアトミック性をサポートしなければならない。したがって、ソフトウェアは、メモリ（揮発性又は永続性）へのＭＯＶＤＩＲ６４Ｂを用いた６４バイトの直接格納を実行でき、すべての６４バイト書き込みがすべてのエージェントによりアトミックに処理されることが保証され得る。永続性メモリへの通常の書き込みと同様に、ソフトウェアが永続性に明示的にコミットする必要がある場合、ソフトウェアは、フェンス／コミット／フェンスシーケンスで直接格納を行う。 Direct Stores to Main Memory: For direct stores to main memory, the coherent fabric and system agents ensure that all data bytes in a direct store are issued to the home agent or other global observability (GO) point for requests to memory as a single (non-ton) write transaction. For platforms that support persistent memory, the home agent, memory controller, memory-side cache, in-line memory encryption engine, memory bus that attaches persistent memory (e.g., DDR-T), and persistent memory controller must each support the same or higher granularity of write completion atomicity for direct stores. Thus, software can perform a 64-byte direct store to memory (volatile or persistent) using MOVDIR64B and be guaranteed that all 64-byte writes are processed atomically by all agents. As with normal writes to persistent memory, if software needs to explicitly commit to persistence, the software performs the direct store in a fence/commit/fence sequence.

メモリマッピングされたＩ／Ｏへの直接格納：メモリマッピングされたＩ／Ｏ（ＭＭＩＯ）への直接格納に関して、コヒーレントファブリック及びシステムエージェントは、直接格納におけるすべてのデータバイトが、単一の（非トーン（ｎｏｎ－ｔｏｒｎ））書き込みトランザクションとして、ルートコンプレックス（ＭＭＩＯへの要求に対してグローバルな可観測性ポイント）に発行されることを確保しなければならない。ルートコンプレックス実装は、ＰＣＩエクスプレスルートコンプレックス統合エンドポイント（ＲＣＩＥＰ）及びルートポート（ＲＰ）を取り付ける内部Ｉ／Ｏファブリック上で単一の（非トーン（ｎｏｎ－ｔｏｒｎ））ポステッド書き込みトランザクションとして、各直接格納が処理及び転送されることを確保しなければならない。ＰＣＩエクスプレスルートポート及びスイッチポートは、単一のポステッド書き込みトランザクションとして各直接格納を転送しなければならない。書き込み完了のアトミック性は、セカンダリブリッジ（例えば、レガシＰＣＩ、ＰＣＩ－Ｘブリッジ）又はセカンダリバス（例えば、ＵＳＢ、ＬＰＣなど）上、又は、その背後でデバイスをターゲットとする直接格納のために規定又は保証されていない。 Direct Stores to Memory Mapped I/O: For direct stores to memory mapped I/O (MMIO), the coherent fabric and system agents must ensure that all data bytes in a direct store are issued to the root complex (the global observability point for requests to MMIO) as a single (non-ton) write transaction. The root complex implementation must ensure that each direct store is processed and forwarded as a single (non-ton) posted write transaction on the internal I/O fabric that attaches the PCI Express root complex integrated endpoint (RCIEP) and root port (RP). PCI Express root ports and switch ports must forward each direct store as a single posted write transaction. Atomicity of write completion is not specified or guaranteed for direct stores targeting devices on or behind secondary bridges (e.g., legacy PCI, PCI-X bridges) or secondary buses (e.g., USB, LPC, etc.).

いくつかのＳｏＣの実装は、ＷＣ書き込み要求に対する書き込み完了のアトミック性を既に保証していることに留意する。具体的には、部分的なラインＷＣ書き込み（ＷＣｉＬ）及び全ラインＷＣ書き込み（ＷＣｉＬＦ）は、システムエージェント、メモリコントローラ、ルートコンプレックス及びＩ／Ｏファブリックにより、書き込み完了のアトミック性で既に処理されている。そのような実施例に関して、プロセッサが直接書き込みをＷＣ書き込みと区別する必要はなく、直接格納とＷＣ格納との挙動の違いは、プロセッサコアの内部にある。したがって、直接書き込みのための内部又は外部ファブリック仕様に対して提案される変更はない。 Note that some SoC implementations already guarantee atomicity of write completion for WC write requests. In particular, partial line WC writes (WCiL) and full line WC writes (WCiLF) are already handled with atomicity of write completion by the system agent, memory controller, root complex, and I/O fabric. For such embodiments, there is no need for the processor to distinguish direct writes from WC writes; the behavioral difference between direct and WC stores is internal to the processor core. Therefore, no changes are proposed to the internal or external fabric specifications for direct writes.

ＰＣＩエクスプレスエンドポイント又はＲＣＩＥＰにより受信された直接書き込みの処理は、デバイス実装に固有のものである。デバイスのプログラミングインタフェースに応じて、デバイス及びそのドライバは、直接格納命令（例えば、ＭＯＶＤＩＲ６４Ｂ）を用いて常に書き込まれるそのレジスタのいくつか（例えば、ドアベルレジスタ又はデバイスホスト型ワークキューレジスタ）を必要とし、デバイス内でアトミックにこれらを処理してよい。デバイス上の他のレジスタへの書き込みは、アトミック性をなんら考慮又は期待することなく、デバイスにより処理され得る。ＲＣＩＥＰに関して、書き込みのアトミック性要件を有するレジスタがサイドバンド又はプライベートワイヤインタフェースを通じたアクセスのために実装されている場合、そのような実施例は、実装に固有の手段を通じて書き込みのアトミック性の特性を確保しなければならない。 Handling of direct writes received by a PCI Express endpoint or RCIEP is device implementation specific. Depending on the device's programming interface, the device and its drivers may require some of its registers (e.g., doorbell registers or device-hosted work queue registers) to always be written with a direct store instruction (e.g., MOVDIR64B) and handle these atomically within the device. Writes to other registers on the device may be handled by the device without any consideration or expectation of atomicity. With respect to the RCIEP, if registers with write atomicity requirements are implemented for access through a sideband or private wire interface, such embodiments must ensure the write atomicity property through implementation specific means.

一実施例におけるエンキュー格納は、本明細書で説明されるＥＮＱＣＭＤ及びＥＮＱＣＭＤＳ命令により生成される。エンキュー格納の指定されたターゲットは、アクセラレータデバイス上の共有のワークキュー（ＳＷＱ）である。一実施例において、エンキュー格納は、以下の特性を有する。 In one embodiment, an enqueue store is generated by the ENQCMD and ENQCMDS instructions described herein. The specified target of the enqueue store is a shared work queue (SWQ) on the accelerator device. In one embodiment, the enqueue store has the following characteristics:

ノンポステッド：エンキュー格納は、ターゲットアドレスへの６４バイトの非ポスト書き込みトランザクションを生成し、成功又はリトライステータスを示す完了応答を受信する。完了応答で返される成功／リトライステータスは、ＥＮＱＣＭＤ／Ｓ命令により（例えば、ゼロフラグにおいて）ソフトウェアに返されてよい。 Non-Posted: The enqueue store generates a 64-byte non-posted write transaction to the target address and receives a completion response indicating success or retry status. The success/retry status returned in the completion response may be returned to software via the ENQCMD/S instruction (e.g., in a zero flag).

キャッシュ可能性：一実施例において、エンキュー格納は、キャッシュ可能ではない。エンキュー格納をサポートするプラットフォームは、エンキュー・ノン・ポステッドライトが、これらの格納を受け入れるために、明示的に可能とされるアドレス（ＭＭＩＯ）範囲に転送されることのみに強制する。 Cacheability: In one embodiment, enqueue stores are not cacheable. Platforms that support enqueue stores enforce that enqueue non-posted writes are only forwarded to address (MMIO) ranges that are explicitly enabled to accept these stores.

メモリオーダリング：エンキュー格納は、ノンポステッド書き込み完了ステータスを有するアーキテクチャの状態（例えば、ゼロフラグ）を更新してよい。したがって、多くても１つのエンキュー格納は、所与の論理プロセッサから未処理となる可能性がある。その意味では、論理プロセッサからのエンキュー格納は、同じ論理プロセッサから発行される別のエンキュー格納を渡すことができない。エンキュー格納は、より古いＷＢ／ＷＣ／ＮＴ格納、ＣＬＦＬＵＳＨＯＰＴ又はＣＬＷＢに対して、異なるアドレスへのオーダリングは行わない。そのようなオーダリングを強制する必要があるソフトウェアは、そのような格納の後、かつ、エンキュー格納前に明示的な格納フェンシングを用いてよい。エンキュー格納は、常に、より古い格納で同じアドレスにオーダリングされる。 Memory ordering: An enqueue store may update architectural state (e.g., the zero flag) with non-posted write completion status. Thus, at most one enqueue store may be outstanding from a given logical processor. In that sense, an enqueue store from a logical processor cannot pass another enqueue store issued from the same logical processor. Enqueue stores do not order to different addresses relative to older WB/WC/NT stores, CLFLUSHOPT, or CLWB. Software that needs to enforce such ordering may use explicit store fencing after such stores and before the enqueue store. Enqueue stores are always ordered to the same address with older stores.

アライメント：ＥＮＱＣＭＤ／Ｓ命令は、エンキュー格納宛先アドレスが６４バイトアラインであることを強制する。 Alignment: The ENQCMD/S instruction forces the enqueue store destination address to be 64-byte aligned.

アトミック性：ＥＮＱＣＭＤ／Ｓ命令により生成されるエンキュー格納は、６４バイト書き込み完了のアトミック性をサポートする。書き込み完了のアトミック性は、ルートコンプレックスにより処理されるような複数のトランザクションにエンキュー格納が分裂（ｔｏｒｎ）されていなことを保証する。エンキュー格納をサポートするプロセッサ上のルートコンプレックス実装は、単一の（非トーン（ｎｏｎ－ｔｏｒｎ））６４バイトの非ポスト書き込みトランザクションとして、各エンキュー格納がエンドポイントデバイスに転送されることを保証する。 Atomicity: Enqueue stores generated by the ENQCMD/S instructions support atomicity of 64-byte write completion. Atomicity of write completion ensures that enqueue stores are not torn into multiple transactions as processed by the root complex. Root complex implementations on processors that support enqueue stores ensure that each enqueue store is forwarded to the endpoint device as a single (non-tonn) 64-byte non-posted write transaction.

宛先メモリタイプの無視：直接格納と同様に、エンキュー格納は、宛先アドレスメモリタイプ（ＵＣ／ＷＰタイプを含む）を無視し、上記で説明されるようなオーダリングに常に従う。これは、通常のＭＯＶの命令を用いて、又は、直接格納（ＭＯＶＤＩＲＩ又はＭＯＶＤＩＲ６４Ｂ）命令を通じて、他のレジスタにアクセスし続けている間、ソフトウェアが、ＥＮＱＣＭＤ／Ｓ命令を用いて、デバイスＭＭＩＯをＵＣとしてマッピングし、共有のワークキュー（ＳＷＱ）レジスタにアクセスし続けることを可能にする。これは、エンキュー格納命令をゲストソフトウェア内から動作させることも可能にする一方、（デバイスに固有の知識を有していない）ＶＭＭソフトウェアは、プロセッサ拡張ページテーブル（ＥＰＴ）内のＵＣとしてゲスト露出ＭＭＩＯをマッピングし、ゲストメモリタイプを無視する。 Ignore destination memory type: Like direct stores, enqueue stores ignore the destination address memory type (including UC/WP type) and always follow the ordering as described above. This allows software to continue to map device MMIO as UC and access shared work queue (SWQ) registers with ENQCMD/S instructions while continuing to access other registers with normal MOV instructions or through direct store (MOVDIRI or MOVDIR64B) instructions. This also allows enqueue store instructions to operate from within guest software, while the VMM software (which has no device-specific knowledge) maps guest exposed MMIO as UC in the processor extended page table (EPT) and ignores the guest memory type.

エンキュー格納に対するプラットフォームの検討 Platform considerations for enqueue storage

いくつかの実施例について、プラットフォーム統合デバイスの特定のセットは、共有のワークキュー（ＳＷＱ）機能をサポートする。これらのデバイスは、内部Ｉ／Ｏファブリックを通じてルートコンプレックスに取り付けられてよい。これらのデバイスは、ＰＣＩエクスプレスルートコンプレックス統合エンドポイント（ＲＣＩＥＰ）、又は、仮想ルートポート（ＶＲＰ）の背後のＰＣＩエクスプレスエンドポイントデバイスのうちのいずれか一方としてホストソフトウェアにさらされ得る。 For some embodiments, a specific set of platform integration devices support shared work queue (SWQ) functionality. These devices may be attached to the root complex through the internal I/O fabric. These devices may be exposed to the host software as either PCI Express root complex integrated endpoints (RCIEPs) or PCI Express endpoint devices behind a virtual root port (VRP).

ＳＷＱを有する統合デバイスをサポートするプラットフォームは、そのようなデバイスのみに対する内部Ｉ／Ｏファブリック上でのエンキュー・ノン・ポステッドライト要求の転送を制限すべきである。これは、新たなトランザクションタイプ（エンキュー・ノン・ポステッドライト）が、エンキューが認識していないエンドポイントデバイスによる不正な形式のトランザクション層パケット（ＴＬＰ）として処理されないことを確保するためのものである。 Platforms that support integrated devices with SWQs should restrict the transmission of enqueue non-posted write requests on the internal I/O fabric to only such devices. This is to ensure that the new transaction type (enqueue non-posted write) is not treated as a malformed transaction layer packet (TLP) by enqueue-unaware endpoint devices.

（メインメモリアドレス範囲及びすべての他のメモリマップアドレス範囲を含む）すべての他のアドレスへのエンキュー格納は、プラットフォームにより終了し、通常の（エラーでない）応答が、リトライ完了ステータスと共に発行元のプロセッサに返される。特権のないソフトウェア（ＶＭＸ非ルートモードにおけるリング３ソフトウェア、又は、リング０ソフトウェア）が、ＥＮＱＣＭＤ／Ｓ命令を実行することによりエンキュー・ノンポステッド・書き込みトランザクションを生成できるので、そのようなエンキュー格納の終端上で生成されるプラットフォームのエラーはない。 Enqueues to all other addresses (including the main memory address range and all other memory mapped address ranges) are terminated by the platform and a normal (non-error) response is returned to the issuing processor along with a retry completion status. Because unprivileged software (ring 3 software in VMX non-root mode or ring 0 software) can generate enqueue non-posted write transactions by executing the ENQCMD/S instruction, no platform error is generated upon termination of such an enqueue.

ルートコンプレックス実装は、ＳＷＱをサポートする統合デバイスに対する内部Ｉ／Ｏファブリック上での単一の（非トーン（ｎｏｎ－ｔｏｒｎ））ノンポステッド書き込みトランザクションとしてエンキュー格納が処理及び転送されることを確保すべきである。 The root complex implementation should ensure that enqueue stores are processed and forwarded as a single (non-ton), non-posted write transaction on the internal I/O fabric for integrated devices that support SWQ.

プラットフォーム性能の検討 Examination of platform performance

このセクションは、システムエージェント及びシステムエージェントによりエンキュー格納の処理におけるいくつかの性能の検討を説明する。 This section describes some performance considerations for the system agent and the processing of enqueue storage by the system agent.

エンキュー格納のためのシステムエージェントトラッカー（ＴＯＲ）エントリ割り当てに対して緩和されたオーダリング。 Relaxed ordering for System Agent Tracker (TOR) entry allocation for enqueue storage.

メモリの整合性を維持するために、システムエージェント実装は、典型的には、コヒーレントメモリ及びＭＭＩＯに対するキャッシュラインアドレス（ＴＯＲエントリを割り当てる場合）への要求に対して厳密なオーダリングを強制する。これは、コヒーレントメモリアクセスに対して総合的なオーダリングをサポートするために必要とされる一方、エンキュー格納に対するこの厳密なオーダリングは、性能の問題を負う。これは、エンキュー格納が、デバイス上の共有のワークキュー（ＳＷＱ）をターゲットとしており、それによって、同じ宛先ＳＷＱアドレスを有する複数の論理プロセッサからエンキュー格納要求を発行させることが一般的だからである。また、システムエージェントにポストされた通常の格納とは異なり、エンキュー格納は、ノンポステッドであり、読み出しと同様のレイテンシを発生させる。共有のワークキューに対して未処理のエンキュー格納１つだけ許可するという条件を無効にするためには、システムエージェント実装は、同じアドレスへのエンキュー格納要求に対する厳密なオーダリングを緩和することが必要とされ、代わりに、同じアドレスに対する複数のインフライト（ｉｎ－ｆｌｉｇｈｔ）エンキュー格納のためのＴＯＲ割り当てを許可する。論理プロセッサは、同時に多くても１つのエンキュー格納だけを発行し得るので、システムエージェント／プラットフォームは、オーダリングを心配することなく独立に各エンキュー格納を処理できる。 To maintain memory consistency, system agent implementations typically enforce strict ordering on requests to cache line addresses (when allocating TOR entries) for coherent memory and MMIO. While this is required to support overall ordering for coherent memory accesses, this strict ordering on enqueue stores incurs performance issues because it is common for enqueue stores to target a shared work queue (SWQ) on the device, thereby causing enqueue store requests to be issued from multiple logical processors with the same destination SWQ address. Also, unlike regular stores posted to the system agent, enqueue stores are non-posted and incur similar latency to reads. To negate the requirement of allowing only one outstanding enqueue store to the shared work queue, a system agent implementation is required to relax the strict ordering on enqueue store requests to the same address, and instead allow TOR allocation for multiple in-flight enqueue stores to the same address. A logical processor can only issue at most one enqueue store at a time, so the system agent/platform can process each enqueue store independently without worrying about ordering.

Ｉ／Ｏブリッジエージェントにおける複数の未処理のエンキュー・ノン・ポステッドライトのサポート。 Support for multiple outstanding enqueued non-posted writes in I/O bridge agents.

Ｉ／Ｏブリッジ実装は、典型的には、少数への（多くの場合、単一の要求への）ダウンストリームパスにおいてサポートされるノンポステッド（読み出し）要求の数を制限する。これは、（通常ＵＣ読み出しである）ＭＭＩＯへのプロセッサからの読み出しは、ほとんどの利用にとって重大な性能ではなく、返されるデータの読み出しに必要なバッファのために大きなキューデプスをサポートしているからであり、ハードウェア費用を増大させる。エンキュー格納が、アクセラレータデバイスに対するワークディスパッチに通常用いられることが予期されるので、エンキュー・ノン・ポステッドライトに対するこの制限されたキューイングを適用してしまうと、性能に弊害をもたらす可能性がある。Ｉ／Ｏブリッジ実装は、改善されたエンキュー・ノン・ポステッドライト帯域幅のために、（論理プロセッサの数のいくつかの実際の割合、論理プロセッサは、一度に１つの未処理のエンキュー格納要求しか有することができないので）増加したキューデプスをサポートすることが推奨される。読み出し要求とは異なり、エンキュー格納は、エンキュー・ノン・ポステッドライト完了が単に完了ステータス（成功対リトライ）を返すだけでデータを返さないので、データバッファのハードウェア費用を発生させない。 I/O bridge implementations typically limit the number of non-posted (read) requests supported in the downstream path to a small number (often to a single request). This is because reads from the processor to the MMIO (usually UC reads) are not performance critical for most uses, and support a large queue depth due to the buffers required to read the returned data, which increases the hardware cost. Since it is expected that enqueue stores will typically be used for work dispatch to accelerator devices, applying this limited queuing for enqueue non-posted writes can be detrimental to performance. I/O bridge implementations are encouraged to support increased queue depths (some actual percentage of the number of logical processors, since a logical processor can only have one outstanding enqueue store request at a time) for improved enqueue non-posted write bandwidth. Unlike read requests, enqueue stores do not incur the hardware cost of data buffers, since enqueue non-posted write completions simply return a completion status (success vs. retry) and do not return data.

エンキュー・ノン・ポステッドライトに対する仮想チャネルサポート Virtual channel support for enqueue non-posted writes

（例えば、ＰＣＩエクスプレストランザクションオーダリングにより特定される）生産者－消費者オーダリング要求を有するＩ／Ｏバス上の典型的なメモリ読み出し及び書き込み要求とは異なり、エンキュー・ノン・ポステッドライトは、Ｉ／Ｏバス上でのオーダリング要求を行わない。これは、エンキュー・ノン・ポステッドライトを発行し、それぞれの完了を返すために、非ＶＣ０仮想チャネルの使用を可能にする。非ＶＣ０チャネルを用いすることの利益は、エンキュー・ノン・ポステッドライト完了が、デバイスからホストにＶＣ０上のアップストリームポステッド書き込みの背後でオーダリングされることを回避することにより、より良好なレイテンシ（コアを遅延させるより少ないサイクル）を有することができる。実装では、統合デバイスの利用を慎重に考慮して、エンキュー・ノンポステッド完了レイテンシを最小化することが推奨される。 Unlike typical memory read and write requests on the I/O bus, which have producer-consumer ordering requirements (e.g., as specified by PCI Express transaction ordering), enqueue non-posted writes do not enforce ordering requirements on the I/O bus. This allows the use of a non-VC0 virtual channel to issue enqueue non-posted writes and return their respective completions. The benefit of using a non-VC0 channel is that enqueue non-posted write completions can have better latency (fewer cycles delaying the core) by avoiding being ordered behind upstream posted writes on VC0 from the device to the host. Implementations are encouraged to carefully consider their use of the integrated device to minimize enqueue non-posted completion latency.

エンキュー・ノン・ポステッドライトの中間停止 Intermediate stop of enqueue non-posted light

高レイテンシな状況（例えば、内部リンクをウェイクアップさせる、又は、ロックフロー上での電源管理）で特定のフロー制御を処理するために、中間エージェント（システムエージェント、Ｉ／Ｏブリッジなど）は、正規のエンキュー格納要求をドロップして、発行したコアに、完了をリトライ応答と共に返すことを可能にする。エンキュー格納を発行するソフトウェアは、リトライ応答が、中間エージェント又はターゲットからのものである場合、又は、ソフトウェアにおいて、通常のリトライする（潜在的にいくつかのバックオフを伴う）場合、直接的な可視性を有していない。 To handle certain flow control in high latency situations (e.g. waking up an internal link or power management on a locked flow), an intermediate agent (system agent, I/O bridge, etc.) may drop the normal enqueue store request and return completion to the issuing core with a retry response. The software issuing the enqueue store has no direct visibility into when the retry response is from the intermediate agent or target, or when the software will retry normally (potentially with some backoff).

そのような中間停止を実行する実装では、そのような挙動は、ＳＷＱを共有するソフトウェアクライアントにわたる任意のサービス妨害攻撃をさらすことができないことを確認するように、非常に注意しなければならない。 Implementations that perform such intermediate stops must be very careful to ensure that such behavior cannot expose any denial of service attacks across software clients that share the SWQ.

エンドポイントデバイス上での共有のワークキューのサポート Support for shared work queues on endpoint devices

図３４は、共有のワークキュー（ＳＷＱ）の概念を示し、複数の非協同ソフトウェアエージェント（アプリケーション３４１０－３４１２）が、本明細書で説明されるＥＮＱＣＭＤ／Ｓ命令を利用して、共有のワークキュー３４０１を通じてワークをサブミットすることを可能にする。 Figure 34 illustrates the concept of a shared work queue (SWQ), allowing multiple non-cooperative software agents (applications 3410-3412) to submit work through a shared work queue 3401 using the ENQCMD/S commands described herein.

以下の検討は、共有のワークキュー（ＳＷＱ）を実装するエンドポイントデバイスに適用可能である。 The following considerations are applicable to endpoint devices that implement a shared work queue (SWQ).

ＳＷＱ及びその列挙：デバイス物理ファンクション（ＰＦ）は、１又は複数のＳＷＱをサポートしてよい。各ＳＷＱは、デバイスＭＭＩＯアドレス範囲内の６４バイトアライン、及び、サイズレジスタ（ここでは、ＳＷＱ＿ＲＥＧと称される）を通じてエンキュー・ノン・ポステッドライトがアクセス可能である。デバイス上のそのような各ＳＷＱ＿ＲＥＧは、一意的なシステムページサイズ（４ＫＢ）領域に配置されることが推奨される。デバイス用のデバイスドライバは、ＳＷＱ機能、サポートされるＳＷＱの数、及び、対応するＳＷＱ＿ＲＥＧアドレスを、適切なソフトウェアインタフェースを通じてソフトウェアに報告／列挙する役割を担う。ドライバは、（これは、機能の正確性にとって必須ではないが）ソフトウェアのチューニング又は情報の用途のためにサポートされるＳＷＱのデプスを選択的に報告してもよい。複数の物理ファンクションをサポートするデバイスについては、物理ファンクションごとに独立してＳＷＱをサポートすることが推奨される。 SWQs and their enumeration: A device physical function (PF) may support one or more SWQs. Each SWQ is 64-byte aligned within the device MMIO address range and is accessible for enqueue non-posted writes through a size register (referred to herein as SWQ_REG). It is recommended that each such SWQ_REG on a device be located in a unique system page size (4KB) region. The device driver for the device is responsible for reporting/enumerating the SWQ capabilities, number of SWQs supported, and corresponding SWQ_REG addresses to software through an appropriate software interface. The driver may selectively report the depth of the SWQs supported for software tuning or informational purposes (although this is not required for functional correctness). For devices that support multiple physical functions, it is recommended that each physical function support SWQs independently.

単一のルートＩ／Ｏ仮想化（ＳＲ－ＩＯＶ）デバイス上でのＳＷＱサポート：ＳＲ－ＩＯＶをサポートするデバイスは、それぞれのＶＦベースアドレスレジスタ（ＢＡＲ）におけるＳＷＱ＿ＲＥＧを通じてさらされる仮想機能（ＶＦ）ごとに独立してＳＷＱをサポートしてよい。この設計のポイントは、ＶＦにわたるワークサブミッションに関する最大の性能分離を考慮している点にあり、少数から中程度の数のＶＦに適切し得る。多数のＶＦをサポートするデバイス（ＶＦ毎の独立したＳＷＱは実用的ではない）について、単一のＳＷＱは、複数のＶＦにわたって共有されてよい。たとえこの場合であっても、各ＶＦは、ＳＷＱを共有するＶＦにわたって共通のＳＷＱにより補助されることを除いて、そのＶＦＢＡＲに自体のプライベートなＳＷＱ＿ＲＥＧを有する。そのようなデバイス設計について、ＳＷＱを共有するＶＦは、ハードウェア設計により静的に決定されてよく、又は、ＳＷＱインスタンスに対する所与のＶＦのＳＷＱ＿ＲＥＧ間のマッピングは、物理ファンクション及びそのドライバを通じて動的にセットアップ／トーンダウンされてよい。ＶＦにわたってＳＷＱを共有するデバイス設計は、このセクションにおいて後で説明されるように、サービス妨害攻撃に対するＱｏＳ及び保護に特別な注意を払う必要がある。ＶＦにわたってＳＷＱを共有する場合、どのＶＦがＳＷＱに受け入れられたエンキュー要求を受信したかを識別するように、デバイス設計において注意払われなければなければならない。ＳＷＱからワーク要求をディスパッチする場合、デバイスは、アップストリーム要求が、（エンキュー要求ペイロード内で伝達されたＰＡＳＩＤに加えて）それぞれのＶＦのリクエスタＩＤ（バス／デバイス／機能＃）と適切にタグ付けされているかを確認すべきである。 SWQ support on Single Root I/O Virtualization (SR-IOV) devices: Devices that support SR-IOV may support independent SWQs for each Virtual Function (VF) exposed through a SWQ_REG in each VF Base Address Register (BAR). This design allows for maximum performance isolation for work submission across VFs and may be appropriate for a small to moderate number of VFs. For devices that support a large number of VFs (where a separate SWQ per VF is not practical), a single SWQ may be shared across multiple VFs. Even in this case, each VF has its own private SWQ_REG in its VF BAR, but is backed by a common SWQ across the VFs that share the SWQ. For such device designs, the VFs that share the SWQ may be statically determined by the hardware design, or the mapping between SWQ_REGs of a given VF to SWQ instances may be dynamically set up/toned down through the physical function and its driver. Device designs that share SWQs across VFs need to pay special attention to QoS and protection against denial of service attacks, as described later in this section. When sharing SWQs across VFs, care must be taken in the device design to identify which VF received an enqueue request that was accepted into the SWQ. When dispatching work requests from the SWQ, the device should ensure that upstream requests are properly tagged with the requester ID (bus/device/function #) of the respective VF (in addition to the PASID carried in the enqueue request payload).

エンキュー・ノン・ポステッドライトアドレス：ＳＷＱをサポートするエンドポイントデバイスは、これらのＰＦ又はＶＦメモリＢＡＲを通じて転送される任意のアドレスへのエンキュー・ノン・ポステッドライトを受け入れることが必要とされる。ＳＷＱ＿ＲＥＧアドレスではないアドレスへの、エンドポイントデバイスにより受信された任意のエンキュー・ノン・ポステッドライト要求について、デバイスは、エラー（例えば、不正な形式のＴＬＰなど）としてこれを処理しない代わりに、リトライの完了ステータス（ＭＲＳ）と共に完了を返すことが必要とされ得る。これは、ＳＷＱ可能なデバイス上の非ＳＷＱ＿ＲＥＧアドレスにエンキュー格納を誤って又は悪意をもって発行するＥＮＱＣＭＤ／Ｓ命令の非特権（リング３又はリング０ＶＭＸゲスト）ソフトウェアの使用は、プラットフォーム固有のエラー処理結果と共に致命的でないエラー又は致命的なエラーを報告するという結果をもたらすことができないことを確保するために行われてよい。 Enqueue Non-Posted Write Addresses: Endpoint devices that support SWQ are required to accept enqueue non-posted writes to any address forwarded through their PF or VF memory BARs. For any enqueue non-posted write request received by an endpoint device to an address that is not a SWQ_REG address, the device may be required not to treat this as an error (e.g., malformed TLP, etc.) but instead to return a completion with retry completion status (MRS). This may be done to ensure that non-privileged (ring 3 or ring 0 VMX guest) software use of the ENQCMD/S instruction that mistakenly or maliciously issues an enqueue store to a non-SWQ_REG address on a SWQ-capable device cannot result in reporting a non-fatal or fatal error with platform-specific error handling consequences.

ＳＷＱ＿ＲＥＧに対する非エンキュー要求処理：ＳＷＱをサポートするエンドポイントデバイスは、致命的又は致命的でないエラーとしてこれらを処理することなく、ＳＷＱ＿ＲＥＧアドレスに対する非エンキュー要求（通常のメモリ書き込み及び読み出し）を無許可でドロップしてよい。ＳＷＱ＿ＲＥＧアドレスに対する読み出し要求は、要求されたデータバイトに対してオール１の値を有する正常完了応答（ＵＲ又はＣＡとは対照的に）を返してよい。ＳＷＱ＿ＲＥＧアドレスへの通常のメモリ（ポステッド）書き込み要求は、エンドポイントデバイスによる動作なしで単にドロップされる。これは、特権のないソフトウェアが、プラットフォーム固有のエラー処理結果と共に致命的でないエラー又は致命的なエラーを誤って又は悪意をもって報告させるようにＳＷＱ＿ＲＥＧアドレスへの通常の読み出し及び書き込み要求を生成できないことを確保するために行われ得る。 Non-enqueue request handling for SWQ_REG: An endpoint device that supports SWQ may silently drop non-enqueue requests (normal memory writes and reads) to the SWQ_REG address without treating them as fatal or non-fatal errors. Read requests to the SWQ_REG address may return a successful completion response (as opposed to UR or CA) with all-ones values for the requested data bytes. Normal memory (posted) write requests to the SWQ_REG address are simply dropped without action by the endpoint device. This may be done to ensure that unprivileged software cannot generate normal read and write requests to the SWQ_REG address in a way that causes them to erroneously or maliciously report non-fatal or fatal errors with platform-specific error handling results.

ＳＷＱキューデプス及びストレージ：ＳＷＱキューデプス及びストレージは、デバイス実装に固有のものである。デバイス設計は、デバイスの最大限の利用を実現するために、十分なキューデプスがＳＷＱにサポートされることを確保すべきである。ＳＷＱに対するストレージは、デバイス上に実装されてよい。ＳｏＣ上の統合デバイスは、ＳＷＱのバッファ溢れとして、スティールされたメインメモリ（デバイスの使用のために予約された非ＯＳ可視プライベートメモリ）を利用してよく、オンデバイスストレージを用いて実現するよりも、より大きなＳＷＱキューデプスを可能にする。そのような設計について、バッファ溢れの使用は、デバイスハードウェアが、（エンキュー要求をドロップして、リトライ完了ステータスを送信することと対比して）いつあふれさせ、コマンド実行に対するバッファ溢れからフェッチし、任意のコマンドに固有のオーダリング要求を維持するかを決定するので、ソフトウェアに対して透過的である。すべての用途に対して、そのようなバッファ溢れの利用は、ＳＷＱストレージに対してローカルデバイスが取り付けられたＤＲＡＭを用いる別個のデバイスと同等である。スティールされたメモリ（ｓｔｏｌｅｎｍｅｍｏｒｙ）内のバッファ溢れを伴うデバイス設計は、そのようなスティールされたメモリ（ｓｔｏｌｅｎｍｅｍｏｒｙ）が、割り当てられたデバイスによりバッファ溢れの読み出し及び書き込み以外の任意のアクセスから保護されることを確認するために非常に注意しなければならない。 SWQ queue depth and storage: SWQ queue depth and storage are device implementation specific. The device design should ensure that sufficient queue depth is supported for the SWQ to achieve maximum utilization of the device. Storage for the SWQ may be implemented on the device. An integrated device on a SoC may use stolen main memory (non-OS visible private memory reserved for device use) as buffer overflow for the SWQ, allowing a larger SWQ queue depth than can be achieved with on-device storage. For such designs, the use of buffer overflow is transparent to the software since the device hardware decides when to overflow (versus dropping the enqueue request and sending a retry completion status) and fetch from the buffer overflow for command execution, maintaining any command-specific ordering requirements. For all purposes, such buffer overflow utilization is equivalent to a separate device using local device attached DRAM for SWQ storage. Device designs involving buffer overflows in stolen memory must be very careful to ensure that such stolen memory is protected from any access other than reading and writing the buffer overflows by the device to which it is allocated.

非ブロックＳＷＱの挙動：性能上の理由で、デバイス実装は、成功又はリトライ完了ステータスを有するエンキュー・ノン・ポステッドライト要求に迅速に応答すべきであり、要求を受け入れるために解放されるＳＷＱ容量のエンキュー完了をブロックすべきでない。ＳＷＱに対するエンキュー要求を受け入れ又は拒絶する決定は、容量、ＱｏＳ／占有率又はその他のポリシに基づき得る。いくつかの例示的なＱｏＳの検討が次に説明される。 Non-blocking SWQ behavior: For performance reasons, device implementations should respond promptly to enqueue non-posted write requests with success or retry completion status and should not block enqueue completion of SWQ capacity that would be freed up to accept the request. The decision to accept or reject an enqueue request to a SWQ may be based on capacity, QoS/occupancy or other policies. Some example QoS considerations are described next.

ＳＷＱＱｏＳの検討：ＳＷＱ＿ＲＥＧアドレスをターゲットととするエンキュー・ノン・ポステッドライトについて、エンドポイントデバイスは、承認制御を適用して、それぞれのＳＷＱに対する要求を受け入れ（及び、成功完了ステータスを送信する）、又は、それをドロップする（及び、リトライ完了ステータスを送信する）ことを決定してよい。承認制御は、デバイス及び利用に固有のものであってよく、ハードウェアによりサポート／強制される特定のポリシは、物理ファンクション（ＰＦ）ドライバインタフェースを通じてソフトウェアにさらされてよい。ＳＷＱが、複数の生産者クライアントを有する共有リソースであるので、デバイス実装は、生産者にわたるサービス妨害攻撃に対して適切な保護を確保しなければならない。ＳＷＱに対するＱｏＳは、単にＳＷＱに対する（エンキュー要求を通じた）ワーク要求の受け入れのみを指し、異なる生産者によりサブミットされたワーク要求を処理する場合、デバイスの実行リソースを共有するために、ＱｏＳがどのように適用されるかについて、デバイスハードウェアにより適用される任意のＱｏＳに直交する。ＳＷＱに対するエンキュー要求を受け入れるための承認ポリシを強制するようにエンドポイントデバイスを構成することについて、いくつかの例示的なアプローチが以下に説明される。これらは、単に例示の目的で記録され、正確な実施例の選択はデバイスに固有である。 SWQ QoS considerations: For enqueue non-posted writes targeted to SWQ_REG addresses, the endpoint device may apply admission control to decide to accept the request for the respective SWQ (and send a successful completion status) or drop it (and send a retry completion status). Admission control may be device and usage specific, and the specific policies supported/enforced by the hardware may be exposed to software through a physical function (PF) driver interface. Because the SWQ is a shared resource with multiple producer clients, the device implementation must ensure adequate protection against denial of service attacks across producers. The QoS for the SWQ refers solely to the acceptance of work requests (through enqueue requests) to the SWQ and is orthogonal to any QoS applied by the device hardware on how the QoS is applied to share the device's execution resources when processing work requests submitted by different producers. Several exemplary approaches for configuring endpoint devices to enforce admission policies for accepting enqueue requests to the SWQ are described below. These are listed for illustrative purposes only, and the exact implementation choices are device specific.

一実施例において、ＭＯＶＤＩＲＩ命令は、直接格納処理を用いて、ソースオペランド（第２のオペランド）内のダブルワード整数を宛先オペランド（第１のオペランド）移動させる。ソースオペランドは、汎用レジスタであってよい。宛先オペランドは、３２ビットのメモリ位置であってよい。６４ビットモードにおいて、命令のデフォルトの処理サイズは３２ビットである。ＭＯＶＤＩＲＩは、ダブルワード又はクワッドワードアラインとなるように宛先を定義する。 In one embodiment, the MOVDIRI instruction moves a doubleword integer in a source operand (second operand) to a destination operand (first operand) using a direct store operation. The source operand may be a general-purpose register. The destination operand may be a 32-bit memory location. In 64-bit mode, the default operation size of the instruction is 32 bits. MOVDIRI defines the destination to be doubleword or quadword aligned.

直接格納は、データを書き込むためのライトコンバイニング（ＷＣ）メモリタイププロトコルを用いることにより実装され得る。このプロトコルを用いることで、プロセッサは、キャッシュ階層にデータを書き込むことも、キャッシュ階層にメモリから対応するキャッシュラインをフェッチすることもしない。宛先アドレスがキャッシュされる場合、直接格納前に、ラインは、キャッシュからライトバック（修正される場合）及び無効にされる。宛先に対する未キャッシュ（ＵＣ）及び書き込み保護（ＷＰ）メモリタイプが非一時的な暗示をオーバーライドすることを可能にする非一時的な暗示を用いた格納とは異なり、直接格納は、（ＵＣ及びＷＰタイプを含む）宛先アドレスメモリタイプに拘わらず、ＷＣメモリタイププロトコルに常に従う。 Direct stores may be implemented by using a write combining (WC) memory type protocol for writing data. With this protocol, the processor neither writes the data to the cache hierarchy nor fetches the corresponding cache line from memory to the cache hierarchy. If the destination address is cached, the line is written back (if modified) and invalidated from the cache before the direct store. Unlike stores with non-transient implications, which allow Uncached (UC) and Write Protected (WP) memory types for the destination to override the non-transient implications, direct stores always follow the WC memory type protocol regardless of the destination address memory type (including UC and WP types).

ＷＣ格納及び非一時的な暗示を用いた格納とはことなり、直接格納は、ライトコンバイニングバッファからの即時エビクションの対象となり、ひいては、同じアドレスへの若い方の格納（直接格納を含む）と組み合わせられない。ライトコンバインバッファにおいて保持されるより古いＷＣ及び非一時的な格納は、同じアドレスへの若い方の直接格納と組み合わせられてよい。 Unlike WC stores and stores with non-transient hinting, direct stores are subject to immediate eviction from the write combining buffer and thus cannot be combined with younger stores (including direct stores) to the same address. Older WC and non-transient stores held in the write combining buffer may be combined with younger direct stores to the same address.

直接格納により用いられるＷＣプロトコルは、弱くオーダリングされたメモリ整合性モデルに従うので、フェンシング処理は、必要なときに、オーダリングを強制するためにＭＯＶＤＩＲＩ命令に従うはずである。 The WC protocol used by direct stores follows a weakly ordered memory consistency model, so the fencing process should follow the MOVDIRI instruction to enforce ordering when necessary.

宛先へＭＯＶＤＩＲＩにより発行された直接格納は、４バイト境界にアラインされ、４バイト書き込み完了のアトミック性を保証する。これは、データが、単一の非トーン４バイト（又は、８バイト）書き込みトランザクションにおける宛先に到達することを意味する。宛先が書き込みサイズに整合しない場合、ＭＯＶＤＩＲＩにより発行された直接格納は、２つの部分の宛先に分割されて到達する。そのような分割直接格納の各部は、若い方の格納とマージすることはないが、任意の順序で宛先に到達できる。 Direct stores issued by MOVDIRI to the destination are aligned on 4-byte boundaries to guarantee atomicity of 4-byte write completion. This means that the data arrives at the destination in a single non-tone 4-byte (or 8-byte) write transaction. If the destination is not aligned to the write size, a direct store issued by MOVDIRI arrives at the destination split in two parts. The parts of such a split direct store do not merge with the younger store, but can arrive at the destination in any order.

図５９は、ＭＯＶＤＩＲＩ命令を処理するために、プロセッサにより実行される方法の実施形態を示す。例えば、ここでは、ハードウェアの詳細が用いられる。 Figure 59 illustrates an embodiment of a method performed by a processor to process the MOVDIRI instruction. For example, hardware details are used here.

５９０１において、命令がフェッチされる。例えば、ＭＯＶＤＩＲＩがフェッチされる。ＭＯＶＤＩＲＩ命令は、オペコード（及び、いくつかの実施形態において、プレフィックス）、宛先オペランドを表す宛先フィールド、及び、ソースレジスタオペランドを表すソースフィールドを含む。 At 5901, an instruction is fetched. For example, MOVDIRI is fetched. The MOVDIRI instruction includes an opcode (and, in some embodiments, a prefix), a destination field representing a destination operand, and a source field representing a source register operand.

５９０３において、フェッチされた命令がデコードされる。例えば、ＭＯＶＤＩＲＩ命令は、本明細書で詳細に説明されるようなデコード回路によりデコードされる。 At 5903, the fetched instruction is decoded. For example, the MOVDIRI instruction is decoded by a decode circuit as described in more detail herein.

５９０５において、デコードされた命令のソースオペランドと関連付けられたデータ値が取得される。さらに、いくつかの実施形態において、命令がスケジューリングされる。 At 5905, data values associated with the source operands of the decoded instruction are obtained. Further, in some embodiments, the instruction is scheduled.

５９０７において、デコードされた命令は、データをキャッシングすることなく、ソースレジスタオペランドから宛先レジスタオペランドにダブルワードサイズのデータを移動する、本明細書で詳細に説明されるような実行回路（ハードウェア）により実行される。 At 5907, the decoded instruction is executed by execution circuitry (hardware) as described in detail herein, which moves doubleword-sized data from source register operands to destination register operands without caching the data.

いくつかの実施形態では、５９０９において、命令がコミット又はリタイアされる。 In some embodiments, at 5909, the instruction is committed or retired.

６４バイト書き込みアトミック性を有する直接格納として、ソースメモリアドレスから宛先メモリアドレスに６４バイトを移動する。ソースオペランドは、通常のメモリオペランドである。宛先オペランドは、汎用レジスタにおいて特定されるメモリ位置である。レジスタコンテンツは、いずれのセグメントオーバーライドを用いることなくＥＳセグメントへのオフセットとして解釈される。６４ビットモードにおいて、レジスタオペランド幅は、６４ビット（又は、３２ビット）である。６４ビットモードの外部では、レジスタ幅は、３２ビット又は１６ビットである。ＭＯＶＤＩＲ６４Ｂは、宛先アドレスが６４バイトでアラインされている必要がある。ソースオペランドに対して強制されるアライメント制限はない。 Moves 64 bytes from a source memory address to a destination memory address as a direct store with 64-byte write atomicity. The source operand is a regular memory operand. The destination operand is a memory location specified in a general-purpose register. The register contents are interpreted as an offset into the ES segment without using any segment overrides. In 64-bit mode, the register operand width is 64 bits (or 32 bits). Outside 64-bit mode, the register width is 32 bits or 16 bits. MOVDIR64B requires that the destination address is 64-byte aligned. No alignment restrictions are enforced on the source operand.

ＭＯＶＤＩＲ６４Ｂは、ソースメモリアドレスから６４バイトを読み出して、宛先アドレスに対する６４バイトの直接格納処理を実行する。ロードオペレーションは、ソースアドレスのメモリタイプに基づく通常の読み出しオーダリングに従う。直接格納は、データを書き込むためのライトコンバイニング（ＷＣ）メモリタイププロトコルを用いることにより実装される。このプロトコルを用いることで、プロセッサは、キャッシュ階層にデータを書き込まなくてよく、キャッシュ階層にメモリから対応するキャッシュラインをフェッチしなくてよい。宛先アドレスがキャッシュされる場合、直接格納前に、ラインは、キャッシュからライトバック（修正される場合）及び無効にされる。 MOVDIR64B reads 64 bytes from the source memory address and performs a 64-byte direct store operation to the destination address. The load operation follows normal read ordering based on the memory type of the source address. The direct store is implemented by using the Write Combining (WC) memory type protocol for writing the data. Using this protocol, the processor does not need to write the data to the cache hierarchy and does not need to fetch the corresponding cache line from memory to the cache hierarchy. If the destination address is cached, the line is written back (if modified) and invalidated from the cache before the direct store.

宛先に対するＵＣ／ＷＰメモリタイプが、非一時的な暗示をオーバーライドすることを可能にする非一時的な暗示を用いた格納とは異なり、直接格納は、（ＵＣ／ＷＰタイプを含む）宛先アドレスメモリタイプに拘わらず、ＷＣメモリタイププロトコルに従ってよい。 Unlike stores with non-transient implications, which allow the UC/WP memory type for the destination to override the non-transient implications, direct stores may follow the WC memory type protocol regardless of the destination address memory type (including UC/WP types).

ＷＣ格納及び非一時的な暗示を用いた格納とは異なり、直接格納は、ライトコンバイニングバッファからの即時エビクションの対象となり、ひいては、同じアドレスへお若い方の格納（直接格納を含む）と組み合わせられない。ライトコンバインバッファにおいて保持されるより古いＷＣ及び非一時的な格納は、同じアドレスへの若い方の直接格納と組み合わせられてよい。 Unlike WC stores and stores with non-transient hinting, direct stores are subject to immediate eviction from the write combining buffer and thus cannot be combined with younger stores (including direct stores) to the same address. Older WC and non-transient stores held in the write combining buffer may be combined with younger direct stores to the same address.

直接格納により用いられるＷＣプロトコルは、弱くオーダリングされたメモリ整合性モデルに従うので、フェンシング処理は、必要なときに、オーダリングを強制するためにＭＯＶＤＩＲ６４Ｂ命令に従うはずである。 The WC protocol used by direct stores follows a weakly ordered memory consistency model, so the fencing process should follow the MOVDIR64B instruction to enforce ordering when necessary.

ソースアドレスから６４バイトのロードオペレーションにもたらされるアトミック性の保証はなく、プロセッサ実装は、複数のロードオペレーションを用いて、６４バイトを読み出すしてよい。ＭＯＶＤＩＲ６４Ｂにより発行される６４バイト直接格納は、６４バイト書き込み完了のアトミック性を保証する。これは、データが、単一の非トーン６４バイト書き込みトランザクションにおける宛先に到達することを意味する。 There are no atomicity guarantees provided for the 64-byte load operation from the source address, and processor implementations may use multiple load operations to read the 64 bytes. The 64-byte direct store issued by MOVDIR64B guarantees atomicity of the 64-byte write completion, meaning that the data arrives at the destination in a single non-tone 64-byte write transaction.

図６０は、ＭＯＶＤＩＲＩ６４Ｂ命令を処理するために、プロセッサにより実行される方法の実施形態を示す。例えば、ここでは、ハードウェアの詳細が用いられる。 Figure 60 illustrates an embodiment of a method performed by a processor to process the MOVDIRI64B instruction. For example, hardware details are used here.

６００１において、命令がフェッチされる。例えば、ＭＯＶＤＩＲＩ６４Ｂがフェッチされる。ＭＯＶＤＩＲＩ６４Ｂ命令は、オペコード（及び、いくつかの実施形態において、プレフィックス）、宛先オペランドを表す宛先フィールド、及び、ソースレジスタオペランドを表すソースフィールドを含む。 At 6001, an instruction is fetched. For example, MOVDIRI64B is fetched. The MOVDIRI64B instruction includes an opcode (and, in some embodiments, a prefix), a destination field representing the destination operand, and a source field representing the source register operand.

６００３において、フェッチされた命令がデコードされる。例えば、ＭＯＶＤＩＲＩ６４Ｂ命令は、本明細書で詳細に説明されるようなデコード回路によりデコードされる。 At 6003, the fetched instruction is decoded. For example, the MOVDIRI64B instruction is decoded by a decode circuit as described in detail herein.

６００５において、デコードされた命令のソースオペランドと関連付けられたデータ値が取得される。さらに、いくつかの実施形態において、命令がスケジューリングされる。 At 6005, data values associated with the source operands of the decoded instruction are obtained. Further, in some embodiments, the instruction is scheduled.

６００７において、デコードされた命令は、データをキャッシングすることなく、ソースレジスタオペランドから宛先レジスタオペランドに６４バイトデータを移動する、本明細書で詳細に説明されるような実行回路（ハードウェア）により実行される。 At 6007, the decoded instruction is executed by execution circuitry (hardware) as described in detail herein, which moves 64 bytes of data from the source register operand to the destination register operand without caching the data.

いくつかの実施形態では、６００９において、命令がコミット又はリタイアされる。 In some embodiments, at 6009, the instruction is committed or retired.

一実施例において、ＥＮＱＣＭＤコマンドは、ソースメモリアドレス（第２のオペランド）から宛先オペランド内のデバイス共有型ワークキュー（ＳＷＱ）メモリアドレスに６４バイト書き込みアトミック性を有するノンポステッド書き込みを用いて、６４バイトのコマンドをエンキューする。ソースオペランドは、通常のメモリオペランドである。宛先オペランドは、汎用レジスタにおいて特定されるメモリアドレスである。レジスタコンテンツは、いずれのセグメントオーバーライドを用いることなくＥＳセグメントへのオフセットとして解釈される。６４ビットモードにおいて、レジスタオペランド幅は、６４ビット又は３２ビットである。６４ビットモードの外部では、レジスタ幅は、３２ビット又は１６ビットである。ＥＮＱＣＭＤは、宛先アドレスが６４バイトでアラインされている必要がある。ソースオペランドに対して強制されるアライメント制限はない。 In one embodiment, the ENQCMD command enqueues a 64-byte command using a non-posted write with 64-byte write atomicity from a source memory address (second operand) to a device shared work queue (SWQ) memory address in the destination operand. The source operand is a regular memory operand. The destination operand is a memory address specified in a general purpose register. The register contents are interpreted as an offset into the ES segment without using any segment overrides. In 64-bit mode, the register operand width is 64-bits or 32-bits. Outside 64-bit mode, the register width is 32-bits or 16-bits. ENQCMD requires that the destination address is 64-byte aligned. There are no alignment restrictions enforced on the source operand.

一実施例において、ＥＮＱＣＭＤは、ソースメモリアドレスから６４バイトのコマンドを読み出し、６４バイトのエンキュー格納データをフォーマット化し、宛先アドレスに対する格納データの６４バイトのエンキュー格納処理を実行する。ロードオペレーションは、ソースアドレスのメモリタイプに基づく通常の読み出しオーダリングに従う。一般的な保護エラーは、ソースメモリアドレスから読み出される６４バイトのコマンドデータの下位４バイトが、ゼロ以外の値を有する場合、又は、ＰＡＳＩＤ有効フィールドビットが０である場合に引き起こされ得る。そうでなければ、６４バイトのエンキュー格納データは、以下のようにフォーマット化される。
エンキュー格納データ［５１１：３２］＝コマンドデータ［５１１：３２］
エンキュー格納データ[３１]＝０
エンキュー格納データ［３０：２０］＝０
エンキュー格納データ［１９：０］＝ＰＡＳＩＤＭＳＲ［１９：０］ In one embodiment, ENQCMD reads a 64 byte command from a source memory address, formats the 64 byte enqueue store data, and performs an enqueue store operation of the 64 byte store data to the destination address. The load operation follows normal read ordering based on the memory type of the source address. A general protection error can be caused if the lower 4 bytes of the 64 byte command data read from the source memory address have a non-zero value or if the PASID valid field bit is 0. Otherwise, the 64 byte enqueue store data is formatted as follows:
Enqueue stored data [511:32] = command data [511:32]
Enqueue storage data [31] = 0
Enqueue stored data [30:20] = 0
Enqueue store data [19:0] = PASID MSR [19:0]

一実施例において、ＥＮＱＣＭＤにより生成された６４バイトのエンキュー格納データは、図５８に示されるフォーマットを有する。コマンド記述子内の上位６０バイトは、ターゲットデバイスに固有のコマンド５８０１を規定する。ＰＲＩＶフィールド５８０２（ビット３１）は、ＥＮＱＣＭＤ命令により生成されたエンキュー格納に対するユーザ特権を伝達するために０に強制されてよい。ＰＡＳＩＤフィールド（ビット１９：０）５８０４は、ＥＮＱＣＭＤ１を実行するソフトウェアスレッド用のシステムソフトウェアにより割り当てられる（ＰＡＳＩＤＭＳＲにおいてプログラミングされるような）処理アドレス空間識別を伝達する。 In one embodiment, the 64 bytes of enqueue store data generated by ENQCMD has the format shown in FIG. 58. The upper 60 bytes in the command descriptor specify a command 5801 specific to the target device. The PRIV field 5802 (bit 31) may be forced to 0 to convey user privileges for the enqueue store generated by the ENQCMD instruction. The PASID field (bits 19:0) 5804 conveys the process address space identification (as programmed in the PASID MSR) assigned by the system software for the software thread executing ENQCMD1.

エンキュー格納処理は、６４バイトのデータを書き込むためにノンポステッド書き込みプロトコルを用いる。ノンポステッド書き込みプロトコルは、キャッシュ階層にデータを書き込まなくてよく、キャッシュ階層に対応するキャッシュラインをフェッチしなくてよい。エンキュー格納は、（ＵＣ／ＷＰタイプを含む）宛先アドレスメモリタイプに拘わらず、ノンポステッド書き込みプロトコルに常に従う。 The enqueue store operation uses a non-posted write protocol to write the 64 bytes of data. The non-posted write protocol does not require writing the data to the cache hierarchy and does not require fetching the corresponding cache line in the cache hierarchy. The enqueue store always follows the non-posted write protocol regardless of the destination address memory type (including UC/WP types).

ノンポステッド書き込みプロトコルは、ノンポステッド書き込みに対する成功又はリトライステータスを示すために、完了応答を返してよい。ＥＮＱＣＭＤ命令は、ゼロフラグでこの完了ステータスを返してよい（０は成功を示し、１はリトライを示す）。成功ステータスは、ノンポステッド書き込みデータ（６４バイト）が、目標の共有のワークキューにより受け入れられることを示す（が、必ずしも作用を受けるわけではない）。リトライステータスは、ノンポステッド書き込みが容量又は他の一時的な理由に起因して（又は、宛先アドレスが有効な共有のワークキューアドレスではないことに起因して）、宛先アドレスによって受け入れられなかったことを示す。 The Non-Posted Write protocol may return a completion response to indicate success or retry status for the Non-Posted Write. The ENQCMD command may return this completion status with zero flags (0 indicates success, 1 indicates retry). A success status indicates that the Non-Posted Write data (64 bytes) was accepted (but not necessarily acted upon) by the target shared work queue. A retry status indicates that the Non-Posted Write was not accepted by the destination address due to capacity or other temporary reasons (or because the destination address is not a valid shared work queue address).

一実施例において、多くても１つのエンキュー格納は、所与の論理プロセッサから未処理となる可能性がある。その意味では、エンキュー格納は、別のエンキュー格納を渡すことができない。エンキュー格納は、より古いＷＢ格納、ＷＣ及び非一時的な格納、ＣＬＦＬＵＳＨＯＰＴ又はＣＬＷＢに対して、異なるアドレスへのオーダリングは行わない。そのようなオーダリングを強制する必要があるソフトウェアは、そのような格納の後、かつ、エンキュー格納前に明示的な格納フェンシングを用いなければならない。ＥＮＱＣＭＤは、他の格納により影響されない共有のワークキュー（ＳＷＱ）アドレスだけに影響を与える。 In one embodiment, at most one enqueue store can be outstanding from a given logical processor. In that sense, an enqueue store cannot pass another enqueue store. Enqueue stores do not order to different addresses with respect to older WB stores, WC and non-temporal stores, CLFLUSHOPT, or CLWB. Software that needs to enforce such ordering must use explicit store fencing after such stores and before the enqueue store. ENQCMD only affects shared work queue (SWQ) addresses that are not affected by other stores.

ソースアドレスから６４バイトのロードオペレーションにもたらされるアトミック性の保証はなく、プロセッサ実装は、複数のロードオペレーションを用いて６４バイトを読み出してよい。ＥＮＱＣＭＤにより発行された６４バイトのエンキュー格納は、６４バイト書き込み完了のアトミック性を保証する。データは、単一の非トーン（ｎｏｎ－ｔｏｒｎ）６４バイトの非ポステッド書き込みトランザクションとして宛先に到達し得る。 There are no atomicity guarantees provided for the 64-byte load operation from the source address, and processor implementations may read the 64 bytes using multiple load operations. The 64-byte enqueue store issued with ENQCMD guarantees the atomicity of the 64-byte write completion. The data may arrive at the destination as a single non-ton 64-byte non-posted write transaction.

いくつかの実施形態において、ＰＡＳＩＤアーキテクチャのＭＳＲがＥＮＱＣＭＤ命令により用いられる。

In some embodiments, the MSR of the PASID architecture is used by the ENQCMD instruction.

図６１は、ＥＮＣＱＭＤ命令を処理するために、プロセッサにより実行される方法の実施形態を示す。例えば、ここでは、ハードウェアの詳細が用いられる。 Figure 61 illustrates an embodiment of a method performed by a processor to process the ENCQMD instruction. For example, hardware details are used here.

６１０１において、命令がフェッチされる。例えば、ＥＮＣＱＭＤがフェッチされる。ＥＮＣＱＭＤ命令は、オペコード（及び、いくつかの実施形態において、プレフィックス）、宛先メモリアドレスオペランドを表す宛先フィールド、及び、ソースメモリオペランドを表すソースフィールドを含む。 At 6101, an instruction is fetched. For example, an ENCQMD is fetched. The ENCQMD instruction includes an opcode (and, in some embodiments, a prefix), a destination field representing a destination memory address operand, and a source field representing a source memory operand.

６１０３において、フェッチされた命令がデコードされる。例えば、ＥＮＣＱＭＤ命令は、本明細書で詳細に説明されるようなデコード回路によりデコードされる。 At 6103, the fetched instruction is decoded. For example, the ENCQMD instruction is decoded by a decode circuit as described in detail herein.

６１０５において、デコードされた命令のソースオペランドと関連付けられたデータ値が取得される。さらに、いくつかの実施形態において、命令がスケジューリングされる。 At 6105, data values associated with the source operands of the decoded instruction are obtained. Further, in some embodiments, the instruction is scheduled.

６１０７において、デコードされた命令は、コマンド（取得したデータ）を宛先メモリアドレスに書き込む、本明細書で詳細に説明されるような実行回路（ハードウェア）により実行される。いくつかの実施形態において、宛先メモリアドレスは共有のワークキューである。 At 6107, the decoded instructions are executed by execution circuitry (hardware) as described in detail herein, which writes the command (the retrieved data) to a destination memory address. In some embodiments, the destination memory address is a shared work queue.

いくつかの実施形態では、６１０９において、命令がコミット又はリタイアされる。 In some embodiments, at 6109, the instruction is committed or retired.

一実施例において、ＥＮＱＣＭＤＳ命令は、ソースメモリアドレス（第２のオペランド）から宛先オペランド内のデバイス共有型ワークキュー（ＳＷＱ）メモリアドレスに６４バイト書き込みアトミック性を有するノンポステッド書き込みを用いて、６４バイトのコマンドをエンキューする。ソースオペランドは、通常のメモリオペランドである。宛先オペランドは、汎用レジスタにおいて特定されるメモリアドレスである。レジスタコンテンツは、いずれのセグメントオーバーライドを用いることなくＥＳセグメントへのオフセットとして解釈されてよい。６４ビットモードにおいて、レジスタオペランド幅は、６４ビット又は３２ビットである。６４ビットモードの外部では、レジスタ幅は３２ビット又は１６ビットである。ＥＮＱＣＭＤは、宛先アドレスが６４バイトでアラインされている必要がある。ソースオペランドに対して強制されるアライメント制限はない。 In one embodiment, the ENQCMDS instruction enqueues a 64-byte command using a non-posted write with 64-byte write atomicity from a source memory address (second operand) to a device shared work queue (SWQ) memory address in the destination operand. The source operand is a regular memory operand. The destination operand is a memory address specified in a general-purpose register. The register contents may be interpreted as an offset into the ES segment without using any segment overrides. In 64-bit mode, the register operand width is 64-bits or 32-bits. Outside 64-bit mode, the register width is 32-bits or 16-bits. ENQCMD requires that the destination address is 64-byte aligned. There are no alignment restrictions enforced on the source operand.

（任意の特権レベルから実行され得る）ＥＮＱＣＭＤとは異なり、ＥＮＱＣＭＤＳは、特権命令である。プロセッサが、保護モードで実行している場合、ＣＰＬは、この命令を実行する０でなければならない。ＥＮＱＣＭＤＳは、ソースメモリアドレスから６４バイトのコマンドを読み出して、宛先アドレスに対して、このデータを用いて６４バイトのエンキュー格納処理を実行する。ロードオペレーションは、ソースアドレスのメモリタイプに基づく通常の読み出しオーダリングに従う。６４バイトのエンキュー格納データは、以下のようにフォーマット化される。
エンキュー格納データ［５１１：３２］＝コマンドデータ［５１１：３２］
エンキュー格納データ[３１]＝コマンドデータ[３１]
エンキュー格納データ［３０：２０］＝０
エンキュー格納データ［１９：０］＝コマンドデータ［１９：０］ Unlike ENQCMD (which may be executed from any privilege level), ENQCMDS is a privileged instruction. If the processor is running in protected mode, the CPL must be 0 to execute this instruction. ENQCMDS reads a 64-byte command from a source memory address and performs a 64-byte enqueue store operation with this data to the destination address. The load operation follows normal read ordering based on the memory type of the source address. The 64 bytes of enqueue store data are formatted as follows:
Enqueue stored data [511:32] = command data [511:32]
Enqueue stored data [31] = command data [31]
Enqueue stored data [30:20] = 0
Enqueue store data [19:0] = command data [19:0]

ＥＮＱＣＭＤＳにより生成された６４バイトのエンキュー格納データは、ＥＮＱＣＭＤと同じフォーマットを有してよい。一実施例において、ＥＮＱＣＭＤＳは、図６２に示されるフォーマットを有する。 The 64-byte enqueue store data generated by ENQCMDS may have the same format as ENQCMD. In one embodiment, ENQCMDS has the format shown in FIG. 62.

コマンド記述子内の上位６０バイトは、ターゲットデバイスに固有のコマンド６２０１を規定する。ＰＲＩＶフィールド（ビット３１）６２０２は、ＥＮＱＣＭＤＳ命令により生成されるエンキュー格納のためのユーザ（０）又はスーパバイザ（１）特権のいずれか一方を伝達するために、ソースオペランドアドレスにおけるコマンドデータ内のビット３１により規定される。ＰＡＳＩＤフィールド（ビット１９：０）６２０４は、ソースオペランドアドレス１におけるコマンドデータ内のビット１９：０に規定されるような処理アドレス空間識別を伝達する。 The upper 60 bytes in the command descriptor specify a command 6201 specific to the target device. The PRIV field (bit 31) 6202 is specified by bit 31 in the command data at the source operand address to convey either user (0) or supervisor (1) privilege for the enqueue store generated by the ENQCMDS instruction. The PASID field (bits 19:0) 6204 conveys the process address space identification as specified by bits 19:0 in the command data at source operand address 1.

一実施例において、エンキュー格納処理は、６４バイトのデータを書き込むために、ノンポステッド書き込みプロトコルを用いる。ノンポステッド書き込みプロトコルは、キャッシュ階層にデータを書き込むことも、キャッシュ階層に対応するキャッシュラインをフェッチすることもしない。エンキュー格納は、（ＵＣ／ＷＰタイプを含む）宛先アドレスメモリタイプに拘わらず、ノンポステッド書き込みプロトコルに常に従う。 In one embodiment, the enqueue store operation uses a non-posted write protocol to write 64 bytes of data. The non-posted write protocol does not write the data to the cache hierarchy, nor does it fetch the corresponding cache line from the cache hierarchy. The enqueue store always follows the non-posted write protocol, regardless of the destination address memory type (including UC/WP types).

ノンポステッド書き込みプロトコルは、ノンポステッド書き込みに対する成功又はリトライステータスを示すために、完了応答を返す。ＥＮＱＣＭＤ命令は、ゼロフラグでこの完了ステータスを返す（０は成功を示し、１はリトライを示す）。成功ステータスは、ノンポステッド書き込みデータ（６４バイト）が、目標の共有のワークキューにより受け入れられることを示す（が、必ずしも作用を受けるわけではない）。リトライステータスは、ノンポステッド書き込みが、容量又は他の一時的な理由に起因して（又は、宛先アドレスが有効な共有のワークキューアドレスではないことに起因して）宛先アドレスにより受け入れられなかったことを示す。 The Non-Posted Write protocol returns a completion response to indicate success or retry status for a Non-Posted Write. The ENQCMD command returns this completion status with zero flags (0 indicates success, 1 indicates retry). A success status indicates that the Non-Posted Write data (64 bytes) was accepted (but not necessarily acted upon) by the target shared work queue. A retry status indicates that the Non-Posted Write was not accepted by the destination address due to capacity or other temporary reasons (or because the destination address is not a valid shared work queue address).

多くても１つのエンキュー格納（ＥＮＱＣＭＤ又はＥＮＱＣＭＤＳ）は、所与の論理プロセッサから未処理となる可能性がある。その意味では、エンキュー格納は、別のエンキュー格納を渡すことができない。エンキュー格納は、より古いＷＢ格納、ＷＣ及び非一時的な格納、ＣＬＦＬＵＳＨＯＰＴ又はＣＬＷＢに対して、異なるアドレスへのオーダリングは行われなくてよい。そのようなオーダリングを強制する必要があるソフトウェアは、そのような格納後、かつ、エンキュー格納前に明示的な格納フェンシングを用いてよい。 At most one enqueue store (ENQCMD or ENQCMDS) may be outstanding from a given logical processor. In that sense, an enqueue store cannot pass another enqueue store. Enqueue stores may not be ordered to different addresses with respect to older WB stores, WC and non-temporal stores, CLFLUSHOPT, or CLWB. Software that needs to enforce such ordering may use explicit store fencing after such stores and before the enqueue stores.

ＥＮＱＣＭＤＳは、他の格納により影響されない共有のワークキュー（ＳＷＱ）アドレスだけに影響を与える。 ENQCMDS only affects shared work queue (SWQ) addresses that are not affected by other stores.

ソースアドレスから６４バイトのロードオペレーションにもたらされるアトミック性の保証はなく、プロセッサ実装は、複数のロードオペレーションを用いて６４バイトを読み出してよい。ＥＮＱＣＭＤＳにより発行された６４バイトのエンキュー格納は、６４バイト書き込み完了のアトミック性を保証する（すなわち、単一の非トーン（ｎｏｎ－ｔｏｒｎ）６４バイトの非ポステッド書き込みトランザクションとしての宛先に到達する）。 There are no atomicity guarantees provided for the 64-byte load operation from the source address, and a processor implementation may read the 64 bytes using multiple load operations. The 64-byte enqueue store issued by ENQCMDS guarantees the atomicity of the 64-byte write completion (i.e., it reaches the destination as a single non-ton 64-byte non-posted write transaction).

図６３は、ＥＮＣＱＭＤ命令を処理するために、プロセッサにより実行される方法の実施形態を示す。例えば、ここでは、ハードウェアの詳細が用いられる。 Figure 63 illustrates an embodiment of a method performed by a processor to process the ENCQMD instruction. For example, hardware details are used here.

６３０１において、命令がフェッチされる。例えば、ＥＮＣＱＭＤがフェッチされる。ＥＮＣＱＭＤ命令は、オペコード（及び、いくつかの実施形態において、プレフィックス）、宛先メモリアドレスオペランドを表す宛先フィールド、及び、ソースメモリオペランドを表すソースフィールドを含む。 At 6301, an instruction is fetched. For example, an ENCQMD is fetched. The ENCQMD instruction includes an opcode (and, in some embodiments, a prefix), a destination field representing a destination memory address operand, and a source field representing a source memory operand.

６３０３において、フェッチされた命令がデコードされる。例えば、ＥＮＣＱＭＤ命令は、本明細書で詳細に説明されるようなデコード回路によりデコードされる。 At 6303, the fetched instruction is decoded. For example, the ENCQMD instruction is decoded by a decode circuit as described in detail herein.

６３０５において、デコードされた命令のソースオペランドと関連付けられたデータ値が取得される。さらに、いくつかの実施形態において、命令がスケジューリングされる。 At 6305, data values associated with the source operands of the decoded instruction are obtained. Further, in some embodiments, the instruction is scheduled.

６３０７において、デコードされた命令は、コマンド（取得したデータ）を宛先メモリアドレスに書き込む、本明細書で詳細に説明されるような実行回路（ハードウェア）により、特権モードで実行される。いくつかの実施形態において、宛先メモリアドレスは、共有のワークキューである。 At 6307, the decoded instruction is executed in privileged mode by execution circuitry (hardware) as described in detail herein, which writes the command (the retrieved data) to a destination memory address. In some embodiments, the destination memory address is a shared work queue.

いくつかの実施形態では、６３０９において、命令がコミット又はリタイアされる。 In some embodiments, at 6309, the instruction is committed or retired.

一実施例では、アクセラレータとホストプロセッサ、すなわち、ＵＭＯＮＩＴＯＲとＵＭＷＡＩＴとの間の効率的な同期を確実にするために２つの命令を利用する。簡潔に、ＵＭＯＮＩＴＯＲ命令は、ソースレジスタにおいて特定されたアドレスを用いて、アドレスモニタリングハードウェアを作動可能にし、ＵＭＷＡＩＴ命令は、アドレスの範囲をモニタリングしている間、実装に依存して最適化された状態に入れるようプロセッサに命令する。 In one embodiment, two instructions are utilized to ensure efficient synchronization between the accelerator and the host processor, namely UMONITOR and UMWAIT. Briefly, the UMONITOR instruction arms the address monitoring hardware with the address specified in the source register, and the UMWAIT instruction instructs the processor to enter an implementation-dependent optimized state while monitoring a range of addresses.

ＵＭＯＮＩＴＯＲ命令は、ｒ３２/ｒ６４ソースレジスタにおいて特定されるアドレスを用いてアドレスモニタリングハードウェアを作動可能にする（格納オペレーションに対するモニタリングハードウェアがチェックするアドレス範囲が、ＣＰＵＩＤモニタリーフ機能を用いることにより判断され得る）。特定のアドレス範囲内のアドレスへの格納は、モニタリングハードウェアをトリガする。モニタハードウェアの状態は、ＵＭＷＡＩＴにより用いられる。 The UMONITOR instruction arms the address monitoring hardware with the address specified in the r32/r64 source register (the address range that the monitoring hardware checks for store operations can be determined by using the CPUID monitor leaf function). Stores to addresses within the specified address range trigger the monitoring hardware. The state of the monitor hardware is used by UMWAIT.

以下のオペランドのエンコーディングは、ＵＭＯＮＩＴＯＲ命令の一実施例に用いられる。

The following operand encodings are used in one embodiment of the UMONITOR instruction:

ｒ３２/ｒ６４ソースレジスタのコンテンツはが有効なアドレスである（６４ビットモードにおいて、ｒ６４が用いられる）。デフォルト設定により、ＤＳセグメントは、モニタリングされる線形アドレスを作成するために用いられる。セグメントオーバーライドが用いられ得る。アドレス範囲は、ライトバックタイプのメモリを用いなければならない。ライトバックメモリだけが、モニタリングハードウェアを正確にトリガすることを保証する。 The contents of the r32/r64 source register is a valid address (in 64-bit mode, r64 is used). By default, the DS segment is used to create the linear address to be monitored. Segment overrides may be used. The address range must use write-back type memory. Only write-back memory is guaranteed to correctly trigger the monitoring hardware.

ＵＭＯＮＩＴＯＲ命令は、他のメモリトランザクションに関するロードオペレーションとしてオーダリングされる。命令は、バイトロードと関連付けられる許可チェック及びフォールトを対象とする。ロードと同様に、ＵＭＯＮＩＴＯＲは、ページテーブル内のＤビットではなく、Ａビットを設定する。 The UMONITOR instruction is ordered as a load operation with respect to other memory transactions. The instruction is subject to the permission checks and faults associated with byte loads. Like loads, UMONITOR sets the A bit in the page table, but not the D bit.

ＵＭＯＮＩＴＯＲ及びＵＭＷＡＩＴは、任意の特権レベルで実行されてよい。命令のオペレーションは、非６４ビットモード及び６４ビットモードにおいて同じである。 UMONITOR and UMWAIT may be executed at any privilege level. The operation of the instructions is the same in non-64-bit and 64-bit modes.

ＵＭＯＮＩＴＯＲは、レガシＭＷＡＩＴ命令と同時に使用しない。ＭＷＡＩＴを実行し、続けて、レガシＭＯＮＩＴＯＲ命令の最新の実行の前に、ＵＭＯＮＩＴＯＲが実行された場合、ＭＷＡＩＴは、最適化された状態に入らなくてよい。実行は、ＭＷＡＩＴの後に続く命令において再開する。 UMONITOR is not used simultaneously with the legacy MWAIT instruction. If MWAIT is executed followed by UMONITOR before the most recent execution of the legacy MONITOR instruction, MWAIT does not have to enter an optimized state. Execution resumes at the instruction following MWAIT.

ＵＭＯＮＩＴＯＲ命令は、トランザクション領域内で用いられる場合、トランザクションをアボートさせる。 The UMONITOR instruction aborts a transaction when used within a transactional region.

ＵＭＯＮＩＴＯＲは、有効なアドレスとしてソースレジスタのコンテンツを用いてモニタハードウェアのアドレス範囲をセットアップし、モニタハードウェアを作動可能状態（ａｒｍｅｄｓｔａｔｅ）に置く。特定のアドレス範囲に対する格納は、モニタハードウェアをトリガする。 UMONITOR sets up the monitor hardware address range using the contents of the source register as a valid address and places the monitor hardware in an armed state. Stores to the specified address range trigger the monitor hardware.

図６４は、ＵＭＯＮＩＴＯＲ命令を処理するために、プロセッサにより実行される方法の実施形態を示す。例えば、ここでは、ハードウェアの詳細が用いられる。 Figure 64 illustrates an embodiment of a method performed by a processor to process a UMONITOR instruction. For example, hardware details are used here.

６４０１において、命令がフェッチされる。例えば、ＵＭＯＮＩＴＯＲがフェッチされる。ＵＭＯＮＩＴＯＲ命令は、オペコード（及び、いくつかの実施形態において、プレフィックス）、及び、明示的なソースレジスタオペランドを含む。 At 6401, an instruction is fetched. For example, UMONITOR is fetched. The UMONITOR instruction includes an opcode (and, in some embodiments, a prefix) and an explicit source register operand.

６４０３において、フェッチされた命令がデコードされる。例えば、ＵＭＯＮＩＴＯＲ命令は、本明細書で詳細に説明されるようなデコード回路によりデコードされる。 At 6403, the fetched instruction is decoded. For example, the UMONITOR instruction is decoded by a decode circuit as described in detail herein.

６４０５において、デコードされた命令のソースオペランドと関連付けられたデータ値が取得される。さらに、いくつかの実施形態において、命令がスケジューリングされる。 At 6405, data values associated with the source operands of the decoded instruction are obtained. Further, in some embodiments, the instruction is scheduled.

６４０７において、デコードされた命令は、取得したソースレジスタデータにより規定されるアドレスへの格納のためにモニタリングハードウェアを作動可能にする、本明細書で詳細に説明されるような実行回路（ハードウェア）により実行される。 At 6407, the decoded instruction is executed by execution circuitry (hardware) as described in detail herein, which enables monitoring hardware for storage at the address specified by the obtained source register data.

いくつかの実施形態では、６４０９において、命令がコミット又はリタイアされる。 In some embodiments, at 6409, the instruction is committed or retired.

ＵＭＷＡＩＴは、アドレスの範囲をモニタリングしている間に、実装に依存して最適化された状態に入るようプロセッサに命令する。最適化された状態は、軽量電力／性能が最適化された状態又は改善された電力／性能が最適化された状態のいずれか一方であってよい。その２つの状態の選択は、明示的な入力レジスタビット［０］ソースオペランドにより統制される。

UMWAIT instructs the processor to enter an implementation-dependent optimized state while monitoring a range of addresses. The optimized state may be either a lightweight power/performance optimized state or an improved power/performance optimized state. The selection between the two states is controlled by an explicit input register bit[0] source operand.

ＵＭＷＡＩＴは、任意の特権レベルで実行されてよい。この命令のオペレーションは、非６４ビットモード及び６４ビットモードにおいて同じである。 UMWAIT may be executed at any privilege level. The operation of this instruction is the same in non-64-bit and 64-bit modes.

入力レジスタは、以下のテーブルにおいて説明さるように、プロセッサが入るべきである好ましく最適化された状態などの情報を含んでよい。ビット０以外のビットは、予約済みであり、ゼロ以外である場合、＃ＧＰを結果としてもたらす。

The input register may contain information such as the preferred and optimized state the processor should be in, as described in the table below. Bits other than bit 0 are reserved and result in #GP if non-zero.

命令は、タイムスタンプカウンタが暗黙的な６４ビット入力値に達した又はこれを超えた場合（モニタリングハードウェアが予めトリガされていなかった場合）にウェイクアップする。 The instruction wakes up when the timestamp counter reaches or exceeds an implicit 64-bit input value (if the monitoring hardware has not been previously triggered).

ＵＭＷＡＩＴ命令を実行する前に、オペレーティングシステムは、２つの電力／性能が最適化された状態のいずれか一方を含み得るそのオペレーションをプロセッサが一時停止することを可能にする最大遅延を規定してよい。それは、以下の３２ビットＭＳＲにＴＳＣ量子値を書き込むことによりそうすることができる。
ＵＭＷＡＩＴ＿ＣＯＮＴＲＯＬ［３１：２］－Ｃ０．１又はＣ０．２のいずれか一方にプロセッサが存在し得るＴＳＣ量子における最大時間を判断する。ゼロの値は、ＯＳがプロセッサに対して課した制限がないことを示す。最大時間値は、上位３０ｂがこのフィールドから来ており、下位２ビットがゼロであると仮定される３２ｂの値である。
ＵＭＷＡＩＴ＿ＣＯＮＴＲＯＬ［１］－予約済。
ＵＭＷＡＩＴ＿ＣＯＮＴＲＯＬ［０］－Ｃ０．２はＯＳにより許可されていない。１の値は、すべてのＣ０．２要求がＣ０．１に戻ることを意味する。 Before executing the UMWAIT instruction, the operating system may specify the maximum delay that the processor is allowed to suspend its operations, which may include one of two power/performance optimized states. It can do so by writing the TSC quanta value to the following 32-bit MSR:
UMWAIT_CONTROL[31:2] - Determines the maximum time in a TSC quantum that a processor can be in either C0.1 or C0.2. A value of zero indicates that there is no OS imposed limit on the processor. The maximum time value is a 32-b value where the most significant 30-b comes from this field and the least significant 2 bits are assumed to be zero.
UMWAIT_CONTROL[1] - Reserved.
UMWAIT_CONTROL[0] - C0.2 is not allowed by the OS. A value of 1 means that all C0.2 requests fall back to C0.1.

一実施例において、ＵＭＷＡＩＴ命令を実行したプロセッサがオペレーティングシステム時間制限の期限切れに起因して起きた場合、命令は、キャリーフラグを設定し、そうでなければ、そのフラグがクリアされる。 In one embodiment, if the processor that executed the UMWAIT instruction woke up due to the expiration of an operating system time limit, the instruction sets the carry flag; otherwise, the flag is cleared.

ＵＭＷＡＩＴ命令は、トランザクション領域内で用いられる場合、トランザクションをアボートさせる。一実施例において、ＵＭＷＡＩＴ命令は、ＵＭＯＮＩＴＯＲ命令と共に動作する。２つの命令は、待機するアドレス（ＵＭＯＮＩＴＯＲ）の定義を可能にし、実装に依存して最適化されたオペレーションが待機アドレス（ＵＭＷＡＩＴ）で開始することを可能にする。ＵＭＷＡＩＴの実行は、ＵＭＯＮＩＴＯＲにより作動可能なアドレス範囲に対するイベント又は格納処理を待機している間に、実装に依存して最適化された状態に入ることができるプロセッサに対する暗示である。 The UMWAIT instruction aborts a transaction when used within a transactional region. In one embodiment, the UMWAIT instruction works in conjunction with the UMONITOR instruction. The two instructions allow the definition of an address to wait at (UMONITOR) and allow an implementation-dependent optimized operation to begin at the wait address (UMWAIT). Execution of UMWAIT is an indication to the processor that it can enter an implementation-dependent optimized state while waiting for an event or store operation to the address range actionable by UMONITOR.

次のような場合には、プロセッサに、実装に依存して最適化された状態を抜けさせてよい。ＵＭＯＮＩＴＯＲ命令により作動可能なアドレス範囲への格納、非マスク可能な割込み（ＮＭＩ）又はシステム管理割込み（ＳＭＩ）、デバッグ例外、マシンチェック例外、ＢＩＮＩＴ＃信号、ＩＮＩＴ＃信号及びＲＥＳＥＴ♯信号。他の実装に依存するイベントは、プロセッサに、実装に依存して最適化された状態を抜けさせてもよい。 The following events may cause the processor to exit the optimized state in an implementation-dependent manner: a store in an address range that can be acted upon by the UMONITOR instruction, a non-maskable interrupt (NMI) or system management interrupt (SMI), a debug exception, a machine check exception, the BINIT# signal, the INIT# signal, and the RESET# signal. Other implementation-dependent events may cause the processor to exit the optimized state in an implementation-dependent manner.

さらに、外部の割込みは、マスク可能割込みが阻害されるか否かに関わらず、プロセッサに、実装に依存して最適化された状態を抜けさせてよい。 Furthermore, an external interrupt may cause the processor to exit a state that is optimized depending on the implementation, regardless of whether maskable interrupts are blocked.

実装に依存して最適化された状態からの抜け出しに続いて、制御が、ＵＭＷＡＩＴ命令に続く命令に渡される。マスクされていない未処理の割込み（ＮＭＩ又はＳＭＩを含む）は、命令の実行前に配信されてよい。 Following an exit from the state, which may be optimized depending on the implementation, control is passed to the instruction following the UMWAIT instruction. Any unmasked pending interrupts (including NMI or SMI) may be delivered before the instruction is executed.

ＨＬＴ命令とは異なり、ＵＭＷＡＩＴ命令は、ＳＭＩの処理に続くＵＭＷＡＩＴ命令での再開をサポートしていない。先行するＵＭＯＮＩＴＯＲ命令が、アドレス範囲を正常に作動可能していなかった場合、又は、ＵＭＷＡＩＴを実行し、続けてレガシＭＯＮＩＴＯＲ命令の最新の実行の前に、ＵＭＯＮＩＴＯＲが実行されなかった（ＵＭＷＡＩＴがＭＯＮＩＴＯＲと同時に使用されていない）場合、プロセッサは、最適化された状態に入らない。実行は、ＵＭＷＡＩＴの後に続く命令において再開する。 Unlike the HLT instruction, the UMWAIT instruction does not support resuming at a UMWAIT instruction following the handling of an SMI. If a preceding UMONITOR instruction did not enable the address range correctly, or if UMONITOR was not executed prior to the execution of UMWAIT followed by the most recent execution of a legacy MONITOR instruction (UMWAIT was not used simultaneously with MONITOR), the processor will not enter an optimized state. Execution resumes at the instruction following UMWAIT.

ＵＭＷＡＩＴが、Ｃ１より数値的に低いＣ０－サブ状態に入るために用いられ、従って、ＵＭＯＮＩＴＯＲ命令により作動可能なアドレス範囲に対する格納は、他のプロセッサエージェントにより格納が発せられた場合、又は、非プロセッサエージェントにより格納が発せられた場合のいずれか一方の場合、プロセッサにＵＭＷＡＩＴを終了させることに留意する。 Note that UMWAIT is used to enter the C0-substate, which is numerically lower than C1, and therefore a store to an address range operable by the UMONITOR instruction will cause the processor to exit UMWAIT either if the store is issued by another processor agent or if the store is issued by a non-processor agent.

図６５は、ＵＭＷＡＩＴ命令を処理するために、プロセッサにより実行される方法の実施形態を示す。例えば、ここでは、ハードウェアの詳細が用いられる。 Figure 65 illustrates an embodiment of a method performed by a processor to process the UMWAIT instruction. For example, hardware details are used here.

６５０１において、命令がフェッチされる。例えば、ＵＭＷＡＩＴがフェッチされる。ＵＭＷＡＩＴ命令は、オペコード（及び、いくつかの実施形態において、プレフィックス）及び明示的なソースレジスタオペランドを含む。 At 6501, an instruction is fetched. For example, UMWAIT is fetched. The UMWAIT instruction includes an opcode (and, in some embodiments, a prefix) and an explicit source register operand.

６５０３において、フェッチされた命令がデコードされる。例えば、ＵＭＷＡＩＴ命令は、本明細書で詳細に説明されるようなデコード回路によりデコードされる。 At 6503, the fetched instruction is decoded. For example, the UMWAIT instruction is decoded by a decode circuit as described in detail herein.

６５０５において、デコードされた命令のソースオペランドと関連付けられたデータ値が取得される。さらに、いくつかの実施形態において、命令がスケジューリングされる。 At 6505, data values associated with the source operands of the decoded instruction are obtained. Further, in some embodiments, the instruction is scheduled.

６５０７において、デコードされた命令は、アドレスの範囲をモニタリングしている間に、プロセッサ（又は、コア）を、明示的なソースレジスタオペランドについてのデータにより規定される実装依存状態に入れる、本明細書で詳細に説明されるような、実行回路（ハードウェア）により実行される。 At 6507, the decoded instruction is executed by execution circuitry (hardware), as described in detail herein, which places the processor (or core) in an implementation-dependent state defined by the data for the explicit source register operands while monitoring the address range.

いくつかの実施形態では、６５０９において、命令がコミット又はリタイアされる。 In some embodiments, at 6509, the instruction is committed or retired.

ＴＰＡＵＳＥは、実装に依存して最適化された状態に入るようプロセッサに命令する。軽量電力／性能が最適化された状態、及び、改善された電力／性能が最適化された状態の中から選択するそのような２つの最適化された状態がある。当該２つの中からの選択は、明示的な入力レジスタビット［０］ソースオペランドにより統制される。

TPAUSE commands the processor to enter an implementation dependent optimized state. There are two such optimized states to choose between: a light power/performance optimized state and an improved power/performance optimized state. The selection between the two is governed by an explicit input register bit[0] source operand.

ＴＰＡＵＳＥは、任意の特権レベルで実行されてよい。この命令のオペレーションは、非６４ビットモード及び６４ビットモードにおいて同じである。 TPAUSE may be executed at any privilege level. The operation of this instruction is the same in non-64-bit and 64-bit modes.

ＰＡＵＳＥとは異なり、ＴＰＡＵＳＥ命令は、トランザクション領域の内部で用いられるときにアボートを生じさせない。入力レジスタは、以下のテーブルで説明されるように、プロセッサが入る好ましく最適化された状態のような情報を含む。ビット０以外のビットは、予約済みであり、ゼロ以外である場合、＃ＧＰをもたらす。

Unlike PAUSE, the TPAUSE instruction does not cause an abort when used inside a transactional region. The input registers contain information such as the preferred and optimized state the processor will be in, as described in the table below. Bits other than bit 0 are reserved and, if non-zero, yield #GP.

命令は、タイムスタンプカウンタが暗黙的な６４ビット入力値に達した又はこれを超えた場合（モニタリングハードウェアが予めトリガされていなかった場合）にウェイクアップする。ＴＰＡＵＳＥ命令を実行する前に、オペレーティングシステムは、２つの電力／性能が最適化された状態のいずれか一方におけるそのオペレーションをプロセッサが一時停止することを可能にする最大遅延を規定してよい。それは、以下の３２ビットＭＳＲにＴＳＣ量子値を書き込むことによりそうすることができる。
ＵＭＷＡＩＴ＿ＣＯＮＴＲＯＬ［３１：２］－Ｃ０．１又はＣ０．２のいずれか一方にプロセッサが存在し得るＴＳＣ量子における最大時間を判断する。ゼロの値は、ＯＳがプロセッサに対して課した制限がないことを示す。最大時間値は、上位３０ｂがこのフィールドから来ており、下位２ビットがゼロであると仮定される３２ｂの値である。
ＵＭＷＡＩＴ＿ＣＯＮＴＲＯＬ［１］－予約済。
ＵＭＷＡＩＴ＿ＣＯＮＴＲＯＬ［０］－Ｃ０．２はＯＳにより許可されていない。'１の値は、すべてのＣ０．２要求がＣ０．１に戻ることを意味する。 The instruction wakes up when the timestamp counter reaches or exceeds an implicit 64-bit input value (unless the monitoring hardware has been previously triggered). Before executing the TPAUSE instruction, the operating system may specify the maximum delay that the processor is allowed to suspend its operation in either of two power/performance optimized states. It can do so by writing the TSC quanta value to the following 32-bit MSR:
UMWAIT_CONTROL[31:2] - Determines the maximum time in a TSC quantum that a processor can be in either C0.1 or C0.2. A value of zero indicates that there is no OS imposed limit on the processor. The maximum time value is a 32-b value where the most significant 30-b comes from this field and the least significant 2 bits are assumed to be zero.
UMWAIT_CONTROL[1] - Reserved.
UMWAIT_CONTROL[0]--C0.2 is not allowed by the OS. A value of '1 means that all C0.2 requests fall back to C0.1.

ＯＳの時間制限の期限切れに起因するウェイクアップ理由は、キャリーフラグを設定することにより示されてよい。 A wakeup reason due to expiration of an OS time limit may be indicated by setting the carry flag.

ＴＰＡＵＳＥ命令を実行したプロセッサがオペレーティングシステム時間制限の期限切れに起因して起きた場合、命令は、キャリーフラグを設定し、そうでなければ、そのフラグがクリアされる。 If the processor that executed the TPAUSE instruction woke up due to the expiration of an operating system time limit, the instruction sets the carry flag; otherwise, the flag is cleared.

複数のアドレス範囲をモニタリングすることに関して、ＴＰＡＵＳＥ命令は、モニタするためのアドレスのセット及び後続のＴＰＡＵＳＥ命令から構成されるトランザクション領域内に置かれ得る。トランザクション領域は、待機するアドレスのセットの定義を可能にし、実装に依存して最適化されたオペレーションがＴＰＡＵＳＥ命令の実行時に開始することを可能にする。一実施例において、ＴＰＡＵＳＥの実行は、読み出しセットにより規定される範囲内のアドレスに対するイベント又は格納処理を待機している間に、実装に依存して最適化された状態に入るようプロセッサに指示する。 For monitoring multiple address ranges, the TPAUSE instruction may be placed within a transactional domain consisting of a set of addresses to monitor and a subsequent TPAUSE instruction. The transactional domain allows for the definition of a set of addresses to wait for, allowing an implementation-dependent optimized operation to begin upon execution of the TPAUSE instruction. In one embodiment, execution of TPAUSE directs the processor to enter an implementation-dependent optimized state while waiting for an event or store operation for an address within the range defined by the read set.

トランザクションメモリ領域内でのＴＰＡＵＳＥの用は、Ｃ０．１（軽量電力／性能が最適化された状態）に制限され得る。たとえ、ソフトウェアが、Ｃ０．２（改善された電力／性能が最適化された状態）に対するその優先度を示すべく、ビット［０］＝０を設定したとしても、プロセッサは、Ｃ０．１に入ってよい。 The use of TPAUSE in transactional memory regions may be restricted to C0.1 (lightweight power/performance optimized state). The processor may enter C0.1 even if software sets bit [0] = 0 to indicate its preference over C0.2 (improved power/performance optimized state).

次のような場合は、プロセッサに、実装に依存して最適化された状態を抜けさせてよい。トランザクション領域内の読み出しセット範囲への格納、ＮＭＩ又はＳＭＩ、デバッグ例外、マシンチェック例外、ＢＩＮＩＴ＃信号、ＩＮＩＴ＃信号及びＲＥＳＥＴ♯信号。すべてのこれらのイベントはまた、トランザクションをアボートする。 The following may cause the processor to exit the optimized state depending on the implementation: a store to a read set range within the transactional domain, an NMI or SMI, a debug exception, a machine check exception, the BINIT# signal, the INIT# signal, and the RESET# signal. All these events also abort the transaction.

他の実装に依存するイベントは、プロセッサに、実装に依存して最適化された状態を抜けさせてよく、ＴＰＡＵＳＥに続く命令に進んで、アボートされていないトランザクション領域を結果としてもたらし得る。さらに、外部の割込みは、いくつかの実施形態において、マスク可能割込みが阻害されるか否かに関わらず、プロセッサに、実装に依存して最適化された状態を抜けさせてよい。マスク可能割込みが阻害される場合、実行はＴＰＡＵＳＥに続く命令に進み、一方、割込みがイネーブルにされたフラグが設定されている場合、トランザクション領域がアボートされることに留意されたい。 Other implementation-dependent events may cause the processor to leave the implementation-dependent optimized state and proceed to the instruction following TPAUSE, resulting in a transactional region that is not aborted. Additionally, an external interrupt may, in some embodiments, cause the processor to leave the implementation-dependent optimized state regardless of whether maskable interrupts are inhibited. Note that if maskable interrupts are inhibited, execution proceeds to the instruction following TPAUSE, whereas if the interrupt enabled flag is set, the transactional region is aborted.

図６６は、ＴＰＡＵＳＥ命令を処理するために、プロセッサにより実行される方法の実施形態を示す。例えば、ここでは、ハードウェアの詳細が用いられる。 Figure 66 illustrates an embodiment of a method performed by a processor to process a TPAUSE instruction. For example, hardware details are used here.

６６０１において、命令がフェッチされる。例えば、ＴＰＡＵＳＥがフェッチされる。ＴＰＡＵＳＥ命令は、オペコード（及び、いくつかの実施形態において、プレフィックス）及び明示的なソースレジスタオペランドを含む。 At 6601, an instruction is fetched. For example, TPAUSE is fetched. The TPAUSE instruction includes an opcode (and, in some embodiments, a prefix) and an explicit source register operand.

６６０３において、フェッチされた命令がデコードされる。例えば、ＴＰＡＵＳＥ命令は、本明細書で詳細に説明されるようなデコード回路によりデコードされる。 At 6603, the fetched instruction is decoded. For example, the TPAUSE instruction is decoded by a decode circuit as described in detail herein.

６６０５において、デコードされた命令のソースオペランドと関連付けられたデータ値が取得される。さらに、いくつかの実施形態において、命令がスケジューリングされる。 At 6605, data values associated with the source operands of the decoded instruction are obtained. Further, in some embodiments, the instruction is scheduled.

６６０７において、デコードされた命令は、明示的なソースレジスタオペランドについてのデータにより規定される実装ごとに決まる状態にプロセッサ（又は、コア）を入れる、本明細書で詳細に説明されるような、実行回路（ハードウェア）により実行される。 At 6607, the decoded instruction is executed by execution circuitry (hardware), as described in detail herein, which places the processor (or core) in an implementation-specific state defined by data for the explicit source register operands.

いくつかの実施形態では、６６０９において、命令がコミット又はリタイアされる。 In some embodiments, at 6609, the instruction is committed or retired.

図６７は、ＵＭＷＡＩＴ及びＵＭＯＮＩＴＯＲ命令を用いた実行の例を示す。 Figure 67 shows an example of execution using the UMWAIT and UMONITOR instructions.

６７０１において、ＵＭＷＡＩＴ命令は、モニタリングするためにアドレスの範囲を設定するように実行される。 At 6701, the UMWAIT command is executed to set the range of addresses to monitor.

６７０３において、ＵＭＯＮＩＴＯＲ命令は、モニタされているアドレスの範囲に対して、命令の明示的なソースレジスタオペランドについてのデータにより規定される実装依存状態に、命令を実行するコアを入れるように実行される。 At 6703, the UMONITOR instruction is executed to put the core executing the instruction into an implementation-dependent state defined by data for the instruction's explicit source register operands for the range of addresses being monitored.

６７０５において、実装依存状態は、モニタリングされたアドレスへの格納、ＮＭＩ、ＳＭＩ、デバッグ例外、マシンチェック例外、ｉｎｉｔ信号又はリセット信号のうちの１つに応じて抜ける。 At 6705, the implementation-dependent state is exited in response to one of a store to a monitored address, an NMI, an SMI, a debug exception, a machine check exception, an init signal, or a reset signal.

図６８は、ＴＰＡＵＳＥ及びＵＭＯＮＩＴＯＲ命令を用いた実行の例を示す。 Figure 68 shows an example of execution using the TPAUSE and UMONITOR instructions.

６８０１において、ＴＰＡＵＳＥ命令は、モニタリングするためにアドレスの範囲を設定するように実行される。 At 6801, the TPAUSE instruction is executed to set the range of addresses to monitor.

６８０３において、ＵＭＯＮＩＴＯＲ命令は、モニタリングされているアドレスの範囲に対して、命令の明示的なソースレジスタオペランドについてのデータにより規定される実装依存状態に、命令を実行するコアを入れるように実行される。 At 6803, the UMONITOR instruction is executed to put the core executing the instruction into an implementation-dependent state defined by data for the instruction's explicit source register operands for the range of addresses being monitored.

６８０５において、実装依存状態は、モニタリングされたアドレスへの格納、ＮＭＩ、ＳＭＩ、デバッグ例外、マシンチェック例外、ｉｎｉｔ信号又はリセット信号のうちの１つに応じて抜ける。 At 6805, the implementation-dependent state is exited in response to one of a store to a monitored address, an NMI, an SMI, a debug exception, a machine check exception, an init signal, or a reset signal.

６８０７において、スレッドと関連付けられたトランザクションは、実装依存状態から抜け出たときにアボートされる。 At 6807, the transaction associated with the thread is aborted when the implementation-dependent state is exited.

いくつかの実施例では、アクセラレータは、特定のタイプのオペレーション、数例をあげると、グラフィックスオペレーション、機械学習オペレーション、パターン解析オペレーション、及び、（以下で詳細に説明されるような）疎行列乗算演算などを加速させるプロセッサコア又は他の処理要素に結合される。アクセラレータは、バス又は他の相互接続（例えば、ポイントツーポイント相互接続）を介してプロセッサ／コアに通信可能に結合されてよい、又は、プロセッサと同じチップ上に統合され、内部プロセッサバス／相互接続を介してコアに通信可能に結合されてよい。アクセラレータが接続される態様に関わらず、プロセッサコアは、これらのタスクを効率的に処理する専用の回路／論理を含むアクセラレータに、特定の処理タスク（例えば、命令又はＵＯＰのシーケンスの形式で）を割り当ててよい。 In some embodiments, an accelerator is coupled to a processor core or other processing element to accelerate certain types of operations, such as graphics operations, machine learning operations, pattern analysis operations, and sparse matrix multiplication operations (as described in more detail below), to name a few. The accelerator may be communicatively coupled to the processor/core via a bus or other interconnect (e.g., a point-to-point interconnect), or may be integrated on the same chip as the processor and communicatively coupled to the core via an internal processor bus/interconnect. Regardless of how the accelerator is connected, the processor core may assign specific processing tasks (e.g., in the form of a sequence of instructions or UOPs) to the accelerator, which contains dedicated circuitry/logic to efficiently process these tasks.

図６９は、アクセラレータ６９００が、キャッシュコヒーレントインタフェース６９３０を通じて複数のコア６９１０－６９１１に通信可能に結合される例示的な実装を示す。コア６９１０－６９１１のそれぞれは、仮想－物理アドレス変換を格納するためのトランスレーションルックアサイドバッファ６９１２－６９１３と、データ及び命令をキャッシングするための１又は複数のキャッシュ６９１４－６９１５（例えば、Ｌ１キャッシュ、Ｌ２キャッシュなど）とを含む。メモリ管理ユニット６９２０は、ダイナミックランダムアクセスメモリＤＲＡＭであり得るシステムメモリ６９５０へのコア６９１０－６９１１によるアクセスを管理する。Ｌ３キャッシュなどの共有キャッシュ６９２６は、プロセッサコア６９１０－６９１１間で、及び、キャッシュコヒーレントインタフェース６９３０を介してアクセラレータ６９００と共有されてよい。一実施例において、コア６９１０－１０１１、ＭＭＵ６９２０及びキャッシュコヒーレントインタフェース６９３０は、シングルプロセッサチップ上に統合される。 Figure 69 illustrates an exemplary implementation in which an accelerator 6900 is communicatively coupled to multiple cores 6910-6911 through a cache coherent interface 6930. Each of the cores 6910-6911 includes a translation lookaside buffer 6912-6913 for storing virtual-to-physical address translations and one or more caches 6914-6915 (e.g., L1 cache, L2 cache, etc.) for caching data and instructions. A memory management unit 6920 manages access by the cores 6910-6911 to a system memory 6950, which may be dynamic random access memory DRAM. A shared cache 6926, such as an L3 cache, may be shared between the processor cores 6910-6911 and with the accelerator 6900 through the cache coherent interface 6930. In one embodiment, the cores 6910-1011, MMU 6920 and cache coherent interface 6930 are integrated onto a single processor chip.

図示されたアクセラレータ６９００は、キャッシュ６９０７及び複数の処理要素６９０１－６９０２、Ｎに対するスケジューリングオペレーションのためのスケジューラ６９０６を有するデータ管理ユニット６９０５を含む。例示された実施例において、各処理要素は、独自のローカルメモリ６９０３－６９０４、Ｎを有する。以下で詳細に説明されるように、各ローカルメモリ６９０３－６９０４、Ｎは、スタック型ＤＲＡＭとして実装されてよい。 The illustrated accelerator 6900 includes a data management unit 6905 having a cache 6907 and a scheduler 6906 for scheduling operations for a number of processing elements 6901-6902, N. In the illustrated embodiment, each processing element has its own local memory 6903-6904, N. As described in more detail below, each local memory 6903-6904, N may be implemented as a stacked DRAM.

一実施例において、キャッシュコヒーレントインタフェース６９３０は、コア６９１０－６９１１とアクセラレータ６９００との間にキャッシュコヒーレントな接続性をもたらし、実際には、アクセラレータをコア６９１０－６９１１のピアとして扱う。例えば、キャッシュコヒーレントインタフェース６９３０は、アクセラレータ６９００によりアクセス／修正され、かつ、アクセラレータキャッシュ６９０７及び／又はローカルメモリ６９０３－６９０４、Ｎに格納されるデータが、コアキャッシュ６９１０－６９１１、共有キャッシュ６９２６及びシステムメモリ６９５０に格納されるデータとコヒーレントであることを確保するために、キャッシュコヒーレンシプロトコルを実装してよい。例えば、キャッシュコヒーレントインタフェース６９３０は、共有キャッシュ６９２６及びローカルキャッシュ６９１４－６９１５内のキャッシュラインの状態を検出するために、コア６９１０－６９１１及びＭＭＵ６９２０により用いられるスヌーピングメカニズムに参加してよく、プロキシとして動作してよく、処理要素６９０１－６９０２、Ｎにより、キャッシュラインに対するアクセス及び試みた修正に応じてスヌープ更新を提供する。さらに、キャッシュラインが、処理要素６９０１－６９０２、Ｎにより修正された場合、キャッシュコヒーレントインタフェース６９３０は、共有キャッシュ６９２６又はローカルキャッシュ６９１４－６９１５内にそれらが格納されている場合にキャッシュラインのステータスを更新してよい。 In one embodiment, the cache coherent interface 6930 provides cache coherent connectivity between the cores 6910-6911 and the accelerator 6900, in effect treating the accelerator as a peer of the cores 6910-6911. For example, the cache coherent interface 6930 may implement a cache coherency protocol to ensure that data accessed/modified by the accelerator 6900 and stored in the accelerator cache 6907 and/or local memory 6903-6904, N is coherent with data stored in the core caches 6910-6911, the shared cache 6926, and the system memory 6950. For example, the cache coherent interface 6930 may participate in the snooping mechanisms used by the cores 6910-6911 and the MMU 6920 to detect the state of cache lines in the shared cache 6926 and the local caches 6914-6915, and may act as a proxy to provide snoop updates in response to accesses and attempted modifications to cache lines by the processing elements 6901-6902, N. Additionally, if a cache line is modified by a processing element 6901-6902, N, the cache coherent interface 6930 may update the status of the cache lines if they are stored in the shared cache 6926 or the local caches 6914-6915.

一実施例において、データ管理ユニット１００５は、システムメモリ６９５０及び共有キャッシュ６９２６へのアクセスをアクセラレータ６９００に提供するメモリ管理回路を含む。さらに、データ管理ユニット６９０５は、必要に応じて（例えば、キャッシュラインに対する状態の変化を判断するために）、キャッシュコヒーレントインタフェース６９３０への更新を提供、及び、キャッシュコヒーレントインタフェース６９３０から更新を受信する。例示された実施例において、データ管理ユニット６９０５は、処理要素６９０１－６９０２により実行される命令／処理をスケジューリングするためのスケジューラ６９０６を含む。そのスケジューリング処理を実行するために、スケジューラ６９０６は、命令／処理間の依存性を評価して、命令／処理がコヒーレントな順序で実行されることを確保する（例えば、第１の命令が、第１の命令からの結果に依存する第２の命令の前に実行することを確保する）。内部依存していない命令／処理は、処理要素６９０１－６９０２で並列に実行されてよい。 In one embodiment, the data management unit 1005 includes memory management circuitry that provides the accelerator 6900 with access to the system memory 6950 and the shared cache 6926. Additionally, the data management unit 6905 provides updates to and receives updates from the cache coherent interface 6930 as needed (e.g., to determine state changes for cache lines). In the illustrated embodiment, the data management unit 6905 includes a scheduler 6906 for scheduling instructions/operations to be executed by the processing elements 6901-6902. To perform its scheduling process, the scheduler 6906 evaluates dependencies between instructions/operations to ensure that the instructions/operations are executed in a coherent order (e.g., ensure that a first instruction executes before a second instruction that depends on a result from the first instruction). Instructions/operations that do not have internal dependencies may be executed in parallel by the processing elements 6901-6902.

図７０は、（例えば、一実施例においてスタックされたローカルのＤＲＡＭを用いて実装された）データ管理ユニット６９０５、複数の処理要素６９０１－Ｎ及び高速オンチップストレージ７０００を含む前述のアクセラレータ６９００及び他のコンポーネントの別の図を示す。一実施例において、アクセラレータ６９００は、ハードウェアアクセラレータアーキテクチャであり、処理要素６９０１－Ｎは、疎／密行列に対する演算を含む行列×ベクトル及びベクトル×ベクトル演算を実行するための回路を含む。特に、処理要素６９０１－Ｎは、列及び行方向の行列処理に対するハードウェアサポートを含んでよく、機械学習（ＭＬ）アルゴリズムで用いられるような「スケール及び更新」オペレーションに対するマイクロアーキテクチャ上のサポートを含んでもよい。 Figure 70 shows another diagram of the accelerator 6900 and other components described above, including a data management unit 6905 (e.g., implemented using stacked local DRAM in one embodiment), multiple processing elements 6901-N, and high-speed on-chip storage 7000. In one embodiment, the accelerator 6900 is a hardware accelerator architecture, and the processing elements 6901-N include circuitry for performing matrix-by-vector and vector-by-vector operations, including operations on sparse and dense matrices. In particular, the processing elements 6901-N may include hardware support for column- and row-wise matrix processing, and may include microarchitectural support for "scale and update" operations such as those used in machine learning (ML) algorithms.

説明される実装は、高速オンチップストレージ７０００において、頻繁に用いられ、ランダムにアクセスされ、潜在的に疎な（例えば、ギャザー／スキャッタ）ベクトルデータを保持することにより、ストリーミング様式で、可能なときにはいつでもアクセスされるオフチップメモリ（例えば、システムメモリ６９５０）において、大きくて低い頻度で用いられる行列データを維持することにより、及び、スケールアップするためにイントラ／インター行列ブロック並列処理をさらすことにより、最適化される行列／ベクトル演算を実行する。 The described implementation performs matrix/vector operations that are optimized by keeping frequently used, randomly accessed, potentially sparse (e.g., gather/scatter) vector data in fast on-chip storage 7000, by maintaining large, less frequently used matrix data in off-chip memory (e.g., system memory 6950) that is accessed in a streaming manner whenever possible, and by exposing intra/inter matrix block parallelism to scale up.

処理要素６９０１－Ｎの実施例では、疎行列、密行列、疎ベクトル及び密ベクトルの様々な組み合わせを処理する。本明細書で用いられるように、「疎（ｓｐａｒｓｅ）」行列又はベクトルは、成分のほとんどがゼロである行列又はベクトルである。一方、「密（ｄｅｎｓｅ）」行列又はベクトルは、成分のほどんとがゼロ以外である行列又はベクトルである。行列／ベクトルの「まばら（ｓｐａｒｓｉｔｙ）」は、ゼロの値の成分の数を成分の総数（例えば、ｍ×ｎ行列に対してｍ×ｎ）で割ることに基づいて規定され得る。一実施例において、行列／ベクトルは、そのまばらさが特定の閾値を上回る場合、「疎（ｓｐａｒｓｅ）」であるとみなされる。 Examples of processing elements 6901-N process various combinations of sparse matrices, dense matrices, sparse vectors, and dense vectors. As used herein, a "sparse" matrix or vector is one in which most of the components are zero. On the other hand, a "dense" matrix or vector is one in which most of the components are non-zero. The "sparsity" of a matrix/vector may be defined based on the number of zero-valued components divided by the total number of components (e.g., m×n for an m×n matrix). In one example, a matrix/vector is considered to be "sparse" if its sparcity is above a certain threshold.

処理要素６９０１－Ｎにより実行される処理の例示的なセットが図７１内のテーブルに示される。特に、オペレーションタイプは、疎行列を用いる第１の乗算７１００、密行列を用いる第２の乗算７１０１、スケール及び更新演算７１０２及びドット積演算７１０３を含む。列には、第１の入力オペランド７１１０及び第２の入力オペランド７１１１（それぞれが、疎又は密行列／ベクトルを含み得る）、出力フォーマット７１１２（例えば、密ベクトル又はスカラ）、行列データフォーマット（例えば、圧縮された疎行、圧縮された疎列、行方向など）７１１３、及び、オペレーション識別子７１１４が規定されている。 An exemplary set of operations performed by processing elements 6901-N is shown in the table in FIG. 71. In particular, the operation types include a first multiplication with a sparse matrix 7100, a second multiplication with a dense matrix 7101, a scale and update operation 7102, and a dot product operation 7103. The columns specify a first input operand 7110 and a second input operand 7111 (each of which may include a sparse or dense matrix/vector), an output format 7112 (e.g., dense vector or scalar), a matrix data format (e.g., compressed sparse row, compressed sparse column, row-wise, etc.) 7113, and an operation identifier 7114.

いくつかの現在のワークロードにおいて得られるランタイムドメイン計算パターンは、行方向及び列方向の様式でベクトルに対する行列の乗算の変形を含む。周知の行列上のそれらの機能は、圧縮された疎行（ＣＳＲ）及び圧縮された疎列（ＣＳＣ）の形式を合わせる。図７２ａは、ベクトルｙを生成するために、ベクトルｘに対する疎行列間の乗算の例を図示する。図７２ｂは、各値が（値、行インデックス）ペアとして格納される行列ＡのＣＳＲ表現を示す。例えば、行０に対する（３、２）は、３の値が、行０の成分位置２に格納されていることを示す。図７２ｃは、（値、列インデックス）ペアを用いる行列ＡのＣＳＣ表現を示す。 Run-time domain computational patterns available in some current workloads include variants of matrix multiplication on vectors in row-wise and column-wise fashion. These functions on well-known matrices combine compressed sparse row (CSR) and compressed sparse column (CSC) formats. Figure 72a illustrates an example of sparse-matrix multiplication on vector x to produce vector y. Figure 72b shows a CSR representation of matrix A where each value is stored as a (value, row index) pair. For example, (3, 2) for row 0 indicates that a value of 3 is stored in element position 2 of row 0. Figure 72c shows a CSC representation of matrix A using (value, column index) pairs.

図７３ａ、図７３ｂ及び図７３ｃは、各計算パターンの擬似コードを示し、以下に詳細に説明される。特に、図７３ａは、行方向の疎行列・密ベクトル乗算（ｓｐＭｄＶ＿ｃｓｒ）を示し、図７３ｂは、列方向の疎行列・疎ベクトル乗算（ｓｐＭｓｐＣ＿ｃｓｃ）を示し、図７３ｃは、スケール及び更新演算（ｓｃａｌｅ＿ｕｐｄａｔｅ）を示す。 Figures 73a, 73b, and 73c show pseudocode for each calculation pattern, which are described in detail below. In particular, Figure 73a shows row-wise sparse matrix-dense vector multiplication (spMdV_csr), Figure 73b shows column-wise sparse matrix-sparse vector multiplication (spMspC_csc), and Figure 73c shows the scale and update operation (scale_update).

Ａ．行方向の疎行列・密ベクトル乗算（ｓｐＭｄＶ＿ｃｓｒ） A. Row-wise sparse matrix-dense vector multiplication (spMdV_csr)

これは、高性能な計算など多くのアプリケーション分野で重要な周知の計算パターンである。ここで、行列Ａの各行に対して、ベクトルｘに対するその行のドット積が実行され、その結果が、行インデックスにより指し示されるｙベクトル成分に格納される。この計算は、サンプリングのセット（すなわち、行列の行）にわたって解析を実行する機械学習（ＭＬ）アルゴリズムにおいて用いられる。それは、「ミニバッチ」などの技術において用いられてもよい。例えば、学習アルゴリズムの確率論的な確率変数において、ＭＬアルゴリズムが、密ベクトルに対する疎ベクトルのドット積を単に実行する場合（すなわち、ｓｐＭｄＶ＿ｃｓｒループの反復）もある。 This is a well-known computational pattern important in many application areas, such as high performance computing. Here, for each row of a matrix A, a dot product of that row against the vector x is performed, and the result is stored in the y vector component pointed to by the row index. This computation is used in machine learning (ML) algorithms that perform analysis over a set of samples (i.e., rows of a matrix). It may also be used in techniques such as "mini-batches". For example, in a probabilistic random variable of a learning algorithm, the ML algorithm may simply perform a dot product of a sparse vector against a dense vector (i.e., iterating the spMdV_csr loop).

この計算の性能に影響を与え得る既知の要因は、ドット積計算において、疎ｘベクトル成分にランダムにアクセスする必要があることである。従来のサーバシステムに関して、ｘベクトルが大きい場合、これは、メモリ又はラストレベルキャッシュへの不規則なアクセス（収集）を結果としてもたらしたであろう。 A known factor that can affect the performance of this computation is the need to randomly access the sparse x-vector components in the dot product computation. For traditional server systems, if the x-vector is large, this would result in irregular accesses (gathering) to memory or last level caches.

これに対処するために、処理要素の一実施例では、行列Ａを列ブロックに、ｘベクトルを（行列Ａの列ブロックにそれぞれ対応する）複数のサブセットに分割する。ブロックサイズは、ｘベクトルのサブセットがチップに合致できるように選択され得る。よって、それへのランダムなアクセスは、局在化されたオンチップであり得る。 To address this, one embodiment of the processing element splits matrix A into column blocks and the x vector into multiple subsets (each corresponding to a column block of matrix A). The block size can be chosen to allow a subset of the x vector to fit on a chip, so random access to it can be localized on-chip.

Ｂ．列方向の疎行列・疎ベクトル乗算（ｓｐＭｓｐＶ＿ｃｓｃ） B. Column-wise sparse matrix/vector multiplication (spMspV_csc)

疎ベクトルに対して疎行列を乗算するこのパターンは、ｓｐＭｄＶ＿ｃｓｒほど周知ではない。しかしながら、いくつかのＭＬアルゴリズムにおいて重要である。それは、アルゴリズムが特徴のセットに作用する場合に用いられ、データセット内の行列の列として表される（よって列方向の行列アクセスが必要になる）。 This pattern of sparse vector multiplication by a sparse matrix is less well known than spMdV_csr. However, it is important in several ML algorithms. It is used when the algorithm operates on a set of features, represented as columns of matrices in the dataset (hence the need for column-wise matrix access).

この計算パターンでは、行列Ａの各列が、読み出されて、ベクトルｘの対応する非ゼロ成分に対して乗算される。その結果は、ｙベクトルで保持される部分的なドット積を更新するために用いられる。ゼロ以外のｘベクトル成分と関連付けられたすべての列が処理されると、ｙベクトルは、最終的なドット積を含むことになる。 In this computation pattern, each column of matrix A is read and multiplied against the corresponding non-zero components of vector x. The result is used to update the partial dot product held in the y vector. Once all columns associated with non-zero x vector components have been processed, the y vector will contain the final dot product.

行列Ａへのアクセス（すなわち、Ａの列におけるストリーム）が正常である一方、部分的なドット積を更新するｙベクトルへのアクセスは不規則である。アクセスするｙ成分は、処理されるＡベクトル成分の行インデックスに依存する。これに対処するために、行列Ａは、行ブロックに分割され得る。その結果、ベクトルｙは、これらのブロックに対応するサブセットに分割され得る。この方式では、行列の行ブロックを処理する場合に、そのｙベクトルサブセットに不規則にアクセス（ギャザー／スキャッタ）することのみが必要である。適切にブロックサイズを選択することにより、ｙベクトルサブセットは、オンチップで保持され得る。 While accesses to matrix A (i.e., streams on the columns of A) are normal, accesses to the y vector that updates the partial dot products are irregular. The y component to access depends on the row index of the A vector component being processed. To address this, matrix A can be partitioned into row blocks. As a result, vector y can be partitioned into subsets corresponding to these blocks. In this manner, when processing a row block of the matrix, it is only necessary to irregularly access (gather/scatter) that y vector subset. By appropriately selecting the block size, the y vector subset can be kept on-chip.

Ｃ．スケール及び更新（ｓｃａｌｅ＿ｕｐｄａｔｅ） C. Scale and update (scale_update)

このパターンは、典型的には、行列内の各サンプルにスケーリングファクタを適用するＭＬアルゴリズムにより用いられ、それぞれが特徴（すなわち、Ａ内の列）に対応する、それらが重みのセットへと低減される。ここで、ｘベクトルはスケーリングファクタを含む。（ＣＳＲフォーマットにおける）行列Ａの各行に対して、その行に対するスケーリングファクタは、ｘベクトルから読み出され、次に、その行におけるＡの各成分に適用される。その結果は、ｙベクトルの成分を更新するために用いられる。すべて行が処理されると、ｙベクトルは、低減された重みを含むことになる。 This pattern is typically used by ML algorithms, which apply a scaling factor to each sample in a matrix, reducing them to a set of weights, each corresponding to a feature (i.e., a column in A), where the x vector contains the scaling factors. For each row of matrix A (in CSR format), the scaling factor for that row is read from the x vector and then applied to each component of A in that row. The result is used to update the components of the y vector. Once all rows have been processed, the y vector will contain the reduced weights.

前の計算パターンと同様に、ｙベクトルに対する不規則なアクセスは、ｙが大きい場合の性能に影響を与え得る。行列Ａを列ブロックに、ｙベクトルをこれらのブロックに対応する複数のサブセットに分割することは、各ｙサブセット内に不規則なアクセスを局所化するのに役立てることができる。 Similar to the previous computation pattern, irregular accesses to the y vector can impact performance when y is large. Splitting the matrix A into column blocks and the y vector into subsets corresponding to these blocks can help localize irregular accesses within each y subset.

一実施例では、上述した計算パターンを効率的に実行できるハードウェアアクセラレータを含む。アクセラレータは、汎用プロセッサと統合され得るハードウェアＩＰブロックである。一実施例において、アクセラレータ６９００は、プロセッサと共有される相互接続を通じてメモリ６９５０に独立にアクセスして、計算パターンを実行する。それは、オフチップメモリ内に存在するいずれか任意の大規模行列データセットをサポートする。 In one embodiment, the system includes a hardware accelerator that can efficiently execute the computational patterns described above. The accelerator is a hardware IP block that can be integrated with a general-purpose processor. In one embodiment, the accelerator 6900 executes the computational patterns by independently accessing memory 6950 through an interconnect shared with the processor. It supports any arbitrarily large matrix data set residing in off-chip memory.

図７４は、データ管理ユニット６９０５及び処理要素６９０１－６９０２の一実施例に関する処理フローを示す。この実施例において、データ管理ユニット６９０５は、処理要素スケジューラ７４０１、読み出しバッファ７４０２、書き込みバッファ７４０３及び低減ユニット７４０４を含む。各ＰＥ６９０１－６９０２は、入力バッファ７４０５－７４０６、乗算器７４０７－７４０８、加算器７４０９－７４１０、ローカルＲＡＭ７４２１－７４２２、合計レジスタ（ｓｕｍｒｅｇｉｓｔｅｒ）７４１１－７４１２及び出力バッファ７４１３－７４１４を含む。 Figure 74 shows the process flow for one embodiment of the data management unit 6905 and processing elements 6901-6902. In this embodiment, the data management unit 6905 includes a processing element scheduler 7401, a read buffer 7402, a write buffer 7403, and a reduction unit 7404. Each PE 6901-6902 includes an input buffer 7405-7406, multipliers 7407-7408, adders 7409-7410, local RAMs 7421-7422, sum registers 7411-7412, and an output buffer 7413-7414.

アクセラレータは、いずれか任意の大規模行列データをサポートするために、上述した行列ブロッキングスキーム（すなわち、行及び列のブロック化）をサポートする。アクセラレータは、行列データのブロックを処理するように設計される。各ブロックは、ＰＥ６９０１－６９０２により並列に処理されるサブブロックにさらに分割される。 The accelerator supports the matrix blocking scheme described above (i.e., row and column blocking) to support any arbitrarily large matrix data. The accelerator is designed to process blocks of matrix data. Each block is further divided into sub-blocks that are processed in parallel by the PEs 6901-6902.

動作中、データ管理ユニット６９０５は、メモリサブシステムからその読み出しバッファ７４０２に行列の行又は列を読み出し、次に、処理するためにＰＥ６９０１－６９０２にわたってＰＥスケジューラ７４０１により動的に分配される。それはまた、その書き込みバッファ７４０３からメモリに結果を書き込む。 In operation, the Data Management Unit 6905 reads rows or columns of a matrix from the memory subsystem into its read buffer 7402, which are then dynamically distributed by the PE Scheduler 7401 across the PEs 6901-6902 for processing. It also writes the results from its write buffer 7403 to memory.

各ＰＥ６９０１－６９０２は、行列のサブブロックを処理する役割を担う。ＰＥは、ランダムにアクセスされる必要があるベクトル（すなわち、上記で説明されたようなｘ又はｙベクトルのサブセット）を格納するオンチップＲＡＭ７４２１－７４２２を含む。また、乗算器７４０７－７４０８及び加算器７４０９－７４１０を含む浮動小数点積和（ＦＭＡ）ユニットと、入力データから行列成分を抽出する入力バッファ７４０５－７４０６内のアンパック論理と、累算されたＦＭＡ結果を保持する合計レジスタ（ｓｕｍｒｅｇｉｓｔｅｒ）７４１１－７４１２と含む。 Each PE 6901-6902 is responsible for processing a subblock of the matrix. The PE contains on-chip RAM 7421-7422 to store vectors that need to be randomly accessed (i.e., a subset of the x or y vectors as described above). It also contains a floating-point multiply-accumulate (FMA) unit that contains multipliers 7407-7408 and adders 7409-7410, unpack logic in input buffers 7405-7406 that extracts matrix elements from the input data, and sum registers 7411-7412 that hold the accumulated FMA results.

アクセラレータの一実施例は、（１）不規則にアクセス（ギャザー／スキャッタ）されるデータをオンチップＰＥＲＡＭ７４２１－７４２２に配置し、（２）ＰＥが十分に利用されることを確保するためにハードウェアＰＥスケジューラ７４０１を利用し、（３）汎用プロセッサを用いる場合とは異なり、アクセラレータが、疎行列演算に不可欠なハードウェアリソースのみからなるので、最高の効率性を実現する。全体的には、アクセラレータは、それに提供される利用可能なメモリ帯域幅を性能へと効率的に変換する。 One embodiment of the accelerator (1) places irregularly accessed (gathered/scattered) data in on-chip PE RAMs 7421-7422, (2) utilizes a hardware PE scheduler 7401 to ensure that the PEs are fully utilized, and (3) achieves maximum efficiency because, unlike with general-purpose processors, the accelerator consists only of hardware resources essential for sparse matrix operations. Overall, the accelerator efficiently translates the available memory bandwidth provided to it into performance.

性能のスケーリングは、１つのアクセラレータブロックにおいて多くのＰＥを使用して、並列に複数の行列のサブブロックを処理することにより、及び／又は、より多くのアクセラレータブロック（それぞれがＰＥのセットを有する）を使用して、並列に複数の行列ブロックを処理することにより行われ得る。これらのオプションの組み合わせが以下で検討される。ＰＥ及び／又はアクセラレータブロックの数は、メモリ帯域幅に一致するように調整されるべきである。 Performance scaling can be done by using many PEs in an accelerator block to process multiple matrix sub-blocks in parallel, and/or by using more accelerator blocks (each with a set of PEs) to process multiple matrix blocks in parallel. Combinations of these options are considered below. The number of PEs and/or accelerator blocks should be adjusted to match the memory bandwidth.

アクセラレータ６９００の一実施例では、ソフトウェアライブラリを通じてプログラミングされ得る。そのようなライブラリは、メモリに行列データを準備し、計算に関する情報（例えば、計算タイプ、行列データに対するメモリポインタ）と共にアクセラレータ６９００内に制御レジスタを設定し、アクセラレータを開始させる。次に、アクセラレータは、メモリ内の行列データに独立にアクセスして、計算を実行し、消費するソフトウェアに関する結果をメモリに書き戻す。 In one embodiment of the accelerator 6900, it may be programmed through a software library. Such a library prepares the matrix data in memory, sets control registers in the accelerator 6900 with information about the computation (e.g., computation type, memory pointer to the matrix data), and starts the accelerator. The accelerator then independently accesses the matrix data in memory to perform the computation and writes the results back to memory for the consuming software.

アクセラレータは、図７５ａ～図７５ｂに図示されるように、適切なデータパス構成に対してそのＰＥを設定することにより、様々な計算パターンを処理する。特に、図７５ａは、ｓｐＭｓｐＶ＿ｃｓｃ及びｓｃａｌｅ＿ｕｐｄａｔｅ演算に関する（点線を用いて）パスを強調表示し、図７５ｂは、ｓｐＭｄＶ＿ｃｓｒ演算に関するパスを示す。各計算パターンを実行するアクセラレータのオペレーションが以下に詳細に説明される。 The accelerator processes various computation patterns by configuring its PEs for the appropriate datapath configuration, as illustrated in Figures 75a-75b. In particular, Figure 75a highlights (with dotted lines) the paths for the spMspV_csc and scale_update operations, while Figure 75b shows the paths for the spMdV_csr operation. The operation of the accelerator to perform each computation pattern is described in detail below.

ｓｐＭｓｐＶ＿ｃｓｃに関して、ＤＭＵ６９０５により、最初のｙベクトルサブセットがＰＥのＲＡＭ７４２１にロードされる。次に、メモリからｘベクトル成分を読み出す。各ｘ成分に対して、ＤＭＵ６９０５は、メモリから対応する行列の列の成分をストリームして、これらをＰＥ６９０１に供給する。各行列成分は、ＰＥのＲＡＭ７４２１から読み出すことをｙ成分に指し示す値（Ａ．ｖａｌ）及びインデックス（Ａ．ｉｄｘ）を含む。ＤＭＵ６９０５はまた、積和（ＦＭＡ）ユニットによりＡ．ｖａｌに対して乗算されるｘベクトル成分（ｘ．ｖａｌ）も提供する。結果は、Ａ．ｉｄｘにより指し示されるＰＥのＲＡＭ内のｙ成分を更新するために用いられる。たとえ、本願のワークロードにより用いられなかったとしても、アクセラレータは、サブセットのみの代わりにすべての行列の列を処理することにより、密ｘベクトル（ｓｐＭｄＶ＿ｃｓｃ）に対して列のような乗算をサポートすることに留意する（ｘが密なので）。 For spMspV_csc, the DMU 6905 loads the initial y-vector subset into the PE's RAM 7421. It then reads the x-vector components from memory. For each x-component, the DMU 6905 streams the corresponding matrix column components from memory and provides them to the PE 6901. Each matrix component includes a value (A.val) and an index (A.idx) that points to the y-component to read from the PE's RAM 7421. The DMU 6905 also provides the x-vector component (x.val) which is multiplied by the functional multiply-accumulate (FMA) unit against A.val. The result is used to update the y-component in the PE's RAM pointed to by A.idx. Note that the accelerator supports column-like multiplication on dense x-vectors (spMdV_csc) by processing all matrix columns instead of only a subset (since x is dense), even if not used by the present workload.

ｓｃａｌｅ＿ｕｐｄａｔｅオペレーションは、ＣＳＣフォーマットの代わりにＣＳＲフォーマットに表される行列Ａの行をＤＭＵ６９０５が読み出すことを除き、ｓｐＭｓｐＶ＿ｃｓｃと同様である。ｓｐＭｄＶ＿ｃｓｒに関して、ｘベクトルのサブセットは、ＰＥのＲＡＭ７４２１にロードされる。ＤＭＵ６９０５は、メモリから行列の行成分（すなわち、｛Ａ.ｖａｌ、Ａ.ｉｄｘ｝のペア）にストリームする。Ａ．ｉｄｘは、ＲＡＭ７４２１から適切なｘベクトル成分を読み出すために用いられ、ＦＭＡによりＡ．ｖａｌに対して乗算される。結果は、合計レジスタ（ｓｕｍｒｅｇｉｓｔｅｒ）７４１２へと累算される。合計レジスタ（ｓｕｍｒｅｇｉｓｔｅｒ）は、ＤＭＵ６９０５により供給される行の終わりを示すマーカをＰＥが参照する度に、出力バッファに書き込まれる。このようにして、各ＰＥは、それが担う行のサブブロックに対する合計を生成する。行についての最終的な合計を生成するために、すべてのＰＥにより生成されたサブブロックの合計は、ＤＭＵ内の低減ユニット７４０４によりまとめて加えられる（図７４を参照）。最終的な合計は、出力バッファ７４１３－７４１４に書き込まれ、その結果、ＤＭＵ６９０５はそれをメモリに書き込む。 The scale_update operation is similar to spMspV_csc, except that the DMU 6905 reads the rows of matrix A represented in CSR format instead of CSC format. For spMdV_csr, a subset of the x vectors is loaded into the PE's RAM 7421. The DMU 6905 streams the row components of the matrix (i.e., {A.val, A.idx} pairs) from memory. A.idx is used to read the appropriate x vector component from RAM 7421 and multiplied against A.val by the FMA. The results are accumulated in the sum register 7412. The sum register is written to the output buffer each time the PE sees an end-of-row marker provided by the DMU 6905. In this way, each PE generates a sum for its sub-block of rows. To generate the final sum for the row, the subblock sums generated by all the PEs are added together by the reduction unit 7404 in the DMU (see FIG. 74). The final sum is written to output buffers 7413-7414, which the DMU 6905 then writes to memory.

グラフデータ処理 Graph data processing

一実施例において、本明細書で説明されるアクセラレータアーキテクチャは、グラフデータを処理するように構成される。グラフ分析は、グラフとして表されるデータ間の関係に関する知識を抽出するために、グラフアルゴリズムに依存する。グラフデータの拡散（ソーシャルメディアなどのソースから）は、グラフ分析に対する強い要求及び幅広い利用をもたらしてきた。その結果、できるだけ効率的にグラフ分析をできるようにすることが非常に重要である。 In one embodiment, the accelerator architecture described herein is configured to process graph data. Graph analysis relies on graph algorithms to extract knowledge about relationships between data represented as graphs. The proliferation of graph data (from sources such as social media) has led to strong demand and widespread use of graph analysis. As a result, it is critical to be able to perform graph analysis as efficiently as possible.

この要求に対処するために、一実施例では、所与の入力グラフアルゴリズムにカスタマイズされるハードウェアアクセラレータアーキテクチャ「テンプレート」にユーザ定義型のグラフアルゴリズムを自動的にマッピングする。アクセラレータは、上記で説明されるアーキテクチャを有してよく、ＦＰＧＡ／ＡＳＩＣとして実装されてよく、それは最高の効率性で実行できる。要約すると、一実施例では以下の構成を含む。 To address this need, one embodiment automatically maps user-defined graph algorithms to hardware accelerator architecture "templates" that are customized for a given input graph algorithm. The accelerator may have the architecture described above and may be implemented as an FPGA/ASIC that can perform with the highest efficiency. In summary, one embodiment includes the following components:

（１）汎用の疎行列ベクトル乗算（ＧＳＰＭＶ）アクセラレータに基づくハードウェアアクセラレータアーキテクチャテンプレート。グラフアルゴリズムが行列演算として定式化されることができることが示されているので、任意のグラフアルゴリズムをサポートする。 (1) A hardware accelerator architecture template based on the Generic Sparse Matrix Vector Multiplication (GSPMV) accelerator. It supports arbitrary graph algorithms since it has been shown that graph algorithms can be formulated as matrix operations.

（２）アーキテクチャテンプレートに対して、広く用いられている「頂点主体」グラフプログラミング抽象化をマッピング及び調整する自動アプローチ。 (2) An automated approach to map and align widely used "vertex-centric" graph programming abstractions to architectural templates.

既存の疎行列乗算ハードウェアアクセラレータがあるが、それらは、グラフアルゴリズムのマッピングを可能にするカスタマイズ性をサポートしていない。 There are existing sparse matrix multiplication hardware accelerators, but they do not support the customizability that would allow for mapping of graph algorithms.

設計フレームワークの一実施例では、以下のように動作する。 In one embodiment of the design framework, it works as follows:

（１）ユーザは、頂点－中心のグラフプログラミング抽象化に従う「頂点プログラム」としてグラフアルゴリズムを規定する。この抽象化は、その人気に起因して、ここでは例として選択される。頂点プログラムは、ハードウェアの詳細をさらすことはなく、そのため、ハードウェアの専門的知識（例えば、データ科学者）のないユーザでも、それを作成できる。 (1) A user specifies a graph algorithm as a "vertex program" that follows a vertex-centric graph programming abstraction. This abstraction is chosen as an example here due to its popularity. Vertex programs do not expose hardware details, so even users without hardware expertise (e.g., data scientists) can create them.

（２）（１）のグラフアルゴリズムと共に、フレームワークの一実施例では、以下の入力を受け入れる。 (2) Along with the graph algorithm in (1), one embodiment of the framework accepts the following inputs:

生成させるターゲットハードウェアアクセラレータのパラメータ（例えば、オンチップＲＡＭの最大量）。これらのパラメータは、ユーザにより提供されてよい、又は、既存のシステム（例えば、特定のＦＰＧＡボード）をターゲットとする場合、既知のパラメータの既存のライブラリから取得されてよい。 Parameters for the target hardware accelerator to be generated (e.g., maximum amount of on-chip RAM). These parameters may be provided by the user, or may be taken from an existing library of known parameters if targeting an existing system (e.g., a specific FPGA board).

ｂ．設計最適化目標（例えば、最大性能、最小エリア） b. Design optimization objectives (e.g., maximum performance, minimum area)

ｃ．ターゲットグラフデータの特性（例えば、グラフのタイプ）又はグラフデータ自体これは選択的であり、自動チューニングを補助するために用いられる。 c. Characteristics of the target graph data (e.g., type of graph) or the graph data itself, which is optional, are used to aid in auto-tuning.

（３）上記の入力を前提として、フレームワークの一実施例では、自動チューニングを実行して、入力グラフアルゴリズムを最適化するために、ハードウェアテンプレートに適用するカスタマイズのセットを判断し、これらのパラメータをアーキテクチャテンプレート上にマッピングして、合成可能なＲＴＬにアクセラレータインスタンスを生成し、入力グラフアルゴリズム仕様から導き出される機能及び性能のソフトウェアモデルに対する、生成したＲＴＬの機能及び性能の検証を行う。 (3) Given the above inputs, one embodiment of the framework performs auto-tuning to determine a set of customizations to apply to the hardware template to optimize the input graph algorithm, maps these parameters onto the architecture template, generates an accelerator instance in synthesizable RTL, and performs functionality and performance validation of the generated RTL against a software model of functionality and performance derived from the input graph algorithm specification.

一実施例において、上記で説明されるアクセラレータアーキテクチャは、（１）それをカスタマイズ可能なハードウェアテンプレートにすること、（２）頂点プログラムにより必要とされる機能をサポートすることにより、頂点プログラムの実行をサポートするように拡張される。このテンプレートに基づいて、設計フレームワークは、ユーザ供給型の頂点プログラムをハードウェアテンプレートにマッピングして、頂点プログラムに対して最適化された合成可能なＲＴＬ（例えば、Ｖｅｒｉｌｏｇ）の実装インスタンスを生成するために説明される。フレームワークはまた、生成したＲＴＬが訂正及び最適化されることを確保するために、自動検証及びチューニングを実行する。このフレームワークに関しては、複数の使用事例がある。例えば、生成された合成可能なＲＴＬは、所与の頂点プログラムを効率的に実行するために、ＦＰＧＡプラットフォーム（例えば、Ｘｅｏｎ－ＦＰＧＡ）に配置され得る。又は、それは、ＡＳＩＣの実装を生成するように、さらに改良され得る。 In one embodiment, the accelerator architecture described above is extended to support vertex program execution by (1) making it a customizable hardware template and (2) supporting the functionality required by the vertex program. Based on this template, a design framework is described to map a user-supplied vertex program to the hardware template to generate an optimized synthesizable RTL (e.g., Verilog) implementation instance for the vertex program. The framework also performs automatic verification and tuning to ensure that the generated RTL is correct and optimized. There are multiple use cases for this framework. For example, the generated synthesizable RTL can be placed on an FPGA platform (e.g., Xeon-FPGA) to efficiently execute a given vertex program. Or, it can be further improved to generate an ASIC implementation.

グラフは、隣接行列として表されることができ、グラフ処理は、疎行列演算として定式化されることができる。図７６ａ～図７６ｂは、隣接行列としてのグラフを表す例を示す。行列内のそれぞれのゼロ以外は、グラフ内の２つのノード中のエッジを表す。例えば、０行２列における１は、ノードＡからＣのエッジを表す。 Graphs can be represented as adjacency matrices, and graph processing can be formulated as sparse matrix operations. Figures 76a-b show an example of representing a graph as an adjacency matrix. Each non-zero in the matrix represents an edge between two nodes in the graph. For example, a 1 in row 0, column 2 represents an edge from nodes A to C.

グラフデータの計算を表現するための最もポピュラーなモデルの１つは、頂点プログラミングモデルである。一実施例では、汎用の疎行列ベクトル乗算（ＧＳＰＭＶ）として、頂点プログラムを定式化するグラフマットソフトウェアフレームワークからの頂点プログラミングモデルの変形をサポートする。図７６ｃに示されるように、頂点プログラムは、（プログラムコードの最上部に示されるような）グラフ内のエッジ／頂点と関連付けられた複数のタイプのデータ（ｅデータ／ｖデータ）、グラフ内の頂点を介して送信されるメッセージ（ｍデータ）、及び、一時的なデータ（ｔデータ）、並びに、（プログラムコードの下部に示さるような）グラフデータを読み出して更新する予め定義されたＡＰＩを用いるステートレスなユーザ定義型の計算機能からなる。 One of the most popular models for expressing computations on graph data is the vertex programming model. In one embodiment, we support a variant of the vertex programming model from the GraphMat software framework that formulates vertex programs as generic sparse matrix vector multiplication (GSPMV). As shown in Figure 76c, a vertex program consists of multiple types of data (e-data/v-data) associated with edges/vertices in the graph (as shown at the top of the program code), messages (m-data) and temporary data (t-data) sent through the vertices in the graph, and stateless user-defined computation functions with predefined APIs to read and update the graph data (as shown at the bottom of the program code).

図７６ｄは、頂点プログラムを実行するための例示的なプログラムコードを示す。エッジデータは、（図７６ｂに示すように）隣接行列Ａとして、頂点データをベクトルｙとして、メッセージを疎ベクトルｘとして表される。図７６ｅは、ＧＳＰＭＶの策定を示し、ＳＰＭＶにおける乗算（）及び加算（）演算は、ユーザ定義型のＰＲＯＣＥＳＳ＿ＭＳＧ（）及びＲＥＤＵＣＥ（）により一般化される。 Figure 76d shows example program code for executing the vertex program. Edge data is represented as an adjacency matrix A (as shown in Figure 76b), vertex data as vector y, and messages as sparse vector x. Figure 76e shows a formulation of GSPMV, where the multiply() and add() operations in SPMV are generalized with user-defined types PROCESS_MSG() and REDUCE().

ここでの１つの見解は、頂点プログラムを実行するのに必要とされるＧＳＰＭＶの変形が、疎ベクトルｘ（すなわち、メッセージ）に対する疎行列Ａ（すなわち、隣接行列）の列方向乗算を実行して、出力ベクトルｙ（すなわち、頂点データ）を生成することである。この演算は、（上記のアクセラレータに関して前述した）ｃｏｌ＿ｓｐＭｓｐＶと称される。 One view here is that the transformation of GSPMV required to execute a vertex program is to perform a column-wise multiplication of a sparse matrix A (i.e., the adjacency matrix) on a sparse vector x (i.e., the message) to produce an output vector y (i.e., the vertex data). This operation is referred to as col_spMspV (described above with respect to the accelerator above).

設計フレームワークテンプレートマッピングコンポーネント７７１１、検証コンポーネント７７１２及び自動チューニングコンポーネント７７１３を含むフレームワークの一実施例が図７７に示される。その材料は、ユーザ規定型の頂点プログラム７７０１、設計最適化目標７７０３（例えば、最大性能、最小エリア）及びターゲットハードウェア設計制約７７０２（例えば、オンチップＲＡＭの最大量、メモリンタフェース幅）である。自動チューニングを補助する選択的な材料として、フレームワークは、グラフデータ特性７７０４（例えば、タイプ＝自然グラフ）又はサンプリンググラフデータも許容する。 One embodiment of the framework is shown in FIG. 77, including a design framework template mapping component 7711, a validation component 7712, and an auto-tuning component 7713. The ingredients are user-specified vertex programs 7701, design optimization goals 7703 (e.g., maximum performance, minimum area), and target hardware design constraints 7702 (e.g., maximum amount of on-chip RAM, memory interface width). As optional ingredients to aid in auto-tuning, the framework also allows graph data characteristics 7704 (e.g., type=natural graph) or sampling graph data.

これらの材料を前提として、フレームワークのテンプレートマッピングコンポーネント７７１１は、入力ベクトルプログラムをハードウェアアクセラレータアーキテクチャテンプレートにマッピングし、頂点プログラム７７０１を実行するために最適化されたアクセラレータインスタンスのＲＴＬ実装７７０５を生成する。自動チューニングコンポーネント７７１３は、自動チューニング７７１３を実行して、所与の設計目標のために、生成したＲＴＬを最適化しつつ、ハードウェア設計制約を満たす。さらに、検証コンポーネント７７１２は、当該材料から導き出された機能及び性能モデルに対して生成したＲＴＬを自動的に検証する。検証テストベンチ７７０６及びチューニング報告７７０７は、ＲＴＬと共に生成される。 Given these materials, the template mapping component 7711 of the framework maps the input vector program to a hardware accelerator architecture template and generates an RTL implementation 7705 of an accelerator instance optimized to execute the vertex program 7701. The auto-tuning component 7713 performs auto-tuning 7713 to optimize the generated RTL for a given design goal while satisfying hardware design constraints. Furthermore, the verification component 7712 automatically verifies the generated RTL against functional and performance models derived from the materials. A verification testbench 7706 and a tuning report 7707 are generated along with the RTL.

汎用の疎行列ベクトル乗算（ＧＳＰＭＶ）ハードウェアアーキテクチャテンプレート Generic sparse matrix vector multiplication (GSPMV) hardware architecture template

ＧＳＰＭＶに関するアーキテクチャテンプレートの一実施例が図７７に示され、それは、上記で説明したアクセラレータアーキテクチャに基づいている（例えば、図７４及び関連する文章を参照）。図７７に示されるコンポーネントの多くがカスタマイズ可能である。一実施例において、頂点プログラムの実行をサポートするアーキテクチャは、以下のように拡張されている。 One embodiment of an architecture template for GSPMV is shown in FIG. 77, which is based on the accelerator architecture described above (see, e.g., FIG. 74 and related text). Many of the components shown in FIG. 77 are customizable. In one embodiment, the architecture to support vertex program execution is extended as follows:

図７８に示されるように、カスタマイズ可能な論理ブロックが、頂点プログラムにより必要とされるＰＲＯＣＥＳＳ＿ＭＳＧ（）１９１０、ＲＥＤＵＣＥ（）７８１１、適用７８１２及びＳＥＮＤ＿ＭＳＧ（）７８１３をサポートするために、各ＰＥ内に提供される。さらに、一実施例では、ユーザ定義型のグラフデータ（すなわち、ｖデータ、ｅデータ、ｍデータ、ｔデータ）をサポートするカスタマイズ可能なオンチップストレージ構造及びパック／アンパック論理７８０５を提供する。図示されるデータ管理ユニット６９０５は、ＰＥスケジューラ７４０１（上記で説明したようなＰＥをスケジューリングするためのもの）、補助バッファ７８０１（アクティブな列、ｘデータを格納するためのもの）、読み出しバッファ７４０２、システムメモリへのアクセスを制御するためのメモリコントローラ７８０３、及び、書き込みバッファ７４０３を含む。さらに、図７８に示される実施例では、古い及び新しいｖデータ及びｔデータがローカルＰＥメモリ７４２１内に格納されている。様々な制御ステートマシンは、頂点プログラムを実行すること、図７６ｄ及び図７６ｅにおけるアルゴリズムにより規定される機能に対する不変性をサポートするために修正されてよい。 As shown in FIG. 78, customizable logic blocks are provided within each PE to support PROCESS_MSG() 1910, REDUCE() 7811, APPLY 7812, and SEND_MSG() 7813 required by the vertex program. Additionally, in one embodiment, a customizable on-chip storage structure and pack/unpack logic 7805 are provided to support user-defined graph data (i.e., v-data, e-data, m-data, t-data). The illustrated data management unit 6905 includes a PE scheduler 7401 (for scheduling PEs as described above), an auxiliary buffer 7801 (for storing active column, x-data), a read buffer 7402, a memory controller 7803 for controlling access to system memory, and a write buffer 7403. Additionally, in the embodiment shown in FIG. 78, old and new v-data and t-data are stored in the local PE memory 7421. The various control state machines may be modified to execute vertex programs and support invariants to the functionality defined by the algorithms in Figures 76d and 76e.

各アクセラレータタイルの処理が図７９に要約されている。７９０１において、ｙベクトル（ｖデータ）がＰＥＲＡＭ７４２１にロードされる。７９０２において、ｘベクトル及び列ポインタが補助バッファ７８０１にロードされる。７９０３において、ｘベクトル成分ごとに、列は（ｅデータ）にストリーミングされ、ＰＥは、ＰＲＯＣ＿ＭＳＧ（）７８１０及びＲＥＤＵＣＥ（）７８１１を実行する。７９０４において、ＰＥは、ＡＰＰＬＹ（）７８１２を実行する。７９０５において、ＰＥは、ＳＥＮＤ＿ＭＳＧ（）７８１３を実行してメッセージを生成し、データ管理ユニット６９０５は、これらうぃｘベクトルとしてメモリに書き込む。７９０６において、データ管理ユニット６９０５は、ＰＥＲＡＭ７４２１に格納された、更新されたｙベクトル（ｖデータ）をメモリに書き戻す。上記の技術は、図７６ｄ及び図７６ｅに示される頂点プログラム実行アルゴリズムに適合する。性能を高めるするために、アーキテクチャは、設計において、タイル内のＰＥの数及び／又はタイルの数を増加させることを可能にする。この方式では、アーキテクチャは、（すなわち、（隣接行列の、又は、各サブグラフ内のブロックにわたる）サブグラフにわたる）グラフの並列処理の複数のレベルを利用する。図８０ａ内の表は、テンプレートの一実施例についてのカスタマイズ可能なパラメータを要約したものである。最適化用のタイル（例えば、別のタイルより多くのＰＥを有する１つのタイル）にわたって非対称なパラメータを割り当てることも可能である。 The processing of each accelerator tile is summarized in Figure 79. At 7901, the y vector (v data) is loaded into the PE RAM 7421. At 7902, the x vector and column pointers are loaded into the auxiliary buffer 7801. At 7903, for each x vector component, the columns are streamed (e data) and the PE executes PROC_MSG() 7810 and REDUCE() 7811. At 7904, the PE executes APPLY() 7812. At 7905, the PE executes SEND_MSG() 7813 to generate a message and the Data Management Unit 6905 writes these to memory as x vectors. At 7906, the Data Management Unit 6905 writes the updated y vector (v data) stored in the PE RAM 7421 back to memory. The above techniques are adapted to the vertex program execution algorithm shown in Figures 76d and 76e. To increase performance, the architecture allows the design to increase the number of PEs in a tile and/or the number of tiles. In this manner, the architecture exploits multiple levels of graph parallelism (i.e., across subgraphs (of the adjacency matrix, or across blocks in each subgraph)). The table in Figure 80a summarizes the customizable parameters for one embodiment of the template. It is also possible to assign asymmetric parameters across tiles for optimization (e.g., one tile with more PEs than another tile).

自動マッピング、検証及びチューニング Automatic mapping, validation and tuning

チューニング。入力に基づいて、フレームワークの一実施例では、入力ベクトルプログラム及び（選択的に）グラフデータに対してハードウェアアーキテクチャテンプレートを最適化するために、それをカスタマイズするために用いるように最良な設計パラメータを判断する自動チューニングを実行する。多くのチューニング検討事項があり、それらは図８０ｂ内のテーブルに要約されている。図示されるように、これらは、データの局所性、グラフデータサイズ、グラフ計算機能、グラフデータ構造、グラフデータアクセス属性、グラフデータタイプ及びグラフデータパターンを含む。 Tuning. Based on the input, one embodiment of the framework performs auto-tuning to determine the best design parameters to use to customize the hardware architecture template to optimize it for the input vector program and (optionally) the graph data. There are a number of tuning considerations, which are summarized in the table in Figure 80b. As shown, these include data locality, graph data size, graph computation function, graph data structure, graph data access attributes, graph data type, and graph data pattern.

テンプレートマッピング。このフェーズでは、フレームワークは、チューニングフェーズにより判断されたテンプレートパラメータを取得し、テンプレートのカスタマイズ可能な部分において「フィル」することによりアクセラレータインスタンスを生成する。ユーザ定義型の計算機能（例えば、図７６ｃ）は、既存の高位合成（ＨＬＳ）ツールを用いて、入力仕様から適切なＰＥ計算ブロックにマッピングされてよい。ストレージ構造（例えば、ＲＡＭ、バッファ、キャッシュ）及びメモリンタフェースは、これらの対応する設計パラメータを用いてインスタンス化される。パック／アンパック論理は、データタイプ仕様（例えば、図７６ａ）から自動的に生成されてよい。制御有限ステートマシン（ＦＳＭ）の一部はまた、提供された設計パラメータ（例えば、ＰＥスケジューリングスキーム）に基づいて生成される。 Template Mapping. In this phase, the framework takes the template parameters determined by the tuning phase and generates the accelerator instance by "filling" in the customizable parts of the template. User-defined compute functions (e.g., Fig. 76c) may be mapped from the input specification to the appropriate PE compute blocks using existing high-level synthesis (HLS) tools. Storage structures (e.g., RAM, buffers, cache) and memory interfaces are instantiated using their corresponding design parameters. Pack/unpack logic may be automatically generated from the data type specification (e.g., Fig. 76a). Parts of the control finite state machine (FSM) are also generated based on the provided design parameters (e.g., PE scheduling scheme).

検証。一実施例において、テンプレートマッピングにより生成されたアクセラレータアーキテクチャインスタンス（合成可能なＲＴＬ）は、次に、自動的に検証される。これをするために、フレームワークの一実施例では、「ゴールデン」リファレンスとして用いられる頂点プログラムの関数型モデルを導出する。テストベンチは、アーキテクチャインスタンスのＲＴＬ実装のシミュレーションに対して、このゴールデンリファレンスの実行を比較するために生成される。フレームワークはまた、解析性能モデル及びサイクルが正確なソフトウェアシミュレータに対して、ＲＴＬシミュレーションを比較することにより、性能検証を実行する。それは、ランタイムの内訳を報告し、性能に影響を与える設計のボトルネックを特定する。 Verification. In one embodiment, the accelerator architecture instance (synthesizable RTL) generated by template mapping is then automatically verified. To do this, one embodiment of the framework derives a functional model of the vertex program that is used as a "golden" reference. A testbench is generated to compare the execution of this golden reference against a simulation of an RTL implementation of the architecture instance. The framework also performs performance verification by comparing the RTL simulation against an analytical performance model and a cycle-accurate software simulator. It reports a runtime breakdown and identifies design bottlenecks that impact performance.

疎データセットの計算－ほとんどの値がゼロであるベクトル又は行列－は、ますます増加する数の商業的に重要なアプリケーションにとって重大であるが、典型的には、今日のＣＰＵ上で実行した場合、ピークパフォーマンスのわずか数パーセントしか実現していない。科学コンピューティング分野において、疎行列計算は、数十年間、線形ソルバの重要なカーネルであった。近年では、機械学習及びグラフ分析の爆発的な成長が疎計算を主流へと移動させてきた。疎行列計算は、多くの機械学習アプリケーションの中核をなし、多くのグラフアルゴリズムのコアを形成する。 Computations on sparse data sets -- vectors or matrices where most of the values are zero -- are critical to an ever-increasing number of commercially important applications, yet typically achieve only a few percent of peak performance when run on today's CPUs. In scientific computing, sparse matrix computations have been the key kernel of linear solvers for decades. In recent years, the explosive growth of machine learning and graph analysis has moved sparse computations into the mainstream. Sparse matrix computations are at the heart of many machine learning applications and form the core of many graph algorithms.

疎行列計算は、計算上制限されるよりもむしろメモリ帯域幅上制限される傾向があり、それが、ＣＰＵの変更によってこれらの性能を向上させることを困難にしている。それらは、行列データ要素毎に演算をほとんど実行しておらず、多くの場合、任意のデータを再利用する前に行列全体にわたって反復するため、キャッシュを役立たせていない。さらに、多くの疎行列アルゴリズムは、かなりの数のデータに依存したギャザー及びスキャッタ、例えば、疎行列－ベクトル乗算において得られる、ｒｅｓｕｌｔ［ｒｏｗ］＋＝ｍａｔｒｉｘ［ｒｏｗ］［ｉ］.ｖａｌｕｅ*ｖｅｃｔｏｒ［ｍａｔｒｉｘ［ｒｏｗ］［ｉ］.ｉｎｄｅｘ］演算、を含み、それらは、プリフェッチャの有効性を予測し低減することが難しい。 Sparse matrix computations tend to be memory bandwidth-limited rather than computationally-limited, which makes it difficult to improve their performance with CPU changes. They perform few operations per matrix data element and often iterate over the entire matrix before reusing any data, thus not benefiting from caches. Furthermore, many sparse matrix algorithms contain a significant number of data-dependent gathers and scatters, e.g., the result[row]+=matrix[row][i].value*vector[matrix[row][i].index] operations that occur in sparse matrix-vector multiplication, which are difficult to predict and reduce the effectiveness of prefetchers.

従来のマイクロプロセッサより良好な疎行列の性能を実現させるためには、システムは、現在のＣＰＵよりもかなり高いメモリ帯域幅及び非常にエネルギー効率の良いコンピューティングアーキテクチャを提供しなければならない。メモリ帯域幅を増加させることで、性能を向上させることができるが、ＤＲＡＭアクセスの高いエネルギー／ビットコストが、その帯域幅を処理するために利用可能な電力量を制限する。エネルギー効率の良い計算アーキテクチャなしでは、システムは、そのパワーバジェットを超えることなく、高帯域幅のメモリシステムからのデータを処理することができない状況に置かれるかもしれない。 To achieve better sparse matrix performance than conventional microprocessors, the system must provide significantly higher memory bandwidth than current CPUs and a very energy-efficient computing architecture. Increasing memory bandwidth can improve performance, but the high energy/bit cost of DRAM access limits the amount of power available to process that bandwidth. Without an energy-efficient computing architecture, a system may be placed in a situation where it is unable to process data from a high-bandwidth memory system without exceeding its power budget.

一実施例では、スタック型ＤＲＡＭを用いて、エネルギー効率の良い方式でその帯域幅を処理するために、疎行列アルゴリズムがカスタム計算アーキテクチャと組み合わせられる必要がある帯域幅を提供する疎行列計算のためのアクセラレータを有する。 In one embodiment, we have an accelerator for sparse matrix computations that uses stacked DRAM to provide the bandwidth that sparse matrix algorithms need combined with a custom computational architecture to process that bandwidth in an energy efficient manner.

疎－行列概要 Sparse - matrix overview

多くのアプリケーションは、値の大部分がゼロであるデータ設定を作成する。有限要素方法は、各ポイントの状態がメッシュ内のそれに近いポイントの状態の関数であるポイントのメッシュとしてオブジェクトをモデル化する。数学的には、これは、各行が１つのポイントの状態を表現し、行が表現するポイントの状態に直接的には影響を与えないポイントのすべてに対して行の値がゼロである行列として表される連立方程式になる。グラフは、隣接行列として表されることができ、行列内の各成分｛ｉ,ｊ｝は、グラフ内の頂点ｉとｊとの間のエッジについての重みを与える。多くの頂点は、グラフ内の他の頂点のごく一部だけを結びつけるので、隣接行列内の成分の大部分はゼロである。機械学習において、モデルは、典型的には、多くのサンプルからなるデータセットを用いてトレーニングされ、それぞれが、特徴のセット（システム又はオブジェクトの状態についての見解）及びその特徴のセットのモデルについての所望の出力を含む。サンプルの多くで、可能な機能の小さなサブセットだけを含めることはよくあることであり、例えば、機能がドキュメント内に存在し得る様々なワードを表す場合、値のほとんどがゼロであるデータセットを再び作成している。 Many applications create data sets where the majority of the values are zero. Finite element methods model objects as a mesh of points where the state of each point is a function of the states of points close to it in the mesh. Mathematically, this becomes a system of equations represented as a matrix where each row represents the state of one point and the row value is zero for all of the points that do not directly affect the state of the point it represents. A graph can be represented as an adjacency matrix, where each entry {i,j} in the matrix gives the weight for the edge between vertices i and j in the graph. Most of the entries in the adjacency matrix are zero because many vertices connect only a small fraction of the other vertices in the graph. In machine learning, models are typically trained with a dataset of many examples, each of which contains a set of features (opinions about the state of the system or object) and the desired output for a model of that set of features. It is common for many of the examples to include only a small subset of the possible features, for example if the features represent the various words that may be present in a document, again creating a dataset where most of the values are zero.

値のほとんどがゼロであるデータセットは、「疎（ｓｐａｒｓｅ）」として説明され、それは、これらの要素のうちの１％より少ない要素においてゼロ以外の値を有する、疎データセットが極めて疎であることはよくあることである。これらのデータセットは、多くの場合、行列として表され、行列内のゼロ以外の成分の値だけを規定するデータ構造を用いる。これは、各ゼロ以外の成分を表すのに必要とされる空間の量を増やす一方で、成分の位置及びその値の両方を規定する必要があるので、行列が十分に疎である場合、全体的な空間（メモリ）の節約はかなりのものとなる。例えば、疎行列の最も単純表現のうちの１つは、調整リスト（ＣＯＯ）表現であり、ゼロ以外のそれぞれは、｛行インデックス、列インデックス、値｝のタプルにより規定される。これは、ゼロ以外の値ごとに必要とされる記憶量を３倍にする一方、たった１％の行列内の成分がゼロ以外の値を有する場合、ＣＯＯ表現は、密な表現（行列内の各成分の値を表すもの）が取るであろう空間のたった３％しか引き上げない。 A data set in which most of the values are zero is described as "sparse", meaning that fewer than 1% of the elements have nonzero values; it is not uncommon for sparse data sets to be extremely sparse. These data sets are often represented as matrices, using data structures that specify only the values of the nonzero elements in the matrix. While this increases the amount of space required to represent each nonzero element, since both the element's location and its value need to be specified, the overall space (memory) savings can be significant if the matrix is sufficiently sparse. For example, one of the simplest representations of a sparse matrix is the coordinated list (COO) representation, where each nonzero is specified by a tuple of {row index, column index, value}. While this triples the amount of storage required per nonzero value, if only 1% of the elements in the matrix have nonzero values, the COO representation only takes up 3% of the space that a dense representation (one that represents the value of each element in the matrix) would take.

図８１は、最も一般的な疎行列フォーマット、圧縮行格納（ＣＲＳ、時には、短縮型ＣＳＲ）フォーマットの１つを示す。ＣＲＳフォーマットにおいて、行列８１００は、ゼロ以外の成分の値を含む値配列８１０１、行列のその行内の各ゼロ以外の成分の位置を規定するインデックスアレイ８１０２、及び、インデックス及び値のリストにおいて行列の各行が始まる位置を規定する行開始アレイ８１０３、という３つの配列により表現される。したがって、例示的な行列の第２行の第１のゼロ以外の成分は、インデックス及び値アレイ内の位置２において見つけられることができ、タプル｛０，７｝で表現されており、成分が行内の位置０に存在し、値７を有することを示す。他の一般に用いられる疎行列フォーマットは、ＣＲＳに対してデュアルな列優先である圧縮された疎列（ＣＳＣ）、及び、行列の各行をゼロ以外の値についての固定幅リスト及びこれらのインデックスとして表し、行列内の最長の行より少ないゼロ以外の成分を行が有する場合、明示的なゼロでパディングするＥＬＬＰＡＣＫを含む。 Figure 81 shows one of the most common sparse matrix formats, the Compressed Row Store (CRS, sometimes abbreviated CSR) format. In the CRS format, a matrix 8100 is represented by three arrays: a value array 8101 that contains the values of the non-zero elements, an index array 8102 that specifies the location of each non-zero element in that row of the matrix, and a row start array 8103 that specifies where each row of the matrix begins in the list of indices and values. Thus, the first non-zero element of the second row of the example matrix can be found at position 2 in the index and value array and is represented by the tuple {0,7}, indicating that the element is at position 0 in the row and has value 7. Other commonly used sparse matrix formats include Compressed Sparse Column (CSC), which is dual column-major to CRS, and ELLPACK, which represents each row of a matrix as a fixed-width list of non-zero values and their indices, padding with explicit zeros if the row has fewer non-zero elements than the longest row in the matrix.

疎行列の計算は、これらの密行列の対応部分と同じ構造を有するが、疎データの特性は、これらの密行列の対応部分よりも、これらをはるかに多くの帯域幅集約的にする傾向がある。例えば、行列－行列乗算の疎及び密の変形の両方は、すべてのｉ，ｊについて、Ｃｉ，ｊ＝Ａｉ，・Ｂ，ｊを計算することにより、Ｃ＝Ａ・Ｂであることが分かる。密行列－行列計算では、Ａの各成分は、Ｂの各要素がそうであるように、Ｎ回の積和演算（Ｎ×Ｎ行列と仮定した場合）に関与するので、これは、かなりのデータ再利用につながる。行列－行列乗算がキャッシュの局所性のためにブロック化される限り、この再利用は、低バイト／ｏｐレートを有し、計算上制限された計算（ｃｏｍｐｕｔａｔｉｏｎ）の原因となる。しかしながら、疎な変形では、Ａの各成分は、Ｂの対応する行にあるゼロ以外の値と同じ数の積和演算に関与するのみである一方、Ｂの各成分は、Ａの対応する列にあるゼロ以外の成分と同じ数の積和演算に関与するのみである。バイト／ｏｐレートがそうであるように、行列のまびき（ｓｐａｒｓｅｎｅｓｓ）が向上するにつれて、密行列－行列乗算が基準計算－バウンド計算であるという事実にも関わらず、多くの疎行列－行列計算の性能をメモリ帯域幅により制限させている。 Although sparse matrix computations have the same structure as their dense counterparts, the properties of sparse data tend to make them much more bandwidth intensive than their dense counterparts. For example, both sparse and dense variants of matrix-matrix multiplication know that C = A · B by computing Ci,j = Ai, · B,j for all i,j. In dense matrix-matrix computations, each element of A participates in N multiply-add operations (assuming an N × N matrix), as does each element of B, leading to significant data reuse. As long as matrix-matrix multiplication is blocked for cache locality, this reuse has a low byte/op rate and causes computationally limited computation. However, in the sparse variant, each element of A only participates in as many multiply-add operations as there are nonzero values in the corresponding row of B, while each element of B only participates in as many multiply-add operations as there are nonzero elements in the corresponding column of A. As matrix sparsity improves, as does the byte/op rate, memory bandwidth limits the performance of many sparse matrix-matrix computations, despite the fact that dense matrix-matrix multiplication is a benchmark-bound computation.

４つの演算は、今日のアプリケーション、すなわち、疎行列－疎ベクトル乗算（ＳｐＭＶ）、疎行列－疎ベクトル乗算、疎行列－疎行列乗算及び緩和／平滑化演算、例えば、高性能な共役勾配基準の実装で用いられるガウス－ザイデルスムーザで見られる疎行列計算のバルクを埋め合わせする。これらの演算は、疎行列アクセラレータを実用的にする２つの特性を共有する。第１に、それらは、ベクトルドット積が大半を占め、４つの重要な計算のすべてを実装できるシンプルなハードウェアを実装することを可能にする。例えば、行列－ベクトル乗算は、ベクトルと行列内の各行とのドット積を取ることにより実行される一方、行列－行列乗算は、一方の行列の各列と他方の行列の各行とのドット積を取る。第２に、アプリケーションは、一般に同じ行列に対して複数の計算、例えば、サポートベクトルマシンアルゴリズムがモデルをトレーニングして実行する、同じ行列の異なるベクトルとの数千回もの乗算を実行する。同じ行列のこの繰り返し使用は、データ転送／変換のコストが各行列に対する多くの演算にわたってならされ得るので、ハードウェアのタスクを簡略化する方式でプログラム実行中にアクセラレータへ／から行列を転送し、及び／又は、行列を再フォーマット化することを実用的にする。 Four operations make up the bulk of sparse matrix computations found in today's applications, namely sparse matrix-sparse vector multiplication (SpMV), sparse matrix-sparse vector multiplication, sparse matrix-sparse matrix multiplication, and relaxation/smoothing operations, e.g., Gauss-Seidel smoothers used in implementing high-performance conjugate gradient criteria. These operations share two properties that make sparse matrix accelerators practical. First, they are dominated by vector dot products, allowing for the implementation of simple hardware that can implement all four key computations. For example, matrix-vector multiplication is performed by taking the dot product of a vector with each row in a matrix, while matrix-matrix multiplication takes the dot product of each column of one matrix with each row of the other matrix. Second, applications typically perform multiple computations on the same matrix, e.g., thousands of multiplications of the same matrix with different vectors, which support vector machine algorithms perform to train models. This repeated use of the same matrices makes it practical to transfer matrices to/from the accelerator during program execution and/or reformat matrices in a manner that simplifies the hardware's tasks, since the cost of data transfer/conversion can be amortized over many operations on each matrix.

疎行列計算は、典型的には、それらが実行するシステムのピークパフォーマンスのわずか数パーセントしか実現していない。なぜこれが発生するかを明らかにするために、図８２は、ＣＲＳデータフォーマットを用いた疎行列－密ベクトル乗算の実装に関する段階８２０１－８２０４を示す。第１に、８２０１において、行列の行を表すデータ構造がメモリから読み出され、通常、予測及びプリフェッチすることが容易であるシーケンシャルな読み出しのセットに関する。第２に、８２０２において、行列の行内のゼロ以外の成分のインデックスは、多数のデータ依存型の予測困難なメモリアクセス（ギャザーオペレーション）を必要とする、ベクトルの対応する成分を収集するために用いられる。さらに、これらのメモリアクセスは、多くの場合、各参照されるキャッシュライン内の１又は２ワードしか触れないので、ベクトルがキャッシュに適合していない場合、かなり多くの無駄な帯域幅をもたらす。 Sparse matrix computations typically achieve only a few percent of the peak performance of the systems on which they are executed. To clarify why this occurs, Figure 82 shows steps 8201-8204 for implementing sparse matrix-dense vector multiplication using the CRS data format. First, in 8201, a data structure representing a row of the matrix is read from memory, typically in a set of sequential reads that are easy to predict and prefetch. Second, in 8202, the index of a non-zero element in the row of the matrix is used to gather the corresponding element of the vector, which requires a large number of data-dependent and hard-to-predict memory accesses (gather operations). Furthermore, these memory accesses often only touch one or two words in each referenced cache line, resulting in a significant amount of wasted bandwidth if the vector does not fit in the cache.

第３に、８２０３において、プロセッサは、行列の行のゼロ以外の成分及びベクトルの対応する成分のドット積を計算する。最後に、８２０４において、ドット積の結果が、結果ベクトルに書き込まれ、また、連続的にアクセスされ、プログラムは、行列の次の行に進む。これは、計算の概念的／アルゴリズムの観点であり、プログラムが実行するオペレーションの正確なシーケンスは、プロセッサのＩＳＡ及びベクトル幅に依存することに留意する。 Third, in 8203, the processor computes the dot product of the non-zero components of the row of the matrix and the corresponding components of the vector. Finally, in 8204, the result of the dot product is written to a result vector, which is also successively accessed and the program proceeds to the next row of the matrix. Note that this is a conceptual/algorithmic view of the computation, and the exact sequence of operations the program performs depends on the processor's ISA and vector width.

この例は、疎行列計算の多数の重要な特性を示す。３２ビットデータタイプ、及び、行列もベクトルもキャッシュに合致していないと仮定して、出力される行の第１の成分を計算するには、ＤＲＡＭから３６バイトを読み出す必要があるが、７．２：１のバイト／ｏｐレートに対して、５つの計算命令（３つが乗算及び２つが加算）だけである。 This example illustrates a number of important properties of sparse matrix computations. Assuming 32-bit data types and that neither matrices nor vectors fit in the cache, computing the first component of the output row requires reading 36 bytes from DRAM, but only five computation instructions (three multiplications and two additions) for a byte/op rate of 7.2:1.

しかしながら、メモリ帯域幅は、高性能な疎行列計算に対する唯一の課題ではない。図８２が示すように、ＳｐＭＶにおけるベクトルへのアクセスは、データ依存しており、予測が困難であるので、アプリケーションへのベクトルアクセスのレイテンシをさらす。ベクトルがキャッシュに適合しない場合、ＳｐＭＶの性能は、たとえ、データを待機する多くのスレッドがストールされた場合であっても、ＤＲＡＭ帯域幅を飽和させるのに十分な並列性をプロセッサが提供しない限り、ＤＲＡＭレイテンシ並びに帯域幅に敏感になる。 However, memory bandwidth is not the only challenge for high performance sparse matrix computation. As Figure 82 shows, accesses to vectors in SpMV are data dependent and hard to predict, exposing vector access latencies to applications. If vectors do not fit in the cache, SpMV performance becomes sensitive to DRAM latency as well as bandwidth unless the processor provides enough parallelism to saturate the DRAM bandwidth, even if many threads are stalled waiting for data.

したがって、疎行列計算のアーキテクチャは、いくつかの事項を有効にしなければならない。疎計算についてのバイト／ｏｐの必要性を満たすように高いメモリ帯域幅を実現させなければならない。また、キャッシュに適合しない可能性がある大きなベクトルからの高帯域幅収集をサポートしなければならない。最後に、ＤＲＡＭ帯域幅に追従するために十分な算術演算／秒を実行することそれ自体が課題ではないとはいえ、アーキテクチャは、システムのパワーバジェット内に維持するために、エネルギー効率の良い方式で、それらのオペレーション及びそれらが必要とするメモリアクセスのすべてを実行しなければならない。 Therefore, an architecture for sparse matrix computation must enable several things: it must provide high memory bandwidth to meet the bytes/op need for sparse computations; and it must support high bandwidth collection from large vectors that may not fit in the cache. Finally, while performing enough arithmetic operations/second to keep up with DRAM bandwidth is not a challenge in itself, the architecture must perform those operations and all of the memory accesses they require in an energy-efficient manner to stay within the system's power budget.

一実施例では、高いメモリ帯域幅、大きなベクトルからの高帯域幅収集及びエネルギー効率の良い計算という、高い疎－行列性能に必要な３つの機能を提供するように設計されたアクセラレータを有する。図８３に示されるように、アクセラレータの一実施例は、アクセラレータ論理ダイ８３０５と、ＤＲＡＭダイの１又は複数のスタック８３０１－８３０４とを含む。以下により詳細に説明されるスタック型ＤＲＡＭは、低エネルギー／ビットで高いメモリ帯域幅を提供する。例えば、スタック型ＤＲＡＭは、２．５ｐＪ／ｂｉｔで２５６－５１２ＧＢ／秒を実現することが予期され、一方、ＬＰＤＤＲ４ＤＩＭＭは、たった６８ＧＢ／秒しか実現しないことが予期され、１２ｐＪ／ｂｉｔのエネルギーコストを有する。 In one embodiment, we have an accelerator designed to provide the three functions required for high sparse-matrix performance: high memory bandwidth, high bandwidth collection from large vectors, and energy-efficient computation. As shown in FIG. 83, one embodiment of the accelerator includes an accelerator logic die 8305 and one or more stacks of DRAM dies 8301-8304. Stacked DRAM, described in more detail below, provides high memory bandwidth with low energy/bit. For example, stacked DRAM is expected to achieve 256-512 GB/sec at 2.5 pJ/bit, while LPDDR4 DIMMs are expected to achieve only 68 GB/sec, with an energy cost of 12 pJ/bit.

アクセラレータスタックの最下層にあるアクセラレータ論理チップ８３０５は、疎行列計算の必要性に合わせてカスタマイズされ、ＤＲＡＭスタック８３０１－８３０４により提供される帯域幅を消費することができ、一方、エネルギー消費がスタックの帯域幅に比例するが、２～４ワットの電力しか費やしていない。本願の残りの部分については、２７３ＧＢ／秒のスタックの帯域幅（ＷＩＯ３スタックの予期される帯域幅）が想定される。より高い帯域幅のスタックに基づいた設計では、メモリ帯域幅を消費するために、より多くの並列性を組み込むであろう。 The accelerator logic chip 8305 at the bottom of the accelerator stack is customized for the needs of sparse matrix computations and is able to consume the bandwidth provided by the DRAM stacks 8301-8304 while consuming only 2-4 watts of power, with energy consumption proportional to the stack bandwidth. For the remainder of this application, a stack bandwidth of 273 GB/sec (the expected bandwidth of the WIO3 stack) is assumed. Designs based on higher bandwidth stacks would incorporate more parallelism to consume memory bandwidth.

図８４ａは、ＤＲＡＭダイ８３０１－８３０４のスタックを貫通する上部視点から配向されたアクセラレータ論理チップ８３０５の一実施例を示す。スタックＤＲＡＭチャネルブロック８４０５は、論理チップ８３０５をＤＲＡＭ８３０１－８３０４に接続するシリコンビアを表す図の中心に向いており、一方、メモリコントローラブロック７４１０は、ＤＲＡＭチャネルに対する制御信号を生成する論理を含む。８つのＤＲＡＭチャネル８４０５が図に示される一方、アクセラレータチップに実装されるチャネルの実際の数は、用いられるスタック型ＤＲＡＭに応じて変化する。開発中のスタックＤＲＡＭ技術のほとんどは、４つ又は８つのチャネルのいずれか一方を提供する。 Figure 84a shows one embodiment of an accelerator logic chip 8305 oriented from a top-down perspective through the stack of DRAM dies 8301-8304. The stacked DRAM channel block 8405 faces toward the center of the diagram, which represents the silicon vias connecting the logic chip 8305 to the DRAMs 8301-8304, while the memory controller block 7410 contains the logic to generate the control signals for the DRAM channels. While eight DRAM channels 8405 are shown in the diagram, the actual number of channels implemented on the accelerator chip will vary depending on the stacked DRAM used. Most stacked DRAM technologies under development offer either four or eight channels.

ドット積エンジン（ＤＰＥ）８４２０は、アーキテクチャの計算要素である。図８４ａ～図８４ｂに示される特定の実施例において、８つのＤＰＥから成る各セットは、ベクトルキャッシュ８４１５と関連付けられる。図８５は、２つのバッファ８５０５－８５０６、２つの６４ビット積和演算ＡＬＵ８５１０及び制御論理８５００を含ＤＰＥの大まかな概観図を提供する。計算中、チップ制御ユニット８５００は、処理されるデータのチャンクをバッファメモリ８５０５－８５０６へとストリームする。一旦、各バッファが満杯になると、ＤＰＥの制御論理がバッファを通じて順序付けられ、それらが含むベクトルのドット積を計算し、その結果をＤＰＥの結果ラッチ８５１２に書き込み、他のＤＰＥの結果ラッチと共にデイジーチェーンに接続され、計算の結果をスタックＤＲＡＭ８３０１－８３０４に書き戻す。 The Dot Product Engine (DPE) 8420 is the computational element of the architecture. In the particular embodiment shown in Figures 84a-b, each set of eight DPEs is associated with a vector cache 8415. Figure 85 provides a high-level overview of a DPE, which includes two buffers 8505-8506, two 64-bit multiply-add ALUs 8510, and control logic 8500. During computation, the chip control unit 8500 streams chunks of data to be processed into the buffer memories 8505-8506. Once each buffer is full, the DPE's control logic sequences through the buffers, computes the dot products of the vectors they contain, and writes the results into the DPE's result latches 8512, which are daisy-chained with the result latches of other DPEs, and writes the results of the computation back into the stacked DRAMs 8301-8304.

一実施例において、アクセラレータ論理チップは、（特定の動作周波数及び電圧が異なるアプリケーションに対して修正され得るが）電力消費を最小化するために、約１ＧＨｚ及び０．６５Ｖで動作する。１４ｎｍ設計研究に基づいた解析では、３２～６４ＫＢのバッファがその電圧でこの周波数スペックを満たしていることを示しているが、弱いエラーを防止するためには強いＥＣＣが必要とされ得る。積和演算ユニットは、０．６５Ｖの供給電圧及び浅いパイプラインでのタイミングを満たすために、基本クロックレートの半分で動作され得る。２つのＡＬＵを用いて、ＤＰＥ毎に１つの倍精度の積和演算／サイクルのスループットを提供する。 In one embodiment, the accelerator logic chip operates at approximately 1 GHz and 0.65V to minimize power consumption (although the specific operating frequency and voltage may be modified for different applications). Analysis based on 14nm design studies indicates that 32-64KB buffers meet this frequency spec at that voltage, but strong ECC may be required to prevent weak errors. The multiply-accumulate unit may be operated at half the base clock rate to meet timing with a 0.65V supply voltage and shallow pipeline. Two ALUs are used to provide a throughput of one double precision multiply-accumulate operation/cycle per DPE.

２７３ＧＢ／秒及び１．０６６ＭＨｚのクロックレートで、ＤＲＡＭスタック８３０１－８３０４は、論理チップのクロックサイクルあたり２５６バイトのデータを供給する。アレイインデックス及び値が、少なくとも３２ビットの量であると仮定すると、これは、１サイクルあたり３２個の疎行列成分（インデックスの４バイト＋値の４バイト＝８バイト／成分）に変換し、チップが、追従するために１サイクルあたり３２個の積和演算を実行することを要求する。（これは、行列－ベクトル乗算に対するものであり、１００％のスタックＤＲＡＭ帯域幅が行列をフェッチするために用いられるようなベクトルキャッシュ内の高いヒット率を前提としている）。図８４ａ及び図８４ｂに示される６４個のＤＰＥは、２－４ｘの必要な計算スループットを提供し、たとえ、ＡＬＵ８５１０が１００％の時間用いられていないとしても、チップが、ピークスタックＤＲＡＭ帯域幅で処理データすることを可能にする。 At 273 GB/sec and a clock rate of 1.066 MHz, the DRAM stacks 8301-8304 provide 256 bytes of data per clock cycle to the logic chip. Assuming that the array indexes and values are at least 32-bit quantities, this translates to 32 sparse matrix elements per cycle (4 bytes for index + 4 bytes for value = 8 bytes/element), requiring the chip to perform 32 multiply-accumulate operations per cycle to keep up. (This is for matrix-vector multiplications and assumes a high hit rate in the vector cache such that 100% of the stack DRAM bandwidth is used to fetch matrices). The 64 DPEs shown in Figures 84a and 84b provide the required computational throughput of 2-4x, allowing the chip to process data at the peak stack DRAM bandwidth even if the ALU 8510 is not used 100% of the time.

一実施例において、ベクトルキャッシュ８４１５は、行列－ベクトル乗算内のベクトルの成分をキャッシュする。これは、以下で説明される行列－ブロッキングスキームの効率性を著しく向上させる。一実施例において、各ベクトルキャッシュブロックは、８つのチャネルアーキテクチャにおいて、２５６～５１２ＫＢの総容量に対して、キャッシュの３２～６４ＫＢを含む。 In one embodiment, the vector cache 8415 caches the components of vectors in matrix-vector multiplication. This significantly improves the efficiency of the matrix-blocking scheme described below. In one embodiment, each vector cache block contains 32-64 KB of cache, for a total capacity of 256-512 KB in an eight channel architecture.

チップ制御ユニット８４０１は、計算のフローを管理し、アクセラレータ内の他のスタック及びシステム内の他のソケットとの通信を処理する。複雑性及び電力消費を低減するために、ドット積エンジンが、メモリからデータを要求することは決してない。代わりに、チップ制御ユニット８４０１は、メモリシステムを管理し、データの適切なブロックをＤＰＥのそれぞれにプッシュする転送を開始する。 The chip control unit 8401 manages the flow of computations and handles communication with other stacks in the accelerator and other sockets in the system. To reduce complexity and power consumption, the dot product engine never requests data from memory. Instead, the chip control unit 8401 manages the memory system and initiates transfers that push the appropriate blocks of data to each of the DPEs.

一実施例において、マルチスタックアクセラレータ内のスタックは、図に示される隣接接続８４３１を用いて実装されるＫＴＩリンク８４３０のネットワークを介して互いに通信する。チップはまた、マルチソケットシステム内の他のソケットと通信するために用いられる３つの追加のＫＴＩリンクを提供する。マルチスタックアクセラレータにおいて、スタックのオフパッケージＫＴＩリンク８４３０のうちの１つだけがアクティブにされる。他のスタック上のメモリをターゲットとするＫＴＩトランザクションは、オンパッケージＫＴＩネットワークを介して適切なスタックに転送される。 In one embodiment, the stacks in a multi-stack accelerator communicate with each other via a network of KTI links 8430 implemented with adjacent connections 8431 as shown in the figure. The chip also provides three additional KTI links that are used to communicate with other sockets in a multi-socket system. In a multi-stack accelerator, only one of the off-package KTI links 8430 of a stack is activated. KTI transactions targeting memory on other stacks are forwarded to the appropriate stack via the on-package KTI network.

アクセラレータの一実施例についての疎行列－密ベクトル及び疎行列－疎ベクトル乗算を実装する技術及びハードウェアがここで説明される。これはまた、疎行列演算をサポートするアクセラレータを作成するために、行列－行列乗算、緩和演算及び他の機能をサポートするために拡張されることもできる。 Techniques and hardware for implementing sparse matrix-dense vector and sparse matrix-sparse vector multiplication for one embodiment of the accelerator are described herein. This can also be extended to support matrix-matrix multiplication, relaxation operations, and other functions to create accelerators that support sparse matrix operations.

疎－疎及び疎－密行列－ベクトル乗算が、（行列及びベクトル内の各行のドット積を取る）同じ基本アルゴリズムを実行する一方、ベクトルが密である場合と比較して、それが疎である場合に、どのようにこのアルゴリズムが実装されるかについて著しい差があり、それは、以下のテーブルに要約されている。

While sparse-sparse and sparse-dense matrix-vector multiplication perform the same basic algorithm (taking the dot product of each row in the matrix and the vector), there are significant differences in how this algorithm is implemented when the vector is sparse compared to when it is dense, which are summarized in the table below.

疎行列－密ベクトル乗算において、ベクトルのサイズは固定され、行列内の列の数に等しい。科学技術計算で得られる行列の多くが、１行あたりおよそ１０個の非ゼロ要素が平均であるので、疎行列－密ベクトル乗算内のベクトルが行列自体の５～１０％の空間を占めることは珍しくない。一方で、疎ベクトルは、多くの場合、かなり短く、行列の行に同様の数のゼロ以外の値を含んでおり、これらをンチップメモリ内にかなりキャッシュしやすくする。 In a sparse matrix-dense vector multiplication, the size of the vector is fixed and equal to the number of columns in the matrix. Since many matrices resulting from scientific computing have an average of around 10 nonzero elements per row, it is not uncommon for the vector in a sparse matrix-dense vector multiplication to occupy 5-10% of the space of the matrix itself. On the other hand, sparse vectors are often much shorter and contain a similar number of nonzero values to the matrix rows, making them much easier to cache in on-chip memory.

疎行列－密ベクトル乗算において、ベクトル内の各成分の位置は、そのインデックスにより判断され、それが行列の領域内のゼロ以外の値に対応するベクトル成分を収集し、行列が乗算される任意の密ベクトルに対して収集される必要があるベクトル成分のセットを予め計算することを実現可能にする。しかしながら、疎ベクトル内の各成分の位置は、予測不可能であり、ベクトル内のゼロ以外の成分の分散に依存する。これは、どの行列内のゼロ以外がベクトル内のゼロ以外の値に対応するかを判断するために、疎ベクトル及び行列の非ゼロ成分を検査する必要がある。 In sparse matrix-dense vector multiplication, the position of each component in the vector is determined by its index, which collects the vector components that correspond to non-zero values in the domain of the matrix, making it feasible to pre-compute the set of vector components that need to be collected for any dense vector to be multiplied by the matrix. However, the position of each component in the sparse vector is unpredictable and depends on the distribution of the non-zero components in the vector. This requires examining the non-zero components of the sparse vector and matrix to determine which non-zeros in the matrix correspond to the non-zero values in the vector.

疎行列－疎ベクトルのドット積を計算するために必要とされる命令／処理の数は、予測不可能であり、行列及びベクトルの構造に依存するので、行列及びベクトル内のゼロ以外の成分をインデックスと比較するのに役立つ。例えば、単一のゼロ以外の成分を有する行列の行と多くのゼロ以外の成分を有するベクトルとのドット積を取ることを検討する。行のゼロ以外の値が、ベクトル内のゼロ以外の値のいずれよりも低いインデックスを有する場合、ドット積は、１つのインデックス比較のみを必要とする。行のゼロ以外の値がベクトル内のゼロ以外の値のいずれより高いインデックスを有する場合、ドット積を計算することは、行のゼロ以外の値のインデックスとベクトル内の各インデックスとを比較する必要がある。これは、ベクトルを通じた線形探索を前提としており、一般的なやり方である。バイナリ探索など、他の探索は、最悪の場合においてより高速であろう。しかしながら、行及びベクトル内のゼロ以外の値が重複するよくある例において、著しいオーバヘッドを追加することになるであろう。一方、疎行列－密ベクトル乗算を実行するために必要とされるオペレーションの数が固定され、行列内のゼロ以外の値の数により判断されるので、計算に必要とされる時間を予測し易くする。 Because the number of instructions/operations required to compute a sparse matrix-sparse vector dot product is unpredictable and depends on the structure of the matrix and vector, it helps to compare the non-zero components in the matrix and vector with their indices. For example, consider taking the dot product of a row of a matrix with a single non-zero component and a vector with many non-zero components. If the non-zero values in the row have a lower index than any of the non-zero values in the vector, the dot product requires only one index comparison. If the non-zero values in the row have a higher index than any of the non-zero values in the vector, computing the dot product requires comparing the index of the non-zero values in the row with each index in the vector. This assumes a linear search through the vector, which is common practice. Other searches, such as a binary search, would be faster in the worst case. However, they would add significant overhead in the common case where non-zero values in the row and vector overlap. On the other hand, since the number of operations required to perform a sparse matrix-dense vector multiplication is fixed and determined by the number of non-zero values in the matrix, it makes the time required for the calculation easier to predict.

これらの違いに起因して、アクセラレータの一実施例では、疎行列－密ベクトル及び疎行列－疎ベクトル乗算を実施するために同じ高水準なアルゴリズムを用いており、ベクトルがドット積エンジンにわたってどのように分配されるかについて、及び、ドット積がどのように計算についての差を有する。アクセラレータは、大きな疎行列計算を対象としているので、行列又はベクトルのいずれか一方がオンチップメモリに合致することができないと仮定される。代わりに、一実施例では、図８６に概説されるブロッキングスキームを用いる。 Due to these differences, one embodiment of the accelerator uses the same high-level algorithms to perform sparse matrix-dense vector and sparse matrix-sparse vector multiplications, with differences in how the vectors are distributed across the dot product engines and how the dot products are computed. Because the accelerator is targeted at large sparse matrix computations, it is assumed that either the matrix or the vectors cannot fit into on-chip memory. Instead, one embodiment uses a blocking scheme outlined in Figure 86.

特に、この実施例において、アクセラレータは、オンチップメモリに合致するようなサイズであり、データ８６０１－８６０２の固定サイズのブロックに行列を分割し、次のブロックに進む前に、出力ベクトルのチャンクを生成するために、ベクトルによりブロック内の行を乗算する。このアプローチは２つの課題をもたらす。第１に、疎行列の各行における非ゼロの数は、調査対象のデータセットの低くて１から高くても４６０００まで、データセット間で広く変化する。これは、１又は固定数の行を各ドット積エンジンに割り当てることを非実用的にしている。故に、一実施例では、行列データの固定サイズのチャンクを各ドット積エンジンに割り当て、チャンクが複数の行列の行を含む場合及び単一の行が複数のチャンクにわたって分割される場合に処理する。 In particular, in this embodiment, the accelerator is sized to fit into on-chip memory, splits the matrix into fixed-sized blocks of data 8601-8602, and multiplies rows in a block by a vector to generate a chunk of an output vector before proceeding to the next block. This approach poses two challenges. First, the number of nonzeros in each row of a sparse matrix varies widely between datasets, from as low as 1 to as high as 46,000 in the datasets studied. This makes it impractical to assign one or a fixed number of rows to each dot-product engine. Thus, in one embodiment, a fixed-sized chunk of matrix data is assigned to each dot-product engine for processing when a chunk contains multiple matrix rows and when a single row is split across multiple chunks.

第２の課題は、行列のブロックごとにスタックＤＲＡＭからベクトル全体をフェッチすると、大量の帯域幅を無駄にする可能性があるということである（すなわち、ブロック内に対応する非ゼロがないベクトル成分をフェッチする）。これは、特に、疎行列－密ベクトル乗算に対する問題点であり、ベクトルは、疎行列のかなりの部分を占め得る。これに対処するために、一実施例は、行列内のブロック８６０１－８６０２ごとにフェッチリスト８６１１－８６１２を構成し、ブロック内のゼロ以外の値に対応するベクトル８６１０の成分のセットを列挙し、ブロックを処理する場合にそれらの成分をフェッチするだけである。フェッチリストはまた、スタックＤＲＡＭからフェッチされなければない一方、ほとんどのブロックに対するフェッチリストがブロックのごく一部を占めると判断されている。ランレングス符号化などの技術は、フェッチリストのサイズを低減するために用いられてもよい。 The second challenge is that fetching the entire vector from stack DRAM for each block of the matrix can waste a large amount of bandwidth (i.e., fetching vector components that have no corresponding nonzeros in the block). This is particularly problematic for sparse matrix-dense vector multiplication, where the vectors can occupy a significant portion of the sparse matrix. To address this, one embodiment constructs a fetch list 8611-8612 for each block 8601-8602 in the matrix, enumerates the set of components of the vector 8610 that correspond to the nonzero values in the block, and only fetches those components when processing the block. While the fetch list must also be fetched from stack DRAM, it has been determined that the fetch list for most blocks occupies a small portion of the block. Techniques such as run-length encoding may be used to reduce the size of the fetch lists.

したがって、アクセラレータ上の行列－ベクトル乗算は、オペレーションの以下のシーケンスに関する。 So matrix-vector multiplication on an accelerator involves the following sequence of operations:

１．ＤＲＡＭスタックから行列データのブロックをフェッチし、それをドット積エンジンにわたって分散する。 1. Fetch a block of matrix data from the DRAM stack and distribute it across the dot product engines.

２．行列データ内のゼロ以外の成分に基づいてフェッチリストを生成する。 2. Generate a fetch list based on the non-zero elements in the matrix data.

３．スタックＤＲＡＭからフェッチリスト内の各ベクトル成分をフェッチし、それをドット積エンジンに分散する。 3. Fetch each vector component in the fetch list from the stack DRAM and distribute it to the dot product engines.

４．ベクトルを有するブロック内の行のドット積を計算し、スタックＤＲＡＭに結果を書き込む。 4. Calculate the dot product of the rows in the block with the vector and write the results to stack DRAM.

５．計算と並列して、行列データの次のブロックをフェッチし、行列全体が処理されるまで繰り返す。 5. In parallel with the computation, fetch the next block of matrix data and repeat until the entire matrix has been processed.

アクセラレータが複数のスタックを含む場合、行列の「区分」は、異なるスタックに静的に割り当てられてよく、次に、ブロックアルゴリズムは、各区分に対して並列実行されてよい。このブロック及びブロードキャストスキームは、メモリ参照のすべてが中央制御装置から由来する利点を有しており、ネットワークは、予測不可能な要求及びドット積エンジンとメモリコントローラとの間の応答を転送する必要がないので、オンチップネットワークの設計を大幅に簡略化する。また、個別のドット積エンジンに、それらが計算のこれらの部分を実行する必要があるベクトル成分に対してメモリ要求を発行させることとは対照的に、所与のブロックが必要とするベクトル成分ごとに１つのメモリ要求のみを発行することによりエネルギーを節約する。最後に、インデックスの体系化されたリストからベクトル成分をフェッチすることは、それらがスタック型ＤＲＡＭにおけるページヒット、ひいては、帯域幅の利用を最大化する方式で要求をフェッチするメモリ要求をスケジューリングし易くする。 If the accelerator contains multiple stacks, "partitions" of the matrix may be statically assigned to different stacks, and then the block algorithm may be executed in parallel for each partition. This block and broadcast scheme has the advantage that all of the memory references originate from a central controller, greatly simplifying the design of the on-chip network, as the network does not need to transport unpredictable requests and responses between the dot product engines and the memory controller. It also saves energy by issuing only one memory request per vector component that a given block requires, as opposed to having individual dot product engines issue memory requests for the vector components they need to perform these parts of the calculation. Finally, fetching vector components from a structured list of indexes makes it easier to schedule memory request fetch requests in a manner that maximizes page hits in the stacked DRAM, and therefore bandwidth utilization.

本明細書で説明されるアクセラレータの実装で疎行列－密ベクトル乗算を実施する場合の１つの課題は、各ドット積エンジンのバッファにおいて行列成分のインデックスにメモリからストリーミングされるベクトル成分をマッチングさせることである。一実施例において、ベクトルの２５６バイト（３２～６４成分）は、１サイクル毎にドット積エンジンに到達し、行列データの固定サイズのブロックが、各ドット積エンジンの行列バッファにフェッチされているので、各ベクトル成分は、ドット積エンジンの行列バッファ内の非ゼロのうちのいずれかに対応し得る。 One challenge in performing sparse matrix-dense vector multiplication in the accelerator implementation described herein is matching vector components streamed from memory to the matrix component indices in each dot-product engine's buffer. In one embodiment, 256 bytes (32-64 components) of the vector arrive at the dot-product engine every cycle, and a fixed-size block of matrix data is fetched into each dot-product engine's matrix buffer, so that each vector component can correspond to any of the nonzeros in the dot-product engine's matrix buffer.

サイクルごとにそのほとんどの比較を実行することは、エリア及び電力において非常に高価であろう。代わりに、一実施例では、図８７に示されるフォーマットを用いて、多くの疎行列アプリケーションが、同じ又は異なるベクトルのいずれか一方により同じ行列を繰り返し乗算し、各ドット積エンジンが行列のそのチャンクを処理する必要があるフェッチリストの要素を予め計算するという事実を利用する。ベースラインＣＲＳフォーマットにおいて、行列は、その行内の各ゼロ以外の値の位置を定義するインデックス８７０２のアレイにより説明され、アレイは、各ゼロ以外の値８７０３及び各行が、インデックス及び値配列において開始する場所を示すアレイ８７０１を含む。そのために、一実施例では、各ドット積エンジンが全体的な計算のその一部を実行するためにキャプチャするのに必要なベクトルデータのバーストがどれかを識別するブロック記述子８７０５のアレイを加える。 Performing that many comparisons every cycle would be very expensive in area and power. Instead, in one embodiment, using the format shown in FIG. 87, we take advantage of the fact that many sparse matrix applications repeatedly multiply the same matrix by either the same or different vectors, and pre-compute the fetch list elements that each dot product engine needs to process its chunk of the matrix. In the baseline CRS format, a matrix is described by an array of indexes 8702 that define the location of each non-zero value in its row, and the array contains each non-zero value 8703 and an array 8701 that indicates where each row begins in the index and value array. To do so, in one embodiment, we add an array of block descriptors 8705 that identify which bursts of vector data each dot product engine needs to capture to perform its part of the overall computation.

図８７に示されるように、各ブロック記述子は、８つの１６ビット値及びバースト記述子のリストからなる。最初の１６ビット値は、どれくらいの数バースト記述子がブロック記述子にあるかをハードウェアに示し、一方、残りの７つは、最初のものを除くスタックＤＲＡＭデータチャネルのすべてに対するバースト記述子のリスト内の開始ポイントを識別する。これらの値の数は、スタック型ＤＲＡＭが提供するデータチャネルの数に応じて変更する。各バースト記述子は、注意を払う必要があるデータのバーストがどれかをハードウェアに示す２４ビットのバーストカウント、及び、ドット処理エンジンが必要とする値を含むバースト内のワードを識別する「必要とされるワード」ビットベクトルを含む。 As shown in Figure 87, each block descriptor consists of eight 16-bit values and a list of burst descriptors. The first 16-bit value tells the hardware how many burst descriptors are in the block descriptor, while the remaining seven identify the starting point in the list of burst descriptors for all of the stacked DRAM data channels except the first. The number of these values changes depending on the number of data channels the stacked DRAM provides. Each burst descriptor contains a 24-bit burst count that tells the hardware which burst of data it needs to pay attention to, and a "required word" bit vector that identifies the word in the burst that contains the value the dot processing engine needs.

一実施例に含まれる他のデータ構造は、行列バッファインデックス（ＭＢＩ）８７０４のアレイであり、行列内のゼロ以外毎に１つのＭＢＩである。各ＭＢＩは、ゼロ以外に対応する密ベクトル成分が関連するドット積エンジンのベクトル値バッファに格納される位置を与える（例えば、図８９を参照）。疎行列－密ベクトル乗算を実行する場合、元の行列インデックスではなくむしろ行列バッファインデックスは、ドット積エンジンの行列インデックスバッファ８７０４にロードされ、ドット積を計算する場合の対応するベクトル値を検索するために用いられるアドレスの代わりになる。 Another data structure included in one embodiment is an array of matrix buffer indexes (MBIs) 8704, one MBI for each non-zero in the matrix. Each MBI gives the location where the dense vector component corresponding to the non-zero is stored in the vector value buffer of the associated dot product engine (see, e.g., FIG. 89). When performing a sparse matrix-dense vector multiplication, the matrix buffer index, rather than the original matrix index, is loaded into the matrix index buffer 8704 of the dot product engine and takes the place of the address used to look up the corresponding vector value when computing the dot product.

図８８は、１つのスタック型ＤＲＡＭデータチャネル及び４ワードデータバーストのみを有するシステムにおいて、単一のドット積エンジンのバッファ内に合致する２行行列に対してこれがどのように作用するかを示す。行開始値８８０１、行列インデックス８８０２及び行列値８８０３を含む元のＣＲＳ表現が図の左側に示される。２行は、列｛２，５，６｝及び｛２，４，５｝内のゼロ以外の成分を有するので、ベクトルの成分２、４、５及び６がドット積を計算するために必要とされる。ブロック記述子は、これを反映し、第１の４ワードバーストのうちのワード２（ベクトルの成分２）及び第２の４ワードバーストのうちのワード０、１及び２（ベクトルの成分４－６）が必要とされることを示す。ベクトルの成分２は、ドット積エンジンが必要とするベクトルの第１のワードであるので、ベクトル値バッファ内の位置０に入る。ベクトルの成分４は、位置１などに入る。 Figure 88 shows how this works for a two row matrix that fits into a single dot product engine's buffer in a system with only one stacked DRAM data channel and four word data bursts. The original CRS representation is shown on the left of the figure, including row start value 8801, matrix index 8802, and matrix value 8803. Row 2 has non-zero components in columns {2,5,6} and {2,4,5}, so vector components 2, 4, 5, and 6 are needed to compute the dot product. The block descriptor reflects this, showing that word 2 of the first four word burst (vector component 2) and words 0, 1, and 2 of the second four word burst (vector components 4-6) are needed. Vector component 2 goes into position 0 in the vector value buffer, as that is the first word of the vector the dot product engine needs. Vector component 4 goes into position 1, etc.

行列バッファインデックスアレイデータ８８０４は、ハードウェアが、行列内の非ゼロに対応する値を見つけたベクトル値バッファ内の位置を保持する。行列インデックスアレイ内の第１のエントリは、値「２」を有するので、行列バッファインデックスアレイ内の第１のエントリは、ベクトルの成分２がベクトル値バッファに格納される位置に対応する値「０」を取得する。同様に、「４」が行列インデックスアレイに現れるときはいつでも、「１」が行列バッファインデックスに現れ、行列インデックスアレイ内の各「５」は、行列バッファインデックス内で対応する「２」を有し、行列インデックスアレイ内の各「６」は、行列バッファインデックス内の「３」に対応する。 The matrix buffer index array data 8804 holds the locations in the vector value buffer where the hardware found values corresponding to non-zeros in the matrix. Because the first entry in the matrix index array has a value of "2", the first entry in the matrix buffer index array gets a value of "0", which corresponds to the location where component 2 of the vector is stored in the vector value buffer. Similarly, whenever a "4" appears in the matrix index array, a "1" appears in the matrix buffer index, each "5" in the matrix index array has a corresponding "2" in the matrix buffer index, and each "6" in the matrix index array corresponds to a "3" in the matrix buffer index.

本発明の一実施例は、行列がアクセラレータ上いロードされる場合、密ベクトルからの高速収集をサポートするのに必要な事前計算を実行し、マルチスタックアクセラレータの総帯域幅は、ＣＰＵからアクセラレータにデータを転送するために用いられるＫＴＩリンクの帯域幅よりはるかに大きいという事実を利用する。この事前計算された情報は、同じ行列インデックスの複数のコピーがドット積エンジン上にマッピングされる行列のチャンク内でどれくらい発生するかに応じて、最大で７５％まで行列を保持するために必要とされるメモリの量を向上させる。しかしながら、１６ビットの行列バッファインデックスアレイは、行列－ベクトル乗算が実行される場合、行列インデックスアレイの代わりにフェッチされるので、スタックＤＲＡＭからフェッチされるデータ量は、多くの場合、特に、６４ビットインデックスを用いる行列に関して、元のＣＲＳ表現より少ない。 One embodiment of the present invention performs the precomputations necessary to support high speed collection from dense vectors when the matrices are loaded onto the accelerator, taking advantage of the fact that the aggregate bandwidth of a multi-stack accelerator is much larger than the bandwidth of the KTI link used to transfer data from the CPU to the accelerator. This precomputed information improves the amount of memory required to hold the matrix by up to 75%, depending on how many copies of the same matrix index occur within the chunk of the matrix that is mapped onto the dot product engine. However, because a 16-bit matrix buffer index array is fetched instead of the matrix index array when matrix-vector multiplication is performed, the amount of data fetched from the stack DRAM is often less than the original CRS representation, especially for matrices that use 64-bit indexes.

図８９は、このフォーマットを用いるドット積エンジン内のハードウェアの一実施例を示す。行列－ベクトル乗算を実行するために、ブロックを作成する行列のチャンクは、行列インデックスバッファ８９０３及び行列値バッファ８９０５にコピーされ（元の行列インデックスの代わりに行列バッファインデックスをコピーし）関連するブロック記述子は、ブロック記述子バッファ８９０２にコピーされる。次に、フェッチリストは、密ベクトルから必要な要素をロードして、これらをドット積エンジンにブロードキャストするために用いられる。各ドット積エンジンは、各データチャネルを過ぎるベクトルデータのバーストの数をカウントする。所与のデータチャネルのカウントが、バースト記述子において特定される値と一致する場合、マッチ論理８９２０は、特定されたワードをキャプチャして、これらをそのベクトル値バッファ８９０４に格納する。 Figure 89 shows one embodiment of hardware within a dot product engine using this format. To perform matrix-vector multiplication, the chunks of matrices that make up a block are copied to a matrix index buffer 8903 and a matrix value buffer 8905 (copying the matrix buffer index instead of the original matrix index) and the associated block descriptor is copied to a block descriptor buffer 8902. The fetch list is then used to load the required elements from the dense vector and broadcast them to the dot product engines. Each dot product engine counts the number of bursts of vector data that pass through each data channel. If the count for a given data channel matches the value specified in the burst descriptor, the match logic 8920 captures the specified words and stores them in its vector value buffer 8904.

図９０は、このキャプチャを行うマッチ論理８９２０ユニットの内容を示す。ラッチ９００５は、カウンタがバースト記述子内の値と一致する場合、データチャネルのワイヤ上の値をキャプチャする。シフタ９００６は、バースト９００１から必要なワード９００２を抽出し、これらを、サイズがベクトル値バッファ内の行とマッチするラインバッファ９００７内の適切な位置に転送する。バーストカウント９００１が内部カウンタ９００４に等しい場合、ロード信号が生成される。ラインバッファが満杯になった場合、（ｍｕｘ９００８を通じて）ベクトル値バッファ８９０４に格納される。このように複数のバーストからラインにワードアセンブルすることで、ベクトル値バッファがサポートする必要がある書き込み／サイクルの数を低減し、そのサイズを低減する。 Figure 90 shows the contents of the Match Logic 8920 unit which performs this capture. A latch 9005 captures the value on the wire of the data channel if the counter matches the value in the burst descriptor. A shifter 9006 extracts the required words 9002 from the burst 9001 and transfers them to the appropriate locations in the line buffer 9007 whose size matches the rows in the vector value buffer. When the burst count 9001 is equal to the internal counter 9004, a load signal is generated. When the line buffer becomes full, it is loaded (through mux 9008) into the vector value buffer 8904. This word assembly from multiple bursts into lines reduces the number of writes/cycles that the vector value buffer needs to support and reduces its size.

一旦、ベクトルの必要な成分のすべてが、ベクトル値バッファ内にキャプチャされると、ドット積エンジンは、ＡＬＵ８９１０を用いて必要なドット積を計算する。制御論理８９０１は、サイクル毎に１成分の順番で行列インデックスバッファ８９０３及び行列値バッファ８９０４を通る。行列インデックスバッファ８９０３の出力は、次のサイクルでベクトル値バッファ８９０４に対する読み出しアドレスとして用いられ、一方、行列値バッファ８９０４の出力は、ベクトル値バッファ８９０４から対応する値と同時にＡＬＵ８９１０に到達するようにラッチされる。例えば、図８８からの行列を用いて、ドット積計算の第１のサイクルにおいて、ハードウェアは、行列値バッファ８９０５から値「１３」と共に行列インデックスバッファ８９０３から行列バッファインデックス「０」を読み出すであろう。第２のサイクルにおいて、行列インデックスバッファ８９０３からの値「０」は、ベクトル値バッファ８９０４に対するアドレスとしての機能を果たし、ベクトル成分「２」の値をフェッチし、次に、サイクル３において「１３」を乗算する。 Once all of the necessary components of the vector are captured in the vector value buffer, the dot product engine uses the ALU 8910 to calculate the necessary dot product. The control logic 8901 goes through the matrix index buffer 8903 and the matrix value buffer 8904 in order, one component per cycle. The output of the matrix index buffer 8903 is used as the read address for the vector value buffer 8904 in the next cycle, while the output of the matrix value buffer 8904 is latched to arrive at the ALU 8910 at the same time as the corresponding value from the vector value buffer 8904. For example, using the matrix from FIG. 88, in the first cycle of the dot product calculation, the hardware would read the matrix buffer index "0" from the matrix index buffer 8903 along with the value "13" from the matrix value buffer 8905. In the second cycle, the value "0" from the matrix index buffer 8903 serves as the address for the vector value buffer 8904 to fetch the value of the vector component "2", which is then multiplied by "13" in cycle 3.

行開始ビットベクトル８９０１内の値は、いつ行列の行を終了して新しい行が開始するかをハードウェアに示す。ハードウェアが行の終了に到達した場合、その出力ラッチ８９１１に、行に対して累算されたドット積を配置し、次の行に対するドット積を累算することを開始する。各ドット積エンジンのドット積ラッチは、ライトバックのために出力ベクトルをアセンブルするデイジーチェーンに接続される。 The value in the row start bit vector 8901 indicates to the hardware when to end a row of the matrix and start a new row. When the hardware reaches the end of a row, it places the accumulated dot products for the row in its output latches 8911 and begins accumulating dot products for the next row. The dot product latches of each dot product engine are connected in a daisy chain that assembles the output vector for write-back.

疎行列－疎ベクトル乗算において、ベクトルは、疎行列－密ベクトル乗算におけるものよりもはるかに少ないメモリを占有する傾向があるが、それが疎であるので、所与のインデックスに対応するベクトル成分を直接フェッチすることはできない。代わりに、ベクトルは、検索されなければならず、各ドット積エンジンが必要とする成分のみをドット積エンジンに転送することは実用的ではなく、各ドット積エンジンに割り当てられる行列データのドット積を計算するために必要とされる時間を予測不可能にする。これに起因して、疎行列－疎ベクトル乗算のためのフェッチリストは、行列ブロック内の最低及び最大のゼロ以外の成分のインデックスを規定するだけであり、それらのポイント間のベクトルのゼロ以外の成分のすべてがドット積エンジンにブロードキャストされなければならない。 In sparse matrix-sparse vector multiplication, the vector tends to occupy much less memory than in sparse matrix-dense vector multiplication, but because it is sparse, it is not possible to directly fetch the vector component corresponding to a given index. Instead, the vector must be searched, and it is impractical to transfer to the dot product engine only the components that each dot product engine needs, making the time required to compute the dot product of the matrix data assigned to each dot product engine unpredictable. Due to this, the fetch list for sparse matrix-sparse vector multiplication only specifies the indices of the lowest and maximum nonzero components in the matrix block, and all of the nonzero components of the vector between those points must be broadcast to the dot product engines.

図９１は、疎行列－疎ベクトル乗算をサポートするドット積エンジン設計の詳細を示す。行列データのブロックを処理するために、インデックス（疎－密乗算に用いられる行列バッファインデックスではない）及び行列のドット積エンジンのチャンクの値は、行列インデックス及び値バッファに書き込まれ、ブロックを処理するために必要なベクトルの領域のインデックス及び値である。次に、ドット積エンジン制御論理９１４０は、インデックスバッファ９１０２－９１０３を通じて順序付けし、４×４コンパレータ９１２０に４つのインデックスのブロックを出力する。４×４コンパレータ９１２０は、ベクトル９１０２からのインデックスのそれぞれを、行列９１０３からのインデックスのそれぞれと比較し、任意の一致したバッファアドレスをマッチしたインデックスキュー９１３０に出力する。マッチしたインデックスキュー９１３０の出力は、行列値バッファ９１０５及びベクトル値バッファ９１０４の読み出しアドレス入力を駆動し、その一致に対応する値を積和演算ＡＬＵ９１１０に出力する。このハードウェアは、マッチしたインデックスキュー９１３０が空きスペースを有する限り、少なくとも４つで、１サイクルあたり８つのインデックスをドット積エンジンが消費することを可能にし、インデックスのマッチングがまれである場合に、データのブロックを処理するために必要とされる時間を低減する。 Figure 91 shows details of the dot product engine design supporting sparse matrix-sparse vector multiplication. To process a block of matrix data, the index (not the matrix buffer index used for sparse-dense multiplication) and value of the matrix dot product engine chunk are written to the matrix index and value buffers, as are the index and value of the region of the vector needed to process the block. The dot product engine control logic 9140 then sequences through index buffers 9102-9103 and outputs a block of four indices to the 4x4 comparator 9120. The 4x4 comparator 9120 compares each of the indices from the vector 9102 with each of the indices from the matrix 9103 and outputs any matching buffer addresses to the matched index queue 9130. The output of the matched index queue 9130 drives the read address inputs of the matrix value buffer 9105 and the vector value buffer 9104, which outputs the value corresponding to the match to the multiply-add ALU 9110. This hardware allows the dot product engine to consume eight indexes per cycle, at least four, as long as the matched index queue 9130 has free space, reducing the time required to process a block of data when index matching is rare.

疎行列－密ベクトルドット積エンジンと同様に、行開始９１０１のビットベクトルは、行列の新たな行を開始する行列バッファ９１０２－９１０３内のエントリを識別する。そのようなエントリが遭遇された場合、制御論理９１４０は、ベクトルインデックスバッファ９１０２の先頭にリセットし、これらの最低値からベクトルインデックスを検査することを開始することで、行列インデックスバッファ９１０３の出力とこれらを比較する。同様に、ベクトルの最後に到達した場合、制御論理９１４０は、行列インデックスバッファ９１０３内の次の行の先頭に進み、ベクトルインデックスバッファ９１０２の先頭にリセットする。「行われた」出力は、ドット積エンジンがデータのブロック又はベクトルの領域の処理を終了したときにチップ制御ユニットに知らせ、次のものに進む準備をする。アクセラレータの一実施例を簡略化するために、制御論理９１４０は、ドット積エンジンのすべてが処理を終了するまで、次のブロック／領域に進まない。 As with the sparse matrix-dense vector dot product engine, the row start 9101 bit vector identifies an entry in the matrix buffers 9102-9103 that starts a new row of the matrix. When such an entry is encountered, the control logic 9140 resets to the beginning of the vector index buffer 9102 and compares them to the output of the matrix index buffer 9103, starting from the lowest of these values to examine the vector index. Similarly, when the end of the vector is reached, the control logic 9140 advances to the beginning of the next row in the matrix index buffer 9103 and resets to the beginning of the vector index buffer 9102. The "done" output informs the chip control unit when the dot product engine has finished processing a block of data or region of the vector, and is ready to proceed to the next one. To simplify one embodiment of the accelerator, the control logic 9140 does not advance to the next block/region until all of the dot product engines have finished processing.

多くの場合、ベクトルバッファは、ブロックを処理するために必要とされる疎ベクトルのすべてを保持するのに十分な大きさである。一実施例において、１０２４個又は２０４８個のベクトル成分に対するバッファ空間が、３２が用いられるか、又は、６４ビット値が用いられるかに応じて提供される。 In many cases, the vector buffer is large enough to hold all of the sparse vectors needed to process a block. In one embodiment, buffer space for 1024 or 2048 vector components is provided, depending on whether 32 or 64 bit values are used.

ベクトルの必要な要素がベクトルバッファに適合しない場合、マルチパスアプローチが用いられてよい。制御論理９１４０は、ベクトルの完全なバッファを各ドット積エンジンにブロードキャストし、その行列バッファ内の行を通じて反復することを開始する。行の最後に到達する前に、ドット積エンジンがベクトルバッファの最後に到達した場合、ベクトルの次の領域が到達したときの行を処理することを再開しなければならない場所を示すべく、現在の行位置のビットベクトル９１１１にビットを設定し、行の開始が、ここまで処理されてきたベクトルインデックスのいずれより高いインデックス値を有する限り、行の開始に対応する行列値バッファ９１０５の位置において累算された部分的なドット積を保存し、次の行に進む。行列バッファ内の行のすべてが処理された後に、ドット積エンジンは、ベクトルの次の領域を要求するためにその終了した信号をアサートし、ベクトル全体が読み出されるまで処理を繰り返す。 If the required elements of the vector do not fit in the vector buffer, a multi-pass approach may be used. The control logic 9140 broadcasts the complete buffer of the vector to each dot product engine and begins iterating through the rows in its matrix buffer. If the dot product engine reaches the end of the vector buffer before reaching the end of the row, it sets a bit in the bit vector 9111 of the current row position to indicate where it should resume processing the row when the next region of the vector is reached, saves the accumulated partial dot product in the matrix value buffer 9105 location corresponding to the start of the row as long as the start of the row has a higher index value than any of the vector indexes processed so far, and proceeds to the next row. After all of the rows in the matrix buffer have been processed, the dot product engine asserts its done signal to request the next region of the vector, and repeats the process until the entire vector has been read out.

図９２は、特定の値を用いる例を示す。計算の開始時に、行列の４つの成分のチャンクは、行列バッファ９１０３、９１０５に書き込まれ、ベクトルの４つの成分の領域は、ベクトルバッファ９１０２、９１０４に書き込まれている。行開始９１０１及び現在の行の位置ビット－ベクトル９１０６の両方は、「１０１０の値」を有し、行列のドット積エンジンチャンクが２つの行、行列バッファ内の第１の成分において開始するもののうちの１つ、及び、第３の成分で開始するもののうちの１つを含むことを示す。 Figure 92 shows an example using specific values. At the start of the calculation, a chunk of four components of the matrix is written to the matrix buffers 9103, 9105, and a region of four components of the vector is written to the vector buffers 9102, 9104. The row start 9101 and the current row position bit-vector 9106 both have a "value of 1010", indicating that the dot product engine chunk of the matrix contains two rows, one starting at the first component in the matrix buffer, and one starting at the third component.

第１の領域が処理される場合、チャンク内の第１行は、インデックス３におけるインデックスのマッチングを参照し、行列及びベクトルバッファの対応する要素の積（４×１＝４）を計算し、行の開始に対応する行列値バッファ９１０５の位置にその値を書き込む。第２行は、インデックス１における１つのインデックスのマッチングを参照し、ベクトル及び行列の対応する成分の積を計算し、その開始に対応する位置における行列値バッファ９１０５に結果（６）を書き込む。現在の行位置のビットベクトルの状態は、各行の第１の成分が処理されており、計算が第２の成分を用いて再開すべきであることを示す「０１０１」に変更する。次に、ドット積エンジンは、その終了ラインをアサートして、ベクトルの別の領域に対する準備が整ったことをシグナリングする。 When the first region is processed, the first row in the chunk looks up the index match at index 3, calculates the product of the corresponding elements of the matrix and vector buffers (4 x 1 = 4), and writes that value to the matrix value buffer 9105 location that corresponds to the start of the row. The second row looks up the index match at index 1, calculates the product of the corresponding components of the vector and matrix, and writes the result (6) to the matrix value buffer 9105 at the location that corresponds to its start. The state of the bit vector at the current row position changes to "0101", indicating that the first component of each row has been processed and the calculation should resume with the second component. The dot product engine then asserts its done line to signal that it is ready for another region of the vector.

ドット積エンジンがベクトルの第２の領域を処理する場合、それは、行１がインデックス４におけるインデックスのマッチングを有することを参照し、行列及びベクトルの対応する値の積（５×２＝１０）を計算し、その値を、第１のベクトル領域が処理された後に保存されている部分的なドット積に加算し、その結果（１４）を出力する。図に示されるように、第２行は、インデックス７における一致を見つけて、結果３８を出力する。このように、部分的なドット積及び計算の状態を保存することで、部分的な積に対する大量の追加のストレージを要求することなく、（ベクトルが、昇順でインデックスを用いてソートされているので）ベクトルの後の領域においてインデックスを一致させることができない可能性がある行列の冗長的な作業処理要素を回避する。 When the dot product engine processes the second region of the vector, it sees that row 1 has an index match at index 4, calculates the product of the corresponding values of the matrix and the vector (5 x 2 = 10), adds that value to the partial dot product saved after the first vector region was processed, and outputs the result (14). As shown, the second row finds a match at index 7 and outputs the result 38. Saving the partial dot product and state of the calculation in this way avoids redundant work processing elements of the matrix that may not be able to match the index in later regions of the vector (because the vectors are sorted by index in ascending order) without requiring a large amount of additional storage for the partial product.

図９３は、両方のタイプの計算を処理できるドット積エンジンを生じさせるために、上記で説明された疎－密及び疎－疎ドット積エンジンがどのように組み合わせられるかを示す。２つの設計間で類似点を考慮すると、唯一の必要な変更は、疎－密ドット積エンジンのマッチ論理９３１１及び疎－疎ドット積エンジンのコンパレータ９３２０の両方と、マッチしたインデックスキュー９３３０とを、どのモジュールが読み出しアドレスを駆動し、バッファ９１０４－９１０５、及び、行列値バッファの出力又は行列値バッファのラッチされた出力が積和演算ＡＬＵ９１１０に送信されるかを選択するマルチプレクサ９３５１のデータ入力を書き込むかを判断するマルチプレクサ９３５０のセットと共にインスタンス化することである。一実施例において、これらのマルチプレクサは、行列－ベクトル乗算の開始時に設定される制御ユニット９１４０内の構成ビットにより制御され、オペレーション全体を通じて同じ構成に維持される。 Figure 93 shows how the sparse-dense and sparse-sparse dot product engines described above can be combined to produce a dot product engine that can handle both types of calculations. Given the similarities between the two designs, the only modification required is to instantiate both the match logic 9311 of the sparse-dense dot product engine and the comparator 9320 of the sparse-sparse dot product engine, as well as the matched index queue 9330, along with a set of multiplexers 9350 that determine which module drives the read address and write buffers 9104-9105, and a multiplexer 9351 that selects whether the output of the matrix value buffer or the latched output of the matrix value buffer is sent to the multiply-accumulate ALU 9110. In one embodiment, these multiplexers are controlled by configuration bits in the control unit 9140 that are set at the start of the matrix-vector multiplication and remain in the same configuration throughout the operation.

単一のアクセラレータスタックは、疎行列演算上のサーバＣＰＵに相当する性能を実現させており、スマートフォン、タブレット及び他のモバイルデバイスに対してアクセラレータに魅力的にする。例えば、１又は複数のサーバ上でモデルをトレーニングし、次に、到着したデータストリームを処理するために、モバイルデバイス上にそれらのモデルを展開する機械学習アプリケーションに関する多数の提案がある。モデルは、これらをトレーニングするために用いられるデータセットよりはるかに小さい傾向があるので、単一のアクセラレータスタックの制限された容量は、これらのアプリケーションにおいてそれほど制限されることはなく、一方、アクセラレータの性能及び電力効率は、モバイルデバイスが、これらプライマリＣＰＵ上で実現可能なものよりもはるかに複雑なモデルを処理することを可能する。非モバイルシステムに対するアクセラレータは、極めて高帯域幅かつ高性能を実現させるべく、複数のスタックを組み合わせる。 A single accelerator stack provides performance comparable to a server CPU on sparse matrix operations, making accelerators attractive for smartphones, tablets, and other mobile devices. For example, there are many proposals for machine learning applications that train models on one or more servers and then deploy those models on mobile devices to process the arriving data streams. Since models tend to be much smaller than the datasets used to train them, the limited capacity of a single accelerator stack is less limiting in these applications, while the performance and power efficiency of accelerators allow mobile devices to process much more complex models than are feasible on their primary CPUs. Accelerators for non-mobile systems combine multiple stacks to provide extremely high bandwidth and performance.

マルチスタック実装についての２つの実施例が図９４ａ及び図９４ｂに示される。これらの実施例の両方では、現代のサーバＣＰＵとのピン互換性のあるパッケージ上にいくつかのアクセラレータスタックを統合する。図９４ａは、１２個のアクセラレータスタック９４０１－９４１２とのソケット交換の実装を示し、図９４ｂは、プロセッサ／コアのセット９４３０（例えば、低コアカウントＸｅｏｎ）及び８つのスタック９４２１－９４２４を用いたマルチチップパッケージ（ＭＣＰ）の実装を示す。図９４ａ内の１２個のアクセラレータスタックは、現在のパッケージで用いられる３９ｍｍ×３９ｍｍのヒートスプレッダの条件下で合致するアレイに置かれ、一方、図９４ｂにおける実施例では、同じフットプリント内で８つのスタック及びプロセッサ／コアのセットを組み込む。一実施例において、スタックに用いられる物理的な次元は、８ＧＢＷＩＯ３スタック用の次元である。他のＤＲＡＭ技術は、異なる次元を有してよく、パッケージに合致するスタックの数を変更してよい。 Two examples of multi-stack implementations are shown in Fig. 94a and Fig. 94b. Both of these examples integrate several accelerator stacks on a package that is pin-compatible with modern server CPUs. Fig. 94a shows a socket-swap implementation with 12 accelerator stacks 9401-9412, and Fig. 94b shows a multi-chip package (MCP) implementation with a set of processors/cores 9430 (e.g., low-core count Xeon) and eight stacks 9421-9424. The 12 accelerator stacks in Fig. 94a are placed in an array that fits under the 39mm x 39mm heat spreader used in current packages, while the example in Fig. 94b incorporates eight stacks and a set of processors/cores in the same footprint. In one example, the physical dimensions used for the stacks are those for an 8GB WIO3 stack. Other DRAM technologies may have different dimensions and may change the number of stacks that fit into the package.

これらの実装の両方は、ＣＰＵとアクセラレータとの間のＫＴＩリンクを介した低レイテンシなメモリベースの通信を提供する。Ｘｅｏｎ実装に関するソケット交換設計は、マルチソケットシステム内のＣＰＵの１又は複数を置換え、９６ＧＢの容量及び３．２ＴＢ／ｓのスタックＤＲＡＭ帯域幅を提供する。予期される電力消費は９０Ｗであり、Ｘｅｏｎソケットのパワーバジェットの範囲内である。ＭＣＰのアプローチは、６４ＧＢの容量及び２．２ＴＢ／ｓの帯域幅を提供する一方、アクセラレータにおいて、６０Ｗの電力を消費する。これにより、中規模のＸｅｏｎＣＰＵをサポートするのに十分な、１ソケットあたり１５０Ｗのパワーバジェットを想定すると、ＣＰＵに対して９０Ｗが残る。詳細なパッケージ設計が、パッケージ内により多くの論理用空間を可能にする場合、追加のスタック又はより多くの強力なＣＰＵが用いられ得るが、これは、ソケットのパワーバジェット内に総電力消費を保持するために、Ｘｅｏｎ+ＦＰＧＡのハイブリッド部分に関して研究されているコアパーキング技術などのメカニズムを必要とするであろう。 Both of these implementations provide low latency memory based communication over a KTI link between the CPU and accelerator. The socket replacement design for the Xeon implementation replaces one or more of the CPUs in a multi-socket system, providing 96GB capacity and 3.2TB/s stacked DRAM bandwidth. The expected power consumption is 90W, within the power budget of the Xeon socket. The MCP approach provides 64GB capacity and 2.2TB/s bandwidth, while consuming 60W of power in the accelerator. This leaves 90W for the CPU, assuming a power budget of 150W per socket, sufficient to support a medium-sized Xeon CPU. If the detailed package design allows for more logic space in the package, additional stacks or more powerful CPUs could be used, but this would require mechanisms such as the core parking technique being investigated for the hybrid part of the Xeon+FPGA to keep the total power consumption within the socket power budget.

これらの設計の両方は、シリコンインターポーザ又は他の精巧な統合技術を必要とすることなく実装され得る。現在のパッケージで用いられている有機基板は、ダイの周囲の１ｃｍあたりおよそ３００個の信号を許容し、中間スタックＫＴＩネットワーク及びオフパッケージＫＴＩリンクをサポートするのに十分である。スタック型ＤＲＡＭ設計は、冷却が問題になるまえに、～１０Ｗの電力を消費する論理チップを典型的にはサポートでき、これは、２５６ＧＢ／秒の帯域幅のを提供するスタックに対する２Ｗの論理ダイ電力の推定を十分に超える。最後に、マルチチップパッケージは、現在の設計と整合する配線用のチップ間に１～２ｍｍの空間を必要とする。 Both of these designs can be implemented without the need for silicon interposers or other sophisticated integration technologies. Organic substrates used in current packages allow roughly 300 signals per cm of die perimeter, sufficient to support mid-stack KTI networks and off-package KTI links. Stacked DRAM designs can typically support logic chips that consume ~10W of power before cooling becomes an issue, well above the 2W logic die power estimate for a stack providing 256GB/sec of bandwidth. Finally, multi-chip packages require 1-2mm of space between chips for routing to be consistent with current designs.

実施例では、ＰＣＩｅカード上に、及び／又は、ＤＤＲ４－Ｔベースのアクセラレータを用いて実装されてもよい。ＰＣＩｅカードに対して３００Ｗの電力制限は、３２０ＧＢの総容量及び１１ＴＢ／秒の帯域幅に対して４０個のアクセラレータスタックをカードがサポートすることを可能にすることを想定している。しかしながら、ＰＣＩｅチャネルのレイテンシが長く帯域幅が制限されているということが、ＰＣＩｅベースのアクセラレータを、ＣＰＵとの頻繁でないインタラクションでしか必要としないという大きな問題に制限する。 In an embodiment, it may be implemented on a PCIe card and/or with a DDR4-T based accelerator. A 300W power limit for the PCIe card is assumed to allow the card to support 40 accelerator stacks for a total capacity of 320GB and 11TB/sec bandwidth. However, the long latency and limited bandwidth of the PCIe channel limits PCIe based accelerators to a major problem requiring only infrequent interaction with the CPU.

代替的に、アクセラレータスタックは、図９５に示されるように、ＤＤＲ－ＴＤＩＭＭベースのアクセラレータ９５０１－９５１６を実装するために用いられ得る。ＤＤＲ－Ｔは、ＤＤＲ４ソケット及びマザーボードとの互換性を有するように設計されたメモリンタフェースである。ＤＤＲ４と同じピン配列及びコネクタフォーマットを用いて、ＤＤＲ－Ｔは、異なるタイミング特性を有するメモリデバイスの使用を可能にするトランザクションベースのインタフェース９５００を提供する。この実施例において、アクセラレータスタック９５０１－９５１６は、計算を実行するために用いられていない場合にシンプルなメモリとして動作する。 Alternatively, the accelerator stacks can be used to implement DDR-T DIMM-based accelerators 9501-9516, as shown in FIG. 95. DDR-T is a memory interface designed to be compatible with DDR4 sockets and motherboards. Using the same pinout and connector format as DDR4, DDR-T provides a transaction-based interface 9500 that allows the use of memory devices with different timing characteristics. In this example, the accelerator stacks 9501-9516 act as simple memory when not being used to perform computations.

１２６～２５６ＧＢのメモリ容量及び４～８ＴＢ／秒の総帯域幅を考慮して、カードの両面が用いられる場合、ＤＤＲ－ＴＤＩＭＭは、１６個のアクセラレータスタック又は３２個のアクセラレータスタックにとって十分な空間を提供する。しかしながら、そのようなシステムは、ＤＤＲ４－ＤＩＭＭにより消費される～１０Ｗよりはるかに多い１２０～２４０ワットの電力を消費するであろう。これは、マザーボード上のＤＩＭＭごとに割り当てられる制限された空間に合致させることを困難にするアクティブな冷却を必要とするであろう。さらに、ＤＤＲ－Ｔベースのアクセラレータは、ユーザが、アクセラレーション用の任意のＣＰＵ性能を諦めようとせず、ファン又は他の冷却システムに関するアクセラレータＤＩＭＭ間に十分な空間を含めるカスタムマザーボード設計のコストを進んで払うアプリケーションにとっては魅力的であり得る。 Given a memory capacity of 126-256 GB and an aggregate bandwidth of 4-8 TB/s, if both sides of the card are used, DDR-T DIMMs provide enough space for 16 accelerator stacks or 32 accelerator stacks. However, such a system would consume 120-240 watts of power, much more than the 10W consumed by a DDR4-DIMM. This would require active cooling that would be difficult to fit into the limited space allotted per DIMM on the motherboard. Furthermore, DDR-T based accelerators may be attractive for applications where the user is not willing to give up any CPU performance for acceleration and is willing to pay the cost of a custom motherboard design that includes sufficient space between accelerator DIMMs for a fan or other cooling system.

一実施例において、マルチスタックアクセラレータ内のスタックは、別個のＫＴＩノードに分けられ、システムソフトウェアにより別々のデバイスとして管理される。システムファームウェアは、存在するアクセラレータスタックの数に基づいて、ブート時間において静的にマルチスタックアクセラレータ内のルーティングテーブルを判断しており、トポロジを一意に判断すべきである。 In one embodiment, stacks in a multi-stack accelerator are split into separate KTI nodes and managed by system software as separate devices. The system firmware statically determines the routing table in the multi-stack accelerator at boot time based on the number of accelerator stacks present, and should uniquely determine the topology.

一実施例において、アクセラレータに対する低レベルインタフェースは、ソケットベースのアクセラレータに関するその適切性に起因して、アクセラレータ抽象化層（ＡＡＬ）ソフトウェアを用いて実装される。アクセラレータは、コアキャッシュインタフェース仕様（ＣＣＩ）により説明されるようなキャッシングエージェントを実装してよく、ホストシステムによりアクセス可能でないアクセラレータ（すなわち、キャッシングエージェント＋プライベートキャッシュメモリ構成、例えば、ＣＡ＋ＰＣＭ）に対するプライベート（非コヒーレントな）メモリとしてのスタック型ＤＲＡＭを処理する。ＣＣＩ仕様は、アクセラレータを制御するドライバにより用いられるアクセラレータごとに別々のコンフィグ／ステータスレジスタ（ＣＳＲ）アドレス空間を義務付けている。その仕様に従って、各アクセラレータは、デバイスステータスメモリ（ＤＳＭ）を介してホストにそのステータスを通信し、ピニングされたメモリ領域は、アクセラレータのステータスを示すために用いられるホストメモリにマッピングされている。したがって、１２スタックシステムにおいて、単一の統合されたドライバエージェントにより管理される１２個の別個のＤＳＭ領域がある。これらのメカニズムは、スタックごとにコマンドバッファを作成するために用いられてよい。コマンドバッファは、システムメモリにマッピングされたピニングされたメモリ領域であり、ＡＡＬドライバにより管理される循環キューとして実装される。ドライバは、各スタックのコマンドバッファにコマンドを書き込み、各スタックは、その専用のコマンドバッファからアイテムを消費する。したがって、コマンドの生産及び消費は、この実施例においてデカップリングされる。 In one embodiment, the low level interface to the accelerator is implemented using Accelerator Abstraction Layer (AAL) software due to its suitability for socket-based accelerators. The accelerator may implement a caching agent as described by the Core Cache Interface Specification (CCI) and treat the stacked DRAM as private (non-coherent) memory for the accelerator that is not accessible by the host system (i.e., caching agent + private cache memory configuration, e.g., CA + PCM). The CCI specification mandates a separate Config/Status Register (CSR) address space per accelerator that is used by the driver that controls the accelerator. According to the specification, each accelerator communicates its status to the host via a Device Status Memory (DSM), and pinned memory regions are mapped into host memory that are used to indicate the accelerator's status. Thus, in a 12-stack system, there are 12 separate DSM regions managed by a single unified driver agent. These mechanisms may be used to create command buffers per stack. The command buffer is a pinned memory area mapped into system memory and is implemented as a circular queue managed by the AAL driver. The driver writes commands to each stack's command buffer, and each stack consumes items from its dedicated command buffer. Thus, command production and consumption are decoupled in this embodiment.

例として、ホストＣＰＵに接続される単一のアクセラレータスタックから構成されるシステムを考慮する。ユーザは、コードを書き込んで以下の計算を実行する。ｗｎ＋１＝ｗｎ－αＡｗｎ。Ａは行列であり、ｗｘはベクトルである。ソフトウェアフレームワーク及びＡＡＬドライバは、このコードを以下のシーケンスコマンドにデコンポーズする。 As an example, consider a system consisting of a single accelerator stack connected to a host CPU. A user writes code to perform the following calculation: wn+1=wn-αAwn, where A is a matrix and wx is a vector. The software framework and AAL driver decompose this code into the following sequence of commands:

ＴＲＡＮＳＭＩＴ－一連の区分（ｗｎ＋１、ｗｎ、α、Ａ）をプライベートキャッシュメモリにロードする。 TRANSMIT - Load the set of segments (wn+1, wn, α, A) into the private cache memory.

ＭＵＬＴＩＰＬＹ－一連の区分（ｔｍｐ＝ｗｎ×α×Ａ）を乗算する。 MULTIPLY - Multiply a series of segments (tmp = wn x α x A).

ＳＵＢＴＲＡＣＴ－一連の区分（ｗｎ＋１＝ｗｎ－ｔｍｐ）をサブミットする。 SUBTRACT - Submit a set of segments (wn+1=wn-tmp).

ＲＥＣＥＩＶＥ－結果（ｗｎ＋１）を含むホストメモリに一連の区分を格納する。 RECEIVE - Store a series of segments in host memory including the result (wn+1).

これらのコマンドは、ホスト又はプライベートキャッシュメモリのいずれか一方に配置される「区分」、データの粗粒度（およそ１６ＭＢ～５１２ＭＢ）単位で演算を行う。区分は、ＭａｐＲｅｄｕｃｅ又はＳｐａｒｋ分散コンピューティングシステムが、アクセラレータを用いて分散された計算の加速を容易にするために用いるデータのブロック上に容易にマッピングすることを目的としている。ＡＡＬドライバは、ホストメモリ領域又はアクセラレータスタックに対する区分についての静的な１対１のマッピングを作成する役割を担う。アクセラレータスタックは、それぞれ個々に、これらのプライベートキャッシュメモリ（ＰＣＭ）アドレス空間に対して、これら割り当てられた区分をマッピングする。区分は、一意的な識別子である区分インデックス、さらに（ホストメモリに配置された区分に対して）対応するメモリ領域及びデータフォーマットにより表現される。ＰＣＭ内に配置された区分は、中央制御装置により管理され、区分に対するＰＣＭアドレス領域を判断する。 These commands operate on "partitions", coarse-grained units of data (approximately 16MB to 512MB) that are located in either the host or private cache memory. Partitions are intended to be easily mapped onto blocks of data that the MapReduce or Spark distributed computing systems use to facilitate the acceleration of distributed computations using accelerators. The AAL driver is responsible for creating a static one-to-one mapping of partitions to host memory regions or accelerator stacks. Accelerator stacks individually map their assigned partitions to their private cache memory (PCM) address space. A partition is represented by a partition index, which is a unique identifier, and the corresponding memory region (for the partition located in host memory) and data format. Partitions located in PCM are managed by a central controller, which determines the PCM address region for the partition.

一実施例において、アクセラレータのＰＣＭを初期化するために、ホストは、ホストメモリからデータをロードするようアクセラレータに指示する。ＴＲＡＮＳＭＩＴオペレーションは、アクセラレータにホストメモリを読み出させて、読み出したデータをアクセラレータのＰＣＭに格納させる。送信されるデータは、一連の｛区分インデックス、ホストメモリ領域、データフォーマット｝のタプルにより説明される。データのオーバヘッドがホストドライバによりまとめられることを回避するために、アクセラレータは、システムプロトコル２（ＳＰＬ２）共有仮想メモリ（ＳＶＭ）を実装してよい。 In one embodiment, to initialize the accelerator's PCM, the host instructs the accelerator to load data from host memory. The TRANSMIT operation causes the accelerator to read host memory and store the read data in the accelerator's PCM. The data to be sent is described by a series of {partition index, host memory region, data format} tuples. To avoid data overhead being bundled by the host driver, the accelerator may implement System Protocol 2 (SPL2) Shared Virtual Memory (SVM).

各タプルにおけるデータフォーマットは、メモリ内の区分のレイアウトを表現する。アクセラレータがサポートするフォーマットの例は、圧縮された疎行（ＣＳＲ）及び多次元密アレイである。上記の例に関して、Aは、ＣＳＲフォーマットにあってよく、他方、ｗｎはアレイフォーマットにあってよい。コマンドの仕様は、ＰＣＭにＴＲＡＮＳＭＩＴオペレーションにより参照される区分をすべてロードするようアクセラレータに指示するために必要な情報及びホストメモリアドレスを含む。 The data format in each tuple represents the layout of the partition in memory. Examples of formats supported by the accelerator are compressed sparse rows (CSR) and multi-dimensional dense arrays. For the above example, A may be in CSR format, while wn may be in array format. The command specification contains the host memory addresses and information necessary to instruct the accelerator to load all partitions referenced by the TRANSMIT operation into the PCM.

各オペレーションは、一連の区分の形式で少数のオペランドを参照してよい。例えば、乗算演算は、アクセラレータに、スタック型ＤＲＡＭを読み出させ、行列－ベクトル乗算を実行させる。故に、この例では、４つのオペランド、すなわち、宛先ベクトルｔｍｐ、乗算器Ａ、被乗数ｗｎ及びスカラαを有する。宛先ベクトルｔｍｐは、オペレーションを含むコマンドの一部として、ドライバにより特定される一連の区分に累算される。コマンドは、必要な場合、一連の区分を初期化するようアクセラレータに指示する。 Each operation may reference a small number of operands in the form of a set of partitions. For example, a multiplication operation causes the accelerator to read a stacked DRAM and perform a matrix-vector multiplication. Thus, in this example, we have four operands: destination vector tmp, multiplier A, multiplicand wn, and scalar α. The destination vector tmp is accumulated into a set of partitions specified by the driver as part of the command containing the operation. The command instructs the accelerator to initialize the set of partitions, if necessary.

ＲＥＣＥＩＶＥオペレーションは、アクセラレータに、ＰＣＭを読み出させ、ホストメモリを書き込ませる。このオペレーションは、すべての他のオペレーション上の選択的なフィールドとして実装されてよく、ホストメモリに結果を格納する指示を用いてＭＵＬＴＩＰＬＹなどの演算を実行するようにコマンドを潜在的に融合する。ＲＥＣＥＩＶＥオペレーションの宛先オペランドは、オンチップに累算され、次に、ホストメモリ内の区分にストリーミングされ、（アクセラレータがＳＰＬ２ＳＶＭを実装しない限り）コマンドのディスパッチの前に、ドライバによりピニングされなければならない。 The RECEIVE operation causes the accelerator to read PCM and write host memory. This operation may be implemented as an optional field on all other operations, potentially fusing commands to perform operations such as MULTIPLY with an instruction to store the result in host memory. The destination operand of the RECEIVE operation must be accumulated on-chip and then streamed to a partition in host memory and pinned by the driver prior to command dispatch (unless the accelerator implements SPL2 SVM).

コマンドのディスパッチフロー Command dispatch flow

一実施例において、スタック用のコマンドバッファにコマンドを挿入した後に、ドライバは、消費される新たなコマンドをスタックに通知するために、ＣＳＲ書き込みを生成する。ドライバによるＣＳＲ書き込みは、アクセラレータスタックの中央制御装置により消費され、スタックに対してドライバによりディスパッチされたコマンドを読み出すために、コマンドバッファに対する一連の読み出しを制御ユニットに生成させる。アクセラレータスタックがコマンドを完了した場合、ステータスビットをそのＤＳＭに書き込む。ＡＡＬドライバは、コマンドの完了を判断するために、これらのステータスビットをポーリング又はモニタリングのいずれか一方を行う。ＤＳＭへのＴＲＡＮＳＭＩＴ又はＭＵＬＴＰＬＹオペレーションに関する出力は、完了を示すステータスビットである。ＲＥＣＥＩＶＥオペレーションに関して、ＤＳＭへの出力は、ホストメモリに書き込まれるステータスビット及び一連の区分である。ドライバは、アクセラレータにより書き込まれるメモリの領域を識別する役割を担う。スタック上の制御ユニットは、スタック型ＤＲＡＭへの一連の読み出し処理及びホストメモリ内の宛先の区分への対応する書き込みを生成する役割を担う。 In one embodiment, after inserting a command into the command buffer for a stack, the driver generates a CSR write to inform the stack of the new command to be consumed. The CSR write by the driver is consumed by the accelerator stack's central controller, causing the control unit to generate a series of reads to the command buffer to read the command dispatched by the driver to the stack. When the accelerator stack completes a command, it writes status bits to its DSM. The AAL driver either polls or monitors these status bits to determine command completion. The output for a TRANSMIT or MULTPLY operation to the DSM is a status bit indicating completion. For a RECEIVE operation, the output to the DSM is a status bit and a series of segments to be written to host memory. The driver is responsible for identifying the region of memory to be written to by the accelerator. The control unit on the stack is responsible for generating a series of read operations to the stacked DRAM and corresponding writes to the destination segments in host memory.

ソフトウェアイネーブル Software enable

一実施例において、ユーザは、ルーチンのライブラリを呼び出して、データをアクセラレータ上に移動し、疎行列計算を実行するなどを行うことにより、アクセラレータとインタラクトする。このライブラリに対するＡＰＩは、既存のアプリケーションを修正して、アクセラレータを利用するために必要とされる労力を低減するために、既存の疎行列ライブラリと可能な限り同様であり得る。ライブラリベースのインタフェースの別の利点は、アクセラレータ及びそのデータフォーマットの詳細を隠すことであり、プログラムが、ランタイムでライブラリの訂正バージョンを動的に連結することにより異なる実装を利用することを可能にする。ライブラリは、Ｓｐａｒｋのような分散コンピューティング環境からアクセラレータを呼び出すために実装されてもよい。 In one embodiment, users interact with the accelerator by calling a library of routines to move data onto the accelerator, perform sparse matrix calculations, etc. The API for this library can be as similar as possible to existing sparse matrix libraries to reduce the effort required to modify existing applications to take advantage of the accelerator. Another advantage of a library-based interface is that it hides the details of the accelerator and its data formats, allowing programs to take advantage of different implementations by dynamically linking in correct versions of the library at run-time. The library may also be implemented to call the accelerator from a distributed computing environment such as Spark.

アクセラレータスタックのエリア及び電力消費は、モジュール（メモリ、ＡＬＵなど）に設計を分割すること、及び、同様の構造の１４ｎｍ設計からデータを収集するにより推定されてよい。１０ｎｍプロセスにスケールするために、５０％のエリアの削減が、２５％のＣｄｙｎの削減及び２０％の漏れ電力の削減と共に想定され得る。エリアは、すべてのオンチップメモリ及びＡＬＵを含むと推定する。ワイヤが、論理／メモリ上を走るものと仮定する。電力推定は、ＡＬＵ及びメモリに対するアクティブなエネルギー、メモリに対する漏れ電力、我々の主要なネットワーク所のワイヤ電力を含む。１ＧＨｚのベースクロックレートが想定されていて、１４ｎｍ及び１０ｎｍプロセスの両方において、０．６５Ｖの供給電圧であった。上記ですでに述べたように、ＡＬＵは、基本クロックレートの半分で実行してよく、これは、電力予測において考慮されるものとする。ＫＴＩリンク及び中間スタックネットワークは、アクセラレータが計算を実行している場合にアイドル又はほぼアイドルであると予測されるので、電力推定に含まれていない。一実施例では、これらのネットワークでのアクティビティを追跡し、これらを電力推定に含める。 The area and power consumption of the accelerator stack may be estimated by partitioning the design into modules (memory, ALU, etc.) and collecting data from a 14 nm design of similar structure. To scale to the 10 nm process, a 50% area reduction may be assumed along with a 25% Cdyn reduction and a 20% leakage power reduction. The area is estimated to include all on-chip memories and ALUs. We assume that wires run on the logic/memory. The power estimation includes active energy for the ALUs and memories, leakage power for the memories, and wire power for our main network. A base clock rate of 1 GHz is assumed, with a supply voltage of 0.65V for both the 14 nm and 10 nm processes. As already mentioned above, the ALUs may run at half the base clock rate, which shall be taken into account in the power estimation. The KTI links and mid-stack network are not included in the power estimation, as they are expected to be idle or near idle when the accelerator is performing computations. In one embodiment, we track activity on these networks and include it in the power estimation.

当該推定は、本明細書で説明されるようなアクセラレータが、１４ｎｍプロセスにおけるチップ面積の１７ｍｍ^２及び１０ｎｍプロセスにおける８．５ｍｍ^２を占有すると予測し、チップ面積の大部分がメモリにより占有されている。図９６は、６４個のドット積エンジン８４２０、８個のベクトルキャッシュ８４１５及び統合メモリコントローラ８４１０を含むＷＩＯ３ＤＲＡＭスタックの下に位置することを目的とするアクセラレータの潜在的なレイアウトを示す。示されるＤＲＡＭスタックＩ／Ｏバンプ９６０１、９６０２のサイズ及び配置は、ＷＩＯ３標準により規定されており、アクセラレータ論理は、これらの間の空間に合致する。しかしながら、アセンブリを簡単にするために、ＤＲＡＭスタックの下方の論理ダイは、少なくともＤＲＡＭダイとほぼ同じ大きさとすべきである。故に、実際のアクセラレータチップは、およそ８ｍｍ～１０ｍｍであるが、エリアのほとんどが未使用であろう。一実施例において、この未使用のエリアは、帯域幅が制限されたアプリケーションの異なるタイプに関するアクセラレータに用いられ得る。 The estimate predicts that an accelerator as described herein would occupy 17 ^mm2 of chip area in a 14 nm process and ^8.5 mm2 in a 10 nm process, with most of the chip area being occupied by memory. Figure 96 shows a potential layout of an accelerator intended to be located under a WIO3 DRAM stack that includes 64 dot product engines 8420, 8 vector caches 8415, and an integrated memory controller 8410. The size and placement of the DRAM stack I/O bumps 9601, 9602 shown are defined by the WIO3 standard, and the accelerator logic would fit in the space between them. However, for ease of assembly, the logic die below the DRAM stack should be at least approximately as large as the DRAM die. Thus, the actual accelerator chip would be approximately 8 mm to 10 mm, but most of the area would be unused. In one embodiment, this unused area could be used for accelerators for different types of bandwidth-limited applications.

スタック型ＤＲＡＭは、その名称が示唆するように、より高い帯域幅、計算ダイとのより密接な物理的統合、ＤＤＲ４ＤＩＭＭなどの従来のＤＲＡＭモジュールより低いエネルギー／ビットを実現させるために、複数のＤＲＡＭダイを鉛直にスタックするメモリ技術である。図９７におけるテーブルでは、７つのＤＲＡＭ技術、すなわち、非スタック型ＤＤＲ４及びＬＰＤＤＲ４、ピコモジュール、ＪＥＤＥＣ標準の高帯域幅（ＨＢＭ_２）及びワイドＩ／Ｏ（ＷＩＯ_３）スタック、スタック型ＤＲＡＭ、並びに、崩壊型ＲＡＭ（ｄｉｓ－ｉｎｔｅｇｒａｔｅｄＲＡＭ、ＤｉＲＡＭ）を比較する。 Stacked DRAM, as the name suggests, is a memory technology that stacks multiple DRAM dies vertically to provide higher bandwidth, closer physical integration with the compute die, and lower energy/bit than traditional DRAM modules such as DDR4 DIMMs. The table in Figure 97 compares seven DRAM technologies: non-stacked DDR4 and LPDDR4, pico-modules, JEDEC standard high bandwidth ( _HBM2 ) and wide I/O ( _WIO3 ) stacks, stacked DRAM, and dis-integrated RAM (DiRAM).

スタック型ＤＲＡＭは、２種類の形式、すなわち、オンダイ及び横側ダイがある。オンダイスタック８３０１－８３０４は、図９８ａに示されるように、スルーシリコンビアを用いて、論理ダイ又はＳｏＣ８３０５に直接的に接続する。一方、横側ダイスタック８３０１－８３０４は、図９８ｂに示されるように、シリコンインターポーザ又はブリッジ９８０２上の論理／ＳｏＣダイ８３０５の横に置かれており、インターポーザ９８０２及びインタフェース層９８０１を通じて走るＤＲＡＭと論理ダイとの間の接続を有する。オンダイＤＲＡＭスタックは、それらが横側ダイスタックより小さいパッケージを可能にするという利点を有するが、１より多くのスタックを論理ダイに取り付けることが難しく、それらがダイ毎に提供できるメモリの量を制限するという短所を有する。一方、シリコンインターポーザ９８０２の使用は、論理ダイが、エリア内によってはいくらかのコストはあるが、複数の横側ダイスタックと通信することを可能にする。 Stacked DRAMs come in two forms: on-die and lateral die. On-die stacks 8301-8304 connect directly to the logic die or SoC 8305 using through silicon vias, as shown in Figure 98a. On the other hand, lateral die stacks 8301-8304 are placed next to the logic/SoC die 8305 on a silicon interposer or bridge 9802, as shown in Figure 98b, with the connections between the DRAM and logic die running through the interposer 9802 and the interface layer 9801. On-die DRAM stacks have the advantage that they allow for a smaller package than lateral die stacks, but the disadvantage that it is difficult to attach more than one stack to a logic die, limiting the amount of memory they can provide per die. On the other hand, the use of silicon interposers 9802 allows the logic die to communicate with multiple lateral die stacks, at some cost depending on the area.

ＤＲＡＭについての２つの重要な特性は、それらがパッケージに合致する帯域幅及びその帯域幅を消費するために必要とされる電力を定義するといったような、１スタックあたりの帯域幅及び１ビットあたりのエネルギーである。ピコモジュールだと、十分な帯域幅を提供せず、ＨＢＭ_２のエネルギー／ビットが電力消費を著しく上昇させるので、これらの特性は、ＷＩＯ_３、ＩＴＲＩ及びＤｉＲＡＭを、本明細書で説明されるようなアクセラレータに対して最も期待できる技術にする。 Two important properties for DRAM are the bandwidth per stack and the energy per bit, as they define the bandwidth that can be matched to the package and the power required to consume that bandwidth. These properties make _WIO3 , ITRI and DiRAM the most promising technologies for accelerators such as those described herein, as pico-modules do not provide sufficient bandwidth and the energy per bit of _HBM2 significantly increases power consumption.

それら３つの技術について、ＤｉＲＡＭは、最も高い帯域幅及び容量並びに最も低いレイテンシを有しているので、非常に魅力的である。ＷＩＯ_３は、ＪＥＤＥＣ標準になることが想定されるさらなる別の有望なオプションであり、良好な帯域幅及び容量を提供する。ＩＴＲＩメモリは、３つのうちで最も低いエネルギー／ビットを有しており、より多くの帯域幅が所与のパワーバジェットに合致することを可能にする。それはまた、レイテンシが低く、そのＳＲＡＭのようなインタフェースは、アクセラレータのメモリコントローラについての複雑性を低減するであろう。しかしながら、ＩＴＲＩＲＡＭは、３つのうちで最も容量が小さく、その設計は、性能に関して容量とトレードオフになる。 Of the three technologies, DiRAM has the highest bandwidth and capacity as well as the lowest latency, making it very attractive. WIO ₃ is yet another promising option that is envisioned to become a JEDEC standard and offers good bandwidth and capacity. ITRI memory has the lowest energy/bit of the three, allowing more bandwidth to fit into a given power budget. It also has low latency and its SRAM-like interface will reduce the complexity for the accelerator's memory controller. However, ITRI RAM has the smallest capacity of the three, and its design trades off capacity for performance.

本明細書で説明されるアクセラレータは、コア疎行列ベクトル乗算（ＳｐＭＶ）プリミティブ上に構築されるデータ解析及び機械学習アルゴリズムに取り組むために設計される。ＳｐＭＶは、多くの場合、これらのアルゴリズムのランタイムを支配する一方、他のオペレーションは、同様にこれらを実装するために必要とされる。 The accelerators described herein are designed to tackle data analytics and machine learning algorithms that are built on a core sparse matrix vector multiplication (SpMV) primitive. While SpMV often dominates the runtime of these algorithms, other operations are required to implement them as well.

例として、図９９に示される幅優先探索（ＢＦＳ）のリストを検討する。この例では、ワークのバルクがライン４上のＳｐＭＶにより実行されている。しかしながら、ベクトル－ベクトル減算（ライン８）、内積演算（ライン９）及びデータ並列マップオペレーション（ライン６）もある。ベクトルの減算及び内積は、ベクトルＩＳＡにおいて一般にサポートされている比較的単純な演算であり、説明をほとんど必要としない。 As an example, consider the breadth-first search (BFS) listing shown in Figure 99. In this example, the bulk of the work is performed by SpMV on line 4. However, there is also a vector-vector subtraction (line 8), a dot product operation (line 9), and a data-parallel map operation (line 6). Vector subtraction and dot product are relatively simple operations commonly supported in vector ISAs and require little explanation.

一方、データ並列マップオペレーションは、プログラミング性を概念的に要素単位のオペレーションに導入するので、はるかに興味深いものである。ＢＦＳの例は、一実施例のマッピング機能により提供されるプログラミング性を明らかにする。特に、ＢＦＳにおけるラムダ関数（図９９のライン６を参照）は、頂点が最初にアクセスされていたときのトラックを保持するために用いられる。これは、ラムダ関数に２つのアレイ及び１つのスカラを渡すことにより一実施例において行われる。ラムダ関数に渡される第１のアレイは、ＳｐＭＶ演算の出力であり、どの頂点が現在到達可能であるかを反映する。第２のアレイは、値が、頂点が最初に見られた反復数である、又は、頂点がまだ到達していない場合は０である頂点ごとにエントリを有する。ラムダ関数に渡されるスカラは、単純なループ反復カウンタである。一実施例において、ラムダ関数は、出力ベクトルを生成するために入力ベクトルの各成分に対して実行される一連のスカラ演算にコンパイルされる。 On the other hand, data-parallel map operations are much more interesting because they introduce programmability to conceptually element-wise operations. A BFS example reveals the programmability provided by the mapping function of one embodiment. In particular, the lambda function in the BFS (see line 6 in Figure 99) is used to keep track of when a vertex was first being visited. This is done in one embodiment by passing two arrays and a scalar to the lambda function. The first array passed to the lambda function is the output of the SpMV operation and reflects which vertices are currently reachable. The second array has an entry for each vertex whose value is the iteration number the vertex was first seen, or 0 if the vertex has not yet been reached. The scalar passed to the lambda function is a simple loop iteration counter. In one embodiment, the lambda function is compiled into a series of scalar operations that are performed on each component of the input vector to generate the output vector.

ＢＦＳに関する一連のオペレーションの中間表現（ＩＲ）が図９９に示される。ＢＦＳラムダＩＲは、いくつかの興味深い特性を明らかにする。生成されたラムダコードは、単一の基本ブロックのみを有することが保証されている。一実施例では、ラムダ関数における反復的な構築を防止し、制御フローを回避するためにｉｆ変換（ｉｆ-ｃｏｎｖｅｒｓｉｏｎ）を実行する。この制約は、一般的な制御フローをサポートする必要はなので、ラムダを実行するために用いられる計算構造の複雑性を著しく低減させる。 The intermediate representation (IR) of the sequence of operations for BFS is shown in Figure 99. The BFS lambda IR reveals several interesting properties. The generated lambda code is guaranteed to have only a single basic block. In one embodiment, we perform if-conversions to prevent repetitive constructions in the lambda function and avoid control flow. This constraint significantly reduces the complexity of the computational structures used to execute lambdas, since it is not necessary to support general control flow.

すべてのメモリオペレーションは、基本ブロックの開始（図９９のライン２から４）で実行される。アセンブリに変換された場合、メモリオペレーションは、コードレット（ｃｏｄｅｌｅｔ）のプリアンブル（ライン２から５）に引き上げられる（ｈｏｉｓｔｅｄ）。 All memory operations are performed at the start of the basic block (lines 2 to 4 in Figure 99). When translated to assembly, the memory operations are hoisted into the codelet preamble (lines 2 to 5).

統計値の評価は、ラムダ関数を使用するアクセラレータと共に実装されるベンチマークに関して実行されていた。命令の数が記録されており、レジスタの総数及び関心のある様々なラムダ関数の「複雑性」を定量化するロードの総数であった。さらに、クリティカルパス長は、各ラムダ関数における従属命令の最も長いチェーンを反映する。命令の数が、クリティカルパスよりも著しく長い場合、命令－レベル並列性技術は、性能を向上させるために適用可能な解決手段である。いくつかのロードは、マッピングの所与の呼び出し又は低減コールに関して不変である（ラムダ関数のすべての実行が同じ値をロードする）。この状況は、「ラムダ不変ロード」と称され、それを検出するために解析が実行される。 Statistical evaluation was performed on benchmarks implemented with accelerators that use lambda functions. The number of instructions was recorded, as was the total number of registers and the total number of loads that quantifies the "complexity" of the various lambda functions of interest. Furthermore, the critical path length reflects the longest chain of dependent instructions in each lambda function. If the number of instructions is significantly longer than the critical path, instruction-level parallelism techniques are an applicable solution to improve performance. Some loads are invariant (all executions of a lambda function load the same value) for a given invocation or reduction call of the mapping. This situation is called "lambda invariant loads" and an analysis is performed to detect it.

解析結果に基づいて、比較的少ない命令格納は、ラムダ関数の実行をサポートするレジスタファイルを必要とする。並行処理（複数のラムダ関数の実行をインタリーブする）を向上させる技術は、レジスタファイルのサイズ及び複雑性を改善する。しかしながら、ベースライン設計は、わずか１６エントリであり得る。さらに、比較及び条件移動オペレーションで使用するために単一ビット述語レジスタファイルも提供されている場合、２Ｒ１Ｗレジスタファイルは、すべてのオペレーションに対して十分なはずである。 Based on the analysis, a relatively small instruction store is required for the register file to support the execution of lambda functions. Techniques to increase parallelism (interleaving the execution of multiple lambda functions) will improve the size and complexity of the register file. However, the baseline design may be as small as 16 entries. Furthermore, if a single-bit predicate register file is also provided for use in compare and condition move operations, a 2R1W register file should be sufficient for all operations.

以下で説明されるように、ラムダ不変ロードは、ギャザーエンジンにおいて実行され、その結果、それらは、ラムダ関数を呼び出す毎に一度実行されるだけである。これらのロードにより返される値は、それらが、必要に応じてラムダデータパスのローカルレジスタファイルに読み出され得るように処理要素に渡される。 As described below, lambda immutable loads are performed in the gather engine, so that they are only executed once per invocation of a lambda function. The values returned by these loads are passed to the Processing Elements so that they can be read into the lambda datapath's local register file as needed.

一実施例において、ラムダ関数の実行は、各ユニットの異なる機能を活用するために、ギャザーエンジンとプロセッサ要素（ＰＥ）（例えば、上記で説明されたようなドット積エンジン）との間で分割される。ラムダ関数は、３つのタイプの引数、すなわち、定数、スカラ及びベクトルを有する。定数は、値がコンパイル時に判断され得る引数である。スカラ変数は、上記で説明されたラムダ不変ロードに対応し、ラムダ関数の呼び出し間で値が変化する引数であるが、所与のラムダ関数が動作する要素のすべてにわたって定数を維持する。ベクトル引数は、ラムダ関数が処理するデータのアレイであり、当該関数における命令をベクトル引数内の各要素に適用する。 In one embodiment, the execution of lambda functions is split between the gather engine and the processor elements (PEs) (e.g., dot product engines as described above) to exploit the different capabilities of each unit. Lambda functions have three types of arguments: constants, scalars, and vectors. Constants are arguments whose values can be determined at compile time. Scalar variables correspond to the lambda immutable loads described above, and are arguments whose values change between invocations of the lambda function, but remain constant across all of the elements that a given lambda function operates on. Vector arguments are arrays of data that the lambda function operates on, applying the instructions in the function to each element in the vector argument.

一実施例において、ラムダ関数は、当該関数を実装するコード、当該関数が参照する任意の定数、及び、その入出力変数に対するポインタを含む記述子データ構造により規定される。ラムダ関数を実行するために、最上位のコントローラは、ラムダ関数の記述子と、ギャザーエンジン及びその関連するＰＥが処理するためのものである関数のベクトル引数の一部の開始及び終了インデックスとを規定する１又は複数のギャザーエンジンにコマンドを送信する。 In one embodiment, a lambda function is specified by a descriptor data structure that contains the code that implements the function, any constants the function references, and pointers to its input and output variables. To execute a lambda function, a top-level controller sends a command to one or more gather engines that specifies a descriptor of the lambda function and the start and end indices of a portion of the function's vector arguments that the gather engines and their associated PEs are intended to process.

ギャザーエンジンがコマンドを受信してラムダ関数を実行する場合、記述子の最後のセクションに到達するまで、メモリから関数の記述子をフェッチして、当該関数のスカラ変数のアドレスを含む当該記述子をその関連するＰＥに渡す。次に、メモリから関数のスカラ変数のそれぞれをフェッチして、記述子内の各引数のアドレスをその値と置換え、修正した記述子をＰＥに渡す。 When the gather engine receives a command to execute a lambda function, it fetches the function's descriptor from memory and passes the descriptor, including the addresses of the function's scalar variables, to its associated PE until the last section of the descriptor is reached. It then fetches each of the function's scalar variables from memory, replaces the address of each argument in the descriptor with its value, and passes the modified descriptor to the PE.

ＰＥが、そのギャザーエンジンから関数記述子の開始を受信した場合、それは、関数のベクトル入力のアドレスを制御レジスタにコピーし、ＰＥのフェッチハードウェアは、ＰＥのローカルバッファにベクトル入力のページをロードすることを開始する。次に、ラムダ関数を実装する命令のそれぞれをデコードし、その結果を、小型のデコードされた命令バッファに格納する。次に、ＰＥは、関数のスカラ変数の値がそのギャザーエンジンから到着するのを待ち、関数のベクトル引数のそれぞれの第１のページが、メモリから到着するのを待つ。関数の引数が到着すると、ＰＥは、入力ベクトルのその範囲内の各成分にラムダ関数を適用することを開始し、ＰＥのフェッチ及びライトバックハードウェアに依存して、入力データのページをフェッチし、必要に応じて、出力値のページをライトバックする。ＰＥが、データの割り当てられる範囲の最後に達した場合、それが行われる最上位のコントローラをシグナリングし、別の処理を開始する準備を行う。 When a PE receives the start of a function descriptor from its gather engine, it copies the addresses of the function's vector inputs into a control register, and the PE's fetch hardware begins loading pages of the vector inputs into the PE's local buffer. It then decodes each of the instructions that implement the lambda function and stores the results in a small decoded instruction buffer. The PE then waits for the values of the function's scalar variables to arrive from its gather engine, and for the first page of each of the function's vector arguments to arrive from memory. Once the function's arguments arrive, the PE begins applying the lambda function to each component within its range of input vectors, relying on the PE's fetch and writeback hardware to fetch pages of input data and, if necessary, write back pages of output values. When the PE reaches the end of its allocated range of data, it signals the top-level controller that it is done, and prepares to begin another process.

図１００は、一実施例に従うラムダ関数を規定するために用いられる記述子のフォーマットを示す。特に、図１００は、メモリ１０００１内のラムダ記述子フォーマットと、ＰＥ１０００２に渡されるラムダフォーマット記述子とを示す。命令を除く記述子内のすべてのフィールドは、６４ビット値である。命令は、３２ビット値であり、２つが６４ビットワードにパックされる。記述子は、スカラ変数が最後に現れるように体系化され、ギャザーエンジンがメモリからスカラ変数をフェッチする前に、それがＰＥにスカラ変数以外のすべてを渡すことを可能にする。これは、ＰＥが関数の命令をデコードし、そのベクトル引数をフェッチすることを開始することを可能にする一方、スカラ変数をフェッチするためにギャザーエンジンを待機させる。ラムダ関数の記述子及びスカラ変数は、ラムダ関数が複数のギャザーエンジン／ＰＥペアにわたって分散されている場合、冗長化ＤＲＡＭアクセスを除去するために、ベクトルキャッシュを通じてフェッチされる。図示されるように、メモリ１０００１内のラムダ記述子フォーマットは、スカラ変数１０００３に対するポインタを含み得る一方、ギャザーエンジンは、ＰＥ１０００２に渡されるときに、ラムダ記述子フォーマット内のスカラ変数１０００４の値をフェッチする。 FIG. 100 illustrates the format of a descriptor used to specify a lambda function according to one embodiment. In particular, FIG. 100 illustrates the lambda descriptor format in memory 10001 and the lambda format descriptor passed to PE 10002. All fields in the descriptor except the instruction are 64-bit values. The instruction is a 32-bit value, packed two into a 64-bit word. The descriptor is organized with scalar variables appearing last, allowing the gather engine to pass everything but the scalar variables to the PE before fetching the scalar variables from memory. This allows the PE to decode the function's instructions and begin fetching its vector arguments while waiting for the gather engine to fetch the scalar variables. The descriptor and scalar variables of the lambda function are fetched through a vector cache to eliminate redundant DRAM accesses when a lambda function is distributed across multiple gather engine/PE pairs. As shown, the lambda descriptor format in memory 10001 may include a pointer to a scalar variable 10003, while the gather engine fetches the value of the scalar variable 10004 in the lambda descriptor format when it is passed to the PE 10002.

一実施例において、各記述子の第１のワードは、記述子内の各ワードの意味を規定するヘッダである。図１０１に示されるように、ヘッダワードの下位６バイトは、ラムダ関数１０１０１に対するベクトル引数の数、定数引数１０１０２の数、ベクトル及びスカラ出力１０１０３－１０１０４の数、関数内の命令１０１０５の数、及び、関数におけるスカラ変数１０１０６の数を規定する（各タイプのデータが記述子に現れる場所を一致させるためにオーダリングされる）。ヘッダワードの第７バイトは、関数のコード内のループ開始命令１０１０７（例えば、ハードウェアが、第１のバイトの後の各反復を開始すべき命令）の位置を規定する。ワード内の高次のバイトは未使用１０１０８である。残りのワードは、図に示される順序で、関数命令、定数及び入出力アドレスを含む。 In one embodiment, the first word of each descriptor is a header that specifies the meaning of each word in the descriptor. As shown in FIG. 101, the lowest 6 bytes of the header word specify the number of vector arguments to the lambda function 10101, the number of constant arguments 10102, the number of vector and scalar outputs 10103-10104, the number of instructions 10105 in the function, and the number of scalar variables 10106 in the function (ordered to match where each type of data appears in the descriptor). The seventh byte of the header word specifies the location of the loop start instruction 10107 (e.g., the instruction where the hardware should start each iteration after the first byte) in the function's code. The higher order byte in the word is unused 10108. The remaining words contain function instructions, constants, and input/output addresses in the order shown in the figure.

すべての必要なオペレーションが制御論理を修正することによりサポートされ得るので、ラムダ関数をサポートするために必要されるギャザーエンジンデータパスに対する変更がない。ギャザーエンジンがメモリからラムダ記述子をフェッチした場合、それは、ベクトル成分ラインバッファ及び列記述子バッファの両方に記述子のラインをコピーする。スカラ変数のアドレスを含まない記述子ラインは、未変更のＰＥに渡される一方、それらは、スカラ変数の値がメモリからフェッチされて、これらのアドレスの配置にあるラインバッファに挿入されるまで、ラインバッファ内に実行する維持する既存の収集及び未応答バッファハードウェアは、変更することなくこのオペレーションをサポートできる。 No changes to the gather engine datapath are required to support lambda functions since all necessary operations can be supported by modifying the control logic. When the gather engine fetches a lambda descriptor from memory, it copies the line of the descriptor into both the vector component line buffer and the column descriptor buffer. Descriptor lines that do not contain addresses of scalar variables are passed to the PEs unmodified, while they keep running in the line buffers until the values of the scalar variables are fetched from memory and inserted into the line buffers at the locations of these addresses. The existing gather and unresponse buffer hardware can support this operation without modification.

ラムダ関数をサポートするために処理要素に対する変更 Changes to processing elements to support lambda functions

一実施例において、ラムダ関数をサポートするために、図１０２に示されるように、別々のデータパスがＰＥに追加され、上記で説明される行列値バッファ９１０５、行列インデックスバッファ９１０３及びベクトル値バッファ９１０４を示す。ＰＥのバッファは同じものを維持しつつ、これらの名称は、現在の実装におけるこれらのより一般的な使用を反映するために、入力バッファ１、入力バッファ２及び入力バッファ３に変更されている。ＳｐＭＶデータパス９１１０も、ベースアーキテクチャから変更されないままである。ラムダ関数としてＳｐＭＶを実装することが可能であろうが、専用のハードウェア１０２０１を構築することで、電力を低減し、ＳｐＭＶの性能を向上させる。ＳｐＭＶデータパス９１１０及びラムダデータパス１０２０１からの結果は、出力バッファ１０２０２に、及び最終的にシステムメモリに送信される。 In one embodiment, to support lambda functions, separate data paths are added to the PE as shown in FIG. 102, showing the matrix value buffer 9105, matrix index buffer 9103, and vector value buffer 9104 described above. While the PE's buffers remain the same, their names have been changed to Input Buffer 1, Input Buffer 2, and Input Buffer 3 to reflect their more common use in current implementations. The SpMV data path 9110 also remains unchanged from the base architecture. While it would be possible to implement SpMV as a lambda function, building dedicated hardware 10201 reduces power and improves performance of SpMV. Results from the SpMV data path 9110 and lambda data path 10201 are sent to the output buffer 10202 and ultimately to system memory.

図１０３は、ラムダデータパスの一実施例の詳細を示し、述語レジスタファイル１０３０１、レジスタファイル１０３０２、デコード論理１０３０３、デコードされた命令バッファ１０３０５を含み、ロード・ストアＩＳＡを実装するインオーダ実行パイプライン１０３０４を中心に展開する。単一の発行実行パイプラインが十分な性能を提供することができない場合、１つは、ラムダオペレーションに固有のデータ並列性を利用して、実行パイプラインをベクトル化して並列に複数のベクトル成分を処理してよく、それは、個別のラムダ関数におけるＩＬＰを活用するよりも、並列性を改善するよりエネルギー効率の良い方式とするべきである。実行パイプラインは、１レジスタあたり６４ビットを有する１６～３２エントリレジスタファイル１０３０２からその入力を読み出し、１６～３２エントリレジスタファイル１０３０２に結果を書き戻す。ハードウェアは、整数及び浮動小数点レジスタを区別しておらず、任意のレジスタが任意のタイプのデータを保持してよい。述語レジスタファイル１０３０１は、比較オペレーションの出力を保持しており、それは述語命令実行に用いられる。一実施例において、ラムダデータパス１０３０４は、分岐命令をサポートしていないので、任意の条件実行が述語命令を通じて行われなければならない。 Figure 103 shows details of one embodiment of a lambda datapath, including predicate register file 10301, register file 10302, decode logic 10303, decoded instruction buffer 10305, and revolves around an in-order execution pipeline 10304 implementing a load-store ISA. If a single issue execution pipeline cannot provide sufficient performance, one may exploit the data parallelism inherent in lambda operations to vectorize the execution pipeline to process multiple vector components in parallel, which should be a more energy efficient way to improve parallelism than leveraging ILP on individual lambda functions. The execution pipeline reads its input from a 16-32 entry register file 10302 with 64 bits per register and writes the results back to the 16-32 entry register file 10302. The hardware does not distinguish between integer and floating point registers, and any register may hold any type of data. The predicate register file 10301 holds the output of the compare operation, which is used for predicated instruction execution. In one embodiment, the lambda datapath 10304 does not support branch instructions, so any conditional execution must be done through predicate instructions.

各ラムダ関数の開始時に、ギャザーエンジンは、関数の命令を入力バッファ３９１０４（ベクトル値バッファ）に配置する。次に、デコード論理１０３０３は、順次、各命令をデコードし、その結果を３２エントリデコードされた命令バッファ１０３０５に配置する。これは、ループ１のすべての反復に対する各命令を繰り返しデコーディングするエネルギーコストを節約する。 At the start of each lambda function, the gather engine places the function's instructions into Input Buffer 3 9104 (a vector-valued buffer). The decode logic 10303 then sequentially decodes each instruction and places the results into the 32-entry decoded instruction buffer 10305. This saves the energy cost of repeatedly decoding each instruction for every iteration of Loop 1.

ラムダデータパスは、４つの特別な制御レジスタ１０３０６を含む。インデックスカウンタレジスタは、ラムダデータパスが現在処理しているベクトル成分のインデックスを保持し、ラムダの各反復の終了時に自動的にインクリメントされる。最後のインデックスレジスタは、ＰＥが処理するはずの最後のベクトル成分のインデックスを保持する。ループ開始レジスタは、ラムダ関数の繰り返される部分において第１の命令のデコードされた命令バッファ内の位置を保持する一方、ループ終了レジスタは、ラムダ関数内の最後の命令の位置を保持する。 The lambda datapath contains four special control registers 10306. The index counter register holds the index of the vector component the lambda datapath is currently processing and is automatically incremented at the end of each iteration of the lambda. The last index register holds the index of the last vector component the PE is to process. The loop begin register holds the position in the decoded instruction buffer of the first instruction in the repeated portion of the lambda function, while the loop end register holds the position of the last instruction in the lambda function.

ラムダ関数の実行は、デコードされた命令バッファ内の第１の命令と共に開始し、パイプラインが、ループ終了レジスタにより指し示される命令に到達するまで進める。そのポイントにおいて、パイプラインは、インデックスカウンタレジスタの値を、最後のインデックスレジスタの値と比較し、インデックスカウンタが最後のインデックスより小さい場合、ループ開始レジスタにより指し示される命令に暗黙的な分岐を戻す。インデックスカウンタレジスタが、各反復の終了時にインクリメントされるだけなので、このチェックは、パイプライン内のバブルを回避するために予め行うことができる。 Execution of a lambda function begins with the first instruction in the decoded instruction buffer and proceeds until the pipeline reaches the instruction pointed to by the loop end register. At that point, the pipeline compares the value of the index counter register with the value of the last index register, and if the index counter is less than the last index, it makes an implicit branch back to the instruction pointed to by the loop begin register. Since the index counter register is only incremented at the end of each iteration, this check can be done in advance to avoid bubbles in the pipeline.

このスキームは、ラムダ関数の第１の反復においてのみ実行される必要がある「プリアンブル」命令を簡単に含めることができる。例えば、２つのスカラ及び１つの定数入力を有するラムダ関数は、３つのロード命令を用いて開始して、それらの入力の値をレジスタファイルにフェッチし、入力が、関数の各反復におけるよりもむしろ、１回が読み出されるのみとなるように、ループ開始レジスタを設定して、デコードされた命令バッファ内の第４の命令を指し示す。 This scheme makes it easy to include "preamble" instructions that need only be executed on the first iteration of a lambda function. For example, a lambda function with two scalar and one constant input might start with three load instructions to fetch the values of those inputs into the register file, and set the loop begin register to point to the fourth instruction in the decoded instruction buffer so that the inputs are only read once, rather than on every iteration of the function.

一実施例において、ラムダデータパスは、多くのＲＩＳＣプロセッサと同様にロード・ストアＩＳＡを実行する。ラムダデータパスのロード及びストア命令は、ＰＥのＳＲＡＭバッファ内の位置を参照する。ＳＲＡＭバッファとＤＲＡＭとの間のデータのすべての転送は、ＰＥのフェッチ及びライトバックハードウェアにより管理されている。ラムダデータパスは、２つのタイプのロード命令、すなわち、スカラ及び成分をサポートする。スカラロードは、ＳＲＡＭバッファのうちの１つにおいて特定された位置のコンテンツをフェッチし、それをレジスタ内に配置する。ラムダ関数内のスカラロード命令のほとんどは、関数のプリアンブルにおいて発生するが、レジスタプレッシャは、ループ本体に置かれるスカラロードを時々必要するかもしれない。 In one embodiment, the Lambda Datapath implements a load-store ISA similar to many RISC processors. Lambda Datapath load and store instructions reference locations in the PE's SRAM buffers. All transfers of data between the SRAM buffers and DRAM are managed by the PE's fetch and writeback hardware. The Lambda Datapath supports two types of load instructions: scalar and component. A scalar load fetches the contents of a specified location in one of the SRAM buffers and places it in a register. Most of the scalar load instructions in a Lambda function occur in the function's preamble, but register pressure may occasionally require a scalar load to be placed in the loop body.

成分ロードは、ラムダ関数の入力ベクトルの成分をフェッチする。ＰＥは、そのバッファにマッピングされる第１の入力ベクトルの現在の成分を指し示すバッファごとに計算ポインタを保持する。成分ロードは、計算ポインタからターゲットバッファ及びオフセットを特定する。成分命令が実行される場合、ハードウェアは、特定したオフセットを、適切なバッファのサイズを法とする（ｍｏｄｕｌｏ）計算ポインタの値に加算し、レジスタ内のその位置からデータをロードする。成分ストア命令は、同様であるが、ＰＥ出力バッファ１０２０２内の適切なアドレスにデータを書き込む。 A component load fetches a component of an input vector for a lambda function. The PE maintains a computation pointer for each buffer that points to the current component of the first input vector that is mapped to that buffer. A component load specifies the target buffer and offset from the computation pointer. When a component instruction is executed, the hardware adds the specified offset to the value of the computation pointer modulo the size of the appropriate buffer and loads the data from that location in the register. A component store instruction is similar, but writes the data to the appropriate address in the PE output buffer 10202.

このアプローチは、ＰＥの既存のフェッチハードウェアと共にサポートされる複数の入出力ベクトルを可能にする。入力ベクトルは、ラムダ関数の記述子により特定される順序で、入力バッファ１９１０５及び２９１０３を交互に行い、フェッチハードウェアは、同時にバッファ内の各ベクトルのページ全体を読み出す。 This approach allows multiple input and output vectors to be supported with the PE's existing fetch hardware. The input vectors alternate between input buffers 1 9105 and 2 9103 in the order specified by the lambda function's descriptor, and the fetch hardware reads an entire page of each vector in the buffer at the same time.

例として、３つの入力ベクトル、Ａ、Ｂ、及びＣを有する関数を検討する。入力ベクトルＡは、０のオフセットにおいて、ＰＥの入力バッファ１９１０５上にマッピングされる。入力Ｂは、再び０のオフセットにおいて、入力バッファ２９１０３上にマッピングされる。入力Ｃは、２５６のオフセットにおいて、入力バッファ１９１０５上にマッピングされる（Ｔｅｚｚａｒｏｎスタイルの２５６バイトのページを想定する）。ＰＥのフェッチハードウェアは、入力Ａ及びＣのページを入力バッファ１９１０５にインタリーブする一方、入力バッファ２９１０３は、入力Ｂのページで満たされている。ラムダ関数の各反復は、０のオフセットを有するバッファ１９１０５から成分ロードを実行することにより、入力Ａの適切な成分をフェッチし、０のオフセットを有するバッファ２９１０３から成分ロードを有する入力Ｂの適切な成分をフェッチし、２５６のオフセットを有するバッファ１９１０５から成分ロードを有する入力Ｃのその成分をフェッチする。各反復の終了時に、ハードウェアは、計算ポインタをインクリメントして、各入力ベクトルの次の成分に進む。計算ポインタが、ページの終了に到達した場合、ハードウェアは、（ページサイズ×（－１上にマッピングされたベクトル入力の＃））バイトによりそれをインクリメントして、バッファの第１の入力ベクトルの次のページの第１の成分にそれを進める。同様のスキームが、複数の出力ベクトルを生成するラムダ関数を処理するために用いられる。 As an example, consider a function with three input vectors, A, B, and C. Input vector A is mapped onto the PE's Input Buffer 1 9105 at an offset of 0. Input B is mapped onto Input Buffer 2 9103, again at an offset of 0. Input C is mapped onto Input Buffer 1 9105 at an offset of 256 (assuming Tezzaron-style 256-byte pages). The PE's fetch hardware interleaves the pages of inputs A and C into Input Buffer 1 9105, while Input Buffer 2 9103 is filled with pages of input B. Each iteration of the lambda function fetches the appropriate component of input A by performing a component load from Buffer 1 9105 with an offset of 0, the appropriate component of input B with a component load from Buffer 2 9103 with an offset of 0, and that component of input C with a component load from Buffer 1 9105 with an offset of 256. At the end of each iteration, the hardware increments the computation pointer to advance to the next component of each input vector. When the computation pointer reaches the end of a page, the hardware increments it by (page size x (# of vector inputs mapped onto -1)) bytes to advance it to the first component of the next page of the first input vector in the buffer. A similar scheme is used to process lambda functions that generate multiple output vectors.

図１０４に示されるように、一実施例において、８ビットは、オペコード１０４０１に専用である。残りの２４ビットは、６ビットのレジスタ指示子を結果としてもたらす単一の宛先１０４０２及び３つの入力オペランド１０４０３－１０４０５間で分割される。制御フロー命令は、一実施例において用いられておらず、定数は、補助レジスタファイルから供給され、ビット割り当てアクロバティクスは、命令ワード内の大きな即値に合致させる必要はない。一実施例において、すべての命令は、図１０４に存在する命令エンコーディングに合致する。ある特定の命令のセットに対するエンコーディングは、図１０５に示される。 As shown in FIG. 104, in one embodiment, 8 bits are dedicated to opcode 10401. The remaining 24 bits are split between a single destination 10402 resulting in a 6-bit register specifier and three input operands 10403-10405. Control flow instructions are not used in one embodiment, constants are sourced from an auxiliary register file, and no bit allocation acrobatics are required to match large immediate values within the instruction word. In one embodiment, all instructions match the instruction encoding present in FIG. 104. The encoding for one particular set of instructions is shown in FIG. 105.

一実施例において、比較命令は、比較述語を用いる。例示的な比較述語のエンコーディングは、図１０６のテーブルに列挙されている。 In one embodiment, the compare instruction uses a compare predicate. Exemplary compare predicate encodings are listed in the table of FIG. 106.

詳細に上述されたように、いくつかの例において、所与のタスクに対してアクセラレータを使用することが有利である。しかしながら、実現可能でない及び／又は有利でないインスタンスがあり得る。例えば、利用可能でないアクセラレータ、不利益が大き過ぎるアクセラレータへのデータの異同、アクセラレータの速度がプロセッサコアより遅いなどでる。その結果、いくつかの実施例では、追加の命令が、いくつかのタスクに対する性能及び／又はエネルギー効率性を提供し得る。 As described in detail above, in some instances it is advantageous to use an accelerator for a given task. However, there may be instances where it is not feasible and/or advantageous. For example, the accelerator may not be available, sending data to the accelerator may be too penalized, the accelerator may be slower than the processor core, etc. As a result, in some embodiments, additional instructions may provide performance and/or energy efficiency for some tasks.

行列の乗算の例が図１０９に示される。行列の乗算は、Ｃ［ｒｏｗｓＡ，ｃｏｌｓＢ］＋＝Ａ［ｒｏｗｓＡ，ｃｏｍｍ］×Ｂ［ｃｏｍｍ，ｃｏｌｓＢ］である。ＭＡＤＤ（積和演算命令）に関して本明細書で用いられるように、行列×ベクトル乗算命令は、ｃｏｌｓＢ＝１を設定することにより規定される。この命令は、行列入力Ａ、ベクトル入力Ｂ及びベクトル出力Ｃを取る。５１２ビットベクトルのコンテキストにおいて、倍精度についてｒｏｗｓＡ＝８であり、単精度について１６である。 An example of matrix multiplication is shown in Figure 109. Matrix multiplication is C[rowsA, colsB] += A[rowsA, comm] x B[comm, colsB]. As used herein for MADD (multiply and add instruction), a matrix x vector multiplication instruction is specified by setting colsB = 1. This instruction takes a matrix input A, a vector input B and a vector output C. In the context of 512 bit vectors, rowsA = 8 for double precision and 16 for single precision.

多くのＣＰＵは、１次元ベクトルに対して演算を行うＳＩＭＤ命令を介して密行列乗算を実行する。本明細書における詳細では、サイズ８×４、８×８及びそれらより大きい２次元行列（タイル）を含むようにＳＩＭＤアプローチを拡張する命令（及び基礎となるハードウェア）である。この命令の使用を通じて、小さな行列が、ベクトルと、宛先ベクトルに追加された結果と共に乗算され得る。すべての演算は、１つの命令で実行されるので、多数の積和演算を介して命令及びデータをフェッチするエネルギーコストをならす。さらに、いくつかの実施例では、２分木を利用して総和（削減）を実行する、及び／又は、レジスタの集合として、入力行列を保持する乗算器アレイに組み込まれるレジスタファイルを含む。 Many CPUs perform dense matrix multiplication via SIMD instructions that operate on one-dimensional vectors. Detailed herein are instructions (and underlying hardware) that extend the SIMD approach to include two-dimensional matrices (tiles) of size 8x4, 8x8, and larger. Through the use of this instruction, small matrices can be multiplied by vectors with the result appended to a destination vector. All operations are performed with a single instruction, amortizing the energy cost of fetching instructions and data over multiple multiply-and-accumulate operations. Additionally, some embodiments utilize a binary tree to perform sums (reductions) and/or include a register file that is integrated into the multiplier array to hold the input matrices as a collection of registers.

行列の乗算に関して、ＭＡＤＤ命令の実施形態の実行では、
（ｉ＝０；ｉ＜Ｎ；ｉ＋＋）ついて／／ｒｏｗｓＡ（例えば、８）のＮ＝８のパックドデータ要素サイズ（例えば、ベクトル長）
（ｋ＝０；ｋ＜Ｍ；ｋ＋＋）について、／／ｃｏｍｍ＝Ｍ
Ｃ［ｉ］＋＝Ａ［ｉ，ｋ］×Ｂ［ｋ］
を計算する。 For matrix multiplication, an embodiment of the MADD instruction executes as follows:
//rowsA (e.g., 8) for (i=0; i<N; i++) packed data element size (e.g., vector length) for N=8
For (k=0; k<M; k++), //comm=M
C[i]+=A[i,k]×B[k]
Calculate.

典型的には、「Ａ」オペランドは、８つのパックドデータレジスタに格納される。「Ｂ」オペランドは、１つのパックドデータレジスタに格納されてよい、又は、メモリから読み出されてよい。「Ｃ」オペランドは、１つのパックドデータレジスタに格納される。 Typically, the "A" operands are stored in eight packed data registers. The "B" operand may be stored in one packed data register or may be read from memory. The "C" operand is stored in one packed data register.

この命令についての残りの考察を通じて、「ｏｃｔｏＭＡＤＤ」バージョンが説明される。このバージョンは、８つのパックドデータ要素のソース（例えば、８つのパックドデータレジスタ）にパックドデータ要素のソース（例えば、単一のレジスタ）を掛ける。内側ループを拡張することにより、シーケンシャルな実装に関して（ｏｃｔｏＭＡＤＤ命令に関して）以下のような実行を提供する。
（ｉ＝０；ｉ＜８；ｉ＋＋）について、Ｃ［ｉ］＋＝Ａ［ｉ，０］×Ｂ［０］＋
Ａ［ｉ，１］×Ｂ［１］＋
Ａ［ｉ，２］×Ｂ［２］＋
Ａ［ｉ，３］×Ｂ［３］＋
Ａ［ｉ，４］×Ｂ［４］＋
Ａ［ｉ，５］×Ｂ［５］＋
Ａ［ｉ，６］×Ｂ［６］＋
Ａ［ｉ，７］×Ｂ［７］。 Throughout the remaining discussion of this instruction, an "octoMADD" version is described, which multiplies eight packed data element sources (e.g., eight packed data registers) by a packed data element source (e.g., a single register). Expanding the inner loop provides the following execution (for the octoMADD instruction) for a sequential implementation:
For (i=0; i<8; i++), C[i]+=A[i,0]×B[0]+
A[i,1]×B[1]+
A[i,2]×B[2]+
A[i,3]×B[3]+
A[i,4]×B[4]+
A[i,5]×B[5]+
A[i,6]×B[6]+
A[i,7]×B[7].

示されるように、「Ａ」及び「Ｂ」オペランドの対応するパックドデータ要素位置からのパックドデータ要素の各乗算の後に続いて加算がある。シーケンシャルな加算は、最小の一時なストレージを用いて複数のより簡単なオペレーションに分解される。 As shown, each multiplication of packed data elements from the corresponding packed data element positions of the "A" and "B" operands is followed by an addition. The sequential additions are decomposed into multiple simpler operations with minimal temporary storage.

いくつかの実施例では、２分木アプローチが用いられる。２分木は、並列に２つのサブツリーを合計して、次に、結果をまとめて加算することにより、レイテンシを最小化する。これは、２分木全体に再帰的に適用される。最終結果は、「Ｃ」宛先オペランドに追加される。 In some embodiments, a binary tree approach is used. A binary tree minimizes latency by summing two subtrees in parallel and then adding the results together. This is applied recursively through the binary tree. The final result is added to the "C" destination operand.

内側ループを拡張することにより、バイナリ実装に関して（ｏｃｔｏＭＡＤＤ命令に関して）以下のような実行を提供する。
（ｉ＝０；ｉ＜８；ｉ＋＋）について、
Ｃ［ｉ］＋＝（（Ａ［ｉ，０］×Ｂ［０］＋Ａ［ｉ，１］×Ｂ［１］）＋
（Ａ［ｉ，２］×Ｂ［２］＋Ａ［ｉ，３］×Ｂ［３］））＋
（（Ａ［ｉ，４］×Ｂ［４］＋Ａ［ｉ，５］×Ｂ［５］）＋
（Ａ［ｉ，６］×Ｂ［６］＋Ａ［ｉ，７］×Ｂ［７］））。 Expanding the inner loop gives the following execution for a binary implementation (in terms of the octoMADD instruction):
For (i=0; i<8; i++),
C[i]+=((A[i,0]×B[0]+A[i,1]×B[1])+
(A[i,2]×B[2]+A[i,3]×B[3]))+
((A[i,4]×B[4]+A[i,5]×B[5])+
(A[i,6]×B[6]+A[i,7]×B[7])).

図１１０は、２分木低減ネットワークを用いたｏｃｔｏＭＡＤＤ命令処理を示す。図は、オペレーションの１つのベクトルレーンを示す。５１２ビットベクトルを用いて、倍精度のｏｃｔｏＭＡＤＤは、８つのレーンを有する一方、単精度のｏｃｔｏＭＡＤＤは、１６個のレーンを有する。 Figure 110 shows octoMADD instruction processing using a binary tree reduction network. The diagram shows one vector lane of the operation. With 512-bit vectors, double precision octoMADD has 8 lanes while single precision octoMADD has 16 lanes.

図示されるように、複数の乗算回路１１００１－１１０１５は、Ａ［ｉ，０］×Ｂ［０］、Ａ［ｉ，１］×Ｂ［１］、Ａ［ｉ，２］×Ｂ［２］、Ａ［ｉ，３］×Ｂ［３］、Ａ［ｉ，４］×Ｂ［４］、Ａ［ｉ，５］×Ｂ［５］、Ａ［ｉ，６］×Ｂ［６］及びＡ［ｉ，７］×Ｂ［７］のそれぞれについての乗算を実行する。この例において、ｉはＡレジスタである。典型的には、乗算は、並列で実行される。 As shown, multiple multiplication circuits 11001-11015 perform multiplications for A[i,0] x B[0], A[i,1] x B[1], A[i,2] x B[2], A[i,3] x B[3], A[i,4] x B[4], A[i,5] x B[5], A[i,6] x B[6], and A[i,7] x B[7], respectively. In this example, i is the A register. Typically, the multiplications are performed in parallel.

乗算回路１１００１－１１０１５に結合される加算回路１１０１７－１１０２３は、乗算回路１１００１－１１０１５の結果を加算する。例えば、加算回路は、Ａ［ｉ，０］×Ｂ［０］＋Ａ［ｉ，１］×Ｂ［１］、Ａ［ｉ，２］×Ｂ［２］＋Ａ［ｉ，３］×Ｂ［３］、Ａ［ｉ，４］×Ｂ［４］＋Ａ［ｉ，５］×Ｂ［５］、及び、Ａ［ｉ，６］×Ｂ［６］＋Ａ［ｉ，７］×Ｂ［７］を実行する。典型的には、総和は、並列で実行される。 Addition circuits 11017-11023 coupled to the multiplication circuits 11001-11015 add the results of the multiplication circuits 11001-11015. For example, the addition circuits perform A[i,0]xB[0]+A[i,1]xB[1], A[i,2]xB[2]+A[i,3]xB[3], A[i,4]xB[4]+A[i,5]xB[5], and A[i,6]xB[6]+A[i,7]xB[7]. Typically, the summations are performed in parallel.

最初の総和の結果は、加算回路１１０２５を用いて合計され、まとめて加算される。この加算の結果は、宛先に格納される新たな値１１０３３を生成するために、宛先から元の（古い）値１１０３１に、加算回路１１０２７により加算される。 The results of the first summation are summed and added together using adder circuit 11025. The result of this addition is added by adder circuit 11027 to the original (old) value 11031 from the destination to produce a new value 11033 that is stored in the destination.

ほとんどの実施例では、命令は、８つの独立したソースレジスタに加え、他のソース及びレジスタの宛先に対するレジスタ又はメモリオペランドを規定することができない。したがって、いくつかの例では、ｏｃｔｏＭＡＤＤ命令は、行列オペランドに対して８つのレジスタの制限された範囲を規定する。例えば、ｏｃｔｏＭＡＤＤ行列オペランドは、レジスタ０－７であってよい。いくつかの実施形態において、第１のレジスタが規定され、第１のレジスタに連続したレジスタは、追加の（例えば、７つの）レジスタである。 In most implementations, the instruction cannot specify eight independent source registers plus other sources and registers for the destination or memory operands. Thus, in some examples, the octoMADD instruction specifies a limited range of eight registers for the matrix operand. For example, the octoMADD matrix operand may be registers 0-7. In some embodiments, a first register is specified and the registers consecutive to the first are additional (e.g., seven) registers.

図１１１は、積和演算命令を処理するために、プロセッサにより実行される方法の実施形態を示す。 Figure 111 illustrates an embodiment of a method performed by a processor to process a multiply-accumulate instruction.

１１１０１において、命令がフェッチされる。例えば、積和演算命令がフェッチされる。積和演算命令は、オペコード、第１のパックドデータオペランド（メモリ又はレジスタのいずれか一方）のためのフィールド、第２から第Ｎのパックドデータソースオペランドのための１又は複数のフィールド、及び、パックドデータ宛先オペランドを含む。いくつかの実施形態において、積和演算命令は、書き込みマスクオペランドを含む。いくつかの実施形態において、命令は、命令キャッシュからフェッチされる。 At 11101, an instruction is fetched. For example, a multiply-and-accumulate instruction is fetched. The multiply-and-accumulate instruction includes an opcode, a field for a first packed data operand (either memory or register), one or more fields for the second through Nth packed data source operands, and a packed data destination operand. In some embodiments, the multiply-and-accumulate instruction includes a write mask operand. In some embodiments, the instruction is fetched from an instruction cache.

１１１０３において、フェッチされた命令がデコードされる。例えば、フェッチされた積和演算命令は、本明細書で詳細に説明されるようなデコード回路によりデコードされる。 At 11103, the fetched instruction is decoded. For example, the fetched multiply-accumulate instruction is decoded by a decode circuit as described in detail herein.

１１１０５において、デコードされた命令のソースオペランドと関連付けられたデータ値が取得される。一連の積和演算命令を実行する場合に、メインレジスタファイルから繰り返しこれらの値を読み出す必要性を回避するために、（以下に詳細に説明されるように）これらのレジスタのコピーが、乗算器－加算器のアレイ自体に構築される。当該コピーは、メインレジスタファイルのキャッシュとして維持される。 At 11105, data values associated with the source operands of the decoded instruction are obtained. To avoid the need to repeatedly read these values from the main register file when executing a series of multiply-accumulate instructions, copies of these registers are built in the multiplier-adder array itself (as described in more detail below). These copies are maintained as a cache of the main register file.

１１１０７において、デコードされた命令は、第２から第Ｎのパックドデータソースオペランドのパックドデータ要素の位置ごとに、１）そのソースオペランドのそのパックドデータ要素の位置のデータ要素に、第１のソースオペランドの対応するパックドデータ要素位置のデータ要素を掛けて、一時的な結果を生成し、２）一時的な結果を合計し、３）一時的な結果の合計をパックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素に加え、４）宛先の対応するパックドデータ要素位置のデータ要素に対する一時的な結果の合計を、パックドデータ宛先オペランドの対応するパックドデータ要素位置に格納する、本明細書で詳細に説明されるような実行回路（ハードウェア）により実行される。Ｎは、典型的には、オペコード又はプレフィックスにより示される。例えば、ｏｃｔｏＭＡＤＤについて、Ｎは９（Ａに対して８つのレジスタがあるような）である。乗算は、並列で実行されてよい。 At 11107, the decoded instruction is executed by an execution circuit (hardware) as described in detail herein to, for each packed data element position of the second through Nth packed data source operands, 1) multiply the data element of that packed data element position of the source operand by the data element of the corresponding packed data element position of the first source operand to generate a temporary result, 2) sum the temporary results, 3) add the sum of the temporary results to the data element of the corresponding packed data element position of the packed data destination operand, and 4) store the sum of the temporary results for the data element of the corresponding packed data element position of the destination in the corresponding packed data element position of the packed data destination operand. N is typically indicated by the opcode or prefix. For example, for octoMADD, N is 9 (so that there are 8 registers for A). The multiplications may be performed in parallel.

いくつかの実施形態では、１１１０９において、命令がコミット又はリタイアされる。 In some embodiments, at 11109, the instruction is committed or retired.

図１１２は、積和演算命令を処理するために、プロセッサにより実行される方法の実施形態を示す。 Figure 112 illustrates an embodiment of a method performed by a processor to process a multiply-accumulate instruction.

１１２０１において、命令がフェッチされる。例えば、積和演算命令がフェッチされる。融合積和演算命令は、オペコード、第１のパックドデータオペランド（メモリ又はレジスタのいずれか一方）のためのフィールド、第２から第Ｎのパックドデータソースオペランドのための１又は複数のフィールド、及び、パックドデータ宛先オペランドを含む。いくつかの実施形態において、融合積和演算命令は、書き込みマスクオペランドを含む。いくつかの実施形態において、命令は、命令キャッシュからフェッチされる。 At 11201, an instruction is fetched. For example, a multiply-and-accumulate instruction is fetched. The fused multiply-and-accumulate instruction includes an opcode, a field for a first packed data operand (either memory or register), one or more fields for the second through Nth packed data source operands, and a packed data destination operand. In some embodiments, the fused multiply-and-accumulate instruction includes a write mask operand. In some embodiments, the instruction is fetched from an instruction cache.

１１２０３において、フェッチされた命令がデコードされる。例えば、フェッチされた積和演算命令は、本明細書で詳細に説明されるようなデコード回路によりデコードされる。 At 11203, the fetched instruction is decoded. For example, the fetched multiply-accumulate instruction is decoded by a decode circuit as described in detail herein.

１１２０５において、デコードされた命令のソースオペランドと関連付けられたデータ値が取得される。一連の積和演算命令を実行する場合に、メインレジスタファイルから繰り返しこれらの値を読み出す必要性を回避するために、（以下に詳細に説明されるように）これらのレジスタのコピーが、乗算器－加算器のアレイ自体に構築される。当該コピーは、メインレジスタファイルのキャッシュとして維持される。 At 11205, data values associated with the source operands of the decoded instruction are obtained. To avoid the need to repeatedly read these values from the main register file when executing a series of multiply-accumulate instructions, copies of these registers are built in the multiplier-adder array itself (as described in more detail below). These copies are maintained as a cache of the main register file.

１１２０７において、デコードされた命令は、第２から第Ｎのパックドデータソースオペランドのパックドデータ要素の位置ごとに、そのソースオペランドのそのパックドデータ要素の位置のデータ要素に、第１のソースオペランドの対応するパックドデータ要素位置のデータ要素を掛けて、一時的な結果を生成し、２）対で一時的な結果を合計し、３）パックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素に一時的な結果の合計を加え、４）宛先の対応するパックドデータ要素位置のデータ要素に対する一時的な結果の合計を、パックドデータ宛先オペランドの対応するパックドデータ要素位置に格納する、本明細書で詳細に説明されるような実行回路（ハードウェア）により実行される。Ｎは、典型的には、オペコード又はプレフィックスにより示される。例えば、ｏｃｔｏＭＡＤＤについて、Ｎは９（Ａに対して８つのレジスタがあるような）である。乗算は、並列で実行されてよい。 At 11207, the decoded instruction is executed by an execution circuit (hardware) as described in detail herein to: 1) multiply, for each packed data element position of the second through Nth packed data source operands, the data element at that packed data element position of the source operand by the data element at the corresponding packed data element position of the first source operand to generate a temporary result; 2) sum the temporary results pairwise; 3) add the sum of the temporary results to the data element at the corresponding packed data element position of the packed data destination operand; and 4) store the sum of the temporary results for the data element at the corresponding packed data element position of the destination in the corresponding packed data element position of the packed data destination operand. N is typically indicated by the opcode or prefix. For example, for octoMADD, N is 9 (so that there are 8 registers for A). The multiplications may be performed in parallel.

いくつかの実施形態では、１１２０９において、命令がコミット又はリタイアされる。 In some embodiments, at 11209, the instruction is committed or retired.

いくつかの実施例において、ＭＡＤＤ命令がまず遭遇された場合、リネーマは、マイクロオペレーションを注入して、メインレジスタをキャッシュにコピーすることにより、キャッシュされたコピーをメインレジスタファイルと同期させる。後続のＭＡＤＤ命令は、それらが変更されないままである限り、キャッシュされたコピーを使用し続ける。いくつかの実施例では、ｏｃｔｏＭＡＤＤ命令により、レジスタの制限された範囲の使用を予期し、レジスタ値が生成されるときに、メインレジスタファイル及びキャッシュされたコピーの両方に書き込みをブロードキャストする。 In some embodiments, when a MADD instruction is first encountered, the renamer injects micro-ops to copy the main registers to the cache, synchronizing the cached copies with the main register file. Subsequent MADD instructions continue to use the cached copies as long as they remain unchanged. In some embodiments, the octoMADD instruction anticipates the use of a limited range of registers, broadcasting writes to both the main register file and the cached copies when register values are generated.

図１１３（Ａ）～図１１３（Ｃ）は、ＭＡＤＤ命令を実行するための例示的なハードウェアを示す。図１１３（Ａ）は、ＭＡＤＤ命令を実行するコンポーネントを示す。図１１３（Ｂ）は、これらのコンポーネントのサブセットを示す。特に、複数の乗算回路１１３２３は、ソースレジスタのパックドデータ要素を、加算回路１１３２７に結合される各乗算回路１１３２３と乗算するために用いられる。各加算回路は、チェーン様式で加算回路１１３２７に供給する。セレクタ１１３２１は、外部入力又は加算回路のフィードバックを選択するために用いられる。レジスタファイルは、レジスタファイルの一部として複数の加算器アレイ内に組み込まれ、マルチプレクサ１１３２５を読み出す。特定のレジスタが、積和演算器の各列に有線で接続される。 Figures 113(A)-113(C) show exemplary hardware for executing the MADD instruction. Figure 113(A) shows the components that execute the MADD instruction. Figure 113(B) shows a subset of these components. In particular, multiple multiplier circuits 11323 are used to multiply packed data elements of source registers with each multiplier circuit 11323 coupled to an adder circuit 11327. Each adder circuit feeds the adder circuit 11327 in a chain fashion. Selector 11321 is used to select the external input or feedback of the adder circuit. The register file is embedded in the multiple adder array as part of the register file and reads multiplexer 11325. A specific register is hardwired to each column of multipliers.

図１１３（Ｂ）は、レジスタファイルを示し、マルチプレクサ１１３２５を読み出す。レジスタファイル１１３２７は、キャッシュとしてＡを格納する複数のレジスタ（例えば、４つ又は８つのレジスタ）である。訂正されたレジスタは、読み出しｍｕｘ１１３２９を用いて選択される。 Figure 113(B) shows the register file and read mux 11325. Register file 11327 is a number of registers (e.g., 4 or 8 registers) that store A as a cache. The corrected register is selected using read mux 11329.

ｏｃｔｏＭＡＤＤ命令の予期される使用は、以下のとおりである。
//Ｃ＋＝Ａ×Ｂを計算する
//Ａは、ＲＥＧ０－７における８×８タイルとしてロードされる
//Ｂは、メモリから１×８タイルとしてロードされる
//Ｃは、ＲＥＧ８－３１における２４×８タイルとしてロード及び格納される
for (outer loop) {
load [24,8] tile of C matrix into REG 8-31 // 24 loads
for (middle loop) {
load [8,8] tile of A matrix into REG 0-7 // 8 loads
for (inner loop) {
// 24 iterations
REG [8-31 from inner loop] += REG 0-7 * memory[inner loop];
// 1 load
}
}
store [24,8] tile of C matrix from REG8-31 // 24 stores
} The expected uses of the octoMADD instruction are as follows:
//Calculate C+=A*B
//A is loaded as an 8x8 tile in REG0-7
//B is loaded from memory as a 1x8 tile
//C is loaded and stored as 24x8 tiles in REG8-31
for (outer loop) {
load [24,8] tile of C matrix into REG 8-31 // 24 loads
for (middle loop) {
load [8,8] tile of A matrix into REG 0-7 // 8 loads
for (inner loop) {
// 24 iterations
REG [8-31 from inner loop] += REG 0-7 * memory[inner loop];
// 1 load
}
}
store [24,8] tile of C matrix from REG8-31 // 24 stores
}

内側ループは、２４個のｏｃｔｏＭＡＤＤ命令を含む。それぞれは、メモリから１つの「Ｂ」オペランドを読み出して、２４個の「Ｃ」アキュムレータのうちの１つに合計する。中間ループは、新たなタイルを有する８つの「Ａ」レジスタをロードする。外側ループは、２４個の「Ｃ」アキュムレータをロード及びストアする。内側ループは、ｏｃｔｏＭＡＤＤハードウェアの高利用率（＞９０％）を実現するために、展開されて、プリフェッチを追加する。 The inner loop contains 24 octoMADD instructions, each of which reads one "B" operand from memory and sums it into one of 24 "C" accumulators. The middle loop loads the eight "A" registers with the new tile. The outer loop loads and stores the 24 "C" accumulators. The inner loop is unrolled to add prefetching in order to achieve high utilization (>90%) of the octoMADD hardware.

以下の図は、上記の実施形態を実施するための例示的なアーキテクチャ及びシステムを詳細に説明する。特に、上述したコアタイプ（例えば、アウトオブオーダ、スカラ、ＳＩＭＤ）の態様（例えば、レジスタ、パイプラインなど）が説明される。さらに、コプロセッサ（例えば、アクセラレータ、コア）を含むシステム及びチップ上のシステム実装が示される。いくつかの実施形態において、上記で説明された１又は複数のハードウェアコンポーネント及び／又は命令は、以下で詳細に説明されるようにエミュレートされ、ソフトウェアモジュールとして実装される。
例示的なレジスタアーキテクチャ The following figures detail exemplary architectures and systems for implementing the above-described embodiments. In particular, aspects (e.g., registers, pipelines, etc.) of the core types (e.g., out-of-order, scalar, SIMD) discussed above are described. Additionally, systems including co-processors (e.g., accelerators, cores) and system implementations on chips are shown. In some embodiments, one or more of the hardware components and/or instructions described above are emulated and implemented as software modules, as described in more detail below.
Exemplary Register Architecture

図１２５は、本発明の一実施形態に係るレジスタアーキテクチャ１２５００のブロック図である。図示される実施形態において、５１２ビット幅である３２個のベクトルレジスタ１２５１０があり、これらのレジスタは、ｚｍｍ０からｚｍｍ３１として参照される。下位１６ｚｍｍレジスタの下位２５６ビットは、レジスタｙｍｍ０～１６上にオーバーレイされる。下位１６ｚｍｍレジスタの下位１２８ビット（ｙｍｍレジスタの下位１２８ビット）は、レジスタｘｍｍ０～１５上にオーバーレイされる。特定のベクトルに適した命令フォーマットＱＡＣ００は、以下のテーブルに示されるように、これらのオーバーレイされたレジスタファイル上で動作する。

Figure 125 is a block diagram of a register architecture 12500 according to one embodiment of the invention. In the embodiment shown, there are 32 vector registers 12510 that are 512 bits wide, and these registers are referenced as zmm0 through zmm31. The lower 256 bits of the lower 16 zmm registers are overlaid onto registers ymm0-16. The lower 128 bits of the lower 16 zmm registers (the lower 128 bits of the ymm registers) are overlaid onto registers xmm0-15. The instruction format QAC00 appropriate for a particular vector operates on these overlaid register files as shown in the table below.

つまり、ベクトル長フィールドＱＡＢ５９Ｂは、最大の長さ及び１又は複数の他のより短い長さから選択し、それぞれのそのようなより短い長さは、先行する長の半分の長さであり、ベクトル長フィールドＱＡＢ５９Ｂなしで命令テンプレートは、最大ベクトル長で動作する。さらに、一実施形態において、特定のベクトルに適した命令フォーマットＱＡＣ００のクラスＢ命令テンプレートは、パックド又はスカラ単一／倍精度の浮動小数点データ、及び、パックド又はスカラ整数データで動作する。スカラ演算は、ｚｍｍ／ｙｍｍ／ｘｍｍレジスタ内の最下位のデータ要素位置で実行されるオペレーションである。上位のデータ要素位置は、それらが命令前と同じままであるか、実施形態に応じてゼロにされるかのいずれか一方である。 That is, vector length field QAB59B selects between a maximum length and one or more other shorter lengths, each such shorter length being half the length of the preceding length, and without vector length field QAB59B the instruction template operates with the maximum vector length. Furthermore, in one embodiment, class B instruction templates of instruction format QAC00 suitable for a particular vector operate on packed or scalar single/double precision floating point data, and packed or scalar integer data. Scalar operations are operations performed on the lowest data element positions in zmm/ymm/xmm registers. Higher data element positions either remain the same as they were before the instruction or are zeroed out depending on the embodiment.

書き込みマスクレジスタ１２５１５－図示された実施形態中では、８個の書き込みマスクレジスタ（ｋ０からｋ７）が存在し、各々６４ビットのサイズである。代替的な実施形態において、書き込みマスクレジスタ１２５１５は、１６ビットのサイズである。前述したように、本発明の一実施形態において、ベクトルマスクレジスタｋ０は、書き込みマスクとして用いられることができない。ｋ０を通常示すであろうエンコーディングが書き込みマスクに用いられる場合、それは、０ｘＦＦＦＦのハードワイヤ型書き込みマスクを選択し、その命令に対する書き込みマスキングを効果的に無効にする。 Write Mask Registers 12515 - In the illustrated embodiment, there are eight write mask registers (k0 through k7), each 64 bits in size. In an alternative embodiment, the write mask registers 12515 are 16 bits in size. As previously mentioned, in one embodiment of the invention, the vector mask register k0 cannot be used as a write mask. If an encoding that would normally indicate k0 is used for the write mask, it selects a hardwired write mask of 0xFFFF, effectively disabling write masking for that instruction.

汎用レジスタ１２５２５－図示される実施形態において、メモリオペランドをアドレス指定する既存のｘ８６アドレッシングモードと共に用いられる１６個の６４ビット汎用レジスタが存在する。これらのレジスタは、ＲＡＸ、ＲＢＸ、ＲＣＸ、ＲＤＸ、ＲＢＰ、ＲＳＩ、ＲＤＩ、ＲＳＰ及びＲ８からＲ１５という名称により参照される。 General Purpose Registers 12525 - In the illustrated embodiment, there are sixteen 64-bit general purpose registers that are used in conjunction with existing x86 addressing modes to address memory operands. These registers are referenced by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8 through R15.

ＭＭＸパックド整数フラットレジスタファイル１２５５０がエイリアスされる、スカラ浮動小数点スタックレジスタファイル（ｘ８７スタック）１２５４５－図示される実施形態では、ｘ８７スタックは、ｘ８７命令セット拡張を用いて３２／６４／８０ビットの浮動小数点データに対するスカラ浮動小数点演算を実行するために用いられる８要素スタックである。一方、ＭＭＸレジスタは、６４ビットのパックド整数データに対する演算を実行するために、並びに、ＭＭＸ及びＸＭＭレジスタ間で実行されるいくつかの演算用にオペランドを保持するために用いられる。 Scalar floating point stack register file (x87 stack) 12545, into which the MMX packed integer flat register file 12550 is aliased - in the illustrated embodiment, the x87 stack is an 8-element stack used to perform scalar floating point operations on 32/64/80 bit floating point data using the x87 instruction set extensions, while the MMX registers are used to perform operations on 64 bit packed integer data, as well as to hold operands for some operations performed between the MMX and XMM registers.

本発明の代替的な実施形態は、より広い又はより狭いレジスタを用いてよい。さらに、本発明の代替的な実施形態は、より多くの、より少ない又は異なるレジスタファイル及びレジスタを用いてよい。
例示的なコアアーキテクチャ、プロセッサ及びコンピュータアーキテクチャ Alternative embodiments of the invention may use wider or narrower registers. Additionally, alternative embodiments of the invention may use more, fewer or different register files and registers.
Exemplary Core Architectures, Processors and Computer Architectures

プロセッサコアは、様々な目的のために、様々な方式で、及び、様々なプロセッサにおいて実装されてよい。例として、そのようなコアの実装は、１）汎用計算を対象とする汎用インオーダコア、２）汎用計算を対象とする高性能汎用アウトオブオーダコア、３）主にグラフィックス及び／又は科学技術（スループット）コンピューティングを対象とする特定用途コアを含んでよい。様々なプロセッサの実装は、１）汎用計算を対象とする１又は複数の汎用インオーダコア、及び／又は、汎用計算を対象とする１又は複数の汎用アウトオブオーダコアを含むＣＰＵ、及び、２）主にグラフィックス及び／又は科学技術（スループット）を対象とする１又は複数の特定用途コアを含むコプロセッサを含んでよい。そのような様々なプロセッサは、異なるコンピュータシステムアーキテクチャもたらし、それは、１）ＣＰＵとは別々のチップ上のコプロセッサ、２）ＣＰＵと同じパッケージ内の別々のダイ上のコプロセッサ、３）ＣＰＵと同じダイ上のコプロセッサ（この場合、当該コプロセッサは、統合グラフィックス及び／又は科学技術（スループット）論理などの特定用途論理又は特定用途コアと称されることもある）、及び、４）同じダイ上に、説明されたＣＰＵ（アプリケーションコア又はアプリケーションプロセッサと称されることもある）、上記で説明したコプロセッサ及び追加の機能を含み得るチップ上のシステムを含んでよい。例示的なコアアーキテクチャが次に説明され、後に例示的なプロセッサ及びコンピュータアーキテクチャの説明が続く。
例示的なコアアーキテクチャ
インオーダ及びアウトオブオーダコアブロック図 Processor cores may be implemented in different ways and in different processors for different purposes. By way of example, implementations of such cores may include: 1) a general purpose in-order core targeted for general purpose computing; 2) a high performance general purpose out-of-order core targeted for general purpose computing; 3) a special purpose core targeted primarily for graphics and/or scientific (throughput) computing. Implementations of different processors may include: 1) a CPU including one or more general purpose in-order cores targeted for general purpose computing and/or one or more general purpose out-of-order cores targeted for general purpose computing; and 2) a co-processor including one or more special purpose cores targeted primarily for graphics and/or scientific (throughput). Such various processors result in different computer system architectures, which may include: 1) a coprocessor on a separate chip from the CPU, 2) a coprocessor on a separate die in the same package as the CPU, 3) a coprocessor on the same die as the CPU (in which case the coprocessor may be referred to as special purpose logic or a special purpose core, such as integrated graphics and/or scientific (throughput) logic, and 4) a system on a chip that may include the described CPU (which may be referred to as an application core or application processor), the above-described coprocessor, and additional functionality on the same die. Exemplary core architectures are described next, followed by descriptions of exemplary processors and computer architectures.
Exemplary Core Architecture In-Order and Out-of-Order Core Block Diagram

図１２６Ａは、本発明の実施形態に係る、例示的なインオーダパイプライン及び例示的なレジスタリネーミング・アウトオブオーダ発行／実行パイプラインの両方を示すブロック図である。図１２６Ｂは、本発明の実施形態に係るプロセッサに含まれるインオーダアーキテクチャコアの例示的な実施形態及び例示的なレジスタリネーミング・アウトオブオーダ発行／実行アーキテクチャコアの両方を示すブロック図である。図１２６Ａ～図１２６Ｂの実線の枠は、インオーダパイプライン及びインオーダコアを示し、一方、破線の枠の選択的な追加部分は、レジスタリネーミング・アウトオブオーダ発行／実行パイプライン及びコアを示す。インオーダ態様がアウトオブオーダ態様のサブセットであることを前提に、アウトオブオーダ態様が説明される。 Figure 126A is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming, out-of-order issue/execution pipeline according to an embodiment of the present invention. Figure 126B is a block diagram illustrating both an exemplary embodiment of an in-order architecture core and an exemplary register renaming, out-of-order issue/execution architecture core included in a processor according to an embodiment of the present invention. The solid boxes in Figures 126A-B show the in-order pipeline and the in-order core, while the optional addition of dashed boxes shows the register renaming, out-of-order issue/execution pipeline and core. The out-of-order aspect is described with the understanding that the in-order aspect is a subset of the out-of-order aspect.

図１２６Ａにおいて、プロセッサパイプライン１２６００は、フェッチステージ１２６０２、長さデコードステージ１２６０４、デコードステージ１２６０６、割り当てステージ１２６０８、リネーミングステージ１２６１０、スケジューリング（ディスパッチ又は発行としても知られている）ステージ１２６１２、レジスタ読み出し／メモリ読み出しステージ１２６１４、実行ステージ１２６１６、ライトバック／メモリ書き込みステージ１２６１８、例外処理ステージ１２６２２、及び、コミットステージ１２６２４を含む。 In FIG. 126A, the processor pipeline 12600 includes a fetch stage 12602, a length decode stage 12604, a decode stage 12606, an allocation stage 12608, a renaming stage 12610, a scheduling (also known as dispatch or issue) stage 12612, a register read/memory read stage 12614, an execution stage 12616, a writeback/memory write stage 12618, an exception handling stage 12622, and a commit stage 12624.

図１２６Ｂは、実行エンジンユニット１２６５０に結合されるフロントエンドユニット１２６３０を含むプロセッサコア１２６９０を示し、両方とも、メモリユニット１２６７０に結合される。コア１２６９０は、縮小命令セットコンピューティング（ＲＩＳＣ）コア、複合命令セットコンピューティング（ＣＩＳＣ）コア、超長命令語（ＶＬＩＷ）コア、又は、ハイブリッド又は代替的なコアタイプであってよい。さらなる別のオプションとして、コア１２６９０は、例えば、ネットワーク又は通信コア、圧縮エンジン、コプロセッサコア、汎用コンピューティンググラフィックス処理ユニット（ＧＰＧＰＵ）コア又はグラフィックスコアなどの特定用途コアであってよい。 126B illustrates a processor core 12690 including a front end unit 12630 coupled to an execution engine unit 12650, both coupled to a memory unit 12670. The core 12690 may be a reduced instruction set computing (RISC) core, a complex instruction set computing (CISC) core, a very long instruction word (VLIW) core, or a hybrid or alternative core type. As yet another option, the core 12690 may be a special purpose core such as, for example, a network or communication core, a compression engine, a co-processor core, a general purpose computing graphics processing unit (GPGPU) core, or a graphics core.

フロントエンドユニット１２６３０は、命令キャッシュユニット１２６３４に結合される分岐予測ユニット１２６３２を含み、命令キャッシュユニット１２６３４は、命令トランスレーションルックアサイドバッファ（ＴＬＢ）１２６３６に結合され、命令トランスレーションルックアサイドバッファ（ＴＬＢ）１２６３６は、命令フェッチユニット１２６３８に結合され、命令フェッチユニット１２６３８は、デコードユニット１２６４０に結合される。デコードユニット１２６４０（又は、デコーダ）は、命令をデコードし、１又は複数のマイクロオペレーション、マイクロコードエントリポイント、マイクロ命令、他の命令又は他の制御信号を出力として生成してよく、これらは、元の命令からデコードされる、又は、そうでなければ元の命令を反映する、又は、元の命令から導き出される。デコードユニット１２６４０は、様々な異なるメカニズムを用いて実装されてよい。好適なメカニズムの例では、限定されることはないが、ルックアップテーブル、ハードウェア実装、プログラマブル論理アレイ（ＰＬＡ）、マイクロコードリードオンリメモリ（ＲＯＭ）などを含む。一実施形態において、コア１２６９０は、特定のマクロ命令に関するマイクロコードを（例えば、デコードユニット１２６４０内、そうでなければ、フロントエンドユニット１２６３０内に）格納するマイクロコードＲＯＭ又は他の媒体を含む。デコードユニット１２６４０は、実行エンジンユニット１２６５０内のリネーム／アロケータユニット１２６５２に結合される。 The front-end unit 12630 includes a branch prediction unit 12632 coupled to an instruction cache unit 12634, which is coupled to an instruction translation lookaside buffer (TLB) 12636, which is coupled to an instruction fetch unit 12638, which is coupled to a decode unit 12640. The decode unit 12640 (or decoder) may decode instructions and generate as output one or more micro-operations, microcode entry points, microinstructions, other instructions, or other control signals that are decoded from or otherwise reflect or are derived from the original instruction. The decode unit 12640 may be implemented using a variety of different mechanisms. Examples of suitable mechanisms include, but are not limited to, a lookup table, a hardware implementation, a programmable logic array (PLA), a microcode read only memory (ROM), etc. In one embodiment, the core 12690 includes a microcode ROM or other medium that stores microcode for particular macro-instructions (e.g., in the decode unit 12640 or otherwise in the front-end unit 12630). The decode unit 12640 is coupled to a rename/allocator unit 12652 in the execution engine unit 12650.

実行エンジンユニット１２６５０は、リタイアメントユニット１２６５４と１又は複数のスケジューラユニット１２６５６のセットとに結合されるリネーム／アロケータユニット１２６５２を含む。スケジューラユニット１２６５６は、予約ステーション、中央命令ウィンドウなどを含む任意の数の様々なスケジューラを表す。スケジューラユニット１２６５６は、物理レジスタファイルユニット１２６５８に結合される。物理レジスタファイルユニット１２６５８のそれぞれは、１又は複数の物理レジスタファイル、１又は複数の異なるデータタイプを格納するもののうちの異なるいくつか、例えば、スカラ整数、スカラ浮動小数点、パックド整数、パックド浮動小数点、ベクトル整数、ベクトル浮動小数点、ステータス（例えば、実行される次の命令のアドレスである命令ポインタ）などを表す。一実施形態において、物理レジスタファイルユニット１２６５８は、ベクトルレジスタユニット、書き込みマスクレジスタユニット及びスカラレジスタユニットを有する。これらのレジスタユニットは、アーキテクチャのベクトルレジスタ、ベクトルマスクレジスタ及び汎用レジスタを提供してよい。物理レジスタファイルユニット１２６５８は、レジスタリネーミング及びアウトオブオーダ実行が、（例えば、リオーダバッファ及びリタイアレジスタファイルを用いて、将来のファイル、履歴バッファ及びリタイアレジスタファイルを用いて、レジスタマッピング及びレジスタのプールなどを用いて）実装され得る様々な方式を示すために、リタイアメントユニット１２６５４により重ね合わせられる。リタイアメントユニット１２６５４及び物理レジスタファイルユニット１２６５８は、実行クラスタ１２６６０に結合される。実行クラスタ１２６６０は、１又は複数の実行ユニット１２６６２のセット及び１又は複数のメモリアクセスユニット１２６６４のセットを含む。実行ユニット１２６６２は、様々なオペレーション（例えば、シフト、加算、減算、乗算）を、様々なタイプのデータ（例えば、スカラ浮動小数点、パックド整数、パックド浮動小数点、ベクトル整数、ベクトル浮動小数点）に対して実行してよい。いくつかの実施形態では、特定の機能又は機能のセットに専用の多数の実行ユニットを含み得る一方、他の実施形態では、１つの実行ユニットのみ、又は、すべての機能をすべてが実行する複数の実行ユニットを含み得る。特定の実施形態では、特定のタイプのデータ／オペレーションに対して別々のパイプライン（例えば、スカラ整数パイプライン、スカラ浮動小数点／パックド整数／パックド浮動小数点／ベクトル整数／ベクトル浮動小数点パイプライン、及び／又は、メモリアクセスパイプラインは、これら自体のスケジューラユニット、物理レジスタファイルユニット、及び／又は、実行クラスタをそれぞれが有する、－別々のメモリアクセスパイプラインの場合、このパイプラインの実行クラスタがメモリアクセスユニット１２６６４のみを有する特定の実施形態で実装される）を作成するので、スケジューラユニット１２６５６、物理レジスタファイルユニット１２６５８及び実行クラスタ１２６６０は、場合によっては複数のものとして示されている。別々のパイプラインが用いられる場合、これらのパイプラインのうちの１又は複数が、アウトオブオーダ発行／実行であってよく、残りがインオーダであってよいことも理解されたい。 The execution engine unit 12650 includes a rename/allocator unit 12652 coupled to a retirement unit 12654 and a set of one or more scheduler units 12656. The scheduler units 12656 represent any number of various schedulers, including reservation stations, central instruction windows, and the like. The scheduler units 12656 are coupled to physical register file units 12658. Each of the physical register file units 12658 represents one or more physical register files, different ones of which store one or more different data types, e.g., scalar integer, scalar floating point, packed integer, packed floating point, vector integer, vector floating point, status (e.g., an instruction pointer, which is the address of the next instruction to be executed), and the like. In one embodiment, the physical register file units 12658 include a vector register unit, a write mask register unit, and a scalar register unit. These register units may provide the vector registers, vector mask registers, and general purpose registers of the architecture. The physical register files unit 12658 is overlaid with the retirement unit 12654 to illustrate various ways in which register renaming and out-of-order execution may be implemented (e.g., with a reorder buffer and retirement register file, with a future file, a history buffer and retirement register file, with register mapping and pooling of registers, etc.). The retirement unit 12654 and the physical register files unit 12658 are coupled to the execution clusters 12660. The execution clusters 12660 include a set of one or more execution units 12662 and a set of one or more memory access units 12664. The execution units 12662 may perform various operations (e.g., shift, add, subtract, multiply) on various types of data (e.g., scalar floating point, packed integer, packed floating point, vector integer, vector floating point). Some embodiments may include multiple execution units dedicated to a particular function or set of functions, while other embodiments may include only one execution unit, or multiple execution units that all perform all functions. Scheduler unit 12656, physical register file unit 12658, and execution cluster 12660 are sometimes shown as multiple because certain embodiments create separate pipelines for particular types of data/operations (e.g., scalar integer pipeline, scalar floating point/packed integer/packed floating point/vector integer/vector floating point pipeline, and/or memory access pipeline, each with their own scheduler unit, physical register file unit, and/or execution cluster - in the case of a separate memory access pipeline, the execution cluster of this pipeline is implemented in certain embodiments with only memory access unit 12664). It should also be understood that when separate pipelines are used, one or more of these pipelines may be out-of-order issue/execution and the rest may be in-order.

メモリアクセスユニット１２６６４のセットは、メモリユニット１２６７０に結合され、メモリユニット１２６７０は、レベル２（Ｌ２）キャッシュユニット１２６７６に結合されるデータキャッシュユニット１２６７４に結合されるデータＴＬＢユニット１２６７２を含む。１つの例示的な実施形態において、メモリアクセスユニット１２６６４は、ロードユニット、格納アドレスユニット及び格納データユニットを含んでよく、それぞれが、メモリユニット１２６７０内のデータＴＬＢユニット１２６７２に結合される。命令キャッシュユニット１２６３４は、メモリユニット１２６７０内のレベル２（Ｌ２）キャッシュユニット１２６７６にさらに結合される。Ｌ２キャッシュユニット１２６７６は、キャッシュの１又は複数の他のレベルに結合されて、最終的にメインメモリに結合される。 The set of memory access units 12664 is coupled to a memory unit 12670, which includes a data TLB unit 12672 coupled to a data cache unit 12674 coupled to a level 2 (L2) cache unit 12676. In one exemplary embodiment, the memory access units 12664 may include a load unit, a store address unit, and a store data unit, each coupled to the data TLB unit 12672 in the memory unit 12670. The instruction cache unit 12634 is further coupled to a level 2 (L2) cache unit 12676 in the memory unit 12670. The L2 cache unit 12676 is coupled to one or more other levels of cache and ultimately to main memory.

例として、例示的なレジスタリネーミング、アウトオブオーダ発行／実行コアアーキテクチャは、以下のようなパイプライン１２６００を実施してよい。１）命令フェッチ１２６３８は、フェッチ及び長さデコーディングステージ１２６０２及び１２６０４を実行し、２）デコードユニット１２６４０は、デコードステージ１２６０６を実行し、３）リネーム／アロケータユニット１２６５２は、割り当てステージ１２６０８及びリネーミングステージ１２６１０を実行し、４）スケジューラユニット１２６５６は、スケジューリングステージ１２６１２を実行し、５）物理レジスタファイルユニット１２６５８及びメモリユニット１２６７０は、レジスタ読み出し／メモリ読み出しステージ１２６１４を実行し、実行クラスタ１２６６０は、実行ステージ１２６１６を実行し、６）メモリユニット１２６７０及び物理レジスタファイルユニット１２６５８は、ライトバック／メモリ書き込みステージ１２６１８を実行し、７）様々なユニットは、例外処理ステージ１２６２２に関するものであってよく、８）リタイアメントユニット１２６５４及び物理レジスタファイルユニット１２６５８は、コミットステージ１２６２４を実行する。 By way of example, an exemplary register renaming, out-of-order issue/execution core architecture may implement a pipeline 12600 as follows: 1) instruction fetch 12638 performs fetch and length decoding stages 12602 and 12604; 2) decode unit 12640 performs decode stage 12606; 3) rename/allocator unit 12652 performs allocation stage 12608 and renaming stage 12610; 4) scheduler unit 12656 performs scheduling stage 12612; 5) physical register file unit 12658 and memory unit 12670 perform 5) the execution cluster 12660 executes the execution stage 12616; 6) the memory unit 12670 and the physical register file unit 12658 execute the writeback/memory write stage 12618; 7) various units may relate to the exception handling stage 12622; and 8) the retirement unit 12654 and the physical register file unit 12658 execute the commit stage 12624.

コア１２６９０は、本明細書で説明される命令を含む１又は複数の命令セット（例えば、ｘ８６命令セット（より新しいバージョンで追加されたいくつかの拡張を伴う））、カリフォルニア州サニーベールのＭＩＰＳ技術のＭＩＰＳ命令セット、ＡＲＭ命令セット（カリフォルニア州サニーベールのＡＲＭホールディングスのＮＥＯＮなどの選択的な追加の拡張を伴う）をサポートしてよい。一実施形態において、コア１２６９０は、パックドデータ命令セット拡張（例えば、ＡＶＸ１、ＡＶＸ２）をサポートする論理を含み、それにより、パックドデータを用いて実行される多くのマルチメディアアプリケーションにより用いられるオペレーションを可能にする。 Core 12690 may support one or more instruction sets including the instructions described herein (e.g., the x86 instruction set (with some extensions added in newer versions), the MIPS instruction set from MIPS Technology of Sunnyvale, Calif., the ARM instruction set (with optional additional extensions such as NEON from ARM Holdings, Inc. of Sunnyvale, Calif.). In one embodiment, core 12690 includes logic to support packed data instruction set extensions (e.g., AVX1, AVX2), thereby enabling operations used by many multimedia applications to be executed with packed data.

コアはマルチスレッディング（オペレーション又はスレッドの２又はそれより多い並列セットを実行する）をサポートしてよく、タイムスライス型マルチスレッディング、同時マルチスレッディング（物理コアが同時にマルチスレッディングするスレッドのそれぞれに対して、単一物理コアが論理コアを提供する）、又は、それらの組み合わせ（例えば、それ以降のインテルのハイパースレッディングテクノロジーなどのタイムスライス型フェッチング及びデコーディング、並びに、同時マルチスレッディング）を含む様々な方式で行われてよいことを理解されたい。 It should be appreciated that a core may support multithreading (executing two or more parallel sets of operations or threads) in a variety of ways, including time-sliced multithreading, simultaneous multithreading (where a single physical core provides a logical core for each thread that the physical core simultaneously multithreads), or a combination thereof (e.g., simultaneous multithreading and time-sliced fetching and decoding, such as Intel's later Hyper-Threading Technology).

レジスタリネーミングが、アウトオブオーダ実行のコンテキストで説明される一方、レジスタリネーミングは、インオーダアーキテクチャにおいて用いられてよいことに理解されたい。プロセッサの例示された実施形態では、別々の命令及びデータキャッシュユニット１２６３４／１２６７４及び共有のＬ２キャッシュユニット１２６７６も含み、一方で、代替的な実施形態では、例えば、レベル１（Ｌ１）内部キャッシュ又は内部キャッシュの複数のレベルなど、命令及びデータの両方に対して単一の内部キャッシュを有してよい。いくつかの実施形態において、システムは、内部キャッシュと、コア及び／又はプロセッサの外部にある外部キャッシュとの組み合わせを含んでよい。代替的に、キャッシュのすべては、コア及び／又はプロセッサの外部にあってよい。
具体的な例示的インオーダコアアーキテクチャ While register renaming is described in the context of out-of-order execution, it should be understood that register renaming may be used in an in-order architecture. The illustrated embodiment of the processor also includes separate instruction and data cache units 12634/12674 and a shared L2 cache unit 12676, while alternative embodiments may have a single internal cache for both instructions and data, such as a level 1 (L1) internal cache or multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache that is external to the core and/or processor. Alternatively, all of the cache may be external to the core and/or processor.
Specific Exemplary In-Order Core Architecture

図１２７Ａ～図１２７Ｂは、より具体的な例示的インオーダコアアーキテクチャのブロック図を示し、コアは、チップ内の（同じタイプ及び／又は異なるタイプの他のコアを含む）いくつかの論理ブロックのうちの１つであろう。論理ブロックは、用途に応じて、高帯域幅相互接続ネットワーク（例えば、リングネットワーク）を通じて、いくつかの固定機能論理、メモリＩ／Ｏインタフェース及び他の必要なＩ／Ｏ論理と通信する。 Figures 127A-127B show block diagrams of a more specific exemplary in-order core architecture, where the core may be one of several logic blocks (including other cores of the same and/or different types) within a chip. The logic block communicates with some fixed function logic, memory I/O interfaces, and other necessary I/O logic, depending on the application, through a high bandwidth interconnect network (e.g., a ring network).

図１２７Ａは、本発明の実施形態に係る、シングルプロセッサコアのオンダイ相互接続ネットワーク１２７０２への接続及びそのレベル２（Ｌ２）キャッシュ１２７０４のローカルサブセットとの接続と併せたシングルプロセッサコアについてのブロック図である。一実施形態において、命令デコーダ１２７００は、パックドデータ命令セット拡張を伴うｘ８６命令セットをサポートする。Ｌ１キャッシュ１２７０６は、スカラ及びベクトルユニット内のキャッシュメモリへの低レイテンシなアクセスを可能にする。一実施形態において（設計を簡略化するために）、スカラユニット１２７０８及びベクトルユニット１２７１０は、別々のレジスタセット（それぞれ、スカラレジスタ１２７１２及びベクトルレジスタ１２７１４）を用い、これらの間で転送されるデータは、メモリに書き込まれて、次にレベル１（Ｌ１）キャッシュ１２７０６からリードバックされ、一方で、本発明の代替的な実施形態では、異なるアプローチ（例えば、単一のレジスタセットを用いる、又は、書き込み及びリードバックされることなく２つのレジスタファイル間でデータが転送されることを可能にする通信パスを含む）を用いてよい。 127A is a block diagram of a single processor core along with its connection to an on-die interconnect network 12702 and its connection to a local subset of a level 2 (L2) cache 12704, according to an embodiment of the present invention. In one embodiment, the instruction decoder 12700 supports the x86 instruction set with the packed data instruction set extension. The L1 cache 12706 allows low latency access to cache memory in the scalar and vector units. In one embodiment (to simplify the design), the scalar unit 12708 and the vector unit 12710 use separate register sets (scalar registers 12712 and vector registers 12714, respectively) and data transferred between them is written to memory and then read back from the level 1 (L1) cache 12706, while alternative embodiments of the present invention may use different approaches (e.g., using a single register set or including a communication path that allows data to be transferred between the two register files without being written and read back).

Ｌ２キャッシュ１２７０４のローカルサブセットは、プロセッサコア毎に１つずつ、別々のローカルサブセットに分割されるグローバルＬ２キャッシュの一部である。各プロセッサコアは、Ｌ２キャッシュ１２７０４の独自のローカルサブセットに対して直接的なアクセスパスを有する。プロセッサコアにより読み出されるデータは、そのＬ２キャッシュサブセット１２７０４に格納され、これら自体のローカルＬ２キャッシュサブセットにアクセスする他のプロセッサコアと並列に、迅速にアクセスされ得る。プロセッサコアにより書き込まれるデータは、独自のＬ２キャッシュサブセット１２７０４に格納され、必要な場合には、他サブセットからフラッシュされる。リングネットワークは、共有データに対するコヒーレンシを確保する。リングネットワークは、プロセッサコア、Ｌ２キャッシュ及び他論理ブロックなどのエージェントがチップ内で互いに通信することを可能にするために双方向である。各リングデータパスは、一方向あたり１０１２ビット幅である。 The local subset of the L2 cache 12704 is part of a global L2 cache that is divided into separate local subsets, one for each processor core. Each processor core has a direct access path to its own local subset of the L2 cache 12704. Data read by a processor core is stored in its L2 cache subset 12704 and can be accessed quickly in parallel with other processor cores accessing their own local L2 cache subsets. Data written by a processor core is stored in its own L2 cache subset 12704 and is flushed from other subsets when necessary. The ring network ensures coherency for shared data. The ring network is bidirectional to allow agents such as processor cores, L2 caches, and other logic blocks to communicate with each other within the chip. Each ring data path is 1012 bits wide per direction.

図１２７Ｂは、本発明の実施形態に係る、図１２７Ａにおけるプロセッサコアの一部の拡大図である。図１２７Ｂは、Ｌ１キャッシュ１２７０４のＬ１データキャッシュ１２７０６Ａ部分、並びに、ベクトルユニット１２７１０及びベクトルレジスタ１２７１４に関するさらなる詳細を含む。具体的には、ベクトルユニット１２７１０は、１６幅ベクトル処理ユニット（ＶＰＵ）（１６幅ＡＬＵ１２７２８を参照）であり、これは、整数、単精度浮動及び倍精度浮動命令のうちの１又は複数を実行する。ＶＰＵは、スウィズルユニット１２７２０を用いたレジスタ入力のスウィズル、数値変換ユニット１２７２２Ａ－Ｂを用いた数値変換、及び、複製ユニット１２７２４を用いたメモリ入力に対する複製をサポートする。書き込みマスクレジスタ１２７２６は、結果としてもたらされるベクトル書き込みのプレディケートを可能にする。 127B is an expanded view of a portion of the processor core in FIG. 127A, according to an embodiment of the present invention. FIG. 127B includes the L1 data cache 12706A portion of the L1 cache 12704, as well as further details regarding the vector unit 12710 and vector registers 12714. In particular, the vector unit 12710 is a 16-wide vector processing unit (VPU) (see 16-wide ALU 12728), which executes one or more of integer, single precision float, and double precision float instructions. The VPU supports swizzling of register inputs using swizzle unit 12720, numeric conversion using numeric conversion units 12722A-B, and duplication on memory inputs using duplication unit 12724. A write mask register 12726 allows for predicating of the resulting vector writes.

図１２８は、本発明の実施形態に係る、１つより多くのコアを有してよく、統合メモリコントローラを有してよく、かつ、統合グラフィックスを有してよいプロセッサ１２８００のブロック図である。図１２８内の実線の枠は、単一のコア１２８０２Ａ、システムエージェント１２８１０、１又は複数のバスコントローラユニット１２８１６のセットを有するプロセッサ１２８００を示す一方、破線の枠の選択的な追加部分は、複数のコア１２８０２Ａ－Ｎ、システムエージェントユニット１２８１０内の１又は複数の統合メモリコントローラユニット１２８１４のセット、及び特定用途論理１２８０８を有する代替的なプロセッサ１２８００を示す。 128 is a block diagram of a processor 12800 that may have more than one core, an integrated memory controller, and integrated graphics, according to an embodiment of the present invention. The solid box in FIG. 128 illustrates a processor 12800 having a single core 12802A, a system agent 12810, and a set of one or more bus controller units 12816, while the optional additional portion of the dashed box illustrates an alternative processor 12800 having multiple cores 12802A-N, a set of one or more integrated memory controller units 12814 in the system agent unit 12810, and special purpose logic 12808.

したがって、プロセッサ１２８００の異なる実装は、１）統合グラフィックス及び／又は科学技術（スループット）論理である特定用途論理１２８０８（１又は複数のコアを含んでよい）、及び、１又は複数の汎用コアであるコア１２８０２Ａ－Ｎ（例えば、汎用インオーダコア、汎用アウトオブオーダコア、その２つの組み合わせ）を有するＣＰＵ、２）グラフィックス及び／又は科学技術（スループット）を主に対象とする多数の特定用途コアであるコア１２８０２Ａ－Ｎを有するコプロセッサ、及び、３）多数の汎用インオーダコアであるコア１２８０２Ａ－Ｎを有するコプロセッサを含んでよい。したがって、プロセッサ１２８００は、汎用プロセッサ、コプロセッサ又は特定用途プロセッサ、例えば、ネットワーク又は通信プロセッサ、圧縮エンジン、グラフィックスプロセッサ、ＧＰＧＰＵ（汎用グラフィックス処理ユニット）、高スループットの多集積コア（ＭＩＣ）コプロセッサ（３０又はそれより多いコアを含む）、又は、埋め込み型プロセッサなどであってよい。プロセッサは、１又は複数のチップ上に実装されてよい。プロセッサ１２８００は、１又は複数の基板の一部であってよい、及び／又は、例えば、ＢｉＣＭＯＳ、ＣＭＯＳ又はＮＭＯＳなどの多数の処理技術のいずれかを用いて１又は複数の基板上に実装されてもよい。 Thus, different implementations of the processor 12800 may include: 1) a CPU with special purpose logic 12808 (which may include one or more cores) that is integrated graphics and/or science and technology (throughput) logic, and one or more general purpose cores 12802A-N (e.g., general purpose in-order cores, general purpose out-of-order cores, combinations of the two); 2) a coprocessor with multiple special purpose cores 12802A-N that are primarily targeted at graphics and/or science and technology (throughput); and 3) a coprocessor with multiple general purpose in-order cores 12802A-N. Thus, the processor 12800 may be a general purpose processor, coprocessor, or special purpose processor, such as a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), high throughput multi-integrated core (MIC) coprocessor (containing 30 or more cores), or embedded processor. The processor may be implemented on one or more chips. The processor 12800 may be part of one or more substrates and/or may be implemented on one or more substrates using any of a number of process technologies, such as, for example, BiCMOS, CMOS, or NMOS.

メモリ階層は、コア、１又は複数の共有キャッシュユニット１２８０６のセット、及び、統合メモリコントローラユニット１２８１４のセットに結合される外部メモリ（図示しない）内に１又は複数のレベルのキャッシュを含む。共有キャッシュユニット１２８０６のセットは、レベル２（Ｌ２）、レベル３（Ｌ３）、レベル４（Ｌ４）又は他のレベルのキャッシュなどの１又は複数の中レベルキャッシュ、ラストレベルキャッシュ（ＬＬＣ）及び／又はその組み合わせを含んでよい。一実施形態において、リングベースの相互接続ユニット１２８１２は、統合グラフィックス論理１２８０８（統合グラフィックス論理１２８０８は、特定用途論理の一例であり、本明細書ではまた特定用途論理と称されている）、共有キャッシュユニット１２８０６のセット及びシステムエージェントユニット１２８１０／統合メモリコントローラユニット１２８１４を相互接続し、一方で、代替的な実施形態では、そのようなユニットを相互接続するための任意の数の周知技術を用いてよい。一実施形態において、コヒーレンシが、１又は複数のキャッシュユニット１２８０６及びコア１２８０２Ａ－Ｎの間で維持されている。 The memory hierarchy includes one or more levels of cache within the cores, a set of one or more shared cache units 12806, and an external memory (not shown) coupled to a set of integrated memory controller units 12814. The set of shared cache units 12806 may include one or more mid-level caches, such as level 2 (L2), level 3 (L3), level 4 (L4) or other level caches, a last level cache (LLC), and/or combinations thereof. In one embodiment, a ring-based interconnect unit 12812 interconnects the integrated graphics logic 12808 (the integrated graphics logic 12808 is an example of special purpose logic, also referred to herein as special purpose logic), the set of shared cache units 12806, and the system agent unit 12810/integrated memory controller unit 12814, while alternative embodiments may use any number of well-known techniques for interconnecting such units. In one embodiment, coherency is maintained between the one or more cache units 12806 and the cores 12802A-N.

いくつかの実施形態において、コア１２８０２Ａ－Ｎのうちの１又は複数は、マルチスレッディングが可能である。システムエージェント１２８１０は、コア１２８０２Ａ－Ｎを調整し、動作させるそれらのコンポーネントを含む。システムエージェントユニット１２８１０は、例えば、電力制御ユニット（ＰＣＵ）及びディスプレイユニットを含んでよい。ＰＣＵは、コア１２８０２Ａ－Ｎ及び統合グラフィックス論理１２８０８の電力状態を調整するために必要とされる論理及びコンポーネントであってよい又は含んでよい。ディスプレイユニットは、１又は複数の外部に接続されたディスプレイを駆動させるためのものである。 In some embodiments, one or more of the cores 12802A-N are capable of multithreading. The system agent 12810 includes those components that coordinate and operate the cores 12802A-N. The system agent unit 12810 may include, for example, a power control unit (PCU) and a display unit. The PCU may be or may include the logic and components required to coordinate the power state of the cores 12802A-N and the integrated graphics logic 12808. The display unit is for driving one or more externally connected displays.

コア１２８０２Ａ－Ｎは、アーキテクチャ命令セットの観点からホモジニアス又はヘテロジニアスであってよい。つまり、コア１２８０２Ａ－Ｎの２又はそれより多くは、同じ命令セットを実行することが可能であってよく、一方、その他は、その命令セット又は異なる命令セットのサブセットのみを実行することが可能であってよい。
例示的なコンピュータアーキテクチャ Cores 12802A-N may be homogeneous or heterogeneous in terms of architectural instruction set, that is, two or more of cores 12802A-N may be capable of executing the same instruction set, while others may be capable of executing only a subset of that instruction set or a different instruction set.
Exemplary Computer Architecture

図１２９～図１３２は、例示的なコンピュータアーキテクチャのブロック図である。ラップトップ、デスクトップ、ハンドヘルドＰＣ、パーソナルデジタルアシスタント、エンジニアリングワークステーション、サーバ、ネットワークデバイス、ネットワークハブ、スイッチ、埋め込み型プロセッサ、デジタル信号プロセッサ（ＤＳＰ）、グラフィックスデバイス、ビデオゲームデバイス、セットトップボックス、マイクロコントローラ、携帯電話、ポータブルメディアプレーヤ、ハンドヘルドデバイス及び他の様々な電子デバイスに関する技術分野で知られている他のシステム設計及び構成にも適している。一般的に、本明細書に開示されるようなプロセッサ及び／又は他の実行論理を組み込むことができる多様なシステム又は電子デバイスが概して適している。 129-132 are block diagrams of exemplary computer architectures. Other system designs and configurations known in the art for laptops, desktops, handheld PCs, personal digital assistants, engineering workstations, servers, network devices, network hubs, switches, embedded processors, digital signal processors (DSPs), graphics devices, video game devices, set-top boxes, microcontrollers, mobile phones, portable media players, handheld devices, and a variety of other electronic devices are also suitable. In general, a wide variety of systems or electronic devices that may incorporate a processor and/or other execution logic as disclosed herein are generally suitable.

ここで図１２９を参照すると、示されているのは、本発明の一実施形態に従うシステム１２９００のブロック図である。システム１２９００は、コントローラハブ１２９２０に結合される１又は複数のプロセッサ１２９１０、１２９１５を含んでよい。一実施形態において、コントローラハブ１２９２０は、グラフィックメモリコントローラハブ（ＧＭＣＨ）１２９９０及び入力／出力ハブ（ＩＯＨ）１２９５０（別々のチップ上にあってよい）を含み、ＧＭＣＨ１２９９０は、メモリ１２９４０及びコプロセッサ１２９４５に結合されるメモリ及びグラフィックスコントローラを含み、ＩＯＨ１２９５０は、入力／出力（Ｉ／Ｏ）デバイス１２９６０をＧＭＣＨ１２９９０に結合する。代替的に、メモリ及びグラフィックスコントローラの一方又は両方は、（本明細書で説明されるように）プロセッサ内に統合されてよく、メモリ１２９４０及びコプロセッサ１２９４５は、プロセッサ１２９１０と、ＩＯＨ１２９５０を有する単一のチップ内のコントローラハブ１２９２０とに直接的に結合される。 129, shown is a block diagram of a system 12900 in accordance with one embodiment of the present invention. The system 12900 may include one or more processors 12910, 12915 coupled to a controller hub 12920. In one embodiment, the controller hub 12920 includes a graphics memory controller hub (GMCH) 12990 and an input/output hub (IOH) 12950 (which may be on separate chips), where the GMCH 12990 includes a memory and graphics controller coupled to the memory 12940 and the coprocessor 12945, and the IOH 12950 couples the input/output (I/O) devices 12960 to the GMCH 12990. Alternatively, one or both of the memory and the graphics controller may be integrated within the processor (as described herein), with the memory 12940 and the coprocessor 12945 being directly coupled to the processor 12910 and the controller hub 12920 in a single chip with the IOH 12950.

追加のプロセッサ１２９１５の選択的な特性が、破線で図１２９に示されている。各プロセッサ１２９１０、１２９１５は、本明細書で説明される処理コアのうちの１又は複数を含んでよく、プロセッサ１２８００のいくつかのバージョンであってよい。 Optional features of additional processor 12915 are shown in FIG. 129 with dashed lines. Each processor 12910, 12915 may include one or more of the processing cores described herein and may be some version of processor 12800.

メモリ１２９４０は、例えば、ダイナミックランダムアクセスメモリ（ＤＲＡＭ）、相変化メモリ（ＰＣＭ）又はその２つの組み合わせであってよい。少なくとも１つの実施形態について、コントローラハブ１２９２０は、例えば、フロントサイドバス（ＦＳＢ）などのマルチドロップバス、ＱｕｉｃｋＰａｔｈ相互接続（ＱＰＩ）などのポイントツーポイントインタフェース、又は、同様の接続１２９９５を介してプロセッサ１２９１０、１２９１５と通信する。 Memory 12940 may be, for example, dynamic random access memory (DRAM), phase change memory (PCM), or a combination of the two. For at least one embodiment, controller hub 12920 communicates with processors 12910, 12915 via, for example, a multi-drop bus such as a front side bus (FSB), a point-to-point interface such as a QuickPath Interconnect (QPI), or a similar connection 12995.

一実施形態において、コプロセッサ１２９４５は、例えば、ハイスループットＭＩＣプロセッサ、ネットワーク又は通信プロセッサ、圧縮エンジン、グラフィックスプロセッサ、ＧＰＧＰＵ、又は、埋め込み型プロセッサなどの特定用途プロセッサである。一実施形態において、コントローラハブ１２９２０は、統合グラフィックスアクセラレータを含んでよい。 In one embodiment, the coprocessor 12945 is a special purpose processor, such as a high throughput MIC processor, a network or communication processor, a compression engine, a graphics processor, a GPGPU, or an embedded processor. In one embodiment, the controller hub 12920 may include an integrated graphics accelerator.

アーキテクチャ特性、マイクロアーキテクチャ特性、熱的特性及び電力消費特性などを含む広範な価値基準の観点から、物理リソース１２９１０、１２９１５間には、様々な差異があり得る。 There may be various differences between the physical resources 12910, 12915 in terms of a wide range of value criteria, including architectural characteristics, microarchitectural characteristics, thermal characteristics, and power consumption characteristics.

一実施形態において、プロセッサ１２９１０は、一般的なタイプのデータ処理オペレーションを制御する命令を実行する。命令内に組み込まれるものは、コプロセッサ命令であってよい。プロセッサ１２９１０は、取り付けられたコプロセッサ１２９４５により実行されるべきタイプのものとしてこれらのコプロセッサ命令を認識する。状況に応じて、プロセッサ１２９１０は、これらのコプロセッサ命令（又は、コプロセッサ命令を表す制御信号）を、コプロセッサバス又は他の相互接続を介してコプロセッサ１２９４５に発行する。コプロセッサ１２９４５は、受信したコプロセッサ命令を受け入れて実行する。 In one embodiment, the processor 12910 executes instructions that control a general type of data processing operation. Embedded within the instructions may be coprocessor instructions. The processor 12910 recognizes these coprocessor instructions as being of a type to be executed by an attached coprocessor 12945. Optionally, the processor 12910 issues these coprocessor instructions (or control signals representing the coprocessor instructions) to the coprocessor 12945 via a coprocessor bus or other interconnect. The coprocessor 12945 accepts and executes the received coprocessor instructions.

ここで図１３０を参照すると、示されているのは、本発明の実施形態に従う第１のより具体的な例示的システム１３０００のブロック図である。図１３０に示されるように、マルチプロセッサシステム１３０００は、ポイントツーポイント相互接続システムであり、ポイントツーポイント相互接続１３０５０を介して結合される第１のプロセッサ１３０７０及び第２のプロセッサ１３０８０を含む。プロセッサ１３０７０及び１３０８０のそれぞれは、プロセッサ１２８００の何らかのバージョンであってよい。本発明の一実施形態において、プロセッサ１３０７０及び１３０８０は、それぞれプロセッサ１２９１０及び１２９１５であり、一方、コプロセッサ１３０３８はコプロセッサ１２９４５である。別の実施形態において、プロセッサ１３０７０及び１３０８０は、それぞれプロセッサ１２９１０及びコプロセッサ１２９４５である。 Referring now to FIG. 130, shown is a block diagram of a first more specific exemplary system 13000 according to an embodiment of the present invention. As shown in FIG. 130, the multiprocessor system 13000 is a point-to-point interconnect system and includes a first processor 13070 and a second processor 13080 coupled via a point-to-point interconnect 13050. Each of the processors 13070 and 13080 may be some version of the processor 12800. In one embodiment of the present invention, the processors 13070 and 13080 are processors 12910 and 12915, respectively, while the coprocessor 13038 is coprocessor 12945. In another embodiment, the processors 13070 and 13080 are processors 12910 and coprocessor 12945, respectively.

統合メモリコントローラ（ＩＭＣ）ユニット１３０７２及び１３０８２をそれぞれ含むプロセッサ１３０７０及び１３０８０が示されている。プロセッサ１３０７０は、そのバスコントローラユニットの一部として、ポイントツーポイント（Ｐ―Ｐ）インタフェース１３０７６及び１３０７８も含む。同様に第２のプロセッサ１３０８０は、Ｐ－Ｐインタフェース１３０８６及び１３０８８を含む。プロセッサ１３０７０、１３０８０は、Ｐ―Ｐインタフェース回路１３０７８、１３０８８を用いて、ポイントツーポイント（Ｐ―Ｐ）インタフェース１３０５０を介して情報を交換してよい。図１３０に示されるように、ＩＭＣ１３０７２及び１３０８２は、メモリのそれぞれ、すなわち、メモリ１３０３２及びメモリ１３０３４にプロセッサを結合し、それぞれのプロセッサにローカルに取り付けられたメインメモリの一部であってよい。 Processors 13070 and 13080 are shown including integrated memory controller (IMC) units 13072 and 13082, respectively. Processor 13070 also includes point-to-point (PP) interfaces 13076 and 13078 as part of its bus controller unit. Similarly, second processor 13080 includes PP interfaces 13086 and 13088. Processors 13070, 13080 may exchange information via point-to-point (PP) interface 13050 using PP interface circuits 13078, 13088. As shown in FIG. 130, IMCs 13072 and 13082 couple the processors to memory, respectively, i.e., memory 13032 and memory 13034, and may be part of a main memory locally attached to the respective processor.

プロセッサ１３０７０、１３０８０は、ポイントツーポイントインタフェース回路１３０７６、１３０９４、１３０８６、１３０９８を用いて、個別のＰ－Ｐインタフェース１３０５２、１３０５４を介して各チップセット１３０９０と情報を交換してよい。チップセット１３０９０は、高性能インタフェース１３０９２を介してコプロセッサ１３０３８と選択的に情報を交換してよい。一実施形態において、コプロセッサ１３０３８は、例えば、ハイスループットＭＩＣプロセッサ、ネットワーク又は通信プロセッサ、圧縮エンジン、グラフィックスプロセッサ、ＧＰＧＰＵ又は埋め込み型プロセッサなどの特定用途プロセッサである。 The processors 13070, 13080 may exchange information with each chipset 13090 via separate P-P interfaces 13052, 13054 using point-to-point interface circuits 13076, 13094, 13086, 13098. The chipset 13090 may selectively exchange information with the coprocessor 13038 via a high performance interface 13092. In one embodiment, the coprocessor 13038 is a special purpose processor such as, for example, a high throughput MIC processor, a network or communication processor, a compression engine, a graphics processor, a GPGPU, or an embedded processor.

共有キャッシュ（図示しない）は、いずれかのプロセッサ内又は両方のプロセッサの外部に含まれてよく、さらに、Ｐ―Ｐ相互接続を介してプロセッサと接続されてよく、その結果、プロセッサが低電力モードに置かれている場合、一方又は両方のプロセッサのローカルキャッシュ情報は、共有キャッシュに格納されてよい。 A shared cache (not shown) may be included within either processor or external to both processors and may be further connected to the processors via the P-P interconnect such that local cache information of one or both processors may be stored in the shared cache when the processors are placed in a low power mode.

チップセット１３０９０は、インタフェース１３０９６を介して第１のバス１３０１６に結合されてよい。一実施形態において、第１のバス１３０１６は、ペリフェラルコンポーネントインターコネクト（ＰＣＩ）バス、又は、ＰＣＩエクスプレスバス又は別の第３世代Ｉ／Ｏ相互接続バスなどのバスであってよいが、本発明の範囲は制限されることはない。 Chipset 13090 may be coupled to a first bus 13016 via an interface 13096. In one embodiment, first bus 13016 may be a bus such as a Peripheral Component Interconnect (PCI) bus, or a PCI Express bus or another third generation I/O interconnect bus, without limiting the scope of the present invention.

図１３０に示されるように、様々なＩ／Ｏデバイス１３０１４は、第１のバス１３０１６を第２のバス１３０２０に結合するバスブリッジ１３０１８と共に、第１のバス１３０１６に結合されてよい。一実施形態において、コプロセッサ、ハイスループットＭＩＣプロセッサ、ＧＰＧＰＵのアクセラレータ（例えば、グラフィックスアクセラレータ又はデジタル信号プロセッシング（ＤＳＰ）ユニットなど）、フィールドプログラマブルゲートアレイ又は任意の他のプロセッサなどの１又は複数の追加のプロセッサ１３０１５は第１のバス１３０１６に結合される。一実施形態において、第２のバス１３０２０は、ローピンカウント（ＬＰＣ）バスであってよい。一実施形態において、様々なデバイスは、例えば、キーボード及び／又はマウス１３０２２、通信デバイス１３０２７、及び、命令／コード及びデータ１３０３０を含み得るディスクドライブ又は他の大容量ストレージデバイスなどのストレージユニット１３０２８を含む第２のバス１３０２０に結合されてよい。さらに、オーディオＩ／Ｏ１３０２４は、第２のバス１３０２０に結合されてよい。他のアーキテクチャが可能であることに留意する。例えば、図１３０のポイントツーポイントアーキテクチャの代わりに、システムは、マルチドロップバス又は他のそのようなアーキテクチャを実装してよい。 As shown in FIG. 130, various I/O devices 13014 may be coupled to the first bus 13016 along with a bus bridge 13018 coupling the first bus 13016 to a second bus 13020. In one embodiment, one or more additional processors 13015, such as a co-processor, a high throughput MIC processor, an accelerator of a GPGPU (e.g., a graphics accelerator or a digital signal processing (DSP) unit), a field programmable gate array, or any other processor, are coupled to the first bus 13016. In one embodiment, the second bus 13020 may be a low pin count (LPC) bus. In one embodiment, various devices may be coupled to the second bus 13020 including, for example, a keyboard and/or mouse 13022, a communication device 13027, and a storage unit 13028, such as a disk drive or other mass storage device, which may contain instructions/code and data 13030. Additionally, audio I/O 13024 may be coupled to the second bus 13020. Note that other architectures are possible. For example, instead of the point-to-point architecture of FIG. 130, the system may implement a multi-drop bus or other such architecture.

ここで図１３１を参照すると、示されているのは、本発明の実施形態に従う第２のより具体的な例示的システム１３１００のブロック図である。図１３０及び図１３１内の同様の要素には、同様の参照番号を付しており、図１３１の他の態様が曖昧になることを回避するために、図１３０の特定の態様が図１３１から省略されている。 Referring now to FIG. 131, shown is a block diagram of a second, more specific, exemplary system 13100 in accordance with an embodiment of the present invention. Like elements in FIG. 130 and FIG. 131 are labeled with like reference numbers, and certain aspects of FIG. 130 have been omitted from FIG. 131 to avoid obscuring other aspects of FIG. 131.

図１３１は、プロセッサ１３０７０、１３０８０が、統合メモリ及びＩ／Ｏ制御論理（「ＣＬ」）１３０７２及び１３０８２をそれぞれ含み得ることを示す。したがって、ＣＬ１３０７２、１３０８２は、統合メモリコントローラユニットを含み、Ｉ／Ｏ制御論理を含む。図１３１は、メモリ１３０３２、１３０３４がＣＬ１３０７２、１３０８２に結合されるだけでなく、Ｉ／Ｏデバイス１３１１４も制御論理１３０７２、１３０８２に結合されることを示す。レガシＩ／Ｏデバイス１３１１５は、チップセット１３０９０に結合される。 Figure 131 shows that the processors 13070, 13080 may include integrated memory and I/O control logic ("CL") 13072 and 13082, respectively. Thus, the CLs 13072, 13082 include integrated memory controller units and include I/O control logic. Figure 131 shows that not only are the memories 13032, 13034 coupled to the CLs 13072, 13082, but also the I/O devices 13114 are coupled to the control logic 13072, 13082. Legacy I/O devices 13115 are coupled to the chipset 13090.

ここで図１３２を参照すると、示されているのは、本発明の実施形態に従うＳｏＣ１３２００のブロック図である。図１２８内の同様の要素には同様の参照番号を付している。また、破線の枠は、より高度なＳｏＣ上の選択的な特徴である。図１３２において、相互接続ユニット１３２０２は、キャッシュユニット１２８０４Ａ－Ｎを含む１又は複数のコア１２８０２Ａ－Ｎのセットと共有キャッシュユニット１２８０６とを含むアプリケーションプロセッサ１３２１０と、システムエージェントユニット１２８１０と、バスコントローラユニット１２８１６と、統合メモリコントローラユニット１２８１４と、統合グラフィックス論理、画像プロセッサ、オーディオプロセッサ及びビデオプロセッサを含み得る１又は複数のコプロセッサ１３２２０のセットと、スタティックランダムアクセスメモリ（ＳＲＡＭ）ユニット１３２３０と、ダイレクトメモリアクセス（ＤＭＡ）ユニット１３２３２と、１又は複数の外部ディスプレイに結合するためのディスプレイユニット１３２４０とに結合される。一実施形態において、コプロセッサ１３２２０は、例えば、ネットワーク又は通信プロセッサ、圧縮エンジン、ＧＰＧＰＵ、ハイスループットＭＩＣプロセッサ又は埋め込み型プロセッサなどの特定用途プロセッサを含む。 Referring now to FIG. 132, shown is a block diagram of a SoC 13200 according to an embodiment of the present invention. Like elements in FIG. 128 are labeled with like reference numbers. Also, dashed boxes are optional features on more advanced SoCs. In FIG. 132, an interconnect unit 13202 is coupled to an application processor 13210 including a set of one or more cores 12802A-N including cache units 12804A-N and a shared cache unit 12806, a system agent unit 12810, a bus controller unit 12816, an integrated memory controller unit 12814, a set of one or more coprocessors 13220 that may include integrated graphics logic, image processors, audio processors, and video processors, a static random access memory (SRAM) unit 13230, a direct memory access (DMA) unit 13232, and a display unit 13240 for coupling to one or more external displays. In one embodiment, the coprocessor 13220 includes a special purpose processor, such as, for example, a network or communication processor, a compression engine, a GPGPU, a high-throughput MIC processor, or an embedded processor.

本明細書で開示されるメカニズムについての実施形態は、ハードウェア、ソフトウェア、ファームウェア又はそのような実装アプローチの組み合わせで実装されてよい。本発明の実施形態は、少なくとも１つのプロセッサ、ストレージシステム（揮発性及び不揮発性メモリ、及び／又は、ストレージエレメントを含む）、少なくとも１つの入力デバイス及び少なくとも１つの出力デバイスを有するプログラマブルシステム上で実行するコンピュータプログラム又はプログラムコードとして実装されてよい。 Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of such implementation approaches. Embodiments of the invention may be implemented as a computer program or program code running on a programmable system having at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

プログラムコード、例えば、図１３０に示されるコード１３０３０は、本明細書で説明される機能を実行し、出力情報を生成するための入力命令に適用されてよい。出力情報は、既知の様式で、１又は複数の出力デバイスに適用されてよい。本願の目的のために、処理システムは、例えば、デジタル信号プロセッサ（ＤＳＰ）、マイクロコントローラ、特定用途向け集積回路（ＡＳＩＣ）又はマイクロプロセッサなどのプロセッサを有する任意のシステムを含む。 Program code, such as code 13030 shown in FIG. 130, may be applied to the input instructions to perform functions described herein and to generate output information. The output information may be applied to one or more output devices, in known fashion. For purposes of this application, a processing system includes any system having a processor, such as, for example, a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC), or a microprocessor.

プログラムコードは、処理システムと通信するために、高水準手続き型又はオブジェクト指向プログラミング言語で実装されてよい。プログラムコードは、必要に応じて、アセンブリ言語又は機械語で実装されてもよい。実際には、本明細書で説明されるメカニズムは、任意の特定のプログラミング言語の範囲に限定されない。いずれの場合でも、言語は、コンパイル型言語又はインタープリタ型言語であってよい。 The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code may also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described herein are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.

少なくとも１つの実施形態のうちの１又は複数の態様では、マシンにより読み出される場合、本明細書で説明される技術を実行するために、マシンに論理を構築させるプロセッサ内の様々な論理を表す機械可読媒体上に格納された代表的な命令により実施されてよい。「ＩＰコア」として知られるそのような表現は、有形の機械可読媒体上に格納され、論理又はプロセッサを実際に作る製造マシンにロードするために、様々な顧客又は製造施設に供給されてよい。 In one or more aspects of at least one embodiment, the techniques may be implemented by representative instructions stored on a machine-readable medium that represent various logic within a processor that, when read by a machine, causes the machine to construct the logic to perform the techniques described herein. Such representations, known as "IP cores," may be stored on tangible machine-readable media and supplied to various customers or manufacturing facilities for loading into manufacturing machines that actually create the logic or processors.

そのような機械可読記憶媒体は、制限なく、マシン又はデバイスにより製造又は形成される非一時的な有形の構成をした物品を含んでよく、ハードディスク、フロッピーディスクを含むその他のタイプのディスク、光ディスク、コンパクトディスクリードオンリメモリ（ＣＤ－ＲＯＭ）、コンパクトディスクリライタブル（ＣＤ－ＲＷ）、及び、磁気－光ディスクなどの記憶媒体と、リードオンリメモリ（ＲＯＭ）、ダイナミックランダムアクセスメモリ（ＤＲＡＭ）などのランダムアクセスメモリ（ＲＡＭ）、スタティックランダムアクセスメモリ（ＳＲＡＭ）、消去可能プログラマブルリードオンリメモリ（ＥＰＲＯＭ）、フラッシュメモリ、電気的消去可能プログラマブルリードオンリメモリ（ＥＥＰＲＯＭ）、相変化メモリ（ＰＣＭ）、磁気又は光カードなどの半導体デバイスと、又は、電子的命令を格納するのに適したその他のタイプの媒体とを含む。 Such machine-readable storage media may include, without limitation, non-transitory tangible articles of construction manufactured or formed by a machine or device, including storage media such as hard disks, other types of disks including floppy disks, optical disks, compact disk read-only memories (CD-ROMs), compact disk re-writeables (CD-RWs), and magneto-optical disks, read-only memories (ROMs), random access memories (RAMs) such as dynamic random access memories (DRAMs), static random access memories (SRAMs), erasable programmable read-only memories (EPROMs), flash memories, electrically erasable programmable read-only memories (EEPROMs), phase change memories (PCMs), semiconductor devices such as magnetic or optical cards, or other types of media suitable for storing electronic instructions.

状況に応じて、本発明の実施形態は又はドウェア記述言語（ＨＤＬ）などの命令を含む又は設計データを含む非一時的な有形の機械可読媒体を含み、本明細書で説明される構造、回路、装置、プロセッサ及び／又はシステム機能を定義する。そのような実施形態では、プログラム製品とも称されてよい。
エミュレーション（バイナリ変換、コード、モーフィングなどを含む） Depending on the circumstances, embodiments of the invention may include a non-transitory, tangible, machine-readable medium that includes instructions, such as a hardware description language (HDL), or that includes design data, that defines the structures, circuits, devices, processors and/or system functions described herein. In such embodiments, they may also be referred to as program products.
Emulation (including binary conversion, chording, morphing, etc.)

いくつかの場合では、ソース命令セットからターゲット命令セットに命令を変換するために、命令変換器が用いられてよい。例えば、命令変換器は、コアにより処理されるために、命令を１又は複数の他の命令に、変換（例えば、静的なバイナリ変換、動的なコンパイルを含む動的なバイナリ変換を用いる）、モーフィング、エミュレート又は別の方法でコンバートしてよい。命令変換器は、ソフトウェア、ハードウェア、ファームウェア又はそれらの組み合わせで実装されてよい。命令変換器は、プロセッサ上、プロセッサ外、又は、プロセッサ上の一部及びプロセッサ外の一部にあってよい。 In some cases, an instruction converter may be used to convert instructions from a source instruction set to a target instruction set. For example, the instruction converter may translate (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise convert instructions into one or more other instructions for processing by the core. The instruction converter may be implemented in software, hardware, firmware, or a combination thereof. The instruction converter may be on-processor, off-processor, or some on-processor and some off-processor.

図１３３は、本発明の実施形態に係る、ソース命令セット内のバイナリ命令をターゲット命令セット内のバイナリ命令に変換するソフトウェア命令変換器の利用を対比するブロック図である。例示された実施形態では、命令変換器はソフトウェア命令変換器であるが、代替的に、命令変換器は、ソフトウェア、ファームウェア、ハードウェア又はそれらの様々な組み合わせで実装されてよい。図１３３は、少なくとも１つのｘ８６命令セットコア１３３１６を有するプロセッサにより実質的に実行され得るｘ８６バイナリコード１３３０６を生成するために、高水準言語１３３０２におけるプログラムがｘ８６コンパイラ１３３０４を用いてコンパイルされ得ることを示す。少なくとも１つのｘ８６命令セットコア１３３１６を有するプロセッサは、少なくとも１つのｘ８６命令セットコアを有するインテル（登録商標）プロセッサと実質的に同じ結果を実現するために、（１）インテル（登録商標）ｘ８６命令セットコアの命令セットの実質的な部分、又は、（２）少なくとも１つのｘ８６命令セットコアを有するインテル（登録商標）プロセッサ上で実行することを目的としたアプリケーション又は他のソフトウェアのオブジェクトコードのバージョンを、互換性のある状態で実行する又は別の方法で処理することにより、少なくとも１つのｘ８６命令セットコアを有するインテル（登録商標）プロセッサと同じ機能を実質的に実行できる任意のプロセッサを表す。ｘ８６コンパイラ１３３０４は、追加のリンケージ処理を用いて又はこれを用いることなく、少なくとも１つのｘ８６命令セットコア１３３１６を有するプロセッサ上で実行され得るｘ８６バイナリコード１３３０６（例えば、オブジェクトコード）を生成するように動作可能なコンパイラを表す。同様に、図１３３は、高水準言語１３３０２におけるプログラムが、少なくとも１つのｘ８６命令セットコア１３３１４（例えば、カリフォルニア州サニーベールの命令セットＭＩＰＳ技術のＭＩＰＳを実行する、及び／又は、カリフォルニア州サニーベールのＡＲＭホールディングスのＡＲＭ命令セットを実行するコアを有するプロセッサ）なしのプロセッサによりネイティブに実行され得る代替的な命令セットのバイナリコード１３３１０を生成するために、代替的な命令セットのコンパイラ１３３０８を用いてコンパイルされてよいことを示す。命令変換器１３３１２は、ｘ８６命令セットコア１３３１４なしのプロセッサによりネイティブに実行され得るコードにｘ８６バイナリコード１３３０６を変換するために用いられる。この変換済みコードは、これができる命令変換器が作成するのが難しいので、代替的な命令セットのバイナリコード１３３１０と同じである可能性が低い。しかしながら、変換済みコードは、一般的なオペレーションを実現し、代替的な命令セットからの命令で構成される。したがって、命令変換器１３３１２は、エミュレーション、シミュレーション又はその他の処理を通じてｘ８６命令セットプロセッサ又はコアを有していないプロセッサ又は他の電子デバイスが、ｘ８６バイナリコード１３３０６を実行することを可能にするソフトウェア、ファームウェア、ハードウェア又はそれらの組み合わせを表す。 Figure 133 is a block diagram contrasting the use of a software instruction converter to convert binary instructions in a source instruction set to binary instructions in a target instruction set, according to an embodiment of the present invention. In the illustrated embodiment, the instruction converter is a software instruction converter, but alternatively, the instruction converter may be implemented in software, firmware, hardware, or various combinations thereof. Figure 133 illustrates that a program in a high-level language 13302 may be compiled using an x86 compiler 13304 to generate x86 binary code 13306 that may be substantially executed by a processor having at least one x86 instruction set core 13316. A processor having at least one x86 instruction set core 13316 represents any processor that can perform substantially the same functions as an Intel processor having at least one x86 instruction set core by compatibly executing or otherwise processing (1) a substantial portion of the instruction set of an Intel x86 instruction set core, or (2) an object code version of an application or other software intended to run on an Intel processor having at least one x86 instruction set core, to achieve substantially the same results as an Intel processor having at least one x86 instruction set core. x86 compiler 13304 represents a compiler operable to generate x86 binary code 13306 (e.g., object code) that can be executed on a processor having at least one x86 instruction set core 13316, with or without additional linkage processing. Similarly, diagram 133 shows that a program in high level language 13302 may be compiled using an alternative instruction set compiler 13308 to generate alternative instruction set binary code 13310 that may be natively executed by a processor without at least one x86 instruction set core 13314 (e.g., a processor having a core that implements the MIPS instruction set MIPS Technology of Sunnyvale, Calif. and/or implements the ARM instruction set of ARM Holdings, Inc. of Sunnyvale, Calif.). An instruction converter 13312 is used to convert the x86 binary code 13306 into code that may be natively executed by a processor without the x86 instruction set core 13314. This converted code is unlikely to be the same as the alternative instruction set binary code 13310, since an instruction converter that can do this would be difficult to create. However, the converted code implements common operations and is composed of instructions from the alternative instruction set. Thus, the instruction converter 13312 represents software, firmware, hardware, or a combination thereof that enables a processor or other electronic device that does not have an x86 instruction set processor or core to execute the x86 binary code 13306 through emulation, simulation, or other processes.

機能及び態様の例示的な実装、実施形態及び特定の組み合わせが以下に詳細に説明される。これらの例は、有益なものであるが限定するものではない。 Exemplary implementations, embodiments, and specific combinations of features and aspects are described in detail below. These examples are provided for illustrative purposes only and are not limiting.

例１．複数のヘテロジニアス処理要素と、複数のヘテロジニアス処理要素のうちの１又は複数の実行のために命令のディスパッチを行うハードウェアヘテロジニアススケジューラであって、命令は、複数のヘテロジニアス処理要素のうちの１又は複数により処理されるコードフラグメントに対応し、命令は、複数のヘテロジニアス処理要素の１又は複数のうちの少なくとも１つに対するネイティブ命令である、ハードウェアヘテロジニアススケジューラとを含むシステム。 Example 1. A system including a plurality of heterogeneous processing elements and a hardware heterogeneous scheduler that dispatches instructions for execution to one or more of the plurality of heterogeneous processing elements, the instructions corresponding to code fragments to be processed by one or more of the plurality of heterogeneous processing elements, the instructions being native instructions for at least one of the one or more of the plurality of heterogeneous processing elements.

例２：複数のヘテロジニアス処理要素は、インオーダプロセッサコア、アウトオブオーダプロセッサコア及びパックドデータプロセッサコアを有する、例１に記載のシステム。 Example 2: The system described in Example 1, in which the multiple heterogeneous processing elements include an in-order processor core, an out-of-order processor core, and a packed data processor core.

例３：複数のヘテロジニアス処理要素は、アクセラレータをさらに有する、例２に記載のシステム。 Example 3: The system of Example 2, wherein the plurality of heterogeneous processing elements further comprises an accelerator.

例４：ハードウェアヘテロジニアススケジューラは、コードフラグメントのプログラムフェーズを検出するプログラムフェーズ検出器をさらに含み、複数のヘテロジニアス処理要素は、第１のマイクロアーキテクチャを有する第１の処理要素、及び、第１のマイクロアーキテクチャとは異なる第２のマイクロアーキテクチャを有する第２の処理要素を含み、プログラムフェーズは、第１のフェーズ及び第２のフェーズを含む複数のプログラムフェーズのうちの１つであり、命令のディスパッチは、検出されたプログラムフェーズに部分的に基づいており、第１の処理要素によるコードフラグメントの処理は、第２の処理要素によるコードフラグメントの処理と比較してワット性能特性を改善する、例１－３のいずれかに記載のシステム。 Example 4: The system of any of Examples 1-3, wherein the hardware heterogeneous scheduler further includes a program phase detector that detects a program phase of the code fragment, the plurality of heterogeneous processing elements including a first processing element having a first microarchitecture and a second processing element having a second microarchitecture different from the first microarchitecture, the program phase being one of a plurality of program phases including a first phase and a second phase, the dispatching of instructions being based in part on the detected program phase, and the processing of the code fragment by the first processing element improves performance per watt characteristics compared to the processing of the code fragment by the second processing element.

例５：ハードウェアヘテロジニアススケジューラは、受信したコードフラグメントを実行するために、複数の処理要素についての処理要素のタイプを選択し、ディスパッチを用いて、複数の処理要素のうち選択されたタイプの処理要素にコードフラグメントをスケジューリングするセレクタをさらに備える、例１－４のいずれかに記載のシステム。 Example 5: The system of any of Examples 1-4, wherein the hardware heterogeneous scheduler further comprises a selector that selects a processing element type for the plurality of processing elements to execute the received code fragment and schedules the code fragment to a processing element of the selected type among the plurality of processing elements using dispatch.

例６：コードフラグメントは、ソフトウェアスレッドと関連付けられた１又は複数の命令である、例１に記載のシステム。 Example 6: The system of example 1, wherein the code fragment is one or more instructions associated with a software thread.

例７：データ並列プログラムフェーズの場合、処理要素の選択されたタイプは、単一命令複数データ（ＳＩＭＤ）命令を実行する処理コアである、例５－６のいずれかに記載のシステム。 Example 7: The system of any of Examples 5-6, wherein for a data parallel program phase, the selected type of processing element is a processing core that executes single instruction multiple data (SIMD) instructions.

例８：データ並列プログラムフェーズの場合、処理要素の選択されたタイプは、密な算術のプリミティブをサポートする回路である、例５－７のいずれかに記載のシステム。 Example 8: The system of any of Examples 5-7, wherein for a data parallel program phase, the selected type of processing element is a circuit that supports dense arithmetic primitives.

例９：データ並列プログラムフェーズの場合、処理要素の選択されたタイプは、アクセラレータである、例５－７のいずれかに記載のシステム。 Example 9: A system as described in any of Examples 5-7, where in the case of a data parallel program phase, the selected type of processing element is an accelerator.

例１０：データ並列プログラムフェーズは、同じ制御フローを同時に用いて処理されるデータ要素を有する、例５－９のいずれかに記載のシステム。 Example 10: The system of any of Examples 5-9, wherein the data parallel program phases have data elements that are processed simultaneously using the same control flow.

例１１：スレッド並列プログラムフェーズの場合、処理要素の選択されたタイプは、スカラ処理コアである、例５－１０のいずれかに記載のシステム。 Example 11: A system as described in any of Examples 5-10, in which for a thread-parallel program phase, the selected type of processing element is a scalar processing core.

例１２：スレッド並列プログラムフェーズは、一意的な制御フローを用いるデータ依存の分岐を有する、例５－１１のいずれかに記載のシステム。 Example 12: A system as described in any of Examples 5-11, in which a thread-parallel program phase has a data-dependent branch with a unique control flow.

例１３：直列プログラムフェーズの場合、処理要素の選択されたタイプは、アウトオブオーダコアである、例２－１２のいずれかに記載のシステム。 Example 13: A system as described in any of Examples 2-12, wherein in the case of a serial program phase, the selected type of processing element is an out-of-order core.

例１４：データ並列プログラムフェーズの場合、処理要素の選択されたタイプは、単一命令複数データ（ＳＩＭＤ）命令を実行する処理コアである、例２－１３のいずれかに記載のシステム。 Example 14: The system of any of Examples 2-13, wherein for a data parallel program phase, the selected type of processing element is a processing core that executes single instruction multiple data (SIMD) instructions.

例１５：ハードウェアヘテロジニアススケジューラは、コンパイル、組み込み関数（ｉｎｔｒｉｎｓｉｃ）、アセンブリ、ライブラリ、中間、オフロード及びデバイスを含む複数のコードタイプをサポートする、例１－１４のいずれかに記載のシステム。 Example 15: A system as described in any of Examples 1-14, in which the hardware heterogeneous scheduler supports multiple code types including compiled, intrinsic, assembly, library, intermediate, offload and device.

例１６：ハードウェアヘテロジニアススケジューラは、選択されたタイプの処理要素がコードフラグメントをネイティブに処理できない場合、機能をエミュレートする、例５－１５のいずれかに記載のシステム。 Example 16: A system according to any of Examples 5-15, in which the hardware heterogeneous scheduler emulates functionality if a processing element of a selected type cannot process the code fragment natively.

例１７：ハードウェアヘテロジニアススケジューラは、利用可能なハードウェアスレッドの数がオーバサブスクライブされている場合、機能をエミュレートする、例１－１５のいずれかに記載のシステム。 Example 17: A system according to any of Examples 1-15, in which the hardware heterogeneous scheduler emulates functionality when the number of available hardware threads is oversubscribed.

例１８：ハードウェアヘテロジニアススケジューラは、選択されたタイプの処理要素がコードフラグメントをネイティブに処理できない場合、機能をエミュレートする、例５－１５のいずれかに記載のシステム。 Example 18: A system according to any of Examples 5-15, in which the hardware heterogeneous scheduler emulates functionality if a processing element of a selected type cannot process the code fragment natively.

例１９：複数のヘテロジニアス処理要素についての処理要素のタイプの選択は、ユーザに対して透過的である、例５－１８のいずれかに記載のシステム。 Example 19: A system as described in any of Examples 5-18, in which the selection of the processing element type for the plurality of heterogeneous processing elements is transparent to a user.

例２０：複数のヘテロジニアス処理要素についての処理要素のタイプの選択は、オペレーティングシステムに対して透過的である、例５－１９のいずれかに記載のシステム。 Example 20: A system as described in any of Examples 5-19, in which the selection of the processing element type for the plurality of heterogeneous processing elements is transparent to the operating system.

例２１：ハードウェアヘテロジニアススケジューラは、各スレッドがスカラコア上で実行中であるかのようにプログラマに見えるようにするべく、ホモジニアスマルチプロセッサプログラミングモデルを提示する、例１－２０のいずれかに記載のシステム。 Example 21: A system as described in any of Examples 1-20, in which the hardware heterogeneous scheduler presents a homogeneous multiprocessor programming model so that each thread appears to the programmer as if it is running on a scalar core.

例２２：提示されたホモジニアスマルチプロセッサプログラミングモデルは、完全な命令セットに対するサポートの出現を提示する、例２１に記載のシステム。 Example 22: The system of Example 21, in which the presented homogeneous multiprocessor programming model presents emergent support for the complete instruction set.

例２３：複数のヘテロジニアス処理要素は、メモリアドレス空間を共有する、例１－２２のいずれかに記載のシステム。 Example 23: A system as described in any of Examples 1-22, in which multiple heterogeneous processing elements share a memory address space.

例２４：ハードウェアヘテロジニアススケジューラは、複数のヘテロジニアス処理要素のうちの１つで実行されるバイナリトランスレータを含む、例１－２３のいずれかに記載のシステム。 Example 24: A system as described in any of Examples 1-23, wherein the hardware heterogeneous scheduler includes a binary translator executing on one of the heterogeneous processing elements.

例２５：複数のヘテロジニアス処理要素についての処理要素のタイプのデフォルト選択は、レイテンシが最適化されたコアである、例５－２４のいずれかに記載のシステム。 Example 25: A system as described in any of Examples 5-24, in which the default selection of processing element type for multiple heterogeneous processing elements is a latency optimized core.

例２６：ヘテロジニアスハードウェアスケジューラは、ディスパッチされた命令に対してマルチプロトコルインタフェースで用いるプロトコルを選択する、例１－２５のいずれかに記載のシステム。 Example 26: A system according to any of Examples 1-25, in which the heterogeneous hardware scheduler selects a protocol to use in the multi-protocol interface for a dispatched instruction.

例２７：マルチプロトコルバスインタフェースによりサポートされている第１のプロトコルは、システムメモリアドレス空間にアクセスするために用いられるメモリインタフェースプロトコルを有する、例２６のいずれかに記載のシステム。 Example 27: A system as described in any of Examples 26, wherein the first protocol supported by the multi-protocol bus interface comprises a memory interface protocol used to access a system memory address space.

例２８：マルチプロトコルバスインタフェースによりサポートされる第２のプロトコルは、アクセラレータのローカルメモリに格納されるデータと、ホストキャッシュ階層及びシステムメモリを含むホストプロセッサのメモリサブシステムとの間のコヒーレンシを維持するキャッシュコヒーレンシプロトコルを有する、例２６－２７のいずれかに記載のシステム。 Example 28: A system as described in any of Examples 26-27, wherein the second protocol supported by the multi-protocol bus interface includes a cache coherency protocol that maintains coherency between data stored in the accelerator's local memory and the host processor's memory subsystem, including the host cache hierarchy and the system memory.

例２９：マルチプロトコルバスインタフェースによりサポートされる第３のプロトコルは、デバイス発見、レジスタアクセス、構成、初期化、割込み、ダイレクトメモリアクセス及びアドレス変換サービスをサポートする直列リンクプロトコルを有する、例２６－２８のいずれかに記載のシステム。 Example 29: The system of any of Examples 26-28, wherein the third protocol supported by the multiprotocol bus interface includes a serial link protocol supporting device discovery, register access, configuration, initialization, interrupts, direct memory access, and address translation services.

例３０：第３のプロトコルは、ペリフェラルコンポーネントインタフェースエクスプレス（ＰＣＩｅ）プロトコルを有する、例２９に記載のシステム。 Example 30: The system of Example 29, wherein the third protocol comprises a Peripheral Component Interface Express (PCIe) protocol.

例３１：アクセラレータを含むヘテロジニアプロセッサ内の複数のヘテロジニアス処理要素と、ヘテロジニアプロセッサ内の複数のヘテロジニアス処理要素のうちの少なくとも１つにより実行可能されるプログラムコードを格納するメモリとを含み、プログラムコードは、複数のヘテロジニアス処理要素のうちの１又は複数の実行のために命令のディスパッチを行うヘテロジニアススケジューラであって、命令は、複数のヘテロジニアス処理要素のうちの１又は複数により処理されるコードフラグメントに対応し、命令は、複数のヘテロジニアス処理要素の１又は複数のうちの少なくとも１つに対するネイティブ命令である、ヘテロジニアススケジューラとを含むシステム。 Example 31: A system including: a plurality of heterogeneous processing elements in a heterogeneous processor including an accelerator; and a memory storing program code executable by at least one of the plurality of heterogeneous processing elements in the heterogeneous processor, the program code being a heterogeneous scheduler dispatching instructions for execution on one or more of the plurality of heterogeneous processing elements, the instructions corresponding to code fragments to be processed by one or more of the plurality of heterogeneous processing elements, the instructions being native instructions for at least one of the one or more of the plurality of heterogeneous processing elements.

例３２：複数のヘテロジニアス処理要素は、インオーダプロセッサコア、アウトオブオーダプロセッサコア及びパックドデータプロセッサコアを有する、例３１に記載のシステム。 Example 32: The system of Example 31, wherein the plurality of heterogeneous processing elements includes an in-order processor core, an out-of-order processor core, and a packed data processor core.

例３３：複数のヘテロジニアス処理要素は、アクセラレータをさらに有する、例３２に記載のシステム。 Example 33: The system of Example 32, wherein the plurality of heterogeneous processing elements further comprises an accelerator.

例３４：ヘテロジニアススケジューラは、コードフラグメントのプログラムフェーズを検出するプログラムフェーズ検出器をさらに含み、複数のヘテロジニアス処理要素は、第１のマイクロアーキテクチャを有する第１の処理要素、及び、第１のマイクロアーキテクチャとは異なる第２のマイクロアーキテクチャを有する第２の処理要素を含み、プログラムフェーズは、第１のフェーズ及び第２のフェーズを含む複数のプログラムフェーズのうちの１つであり、命令のディスパッチは、検出したプログラムフェーズに部分的に基づいており、第１の処理要素によるコードフラグメントの処理は、第２の処理要素によるコードフラグメントの処理と比較してワット性能特性を改善する、例３１－３３のいずれかに記載のシステム。 Example 34: The system of any of Examples 31-33, wherein the heterogeneous scheduler further includes a program phase detector that detects a program phase of the code fragment, the plurality of heterogeneous processing elements including a first processing element having a first microarchitecture and a second processing element having a second microarchitecture different from the first microarchitecture, the program phase being one of a plurality of program phases including a first phase and a second phase, the dispatching of instructions being based in part on the detected program phase, and the processing of the code fragment by the first processing element improves performance per watt characteristics compared to the processing of the code fragment by the second processing element.

例３５：ヘテロジニアススケジューラは、受信したコードフラグメントを実行するために、複数の処理要素についての処理要素のタイプを選択し、ディスパッチを用いて、複数の処理要素のうち選択されたタイプの処理要素にコードフラグメントをスケジューリングするセレクタをさらに備える、例３１－３４のいずれかに記載のシステム。 Example 35: The system of any of Examples 31-34, wherein the heterogeneous scheduler further comprises a selector for selecting a processing element type for the plurality of processing elements to execute the received code fragment, and using dispatch to schedule the code fragment to a processing element of the selected type among the plurality of processing elements.

例３６：コードフラグメントは、ソフトウェアスレッドと関連付けられた１又は複数の命令である、例３１－３５のいずれかに記載のシステム。 Example 36: The system of any of Examples 31-35, wherein the code fragment is one or more instructions associated with a software thread.

例３７：データ並列プログラムフェーズの場合、処理要素の選択されたタイプは、単一命令複数データ（ＳＩＭＤ）命令を実行する処理コアである、例３４－３６のいずれかに記載のシステム。 Example 37: The system of any of Examples 34-36, wherein for a data parallel program phase, the selected type of processing element is a processing core that executes single instruction multiple data (SIMD) instructions.

例３８：データ並列プログラムフェーズの場合、処理要素の選択されたタイプは、密な算術のプリミティブをサポートする回路である、例３４－３７のいずれかに記載のシステム。 Example 38: The system of any of Examples 34-37, wherein for a data parallel program phase, the selected type of processing element is a circuit that supports dense arithmetic primitives.

例３９：データ並列プログラムフェーズの場合、処理要素の選択されたタイプは、アクセラレータである、例３４－３８のいずれかに記載のシステム。 Example 39: The system of any of Examples 34-38, wherein for a data parallel program phase, the selected type of processing element is an accelerator.

例４０：データ並列プログラムフェーズは、同じ制御フローを同時に用いて処理されるデータ要素を有する、例３４－３９のいずれかに記載のシステム。 Example 40: The system of any of Examples 34-39, wherein the data parallel program phases have data elements that are processed simultaneously using the same control flow.

例４１：スレッド並列プログラムフェーズの場合、処理要素の選択されたタイプは、スカラ処理コアである、例３０－３５のいずれかに記載のシステム。 Example 41: A system as described in any of Examples 30-35, wherein for a thread-parallel program phase, the selected type of processing element is a scalar processing core.

例４２：スレッド並列プログラムフェーズは、一意的な制御フローを用いるデータ依存の分岐を有する、例３０－３６のいずれかに記載のシステム。 Example 42: A system as described in any of Examples 30-36, in which a thread-parallel program phase has a data-dependent branch with a unique control flow.

例４３：直列プログラムフェーズの場合、処理要素の選択されたタイプは、アウトオブオーダコアである、例３０－３７のいずれかに記載のシステム。 Example 43: A system as described in any of Examples 30-37, wherein in the serial program phase, the selected type of processing element is an out-of-order core.

例４４：データ並列プログラムフェーズの場合、処理要素の選択されたタイプは、単一命令複数データ（ＳＩＭＤ）命令を実行する処理コアである、例３０－３８のいずれかに記載のシステム。 Example 44: The system of any of Examples 30-38, wherein for a data parallel program phase, the selected type of processing element is a processing core that executes single instruction multiple data (SIMD) instructions.

例４５：ヘテロジニアススケジューラは、コンパイル、組み込み関数（ｉｎｔｒｉｎｓｉｃ）、アセンブリ、ライブラリ、中間、オフロード及びデバイスを含む複数のコードタイプをサポートする、例３１－４４のいずれかに記載のシステム。 Example 45: A system as described in any of Examples 31-44, in which the heterogeneous scheduler supports multiple code types including compiled, intrinsic, assembly, library, intermediate, offload and device.

例４６：ヘテロジニアススケジューラは、選択されたタイプの処理要素がコードフラグメントをネイティブに処理できない場合、機能をエミュレートする、例３１－４５のいずれかに記載のシステム。 Example 46: A system according to any of Examples 31-45, in which the heterogeneous scheduler emulates functionality if a processing element of a selected type cannot process the code fragment natively.

例４７：ヘテロジニアススケジューラは、利用可能なハードウェアスレッドの数がオーバサブスクライブされている場合、機能をエミュレートする、例３１－４６のいずれかに記載のシステム。 Example 47: A system according to any of Examples 31-46, in which the heterogeneous scheduler emulates functionality when the number of available hardware threads is oversubscribed.

例４８：ヘテロジニアススケジューラは、選択されたタイプの処理要素がコードフラグメントをネイティブに処理できない場合、機能をエミュレートする、例３１－４７のいずれかに記載のシステム。 Example 48: A system according to any of Examples 31-47, in which the heterogeneous scheduler emulates functionality if a processing element of a selected type cannot process the code fragment natively.

例５０：複数のヘテロジニアス処理要素についての処理要素のタイプの選択は、ユーザに対して透過的である、例３１－４９のいずれかに記載のシステム。 Example 50: A system as described in any of Examples 31-49, in which the selection of the processing element type for a plurality of heterogeneous processing elements is transparent to a user.

例５１：複数のヘテロジニアス処理要素についての処理要素のタイプの選択は、オペレーティングシステムに対して透過的である、例３１－５０のいずれかに記載のシステム。 Example 51: A system as described in any of Examples 31-50, in which the selection of the processing element type for the plurality of heterogeneous processing elements is transparent to the operating system.

例５２：ヘテロジニアススケジューラは、各スレッドがスカラコア上で実行中であるかのようにプログラマに見えるようにするべく、ホモジニアスプログラミングモデルを提示する、例３１－５１のいずれかに記載のシステム。 Example 52: A system as described in any of Examples 31-51, in which the heterogeneous scheduler presents a homogeneous programming model so that each thread appears to the programmer as if it is running on a scalar core.

例５３：提示されたホモジニアスマルチプロセッサプログラミングモデルは、完全な命令セットに対するサポートの出現を提示する、例５２のいずれかに記載のシステム。 Example 53: A system as described in any of Examples 52, wherein the presented homogeneous multiprocessor programming model presents emergent support for the complete instruction set.

例５４ａ：複数のヘテロジニアス処理要素は、メモリアドレス空間を共有する、例３１－５３のいずれかに記載のシステム。 Example 54a: A system as described in any of Examples 31-53, in which multiple heterogeneous processing elements share a memory address space.

例５４ｂ：ヘテロジニアススケジューラは、複数のヘテロジニアス処理要素のうちの１つで実行されるバイナリトランスレータを含む、例３１－５３のいずれかに記載のシステム。 Example 54b: A system as described in any of Examples 31-53, wherein the heterogeneous scheduler includes a binary translator executing on one of the plurality of heterogeneous processing elements.

例５５：複数のヘテロジニアス処理要素についての処理要素のタイプのデフォルト選択は、レイテンシが最適化されたコアである、例３１－５４のいずれかに記載のシステム。 Example 55: A system as described in any of Examples 31-54, in which the default selection of processing element type for multiple heterogeneous processing elements is a latency optimized core.

例５６：ヘテロジニアスソフトウェアスケジューラは、ディスパッチされた命令に対してマルチプロトコルインタフェースで用いるプロトコルを選択する、例３１－５５のいずれかに記載のシステム。 Example 56: A system according to any of Examples 31-55, in which the heterogeneous software scheduler selects a protocol to use in the multi-protocol interface for a dispatched instruction.

例５７：マルチプロトコルバスインタフェースによりサポートされる第１のプロトコルは、システムメモリアドレス空間にアクセスするために用いられるメモリインタフェースプロトコルを有する、例５６のいずれかに記載のシステム。 Example 57: A system as described in any of Examples 56, wherein the first protocol supported by the multi-protocol bus interface comprises a memory interface protocol used to access a system memory address space.

例５８：マルチプロトコルバスインタフェースによりサポートされる第２のプロトコルは、アクセラレータのローカルメモリに格納されるデータと、ホストキャッシュ階層及びシステムメモリを含むホストプロセッサのメモリサブシステムとの間のコヒーレンシを維持するキャッシュコヒーレンシプロトコルを有する、例５６－５７のいずれかに記載のシステム。 Example 58: A system as described in any of Examples 56-57, wherein the second protocol supported by the multi-protocol bus interface comprises a cache coherency protocol that maintains coherency between data stored in the accelerator's local memory and the host processor's memory subsystem, including the host cache hierarchy and the system memory.

例５９：マルチプロトコルバスインタフェースによりサポートされる第３のプロトコルは、デバイス発見、レジスタアクセス、構成、初期化、割込み、ダイレクトメモリアクセス及びアドレス変換サービスをサポートする直列リンクプロトコルを有する、例５６－５８のいずれかに記載のシステム。 Example 59: A system as described in any of Examples 56-58, wherein the third protocol supported by the multiprotocol bus interface includes a serial link protocol supporting device discovery, register access, configuration, initialization, interrupts, direct memory access, and address translation services.

例６０：第３のプロトコルは、ペリフェラルコンポーネントインタフェースエクスプレス（ＰＣＩｅ）プロトコルを有する、例５９に記載のシステム。 Example 60: The system of Example 59, wherein the third protocol is a Peripheral Component Interface Express (PCIe) protocol.

例６１：複数の命令を受信する段階と、複数のヘテロジニアス処理要素のうちの１又は複数の実行のために、受信した複数の命令をディスパッチする段階であって、受信した複数の命令は、複数のヘテロジニアス処理要素のうちの１又は複数により処理されるコードフラグメントに対応し、その結果、複数の命令は、複数のヘテロジニアス処理要素の１又は複数のうちの少なくとも１つに対するネイティブ命令である、段階とを含む方法。 Example 61: A method comprising: receiving a plurality of instructions; and dispatching the received plurality of instructions for execution on one or more of a plurality of heterogeneous processing elements, where the received plurality of instructions correspond to a code fragment to be processed by one or more of the plurality of heterogeneous processing elements, such that the plurality of instructions are native instructions for at least one of the one or more of the plurality of heterogeneous processing elements.

例６２：複数のヘテロジニアス処理要素は、インオーダプロセッサコア、アウトオブオーダプロセッサコア及びパックドデータプロセッサコアを有する、例６１に記載の方法。 Example 62: The method of example 61, wherein the plurality of heterogeneous processing elements includes an in-order processor core, an out-of-order processor core, and a packed data processor core.

例６３：複数のヘテロジニアス処理要素は、アクセラレータをさらに有する、例６２に記載の方法。 Example 63: The method of example 62, wherein the plurality of heterogeneous processing elements further comprises an accelerator.

例６４：コードフラグメントのプログラムフェーズを検出する段階をさらに含み、複数のヘテロジニアス処理要素は、第１のマイクロアーキテクチャを有する第１の処理要素、及び、第１のマイクロアーキテクチャとは異なる第２のマイクロアーキテクチャを有する第２の処理要素を含み、プログラムフェーズは、第１のフェーズ及び第２のフェーズを含む複数のプログラムフェーズのうちの１つであり、第１の処理要素によりコードフラグメントの処理は、第２の処理要素によるコードフラグメントの処理と比較してワット性能特性を改善する、例６１－６３のいずれかに記載の方法。 Example 64: The method of any of Examples 61-63, further comprising detecting a program phase of the code fragment, the plurality of heterogeneous processing elements including a first processing element having a first microarchitecture and a second processing element having a second microarchitecture different from the first microarchitecture, the program phase being one of a plurality of program phases including a first phase and a second phase, and processing of the code fragment by the first processing element improves performance per watt characteristics compared to processing of the code fragment by the second processing element.

例６５：受信したコードフラグメントを実行するために、複数の処理要素についての処理要素のタイプを選択し、複数の処理要素のうち選択されたタイプの処理要素にコードフラグメントをスケジューリングする段階をさらに含む、例６１－６４のいずれかに記載の方法。 Example 65: The method of any of Examples 61-64, further comprising selecting a processing element type for the plurality of processing elements for executing the received code fragment, and scheduling the code fragment to a processing element of the selected type among the plurality of processing elements.

例６６：コードフラグメントは、ソフトウェアスレッドと関連付けられた１又は複数の命令である、例６１－６３のいずれかに記載の方法。 Example 66: The method of any of Examples 61-63, wherein the code fragment is one or more instructions associated with a software thread.

例６７：データ並列プログラムフェーズの場合、処理要素の選択されたタイプは、単一命令複数データ（ＳＩＭＤ）命令を実行する処理コアである、例６４－６６のいずれかに記載の方法。 Example 67: The method of any of Examples 64-66, wherein for a data parallel program phase, the selected type of processing element is a processing core that executes single instruction multiple data (SIMD) instructions.

例６８：データ並列プログラムフェーズの場合、処理要素の選択されたタイプは、密な算術のプリミティブをサポートする回路である、例６４－６６のいずれかに記載の方法。 Example 68: The method of any of Examples 64-66, in which, for a data parallel program phase, the selected type of processing element is a circuit that supports dense arithmetic primitives.

例６９：データ並列プログラムフェーズの場合、処理要素の選択されたタイプは、アクセラレータである、例６４－６８のいずれかに記載の方法。 Example 69: The method of any of Examples 64-68, in which, for a data parallel program phase, the selected type of processing element is an accelerator.

例７０：データ並列プログラムフェーズは、同じ制御フローを同時に用いて処理されるデータ要素により特徴付けられる、例６４－６９のいずれかに記載の方法。 Example 70: The method of any of Examples 64-69, wherein a data-parallel program phase is characterized by data elements that are processed simultaneously using the same control flow.

例７１：スレッド並列プログラムフェーズの場合、処理要素の選択されたタイプは、スカラ処理コアである、例６４－７０のいずれかに記載の方法。 Example 71: The method of any of Examples 64-70, in which, for a thread-parallel program phase, the selected type of processing element is a scalar processing core.

例７２：スレッド並列プログラムフェーズは、一意的な制御フローを用いるデータ依存の分岐により特徴付けられる、例６４－７１のいずれかに記載の方法。 Example 72: The method of any of Examples 64-71, in which a thread-parallel program phase is characterized by data-dependent branching with unique control flow.

例７３：直列プログラムフェーズの場合、処理要素の選択されたタイプは、アウトオブオーダコアでる、例６４－７２のいずれかに記載の方法。 Example 73: In the case of a serial program phase, the selected type of processing element is an out-of-order core, as described in any of Examples 64-72.

例７４：データ並列プログラムフェーズの場合、処理要素の選択されたタイプは、単一命令複数データ（ＳＩＭＤ）命令を実行する処理コアである、例６４－７３のいずれかに記載の方法。 Example 74: The method of any of Examples 64-73, wherein for a data parallel program phase, the selected type of processing element is a processing core that executes single instruction multiple data (SIMD) instructions.

例７５：選択されたタイプの処理要素がコードフラグメントをネイティブに処理できない場合、機能をエミュレートする段階をさらに含む、例６１－７４のいずれかに記載の方法。 Example 75: The method of any of Examples 61-74, further comprising emulating the functionality if the selected type of processing element cannot process the code fragment natively.

例７６：利用可能なハードウェアスレッドの数がオーバサブスクライブされている場合、機能をエミュレートする段階をさらに含む、例６１－７４のいずれかに記載の方法。 Example 76: The method of any of Examples 61-74, further comprising emulating functionality if the number of available hardware threads is oversubscribed.

例７７：複数のヘテロジニアス処理要素についての処理要素のタイプの選択は、ユーザに対して透過的である、例６１－７６のいずれかに記載の方法。 Example 77: A method according to any of Examples 61-76, in which the selection of the processing element type for a plurality of heterogeneous processing elements is transparent to a user.

例７８：複数のヘテロジニアス処理要素についての処理要素のタイプの選択は、オペレーティングシステムに対して透過的である、例６１－７７のいずれかに記載の方法。 Example 78: A method according to any of examples 61-77, in which the selection of the processing element type for a plurality of heterogeneous processing elements is transparent to the operating system.

例７９：各スレッドがスカラコア上で実行中であるかのように見えるようにするべく、ホモジニアスマルチプロセッサプログラミングモデルを提示する段階をさらに含む、例６１－７４のいずれかに記載の方法。 Example 79: The method of any of Examples 61-74, further comprising presenting a homogeneous multiprocessor programming model such that each thread appears to be executing on a scalar core.

例８０：提示されたホモジニアスマルチプロセッサプログラミングモデルは、完全な命令セットに対するサポートの出現を提示する、例７９に記載の方法。 Example 80: The method of example 79, wherein the presented homogeneous multiprocessor programming model presents the emergence of support for a complete instruction set.

例８１：複数のヘテロジニアス処理要素は、メモリアドレス空間を共有する、例６１－７９のいずれかに記載の方法。 Example 81: A method according to any of examples 61-79, in which multiple heterogeneous processing elements share a memory address space.

例８２：複数のヘテロジニアス処理要素のうちの１つで実行されるコードフラグメントをバイナリ変換する段階をさらに含む、例６１－８１のいずれかに記載の方法。 Example 82: The method of any of Examples 61-81, further comprising a step of binary converting a code fragment to be executed on one of the plurality of heterogeneous processing elements.

例８３：複数のヘテロジニアス処理要素についての処理要素のタイプのデフォルト選択は、レイテンシが最適化されたコアである、例６１－８２のいずれかに記載の方法。 Example 83: The method of any of Examples 61-82, in which the default selection of processing element type for multiple heterogeneous processing elements is a latency optimized core.

例８４：ハードウェアにより実行される場合に、プロセッサが、例５１－８３のうちの１つに記載の方法を実行する命令を格納する非一時的な機械可読媒体。 Example 84: A non-transitory machine-readable medium storing instructions that, when executed by hardware, cause a processor to perform a method according to one of examples 51-83.

例８５：ヘテロジニアススケジューラにおいてコードフラグメントを受信する段階と、コードフラグメントが並列フェーズにあるか否かを判断する段階と、コードフラグメントが並列フェーズにない場合、レイテンシに敏感なオペレーション要素を選択してコードフラグメントを実行する段階と、コードフラグメントが並列フェーズにある場合、並列性のタイプを判断し、及びスレッド並列コードフラグメントに関して、スカラ処理要素を選択してコードフラグメントを実行する段階と、データ並列コードフラグメントに関して、データ並列コードフラグメントのデータレイアウトを判断する段階と、パックドデータレイアウトに関して、単一命令複数データ（ＳＩＭＤ）処理要素及び算術プリミティブ処理要素のうちの１つを選択し、ランダムデータレイアウトに関して、ギャザー命令、空間計算アレイ、又は、複数のスカラコアのアレイから１つのスカラコアを用いるＳＩＭＤ処理要素のうちの１つを選択する段階と、実行のために処理要素にコードフラグメントを送信する段階とを含む方法。 Example 85: A method including receiving a code fragment in a heterogeneous scheduler; determining whether the code fragment is in a parallel phase; if the code fragment is not in a parallel phase, selecting a latency-sensitive operation element to execute the code fragment; if the code fragment is in a parallel phase, determining a type of parallelism and, for a thread-parallel code fragment, selecting a scalar processing element to execute the code fragment; for a data-parallel code fragment, determining a data layout for the data-parallel code fragment; for a packed data layout, selecting one of a single instruction multiple data (SIMD) processing element and an arithmetic primitive processing element, and for a random data layout, selecting one of a gather instruction, a spatial computation array, or a SIMD processing element using one scalar core from an array of multiple scalar cores; and sending the code fragment to the processing element for execution.

例８６：コードフラグメントが並列フェーズにあるか否かを判断する段階の前に、コードフラグメントがアクセラレータへのオフロードの対象となるときを判断する段階と、コードフラグメントがオフロードの対象となったときに、アクセラレータにコードフラグメントを送信する段階とをさらに含む、例８５に記載の方法。 Example 86: The method of example 85, further comprising, prior to the step of determining whether the code fragment is in a parallel phase, determining when the code fragment is eligible for offloading to an accelerator, and sending the code fragment to the accelerator when the code fragment is eligible for offloading.

例８７：コードフラグメントが並列フェーズにあるか否かを判断する段階は、検出されたデータの依存性、命令タイプ及び制御フロー命令のうちの１又は複数に基づいている、例８５－８６のいずれかに記載の方法。 Example 87: The method of any of Examples 85-86, wherein determining whether the code fragment is in a parallel phase is based on one or more of detected data dependencies, instruction types, and control flow instructions.

例８８：単一の命令、複数のデータ命令についての命令のタイプは、並列フェーズを示す、例８７に記載の方法。 Example 88: The method of example 87, in which the instruction type for single instruction, multiple data instructions indicates a parallel phase.

例８９：ヘテロジニアススケジューラにより処理される各オペレーティングシステムスレッドは、論理スレッド識別子が割り当てられる、例８５－８８のいずれかに記載の方法。 Example 89: A method as described in any of Examples 85-88, in which each operating system thread processed by the heterogeneous scheduler is assigned a logical thread identifier.

例９０：ヘテロジニアススケジューラは、処理要素タイプ、処理要素識別子及びスレッド識別子から成るタプルに各論理スレッド識別子がマッピングされるように、論理スレッド識別子の縞模様マッピングを利用する、例８９に記載の方法。 Example 90: The method of example 89, in which the heterogeneous scheduler utilizes striped mapping of logical thread identifiers such that each logical thread identifier is mapped to a tuple consisting of a processing element type, a processing element identifier, and a thread identifier.

例９１：論理スレッド識別子から処理要素識別子及びスレッド識別子へのマッピングは、除算及びモジュロを用いて計算される、例９０に記載の方法。 Example 91: The method of example 90, in which the mapping from logical thread identifiers to processing element identifiers and thread identifiers is calculated using division and modulo.

例９２：論理スレッド識別子から処理要素識別子及びスレッド識別子へのマッピングは、スレッドの共通性を保つために固定される、例９１に記載の方法。 Example 92: The method of example 91, in which the mapping from logical thread identifiers to processing element identifiers and thread identifiers is fixed to preserve thread commonality.

例９３：論理スレッド識別子から処理要素タイプへのマッピングは、ヘテロジニアススケジューラにより実行される、例９０に記載の方法。 Example 93: The method of example 90, wherein the mapping from logical thread identifiers to processing element types is performed by a heterogeneous scheduler.

例９４：論理スレッド識別子から処理要素タイプへのマッピングは、将来の処理要素タイプに順応するように柔軟である、例９３に記載の方法。 Example 94: The method of example 93, in which the mapping from logical thread identifiers to processing element types is flexible to accommodate future processing element types.

例９５：ヘテロジニアススケジューラは、少なくとも１つのアウトオブオーダタプル、及び、同じアウトオブオーダタプルに論理スレッド識別子がマッピングするスカラ及びＳＩＭＤタプルを複数のコアグループのうちの少なくとも１つが有するように、複数のコアグループを利用する、例９１に記載の方法。 Example 95: The method of example 91, in which the heterogeneous scheduler utilizes multiple core groups such that at least one of the multiple core groups has at least one out-of-order tuple and scalars and SIMD tuples whose logical thread identifiers map to the same out-of-order tuple.

例９６：複数のコアグループのうちの１つに属するスレッド間で一意的なページディレクトリベースレジスタ値を有するスレッドにより、非並列フェーズが判断される、例９５に記載の方法。 Example 96: The method of example 95, in which a non-parallel phase is determined by a thread having a page directory base register value that is unique among threads belonging to one of a plurality of core groups.

例９７：処理に属するスレッドは、同じアドレス空間、ページテーブル及びページディレクトリベースレジスタ値を共有する、例９６に記載の方法。 Example 97: The method of example 96, in which threads belonging to a process share the same address space, page table, and page directory base register value.

例９８：イベントを検出する段階であって、イベントは、スレッドウェイクアップコマンド、ページディレクトリベースレジスタへの書き込、スリープコマンド、スレッドのフェーズ変更、異なるコアへの所望の再割り当てを示す１又は複数の命令のうちの１つである、段階をさらに含む、例８５－９７のいずれかに記載の方法。 Example 98: The method of any of Examples 85-97, further comprising detecting an event, the event being one of a thread wake-up command, a write to a page directory base register, a sleep command, a phase change of a thread, and one or more instructions indicating a desired reallocation to a different core.

例９９：イベントがスレッドウェイクアップコマンドである場合、コードフラグメントが並列フェーズにあると判断して、ウェイクアップしたスレッドと同じページテーブルベースポインタを共有する処理要素の数をカウントする段階と、カウントされた処理要素の数が１より大きいか否かを判断する段階であって、ウェイクアップしたスレッドと同じページテーブルベースポインタを共有する処理要素の数のカウントが１である場合、当該スレッドは直列フェーズであり、ウェイクアップしたスレッドと同じページテーブルベースポインタを共有する処理要素の数のカウントが１ではない場合、当該スレッドは並列フェーズにある、段階をさらに含む、例９８に記載の方法。 Example 99: The method of Example 98, further comprising the steps of: if the event is a thread wakeup command, determining that the code fragment is in a parallel phase, counting the number of processing elements that share the same page table base pointer as the woken up thread, and determining whether the number of counted processing elements is greater than one, where if the count of the number of processing elements that share the same page table base pointer as the woken up thread is one, the thread is in a serial phase, and if the count of the number of processing elements that share the same page table base pointer as the woken up thread is not one, the thread is in a parallel phase.

例１００：イベントがスレッドスリープコマンドである場合、スレッドと関連付けられた実行フラグをクリアする段階と、影響を受けるスレッドと同じページテーブルベースポインタを共有する処理要素のスレッドの数をカウントする段階と、アウトオブオーダ処理要素がアイドルであるか否かを判断する段階とをさらに含み、ページテーブルベースポインタがコアグループ内のちょうど１つのスレッドにより共有されている場合、その共有しているスレッドがアウトオブオーダ処理要素から移動され、ページテーブルベースポインタが１つより多くのスレッドにより共有されている場合、コアグループの第１の実行スレッドがアウトオブオーダ処理要素に移行される、例９８に記載の方法。 Example 100: The method of example 98, further comprising, if the event is a thread sleep command, clearing an execution flag associated with the thread, counting the number of threads of the processing element that share the same page table base pointer as the affected thread, and determining whether the out-of-order processing element is idle, where if the page table base pointer is shared by exactly one thread in the core group, the sharing thread is moved out of the out-of-order processing element, and if the page table base pointer is shared by more than one thread, the first executing thread of the core group is migrated to the out-of-order processing element.

例１０１：スレッドスリープコマンドは、停止、待ちエントリ及びタイムアウト又は一時停止コマンドのうちの１つである、例１００に記載の方法。 Example 101: The method of example 100, wherein the thread sleep command is one of a stop, wait entry, and timeout or pause command.

例１０２：イベントがフェーズ変更である場合、スカラ処理要素上でスレッドが実行中であり、かつ、ＳＩＭＤ命令があることをスレッドの論理スレッド識別子が示す場合、当該スレッドをＳＩＭＤ処理要素に移行する段階と、ＳＩＭＤ処理要素上でスレッドが実行中であり、かつ、ＳＩＭＤ命令がないことをスレッドの論理スレッド識別子が示す場合、当該スレッドをスカラ処理要素に移行する段階とをさらに含む、例９８に記載の方法。 Example 102: The method of example 98, further comprising, if the event is a phase change, migrating the thread to a SIMD processing element if the thread is executing on a scalar processing element and the logical thread identifier of the thread indicates that there are SIMD instructions, and migrating the thread to a scalar processing element if the thread is executing on a SIMD processing element and the logical thread identifier of the thread indicates that there are no SIMD instructions.

例１０３：コードフラグメントを送信する前に、選択された処理要素をより良く適合させるようにコードフラグメントを変換する段階をさらに含む、例８５－１０２のいずれかに記載の方法。 Example 103: The method of any of Examples 85-102, further comprising transforming the code fragment to better suit the selected processing element before transmitting the code fragment.

例１０４：ヘテロジニアススケジューラは、変換を実行するバイナリトランスレータを含む、例１０３に記載の方法。 Example 104: The method of example 103, wherein the heterogeneous scheduler includes a binary translator that performs the translation.

例１０５：ヘテロジニアススケジューラは、変換を実行するＪＩＴコンパイラを含む、例１０３に記載の方法。 Example 105: The method of example 103, wherein the heterogeneous scheduler includes a JIT compiler that performs the transformation.

例１０６：方法は、例６１－８３についての方法の例のうちのいずれかの方法の段階をさらに備える、例８５－１０５のいずれかに記載の方法。 Example 106: The method of any of Examples 85-105, further comprising the method steps of any of the method examples for Examples 61-83.

例１０７：複数のヘテロジニアス処理要素と、コードフラグメントのフェーズを判断して、判断されたフェーズに少なくとも部分的に基づく実行のために複数のヘテロジニアス処理要素のうちの１つにコードフラグメントを送信するヘテロジニアススケジューラとを含むシステム。 Example 107: A system including a plurality of heterogeneous processing elements and a heterogeneous scheduler that determines a phase of a code fragment and sends the code fragment to one of the plurality of heterogeneous processing elements for execution based at least in part on the determined phase.

例１０８：ヘテロジニアススケジューラは、コードフラグメントが並列フェーズにあるか否かを判断し、コードフラグメントが並列フェーズにない場合、レイテンシに敏感なオペレーション要素を選択してコードフラグメントを実行し、コードフラグメントが並列フェーズにある場合、並列性のタイプを判断し、スレッド並列コードフラグメントに関して、スカラ処理要素を選択してコードフラグメントを実行し、データ並列コードフラグメントに関して、データ並列コードフラグメントのデータレイアウトを判断し、パックドデータレイアウトに関して、単一命令複数データ（ＳＩＭＤ）処理要素及び算術プリミティブ処理要素のうちの１つを選択し、ランダムデータレイアウトに関して、ギャザー命令、空間計算アレイ、又は、複数のスカラコアのアレイから１つのスカラコアを用いるＳＩＭＤ処理要素のうちの１つを選択する、例１０７に記載のシステム。 Example 108: The system of Example 107, in which the heterogeneous scheduler determines whether the code fragment is in a parallel phase, selects a latency-sensitive operation element to execute the code fragment if the code fragment is not in a parallel phase, determines a type of parallelism if the code fragment is in a parallel phase, and for a thread-parallel code fragment, selects a scalar processing element to execute the code fragment, for a data-parallel code fragment, determines a data layout for the data-parallel code fragment, selects one of a single instruction multiple data (SIMD) processing element and an arithmetic primitive processing element for a packed data layout, and for a random data layout, selects one of a SIMD processing element using a gather instruction, a spatial computation array, or one scalar core from an array of multiple scalar cores.

例１０９：ヘテロジニアススケジューラは、さらに、コードフラグメントが並列フェーズにあるか否かを判断する前に、いつコードフラグメントがアクセラレータへのオフロードの対象になるかを判断し、コードフラグメントがオフロードの対象になったときに、アクセラレータにコードフラグメントを送信する、例１０８に記載のシステム。 Example 109: The system of Example 108, wherein the heterogeneous scheduler further determines when the code fragment is eligible for offloading to an accelerator before determining whether the code fragment is in a parallel phase, and sends the code fragment to the accelerator when the code fragment is eligible for offloading.

例１１０：ヘテロジニアススケジューラは、さらに、検出されたデータの依存性、命令タイプ及び制御フロー命令のうちの１又は複数に基づいて、コードフラグメントが並列フェーズにあるか否かを判断する、例１０８－１０９のいずれかに記載のシステム。 Example 110: The system of any of Examples 108-109, wherein the heterogeneous scheduler further determines whether the code fragment is in a parallel phase based on one or more of the detected data dependencies, instruction types, and control flow instructions.

例１１１：単一の命令、複数のデータ命令についての命令のタイプは、並列フェーズを示す、例１１０に記載のシステム。 Example 111: The system of example 110, in which the instruction type for single instruction, multiple data instructions indicates parallel phase.

例１１２：ヘテロジニアススケジューラにより処理される各オペレーティングシステムスレッドは、論理スレッド識別子が割り当てられる、例１０８－１１１のいずれかに記載のシステム。 Example 112: A system as described in any of Examples 108-111, in which each operating system thread processed by the heterogeneous scheduler is assigned a logical thread identifier.

例１１３：ヘテロジニアススケジューラは、処理要素タイプ、処理要素識別子及びスレッド識別子から成るタプルに各論理スレッド識別子がマッピングされるように、論理スレッド識別子の縞模様マッピングを利用する、例１１２に記載のシステム。 Example 113: The system of Example 112, in which the heterogeneous scheduler utilizes striped mapping of logical thread identifiers such that each logical thread identifier is mapped to a tuple consisting of a processing element type, a processing element identifier, and a thread identifier.

例１１４：論理スレッド識別子から処理要素識別子及びスレッド識別子へのマッピングは、除算及びモジュロを用いて計算される、例１１２に記載のシステム。 Example 114: The system of example 112, wherein the mapping from logical thread identifiers to processing element identifiers and thread identifiers is calculated using division and modulo.

例１１５：論理スレッド識別子から処理要素識別子及びスレッド識別子へのマッピングは、スレッドの共通性を保つために固定される、例１１４に記載のシステム。 Example 115: The system of Example 114, in which the mapping from logical thread identifiers to processing element identifiers and thread identifiers is fixed to preserve thread commonality.

例１１６：論理スレッド識別子から処理要素タイプへのマッピングは、ヘテロジニアススケジューラにより実行される、例１１５に記載のシステム。 Example 116: The system of Example 115, wherein the mapping from logical thread identifiers to processing element types is performed by a heterogeneous scheduler.

例１１７：論理スレッド識別子から処理要素タイプへのマッピングは、将来の処理要素タイプに順応するように柔軟である、例１１６に記載のシステム。 Example 117: The system of Example 116, in which the mapping from logical thread identifiers to processing element types is flexible to accommodate future processing element types.

例１１８：ヘテロジニアススケジューラは、少なくとも１つのアウトオブオーダタプル、及び、同じアウトオブオーダタプルに論理スレッド識別子がマッピングするスカラ及びＳＩＭＤタプルをコアグループが有するように、コアグループを利用する、例１０８－１１７のいずれかに記載のシステム。 Example 118: A system according to any of Examples 108-117, in which the heterogeneous scheduler utilizes core groups such that the core groups have at least one out-of-order tuple and scalars and SIMD tuples whose logical thread identifiers map to the same out-of-order tuple.

例１１９：複数のコアグループのうちの１つに属するスレッド間で一意的なページディレクトリベースレジスタ値を有するスレッドにより、非並列フェーズが判断される、例１１８に記載のシステム。 Example 119: The system of Example 118, in which a non-parallel phase is determined by a thread having a page directory base register value that is unique among threads belonging to one of a plurality of core groups.

例１２０：処理に属するスレッドは、同じアドレス空間、ページテーブル及びページディレクトリベースレジスタ値を共有する、例１１９に記載のシステム。 Example 120: The system described in Example 119, in which threads belonging to a process share the same address space, page table, and page directory base register value.

例１２１：ヘテロジニアススケジューラは、イベントを検出し、当該イベントは、スレッドウェイクアップコマンド、ページディレクトリベースレジスタへの書き込み、スリープコマンド、スレッドのフェーズ変更及び所望の再割り当てを示す１又は複数の命令のうちの１つである、例１０８－１２０のいずれかに記載のシステム。 Example 121: The system of any of Examples 108-120, wherein the heterogeneous scheduler detects an event, the event being one of a thread wakeup command, a write to a page directory base register, a sleep command, a thread phase change, and one or more instructions indicating a desired reallocation.

例１２２：ヘテロジニアススケジューラは、イベントがスレッドウェイクアップコマンドである場合、コードフラグメントが並列フェーズにあると判断して、ウェイクアップしたスレッドと同じページテーブルベースポインタを共有する処理要素の数をカウントし、カウントされた処理要素の数が１より大きいか否かを判断し、ウェイクアップしたスレッドと同じページテーブルベースポインタを共有する処理要素の数のカウントが１である場合、当該スレッドは直列フェーズにあり、ウェイクアップしたスレッドと同じページテーブルベースポインタを共有する処理要素の数のカウントが１ではない場合、当該スレッドは並列フェーズにある、例１２１に記載のシステム。 Example 122: The system of Example 121, in which the heterogeneous scheduler determines that the code fragment is in a parallel phase if the event is a thread wakeup command, counts the number of processing elements that share the same page table base pointer as the woken up thread, and determines whether the counted number of processing elements is greater than one, and if the count of the number of processing elements that share the same page table base pointer as the woken up thread is one, the thread is in a serial phase, and if the count of the number of processing elements that share the same page table base pointer as the woken up thread is not one, the thread is in a parallel phase.

例１２３：ヘテロジニアススケジューラは、イベントがスレッドスリープコマンドである場合、スレッドと関連付けられている実行フラグをクリアし、影響を受けるスレッドと同じページテーブルベースポインタを共有する処理要素のスレッドの数をカウントし、アウトオブオーダ処理要素がアイドルであるか否かを判断し、ページテーブルベースポインタがコアグループ内のちょうど１つのスレッドにより共有されている場合、その共有しているスレッドがアウトオブオーダ処理要素から移動され、ページテーブルベースポインタが１つより多くのスレッドにより共有されている場合、グループの第１の実行スレッドがアウトオブオーダ処理要素に移行される、例１２１に記載のシステム。 Example 123: The system of Example 121, in which the heterogeneous scheduler clears the execution flag associated with the thread if the event is a thread sleep command, counts the number of threads of the processing element that share the same page table base pointer as the affected thread, determines whether the out-of-order processing element is idle, and if the page table base pointer is shared by exactly one thread in the core group, the sharing thread is moved out of the out-of-order processing element, and if the page table base pointer is shared by more than one thread, the first executing thread of the group is migrated to the out-of-order processing element.

例１２４：スレッドスリープコマンドは、停止、待ちエントリ及びタイムアウト又は一時停止コマンドのうちの１つである、例１２３に記載のシステム。 Example 124: The system of Example 123, wherein the thread sleep command is one of a stop, wait entry, and timeout or pause command.

例１２５：ヘテロジニアススケジューラは、イベントがフェーズ変更である場合、スカラ処理要素上でスレッドが実行中であり、かつ、ＳＩＭＤ命令があることをスレッドの論理スレッド識別子が示す場合、当該スレッドをＳＩＭＤ処理要素に移行し、ＳＩＭＤ処理要素上でスレッドが実行中であり、かつ、ＳＩＭＤ命令がないことをスレッドの論理スレッド識別子が示す場合、当該スレッドをスカラ処理要素に移行する、例１２１に記載のシステム。 Example 125: The system of Example 121, in which the heterogeneous scheduler migrates a thread to a SIMD processing element if the event is a phase change, if the thread is running on a scalar processing element and the logical thread identifier of the thread indicates that there is a SIMD instruction, and migrates the thread to a scalar processing element if the thread is running on a SIMD processing element and the logical thread identifier of the thread indicates that there is no SIMD instruction.

例１２６：ヘテロジニアススケジューラは、コードフラグメントを送信する前に、選択された処理要素をより良く適合させるようにコードフラグメントを変換する、例１０８－１２５のいずれかに記載のシステム。 Example 126: A system as described in any of Examples 108-125, in which the heterogeneous scheduler transforms the code fragment to better suit the selected processing element before transmitting the code fragment.

例１２７：ヘテロジニアススケジューラは、実行されると変換を実行するために、非一時的な機械可読媒体に格納されるバイナリトランスレータを含む、例１２６に記載のシステム。 Example 127: The system of Example 126, wherein the heterogeneous scheduler includes a binary translator stored in a non-transitory machine-readable medium to perform the translation when executed.

例１２８：ヘテロジニアススケジューラは、実行されると変換を実行するために、非一時的な機械可読媒体に格納されるＪＩＴコンパイラを含む、例１２６に記載のシステム。 Example 128: The system of Example 126, wherein the heterogeneous scheduler includes a JIT compiler stored on a non-transitory machine-readable medium to perform the transformation when executed.

例１２９：ヘテロジニアススケジューラを提供するヘテロジニアプロセッサ内の複数のヘテロジニアス処理要素のうちの少なくとも１つにより実行可能なプログラムコードを格納するメモリをさらに含む、例１０８－１２８のいずれかに記載のシステム。 Example 129: The system of any of Examples 108-128, further comprising a memory storing program code executable by at least one of the plurality of heterogeneous processing elements in a heterogeneous processor that provides a heterogeneous scheduler.

例１３０：ヘテロジニアススケジューラは回路を有する、例１０８－１２８のいずれかに記載のシステム。 Example 130: A system as described in any of Examples 108-128, in which the heterogeneous scheduler has circuitry.

例１３１：プロセッサコアを含み、プロセッサコアは、プロセッサコアに対してネイティブな少なくとも１つの命令をデコードするデコーダと、少なくとも１つのデコードされた命令を実行する１又は複数の実行ユニットであって、少なくとも１つのデコードされた命令は加速開始命令に対応し、加速開始命令はアクセラレータにオフロードされるコードの領域の開始を示す、１又は複数の実行ユニットとを含むプロセッサ。 Example 131: A processor including a processor core, the processor core including a decoder that decodes at least one instruction native to the processor core, and one or more execution units that execute the at least one decoded instruction, the at least one decoded instruction corresponding to an acceleration start instruction, the acceleration start instruction indicating the start of a region of code to be offloaded to an accelerator.

例１３２：コードの領域は、ターゲットアクセラレータがプロセッサコアに結合され、コードの領域を処理するために利用可能であるか否かに基づいてオフロードされ、コードの領域を処理するプロセッサコアにターゲットアクセラレータが結合されていない場合、コードの領域は、プロセッサコアにより処理される、例１３１に記載のプロセッサ。 Example 132: The processor of Example 131, wherein a region of code is offloaded based on whether a target accelerator is coupled to a processor core and available to process the region of code, and if a target accelerator is not coupled to a processor core that processes the region of code, the region of code is processed by the processor core.

例１３３：加速開始命令に対応する少なくとも１つのデコードされた命令の実行に応じて、プロセッサコアは、実行の第１のモードから実行の第２のモードに遷移する、例１３１に記載のプロセッサ。 Example 133: The processor of Example 131, in which the processor core transitions from a first mode of execution to a second mode of execution in response to execution of at least one decoded instruction corresponding to an acceleration start instruction.

例１３４：実行の第１のモードにおいて、プロセッサコアは、自己書き換えコードをチェックし、実行の第２のモードにおいて、プロセッサコアは、自己書き換えコードに対するチェックをディセーブルにする、例１３３に記載のプロセッサ。 Example 134: The processor of example 133, wherein in a first mode of execution, the processor core checks for self-modifying code, and in a second mode of execution, the processor core disables checking for self-modifying code.

例１３５：自己書き換えコードチェックをディセーブルにするために、自己書き換えコード検出回路がディセーブルにされる、例１３４に記載のプロセッサ。 Example 135: The processor of Example 134, wherein the self-modifying code detection circuit is disabled to disable the self-modifying code check.

例１３６：実行の第１のモードにおいて、メモリ一貫性モデル制限は、メモリオーダリング要求を緩和することにより弱められる、例１３３－１３５のいずれか１つに記載のプロセッサ。 Example 136: A processor according to any one of Examples 133-135, in which in a first mode of execution, memory consistency model restrictions are relaxed by relaxing memory ordering requirements.

例１３７：実行の第１のモードにおいて、浮動小数セマンティクスは、浮動小数点制御ワードレジスタを設定することにより変更される、例１３３－１３６のいずれか１つに記載のプロセッサ。 Example 137: A processor according to any one of Examples 133-136, wherein in a first mode of execution, floating point semantics are changed by setting a floating point control word register.

例１３８：プロセッサコアに対してネイティブな命令をデコーディングする段階と、加速開始命令に対応するデコードされた命令を実行する段階であって、加速開始命令は、アクセラレータにオフロードされるコードの領域の開始を示す、段階とを含み方法。 Example 138: A method including: decoding instructions native to a processor core; and executing the decoded instructions corresponding to an acceleration start instruction, the acceleration start instruction indicating the start of a region of code to be offloaded to an accelerator.

例１３９：コードの領域は、ターゲットアクセラレータがプロセッサコアに結合され、コードの領域を処理するために利用可能であるか否かに基づいてオフロードされ、コードの領域を処理するプロセッサコアにターゲットアクセラレータが結合されていない場合、コードの領域は、プロセッサコアにより処理される、例１３８に記載の方法。 Example 139: The method of example 138, wherein the region of code is offloaded based on whether a target accelerator is coupled to a processor core and available to process the region of code, and if the target accelerator is not coupled to the processor core that processes the region of code, the region of code is processed by the processor core.

例１４０：加速開始命令に対応するデコードされた命令の実行に応じて、プロセッサコアは、実行の第１のモードから実行の第２のモードに遷移する、例１３８に記載の方法。 Example 140: The method of example 138, in which the processor core transitions from a first mode of execution to a second mode of execution in response to execution of a decoded instruction corresponding to the acceleration start instruction.

例１４１：実行の第１のモードにおいて、プロセッサコアは、自己書き換えコードをチェックし、実行の第２のモードにおいて、プロセッサコアは、自己書き換えコードに対するチェックをディセーブルにする、例１４０に記載の方法。 Example 141: The method of example 140, in which in a first mode of execution, the processor core checks for self-modifying code and in a second mode of execution, the processor core disables checking for self-modifying code.

例１４２：自己書き換えコードチェックをディセーブルにするために、自己書き換えコード検出回路がディセーブルにされる、例１４１に記載の方法。 Example 142: The method of example 141, in which the self-modifying code detection circuit is disabled to disable the self-modifying code check.

例１４３：実行の第１のモードにおいて、メモリ一貫性モデル制限は、メモリオーダリング要求を緩和することにより弱められる、例１４０－１４２のいずれか１つに記載の方法。 Example 143: The method of any one of Examples 140-142, in which in a first mode of execution, memory consistency model restrictions are relaxed by relaxing memory ordering requirements.

例１４４：実行の第１のモードにおいて、浮動小数セマンティクスは、浮動小数点制御ワードレジスタを設定することにより変更される、例１４０－１４３のいずれか１つに記載の方法。 Example 144: The method of any one of Examples 140-143, wherein in a first mode of execution, floating point semantics are changed by setting a floating point control word register.

例１４５：プロセッサにより実行されるときに、プロセッサに方法を実行させる命令を格納する非一時的な機械可読媒体であって、方法は、プロセッサコアに対してネイティブな命令をデコーディングする段階と、加速開始命令に対応するデコードされた命令を実行する段階であって、加速開始命令は、アクセラレータにオフロードされるコードの領域の開始を示す、段階とを含む、非一時的な機械可読媒体。 Example 145: A non-transitory machine-readable medium storing instructions that, when executed by a processor, cause the processor to perform a method, the method including: decoding instructions native to a processor core; and executing the decoded instructions corresponding to an acceleration start instruction, the acceleration start instruction indicating the start of a region of code to be offloaded to an accelerator.

例１４６：コードの領域は、ターゲットアクセラレータがプロセッサコアに結合され、コードの領域を処理するために利用可能であるか否かに基づいてオフロードされ、コードの領域を処理するプロセッサコアにターゲットアクセラレータが結合されていない場合、コードの領域は、プロセッサコアにより処理される、例１４５に記載の方法。 Example 146: The method of example 145, in which the region of code is offloaded based on whether a target accelerator is coupled to a processor core and available to process the region of code, and if the target accelerator is not coupled to a processor core that processes the region of code, the region of code is processed by the processor core.

例１４７：加速開始命令に対応するデコードされた命令の実行に応じて、プロセッサコアは、実行の第１のモードから実行の第２のモードに遷移する、例１４５に記載の方法。 Example 147: The method of example 145, in which the processor core transitions from a first mode of execution to a second mode of execution in response to execution of a decoded instruction corresponding to the acceleration start instruction.

例１４８：実行の第１のモードにおいて、プロセッサコアは、自己書き換えコードをチェックし、実行の第２のモードにおいて、プロセッサコアは、自己書き換えコードに対するチェックをディセーブルにする、例１４７に記載の方法。 Example 148: The method of example 147, in which in a first mode of execution, the processor core checks for self-modifying code and in a second mode of execution, the processor core disables checking for self-modifying code.

例１４９：自己書き換えコードチェックをディセーブルにするために、自己書き換えコード検出回路がディセーブルにされる、例１４８に記載の方法。 Example 149: The method of example 148, in which the self-modifying code detection circuit is disabled to disable the self-modifying code check.

例１５０：実行の第１のモードにおいて、メモリ一貫性モデル制限は、メモリオーダリング要求を緩和することにより弱められる、例１４８－１４９のいずれか１つに記載の方法。 Example 150: The method of any one of Examples 148-149, in which in a first mode of execution, memory consistency model restrictions are relaxed by relaxing memory ordering requirements.

例１５１：実行の第１のモードにおいて、浮動小数セマンティクスは、浮動小数点制御ワードレジスタを設定することにより変更される、例１４８－１５０のいずれか１つに記載の方法。 Example 151: The method of any one of Examples 148-150, wherein in a first mode of execution, floating point semantics are changed by setting a floating point control word register.

例１５２：プロセッサコアを含み、プロセッサコアは、プロセッサコアに対してネイティブな少なくとも１つの命令をデコードするデコーダと、少なくとも１つのデコードされた命令を実行する１又は複数の実行ユニットであって、少なくとも１つのデコードされた命令は加速開始命令に対応し、加速開始命令は、アクセラレータにオフロードされるコードの領域の開始を示す、１又は複数の実行ユニットとを含む、システム。 Example 152: A system including a processor core, the processor core including a decoder that decodes at least one instruction native to the processor core, and one or more execution units that execute the at least one decoded instruction, the at least one decoded instruction corresponding to an acceleration start instruction, the acceleration start instruction indicating the start of a region of code to be offloaded to an accelerator.

例１５３：コードの領域は、ターゲットアクセラレータがプロセッサコアに結合され、コードの領域を処理するために利用可能であるか否かに基づいてオフロードされ、コードの領域を処理するプロセッサコアにターゲットアクセラレータが結合されていない場合、コードの領域は、プロセッサコアにより処理される、例１５２に記載のシステム。 Example 153: The system of Example 152, in which the region of code is offloaded based on whether a target accelerator is coupled to a processor core and available to process the region of code, and if the target accelerator is not coupled to the processor core that processes the region of code, the region of code is processed by the processor core.

例１５４：加速開始命令に対応する少なくとも１つのデコードされた命令の実行に応じて、プロセッサコアは、実行の第１のモードから実行の第２のモードに遷移する、例１５２に記載のシステム。 Example 154: The system of Example 152, in which the processor core transitions from a first mode of execution to a second mode of execution in response to execution of at least one decoded instruction corresponding to the acceleration start instruction.

例１５５：実行の第１のモードにおいて、プロセッサコアは、自己書き換えコードをチェックし、実行の第２のモードにおいて、プロセッサコアは、自己書き換えコードに対するチェックをディセーブルにする、例１５４に記載のシステム。 Example 155: The system of example 154, wherein in a first mode of execution, the processor core checks for self-modifying code, and in a second mode of execution, the processor core disables checking for self-modifying code.

例１５６：自己書き換えコードチェックをディセーブルにするために、自己書き換えコード検出回路がディセーブルにされる、例１５５に記載のシステム。 Example 156: The system of example 155, in which the self-modifying code detection circuit is disabled to disable the self-modifying code check.

例１５７：実行の第１のモードにおいて、メモリ一貫性モデル制限は、メモリオーダリング要求を緩和することにより弱められる、例１５２－１５６のいずれか１つに記載のプロセッサ。 Example 157: The processor of any one of Examples 152-156, wherein in a first mode of execution, memory consistency model restrictions are relaxed by relaxing memory ordering requirements.

例１５８：実行の第１のモードにおいて、浮動小数セマンティクスは、浮動小数点制御ワードレジスタを設定することにより変更される、例１５２－１５７のいずれか１つに記載のプロセッサ。 Example 158: A processor according to any one of Examples 152-157, wherein in a first mode of execution, floating point semantics are changed by setting a floating point control word register.

例１５９：プロセッサコアを含み、プロセッサコアは、プロセッサコアに対してネイティブな命令をデコードするデコーダと、加速終了命令に対応するデコードされた命令を実行する１又は複数の実行ユニットであって、加速終了命令は、アクセラレータにオフロードされるコードの領域の終了を示す、１又は複数の実行ユニットとを含むプロセッサ。 Example 159: A processor including a processor core, the processor core including a decoder that decodes instructions native to the processor core, and one or more execution units that execute decoded instructions corresponding to an acceleration end instruction, the acceleration end instruction indicating the end of a region of code to be offloaded to an accelerator.

例１６０：コードの領域は、ターゲットアクセラレータがプロセッサコアに結合され、コードの領域を処理するために利用可能であるか否かに基づいてオフロードされ、コードの領域を受信及び処理するプロセッサコアにターゲットアクセラレータが結合されていない場合、コードの領域は、プロセッサコアにより処理される、例１５９に記載のプロセッサ。 Example 160: The processor of example 159, wherein a region of code is offloaded based on whether a target accelerator is coupled to a processor core and available to process the region of code, and if a target accelerator is not coupled to a processor core that receives and processes the region of code, the region of code is processed by the processor core.

例１６１：実行の第１のモードから実行の第２のモードにプロセッサコアを遷移させる加速開始命令に対応するデコードされた命令の実行により、コードの領域が記述される、例１５９に記載のプロセッサ。 Example 161: The processor of example 159, wherein the region of code is described by execution of decoded instructions corresponding to an acceleration start instruction that transitions the processor core from a first mode of execution to a second mode of execution.

例１６２：実行の第１のモードにおいて、プロセッサは、自己書き換えコードをチェックし、実行の第２のモードにおいて、プロセッサは、自己書き換えコードに対するチェックをディセーブルにする、例１６１に記載のプロセッサ。 Example 162: The processor of example 161, wherein in a first mode of execution, the processor checks for self-modifying code, and in a second mode of execution, the processor disables checking for self-modifying code.

例１６３：自己書き換えコードチェックをディセーブルにするために、自己書き換えコード検出回路がディセーブルにされる、例１６２に記載のプロセッサ。 Example 163: The processor of Example 162, in which the self-modifying code detection circuit is disabled to disable the self-modifying code check.

例１６４：実行の第１のモードにおいて、メモリ一貫性モデル制限が弱められる、例１６１－１６３のいずれか１つに記載のプロセッサ。 Example 164: A processor according to any one of Examples 161-163, in which in a first mode of execution, memory consistency model restrictions are relaxed.

例１６５：実行の第１のモードにおいて、浮動小数セマンティクスは、浮動小数点制御ワードレジスタを設定することにより変更される、例１６１－１６４のいずれか１つに記載のプロセッサ。 Example 165: A processor according to any one of Examples 161-164, wherein in a first mode of execution, floating point semantics are changed by setting a floating point control word register.

例１６６：アクセラレータ開始命令の実行は、アクセラレータ終了命令が実行されるまで、プロセッサコア上でコードの領域の実行をゲート制御する、例１５９－１６５のいずれか１つに記載のプロセッサ。 Example 166: A processor according to any one of Examples 159-165, in which execution of an accelerator start instruction gates execution of a region of code on a processor core until an accelerator end instruction is executed.

例１６７：プロセッサコアに対してネイティブな命令をデコーディングする段階と、加速終了命令に対応するデコードされた命令を実行する段階であって、加速終了命令は、アクセラレータにオフロードされるコードの領域の終了を示す、段階とを含む方法。 Example 167: A method including: decoding instructions native to a processor core; and executing the decoded instructions corresponding to an acceleration end instruction, the acceleration end instruction indicating an end of a region of code to be offloaded to an accelerator.

例１６８：コードの領域は、ターゲットアクセラレータがプロセッサコアに結合され、コードの領域を処理するために利用可能であるか否かに基づいてオフロードされ、コードの領域を受信及び処理するプロセッサコアにターゲットアクセラレータが結合されていない場合、コードの領域は、プロセッサコアにより処理される、例１６７に記載の方法。 Example 168: The method of example 167, in which the region of code is offloaded based on whether a target accelerator is coupled to a processor core and available to process the region of code, and if a target accelerator is not coupled to a processor core that receives and processes the region of code, the region of code is processed by the processor core.

例１６９：実行の第１のモードから実行の第２のモードにプロセッサコアを遷移させる加速開始命令に対応するデコードされた命令の実行により、コードの領域が記述される、例１６７に記載の方法。 Example 169: The method of example 167, in which the region of code is described by execution of decoded instructions corresponding to an acceleration start instruction that transitions the processor core from a first mode of execution to a second mode of execution.

例１７０：実行の第１のモードにおいて、プロセッサは、自己書き換えコードをチェックし、実行の第２のモードにおいて、プロセッサは、自己書き換えコードに対するチェックをディセーブルにする、例１６９に記載の方法。 Example 170: The method of example 169, in which in a first mode of execution, the processor checks for self-modifying code and in a second mode of execution, the processor disables checking for self-modifying code.

例１７１：自己書き換えコードチェックをディセーブルにするために、自己書き換えコード検出回路がディセーブルにされる、例１７０に記載の方法。 Example 171: The method of example 170, in which the self-modifying code detection circuit is disabled to disable the self-modifying code check.

例１７２：実行の第１のモードにおいて、メモリ一貫性モデル制限が弱められる、例１６９－１７１のいずれか１つに記載の方法。 Example 172: The method of any one of Examples 169-171, in which in a first mode of execution, memory consistency model restrictions are relaxed.

例１７３：実行の第１のモードにおいて、浮動小数セマンティクスは、浮動小数点制御ワードレジスタを設定することにより変更される、例１６９－１７２のいずれか１つに記載の方法。 Example 173: The method of any one of Examples 169-172, wherein in a first mode of execution, floating point semantics are changed by setting a floating point control word register.

例１７４：アクセラレータ開始命令の実行は、アクセラレータ終了命令が実行されるまで、プロセッサコア上でコードの領域の実行をゲート制御する、例１６７－１７３のいずれか１つに記載の方法。 Example 174: The method of any one of Examples 167-173, wherein execution of an accelerator start instruction gates execution of a region of code on a processor core until an accelerator end instruction is executed.

例１７５：プロセッサにより実行されるときに、プロセッサに方法を実行させる命令を格納する非一時的な機械可読媒体であって、方法は、プロセッサコアに対してネイティブな命令をデコーディングする段階と、加速終了命令に対応するデコードされた命令を実行する段階であって、加速終了命令は、アクセラレータにオフロードされるコードの領域の終了を示す、段階とを含む、非一時的な機械可読媒体。 Example 175: A non-transitory machine-readable medium storing instructions that, when executed by a processor, cause the processor to perform a method, the method including: decoding instructions native to a processor core; and executing the decoded instructions corresponding to an acceleration end instruction, the acceleration end instruction indicating an end of a region of code to be offloaded to an accelerator.

例１７６：コードの領域は、ターゲットアクセラレータがプロセッサコアに結合され、コードの領域を処理するために利用可能であるか否かに基づいてオフロードされ、コードの領域を受信及び処理するプロセッサコアにターゲットアクセラレータが結合されていない場合、コードの領域は、プロセッサコアにより処理される、例１７５に記載の非一時的な機械可読媒体。 Example 176: The non-transitory machine-readable medium of Example 175, wherein the region of code is offloaded based on whether a target accelerator is coupled to a processor core and available to process the region of code, and if a target accelerator is not coupled to a processor core that receives and processes the region of code, the region of code is processed by the processor core.

例１７７：実行の第１のモードから実行の第２のモードにプロセッサコアを遷移させる加速開始命令に対応するデコードされた命令の実行により、コードの領域が記述される、例１７５に記載の非一時的な機械可読媒体。 Example 177: The non-transitory machine-readable medium of Example 175, in which a region of code is described by execution of decoded instructions corresponding to an acceleration start instruction that transitions a processor core from a first mode of execution to a second mode of execution.

例１７８：実行の第１のモードにおいて、プロセッサは、自己書き換えコードをチェックし、実行の第２のモードにおいて、プロセッサは、自己書き換えコードに対するチェックをディセーブルにする、例１７７に記載の非一時的な機械可読媒体。 Example 178: The non-transitory machine-readable medium of example 177, wherein in a first mode of execution, the processor checks for self-modifying code and in a second mode of execution, the processor disables checking for self-modifying code.

例１７９：自己書き換えコードチェックをディセーブルにするために、自己書き換えコード検出回路がディセーブルにされる、例１７８に記載の非一時的な機械可読媒体。 Example 179: The non-transitory machine-readable medium of Example 178, wherein the self-modifying code detection circuit is disabled to disable the self-modifying code check.

例１８０：実行の第１のモードにおいて、メモリ一貫性モデル制限が弱められる、例１７７－１７９のいずれか１つに記載の非一時的な機械可読媒体。 Example 180: The non-transitory machine-readable medium of any one of Examples 177-179, in which in a first mode of execution, memory consistency model restrictions are relaxed.

例１８１：実行の第１のモードにおいて、浮動小数セマンティクスは、浮動小数点制御ワードレジスタを設定することにより変更される、例１７７－１８０のいずれか１つに記載の非一時的な機械可読媒体。 Example 181: The non-transitory machine-readable medium of any one of Examples 177-180, wherein in a first mode of execution, floating point semantics are changed by setting a floating point control word register.

例１８２：アクセラレータ開始命令の実行は、アクセラレータ終了命令が実行されるまで、プロセッサコア上でコードの領域の実行をゲート制御する、例１７５－１８１のいずれか１つに記載の非一時的な機械可読媒体。 Example 182: The non-transitory machine-readable medium of any one of Examples 175-181, wherein execution of an accelerator start instruction gates execution of a region of code on a processor core until an accelerator end instruction is executed.

例１８３：プロセッサコアを含み、プロセッサコアは、プロセッサコアに対してネイティブな命令をデコードするデコーダと、加速終了命令に対応するデコードされた命令を実行する１又は複数の実行ユニットであって、加速終了命令は、アクセラレータにオフロードされるコードの領域の終了を示す、１又は複数の実行ユニットと、オフロードされた命令を実行するアクセラレータとを含むシステム。 Example 183: A system including a processor core, the processor core including a decoder that decodes instructions native to the processor core, one or more execution units that execute decoded instructions corresponding to an accelerated end instruction, the accelerated end instruction indicating an end of a region of code that is offloaded to an accelerator, and an accelerator that executes the offloaded instructions.

例１８４：コードの領域は、ターゲットアクセラレータがプロセッサコアに結合され、コードの領域を処理するために利用可能であるか否かに基づいてオフロードされ、コードの領域を受信及び処理するプロセッサコアにターゲットアクセラレータが結合されていない場合、コードの領域は、プロセッサコアにより処理される、例１８３に記載のシステム。 Example 184: The system of Example 183, in which the region of code is offloaded based on whether a target accelerator is coupled to a processor core and available to process the region of code, and if a target accelerator is not coupled to a processor core that receives and processes the region of code, the region of code is processed by the processor core.

例１８５：実行の第１のモードから実行の第２のモードにプロセッサコアを遷移させる加速開始命令に対応するデコードされた命令の実行により、コードの領域が記述される、例１８４に記載のシステム。 Example 185: The system of Example 184, wherein the region of code is described by execution of decoded instructions corresponding to an acceleration start instruction that transitions the processor core from a first mode of execution to a second mode of execution.

例１８６：実行の第１のモードにおいて、プロセッサは、自己書き換えコードをチェックし、実行の第２のモードにおいて、プロセッサは、自己書き換えコードに対するチェックをディセーブルにする、例１８５に記載のシステム。 Example 186: The system of example 185, in which in a first mode of execution, the processor checks for self-modifying code and in a second mode of execution, the processor disables checking for self-modifying code.

例１８７：自己書き換えコードチェックをディセーブルにするために、自己書き換えコード検出回路がディセーブルにされる、例１８６に記載のシステム。 Example 187: The system of Example 186, in which the self-modifying code detection circuit is disabled to disable the self-modifying code check.

例１８８：実行の第１のモードにおいて、メモリ一貫性モデル制限が弱められる、例１８５－１８７のいずれか１つに記載のシステム。 Example 188: A system as described in any one of Examples 185-187, in which in a first mode of execution, memory consistency model restrictions are relaxed.

例１８９：実行の第１のモードにおいて、浮動小数セマンティクスは、浮動小数点制御ワードレジスタを設定することにより変更される、例１８５－１８８のいずれか１つに記載のシステム。 Example 189: The system of any one of Examples 185-188, wherein in a first mode of execution, floating point semantics are changed by setting a floating point control word register.

例１９０：アクセラレータ開始命令の実行は、アクセラレータ終了命令が実行されるまで、プロセッサコア上でコードの領域の実行をゲート制御する、例１８３－１９０のいずれか１つに記載のシステム。 Example 190: The system of any one of Examples 183-190, wherein execution of an accelerator start instruction gates execution of a region of code on a processor core until an accelerator end instruction is executed.

例１９１：スレッドを実行するアクセラレータを含むシステム。 Example 191: A system including an accelerator that executes threads.

システムは、プロセッサコアと、ヘテロジニアススケジューラを実装するソフトウェアを内部に格納したメモリとを含み、プロセッサコアにより実行されるときに、ヘテロジニアススケジューラは、アクセラレータ上で可能な実行に適したスレッドにおいてコードシーケンスを検出し、検出されたコードシーケンスを実行するアクセラレータを選択し、検出されたコードシーケンスを選択されたアクセラレータに送信する。 The system includes a processor core and a memory having software stored therein that implements a heterogeneous scheduler, which, when executed by the processor core, detects code sequences in threads suitable for possible execution on an accelerator, selects an accelerator for executing the detected code sequence, and transmits the detected code sequence to the selected accelerator.

例１９２：アクセラレータによる実行に適していないスレッドのプログラムフェーズを実行する複数のヘテロジニアス処理要素をさらに含む、例１９１に記載のシステム。 Example 192: The system of Example 191, further comprising a plurality of heterogeneous processing elements that execute program phases of threads that are not suitable for execution by the accelerator.

例１９３：ヘテロジニアススケジューラは、コードシーケンスをパターンの予め決定されたセットと比較することにより、コードシーケンスを認識するパターンマッチャをさらに有する、例１９１－１９２のいずれかに記載のシステム。 Example 193: The system of any of Examples 191-192, wherein the heterogeneous scheduler further includes a pattern matcher that recognizes code sequences by comparing the code sequences to a predetermined set of patterns.

例１９４：パターンの予め決定されたセットは、メモリに格納される、例１９３に記載のシステム。 Example 194: The system of Example 193, wherein the predetermined set of patterns is stored in memory.

例１９５：ヘテロジニアススケジューラは、パターンマッチを有するコードを認識し、無視自己書き換えコードが無視されること、メモリ一貫性モデル制限を弱め、浮動小数セマンティクスを変更すること、パフォーマンスモニタリングを変更すること、アーキテクチャフラグの利用を変更することのうちの１又は複数を行うプロセッサコアを構成することによりスレッドと関連付けられた動作モードを調整するパフォーマンスモニタリングを用いる、例１９１－１９４のいずれかに記載のシステム。 Example 195: The system of any of Examples 191-194, wherein the heterogeneous scheduler recognizes code having a pattern match and uses performance monitoring to adjust an operating mode associated with a thread by configuring a processor core to do one or more of the following: ignore self-modifying code is ignored, relax memory consistency model restrictions, change floating point semantics, change performance monitoring, and change utilization of architectural flags.

例１９６：ヘテロジニアススケジューラは、認識されたコードを、実行するアクセラレータに対するアクセラレータコードに変換する変換モジュールをさらに有する、例１９１－１９５のいずれかに記載のシステム。 Example 196: The system of any of Examples 191-195, wherein the heterogeneous scheduler further includes a conversion module that converts the recognized code into accelerator code for execution on the accelerator.

例１９７：プロセッサコアは、格納されたパターンを用いて、スレッド内のコードシーケンスを検出するパターンマッチング回路を有する、例１９１－１９６のいずれかに記載のシステム。 Example 197: A system according to any of Examples 191-196, wherein the processor core has pattern matching circuitry that detects code sequences within a thread using stored patterns.

例１９８：プロセッサコアは、システムにおいて実行している各スレッドの実行ステータスを維持する、例１９１－１９７のいずれかに記載のシステム。 Example 198: A system according to any of Examples 191-197, in which the processor core maintains the execution status of each thread executing in the system.

例１９９：ヘテロジニアススケジューラは、システムにおいて実行している各スレッドのステータスを維持する、例１９１－１９７のいずれかに記載のシステム。 Example 199: A system as described in any of Examples 191-197, in which the heterogeneous scheduler maintains the status of each thread executing in the system.

例２００：ヘテロジニアススケジューラは、プロセッサ要素情報、追跡されたスレッド及び検出されたコードシーケンスのうちの１又は複数に基づいて、アクセラレータを選択する、例１９１－１９９のいずれかに記載のシステム。 Example 200: A system as described in any of Examples 191-199, in which the heterogeneous scheduler selects an accelerator based on one or more of processor element information, tracked threads, and detected code sequences.

例２０１：複数のヘテロジニアス処理要素と、複数の処理要素に結合されるヘテロジニアススケジューラ回路とを含み、ヘテロジニアススケジューラ回路は、実行中の各スレッド及び各処理要素の実行ステータスを維持するスレッド及び処理要素トラッカテーブルと、コードフラグメントを処理する複数のヘテロジニアス処理要素についての処理要素のタイプを選択して、スレッド及び処理要素トラッカからのステータス及び処理要素情報に基づいて、実行のために複数のヘテロジニアス処理要素のうちの１つ上でコードフラグメントをスケジューリングするセレクタとを含む、システム。 Example 201: A system including a plurality of heterogeneous processing elements and a heterogeneous scheduler circuit coupled to the plurality of processing elements, the heterogeneous scheduler circuit including a thread and processing element tracker table that maintains an execution status of each executing thread and each processing element, and a selector that selects a processing element type for the plurality of heterogeneous processing elements that processes a code fragment and schedules the code fragment on one of the plurality of heterogeneous processing elements for execution based on the status and processing element information from the thread and processing element trackers.

例２０２：プロセッサコアにより実行可能なソフトウェアを格納するメモリをさらに含み、ソフトウェアは、ヘテロジニアススケジューラ回路に結合される複数のヘテロジニアス処理要素のうちの１つであるアクセラレータ上で可能な実行に対するスレッドにおけるコードシーケンスを検出する、例２０１に記載のシステム。 Example 202: The system of example 201 further includes a memory storing software executable by the processor core, the software detecting code sequences in a thread for possible execution on an accelerator, the accelerator being one of a plurality of heterogeneous processing elements coupled to the heterogeneous scheduler circuit.

例２０３：ソフトウェアパターンマッチャは、格納されたパターンからコードシーケンスを認識する、例２０２に記載のシステム。 Example 203: The system of example 202, in which a software pattern matcher recognizes code sequences from stored patterns.

例２０４：ヘテロジニアススケジューラは、認識されたコードをアクセラレータコードに変換する、例２０１－２０３のいずれかに記載のシステム。 Example 204: A system as described in any of Examples 201-203, in which the heterogeneous scheduler converts the recognized code into accelerator code.

例２０５：セレクタは、ヘテロジニアススケジューラ回路により実行される有限ステートマシンである、例２０１－２０４のいずれかに記載のシステム。 Example 205: A system according to any of Examples 201-204, in which the selector is a finite state machine executed by a heterogeneous scheduler circuit.

例２０６：スレッドを実行する段階と、実行中のスレッド内のパターンを検出する段階と、認識されたパターンをアクセラレータコードに変換する段階と、変換されたパターンを実行のために利用可能なアクセラレータに転送する段階とを含む方法。 Example 206: A method including executing a thread, detecting a pattern in the executing thread, converting the recognized pattern into accelerator code, and forwarding the converted pattern to an available accelerator for execution.

例２０７：パターンは、ソフトウェアパターンマッチャを用いて認識される、例２０６に記載の方法。 Example 207: The method of example 206, wherein the pattern is recognized using a software pattern matcher.

例２０８：パターンは、ハードウェアパターンマッチ回路を用いて認識される、例２０６に記載の方法。 Example 208: The method of example 206, in which the pattern is recognized using a hardware pattern matching circuit.

例２０９：スレッドを実行する段階と、実行中のスレッド内のパターンを検出する段階と、パターンに基づいた緩和要求を用いるために、スレッドと関連付けられた動作モードを調整する段階とを含む方法。 Example 209: A method including executing a thread, detecting a pattern in the executing thread, and adjusting an operating mode associated with the thread to employ mitigation requests based on the pattern.

例２１０：パターンは、ソフトウェアパターンマッチャを用いて認識される、例２０９に記載の方法。 Example 210: The method of example 209, wherein the pattern is recognized using a software pattern matcher.

例２１１：パターンは、ハードウェアパターンマッチ回路を用いて認識される、例２０９に記載の方法。 Example 211: The method of example 209, in which the pattern is recognized using a hardware pattern matching circuit.

例２１２：調整された動作モードにおいて、自己書き換えコードが無視されること、メモリ一貫性モデル制限が弱められることと、浮動小数セマンティクスが変更されることと、パフォーマンスモニタリングが変更されることと、アーキテクチャフラグの利用が変更されることとのうちの、１又は複数が適用される、例２０９に記載の方法。 Example 212: The method of example 209, in which in the adjusted mode of operation, one or more of the following are applied: self-modifying code is ignored; memory consistency model restrictions are relaxed; floating point semantics are altered; performance monitoring is altered; and architectural flag usage is altered.

例２１３：プロセッサコアに対してネイティブな命令をデコードするデコーダと、デコードされた命令を実行する１又は複数の実行ユニットであって、デコードされた命令の１又は複数は、加速開始命令に対応し、加速開始命令は、同じスレッド内の加速開始命令に従う命令に対する実行の異なるモードにエントリさせる、１又は複数の実行ユニットとを含むシステム。 Example 213: A system including a decoder that decodes instructions native to a processor core, and one or more execution units that execute the decoded instructions, where one or more of the decoded instructions correspond to an accelerated start instruction, and the accelerated start instruction causes entry into a different mode of execution for instructions that follow the accelerated start instruction within the same thread.

例２１４：加速開始命令は、メモリデータブロックに対するポインタを規定するフィールドを含み、メモリデータブロックのフォーマットは、割込み前の進み具合を示すシーケンス番号フィールドを含む、例２１３に記載のシステム。 Example 214: The system of Example 213, wherein the acceleration start command includes a field that specifies a pointer to a memory data block, and the format of the memory data block includes a sequence number field that indicates progress before the interrupt.

例２１５：加速開始命令は、メモリに格納されたコードの予め定義された変換を規定するブロッククラス識別子フィールドを含む、例２１３－２１４のいずれかに記載のシステム。 Example 215: A system as described in any of Examples 213-214, wherein the acceleration start instruction includes a block class identifier field that specifies a predefined transformation of code stored in memory.

例２１６：加速開始命令は、実行のために用いられるハードウェアのタイプを示す実装識別子フィールドを含む、例２１３－２１５のいずれかに記載のシステム。 Example 216: A system as described in any of Examples 213-215, wherein the acceleration start command includes an implementation identifier field indicating the type of hardware used for execution.

例２１７：加速開始命令は、加速開始命令が実行した後に修正されるレジスタを格納する状態保存エリアのサイズ及びフォーマットを示す保存状態エリアサイズフィールドを含む、例２１３－２１６のいずれかに記載のシステム。 Example 217: A system as described in any of Examples 213-216, in which the acceleration start instruction includes a saved state area size field indicating the size and format of a state save area that stores registers that are modified after the acceleration start instruction executes.

例２１８：加速開始命令は、ローカルストレージエリアサイズ用のフィールドを含み、ローカルストレージエリアは、レジスタを超えたストレージ（ｓｔｏｒａｇｅｂｅｙｏｎｄｒｅｇｉｓｔｅｒ）を提供する、例２１３－２１７のいずれかに記載のシステム。 Example 218: The system of any of Examples 213-217, wherein the acceleration start instruction includes a field for a local storage area size, and the local storage area provides storage beyond registers.

例２１９：ローカルストレージエリアサイズは、加速開始命令の即値オペランドにより規定される、例２１８に記載のシステム。 Example 219: The system of example 218, wherein the local storage area size is specified by an immediate operand of the acceleration start instruction.

例２２０：ローカルストレージエリアは、加速開始命令に続く命令を除いてアクセスされない、例２１８に記載のシステム。 Example 220: The system of example 218, wherein the local storage area is not accessed except for instructions following the start acceleration instruction.

例２２１：実行の異なるモード内の命令の場合、メモリ依存性タイプが定義可能である、例２１３－２２０のいずれかに記載のシステム。 Example 221: A system as described in any of Examples 213-220, in which memory dependency types are definable for instructions in different modes of execution.

例２２２：定義可能なメモリ依存性タイプは、ストア－ロード及びストア－ストア依存性が存在しないことが保証されている独立タイプと、ローカルストレージエリアへのロード及びストアが互いに依存し得るが、他のロード及びストアからは独立しているローカルストレージエリアへの潜在的に依存したアクセスタイプと、ハードウェアが命令間の依存性を動的にチェックして強化する潜在的に依存するタイプと、ロード及びストアがそれらの間で依存しており、メモリがアトミックに更新されるアトミック性タイプとのうちの１つを有する、例２２１に記載のシステム。 Example 222: The system of Example 221, in which the definable memory dependency types include one of an independent type, in which store-load and store-store dependencies are guaranteed not to exist, a potentially dependent access type to a local storage area, in which loads and stores to a local storage area may depend on each other but are independent of other loads and stores, a potentially dependent type, in which hardware dynamically checks and enforces dependencies between instructions, and an atomic type, in which loads and stores are dependent between themselves and memory is updated atomically.

例２２３：使用対象のレジスタを含む保存状態、更新されるフラグ、実装仕様情報を格納するメモリと、レジスタを超える実行（ｅｘｅｃｕｔｉｏｎｂｅｙｏｎｄｒｅｇｉｓｔｅｒ）の間に用いられるローカルストレージとをさらに含む、例２１３－２２２のうちのいずれかに記載のシステム。 Example 223: The system of any of Examples 213-222, further comprising a memory for storing saved state including registers to be used, flags to be updated, implementation specific information, and local storage for use during execution beyond register.

例２２４：並列実行の各インスタンスは、独自のローカルストレージを取得する、例２２３に記載のシステム。 Example 224: A system as described in Example 223, in which each instance of parallel execution gets its own local storage.

例２２５：スレッドに対する実行についての異なる緩和モードに入る段階と、異なる緩和モードの実行中、スレッドの実行中に使用対象のレジスタを保存状態エリアに書き込む段階と、異なる緩和モードの実行中に、スレッド内の並列実行毎に用いられるローカルストレージを予約する段階と、スレッドのブロックを実行して、実行の異なる緩和モード内の命令を追跡する段階と、実行の異なるモードの終了が、アクセラレータ終了命令の実行に基づいて到達したか否かを判断する段階と、実行の異なるモードの終了が到達した場合、保存状態エリアからレジスタ及びフラグを元の状態に戻す段階と、実行の異なるモードの終了が到達していない場合、中間結果を用いてローカルストレージを更新する段階とを含む方法。 Example 225: A method including entering a different relaxation mode of execution for a thread, writing registers used during execution of the thread to a saved state area during execution of the different relaxation modes, reserving local storage for use by each parallel execution within the thread during execution of the different relaxation modes, executing a block of threads to track instructions in the different relaxation modes of execution, determining whether an end of the different mode of execution has been reached based on execution of an accelerator exit instruction, restoring registers and flags from the saved state area to their original state if the end of the different mode of execution has been reached, and updating the local storage with intermediate results if the end of the different mode of execution has not been reached.

例２２６：異なる緩和モード実行の間、自己書き換えコードが無視されることと、メモリ一貫性モデル制限が弱められることと、浮動小数セマンティクスが変更されることと、パフォーマンスモニタリングが変更されることと、アーキテクチャフラグの利用が変更されることとのうちの１又は複数が発生する、例２２５に記載の方法。 Example 226: The method of example 225, in which during different relaxed mode execution, one or more of self-modifying code is ignored, memory consistency model restrictions are relaxed, floating point semantics are altered, performance monitoring is altered, and architectural flag usage is altered.

例２２７：アクセラレータ開始命令の実行に基づいて、実行の異なるモードに入る、例２２５又は２２６に記載の方法。 Example 227: The method of example 225 or 226, entering a different mode of execution based on execution of an accelerator start instruction.

例２２８：判断されたパターンに基づいて、実行の異なるモードに入る、例２２５に記載の方法。 Example 228: The method of example 225, entering different modes of execution based on the determined pattern.

例２２９：アクセラレータ開始命令が実行した後に修正されるレジスタを格納する状態保存エリアのサイズ及びフォーマットは、アクセラレータ開始命令により指し示されるメモリブロックに規定される、例２２５－２２８のいずれかに記載の方法。 Example 229: The method of any of Examples 225-228, in which the size and format of a state save area that stores registers that are modified after an accelerator start instruction executes is specified in a memory block pointed to by the accelerator start instruction.

例２３０：実行前に、スレッド又はその一部を変換する段階をさらに含む、例２２５－２２９のいずれかに記載の方法。 Example 230: The method of any of Examples 225-229, further comprising transforming the thread or a portion thereof prior to execution.

例２３１：スレッド又はその一部は、アクセラレータコードに変換される、例２３０に記載の方法。 Example 231: The method of example 230, in which the thread or a portion thereof is converted into accelerator code.

例２３２：変換されたスレッド又は変換されたスレッドの一部は、アクセラレータにより実行される、例２３０又は２３１に記載の方法。 Example 232: The method of example 230 or 231, in which the transformed thread or a portion of the transformed thread is executed by an accelerator.

例２３３：ブロックの命令は、スレッドの上記ブロックと関連付けられるメモリブロック内のシーケンス番号を更新することにより追跡される、例２１３－２３２のいずれかに記載の方法。 Example 233: The method of any of Examples 213-232, in which instructions of a block are tracked by updating a sequence number in a memory block associated with the block of a thread.

例２３４：命令が実行に成功して、リタイアしたときに、スレッドのブロックのシーケンス番号が更新される、例２２３－２３３のいずれかに記載の方法。 Example 234: A method according to any of Examples 223-233, in which the sequence number of a block of threads is updated when an instruction is successfully executed and retired.

例２３５：アクセラレータ終了命令が実行し、リタイアした場合、実行の異なるモードの終了に到達しない、例２２３－２３４のいずれかに記載の方法。 Example 235: A method as in any of Examples 223-234, in which an accelerator exit instruction executes and retires, but does not reach the end of a different mode of execution.

例２３６：アクセラレータ終了命令実行により判断されたときに、実行の異なるモードの終了に到達しなかった場合、中間結果を用いてブロックの一部を実行しようと試みる、例２２３－２３５のいずれかに記載の方法。 Example 236: The method of any of Examples 223-235, attempting to execute a portion of a block using intermediate results if the end of a different mode of execution has not been reached as determined by accelerator end instruction execution.

例２３７：非アクセラレータ処理要素は、例外又は割込み後に中間結果と共に実行するために用いられる、例２３６の方法。 Example 237: The method of example 236, wherein the non-accelerator processing element is used to execute with the intermediate results after an exception or interrupt.

例２３８：実行の異なるモードの終了に到達しなかった場合、アクセラレータの利用が開始したポイントに実行をロールバックする、例２２３－２３７のいずれかに記載の方法。 Example 238: A method according to any of Examples 223-237, where if the end of a different mode of execution has not been reached, the execution is rolled back to the point where the use of the accelerator began.

例２３９：オペコード、第１のパックドデータソースオペランド用のフィールド、第２から第Ｎのパックドデータソースオペランド用の１又は複数のフィールド、及び、パックドデータ宛先オペランド用のフィールドを有する命令をデコードするデコーダと、第２から第Ｎのパックドデータソースオペランドのパックドデータ要素の位置ごとに、１）そのパックドデータソースオペランドのそのパックドデータ要素の位置のデータ要素に、第１のパックドデータソースオペランドの対応するパックドデータ要素位置のデータ要素を掛けて、一時的な結果を生成し、２）一時的な結果を合計し、３）一時的な結果の合計をパックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素に加え、４）パックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素に対する一時的な結果の合計を、パックドデータ宛先オペランドの対応するパックドデータ要素位置に格納するように、デコードされた命令を実行する実行回路とを含むシステム。 Example 239: A system including a decoder that decodes an instruction having an opcode, a field for a first packed data source operand, one or more fields for the second through Nth packed data source operands, and a field for a packed data destination operand, and an execution circuit that executes the decoded instruction to, for each packed data element position of the second through Nth packed data source operands, 1) multiply a data element at that packed data element position of the packed data source operand by a data element at a corresponding packed data element position of the first packed data source operand to generate a temporary result, 2) sum the temporary results, 3) add the sum of the temporary result to a data element at the corresponding packed data element position of the packed data destination operand, and 4) store the sum of the temporary result for the data element at the corresponding packed data element position of the packed data destination operand in the corresponding packed data element position of the packed data destination operand.

例２４０：Ｎはオペコードにより示される、例２３９に記載のシステム。 Example 240: The system described in Example 239, where N is indicated by an opcode.

例２４１：ソースオペランドの値は、乗算加算器アレイのレジスタにコピーされる、例２３９－２４０のいずれかに記載のシステム。 Example 241: A system as described in any of Examples 239-240, in which the value of the source operand is copied to a register of the multiplier-adder array.

例２４２：実行回路は２分木低減ネットワークを含む、例２３９－２４１のいずれかに記載のシステム。 Example 242: A system as described in any of Examples 239-241, wherein the execution circuitry includes a binary tree reduction network.

例２４３：実行回路はアクセラレータの一部である、例２４２のいずれかに記載のシステム。 Example 243: A system as described in any of Examples 242, wherein the execution circuitry is part of an accelerator.

例２４４：２分木低減ネットワークは、対で加算回路の第１セットに結合される複数の乗算回路を有し、加算回路の第１セットは、パックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素にも結合される加算回路の第３セットに結合される加算回路の第２セットに結合される、例２４２に記載のシステム。 Example 244: The system of Example 242, wherein the binary tree reduction network has a plurality of multiplier circuits coupled in pairs to a first set of adder circuits, the first set of adder circuits coupled to a second set of adder circuits coupled to a third set of adder circuits that are also coupled to data elements in corresponding packed data element positions of the packed data destination operand.

例２４５：各乗算は並列に処理される、例２４４に記載のシステム。 Example 245: The system described in Example 244, in which each multiplication is processed in parallel.

例２４６：パックドデータ要素は、１又は複数の行列の成分に対応する、例２３９－２４５のいずれかに記載のシステム。 Example 246: A system according to any of Examples 239-245, in which the packed data elements correspond to elements of one or more matrices.

例２４７：オペコード、第１のパックドデータソースオペランド用のフィールド、第２から第Ｎのパックドデータソースオペランド用の１又は複数のフィールド及びパックドデータ宛先オペランド用のフィールドを有する命令をデコーディングする段階と、第２から第Ｎのパックドデータソースオペランドのパックドデータ要素の位置ごとに、１）そのパックドデータソースオペランドのそのパックドデータ要素の位置のデータ要素に、第１のパックドデータソースオペランドの対応するパックドデータ要素位置のデータ要素を掛けて、一時的な結果を生成し、一時的な結果を合計し、３）パックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素に一時的な結果の合計を加え、４）パックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素に対する一時的な結果の合計を、パックドデータ宛先オペランドの対応するパックドデータ要素位置に格納するように、デコードされた命令を実行する段階とを含む方法。 Example 247: A method comprising: decoding an instruction having an opcode, a field for a first packed data source operand, one or more fields for the second through Nth packed data source operands, and a field for a packed data destination operand; and executing the decoded instruction to, for each packed data element position of the second through Nth packed data source operands, 1) multiply a data element for that packed data element position of the packed data source operand by a data element for a corresponding packed data element position of the first packed data source operand to generate a temporary result and sum the temporary results; 3) add the sum of the temporary results to a data element for the corresponding packed data element position of the packed data destination operand; and 4) store the sum of the temporary results for the data element for the corresponding packed data element position of the packed data destination operand in the corresponding packed data element position of the packed data destination operand.

例２４８：Ｎはオペコードにより示される、例２４７に記載の方法。 Example 248: The method of example 247, where N is indicated by an opcode.

例２４９：ソースオペランドの値は、乗算加算器アレイのレジスタにコピーされる、例２４７－２４８のいずれかに記載の方法。 Example 249: The method of any of Examples 247-248, in which the value of the source operand is copied to a register of the multiplier-adder array.

例２５０：実行回路は２分木低減ネットワークを含む、例２４７－２４９のいずれかに記載の方法。 Example 250: The method of any of Examples 247-249, wherein the implementation circuitry includes a binary tree reduction network.

例２５１：２分木低減ネットワークは、対で加算回路の第１セットに結合される複数の乗算回路を有し、加算回路の第１セットは、パックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素にも結合される加算回路の第３セットに結合される加算回路の第２セットに結合される、例２４７に記載の方法。 Example 251: The method of example 247, wherein the binary tree reduction network has a plurality of multiplier circuits coupled in pairs to a first set of adder circuits, the first set of adder circuits coupled to a second set of adder circuits coupled to a third set of adder circuits that are also coupled to data elements in corresponding packed data element positions of the packed data destination operand.

例２５２：各パックドデータオペランドは、８つのパックドデータ要素を有する、例２５１に記載の方法。 Example 252: The method of example 251, wherein each packed data operand has eight packed data elements.

例２５３：各乗算は並列に処理される、例２５１に記載の方法。 Example 253: The method of example 251, in which each multiplication is processed in parallel.

例２５４：パックドデータ要素は、１又は複数の行列の成分に対応する、例２４７－２５３のいずれかに記載の方法。 Example 254: A method according to any of Examples 247-253, in which the packed data elements correspond to elements of one or more matrices.

例２５５：プロセッサにより実行されるときに、プロセッサに方法を実行させる命令を格納する非一時的な機械可読媒体であって、方法は、オペコード、第１のパックドデータソースオペランド用のフィールド、第２から第Ｎのパックドデータソースオペランド用の１又は複数のフィールド、及び、パックドデータ宛先オペランド用のフィールドを有する命令をデコーディングする段階と、第２から第Ｎのパックドデータソースオペランドのパックドデータ要素の位置ごとに、１）そのパックドデータソースオペランドのそのパックドデータ要素の位置のデータ要素に、第１のパックドデータソースオペランドの対応するパックドデータ要素位置のデータ要素を掛けて一時的な結果を生成し、２）一時的な結果を合計し、３）一時的な結果の合計をパックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素に加え、４）パックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素に対する一時的な結果の合計をパックドデータ宛先オペランドの対応するパックドデータ要素位置に格納するように、デコードされた命令を実行する段階とを含む、非一時的な機械可読媒体。 Example 255: A non-transitory machine-readable medium storing instructions that, when executed by a processor, cause the processor to perform a method, the method including: decoding an instruction having an opcode, a field for a first packed data source operand, one or more fields for the second through Nth packed data source operands, and a field for a packed data destination operand; and executing the decoded instructions to, for each packed data element position of the second through Nth packed data source operands, 1) multiply a data element for that packed data element position of the packed data source operand by a data element for a corresponding packed data element position of the first packed data source operand to generate a temporary result; 2) sum the temporary results; 3) add the sum of the temporary result to a data element for the corresponding packed data element position of the packed data destination operand; and 4) store the sum of the temporary result for the data element for the corresponding packed data element position of the packed data destination operand in the corresponding packed data element position of the packed data destination operand.

例２５６：Ｎはオペコードにより示される、例２５５に記載の非一時的な機械可読媒体。 Example 256: A non-transitory machine-readable medium as described in Example 255, wherein N is indicated by an opcode.

例２５７：ソースオペランドの値は、乗算加算器アレイのレジスタにコピーされる、例２５５－２５６のいずれかに記載の非一時的な機械可読媒体。 Example 257: The non-transitory machine-readable medium of any of Examples 255-256, wherein the value of the source operand is copied to a register of a multiplier-adder array.

例２５８：実行回路は２分木低減ネットワークを含む、例２５５－２５７のいずれかに記載の非一時的な機械可読媒体。 Example 258: A non-transitory machine-readable medium according to any of Examples 255-257, wherein the execution circuitry includes a binary tree reduction network.

例２５９：２分木低減ネットワークは、対で加算回路の第１セットに結合される複数の乗算回路を有し、加算回路の第１セットは、パックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素にも結合される加算回路の第３セットに結合される加算回路の第２セットに結合される、例２５８に記載の非一時的な機械可読媒体。 Example 259: The non-transitory machine-readable medium of Example 258, wherein the binary tree reduction network has a plurality of multiplier circuits coupled in pairs to a first set of adder circuits, the first set of adder circuits coupled to a second set of adder circuits coupled to a third set of adder circuits that are also coupled to data elements at corresponding packed data element positions of a packed data destination operand.

例２６０：各パックドデータオペランドは、８つのパックドデータ要素を有する、例２５９に記載の非一時的な機械可読媒体。 Example 260: The non-transitory machine-readable medium of example 259, wherein each packed data operand has eight packed data elements.

例２６１：各乗算は並列に処理される、例２５９に記載の非一時的な機械可読媒体。 Example 261: A non-transitory machine-readable medium as described in example 259, wherein each multiplication is processed in parallel.

例２６２：パックドデータ要素は、１又は複数の行列の成分に対応する、例２５５－２６１のいずれかに記載の非一時的な機械可読媒体。 Example 262: A non-transitory machine-readable medium according to any of Examples 255-261, in which packed data elements correspond to elements of one or more matrices.

例２６３：オペコード、第１のパックドデータソースオペランド用のフィールド、第２から第Ｎのパックドデータソースレジスタオペランド用の１又は複数のフィールド、及び、パックドデータ宛先オペランド用のフィールドを有する命令をデコーディングする段階と、第２から第Ｎのパックドデータソースオペランドのパックドデータ要素の位置ごとに、１）そのパックドデータソースオペランドのそのパックドデータ要素の位置のデータ要素に、第１のパックドデータソースオペランドの対応するパックドデータ要素位置のデータ要素を掛けて、一時的な結果を生成し、２）一時的な結果を合計し、３）一時的な結果の合計をパックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素に加え、４）パックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素に対する一時的な結果の合計を格納するように、デコードされた命令を実行する段階とを含む方法。 Example 263: A method comprising: decoding an instruction having an opcode, a field for a first packed data source operand, one or more fields for second through Nth packed data source register operands, and a field for a packed data destination operand; and executing the decoded instruction to, for each packed data element position of the second through Nth packed data source operands, 1) multiply a data element for that packed data element position of the packed data source operand by a data element for a corresponding packed data element position of the first packed data source operand to generate a temporary result; 2) sum the temporary results; 3) add the sum of the temporary results to a data element for the corresponding packed data element position of the packed data destination operand; and 4) store the sum of the temporary result for the data element for the corresponding packed data element position of the packed data destination operand.

例２６４：Ｎはオペコードにより示される、例２６３に記載の方法。 Example 264: The method of example 263, where N is indicated by an opcode.

例２６５：ソースオペランドの値は、乗算加算器アレイのレジスタにコピーされる、例２６３－２６４のいずれかに記載の方法。 Example 265: The method of any of Examples 263-264, in which the value of the source operand is copied to a register of the multiplier-adder array.

例２６６：実行回路は２分木低減ネットワークである、例２６５に記載の方法。 Example 266: The method of example 265, wherein the execution circuit is a binary tree reduction network.

例２６７：２分木低減ネットワークは、対で加算回路の第１セットに結合される複数の乗算回路を有し、加算回路の第１セットは、パックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素にも結合される加算回路の第３セットに結合される加算回路の第２セットに結合される、例２６６に記載の方法。 Example 267: The method of example 266, wherein the binary tree reduction network has a plurality of multiplier circuits coupled in pairs to a first set of adder circuits, the first set of adder circuits coupled to a second set of adder circuits coupled to a third set of adder circuits that are also coupled to data elements in corresponding packed data element positions of the packed data destination operand.

例２６８：各パックドデータオペランドは、８つのパックドデータ要素を有する、例２６３－２６７のいずれかに記載の方法。 Example 268: A method as described in any of Examples 263-267, wherein each packed data operand has eight packed data elements.

例２６９：各乗算は並列に処理される、例２６８－２６８のいずれかに記載の方法。 Example 269: A method according to any of Examples 268-268, in which each multiplication is processed in parallel.

例２７０：プロセッサにより実行されるときに、プロセッサに方法を実行させる命令を格納する非一時的な機械可読媒体であって、方法は、オペコード、第１のパックドデータソースオペランド用のフィールド、第２から第Ｎのパックドデータソースレジスタオペランド用の１又は複数のフィールド、及び、パックドデータ宛先オペランド用のフィールドを有する命令をデコーディングする段階と、第２から第Ｎのパックドデータソースオペランドのパックドデータ要素の位置ごとに、１）そのパックドデータソースオペランドのそのパックドデータ要素の位置のデータ要素に、第１のパックドデータソースオペランドの対応するパックドデータ要素位置のデータ要素を掛けて、一時的な結果を生成し、２）一時的な結果を合計し、３）一時的な結果の合計をパックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素に加え、４）パックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素に対する一時的な結果の合計を格納するように、デコードされた命令を実行する段階とを含む、非一時的な機械可読媒体。 Example 270: A non-transitory machine-readable medium storing instructions that, when executed by a processor, cause the processor to perform a method, the method including: decoding an instruction having an opcode, a field for a first packed data source operand, one or more fields for second through Nth packed data source register operands, and a field for a packed data destination operand; and executing the decoded instructions to, for each packed data element position of the second through Nth packed data source operands, 1) multiply a data element for that packed data element position of the packed data source operand by a data element for a corresponding packed data element position of the first packed data source operand to generate a temporary result; 2) sum the temporary results; 3) add the sum of the temporary result to a data element for the corresponding packed data element position of the packed data destination operand; and 4) store the sum of the temporary result for the data element for the corresponding packed data element position of the packed data destination operand.

例２７１：Ｎはオペコードにより示される、例２７０に記載の非一時的な機械可読媒体。 Example 271: A non-transitory machine-readable medium as described in Example 270, where N is indicated by an opcode.

例２７２：ソースオペランドの値は、乗算加算器アレイのレジスタにコピーされる、例２７０－２７１のいずれかに記載の非一時的な機械可読媒体。 Example 272: The non-transitory machine-readable medium of any of Examples 270-271, wherein the value of the source operand is copied to a register of a multiplier-adder array.

例２７３：実行回路は２分木低減ネットワークである、例２７２に記載の非一時的な機械可読媒体。 Example 273: The non-transitory machine-readable medium of Example 272, wherein the execution circuit is a binary tree reduction network.

例２７４：２分木低減ネットワークは、対で加算回路の第１セットに結合される複数の乗算回路を有し、加算回路の第１セットは、パックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素にも結合される加算回路の第３セットに結合される加算回路の第２セットに結合される、例２７２に記載の非一時的な機械可読媒体。 Example 274: The non-transitory machine-readable medium of Example 272, wherein the binary tree reduction network has a plurality of multiplier circuits coupled in pairs to a first set of adder circuits, the first set of adder circuits coupled to a second set of adder circuits coupled to a third set of adder circuits that are also coupled to data elements in corresponding packed data element positions of a packed data destination operand.

例２７５：各パックドデータオペランドは、８つのパックドデータ要素を有する、例２７０－２７４のいずれかに記載の非一時的な機械可読媒体。 Example 275: A non-transitory machine-readable medium according to any of Examples 270-274, wherein each packed data operand has eight packed data elements.

例２７６：各乗算は並列に処理される、例２７０－２７５のいずれかに記載の非一時的な機械可読媒体。 Example 276: A non-transitory machine-readable medium according to any of Examples 270-275, wherein each multiplication is processed in parallel.

例２７７：オペコード、第１のパックドデータソースオペランド用のフィールド、第２から第Ｎのパックドデータソースレジスタオペランド用の１又は複数のフィールド、及び、パックドデータ宛先オペランド用のフィールドを有する命令をデコードするデコーダと、第２から第Ｎのパックドデータソースオペランドのパックドデータ要素の位置ごとに、１）そのパックドデータソースオペランドのそのパックドデータ要素の位置のデータ要素に、第１のパックドデータソースオペランドの対応するパックドデータ要素位置のデータ要素を掛けて一時的な結果を生成し、２）対で一時的な結果を合計し、３）一時的な結果の合計をパックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素に加え、４）パックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素に対する一時的な結果の合計を、パックドデータ宛先オペランドの対応するパックドデータ要素位置に格納するように、デコードされた命令を実行する実行回路とを含むシステム。 Example 277: A system including a decoder for decoding an instruction having an opcode, a field for a first packed data source operand, one or more fields for the second through Nth packed data source register operands, and a field for a packed data destination operand, and an execution circuit for executing the decoded instruction to, for each packed data element position of the second through Nth packed data source operands, 1) multiply a data element at that packed data element position of the packed data source operand by a data element at a corresponding packed data element position of the first packed data source operand to generate a temporary result, 2) sum the temporary results in pairs, 3) add the sum of the temporary result to a data element at the corresponding packed data element position of the packed data destination operand, and 4) store the sum of the temporary result for the data element at the corresponding packed data element position of the packed data destination operand in the corresponding packed data element position of the packed data destination operand.

例２７８：Ｎはオペコードにより示される、例２７７に記載のシステム。 Example 278: The system described in Example 277, where N is indicated by an opcode.

例２７９：ソースオペランドの値は、乗算加算器アレイのレジスタにコピーされる、例２７７－２７８のいずれかに記載のシステム。 Example 279: A system as described in any of Examples 277-278, in which the value of the source operand is copied to a register of the multiplier-adder array.

例２８０：実行回路は２分木低減ネットワークである、例２７９に記載のシステム。 Example 280: The system described in Example 279, wherein the execution circuit is a binary tree reduction network.

例２８１：２分木低減ネットワークは、対で加算回路の第１セットに結合される複数の乗算回路を有し、加算回路の第１セットは、パックドデータ宛先オペランドの対応するパックドデータ要素位置のデータ要素にも結合される加算回路の第３セットに結合される加算回路の第２セットに結合される、例２７９に記載のシステム。 Example 281: The system of Example 279, wherein the binary tree reduction network has a plurality of multiplier circuits coupled in pairs to a first set of adder circuits, the first set of adder circuits coupled to a second set of adder circuits coupled to a third set of adder circuits that are also coupled to data elements in corresponding packed data element positions of the packed data destination operand.

例２８２：各パックドデータオペランドは、８つのパックドデータ要素を有する、例２７７－２８１のいずれかに記載のシステム。 Example 282: A system as described in any of Examples 277-281, wherein each packed data operand has eight packed data elements.

例２８３：各乗算は並列に処理される、例２７７－２８２のいずれかに記載のシステム。 Example 283: A system as described in any of Examples 277-282, in which each multiplication is processed in parallel.

例２８４：ホストプロセッサにアクセラレータを結合するマルチプロトコルバスインタフェースを含むアクセラレータであって、コマンドを処理する１又は複数の処理要素を含む、アクセラレータと、複数のクライアントによりサブミットされるワーク記述子を格納する複数のエントリを含む共有のワークキューであって、ワーク記述子は、ワーク記述子と、１又は複数の処理要素により実行される少なくとも１つのコマンドと、アドレッシング情報とをサブミットしたクライアントを識別する識別コードを含む、共有のワークキューと、特定のアービトレーションポリシに従って、共有のワークキューから１又は複数の処理要素にワーク記述子をディスパッチするアービタとを含み、１又は複数の処理要素のそれぞれは、アービタからディスパッチされたワーク記述子を受信し、ソース及び宛先アドレス変換を実行し、ソースアドレス変換により識別されたソースデータを読み出し、少なくとも１つのコマンドを実行して宛先データを生成し、宛先アドレス変換を用いてメモリに宛先データを書き込む、システム。 Example 284: A system including an accelerator including a multi-protocol bus interface coupling the accelerator to a host processor, the accelerator including one or more processing elements for processing commands; a shared work queue including a plurality of entries for storing work descriptors submitted by a plurality of clients, the work descriptor including an identification code identifying a client that submitted the work descriptor, at least one command to be executed by the one or more processing elements, and addressing information; and an arbiter that dispatches the work descriptors from the shared work queue to one or more processing elements according to a particular arbitration policy, each of the one or more processing elements receiving the work descriptors dispatched from the arbiter, performing source and destination address translations, reading source data identified by the source address translation, executing at least one command to generate destination data, and writing the destination data to a memory using the destination address translation.

例２８５：複数のクライアントは、直接ユーザモード入力／出力（ＩＯ）要求をアクセラレータにサブミットするユーザモードアプリケーション、アクセラレータを共有する仮想マシン（ＶＭ）において実行するカーネルモードドライバ、及び／又は、複数のコンテナにおいて実行するソフトウェアエージェントのうちの１又は複数を有する、例２８４に記載のシステム。 Example 285: The system of Example 284, wherein the multiple clients include one or more of user mode applications that directly submit user mode input/output (IO) requests to the accelerator, kernel mode drivers that execute in virtual machines (VMs) that share the accelerator, and/or software agents that execute in multiple containers.

例２８６：複数のクライアントのうちの少なくとも１つのクライアントは、ＶＭ内で実行されるユーザモードアプリケーション又はコンテナを有する、例２８５に記載のシステム。 Example 286: The system of example 285, wherein at least one of the multiple clients has a user-mode application or container running within a VM.

例２８７：クライアントは、ピア入力／出力（ＩＯ）エージェント及び／又はソフトウェアチェーンオフロード要求のうちの１又は複数を有する、例２８４－２８６のいずれかに記載のシステム。 Example 287: The system of any of Examples 284-286, wherein the client has one or more of a peer input/output (IO) agent and/or a software chain offload request.

例２８８：ピアＩＯエージェントのうちの少なくとも１つは、ネットワークインタフェースコントローラ（ＮＩＣ）を有する、例２８７に記載のシステム。 Example 288: The system of Example 287, wherein at least one of the peer IO agents has a network interface controller (NIC).

例２８９：１又は複数の処理要素により使用可能な仮想－物理アドレス変換を格納するアドレス変換キャッシュをさらに含む、例２８４－２８８のいずれかに記載のシステム。 Example 289: The system of any of Examples 284-288, further including an address translation cache that stores virtual-to-physical address translations usable by one or more processing elements.

例２９０：特定のアービトレーションポリシは、先入先出ポリシを有する、例２８４－２８９のいずれかに記載のシステム。 Example 290: A system as described in any of Examples 284-289, in which the particular arbitration policy is a first-in, first-out policy.

例２９１：特定のアービトレーションポリシは、第１のクライアントのワーク記述子が第２のクライアントのワーク記述子を上回る優先度が与えられるサービス品質（ＱｏＳ）ポリシを有する、例２８４－２９０のいずれかに記載のシステム。 Example 291: The system of any of Examples 284-290, wherein the particular arbitration policy comprises a quality of service (QoS) policy in which a work descriptor of a first client is given priority over a work descriptor of a second client.

例２９２：たとえ第２のクライアントのワーク記述子が、第１のクライアントのワーク記述子の前に共有のワークキューに受信されていたとしても、第１のクライアントのワーク記述子は、第２のクライアントのワーク記述子の前に１又は複数の処理要素にディスパッチされる、例２９１に記載のシステム。 Example 292: The system of example 291, in which a work descriptor of a first client is dispatched to one or more processing elements before a work descriptor of a second client, even if the work descriptor of the second client is received in the shared work queue before the work descriptor of the first client.

例２９３：識別コードは、クライアントに割り当てられるシステムメモリ内のアドレス空間を識別する処理アドレス空間識別子（ＰＡＳＩＤ）を有する、例２８４－２９２のいずれかに記載のシステム。 Example 293: A system as described in any of Examples 284-292, wherein the identification code includes a process address space identifier (PASID) that identifies an address space in system memory that is allocated to the client.

例２９４：１又は複数の専用のワークキューをさらに含み、各専用のワークキューは、専用のワークキューと関連付けられた単一のクライアントによりサブミットされたワーク記述子を格納する複数のエントリを含む、例２８４－２９３のいずれかに記載のシステム。 Example 294: The system of any of Examples 284-293, further including one or more dedicated work queues, each dedicated work queue including a plurality of entries storing work descriptors submitted by a single client associated with the dedicated work queue.

例２９５：グループ内の専用のワークキュー及び／又は共有のワークキューのうちの２又はそれより多くを組み合わせるためにプログラミングされるグループ構成レジスタをさらに含み、グループは、複数の処理要素のうちの１又は複数と関連付けられる、例２９４のシステム。 Example 295: The system of Example 294, further including a group configuration register programmed to combine two or more of the dedicated and/or shared work queues in a group, the group being associated with one or more of the plurality of processing elements.

例２９６：１又は複数の処理要素は、グループ内の専用のワークキュー及び／又は共有のワークキューからのワーク記述子を処理する、例２９５に記載のシステム。 Example 296: The system of example 295, wherein one or more processing elements process work descriptors from dedicated work queues and/or shared work queues within the group.

例２９７：マルチプロトコルバスインタフェースによりサポートされる第１のプロトコルは、システムメモリアドレス空間にアクセスするために用いられるメモリインタフェースプロトコルを有する、例２８４－２９６のいずれかに記載のシステム。 Example 297: A system as described in any of Examples 284-296, wherein a first protocol supported by the multi-protocol bus interface includes a memory interface protocol used to access a system memory address space.

例２９８：マルチプロトコルバスインタフェースによりサポートされる第２のプロトコルは、アクセラレータのローカルメモリに格納されるデータと、ホストキャッシュ階層及びシステムメモリを含むホストプロセッサのメモリサブシステムとの間のコヒーレンシを維持するキャッシュコヒーレンシプロトコルを有する、例２８４－２９７のいずれかに記載のシステム。 Example 298: A system as described in any of Examples 284-297, wherein the second protocol supported by the multi-protocol bus interface comprises a cache coherency protocol that maintains coherency between data stored in the accelerator's local memory and the host processor's memory subsystem, including the host cache hierarchy and the system memory.

例２９９：マルチプロトコルバスインタフェースによりサポートされる第３のプロトコルは、デバイス発見、レジスタアクセス、構成、初期化、割込み、ダイレクトメモリアクセス及びアドレス変換サービスをサポートする直列リンクプロトコルを有する、例２８４－２９８のいずれかに記載のシステム。 Example 299: The system of any of Examples 284-298, wherein the third protocol supported by the multiprotocol bus interface includes a serial link protocol supporting device discovery, register access, configuration, initialization, interrupts, direct memory access, and address translation services.

例３００：第３のプロトコルは、ペリフェラルコンポーネントインタフェースエクスプレス（ＰＣＩｅ）プロトコルを有する、例２９９に記載のシステム。 Example 300: The system described in Example 299, wherein the third protocol has a Peripheral Component Interface Express (PCIe) protocol.

例３０１：処理要素により処理されるソースデータを格納し、１又は複数の処理要素による処理から生じた宛先データを格納するアクセラレータメモリをさらに含む、例２８４－３００のいずれかに記載のシステム。 Example 301: The system of any of Examples 284-300, further including an accelerator memory that stores source data to be processed by the processing elements and stores destination data resulting from processing by one or more processing elements.

例３０２：アクセラレータメモリは、高帯域幅メモリ（ＨＢＭ）を有する、例３０１に記載のシステム。 Example 302: The system of example 301, wherein the accelerator memory comprises high bandwidth memory (HBM).

例３０３：アクセラレータメモリは、ホストプロセッサにより用いられるシステムメモリアドレス空間の第１の部分に割り当てられる、例３０１に記載のシステム。 Example 303: The system of example 301, wherein the accelerator memory is allocated in a first portion of the system memory address space used by the host processor.

例３０４：システムメモリアドレス空間の第２の部分に割り当てられるホストメモリをさらに含む、例３０３に記載のシステム。 Example 304: The system of example 303, further including host memory allocated to a second portion of the system memory address space.

例３０５：システムメモリアドレス空間に格納されたデータのブロックごとに、ブロック内に含まれるデータがアクセラレータに向けてバイアスがかけられているか否かを示すバイアス回路及び／又は論理をさらに含む、例３０４に記載のシステム。 Example 305: The system of example 304, further comprising bias circuitry and/or logic for indicating, for each block of data stored in the system memory address space, whether the data contained within the block is biased towards the accelerator.

例３０６：データの各ブロックはメモリページを有する、例３０５に記載のシステム。 Example 306: The system of example 305, wherein each block of data has a memory page.

例３０７：ホストは、まずアクセラレータに要求を送信することなく、アクセラレータに向けてバイアスがかけられているデータを処理することを控える、例３０５に記載のシステム。 Example 307: The system of example 305, wherein the host refrains from processing data that is biased toward the accelerator without first sending a request to the accelerator.

例３０８：バイアス回路及び／又は論理は、アクセラレータに向けたバイアスを示すために、データの固定サイズのブロック毎に設定される１ビットを含むバイアステーブルを含む、例３０７に記載のシステム。 Example 308: The system of example 307, wherein the bias circuitry and/or logic includes a bias table that includes one bit that is set for each fixed-size block of data to indicate a bias toward the accelerator.

例３０９：アクセラレータは、アクセラレータメモリに格納されるデータと関連付けられた１又は複数のデータコヒーレンシなトランザクションを実行するホストプロセッサのコヒーレンスコントローラと通信するメモリコントローラを有する、例３０１－３０８のいずれかに記載のシステム。 Example 309: A system as described in any of Examples 301-308, in which the accelerator has a memory controller in communication with a coherency controller of a host processor that executes one or more data coherent transactions associated with data stored in the accelerator memory.

例３１０：メモリコントローラは、アクセラレータに向けられたバイアスに設定されるアクセラレータメモリに格納されるデータのブロックにアクセスするデバイスバイアスモードで動作し、デバイスバイアスモードにある場合、メモリコントローラは、ホストプロセッサのキャッシュコヒーレンスコントローラに問い合わせることなく、アクセラレータメモリにアクセスする、例３０９に記載のシステム。 Example 310: The system of example 309, in which the memory controller operates in a device bias mode to access blocks of data stored in the accelerator memory that is set to a bias directed toward the accelerator, and when in the device bias mode, the memory controller accesses the accelerator memory without consulting a cache coherence controller of the host processor.

例３１１：メモリコントローラは、ホストプロセッサに向けたバイアスに設定されるデータのブロックにアクセスするホストバイアスモードで動作し、ホストバイアスモードにある場合、メモリコントローラは、ホストプロセッサ内のキャッシュコヒーレンスコントローラを通じてアクセラレータメモリにすべての要求を送信する、例３０９に記載のシステム。 Example 311: The system of example 309, wherein the memory controller operates in a host bias mode to access blocks of data that are biased toward the host processor, and when in the host bias mode, the memory controller sends all requests to the accelerator memory through a cache coherence controller in the host processor.

例３１２：共有のワークキューは、ワーク記述子のバッチを識別する少なくとも１つのバッチ記述子を格納する、例２８４－３１１のいずれかに記載のシステム。 Example 312: A system as described in any of examples 284-311, wherein the shared work queue stores at least one batch descriptor that identifies a batch of work descriptors.

例３１３：メモリからワーク記述子のバッチを読み出すことにより、バッチ記述子を処理するバッチ処理回路をさらに含む、例３１２に記載のシステム。 Example 313: The system of Example 312, further comprising a batch processing circuit for processing the batch descriptors by reading the batch of work descriptors from the memory.

例３１４：ワーク記述子は、命令の第１のタイプを実行するホストプロセッサに対応する専用のワークキューに追加され、ワーク記述子は、命令の第２のタイプを実行するホストプロセッサに対応する共有のワークキューに追加される、例２９２に記載のシステム。 Example 314: The system of example 292, wherein a work descriptor is added to a dedicated work queue corresponding to a host processor executing a first type of instruction, and a work descriptor is added to a shared work queue corresponding to a host processor executing a second type of instruction.

例３１５：デバイスバイアスでメモリページの第１セットを配置する段階と、ホストプロセッサに結合されるアクセラレータデバイスのローカルメモリからメモリページの第１セットを割り当てる段階と、ホストプロセッサのコア、又は、入力／出力エージェントから割り当てられたページにオペランドデータを転送する段階と、ローカルメモリを用いてアクセラレータデバイスによりオペランドを処理して、結果を生成する段階と、メモリページの第１セットをデバイスバイアスからホストバイアスに変換する段階とを含む方法。 Example 315: A method including arranging a first set of memory pages with a device bias; allocating the first set of memory pages from a local memory of an accelerator device coupled to a host processor; transferring operand data from a core or an input/output agent of the host processor to the allocated pages; processing the operands by the accelerator device using the local memory to generate results; and converting the first set of memory pages from the device bias to the host bias.

例３１６：デバイスバイアスでメモリページの第１セットを配置する段階は、ページがアクセラレータデバイスバイアスにあることを示すために、バイアステーブル内のメモリページの第１セットを更新する、例３１５に記載の方法。 Example 316: The method of example 315, wherein placing the first set of memory pages in the device bias includes updating the first set of memory pages in a bias table to indicate that the pages are in the accelerator device bias.

例３１７：エントリを更新する段階は、メモリページの第１セット内の各ページと関連付けられたビットを設定する段階を有する、例３１５－３１６のいずれかに記載の方法。 Example 317: The method of any of Examples 315-316, wherein updating the entry includes setting a bit associated with each page in the first set of memory pages.

例３１８：デバイスバイアスに設定されると、メモリページの第１セットは、ホストキャッシュメモリにキャッシュされないことが保証される、例３１５－３１７のいずれかに記載の方法。 Example 318: A method as described in any of examples 315-317, in which when set to device bias, the first set of memory pages is guaranteed not to be cached in the host cache memory.

例３１９：メモリページの第１セットを割り当てることは、ドライバ又はアプリケーションプログラミングインタフェース（ＡＰＩ）コールを開始する段階を有する、例３１５－３１８のいずれかに記載の方法。 Example 319: The method of any of Examples 315-318, wherein allocating the first set of memory pages includes initiating a driver or application programming interface (API) call.

例３２０：オペランドを処理するために、アクセラレータデバイスは、コマンドを実行して、そのローカルメモリから直接データを処理する、例３１５－３１９のいずれかに記載の方法。 Example 320: The method of any of Examples 315-319, in which, to process the operands, the accelerator device executes commands to process data directly from its local memory.

例３２１：割り当てられたページにオペランドデータを転送する段階は、アクセラレータデバイスに１又は複数のワーク記述子をサブミットする段階を有し、ワーク記述子は、オペランドを識別する又は含む、例３１５－３２０のいずれかに記載の方法。 Example 321: The method of any of Examples 315-320, wherein transferring operand data to the allocated pages includes submitting one or more work descriptors to the accelerator device, the work descriptors identifying or including the operands.

例３２２：１又は複数のワーク記述子は、割り当てられたページに、コマンドでホストプロセッサキャッシュからフラッシュさせてよい、例３２１に記載の方法。 Example 322: The method of example 321, in which one or more work descriptors may cause allocated pages to be flushed from the host processor cache upon command.

例３２３：ホストプロセッサは、メモリページの第１セットがホストバイアスに設定されている場合、結果にアクセスし、結果をキャッシュし、結果を共有することが許可されている、例３１５－３２３のいずれかに記載の方法。 Example 323: The method of any of Examples 315-323, wherein the host processor is permitted to access the results, cache the results, and share the results if the first set of memory pages is set to host bias.

Claims

Accelerator and
a local memory having a plurality of stacked DRAM dies;
a silicon bridge coupling the accelerator to the plurality of stacked DRAM dies;
Equipped with
a connection between the accelerator and the plurality of stacked DRAM dies through the silicon bridge;
The accelerator comprises:
a plurality of processing elements for performing tasks assigned to them by an external processor;
a cache coherent interface coupling the accelerator to the external processor, the cache coherent interface ensuring that data stored in the local memory and/or accelerator cache is coherent with data stored in the external processor's cache and system memory;
logic for mapping a virtual memory space onto a heterogeneous physical system memory including said local memory and said system memory;
having
the accelerator and the external processor each access corresponding portions of the local memory and the system memory using the virtual memory space;
the accelerator maintains a table for tracking memory portions that are biased to the external processor;
The table has an entry for each fixed-size memory block, the entry being set to be biased to the external processor or not .

The apparatus of claim 1, wherein the cache coherent interface provides cache coherent connectivity between the accelerator and a core of the external processor.

The device of claim 1 or 2, wherein the cache coherent interface performs snooping to detect the state of cache lines in the external processor's cache.

The apparatus of any one of claims 1 to 3, wherein the cache coherent interface provides snoop updates in response to accesses and attempts to modify cache lines by the multiple processing elements.

the accelerator accesses the table from the local memory or an accelerator cache;
5. Apparatus according to claim 1, wherein the table indicates which memory portions are associated with the external processor.

The apparatus of any one of claims 1 to 5, wherein the table includes a table containing a plurality of entries, each entry indicating whether a corresponding memory portion is associated with the external processor.

The apparatus of claim 6, wherein each entry has at least one bit programmed to a first value to identify the accelerator and to a second value to identify the external processor.

The apparatus of claim 7, wherein when the at least one bit is programmed to the first value, the accelerator biases a memory page associated with a corresponding entry to the accelerator, and when the at least one bit is programmed to the second value, the accelerator biases the memory page to the external processor.

The apparatus of claim 8, wherein to bias the memory page to the accelerator, the accelerator is provided with access to the memory page from the local memory without first communicating with the external processor.

The apparatus of any one of claims 1 to 9, wherein the accelerator has a scheduler for scheduling operations on the plurality of processing elements.

The apparatus of any one of claims 1 to 10, wherein each of the plurality of processing elements includes hardware support for column- and row-wise matrix processing.

The apparatus of any one of claims 1 to 11, wherein the multiple processing elements process different combinations of sparse matrices, dense matrices, sparse vectors, and dense vectors.

The apparatus of any one of claims 1 to 12, wherein the plurality of processing elements perform a matrix multiplication operation using matrix data elements.

The device of any one of claims 1 to 13, wherein the accelerator executes a single instruction to perform a two-dimensional matrix multiplication operation.