JP7052170B2

JP7052170B2 - Processor and system

Info

Publication number: JP7052170B2
Application number: JP2018218328A
Authority: JP
Inventors: シファー、エラン; ハゴグ、モスタファ; チュリエル、エリヤフ
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2012-09-27
Filing date: 2018-11-21
Publication date: 2022-04-12
Anticipated expiration: 2033-06-12
Also published as: GB2568816A8; GB201500450D0; DE112013004751T5; US20190114176A1; GB2520852A; GB2568816B8; US10963263B2; CN109375949B; JP6240964B2; JP2017224342A; GB201816776D0; KR20160141001A; US11494194B2; US20190012178A1; US20170147353A1; WO2014051736A1; US9582287B2; CN115858017A; CN104603748A; JP2015532990A

Description

複数の実施形態は、複数のプロセッサに関する。特に、複数の実施形態は、複数のコアを有する複数のプロセッサに関する。 The plurality of embodiments relates to a plurality of processors. In particular, the plurality of embodiments relates to a plurality of processors having a plurality of cores.

図１は、従来技術のプロセッサ１００のブロック図である。プロセッサは、複数のコア１０１を有する。特に、図示されたプロセッサは、コア０１０１－０、コア１１０１－１からコアＭ１０１－Ｍを有する。例として、２個、４個、７個、１０個、１６個または任意の他の適切な数のコアがあってもよい。複数のコアの各々は、対応する単一命令複数データ（ＳＩＭＤ）実行ロジック１０２を含む。特に、コア０は、ＳＩＭＤ実行ロジック１０２－０を含み、コア１は、ＳＩＭＤ実行ロジック１０２－１を含み、コアＭは、ＳＩＭＤ実行ロジック１０２－Ｍを含む。すなわち、ＳＩＭＤ実行ロジックは、コア毎に複製される。各ＳＩＭＤ実行ロジックは、ＳＩＭＤ、ベクトルまたはパックドデータオペランドを処理するように動作可能である。複数のオペランドの各々は、複数のオペランド内にパックされ、ＳＩＭＤ実行ロジックによって並列に処理される８ビット、１６ビット、３２ビットまたは６４ビットのデータ要素のような、より小さい複数のデータ要素を有してもよい。 FIG. 1 is a block diagram of a prior art processor 100. The processor has a plurality of cores 101. In particular, the illustrated processor has core 0 101-0, core 1 101-1 to core M 101-M. As an example, there may be 2, 4, 7, 10, 16 or any other suitable number of cores. Each of the plurality of cores includes a corresponding single instruction, multiple data (SIMD) execution logic 102. In particular, core 0 includes SIMD execution logic 102-0, core 1 contains SIMD execution logic 102-1, and core M includes SIMD execution logic 102-M. That is, the SIMD execution logic is duplicated for each core. Each SIMD execution logic can operate to handle SIMD, vector or packed data operands. Each of the operands has smaller data elements, such as 8-bit, 16-bit, 32-bit, or 64-bit data elements, packed within the operands and processed in parallel by the SIMD execution logic. You may.

いくつかのプロセッサでは、各ＳＩＭＤ実行ロジックは、比較的大量のロジックを表してもよい。例えば、これは、各ＳＩＭＤ実行ロジックが、複数の広いＳＩＭＤオペランドを処理する場合であってもよい。いくつかのプロセッサは、例えば、１２８ビットの複数のオペランド、２５６ビットの複数のオペランド、５１２ビットの複数のオペランド、１０２４ビットの複数のオペランド等のような、比較的幅の広いベクトルまたはパックドデータオペランドを処理することができる。一般に、かかる広い複数のオペランドを処理するために必要とされるＳＩＭＤ実行ロジックは、比較的大きい傾向があり、比較的大量のダイ領域を消費するため、プロセッサの製造コストを増大させ、使用時に比較的大量の電力を消費する。各コアについて比較的大きいＳＩＭＤ実行ロジックを複製することは、かかる複数の問題を悪化させる傾向がある。さらに、多くの用途または動作負荷のシナリオでは、コア毎に複製されたＳＩＭＤ実行ロジックは、少なくともいくらかの時間、十分に利用されていない傾向がある。今後、コアの数が増え続けると、かかる複数の問題がさらに深刻化する可能性がある。 On some processors, each SIMD execution logic may represent a relatively large amount of logic. For example, this may be the case where each SIMD execution logic handles a plurality of wide SIMD operands. Some processors have relatively wide vector or packed data operands, such as 128-bit multiple operands, 256-bit multiple operands, 512-bit multiple operands, 1024-bit multiple operands, and so on. Can be processed. In general, the SIMD execution logic required to handle such a wide range of operands tends to be relatively large and consumes a relatively large amount of die area, which increases the manufacturing cost of the processor and is compared in use. It consumes a large amount of power. Duplicating relatively large SIMD execution logic for each core tends to exacerbate these problems. Moreover, in many applications or operating load scenarios, the SIMD execution logic replicated per core tends to be underutilized, at least for some time. As the number of cores continues to grow in the future, these multiple problems may become even more serious.

さらに、図１の従来技術によるプロセッサでは、複数のコアの各々も、従来のフロー制御ロジックを有する。特に、コア０は、フロー制御ロジック１０３－０を有し、コア１は、フロー制御ロジック１０３－１を有し、コアＭは、フロー制御ロジック１０３－Ｍを有する。一般に、フロー制御ロジックは、広範な複数の利用モデル、例えば、投機実行の導入をカバーするように、設計または最適化されてもよい。しかしながら、これは、概して、ＳＩＭＤ及び様々な他の複数のハイスループット演算にとっては、利益が比較的小さい傾向があるものの、比較的高い電力消費を伴う傾向がある。 Further, in the prior art processor of FIG. 1, each of the plurality of cores also has conventional flow control logic. In particular, core 0 has flow control logic 103-0, core 1 has flow control logic 103-1 and core M has flow control logic 103-M. In general, flow control logic may be designed or optimized to cover a wide range of usage models, such as the introduction of speculative execution. However, this generally tends to be associated with relatively high power consumption, although the benefits tend to be relatively small for SIMD and various other high-throughput operations.

本発明は、本発明の複数の実施形態を示すために用いられる以下の説明及び複数の添付図面を参照することにより、最もよく理解され得る。複数の図面は、以下のとおりである。 The present invention can be best understood by reference to the following description and a plurality of accompanying drawings used to illustrate the plurality of embodiments of the invention. The plurality of drawings are as follows.

従来技術によるプロセッサのブロック図である。It is a block diagram of a processor by the prior art.

プロセッサの実施形態及びメモリの実施形態を有するシステムの実施形態のブロック図である。It is a block diagram of the embodiment of the system which has the embodiment of a processor and the embodiment of a memory.

共有コア拡張インターフェースロジックの実施形態を含み、コアインターフェースロジックの実施形態を含む共有コア拡張ロジックの実施形態を有するコア０を有するプロセッサの実施形態のブロック図である。FIG. 6 is a block diagram of an embodiment of a processor having a core 0 including an embodiment of a shared core extended interface logic and an embodiment of a shared core extended logic including an embodiment of the core interface logic.

共有コア拡張呼び出し命令の実施形態を処理する方法の実施形態のブロックフロー図である。It is a block flow diagram of the embodiment of the method of processing the embodiment of a shared core extended call instruction.

共有コア拡張コマンドレジスタの実施形態の例のブロック図である。It is a block diagram of the example of the embodiment of a shared core extended command register.

共有コア拡張読み出し命令の実施形態を処理する方法の実施形態のブロックフロー図である。It is a block flow diagram of the embodiment of the method of processing the embodiment of a shared core extended read instruction.

供給コア拡張停止命令の実施形態を処理する方法の実施形態のブロックフロー図である。It is a block flow diagram of the embodiment of the method of processing the embodiment of the supply core expansion stop instruction.

本発明の複数の実施形態に係る例示的なインオーダパイプライン及び例示的なレジスタリネーミング、アウトオブオーダ発行／実行パイプラインの両方を示すブロック図である。It is a block diagram which shows both the exemplary in-order pipeline and the exemplary register renaming, out-of-order issuance / execution pipeline according to the plurality of embodiments of the present invention.

本発明の複数の実施形態に係るプロセッサに含まれるべきインオーダアーキテクチャコアの例示的な実施形態及び例示的なレジスタリネーミング、アウトオブオーダ発行／実行アーキテクチャコアの両方を示すブロック図である。It is a block diagram which shows both the exemplary embodiment of the in-order architecture core to be included in the processor according to the plurality of embodiments of the present invention, and the exemplary register renaming, out-of-order issuance / execution architecture core.

本発明の複数の実施形態に係る単一のプロセッサコアのブロック図であり、そのオンダイインターコネクトネットワークとの接続、及びその二次（Ｌ２）キャッシュ９０４のローカルサブセットと共に示される。It is a block diagram of a single processor core according to a plurality of embodiments of the present invention, and is shown together with its connection to an on-die interconnect network and a local subset of its secondary (L2) cache 904.

本発明の複数の実施形態に係る図９Ａのプロセッサコアの一部の拡大図である。9 is an enlarged view of a part of the processor core of FIG. 9A according to a plurality of embodiments of the present invention.

１つより多くのコアを有してもよく、集積メモリコントローラを有してもよく、集中画像表示を有してもよい、本発明の複数の実施形態に係るプロセッサのブロック図である。FIG. 3 is a block diagram of a processor according to a plurality of embodiments of the present invention, which may have more than one core, may have an integrated memory controller, and may have a centralized image display.

本発明の一実施形態に係るシステムのブロック図を示す。The block diagram of the system which concerns on one Embodiment of this invention is shown.

本発明の実施形態に係る第１のより具体的な、例示的なシステムのブロック図を示す。A block diagram of a first, more specific, exemplary system according to an embodiment of the invention is shown.

本発明の実施形態に係る第２のより具体的な、例示的なシステムのブロック図を示す。A block diagram of a second, more specific, exemplary system according to an embodiment of the invention is shown.

本発明の実施形態に係るＳｏＣのブロック図を示す。The block diagram of SoC which concerns on embodiment of this invention is shown.

ソース命令セットの複数のバイナリ命令をターゲット命令セットの複数のバイナリ命令に変換するための本発明の複数の実施形態に係るソフトウェア命令変換器の使用を対比したブロック図である。It is a block diagram comparing the use of the software instruction converter which concerns on the plurality of embodiments of the present invention for converting a plurality of binary instructions of a source instruction set into a plurality of binary instructions of a target instruction set.

複数のコア及び複数のコアによって共有される（例えば、複数のコアの各々に対してデータ処理を実行するように動作可能な）共有コア拡張ロジックを有する複数のプロセッサの複数の実施形態が、本明細書で開示される。複数の共有コア拡張使用命令、複数の共有コア拡張使用命令を実行する複数のプロセッサ、複数の共有コア拡張使用命令を処理または実行する場合に複数のプロセッサによって実行される複数の方法、及び複数の共有コア拡張使用命令を処理又は実行するための１つまたは複数のプロセッサを組み込んだ複数のシステムも、本明細書で開示される。 A plurality of embodiments of a plurality of processors having shared core extension logic shared by a plurality of cores and shared by the plurality of cores (eg, capable of operating to perform data processing for each of the plurality of cores). Disclosed in the specification. Multiple shared core extended usage instructions, multiple processors executing multiple shared core extended usage instructions, multiple methods executed by multiple processors when processing or executing multiple shared core extended usage instructions, and multiple methods. Multiple systems incorporating one or more processors for processing or executing shared core extended use instructions are also disclosed herein.

以下の説明では、特定のマイクロアーキテクチャの複数の詳細、特定のコマンドレジスタの複数のフォーマット、特定の共有コア拡張使用命令の複数の機能性、複数の共有コア拡張使用命令の特定の複数のグループ、特定の複数のタイプ及び複数の相互関係の複数のシステムコンポーネント、及び特定のロジック区画化／統合の複数の詳細のような、数々の具体的な複数の詳細が示される。しかしながら、本発明の複数の実施形態は、これらの具体的な複数の詳細がなくても実施され得ることが理解される。他の複数の例では、周知の複数の回路、複数の構造及び複数の技術は、この説明に対する理解の妨げとならないよう、詳細に示されてはいない。 In the following description, multiple details of a particular microarchitecture, multiple formats of a particular command register, multiple functionality of a particular shared core extended instruction, multiple groups of specific shared core extended instruction, A number of specific details are presented, such as multiple system components of a particular type and interrelationship, and multiple details of a particular logic partitioning / integration. However, it is understood that the embodiments of the present invention may be practiced without these specific details. In the other examples, well-known circuits, structures and techniques are not shown in detail so as not to interfere with the understanding of this description.

図２は、プロセッサ２１０の実施形態及びメモリ２１８の実施形態を有するシステム２０９の実施形態のブロック図である。プロセッサ及びメモリは、連結され、もしくは、１つまたは複数のバスまたは他の複数のインターコネクト２１９を介して互いに通信を行う。様々な複数の実施形態において、システム２０９は、デスクトップコンピュータシステム、ラップトップコンピュータシステム、サーバコンピュータシステム、ネットワーク要素、携帯電話またはマルチコアプロセッサ及びメモリを有する他のタイプの電子デバイスを表してもよい。 FIG. 2 is a block diagram of an embodiment of a system 209 having an embodiment of a processor 210 and an embodiment of a memory 218. Processors and memory are concatenated or communicate with each other via one or more buses or other interconnects 219. In various embodiments, the system 209 may represent a desktop computer system, a laptop computer system, a server computer system, a network element, a mobile phone or another type of electronic device having a multi-core processor and memory.

プロセッサ２１０は、複数のコア２１１を有する。図示されたプロセッサは、コア０２１１－０からコアＭ２１１－Ｍを有する。例として、２個、４個、７個、１０個、１６個、３２個、６４個、１２８個またはそれより多くのコアまたは特定の実装に望ましい任意の他の合理的で適切な数のコアがあってもよい。いくつかの実施形態では、複数のコアの各々は、他の複数のコアから実質的に独立して動作可能であってもよい。複数のコアの各々は、少なくとも１つのスレッドを処理可能である。図示されるように、コア０は、スレッド０２１２－０を有し、選択的にスレッドＰ２１２－Ｐまで含んでもよい。同様に、コアＭは、スレッド０２１２－０を有し、選択的にスレッドＰ２１２－Ｐまで含んでもよい。複数のスレッドＰの数は、任意の合理的で適切な数のスレッドであってもよい。本発明の範囲は、任意の公知の数のコア、またはこのような複数のコアが処理可能な任意の公知の数のスレッドに限定されない。 The processor 210 has a plurality of cores 211. The illustrated processor has cores 0 211-0 to core M 211-M. As an example, 2, 4, 7, 10, 16, 16, 32, 64, 128 or more cores or any other reasonable and appropriate number of cores desirable for a particular implementation. There may be. In some embodiments, each of the plurality of cores may be capable of operating substantially independently of the other plurality of cores. Each of the plurality of cores can process at least one thread. As shown, core 0 has thread 0 212-0 and may optionally include threads P 212-P. Similarly, the core M has threads 0 212-0 and may optionally include threads P 212-P. The number of the plurality of threads P may be any reasonable and appropriate number of threads. The scope of the invention is not limited to any known number of cores, or any known number of threads that can be processed by such multiple cores.

プロセッサは、様々な複合命令セットコンピュータ（ＣＩＳＣ）の複数のプロセッサ、様々な縮小命令セットコンピュータ（ＲＩＳＣ）の複数のプロセッサ、様々な超長命令語（ＶＬＩＷ）の複数のプロセッサ、様々なこれらの複数のハイブリッド、または完全に他の複数のタイプの複数のプロセッサのいずれであってもよい。いくつかの実施形態では、複数のコアは、デスクトップ、ラップトップ、サーバ及び同様の複数のコンピュータシステムで用いられるタイプの汎用プロセッサの複数の汎用コアであってもよい。いくつかの実施形態では、複数のコアは、複数の特別用途コアであってもよい。適した複数の特別用途コアの複数の例は、限定的ではないが、わずかな例を挙げれば、複数のグラフィクスプロセッサコア、複数のデジタルシグナルプロセッサ（ＤＳＰ）コア及び複数のネットワークプロセッサコアを含む。いくつかの実施形態では、プロセッサは、複数の汎用または特別用途コアと、グラフィクスユニット、メディアブロック、複数のコアと共にチップ上に集積されたシステムメモリの中の１つまたは複数とを有するシステムオンチップ（ＳｏＣ）であってもよい。 The processors are multiple processors of various complex instruction set computers (CISC), multiple processors of various reduced instruction set computers (RISC), multiple processors of various very long instruction words (VLIW), and various multiple of these. It can be either a hybrid of, or completely other multiple types of multiple processors. In some embodiments, the plurality of cores may be multiple general purpose cores of the type of general purpose processor used in desktops, laptops, servers and similar multiple computer systems. In some embodiments, the plurality of cores may be a plurality of special purpose cores. Multiple examples of suitable special purpose cores include, but are not limited to, multiple graphics processor cores, multiple digital signal processor (DSP) cores, and multiple network processor cores, to name a few. In some embodiments, the processor is a system-on-chip having multiple general purpose or special purpose cores and one or more of the graphics units, media blocks, and one or more of the system memory integrated on the chip with the cores. It may be (SoC).

プロセッサは、共有コア拡張ロジック２１４の実施形態をさらに含む。共有コア拡張ロジックは、複数のコア２１１の各々によって共有される（例えば、複数のコアの各々に対してデータ処理を実行するように動作可能である）。共有コア拡張ロジックは、複数のコアの各々に対してデータ処理を実行するように動作可能な共有データ処理ロジック２１６を含む。共有コア拡張ロジック及び複数のコアは、プロセッサの１つまたは複数のバスまたは他の複数のインターコネクト２１７によって、互いに連結される。複数のコア及び共有コア拡張ロジックは、対応するインターフェースロジック２１３、２１５を含むことにより、複数のコアの各々における１つまたは複数の物理的スレッド及び共有コア拡張ロジックは、互いにインターフェースまたは連携する（例えば、複数のコアの複数のスレッドに対し、データ処理を実行させるために共有コア拡張ロジックを呼び出す、データ処理のステータスをチェックする、データ処理を停止する、複数のコンテキストスイッチに対する複数の仮想メモリ属性を同期させる、データ処理中に発生した複数のページフォルトをルーティングする等）ことができる。共有コア拡張ロジックによって実行される複数のコンピュータタスクは、各物理的スレッドの代わりに、その具体的な物理的スレッドの論理的プロセス下で実行されてもよい。さらに後述されるように、インターフェースに用いられるコンテキストは、物理的スレッド毎に提供されてもよい。 The processor further includes embodiments of the shared core expansion logic 214. The shared core extension logic is shared by each of the plurality of cores 211 (eg, can act to perform data processing on each of the plurality of cores). The shared core extension logic includes shared data processing logic 216 that can operate to perform data processing on each of the plurality of cores. The shared core expansion logic and the plurality of cores are connected to each other by one or more buses of the processor or a plurality of other interconnects 217. The plurality of cores and shared core extension logics include the corresponding interface logics 213 and 215 so that one or more physical threads and shared core extension logics in each of the plurality of cores interface or cooperate with each other (eg,). , Call shared core extension logic for multiple threads on multiple cores to perform data processing, check the status of data processing, stop data processing, multiple virtual memory attributes for multiple context switches It can be synchronized, route multiple page faults that occur during data processing, etc.). Multiple computer tasks performed by shared core extension logic may be performed under the logical process of that particular physical thread instead of each physical thread. As will be further described later, the context used for the interface may be provided for each physical thread.

特に、コア０は、コア０のスレッド０に固有の少なくともいくつかのロジック及びコア０のスレッドＰに固有の少なくともいくつかのロジックを含む共有コア拡張インターフェースロジック２１３－０の実施形態を含む。同様に、コアＭは、コアＭのスレッド０に固有の少なくともいくつかのロジック及びコアＭのスレッドＰに固有の少なくともいくつかのロジックを含む共有コア拡張インターフェースロジック２１３－Ｍの実施形態を含む。他の複数のコアの各々（もしあれば）は、同様に、かかる共有コア拡張インターフェースロジックを含む。共有コア拡張ロジック２１４は、対応するコアインターフェースロジック２１５の実施形態を含む。各コア２１１は、その対応する共有コア拡張インターフェースロジック２１３を介して、共有コア拡張ロジック２１４のコアインターフェースロジック２１５とインターフェースまたは連携してもよい。いくつかの実施形態では、共有コア拡張インターフェースロジック２１３及びコアインターフェースロジック２１５は、複数のコアが共有コア拡張ロジックを共有する（例えば、共有データ処理ロジック２１６によってデータ処理を共有する）ことができるように、アーキテクチャインターフェース（例えば、新たな複数のアーキテクチャマクロ命令及び新たな複数のアーキテクチャレジスタ）、マイクロアーキテクチャインターフェースまたはハードウェアメカニズム（例えば、データ処理スケジューリングロジック、メモリ管理ユニット（ＭＭＵ）同期ロジック、ページフォルトルーティングロジック等）を提供してもよい。共有コア拡張インターフェースロジック２１３及びコアインターフェースロジック２１５の詳細な複数の例示的な実施形態は、さらに後述される。 In particular, core 0 includes an embodiment of shared core extended interface logic 213-0 that includes at least some logic specific to thread 0 of core 0 and at least some logic specific to thread P of core 0. Similarly, core M includes embodiments of shared core extended interface logic 213-M that includes at least some logic specific to thread 0 of core M and at least some logic specific to thread P of core M. Each of the other cores (if any) also contains such shared core extension interface logic. The shared core extension logic 214 includes an embodiment of the corresponding core interface logic 215. Each core 211 may interface with or cooperate with the core interface logic 215 of the shared core expansion logic 214 via its corresponding shared core expansion interface logic 213. In some embodiments, the shared core expansion interface logic 213 and the core interface logic 215 allow a plurality of cores to share a shared core expansion logic (eg, share data processing by a shared data processing logic 216). Architectural interfaces (eg, new architecture macro instructions and new architecture registers), microarchitecture interfaces or hardware mechanisms (eg, data processing scheduling logic, memory management unit (MMU) synchronization logic, page fault routing). Logic etc.) may be provided. A plurality of detailed exemplary embodiments of the shared core extended interface logic 213 and the core interface logic 215 will be further described below.

共有データ処理ロジック２１６は、様々な異なる複数の実施形態において、複数の異なるタイプのデータ処理ロジックを表してもよい。背景技術の部分で前述されたように、特定の複数のタイプのデータ処理ロジック（例えば、特定の複数のワイドＳＩＭＤ実行ユニット）は、従来、コア毎に複製されてきた。前述したように、しばしば、この複製されたロジックは、比較的大きくなる傾向がある。さらに、しばしば、この複製されたロジックは、多くの共通する動作負荷のシナリオでは、少なくともいくらかの時間、十分に利用されていない。かかるロジックの複製は、概して、比較的大量のダイ領域を消費する傾向があることにより、製造コストを増大させ、比較的大量の電力を消費する。いくつかの実施形態では、かかる比較的大きい及び／または一般に十分に利用されていない、従来はコア毎に複製されるデータ処理ロジックは、データ処理ロジックの単一の共有コピーとして、複数のコアから共有コア拡張ロジックに抽出されてもよい。さらに、図１の複数のコアの従来のフロー制御ロジックの場合のように、広範な複数の利用モデル、例えば、投機実行の導入をカバーするように設計または最適化されることとは対照的に、共有コア拡張ロジック２１４は、所望のまたはハイスループットのために最適化されるフロー制御ロジックを採用してもよい。これは概して、スループット志向のアルゴリズムに対して、よりハイレベルな電力性能の効率性を提供する傾向がある。 The shared data processing logic 216 may represent a plurality of different types of data processing logic in a variety of different embodiments. As mentioned earlier in the background art section, certain types of data processing logic (eg, certain multiple wide SIMD execution units) have traditionally been replicated on a core-by-core basis. As mentioned earlier, this duplicated logic often tends to be relatively large. Moreover, often this duplicated logic is underutilized in many common workload scenarios, at least for some time. Duplication of such logic generally tends to consume a relatively large amount of die area, thus increasing manufacturing costs and consuming a relatively large amount of power. In some embodiments, such relatively large and / or underutilized data processing logic that is traditionally replicated per core is from multiple cores as a single shared copy of the data processing logic. It may be extracted into the shared core extension logic. Moreover, as in the case of the traditional flow control logic of multiple cores in Figure 1, it is designed or optimized to cover a wide range of multiple utilization models, eg, speculative execution deployments. , Shared core expansion logic 214 may employ flow control logic optimized for desired or high throughput. This generally tends to provide a higher level of power performance efficiency for throughput-oriented algorithms.

様々な複数の実施形態において、共有データ処理ロジックは、スループット志向のハードウェア計算機能ロジック、ハイスループット計算エンジン、行列乗算ロジック、行列転置ロジック、有限フィルタロジック、合計絶対差ロジック、ヒストグラム計算ロジック、ギャザースキャッタ命令実装ロジック、超越ベクトル実行ロジック等を表してもよい。いくつかの実施形態では、共有データ処理ロジックは、例えば、複数のＳＩＭＤ実行ユニット（例えば、潜在的には、比較的広い複数のＳＩＭＤ実行ユニット）などの複数の実行ユニットを含んでもよい。いくつかの実施形態では、共有コア拡張ロジックは、例えばメモリ２１８内の複数の共有コア拡張データ構造２０８（例えば、複数の行列、複数のテーブル等）と連携してもよい。 In various embodiments, the shared data processing logic includes throughput-oriented hardware compute logic, high throughput compute engine, matrix multiplication logic, matrix translocation logic, finite filter logic, total absolute difference logic, histogram computation logic, gathers. It may represent a scatter instruction implementation logic, a transcendental vector execution logic, and the like. In some embodiments, the shared data processing logic may include multiple execution units, such as, for example, multiple SIMD execution units (eg, potentially relatively wide multiple SIMD execution units). In some embodiments, the shared core expansion logic may work with, for example, a plurality of shared core expansion data structures 208 (eg, a plurality of matrices, a plurality of tables, etc.) in memory 218.

有利には、ロジックの複製と比較して、共有コア拡張ロジックは、ロジック実行のために必要なダイ領域全体、ロジック製造コスト及び／またはロジックの消費電力の中で１つまたは複数を低減させる助けとなり得る。すなわち、共有コア拡張ロジックによれば、複数のコアは、かかる複数のリソースをコア毎に複製するために概して高い統合コストをかけることなく、一般的なデータ処理機能評価ハードウェアの複数のリソースを共有し得る。明確性のために、特定の共有データ処理ロジックが大きいことは必要とされないが、サイズ、コスト及び電力削減の利益の最大化は、比較的大きいロジックが、コア毎に複製される代わりに、複数のコアによって共有される場合に実現されることが多い。さらに、共有ロジックは、コア毎に複製されたとしたら、さもなければ比較的十分に利用されていなかった場合に、共有はロジックの活用を増大させる傾向があり得ることから、十分に利用されていないまたは不要なロジックが統合されることによりダイ領域及び製造コストを低減させ得るため、利益の最大化が実現されることが多い。さらなる利点として、共有コア拡張ロジックは、潜在的には、複数のコアが１つのタイプの処理に対して（例えば、スカラ動作負荷性能、電力及びエリアに対して）カスタマイズまたは最適化され、共有コア拡張ロジックが他のタイプの処理に対して（例えば、スループット志向の動作負荷性能、電力及びエリアに対して）カスタマイズまたは最適化されるために、さらに用いられてもよい。 Advantageously, compared to logic duplication, shared core extension logic helps reduce one or more of the entire die area required for logic execution, logic manufacturing costs and / or logic power consumption. Can be. That is, according to the shared core extension logic, multiple cores combine multiple resources of common data processing capability evaluation hardware without generally incurring high integration costs to replicate such multiple resources on a core-by-core basis. Can be shared. For clarity, a particular shared data processing logic is not required to be large, but maximizing the benefits of size, cost and power savings is multiple instead of relatively large logic being replicated per core. Often realized when shared by the core of. In addition, shared logic is underutilized, as shared logic can tend to increase logic utilization if it is replicated on a core-by-core basis, otherwise it is relatively underutilized. Alternatively, the integration of unnecessary logic can reduce the die area and manufacturing cost, so that the profit is often maximized. As a further advantage, the shared core expansion logic is potentially a shared core with multiple cores customized or optimized for one type of processing (eg, for scalar operating load performance, power and area). Extended logic may be further used to customize or optimize for other types of processing (eg, for throughput-oriented operating load performance, power and area).

図３は、共有コア拡張インターフェースロジック３１３の実施形態の例を含むコア０３１１－０と、コアインターフェースロジック３１５の実施形態の例を含む共有コア拡張ロジック３１４の実施形態とを有するプロセッサ３１０の実施形態のブロック図である。前述したように、プロセッサは、コアＭ（不図示）までの１つまたは複数の他のコアをさらに含んでもよい。共有コア拡張インターフェースロジック３１３に加えて、コア０は、複数のコアに従来見られたタイプの従来のロジック３３４（例えば、１つまたは複数の実行ユニット、複数のアーキテクチャレジスタ、１つまたは複数のキャッシュ、マイクロアーキテクチャロジック等）をさらに有する。本発明の範囲は、あらゆる公知のかかる従来のロジックに限定されない。コアインターフェースロジック３１５に加えて、共有コア拡張ロジック３１４は、共有データ処理ロジック３１６、及び複数のコアからのデータ処理または複数のタスクを共有データ処理ロジックに対してスケジューリングするスケジューラ３４４をさらに有する。 FIG. 3 shows an embodiment of a processor 310 having a core 0 311-0 including an example of an embodiment of a shared core expansion interface logic 313 and an embodiment of a shared core expansion logic 314 including an example of an embodiment of a core interface logic 315. It is a block diagram of a form. As mentioned above, the processor may further include one or more other cores up to core M (not shown). In addition to the shared core extended interface logic 313, core 0 is the type of conventional logic 334 traditionally found in multiple cores (eg, one or more execution units, multiple architecture registers, one or more caches). , Microarchitecture logic, etc.). The scope of the invention is not limited to any known such conventional logic. In addition to the core interface logic 315, the shared core extension logic 314 further includes a shared data processing logic 316 and a scheduler 344 that schedules data processing from a plurality of cores or tasks to the shared data processing logic.

コア０３１１－０上で実行する１つまたは複数の物理的スレッドの各々は、共有コア拡張ロジック３１４とのインターフェースをとるために、共有コア拡張インターフェースロジック３１３を用いてもよい。共有コア拡張インターフェースロジック３１３は、コア０の命令セット３２２の複数の共有コア拡張使用命令３２３を含む。命令セットは、コアの命令セットアーキテクチャ（ＩＳＡ）の一部である。ＩＳＡは、プログラミングに関するコアのアーキテクチャの一部を表す。ＩＳＡは、一般には、プロセッサの複数のネイティブ命令、複数のアーキテクチャレジスタ、複数のデータ型、複数のアドレッシングモード、メモリアーキテクチャ、割り込み及び例外処理等を含む。ＩＳＡは、概して、ＩＳＡを実装するために選択された特定の複数の設計技術を表すマイクロアーキテクチャとは区別される。異なる複数のマイクロアーキテクチャによる複数のプロセッサまたは複数のコアは、共通ＩＳＡを共有してもよい。共有コア拡張使用命令３２３を含む命令セットの複数の命令は、複数のマイクロ命令、マイクロｏｐ、またはより低レベルの複数の命令（例えば、復号ロジックが複数の機械命令または複数のマクロ命令を復号した結果のもの）とは対照的に、複数の機械命令、複数のマクロ命令またはより高レベルの複数の命令（例えば、実行のためにコアに与えられる複数の命令）を表す。 Each of the one or more physical threads running on core 0 311-0 may use shared core extension interface logic 313 to interface with shared core extension logic 314. The shared core extended interface logic 313 includes a plurality of shared core extended use instructions 323 of the instruction set 322 of core 0. The instruction set is part of the core instruction set architecture (ISA). ISA represents part of the core architecture for programming. The ISA generally includes multiple native instructions of the processor, multiple architecture registers, multiple data types, multiple addressing modes, memory architecture, interrupt and exception handling, and the like. ISA is generally distinguished from microarchitectures that represent specific design techniques selected to implement ISA. Multiple processors or multiple cores with different microarchitectures may share a common ISA. Multiple instructions in an instruction set, including a shared core extended instruction 323, are multiple microinstructions, microops, or lower level instructions (eg, decoding logic decoding multiple machine instructions or multiple macro instructions. In contrast to the resulting), it represents multiple machine instructions, multiple macro instructions, or higher level instructions (eg, instructions given to the core for execution).

共有コア拡張インターフェースロジック３１３は、複数の共有コア拡張コマンドレジスタ（ＳＣＥＲＣＲ）３２８のコア０、スレッド０セットをさらに含む。各物理的スレッドは、他の複数のスレッドの進捗度と無関係に保存及び復元されるべき当該コンテキストの一部としてこれと関連付けられた複数のＳＣＥＲＣＲレジスタのセットを有してもよい。いくつかの実施形態では、コア０の場合、コア０上で実行する１つまたは複数の物理的スレッドの各々に対して、スレッド毎に与えられる複数のＳＣＥＲＣＲの複数のセットがあってもよい。例えば、図示された実施形態では、コア０、スレッド０の複数のＳＣＥＲＣＲは、スレッド０に属してもよい。同様に、コア０上で実行する各物理的スレッドは、共有コア拡張ロジック３１４とのインターフェースをとるために、コア０、スレッド固有の複数のＳＣＥＲＣＲのセットを有してもよい。代替的に、コア０に対し、コア０の複数のＳＣＥＲＣＲの単一のセットがあってもよい。かかる場合には、ハードウェアレベルにおいて、複数の物理的スレッド間で、複数のＳＣＥＲＣＲを共有する時間があってもよい。コンテキストは、複数のコンテキストスイッチにおいてコア０の複数のＳＣＥＲＣＲからスワップされ、保存及び復元されてもよい。 The shared core extension interface logic 313 further includes a set of core 0s and thread 0s of a plurality of shared core extension command registers (SCERCR) 328. Each physical thread may have a set of multiple SCERCR registers associated with it as part of the context to be saved and restored independently of the progress of the other threads. In some embodiments, for core 0, there may be multiple sets of multiple SCERCRs given per thread for each of one or more physical threads running on core 0. For example, in the illustrated embodiment, the plurality of SCERCRs of core 0 and thread 0 may belong to thread 0. Similarly, each physical thread running on core 0 may have a plurality of core 0, thread-specific sets of SCERCRs to interface with the shared core extension logic 314. Alternatively, for core 0, there may be a single set of multiple SCERCRs of core 0. In such cases, at the hardware level, there may be time to share multiple SCERC Rs among multiple physical threads. Contexts may be swapped, saved and restored from multiple SCERCRs in core 0 at multiple context switches.

図では、ＳＣＥＲＣＲ０３２８－０からＳＣＥＲＣＲＮ３２８－Ｎが示される。すなわち、Ｎ＋１本のレジスタが存在する。Ｎ＋１という数は、２、４、８、１６、３２、６４、またはいくつかの他の数などの任意の所望の数であってもよい。Ｎ＋１という数が２のべき乗である必要はないが、これは概して、効率的なレジスタアドレッシングを提供する傾向がある。これらの複数のレジスタの中で所与の１つが、本明細書ではＳＣＥＣＲｘとして一般的に表され、ここで、ｘは、レジスタＳＣＥＲＣＲ０からＳＣＥＲＣＲＮのいずれか１つを表してもよい。 In the figure, SCERCRN 0 328-0 to SCERCRN 328-N are shown. That is, there are N + 1 registers. The number N + 1 may be any desired number, such as 2, 4, 8, 16, 32, 64, or some other number. The number N + 1 does not have to be a power of 2, but this generally tends to provide efficient register addressing. A given one of these plurality of registers is commonly represented herein as SCECRx, where x may represent any one of the registers SCERCR0 to SCERCRN.

いくつかの実施形態では、複数の共有コア拡張コマンドレジスタは、アーキテクチャ的に可視なコア及び／またはプロセッサのＩＳＡの複数のレジスタであってもよい。複数のアーキテクチャレジスタは、概して、オンダイプロセッサの複数の格納位置を表す。複数のアーキテクチャレジスタは、本明細書では単に複数のレジスタとさらに称されてもよい。他に指定されまたは明らかでない限り、複数のアーキテクチャレジスタ及び複数のレジスタという表現は、本明細書では、ソフトウェアに可視の複数のレジスタ及び／またはプログラマ（例えば、ソフトウェアに可視の）及び／または複数のマクロ命令によって指定される複数のレジスタを指すために用いられる。これらの複数のレジスタは、所与のマイクロアーキテクチャ（例えば、複数の命令に用いられる複数の一時レジスタ、複数のリオーダバッファ、複数のリタイアメントレジスタ等）において、他の非アーキテクチャ的な、またはアーキテクチャ的に可視ではない複数のレジスタと対比される。 In some embodiments, the plurality of shared core extended command registers may be multiple registers of the architecturally visible core and / or processor ISA. Multiple architecture registers generally represent multiple storage locations for the on-die processor. Multiple architecture registers may be further referred to herein as simply multiple registers. Unless otherwise specified or apparent, the expressions multiple architecture registers and multiple registers are used herein as software-visible multiple registers and / or programmers (eg, software-visible) and / or multiple. Used to refer to multiple registers specified by macro instructions. These multiple registers are other non-architectural or architectural in a given microarchitecture (eg, multiple temporary registers used for multiple instructions, multiple reorder buffers, multiple retirement registers, etc.). Contrast with multiple registers that are not visible.

共有コア拡張使用命令３２３は、データ処理が実行されるよう、共有コア拡張ロジック３１４への複数の呼び出しを送信、モニタリング及び停止するために用いられる。例として、複数の共有コア拡張使用命令は、並列プログラミングのために用いられてもよく、並列プログラミング動作負荷の効率性及び／またはスループットを向上させるために、命令セット（例えば、命令セットの拡張として）に含まれてもよい。複数の共有コア拡張使用命令は、コア０の複数の共有コア拡張コマンドレジスタ３２８の共有コア拡張コマンドレジスタ（ＳＣＥＣＲｘ）を（例えば、複数のビットまたは１つまたは複数のフィールドを介して）明示的に指定してもよく、または示し（例えば、暗示的に示し）てもよい。複数の共有コア拡張レジスタは、プロセッサのアーキテクチャハードウェアインターフェースを共有コア拡張ロジックに提供してもよい。 The shared core extension instruction 323 is used to send, monitor, and stop multiple calls to the shared core extension logic 314 so that data processing can be performed. As an example, multiple shared core extended instruction may be used for parallel programming, as an instruction set (eg, as an extension of the instruction set) to improve the efficiency and / or throughput of the parallel programming operating load. ) May be included. Multiple shared core extended use instructions explicitly set the shared core extended command register (SCECRx) of multiple shared core extended command registers 328 of core 0 (eg, via multiple bits or one or more fields). It may be specified or indicated (eg, implied). Multiple shared core extension registers may provide the processor's architectural hardware interface to the shared core expansion logic.

図示された実施形態では、共有コア拡張使用命令３２３は、フォーマットＳＣＥ呼び出し（ＳＣＥＣＲｘ、複数のパラメータ）を有する共有コア拡張（ＳＣＥ）呼び出し命令３２４を含む。ＳＣＥＣＲｘは、コア０の複数の共有コア拡張コマンドレジスタ３２８の１つを示し、複数のパラメータは、さらに後述される呼び出しと関連付けられる１つまたは複数のパラメータを示す。図示された複数の共有コア拡張使用命令は、フォーマットＳＣＥ読み出し（ＳＣＥＣＲｘ）を有するＳＣＥ読み出し命令３２５をさらに含む。他の複数の共有コア拡張使用命令は、フォーマットＳＣＥ停止（ＳＣＥＣＲｘ）を有するＳＣＥ停止命令３２６である。さらに他の複数の共有コア拡張使用命令は、フォーマットＳＣＥ待機（ＳＣＥＣＲｘ）を有するＳＣＥ待ち命令３２７である。これらの複数の命令の各々は、実行されるべき命令及び／またはオペレーションを特定するように動作可能なオペレーションコードまたはオペコード（例えば、複数のビットまたは１つまたは複数のフィールド）を含んでもよい。これらの例示的な複数の共有コア拡張使用命令の各々の機能性は、さらに後述される。 In the illustrated embodiment, the shared core extended use instruction 323 includes a shared core extended (SCE) call instruction 324 with a format SCE call (SCECRx, multiple parameters). SCECRx indicates one of the plurality of shared core extended command registers 328 of core 0, and the plurality of parameters indicate one or more parameters associated with the call further described below. The plurality of shared core extended use instructions illustrated further include an SCE read instruction 325 with a format SCE read (SCECRx). Another plurality of shared core extended use instructions is the SCE stop instruction 326 with the format SCE stop (SCECRx). Yet a plurality of shared core extended use instructions are SCE wait instructions 327 with format SCE wait (SCECRx). Each of these instructions may include an operation code or opcode (eg, multiple bits or one or more fields) that can operate to identify the instruction and / or operation to be executed. The functionality of each of these exemplary shared core extended use instructions will be further described below.

これは、複数の共有コア拡張使用命令の適したセットの１つの例示に過ぎないことが理解されよう。例えば、他の複数の実施形態では、図示された複数の命令のいくつかは、選択的に省略されてもよく、及び／またはさらなる複数の命令が、複数の共有コア拡張使用命令に選択的に追加されてもよい。さらに、他の複数の共有コア拡張使用命令及びこれらの複数のセットは、考慮され、当業者に明らかであり、本開示の利益を有する。 It will be appreciated that this is just one example of a suitable set of multiple shared core extended usage instructions. For example, in other embodiments, some of the illustrated instructions may be selectively omitted and / or additional instructions selectively to multiple shared core extended use instructions. May be added. In addition, a plurality of other shared core extended use orders and multiple sets thereof are considered and apparent to those of skill in the art and have the benefit of the present disclosure.

コア０３１１－０上で実行する複数の物理的スレッドの１つは、共有コア拡張使用命令３２３の１つを発行してもよい。当該スレッドによって発行された共有コア拡張使用命令は、適切なコア０の複数の共有コア拡張コマンドレジスタ３２８を示してもよい。適切なコア０の複数の共有コア拡張コマンドレジスタは、スレッド（例えば、スレッド０）に対応し、スレッド毎にコンテキストを提供してもよい。 One of the plurality of physical threads running on core 0 311-0 may issue one of the shared core extended use instructions 323. The shared core extended use instruction issued by the thread may indicate a plurality of shared core extended command registers 328 of the appropriate core 0. A plurality of shared core extended command registers of appropriate core 0 may correspond to threads (eg, thread 0) and provide context for each thread.

再び図３を参照すると、コア０は、復号ロジック３４８を含む。復号ロジックは、デコーダまたは復号ユニットとさらに称されてもよい。復号ロジックは、より高レベルの複数の機械命令または複数のマクロ命令を受信及び復号し、１つまたは複数のより低レベルのマイクロオペレーション、複数のマイクロコードエントリポイント、複数のマイクロ命令または他のより低レベルの複数の命令、もしくは元のより高レベルの命令を反映し、及び／またはこれから導出される複数の制御信号を出力してもよい。１つまたは複数のより低レベルの複数の制御信号は、１つまたは複数のより低レベルの（例えば、回路レベルまたはハードウェアレベルの）オペレーションを介して、より高レベルの命令のオペレーションを実装してもよい。デコーダは、限定的ではないが、複数のマイクロコード読み出し専用メモリ（ＲＯＭ）、複数のルックアップテーブル、複数のハードウェア実装、複数のプログラマブルロジックアレイ（ＰＬＡ）、及び命令の復号を実行するために用いられる、当技術分野で公知の他の複数のメカニズムを含む、様々な複数の異なるメカニズムを用いて実装されてもよい。さらに、いくつかの実施形態では、命令エミュレータ、トランスレータ、モーファ、インタプリタまたは他の命令変換ロジックは、復号ロジックの代わりに、及び／またはこれに加えて、用いられてもよい。 Referring again to FIG. 3, core 0 includes decoding logic 348. The decoding logic may be further referred to as a decoder or decoding unit. Decoding logic receives and decodes higher level machine instructions or multiple macro instructions, one or more lower level microoperations, multiple microcode entry points, multiple microinstructions or more. Multiple low-level instructions, or multiple control signals derived from and / or derived from the original higher-level instruction, may be output. Multiple lower-level control signals, one or more, implement higher-level instructional operations via one or more lower-level (eg, circuit-level or hardware-level) operations. You may. Decoders are used to perform, but are not limited to, multiple microcode read-only memories (ROMs), multiple look-up tables, multiple hardware implementations, multiple programmable logic arrays (PLAs), and instruction decoding. It may be implemented using a variety of different mechanisms, including a plurality of other mechanisms known in the art used. Further, in some embodiments, an instruction emulator, translator, morpher, interpreter or other instruction conversion logic may be used in place of and / or in addition to the decoding logic.

ＳＣＥ命令実行ロジック３３０は、復号ロジック３４８及びコア０の複数の共有コア拡張コマンドレジスタ３２８と連結される。共有コア拡張命令実行ロジックは、１つまたは複数のマイクロオペレーション、複数のマイクロコードエントリポイント、複数のマイクロ命令、他の複数の命令、もしくは複数の共有コア拡張使用命令を反映し、またはこれから導出される他の複数の制御信号を、デコーダから受信してもよい。共有コア拡張命令実行ロジックは、複数の共有コア拡張使用命令に応答して及び／またはこれに指定されたように（例えば、デコーダからの複数の制御信号に応答して）、複数の動作を実行するように動作可能である。いくつかの実施形態では、共有コア拡張命令実行ロジック及び／またはプロセッサは、複数の共有コア拡張使用命令を実行及び／または処理し、複数の共有コア拡張使用命令に応答して及び／またはこれに指定されたように複数の動作を実行するように動作可能な、具体的なまたは特定のロジック（例えば、回路またはソフトウェア及び／またはファームウェアと潜在的に組み合わせられる他のハードウェア）を含んでもよい。 The SCE instruction execution logic 330 is concatenated with the decoding logic 348 and the plurality of shared core extension command registers 328 of the core 0. Shared core extended instruction execution logic reflects or is derived from one or more microoperations, multiple microcode entry points, multiple microinstructions, multiple other instructions, or multiple shared core extended instruction usage instructions. Other plurality of control signals may be received from the decoder. The shared core extension instruction execution logic performs multiple actions in response to and / or as specified to it (eg, in response to multiple control signals from the decoder). It is possible to operate as if. In some embodiments, the shared core extended instruction execution logic and / or processor executes and / or processes a plurality of shared core extended instruction and responds to and / or responds to the plurality of shared core extended instruction. It may include specific or specific logic (eg, circuits or software and / or other hardware potentially combined with firmware) that can operate to perform multiple operations as specified.

図示された実施形態では、共有コア拡張命令実行ロジックは、共有コア拡張制御ロジック３２９内に含まれる。共有コア拡張制御ロジックは、複数の共有コア拡張コマンドレジスタ３２８、復号ロジック３４８、及びさらに後述されるメモリ管理ユニット３３１と連結される。共有コア拡張制御ロジックは、共有コア拡張インターフェースロジック３１３の様々な制御、管理、調整、タイミング及び関連する複数の実装態様を支援してもよい。 In the illustrated embodiment, the shared core extension instruction execution logic is contained within the shared core extension control logic 329. The shared core expansion control logic is concatenated with a plurality of shared core expansion command registers 328, a decoding logic 348, and a memory management unit 331 further described later. The shared core extended control logic may support various control, management, coordination, timing and related implementation embodiments of the shared core extended interface logic 313.

上述したように、コア０の命令セットは、ＳＣＥ呼び出し命令３２４を含む。ＳＣＥ呼び出し命令は、コアの代わりに（例えば、コア上でスレッドを実行する代わりに）データ処理を実行させるべく、共有コア拡張ロジック３１４に呼び出しを送信するために用いられてもよい。例として、コア０上で実行する物理的または論理的スレッドは、データ処理が実行されるよう、共有コア拡張ロジックに呼び出しまたはコマンドを送信するために、ＳＣＥ呼び出し命令を発行してもよい。いくつかの実施形態では、呼び出しまたはコマンドは、複数の共有コア拡張コマンドレジスタ３２８の１つまたは複数を介して、共有コア拡張ロジックに渡されてもよい。例えば、実施形態の共有コア拡張呼び出し命令は、コア０の複数の共有コア拡張コマンドレジスタ３２８（例えば、ＳＣＥＣＲｘ）の１つを指定または示してもよい。すなわち、複数の共有コア拡張コマンドレジスタは、新たなＳＣＥ呼び出しマクロ命令を用いて、複数のコア上のスレッドからアクセス可能であってもよい。いくつかの実施形態では、ＳＣＥ呼び出し命令は、実行されるべきデータ処理をさらに指定、認定または定義するために、より多くのパラメータの１つをさらに指定または示してもよい。データは、ＳＣＥ呼び出し命令に基づいて（例えば、ＳＣＥ呼び出し命令の１つまたは複数のパラメータに基づいて）、示された共有コア拡張コマンドレジスタ（例えば、ＳＣＥＣＲｘ）に書き込まれまたは格納されてもよい。現在のＳＣＥ呼び出しが、既に前のＳＣＥ呼び出しの専用であるまたはこれに占められている共有コア拡張コマンドレジスタに対してなされる場合、現在のＳＣＥ呼び出しは、占められた共有コア拡張コマンドレジスタが解放される（例えば、関連する呼び出しが完了するまたは停止されるとき）まで、ブロックされてもよい。次に、共有コア拡張ロジックは、書き込まれまたは格納されたデータを含む、示された共有コア拡張コマンドレジスタ（例えば、ＳＣＥＣＲｘ）にアクセスしてもよく、（例えば、要求されたデータ処理を実行する）呼び出しまたはコマンドを実装してもよい。 As mentioned above, the core 0 instruction set includes the SCE call instruction 324. The SCE call instruction may be used to send a call to the shared core extension logic 314 to perform data processing on behalf of the core (eg, instead of running a thread on the core). As an example, a physical or logical thread running on core 0 may issue an SCE call instruction to call or send a command to the shared core extension logic so that data processing is performed. In some embodiments, the call or command may be passed to the shared core extension logic via one or more of the plurality of shared core extended command registers 328. For example, the shared core extended call instruction of the embodiment may specify or indicate one of a plurality of shared core extended command registers 328 (eg, SCECRx) of core 0. That is, the plurality of shared core extended command registers may be accessible by threads on the plurality of cores using the new SCE call macroinstruction. In some embodiments, the SCE call instruction may further specify or indicate one of more parameters in order to further specify, authorize, or define the data processing to be performed. Data may be written or stored in the indicated shared core extended command register (eg, SCECRx) based on the SCE call instruction (eg, based on one or more parameters of the SCE call instruction). If the current SCE call is made to a shared core extended command register that is already dedicated to or occupied by a previous SCE call, the current SCE call is released by the occupied shared core extended command register. May be blocked (eg, when the associated call completes or is stopped). The shared core extension logic may then access the indicated shared core extension command register (eg, SCECRx) containing the written or stored data (eg, perform the requested data processing). ) Calls or commands may be implemented.

図４は、ＳＣＥ呼び出し命令の実施形態を処理する方法４５０の実施形態のブロックフロー図である。複数の実施形態では、方法は、プロセッサ、コアまたは他のタイプの命令処理装置によって実行されてもよい。いくつかの実施形態では、方法４５０は、図２のプロセッサ２１０または図３のコア０３１１－０、もしくは同様のプロセッサまたはコアによって実行されてもよい。代替的に、方法４５０は、完全に異なるプロセッサ、コアまたは命令処理装置によって実行されてもよい。さらに、プロセッサ２１０及びコア３１１－０は、方法４５０と同じ、同様の、または異なる、複数のオペレーション及び複数の方法の複数の実施形態を実行してもよい。 FIG. 4 is a block flow diagram of an embodiment of the method 450 for processing an embodiment of an SCE call instruction. In multiple embodiments, the method may be performed by a processor, core or other type of instruction processing device. In some embodiments, method 450 may be performed by processor 210 in FIG. 2 or core 0 311-0 in FIG. 3, or a similar processor or core. Alternatively, method 450 may be performed by a completely different processor, core or instruction processor. Further, the processor 210 and the core 310-1 may perform the same, similar or different, multiple operations and multiple embodiments of the method 450.

ブロック４５１において、ＳＣＥ呼び出し命令は、複数のコアを有するプロセッサのコア内で受信される。様々な複数の態様では、ＳＣＥ呼び出し命令は、オフコアソースから（例えば、メインメモリ、ディスク、もしくはバスまたはインターコネクトから）コアで受信されてもよく、または、コア内の他のロジック（例えば、命令キャッシュ、キュー、スケジューリングロジック等）からコアの一部で（例えば、復号ロジック、スケジューリングロジック等で）受信されてもよい。ＳＣＥ呼び出し命令は、データ処理を実行させるために、コアにより、共有コア拡張ロジックを呼び出させるものである。共有コア拡張ロジックは、複数のコアにより共有される。ＳＣＥ呼び出し命令は、共有コア拡張コマンドレジスタを示し、１つまたは複数のパラメータをさらに示す。１つまたは複数のパラメータは、共有コア拡張ロジックによって実行されるべきデータ処理を指定する。 At block 451 the SCE call instruction is received within the core of a processor having a plurality of cores. In various aspects, the SCE call instruction may be received in the core from an off-core source (eg, from main memory, disk, or bus or interconnect), or other logic within the core (eg, instruction). It may be received from a part of the core (eg, in the decoding logic, scheduling logic, etc.) from the cache, queue, scheduling logic, etc.). The SCE call instruction causes the core to call the shared core extension logic in order to execute data processing. Shared core extension logic is shared by multiple cores. The SCE call instruction indicates a shared core extended command register and further indicates one or more parameters. One or more parameters specify the data processing to be performed by the shared core extension logic.

いくつかの実施形態では、１つまたは複数のパラメータは、ポインタの１つまたは複数（例えば、複数の明示的な仮想メモリポインタ）を、呼び出しに関連付けられた複数のコマンド属性を有するメモリにおけるコマンド属性データ構造に、１つまたは複数のポインタ（例えば、１つまたは複数の明示的な仮想メモリポインタ）を、データ処理が実行されるべきメモリにおける１つまたは複数の入力データオペランドに、１つまたは複数のポインタ（例えば、１つまたは複数の明示的な仮想メモリポインタ）を、データ処理の複数の結果が格納されるべきメモリにおける１つまたは複数の出力データオペランドに、提供してもよい。例えば、いくつかの実施形態では、１つまたは複数のパラメータは、さらに後述される図５の複数のフィールドに格納されるべき、及び／またはこれを抽出するために用いられるべき情報を提供してもよい。代替的に、他の複数の実施形態では、１つまたは複数のフィールドは、複数のメモリポインタの代わりに、複数のオペコード及び複数の引数の直接符号化を有してもよい。 In some embodiments, one or more parameters are one or more pointers (eg, multiple explicit virtual memory pointers), command attributes in memory with multiple command attributes associated with the call. One or more pointers to the data structure (eg, one or more explicit virtual memory pointers) to one or more input data operands in the memory in which the data processing should be performed. Pointers (eg, one or more explicit virtual memory pointers) may be provided for one or more output data operands in memory where multiple results of data processing should be stored. For example, in some embodiments, the one or more parameters should further be stored in the plurality of fields of FIG. 5, which will be described below, and / or provide information to be used to extract them. May be good. Alternatively, in other embodiments, one or more fields may have multiple opcodes and direct encoding of multiple arguments instead of multiple memory pointers.

ブロック４５２において、共有コア拡張ロジックは、データ処理が実行されるように、ＳＣＥ呼び出し命令に応答して呼び出される。いくつかの実施形態では、共有コア拡張ロジックの呼び出しは、命令によって示される１つまたは複数のパラメータに基づいて、命令によって示される共有コア拡張コマンドレジスタにデータを書き込むまたは格納することを含んでもよい。 At block 452, the shared core extension logic is called in response to an SCE call instruction so that data processing is performed. In some embodiments, the call to the shared core extension logic may include writing or storing data in the shared core extension command register indicated by the instruction, based on one or more parameters indicated by the instruction. ..

図５は、共有コア拡張コマンドレジスタ５２８の実施形態の例のブロック図である。共有コア拡張コマンドレジスタは、多数のフィールドを有する。図示された実施形態では、これらの複数のフィールドは、左から右に向かって、ステータスフィールド５５３、進捗度フィールド５５４、コマンドポインタフィールド５５５、入力データオペランドポインタフィールド５５６及び出力データオペランドポインタフィールド５５７を含む。これらの複数のフィールドの各々は、特定の実装に対して所望の情報を伝えるために十分な多数のビットを含んでもよい。 FIG. 5 is a block diagram of an example of an embodiment of the shared core expansion command register 528. The shared core extended command register has a large number of fields. In the illustrated embodiment, from left to right, these plurality of fields include a status field 555, a progress field 554, a command pointer field 555, an input data operand pointer field 556, and an output data operand pointer field 557. .. Each of these plurality of fields may contain a sufficient number of bits to convey the desired information for a particular implementation.

ステータスフィールド５５３は、共有コア拡張コマンドレジスタに対応する呼び出しのステータスを提供するために用いられてもよい。かかるステータスの複数の例は、限定的ではないが、呼び出しが有効（例えば、進行中）、呼び出し完了、呼び出しエラー等を含む。例として、２つのビットは、前述した３つのステータス条件のいずれかを指定するために用いられてもよい。他の例では、単一のビットは、有効及び無効のような２つのステータス条件のいずれかを符号化するために用いられてもよい。有効は、呼び出しが現在進行中であることを表してもよい。無効は、エラーが発生したことを示してもよい。 The status field 553 may be used to provide the status of the call corresponding to the shared core extended command register. A plurality of examples of such status include, but are not limited to, call is valid (eg, in progress), call complete, call error, and the like. As an example, the two bits may be used to specify any of the three status conditions described above. In another example, a single bit may be used to encode one of two status conditions, such as valid and invalid. Valid may indicate that the call is currently in progress. Invalid may indicate that an error has occurred.

進捗度フィールド５５４は、共有コア拡張コマンドレジスタに対応する呼び出しの進捗度を提供するために用いられてもよい。進捗度は、完了進捗度のレベルまたは呼び出しまたはコマンドが完了に向けて進捗した程度を表してもよい。進捗度フィールドは、呼び出しの実行において、これまでに完了した作業量をカウントする種類のカウンタを有効に実装してもよい。いくつかの実施形態では、進捗度は、複数のアトミックなコミット点によって表されてもよい。例えば、アトミックなサブオペレーションがＳＣＥロジックによって完了した場合はいつでも、カウンタは、インクリメントされてもよい。アトミックなサブオペレーションは、１つのタイプのデータ処理から他のタイプまで（例えば、１つの例では、特定の数のデータのキャッシュラインが処理された場合）、異なってもよい。いくつかの実施形態では、進捗度フィールドは、共有コア拡張ロジックのデータ処理に関する進捗度のアトミック性、及び共有コア拡張ロジック上におけるコマンドの実行をプレエンプション及びリスケジューリングする能力を提供するために用いられてもよい。呼び出しの実行が割り込まれた場合（例えば、１つのスレッドから他へのコンテキストスイッチで、またはフォールトで）、進捗度フィールドは、保存されてもよい。その後、進捗度フィールドは復元されてもよく、呼び出しに関連付けられたデータ処理が再開した（例えば、スレッドが再送信する場合）。進捗度フィールドの復元により、データ処理は、中断したところで再開可能であってもよい。これは、ＳＣＥロジックにより実行されるべきデータ処理量が比較的大きい及び／または完了するために比較的大量の時間がかかる場合に、特に有用である。 The progress field 554 may be used to provide the progress of the call corresponding to the shared core extended command register. Progress may represent the level of completion progress or the degree to which a call or command has progressed towards completion. The progress field may effectively implement a type of counter that counts the amount of work completed so far in the execution of the call. In some embodiments, progress may be represented by multiple atomic commit points. For example, the counter may be incremented whenever an atomic suboperation is completed by SCE logic. Atomic sub-operations may vary from one type of data processing to another (eg, in one example, when a cache line of a certain number of data is processed). In some embodiments, the progress field is used to provide the atomicity of progress with respect to the data processing of the shared core extension logic and the ability to preempt and reschedul the execution of commands on the shared core extension logic. May be done. If the call execution is interrupted (eg, at a context switch from one thread to another, or at a fault), the progress field may be saved. The progress field may then be restored and the data processing associated with the call resumed (eg, if the thread resends). Data processing may be resumed where it was interrupted by restoring the progress field. This is especially useful when the amount of data processing to be performed by the SCE logic is relatively large and / or takes a relatively large amount of time to complete.

コマンドポインタフィールド５５５は、呼び出しまたは共有コア拡張コマンドレジスタに対応する呼び出しのコマンド属性情報５５８を指すポインタを提供するために用いられてもよい。いくつかの実施形態では、呼び出し属性情報は、呼び出し属性データ構造に含まれてもよい。いくつかの実施形態では、呼び出し属性情報は、メモリ５１８における１つまたは複数のメモリ位置に格納されてもよい。いくつかの実施形態では、ポインタは、明示的な仮想メモリポインタであってもよい。呼び出し属性情報は、呼び出しの複数の属性をさらに指定、認定、定義または特徴付けしてもよい。例えば、呼び出し属性情報は、共有コア拡張ロジックによって実行されるべき正確なタイプのデータ処理をさらに指定、認定、定義または特徴付けしてもよい。いくつかの実施形態では、複数のコマンド属性は、例えば、行列を転置する複数のオペレーション、ヒストグラムを生成する複数のオペレーション、フィルタリングを実行する複数のオペレーション等のような比較的簡単なまたは短い複数の処理ルーチンまたは複数の機能を表す処理を説明してもよい。複数のコマンド属性は、１つまたは複数の出力データオペランド（例えば、１つまたは複数の出力データ構造）を生成するために、１つまたは複数の入力データオペランド（例えば、１つまたは複数の入力データ構造）に対して実行する複数のオペレーションのシーケンスを説明してもよい。いくつかの実施形態では、これらは、様々なかかる比較的簡単なアルゴリズムもしくは典型的には複数のハードウェアアクセラレータまたは複数のグラフィクス処理ユニット等で実行される複数のルーチンのいずれかであってもよい。 The command pointer field 555 may be used to provide a pointer to the command attribute information 558 of the call corresponding to the call or shared core extended command register. In some embodiments, the call attribute information may be included in the call attribute data structure. In some embodiments, the call attribute information may be stored in one or more memory locations in memory 518. In some embodiments, the pointer may be an explicit virtual memory pointer. The call attribute information may further specify, authorize, define, or characterize multiple attributes of the call. For example, call attribute information may further specify, authorize, define or characterize the exact type of data processing to be performed by the shared core extension logic. In some embodiments, the plurality of command attributes are relatively simple or short, such as multiple operations that transpose a matrix, multiple operations that generate a histogram, multiple operations that perform filtering, and so on. Processing routines or processing representing a plurality of functions may be described. Multiple command attributes are one or more input data operands (eg, one or more input data) to generate one or more output data operands (eg, one or more output data structures). A sequence of multiple operations to be performed on a structure) may be described. In some embodiments, they may be either various such relatively simple algorithms or typically multiple routines performed by a plurality of hardware accelerators or a plurality of graphics processing units, etc. ..

入力データオペランドポインタフィールド５５６は、１つまたは複数の入力データオペランドを指す１つまたは複数のポインタを提供するために用いられてもよい。複数の入力データオペランドは、データ処理が共有コア拡張ロジックによって実行されるべきものである。いくつかの実施形態では、１つまたは複数の入力データオペランドは、例えば、複数の行列、複数のテーブル等のような１つまたは複数のデータ構造を表してもよい。示されるように、いくつかの実施形態では、ポインタは、メモリ５１８のメモリ位置における入力データオペランドを指してもよい。いくつかの実施形態では、ポインタは、明示的な仮想メモリポインタであってもよい。他の複数の実施形態では、複数のポインタは、１つまたは複数のレジスタまたは他の複数の格納位置における１つまたは複数の入力データオペランドを指してもよい。 The input data operand pointer field 556 may be used to provide one or more pointers to one or more input data operands. Multiple input data operands are those whose data processing should be performed by shared core extension logic. In some embodiments, the one or more input data operands may represent one or more data structures such as, for example, a plurality of matrices, a plurality of tables, and the like. As shown, in some embodiments, the pointer may point to an input data operand at the memory location of memory 518. In some embodiments, the pointer may be an explicit virtual memory pointer. In other embodiments, the pointer may point to one or more input data operands at one or more registers or other storage locations.

出力データオペランドポインタフィールド５５７は、１つまたは複数の出力データオペランドを指す１つまたは複数のポインタを提供するために用いられてもよい。複数の出力データオペランドは、呼び出し完了時に、共有コア拡張ロジックによって実行されたデータ処理の複数の結果を伝達するためのものである。いくつかの実施形態では、１つまたは複数の出力データオペランドは、例えば、複数の行列、複数のテーブル等のような１つまたは複数のデータ構造を表してもよい。示されるように、いくつかの実施形態では、ポインタは、メモリのメモリ位置における出力データオペランドを指してもよい。いくつかの実施形態では、ポインタは、明示的な仮想メモリポインタであってもよい。他の複数の実施形態では、複数のポインタは、１つまたは複数のレジスタまたは他の複数の格納位置における１つまたは複数の出力データオペランドを指してもよい。 The output data operand pointer field 557 may be used to provide one or more pointers to one or more output data operands. Multiple output data operands are intended to convey multiple results of data processing performed by the shared core extension logic upon completion of the call. In some embodiments, the one or more output data operands may represent one or more data structures, such as, for example, a plurality of matrices, a plurality of tables, and the like. As shown, in some embodiments, the pointer may point to an output data operand at a memory location in memory. In some embodiments, the pointer may be an explicit virtual memory pointer. In other embodiments, the pointer may point to one or more output data operands at one or more registers or other storage locations.

これは、共有コア拡張コマンドレジスタに適したフォーマットの実施形態の一例に過ぎないことが理解されよう。別の複数の実施形態は、図示された複数のフィールドのいくつかを省略してもよく、またはさらなる複数のフィールドを追加してもよい。例えば、複数のフィールドの１つまたは複数は、共有コア拡張コマンドレジスタにおいて明示的に指定される必要のない暗示的な位置を介して提供されてもよい。他の例として、入力データオペランド格納位置は、これが二度指定される必要がないように、出力データオペランド格納位置として再使用されてもよいが、複数の仕様の１つは暗示的であってもよい。さらに他の例として、１つまたは複数のフィールドは、複数のメモリポインタの代わりに、複数のオペコード及び複数の引数の直接符号化を有してもよい。さらに、複数のフィールドの図示された順序／構成は、必要ではなく、むしろ、複数のフィールドは、再構成されてもよい。さらに、複数のフィールドは、複数のビットの連続的な複数のシーケンスを（図中で示唆されるように）含む必要がなく、むしろ、不連続または分離された複数のビットからなってもよい。 It will be appreciated that this is just one example of a format suitable for shared core extended command registers. In another embodiment, some of the illustrated fields may be omitted, or additional fields may be added. For example, one or more of the fields may be provided via an implied position that does not need to be explicitly specified in the shared core extended command register. As another example, the input data operand storage position may be reused as the output data operand storage position so that it does not need to be specified twice, but one of the specifications is implicit. May be good. As yet another example, one or more fields may have multiple opcodes and direct encoding of multiple arguments instead of multiple memory pointers. Moreover, the illustrated order / configuration of the fields is not required, rather the fields may be reconstructed. Further, the field does not have to contain a continuous sequence of bits (as suggested in the figure), but rather may consist of discontinuous or separated bits.

再び図３を参照すると、ＳＣＥ呼び出し命令の実行後、ＳＣＥ呼び出し命令（例えば、ＳＣＥＣＲｘ）によって示される共有コア拡張コマンドレジスタは、ＳＣＥ呼び出し命令に対応するデータを格納してもよい。スレッドまたはコアがタスクまたは呼び出しを送信した後、スレッドまたはコアは、先に送信された複数の呼び出しまたは複数のタスクが完了する前に、さらなる複数の呼び出しまたは複数のタスクの準備及び共有コア拡張ロジックへの送信に進んでもよい。さらに、スレッドまたはコアは、前に送信された複数の呼び出しまたは複数のタスクが完了する間に、他の処理の実行に進んでもよい。複数の共有コア拡張コマンドレジスタは、スケジューラ（さらに後述される）と共に、共有コア拡張ロジックに対する複数のタスクまたは複数の呼び出し完了までの間、及びこれらが完了するまで、複数のスレッド及び／または複数のコアが複数のタスクまたは複数の呼び出しを送信し、次に他の複数のタスクまたは複数の呼び出しの送信または他の処理の実行に進むことを可能とする、きめ細かい制御フローの提供を助けてもよい。 Referring again to FIG. 3, after the execution of the SCE call instruction, the shared core extended command register indicated by the SCE call instruction (eg, SCECRx) may store the data corresponding to the SCE call instruction. After a thread or core sends a task or call, the thread or core prepares and shares additional calls or tasks before the previously sent multiple calls or tasks complete. You may proceed to send to. In addition, the thread or core may proceed to perform other processing while the previously transmitted calls or tasks are completed. Multiple shared core extension command registers, along with the scheduler (discussed further below), include multiple threads and / or multiple threads and / or multiple until the completion of multiple tasks or calls to the shared core extension logic, and until these are completed. May help provide a fine-grained control flow that allows the core to send multiple tasks or calls and then proceed to send other tasks or calls or perform other processing. ..

共有コア拡張ロジック３１４は、複数のコア０共有コア拡張コマンドレジスタ３２８にアクセスするコアインターフェースロジック３１５を含む。コアインターフェースロジックは、コアＭ及び任意の他の複数のコア（もしあれば）の複数の共有コア拡張コマンドレジスタ３４０にアクセスするためにさらに用いられてもよい。すなわち、いくつかの実施形態では、共有コア拡張ロジック及び／またはコアインターフェースロジックは、複数のコアの各々に対する複数の共有コア拡張コマンドレジスタの個別のセットにアクセスしてもよい。 The shared core expansion logic 314 includes a core interface logic 315 that accesses a plurality of core 0 shared core expansion command registers 328. Core interface logic may be further used to access multiple shared core extension command registers 340 of core M and any other multiple cores (if any). That is, in some embodiments, the shared core extension logic and / or the core interface logic may access a separate set of multiple shared core extension command registers for each of the plurality of cores.

共有コア拡張ロジックは、複数の共有コア拡張コマンドレジスタ３２８を用いてもよい。例えば、共有コア拡張ロジックは、コマンドフィールド（例えば、フィールド５５５）に指されるコマンド属性情報にアクセスしてもよく、複数の入力データオペランドフィールド（例えば、フィールド５５６）に指される複数の入力データオペランドにアクセスしてもよく、進捗度フィールド（例えば、フィールド５５４）におけるデータ処理の結果として進捗度を更新してもよく、オペレーションが行われ、またはエラーに直面した場合は、完了またはエラーを反映するためにステータスフィールド（例えば、フィールド５５３）を更新してもよく、エラーなしで完了した場合は、複数の出力データオペランドフィールド（例えば、フィールド５５７）のポインタを介して、複数の出力データオペランドにアクセスしてもよい。 The shared core extension logic may use a plurality of shared core extension command registers 328. For example, the shared core extension logic may access command attribute information pointed to in a command field (eg, field 555) and may have multiple input data pointed to in a plurality of input data operand fields (eg, field 556). The operands may be accessed, the progress may be updated as a result of data processing in the progress field (eg, field 554), and if an operation is performed or an error is encountered, the completion or error is reflected. The status field (eg, field 553) may be updated to do so, and if completed without error, to multiple output data operands via pointers to multiple output data operand fields (eg, field 557). You may access it.

説明を容易にするために、共有コア拡張ロジックは、コア０、スレッド０の複数の共有コア拡張コマンドレジスタのコピーを有するものとして示される。しかしながら、共有コア拡張ロジックの複数の共有コア拡張コマンドレジスタは、実際にはコア０、スレッド０の複数の共有コア拡張コマンドレジスタの２つのセットがない場合があることを示すために、複数の破線で示される。むしろ、コア０及び共有コア拡張ロジックの両方は、論理的に、コア０、スレッド０の複数の共有コア拡張コマンドレジスタの同じセットを見てもよい。同様に、共有コア拡張ロジックは、潜在的には、コアＭ、スレッドＰのセット３４０までの他の複数のプロセッサの他の複数のスレッドの対応する複数の共有コア拡張コマンドレジスタを見てもよい。さらに明確性のために、物理的コア０、スレッド０の複数の共有コア拡張コマンドレジスタは、コア０に、共有コア拡張ロジックに、コア０の外側かつ共有コア拡張ロジックの外側の位置に、または複数の異なる位置の組み合わせに位置してもよい。 For ease of explanation, the shared core extension logic is shown as having multiple copies of the shared core extension command registers for core 0, thread 0. However, multiple shared core extension command registers in the shared core extension logic may actually not have two sets of multiple shared core extension command registers for core 0, thread 0, and multiple dashed lines. Indicated by. Rather, both core 0 and shared core extension logic may logically look at the same set of multiple shared core extension command registers for core 0, thread 0. Similarly, the shared core extension logic may potentially look at the corresponding shared core extension command registers of other threads of other processors up to set 340 of core M, thread P. .. For further clarity, multiple shared core extension command registers for physical core 0, thread 0 can be placed in core 0, in shared core extension logic, outside of core 0 and outside of shared core extension logic, or. It may be located in a combination of a plurality of different positions.

共有コア拡張ロジック３１４は、スケジューラ３４４の実施形態を含む。スケジューラは、ハードウェア、ソフトウェア、ファームウェアまたはいくつかの組み合わせで実装されてもよい。一態様では、スケジューラは、ハードウェアスケジューラであってもよい。スケジューラは、コア０の共有コア拡張コマンドレジスタ３２８からコアＭの複数の共有コア拡張コマンドレジスタ３４０にアクセスし、共有データ処理ロジック３１６上でこれらの複数のレジスタを介して伝達される複数の呼び出しに関連付けられたデータ処理をスケジューリングするように動作可能であってもよい。いくつかの実施形態では、スケジューラは、プログラマブルスケジューリングアルゴリズムまたは目的に従って複数のコアに対してデータ処理をスケジューリングする、プログラマブルハードウェアスケジューラまたはプログラマブルハードウェアスケジューリングロジックを表してもよい。いくつかの実施形態では、ハードウェアスケジューラは、コマンド複数のレジスタ間及び複数の物理的スレッド間で循環するように動作可能な状態機械として実装されてもよい。複数のアービトレーションポリシは、潜在的には、複数の機械固有レジスタ（ＭＳＲ）のセットを介して、ソフトウェアにさらされてもよい。他の複数の実施形態では、ハードウェアスケジューラは、例えば、固定の読み出し専用メモリ（ＲＯＭ）及びパッチ可能なランダムアクセスメモリ（ＲＡＭ）のドメインの両方を組み込んだファームウェアブロックとして実装されてもよい。これにより、潜在的には、ハードウェアスケジューラは、オペレーティングシステムの複数の指示、複数のアプリケーションプログラミングインターフェース（ＡＰＩ）、ランタイムコンパイラの複数の指示、複数のリアルタイムハードウェアシグナルまたはかかる複数の制御の組み合わせに依存し得るより複雑なスケジューリングアルゴリズムを用いることができる。例として、スケジューリングは、公平なスケジューリングアルゴリズム、複数のコアのいくつかに対して他よりも重みづけされたスケジューリングアルゴリズム（例えば、コア負荷、処理されているスレッドまたはデータの臨界時間、スレッド優先度に基づいて、または他の複数の目的に従って）であってもよい。当技術分野で公知の多くの異なるタイプのスケジューリングアルゴリズムは、それらの複数の実装の特定の複数の目的に応じた異なる複数の実装に適している。スケジューラは、共有データ処理ロジックに対してスケジューリングされた複数の呼び出しまたは複数のタスクの完了をさらにモニタリングしてもよい。 The shared core extension logic 314 includes an embodiment of the scheduler 344. The scheduler may be implemented in hardware, software, firmware or some combination. In one aspect, the scheduler may be a hardware scheduler. The scheduler accesses the multiple shared core extension command registers 340 of the core M from the shared core extension command register 328 of the core 0, and makes multiple calls transmitted through these multiple registers on the shared data processing logic 316. It may be operational to schedule the associated data processing. In some embodiments, the scheduler may represent a programmable hardware scheduler or programmable hardware scheduling logic that schedules data processing for multiple cores according to a programmable scheduling algorithm or purpose. In some embodiments, the hardware scheduler may be implemented as a state machine that can operate to circulate between a plurality of commands and between a plurality of physical threads. Multiple arbitration policies may potentially be exposed to software via a set of multiple Machine Specific Registers (MSRs). In other embodiments, the hardware scheduler may be implemented, for example, as a firmware block incorporating both fixed read-only memory (ROM) and patchable random access memory (RAM) domains. This potentially allows the hardware scheduler to combine multiple operating system instructions, multiple application programming interfaces (APIs), multiple runtime compiler instructions, multiple real-time hardware signals or such controls. More complex scheduling algorithms can be used that may depend on it. As an example, scheduling is a fair scheduling algorithm, a scheduling algorithm that is more weighted to some of the cores than others (eg, core load, critical time of thread being processed or data, thread priority). Based on or according to several other purposes). Many different types of scheduling algorithms known in the art are suitable for different implementations for a particular purpose of those implementations. The scheduler may further monitor the completion of multiple scheduled calls or tasks to the shared data processing logic.

共有コア拡張ロジック３１４は、ステータス及び／または進捗度更新ロジック３４９をさらに含む。ステータス及び／または進捗度更新ロジックは、共有データ処理ロジック３１６によって処理されている複数の呼び出しのステータス及び／または進捗度をモニタリングしてもよい。ステータス及び／または進捗度更新ロジックは、モニタリングされたステータス及び／または進捗度に基づいて複数の呼び出しに対応する複数の共有コア拡張コマンドレジスタをさらに更新してもよい。例えば、図５のステータスフィールド５５３及び進捗度フィールド５５４は、更新されてもよい。例として、共有コア拡張ロジックに対する呼び出しが完了した場合、ステータスは、完了されたことを反映するために更新されてもよく、または共有コア拡張ロジックに対する呼び出し処理がエラーに直面した場合、ステータスは、エラー状態を反映するために更新されてもよい。他の例として、呼び出しに関連付けられたデータ処理にわたって、ステータス及び／または進捗度更新ロジックは、呼び出し完了の進捗度を更新してもよい（例えば、進捗度フィールド５５４において、アトミックな複数のコミット点を更新してもよい）。 The shared core extension logic 314 further includes a status and / or progress update logic 349. The status and / or progress update logic may monitor the status and / or progress of a plurality of calls being processed by the shared data processing logic 316. The status and / or progress update logic may further update multiple shared core extended command registers corresponding to multiple calls based on the monitored status and / or progress. For example, the status field 553 and the progress field 554 in FIG. 5 may be updated. As an example, if the call to the shared core extension logic is completed, the status may be updated to reflect that it was completed, or if the call process to the shared core extension logic faces an error, the status is. It may be updated to reflect the error condition. As another example, across the data processing associated with the call, the status and / or progress update logic may update the progress of the call completion (eg, in the progress field 554, a plurality of atomic commit points. May be updated).

いくつかの実施形態では、オペレーティングシステムは、複数のコンテキストスイッチにおける共有コア拡張コマンドレジスタの状態を管理するために、状態保存／状態復元機能性（例えば、インテルアーキテクチャではｘｓａｖｅ／ｘｒｅｓｔｏｒｅ）を用いてもよい。共有コア拡張ロジックによってまだ完了していない複数の呼び出しまたは複数のコマンドは、コンテキストスイッチにおいて、物理的スレッドによって保存され、次に復元及び再開されてもよい。いくつかの実施形態では、コンテキストスイッチ及びオペレーティングシステムのプレエンプションをサポートするために、複数の共有コア拡張コマンドレジスタは、上述の進捗度フィールドを有することにより、共有コア拡張ロジックによって処理されているデータ処理タスク（の例えばアトミックな進捗度）を記録してもよい。進捗度フィールドは、コンテキストスイッチにおいて、スレッドコンテキストの一部として保存され、オペレーティングシステムがスレッドをリスケジュールした場合にタスクを再開するために用いられてもよい。 In some embodiments, the operating system also uses state save / restore functionality (eg, xsave / xstore in Intel architecture) to manage the state of shared core extended command registers in multiple context switches. good. Multiple calls or commands that have not yet been completed by the shared core extension logic may be saved by the physical thread in the context switch and then restored and resumed. In some embodiments, to support context switching and operating system preemption, the plurality of shared core extension command registers are the data being processed by the shared core extension logic by having the progress fields described above. Processing tasks (eg, atomic progress) may be recorded. Progress fields are stored in the context switch as part of the thread context and may be used to resume the task if the operating system reschedules the thread.

共有コア拡張ロジック３１５は、共有コア拡張制御ロジック３４３をさらに含む。 The shared core extension logic 315 further includes a shared core extension control logic 343.

共有コア拡張制御ロジックは、スケジューラ３４４、共有データ処理ロジック３１６、ステータス/進捗度更新ロジック３４９、コア０－Ｍの複数の共有コア拡張コマンドレジスタ３２８、３４０、さらに後述される共有コア拡張メモリ管理ユニット（ＭＭＵ）３４１と連結される。共有コア拡張制御ロジックは、様々な制御、管理、調整、タイミング及び共有コア拡張ロジック３１４が関連する複数の実装態様を支援してもよい。 The shared core expansion control logic includes a scheduler 344, a shared data processing logic 316, a status / progress update logic 349, a plurality of shared core expansion command registers 328 and 340 of cores 0 to M, and a shared core expansion memory management unit described later. (MMU) 341 is connected. The shared core extended control logic may support various implementations involving various controls, management, coordination, timing and shared core extended logic 314.

再び図３のＳＣＥ呼び出し命令３２４及び／または図４の方法のＳＣＥ呼び出し命令を参照すると、いくつかの実施形態では、ＳＣＥ呼び出し命令は、非ブロックＳＣＥ呼び出し命令であってもよい。いくつかの実施形態では、非ブロックＳＣＥ呼び出し命令は、スレッド（例えば、物理的スレッド）から非投機的に送信されてもよく、非ブロックＳＣＥ呼び出し命令が共有コア拡張ロジックにおいて実行のために受け入れられた後で、発行スレッドが実行しているコアにおいてリタイアしてもよい（例えば、ＳＣＥコマンドレジスタに格納される）。 Referring again to the SCE call instruction 324 of FIG. 3 and / or the SCE call instruction of the method of FIG. 4, in some embodiments, the SCE call instruction may be a non-blocking SCE call instruction. In some embodiments, the non-blocking SCE call instruction may be sent non-speculatively from a thread (eg, a physical thread) and the non-blocking SCE call instruction is accepted for execution in the shared core extension logic. After that, it may be retired at the core where the issuing thread is running (for example, stored in the SCE command register).

他の複数の実施形態では、ＳＣＥ呼び出し命令は、ブロックＳＣＥ呼び出し命令であってもよい。 In a plurality of other embodiments, the SCE call instruction may be a block SCE call instruction.

いくつかの実施形態では、ブロックＳＣＥ呼び出し命令は、スレッド（例えば、物理的スレッド）から非投機的に送信されてもよく、呼び出しまたはタスクの実行が共有コア拡張ロジックにおいて完了した後で（例えば、共有コア拡張コマンドレジスタのステータスフィールドが、完了されたことを反映するために更新された場合）、発行スレッドが実行しているコアにおいてリタイアしてもよい。いくつかの実施形態では、複数のＳＣＥ呼び出し命令の非ブロック及びブロックの違いの両方は、命令セットに含まれてもよい。 In some embodiments, the block SCE call instruction may be sent non-speculatively from a thread (eg, a physical thread) after the call or task execution is completed in the shared core extension logic (eg, eg). If the status field of the shared core extended command register is updated to reflect completion), it may retire on the core running the issuing thread. In some embodiments, both non-blocking and blocking differences of multiple SCE calling instructions may be included in the instruction set.

いくつかの実施形態では、ブロックＳＣＥ呼び出し命令は、共有コア拡張コマンドレジスタの解放を待つためのタイムアウト値（例えば、サイクル数）を指定または示してもよい。例えば、このサイクル数または他のタイムアウト値は、ＳＣＥ呼び出し命令の複数のパラメータの１つに指定されてもよい。いくつかの実施形態では、フェール、フォールト、エラー等は、共有コア拡張コマンドレジスタが解放されることなくタイムアウト値に達した場合には、呼び出しに応答して戻されてもよい。 In some embodiments, the block SCE call instruction may specify or indicate a timeout value (eg, the number of cycles) to wait for the shared core extended command register to be released. For example, this number of cycles or other timeout value may be specified as one of the parameters of the SCE call instruction. In some embodiments, failures, faults, errors, etc. may be returned in response to a call if the timeout value is reached without the shared core extended command register being released.

ＳＣＥ呼び出し命令のリタイアを受けて、共有コア拡張ロジックは、割り当てられたタスクまたは呼び出しに従って、メモリ状態を修正してもよい。マルチスレッディング環境では、共有コア拡張を用い、複数の共有オペランドを有し得る複数の論理スレッド間でキャッシュのコヒーレンシ及びメモリの順序を維持するためにソフトウェアの同期が実行されてもよい。代替的に、ハードウェアの同期も、選択的に実行されてもよい。 Upon retirement of the SCE call instruction, the shared core extension logic may modify the memory state according to the assigned task or call. In a multithreading environment, shared core extensions may be used to perform software synchronization to maintain cache coherency and memory order across multiple logical threads that may have multiple shared operands. Alternatively, hardware synchronization may also be performed selectively.

図６は、ＳＣＥ読み出し命令の実施形態を処理する方法６６２の実施形態のブロックフロー図である。複数の実施形態では、方法は、プロセッサ、コアまたは他のタイプの命令処理装置によって実行されてもよい。いくつかの実施形態では、方法６６２は、図２のプロセッサ２１０または図３のコア０３１１－０、または同様のプロセッサまたはコアによって実行されてもよい。代替的に、方法６６２は、完全に異なるプロセッサ、コアまたは命令処理装置によって実行されてもよい。さらに、プロセッサ２１０及びコア３１１－０は、方法６６２と同じ、同様の、または異なる、複数のオペレーション及び複数の方法の複数の実施形態を実行してもよい。 FIG. 6 is a block flow diagram of the embodiment of the method 662 for processing the embodiment of the SCE read instruction. In multiple embodiments, the method may be performed by a processor, core or other type of instruction processing device. In some embodiments, method 662 may be performed by processor 210 in FIG. 2 or core 0 311-0 in FIG. 3, or a similar processor or core. Alternatively, method 662 may be performed by a completely different processor, core or instruction processor. Further, the processor 210 and the core 310-1 may perform the same, similar or different, multiple operations and multiple embodiments of the method 662.

共有コア拡張（ＳＣＥ）読み出し命令は、ブロック６６３において、複数のコアを有するプロセッサのコア内で受信される。様々な複数の態様では、ＳＣＥ読み出し命令は、オフコアソースから（例えば、メインメモリ、ディスク、もしくはバスまたはインターコネクトから）コアで受信されてもよく、または、コア内の他のロジック（例えば、命令キャッシュ、キュー、スケジューリングロジック等）からコアの一部で（例えば、復号ロジック、スケジューリングロジック等で）受信されてもよい。ＳＣＥ読み出し命令は、コアに、共有コア拡張ロジックに対して前になされた呼び出しのステータスの読み出しを実行させる。共有コア拡張ロジックは、複数のコアにより共有される。ＳＣＥ読み出し命令は、共有コア拡張コマンドレジスタを示す。 The shared core extension (SCE) read instruction is received in block 663 within the core of a processor having a plurality of cores. In various aspects, the SCE read instruction may be received in the core from an off-core source (eg, from main memory, disk, or bus or interconnect), or other logic within the core (eg, instruction). It may be received from a part of the core (eg, in the decoding logic, scheduling logic, etc.) from the cache, queue, scheduling logic, etc.). The SCE read instruction causes the core to read the status of the previous call to the shared core extension logic. Shared core extension logic is shared by multiple cores. The SCE read instruction indicates a shared core extended command register.

共有コア拡張ロジックに対して前になされた呼び出しのステータスは、ブロック６６４において、ＳＣＥ読み出し命令に応答して読み出される。いくつかの実施形態では、ステータスの読み出しは、命令によって示される共有コア拡張コマンドレジスタからのデータの読み出しを含んでもよい。いくつかの実施形態では、ステータスは、完了ステータスを含んでもよい。例えば、ステータスフィールド（例えば、図５のステータスフィールド５５３）は、読み出されてもよい。いくつかの実施形態では、読み出しステータスは、完了、エラー、有効から選択されてもよいが、本発明の範囲は、このように限定されない。 The status of the previous call to the shared core extension logic is read in block 664 in response to the SCE read instruction. In some embodiments, reading the status may include reading data from the shared core extended command register indicated by the instruction. In some embodiments, the status may include a completion status. For example, the status field (eg, status field 553 in FIG. 5) may be read. In some embodiments, the read status may be selected from Complete, Error, Valid, but the scope of the invention is not thus limited.

他の複数の実施形態では、ＳＣＥ読み出し命令は、示された共有コア拡張コマンドレジスタからの他の情報を読み出してもよい。かかる情報の複数の例は、限定的ではないが、（例えば、図５の進捗度フィールド５５４からの）進捗度、（例えば、フィールド５５７によって示されたように）出力データオペランドまたその一部及び（例えば、フィールド５５５によって示されたように）コマンド属性情報を含む。いくつかの実施形態では、コアがＳＣＥ読み出し命令を受信する代わりにデータ処理が実行されるよう、共有コア拡張コマンドレジスタは、共有コア拡張ロジックに対する前の呼び出しに対応する。 In a plurality of other embodiments, the SCE read instruction may read other information from the indicated shared core extended command register. A plurality of examples of such information are, but are not limited to, the progress (eg, from the progress field 554 in FIG. 5), the output data operands (eg, as indicated by the field 557) and some of them. Contains command attribute information (eg, as indicated by field 555). In some embodiments, the shared core extended command register corresponds to a previous call to the shared core extended logic so that data processing is performed instead of the core receiving the SCE read instruction.

図７は、ＳＣＥ停止命令の実施形態を処理する方法７６６の実施形態のブロックフロー図である。複数の実施形態では、方法は、プロセッサ、コアまたは他のタイプの命令処理装置によって実行されてもよい。いくつかの実施形態では、方法７６６は、図２のプロセッサ２１０または図３のコア０３１１－０、または同様のプロセッサまたはコアによって実行されてもよい。代替的に、方法７６６は、完全に異なるプロセッサ、コアまたは命令処理装置によって実行されてもよい。さらに、プロセッサ２１０及びコア３１１－０は、方法７６６と同じ、同様の、または異なる、複数のオペレーション及び複数の方法の複数の実施形態を実行してもよい。 FIG. 7 is a block flow diagram of the embodiment of the method 766 for processing the embodiment of the SCE stop command. In multiple embodiments, the method may be performed by a processor, core or other type of instruction processing device. In some embodiments, method 766 may be performed by processor 210 in FIG. 2 or core 0 311-0 in FIG. 3, or a similar processor or core. Alternatively, method 766 may be performed by a completely different processor, core or instruction processor. Further, the processor 210 and the core 310-1 may perform the same, similar or different, multiple operations and multiple embodiments of the method 766.

共有コア拡張（ＳＣＥ）停止命令は、ブロック７６７において、複数のコアを有するプロセッサのコア内で受信される。様々な複数の態様では、ＳＣＥ停止命令は、オフコアソースから（例えば、メインメモリ、ディスク、もしくはバスまたはインターコネクトから）コアで受信されてもよく、または、コア内の他のロジック（例えば、命令キャッシュ、キュー、スケジューリングロジック等）からコアの一部で（例えば、復号ロジック、スケジューリングロジック等で）受信されてもよい。ＳＣＥ停止命令は、コアに、共有コア拡張ロジックに対して前になされた呼び出しを停止させるためのものである。共有コア拡張ロジックは、複数のコアにより共有される。ＳＣＥ停止命令は、共有コア拡張コマンドレジスタを示す。 The shared core extension (SCE) stop instruction is received in block 767 within the core of a processor having multiple cores. In various aspects, the SCE stop instruction may be received in the core from an off-core source (eg, from main memory, disk, or bus or interconnect), or other logic within the core (eg, instruction). It may be received from a part of the core (eg, in the decoding logic, scheduling logic, etc.) from the cache, queue, scheduling logic, etc. The SCE stop instruction is for the core to stop the previous call to the shared core extension logic. Shared core extension logic is shared by multiple cores. The SCE stop instruction indicates a shared core extended command register.

ブロック７６８において、ＳＣＥ停止命令に応答して、共有コア拡張ロジックに対して前になされた呼び出しは、停止される。いくつかの実施形態では、呼び出しの停止は、前になされた呼び出しに対応する、及び／または示された共有コア拡張コマンドレジスタに対応する共有コア拡張ロジックによるデータ処理の中止を含んでもよい。いくつかの実施形態では、呼び出しの停止は、ＳＣＥ停止命令によって示された、占められた共有コア拡張コマンドレジスタの解放をさらに含んでもよい。 At block 768, the previous call to the shared core extension logic is stopped in response to the SCE stop instruction. In some embodiments, stopping the call may include stopping data processing by the shared core extension logic corresponding to the previously made call and / or the indicated shared core extension command register. In some embodiments, stopping the call may further include releasing the occupied shared core extended command register indicated by the SCE stop instruction.

いくつかの実施形態では、ブロックＳＣＥ呼び出し命令は、ＳＣＥＲＣＲの解放を待つためのタイムアウト値（例えば、サイクル数）を指定または示してもよく、呼び出しは、タイムアウトが経過した場合、フェールを戻してもよいフェールは、解放なくタイムアウトに達した場合、及び／またはタイムアウト経過前に完了していないコマンド実行の進行中にタイムアウトに達した場合のいずれかに発生し得る。非ブロック呼び出しでは、ＳＣＥ待ち命令は、共有コア拡張実行に対してブロックするために用いられてもよい。ＳＣＥ待ち命令は、同様に、共有コア拡張コマンドレジスタの解放を待つためのタイムアウト値（例えば、サイクル数）を含んでもよい。共有コア拡張コマンドレジスタの解放なしでタイムアウトが経過した場合、フェール、エラー等は、戻されてもよい。いくつかの実施形態では、ブロックＳＣＥ呼び出し命令及び／またはＳＣＥ待ち命令のタイムアウト値は、命令が指定し得る可変パラメータとして符号化されてもよい。他の複数の実施形態では、タイムアウトは、固定された暗示的な値であってもよい。いくつかの実施形態では、ＳＣＥ待ち命令は、電力消費を削減するために、非ブロックＳＣＥ呼び出し命令と共に用いられてもよい。例えば、ブロックＳＣＥ呼び出し命令がブロックした場合、及び／またはＳＣＥ待ち命令がブロックした場合、物理的スレッドは、（行われることが望ましい他の作業がないと仮定して）選択的に中断され、関連するＳＣＥ呼び出しの完了時に共有コア拡張ロジックがこれを起こすまで、スリープ状態にされてもよい。しかしながら、これは選択的であり、必要ではない。さらに、ブロックＳＣＥ呼び出し命令及び／またはＳＣＥ待ち命令を介してタイムアウト値を示す上述のアプローチの他に、予期されないまたは望ましくない長期間にわたり実行する呼び出しまたはコマンドを停止するための他の複数の方法も、考慮される。 In some embodiments, the block SCE call instruction may specify or indicate a timeout value (eg, the number of cycles) to wait for the SCERC R to be released, and the call may return a fail if the timeout has elapsed. A good fail can occur either if the timeout is reached without release and / or if the timeout is reached while a command execution that has not completed before the timeout elapses. In non-blocking calls, SCE wait instructions may be used to block for shared core extended execution. The SCE wait instruction may also include a timeout value (eg, the number of cycles) for waiting for the release of the shared core extended command register. If the timeout elapses without releasing the shared core extended command register, failures, errors, etc. may be returned. In some embodiments, the timeout values for the block SCE call instruction and / or the SCE wait instruction may be encoded as variable parameters that the instruction can specify. In other embodiments, the timeout may be a fixed, implied value. In some embodiments, the SCE wait instruction may be used with a non-blocking SCE call instruction to reduce power consumption. For example, if a block SCE call instruction blocks and / or an SCE wait instruction blocks, the physical thread is selectively interrupted (assuming no other work is desired to be done) and is associated. May be put to sleep until the shared core extension logic wakes up upon completion of the SCE call. However, this is selective and not necessary. In addition to the above approach of indicating a timeout value via a block SCE call instruction and / or an SCE wait instruction, there are several other ways to stop an unexpected or undesired long-running call or command. , Will be considered.

いくつかの実施形態では、ＳＣＥロジックは、コア０と同じ仮想メモリ上で動作してもよい。再び図３を参照すると、コア０３１１－０は、メモリ管理ユニット（ＭＭＵ）３３１を有する。ＭＭＵは、共有コア拡張ＭＭＵインターフェースロジック３３２を含む。ＭＭＵ３３１は、共有コア拡張ＭＭＵインターフェースロジック３３２を除き、実質的に従来のものであってもよい。共有コア拡張ロジック３１４は、共有コア拡張ＭＭＵ３４１を有する。ＳＣＥＭＭＵは、コア０のページマッピングを、維持してもよい（例えば、仮想のまたはリニアなメモリから、コア０によってキャッシュまたは保持されるシステムメモリへの複数の変換をキャッシュまたは保持する）。コア０のＴＬＢのものに対応する複数のＴＬＢエントリの維持に加えて、ＳＣＥＭＭＵは、複数のｏコアの各々に対する複数のＴＬＢエントリをさらに維持してもよい。共有コア拡張ＭＭＵは、コアＭＭＵインターフェースロジック３４２を有する。共有コア拡張ＭＭＵインターフェースロジック３３２及びコアＭＭＵインターフェースロジック３４２は、ＭＭＵ３３１及び共有コア拡張ＭＭＵ３４１の間で同期３４６を実行するために、互いにインターフェースをとる。いくつかの実施形態では、共有コア拡張ＭＭＵインターフェースロジック３３２及びコアＭＭＵインターフェースロジック３４２は、ハードウェアメカニズム、またはＭＭＵ３３１及び共有コア拡張ＭＭＵ３４１の同期に対するハードウェアサポートを表してもよい。 In some embodiments, the SCE logic may operate on the same virtual memory as core 0. Referring again to FIG. 3, core 0 311-0 has a memory management unit (MMU) 331. The MMU includes a shared core extended MMU interface logic 332. The MMU 331 may be substantially conventional, except for the shared core extended MMU interface logic 332. The shared core extension logic 314 has a shared core extension MMU341. The SCE MMU may maintain the page mapping of core 0 (eg, cache or retain multiple conversions from virtual or linear memory to system memory cached or held by core 0). In addition to maintaining multiple TLB entries corresponding to those of the core 0 TLB, the SCE MMU may further maintain multiple TLB entries for each of the plurality of o-cores. The shared core extension MMU has a core MMU interface logic 342. The shared core extended MMU interface logic 332 and the core MMU interface logic 342 interface with each other to perform synchronization 346 between the MMU 331 and the shared core extended MMU 341. In some embodiments, the shared core extended MMU interface logic 332 and core MMU interface logic 342 may represent a hardware mechanism or hardware support for synchronization of the MMU 331 and shared core extended MMU 341.

いくつかの実施形態では、ＭＭＵ及びＳＣＥＭＭＵ間の同期は、このページマッピングにおける整合を維持するために、実行されてもよい。例えば、ページがコア０によって無効化された場合、コア０は、コア０ＭＭＵの対応するＴＬＢエントリを無効化してもよい。いくつかの実施形態では、同期は、ＳＣＥロジックのＳＣＥＭＭＵに対する対応するＴＬＢエントリも対応して無効化され得るコア０及びＳＣＥロジックの間でも実行されてもよい。例として、コア０上で実行する物理的スレッドは、ＳＣＥロジックにシグナルを送るために共有コア拡張ＭＭＵインターフェースロジック３３２及びコアＭＭＵインターフェースロジック３４２によって提供されるハードウェアインターフェースを用いることにより、プロセッサ上の複数のバスサイクルを介して対応するＳＣＥＭＭＵのＴＬＢを無効化してもよい。すなわち、いくつかの実施形態では、共有コア拡張ＭＭＵ３４１の同期は、コア０上で実行する物理的スレッド内からのハードウェアによって実行されてもよい。他の例として、スレッドがオペレーティングシステムによってスワップされた場合（例えば、コンテキストスイッチ）、ＳＣＥロジックは、スレッドと関連付けられたコンテキストが後で復元可能なように保存され得るように、コンテキストスイッチのシグナル及び／または通知を受けてもよい。いくつかの実施形態では、かかる同期シグナリングは、ハードウェアレベルで（例えば、ハードウェアメカニズムを介した複数のバスサイクルまたは複数のバストランザクションを介して）あってもよい。すなわち、同期は、ソフトウェアの関与を介して（例えば、オペレーティングシステムの関与なしで）ではなく、ハードウェアレベルで（例えば、ＭＭＵ及びＳＣＥＭＭＵのハードウェア、ならびに複数のバストランザクションを介して）実行されてもよい。 In some embodiments, synchronization between the MMU and SCE MMU may be performed to maintain consistency in this page mapping. For example, if the page is invalidated by core 0, core 0 may invalidate the corresponding TLB entry in the core 0 MMU. In some embodiments, synchronization may also be performed between core 0 and SCE logic, where the corresponding TLB entry for the SCE logic's SCE MMU may also be correspondingly invalidated. As an example, a physical thread running on core 0 on a processor uses the hardware interface provided by the shared core extended MMU interface logic 332 and core MMU interface logic 342 to signal the SCE logic. The TLB of the corresponding SCE MMU may be disabled via multiple bus cycles. That is, in some embodiments, synchronization of the shared core extended MMU341 may be performed by hardware from within a physical thread running on core 0. As another example, if a thread is swapped by the operating system (eg, a context switch), the SCE logic will be able to save the context associated with the thread so that it can be restored later with the context switch signals and / Or you may be notified. In some embodiments, such synchronous signaling may be at the hardware level (eg, via multiple bus cycles or multiple bus transactions via hardware mechanisms). That is, synchronization is performed at the hardware level (eg, via MMU and SCE MMU hardware, as well as multiple bus transactions) rather than through software involvement (eg, without operating system involvement). You may.

いくつかの実施形態では、ＭＭＵ３３１及び共有コア拡張ＭＭＵ３４１は、共有コア拡張ロジックがコア０に対する複数の呼び出しを処理する場合に発生する複数のページフォルトに対してルーティングまたは通信を行うために、インターフェースロジック３３２、３４２を介してさらに連携してもよい。いくつかの実施形態では、共有コア拡張ＭＭＵは、コア０からの呼び出しを処理している間に発生したページフォルトをオペレーティングシステムに通知するために、コア０を用いてもよい。同様に、共有コア拡張ＭＭＵは、他の複数のコアからの複数の呼び出しを処理している間に発生した複数のページフォルトを、これらの他の複数のコアに通知してもよい。複数のコアは、複数のページフォルトをオペレーティングシステムに通知してもよい。オペレーティングシステムは、ページフォルトが、ページフォルトを与えたコアにおいてではなく、実際にＳＣＥロジックにおいて発生したことを知る根拠を何ら有さない場合がある。いくつかの実施形態では、非ブロックＳＣＥ呼び出し命令に対して、コアに対してフォールトを指定する命令ポインタは、任意であってもよい。いくつかの実施形態では、ブロックＳＣＥ呼び出し命令に対して、フォールトした共有コア拡張ロジックのための命令ポインタは、呼び出したスレッドに対してフォールトされた呼び出しに対応するＳＣＥ呼び出し命令を指してもよい。 In some embodiments, the MMU331 and the shared core extension MMU341 are interface logics for routing or communicating for multiple page faults that occur when the shared core extension logic handles multiple calls to core 0. Further cooperation may be made via 332 and 342. In some embodiments, the shared core extended MMU may use core 0 to notify the operating system of page faults that occur while processing a call from core 0. Similarly, the shared core extended MMU may notify these other cores of multiple page faults that occur while processing multiple calls from the other cores. Multiple cores may notify the operating system of multiple page faults. The operating system may have no basis for knowing that the page fault actually occurred in the SCE logic, not in the core that gave the page fault. In some embodiments, the instruction pointer that specifies a fault for the core for the non-blocking SCE call instruction may be arbitrary. In some embodiments, for a block SCE call instruction, the instruction pointer for the faulted shared core extension logic may point to the SCE call instruction corresponding to the faulted call to the calling thread.

共有コア拡張ロジックは、処理をオフロードする当技術分野で公知の他の複数のアプローチよりも多数の利点を提供する。従来、複数のハードウェアアクセラレータ（例えば、複数のグラフィクス処理ユニット）等により、ソフトウェアベースのパラダイムが、複数のハードウェアアクセラレータと連携するために用いられる。複数のハードウェアアクセラレータは、一般に、複数のソフトウェアデバイスドライバにより管理される。システム複数の呼び出しは、複数のハードウェアアクセラレータの処理を用いるために、複数のアプリケーションが用いられる。ソフトウェア（例えば、オペレーティングシステム）の介在は、複数のコア上で実行する複数の異なるスレッドによるハードウェアアクセラレータの公平な利用を提供するために必要とされることが多い。かかる複数のハードウェアアクセラレータと比較して、共有コア拡張ロジックによれば、ドライバベースのハードウェアアクセラレータアクセスというソフトウェアパラダイムにシフトすることなく、共有コア拡張ロジック（例えば、複数の汎用コア）を用いる複数のコアの従来的なプログラミングパラダイムが可能となり得る。さらに、関連付けられた複数の物理的スレッドとしてＳＣＥロジックが同じ仮想メモリ上で動作する複数の実施形態では、データコピー及び／またはデータマーシャリングに伴うオーバヘッドなしで、これが用いられることができる。さらに、ハードウェアアクセラレータと比較して、共有コア拡張ロジックは、概して、進捗度を進めるために、少量のオープンページを伴う。さらに、ハードウェアアクセラレータと比較して、共有コア拡張ロジックは、概して、コマンドをおおよそ非投機的なコアバスサイクルのレイテンシに実質的に送信するレイテンシオーバヘッドを低減する傾向がある。また、ＳＣＥロジックは、ソフトウェア（例えば、オペレーティングシステム）の介在によってではなく、複数のコアで実行する複数の異なるスレッド間で公平かつ分散された利用を提供するために、ハードウェアまたは他のオンプロセッサのロジックにおけるスケジューリングユニットを用いてもよい。 Shared core extension logic offers many advantages over other approaches known in the art for offloading processing. Traditionally, a software-based paradigm has been used to work with a plurality of hardware accelerators by a plurality of hardware accelerators (eg, a plurality of graphics processing units) and the like. Multiple hardware accelerators are generally managed by multiple software device drivers. Multiple system calls use multiple applications to use the processing of multiple hardware accelerators. Software (eg, operating system) intervention is often required to provide fair utilization of hardware accelerators by multiple different threads running on multiple cores. Compared to such multiple hardware accelerators, the shared core extension logic uses shared core extension logic (eg, multiple general purpose cores) without shifting to the software paradigm of driver-based hardware accelerator access. The traditional programming paradigm of the core of is possible. Further, in a plurality of embodiments in which the SCE logic operates on the same virtual memory as a plurality of associated physical threads, it can be used without the overhead associated with data copying and / or data marshalling. In addition, compared to hardware accelerators, shared core extension logic generally involves a small amount of open pages to advance progress. In addition, compared to hardware accelerators, shared core extension logic generally tends to reduce latency overhead, which effectively sends commands to the latency of a roughly non-speculative core bus cycle. SCE logic is also hardware or other on-processor to provide fair and distributed use among different threads running on multiple cores, rather than through software (eg, operating system) intervention. You may use the scheduling unit in the logic of.

上述の説明では、図及び説明の単純さのために、複数の実施形態は、共有コア拡張ロジックの単一の例（例えば、ロジック２１４、ロジック３１４等）を示し、説明した。しかしながら、いくつかの実施形態では、１つより多くの共有コア拡張ロジックがあってもよい。各共有コア拡張ロジックか、同じ複数のコアまたは異なる複数のコアのいずれかであり得るとともに、複数のコアの全てまたは複数のコアのいくつかのいずれかであり得る複数のコアに共有されてもよい。 In the above description, for simplicity of illustration and description, a plurality of embodiments have shown and described a single example of shared core extension logic (eg, Logic 214, Logic 314, etc.). However, in some embodiments, there may be more than one shared core extension logic. Each shared core expansion logic can be either the same multiple cores or different multiple cores, and can be shared by multiple cores that can be all of the multiple cores or some of the multiple cores. good.

いくつかの実施形態では、複数の異なるタイプの共有コア拡張ロジック（例えば、複数の異なるタイプのデータ処理を実行する）は、複数のコアに含まれ、これらの間で共有されてもよい。他の複数の場合、同じ汎用的なタイプの共有コア拡張ロジックの複数の例は、複数のコアの全て（例えば、これらの複数のスレッド）に含まれ、その間で共有されてもよく、または各共有コア拡張ロジックは、複数のコアの合計数のサブセット（例えば、異なるサブセット）に共有されてもよい。当業者及び本開示の利益を有する者によれば理解されるように、様々な複数の構成が考慮される。 In some embodiments, a plurality of different types of shared core extension logic (eg, performing a plurality of different types of data processing) may be included in and shared among the plurality of cores. In other cases, multiple examples of the same generic type of shared core extension logic may be contained in all of the multiple cores (eg, these multiple threads) and may be shared between them, or each. Shared core extension logic may be shared by a subset of the total number of cores (eg, different subsets). Various configurations are considered, as will be appreciated by those skilled in the art and those who have the benefit of the present disclosure.

図５について説明された複数のコンポーネント、複数の特徴及び具体的な複数の詳細は、図３、４または６のものとともに用いられ得る。本明細書で説明された装置の複数の特徴及び／または複数の詳細も、装置により、及び／またはこれと共に実行される本明細書で説明された複数の方法に選択的に適用される。図３について説明された、例えば、複数のコンポーネント、複数の特徴及び具体的な複数の詳細は、図４または６のものと共に、選択的に用いられ得る。 The plurality of components, the plurality of features and the plurality of specific details described with respect to FIG. 5 may be used together with those of FIGS. 3, 4 or 6. The features and / or details of the device described herein are also selectively applied to the methods described herein by and / or with the device. For example, the plurality of components, the plurality of features, and the plurality of specific details described with respect to FIG. 3 may be selectively used together with those of FIG. 4 or 6.

［例示的な複数のコアアーキテクチャ、複数のプロセッサ及び複数のコンピュータアーキテクチャ］ [Exemplary multiple core architectures, multiple processors and multiple computer architectures]

複数のプロセッサコアは、異なる複数の態様で、異なる複数の目的で、及び複数の異なるプロセッサで、実装されてもよい。例えば、かかる複数のコアの複数の実装は、１）汎用コンピューティング向け汎用インオーダコア、２）汎用コンピューティング向け高性能汎用アウトオブオーダコア、３）主にグラフィクス及び／または科学的（スループット）コンピューティング向け特別用途コアを含んでもよい。複数の異なるプロセッサの複数の実装は、１）汎用コンピューティング向けの１つまたは複数の汎用インオーダコア及び／または汎用コンピューティング向けの１つまたは複数の汎用アウトオブオーダコアを含むＣＰＵ、ならびに２）主にグラフィクス及び／または科学的（スループット）用の１つまたは複数の特別用途コアを含むコプロセッサを含んでもよい。かかる複数の異なるプロセッサにより、１）ＣＰＵからの個別のチップ上のコプロセッサ、２）ＣＰＵと同じパッケージ内の個別のダイ上のコプロセッサ、３）ＣＰＵと同じダイ上のコプロセッサ（この場合、かかるコプロセッサは、場合により、集中画像表示及び／または科学的（スループット）ロジックのような特別用途ロジック、または特別用途コアと称される）、及び４）説明されたＣＰＵ（場合によりアプリケーションコアまたはアプリケーションプロセッサと称される）、上述のコプロセッサ及びさらなる機能性を同じダイ上に含み得るチップ上のシステムを含み得る、異なる複数のコンピュータシステムアーキテクチャが構成される。次に、例示的な複数のコアアーキテクチャが説明され、続いて、例示的な複数のプロセッサ及び複数のコンピュータアーキテクチャが説明される。 The plurality of processor cores may be implemented in different embodiments, for different purposes, and in different processors. For example, multiple implementations of such multiple cores are: 1) general purpose in-order core for general-purpose computing, 2) high-performance general-purpose out-of-order core for general-purpose computing, and 3) mainly graphics and / or scientific (throughput) computing. May include special purpose cores for. Multiple implementations of multiple different processors are 1) a CPU containing one or more general purpose in-order cores for general purpose computing and / or one or more general purpose out of order cores for general purpose computing, and 2) main. May include a coprocessor containing one or more special purpose cores for graphics and / or scientific (throughput). With such different processors, 1) a coprocessor on a separate chip from the CPU, 2) a coprocessor on a separate die in the same package as the CPU, and 3) a coprocessor on the same die as the CPU (in this case). Such coprocessors are optionally referred to as special purpose logic, such as centralized image display and / or scientific (throughput) logic, or special purpose core), and 4) the described CPU (possibly application core or Different computer system architectures are configured that may include a system on a chip (referred to as an application processor), the coprocessors described above and additional functionality on the same die. Next, exemplary core architectures will be described, followed by exemplary processors and computer architectures.

［例示的な複数のコアアーキテクチャ］
［インオーダ及びアウトオブオーダコアのブロック図］図８Ａは、本発明の複数の実施形態に係る例示的なインオーダパイプライン及び例示的なレジスタリネーミング、アウトオブオーダ発行／実行パイプラインの両方を示すブロック図である。図８Ｂは、本発明の複数の実施形態に係るプロセッサに含まれるべきインオーダアーキテクチャコアの例示的な実施形態及び例示的なレジスタリネーミング、アウトオブオーダ発行／実行アーキテクチャコアの両方を示すブロック図である。図面８Ａ－Ｂの複数の実線のボックスは、インオーダパイプライン及びインオーダコアを示し、選択的に追加された複数の破線のボックスは、レジスタリネーミング、アウトオブオーダ発行／実行パイプライン及びコアを示す。インオーダ態様がアウトオブオーダ態様のサブセットであるため、アウトオブオーダ態様について説明する。 [Exemplary multiple core architectures]
[Block Diagram of In-Order and Out-of-Order Core] FIG. 8A shows both an exemplary in-order pipeline and an exemplary register renaming, out-of-order issuance / execution pipeline according to a plurality of embodiments of the present invention. It is a block diagram which shows. FIG. 8B is a block diagram showing both an exemplary embodiment of an in-order architecture core to be included in a processor according to a plurality of embodiments of the present invention, as well as an exemplary register renaming and out-of-order issuance / execution architecture core. Is. Multiple solid boxes in FIGS. 8A-B indicate in-order pipelines and in-order cores, and a plurality of selectively added dashed boxes indicate register renaming, out-of-order issuance / execution pipelines and cores. .. Since the in-order mode is a subset of the out-of-order mode, the out-of-order mode will be described.

図８Ａでは、プロセッサパイプライン８００は、フェッチステージ８０２、長さ復号ステージ８０４、復号ステージ８０６、アロケーションステージ８０８、リネームステージ８１０、スケジューリング（ディスパッチまたは発行としても知られる）ステージ８１２、レジスタ読み出し／メモリ読み出しステージ８１４、実行ステージ８１６、ライトバック／メモリ書き込みステージ８１８、例外ハンドリングステージ８２２及びコミットステージ８２４を含む。 In FIG. 8A, processor pipeline 800 includes fetch stage 802, length decode stage 804, decode stage 806, allocation stage 808, rename stage 810, scheduling (also known as dispatch or issue) stage 812, register read / memory read. Includes stage 814, execution stage 816, writeback / memory write stage 818, exception handling stage 822 and commit stage 824.

図８Ｂは、実行エンジンユニット８５０に連結されるフロントエンドユニット８３０を含むプロセッサコア８９０を示し、両方ともメモリユニット８７０に連結される。コア８９０は、縮小命令セットコンピュータ（ＲＩＳＣ）コア、複合命令セットコンピュータ（ＣＩＳＣ）コア、超長命令語（ＶＬＩＷ）コアもしくはハイブリッドまたは代替的なコアタイプであってもよい。さらに他のオプションとして、コア８９０は、例えば、ネットワークまたは通信コア、圧縮エンジン、コプロセッサコア、汎用演算グラフィクス処理ユニット（ＧＰＧＰＵ）コア、グラフィクスコアなどのような特別用途コアであってもよい。 FIG. 8B shows a processor core 890 including a front-end unit 830 coupled to an execution engine unit 850, both coupled to a memory unit 870. The core 890 may be a reduced instruction set computer (RISC) core, a complex instruction set computer (CISC) core, a very long instruction word (VLIW) core or a hybrid or alternative core type. Yet another option, the core 890 may be a special purpose core such as, for example, a network or communication core, a compression engine, a coprocessor core, a general purpose computing graphics processing unit (GPGPU) core, a graphics score, and the like.

フロントエンドユニット８３０は、命令キャッシュユニット８３４に連結された分岐予測ユニット８３２を含み、命令キャッシュユニット８３４は、命令トランスレーションルックアサイドバッファ（ＴＬＢ）８３６に連結され、ＴＬＢ８３６は、命令フェッチユニット８３８に連結され、命令フェッチユニット８３８は、復号ユニット８４０に連結される。復号ユニット８４０（またはデコーダ）は、複数の命令を復号化し、出力として、１つまたは複数のマイクロオペレーション、マイクロコードエントリポイント、マイクロ命令、他の命令もしくは元の複数の命令から復号化された、またはこれらを他の方法で反映する、またはこれらから導出された他の制御信号を生成してもよい。復号ユニット８４０は、複数の様々な異なるメカニズムを用いて実装されてもよい。適した複数のメカニズムの例は、限定されるものではないが、ルックアップテーブル、ハードウェア実装、プログラマブルロジックアレイ（ＰＬＡ）、マイクロコード読み出し専用メモリ（ＲＯＭ）等を含む。一実施形態では、コア８９０は、マイクロコードＲＯＭまたは（例えば、復号ユニット８４０で、またはフロントエンドユニット８３０内で）特定の複数のマクロ命令に対するマイクロコードを格納する他のメディアを含む。復号ユニット８４０は、実行エンジンユニット８５０内のリネーム／アロケータユニット８５２に連結される。 The front-end unit 830 includes a branch prediction unit 832 concatenated to the instruction cache unit 834, the instruction cache unit 834 is concatenated to the instruction translation lookaside buffer (TLB) 836, and the TLB 836 is concatenated to the instruction fetch unit 838. Then, the instruction fetch unit 838 is concatenated to the decoding unit 840. The decoding unit 840 (or decoder) decodes a plurality of instructions and, as output, decodes from one or more microoperations, a microcode entry point, a microinstruction, another instruction or the original multiple instructions. Alternatively, they may be reflected in other ways, or other control signals derived from them may be generated. The decoding unit 840 may be implemented using a variety of different mechanisms. Examples of suitable mechanisms include, but are not limited to, look-up tables, hardware implementations, programmable logic arrays (PLAs), microcode read-only memory (ROMs), and the like. In one embodiment, the core 890 includes a microcode ROM or other media that stores microcode for a particular plurality of macro instructions (eg, in a decryption unit 840 or in a front-end unit 830). The decryption unit 840 is connected to the rename / allocator unit 852 in the execution engine unit 850.

実行エンジンユニット８５０は、リタイアメントユニット８５４及び１つまたは複数のスケジューラユニット８５６のセットに連結されたリネーム／アロケータユニット８５２を含む。スケジューラユニット８５６は、複数の予約ステーション、中央命令ウィンドウ等を含む任意の数の異なるスケジューラを表す。スケジューラユニット８５６は、物理レジスタファイルユニット８５８に連結される。各物理レジスタファイルユニット８５８は、１つまたは複数の物理レジスタファイルを表し、異なる物理レジスタファイルが、スカラ整数、スカラ浮動小数点、パック型整数、パック型浮動小数点、ベクトル整数、ベクトル浮動小数点、ステータス（例えば、次に実行されるべき命令のアドレスである命令ポインタ）等の、１つまたは複数の異なるデータ型を格納する。一実施形態では、物理レジスタファイルユニット８５８は、ベクトルレジスタユニット、書き込みマスクレジスタユニット及びスカラレジスタユニットを備える。これらの複数のレジスタユニットは、複数のアーキテクチャベクトルレジスタ、複数のベクトルマスクレジスタ及び複数の汎用レジスタを提供してもよい。物理レジスタファイルユニット８５８は、リタイアメントユニット８５４にオーバラップされることにより、レジスタリネーミング及びアウトオブオーダ実行が実装され得る様々な複数の態様（例えば、リオーダバッファ及びリタイアメントレジスタファイルを用いて、フューチャファイル、ヒストリバッファ及びリタイアメントレジスタファイルを用いて、レジスタマップ及び複数のレジスタのプールを用いて、等）を示す。リタイアメントユニット８５４及び物理レジスタファイルユニット８５８は、実行クラスタ８６０に連結される。実行クラスタ８６０は、１つまたは複数の実行ユニット８６２のセット及び１つまたは複数のメモリアクセスユニット８６４のセットを含む。複数の実行ユニット８６２は、複数の様々な型のデータ（例えば、スカラ浮動小数点、パック型整数、パック型浮動小数点、ベクトル整数、ベクトル浮動小数点）に対して、様々な複数のオペレーション（例えば、複数のシフト、加算、減算、乗算）を実行してもよい。いくつかの実施形態は、特定の複数の機能または複数の機能の複数のセット専用の多数の実行ユニットを含んでもよく、複数の他の実施形態は、１つだけの実行ユニットを、または、全ての機能を全てが実行する複数の実行ユニットを含んでもよい。スケジューラユニット８５６、物理レジスタファイルユニット８５８及び実行クラスタ８６０は、場合によっては複数として示されるが、その理由は、特定の複数の実施形態は、特定の複数のデータ型／複数のオペレーション（例えば、各々が自己のスケジューラユニット、物理レジスタファイルユニット及び／または実行クラスタを有するスカラ整数パイプライン、スカラ浮動小数点／パック型整数／パック型浮動小数点／ベクトル整数／ベクトル浮動小数点パイプライン及び／またはメモリアクセスパイプライン、及び、個別のメモリアクセスパイプラインの場合、このパイプラインの実行クラスタのみがメモリアクセスユニット８６４を有する特定の複数の実施形態が実装される）に対して個別の複数のパイプラインを形成するからである。個別の複数のパイプラインが用いられる場合、これらのパイプラインのうちの１つまたは複数がアウトオブオーダ発行／実行であり、残りがインオーダであってもよいことも理解されたい。 The execution engine unit 850 includes a retirement unit 854 and a rename / allocator unit 852 coupled to a set of one or more scheduler units 856. The scheduler unit 856 represents an arbitrary number of different schedulers including a plurality of reserved stations, a central command window, and the like. The scheduler unit 856 is concatenated to the physical register file unit 858. Each physical register file unit 858 represents one or more physical register files, where different physical register files are scalar integers, scalar floating point, packed integers, packed floating point, vector integers, vector floating point, status ( Stores one or more different data types, such as an instruction pointer), which is the address of the next instruction to be executed. In one embodiment, the physical register file unit 858 comprises a vector register unit, a write mask register unit, and a scalar register unit. These plurality of register units may provide a plurality of architecture vector registers, a plurality of vector mask registers, and a plurality of general-purpose registers. The physical register file unit 858 can be overlapped with the retirement unit 854 to implement register renaming and out-of-order execution in various ways (eg, using a reorder buffer and a retirement register file). , Using a history buffer and a retirement register file, using a register map and a pool of multiple registers, etc.). The retirement unit 854 and the physical register file unit 858 are concatenated to the execution cluster 860. Execution cluster 860 includes a set of one or more execution units 862 and a set of one or more memory access units 864. Multiple execution units 862 may perform various operations (eg, multiple operations) on multiple different types of data (eg, scalar floating point, packed integer, packed floating point, vector integer, vector floating point). Shift, add, subtract, multiply) may be performed. Some embodiments may include multiple execution units dedicated to a particular function or multiple sets of functions, while other embodiments may include only one execution unit or all. It may include multiple execution units that perform all of the functions of. The scheduler unit 856, the physical register file unit 858 and the execution cluster 860 are sometimes shown as plural, because the particular embodiments may have multiple specific data types / operations (eg, each). Scalar integer pipeline with its own scheduler unit, physical register file unit and / or execution cluster, scalar floating point / packed integer / packed floating point / vector integer / vector floating point pipeline and / or memory access pipeline And, in the case of a separate memory access pipeline, only the execution cluster of this pipeline implements a particular plurality of embodiments having a memory access unit 864). Is. It should also be understood that if multiple separate pipelines are used, one or more of these pipelines may be out-of-order issuance / execution and the rest may be in-order.

メモリアクセスユニット８６４のセットは、データＴＬＢユニット８７２を含むメモリユニット８７０に連結され、データＴＬＢユニット８７２は、データキャッシュユニット８７４に連結され、データキャッシュユニット８７４は、二次（Ｌ２）キャッシュユニット８７６に連結される。例示的な一実施形態では、メモリアクセスユニット８６４は、ロードユニット、ストアアドレスユニット及びストアデータユニットを含んでもよく、これらのそれぞれは、メモリユニット８７０内のデータＴＬＢユニット８７２に連結される。命令キャッシュユニット８３４は、メモリユニット８７０内の二次（Ｌ２）キャッシュユニット８７６にさらに連結される。Ｌ２キャッシュユニット８７６は、１つまたは複数の他のレベルのキャッシュ、及び最終的にはメインメモリに連結される。 The set of memory access units 864 is concatenated to the memory unit 870 including the data TLB unit 872, the data TLB unit 872 is concatenated to the data cache unit 874, and the data cache unit 874 is concatenated to the secondary (L2) cache unit 876. Be concatenated. In one exemplary embodiment, the memory access unit 864 may include a load unit, a store address unit and a store data unit, each of which is concatenated to a data TLB unit 872 within the memory unit 870. The instruction cache unit 834 is further linked to the secondary (L2) cache unit 876 in the memory unit 870. The L2 cache unit 876 is concatenated to one or more other levels of cache and ultimately to main memory.

例として、例示的なレジスタリネーミング、アウトオブオーダ発行／実行コアアーキテクチャは、パイプライン８００を以下のとおり実装してもよい。１）命令フェッチ８３８が、フェッチステージ８０２及び長さ復号ステージ８０４を実行し、２）復号ユニット８４０が、復号ステージ８０６を実行し、３）リネーム／アロケータユニット８５２が、アロケーションステージ８０８及びリネームステージ８１０を実行し、４）スケジューラユニット８５６が、スケジューリングステージ８１２を実行し、５）物理レジスタファイルユニット８５８及びメモリユニット８７０が、レジスタ読み出し／メモリ読み出しステージ８１４を実行し、実行クラスタ８６０が、実行ステージ８１６を実行し、６）メモリユニット８７０及び物理レジスタファイルユニット８５８が、ライトバック／メモリ書き込みステージ８１８を実行し、７）様々な複数のユニットが、例外ハンドリングステージ８２２に関与してもよく、かつ８）リタイアメントユニット８５４及び物理レジスタファイルユニット８５８が、コミットステージ８２４を実行する。 As an example, an exemplary register renaming, out-of-order issuance / execution core architecture may implement Pipeline 800 as follows: 1) the instruction fetch 838 executes the fetch stage 802 and the length decoding stage 804, 2) the decoding unit 840 executes the decoding stage 806, and 3) the rename / allocator unit 852 performs the allocation stage 808 and the rename stage 810. 4) The scheduler unit 856 executes the scheduling stage 812, 5) the physical register file unit 858 and the memory unit 870 execute the register read / memory read stage 814, and the execution cluster 860 executes the execution stage 816. 6) the memory unit 870 and the physical register file unit 858 perform the writeback / memory write stage 818, 7) various multiple units may be involved in the exception handling stage 822, and 8 ) Retirement unit 854 and physical register file unit 858 execute commit stage 824.

コア８９０は、本明細書で説明された命令を含む１つまたは複数の命令セット（例えば、ｘ８６命令セット（複数のより新しいバージョンに追加されたいくつかの拡張を有する）、カリフォルニア州サニーベールのＭＩＰＳテクノロジーズのＭＩＰＳ命令セット、カリフォルニア州サニーベールのＡＲＭホールディングスのＡＲＭ命令セット（ＮＥＯＮなどの複数の選択的なさらなる拡張を有する））をサポートしてもよい。一実施形態では、コア８９０は、パックドデータ命令セット拡張（例えば、ＡＶＸ１、ＡＶＸ２）をサポートするロジックを含むことによって、多くのマルチメディアアプリケーションによって用いられる複数のオペレーションが、パックドデータを用いて実行される。 Core 890 is one or more instruction sets containing the instructions described herein (eg, x86 instruction set (with some extensions added to multiple newer versions), Sunnyvale, California). MIPS Technologies' MIPS instruction set, ARM Holdings' ARM instruction set in Sunnyvale, California (with multiple selective further extensions such as NEON) may be supported. In one embodiment, the core 890 includes logic that supports a packed data instruction set extension (eg, AVX1, AVX2) so that multiple operations used by many multimedia applications can be performed with the packed data. To.

理解されるべきことは、コアは、マルチスレッディング（複数のオペレーションまたは複数のスレッドの２つまたはそれより多くの並列セットを実行すること）をサポートしてもよく、時分割マルチスレッディング、同時マルチスレッディング（単一の物理的コアが複数のスレッドのそれぞれに対して論理的コアを提供することにより、物理的コアが同時マルチスレッディングを実行すること）またはこれらの組み合わせ（例えば、時分割フェッチ及び復号化、その後、インテル（登録商標）ハイパースレッディング・テクノロジーなどでの同時マルチスレッディング）を含む様々な態様で、マルチスレッディングを実行してもよい。 It should be understood that the core may support multithreading (performing two or more parallel sets of multiple operations or multiple threads), timed multithreading, simultaneous multithreading (single). The physical core provides a logical core for each of multiple threads so that the physical core performs simultaneous multithreading) or a combination of these (eg, time-split fetching and decoding, then Intel. Multithreading may be performed in a variety of embodiments, including (simultaneous multithreading with Hyper-Threading Technology, etc.).

レジスタリネーミングは、アウトオブオーダ実行のコンテキストで説明されるが、理解されるべきことは、レジスタリネーミングは、インオーダアーキテクチャで用いられてもよい。プロセッサの例示された実施形態は、個別の命令キャッシュユニット８３４及びデータキャッシュユニット８７４と共有のＬ２キャッシュユニット８７６とをさらに含むが、代替的な複数の実施形態は、例えば、一次（Ｌ１）内部キャッシュまたは複数のレベルの内部キャッシュのような、複数の命令及びデータの両方のための単一の内部キャッシュを有してもよい。いくつかの実施形態では、システムは、内部キャッシュとコア及び／またはプロセッサの外部にある外部キャッシュとの組み合わせを含んでもよい。代替的に、全てのキャッシュは、コア及び／またはプロセッサの外部にあってもよい。 Register renaming is described in the context of out-of-order execution, but it should be understood that register renaming may be used in in-order architectures. Illustrative embodiments of the processor further include a separate instruction cache unit 834 and a data cache unit 874 with a shared L2 cache unit 876, while alternative embodiments include, for example, a primary (L1) internal cache. Alternatively, it may have a single internal cache for both instructions and data, such as multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache that is external to the core and / or processor. Alternatively, all caches may be outside the core and / or processor.

［具体的な、例示的なインオーダコアアーキテクチャ］
図９Ａ－Ｂは、より具体的な、例示的なインオーダコアアーキテクチャのブロック図を示し、ここで、コアは、チップ内のいくつかのロジックブロック（同じタイプ及び／または複数の異なるタイプの他の複数のコアを含む）の中の１つであってもよい。複数のロジックブロックは、用途に応じて、高帯域幅インターコネクトネットワーク（例えば、リングネットワーク）を介して、いくつかの固定機能ロジック、メモリＩ／Ｏインターフェース及び他の必要なＩ／Ｏロジックと通信を行う。 [Concrete, exemplary in-order core architecture]
9A-B show a block diagram of a more specific, exemplary in-order core architecture, where the core is some logic blocks within the chip (of the same type and / or of several different types). It may be one of (including a plurality of cores of). Multiple logic blocks communicate with some fixed-function logic, memory I / O interfaces and other required I / O logic via high bandwidth interconnect networks (eg, ring networks), depending on the application. conduct.

図９Ａは、本発明の複数の実施形態に係る単一のプロセッサコアのブロック図であり、オンダイインターコネクトネットワーク９０２との接続及び二次（Ｌ２）キャッシュローカルサブセット９０４と共に示される。一実施形態では、命令デコーダ９００は、パックドデータ命令セット拡張を有するｘ８６命令セットをサポートする。Ｌ１キャッシュ９０６によれば、スカラ及びベクトルユニットに対して、キャッシュメモリへの低レイテンシアクセスが可能である。一実施形態では（設計の単純化のために）、スカラユニット９０８及びベクトルユニット９１０は、個別のレジスタセット（それぞれ、複数のスカラレジスタ９１２及び複数のベクトルレジスタ９１４）を用い、これらの間で転送されるデータは、一次（Ｌ１）キャッシュ９０６のメモリに書き込まれてから再読み出しされるが、本発明の代替的な複数の実施形態は、異なるアプローチ（例えば、単一のレジスタセットを用いる、または書き込み及び再読み出しを行うことなく、２つのレジスタファイル間でデータを転送させる通信パスを含む）を用いてもよい。 FIG. 9A is a block diagram of a single processor core according to a plurality of embodiments of the present invention, shown with a connection to the on-die interconnect network 902 and a secondary (L2) cache local subset 904. In one embodiment, the instruction decoder 900 supports an x86 instruction set with a packed data instruction set extension. According to the L1 cache 906, low latency access to the cache memory is possible for the scalar and the vector unit. In one embodiment (for design simplification), the scalar unit 908 and the vector unit 910 use separate register sets (multiple scalar registers 912 and multiple vector registers 914, respectively) and transfer between them. The data to be generated is written to the memory of the primary (L1) cache 906 and then read back, but alternative embodiments of the present invention use different approaches (eg, using a single register set, or). A communication path that transfers data between two register files without writing and rereading) may be used.

Ｌ２キャッシュローカルサブセット９０４は、プロセッサコアあたり１つの個別のローカルサブセットに分割されるグローバルＬ２キャッシュの一部である。各プロセッサコアは、自己のＬ２キャッシュローカルサブセット９０４に対するダイレクトアクセスパスを有する。プロセッサコアに読み出されたデータは、自己のＬ２キャッシュサブセット９０４に格納され、迅速かつ、自己の複数のローカルＬ２キャッシュサブセットにアクセスする他の複数のプロセッサコアと並列に、アクセスされることができる。プロセッサコアに書き込まれたデータは、自己のＬ２キャッシュサブセット９０４に格納され、必要な場合には、他の複数のサブセットからフラッシュされる。リングネットワークは、共有のデータに対するコヒーレンシを保証する。リングネットワークが双方向であることにより、複数のプロセッサコア、複数のＬ２キャッシュ及び他の複数のロジックブロックなどの複数のエージェントは、チップ内で互いに通信を行うことができる。各リングデータパスは、１方向あたり１０１２ビット幅である。 The L2 cache local subset 904 is part of a global L2 cache that is divided into one separate local subset per processor core. Each processor core has a direct access path to its own L2 cache local subset 904. The data read into the processor core is stored in its own L2 cache subset 904 and can be quickly accessed in parallel with multiple other processor cores accessing its own multiple local L2 cache subsets. .. The data written to the processor core is stored in its own L2 cache subset 904 and, if necessary, flushed from a plurality of other subsets. The ring network guarantees coherency for shared data. The bidirectional ring network allows multiple agents, such as multiple processor cores, multiple L2 caches, and multiple other logic blocks, to communicate with each other within the chip. Each ring data path is 1012 bits wide in one direction.

図９Ｂは、本発明の複数の実施形態に係る図９Ａのプロセッサコアの一部の拡大図である。図９Ｂは、Ｌ１キャッシュ９０６の一部であるＬ１データキャッシュ９０６Ａと、併せて、ベクトルユニット９１０及び複数のベクトルレジスタ９１４に関するさらなる詳細とを含む。具体的には、ベクトルユニット９１０は、整数、単精度浮動及び倍精度浮動命令のうちの１つまたは複数を実行する１６幅ベクトル処理ユニット（ＶＰＵ）（１６幅ＡＬＵ９２８を参照）である。ＶＰＵは、再構成ユニット９２０による複数のレジスタ入力の再構成、数字変換ユニット９２２Ａ－Ｂによる数字変換、及び複製ユニット９２４によるメモリ入力に対する複製をサポートする。複数の書き込みマスクレジスタ９２６によれば、結果の複数のベクトル書き込みを記述することができる。 FIG. 9B is an enlarged view of a part of the processor core of FIG. 9A according to a plurality of embodiments of the present invention. FIG. 9B includes the L1 data cache 906A, which is part of the L1 cache 906, as well as further details regarding the vector unit 910 and the plurality of vector registers 914. Specifically, the vector unit 910 is a 16-width vector processing unit (VPU) (see 16-width ALU928) that executes one or more of integer, single-precision floating and double-precision floating instructions. The VPU supports the reconstruction of multiple register inputs by the reconstruction unit 920, the digit conversion by the digit conversion units 922AB, and the duplication to the memory inputs by the duplication unit 924. Multiple write mask registers 926 allow multiple vector writes of the result to be described.

［集積メモリコントローラ及びグラフィクスを有するプロセッサ］
図１０は、本発明の複数の実施形態に係るプロセッサ１０００のブロック図である。１つより多くのコアを有してもよく、集積メモリコントローラを有してもよく、集中画像表示を有してもよい。図１０の複数の実線のボックスは、単一のコア１００２Ａ、システムエージェント１０１０、１つまたは複数のバスコントローラユニット１０１６のセットを有するプロセッサ１０００を示し、選択的に追加された複数の破線のボックスは、複数のコア１００２Ａ－Ｎを有する代替的なプロセッサ１０００、システムエージェントユニット１０１０内の１つまたは複数の集積メモリコントローラユニット１０１４のセット及び特別用途ロジック１００８を示す。 [Processor with integrated memory controller and graphics]
FIG. 10 is a block diagram of a processor 1000 according to a plurality of embodiments of the present invention. It may have more than one core, it may have an integrated memory controller, it may have a centralized image display. Multiple solid boxes in FIG. 10 show a processor 1000 with a single core 1002A, system agent 1010, or a set of one or more bus controller units 1016, and multiple dashed boxes additionally added selectively. , An alternative processor 1000 with a plurality of cores 1002A-N, a set of one or more integrated memory controller units 1014 within a system agent unit 1010, and a special purpose logic 1008.

したがって、プロセッサ１０００の異なる複数の実装は、１）集中画像表示及び／または科学的（スループット）ロジック（１つまたは複数のコアを含んでもよい）である特別用途ロジック１００８を有するＣＰＵ、及び１つまたは複数の汎用コア（例えば、汎用インオーダコア、汎用アウトオブオーダコア、この２つの組み合わせ）であるコア１００２Ａ－Ｎ、２）主にグラフィクス及び／または科学的（スループット）用として意図された多数の特別用途コアであるコア１００２Ａ－Ｎを有するコプロセッサ、及び３）多数の汎用インオーダコアであるコア１００２Ａ－Ｎを有するコプロセッサを含んでもよい。したがって、プロセッサ１０００は、汎用プロセッサ、コプロセッサまたは、例えば、ネットワークまたは通信プロセッサ、圧縮エンジン、グラフィクスプロセッサ、ＧＰＧＰＵ（汎用グラフィクス処理ユニット）、ハイスループット多集積コア（ＭＩＣ）コプロセッサ（３０またはそれより多くのコアを含む）、組み込みプロセッサ等のような特別用途プロセッサであってもよい。プロセッサは、１つまたは複数のチップ上に実装されてもよい。プロセッサ１０００は、例えば、ＢｉＣＭＯＳ、ＣＭＯＳまたはＮＭＯＳなどの多数の処理技術のいずれかを用いて、１つまたは複数の基板の一部であってもよく、及び／またはその上に実装されてもよい。 Thus, different implementations of the processor 1000 are: 1) a CPU with special purpose logic 1008, which is centralized image display and / or scientific (throughput) logic (which may include one or more cores), and one. Or multiple general purpose cores (eg, general purpose in-order cores, general-purpose out-of-order cores, combinations of the two) cores 1002A-N, 2) Many special intended primarily for graphics and / or scientific (throughput). It may include a coprocessor having a core 1002A-N which is an application core, and 3) a coprocessor having a core 1002A-N which is a large number of general-purpose in-order cores. Thus, the processor 1000 may be a general purpose processor, coprocessor or, for example, a network or communication processor, compression engine, graphics processor, GPGPU (general purpose graphics processing unit), high throughput multi-integrated core (MIC) coprocessor (30 or more). It may be a special purpose processor such as (including the core of), an embedded processor, and the like. The processor may be mounted on one or more chips. The processor 1000 may be part of and / or be mounted on one or more substrates using any of a number of processing techniques such as, for example, BiCMOS, CMOS or MIMO. ..

メモリ階層は、複数のコア、１つまたは複数の共有キャッシュユニット１００６またはそのセット、及び複数の集積メモリコントローラユニット１０１４のセットに連結される外部メモリ（不図示）内に、１つまたは複数のレベルのキャッシュを含む。複数の共有キャッシュユニット１００６のセットは、二次（Ｌ２）、三次（Ｌ３）、四次（Ｌ４）または他の複数のレベルのキャッシュなどの１つまたは複数の中レベルキャッシュ、ラストレベルキャッシュ（ＬＬＣ）及び／またはこれらの組み合わせを含んでもよい。一実施形態では、リングベースのインターコネクトユニット１０１２が、集中画像表示ロジック１００８、複数の共有キャッシュユニット１００６のセット及びシステムエージェントユニット１０１０／集積メモリコントローラユニット１０１４をインターコネクトするが、代替的な複数の実施形態は、かかる複数のユニットをインターコネクトするための任意の数の周知技術を用いてもよい。一実施形態では、１つまたは複数のキャッシュユニット１００６と複数のコア１００２Ａ－Ｎとの間で、コヒーレンシが維持される。 The memory hierarchy is one or more levels within an external memory (not shown) concatenated into multiple cores, one or more shared cache units 1006 or a set thereof, and a set of multiple integrated memory controller units 1014. Includes cache. A set of multiple shared cache units 1006 is one or more medium level caches, last level caches (LLC) such as secondary (L2), tertiary (L3), quaternary (L4) or other multiple level caches. ) And / or a combination thereof. In one embodiment, the ring-based interconnect unit 1012 interconnects the centralized image display logic 1008, a set of multiple shared cache units 1006 and the system agent unit 1010 / integrated memory controller unit 1014, but in a plurality of alternative embodiments. May use any number of well-known techniques for interconnecting such plurality of units. In one embodiment, coherency is maintained between one or more cache units 1006 and a plurality of cores 1002A-N.

いくつかの実施形態では、複数のコア１００２Ａ－Ｎのうちの１つまたは複数は、マルチスレッディングが可能である。システムエージェント１０１０は、複数のコア１００２Ａ－Ｎの調整及び操作を行うこれらの複数のコンポーネントを含む。システムエージェントユニット１０１０は、例えば、電力制御ユニット（ＰＣＵ）及びディスプレイユニットを含んでもよい。ＰＣＵは、複数のコア１００２Ａ－Ｎ及び集中画像表示ロジック１００８の電力状態を調整するために必要なロジック及び複数のコンポーネントであってもよく、またはこれらを含んでもよい。ディスプレイユニットは、１つまたは複数の外部接続ディスプレイを駆動するためのものである。 In some embodiments, one or more of the plurality of cores 1002A-N can be multithreaded. The system agent 1010 includes a plurality of these components that tune and operate a plurality of cores 1002A-N. The system agent unit 1010 may include, for example, a power control unit (PCU) and a display unit. The PCU may be, or may include, the logic and components required to adjust the power states of the plurality of cores 1002A-N and the centralized image display logic 1008. The display unit is for driving one or more externally connected displays.

複数のコア１００２Ａ－Ｎは、アーキテクチャ命令セットに関してホモジニアスまたはヘテロジニアスであってもよく、すなわち、複数のコア１００２Ａ－Ｎのうちの２つまたはそれより多くは、同じ命令セットを実行可能であってもよいが、他は、その命令セットまたは異なる命令セットのサブセットのみを実行可能であってもよい。 The plurality of cores 1002A-N may be homogeneous or heterogeneous with respect to the architectural instruction set, i.e., two or more of the plurality of cores 1002A-N are capable of executing the same instruction set. Others may be able to execute only that instruction set or a subset of different instruction sets.

［例示的な複数のコンピュータアーキテクチャ］
図１１－１４は、例示的な複数のコンピュータアーキテクチャのブロック図である。複数のラップトップ、複数のデスクトップ、複数のハンドヘルド型ＰＣ、複数の携帯情報端末、複数のエンジニアリングワークステーション、複数のサーバ、複数のネットワークデバイス、複数のネットワークハブ、複数のスイッチ、複数の組み込みプロセッサ、複数のデジタルシグナルプロセッサ（ＤＳＰ）、複数のグラフィクスデバイス、複数のビデオゲームデバイス、複数のセットトップボックス、複数のマイクロコントローラ、複数の携帯電話、複数のポータブルメディアプレイヤ、複数のハンドヘルドデバイス及び様々な他の複数の電子デバイス用の当技術分野で公知の他の複数のシステム設計及び複数の構成も、適している。概して、本明細書で開示されるように、プロセッサ及び／または他の実行ロジックを組み込み可能な多様な複数のシステムまたは複数の電子デバイスが、概して適している。 [Exemplary multiple computer architectures]
11-14 are block diagrams of a plurality of exemplary computer architectures. Multiple laptops, multiple desktops, multiple handheld PCs, multiple personal digital assistants, multiple engineering workstations, multiple servers, multiple network devices, multiple network hubs, multiple switches, multiple embedded processors, Multiple Digital Signal Processors (DSPs), Multiple Graphics Devices, Multiple Video Game Devices, Multiple Set-Top Boxes, Multiple Microcontrollers, Multiple Mobile Devices, Multiple Portable Media Players, Multiple Handheld Devices and Various Others Other system designs and configurations known in the art for multiple electronic devices are also suitable. In general, a variety of systems or electronic devices that can incorporate processors and / or other execution logic, as disclosed herein, are generally suitable.

ここで、図１１を参照すると、本発明の一実施形態に係るシステム１１００のブロック図が示される。システム１１００は、コントローラハブ１１２０に連結される１つまたは複数のプロセッサ１１１０、１１１５を含んでもよい。一実施形態では、コントローラハブ１１２０は、（個別のチップ上にあってもよい）グラフィクスメモリコントローラハブ（ＧＭＣＨ）１１９０及び入出力ハブ（ＩＯＨ）１１５０を含み、ＧＭＣＨ１１９０は、メモリ１１４０及びコプロセッサ１１４５が連結されるメモリ及び複数のグラフィクスコントローラを含み、ＩＯＨ１１５０は、複数の入出力（Ｉ／Ｏ）デバイス１１６０をＧＭＣＨ１１９０に連結する。代替的に、メモリ及び複数のグラフィクスコントローラの一方または両方は、プロセッサ内に集積され（本明細書で説明されたように）、メモリ１１４０及びコプロセッサ１１４５は、プロセッサ１１１０に直接連結され、コントローラハブ１１２０は、単一のチップにＩＯＨ１１５０を有する。 Here, with reference to FIG. 11, a block diagram of the system 1100 according to an embodiment of the present invention is shown. The system 1100 may include one or more processors 1110 and 1115 coupled to the controller hub 1120. In one embodiment, the controller hub 1120 comprises a graphics memory controller hub (GMCH) 1190 and an input / output hub (IOH) 1150 (which may be on separate chips), the GMCH 1190 having the memory 1140 and the coprocessor 1145. The IOH 1150 concatenates a plurality of input / output (I / O) devices 1160 to the GMCH 1190, including concatenated memory and a plurality of graphics controllers. Alternatively, the memory and one or both of the graphics controllers are integrated within the processor (as described herein), the memory 1140 and the coprocessor 1145 are directly coupled to the processor 1110 and the controller hub. The 1120 has an IOH 1150 on a single chip.

追加のプロセッサ１１１５の選択的な性質が、図１１に複数の破線で示される。各プロセッサ１１１０、１１１５は、本明細書で説明された複数のプロセッサコアのうちの１つまたは複数を含んでもよく、いくつかのバージョンのプロセッサ１０００であってもよい。 The selective nature of the additional processor 1115 is shown in FIG. 11 by a plurality of dashed lines. Each processor 1110 and 1115 may include one or more of the plurality of processor cores described herein, or may be several versions of the processor 1000.

メモリ１１４０は、例えば、ダイナミックランダムアクセスメモリ（ＤＲＡＭ）、位相変化メモリ（ＰＣＭ）またはこれら２つの組み合わせであってもよい。少なくとも１つの実施形態について、コントローラハブ１１２０は、フロントサイドバス（ＦＳＢ）のようなマルチドロップバス、クイックパスインターコネクト（ＱＰＩ）のようなポイントツーポイントインターフェースまたは同様の接続１１９５を介して、プロセッサ１１１０、１１１５と通信を行う。 The memory 1140 may be, for example, a dynamic random access memory (DRAM), a phase change memory (PCM), or a combination thereof. For at least one embodiment, the controller hub 1120 is a processor 1110, via a multi-drop bus such as a front-side bus (FSB), a point-to-point interface such as a quickpath interconnect (QPI), or a similar connection 1195. Communicate with 1115.

一実施形態では、コプロセッサ１１４５は、例えば、ハイスループットＭＩＣプロセッサ、ネットワークまたは通信プロセッサ、圧縮エンジン、グラフィクスプロセッサ、ＧＰＧＰＵ、組み込みプロセッサ等のような特別用途プロセッサである。一実施形態では、コントローラハブ１１２０は、集中画像表示アクセラレータを含んでもよい。 In one embodiment, the coprocessor 1145 is a special purpose processor such as, for example, a high throughput MIC processor, network or communication processor, compression engine, graphics processor, GPGPU, embedded processor and the like. In one embodiment, the controller hub 1120 may include a centralized image display accelerator.

物理的リソース１１１０、１１１５の間には、アーキテクチャ、マイクロアーキテクチャ、温度、電力消費の特性等を含む様々な利益の基準に関して、様々な違いが存在しうる。 There can be various differences between the physical resources 1110 and 1115 with respect to different criteria of benefit, including architecture, microarchitecture, temperature, power consumption characteristics and the like.

一実施形態では、プロセッサ１１１０は、一般的なタイプの複数のデータ処理オペレーションを制御する複数の命令を実行する。複数の命令内に、複数のコプロセッサ命令が組み込まれていてもよい。プロセッサ１１１０は、これらの複数のコプロセッサ命令を、取り付けられたコプロセッサ１１４５によって実行されるべきタイプのものと認識する。従って、プロセッサ１１１０は、これらの複数のコプロセッサ命令（または複数のコプロセッサ命令を表す複数の制御信号）を、コプロセッサバスまたは他のインターコネクト上で、コプロセッサ１１４５に対して発行する。コプロセッサ１１４５は、受信された複数のコプロセッサ命令を受け付けて実行する。 In one embodiment, the processor 1110 executes a plurality of instructions controlling a plurality of common types of data processing operations. A plurality of coprocessor instructions may be incorporated in a plurality of instructions. Processor 1110 recognizes these plurality of coprocessor instructions as of the type to be executed by the attached coprocessor 1145. Accordingly, the processor 1110 issues these plurality of coprocessor instructions (or multiple control signals representing the plurality of coprocessor instructions) to the coprocessor 1145 on the coprocessor bus or other interconnect. The coprocessor 1145 receives and executes a plurality of received coprocessor instructions.

ここで、図１２を参照すると、本発明の実施形態に係る第１のより具体的な、例示的なシステム１２００のブロック図が示される。図１２に示されるように、マルチプロセッサシステム１２００は、ポイントツーポイントインターコネクトシステムであり、ポイントツーポイントインターコネクト１２５０を介して連結される第１のプロセッサ１２７０及び第２のプロセッサ１２８０を含む。 Here, with reference to FIG. 12, a block diagram of a first, more specific, exemplary system 1200 according to an embodiment of the present invention is shown. As shown in FIG. 12, the multiprocessor system 1200 is a point-to-point interconnect system and includes a first processor 1270 and a second processor 1280 coupled via a point-to-point interconnect 1250.

プロセッサ１２７０および１２８０の各々は、いくつかのバージョンのプロセッサ１０００であってもよい。本発明の一実施形態では、プロセッサ１２７０および１２８０は、それぞれプロセッサ１１１０および１１１５であり、コプロセッサ１２３８は、コプロセッサ１１４５である。他の実施形態では、プロセッサ１２７０および１２８０は、それぞれプロセッサ１１１０、コプロセッサ１１４５である。 Each of the processors 1270 and 1280 may be several versions of the processor 1000. In one embodiment of the invention, the processors 1270 and 1280 are processors 1110 and 1115, respectively, and the coprocessor 1238 is a coprocessor 1145. In another embodiment, the processors 1270 and 1280 are the processor 1110 and the coprocessor 1145, respectively.

プロセッサ１２７０および１２８０は、集積メモリコントローラ（ＩＭＣ）ユニット１２７２および１２８２をそれぞれ含むものとして示される。プロセッサ１２７０は、自己の複数のバスコントローラユニットの一部として、ポイントツーポイント（Ｐ－Ｐ）インターフェース１２７６および１２７８をさらに含み、同様に、第２のプロセッサ１２８０は、Ｐ－Ｐインターフェース１２８６および１２８８を含む。プロセッサ１２７０、１２８０は、Ｐ－Ｐインターフェース回路１２７８、１２８８を用いたポイントツーポイント（Ｐ－Ｐ）インターフェース１２５０を介して、情報を交換してもよい。図１２に示されるように、ＩＭＣ１２７２および１２８２は、複数のプロセッサを個別のメモリ、すなわち、個別のプロセッサにローカルに取り付けられたメインメモリの一部となり得るメモリ１２３２及びメモリ１２３４に連結する。 Processors 1270 and 1280 are shown as including integrated memory controller (IMC) units 1272 and 1282, respectively. Processor 1270 further includes point-to-point (PP) interfaces 1276 and 1278 as part of its plurality of bus controller units, as well as a second processor 1280 with PP interfaces 1286 and 1288. include. Processors 1270 and 1280 may exchange information via a point-to-point (PP) interface 1250 using the PP interface circuits 1278 and 1288. As shown in FIG. 12, IMC1272 and 1282 connect multiple processors to separate memory, i.e., memory 1232 and memory 1234 which can be part of the main memory locally attached to the individual processors.

複数のプロセッサ１２７０、１２８０の各々は、複数のポイントツーポイントインターフェース回路１２７６、１２９４、１２８６、１２９８を用いる個々のＰ－Ｐインターフェース１２５２、１２５４を介して、チップセット１２９０と情報を交換してもよい。チップセット１２９０は、高性能インターフェース１２３９を介して、コプロセッサ１２３８と選択的に情報を交換してもよい。一実施形態では、コプロセッサ１２３８は、例えば、ハイスループットＭＩＣプロセッサ、ネットワークまたは通信プロセッサ、圧縮エンジン、グラフィクスプロセッサ、ＧＰＧＰＵ、組み込みプロセッサ等のような特別用途プロセッサである。 Each of the plurality of processors 1270, 1280 may exchange information with the chipset 1290 via individual PP interfaces 1252, 1254 using multiple point-to-point interface circuits 1276, 1294, 1286, 1298. .. The chipset 1290 may selectively exchange information with the coprocessor 1238 via the high performance interface 1239. In one embodiment, the coprocessor 1238 is a special purpose processor such as, for example, a high throughput MIC processor, network or communication processor, compression engine, graphics processor, GPGPU, embedded processor and the like.

共有キャッシュ（不図示）は、いずれかのプロセッサの内部に含まれ、または両方のプロセッサの外部にあってもよいが、プロセッサが低電力モードであっても、いずれかまたは両方のプロセッサのローカルキャッシュ情報が共有キャッシュに格納され得るように、Ｐ－Ｐインターコネクトを介して複数のプロセッサに接続される。 A shared cache (not shown) may be contained inside either processor or outside both processors, but the local cache for either or both processors, even if the processor is in low power mode. It is connected to multiple processors via a PP interconnect so that information can be stored in a shared cache.

チップセット１２９０は、インターフェース１２９６を介して、第１のバス１２１６に連結されてもよい。一実施形態では、第１のバス１２１６は、ペリフェラルコンポーネントインターコネクト（ＰＣＩ）バスもしくはＰＣＩＥｘｐｒｅｓｓバスまたは他の第３世代Ｉ／Ｏインターコネクトバスなどのバスであってもよいが、本発明の範囲はこれに限定されない。 The chipset 1290 may be connected to the first bus 1216 via the interface 1296. In one embodiment, the first bus 1216 may be a bus such as a peripheral component interconnect (PCI) bus or a PCI Express bus or another 3rd generation I / O interconnect bus, but the scope of the invention is this. Not limited to.

図１２に示されるように、様々なＩ／Ｏデバイス１２１４が、第１のバス１２１６を第２のバス１２２０に連結するバスブリッジ１２１８と共に、第１のバス１２１６に連結されてもよい。一実施形態では、複数のコプロセッサ、複数のハイスループットＭＩＣプロセッサ、ＧＰＧＰＵの複数のアクセラレータ（例えば、複数のグラフィクスアクセラレータまたは複数のデジタル信号処理（ＤＳＰ）ユニットなど）、複数のフィールドプログラマブルゲートアレイもしくは任意の他のプロセッサなどの１つまたは複数の追加のプロセッサ１２１５が、第１のバス１２１６に連結される。一実施形態では、第２のバス１２２０は、ローピンカウント（ＬＰＣ）バスであってもよい。例えば、キーボード及び／またはマウス１２２２、複数の通信デバイス１２２７、及びディスクドライブ、または複数の命令／コード及びデータ１２３０を含みうる他の大容量ストレージデバイスなどの記憶ユニット１２２８を含む様々な複数のデバイスが、一実施形態では、第２のバス１２２０に連結されてもよい。さらに、オーディオＩ／Ｏ１２２４が、第２のバス１２２０に連結されてもよい。なお、他の複数のアーキテクチャが、適用可能である。例えば、図１２のポイントツーポイントアーキテクチャの代わりに、システムは、マルチドロップバスまたは他のかかるアーキテクチャを実装してもよい。 As shown in FIG. 12, various I / O devices 1214 may be coupled to the first bus 1216 along with a bus bridge 1218 that connects the first bus 1216 to the second bus 1220. In one embodiment, multiple coprocessors, multiple high throughput MIC processors, multiple accelerators for GPGPU (eg, multiple graphics accelerators or multiple digital signal processing (DSP) units, etc.), multiple field programmable gate arrays or any. One or more additional processors 1215, such as other processors, are connected to the first bus 1216. In one embodiment, the second bus 1220 may be a Low Pin Count (LPC) bus. Various devices including, for example, a storage unit 1228 such as a keyboard and / or mouse 1222, multiple communication devices 1227, and disk drives, or other high capacity storage devices that may contain multiple instructions / codes and data 1230. , In one embodiment, it may be connected to the second bus 1220. Further, the audio I / O 1224 may be connected to the second bus 1220. Note that a plurality of other architectures are applicable. For example, instead of the point-to-point architecture of FIG. 12, the system may implement a multi-drop bus or other such architecture.

ここで、図１３を参照すると、本発明の実施形態に係る第２のより具体的な、例示的なシステム１３００のブロック図が示される。図１２及び１３における同様の複数の要素には、同様の参照番号が付され、図１２の複数の特定の態様は、図１３の他の複数の態様への妨げとならないよう、図１３では省略されている。 Here, with reference to FIG. 13, a block diagram of a second, more specific, exemplary system 1300 according to an embodiment of the present invention is shown. Similar elements in FIGS. 12 and 13 are given similar reference numbers, and the particular embodiment of FIG. 12 is omitted in FIG. 13 so as not to interfere with the other aspects of FIG. Has been done.

図１３は、プロセッサ１２７０、１２８０が、それぞれ集積メモリ及びＩ／Ｏ制御ロジック（「ＣＬ」）１２７２及び１２８２を含んでもよいことを示す。したがって、ＣＬ１２７２、１２８２は、複数の集積メモリコントローラユニットを含み、かつ、Ｉ／Ｏ制御ロジックを含む。図１３は、メモリ１２３２、１２３４のみがＣＬ１２７２、１２８２に連結されるのではなく、複数のＩ／Ｏデバイス１３１４も、複数の制御ロジック１２７２、１２８２に連結されることを示す。複数のレガシーＩ／Ｏデバイス１３１５は、チップセット１２９０に連結される。 FIG. 13 shows that processors 1270, 1280 may include integrated memory and I / O control logic (“CL”) 1272 and 1282, respectively. Therefore, CL1272, 1282 includes a plurality of integrated memory controller units and includes I / O control logic. FIG. 13 shows that not only the memories 1232, 1234 are connected to CL1272, 1282, but the plurality of I / O devices 1314 are also connected to the plurality of control logics 1272, 1282. The plurality of legacy I / O devices 1315 are coupled to the chipset 1290.

ここで、図１４を参照すると、本発明の実施形態に係るＳｏＣ１４００のブロック図が示される。図１０における同様の複数の要素には、同様の参照番号が付される。また、複数の破線のボックスは、より進化した複数のＳｏＣ上の選択的な機能である。図１４では、インターコネクトユニット１４０２は、１つまたは複数のコア２０２Ａ－Ｎ及び共有キャッシュユニット１００６のセットを含むアプリケーションプロセッサ１４１０、システムエージェントユニット１０１０、バスコントローラユニット１０１６、集積メモリコントローラユニット１０１４、集中画像表示ロジック、画像プロセッサ、オーディオプロセッサ及びビデオプロセッサを含み得る１つまたは複数のコプロセッサ１４２０またはそのセット、スタティックランダムアクセスメモリ（ＳＲＡＭ）ユニット１４３０、ダイレクトメモリアクセス（ＤＭＡ）ユニット１４３２及び１つまたは複数の外部ディスプレイに連結するためのディスプレイユニット１４４０に連結される。一実施形態では、コプロセッサ１４２０は、例えば、ネットワークまたは通信プロセッサ、圧縮エンジン、ＧＰＧＰＵ、ハイスループットＭＩＣプロセッサ、組み込みプロセッサ等のような特別用途プロセッサを含む。 Here, with reference to FIG. 14, a block diagram of the SoC1400 according to the embodiment of the present invention is shown. Similar elements in FIG. 10 are given similar reference numbers. Also, the plurality of dashed boxes are selective functions on multiple more advanced SoCs. In FIG. 14, the interconnect unit 1402 includes an application processor 1410 including a set of one or more cores 202A-N and a shared cache unit 1006, a system agent unit 1010, a bus controller unit 1016, an integrated memory controller unit 1014, and a centralized image display. One or more coprocessors 1420 or sets thereof, which may include logic, image processors, audio processors and video processors, static random access memory (SRAM) unit 1430, direct memory access (DMA) unit 1432 and one or more externals. It is connected to a display unit 1440 for connecting to a display. In one embodiment, the coprocessor 1420 includes special purpose processors such as, for example, network or communication processors, compression engines, GPGPUs, high throughput MIC processors, embedded processors and the like.

本明細書に開示される複数のメカニズムの複数の実施形態は、ハードウェア、ソフトウェア、ファームウェアまたはそのかかる複数の実装アプローチの組み合わせで実装されてもよい。本発明の複数の実施形態は、少なくとも１つのプロセッサ、ストレージシステム（揮発性及び不揮発性メモリ及び／または複数のストレージ要素を含む）、少なくとも１つの入力デバイスおよび少なくとも１つの出力デバイスを備える複数のプログラマブルシステム上で実行する複数のコンピュータプログラムまたはプログラムコードとして実装されてもよい。 Multiple embodiments of the plurality of mechanisms disclosed herein may be implemented in a combination of hardware, software, firmware or such multiple implementation approaches. A plurality of embodiments of the present invention include a plurality of programmable devices comprising at least one processor, a storage system (including volatile and non-volatile memory and / or a plurality of storage elements), at least one input device and at least one output device. It may be implemented as multiple computer programs or program code running on the system.

図１２に示すコード１２３０などのプログラムコードは、本明細書で説明された複数の機能を実行し、出力情報を生成するために、複数の入力命令に適用されてもよい。出力情報は、１つまたは複数の出力デバイスに、公知の態様で適用されてもよい。この適用のために、処理システムは、例えば、デジタルシグナルプロセッサ（ＤＳＰ）、マイクロコントローラ、特別用途集積回路（ＡＳＩＣ）またはマイクロプロセッサなどのプロセッサを有する任意のシステムを含む。 A program code, such as the code 1230 shown in FIG. 12, may be applied to a plurality of input instructions in order to perform the plurality of functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For this application, the processing system includes any system having a processor such as, for example, a digital signal processor (DSP), a microcontroller, an application specific integrated circuit (ASIC) or a microprocessor.

プログラムコードは、処理システムと通信を行うために、高水準の手順型またはオブジェクト指向プログラミング言語で実装されてもよい。プログラムコードは、望ましい場合は、アセンブリ言語または機械言語でさらに実装されてもよい。実際には、本明細書で説明された複数のメカニズムは、その範囲において、任意の特定のプログラム言語に限定されるものではない。いずれの場合であっても、言語は、コンパイル言語またはインタプリタ型言語であってもよい。 The program code may be implemented in a high-level procedural or object-oriented programming language to communicate with the processing system. The program code may be further implemented in assembly or machine language, if desired. In practice, the mechanisms described herein are not limited to any particular programming language to the extent. In either case, the language may be a compiled language or an interpreted language.

少なくとも１つの実施形態のうち１つまたは複数の態様は、機械可読媒体に格納された、プロセッサ内の様々なロジックを表す複数の表現命令によって実装されてもよく、このような命令は、機械に読み出された場合に、本明細書で説明される複数の技術を実行するべく、機械にロジックを組み立てさせる。「ＩＰコア」として知られるかかる複数の表現は、有形の機械可読媒体上に格納され、様々な複数の顧客または製造工場に供給されて、実際にロジックまたはプロセッサを作り出す複数の製造機械に読み込まれてもよい。 One or more aspects of at least one embodiment may be implemented by a plurality of representational instructions stored on a machine-readable medium representing various logics in the processor, such instructions to the machine. When read, it causes the machine to assemble the logic to perform the techniques described herein. Such representations, known as "IP cores," are stored on tangible machine-readable media, supplied to various customers or manufacturing plants, and loaded into multiple manufacturing machines that actually produce the logic or processor. May be.

かかる機械可読記憶媒体は、限定的ではないが、複数のハードディスク、複数のフロッピー（登録商標）ディスク、複数の光ディスク、複数のコンパクトディスク読み出し専用メモリ（ＣＤ－ＲＯＭ）、複数の書き換え可能コンパクトディスク（ＣＤ－ＲＷ）及び複数の磁気光ディスクを含む任意の他のタイプのディスク、複数の読み出し専用メモリ（ＲＯＭ）などの複数の半導体デバイス、複数のダイナミックランダムアクセスメモリ（ＤＲＡＭ）、複数のスタティックランダムアクセスメモリ（ＳＲＡＭ）、複数の消去可能プログラマブルＲＯＭ（ＥＰＲＯＭ）、複数のフラッシュメモリ、複数の消去可能プログラマブルＲＯＭ（ＥＥＰＲＯＭ）、位相変化メモリ（ＰＣＭ）、複数の磁気または光カードなどの複数のランダムアクセスメモリ（ＲＡＭ）、もしくは複数の電子的命令を格納するために適した任意の他のタイプのメディアなどの記憶媒体を含む、機械またはデバイスによって製造または形成される複数の物品の非一時的かつ有形の複数の構成を含んでもよい。 Such machine-readable storage media are, but are not limited to, a plurality of hard disks, a plurality of floppy (registered trademark) disks, a plurality of optical disks, a plurality of compact disk read-only memories (CD-ROMs), and a plurality of rewritable compact disks (s). Any other type of disk, including CD-RW) and multiple magnetic optical disks, multiple semiconductor devices such as multiple read-only memories (ROMs), multiple dynamic random access memories (DRAMs), multiple static random access memories. (SRAM), multiple erasable programmable ROMs (EPROMs), multiple flash memories, multiple erasable programmable ROMs (EEPROMs), phase change memories (PCMs), multiple random access memories such as multiple magnetic or optical cards (s). Non-temporary and tangible plurality of articles manufactured or formed by a machine or device, including storage media such as RAM), or any other type of media suitable for storing multiple electronic instructions. May include the configuration of.

従って、本発明の複数の実施形態は、複数の命令を含む、または本明細書で説明される複数の構造、複数の回路、複数の装置、複数のプロセッサ及び／または複数のシステム機能を定義するハードウェア記述言語（ＨＤＬ）などの設計データを含む非一時的かつ有形の機械可読媒体をさらに含む。かかる複数の実施形態は、複数のプログラム製品と称されてもよい。 Accordingly, a plurality of embodiments of the present invention define a plurality of structures, a plurality of circuits, a plurality of devices, a plurality of processors and / or a plurality of system functions including a plurality of instructions or described herein. Further includes non-temporary and tangible machine-readable media containing design data such as a hardware description language (HDL). Such plurality of embodiments may be referred to as a plurality of program products.

［エミュレーション（バイナリ変換、コードモーフィング等を含む）］
場合によっては、命令変換器は、ソース命令セットからターゲット命令セットへと命令を変換するために用いられてもよい。例えば、命令変換器は、命令をコアによって処理されるべき１つまたは複数の他の命令に、（例えば、静的バイナリ変換、動的コンパイルを含む動的バイナリ変換を用いて）変換、モーフィング、エミュレートまたは他の方法で変換してもよい。命令変換器は、ソフトウェア、ハードウェア、ファームウェアまたはこれらの組み合わせで実装されてもよい。命令変換器は、プロセッサ上にあってもよく、プロセッサ外にあってもよく、または一部がプロセッサ上かつ一部がプロセッサ外にあってもよい。 [Emulation (including binary conversion, code morphing, etc.)]
In some cases, instruction transducers may be used to convert instructions from a source instruction set to a target instruction set. For example, an instruction transducer may convert, morph, convert an instruction into one or more other instructions to be processed by the core (eg, using dynamic binary translation, including static binary translation, dynamic compilation). It may be emulated or converted in other ways. The instruction transducer may be implemented in software, hardware, firmware or a combination thereof. The instruction transducer may be on the processor, off the processor, or partly on the processor and partly off the processor.

図１５は、ソース命令セットの複数のバイナリ命令をターゲット命令セットの複数のバイナリ命令に変換する、本発明の複数の実施形態に係るソフトウェア命令変換器の利用を対比したブロック図である。図示された実施形態では、命令変換器は、ソフトウェア命令変換器であるが、代替的に、命令変換器は、ソフトウェア、ファームウェア、ハードウェアまたはこれらの様々な複数の組み合わせで実装されてもよい。図１５は、少なくとも１つのｘ８６命令セットコア１５１６を有するプロセッサによって本来的に実行され得るｘ８６バイナリコード１５０６を生成するために、ｘ８６コンパイラ１５０４を用いてコンパイルされ得る高水準言語１５０２のプログラムを示す。少なくとも１つのｘ８６命令セットコア１５１６を有するプロセッサは、少なくとも１つのｘ８６命令セットコアを有するインテル社製プロセッサと実質的に同じ結果を出すために、（１）インテル社製ｘ８６命令セットコアの命令セットの大部分、もしくは（２）複数のアプリケーションの複数のオブジェクトコードバージョンまたは少なくとも１つのｘ８６命令セットコアを有するインテル社製プロセッサ上で実行することが想定された他のソフトウェアを互換可能に実行または他の方法で処理することにより、少なくとも１つのｘ８６命令セットコアを有するインテル社製プロセッサと実質的に同じ複数の機能を実行可能な任意のプロセッサを表す。ｘ８６コンパイラ１５０４は、さらなるリンク処理の有無に関わらず、少なくとも１つのｘ８６命令セットコアを有するプロセッサ１５１６上で実行可能なｘ８６バイナリコード１５０６（例えば、オブジェクトコード）を生成するように動作可能なコンパイラを表す。同様に、図１５は、少なくとも１つのｘ８６命令セットコアを有さないプロセッサ１５１４（例えば、カリフォルニア州サニーベールのＭＩＰＳテクノロジーズ（ＭＩＰＳＴｅｃｈｎｏｌｏｇｉｅｓ）のＭＩＰＳ命令セットを実行する、及び／またはカリフォルニア州サニーベールのＡＲＭホールディングス（ＡＲＭＨｏｌｄｉｎｇｓ）のＡＲＭ命令セットを実行する複数のコアを有するプロセッサ）によって本来的に実行され得る、代替的な命令セットバイナリコード１５１０を生成するべく、代替的な命令セットコンパイラ１５０８を用いてコンパイルされ得る高水準言語１５０２のプログラムを示す。命令変換器１５１２は、ｘ８６バイナリコード１５０６を、ｘ８６命令セットコアを有さないプロセッサ１５１４によって本来的に実行され得るコードに変換するために用いられる。この変換されたコードは、これが可能な命令変換器の製造が難しいため、代替的な命令セットバイナリコード１５１０と同じとなる可能性は低いが、しかしながら、変換されたコードは、全般的なオペレーションを達成し、代替的な命令セットからの複数の命令により補完される。したがって、命令変換器１５１２は、エミュレーション、シミュレーションまたは任意の他の処理を介して、プロセッサもしくはｘ８６命令セットプロセッサまたはコアを有さない他の電子デバイスにｘ８６バイナリコード１５０６を実行させるソフトウェア、ファームウェア、ハードウェアまたはこれらの組み合わせを表す。 FIG. 15 is a block diagram comparing the use of a software instruction converter according to a plurality of embodiments of the present invention, which converts a plurality of binary instructions in a source instruction set into a plurality of binary instructions in a target instruction set. In the illustrated embodiment, the instruction converter is a software instruction converter, but instead, the instruction converter may be implemented in software, firmware, hardware or various combinations thereof. FIG. 15 shows a program of high-level language 1502 that can be compiled with the x86 compiler 1504 to generate x86 binary code 1506 that can be inherently executed by a processor with at least one x86 instruction set core 1516. In order for a processor with at least one x86 instruction set core 1516 to produce substantially the same results as an Intel processor with at least one x86 instruction set core, (1) the instruction set of the Intel x86 instruction set core. Most, or (2) compatiblely run or others of other software intended to run on Intel processors with multiple object code versions of multiple applications or at least one x86 instruction set core. Represents any processor capable of performing a plurality of functions substantially the same as an Intel processor having at least one x86 instruction set core by processing in the above manner. The x86 compiler 1504 is a compiler capable of producing x86 binary code 1506 (eg, object code) that can be run on a processor 1516 with at least one x86 instruction set core with or without further linking. show. Similarly, FIG. 15 executes a MIPS instruction set from a processor 1514 (eg, MIPS Technologies, Sunnyvale, California) that does not have at least one x86 instruction set core, and / or Sunnyvale, California. An alternative instruction set compiler 1508 is used to generate an alternative instruction set binary code 1510 that can be inherently executed by a processor with multiple cores that executes an ARM instructions set from ARM Holdings. The program of the high-level language 1502 that can be compiled is shown. The instruction converter 1512 is used to translate x86 binary code 1506 into code that can be inherently executed by a processor 1514 that does not have an x86 instruction set core. This converted code is unlikely to be the same as the alternative instruction set binary code 1510 because it is difficult to manufacture an instruction converter that can do this, however, the converted code does general operation. Achieved and complemented by multiple instructions from an alternative instruction set. Thus, the instruction converter 1512 is software, firmware, hardware that causes a processor or other electronic device without a x86 instruction set processor or core to execute x86 binary code 1506 through emulation, simulation, or any other processing. Represents wear or a combination of these.

説明及び特許請求の範囲では、「連結」及び／または「接続」という複数の用語が、これらの複数の派生語と共に用いられている。理解されるべきことは、これらの複数の用語は、互いの類義語として意図されるものではない。むしろ、複数の特定の実施形態では、「接続」は、２つまたはそれより多くの要素が、互いに直接物理的にまたは電気的に接触することを示すために用いられてもよい。「連結」は、２つまたはそれより多くの要素が、直接物理的にまたは電気的に接触することを意味してもよい。しかしながら、「連結」は、２つまたはそれより多くの要素が、互いに直接接触しないものの、互いに連動または連携することをさらに意味してもよい。例えば、実行ユニットは、１つまたは複数の介在コンポーネントを介して、レジスタまたはデコーダと連結されてもよい。複数の図において、複数の矢印は、複数の連結及び／または複数の接続を示すために用いられる。 Within the scope of the description and claims, the terms "concatenation" and / or "connection" are used in conjunction with these multiple derivatives. It should be understood that these terms are not intended as synonyms for each other. Rather, in certain embodiments, "connection" may be used to indicate that two or more elements are in direct physical or electrical contact with each other. "Connecting" may mean that two or more elements are in direct physical or electrical contact. However, "linkage" may further mean that two or more elements do not come into direct contact with each other, but are interlocked or linked to each other. For example, the execution unit may be linked to a register or decoder via one or more intervening components. In a plurality of figures, a plurality of arrows are used to indicate a plurality of connections and / or a plurality of connections.

説明及び特許請求の範囲では、「ロジック」という用語が用いられてもよい。本明細書で用いられる場合において、ロジックという用語は、ハードウェア、ファームウェア、ソフトウェアまたはこれらの様々な複数の組み合わせを含んでもよい。ロジックの複数の例は、集積回路、複数の特別用途集積回路、複数のアナログ回路、複数のデジタル回路、複数のプログラムされたロジックデバイス、複数の命令を含む複数のメモリデバイス等を含む。いくつかの実施形態では、ロジックは、複数のトランジスタ及び／または複数のゲートを、潜在的には（例えば、複数の半導体材料に組み込まれた）他の複数の回路コンポーネントと共に含んでもよい。 Within the scope of the description and claims, the term "logic" may be used. As used herein, the term logic may include hardware, firmware, software, or various combinations thereof. Multiple examples of logic include integrated circuits, multiple special purpose integrated circuits, multiple analog circuits, multiple digital circuits, multiple programmed logic devices, multiple memory devices containing multiple instructions, and the like. In some embodiments, the logic may include a plurality of transistors and / or a plurality of gates, potentially with a plurality of other circuit components (eg, incorporated into a plurality of semiconductor materials).

上述の説明では、複数の実施形態への十分な理解を提供するために、具体的な複数の詳細が示された。しかしながら、他の複数の実施形態は、これらの具体的な複数の詳細のいくつかによらずに実施可能である。本発明の範囲は、上述された具体的な複数の例によって決定されるものではなく、以下の特許請求の範囲によってのみ決定される。複数の図面に示され、明細書で説明されたものへの全ての等しい関係は、複数の実施形態内に包含される。他の複数の例では、周知の複数の回路、複数の構造、複数のデバイス及び複数のオペレーションが、ブロック図の形で、または説明に対する理解の妨げとならないよう、詳細は省略して示されている。複数のコンポーネントが示されているいくつかの場合では、これらは、単一のコンポーネントに集積されてもよい。単一のコンポーネントが示され、説明されているいくつかの場合では、この単一のコンポーネントは、２つまたはそれより多くのコンポーネントに分離されてもよい。 In the above description, a number of specific details have been presented to provide a good understanding of the plurality of embodiments. However, other embodiments are feasible without relying on some of these specific details. The scope of the present invention is not determined by the specific examples described above, but only by the following claims. All equal relationships to those shown in the drawings and described herein are contained within the embodiments. In other examples, details are omitted so that well-known circuits, structures, devices, and operations do not interfere with understanding in the form of block diagrams or explanations. There is. In some cases where multiple components are shown, they may be integrated into a single component. In some cases a single component is shown and described, this single component may be separated into two or more components.

本明細書で開示された特定の複数の方法は、基本的な形で示され、説明されているが、複数のオペレーションは、選択的に複数の方法に追加され、及び／またはこれらから取り除かれてもよい。さらに、複数のオペレーションの特定の順序が示され、及び／または説明された可能性があるが、別の複数の実施形態は、特定の複数のオペレーションを異なる順序で実行する、特定の複数のオペレーションを組み合わせる、特定の複数のオペレーションをオーバラップさせる等が可能である。 The specific methods disclosed herein are shown and described in basic form, but operations are selectively added to and / or removed from the methods. You may. Further, although a particular order of operations may have been shown and / or described, another embodiment of the particular operation performs the particular operation in a different order. Can be combined, multiple specific operations can be overlapped, and so on.

特定の複数のオペレーションは、複数のハードウェア要素によって実行されてもよく、及び／または複数のオペレーションを実行する命令によりプログラムされたハードウェア要素（例えば、プロセッサ、プロセッサの一部等）を生じ、及び／またはもたらすために用いられ得る機械により実行可能な命令で具現されてもよい。ハードウェア要素は、汎用または特別用途向けのハードウェア要素を含んでもよい。複数のオペレーションは、ハードウェア、ソフトウェア及び／またはファームウェアの組み合わせによって実行されてもよい。ハードウェア要素は、命令を実行及び／または処理し、命令に応答して（例えば、１つまたは複数のマイクロ命令または命令から導出された他の複数の制御信号に応答して）動作を実行するように動作可能な、具体的なまたは特定のロジック（例えば、潜在的には、ソフトウェア及び／またはファームウェアと組み合わせられた回路）を含んでもよい。 A particular operation may be performed by multiple hardware elements and / or yields a hardware element programmed by an instruction to perform multiple operations (eg, a processor, a portion of a processor, etc.). And / or may be embodied in machine-executable instructions that can be used to bring. Hardware elements may include general purpose or special purpose hardware elements. Multiple operations may be performed by a combination of hardware, software and / or firmware. A hardware element executes and / or processes an instruction and performs an operation in response to the instruction (eg, in response to one or more microinstructions or other control signals derived from the instruction). It may include specific or specific logic (eg, potentially a circuit combined with software and / or firmware) that is capable of operating in such a manner.

本明細書を通して、「一実施形態」、「実施形態」、「１つまたは複数の実施形態」、「いくつかの実施形態」という言及は、例えば、特定の機能が、本発明の実施に含まれてもよいが、必ずしも含まれていなくてもよいことを示す。同様に、説明において、様々な複数の特徴は、場合により、開示を簡素化し、様々な進歩的な複数の態様に対する理解を助けることを目的として、単一の実施形態、図またはその説明の中でグループ化される。この開示方法は、しかしながら、本発明が、各請求項で明確に規定されるよりも多くの機能を必要とするという意図を反映すると解釈されるものではない。むしろ、以下の特許請求の範囲が反映するように、進歩的な複数の態様は単一の開示された実施形態の一部の機能にある。したがって、詳細な説明に続く、各請求項が実施形態として独立する特許請求の範囲は、これにより、この詳細な説明に明確に組み込まれている。 Throughout the present specification, the references "one embodiment", "embodiment", "one or more embodiments", "several embodiments" include, for example, specific functions in the practice of the present invention. It may be, but it does not necessarily have to be included. Similarly, in the description, the various features, optionally in a single embodiment, figure or description thereof, are intended to simplify disclosure and aid understanding of various progressive embodiments. Grouped by. This disclosure method, however, is not construed to reflect the intent that the invention requires more functionality than is expressly defined in each claim. Rather, the progressive embodiments are part of the function of a single disclosed embodiment, as reflected in the claims below. Therefore, the scope of claims, in which each claim is independent as an embodiment, following the detailed description, is thereby clearly incorporated into this detailed description.

以下の複数の節及び／または複数の例は、さらなる複数の実施形態に関する。複数の節の複数の詳細及び／または複数の例は、１つまたは複数の実施形態のいずれの部分において用いられてもよい。 The following sections and / or examples relate to further embodiments. The details and / or examples of the sections may be used in any part of one or more embodiments.

一実施形態では、第１の装置は、複数のコアと、複数のコアの各々と連結される共有コア拡張ロジックとを含む。共有コア拡張ロジックは、複数のコアの各々に共有される共有データ処理ロジックを有する。第１の装置は、共有コア拡張呼び出し命令に応答して共有コア拡張ロジックを呼び出す、複数のコアの各々に対する命令実行ロジックをさらに含む。呼び出しは、対応するコアの代わりに、共有データ処理ロジックデータ処理を実行させる。 In one embodiment, the first device comprises a plurality of cores and shared core extension logic linked to each of the plurality of cores. The shared core extension logic has shared data processing logic shared by each of the plurality of cores. The first device further includes instruction execution logic for each of the plurality of cores that calls the shared core extension logic in response to the shared core extension call instruction. The call causes the shared data processing logic data processing to be performed on behalf of the corresponding core.

複数の実施形態は、命令実行ロジック及び共有コア拡張ロジックと連結される複数の共有コア拡張コマンドレジスタをさらに含む第１の装置を含み、共有コア拡張呼び出し命令は、複数の共有コア拡張コマンドレジスタの１つ及び複数のパラメータを示す。 The plurality of embodiments include a first device further including a plurality of shared core extended command registers concatenated with instruction execution logic and shared core extended logic, and a shared core extended call instruction is a plurality of shared core extended command registers. Shows one and more parameters.

複数の実施形態は、上述した第１の装置のいずれかを含み、命令実行ロジックは、共有コア拡張呼び出し命令に応答して、示された複数のパラメータに基づいて、示された共有コア拡張コマンドレジスタにデータを格納する。 The plurality of embodiments include one of the first devices described above, wherein the instruction execution logic responds to the shared core extended call instruction and is shown a shared core extended command based on the indicated parameters. Store data in registers.

複数の実施形態は、上述した第１の装置のいずれかを含み、命令実行ロジックは、共有コア拡張呼び出し命令に応答して、示された共有コア拡張コマンドレジスタに、呼び出し属性情報を指す呼び出し属性ポインタフィールドのポインタ、入力データオペランドを指す入力データオペランドポインタフィールドのポインタ、及び出力データオペランドを指す出力データオペランドポインタフィールドのポインタを格納する。 The plurality of embodiments include one of the first devices described above, in which the instruction execution logic responds to the shared core extended call instruction and in the indicated shared core extended command register a call attribute pointing to the call attribute information. Stores a pointer to a pointer field, a pointer to an input data operand pointer field that points to an input data operand, and a pointer to an output data operand pointer field that points to an output data operand.

複数の実施形態は、上述した第１の装置のいずれかを含み、共有コア拡張ロジックは、呼び出しに関連付けられたデータ処理に基づいて、示された共有コア拡張コマンドレジスタに、呼び出しのステータスを提供するステータスフィールド、及び呼び出しの進捗度を提供する進捗度フィールドを格納する。 The plurality of embodiments include one of the first devices described above, in which the shared core extension logic provides the status of the call to the indicated shared core extension command register based on the data processing associated with the call. Contains a status field to be used, and a progress field that provides the progress of the call.

複数の実施形態は、上述した第１の装置のいずれかを含み、共有コア拡張呼び出し命令は、複数のコアの命令セットのマクロ命令を含む。 The plurality of embodiments include one of the first devices described above, and the shared core extended call instruction comprises a macro instruction of an instruction set of a plurality of cores.

複数の実施形態は、上述した第１の装置のいずれかを含み、共有データ処理ロジックは、少なくとも１つのベクトル実行ユニットを含む。 The plurality of embodiments include any of the first devices described above, and the shared data processing logic includes at least one vector execution unit.

複数の実施形態は、上述した第１の装置のいずれかを含み、共有データ処理ロジックは、複数のコア内に見られないデータ処理ロジックを含む。 The plurality of embodiments include any of the first devices described above, and the shared data processing logic includes data processing logic that is not found in the plurality of cores.

複数の実施形態は、上述した第１の装置のいずれかを含み、命令実行ロジックは、共有コア拡張呼び出し命令に応答して、メモリにおいて少なくとも１つの出力データ構造を生成するルーチンに従って、メモリにおける少なくとも１つの入力データ構造に対してデータ処理を実行させるために、共有コア拡張ロジックを呼び出す。 The plurality of embodiments include one of the first devices described above, wherein the instruction execution logic follows a routine in memory that produces at least one output data structure in response to a shared core extended call instruction. Call the shared core extension logic to perform data processing on one input data structure.

複数の実施形態は、複数のコアの第１のコアのメモリ管理ユニット（ＭＭＵ）と、共有コア拡張ロジックの共有コア拡張ＭＭＵと、第１のコアのＭＭＵと共有コア拡張ＭＭＵとを同期するために、ハードウェアにおいて複数の同期信号を交換する、第１のコアのＭＭＵと共有コア拡張ＭＭＵとの間のハードウェアインターフェースとをさらに含む、上述した第１の装置のいずれかを含む。 The plurality of embodiments is for synchronizing the memory management unit (MMU) of the first core of a plurality of cores, the shared core expansion MMU of the shared core expansion logic, and the MMU of the first core and the shared core expansion MMU. Includes any of the first devices described above, further comprising a hardware interface between a first core MMU and a shared core extended MMU that exchanges a plurality of sync signals in hardware.

複数の実施形態は、複数のコアの第１のコアのメモリ管理ユニット（ＭＭＵ）と、共有コア拡張ロジックの共有コア拡張ＭＭＵと、第１のコアからの呼び出しに対応するページフォルトを、共有コア拡張ＭＭＵから第１のコアのＭＭＵにルーティングする、第１のコアのＭＭＵと共有コア拡張ＭＭＵとの間のインターフェースとをさらに含む、上述した第１の装置のいずれかを含む。 A plurality of embodiments include a memory management unit (MMU) of the first core of a plurality of cores, a shared core extension MMU of the shared core expansion logic, and a page fault corresponding to a call from the first core. Includes any of the first devices described above, further including an interface between the first core MMU and the shared core extended MMU that routes from the extended MMU to the first core MMU.

複数の実施形態は、ダイ上に共有コア拡張ロジックと共に、複数のコアから共有データ処理ロジックに対する複数の呼び出しをスケジューリングするハードウェアスケジューリングロジックをさらに含む、上述した第１の装置のいずれかを含む。 The plurality of embodiments includes any of the first devices described above, further comprising hardware scheduling logic that schedules multiple calls to the shared data processing logic from the plurality of cores, along with shared core extension logic on the die.

一実施形態では、第１の方法は、複数のコアを有するプロセッサのコア内で、共有コア拡張呼び出し命令を受信する段階を含む。共有コア拡張呼び出し命令は、コアに、複数のコアにより共有される共有コア拡張ロジックを呼び出させる。呼び出しは、データ処理を実行させる。共有コア拡張呼び出し命令は、共有コア拡張コマンドレジスタを示し、実行されるべきデータ処理を指定する複数のパラメータを示す。共有コア拡張ロジックは、データ処理が実行されるように、共有コア拡張呼び出し命令に応答して呼び出される。共有コア拡張ロジックを呼び出す段階は、命令によって示される複数のパラメータに基づいて、命令によって示される共有コア拡張コマンドレジスタにデータを格納する段階を含む。 In one embodiment, the first method comprises receiving a shared core extended call instruction within the core of a processor having a plurality of cores. The shared core extended call instruction causes the core to call the shared core extended logic shared by multiple cores. The call causes data processing to be performed. A shared core extended call instruction indicates a shared core extended command register and indicates multiple parameters that specify the data processing to be performed. The shared core extension logic is called in response to a shared core extension call instruction so that data processing is performed. The step of calling the shared core extension logic involves storing data in the shared core extension command register indicated by the instruction, based on multiple parameters indicated by the instruction.

複数の実施形態は、命令を受信する段階が、非ブロック共有コア拡張呼び出し命令を受信する段階を含み、共有コア拡張ロジックが実行されるべきデータ処理を受け入れた後、コアで非ブロック共有コア拡張呼び出し命令をリタイアさせる段階をさらに含む、第１の方法を含む。 In a plurality of embodiments, the stage of receiving an instruction includes the stage of receiving a non-block shared core extension call instruction, and the shared core extension logic accepts the data processing to be executed, and then the non-block shared core extension is performed in the core. Includes a first method, further comprising the step of retiring the call instruction.

複数の実施形態は、命令を受信する段階が、ブロック共有コア拡張呼び出し命令を受信する段階を含み、共有コア拡張ロジックがデータ処理を完了した後、コアでブロック共有コア拡張呼び出し命令をリタイアさせる段階をさらに含む、第１の方法を含む。 In the plurality of embodiments, the stage of receiving the instruction includes the stage of receiving the block shared core extended call instruction, and the stage of retiring the block shared core extended call instruction in the core after the shared core extension logic completes the data processing. The first method is included, further comprising.

複数の実施形態は、命令を受信する段階が、ブロック共有コア拡張呼び出し命令を受信する段階を含み、ブロック共有コア拡張呼び出し命令が、示された共有コア拡張コマンドレジスタを解放するタイムアウト値を示す、第１の方法を含む。 The plurality of embodiments include a step of receiving an instruction including a step of receiving a block shared core extended call instruction, indicating a timeout value at which the block shared core extended call instruction releases the indicated shared core extended command register. Including the first method.

複数の実施形態は、共有コア拡張呼び出し命令は、コアの命令セットのマクロ命令を含み、共有コア拡張コマンドレジスタは、アーキテクチャレジスタを備える、上述した複数の第１の方法のいずれかを含む。 In a plurality of embodiments, the shared core extended call instruction comprises a macro instruction of the core instruction set, and the shared core extended command register comprises any of the plurality of first methods described above comprising an architecture register.

複数の実施形態は、複数のパラメータに基づいて、示された共有コア拡張コマンドレジスタにデータを格納する段階は、呼び出し属性情報を指す呼び出し属性ポインタフィールドにポインタを格納する段階と、入力データオペランドを指す入力データオペランドポインタフィールドにポインタを格納する段階と、出力データオペランドを指す出力データオペランドポインタフィールドにポインタを格納する段階とを含む、上述した複数の第１の方法のいずれかを含む。 The plurality of embodiments are based on a plurality of parameters, the stage of storing data in the indicated shared core extended command register is the stage of storing a pointer in the call attribute pointer field pointing to the call attribute information, and the stage of storing the input data operand. It comprises one of a plurality of first methods described above, comprising storing a pointer in a pointing input data operand pointer field and storing a pointer in an output data operand pointer field pointing to an output data operand.

複数の実施形態は、呼び出しに関連付けられたデータ処理に基づいて、示された共有コア拡張レジスタにデータを格納する共有コア拡張ロジックをさらに含み、データを格納する段階は、呼び出しのステータスを提供するために、示されたレジスタのステータスフィールドにステータスを格納する段階と、呼び出しの進捗度を提供するために、示されたレジスタの進捗度フィールドに進捗度を格納する段階とを含む、上述した複数の第１の方法のいずれかを含む。 The plurality of embodiments further include shared core extension logic that stores the data in the indicated shared core extension register based on the data processing associated with the call, and the stage of storing the data provides the status of the call. Multiple described above, including a step of storing the status in the status field of the indicated register, and a step of storing the progress in the progress field of the indicated register to provide the progress of the call. Includes any of the first methods of.

複数の実施形態は、上述した複数の第１の方法のいずれかを含む。呼び出す段階は、メモリにおいて少なくとも１つの出力データ構造を生成するルーチンに従って、メモリにおける少なくとも１つの入力データ構造に対してデータ処理を実行させる共有コア拡張ロジックを呼び出す段階を含む。 The plurality of embodiments includes any of the plurality of first methods described above. The calling step includes calling the shared core extension logic to perform data processing on at least one input data structure in memory according to a routine that produces at least one output data structure in memory.

複数の実施形態は、ＭＭＵと共有コア拡張ＭＭＵとの間でハードウェアにおいて同期信号を交換することにより、コアのメモリ管理ユニット（ＭＭＵ）と共有コア拡張ロジックの共有コア拡張ＭＭＵとを同期させる段階をさらに含む、上述した複数の第１の方法のいずれかを含む。 A plurality of embodiments are a step of synchronizing a core memory management unit (MMU) with a shared core expansion MMU of shared core expansion logic by exchanging synchronization signals in hardware between the MMU and the shared core expansion MMU. Includes any of the plurality of first methods described above, further comprising:

複数の実施形態は、呼び出しに対応するページフォルトを、共有コア拡張メモリ管理ユニット（ＭＭＵ）からコアのＭＭＵにルーティングする段階をさらに含む、上述した複数の第１の方法のいずれかを含む。 The plurality of embodiments includes one of the first method described above, further comprising the step of routing the page fault corresponding to the call from the shared core extended memory management unit (MMU) to the core MMU.

複数の実施形態は、共有コア拡張呼び出し命令を受信する前に、共有コア拡張コマンドレジスタを示す供給コア拡張停止命令を受信する段階と、供給コア拡張停止命令に応答して、供給コア拡張停止命令によって示される共有コア拡張コマンドレジスタに対応するデータ処理を中止し、共有コア拡張コマンドレジスタを解放する段階とをさらに含む、上述した複数の第１の方法のいずれかを含む。 A plurality of embodiments include a step of receiving a supply core extended stop instruction indicating a shared core extended command register and a supply core extended stop instruction in response to the supply core extended stop instruction before receiving the shared core extended call command. Includes any of the first methods described above, further comprising aborting the data processing corresponding to the shared core extended command register indicated by and releasing the shared core extended command register.

複数の実施形態は、共有コア拡張呼び出し命令を受信した後に、共有コア拡張コマンドレジスタを示す共有コア拡張読み出し命令を受信する段階と、共有コア拡張読み出し命令に応答して、共有コア拡張読み出し命令によって示される共有コア拡張コマンドレジスタからデータ処理完了ステータスを読み出す段階とをさらに含む、上述した複数の第１の方法のいずれかを含む。 A plurality of embodiments include a step of receiving a shared core extended read instruction indicating a shared core extended command register after receiving a shared core extended call instruction, and a shared core extended read instruction in response to the shared core extended read instruction. It comprises any of the first methods described above, further comprising reading the data processing completion status from the shared core extended command register shown.

一実施形態では、機械可読記憶媒体は、機械によって実行された場合に、上述した複数の第１の方法のいずれかを機械に実行させる１つまたは複数の命令を格納する。 In one embodiment, the machine-readable storage medium stores one or more instructions that, when executed by the machine, cause the machine to perform any of the first methods described above.

一実施形態では、装置は、上述した複数の第１の方法のいずれかを実行するように構成され、または動作可能である。 In one embodiment, the device is configured or operable to perform any of the plurality of first methods described above.

複数の実施形態は、プロセッサと、プロセッサと連結されるダイナミックランダムアクセスメモリ（ＤＲＡＭ）とを含む第１のシステムを含む。プロセッサは、複数のコアと、複数のコアの各々と連結される共有コア拡張ロジックとを含む。共有コア拡張ロジックは、複数のコアの各々に共有される共有データ処理ロジックを有する。プロセッサは、共有コア拡張呼び出し命令に応答して共有コア拡張ロジックを呼び出す、複数のコアの各々に対する命令実行ロジックをさらに含む。呼び出しは、対応するコアの代わりに、共有データ処理ロジックにデータ処理を実行させる。 A plurality of embodiments include a first system comprising a processor and a dynamic random access memory (DRAM) coupled to the processor. The processor includes a plurality of cores and shared core extension logic linked to each of the plurality of cores. The shared core extension logic has shared data processing logic shared by each of the plurality of cores. The processor further includes instruction execution logic for each of the plurality of cores that calls the shared core extension logic in response to the shared core extension call instruction. The call causes the shared data processing logic to perform data processing on behalf of the corresponding core.

複数の実施形態は、共有コア拡張呼び出し命令が、複数のコアの命令セットのマクロ命令を含む第１のシステムを含む。 In a plurality of embodiments, the shared core extended call instruction includes a first system including macro instructions of an instruction set of a plurality of cores.

複数の実施形態は、命令実行ロジック及び共有コア拡張ロジックと連結される複数のアーキテクチャ上の共有コア拡張コマンドレジスタをさらに含み、共有コア拡張呼び出し命令が、複数の共有コア拡張コマンドレジスタの１つ及び複数のパラメータを示す、上述した複数の第１のシステムのいずれかを含む。 The plurality of embodiments further include a shared core extended command register on a plurality of architectures concatenated with an instruction execution logic and a shared core extended logic, wherein the shared core extended call instruction is one of the plurality of shared core extended command registers. Includes one of the plurality of first systems described above, indicating a plurality of parameters.

Claims

It ’s a processor,
With multiple digital signal processor cores (multiple DSP cores),
A shared vector processing circuit connected to the plurality of DSP cores and shared by the plurality of DSP cores.
A memory management circuit for translating a virtual address in a virtual address space shared by the plurality of DSP cores and the shared vector processing circuit, wherein the virtual address is converted into a physical address of a system memory. With the memory management circuit,
Each DSP core
A decoder for decoding the instructions of the first thread and the second thread, and
A core execution circuit for executing one or more instructions of the first thread and the second thread, and
It has a context data for the first thread and a core register file for storing the context data for the second thread.
The shared vector processing circuit is
A first plurality of registers for storing a copy of the context data for the first thread, and
A second plurality of registers for storing a copy of the context data for the second thread, and
A processor having a single instruction and a plurality of data execution circuits (SIMD execution circuits) for executing a process for the first thread and the second thread.

The processor according to claim 1, wherein the memory management circuit has a translation lookaside buffer (TLB) for caching a translation from a virtual address to a physical address.

The DSP core is for storing the context data as a part of the context of the first thread and the context data as a part of the context of the second thread in the core register file. The processor according to claim 1 or 2.

The processing performed by the SIMD execution circuit is based on an extension to the instruction set architecture (ISA).
The processor according to any one of claims 1 to 3.

The processor according to any one of claims 1 to 4 , wherein the first plurality of registers and the second plurality of registers are architecturally visible registers of an instruction set architecture (ISA).

The processor according to any one of claims 1 to 5 , wherein the shared vector processing circuit has a histogram circuit for performing a histogram calculation.

The processor according to any one of claims 1 to 6 , wherein the shared vector processing circuit has a matrix multiplication circuit.

The processor according to any one of claims 1 to 7 , wherein the shared vector processing circuit has a total absolute difference circuit.

The processor according to any one of claims 1 to 8 , further comprising a first interconnect for connecting the plurality of DSP cores to the shared vector processing circuit.

9. The processor of claim 9 , further comprising a second interconnect for connecting the shared vector processing circuit to a memory subsystem that includes one or more levels of shared memory.

System memory for storing instructions and data,
A system comprising the processor according to any one of claims 1 to 10 , which is connected to the system memory.

11. The system of claim 11 , further comprising a storage device connected to the processor to store instructions and data.

22. The system of claim 11 or 12 , further comprising an input / output (I / O) interconnect for connecting the processor to one or more I / O devices.

The system according to any one of claims 11 to 13 , wherein the system memory has a dynamic random access memory (DRAM).

The system according to any one of claims 11 to 14 , further comprising a graphics processor connected to the processor to perform a graphics processing operation.

The system according to any one of claims 11 to 15 , further comprising a network processor connected to the processor.

The system according to any one of claims 11 to 16 , further comprising an audio input / output device connected to the processor.

The system further comprises a second processor connected to the DSP core of the processor.
The processor is the first processor and
The first processor recognizes the type of instruction to be processed by the second processor .
The system according to any one of claims 11 to 17 .

18. The system of claim 18 , further comprising a shared cache shared by the first processor and the second processor.

The system according to any one of claims 11 to 19 , further comprising a compression circuit connected to the processor.