JP5372929B2

JP5372929B2 - Multi-core processor with hierarchical microcode store

Info

Publication number: JP5372929B2
Application number: JP2010517025A
Authority: JP
Inventors: ダブリュ．シェンジーン; アール．ホロウェイブルース; リーシーン; ジー．バトラーマイケル
Original assignee: Advanced Micro Devices Inc
Current assignee: Advanced Micro Devices Inc
Priority date: 2007-07-18
Filing date: 2008-07-18
Publication date: 2013-12-18
Anticipated expiration: 2028-07-18
Also published as: KR101493017B1; TW200912738A; US20090024836A1; KR20100063024A; CN101855614A; EP2171574B1; JP2010533920A; US7743232B2; WO2009011913A1; EP2171574A1; TWI433032B

Abstract

A multiple-core processor having a hierarchical microcode store. A processor may include multiple processor cores, each configured to independently execute instructions defined according to a programmer-visible instruction set architecture (ISA). Each core may include a respective local microcode unit configured to store microcode entries. The processor may also include a remote microcode unit accessible by each of the processor cores. Any given one of the processor cores may be configured to generate a given microcode entrypoint corresponding to a particular microcode entry including one or more operations to be executed by the given processor core, and to determine whether the particular microcode entry is stored within the respective local microcode unit of the given core. In response to determining that the particular microcode entry is not stored within the respective local microcode unit, the given core may convey a request for the particular microcode entry to the remote microcode unit.

Description

本発明は、プロセッサに関し、より詳細には、マルチプロセッサコアを含むプロセッサ内でのマイクロコード実装に関する。 The present invention relates to processors, and more particularly to microcode implementation within a processor that includes a multiprocessor core.

プロセッサの実装の進歩に伴い、プロセッサ内の個々のプロセッサコアを複製することにより、プロセッサの性能を高める試みがますます広まってきている。このようなコアは、粗粒度並列処理を増大しながら独立した命令実行が可能であってもよく、この粗粒度並列処理は、細粒度並列処理または実行の頻度を上げるなど、性能を上げるための別の方法によって要求されることもあるものよりも抵抗コストおよび低デザイン複雑性で、アプリケーションの実行に利用可能である。 As processor implementation progresses, attempts to increase processor performance by duplicating individual processor cores within the processor are becoming increasingly popular. Such a core may be capable of independent instruction execution while increasing coarse-grained parallelism, and this coarse-grained parallelism is intended to improve performance, such as increasing the frequency of fine-grained parallelism or execution. It can be used to run applications with lower cost of resistance and lower design complexity than might otherwise be required.

しかしながら、マルチコアプロセッサの実装には、実装自体に設計上の問題がいくつかある。マイクロコードルーチンのストレージリソースのようなある特定のプロセッサコアリソースが、クリティカルタイミングパスに含まれてもよいため、所与のコアへリソースが近接すると、所与のコアの動作周波数に直接影響を及ぼすことがある。このため、このようなストレージリソースの単一インスタンスを複数のコア間で共有すると、共有インスタンスへの待ち時間が長くなり、コアの性能低下を招いてしまいかねない。しかしながら、各コアがリソースの独自のインスタンスを含むようなストレージリソースを複製すると、設計エリア、電力、および／または他の設計性能指数の面でコストがかかることにもなる。 However, the implementation of a multi-core processor has some design issues in the implementation itself. Certain processor core resources, such as microcode routine storage resources, may be included in the critical timing path, so the proximity of a resource to a given core directly affects the operating frequency of a given core Sometimes. For this reason, if a single instance of such a storage resource is shared among a plurality of cores, the waiting time for the shared instance becomes long, which may lead to a decrease in core performance. However, duplicating storage resources such that each core contains its own instance of the resource can also be costly in terms of design area, power, and / or other design figure of merit.

階層マイクロコードストアを有するマルチコアプロセッサのさまざまな実施形態が開示される。１つの実施形態によれば、プロセッサは、プログラマビジブル命令セットアーキテクチャ（ＩＳＡ：instruction set architecture）に従って規定された命令を独立して実行するように構成された複数のプロセッサコアを含んでもよい。プロセッサコアの各々は、マイクロコードエントリを格納するように構成されたローカルマイクロコードユニットをそれぞれ含んでもよい。また、プロセッサは、プロセッサコアの各々によってアクセス可能であり、マイクロコードエントリを格納するように構成されたリモートマイクロコードストアを含むリモートマイクロコードユニットを含んでもよい。プロセッサコアの任意のコアが、所与のプロセッサコアによって実行される１つ以上の演算を含む特定のマイクロコードエントリに対応する所与のマイクロコードエントリポイントを発生し、所与のプロセッサコアのそれぞれのローカルマイクロコードユニット内に特定のマイクロコードエントリが格納されているかを決定するように構成されてもよい。特定のマイクロコードエントリが、それぞれのローカルマイクロコードユニット内に格納されていないということの決定に応答して、所のコアは、リモートマイクロコードユニットに特定のマイクロコードエントリの要求を伝えてもよい。 Various embodiments of a multi-core processor having a hierarchical microcode store are disclosed. According to one embodiment, the processor may include a plurality of processor cores that are configured to independently execute instructions defined in accordance with a programmable instruction set architecture (ISA). Each of the processor cores may each include a local microcode unit configured to store a microcode entry. The processor may also include a remote microcode unit including a remote microcode store that is accessible by each of the processor cores and configured to store microcode entries. Any core of the processor core generates a given microcode entry point corresponding to a particular microcode entry that includes one or more operations performed by the given processor core, and each of the given processor core May be configured to determine whether a particular microcode entry is stored in the local microcode unit. In response to determining that a particular microcode entry is not stored in each local microcode unit, the given core may communicate a request for the particular microcode entry to the remote microcode unit. .

本発明はさまざまな修正例および別の形態が可能なものであるが、その具体的な実施形態は、例示的に図面に示され、本明細書に詳細に記載される。しかしながら、図面および詳細な記載は、本発明を開示された特定の形態に限定することを意図したものではなく、むしろ本発明は、添付の特許請求の範囲によって規定されているような本発明の趣旨および範囲内にあるすべての変更、等価物、および代替物を包含するということを理解されたい。 While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and are herein described in detail. However, the drawings and detailed description are not intended to limit the invention to the particular form disclosed, but rather, the invention is intended to be as defined by the appended claims. It should be understood that all modifications, equivalents, and alternatives within the spirit and scope are encompassed.

プロセッサコアの１つの実施形態を示すブロック図。FIG. 3 is a block diagram illustrating one embodiment of a processor core. マルチプロセッサコアを含むプロセッサの１つの実施形態を示すブロック図。1 is a block diagram illustrating one embodiment of a processor that includes a multiprocessor core. FIG. マイクロコード制御ストアの１つの実施形態の体系を示すブロック図。The block diagram which shows the structure of one embodiment of a microcode control store. ローカルおよびリモートマイクロコードユニットを含む階層マイクロコードストアの１つの実施形態を示すブロック図。1 is a block diagram illustrating one embodiment of a hierarchical microcode store that includes local and remote microcode units. FIG. ローカルマイクロコードユニットの１つの実施形態を示すブロック図。FIG. 3 is a block diagram illustrating one embodiment of a local microcode unit. リモートマイクロコードユニットの１つの実施形態を示すブロック図。The block diagram which shows one Embodiment of a remote microcode unit. 階層マイクロコードストアを有するプロセッサにおいてマイクロコードエントリを取り出す方法の１つの実施形態を示す流れ図。6 is a flow diagram illustrating one embodiment of a method for retrieving a microcode entry in a processor having a hierarchical microcode store. 例示的なコンピュータシステムの１つの実施形態を示すブロック図。1 is a block diagram illustrating one embodiment of an exemplary computer system.

プロセッサコアの概要
図１に、プロセッサコア１００の１つの実施形態が示されている。一般的にいえば、コア１００は、コア１００に直接的または間接的に連結されたシステムメモリに格納されてもよい。命令を実行するように構成されてもよい。このような命令は、特定の命令セットアーキテクチャ（ＩＳＡ）に従って規定されるものであってもよい。例えば、コア１００は、ｘ８６ＩＳＡ系を実装するように構成されてもよいが、他の実施形態において、コア１００は、異なるＩＳＡまたはＩＳＡの組み合わせを実装してもよい。 Overview of Processor Core FIG. 1 illustrates one embodiment of a processor core 100. Generally speaking, the core 100 may be stored in a system memory coupled directly or indirectly to the core 100. The instructions may be configured to execute. Such instructions may be defined according to a specific instruction set architecture (ISA). For example, the core 100 may be configured to implement the x86 ISA series, but in other embodiments, the core 100 may implement different ISAs or combinations of ISAs.

例示した実施形態において、コア１００は、命令フェッチユニット（ＩＦＵ：instruction fetch unit）１２０に連結された命令キャッシュ（ＩＣ：instruction cache）１１０を含んでもよい。ＩＦＵ１２０は、分岐予測ユニット（ＢＰＵ：branch prediction unit）１３０および命令デコードユニット１４０に連結されてもよい。デコードユニット１４０は、複数の整数演算実行クラスタ１５０ａ〜ｂおよび浮動小数点ユニット（ＦＰＵ： floating point unit）１６０に演算を与えるように連結されてもよい。クラスタ１５０ａ〜ｂの各々は、複数の整数演算実行ユニット１５４ａ〜ｂのそれぞれに連結されたクラスタスケジューラ１５２ａ〜ｂをそれぞれ含んでもよい。さまざまな実施形態において、コア１００は、クラスタ１５０ａ〜ｂ内のいずれか、または共有してデータキャッシュ（図示せず）を含んでもよい。例示した実施形態において、ＦＰＵ１６０は、ＦＰスケジューラ１６２から演算を受けるように連結されてもよい。クラスタ１５０ａ〜ｂ、ＦＰＵ１６０、および命令キャッシュ１１０は、コアインタフェースユニット１７０にさらに連結されてもよく、さらに、コアインタフェースユニット１７０は、Ｌ２キャッシュ１８０およびコア１００の外部にあるシステムインタフェースユニット（ＳＩＵ：system interface unit）（図２に示し、以下に記載）に連結されてもよい。図１は、さまざまなユニット間でのある特定の命令およびデータフローパスを反映させたものであるが、図１に詳細には示されていないデータまたは命令フローのさらなるパスまたは方向が与えられてもよいことに留意されたい。また、例えば、クラスタ１５０、ＦＰＵ１６０、およびＬ２キャッシュ１８０のインスタンスの数の変更、およびこのようなユニット間での相互の関わりの変更を採用したコア１００のさらなる構成も可能である。 In the illustrated embodiment, the core 100 may include an instruction cache (IC) 110 coupled to an instruction fetch unit (IFU) 120. The IFU 120 may be coupled to a branch prediction unit (BPU) 130 and an instruction decode unit 140. The decode unit 140 may be coupled to provide operations to a plurality of integer operation execution clusters 150 a-b and a floating point unit (FPU) 160. Each of the clusters 150a-b may include a cluster scheduler 152a-b coupled to each of the plurality of integer arithmetic execution units 154a-b. In various embodiments, the core 100 may include a data cache (not shown) either in or shared with the clusters 150a-b. In the illustrated embodiment, the FPU 160 may be coupled to receive operations from the FP scheduler 162. The clusters 150a-b, the FPU 160, and the instruction cache 110 may be further coupled to a core interface unit 170. The core interface unit 170 may further include a system interface unit (SIU) that is external to the L2 cache 180 and the core 100. interface unit) (shown in FIG. 2 and described below). FIG. 1 reflects certain specific instruction and data flow paths between various units, but may be given additional paths or directions of data or instruction flow not shown in detail in FIG. Please note that it is good. Further, for example, a further configuration of the core 100 that adopts a change in the number of instances of the cluster 150, the FPU 160, and the L2 cache 180 and a change in the mutual relationship between the units is also possible.

以下にさらに詳細に記載するように、コア１００は、ココの実行スレッドからの命令を同時に実行しうるマルチスレッド実行用に構成されてもよい。さまざまな実施形態において、同時実行を行うためにさまざまな数のスレッドがサポートされてもよく、異なる数のクラスタ１５０およびＦＰＵ１６０が設けられてもよいことを考慮されたい。さらに、特殊メディアプロセッサまたは他のタイプのアクセラレータを含む他の共有ユニットまたはマルチスレッドユニットが追加されてもよい。 As described in further detail below, the core 100 may be configured for multi-threaded execution that can simultaneously execute instructions from a Coco execution thread. It should be noted that in various embodiments, different numbers of threads may be supported for concurrent execution, and different numbers of clusters 150 and FPUs 160 may be provided. In addition, other shared or multi-threaded units may be added that include special media processors or other types of accelerators.

命令キャッシュ１１０は、命令がフェッチされ、デコードされ、および実行するために発行される前に命令を格納するように構成されてもよい。さまざまな実施形態において、命令キャッシュ１１０は、任意の適切なサイズおよび／または連想度のダイレクトマップ方式、セットアソシアティブ方式、またはフルアソシアティブ方式のキャッシュとして構成されてもよい。命令キャッシュ１１０は、物理アドレッシング、仮想アドレッシング、またはそれら２つの組み合わせが行われたものであってもよい。いくつかの実施形態において、命令キャッシュ１１０はまた、命令フェッチアドレスの仮想／物理変換をキャッシュするように構成されたトランスレーションルックアサイドバッファ（ＴＬＢ：translation lookaside buffer）ロジックを含んでもよいが、ＴＬＢおよびトランスレーションロジックが、コア１００内の他の場所にも含まれてもよい。 The instruction cache 110 may be configured to store instructions before they are fetched, decoded, and issued for execution. In various embodiments, the instruction cache 110 may be configured as a direct map, set associative, or fully associative cache of any suitable size and / or association. The instruction cache 110 may be physical addressing, virtual addressing, or a combination of the two. In some embodiments, instruction cache 110 may also include translation lookaside buffer (TLB) logic configured to cache virtual / physical translations of instruction fetch addresses, although TLB and Translation logic may also be included elsewhere in the core 100.

命令キャッシュ１１０への命令フェッチアクセスは、ＩＦＵ１２０によって調整されてもよい。例えば、ＩＦＵ１２０は、さまざまな実行スレッドの現在のプログラムカウンタ状態をトラッキングし、さらなる実行命令を取り出すために命令キャッシュ１１０へのフェッチを発行してもよい。命令キャッシュミスの場合、命令キャッシュ１１０またはＩＦＵ１２０のいずれかが、Ｌ２キャッシュ１８０から命令データの取り出しを調整してもよい。いくつかの実施形態において、ＩＦＵ１２０はまた、メモリレイテンシの効果を軽減するために、予測された使用の前に他のレベルのメモリ階層からの命令のプリフェッチを調整してもよい。例えば、命令プリフェッチがうまくいくと、必要なときに命令キャッシュ１１０に存在する命令の可能性を高めることがあるため、メモリ階層の可能性のある複数のレベルでキャッシュミスのレイテンシ効果が回避される。 Instruction fetch access to the instruction cache 110 may be coordinated by the IFU 120. For example, the IFU 120 may track the current program counter state of various execution threads and issue fetches to the instruction cache 110 to retrieve additional execution instructions. In the case of an instruction cache miss, either instruction cache 110 or IFU 120 may coordinate the fetching of instruction data from L2 cache 180. In some embodiments, the IFU 120 may also adjust prefetching of instructions from other levels of memory hierarchy prior to predicted use to mitigate the effects of memory latency. For example, successful instruction prefetching may increase the likelihood of instructions residing in the instruction cache 110 when needed, thus avoiding cache miss latency effects at multiple levels of the memory hierarchy. .

さまざまなタイプの分岐（例えば、条件または無条件ジャンプ、コール／リターン命令など）が、特定のスレッドの実行フローを変更することもある。分岐予測ユニット１３０は、一般に、ＩＦＵ１２０によって使用するための将来的なフェッチアドレスを予測するように構成されてもよい。いくつかの実施形態において、ＢＰＵ１３０は、分岐命令に関する情報を格納するように構成された任意の適切な構造を含んでもよい。例えば、いくつかの実施形態において、ＢＰＵ１３０は、条件分岐の結果を予測するように構成された１つ以上の異なるタイプの予測変数（例えば、ローカル、グローバル、またはハイブリッド予測変数）を含んでもよい。 Various types of branches (eg, conditional or unconditional jumps, call / return instructions, etc.) may change the execution flow of a particular thread. Branch prediction unit 130 may generally be configured to predict future fetch addresses for use by IFU 120. In some embodiments, the BPU 130 may include any suitable structure configured to store information regarding branch instructions. For example, in some embodiments, the BPU 130 may include one or more different types of predictor variables (eg, local, global, or hybrid predictor variables) configured to predict the outcome of a conditional branch.

フェッチングの結果、ＩＦＵ１２０は、フェッチパケットと呼ばれることもある命令バイトシーケンスを生成するように構成されてもよい。フェッチパケットの長さは、任意の適切なバイト数であってもよい。いくつかの実施形態において、特に、可変長命令を実装するＩＳＡの場合、所与のフェッチパケット内に整列された可変数の有効命令が存在し、いくつかのインスタンスにおいて、命令が異なるフェッチパケットに及ぶこともある。一般的に言えば、デコードユニット１４０は、クラスタ１５０またはＦＰＵ１６０による実行に適した演算に命令をデコードまたは変換し、このような実行演算を送るために、フェッチパケット内に命令の境界を特定するように構成されてもよい。 As a result of fetching, the IFU 120 may be configured to generate an instruction byte sequence, sometimes referred to as a fetch packet. The length of the fetch packet may be any suitable number of bytes. In some embodiments, particularly for ISAs that implement variable length instructions, there are a variable number of valid instructions aligned within a given fetch packet, and in some instances, instructions in different fetch packets Sometimes. Generally speaking, decode unit 140 decodes or transforms instructions into operations suitable for execution by cluster 150 or FPU 160 and identifies instruction boundaries in fetch packets to send such execution operations. May be configured.

１つの実施形態において、ＤＥＣ１４０は、最初に、１つ以上のフェッチパケットから抽出したバイトの所与のウィンドウ内に可能な命令の長さを決定するように構成されてもよい。例えば、ｘ８６互換ＩＳＡの場合、ＤＥＣ１４０は、所与のフェッチパケット内の各バイト位置で始まる有効なプリフィックス、オペコード、「ｍｏｄ／ｒｍ」および「ＳＩＢ］バイトの有効シーケンスを識別するように構成されてもよい。さらに、ＤＥＣ１４０内のピックロジックが、１つの実施形態において、ウィンドウ内の複数の有効な命令の境界を特定するように構成されてもよい。１つの実施形態において、複数のフェッチパケットおよび命令境界を特定する複数の命令ポインタ群が、ＤＥＣ１４０内のキューに入れられてもよい。 In one embodiment, DEC 140 may initially be configured to determine the length of possible instructions within a given window of bytes extracted from one or more fetch packets. For example, for an x86 compatible ISA, the DEC 140 is configured to identify a valid prefix, opcode, “mod / rm” and “SIB” byte valid sequence starting at each byte position within a given fetch packet. Further, the pick logic in the DEC 140 may be configured to identify the boundaries of multiple valid instructions in a window in one embodiment, In one embodiment, multiple fetch packets and A plurality of instruction pointer groups that specify instruction boundaries may be queued in the DEC 140.

次に、フェットパケットストレージからＤＥＣ１４０内のいくつかの命令デコーダの１つに命令が向けられてもよい。１つの実施形態において、ＤＥＣ１４０は、実行サイクルごとにディスパッチされる命令と同程度に多くの独立した命令デコーダを設けるように構成されてもよいが、他の構成も可能であり考慮される。コア１００がマイクロコード化命令をサポートする実施形態において、各命令デコーダは、所与の命令がマイクロコード化されているか否かを決定するように構成されてもよく、マイクロコード化されていれば、命令を演算シーケンスに変換するためのマイクロコードエンジンの演算を呼び出してもよい。あるいは、命令デコーダは、クラスタ１５０またはＦＰＵ１６０による実行に適した１つの演算（または、いくつかの実施形態において、場合によっては、複数の演算）に命令を変換してもよい。結果的に得られる演算は、マイクロオペレーション、マイクロＯＰ、またはμＯＰと呼ばれることもあり、実行のディスパッチを待機するために１つ以上のキュー内に格納されてもよい。いくつかの実施形態において、マイクロコード演算および非マイクロコード（または「ファーストパス」）演算が、別々のキューに格納されてもよい。コア１００内でのマイクロコード実装の実施形態に関するさらなる詳細は、以下にさらに詳細に記載される。 Next, an instruction may be directed from the Fett packet storage to one of several instruction decoders in the DEC 140. In one embodiment, the DEC 140 may be configured to provide as many independent instruction decoders as instructions dispatched every execution cycle, although other configurations are possible and contemplated. In embodiments where the core 100 supports microcoded instructions, each instruction decoder may be configured to determine whether a given instruction is microcoded, provided that it is microcoded. , A microcode engine operation for converting an instruction into an operation sequence may be invoked. Alternatively, the instruction decoder may convert the instructions into a single operation (or, in some embodiments, multiple operations in some embodiments) suitable for execution by the cluster 150 or FPU 160. The resulting operations, sometimes called micro-operations, micro-OPs, or μOPs, may be stored in one or more queues to wait for execution dispatch. In some embodiments, microcode operations and non-microcode (or “fast path”) operations may be stored in separate queues. Further details regarding embodiments of microcode implementation within the core 100 are described in further detail below.

ＤＥＣ１４０内のディスパッチロジックが、ディスパッチパーセルのアセンブルを試みるために、実行リソースおよびディスパッチルールの状態と組み合わせて、ディスパッチ待機中のキューに入れられた演算の状態を調べるように構成されてもよい。例えば、ＤＥＣ１４０は、ディスパッチキューに入っている演算の利用可能性、クラスタ１５０および／またはＦＰＵ１６０内でキューに入れられ実行待機中の演算数、およびディスパッチされる演算に適用してもよい任意のリソース制約を考慮に入れてもよい。１つの実施形態において、ＤＥＣ１４０は、所与の実行サイクル中、クラスタ１５０またはＦＰＵ１６０の一方に演算のパーセルをディスパッチするように構成されてもよい。 The dispatch logic in DEC 140 may be configured to examine the status of queued operations waiting for dispatch, in combination with the status of execution resources and dispatch rules, in order to attempt to assemble the dispatch parcel. For example, the DEC 140 may determine the availability of operations in the dispatch queue, the number of operations queued in the cluster 150 and / or FPU 160 and waiting to be executed, and any resources that may be applied to the dispatched operations. Constraints may be taken into account. In one embodiment, DEC 140 may be configured to dispatch a parcel of operations to one of cluster 150 or FPU 160 during a given execution cycle.

１つの実施形態において、ＤＥＣ１４０は、所与の実行サイクル中、１つのみのスレッドに対して演算をデコードおよびディスパッチするように構成されてもよい。しかしながら、ＩＦＵ１２０およびＤＥＣ１４０は、同じスレッドで同時に動作する必要ないことに留意されたい。命令のフェッチおよびデコード中に使用するためのさまざまなタイプのスレッドスイッチングの方法が考えられる。例えば、ＩＦＵ１２０およびＤＥＣ１４０は、ラウンドロビン式にＮサイクルごとに処理するための異なるスレッドを選択するように構成されてもよい（ここで、Ｎはわずか１であってもよい）。他の形態では、演算中に起こるダイナミック条件によってスレッドスイッチングが影響されることもある。 In one embodiment, the DEC 140 may be configured to decode and dispatch operations to only one thread during a given execution cycle. However, it should be noted that IFU 120 and DEC 140 need not operate simultaneously in the same thread. Various types of thread switching methods are contemplated for use during instruction fetch and decode. For example, IFU 120 and DEC 140 may be configured to select different threads for processing every N cycles in a round-robin fashion (where N may be only 1). In other forms, thread switching may be affected by dynamic conditions that occur during the operation.

一般的に言えば、クラスタ１５０は、整数演算および論理演算を実装し、ロード／ストア演算を実行するように構成されてもよい。１つの実施形態において、クラスタ１５０ａ〜ｂの各々が、それぞれのスレッドに対する演算の実行のためのものであってもよい。各クラスタ１５０は、それ自体のスケジューラ１５２を含んでもよく、このスケジューラ１５２は、クラスタにすでにディスパッチされた演算を実行するための発行を管理するように構成されてもよい。各クラスタ１５０は、整数物理レジスタファイルのコピーおよび完了ロジック（例えば、演算の完了および退去を管理するためのリオーダバッファまたは他の構造）をさらに含んでもよい。 Generally speaking, cluster 150 may be configured to implement integer and logical operations and perform load / store operations. In one embodiment, each of the clusters 150a-b may be for performing operations on a respective thread. Each cluster 150 may include its own scheduler 152, which may be configured to manage issues for performing operations already dispatched to the cluster. Each cluster 150 may further include a copy of the integer physical register file and completion logic (eg, a reorder buffer or other structure for managing the completion and exit of operations).

各クラスタ１５０内において、実行ユニット１５４が、さまざまな異なるタイプの演算の同時実行をサポートしてもよい。例えば、１つの実施形態において、実行ユニット１５４は、同時ロード／ストアアドレス発生（ＡＧＵ：address generation）演算および演算／論理（ＡＬＵ：arithmetic/logic）をサポートしてもよい。実行ユニット１５４は、整数乗算および除算などの追加の演算をサポートしてもよいが、さまざまな実施形態において、クラスタ１５０は、このような追加の演算とともに他のＡＬＵ／ＡＧＵ演算のスループットおよび同時並行性にスケジューリング制約を実装してもよい。さまざまな実施形態において、クラスタ１５０は、命令キャッシュ１１０とは異なるように体系化されてもよいデータキャッシュを含み、または共有してもよい。 Within each cluster 150, an execution unit 154 may support simultaneous execution of a variety of different types of operations. For example, in one embodiment, execution unit 154 may support simultaneous load / store address generation (AGU) operations and arithmetic / logic (ALU). Execution unit 154 may support additional operations such as integer multiplication and division, but in various embodiments, cluster 150, along with such additional operations, throughput and concurrent of other ALU / AGU operations. Scheduling constraints may be implemented. In various embodiments, the cluster 150 may include or share a data cache that may be organized differently than the instruction cache 110.

ＦＰＵ１６０は、クラスタスケジューラ１５２のように、ＦＰ実行ユニット１６４内で実行の演算を受信し、キューに入れ、発行するように構成されてもよいＦＰスケジューラ１６２を含んでもよい。ＦＰＵ１６０はまた、浮動小数点オペランドを管理するように構成された浮動小数点物理レジスタファイルを含んでもよい。ＦＰ実行ユニット１６４は、ＩＳＡによって規定されてもよいようなさまざまなタイプの浮動小数点演算を実装するように構成されてもよい。さまざまな実施形態において、ＦＰＵ１６０は、ある異なるタイプの浮動小数点演算の同時実行をサポートしてもよく、異なる精度（例えば、６４ビットオペランド、１２８ビットオペランドなど）をサポートしてもよい。さまざまな実施形態において、ＦＰＵ１６０は、データキャッシュを含んでもよく、または他のユニットに位置するデータキャッシュにアクセスするように構成されてもよい。 The FPU 160 may include an FP scheduler 162 that may be configured to receive, queue, and issue execution operations within the FP execution unit 164, such as the cluster scheduler 152. The FPU 160 may also include a floating point physical register file configured to manage floating point operands. The FP execution unit 164 may be configured to implement various types of floating point operations as may be defined by the ISA. In various embodiments, the FPU 160 may support the concurrent execution of certain different types of floating point operations and may support different precisions (eg, 64-bit operands, 128-bit operands, etc.). In various embodiments, the FPU 160 may include a data cache or may be configured to access a data cache located in other units.

命令キャッシュ１１０およびデータキャッシュ１５６は、コアインタフェースユニット１７０を介してＬ２キャッシュ１８０にアクセスするように構成されてもよい。１つの実施形態において、ＣＩＵ１７０は、システム内のコア１００と他のコア１００との間、外部システムメモリ、周辺機器などへ汎用インタフェースを与えてもよい。典型的に、Ｌ２キャッシュ１８０の容量は、第１レベル命令およびデータキャッシュより実質的に大きい。 The instruction cache 110 and the data cache 156 may be configured to access the L2 cache 180 via the core interface unit 170. In one embodiment, the CIU 170 may provide a general-purpose interface between the core 100 and other cores 100 in the system, external system memory, peripheral devices, and the like. Typically, the capacity of the L2 cache 180 is substantially larger than the first level instruction and data cache.

いくつかの実施形態において、コア１００は、インオーダー演算をサポートしてもよい。他の実施形態において、コア１００は、ロードおよびストア演算を含む演算のアウトオブオーダー実行をサポートしてもよい。すなわち、クラスタ１５０およびＦＰＵ１６０内の演算の実行順序は、演算が対応する命令の最初のプログラム順序とは異なるものであってもよい。このような緩和演算の順序付けにより、実行リソースのより効率的なスケジューリングが行われてもよく、これにより、全体的な実行性能を高められうる。 In some embodiments, the core 100 may support in-order operations. In other embodiments, the core 100 may support out-of-order execution of operations including load and store operations. That is, the execution order of the operations in the cluster 150 and the FPU 160 may be different from the initial program order of the instructions to which the operations correspond. By ordering such relaxation operations, more efficient scheduling of execution resources may be performed, thereby improving the overall execution performance.

さらに、コア１００は、種々の制御およびデータ投機技術を実装してもよい。上述したように、コア１００は、スレッドの実行制御フローが進む方向の予測を試みるために、さまざまな分岐予測および投機プリフェッチ技術を実装してもよい。このような制御投機技術は、一般に、命令が使用可能であるか、または（例えば、分岐誤り予測により）誤り投機が起こったかが確実にわかる前に、一貫した命令フローを与えるように試みてもよい。制御誤り投機が生じれば、コア１００は、誤り投機パスに沿って演算およびデータを廃棄し、実行制御を正確なパスに再方向付けするように構成されてもよい。例えば、１つの実施形態において、クラスタ１５０は、条件分岐命令を実行し、分岐結果が予測結果と一致するかを決定するように構成されてもよい。一致しなければ、クラスタ１５０は、正確なパスに沿ったフェッチを始めるように、ＩＦＵ１２０に再方向付けられるように構成されてもよい。 Furthermore, the core 100 may implement various control and data speculation techniques. As described above, the core 100 may implement various branch prediction and speculative prefetch techniques to attempt to predict the direction in which the thread execution control flow will proceed. Such control speculation techniques may generally attempt to provide a consistent instruction flow before it is certain that an instruction is available or that an error speculation has occurred (eg, due to branch error prediction). . If a control error speculation occurs, the core 100 may be configured to discard operations and data along the error speculation path and redirect the execution control to the correct path. For example, in one embodiment, the cluster 150 may be configured to execute a conditional branch instruction and determine whether the branch result matches the predicted result. If not, the cluster 150 may be configured to be redirected to the IFU 120 to begin fetching along the correct path.

別に、コア１００は、データ値が正確であるかが分かる前に、さらなる実行において使用するためのデータ値を提供するように試みるさまざまなデータ投機技術を実装してもよい。このようなデータを投機的に使用することによって、データ使用前にデータの有効性に関するあらゆる条件を評価するのに必要なタイミング制約が緩和されて、コア演算の高速化が可能となる。 Alternatively, core 100 may implement various data speculation techniques that attempt to provide data values for use in further execution before the data values are known to be accurate. By using such data speculatively, timing constraints necessary for evaluating all conditions related to the validity of the data before using the data are relaxed, and the core operation can be speeded up.

さまざまな実施形態において、プロセッサの実装は、他の構造とともに単一のモノリシック集積回路チップの一部として作製されたコア１００の多重インスタンスを含んでもよい。図２に、プロセッサの１つのこのような実施形態を示す。図示するように、プロセッサ２００は、４つのコアインスタンス１００ａ〜ｄを含み、これらのインスタンスの各々は、上述したように構成されてもよい。例示した実施形態において、コア１００の各々は、システムインタフェースユニット（ＳＩＵ）２１０を介して、Ｌ３キャッシュ２２０およびメモリコントローラ／周辺インタフェースユニット（ＭＣＵ：memory controller/peripheral interface unit）２３０へ連結されてもよい。１つの実施形態において、Ｌ３キャッシュ２２０は、コア１００のＬ２キャッシュ１８０と比較的遅いシステムメモリ２４０との間の中間キャッシュとして動作する任意の適切な体系を用いて実装された統一キャッシュとして構成されてもよい。 In various embodiments, a processor implementation may include multiple instances of the core 100 made as part of a single monolithic integrated circuit chip along with other structures. FIG. 2 shows one such embodiment of a processor. As shown, the processor 200 includes four core instances 100a-d, each of which may be configured as described above. In the illustrated embodiment, each of the cores 100 may be coupled to an L3 cache 220 and a memory controller / peripheral interface unit (MCU) 230 via a system interface unit (SIU) 210. . In one embodiment, the L3 cache 220 is configured as a unified cache implemented using any suitable scheme that operates as an intermediate cache between the core 100 L2 cache 180 and the relatively slow system memory 240. Also good.

ＭＣＵ２３０は、プロセッサ２００をシステムメモリ２４０と直接連係するように構成されてもよい。例えば、ＭＣＵ２３０は、デュアルデータレート同期型ダイナミックＲＡＭ（ＤＤＲＳＤＲＡＭ：Dual Data Rate Synchronous Dynamic RAM）、ＤＤＲ−２ＳＤＲＡＭ、完全バッファ型デュアルインラインメモリモジュール（ＦＢ−ＤＩＭＭ：Fully Buffered Dual Inline Memory Module）、またはシステムメモリ２４０を実装するのに使用されてもよい別の適切なタイプのメモリなど、１つ以上の異なるタイプのランダムアクセスメモリ（ＲＡＭ）をサポートするのに必要な信号を発生させるように構成されてもよい。システムメモリ２４０は、プロセッサ２００のさまざまなコア１００によって演算されてもよい命令およびデータを格納するように構成されてもよい、システムメモリ２４０のコンテンツは、上述したキャッシュのさまざまなものによってキャッシュされてもよい。 MCU 230 may be configured to interface processor 200 directly with system memory 240. For example, the MCU 230 may be a dual data rate synchronous dynamic RAM (DDR SDRAM), a DDR-2 SDRAM, a fully buffered dual inline memory module (FB-DIMM), or Configured to generate the signals necessary to support one or more different types of random access memory (RAM), such as another suitable type of memory that may be used to implement system memory 240. May be. System memory 240 may be configured to store instructions and data that may be computed by various cores 100 of processor 200. The contents of system memory 240 are cached by various of the caches described above. Also good.

さらに、ＭＣＵ２３０は、プロセッサ２００への他のタイプのインタフェースをサポートしてもよい。例えば、ＭＣＵ２３０は、別のグラフィックスプロセッサ、グラフィックスメモリ、および／または他のコンポーネントを含んでもよいプロセッサ２００をグラフィックス処理サブシステムと連係するように使用されてもよいアクセラレーテッド／アドバンストグラフィックスポート（ＡＧＰ： Accelerated/ Advanced Graphics Port）インタフェースのバージョンなど、専用のグラフィックスプロセッサを実装してもよい。ＭＣＵ２３０はまた、１つ以上のタイプの周辺インタフェース、例えば、ＰＣＩ−Ｅｘｐｒｅｓｓバス標準のバージョンを実装するように構成されてもよく、これを介して、プロセッサ２００は、ストレージデバイス、グラフィックスデバイス、ネットワーキングデバイスなどの周辺機器と連係をとってもよい。いくつかの実施形態において、プロセッサ２００の外部の二次バスブリッジ（例えば、「サウスブリッジ」）が、他のタイプのバスまたは相互接続を介してプロセッサ２００を他の周辺デバイスへ連結するように使用されてもよい。メモリコントローラおよび周辺インタフェース機能は、ＭＣＵ２３０を介してプロセッサ２００内で一体化されて示されているが、他の実施形態において、これらの機能は、従来の「ノースブリッジ」配置を介してプロセッサ２００の外部へ実装されてもよいことに留意されたい。例えば、ＭＣＵ２３０のさまざまな機能は、プロセッサ２００内に集積化されてもよいものとは別のチップセットを介して実装されてもよい。 Further, the MCU 230 may support other types of interfaces to the processor 200. For example, the MCU 230 may be used to coordinate a processor 200 that may include another graphics processor, graphics memory, and / or other components with a graphics processing subsystem. A dedicated graphics processor such as an (AGP: Accelerated / Advanced Graphics Port) interface version may be implemented. The MCU 230 may also be configured to implement one or more types of peripheral interfaces, eg, a version of the PCI-Express bus standard, through which the processor 200 may be configured as a storage device, graphics device, networking device. You may link with peripheral devices, such as a device. In some embodiments, a secondary bus bridge (eg, “south bridge”) external to processor 200 is used to couple processor 200 to other peripheral devices via other types of buses or interconnects. May be. Although the memory controller and peripheral interface functions are shown integrated within the processor 200 via the MCU 230, in other embodiments, these functions are implemented in the processor 200 via a conventional “north bridge” arrangement. Note that it may be implemented externally. For example, the various functions of the MCU 230 may be implemented via a different chipset than those that may be integrated within the processor 200.

ローカルおよびリモートマイクロユニット
上述したように、いくつかの実施形態において、コア１００は、マイクロコード化演算をサポートしてもよい。一般的に言えば、マイクロコードは、所与のコア１００によって実行される個々の演算が、例えば、ＩＦＵ１２０によって与えられてもよいような命令フローの一般的なプログラマブルビジブルパスから分離された制御およびデータストアからのものであり、プロセッサ実装技術を包含してもよい。一例として、ｘ８６ＩＳＡなどの所与のＩＳＡが、複雑性が著しい命令および他の所定のプロセッサ挙動（例えば、リセット機能、割り込み／トラップ／例外機能など）を含んでもよい。レジスタオペランドを伴う単純なシフトまたはローテート命令が、実行ユニット１５４によって直接実行可能な単一の演算として実装するのに非常に単純でありうる。例えば、キャリーフラッグを通してオペランドを回転させるローテート命令が、実行ユニット１５４内で実行可能な異なる演算の組み合わせとしてより容易に実装されてもよい。ある種の複雑な制御転送命令、制御レジスタなどのシステムレジスタを修正する命令、仮想メモリを伴う命令、優先または保護モデルのコンテンツ内で実行する命令などのさらにより複雑な命令が、命令実行が調整されるプロセッサ状態（例えば、優先状態）のテストなど、追加の演算を伴うものであってもよい。 Local and Remote Microunits As described above, in some embodiments, core 100 may support microcoded operations. Generally speaking, microcode is a control that separates from the general programmable visible path of instruction flow where individual operations performed by a given core 100 may be provided, for example, by IFU 120. It is from a data store and may include processor implementation technology. As an example, a given ISA, such as the x86 ISA, may include highly complex instructions and other predetermined processor behavior (eg, reset function, interrupt / trap / exception function, etc.). A simple shift or rotate instruction with register operands can be very simple to implement as a single operation that can be performed directly by execution unit 154. For example, a rotate instruction that rotates an operand through a carry flag may be more easily implemented as a combination of different operations that can be performed within the execution unit 154. Certain complex control transfer instructions, instructions that modify system registers such as control registers, instructions with virtual memory, and even more complex instructions such as instructions that execute within the content of a priority or protection model coordinate instruction execution It may be accompanied by additional operations, such as a test of the processor state (e.g. priority state) being performed.

ＩＳＡ内のいくつかの命令は、個々の対応する演算に直接マッピングしてもよい。いくつかの実施形態において、ＩＳＡのこれらの複雑な命令が、実行ユニット１５４内で実行可能な単一の演算に直接マッピングしない場合、所与の複雑な命令を、実行するためにディスパッチされてもよいより単純な演算シーケンスに変換するために、マイクロコードが使用されてもよい。マイクロコードに利用可能な演算セットは、一般に、直接実行可能なＩＳＡ命令に対応する演算を含んでもよい。しかしながら、いくつかの実施形態において、マイクロコード演算は、プログラマビジブルでない実行ユニット１５４によって実行可能な演算を含んでもよい。さらに、実装ＩＳＡ内の特定の命令に対応しないアーキテクチャが規定されたプロセッサイベントに応答して、実行可能な演算シーケンスを発生するために、マイクロコードが使用されてもよい。例えば、リセットマイクロコードルーチンは、ソフトウェアまたはハードウェアリセットイベントの後に、一貫した演算状態にコア１００を配置するように構成された演算シーケンスを含んでもよい（例えば、キャッシュの初期化、特定のアーキテクチャおよび非アーキテクチャレジスタへの特定の値の格納、所定のアドレスから始めさせる命令フェッチなど）。いくつかの実施形態において、マイクロコードはまた、コア１００の非アーキテクチャ機能を実行するルーチンを発生するために使用されてもよい（すなわち、一般に、機能は、またはセル１００の通常演算中にプログラマによってビジブルでも、アクセス可能でもない）。例えば、製品の製造、フィールド分析、パワーオンセルフテスティング、または他の適切なアプリケーションのために、命令マイクロコード使用するためのハードウェアテストまたはデバッグルーチングを実装するように、マイクロコードが使用されてもよい。 Some instructions in the ISA may map directly to individual corresponding operations. In some embodiments, if these complex instructions of the ISA do not map directly to a single operation that can be executed within the execution unit 154, a given complex instruction may be dispatched for execution. Microcode may be used to convert to a better and simpler sequence of operations. The set of operations available for microcode may generally include operations corresponding to directly executable ISA instructions. However, in some embodiments, microcode operations may include operations that can be performed by an execution unit 154 that is not programmer-visible. Further, microcode may be used to generate an executable sequence of operations in response to a processor event that defines an architecture that does not correspond to a particular instruction in the implementation ISA. For example, the reset microcode routine may include an operation sequence configured to place the core 100 in a consistent operation state after a software or hardware reset event (eg, cache initialization, specific architecture and For example, storing a specific value in a non-architectural register, or an instruction fetch starting from a given address). In some embodiments, microcode may also be used to generate routines that perform non-architectural functions of core 100 (ie, in general, functions are performed by a programmer during normal operations of cell 100 or Neither visible nor accessible). For example, microcode is used to implement hardware testing or debug routing to use instruction microcode for product manufacturing, field analysis, power-on self-testing, or other suitable applications. Also good.

マイクロコードルーチンは、典型的に、読み出し専用メモリ（ＲＯＭ）、または以下にさらに詳細に記載するような、ランダムアクセスメモリ（ＲＡＭ)のような書き込み可能なメモリなどの任意の適切なタイプの制御ストア内に格納されてもよい。図３に、マイクロコード制御ストアの１つの例示的な体系が示されている。例示した実施形態において、マイクロコード制御ストア３００は、複数のエントリ３１０を含み、エントリの各々は、多数の演算フィールド３２０およびシーケンス制御フィールド３３０を含む。１つの実施形態において、エントリ３１０の各々は、マイクロコードアドレス空間内のそれぞれのエントリポイントに対応してもよい。例えば、１２ビットマイクロコードアドレス空間は、４，０９６（４Ｋ）もの別個のエントリポイント値を許容してもよい。いくつかの実施形態において、マイクロコードアドレス空間が、例えば、密度の低いエントリポイントアドレッシングが採用される場合など、実際のエントリ３１０より可能なエントリポイント値が多く存在するように構成されてもよいことに留意されたい。 The microcode routine is typically any suitable type of control store such as read only memory (ROM) or writable memory such as random access memory (RAM), as described in more detail below. May be stored within. In FIG. 3, one exemplary scheme of a microcode control store is shown. In the illustrated embodiment, the microcode control store 300 includes a plurality of entries 310, each of which includes a number of calculation fields 320 and a sequence control field 330. In one embodiment, each of the entries 310 may correspond to a respective entry point in the microcode address space. For example, a 12-bit microcode address space may allow as many as 4,096 (4K) distinct entry point values. In some embodiments, the microcode address space may be configured such that there are more possible entry point values than the actual entry 310, such as when less dense entry point addressing is employed. Please note that.

１つの実施形態において、エントリ３１０の各演算フィールド３２０は、実行ユニット１５４の１つによって実行可能な単一の演算を符号化する情報を格納するように構成されてもよい。例えば、演算フィールド３２０は、実行される演算のタイプを識別するように構成されたオプコードビットと、レジスタまたは既値データなどの演算によって使用されるオペランドソースを識別するように構成されたオペランドビットを含んでもよい。いくつかの実施形態において、演算フィールド３２０内のオペランド符号化は、マイクロコードエントリ３１０が実行されることに応答して、マイクロ命令（例えば、ＩＦＵ１２０によってフェッチされるプログラマビジブルＩＳＡ命令）からオペランドから取られるべきであることを規定してもよい。いくつかの実施形態において、演算フィールド３２０の異なるフィールドが、クラスタ１５０内の実行ユニット１５４のそれぞれに対応してもよい。このようにして、例えば、実行ユニット１５４が、２つのＡＬＵおよび２つのＡＧＵを含めば、演算フィールド３２０の２つは、ＡＬＵに対応してもよく、２つは、ＡＧＵに対応してもよい。すなわち、所与のエントリ３１０の各演算フィールド３２０は、クラスタのそれぞれの発行スロットに対応してもよく、したがって、所与のエントリ３１０全体が、ユニットとしてクラスタ１５０にディスパッチされてもよい。所与のエントリ３１０内の１つの演算が、同じエントリ内の別の演算に依存性を有してもよい場合もあるのに対して、エントリ３１０内の各演算が、他のものから独立したものであってもよい場合もある。他の実施形態において、マイクロコードエントリ３１０内の演算フィールド３２０に対して発行スロット制限がない場合もあることに留意されたい。例えば、１つのこのような実施形態において、任意のフィールド３２０から任意の実行ユニットへ演算が発行されてもよい。 In one embodiment, each operation field 320 of entry 310 may be configured to store information that encodes a single operation that can be performed by one of execution units 154. For example, the operation field 320 includes an opcode bit configured to identify the type of operation to be performed and an operand bit configured to identify an operand source used by the operation, such as a register or value data. May be included. In some embodiments, the operand encoding in the operation field 320 is taken from the operand from a microinstruction (eg, a programmable ISA instruction fetched by the IFU 120) in response to the microcode entry 310 being executed. It may be specified that it should be done. In some embodiments, different fields of the computation field 320 may correspond to each of the execution units 154 in the cluster 150. Thus, for example, if execution unit 154 includes two ALUs and two AGUs, two of operation fields 320 may correspond to ALUs and two may correspond to AGUs. . That is, each operational field 320 of a given entry 310 may correspond to a respective issue slot of the cluster, and thus the entire given entry 310 may be dispatched as a unit to the cluster 150. In some cases, one operation in a given entry 310 may have a dependency on another operation in the same entry, whereas each operation in entry 310 is independent of the others. It may be a thing. It should be noted that in other embodiments, there may be no issue slot limit for the calculation field 320 in the microcode entry 310. For example, in one such embodiment, operations may be issued from any field 320 to any execution unit.

シーケンス制御フィールド３３０が、マイクロコードルーチンのシーケンシング挙動を統制するように構成されてもよい。例えば、シーケンス制御フィールド３３０が、ルーチンの出口点（すなわち、特定のルーチンが終了するエントリ３１０）を示すように構成されてもよく、または異なるエントリポイントへの条件または無条件分岐またはジャンプなどのマイクロコードフロー制御の非連続な変化を示してもよい。いくつかの実施形態において、マイクロコードシーケンス制御は、所与のエントリ３１０に関係するシーケンス制御フィールド３３０が、異なる連続した先行エントリ３１０に実際に関連付けられるようにパイプラインで送られてもよい。例えば、マイクロコードエントリ３１０のフェッチと、エントリ３１０に関連付けられたシーケンス制御フィールド３３０の実行との間に１または２サイクル遅延またはバブルを有する実施形態において、エントリポイントＮでエントリ３１０の挙動に影響するシーケンス制御フィールド３３０が、エントリポイントＮ−１またはＮ−２のそれぞれでエントリ３１０に関連して格納されてもよい。 The sequence control field 330 may be configured to govern the sequencing behavior of the microcode routine. For example, the sequence control field 330 may be configured to indicate the exit point of the routine (ie, the entry 310 where the particular routine ends), or a micro such as a conditional or unconditional branch or jump to a different entry point. A non-continuous change in code flow control may be indicated. In some embodiments, the microcode sequence control may be sent in a pipeline so that the sequence control field 330 associated with a given entry 310 is actually associated with a different consecutive predecessor entry 310. For example, in an embodiment that has a one or two cycle delay or bubble between the fetch of microcode entry 310 and the execution of sequence control field 330 associated with entry 310, the behavior of entry 310 is affected at entry point N. A sequence control field 330 may be stored in association with entry 310 at each entry point N-1 or N-2.

いくつかの実施形態において、命令デコードユニット１４０は、マイクロコードエントリ３１０のアクセシングおよびシーケンシングに関連する機能を含んでもよい。例えば、ＤＥＣ１４０は、マイクロコード実行が必要とされる命令および他のプロセッサイベントを検出し、マイクロコードストアからエントリ３１０をそれに応じて要求するように構成されてもよい。いくつかの実施形態において、ＤＥＣ１４０は、例えば、命令オプコードをマイクロコードアドレスの空間内のエントリポイント値へマッピングすることによって、マイクロコードが必要である命令または他のイベントに依存したエントリポイントを計算するように構成されてもよい。次に、ＤＥＣ１４０は、ＤＥＣ１４０内に含まれ、またはコア１００内の別個の機能ユニットとして与えられてもよいマイクロコードユニットに計算されたエントリポイントをサブミットしてもよく、その結果として、実行用の１つ以上のマイクロコードエントリ３１０に示す演算を取り出しディスパッチしてもよい。マイクロコードエントリポイントはまた、命令デコード以外のイベントに応答して発生してもよい。例えば、エントリポイントは、例外または割り込みの検出に応答して、実行クラスタまたは他のユニットによって発生してもよい。 In some embodiments, instruction decode unit 140 may include functions related to accessing and sequencing microcode entry 310. For example, DEC 140 may be configured to detect instructions and other processor events that require microcode execution and request entry 310 from the microcode store accordingly. In some embodiments, the DEC 140 calculates an entry point that depends on the instruction or other event for which the microcode is needed, eg, by mapping an instruction opcode to an entry point value in the space of the microcode address. It may be configured as follows. The DEC 140 may then submit the calculated entry points to a microcode unit that may be included in the DEC 140 or provided as a separate functional unit in the core 100, resulting in an execution Operations shown in one or more microcode entries 310 may be retrieved and dispatched. Microcode entry points may also occur in response to events other than instruction decode. For example, an entry point may be generated by an execution cluster or other unit in response to detecting an exception or interrupt.

複数のコア１００を含むプロセッサ２００の実施形態において、特に考慮すべであろう点はマイクロコード実装である。典型的に、各コア１００は、同一のマイクロコードコンテンツを参照することになる（例えば、エントリ３１０内に反映されている場合のような。すなわち、いくつかの実施形態において、各コア１００は、同じＩＳＡを実装してもよく、実行するコア１００に関係なく、コードは、一般に、同じ機能挙動と実行するように予測されてもよい。それに応じて、コア１００は、プロセッサ２００の実施形態に対して規定されたマイクロコードエントリ３１０のすべてを含むマイクロコードストアの単一の共通インスタンスを共有するように構成されてもよい。しかしながら、すべてのコア１００によって共有される単一のＲＯＭ構造など、単一の制御ストアインスタンスを実装することで、制御ストアからデータを要求したコア１００へ進むようにマイクロコードデータが要求される距離が長くなることで、マイクロコードを伴う演算を実行するのに要求される全体的な待ち時間が長くなりうる。これは、つまり、コア１００の実行性能の低下を招きうる。 In embodiments of the processor 200 that include multiple cores 100, a particular consideration is the microcode implementation. Typically, each core 100 will reference the same microcode content (eg, as reflected in entry 310. That is, in some embodiments, each core 100 is The same ISA may be implemented and the code may generally be expected to execute with the same functional behavior regardless of the executing core 100. Accordingly, the core 100 may be implemented in an embodiment of the processor 200. May be configured to share a single common instance of a microcode store that includes all of the microcode entries 310 defined for it, however, such as a single ROM structure shared by all cores 100, etc. By implementing a single control store instance, you can proceed to the core 100 that requested the data from the control store. The longer the distance that the microcode data is required, the longer the overall waiting time required to perform the operation involving the microcode, which means that the execution performance of the core 100 is degraded. Can be invited.

さらに、複数のコア１００の間で単一の制御ストアを供給することで、複数のコア１００が共有制御ストアに同時アクセスすることを試みれば、リソースの競合問題が生じうる。制御ストアによってサポートされる同時アクセス数が増えると（例えば、リードバスの数、エントリポイントデコードロジックの量などを上げることで）、制御ストアの複雑性およびコストが上がるとともに、タイミング性能が低下することもある一方で、同時アクセス要求をシリアル化することで、各コアがサービスの順番を待機するさい、コア１００が受けるマイクロコードアクセスの待ち時間が長くなりうる。 Furthermore, if a plurality of cores 100 attempt to access the shared control store simultaneously by supplying a single control store among the plurality of cores 100, a resource contention problem may occur. Increasing the number of concurrent accesses supported by the control store (for example, increasing the number of read buses, the amount of entry point decoding logic, etc.) increases the complexity and cost of the control store and decreases timing performance On the other hand, serializing simultaneous access requests can increase the waiting time of microcode access received by the core 100 when each core waits for the order of services.

対照的に、マイクロコード制御ストアのインスタンスは、各コア１００がマイクロコードの完全なコピーを含むように複製されてもよい。このような複製により、単一の制御ストアインスタンスを共有するコア１００の場合に関して、上述したルーティング、待ち時間、リソース競合の問題が改善されうる。例えば、各複製インスタンスは、それぞれのコア１００に比較的近接して配置されることで、全体的なルーティング距離が短縮される。しかしながら、このようにして制御ストアを全体的に複製すると、コア１００の各々の設計エリアおよび複雑性が増し、設計および製造コスト、さらには、プロセッサ２００の電力消費が上がりかねない。 In contrast, an instance of a microcode control store may be replicated so that each core 100 contains a complete copy of the microcode. Such duplication may improve the routing, latency, and resource contention problems described above for the core 100 sharing a single control store instance. For example, each replica instance is placed relatively close to the respective core 100, thereby reducing the overall routing distance. However, duplicating the control store as a whole in this way increases the design area and complexity of each of the cores 100, which can increase the design and manufacturing costs, as well as the power consumption of the processor 200.

いくつかの実施形態において、プロセッサ２００は、複製と共有の両方の態様を含む階層マイクロコード制御ストアのアプローチを採用するように構成されてもよい。図４に示すように、１つの実施形態において、プロセッサ２００の複数のコア１００ａ〜ｄの各々は、ローカルマイクロコードユニット４００ａ〜ｄ（または簡潔に、ローカルユニット４００ａ〜ｄ）のそれぞれのインスタンスを含み、システムインタフェースユニット２１０を介して、共有リモートマイクロコードユニット４１０（または簡潔に、リモートユニット４１０）にアクセスするように構成されてもよい。明確にするために省いているが、図２に示すプロセッサ２００の他の要素は、図４の実施形態に含まれてもよいことに留意されたい。さらに、さまざまな実施形態において、コア１００の数およびコアに含まれるそれぞれのローカルマイクロコードユニット４００は変動してもよい。いくつかの実施形態において、ローカルユニット４００は、それぞれのコア１００の他のユニット内に含まれてもよい。例えば、ローカルユニット４００は、ＤＥＣ１４０内に実装されてもよい。例示した実施形態において、コア１００の１つずつにそれぞれ、ローカルマイクロコードユニット４４０が含まれているが、いくつかの実施形態では、コア１００のうちの複数のコアにしか、ローカルマイクロコードユニット４００がそれぞれ含まれていないものも考えられる。すなわち、すべてのコア１００が、ローカルユニット４００に対して同一に構成されることが要求されるわけではないが、いくつかの実施形態において、同一の構成が要求される場合もある。 In some embodiments, the processor 200 may be configured to employ a hierarchical microcode control store approach that includes both replication and sharing aspects. As shown in FIG. 4, in one embodiment, each of the plurality of cores 100a-d of the processor 200 includes a respective instance of a local microcode unit 400a-d (or briefly, the local unit 400a-d). The shared remote microcode unit 410 (or simply remote unit 410) may be configured to be accessed via the system interface unit 210. Although omitted for clarity, it should be noted that other elements of the processor 200 shown in FIG. 2 may be included in the embodiment of FIG. Further, in various embodiments, the number of cores 100 and each local microcode unit 400 included in the cores may vary. In some embodiments, the local unit 400 may be included within other units of the respective core 100. For example, the local unit 400 may be implemented in the DEC 140. In the illustrated embodiment, each of the cores 100 includes a local microcode unit 440, but in some embodiments, only a plurality of cores 100 of the core 100 have local microcode units 400. May not be included. That is, not all cores 100 are required to be configured identically for local unit 400, but in some embodiments, the same configuration may be required.

一般的にいえば、ローカルマイクロコードユニット４００の各インスタンスは、コア１００の演算がより高い性能感度になるように決定されたマイクロコードルーチンを格納するように構成されてもよく、一方で、マイクロコードユニット４１０は、性能感度が低くなるように決定されたマイクロコードルーチンを格納するように構成されてもよい。例えば、所与のマイクロコードルーチンは、性能感度のしきい値要求を満たせば、ローカルユニット４００内のストレージ用に選択されてもよく、満たさなければ、リモートユニット４１０内のストレージ用に割り当てられてもよい。 Generally speaking, each instance of the local microcode unit 400 may be configured to store a microcode routine that is determined so that the operation of the core 100 is more performance sensitive, while The code unit 410 may be configured to store microcode routines determined to have low performance sensitivity. For example, a given microcode routine may be selected for storage in the local unit 400 if it meets the performance sensitivity threshold requirement, otherwise it is assigned for storage in the remote unit 410. Also good.

さまざまな実施形態において、所与のマイクロコードルーチンの性能感度は、異なる基準を用いて決定されてもよい。例えば、マイクロコードルーチンの長さは、性能感度の基準として使用されてもよいことにより、しきい値長さ以下の所与の長さのルーチン（例えば、単一のエントリ３１０からなるルーチン）は、ローカルユニット４００内に含まれるが、しきい値長さより長いルーチンは、リモートユニット４１０内に含まれる。場合によっては、ルーチンの実行待ち行列が、長さの代用として使用されてもよい。他の形態において、ルーチン実行の頻度が選択基準として使用されてもよいため、少なくともしきい値頻度または確率で実行されるルーチン（例えば、予測されたプログラミング仕事量のシミュレーションにより予測可能なものなど）が、ローカルユニット４００内に含まれ、実行の頻度または確率がより低いルーチンが、リモートユニット４１０内に含まれる。他の実施形態において、所与のルーチンの格納場所を決定するさいに、場合によっては、他の基準とともに、実行頻度／確率とルーチンの長さの両方が考慮されてもよい。例えば、リセットマイクロコードルーチンが非常に長く、頻繁に実行されないこともあるため、リモートマイクロコードユニット４１０内に含まれる可能性のある候補である。別の場合では、仮想メモリページミスを取り扱うマイクロコードルーチンは、リセットルーチンより実行頻度が高いこともある。しかしながら、ページミスルーチンは、ページミスにより、待ち時間の長い多数のメモリアクセスを行う可能性が高いこともある。このように、ページミスルーチンにアクセスする待ち時間は、ルーチンを実行する待ち時間により劣ることもあるため、リモートマイクロコードユニット４１０内に含まれる候補になりうる。一般的にいえば、しきい値要求に対して、所与のマイクロコードルーチンの性能感度は、上記要因（例えば、実行の長さ、待ち時間、頻度）または他の関連する要因の１つ以上の任意の適切な関数であってもよい。 In various embodiments, the performance sensitivity of a given microcode routine may be determined using different criteria. For example, the length of a microcode routine may be used as a measure of performance sensitivity so that a routine of a given length that is less than or equal to the threshold length (eg, a routine consisting of a single entry 310) Routines included in the local unit 400 but longer than the threshold length are included in the remote unit 410. In some cases, a routine execution queue may be used as a substitute for length. In other forms, routine execution frequency may be used as a selection criterion, so routines that are executed at least with a threshold frequency or probability (eg, predictable by simulation of predicted programming workload). Are included in the local unit 400 and routines with lower frequency or probability of execution are included in the remote unit 410. In other embodiments, in determining the storage location of a given routine, both execution frequency / probability and the length of the routine may be considered, along with other criteria. For example, reset microcode routines are very long and may not execute frequently, so they are candidates that may be included in the remote microcode unit 410. In other cases, microcode routines that handle virtual memory page misses may be executed more frequently than reset routines. However, the page miss routine may have a high possibility of performing a large number of memory accesses with a long waiting time due to a page miss. In this way, the waiting time for accessing the page miss routine may be inferior to the waiting time for executing the routine, and thus can be a candidate included in the remote microcode unit 410. Generally speaking, for threshold requirements, the performance sensitivity of a given microcode routine is one or more of the above factors (eg, length of execution, latency, frequency) or other related factors. May be any suitable function.

図５に、ローカルマイクロコードユニット４００のインスタンスの１つの実施形態を示す。例示した実施形態において、ローカルユニット４００は、制御ストア５００と、シーケンサ５１０とを含む。一般的に言えば、制御ストア５００は、ローカルユニット４００内に含まれるように選択されたマイクロコードルーチンに対応する複数のマイクロコードエントリ３１０を格納するように構成されたマイクロコード制御ストア３００のインスタンスの一例であってもよい。１つの実施形態において、制御ストア５００は、任意の適切な体系により構成されたＲＯＭを含んでもよい。例えば、制御ストア５００は、単一の大きなＲＯＭバンクとして、エントリ３１０のフィールドに従って分離された複数のバンクとして、または別の適切な方法で体系化されてもよい。一般的に言えば、ＲＯＭは、入力としてインデックス値（例えば、エントリポイント）を受け取り、それに反応して、供給された入力値に対応する出力値（例えば、エントリ３１０）を出力するように構成された任意のタイプのアドレス指定可能なデータ構造を参照してもよい。このようなデータ構造は、メモリアレイおよびゲートアレイ、または他の適切なロジックデバイスの配置を含んでもよい。いくつかの実施形態において、制御ストア５００は、ＲＯＭに追加して、またはＲＯＭの代わりに、ＲＡＭまたは不揮発性メモリなど書き込み可能なメモリ要素を含んでもよい。 FIG. 5 illustrates one embodiment of an instance of the local microcode unit 400. In the illustrated embodiment, the local unit 400 includes a control store 500 and a sequencer 510. Generally speaking, the control store 500 is an instance of a microcode control store 300 configured to store a plurality of microcode entries 310 corresponding to microcode routines selected to be included in the local unit 400. It may be an example. In one embodiment, the control store 500 may include a ROM configured with any suitable scheme. For example, the control store 500 may be organized as a single large ROM bank, as multiple banks separated according to the fields of the entry 310, or in another suitable manner. Generally speaking, the ROM is configured to receive an index value (eg, entry point) as input and output an output value (eg, entry 310) corresponding to the supplied input value in response thereto. Any type of addressable data structure may be referenced. Such data structures may include arrangements of memory and gate arrays, or other suitable logic devices. In some embodiments, the control store 500 may include writable memory elements such as RAM or non-volatile memory in addition to or in place of the ROM.

シーケンサ５１０は、１つの実施形態において、ＤＥＣ１４０から受け取ったマイクロコード要求およびエントリ３１０に含まれるシーケンス制御情報に従って、制御ストア５００にアクセスするように構成されてもよい。ＤＥＣ１４０からの特定のマイクロコードエントリポイントを受け取ると、それに応答して、シーケンサ５１０は、特定のエントリポイントに対応するエントリ３１０を取り出すために制御ストア５００にアクセスするように構成されてもよい。次に、取り出したエントリ３１０内に指定された演算が、ディスパッチおよび実行のためにＤＥＣＩ１４０に戻されてもよい。さらに、シーケンサ５１０は、先行して取り出したエントリ３１０に連続して続く別のエントリ３１０を取り出すかどうかを決定するために、不連続のエントリポイント（例えば、シーケンス制御フィールド３３０において指定されたエントリポイント）に位置する別のエントリ３１０を取り出すために、マイクロコードエントリ３１０の取り出しを終了させるために、または何らかの他の所定のアクションをとるために、取り出したエントリ３１０のシーケンス制御フィールド３３０を評価するように構成されてもよい。いくつかの実施形態において、マイクロコードエントリ３１０は、実行ユニット１５４内、あるいは、コア１００内の他の場所で予測および／または実行されてもよい分岐演算を含んでもよい。いくつかのこのような実施形態において、シーケンサ５１０は、予測または実行された分岐演算から生じるマイクロコードシーケンス制御の変化に応答するように構成されてもよい。例えば、このような分岐演算は、現在のエントリポイントから分岐演算によって指定されたエントリポイントへマイクロコードの取り出しをシーケンサ５１０に再度方向付けるようにさせてもよい。 The sequencer 510 may be configured to access the control store 500 according to the microcode request received from the DEC 140 and the sequence control information included in the entry 310 in one embodiment. In response to receiving a particular microcode entry point from DEC 140, sequencer 510 may be configured to access control store 500 to retrieve entry 310 corresponding to the particular entry point. The operation specified in the retrieved entry 310 may then be returned to the DECI 140 for dispatch and execution. In addition, the sequencer 510 determines a discontinuous entry point (e.g., an entry point specified in the sequence control field 330) to determine whether to retrieve another entry 310 that follows the previously retrieved entry 310. To evaluate the sequence control field 330 of the retrieved entry 310, to retrieve another entry 310 located at), to terminate retrieval of the microcode entry 310, or to take some other predetermined action. May be configured. In some embodiments, the microcode entry 310 may include branch operations that may be predicted and / or executed in the execution unit 154 or elsewhere in the core 100. In some such embodiments, the sequencer 510 may be configured to respond to changes in microcode sequence control that result from a predicted or performed branch operation. For example, such a branch operation may cause the sequencer 510 to redirect the microcode fetch from the current entry point to the entry point specified by the branch operation.

いくつかの実施形態において、ローカルマイクロコードユニット４００の各々は、マイクロコードエントリポイントの同じセットをそれぞれの制御ストア５００にマッピングするように構成されることで、コア１００の各々が、ローカルエントリポイントと同じエントリポイントにアクセスするようにしてもよい。いくつかの実施形態において、制御ストア５００の各インスタンスは、プロセッサ２００のさまざまなコア１００内の互いのインスタンスと同じコンテンツを有してもよいが、他の実施形態において、各制御ストア５００のコンテンツは他のものとまったく同一のものである必要はないことが考えられる。例えば、製造欠陥により、パッチ技術によって、または所与の制御ストア５００以外の場所に格納されたエントリ３１０を参照することによって修正可能であってもよい制御ストア５００のインスタンス間で機能差が生じることもある。 In some embodiments, each of the local microcode units 400 is configured to map the same set of microcode entry points to a respective control store 500 so that each of the cores 100 is a local entry point. The same entry point may be accessed. In some embodiments, each instance of the control store 500 may have the same content as each other instance in the various cores 100 of the processor 200, but in other embodiments, the content of each control store 500 May not need to be exactly the same as the others. For example, functional differences may occur between instances of the control store 500 that may be correctable due to manufacturing defects, patch technology, or by referring to an entry 310 stored elsewhere than a given control store 500. There is also.

図６は、リモートマイクロコードユニット４１０の１つの実施形態を示す。例示した実施形態において、リモートユニット４１０は、ＳＩＵ２１０を介してまたは任意の他の適切なタイプのインタフェースを介してコア１００と通信するように構成されてもよいリモートマイクロコードユニットインタフェース６１０を含む。インタフェース６１０は、要求キュー６２０および送信キュー６３０に連結されてもよい。要求キュー６２０は、制御ストア内に格納されたマイクロコードエントリに対する要求をリモート制御ストア６４０に搬送するように構成されてもよい。例示した実施形態において、リモート制御ストア６４０は、任意の数のマイクロコードＲＯＭバンク６５０ａ〜ｎを含んでもよく、任意に、マイクロコードパッチＲＡＭ６６０を含んでもよい。いくつかの実施形態において、要求キュー６２０はまた、任意のダイナミックマイクロコードＲＡＭアレイ６７０にマイクロコード要求を伝えるように搬送するように構成されてもよい。送信キュー６３０は、リモート制御ストア６４０から、およびいくつかの実施形態において、これらの要求コア１００へ送信するためのダイナミックマイクロコードＲＡＭアレイ６７０から取り出されたエントリをキューに入れるように構成されてもよい。 FIG. 6 shows one embodiment of the remote microcode unit 410. In the illustrated embodiment, the remote unit 410 includes a remote microcode unit interface 610 that may be configured to communicate with the core 100 via the SIU 210 or via any other suitable type of interface. Interface 610 may be coupled to request queue 620 and send queue 630. Request queue 620 may be configured to carry requests for microcode entries stored in the control store to remote control store 640. In the illustrated embodiment, the remote control store 640 may include any number of microcode ROM banks 650a-n, and may optionally include a microcode patch RAM 660. In some embodiments, request queue 620 may also be configured to carry a microcode request to any dynamic microcode RAM array 670. The send queue 630 may be configured to queue entries retrieved from the remote control store 640 and, in some embodiments, from the dynamic microcode RAM array 670 for sending to these request cores 100. Good.

図４に関して上述したように、リモートユニット４１０は、コア１００の代わりにさまざまなマイクロコードルーチンを格納するように構成されてもよい。このようなルーチンは、例えば、長く、実行頻度が低く、またはローカルユニット４００内に含まれるために選択されたルーチンよりもプロセッサ性能へ影響する可能性が低いと考えられるルーチンを含んでもよい。一般的に言えば、リモートユニット４１０は、コア１００の異なるコアからさまざまなマイクロコードエントリポイントの要求を受け取るように構成されてもよく、１つ以上の対応するエントリとともに各要求に応答してもよい。リモートマイクロコードユニットインタフェース６１０の構成およびコア１００とリモートユニット４１０との間に採用されたインタフェースプロトコルに応じて、複数のコア１００からのマイクロコード要求は、同時並列または直列に受け取られてもよい。 As described above with respect to FIG. 4, the remote unit 410 may be configured to store various microcode routines instead of the core 100. Such routines may include, for example, routines that are long, infrequently executed, or considered less likely to affect processor performance than routines selected for inclusion in local unit 400. Generally speaking, the remote unit 410 may be configured to receive requests for various microcode entry points from different cores of the core 100 and respond to each request with one or more corresponding entries. Good. Depending on the configuration of the remote microcode unit interface 610 and the interface protocol employed between the core 100 and the remote unit 410, microcode requests from multiple cores 100 may be received concurrently or serially.

リモート制御ストア６４０は、リモートユニット４１０内に含まれるように選択されたマイクロコードルーチンに対応する多数のマイクロコードエントリ３１０を格納するように構成されたマイクロコード制御ストア３００の例示的なインスタンスであってもよい。いくつかの実施形態において、リモート制御ストア６４０内に格納されたエントリ３１０のフォーマットは、ローカルユニット４００の制御ストア５００内に格納されたエントリ３１０のものに類似したものであってもよい。例示した実施形態において、任意の適切な分割または体系化に応じて、多数のマイクロコードＲＯＭバンク６５０にわたってエントリ３１０が分配されてもよい。他の実施形態において、ブランク６５０の数は変動してもよいことに留意されたい。また、制御ストア５００と同様に、いくつかの実施形態において、リモート制御ストア６４０は、メモリアレイ以外に書き込み可能メモリおよび／またはデータストレージ要素を含んでもよいことが考えられる。 Remote control store 640 is an exemplary instance of microcode control store 300 configured to store a number of microcode entries 310 corresponding to microcode routines selected to be included in remote unit 410. May be. In some embodiments, the format of the entry 310 stored in the remote control store 640 may be similar to that of the entry 310 stored in the control store 500 of the local unit 400. In the illustrated embodiment, entries 310 may be distributed across multiple microcode ROM banks 650 according to any suitable division or organization. Note that in other embodiments, the number of blanks 650 may vary. Also, like the control store 500, it is contemplated that in some embodiments, the remote control store 640 may include writable memory and / or data storage elements in addition to the memory array.

一般に、リモートマイクロコードユニットインタフェース６１０、要求キュー６２０、および送信キュー６３０が、コア１００から取り出したマイクロコード要求の処理を管理するように一括して構成されてもよい。要求キュー６２０は、入力マイクロコード要求を格納するように構成されてもよく、これらの要求は、１つの実施形態において、リモート制御ストア６４０内で処理可能になるまで、エントリポイントおよび要求コア１００の表示を最小限含んでもよい。同様に、送信キュー６３０は、エントリ４１０がそれらの要求コア１００に送信可能になるまで、リモート制御ストア６４０から、またはいくつかの実施形態において、ダイナミックマイクロコードＲＡＭアレイ６７０から取り出されたエントリ３１０を格納するように構成されてもよい。１つの実施形態において、リモートマイクロコードユニットインタフェース６１０は、要求キュー６２０および送信キュー６３０の管理を含み、コア１００からの要求の受信およびコア１００への結果的に得られるエントリの送信を制御するように構成されてもよい。インタフェース６１０はまた、例えば、シーケンス制御または調停スキームに応じて処理するための特定の要求を選択し、要求処理中に取り出されたエントリの送信キュー６３０内のストレージを調整するために、要求キュー６２０に格納された個々の要求の処理を調整するように構成されたロジックを含んでもよい。他の実施形態において、要求キュー６２０からの処理するための要求の選択および送信キュー６３０内の結果の格納は、インタフェース６１０の外部のロジックによって実行されてもよい。 In general, the remote microcode unit interface 610, the request queue 620, and the transmission queue 630 may be collectively configured to manage the processing of microcode requests retrieved from the core 100. Request queue 620 may be configured to store incoming microcode requests, and in one embodiment, these requests can be processed in entry point and request core 100 until they can be processed in remote control store 640. Minimal display may be included. Similarly, the send queue 630 retrieves entries 310 retrieved from the remote control store 640 or, in some embodiments, from the dynamic microcode RAM array 670 until the entries 410 can be sent to their requesting cores 100. It may be configured to store. In one embodiment, remote microcode unit interface 610 includes management of request queue 620 and transmit queue 630 to control the receipt of requests from core 100 and the transmission of resulting entries to core 100. May be configured. Interface 610 also selects request queues 620 for processing, for example, in response to a sequence control or arbitration scheme, and adjusts storage in transmit queue 630 for entries retrieved during request processing. May include logic configured to coordinate the processing of individual requests stored in the. In other embodiments, selection of requests for processing from request queue 620 and storage of results in send queue 630 may be performed by logic external to interface 610.

リモートユニット４１０の多数の異なる要求管理構成が可能であり考慮される。１つの実施形態において、リモートユニット４１０は、シーケンス機能を省略してもよい。このような実施形態において、リモートマイクロコードユニットインタフェース６１０は、所与のコア１００が受け取りを望む各特定のエントリポイントに対応する所与のコア１００から要求を受け取るように構成されてもよい。このように、例えば、所与のローカルユニット４００のシーケンサ５１０が、リモートユニット４１０内に格納された所与のルーチンに対して、どのエントリポイントを取り出す必要があるかを決定し、制御ストア５００にアクセスするために使用されたものと同様の方法でそれらのエントリポイントに向けられた要求を発生するように構成されてもよい。 Many different requirement management configurations of the remote unit 410 are possible and contemplated. In one embodiment, the remote unit 410 may omit the sequence function. In such embodiments, the remote microcode unit interface 610 may be configured to receive requests from a given core 100 corresponding to each particular entry point that the given core 100 desires to receive. Thus, for example, the sequencer 510 of a given local unit 400 determines which entry points need to be retrieved for a given routine stored in the remote unit 410 and sends it to the control store 500. It may be configured to generate requests directed to those entry points in a manner similar to that used to access.

リモートユニット４１０からシーケンス機能を省略すると、デザインが単純化されうるが、コア１００とリモートユニット４１０との間の要求トラフィックおよび処理の待ち時間が長くなる可能性もある。他の実施形態において、リモートユニット４１０は、さまざまな程度の自発シーケンスをサポートするものであってもよい。例えば、１つの実施形態において、受け取ったマイクロコード要求は、開始エントリポイントおよび終了エントリポイントまたは取り出されるエントリ数を指定してもよい。このような実施形態において、リモートユニット４１０は、開始エンドポイントから始まり、引き続き連続して、終了エントリポイントまたは取り出されるエントリ数に達するまで、受け取った要求に対応する複数のエントリ３１０を取り出すように構成されてもよい。このようにして取り出された複数のエントリ３１０は、取り出された順序で要求コア１００に戻されてもよい。 Omitting the sequencing function from the remote unit 410 can simplify the design, but can also increase the request traffic and processing latency between the core 100 and the remote unit 410. In other embodiments, remote unit 410 may support varying degrees of spontaneous sequence. For example, in one embodiment, the received microcode request may specify a start entry point and an end entry point or the number of entries to be retrieved. In such an embodiment, the remote unit 410 is configured to retrieve a plurality of entries 310 corresponding to the received request, starting from the start endpoint and continuing sequentially until the end entry point or number of entries to be retrieved is reached. May be. The plurality of entries 310 thus retrieved may be returned to the requesting core 100 in the order in which they were retrieved.

別の実施形態において、リモートユニット４１０は、より高度なシーケンス機能をサポートしてもよい。例えば、リモートユニット４１０は、シーケンサ５１０に機能面で類似したシーケンサを含むように構成されてもよい。このようなシーケンサは、シーケンサ５１０によってサポートされたシーケンス制御フィールド３３０の一部またはすべての値をサポートするように構成されてもよい。あるいは、１つの実施形態において、リモートユニット４１０内に格納されたエントリ３１０が、シーケンス制御フィールド３３０とは別個の追加のシーケンス制御フィールドを含むように構成されることで、リモートユニット４１０内のシーケンス機能によって追加のシーケンス制御フィールドが処理されてもよく、一方で、シーケンス制御フィールド３３０は、特定のローカルユニット４００内のシーケンサ５１０によって処理されてもよい。このような実施形態において、複数のエントリルーチンを順序付けるタスクは、リモートユニット４１０およびローカルユニット４００の間で適切に分割されてもよい。 In another embodiment, the remote unit 410 may support more advanced sequencing functions. For example, the remote unit 410 may be configured to include a sequencer that is functionally similar to the sequencer 510. Such a sequencer may be configured to support some or all values of the sequence control field 330 supported by the sequencer 510. Alternatively, in one embodiment, the sequence function in the remote unit 410 is configured such that the entry 310 stored in the remote unit 410 is configured to include an additional sequence control field that is separate from the sequence control field 330. Additional sequence control fields may be processed by, while sequence control field 330 may be processed by sequencer 510 within a particular local unit 400. In such an embodiment, the task of ordering multiple entry routines may be appropriately divided between the remote unit 410 and the local unit 400.

リモートユニット４１０が、多数の異なるコア１００によって共有されてもよいため、リモートユニット４１０は、どのマイクロコード要求が供されるかを選択するとき、さまざまな異なるタイプの調停またはスケジューリング方式を採用するように構成されてもよい。例えば、リモートユニット４１０は、ラウンドロビン式に、最も古い要求を最初に選択することによって、または任意の他の適切な選択方式を使用することによって、処理するための要求キュー６２０から要求を選択するように構成されてもよい。リモートユニット４１０が要求の順序をサポートするいくつかの実施形態において、他のコア１００からの要求を除外して、特定のコア１００からの単一の要求により多数のエントリ３１０を取り出すことも可能である。それに応じて、いくつかの実施形態において、このような要求は、ある特定の数のエントリ３１０が、異なる要求を供することができるように取り出された後に中断されてもよい。他の形態において、次に供する要求を選択するためのフェアネスアルゴリズムが、コア１００の代わりに供された最新の要求だけではなく、これらの要求の持続時間も考慮してもよい。いくつかの実施形態において、２つい上のマイクロコード要求が同時に取り出されてもよいように、リモート制御ストア６４０がマルチポート式に構成されてもよいことに留意されたい。例えば、リモート制御ストア６４０は、マルチポート式のメモリセルを使用して実装されてもよい。他の形態において、異なるバンク６５０をターゲットにする複数の要求が同時に取り出されてもよいように、より小さなシングルポート式のメモリセルがバンク式に採用されてもよい。いくつかのバンク型の実施形態において、同一のバンク６５０をターゲットにする複数の要求がシリアル化されてもよい。 Because the remote unit 410 may be shared by many different cores 100, the remote unit 410 may employ a variety of different types of arbitration or scheduling schemes when selecting which microcode request is served. May be configured. For example, the remote unit 410 selects requests from the request queue 620 for processing in a round robin manner, by first selecting the oldest request or by using any other suitable selection scheme. It may be configured as follows. In some embodiments where the remote unit 410 supports request ordering, it is also possible to exclude requests from other cores 100 and retrieve multiple entries 310 with a single request from a particular core 100. is there. Accordingly, in some embodiments, such a request may be interrupted after a certain number of entries 310 have been retrieved so that they can serve different requests. In other forms, the fairness algorithm for selecting the next request to serve may consider not only the latest request served on behalf of the core 100, but also the duration of these requests. Note that in some embodiments, the remote control store 640 may be configured to be multi-ported so that two or more microcode requests may be retrieved simultaneously. For example, the remote control store 640 may be implemented using multiport memory cells. In other configurations, smaller single-port memory cells may be employed in a bank fashion so that multiple requests targeting different banks 650 may be retrieved simultaneously. In some banked embodiments, multiple requests targeting the same bank 650 may be serialized.

マイクロコード実装を複数のローカルマイクロコードユニット４００および共有リモートマイクロコードユニット４１０に階層的に分割すると、プロセッサ２００のタイミングおよび電力消費が改善されうる。例えば、ローカルマイクロコードユニット４００は、典型的に、リモートマイクロコードユニット４１０より面積が小さい場合があるため、ローカルマイクロコードユニット４００は、各ローカルマイクロコードユニット４００がリモートマイクロコードユニット４１０と同程度に大きい場合より高い頻度で演算できる。同様に、マイクロコードユニットが小さいと、典型的に、大きなものより電力消費量が少ない。いくつかの実施形態において、ローカルマイクロコードユニット４００とは別にリモートマイクロコードユニット４１０を実装すると、リモートマイクロコードユニット４１０は、コア１００によってアクセスされないときに電力を落とすことができる（例えば、ゲーティングクロックによって、またはリモートユニット４１０への電力グリッドの無効化によって）。これにより、プロセッサ２００の全電力消費量を下げうる。 Dividing the microcode implementation hierarchically into multiple local microcode units 400 and shared remote microcode units 410 may improve the timing and power consumption of the processor 200. For example, since the local microcode unit 400 may typically have a smaller area than the remote microcode unit 410, each local microcode unit 400 is as large as each remote microcode unit 410. It can be calculated more frequently than when it is large. Similarly, smaller microcode units typically consume less power than larger ones. In some embodiments, implementing a remote microcode unit 410 separate from the local microcode unit 400 can cause the remote microcode unit 410 to power down when not accessed by the core 100 (eg, a gating clock). Or by disabling the power grid to the remote unit 410). This can reduce the total power consumption of the processor 200.

図７に、個々のコア１００および共有リモートマイクロコードユニット４１０内にローカルマイクロコードユニット４００を含むプロセッサの実施形態において、マイクロコードエントリ３１０を取り出す方法の１つの実施形態を示す。例示した実施形態において、演算は、マイクロコードエントリポイントが所与のコア１００内で発生するブロック７００から始まる。１つの実施形態において、ＤＥＣ１４０は、例えば、命令のさまざまな部分をデコードし、命令がマイクロコードに対応していることを決定し、デコードされた命令から対応するエントリポイントを発生させることによって、ＩＦＵ１２０から受け取った命令に対応するマイクロコードエントリポイントを発生するように構成されてもよい。上述したように、いくつかの実施形態において、リセットイベント、割り込み、トラップ、例外、フォールト、非階層マイクロコードルーチン（例えば、テストやデバッグルーチン）を呼び出すために要求、またはマイクロコードの実行を伴う他のタイプのイベントなど、命令の実行以外のプロセッサのイベントに対して、ＤＥＣ１４０またはコア１００内の別のユニットによって、エントリポイントが発生されてもよい。 FIG. 7 illustrates one embodiment of a method for retrieving a microcode entry 310 in an embodiment of a processor that includes a local microcode unit 400 within an individual core 100 and a shared remote microcode unit 410. In the illustrated embodiment, the operation begins at block 700 where a microcode entry point occurs within a given core 100. In one embodiment, the DEC 140 may, for example, decode the various portions of the instruction, determine that the instruction corresponds to microcode, and generate a corresponding entry point from the decoded instruction. May be configured to generate a microcode entry point corresponding to the instruction received from. As noted above, in some embodiments, reset events, interrupts, traps, exceptions, faults, requests to invoke non-hierarchical microcode routines (eg, test and debug routines), or others that involve execution of microcode Entry points may be generated by DEC 140 or another unit within core 100 for processor events other than instruction execution, such as these types of events.

次に、発生したエントリポイントが、所与のコア１００内のローカルマイクロコードユニット４００に位置するか、またはリモートマイクロコードユニット４１０内に位置するエントリ３１０に対応するかが判定される（ブロック７０２）。１つの実施形態において、エントリ３１０を包含するマイクロコードアドレス空間は、１つ以上のアドレス空間部分が、ローカルマイクロコードユニット４００内のエントリ３１０に対応するが、１つ以上の他の別個のアドレス空間部分が、リモートマイクロコードユニット４１０内のエントリ３１０に対応するように分割されてもよい。例えば、１つの実施形態が、４，０９６個のエントリ３１０を包含する１２ビットマイクロコードアドレス空間をサポートしてもよく、そのうちの１，０２４個は、ローカルマイクロコードユニット４００内に位置し、残りはリモート格納されてもよい。この例において、１６進歩のエントリポイント値０ｘ０００から０ｘ３ＦＦが、ローカルユニット４００内のエントリ３１０に対応してもよく、エントリポイント値０ｘ４００から０ｘＦＦＦが、リモートユニット４１０内のエントリに対応してもよい。マイクロコードアドレス空間の不連続部分がさまざまなユニットに割り当てられたマッピングなど、他のマッピングも可能である。また、いくつかの実施形態において、エントリポイントがローカルであるかリモートであるかの決定は、エントリポイントが特定のアドレス範囲にあるかどうか以外の表示に依存してもよい。例えば、ローカル／リモート表示が、その命令に対するエントリポイントとは別のＤＥＣ１４０によって命令からデコードされてもよい。さまざまな実施形態において、エントリポイントがローカルであるか、またはリモートであるかの判定は、ＤＥＣ１４０などのコア１００内のデコードロジックによって、シーケンサ５１０などのローカルマイクロコードユニット４００によって、またはコア１００内の他のロジックによって実行されてもよい。 Next, it is determined whether the generated entry point is located in the local microcode unit 400 in a given core 100 or corresponds to the entry 310 located in the remote microcode unit 410 (block 702). . In one embodiment, the microcode address space that includes entry 310 has one or more address space portions corresponding to entry 310 in local microcode unit 400, but one or more other separate address spaces. The portion may be split to correspond to the entry 310 in the remote microcode unit 410. For example, one embodiment may support a 12-bit microcode address space that includes 4,096 entries 310, of which 1,024 are located in the local microcode unit 400 and the rest May be stored remotely. In this example, the 16 advance entry point values 0x000 to 0x3FF may correspond to the entry 310 in the local unit 400, and the entry point values 0x400 to 0xFFF may correspond to the entry in the remote unit 410. Other mappings are possible, such as mappings where discontinuities in the microcode address space are assigned to various units. Also, in some embodiments, the determination of whether an entry point is local or remote may depend on an indication other than whether the entry point is in a specific address range. For example, a local / remote display may be decoded from an instruction by a DEC 140 separate from the entry point for that instruction. In various embodiments, the determination of whether an entry point is local or remote is determined by decoding logic in the core 100, such as the DEC 140, by a local microcode unit 400, such as the sequencer 510, or in the core 100. It may be executed by other logic.

発生したエントリポイントが、ローカルユニット４００内に位置するエントリ３１０に対応すれば、エントリは、ローカルユニットから取り出される（ブロック７０４）。例えば、マイクロコード要求の表示とともにエントリポイントを取り出すと、ローカルユニット４００内のシーケンサ５１０は、対応するエントリ３１０を取り出すために制御ストア５００にアクセスするように構成されてもよい。 If the entry point that occurred corresponds to an entry 310 located in the local unit 400, the entry is retrieved from the local unit (block 704). For example, upon retrieving an entry point with a microcode request indication, the sequencer 510 in the local unit 400 may be configured to access the control store 500 to retrieve the corresponding entry 310.

発生したエントリポイントが、リモートユニット４１０内に位置するエントリ３１０に対応すれば、リモートユニットに要求が搬送されてもよい（ブロック７０６）。例えば、ＤＥＣ１４０またはローカルユニット４００のいずれかは、宛先としてリモートユニット４１０を指定するシステムインタフェースユニット２１０を介して要求を搬送することによって、リモートユニット４１０に要求を開始してもよい。次に、リモートユニット４１０は、指定のエントリ３１０を受け取り（ブロック７０８）、所与のコア１００にエントリを戻してもよい（ブロック７１０）。例えば、リモートマイクロコードユニットインタフェース６１０は、ＳＩＵ２１０から要求を受け取り、それを要求キュー６２０に入れてもよい。選択されると、要求は、指定のエントリ３１０をリモート制御ストア６４０から取り出すように処理されてもよく、その結果は、送信キュー６３０内に配置されてもよい。引き続き、インタフェース６１０は、送信キュー６３０からエントリ３１０を選択し、そのエントリをＳＩＵ２１０を介して所与のコア１００に搬送してもよい。 If the entry point that occurred corresponds to an entry 310 located in the remote unit 410, the request may be conveyed to the remote unit (block 706). For example, either the DEC 140 or the local unit 400 may initiate a request to the remote unit 410 by carrying the request through the system interface unit 210 that specifies the remote unit 410 as the destination. The remote unit 410 may then receive the specified entry 310 (block 708) and return the entry to the given core 100 (block 710). For example, remote microcode unit interface 610 may receive a request from SIU 210 and place it in request queue 620. Once selected, the request may be processed to retrieve the specified entry 310 from the remote control store 640 and the result may be placed in the send queue 630. Subsequently, interface 610 may select entry 310 from transmit queue 630 and transport the entry to a given core 100 via SIU 210.

エントリ３１０がローカルユニット４００またはリモートユニット４１０のいずれかから取り出されると、その演算が実行するためにディスパッチされてもよい（ブロック７１２）。例えば、ＤＥＣ１４０は、実行ユニット１５４によって引き続き実行するためのクラスタ１５０の１つ内のスケジューラ１５２に、エントリ３１０内で指定されたような演算をディスパッチしてもよい。いくつかの実施形態において、エントリ３１０内に指定された演算のさらなるデコードは、例えば、ＤＥＣ１４０によってディスパッチされる前に実行されてもよい。 Once the entry 310 is retrieved from either the local unit 400 or the remote unit 410, the operation may be dispatched for execution (block 712). For example, DEC 140 may dispatch operations as specified in entry 310 to scheduler 152 in one of clusters 150 for subsequent execution by execution unit 154. In some embodiments, further decoding of the operation specified in entry 310 may be performed before being dispatched by DEC 140, for example.

いくつかの実施形態において、取り出されたマイクロコードエントリ３１０によって指定された演算は、実行するためにディスパッチされている前に格納され、またはキューに入れられてもよい。例えば、ＤＥＣ１４０は、デコーダまたはマイクロコード取り出しプロセスから演算のディスパッチを切り離すことができるキューを実装してもよく、これにより、ＤＥＣ１４０またはその上流で生じることがあるストールの性能影響が低減されうる。いくつかの実施形態において、リモートマイクロコードユニット４１０から取り出されたマイクロコードエントリ３１０に指定された演算は、コア１００が受け取ると、このような演算キューまたはストレージ内に直接挿入されてもよく、取り出されたエントリ３１０は、コア１００内に格納され、または保持されなくてもよい。しかしながら、他の実施形態において、取り出されたリモート格納エントリ３１０は、ある一定の時間、コア１００内に保持されてもよく、これは、リモートユニット４１０から再度取り出されるのを待機することなく再利用することができる。例えば、ローカルマイクロコードユニット４００は、リモートユニット４１０から取り出されたエントリ３１０が、受け取られると書き込まれてもよい書き込み可能なストレージを含んでもよい。１つの実施形態において、このようなストレージは、１つまたは鵜複数のバッファまたはレジスタを含んでもよく、エントリ３１０は、後続して取り出されたエントリ３１０によって追いやられるまで格納されてもよい。 In some embodiments, the operation specified by the retrieved microcode entry 310 may be stored or queued before being dispatched for execution. For example, DEC 140 may implement a queue that can decouple dispatch of operations from a decoder or microcode fetch process, which may reduce the performance impact of stalls that may occur at DEC 140 or upstream thereof. In some embodiments, operations specified in microcode entry 310 retrieved from remote microcode unit 410 may be inserted directly into such a computation queue or storage upon receipt by core 100 and retrieved. The entered entry 310 may or may not be stored in the core 100. However, in other embodiments, the retrieved remote storage entry 310 may be kept in the core 100 for a certain amount of time, and reused without waiting for it to be retrieved again from the remote unit 410. can do. For example, the local microcode unit 400 may include writable storage that may be written when the entry 310 retrieved from the remote unit 410 is received. In one embodiment, such storage may include one or more buffers or registers, and entry 310 may be stored until it is relegated by subsequently retrieved entry 310.

別の実施形態において、ローカルユニット４００は、例えば、制御ストア５００の一環として実装されたマイクロコードキャッシュを含んでもよい。このようなキャッシュは、複数の取り出されたエントリ３１０を格納するように構成されてもよく、ダクレクトマップ方式、セットアソシアティブ方式、フルアソシアティブ方式などの任意の適切な構成を用いて体系化されてもよい。使用頻度が最も低い方針か、または長時間未使用の方針などの任意の適切な置き換え方針に従って、キャッシュの立ち退きが実行されてもよい。このような実施形態の１つの変形例において、取り出されたエントリ３１０は、専用マイクロコードキャッシュ内ではなく命令キャッシュ１１０内に格納されてもよい。リモート格納された取り出されたエントリ３１０が、コア１００内にキャッシュされるか、または格納されてもよい実施形態において、キャッシュまたは他のローカルストレージは、所与のエントリ３１０に対する要求の発生より前か、または発生と同時のいずれかで、所与のリモート格納されたエントリ３１０の存在をチェックしてもよい。所望のエントリ３１０が、ローカルキャッシュまたは他のローカルストレージ内ですでに利用可能であれば、リモートユニット４１０への要求は発生されなくてもよく、またはすでに顕著であれば、キャンセルされてもよい。マイクロコードキャッシュは、コア１００のローカルであり、所与のコア１００を実行する特定の命令ストリームに基づいて割り当てられたダイナミックマイクロコードストアを与えるように構成されてもよい。リモートマイクロコードユニット４１０の待ち時間は、典型的に、所与のローカルマイクロコードユニット４００より長いことがあるため、マイクロコードキャッシュは、シーケンスが生じたときに、リモートマイクロコードユニット４１０への高頻度アクセスを要求することもありうる低頻度命令シーケンスに関する性能問題を軽減しうる。 In another embodiment, the local unit 400 may include a microcode cache implemented as part of the control store 500, for example. Such a cache may be configured to store a plurality of retrieved entries 310 and is organized using any suitable configuration such as a direct map method, a set associative method, a full associative method, etc. Also good. Cache eviction may be performed according to any suitable replacement policy, such as a least frequently used policy or a long unused policy. In one variation of such an embodiment, the retrieved entry 310 may be stored in the instruction cache 110 rather than in a dedicated microcode cache. In embodiments where remotely stored fetched entries 310 may be cached or stored in the core 100, is the cache or other local storage prior to the occurrence of a request for a given entry 310? Or the presence of a given remote stored entry 310 may be checked either at the same time as the occurrence. If the desired entry 310 is already available in the local cache or other local storage, a request to the remote unit 410 may not be generated, or if already desired, it may be canceled. The microcode cache may be configured to provide a dynamic microcode store that is local to the core 100 and allocated based on a particular instruction stream executing a given core 100. Since the latency of the remote microcode unit 410 may typically be longer than a given local microcode unit 400, the microcode cache is frequently used for remote microcode units 410 when sequences occur. Performance problems with low frequency instruction sequences that may require access can be mitigated.

図４のような実施形態において、単一のリモートマイクロコードユニット４１０が、多数のコア１００によって共有されてもよいが、他の実施形態において、リモートマイクロコードユニット４１０が複製されてもよいことに留意されたい。例えば、リモートマイクロコードユニット４１０のいくつかのインスタンスが、全コア１００より少ない数のコアで共有されてもよい。他の形態では、リモートマイクロコードユニット４１０の完全複製が、各コア１００内に含まれてもよい。このような実施形態において、リモートマイクロコードユニット４１０の複製によって要求される増大した面積は、ルーティングおよび／またはタイミングの複雑性が低減されることにより相殺されてもよい。例えば、リモートマイクロコードユニット４１０の複製は、コア１００からリモートマイクロコードまでの平均距離を短くし、それに応じて、リモートマイクロコードアクセスの待ち時間を短縮してもよい。 In an embodiment such as FIG. 4, a single remote microcode unit 410 may be shared by multiple cores 100, but in other embodiments the remote microcode unit 410 may be replicated. Please keep in mind. For example, some instances of the remote microcode unit 410 may be shared by fewer cores than the entire core 100. In other forms, a complete replica of the remote microcode unit 410 may be included in each core 100. In such an embodiment, the increased area required by replication of the remote microcode unit 410 may be offset by reducing routing and / or timing complexity. For example, duplication of the remote microcode unit 410 may shorten the average distance from the core 100 to the remote microcode and correspondingly reduce the remote microcode access latency.

多数のプロセッサ実装において、マイクロコードは、プロセッサ実装中に静的であることが多く、例えば、制御ストアによって要求される面積を最小限に抑えるために、読み出し専用制御ストア内に実装されてもよい。しかしながら、いくつかの実施形態において、例えば、欠陥を修正したり、機能を追加したりするために、マイクロコードが修正されてもよい技術を提供することが有用な場合もある。図６に示すように、いくつかの実施形態において、リモートマイクロコードユニット４１０は、書き込み可能な制御ストアを提供するように構成された追加の特徴を含んでもよい。 In many processor implementations, microcode is often static during the processor implementation and may be implemented, for example, in a read-only control store to minimize the area required by the control store. . However, in some embodiments, it may be useful to provide a technique in which the microcode may be modified, for example, to correct a defect or add functionality. As shown in FIG. 6, in some embodiments, the remote microcode unit 410 may include additional features configured to provide a writable control store.

任意のマイクロコードパッチＲＡＭ６６０は、特定のエントリポイントが、マイクロコードＲＯＭバンク６５０から、パッチＲＡＭ６６０内の対応する書き込み可能なエントリ３１０へマッピングされてもよい機能を与えるように構成されてもよい。１つの実施形態において、パッチＲＡＭ６６０は、多数のエントリ３１０を実装するように構成された書き込み可能なストレージリソースを含んでもよい。いくつかの実施形態において、パッチＲＡＭ６６０は、ＲＯＭバンク６５０の１つと同じ数のエントリ３１０を含んでもよいが、他の実施形態において、エントリ数は増減してもよい。パッチＲＡＭ６６０はまた、各エントリ３１０が割り当て可能なエントリポイントに対応するようにリソースを与えてもよい。リモート制御ストア６４０が、各々がパッチＲＡＭ６６０と同じ数のエントリを有する多数のＲＯＭバンク６５０を含む実施形態において、パッチＲＡＭ６６０の所与のエントリ３１０が、ＲＯＭバンク６５０の任意の１つにある対応するエントリ３１０にマッピングしてもよい。パッチＲＡＭ６６０内の追加のビットセットが、任意の所与の時間で所与のパッチＲＡＭエントリがどのバンクに対応するかを指定してもよい。例えば、ある実施形態は、４つのＲＯＭバンク６５０および１つのパッチＲＡＭバンク６６０を含んでもよく、各々が１０００個のエントリを有する。パッチＲＡＭ６６０内の追加のビットが、各所与のエントリ３１０に対して、所与のエントリが任意の４つのＲＯＭバンクのいずれにマッピングされるかを指定してもよい。他の実施形態において、パッチＲＡＭ６６０の各エントリは、プログラム可能な対応するエントリポイントを有してもよい。 The optional microcode patch RAM 660 may be configured to provide a function where specific entry points may be mapped from the microcode ROM bank 650 to the corresponding writable entry 310 in the patch RAM 660. In one embodiment, the patch RAM 660 may include a writable storage resource configured to implement multiple entries 310. In some embodiments, the patch RAM 660 may include the same number of entries 310 as one of the ROM banks 650, but in other embodiments, the number of entries may be increased or decreased. The patch RAM 660 may also provide resources so that each entry 310 corresponds to an assignable entry point. In an embodiment where the remote control store 640 includes multiple ROM banks 650 each having the same number of entries as the patch RAM 660, a given entry 310 of the patch RAM 660 corresponds to any one of the ROM banks 650. It may be mapped to the entry 310. An additional bit set in the patch RAM 660 may specify which bank a given patch RAM entry corresponds to at any given time. For example, an embodiment may include four ROM banks 650 and one patch RAM bank 660, each having 1000 entries. Additional bits in the patch RAM 660 may specify for each given entry 310 which of the four ROM banks a given entry is mapped to. In other embodiments, each entry in the patch RAM 660 may have a corresponding entry point that is programmable.

バンク６５０の１つのバンク内に所与のエントリ３１０を発送するために、１つの実施形態において、どのバンク６５０が発送されるか、またはパッチＲＡＭエントリに関連付けられた特定のエントリポイント値を示すいずれかの情報とともに、パッチＲＡＭ６６０の対応するエントリ内に、所与のエントリに対して所望の発送された値が格納されてもよい。引き続き、特定のエントリポイントにアクセスするための要求が、リモート制御ストア６４０によって受け取られると、パッチＲＡＭ６６０は、要求されたエントリポイントが発送されたかを決定するために調べられてもよい。例えば、リモート制御ストア６４０は、特定のエントリポイントがマッピングする特定のバンク６５０を決定し、次に、パッチＲＡＭ６６０内の対応するエントリ３１０の制御ビットを調べて、パッチＲＡＭ６６０内に格納されたエントリ３１０が、特定のバンク６５０内に格納されたエントリ３１０の代わりに選択されるべきであるかを決定するように構成されてもよい。あるいは、特定のエントリポイントは、特定のエントリポイントがパッチＲＡＭ６６０のエントリにヒットまたはマッチするかどうかを決定するために、アソシアティブ方式にパッチＲＡＭ６６０内のプログラミングされたエントリポイントに対して比較されてもよい。リモートマイクロコードユニット４１０の任意の特徴として、パッチＲＡＭ６６０が示されているが、いくつかの実施形態において、ローカルマイクロコードユニット４００はまた、上述したものと同様の方法で制御ストア５００内のパッチＲＡＭ特徴をサポートしてもよいことに留意されたい。 To route a given entry 310 within a bank 650, in one embodiment, which bank 650 is shipped or which indicates a particular entry point value associated with a patch RAM entry. With that information, the desired dispatched value for a given entry may be stored in the corresponding entry in patch RAM 660. Subsequently, when a request to access a particular entry point is received by the remote control store 640, the patch RAM 660 may be examined to determine if the requested entry point has been dispatched. For example, the remote control store 640 determines a particular bank 650 that a particular entry point maps to, and then examines the control bit of the corresponding entry 310 in the patch RAM 660 to determine the entry 310 stored in the patch RAM 660. May be configured to be selected instead of the entry 310 stored in a particular bank 650. Alternatively, a particular entry point may be compared to programmed entry points in the patch RAM 660 in an associative manner to determine whether the particular entry point hits or matches an entry in the patch RAM 660. . Although an optional feature of the remote microcode unit 410 is shown as a patch RAM 660, in some embodiments, the local microcode unit 400 can also be a patch RAM in the control store 500 in a manner similar to that described above. Note that features may be supported.

パッチＲＡＭ６６０は、個々のマイクロコードエントリポイントのパッチングを容易に与えてもよい。しかしながら、いくつかの実施形態において、多くのエントリポイントを包含するルーチン全体を書き換えるか、または新しいルーチンで既存のマイクロコードを増大することが望ましいこともある。それに応じて、１つの実施形態において、リモートマイクロコードユニット４１０は、任意のダイナミックマイクロコードＲＡＭアレイ６７０を含んでもよい。一般的に言えば、マイクロコードＲＡＭアレイ６７０は、任意の適切な書き込み可能または不揮発性のストレージアレイ技術に従って実装されてもよく、リモート制御ストア６４０およびローカル制御ストア５００内に格納されたものに追加して、多数のエントリ３１０を格納するように構成されてもよい。１つの実施形態において、マイクロコードＲＡＭアレイ６７０のエントリ３１０に関連付けられたマイクロコードアドレス空間の部分は、リモート制御ストア６４０およびローカル制御ストア５００内のエントリ３１０に関連付けられたマイクロコードアドレス空間の部分から別個のものであってもよく、それによって、リモートマイクロコードユニット４１０によって受け取られるマイクロコードアクセス要求は、要求されたエントリポイントに従って、リモート制御ストア６４０か、またはマイクロコードＲＡＭアレイ６７０のいずれかに向けられてもよい。 Patch RAM 660 may easily provide patching of individual microcode entry points. However, in some embodiments, it may be desirable to rewrite an entire routine that includes many entry points or augment existing microcode with new routines. Accordingly, in one embodiment, the remote microcode unit 410 may include an optional dynamic microcode RAM array 670. Generally speaking, the microcode RAM array 670 may be implemented according to any suitable writable or non-volatile storage array technology, in addition to those stored in the remote control store 640 and the local control store 500. Thus, a large number of entries 310 may be stored. In one embodiment, the portion of the microcode address space associated with entry 310 of microcode RAM array 670 is derived from the portion of the microcode address space associated with entry 310 in remote control store 640 and local control store 500. The microcode access request received by the remote microcode unit 410 may be separate and directed to either the remote control store 640 or the microcode RAM array 670 according to the requested entry point. May be.

他の実施形態において、マイクロコードＲＡＭアレイ６７０内のあるエントリ３１０は、リモート制御ストア６４０にもマッピングするエントリポイントをシャドウイングまたはオーバーライドするように構成されてもよい。例えば、マイクロコードＲＡＭアレイ６７０が、パッチＲＡＭ６６０に対して上述したものに類似したプログラム可能なエントリポイント制御ビットまたはレジスタを含んでもよい。このような実施形態において、対応するエントリ３１０が、マイクロコードＲＡＭアレイ６７０内に規定されたかどうかを確認するために、リモートマイクロコードユニット４１０にマッピングする特定のエントリポイントがチェックされてもよい。確認できれば、リモート制御ストア６４０内の任意の対応するエントリ３１０は無視されてもよい。ある程度のシーケンシングをサポートするリモートマイクロコードユニット４１０のいくつかの実施形態において、エントリポイントが、マイクロコードＲＡＭアレイ６７０に、例えば、複数エントリのマイクロコードルーチンの開始時に最初にマッピングされると、後続の連続的な参照が、マイクロコードＲＡＭアレイ６７０に割り当てられたマイクロコードアドレス空間の一部分内に留まってもよい。これにより、リモート制御ストア６４０をさらに参照することなく、ルーチンの残りがＲＡＭアレイ６７０から実行することが可能となりうる。 In other embodiments, an entry 310 in the microcode RAM array 670 may be configured to shadow or override entry points that also map to the remote control store 640. For example, microcode RAM array 670 may include programmable entry point control bits or registers similar to those described above for patch RAM 660. In such an embodiment, a particular entry point that maps to the remote microcode unit 410 may be checked to see if the corresponding entry 310 has been defined in the microcode RAM array 670. If so, any corresponding entry 310 in the remote control store 640 may be ignored. In some embodiments of the remote microcode unit 410 that support some degree of sequencing, once an entry point is mapped to the microcode RAM array 670, eg, at the beginning of a multi-entry microcode routine, Consecutive references may remain within a portion of the microcode address space allocated to the microcode RAM array 670. This may allow the remainder of the routine to execute from the RAM array 670 without further reference to the remote control store 640.

上記実施形態は、ローカルマイクロコードユニット４００およびリモートマイクロコードユニット４１０を含む２段階の階層に関して記載してきたが、他の実施形態において、マイクロコードは、さらなる段階の階層を用いて、プロセッサ２００内、またはシステム内のプロセッサ２００の複数のインスタンスにわたって分配されてもよいことに留意されたい。 Although the above embodiments have been described with reference to a two-level hierarchy that includes a local microcode unit 400 and a remote microcode unit 410, in other embodiments, the microcode can be used in the processor 200 using additional levels of hierarchy, Note that it may also be distributed across multiple instances of processor 200 in the system.

いくつかの実施形態において、プロセッサ２００は、他のコンポーネントとともにコンピュータシステム内に実装されてもよい。図８に、このようなシステムの１つの実施形態を示す。例示した実施形態において、コンピュータシステム８００は、いくつかの処理ノード８１２Ａ、８１２Ｂ、８１２Ｃ、および８１２Ｄを含む。各処理ノードは、各処理ノード８１２Ａ〜８１２Ｄ内のそれぞれに含まれたメモリコントローラ８１６Ａ〜８１６Ｄを介して、それぞれのメモリ８１４Ａ〜８１４Ｄに連結される。さらに、処理ノード８１２Ａ〜８１２Ｄは、処理ノード８１２Ａ〜８１２Ｄ間で通信するために使用されるインタフェースロジックを含む。例えば、処理ノード８１２Ａｈａ，処理ノード８１２Ｂと通信するためのインタフェースロジックＡと、処理ノード８１２Ｃと通信するためのインタフェースロジック８１８Ｂと、さらなる別の処理ノード（図示せず）と通信するための第３のインタフェースロジック８１８Ｃとを含む。同様に、処理ノード８１２Ｂは、インタフェースロジック８１８Ｄ、８１８Ｅ、および８１８Ｆを含み、処理ノード８１２Ｃは、インタフェースロジック８１８Ｇ、８１８Ｈ、および８１８Ｉとを含み、処理ノード８１２Ｄは、インタフェースロジック８１８Ｊ、８１８Ｋ、および８１８Ｌを含む。処理ノード８１２Ｄは、インタフェースロジック８１８Ｌを介して複数の入力／出力デバイス（例えば、デイジーチェーン構成のデバイス８２０Ａ〜８２０Ｂ）と通信するように連結される。他の処理ノードが、同様の方法で他のＩ／Ｏデバイスと通信してもよい。 In some embodiments, the processor 200 may be implemented in a computer system along with other components. FIG. 8 illustrates one embodiment of such a system. In the illustrated embodiment, computer system 800 includes a number of processing nodes 812A, 812B, 812C, and 812D. Each processing node is connected to each of the memories 814A to 814D via memory controllers 816A to 816D included in each of the processing nodes 812A to 812D. Further, processing nodes 812A-812D include interface logic used to communicate between processing nodes 812A-812D. For example, interface logic A for communicating with processing node 812Aha, processing node 812B, interface logic 818B for communicating with processing node 812C, and a third for communicating with yet another processing node (not shown). Interface logic 818C. Similarly, processing node 812B includes interface logic 818D, 818E, and 818F, processing node 812C includes interface logic 818G, 818H, and 818I, and processing node 812D includes interface logic 818J, 818K, and 818L. Including. Processing node 812D is coupled to communicate with a plurality of input / output devices (eg, devices 820A-820B in a daisy chain configuration) via interface logic 818L. Other processing nodes may communicate with other I / O devices in a similar manner.

処理ノード８１２Ａ〜８１２Ｄは、処理間のノード通信用のパケットベースリンクを実装するように構成されてもよい。例示した実施形態において、リンクは、単方向ラインのセットとして実装される（例えば、ライン８２４Ａは、処理ノード８１２Ａから処理ノード８１２Ｂへパケットを送信するように使用され、ライン８２４Ｂは、処理ノード８１２Ｂから処理ノード８１２Ａへパケットを送信するように使用される）。ライン８２４Ｃ〜８２４Ｈの他のセットは、図８に示すように、他の処理ノード間でパケットを送信するように使用される。一般に、ライン８２４の各セットは、１つ以上のデータライン、データラインに対応する１つ以上のクロックライン、および搬送されるパケットのタイプを示す１つ以上の制御ラインを含んでもよい。リンクは、処理ノード間で通信するためのキャッシュコヒーレント式か、または処理ノードとＩ／Ｏデバイス間で通信するための非コヒーレント式で動作されてもよい（または周辺機器相互接続（ＰＣＩ：Peripheral Component Interconnect）バスまたは、業界標準アーキテクチャ（ＩＳＡ：Industry Standard Architecture）バスなどの従来の構造のバスブリッジとＩ／Ｏバス）。さらに、リンクは、図示したようなＩ／Ｏデバイス間でデイジーチェーン構造を用いて非コヒーレント式に動作されてもよい。１つの処理ノードから別のものへ送信されるパケットが、１つ以上の中間ノードを通過してもよいことに留意されたい。例えば、処理ノード８１２Ａから処理ノード８１２Ｄによって送信されるパケットは、図８に示すように、処理ノード８１２Ｂまたは処理ノード８１２Ｃのいずれかを通過してもよい。任意の適切なルーティングアルゴリズムが使用されてもよい。コンピュータシステム８００の他の実施形態が、図８に示す実施形態より多いか、または少ない処理ノード数を含んでもよい。また、コンピュータシステム８００の他の実施形態は、上述したようなパケットベースのプロトコルを採用した単方向バスではなく、適切なインタフェースプロトコルを採用した双方向バスをしようして実装されてもよい。 Processing nodes 812A-812D may be configured to implement a packet-based link for node communication between processes. In the illustrated embodiment, the links are implemented as a set of unidirectional lines (eg, line 824A is used to transmit packets from processing node 812A to processing node 812B, and line 824B is transmitted from processing node 812B. Used to send packets to processing node 812A). Another set of lines 824C-824H is used to transmit packets between other processing nodes, as shown in FIG. In general, each set of lines 824 may include one or more data lines, one or more clock lines corresponding to the data lines, and one or more control lines indicating the type of packet being carried. The link may be operated in a cache coherent manner for communicating between processing nodes or in a non-coherent manner for communicating between a processing node and an I / O device (or Peripheral Component Interconnect (PCI)). Interconnect) bus or conventional structure bus bridge and I / O bus, such as an Industry Standard Architecture (ISA) bus. Further, the links may be operated incoherently using a daisy chain structure between I / O devices as shown. Note that a packet sent from one processing node to another may pass through one or more intermediate nodes. For example, a packet transmitted from the processing node 812A by the processing node 812D may pass through either the processing node 812B or the processing node 812C as shown in FIG. Any suitable routing algorithm may be used. Other embodiments of the computer system 800 may include more or fewer processing nodes than the embodiment shown in FIG. Also, other embodiments of the computer system 800 may be implemented using a bidirectional bus that employs an appropriate interface protocol rather than a unidirectional bus that employs a packet-based protocol as described above.

一般に、パケットは、ノード間のライン８２４に１以上のビット時間として送信されてもよい。ビット時間は、対応するクロックラインでのクロック信号の立ち上がりまたは立下り縁であってもよい。パケットは、トランザクションを開始するためのコマンドパケット、キャッシュコヒーレンシーを維持するためのプローブパケット、およびプローブおよびコマンドへの応答からの応答パケットを含んでもよい。 In general, a packet may be sent on line 824 between nodes as one or more bit times. The bit time may be the rising or falling edge of the clock signal on the corresponding clock line. The packets may include command packets for initiating transactions, probe packets for maintaining cache coherency, and response packets from responses to probes and commands.

処理ノード８１２Ａ〜８１２Ｄは、メモリコントローラおよびインタフェースロジックの他にも、１つ以上のプロセッサを含んでもよい。広義に言えば、処理ノードは、少なくとも１つのプロセッサを備え、任意に、必要に応じてメモリおよび他のロジックと通信するためのメモリコントローラを含んでもよい。より詳細に言えば、各処理ノード８１２Ａ〜８１２Ｄは、図２に示すように、プロセッサ２００の１つ以上のコピーを備えてもよい（例えば、図１および図３〜図７に示すさまざまな構造的および演算の詳細を含む）。１つ以上のプロセッサは、処理ノードまたは処理ノードの形成においてチップマルチプロセッシング（ＣＭＰ：chip multiprocessing）またはチップマルチスレッド（ＣＭＴ：chip
multithreaded）集積回路を備えてもよく、または、処理ノードは、任意の他の所望の内部構造を有してもよい。いくつかの実施形態において、処理ノード８１２のメモリコントローラおよび／または周辺インタフェースロジックが、図２に示すように、プロセッサ２００内に直接集積化されてもよい。例えば、メモリコントローラ８１６のインスタンスが、プロセッサ２００内のメモリコントローラ／周辺インタフェース２３０に対応してもよい。 In addition to the memory controller and interface logic, the processing nodes 812A-812D may include one or more processors. Broadly speaking, the processing node comprises at least one processor and may optionally include a memory controller for communicating with memory and other logic as required. More specifically, each processing node 812A-812D may comprise one or more copies of the processor 200 as shown in FIG. 2 (eg, various structures shown in FIGS. 1 and 3-7). And details of operations). One or more processors may be chip multiprocessing (CMP) or chip multithread (CMT) in forming a processing node or processing node.
multithreaded) integrated circuit, or the processing node may have any other desired internal structure. In some embodiments, the memory controller and / or peripheral interface logic of the processing node 812 may be integrated directly into the processor 200, as shown in FIG. For example, an instance of the memory controller 816 may correspond to the memory controller / peripheral interface 230 in the processor 200.

メモリ８１４Ａ〜８１４Ｄは、任意の適切なメモリデバイスを備えてもよい。例えば、メモリ８１４Ａ〜８１４Ｄは、１つ以上のＲＡＭＢＵＳＤＲＡＭ（ＲＤＲＡＭ）、同期ＤＲＡＭ（ＳＤＲＡＭ）、ＤＤＲＳＤＲＡＭ、スタティックＲＡＭなどを備えてもよい。コンピュータシステム８００のアドレス空間は、メモリ８１４Ａ〜８１４Ｄの間で分割されてもよい。各処理ノード８１２Ａ〜８１２Ｄは、どのメモリ８１４Ａ〜８１４Ｄに対して、ひいては、特定のアドレスに対するメモリ要求がルーティングされるべきであるどの処理ノード８１２Ａ〜８１２に対して、どのアドレスがマッピングされるかを決定するために使用されたメモリマップを含んでもよい。１つの実施形態において、コンピュータシステム８００内のアドレスのコヒーレンシーポイントは、アドレスに対応するバイトを格納するメモリに連結されたメモリコントローラ８１６Ａ〜８１６Ｄである。言い換えれば、メモリコントローラ８１６Ａ〜８１６Ｄは、対応するメモリ８１４Ａ〜８１４Ｄへの各メモリアクセスがキャッシュコヒーレント式に生じるようにする役割のものであってもよい。メモリコントローラ８１６Ａ〜８１６Ｄは、メモリ８１４Ａ〜８１４Ｄへのインタフェースとなるための制御回路を備えてもよい。さらに、メモリコントローラ８１６Ａ〜８１６Ｄは、メモリ要求をキューに入れるための要求キューを含んでもよい。 Memories 814A-814D may comprise any suitable memory device. For example, the memories 814A-814D may include one or more RAMBUS DRAM (RDRAM), synchronous DRAM (SDRAM), DDR SDRAM, static RAM, and the like. The address space of computer system 800 may be divided among memories 814A-814D. Each processing node 812A-812D determines which address is mapped to which memory 814A-814D and thus to which processing node 812A-812 the memory request for a particular address should be routed. It may also include a memory map used to determine. In one embodiment, the coherency point of the address in computer system 800 is a memory controller 816A-816D coupled to a memory that stores a byte corresponding to the address. In other words, the memory controllers 816A to 816D may be responsible for causing each memory access to the corresponding memories 814A to 814D to occur in a cache coherent manner. The memory controllers 816A to 816D may include a control circuit for providing an interface to the memories 814A to 814D. Further, the memory controllers 816A-816D may include a request queue for queuing memory requests.

一般に、インタフェースロジック８１８Ａ〜８１８Ｌは、リンクからパケットを受け取り、リンク上を送信されるパケットをバッファリングするための種々のバッファを備えてもよい。上述したように、いくつかの実施形態において、インタフェースロジック８１８は、プロセッサ２００内、例えば、メモリコントローラ／周辺インタフェース２３０内、または集積メモリコントローラとは別個の別々のインタフェース内で集積化されてもよい。コンピュータシステム８００は、パケットを送信するための任意の適切なフロー制御機構を採用してもよい。例えば、１つの実施形態において、各インタフェースロジック８１８は、インタフェースロジックが接続されるリンクの他端部にあるレシーバ内にあるバッファのタイプ別の数を格納する。インタフェースロジックは、受信するインタフェースロジックにパケットを格納する空きのバッファがない限り、パケットを送信しない。パケットを前方にルーティングすることで受信バッファに空きができると、受信側のインタフェースロジックは送信側のインタフェースロジックにバッファが空いたことを示すメッセージを送信する。このような機構を、「クーポンベース」システムと呼ぶこともある。 In general, interface logic 818A-818L may include various buffers for receiving packets from the link and buffering packets transmitted over the link. As described above, in some embodiments, interface logic 818 may be integrated within processor 200, eg, within memory controller / peripheral interface 230, or within a separate interface separate from the integrated memory controller. . Computer system 800 may employ any suitable flow control mechanism for transmitting packets. For example, in one embodiment, each interface logic 818 stores a number by type of buffer in the receiver at the other end of the link to which the interface logic is connected. The interface logic does not transmit a packet unless the receiving interface logic has a free buffer to store the packet. When the receiving buffer is freed by routing the packet forward, the receiving side interface logic sends a message indicating that the buffer is free to the sending side interface logic. Such a mechanism is sometimes referred to as a “coupon-based” system.

Ｉ／Ｏデバイス８２０Ａ〜８２０Ｂは、適切な任意のＩ／Ｏデバイスでよい。例えば、Ｉ／Ｏデバイス８２０Ａ〜８２０Ｂは、そのデバイスが連結されてもよい別のコンピュータシステムと通信するためのデバイス（例えば、ネットワークインタフェースカード、またはモデム）を含む。さらに、Ｉ／Ｏデバイス８２０Ａ〜８２０Ｂは、ビデオアクセラレータ、オーディオカード、ハードまたはフロッピーディスクドライブまたはドライブコントローラ、ＳＣＳＩ（Small Computer Systems Interface）アダプタ、および電話カード、サウンドカード、およびＧＰＩＢなどの様々なデータ取得カードまたはフィールドバスインタフェースカードを含んでもよい。さらに、カードとして実装された任意のＩ／Ｏデバイスはまた、システム８００の主要回路基板上の回路、プロセッサ２００内、および／または処理ノードで実行されるソフトウェアで回路として実装されてもよい。本明細書において、「Ｉ／Ｏデバイス」という用語、および「周辺デバイス」という用語は、同義に用いられていることに留意されたい。 I / O devices 820A-820B may be any suitable I / O device. For example, I / O devices 820A-820B include a device (eg, a network interface card or modem) for communicating with another computer system to which the device may be coupled. In addition, the I / O devices 820A-820B provide various data acquisition such as video accelerators, audio cards, hard or floppy disk drives or drive controllers, small computer systems interface (SCSI) adapters, and telephone cards, sound cards, and GPIB. Cards or fieldbus interface cards may be included. Further, any I / O device implemented as a card may also be implemented as a circuit with software running on the main circuit board of system 800, within processor 200, and / or at a processing node. It should be noted that the term “I / O device” and the term “peripheral device” are used interchangeably herein.

さらに、１つ以上のＩ／Ｏ相互接続および／またはメモリへのブリッジへのプロセッサの１つ以上のインタフェースを含む、従来のパーソナルコンピュータ（ＰＣ）構造に１つ以上のプロセッサ２００が実装されてもよい。例えば、プロセッサ２００は、ノースブリッジ・サウスブリッジ階層内に実装されるように構成されてもよい。このような実施形態において、ノースブリッジ（プロセッサ２００に連結されるか、または集積されてもよい）は、システムメモリ、グラフィックデバイスインタフェース、および／または他のシステムデバイスへの高帯域接続を与えるように構成されてもよく、一方で、サウスブリッジは、さまざまなタイプの周辺バス（例えば、ユニバーサルシリアルバス（ＵＳＢ：Universal Serial Bus）、ＰＣＩ、ＩＳＡなど）を介して他の周辺機器への低帯域域接続を与えてもよい。 Further, one or more processors 200 may be implemented in a conventional personal computer (PC) structure that includes one or more I / O interconnects and / or one or more interfaces of the processor to a bridge to memory. Good. For example, the processor 200 may be configured to be implemented in a North Bridge / South Bridge hierarchy. In such embodiments, the north bridge (which may be coupled to or integrated with the processor 200) provides a high bandwidth connection to system memory, graphics device interfaces, and / or other system devices. While the south bridge may be configured, the low bandwidth to other peripherals via various types of peripheral buses (eg Universal Serial Bus (USB), PCI, ISA, etc.) A connection may be given.

上記に実施形態を非常に詳細に記載してきたが、当業者が上記開示を十分に理解すれば、多数の変更例および修正例が明らであろう。添付の特許請求の範囲は、そのような変更例および修正例のすべてを包含するように解釈されることを意図したものである。 Although embodiments have been described in great detail above, numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. The appended claims are intended to be construed to include all such variations and modifications.

本発明は、一般に、マイクロプロセッサの分野に適用可能でありうる。 The present invention may be generally applicable to the field of microprocessors.

Claims

A plurality of processor cores (100) each configured to independently execute instructions defined in accordance with a Programmable Instruction Set Architecture (ISA);
A remote microcode store (640) that is accessible by each of the processor cores (100) and configured to store microcode entries, and a given processor core separates another microcode entry (310). Configured to evaluate the sequence control field (330) of the retrieved entry (310) to determine whether to retrieve the another microcode entry for the given processor core without requesting A processor (200) comprising: a remote microcode unit (410) including a programmed sequencer;
Each of the plurality of processor cores (100) includes a local microcode unit (400) configured to store a microcode entry;
Wherein each of the local microcode unit (400) corresponding to Luma Lee black code routine in the microcode entry stored in the respect meet the performance sensitivity threshold requirements, the remote microcode unit (410) in Each of at least some of the microcode routines corresponding to the microcode entry stored in the table does not meet the performance sensitivity threshold requirement;
Any given core of the processor core (100) is
Generating a given microcode entry point corresponding to a particular microcode entry containing one or more operations to be performed by the given processor core (100);
Determining whether the particular microcode entry is stored in the respective local microcode unit (400) of the given processor core (100);
In response to a determination that the specific microcode entry is not stored in the respective local microcode unit (400), conveys the request for the specific microcode entry to the remote microcode unit (410) A processor further configured to.

In order to determine whether the particular microcode entry is stored in the respective local microcode unit of the given processor core, the given processor core may have the given microcode entry point 2. The processor of claim 1, further comprising:

Each of the processor cores further includes a respective instruction cache configured to store instructions among the instructions, and in response to receiving the particular microcode entry after the request, the given The processor of claim 1 or 2, wherein a processor core is further configured to store the particular microcode entry in the respective instruction cache.

4. A processor according to any one of the preceding claims, wherein at least a part of the remote microcode store comprises a writable memory.

5. A processor according to any one of the preceding claims, wherein the performance sensitivity threshold requirement is dependent on the frequency of execution of a microcode routine.

6. A processor according to any one of the preceding claims, wherein the performance sensitivity threshold requirement depends on the latency of microcode routine execution.

One or more of the local microcode units comprises a microcode cache configured to cache an entry retrieved from the remote microcode unit, wherein the particular microcode entry is the given processor core To determine whether the given processor core requests the specific microcode entry from the remote microcode unit before determining whether it is stored in the respective local microcode unit. 7. A processor according to any one of the preceding claims, configured to determine whether a code entry is in the microcode cache.

System memory,
A system comprising: a processor according to claim 1, wherein the processor is coupled to the system memory.

In response to the determination that a given microcode routine meets the performance sensitivity threshold requirement, the given microcode routine includes the given microcode routine in each local microcode unit (400) included in the plurality of processor cores (100). Store one or more microcode entries corresponding to microcode routines;
In response to a determination that a given microcode routine does not meet performance sensitivity threshold requirements, the given microcode routine is in a remote microcode unit (410) external to the plurality of processor cores (100). Store one or more microcode entries corresponding to microcode routines;
A given microcode entry point corresponding to a particular microcode entry in which a given core of the plurality of processor cores (100) includes one or more operations performed by the given processor core (100) Each of the plurality of processor cores (100) is configured to independently execute instructions defined in accordance with a Programmable Instruction Set Architecture (ISA) to store microcode entries. Each includes a configured local microcode unit (400),
The given processor core (100) determines whether the particular microcode entry is stored in the respective local microcode unit (400) of the given processor core (100);
In response to a determination that the particular microcode entry is not stored in the respective local microcode unit (400), the given processor core (100) makes a request for the particular microcode entry. A remote microcode store that is configured to store a microcode entry that is transported to a remote microcode unit (410), wherein the remote microcode unit (410) is accessible by each of the processor cores (100) (640) and to determine whether the given processor core should fetch the other microcode entry for the given processor core without requesting another microcode entry (310) separately. To the entry (310) And a sequencer configured to evaluate the cans control field (330), method.