JP2005182791A

JP2005182791A - General purpose embedded processor

Info

Publication number: JP2005182791A
Application number: JP2004359188A
Authority: JP
Inventors: Steven Frank; フランクスティーブ; Shigenori Imai; 繁規今井
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2003-12-12
Filing date: 2004-12-10
Publication date: 2005-07-07

Abstract

<P>PROBLEM TO BE SOLVED: To provide an embedded processor architecture comprising a plurality of virtual processing units that each execute processes or threads (collectively, "threads"). <P>SOLUTION: One or more execution units, which are shared by the processing units, execute instructions from the threads. An event delivery mechanism delivers events such as, by way of non-limiting example, hardware interrupts, software-initiated signaling events ("software events") and memory events to respective threads without execution of the instructions. Each event can, per aspects of the invention, be processed by the respective threads without execution of the instructions outside that thread. The threads need not be constrained to execute on the same respective processing units during the lives of those threads, though in some embodiments, they can be so constrained. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

本出願は、「ＶｉｒｔｕａｌＰｒｏｃｅｓｓｏｒＭｅｔｈｏｄｓａｎｄＡｐｐａｒａｔｕｓｗｉｔｈＵｎｉｆｉｅｄＥｖｅｎｔＮｏｔｉｆｉｃａｔｉｏｎＡｎｄＣｏｎｓｕｍｅｒ−ｐｒｏｄｕｃｅｒＭｅｍｏｒｙＯｐｅｒａｔｉｏｎｓ」と称される、同一人に譲受された、係属中の米国特許出願シリアルナンバー第１０／４４９，７３２号（２００３年５月３０日出願）の継続出願であり、かつ、その特権権の利益を主張する。この出願の教示は、参考のため、本明細書中に援用される。 This application is filed in United States Patent No. 7, which is assigned to US Patent No. 7, which is assigned to US Patent No. 7, which is assigned to US Patent No. 7, which is assigned to US Patent No. 7, which is assigned to US Patent No. 7, which is assigned to US Patent No. 7, which is assigned to US Patent No. 7, which is assigned to US Patent No. 7 / No. And claims the benefit of its privilege right. The teachings of this application are incorporated herein by reference.

本発明は、デジタルデータ処理に関し、より具体的には、組込みプロセッサアーキテクチャおよびオペレーションに関する。本発明は、高品位テレビ、ゲームシステム、デジタルビデオレコーダ、ビデオおよび／またはオーディオプレーヤ、パーソナルデジタルアシスタント、パーソナルナリッジナビゲータ、携帯電話、ならびに他のマルチメディアおよび非マルチメディアデバイスに適用される。 The present invention relates to digital data processing, and more particularly to embedded processor architecture and operation. The present invention applies to high-definition television, gaming systems, digital video recorders, video and / or audio players, personal digital assistants, personal knowledge navigators, mobile phones, and other multimedia and non-multimedia devices.

従来技術の組込みプロセッサベースまたはアプリケーションシステムは、通常、（１）ユーザインターフェース処理、ハイレベルアプリケーション処理、およびオペレーティングシステムタスクをハンドリングする、例えば、ＡＲＭ、ＭＩＰまたはｘ８６の変種等の１つ以上の汎用プロセッサを、（２）特定のインターフェース、または特定のアプリケーション内での特定のタイプの算術演算をリアルタイム／低レイテンシーベースでハンドリングするためのメディアプロセッサを含む１つ以上のデジタル信号プロセッサ（ＤＳＰ）と組み合わせる。ＤＳＰの代わりに、またはこれに加えて、例えば、ＤＳＰが複数のアクティビティを同時にハンドリングすることができないか、またはＤＳＰが特別に専門的な演算素子のニーズを満たすことができないことが原因で、プログラマブルベースでハンドリングすることができない専用のニーズに対処するために、特定用途向けハードウェアが提供されることが多い。 Prior art embedded processor-based or application systems typically (1) one or more general-purpose processors, such as ARM, MIP or x86 variants that handle user interface processing, high-level application processing, and operating system tasks. Are combined with (2) one or more digital signal processors (DSPs) including a media processor for handling specific interfaces or specific types of arithmetic operations within specific applications on a real-time / low-latency basis. Instead of, or in addition to, a DSP, for example, because the DSP cannot handle multiple activities at the same time or the DSP cannot meet the needs of specially specialized computing elements Application specific hardware is often provided to address dedicated needs that cannot be handled on a base.

これらの従来技術の例は、通常、メインプロセッサを別個のグラフィックプロセッサおよび別個のサウンドプロセッサと組み合わせるパーソナルコンピュータ；通常、メインプロセッサと、別個にプログラムされたグラフィックプロセッサとを組み合わせるゲームシステム；通常、汎用プロセッサと、ｍｐｅｇ２デコーダチップおよびｍｐｅｇ２エンコーダチップと、特殊用途向けデジタル信号プロセッサとを組み合わせるデジタルビデオレコーダ；通常、汎用プロセッサと、ｍｐｅｇ２デコーダチップおよびｍｐｅｇ２エンコーダチップと、特殊用途向けＤＳＰまたはメディアプロセッサとを組み合わせるデジタルテレビ；通常、ユーザインターフェースおよびアプリケーション処理用のプロセッサと、携帯電話ＧＳＭ、ＣＤＭＡまたは他のプロトコル処理用の特定用途向けＤＳＰとを組み合わせる携帯電話を含む。 These prior art examples typically include a personal computer that combines a main processor with a separate graphics processor and a separate sound processor; typically a gaming system that combines a main processor and a separately programmed graphics processor; typically a general purpose processor A digital video recorder that combines a mpeg2 decoder chip and mpeg2 encoder chip and a special purpose digital signal processor; typically a digital that combines a general purpose processor, mpeg2 decoder chip and mpeg2 encoder chip, and a special purpose DSP or media processor TV; usually a processor for user interface and application processing and mobile phone GSM, CDMA or others Including mobile phones to combine the application-specific DSP for protocol processing.

米国特許第６，４０８，３８１号（特許文献１）は、各種パイプラインステージの命令の状態を示すエントリを有するスナップショットファイルを利用するパイプラインプロセッサを開示する。パイプライン内で命令が移動すると、対応するスナップショットファイルエントリが変更される。これらのファイルをモニタリングすることによって、デバイスは、あるパイプラインステージからの中間結果が、例えば、最初にレジスタに格納されることなく、内部オペランドバスを介して別のステージに直接転送され得る。 U.S. Pat. No. 6,408,381 discloses a pipeline processor that uses a snapshot file having entries indicating the state of instructions at various pipeline stages. As instructions move in the pipeline, the corresponding snapshot file entry is changed. By monitoring these files, the device can transfer intermediate results from one pipeline stage directly to another stage via, for example, an internal operand bus without first being stored in a register.

米国特許第６，２１９，７８０号（特許文献２）は、クラスタに分類された複数の実行ユニットでのコンピュータのスループットの向上に関する技術を開示する。この特許は、「生産者（ｐｒｏｄｕｃｅｒ）」命令の結果に従属する「消費者（ｃｏｎｓｕｍｅｒ）」命令を識別することを提示する。その後、各生産者命令の複数のコピーが実行され、すなわち、従属する消費者命令を実行するために引き続いて用いられるクラスタごとに１つのコピーが実行される。
米国特許第６，４０８，３８１号米国特許第６，２１９，７８０号 US Pat. No. 6,219,780 (Patent Document 2) discloses a technique related to improving the throughput of a computer with a plurality of execution units classified into clusters. This patent presents identifying a “consumer” command that is dependent on the result of a “producer” command. Thereafter, multiple copies of each producer instruction are executed, i.e., one copy for each cluster subsequently used to execute the dependent consumer instructions.
US Pat. No. 6,408,381 US Pat. No. 6,219,780

一般的な従来技術のアプローチが汎用プロセッサをＤＳＰおよび／または特定用途向けハードウェアと組み合わせる理由は、複数のアクティビティ（例えば、イベントまたはスレッド）をリアルタイムベースで同時にハンドリングする必要があるためであり、ここで、各アクティビティは、異なったタイプの演算素子を必要とする。さらに、従来技術の単一プロセッサは、アクティビティのすべてをハンドリングする容量を有さない。さらに、これらのアクティビティのいくつかは、従来技術のプロセッサがこれらのうちの２つ以上を適切に処理することができないような性質のものである。これは、特に、リアルタイムアクティビティに当てはまり、完全に避けられなければ、オペレーティングシステムの介入によって、これへの時間的パフォーマンスが低下する。 The reason why a general prior art approach combines a general purpose processor with a DSP and / or application specific hardware is that multiple activities (eg, events or threads) need to be handled simultaneously on a real-time basis, where Thus, each activity requires a different type of computing element. Furthermore, prior art single processors do not have the capacity to handle all of the activities. In addition, some of these activities are of a nature such that prior art processors cannot properly handle two or more of these. This is especially true for real-time activity, and if not unavoidable, operating system intervention reduces the time performance to this.

従来技術のアプローチの１つの問題は、ハードウェア設計の複雑性であり、異種の演算素子をプログラムおよびインターフェースする際のソフトウェアの複雑性を併せ持つ。別の問題は、ハードウェアおよびソフトウェアの両方をアプリケーションごとにリエンジニアしなければならないことである。さらに、従来技術のシステムは負荷分散しない。すなわち、容量が、あるハードウェア素子から別のハードウェア素子に転送され得ない。 One problem with prior art approaches is the complexity of hardware design, which combines the complexity of software in programming and interfacing disparate computing elements. Another problem is that both hardware and software must be re-engineered for each application. Furthermore, prior art systems do not load balance. That is, capacity cannot be transferred from one hardware element to another.

本発明の目的は、デジタルデータ処理の改善された装置および方法を提供することである。本発明のさらなる目的は、複数の動作がリアルタイムであろうとなかろうと単一のプロセッサ上で実行することをサポートするような装置および方法を提供すること、および、一緒に動作することができるような複数のプロセッサを提供することである。関連目的は、組込み環境またはアプリケーションに適するような装置および方法を提供することである。別の関連目的は、設計、製造、市場投入までの期間、コストおよび／または保守を容易にするような装置および方法を提供することである。 It is an object of the present invention to provide an improved apparatus and method for digital data processing. It is a further object of the present invention to provide such an apparatus and method that supports running multiple operations on a single processor, whether real-time or not, and to be able to work together. It is to provide a plurality of processors. A related object is to provide such an apparatus and method suitable for embedded environments or applications. Another related object is to provide such an apparatus and method that facilitates design, manufacture, time to market, cost and / or maintenance.

本発明のさらなる目的は、ごく少数挙げると、デジタルテレビ、デジタルビデオレコーダ、ビデオおよび／またはオーディオプレーヤ、パーソナルデジタルアシスタント、パーソナルナリッジナビゲータ、および携帯電話を例として含むがこれらに限定されない、現在および将来のアプライアンスの計算、サイズ、パワーおよびコスト要件を満たす組込み（または他の）プロセシングの改善された装置および方法を提供することである。 Further objects of the present invention are present and future including, but not limited to, digital televisions, digital video recorders, video and / or audio players, personal digital assistants, personal knowledge navigators, and mobile phones, to name a few. It is to provide an improved apparatus and method of embedded (or other) processing that meets the calculation, size, power and cost requirements of the appliance.

さらに別の目的は、種々のアプリケーションをサポートする改善された装置および方法を提供することである。 Yet another object is to provide an improved apparatus and method that supports various applications.

さらに別の目的は、低コスト、低パワーであり、および／または、市場への投入が高速のロバストな実装をサポートするような装置および方法を提供することである。 Yet another object is to provide such an apparatus and method that is low cost, low power, and / or that supports fast and robust implementation on the market.

これらおよび他の目的は、ある局面では、プロセスまたはスレッド（集合的に「スレッド」）を各々が実行する複数の処理ユニットを備える組込みプロセッサを提供する本発明によって達成される。１つ以上の実行ユニットが処理ユニットによって共有され、スレッドからの命令を実行する。イベント送達メカニズムは、命令を実行することなく、それぞれのスレッドにイベント（ハードウェア割込み、ソフトウェアによって開始されるシグナリングイベント（「ソフトウェアイベント」およびメモリイベントを例として含むがこれらに限定されない）を送達する。各イベントは、そのスレッドの外で命令を実行することなく、それぞれのスレッドによって処理され得る。 These and other objects are achieved in one aspect by the present invention that provides an embedded processor comprising a plurality of processing units each executing a process or thread (collectively a “thread”). One or more execution units are shared by the processing units to execute instructions from the thread. The event delivery mechanism delivers events (hardware interrupts, software-initiated signaling events (including but not limited to “software events” and memory events) to each thread without executing instructions) Each event can be handled by the respective thread without executing instructions outside that thread.

本発明の関連する局面によると、スレッドが生きている間、これらのスレッドは、それぞれの同じ処理ユニット上での実行を強いられる必要がない。さらに別の関連する局面では、実行ユニットは、これらの命令が、どのスレッドからのものであるかを知ることを必要とせずに、スレッドからの命令を実行する。 According to a related aspect of the invention, while threads are alive, these threads need not be forced to execute on their same processing unit. In yet another related aspect, the execution unit executes instructions from threads without needing to know which threads these instructions are from.

本発明は、他の局面では、複数の実行ユニット上で並列実行するための複数のスレッドからの命令を開始するパイプラインコントロールユニットをさらに備える上述の組込みプロセッサを提供する。このパイプラインコントロールユニットは、仮想処理ユニットのそれぞれ１つと各々が関連付けられた複数の命令キューを備え得、このキューから命令がディスパッチされる。命令キューからの命令クラスをデコードすることに加えて、パイプラインコントロールユニットは、ディスパッチされる命令のソースレジスタおよび宛先レジスタを提供するリソースへのアクセスを仮想処理ユニットによってコントロールし得る。 In another aspect, the present invention provides the above embedded processor further comprising a pipeline control unit that initiates instructions from a plurality of threads for parallel execution on the plurality of execution units. The pipeline control unit may comprise a plurality of instruction queues, each associated with a respective one of the virtual processing units, from which instructions are dispatched. In addition to decoding instruction classes from the instruction queue, the pipeline control unit may be controlled by the virtual processing unit to access resources that provide source and destination registers for dispatched instructions.

他の関連する局面によると、複数の実行ユニットの中に分岐実行ユニットがある。これは、命令アドレス生成、アドレス変換および命令フェッチのいずれかを担当する。分岐実行ユニットは、さらに、仮想処理ユニットの状態を維持し得る。これは、パイプラインコントロールユニットによってコントロールされ得、このユニットは、各仮想処理ユニット命令キューが空になると、分岐実行ユニットにシグナリングする。 According to another related aspect, there is a branch execution unit among the plurality of execution units. This is responsible for any of instruction address generation, address translation and instruction fetch. The branch execution unit may further maintain the state of the virtual processing unit. This can be controlled by a pipeline control unit, which signals to the branch execution unit when each virtual processing unit instruction queue is empty.

本発明の他の局面によると、上述の複数の仮想処理ユニットは、組込みプロセッサ上で実行し得る。他方、関連する局面では、これらの複数の仮想処理ユニットは、複数の組込みプロセッサ上で実行し得る。 According to another aspect of the invention, the plurality of virtual processing units described above may execute on an embedded processor. On the other hand, in related aspects, these multiple virtual processing units may execute on multiple embedded processors.

本発明のさらなる局面は、上述のように、１つ以上のスレッドを各々が実行する、複数の処理ユニットを備える組込みプロセッサを提供する。イベント送達メカニズムは、上述のように、命令を実行することなく、イベントを、それらのイベントが関連付けられたそれぞれのスレッドに送達する。上述のように、処理ユニットは、仮想処理ユニットであり得る。そして、上述のように、これらは１つ以上の組込みプロセッサ上で実行し得る。 A further aspect of the invention provides an embedded processor comprising multiple processing units, each executing one or more threads, as described above. The event delivery mechanism delivers events to the respective threads with which the events are associated, as described above, without executing instructions. As described above, the processing unit may be a virtual processing unit. And, as described above, these can run on one or more embedded processors.

上述のように、組込みプロセッサおよびシステムは、わずかなレイテンシーで、またはレイテンシーなしで、かつオペレーティングシステムスイッチングオーバーヘッドを用いずに、複数のスレッドの実行と複数のイベントの処理とを同時に行うことができる。スレッドは、リアルタイムビデオ処理からＬｉｎｕｘオペレーティングシステム機能、最終使用アプリケーションの範囲であり得る。従って、組込みプロセッサおよびシステムは、非限定的な例として、高品位デジタルテレビ、ゲームシステム、デジタルビデオレコーダ、ビデオおよび／またはオーディオプレーヤ、パーソナルデジタルアシスタント、パーソナルナリッジナビゲータ、携帯電話、および他のマルチメディアデバイスおよび非マルチメディアデバイスのような、マルチメディアまたは他のデバイスに必要とされる多様な機能の直接実行をサポートするアプリケーションを有する。 As described above, embedded processors and systems can simultaneously execute multiple threads and handle multiple events with little or no latency and without operating system switching overhead. Threads can range from real-time video processing to Linux operating system functions, end-use applications. Thus, embedded processors and systems include, by way of non-limiting example, high definition digital television, gaming systems, digital video recorders, video and / or audio players, personal digital assistants, personal knowledge navigators, cell phones, and other multimedia. Having applications that support direct execution of various functions required for multimedia or other devices, such as devices and non-multimedia devices.

さらに、上述のように、組込みプロセッサおよびシステムによって、特定用途向けハードウェアアクセラレータ、特定用途向けＤＳＰ、特定用途向けプロセッサ、または他の特定用途向けハードウェアを必要とせずに、すべての機能を単一のプログラミングおよび実行環境で展開および実行することが可能になる。複数のこのような組込みプロセッサが足し合わされた場合、これらは、シームレスに動作し、より多くの実行容量、および、より多くの並列スレッドハンドリング容量の両方を含む全容量をより大きくする。これらのプロセッサの付加は、スレッド、および特定のスレッドに向けたれたイベントの両方の観点からトランスペアレントである。プロセッサが処理するイベントおよびスレッドを割り当てる方法に基づいて、負荷分散もまた、部分的にトランスペアレントである。 In addition, as noted above, embedded processors and systems allow all functions to be single-ended without the need for an application specific hardware accelerator, application specific DSP, application specific processor, or other application specific hardware. Can be deployed and executed in any programming and execution environment. When multiple such embedded processors are added, they operate seamlessly, increasing the total capacity, including both more execution capacity and more parallel thread handling capacity. The addition of these processors is transparent in terms of both threads and events directed to specific threads. Load balancing is also partially transparent, based on how the processor processes and assigns threads.

本発明の他の局面は、上述のような構造を有して動作する、組込みプラットフォーム上で動作しないデジタルデータ処理システムを提供する。 Another aspect of the present invention provides a digital data processing system that operates with the structure as described above and does not operate on an embedded platform.

本発明のさらなる局面は、上述のようなシステムのオペレーションを並列化する方法を提供する。 A further aspect of the invention provides a method for parallelizing the operation of a system as described above.

本発明のこれらおよび他の局面は、以下の図面および詳細な説明において明らかである。 These and other aspects of the invention are apparent in the drawings and detailed description that follow.

図面を参照することによって、本発明がより完全に理解され得る。 The present invention may be more fully understood with reference to the drawings.

本発明による組込みプロセッサは、Ａ．１つ以上のプロセスまたはスレッド（集合的に「スレッド」）を各々が実行する複数の処理ユニットと、Ｂ．前記複数の処理ユニットによって共有され、かつ、前記複数の処理ユニットと通信接続された１つ以上の実行ユニットであって、前記実行ユニットは、前記スレッドからの命令を実行する、１つ以上の実行ユニットと、Ｃ．イベントが関連付けられたそれぞれのスレッドに前記イベントを送達するイベント送達メカニズムであって、前記イベント送達メカニズムは、ｉ前記複数の処理ユニットと通信接続され、ｉｉそのような各イベントを、命令を実行することなく、前記それぞれのスレッドに送達する、イベント送達メカニズムとを備え、これにより上記目的を達成する。 An embedded processor according to the present invention comprises: A plurality of processing units each executing one or more processes or threads (collectively “threads”); One or more execution units shared by the plurality of processing units and communicatively connected to the plurality of processing units, wherein the execution unit executes instructions from the thread A unit; An event delivery mechanism for delivering the event to a respective thread with which the event is associated, wherein the event delivery mechanism is communicatively connected with the plurality of processing units, and ii executes each such event with an instruction And an event delivery mechanism for delivering to each of the threads, thereby achieving the above objective.

前記イベントが送達される前記スレッドは、前記スレッドの外で命令を実行することなく、前記イベントを処理してもよい。 The thread to which the event is delivered may process the event without executing instructions outside the thread.

前記イベントは、ハードウェア割り込み、ソフトウェアによって開始されるシグナリングイベント（「ソフトウェアイベント」）およびメモリイベントのいずれかを含んでもよい。 The events may include any of hardware interrupts, software initiated signaling events (“software events”), and memory events.

前記実行ユニットは、どのスレッドからの命令であるかを知ることを必要とせずに、前記スレッドからの命令を実行してもよい。 The execution unit may execute instructions from the thread without needing to know from which thread the instruction is.

各スレッドは、前記スレッドが生きている間、同じ処理ユニット上での実行を強いられるか、または強いられなくてもよい。 Each thread may or may not be forced to execute on the same processing unit while the thread is alive.

前記処理ユニットの少なくとも１つは、仮想処理ユニットであってもよい。 At least one of the processing units may be a virtual processing unit.

本発明による組込みプロセッサは、Ａ．１つ以上のプロセスまたはスレッド（集合的に「スレッド」）を各々が実行する複数の仮想処理ユニットであって、各スレッドは、前記スレッドが生きている間、同じ処理ユニット上での実行を強いられるか、または強いられない、複数の仮想処理ユニットと、Ｂ．複数の実行ユニットと、Ｃ．前記複数の処理ユニットおよび前記複数の実行ユニットと通信接続されるパイプラインコントロールであって、前記パイプラインコントロールは、前記実行ユニットのうちの複数のユニット上で同時に実行するために、前記スレッドのうちの複数のスレッドからの命令を起動する、パイプラインコントロールと、Ｄ．前記複数の処理ユニットと通信接続され、かつ、命令の実行なしに、イベントが関連付けられたそれぞれのスレッドに前記イベントを送達するイベント送達メカニズムであって、ここで、前記イベントは、ハードウェア割り込み、ソフトウェアによって開始されるシグナリングイベント（「ソフトウェアイベント」）およびメモリイベントのいずれかを含む、イベント送達メカニズムとを備える組込みプロセッサであって、Ｅ．イベントが送達されるスレッドは、前記スレッドの外で命令を実行することなく、前記イベントを処理し、これにより、上記目的を達成する。 An embedded processor according to the present invention comprises: A plurality of virtual processing units, each executing one or more processes or threads (collectively “threads”), each thread forcing execution on the same processing unit while the thread is alive B. a plurality of virtual processing units that may or may not be forced; A plurality of execution units; A pipeline control communicatively connected to the plurality of processing units and the plurality of execution units, wherein the pipeline control is configured to Pipeline control for initiating instructions from a plurality of threads; An event delivery mechanism that is communicatively connected to the plurality of processing units and delivers the event to each thread with which the event is associated without execution of an instruction, wherein the event is a hardware interrupt, An embedded processor comprising an event delivery mechanism, including any of software initiated signaling events ("software events") and memory events; The thread to which the event is delivered processes the event without executing instructions outside the thread, thereby achieving the above objective.

前記パイプラインコントロールは、それぞれの仮想処理ユニットと各々が関連付けられた複数の命令キューを備えてもよい。 The pipeline control may include a plurality of instruction queues each associated with a respective virtual processing unit.

前記パイプラインコントロールは、前記命令キューからの命令クラスをデコードしてもよい。 The pipeline control may decode an instruction class from the instruction queue.

前記パイプラインコントロールは、前記処理ユニットによって、リソース供給ソースと、前記命令キューからディスパッチされた前記命令の宛先レジスタとにアクセスしてもよい。 The pipeline control may access a resource supply source and a destination register of the instruction dispatched from the instruction queue by the processing unit.

前記実行ユニットは、命令アドレスの生成、アドレス変換および命令のフェッチのいずれかを担う分岐実行ユニットを備えてもよい。 The execution unit may include a branch execution unit responsible for any of instruction address generation, address translation, and instruction fetch.

前記分岐実行ユニットは、前記仮想処理ユニットの状態を維持してもよい。 The branch execution unit may maintain the state of the virtual processing unit.

前記パイプラインコントロールは、前記仮想処理ユニットによって、前記実行ユニットにアクセスしてもよい。 The pipeline control may access the execution unit by the virtual processing unit.

前記パイプラインコントロールは、仮想処理ユニットごとの前記命令キューが空になると、前記仮想処理ユニットによって共有される分岐実行ユニットにシグナリングしてもよい。 The pipeline control may signal a branch execution unit shared by the virtual processing unit when the instruction queue for each virtual processing unit becomes empty.

前記パイプラインコントロールは、前記実行ユニットをアイドリングしてもよい。 The pipeline control may idle the execution unit.

前記複数の実行ユニットは、整数ユニット、浮動ユニット、分岐ユニット、比較ユニット、およびメモリユニットのいずれかを備えてもよい。 The plurality of execution units may include any of an integer unit, a floating unit, a branch unit, a comparison unit, and a memory unit.

本発明による組込みプロセッサシステムは、Ａ．複数の組込みプロセッサと、Ｂ．前記複数の組込みプロセッサ上で実行する複数の仮想処理ユニットであって、各仮想処理は、１つ以上のプロセスまたはスレッド（集合的に「スレッド」）を実行し、かつ、各仮想処理ユニットは、前記スレッドが生きている間、同じ処理ユニット上での実行を強いられるか、または強いられない、複数の仮想処理ユニットと、Ｃ．前記複数の処理ユニットによって共有され、かつ、前記複数の処理ユニットと通信接続された１つ以上の実行ユニットであって、前記実行ユニットは、前記スレッドからの命令を実行し、前記実行ユニットは、整数ユニット、浮動ユニット、分岐ユニット、比較ユニット、およびメモリ実行ユニットのいずれかを備える、１つ以上の実行ユニットと、Ｄ．イベントが関連付けられるそれぞれのスレッドに前記イベントを送達するイベント送達メカニズムであって、前記イベント送達メカニズムは、ｉ前記複数の処理ユニットと通信接続され、ｉｉそのような各イベントを、命令を実行することなく、前記それぞれのスレッドに送達する、イベント送達メカニズムとを備え、これにより、上記目的を達成する。 An embedded processor system according to the present invention comprises: A plurality of embedded processors; A plurality of virtual processing units executing on the plurality of embedded processors, each virtual processing executing one or more processes or threads (collectively “threads”), and each virtual processing unit comprising: C. a plurality of virtual processing units that may or may not be forced to execute on the same processing unit while the thread is alive; One or more execution units shared by the plurality of processing units and communicatively connected to the plurality of processing units, the execution unit executing instructions from the thread, the execution unit comprising: One or more execution units comprising any of an integer unit, a floating unit, a branch unit, a comparison unit, and a memory execution unit; An event delivery mechanism for delivering the event to a respective thread with which the event is associated, wherein the event delivery mechanism is communicatively connected with the plurality of processing units, and ii executes instructions for each such event. And an event delivery mechanism for delivering to the respective thread, thereby achieving the above object.

イベントが送達される前記スレッドは、前記スレッドの外で命令を実行することなく、前記イベントを処理してもよい。 The thread to which the event is delivered may process the event without executing instructions outside the thread.

前記分岐ユニットは、前記スレッドのために実行されるべき命令のフェッチを担ってもよい。 The branch unit may be responsible for fetching instructions to be executed for the thread.

前記分岐ユニットは、命令アドレスの生成およびアドレス変換のいずれかをさらに担ってもよい。 The branch unit may further be responsible for either instruction address generation or address translation.

前記分岐ユニットは、前記それぞれの仮想処理ユニットごとのスレッド状態を格納するスレッド状態記憶装置を備えてもよい。 The branch unit may include a thread state storage device that stores a thread state for each of the virtual processing units.

本発明による組込みプロセッサシステムは、Ａ．前記複数の処理ユニットおよび前記複数の実行ユニットと通信接続されるパイプラインコントロールであって、前記パイプラインコントロールは、前記実行ユニットのうちの複数のユニット上で同時に実行するために、前記スレッドのうちの複数のスレッドからの命令をディスパッチする、パイプラインコントロールを備え、Ｂ．前記パイプラインコントロールは、それぞれの仮想処理ユニットと各々が関連付けられた複数の命令キューを備え、Ｃ．前記分岐実行ユニットによってフェッチされた命令は、対応するスレッドが実行される前記それぞれの仮想処理ユニットと関連付けられた前記命令キュー内に配置されてもよい。１つ以上の命令が、前記命令キューを等しいレベルに保つという目的で、前記スレッドのために、ある時点にフェッチされてもよい。 An embedded processor system according to the present invention comprises: A pipeline control communicatively connected to the plurality of processing units and the plurality of execution units, wherein the pipeline control is configured to A pipeline control for dispatching instructions from a plurality of threads; The pipeline control includes a plurality of instruction queues each associated with a respective virtual processing unit; Instructions fetched by the branch execution unit may be placed in the instruction queue associated with the respective virtual processing unit in which the corresponding thread is executed. One or more instructions may be fetched at some point for the thread for the purpose of keeping the instruction queue at an equal level.

前記パイプラインコントロールは、１つ以上の命令を、実行するために、ある時点に所与の命令キューからディスパッチしてもよい。 The pipeline control may dispatch one or more instructions from a given instruction queue at some point for execution.

前記パイプラインコントロールによって、所与の時点に所与の命令キューからディスパッチされた多数の命令は、前記キューの命令シーケンス内のストップフラグによってコントロールされてもよい。 With the pipeline control, a number of instructions dispatched from a given instruction queue at a given time may be controlled by stop flags in the queue's instruction sequence.

１つ以上のスレッドからの複数の命令を同時に、前記パイプラインコントロールが起動し、かつ前記実行ユニットが実行してもよい。 A plurality of instructions from one or more threads may be simultaneously activated by the pipeline control and executed by the execution unit.

本発明の組込みプロセッサは、Ａ．１つ以上のプロセスまたはスレッド（集合的に「スレッド」）を各々が実行する複数の処理ユニットであって、各スレッドは、前記スレッドが生きている間、同じ処理ユニット上での実行を強いられるか、または強いられない、複数の処理ユニットと、Ｂ．イベントが関連付けられたそれぞれのスレッドに前記イベントを送達するイベント送達メカニズムであって、前記イベント送達メカニズムは、ｉ前記複数の処理ユニットと通信接続され、ｉｉそのような各イベントを、命令を実行することなく、前記それぞれのスレッドに送達する、イベント送達メカニズムとを備え、これにより、上記目的を達しする。 The embedded processor of the present invention includes A. A plurality of processing units, each executing one or more processes or threads (collectively “threads”), each thread being forced to execute on the same processing unit while the thread is alive B. a plurality of processing units, An event delivery mechanism for delivering the event to a respective thread with which the event is associated, wherein the event delivery mechanism is communicatively connected with the plurality of processing units, and ii executes each such event with an instruction And an event delivery mechanism for delivering to each of the threads, thereby achieving the above objective.

本発明による組込みプロセッサシステムは、Ａ．複数の組込みプロセッサと、Ｂ．前記複数の組込みプロセッサ上で実行する複数の仮想処理ユニットであって、各仮想処理は、１つ以上のプロセスまたはスレッド（集合的に「スレッド」）を実行し、かつ、各仮想処理ユニットは、前記スレッドが生きている間、同じ処理ユニット上での実行を強いられるか、または強いられない、複数の仮想処理ユニットと、Ｃ．イベントが関連付けられるそれぞれのスレッドに前記イベントを送達するイベント送達メカニズムであって、前記イベント送達メカニズムは、ｉ前記複数の処理ユニットと通信接続され、ｉｉそのような各イベントを、命令を実行することなく、前記それぞれのスレッドに送達する、イベント送達メカニズムとを備え、これにより、上記目的を達成する。 An embedded processor system according to the present invention comprises: A plurality of embedded processors; A plurality of virtual processing units executing on the plurality of embedded processors, each virtual processing executing one or more processes or threads (collectively “threads”), and each virtual processing unit comprising: C. a plurality of virtual processing units that may or may not be forced to execute on the same processing unit while the thread is alive; An event delivery mechanism for delivering the event to a respective thread with which the event is associated, wherein the event delivery mechanism is communicatively connected with the plurality of processing units, and ii executes instructions for each such event. And an event delivery mechanism for delivering to the respective thread, thereby achieving the above object.

本発明による組込みプロセシングの方法は、Ａ．１つ以上のプロセスまたはスレッド（集合的に「スレッド」）を複数の処理ユニットの各々上で実行するステップと、Ｂ．前記複数の処理ユニットによって共有される１つ以上の実行ユニットにおいて、前記スレッドからの命令を実行するステップと、Ｃ．命令を実行することなく、イベントが関連付けられたそれぞれのスレッドにイベントを送達するステップとを包含し、これにより、上記目的を達成する。 The method of embedded processing according to the present invention comprises: B. executing one or more processes or threads (collectively “threads”) on each of the plurality of processing units; C. executing instructions from the thread in one or more execution units shared by the plurality of processing units; Delivering the event to each thread with which the event is associated without executing the instruction, thereby achieving the above objective.

前記スレッドの外で命令を実行することなく、前記スレッドが送達される前記イベントを処理するステップを包含してもよい。 It may include processing the event delivered by the thread without executing instructions outside the thread.

前記１つ以上の実行ユニットにおいて、前記スレッドからの命令を実行するステップは、どのスレッドからの命令であるかを知ることを必要としなくてもよい。 In the one or more execution units, the step of executing an instruction from the thread may not require knowing from which thread the instruction is.

本発明による前記組込みプロセシングの方法は、Ａ．１つ以上のプロセスまたはスレッド（集合的に「スレッド」）を複数の仮想処理ユニットの各々上で実行するステップであって、各スレッドは、前記スレッドが生きている間、同じ処理ユニット上での実行を強いられるか、または強いられない、ステップと、Ｂ．前記実行ユニットのうちの複数のユニット上で同時に実行するために、前記スレッドのうちの複数のスレッドからの命令を起動するステップと、Ｃ．命令を実行することなく、イベントが関連付けられたそれぞれのスレッドに前記イベントを送達するステップであって、前記イベントは、ハードウェア割り込み、ソフトウェアによって開始されるシグナリングイベント（「ソフトウェアイベント」）およびメモリイベントのいずれかを含む、ステップと、Ｄ．前記スレッドの外で命令を実行することなく、前記スレッドが送達される前記イベントを処理するステップとを包含し、これにより、上記目的を達成する。 The method of integration processing according to the present invention comprises A. Executing one or more processes or threads (collectively "threads") on each of a plurality of virtual processing units, each thread running on the same processing unit while the thread is alive B. Forced or not forced to perform steps; C. invoking instructions from a plurality of threads of the threads for simultaneous execution on a plurality of units of the execution units; Delivering the event to each thread with which the event is associated without executing an instruction, the event comprising a hardware interrupt, a software initiated signaling event ("software event") and a memory event A step comprising any of: Processing the event delivered by the thread without executing instructions outside the thread, thereby achieving the above objective.

前記起動するステップは、前記命令キューからの命令クラスをデコードするステップを包含してもよい。 The step of activating may include the step of decoding an instruction class from the instruction queue.

前記起動するステップは、前記仮想処理ユニットによる、リソース供給ソース、および前記命令キューからディスパッチされた前記命令の宛先レジスタへのアクセスをコントロールするステップを包含してもよい。 The initiating step may include controlling access by the virtual processing unit to a resource supply source and a destination register of the instruction dispatched from the instruction queue.

前記仮想処理ユニットのすべてによって共有される分岐実行ユニットを用いて、命令アドレスの生成、アドレス変換、および命令をフェッチするステップのいずれかを実行するステップを包含してもよい。 A branch execution unit shared by all of the virtual processing units may be used to include any of the steps of instruction address generation, address translation, and instruction fetching.

前記分岐実行ユニットを用いて、前記仮想処理ユニットの状態を維持するステップを包含してもよい。 The branch execution unit may be used to maintain the state of the virtual processing unit.

前記実行ユニットは、整数ユニット、浮動ユニット、分岐ユニット、比較ユニット、およびメモリユニットのいずれかを備えてもよい。 The execution unit may include any one of an integer unit, a floating unit, a branch unit, a comparison unit, and a memory unit.

本発明による組込みプロセシングの方法は、Ａ．複数の仮想処理ユニットを複数の組込みプロセッサ上で実行するステップと、Ｂ．１つ以上のプロセスまたはスレッド（集合的に「スレッド」）を複数の仮想処理ユニットの各々上で実行するステップであって、各スレッドは、前記スレッドが生きている間、同じ仮想処理ユニットおよび／または同じ組込みプロセッサ上での実行を強いられるか、または強いられない、ステップと、Ｃ．前記複数の処理ユニットによって共有される１つ以上の実行ユニットにおいて、前記スレッドからの命令を実行するステップであって、前記実行ユニットは、整数ユニット、浮動ユニット、分岐ユニット、比較ユニット、およびメモリ実行ユニットのいずれかを備える、ステップと、Ｄ．命令を実行することなく、イベントが関連付けられたそれぞれのスレッドに前記イベントを送達するステップとを包含し、これにより、上記目的を達成する。 The method of embedded processing according to the present invention comprises: B. executing a plurality of virtual processing units on a plurality of embedded processors; Executing one or more processes or threads (collectively “threads”) on each of a plurality of virtual processing units, each thread being the same virtual processing unit and / or while the thread is alive Or forced or not forced to run on the same embedded processor; Executing instructions from the thread in one or more execution units shared by the plurality of processing units, the execution units comprising an integer unit, a floating unit, a branch unit, a comparison unit, and a memory execution A step comprising any of the units; Delivering the event to each thread with which the event is associated without executing the instruction, thereby achieving the above objective.

前記分岐実行ユニットを用いて、前記スレッドのために実行されるべき命令をフェッチするステップを包含してもよい。 The branch execution unit may be used to fetch an instruction to be executed for the thread.

前記分岐実行ユニットで、命令アドレスの生成およびアドレス変換のいずれかを実行するステップを包含してもよい。 The branch execution unit may include a step of executing any one of instruction address generation and address translation.

本発明による方法は、Ａ．前記実行ユニットのうちの複数のユニット上で同時に実行するために、前記スレッドのうちの複数のスレッドからの命令をディスパッチするステップを包含し、Ｂ．前記ディスパッチするステップは、前記それぞれのスレッドごとにフェッチされた命令を、前記スレッドが実行される前記仮想処理ユニットと関連付けられた命令キュー内に配置するステップを包含してもよい。 The method according to the invention comprises: B. dispatching instructions from a plurality of threads of the thread for simultaneous execution on a plurality of units of the execution unit; The dispatching step may include placing an instruction fetched for each respective thread in an instruction queue associated with the virtual processing unit in which the thread is executed.

前記ディスパッチするステップは、前記命令キューを等しいレベルに保つという目的で、与えられたスレッドのために、１つ以上の命令をある時点にフェッチするステップを包含してもよい。 The dispatching step may include fetching one or more instructions at a point in time for a given thread for the purpose of keeping the instruction queue at an equal level.

前記ディスパッチするステップは、前記実行ユニットによって実行するために、１つ以上の命令を、ある時点に所与の命令キューからディスパッチするステップを包含してもよい。 The dispatching step may include dispatching one or more instructions from a given instruction queue at a point in time for execution by the execution unit.

本発明による方法は、１つ以上のスレッドから複数の命令を同時に起動および実行するステップを包含してもよい。 The method according to the present invention may include the step of simultaneously invoking and executing a plurality of instructions from one or more threads.

本発明による組込みプロセシングの方法は、Ａ．１つ以上のプロセスまたはスレッド（集合的に「スレッド」）を複数の処理ユニットの各々上で実行するステップであって、各スレッドは、前記スレッドが生きている間、同じ処理ユニット上での実行を強いられるか、または強いられない、ステップと、Ｂ．命令を実行することなく、イベントが関連付けられたそれぞれのスレッドに前記イベントを送達するステップとを包含し、これにより、上記目的を達成する。 The method of embedded processing according to the present invention comprises: Executing one or more processes or threads (collectively “threads”) on each of a plurality of processing units, each thread executing on the same processing unit while the thread is alive Forced or not forced, step B. Delivering the event to each thread with which the event is associated without executing the instruction, thereby achieving the above objective.

本発明による組込みプロセシングの方法は、Ａ．複数の仮想処理ユニットを複数の組込みプロセッサ上で実行するステップと、Ｂ．１つ以上のプロセスまたはスレッド（集合的に「スレッド」）を複数の仮想処理ユニットの各々上で実行するステップであって、各仮想処理ユニットは、前記スレッドが生きている間、同じ仮想処理ユニットおよび／または同じ組込みプロセッサ上での実行を強いられるか、または強いられない、ステップと、
Ｃ．命令を実行することなく、イベントが関連付けられたそれぞれのスレッドに前記イベントを送達するステップとを包含し、これにより、上記目的を達成する。 The method of embedded processing according to the present invention comprises: B. executing a plurality of virtual processing units on a plurality of embedded processors; Executing one or more processes or threads (collectively “threads”) on each of a plurality of virtual processing units, each virtual processing unit being the same virtual processing unit while the thread is alive And / or steps forced or not forced to run on the same embedded processor; and
C. Delivering the event to each thread with which the event is associated without executing the instruction, thereby achieving the above objective.

本発明による方法は、前記スレッドの外で命令を実行することなく、前記スレッドが送達される前記イベントを処理してもよい。 The method according to the invention may process the event to which the thread is delivered without executing instructions outside the thread.

図１は、本発明のある実施形態により構成され動作し、かつ、本明細書および添付の図面中で「ＳＥＰ」と呼ばれることがあるプロセッサモジュール５を示す。このモジュールは、ＰＣ、ワークステーション、またはメインフレームコンピュータ等（しかしながら、示された実施形態は組込みプロセッサとして利用される）の汎用プロセッサの基礎を提供し得る。 FIG. 1 illustrates a processor module 5 that is constructed and operative in accordance with an embodiment of the present invention, and may be referred to as “SEP” herein and in the accompanying drawings. This module may provide the basis for a general purpose processor such as a PC, workstation, or mainframe computer (but the illustrated embodiment is utilized as an embedded processor).

モジュール５は、単一で、または、１つ以上の他のそのようなモジュールと組み合わせて用いられ得、計算要件が、実質的に並列であり、同時に実行する複数のアプリケーションおよび／または命令レベル並列性から利益を得るデバイスまたはシステムに特に適している。これは、リアルタイム要件のデバイスまたはシステム、マルチメディアアプリケーションを実行するデバイスまたはシステム、ならびに／あるいは、イメージ、信号、グラフィックおよび／またはネットワーク処理等の高度な計算要件のデバイスまたはシステムを含み得る。このモジュールは、さらに、例えば、アプリケーションを同時に使用する場合に単一プラットフォーム上に複数のアプリケーションを統合するために適切である。このモジュールは、これが組み込まれるか、そうでなければ内蔵されているデバイスおよび／またはシステムを横断して、ならびにネットワーク（有線、無線、その他）を横断して、あるいは、これらのデバイスおよび／またはシステムが結合される他の媒体を横断してシームレスアプリケーションを提供する。さらに、このモジュールは、ピアツーピア（Ｐ２Ｐ）アプリケーション、およびユーザの対話性を有するアプリケーションに適している。これまでの記述は、モジュール５が適しているアプリケーションおよび環境の広範囲な列挙であることを意図しているのではなく、１例にすぎない。 Module 5 may be used singly or in combination with one or more other such modules, the computing requirements being substantially parallel and multiple applications and / or instruction level parallel executing simultaneously Particularly suitable for devices or systems that benefit from sex. This may include devices or systems with real-time requirements, devices or systems running multimedia applications, and / or devices or systems with advanced computational requirements such as images, signals, graphics and / or network processing. This module is further suitable for integrating multiple applications on a single platform, for example when using applications simultaneously. This module can traverse devices and / or systems in which it is incorporated or otherwise embedded, and across networks (wired, wireless, etc.) or these devices and / or systems To provide seamless applications across other media. Furthermore, this module is suitable for peer-to-peer (P2P) applications and applications with user interactivity. The description so far is not intended to be an extensive list of applications and environments for which module 5 is suitable, but only an example.

モジュール５が組込まれ得るデバイスおよびシステムの例は、特に、例えば、図２４に示されるタイプのデジタルＬＣＤ−ＴＶを含み、モジュール５は、ＳＯＣ（ｓｙｓｔｅｍ−ｏｎ−ａ−ｃｈｉｐ）コンフィギュレーションで具現化される。（当然、モジュールは、単一チップ上に組み入れられる必要がなく、むしろ、おそらく、複数チップ、１つ以上の基板、別々に収容された１つ以上のデバイス、および／またはこれらの組み合わせを含む多数のフォームファクタのいずれかで具現化され得ることがわかる。）さらなる例は、デジタルビデオレコーダ（ＤＶＲ）およびサーバ、ＭＰ３サーバ、携帯電話、スチルおよびビデオカメラ（ＤＶＲ）、ゲームプラットフォーム、ユニバーサルネットワークディスプレイ（例えば、デジタルＬＣＤ−ＴＶ、ネットワーク情報／インターネットアプライアンス、および汎用アプリケーションプラットフォームの組み合わせ）、Ｇ３携帯電話、パーソナルデジタルアシスタント等を含む。 Examples of devices and systems in which the module 5 can be incorporated include, for example, a digital LCD-TV of the type shown in FIG. 24, for example, which is embodied in an SOC (system-on-a-chip) configuration. Is done. (Of course, the modules need not be incorporated on a single chip, but rather probably include multiple chips, one or more substrates, one or more devices housed separately, and / or combinations thereof. Can be embodied in any of the following form factors.) Further examples include digital video recorders (DVR) and servers, MP3 servers, mobile phones, still and video cameras (DVR), gaming platforms, universal network displays ( For example, a combination of a digital LCD-TV, a network information / Internet appliance, and a general purpose application platform), a G3 mobile phone, a personal digital assistant, and the like.

モジュール５は、スレッド処理ユニット（ＴＰＵ）１０〜２０、レベルワン（Ｌ１）命令およびデータキャッシュ２２、２４、レベルツー（Ｌ２）キャッシュ２６、パイプラインコントロール２８および実行（または機能ユニット）３０〜３８、すなわち、整数処理ユニット、浮動少数点処理ユニット、比較ユニット、メモリユニット、および分岐ユニットを備える。ユニット１０〜３８は、図面に示されているように結合されており、以下において、特に詳述されない。 Module 5 includes thread processing units (TPUs) 10-20, level one (L1) instruction and data caches 22, 24, level to (L2) cache 26, pipeline control 28 and execution (or functional units) 30-38, , An integer processing unit, a floating point processing unit, a comparison unit, a memory unit, and a branch unit. Units 10-38 are coupled as shown in the drawings and are not specifically described below.

概観として、ＴＰＵ１０〜２０は、仮想処理ユニットであり、プロセッサモジュール５内に物理的に実装されており、各々が結合され、そして、１つ（以上）のプロセス（単数または複数）および／またはスレッド（単数または複数）（集合的に、スレッド（単数または複数））を所与の瞬間に処理する。ＴＰＵは、汎用レジスタ、述語レジスタ、コントロールレジスタで表されるそれぞれのスレッド単位の状態を有する。ＴＰＵは、スレッドの各サイクルの任意の組み合わせから５つまでの命令を起動する、起動およびパイプラインコントロール等のハードウェアを共有する。図面に示されるように、ＴＰＵは、さらに、起動された命令を、どのスレッドからのものかを知ることを必要とせずに、起動された命令を独立して実行する実行ユニット３０〜３８を共有する。 As an overview, TPUs 10-20 are virtual processing units that are physically implemented within processor module 5, each coupled together, and one (or more) process (s) and / or threads. Process the thread (s) (collectively, the thread (s)) at a given moment. The TPU has a state of each thread represented by a general-purpose register, a predicate register, and a control register. The TPU shares hardware such as activation and pipeline control that activates up to five instructions from any combination of each cycle of a thread. As shown in the drawings, the TPU further shares execution units 30-38 that execute the activated instructions independently without having to know from which thread the activated instructions are from. To do.

さらなる概観として、示されたＬ２キャッシュ２６は、スレッド処理ユニット１０〜２０のすべてによって共有され、命令およびデータを、モジュール５が組み入れられたチップの内部（ローカル）および外部の両方の記憶装置に格納する。示されたＬ１命令キャッシュ２２およびデータキャッシュ２４もまた、ＴＰＵ１０〜２０によって共有され、上述のチップにローカルである記憶装置に搭載される（当然、他の実施形態では、レベル１およびレベル２キャッシュは、例えば、モジュール５に完全にローカルである、完全に外部である、またはその他の）異なった構成であり得ることが分かる。）。 As a further overview, the L2 cache 26 shown is shared by all of the thread processing units 10-20 and stores instructions and data in both internal (local) and external storage of the chip in which the module 5 is incorporated. To do. The L1 instruction cache 22 and data cache 24 shown are also shared by the TPUs 10-20 and mounted on a storage device that is local to the above-described chip (of course, in other embodiments, the level 1 and level 2 caches are It can be seen that it can be a different configuration (for example, completely local to module 5, completely external, or otherwise). ).

モジュール５の設計はスケーラブルである。２つ以上のモジュール５が、Ｓｏｃまたは他のコンフィギュレーションにまとめられ（ｇａｎｇｅｄ）、これにより、アクティブスレッドの数および処理パワー全体が増加する。モジュール５によって用いられ、本明細書中に記載されるスレディングモデルに基づいて、結果としてのＴＰＵの増加はソフトウェアトランスペアレントである。示されたモジュール５は、６つのＴＰＵ１０〜２０を有するが、他の実施形態は、より多くの数の（さらに、当然、より少ない数の）ＴＰＵを有してもよい。さらに、追加的機能ユニットは、例えば、サイクルごとに起動される命令の数を５から１０〜１５以上に増加して提供されてもよい。Ｌ１およびＬ２キャッシュ構造の以下の記載において明らかなように、これらの２つはスケーリングされ得る。 The design of module 5 is scalable. Two or more modules 5 are ganged into a Soc or other configuration, which increases the number of active threads and overall processing power. Based on the threading model used by module 5 and described herein, the resulting increase in TPU is software transparent. The illustrated module 5 has six TPUs 10-20, but other embodiments may have a greater number (and of course, a smaller number) of TPUs. Further, additional functional units may be provided, for example, increasing the number of instructions activated per cycle from 5 to 10-15 or more. As will be apparent in the following description of the L1 and L2 cache structures, these two can be scaled.

示されたモジュール５は、アプリケーションソフトウェア環境としてＬｉｎｕｘを利用する。マルチスレッドと共に、これは、リアルタイムおよび非リアルタイムアプリケーションが１つのプラットフォーム上で動作することを可能にする。これにより、さらに、オープンソースソフトウェアおよびアプリケーションのレバレッジが製品機能を向上させることが可能になる。さらに、これにより、種々のプロバイダによるアプリケーションの実行が可能になる。
（マルチスレッド）
上述のように、ＴＰＵ１０〜２０は、仮想処理ユニットであり、単一のプロセッサモジュール５内に物理的に実装されており、各々が結合され、所与の瞬間に１つ（以上）のスレッド（単数または複数）を処理する。スレッドは、広範囲にわたるアプリケーションを具現化し得る。デジタルＬＣＤ−ＴＶにおいて有用な例は、例えば、ＭＰＥＧ２信号多重分離、ＭＰＥＧ２ビデオデコード、ＭＰＥＧオーディオデコード、デジタル−ＴＶユーザインターフェースオペレーション、オペレーティングシステム実行（例えば、Ｌｉｎｕｘ）を含む。当然、これらおよび／または他のアプリケーションは、デジタルＬＣＤＴＶ、およびモジュール５が組み入れられ得る種々の他のデバイスおよびシステムにおいて有用であり得る。 The shown module 5 uses Linux as an application software environment. Along with multithreading, this allows real-time and non-real-time applications to run on one platform. This further enables leverage of open source software and applications to improve product functionality. Furthermore, this allows execution of applications by various providers.
(Multithread)
As mentioned above, TPUs 10-20 are virtual processing units that are physically implemented within a single processor module 5, each coupled together and one (or more) threads (at a given moment) Singular (s). A thread can embody a wide range of applications. Examples useful in digital LCD-TV include, for example, MPEG2 signal demultiplexing, MPEG2 video decoding, MPEG audio decoding, digital-TV user interface operations, operating system execution (eg, Linux). Of course, these and / or other applications may be useful in digital LCD TVs and various other devices and systems in which module 5 may be incorporated.

ＴＰＵによって実行されるスレッドは独立しているが、メモリおよびイベントを通じて通信し得る。プロセッサモジュール５の各サイクルの間、実行ユニットまたは機能ユニット３０〜３８を利用するために必要な数のアクティブ実行スレッドから命令が起動される。示された実施形態では、この点に関して「公正」を保証するためにそれぞれのスレッドにラウンドロビンプロトコルが課せられる（しかしながら、他の実施形態では、特権プロトコルまたは他のプロトコルが代替的または追加的に用いられてもよい）。（例えば、アプリケーションを起動、スレッドの活性化を容易にする等のために）１つ以上のシステムスレッドがＴＰＵ上で実行し得るが、アクティブスレッドを実行するためにオペレーションシステムの介入は必要とされない。 The threads executed by the TPU are independent, but can communicate through memory and events. During each cycle of the processor module 5, instructions are launched from as many active execution threads as are necessary to utilize the execution units or functional units 30-38. In the illustrated embodiment, a round-robin protocol is imposed on each thread to ensure “fairness” in this regard (however, in other embodiments, privileged protocols or other protocols may be alternatively or additionally May be used). One or more system threads can run on the TPU (eg, to launch an application, facilitate thread activation, etc.), but no operating system intervention is required to run the active thread .

プロセッサごとの複数のアクティブスレッド（仮想プロセッサ）をサポートする基本的理由は以下のとおりである。 The basic reason for supporting multiple active threads (virtual processors) per processor is as follows.

機能能力
プロセッサごとの複数のアクティブスレッドは、単一のマルチスレッドプロセッサが複数のアプリケーション、メディア、信号処理およびネットワークプロセッサに取って代わることを可能にする。これは、さらに、アプリケーション、イメージ、信号処理およびネットワーキングに対応する複数のスレッドが低レイテンシーおよび高性能で同時に動作および相互運用することを可能にする。コンテキストスイッチおよびインターフェースオーバーヘッドは最小化される。ＭＰ４でコードのような単一のイメージ処理アプリケーション内であっても、スレッドは、例えば、フレームｎが作成されている間にフレームｎ＋１のデータを準備するために、パイプラインの態様で、容易に同時に動作し得る。 Multiple active threads per functional capability processor allow a single multi-threaded processor to replace multiple applications, media, signal processing and network processors. This further allows multiple threads corresponding to applications, images, signal processing and networking to operate and interoperate simultaneously with low latency and high performance. Context switch and interface overhead is minimized. Even within a single image processing application such as code in MP4, a thread can easily, for example, in a pipelined manner to prepare data for frame n + 1 while frame n is being created. Can operate simultaneously.

性能
プロセッサごとの複数のアクティブスレッドは、機能ユニットをより良好に利用し、かつ、メモリおよび他のイベントレイテンシーを容認することによって個々のプロセッサの性能を向上させる。スレッドを４つまで同時に実行するために２ｘ性能の向上を達成することは珍しくはない。パワー消費およびダイサイズの増大は重要ではなく、従って、ユニットごとの性能、パワーおよび価格パフォーマンスが改善される。プロセッサごとの複数のアクティブスレッドは、さらに、分岐およびキャッシュミスによる性能の低下を、これらのイベントの間に別のスレッドを実行することによって低減する。さらに、これは、ほとんどのコンテキストスイッチオーバーヘッドを除去し、リアルタイムアクティビティのレイテンシーを低減する。さらに、これにより、一般的な高性能イベントモデルがサポートされる。 Multiple active threads per performance processor improve the performance of individual processors by making better use of functional units and allowing memory and other event latencies. It is not uncommon to achieve 2x performance improvements to run up to four threads simultaneously. The increase in power consumption and die size is not critical, thus improving performance, power and price performance per unit. Multiple active threads per processor further reduce performance degradation due to branching and cache misses by executing another thread during these events. In addition, this removes most of the context switch overhead and reduces the latency of real-time activity. In addition, this supports a general high performance event model.

インプリメンテーション
プロセッサごとの複数のアクティブスレッドは、パイプラインおよび設計全体の簡略化ももたらす。複雑な分岐プレディケーションの必要がない。なぜなら、別のスレッドが動作し得るからである。これにより、複数のプロセッサチップに対して単一プロセッサチップのコストが低減され、かつ、他の複雑性が除去された場合にコストが低減される。さらに、これにより、単位電力当りの性能が向上する。 Multiple active threads per implementation processor also provides simplification of the pipeline and the overall design. There is no need for complex branch predication. This is because another thread can operate. This reduces the cost of a single processor chip for multiple processor chips and reduces costs when other complexity is eliminated. Further, this improves the performance per unit power.

図２は、従来のスーパースカラプロセッサによるスレッド処理を、示されたプロセッサモジュール５のスレッド処理と比較する。図２Ａを参照して、スーパースカラプロセッサにおいて、単一の実行スレッドから（斜めのスティップリングで示される）の命令が、実行されているコード内の実際の並列および依存性に基づいて利用可能な実行ユニット上で実行するように動的にスケジューリングされる。これは、各サイクルの間に平均的にほとんどの実行ユニットを利用することができないことを意味する。実行ユニットの数が増加するにつれて、利用率が、通常、低下する。さらに、実行ユニットは、メモリシステムおよび分岐予測ミス／待ちの間にアイドル状態になる。 FIG. 2 compares thread processing by a conventional superscalar processor with thread processing of the processor module 5 shown. Referring to FIG. 2A, in a superscalar processor, instructions from a single execution thread (indicated by diagonal stippling) are available based on the actual parallelism and dependencies in the code being executed. Scheduled dynamically to run on an execution unit. This means that on average, most execution units are not available during each cycle. As the number of execution units increases, the utilization rate typically decreases. Furthermore, the execution unit goes idle during the memory system and branch prediction miss / wait.

これとは異なり、図２Ｂを参照して、モジュール５において、複数のスレッドからの命令（異なったそれぞれのスティップリングパターンで示される）が同時に実行する。各サイクル、モジュール５が複数のスレッドからの命令が利用可能な実行ユニットリソースを最適に利用するようにスケジューリングする。従って、実行ユニットの利用および性能全体がより高くなる一方で、同時に、ソフトウェアにトランスペアレントである。
（イベントおよびスレッド）
示された実施形態では、イベントは、割り込み等のハードウェア（またはデバイス）イベントと、デバイスイベントと等しいがソフトウェア命令によって開始されるソフトウェアイベントと、キャッシュミスの完了またはメモリ生産者−消費者（フル−エンプティ）変換の結果等のメモリイベントとを含む。ハードウェアの割り込みは、通常、アイドル状態のスレッド（例えば、ターゲットスレッドまたはターゲットグループのスレッド）によって扱われるデバイスイベントに変換される。ソフトウェアイベントは、例えば、あるスレッドが別のスレッドを直接起こすことを可能にするために用いられ得る。 In contrast, referring to FIG. 2B, in module 5, instructions from multiple threads (shown in different stippling patterns) are executed simultaneously. In each cycle, the module 5 performs scheduling so as to optimally use execution unit resources that can use instructions from a plurality of threads. Thus, the overall utilization and performance of the execution unit is higher while at the same time being transparent to the software.
(Events and threads)
In the illustrated embodiment, events are hardware (or device) events such as interrupts, software events that are equal to device events but initiated by software instructions, and cache miss completion or memory producer-consumer (full -Empty) including memory events such as the result of the conversion. Hardware interrupts are typically translated into device events that are handled by idle threads (eg, target threads or target group threads). Software events can be used, for example, to allow one thread to wake up another thread directly.

各イベントは、アクティブスレッドにバインドする。特定のスレッドバインディングが存在しない場合、これは、示された実施形態では常時アクティブであるデフォルトシステムスレッドにバインドする。そのスレッドは、その後、新しいスレッドを仮想プロセッサ上でスケジューリングすることを含む、イベントの処理を適宜行う。特定のスレッドバインディングが存在しない場合、ハードウェアまたはソフトウェアイベントが送達されると（後述されるように、イベント送達メカニズムで）、ターゲットスレッドはアイドル状態から実行状態に遷移する。ターゲットスレッドがすでにアクティブであり、実行している場合、イベントはハンドリングのためにデフォルトシステムスレッドに向けられる。 Each event binds to an active thread. If there is no specific thread binding, this binds to the default system thread that is always active in the illustrated embodiment. The thread then processes events appropriately, including scheduling a new thread on the virtual processor. In the absence of a particular thread binding, the target thread transitions from the idle state to the running state when a hardware or software event is delivered (with an event delivery mechanism, as described below). If the target thread is already active and running, the event is directed to the default system thread for handling.

示された実施形態では、スレッドは、キャッシュミスおよび同期化待ちを含むメモリシステムストール（略して遮断（ｂｌｏｃｋａｇｅ））；分岐ミス予測（非常に簡単には遮断）；明示的な（ソフトウェアまたはハードウェアによって生成された）イベント待ち；ならびにシステムスレッドによるアプリケーションスレッドの明示的ブロックにより、非実行状態（ブロック）になり得る。 In the illustrated embodiment, the thread is a memory system stall (blockage for short), including cache misses and synchronization waits; branch miss prediction (very simply shut down); explicit (software or hardware) Waiting for events (generated by); as well as explicit blocking of application threads by system threads.

本発明の好ましい実施形態では、イベントは、効率的な動的分散実行環境の基礎を提供する物理プロセッサモジュール５とネットワークとを介してイベントを操作する。従って、例えば、デジタルＬＣＤ−ＴＶにおいて実行するモジュール５、あるいは他のデバイスまたはシステムは、スレッドを実行し、ネットワーク（無線、特権、その他）を介して動的に拡散するメモリ、またはサーバまたは他の（リモート）デバイスからの他の媒体を利用する。スレッドおよびメモリベースのイベントは、例えば、スレッドがその原理により動作する任意のモジュール５上でトランスペアレントに実行し得ることを保証する。これにより、例えば、モバイルデバイスが他のネットワークデバイスのパワーをレバレッジすることが可能になる。これにより、さらに、リモートネットワークデバイス上でのピアツーピアおよびマルチスレッドアプリケーションのトランスペアレントな実行が可能になる。優利な点は、性能の向上、機能の向上、およびパワー消費の低減を含む。 In the preferred embodiment of the present invention, the events manipulate the events through the physical processor module 5 and the network that provide the basis for an efficient dynamic distributed execution environment. Thus, for example, a module 5 executing in a digital LCD-TV, or other device or system, executes threads and dynamically spreads over a network (wireless, privileged, etc.), or a server or other Use other media from (remote) devices. Thread and memory based events, for example, ensure that threads can run transparently on any module 5 that operates on that principle. This allows, for example, mobile devices to leverage the power of other network devices. This further allows transparent execution of peer-to-peer and multi-threaded applications on remote network devices. Advantages include improved performance, improved functionality, and reduced power consumption.

スレッドは、システムおよびアプリケーションの２つの特権レベルで動作する。システムスレッドは、そのスレッドおよびプロセッサ内のすべての他のスレッドのすべての状態にアクセスし得る。アプリケーションスレッドは、それ自体に対応する非特権状態にのみアクセスし得る。デフォルトにより、スレッド０はシステム特権で動作する。他のスレッドは、システム特権スレッドによって生成されている場合にシステム特権用に構成され得る。 Threads operate at two privilege levels: system and application. A system thread can access all the states of that thread and all other threads in the processor. An application thread can only access the unprivileged state corresponding to itself. By default, thread 0 operates with system privileges. Other threads can be configured for system privileges if they are created by system privilege threads.

図３を参照して、示された実施形態では、スレッド状態は以下のとおりである。 Referring to FIG. 3, in the illustrated embodiment, the thread states are as follows:

アイドル状態（または非アクティブ）
スレッドコンテキストはＴＰＵ内にロードされ、スレッドは、命令を実行していない。アイドルスレッドは、例えば、ハードウェアまたはソフトウェアイベントが生じた場合、実行状態に遷移する。 Idle state (or inactive)
The thread context is loaded into the TPU and the thread is not executing instructions. An idle thread transitions to an execution state when, for example, a hardware or software event occurs.

待ち状態（またはアクティブ、待ち状態
スレッドコンテキストは、ＴＰＵ内にロードされるが、現在、命令を実行していない。待ちスレッドは、例えば、メモリ命令が続行することを可能にする、キャッシュオペレーション等の、起こるが待たれるイベントが完了した場合に実行状態に遷移する。 A wait state (or active, wait state thread context is loaded into the TPU, but is not currently executing an instruction. A wait thread can, for example, allow a memory instruction to continue, such as a cache operation. When an event that occurs but is awaited is completed, it transitions to an execution state.

実行状態（またはアクティブ、実行状態）
スレッドコンテキストがＴＰＵ内にロードされ、現在、命令を実行している。スレッドは、例えば、キャッシュミスまたはエンプティ／フィル（生産者−消費者メモリ（ｃｏｎｓｕｍｅｒ−ｐｒｏｄｕｃｅｒｍｅｍｏｒｙ））命令が完了し得ない動作をキャッシュが完了することをメモリ命令が待たなければならない場合、メモリ命令が待ち状態に遷移する。イベント命令が実行された場合、スレッドはアイドル状態に遷移する。 Running state (or active, running state)
A thread context is loaded into the TPU and is currently executing instructions. For example, if a memory instruction has to wait for the cache to complete an operation that a cache miss or empty / fill (consumer-producer memory) instruction cannot complete, the memory instruction Transitions to a wait state. If an event instruction is executed, the thread transitions to the idle state.

スレッドは、各ＴＰＵと関連したスレッドイネーブルビット（またはフラグまたは他のインジケータ）が、ソフトウェアがＴＰＵにスレッドをロードおよびアンロードするために、いずれのスレッド状態も妨げることなくスレッドをディセーブルする。 Threads disable threads without the thread enable bit (or flag or other indicator) associated with each TPU preventing software from loading and unloading threads to the TPU.

プロセッサモジュール５は、実行する命令のアベイラビリティに基づいてアクティブスレッドにわたって負荷分散する。このモジュールは、さらに、スレッドごとの命令キューを一様にフルに保つことを試みる。従って、アクティブの状態に留まるスレッドがほとんどの命令を実行する。 The processor module 5 distributes the load across the active threads based on the availability of instructions to be executed. This module also attempts to keep the per-thread instruction queue uniformly full. Thus, a thread that remains active executes most instructions.

（イベント）
図４は、本発明のある実施形態によるシステムにおけるイベント送達メカニズムを示す。イベントがスレッドに知らされた場合、このスレッドは、（現在、実行状態にある場合）実行をペンディングし、かつ、例えば、仮想アドレス０ｘ０でデフォルトイベントハンドラを実行することによって、そのイベントを認識する。 (Event)
FIG. 4 illustrates an event delivery mechanism in a system according to an embodiment of the invention. When an event is notified to a thread, this thread recognizes the event by pending execution (if currently in the running state) and executing a default event handler at virtual address 0x0, for example.

示された実施形態では、特定のスレッドに知らされ得る５つのイベントタイプがある。 In the illustrated embodiment, there are five event types that can be signaled to a particular thread.

示されたイベントキュー４０は、ハードウェアデバイスおよびソフトウェアベースのイベント命令（例えば、ソフトウェア「割り込み」）によって提供されたイベントを仮想スレッド数（ＶＴＮ）およびイベント数を含むタプルの形態で段階的に行う。

The illustrated event queue 40 steps through events provided by hardware devices and software-based event instructions (eg, software “interrupts”) in the form of a tuple containing a virtual thread number (VTN) and an event number. .

当然、ハードウェアデバイスおよびソフトウェア命令によって提供されたイベントは、他の形態で、および／または他の情報を含んで提供されてもよいことが分かる。

Of course, it will be appreciated that events provided by hardware devices and software instructions may be provided in other forms and / or including other information.

イベントタプルは、次に、受け取られた順序で、イベント−スレッドルックアップテーブル（イベントテーブルまたはスレッドルックアップテーブルとも呼ばれる）４２に渡される。このルックアップテーブルは、表示された各スレッドを現在ハンドリングしているのはどのＴＰＵかを決定する。イベントは、次に、イベント−スレッド送達メカニズム４４を介して、イベント数で構成されている「ＴＰＵイベント」の形態でＴＰＵ（および、これにより、そのそれぞれのスレッド）に提供される。特定のイベントをハンドリングするスレッドがまだ示されていない場合、対応するイベントが、ＴＰＵの１つでアクティブなデフォルトシステムスレッドに渡される。 The event tuples are then passed to the event-thread lookup table (also called event table or thread lookup table) 42 in the order received. This lookup table determines which TPU is currently handling each displayed thread. The event is then provided to the TPU (and thus its respective thread) via the event-thread delivery mechanism 44 in the form of a “TPU event” consisting of the number of events. If a thread handling a particular event has not yet been indicated, the corresponding event is passed to the default system thread that is active on one of the TPUs.

イベントキュー４０は、ハードウェア、ソフトウェアおよび／またはこれらの組み合わせで実現され得る。モジュール５によって表された組み込みＳｏＣ（ｓｙｓｔｅｍ−ｏｎ−ａ−ｃｈｉｐ）のインプリメンテーションは、必要なキューイング関数を提供する一連のゲートおよび専用バッファとして実現される。これに代わる実施形態では、これは、ソフトウェア（またはハードウェア）リンクリスト、アレイ等で実現される。 The event queue 40 can be implemented in hardware, software, and / or a combination thereof. The embedded SoC (system-on-a-chip) implementation represented by module 5 is implemented as a series of gates and dedicated buffers that provide the necessary queuing functions. In alternative embodiments, this is accomplished with a software (or hardware) linked list, array, or the like.

テーブル４２は、ハードウェアデバイスによって提供されたイベントナンバー（例えば、ハードウェア割り込み）またはイベント命令と、そのイベントをシグナリングするために好ましいスレッドとの間にマッピングを確立する。可能なケースは以下のとおりである。 Table 42 establishes a mapping between the event number (eg, hardware interrupt) or event instruction provided by the hardware device and the preferred thread to signal that event. Possible cases are:

イベント数のエントリがない：デフォルトシステムスレッドにシグナリングする
スレッドに提供：スレッドが実行、アクティブまたはアイドル状態にある場合、特定のスレッド数シグナリングし、そうでない場合、特定のシステムスレッドにシグナリングする。 No event number entry: Signal to default system thread Provided to thread: Signal a specific number of threads if the thread is running, active or idle, otherwise signal to a specific system thread.

テーブル４２は、専用またはその他の単一の記憶領域であり得、イベントースレッドの更新されたマッピングを維持する。このテーブルは、さらに、分散またはその他の複数の記憶領域を構成し得る。これに関わらず、テーブル４２は、ハードウェア、ソフトウェアおよび／またはこれらの組み合わせで実現されてもよい。モジュール５によって表された組込みＳｏＣインプリメンテーションにおいて、テーブルは、イベント−レッドの更新されたマッピングを維持する専用格納領域（単数または複数）で「ハードウェア」ルックアップを実行するゲートによってインプリメントされる。このテーブルは、例えば、ＴＰＵ１０〜２０内にスレッドが新規にロードされ、および／または、不活性化およびこれらからアンロードされるとマッピングを更新するシステムレベル特権スレッドによってソフトウェアにアクセス可能でもある。次の実施形態では、テーブル４２は、マッピングを維持する格納領域のソフトウェアベースのルックアップによって実現される。 Table 42 may be a dedicated or other single storage area and maintains an updated mapping of event threads. This table may further constitute a distributed or other plurality of storage areas. Regardless, the table 42 may be implemented in hardware, software, and / or a combination thereof. In the embedded SoC implementation represented by module 5, the table is implemented by a gate that performs a “hardware” lookup in a dedicated storage area (s) that maintains an updated event-red mapping. . This table may also be accessible to software by system level privileged threads that update mappings as new threads are loaded into and / or deactivated and unloaded from TPUs 10-20, for example. In the next embodiment, the table 42 is implemented by a software-based lookup of the storage area that maintains the mapping.

イベント−スレッド送達メカニズム４４もまた、ハードウェア、ソフトウェアおよび／またはこれらの組み合わせで実現され得る。モジュール５によって表される組込みＳｏＣインプリメンテーションにおいて、メカニズム４４は、それ自体、キューイングが送達されるイベントの一連のゲートおよび専用バッファ４６〜４８として実現されるＴＰＵキューにシグナリングされたイベントをルーティングするゲート（およびラッチ）によって実現される。上述のように、代替的実施形態では、メカニズム４４は、必要な機能を提供するソフトウェア（または他のハードウェア構造）で実現され、同様に、キュー４６〜４８は、ソフトウェア（またはハードウェア）リンクリスト、アレイ等で実現される。 The event-thread delivery mechanism 44 may also be implemented in hardware, software and / or combinations thereof. In the embedded SoC implementation represented by module 5, the mechanism 44 itself routes the signaled event to a TPU queue implemented as a series of gates and dedicated buffers 46-48 for the event to which queuing is delivered. Realized by a gate (and a latch). As described above, in an alternative embodiment, mechanism 44 is implemented in software (or other hardware structure) that provides the necessary functionality, and similarly, queues 46-48 are software (or hardware) links. This is realized by a list, an array or the like.

示された実施形態におけるハードウェアおよびソフトウェアイベント（すなわち、ソフトウェアによって開始されるシグナリングイベントまたは「ソフトウェア割り込み」）を処理する手順の大要は以下のとおりである。 A summary of the procedure for handling hardware and software events (ie, software initiated signaling events or “software interrupts”) in the illustrated embodiment is as follows.

１．アクティブスレッドを現在実行中のＴＰＵにイベントがシグナリングされる。 1. An event is signaled to the TPU that is currently executing the active thread.

２．そのＴＰＵがアクティブスレッドの実行をペンディングする。例外ステータス、例外ＩＰ、および例外Ｍｅｍアドレスコントロールレジスタが、イベントのタイプに基づいてイベントに対応する情報を示すようにセットされる。すべてのスレッド状態が有効である。 2. The TPU is pending execution of the active thread. The exception status, exception IP, and exception Mem address control register are set to indicate information corresponding to the event based on the type of event. All thread states are valid.

３．ＩＰＵは、仮想アドレス０ｘ０でデフォルトイベントハンドラのシステム特権の実行を開始し、イベントシグナリングは、対応するスレッドユニットに対してディセーブルにされる。ＧＰレジスタ０〜３および述語レジスタ０〜１は、イベントハンドラによってスクラッチレジスタとして利用され、かつ、システム特権である。慣例により、ＧＰ［０］はイベント処理スタックポインタである。 3. The IPU starts executing the default event handler system privilege at virtual address 0x0, and event signaling is disabled for the corresponding thread unit. The GP registers 0 to 3 and the predicate registers 0 to 1 are used as scratch registers by the event handler and have system privileges. By convention, GP [0] is an event processing stack pointer.

４．イベントハンドラは、リエントラントになり、かつ、対応するスレッド実行ユニットへのイベントシグナリングを再イネーブルし得るために十分な状態をセーブする。 4). The event handler is reentrant and saves enough state to be able to re-enable event signaling to the corresponding thread execution unit.

５．イベントハンドラは、その後、イベントを処理する。これは、ＳＷベースのキューにイベントをポスティングするか、または任意の他のアクションをとることができるに過ぎない。 5. The event handler then processes the event. This can only post the event to the SW-based queue or take any other action.

６．イベントハンドラは、その後、状態を復元し、元のスレッドの実行に戻る。 6). The event handler then restores the state and returns to the original thread execution.

メモリ関連イベントは、いくらか異なった態様でハンドリングされる。ペンディング中（メモリ）イベントテーブル（ＰＥＴ）５０は、スレッドを実行状態から待ち状態に遷移させる（メモリ参照命令からの）メモリ動作のエントリを保持する。テーブル５０は、イベント−スレッドルックアップテーブル４２のように実現され得、リファレンスを開始したペンディング中メモリ動作、状態情報、およびスレッドＩＤのアドレスを保持する。ＰＥＩにおけるエントリに対応するメモリ動作が完了し、そのスレッドのＰＥＴ内に他のペンディング中動作がない場合、対応するスレッドにＰＥＴイベントがシグナリングされる。 Memory related events are handled in a somewhat different manner. The pending (memory) event table (PET) 50 holds entries for memory operations (from memory reference instructions) that cause a thread to transition from an execution state to a wait state. The table 50 may be implemented like the event-thread lookup table 42 and holds pending memory operations that initiated the reference, state information, and thread ID addresses. If a memory operation corresponding to an entry in the PEI is complete and there is no other pending operation in the thread's PET, a PET event is signaled to the corresponding thread.

示された実施形態によるメモリイベント処理の大要は以下のとおりである。 The outline of the memory event processing according to the illustrated embodiment is as follows.

１．アクティブスレッドを現在実行中のユニットにイベントがシグナリングされる。 1. An event is signaled to the unit that is currently executing the active thread.

２．そのスレッドがアクティブ−待ち状態であり、イベントがメモリイベントである場合、スレッドは、アクティブ−実行状態に遷移し、現在のＩＰで実行を継続する。そうでない場合、イベントは無視される。 2. If the thread is active-waiting and the event is a memory event, the thread transitions to the active-execution state and continues execution with the current IP. Otherwise, the event is ignored.

図面においてさらに示されるように、示された実施形態では、スレッド待ちタイムアウトおよびスレッド例外が、スレッドに直接シグナリングされ、イベント−スレッド送達メカニズム４４を通じて渡されない。
（トラップ）
マルチスレッドおよびイベントの目標は、スレッドの正常なプログラムの実行が妨げられないようにすることである。発生するイベントおよび割り込みは、イベントを待つ適切なスレッドによって処理される。これが不可能であり、正常な処理が割り込まれることを余儀なくされる場合もある。ＳＥＰは、この目的で、トラップメカニズムをサポートする。イベントタイプに基づいたアクションのリストは以下のとおりであり、トラップの全リストは、システム例外状態レジスタに列挙される。 As further shown in the drawings, in the illustrated embodiment, thread wait timeouts and thread exceptions are signaled directly to the thread and not passed through the event-thread delivery mechanism 44.
(trap)
The goal of multithreading and events is to ensure that a thread's normal program execution is not prevented. Events and interrupts that occur are handled by the appropriate thread waiting for the event. This is not possible and sometimes forced to interrupt normal processing. SEP supports a trap mechanism for this purpose. The list of actions based on event type is as follows, and the entire list of traps is listed in the system exception status register:

示されたプロセッサモジュール５は、トラップが発生した場合に以下のアクションをとる。

The indicated processor module 5 takes the following actions when a trap occurs.

１．次に実行される命令を特定するＩＰ（命令ポインタ）が例外ＩＰレジスタ内にロードされる
２．特権レベルが例外ＩＰレジスタのビット０内に格納される
３．例外タイプが例外状態レジスタ内にロードされる
４．例外がメモリユニット命令と関連する場合、例外に対応するメモリアドレスが例外メモリアドレスレジスタ内にロードされる
５．現在の特権レベルがシステムにセットされる
６．ＩＰ（命令ポインタ）がクリア（ゼロに）される
７．実行がＩＰ０で開始する。 1. 1. An IP (instruction pointer) specifying the next instruction to be executed is loaded into the exception IP register. 2. The privilege level is stored in bit 0 of the exception IP register. 3. The exception type is loaded into the exception status register. 4. If the exception is associated with a memory unit instruction, the memory address corresponding to the exception is loaded into the exception memory address register. The current privilege level is set in the system. 6. IP (instruction pointer) is cleared (to zero) Execution starts at IP0.

（仮想メモリおよびメモリシステム）
示されたプロセッサモジュール５は、６４ビット仮想アドレス（ＶＡ）空間、６４ビットシステムアドレス（ＳＡ）（標準的な物理アドレスとは異なった特性を有する）、およびまばらに埋められたＶＡまたはＳＡでのシステムアドレスへの仮想アドレスの変換のセグメントモデルを有する仮想メモリおよびメモリシステムアーキテクチャを利用する。 (Virtual memory and memory system)
The processor module 5 shown has a 64-bit virtual address (VA) space, a 64-bit system address (SA) (having different characteristics than a standard physical address), and a sparsely populated VA or SA. A virtual memory and memory system architecture having a segment model of virtual address translation to system address is utilized.

ＴＰＵ１０〜２０によってアクセスされるすべてのメモリは、キャッシュとして効率的に管理されるが、オフチップメモリは、ＤＤＲＤＲＡＭまたはダイナミックメモリの他の形態を利用し得る。図１を再び参照して、示された実施形態では、メモリシステムは、２つのロジックレベルで構成される。レベル１キャッシュは、最適なレイテンシーおよび帯域幅の別個のデータキャッシュ２４と命令キャッシュ２２とに分割される。示されたレベル２キャッシュ２６は、拡張されたレベル２と呼ばれるオンチップ部分とオフチップ部分とからなる。全体として、レベル２キャッシュは、個々のＳＥＰプロセッサ（単数または複数）５のメモリシステムであり、複数のＳＥＰプロセッサ５が用いられるインプリメンテーションの分散した「すべてのキャッシュ」メモリシステムに寄与する。当然、これらの複数のプロセッサは、同じメモリシステム、チップまたはバスを物理的に共有している必要はなく、例えば、ネットワークまたはその他の方法を介して接続され得ることが分かる。 All memory accessed by TPUs 10-20 is efficiently managed as a cache, but off-chip memory may utilize DDR DRAM or other forms of dynamic memory. Referring back to FIG. 1, in the illustrated embodiment, the memory system is configured with two logic levels. The level 1 cache is divided into a separate data cache 24 and instruction cache 22 with optimal latency and bandwidth. The level 2 cache 26 shown consists of an on-chip part and an off-chip part called extended level 2. Overall, a level 2 cache is a memory system of individual SEP processor (s) 5 and contributes to a distributed “all cache” memory system in an implementation in which multiple SEP processors 5 are used. Of course, it will be appreciated that these multiple processors need not physically share the same memory system, chip or bus, but may be connected, for example, via a network or other method.

図５は、示されたシステムにおいて用いられるＶＡからＳＡへの変換を示し、この変換は、セグメントごとにハンドリングされ、その際（示された実施形態では）これらのセグメントのサイズは、例えば、２^２４〜２^４８バイトと様々であり得る。ＳＡは、メモリシステムにキャッシュされる。従って、メモリシステムに存在するＳＡは、キャッシュ２２／２４、２６のレベルの１つにエントリを有する。キャッシュ（およびメモリシステム）に存在しないＳＡは、実質的に、メモリシステムに存在しない。従って、メモリシステムは、ページ（およびサブページ）粒度で、プロセッサ上のページテーブルのオーバーヘッドなしに、ソフトウェアおよびＯＳに固有の態様でまばらに埋められる。 FIG. 5 illustrates the VA to SA conversion used in the system shown, which is handled segment by segment, where (in the illustrated embodiment) the size of these segments is, for example, 2 It can vary and ^24-2 ⁴⁸ bytes. The SA is cached in the memory system. Thus, the SA present in the memory system has an entry at one of the cache 22/24, 26 levels. SAs that are not present in the cache (and memory system) are virtually absent from the memory system. Thus, the memory system is sparsely populated in a software and OS specific manner at page (and subpage) granularity and without page table overhead on the processor.

上述のことに加えて、示された実施形態の仮想メモリおよびメモリシステムアーキテクチャは、以下のさらなる特徴を有する。これらは、分散共有メモリ（ＤＳＭ）、ファイル（ＤＳＦ）、オブジェクト（ＤＳＯ）、ピアツーピア（ＤＳＰ２Ｐ）の直接サポート；スケーラブルキャッシュおよびメモリシステムアーキテクチャ；スレッド間で共有され得るセグメント；ルックアップがタグアクセスと並列であるため、完全な仮想−物理アドレス変換や仮想キャッシュの複雑性を伴わない高速レベル１キャッシュである。 In addition to the above, the virtual memory and memory system architecture of the illustrated embodiment has the following additional features. These include distributed shared memory (DSM), file (DSF), object (DSO), peer-to-peer (DSP2P) direct support; scalable cache and memory system architecture; segments that can be shared between threads; lookups in parallel with tag access Therefore, it is a high-speed level 1 cache that does not involve complete virtual-physical address translation or the complexity of a virtual cache.

（仮想メモリの概観）
示されるシステムにおける仮想アドレスは、メモリリファレンスおよび分岐命令によって構成された６４ビットアドレスである。仮想アドレスは、すべてのシステムメモリおよびＩＯデバイスにアクセスするために用いられるシステムに、セグメントごとに変換される。各セグメントは、２^２４から２^４８バイトまでサイズが様々である。より具体的には、図５を参照して、仮想アドレス５０を用いて、図面に示される態様で、エントリがセグメントテーブル５２内にマッチされる。図面において識別される仮想アドレスのコンポーネントと組み合わせて用いられた場合、マッチしたエントリ５４は、対応するシステムアドレスを特定する。さらに、マッチしたエントリ５４は、対応するセグメントサイズおよび特権を特定する。次に、そのシステムアドレスは、システムメモリ（示された実施形態では、別々にフィルされた２^６４バイトを備える）にマッピングする。示された実施形態は、システム特権を有するスレッドによって、アドレス変換がディセーブルされることを可能にし、この場合、セグメントテーブルは、バイパスされ、すべてのアドレスが低３２ビットに丸められる。 (Overview of virtual memory)
The virtual address in the system shown is a 64-bit address constructed by a memory reference and a branch instruction. Virtual addresses are translated segment by segment into a system that is used to access all system memory and IO devices. Each segment varies in size from 2 ²⁴ to 2 ⁴⁸ bytes. More specifically, referring to FIG. 5, using virtual address 50, entries are matched in segment table 52 in the manner shown in the drawing. When used in combination with the virtual address component identified in the drawing, the matched entry 54 identifies the corresponding system address. In addition, the matched entry 54 identifies the corresponding segment size and privilege. The system address then maps to system memory (in the illustrated embodiment, comprising ²⁶⁴ bytes filled separately). The illustrated embodiment allows address translation to be disabled by a thread with system privileges, in which case the segment table is bypassed and all addresses are rounded to low 32 bits.

示されたセグメントテーブル５２は、スレッド（ＴＰＵ）ごとに１６〜３２個のエントリを備える。このテーブルは、ハードウェア、ソフトウェアおよび／またはこれらの組み合わせで実現されてもよい。モジュール５によって表される、組込みＳｏＣのインプリメンテーションでは、テーブルは、ハードウェアで実現され、メモリ内の別個のエントリがスレッドごとに提供されている（例えば、スレッドごとに別個のテーブル）。セグメントは、同じシステムアドレスを示すスレッドごとに別個のエントリをセットアップすることによって２つ以上のスレッド間で共有され得る。この目的で、他のハードウェアまたはソフトウェア構造が代替的または追加的に用いられてもよい。 The segment table 52 shown comprises 16 to 32 entries per thread (TPU). This table may be implemented in hardware, software and / or combinations thereof. In the embedded SoC implementation represented by module 5, the table is implemented in hardware and a separate entry in memory is provided for each thread (eg, a separate table for each thread). A segment can be shared between two or more threads by setting up a separate entry for each thread that exhibits the same system address. For this purpose, other hardware or software structures may be used alternatively or additionally.

（キャッシュメモリシステムの概観）
上述のように、レベル１キャッシュは、別個のレベル１命令キャッシュ２２およびｌレベル１データキャッシュ２４として組織され、命令およびデータ帯域幅を最大化する。 (Overview of cache memory system)
As described above, the level 1 cache is organized as a separate level 1 instruction cache 22 and 1 level 1 data cache 24 to maximize instruction and data bandwidth.

図６を参照して、オンチップＬ２キャッシュ２６ａは、タグおよびデータ部分で構成される。示された実施形態では、これは、サイズが５．１Ｍｙｔｅ、１２８ブロック、１６ウェイ連想である。各ブロックは、１２８バイトデータまたは１６個の拡張Ｌ２タグを格納し、拡張Ｌ２タグに格納するために６４ｋｂｙｔｅが提供される。タグ内のタグモードビットは、データ部分が拡張Ｌ２キャッシュの１６個のタグで構成されることを示す。 Referring to FIG. 6, the on-chip L2 cache 26a includes a tag and a data portion. In the embodiment shown, this is a size of 5.1 Myte, 128 blocks, 16 way association. Each block stores 128 bytes of data or 16 extended L2 tags, and 64 kbytes are provided for storage in the extended L2 tags. The tag mode bit in the tag indicates that the data part is composed of 16 tags in the extended L2 cache.

拡張Ｌ２キャッシュ２６ｂは、上述のように、ＤＤＲＤＲＡＭベースであるが、他のメモリタイプも用いられ得る。示された実施形態では、１ｇｂｙｔｅまでのサイズ、２５６ウェイ連想であり、１６ｋバイトページおよび１２８バイトサブページを有する。５ｍｂｙｔｅＬ２キャッシュ２６ａおよび１ｇｂｙｔｅＬ２拡張キャッシュ２６ｂのコンフィギュレーションの場合、Ｌ２拡張を完全に記述するためにオンチップＬ２キャッシュの１２％が必要とされるにすぎない。より大きいオンチップＬ２またはより小さいＬ２拡張サイズの場合、このパーセンテージはより低い。Ｌ２キャッシュ（オンチップおよび拡張）のアグリゲーションは、分散ＳＥＰメモリシステムを構成する。 The extended L2 cache 26b is based on DDR DRAM as described above, but other memory types may be used. In the illustrated embodiment, the size is up to 1 gbyte, is 256-way associative, and has 16 kbyte pages and 128 byte subpages. For the configuration of 5 mbyte L2 cache 26a and 1 gbyte L2 extended cache 26b, only 12% of the on-chip L2 cache is required to fully describe the L2 extension. For larger on-chip L2 or smaller L2 expansion sizes, this percentage is lower. Aggregation of L2 cache (on-chip and expansion) constitutes a distributed SEP memory system.

示された実施形態では、Ｌ１命令キャッシュ２２およびＬ１データキャッシュ２４の両方が８ウェイ連想であり、３２ｋｂｙｔｅおよび１２８バイトブロックを有する。図面に示されるように、両方のレベル１キャッシュはレベル２キャッシュの固有のサブセットである。レベル２キャッシュは、オンチップおよびオフチップ拡張Ｌ２キャッシュで構成される。 In the illustrated embodiment, both L1 instruction cache 22 and L1 data cache 24 are 8-way associative and have 32 kbytes and 128 byte blocks. As shown in the drawing, both level 1 caches are unique subsets of the level 2 cache. Level 2 cache is composed of on-chip and off-chip extended L2 cache.

図７は、Ｌ２キャッシュ２６ａと、示された例では、Ｌ２キャッシュ２６ａにおいてタグルックアップを実行し、Ｌ２キャッシュアドレス７８とマッチするデータブロック７０を識別するために用いられるロジックとを示す。示された実施形態では、ロジックは、１６個のキャッシュタグアレイグル−プ７２ａ〜７２ｐと、対応するタグ比較素子７４ａ〜７４ｐと、対応するデータアレイグループ７６ａ〜７６ｐとを備える。示されるように、これらは結合され、Ｌ２キャッシュアドレス７８をグループタグアレイ７２ａ〜７２ｐに対してマッチさせ、さらに示されるように、示されたデータアレイグループ７６ａ〜７６ｐによって識別されたデータブロック７０を選択する。 FIG. 7 shows the L2 cache 26 a and, in the example shown, the logic used to perform a tag lookup in the L2 cache 26 a and identify the data block 70 that matches the L2 cache address 78. In the illustrated embodiment, the logic comprises 16 cache tag array groups 72a-72p, corresponding tag comparison elements 74a-74p, and corresponding data array groups 76a-76p. As shown, these are combined to match the L2 cache address 78 against the group tag arrays 72a-72p and, as further shown, the data block 70 identified by the indicated data array group 76a-76p. select.

キャッシュタグアレイグループ７２ａ〜７２ｐ、タグ比較素子７４ａ〜７４ｐ、および対応するデータアレイグループ７６ａ〜７６ｐは、ハードウェア、ソフトウェアおよび／またはこれらの組み合わせで実現され得る。モジュール５によって表される組込みＳｏＣのインプリメンテーションでは、これらは、図１９に示されるように実現される。図１９は、３２ｘ２５６のシングルポートメモリセルに組み込まれたキャッシュタグアレイグループ７２ａ〜７２ｐ、および１２８ｘ２５６のシングルポートメモリセルに組み込まれたデータアレイグループ７６ａ〜７６ｐを示し、これらすべては、示されるように、現在の状態コントロールロジック１９０と結合されている。示されるように、この素子は、次に、Ｌ２キャッシュユニット２６ａの動作を容易にする状態マシン１９２に本明細書中で一貫した態様で、ならびに、Ｌ１命令およびデータキャッシュ２２、２４からのリクエストをバッファするリクエストキュー１９２に結合される。 Cache tag array groups 72a-72p, tag comparison elements 74a-74p, and corresponding data array groups 76a-76p may be implemented in hardware, software, and / or combinations thereof. In the embedded SoC implementation represented by module 5, these are implemented as shown in FIG. FIG. 19 shows cache tag array groups 72a-72p embedded in a 32x256 single port memory cell and data array groups 76a-76p incorporated in a 128x256 single port memory cell, all as shown. , Coupled with the current state control logic 190. As shown, this element then directs requests from the L1 instruction and data caches 22, 24 in a manner consistent with the state machine 192 that facilitates the operation of the L2 cache unit 26a. Coupled to a request queue 192 for buffering.

論理素子１９０は、Ｌ２キャッシュのオフチップ部分２６ｂにインターフェースを提供するＤＤＲＤＲＡＭコントロールインターフェース２６ｃとさらに結合される。これは、同様に、少数の例を挙げると、液晶ディスプレイ（ＬＣＤ）、オーディオ出力インターフェース、ビデオ入力インターフェースビデオ出力インターフェース、ネットワークインターフェース（無線、有線等）、記憶装置インターフェース、ペリフェラルインターフェース（例えば、ＵＳＢ，ＵＳＢ２）、バスインターフェース（ＰＣＩ、ＡＴＡ）といったＡＭＢＡと互換性のあるコンポーネントにインターフェースを提供するＡＭＢＡインターフェース２６ｄに結合される。ＤＤＲＤＲＡＭインターフェース２６ｃおよびＡＭＢＡインターフェース２６ｄは、示されるように、同様に、Ｌ２データキャッシュバス１９８によって、Ｌ１命令およびデータキャッシュへのインターフェース１９６に結合される。 The logic element 190 is further coupled to a DDR DRAM control interface 26c that provides an interface to the off-chip portion 26b of the L2 cache. Similarly, this includes a liquid crystal display (LCD), an audio output interface, a video input interface, a video output interface, a network interface (wireless, wired, etc.), a storage device interface, a peripheral interface (eg, USB, It is coupled to an AMBA interface 26d that provides an interface to components compatible with AMBA such as USB 2) and bus interfaces (PCI, ATA). DDR DRAM interface 26c and AMBA interface 26d are similarly coupled to L1 instruction and data cache interface 196 by L2 data cache bus 198 as shown.

図８は、同様に、示された実施形態で用いられ、Ｌ２拡張キャッシュ２６ｂにおいてタグルックアップを実行し、かつ、指定されたアドレス７８とマッチするデータブロック８０を識別するロジックを示す。示された実施形態では、このロジックは、データアレイグループ８２ａ〜８２ｐと、対応するタグ比較素子８４ａ〜８４ｐと、タグラッチ８６とを備える。これらは、示されるように結合され、示されるように、Ｌ２キャッシュアドレス７８をデータアレイグループ７２ａ〜７２ｐに対してマッチさせ、かつ、さらに示されるように、アドレス７８の対応する部分とマッチするグループの１つからタグを選択する。マッチングタグからの物理ページ番号は、示されるように、アドレス７８のインデックス部分と組み合わせられ、オフチップメモリ２６ｂにおいてデータブロック８０を識別する。 FIG. 8 similarly illustrates the logic used in the illustrated embodiment to perform a tag lookup in the L2 extended cache 26b and identify the data block 80 that matches the specified address 78. In the illustrated embodiment, the logic comprises data array groups 82a-82p, corresponding tag comparison elements 84a-84p, and tag latches 86. These are combined as shown, matching L2 cache address 78 to data array groups 72a-72p as shown, and matching corresponding portions of address 78 as shown further Select a tag from one of The physical page number from the matching tag is combined with the index portion of address 78, as shown, to identify data block 80 in off-chip memory 26b.

データアレイグループ８２ａ〜８２ｐおよびタグ比較素子８４ａ〜８４ｐは、ハードウェア、ソフトウェアおよび／またはこれらの組み合わせで実現され得る。モジュール５によって表された組込みＳｏＣのインプリメンテーションにおいて、これらは、必要なルックアップおよびタグ比較機能を提供するゲートおよび専用メモリで実現される。この目的で、他のハードウェアまたはソフトウェア構造が代替的または追加的に用いられてもよい。 Data array groups 82a-82p and tag comparison elements 84a-84p may be implemented in hardware, software and / or combinations thereof. In the embedded SoC implementation represented by module 5, these are implemented with gates and dedicated memory that provide the necessary lookup and tag comparison functions. For this purpose, other hardware or software structures may be used alternatively or additionally.

以下に記載されるのは、示された実施形態のＬ２およびＬ２Ｅキャッシュ動作を示す擬似コードである。 Described below is pseudo code illustrating the L2 and L2E cache operations of the illustrated embodiment.

（スレッド処理ユニットの状態）
図９を参照して、示された実施形態は、６個までのアクティブスレッドをサポートする６個のＴＰＵを有する。各ＴＰＵ１０〜２０は、図９に示されるように、汎用レジスタと述語レジスタとコントロールレジスタとを備える。システムおよびアプリケーション特権レベルの両方のスレッドは、同じ状態を含むが、システム特権レベルにおいて（キーおよびそれぞれのスティプリングパターンによって示されるように）いくつかのスレッド状態情報のみが見られる。レジスタに加えて、各ＴＰＵは、ペンディングメモリイベントテーブル、イベントキュー、およびイベント−スレッドルックアップテーブルをさらに備えるが、これらのいずれも図９には示されていない。

(Thread processing unit status)
Referring to FIG. 9, the illustrated embodiment has six TPUs that support up to six active threads. As shown in FIG. 9, each TPU 10-20 includes a general-purpose register, a predicate register, and a control register. Both system and application privilege level threads contain the same state, but only some thread state information is seen at the system privilege level (as indicated by the key and the respective stippling pattern). In addition to the registers, each TPU further comprises a pending memory event table, an event queue, and an event-thread lookup table, none of which are shown in FIG.

実施形態によるが、４８（以下）〜１２８（以上）個（示された実施形態は１２８個を有する）の汎用レジスタと、２４（以下）〜６４（以上）個（示された実施形態は３２個を有する）の述語レジスタと、６（以下）〜２５６（以上）個（示された実施形態は８個を有する）のアクティブスレッドと、１６（以下）〜５１２（以上）個のエントリ（示された実施形態は１６個のエントリ）とを有するペンディングメモリイベントテーブルと、スレッドごとの、複数の、好ましくは、少なくとも２個（場合によっては、これより少ない）のペンディングメモリイベントと、２５６個（以上または以下）のイベントキューと、１６（以下）〜２５６（以上）個のエントリ（示された実施例は３２個）を有するイベント−スレッドルックアップテーブルとがあり得る。 Depending on the embodiment, 48 (or less) to 128 (or more) general purpose registers (the illustrated embodiment has 128) and 24 (or less) to 64 (or more) (more than 32 illustrated embodiments). Predicate registers, 6 (below) to 256 (more) active threads (the illustrated embodiment has 8), and 16 (below) to 512 (more) entries (shown). The embodiment has 16 pending entries), multiple, preferably at least 2 (possibly fewer) pending memory events per thread, and 256 ( Event queue with 16 (or less) to 256 (or more) entries (32 in the example shown). There may be a bull.

（汎用レジスタ）
示された実施形態では、各スレッドは、インプリメンテーションによるが、１２８個までの汎用レジスタを有する。汎用レジスタ３−０（ＧＰ［３：０］）は、システム特権レベルにおいて見られ、かつ、イベント処理の初期段階において、イベントスタックポインタおよびワーキングレジスタのために利用され得る。 (General-purpose register)
In the illustrated embodiment, each thread has up to 128 general purpose registers, depending on the implementation. General registers 3-0 (GP [3: 0]) are found at the system privilege level and can be used for event stack pointers and working registers in the early stages of event processing.

（プレディケーションレジスタ）
述語レジスタは、汎用ＳＥＰプレディケーションメカニズムの一部である。各命令の実行は、リファレンス述語レジスタの値に基づいて条件付である。 (Predication register)
The predicate register is part of the general SEP predication mechanism. Execution of each instruction is conditional based on the value of the reference predicate register.

ＳＥＰは、スレッド状態の一部として６４個までの１ビット述語レジスタを提供する。 SEP provides up to 64 1-bit predicate registers as part of the thread state.

各述語レジスタは、いわゆる述語を保持し、これは、比較命令を実行する結果に基づいて、１（真（ｔｒｕｅ））にセットされるか、または、０（偽（ｆａｌｓｅ））にリセットされる。述語レジスタ３−１（ＰＲ「３：１」）は、システム特権レベルにおいて見られ、かつ、イベント処理の初期段階において、ワーキング述語（ｗｏｒｋｉｎｇｐｒｅｄｉｃａｔｅ）のために利用され得る。述語レジスタ０は、読み出し専用であり、常に１、真と読み出す。これは、実行を無条件にするという命令による。 Each predicate register holds a so-called predicate, which is set to 1 (true) or reset to 0 (false) based on the result of executing the compare instruction. . The predicate register 3-1 (PR “3: 1”) is seen at the system privilege level and can be used for working predicates in the early stages of event processing. The predicate register 0 is read-only and always reads 1, true. This is due to the command to make execution unconditional.

（コントロールレジスタ）
スレッド状態レジスタ (Control register)
Thread status register

ID register

Instruction pointer register

実行されるべき次の命令の６４ビット仮想アドレスを特定する。

Specifies the 64-bit virtual address of the next instruction to be executed.

System exception status register

Application exception status register

System exception IP

命令のアドレスは、システム特権のシグナリングされた例外に対応する。ビット［０］は、例外の時点では特権レベルである。

The address of the instruction corresponds to a signaled exception for system privileges. Bit [0] is a privilege level at the time of the exception.

命令のアドレスは、シグナリングされた例外に対応する。ビット［０］は、例外の時点では特権レベルである。

The instruction address corresponds to the signaled exception. Bit [0] is a privilege level at the time of the exception.

アプリケーション例外ＩＰApplication exception IP

命令のアドレスは、アプリケーション特権のシグナリングされた例外に対応する。

The address of the instruction corresponds to a signaled exception of application privileges.

例外ＭｅｍアドレスException Mem address

メモリのアドレスは、シグナリングされた例外を参照する。メモリフォールとにのみ有効。実行状態レジスタがメモリリファレンスフォールトを示す場合、フィルを待つか、またはエンプティを待ちながら、ペンディングメモリ動作のアドレスを保持する。

The memory address refers to the signaled exception. Valid only for memory fall. If the execution status register indicates a memory reference fault, the pending memory operation address is held while waiting for a fill or waiting for an empty.

命令Ｓｅｇテーブルポインタ（ＩＳＴＰ）、データＳｅｇテーブルポインタ（ＤＳＴＰ）Instruction Seg table pointer (ISTP), Data Seg table pointer (DSTP)

ＩＳＴＥおよびＩＳＴＥレジスタによって利用され、読み出しまたは書き込みされるｓｔｅおよびフィールドを特定する。

Used by the ISTE and ISTE registers to identify the ste and field to be read or written.

Instruction segment table entry (ISTE), data segment table entry (DSTE)

読み出される場合、ＩＳＴＥレジスタによって特定されたＳＴＥは、宛先汎用レジスタに配置される。書き込まれる場合、ＩＳＴＥまたはＤＳＴＥによって特定されたＳＴＥは、汎用ソースレジスタから書き込まれる。セグメントテーブルエントリのフォーマットは、第６章（「変換テーブルの組織およびエントリの説明」と称される節において明確にされる。

When read, the STE specified by the ISTE register is placed in the destination general purpose register. When written, the STE specified by the ISTE or DSTE is written from the general source register. The format of the segment table entry is clarified in Chapter 6 (“Translation Table Organization and Entry Description”).

命令またはデータレベル１キャッシュタグポインタ（ＩＣＴＰ、ＤＣＴＰ）Instruction or data level 1 cache tag pointer (ICTP, DCTP)

ＩＣＴＥまたはＤＣＴＥによって読み出しまたは書き込みされる命令キャッシュタグエントリを特定する。

Identify an instruction cache tag entry that is read or written by ICTE or DCTE.

Instruction or data level 1 cache tag entry (ICTE, DCTE)

読み出される場合、ＩＣＴＰまたはＤＣＴＰレジスタによって特定されたキャッシュタグは、宛先汎用レジスタに配置される。書き込まれる場合、ＩＣＴＰまたはＤＣＴＰによって特定されたキャッシュタグは、汎用ソースレジスタから書き込まれる。キャッシュタグエントリのフォーマットは、第６章（「変換テーブルの組織およびエントリの説明」と称される節において明確にされる。

When read, the cache tag specified by the ICTP or DCTP register is placed in the destination general purpose register. When written, the cache tag specified by ICTP or DCTP is written from the general source register. The format of the cache tag entry is clarified in Chapter 6 (“Translation Table Organization and Entry Description”).

イベントキューコントロールレジスタEvent queue control register

イベントキューコントロールレジスタ（ＥＱＣＲ）は、イベントキューへの通常のアクセスおよびダイアグノスティックアクセス（ｄｉａｇｎｏｓｔｉｃａｃｃｅｓｓ）をイネーブルする。レジスタを使用するためのシーケンスは、レジスタ書き込みに続いてレジスタ読み出しである。ｒｅｇ＿ｏｐフィールドのコンテンツは、書き込み、および次の読み出しの動作を特定する。

The event queue control register (EQCR) enables normal access and diagnostic access to the event queue. The sequence for using the register is register read followed by register read. The contents of the reg_op field specify the write and next read operations.

Event-thread lookup table control

イベント−スレッドルックアップテーブルは、ハードウェアデバイスによって提供されたイベント数またはイベント命令と、好ましいスレッドとの間のマッピングを確立して、イベントをシグナリングする。テーブルにおける各エントリは、ビットマスク、およびイベントがマッピングされる対応するスレッドと共にイベント数を明確にする。

The event-thread lookup table establishes a mapping between the number of events or event instructions provided by the hardware device and the preferred thread to signal the event. Each entry in the table defines the number of events along with a bit mask and the corresponding thread to which the event is mapped.

（タイマおよび性能モニタ）
示された実施形態では、すべてのタイマおよび性能モニタレジスタは、アプリケーション特権でアクセス可能である。

(Timer and performance monitor)
In the illustrated embodiment, all timer and performance monitor registers are accessible with application privileges.

クロックclock

Instruction executed

Thread execution clock

Timeout waiting counter

（仮想プロセッサおよびスレッドＩＤ）
示された実施形態では、各アクティブスレッドは、仮想プロセッサに対応し、かつ、８ビットアクティブスレッド数によって特定される（アクティブスレッド［７：０］）モジュール５は、１６ビットスレッドＩＤ（スレッド［１５：０］）をサポートし、スレッドの高速のロード（活性化）およびアンロード（不活性化）を可能にする。他の実施形態は、異なったサイズのスレッドＩＤをサポートし得る。
（スレッド−命令フェッチ抽象化）
上述のように、モジュール５のＴＰＵ１０〜２０は、Ｌ１命令キャッシュ２２、およびこれらのＴＰＵ内でアクティブのスレッドの任意の組み合わせから、各サイクルごとに５つまでの命令を起動するパイプラインコントロールハードウェアを共有する。図１０は、機能ユニット３０〜３８上で実行するためのこれらの命令をフェッチおよびディスパッチするために、モジュール５によって用いられるメカニズムの抽象化である。

(Virtual processor and thread ID)
In the illustrated embodiment, each active thread corresponds to a virtual processor and is identified by the number of 8-bit active threads (active thread [7: 0]). Module 5 has a 16-bit thread ID (thread [15 : 0]), allowing fast loading (activation) and unloading (deactivation) of threads. Other embodiments may support different sized thread IDs.
(Thread-instruction fetch abstraction)
As mentioned above, the TPUs 10-20 of module 5 are pipeline control hardware that launches up to five instructions per cycle from any combination of L1 instruction caches 22 and threads active within these TPUs. Share FIG. 10 is an abstraction of the mechanism used by module 5 to fetch and dispatch these instructions for execution on functional units 30-38.

この図面に示されるように、各サイクルの間、命令がＬ１キャッシュ２２からフェッチされ、かつ、各それぞれのＴＰＵ１０〜２０と関連付けられた命令キュー１０ａ〜２０ａに配置される。示された実施形態では、スレッドキュー１０ａ〜２０ａを等しいレベルに保つという上位目標により、シングルスレッドごとに３〜６つの命令がフェッチされる。他の実施形態では、異なった数の命令がフェッチされてもよいし、および／またはキューをだいたい埋めるために異なった目標がセットされてもよい。フェッチステージの間もまた、モジュール５（および、特に、例えば、上述のイベントハンドリングメカニズム）は、イベント、および対応するスレッドの待ち状態から実行状態への変換を認識する。 As shown in this figure, during each cycle, instructions are fetched from the L1 cache 22 and placed in the instruction queues 10a-20a associated with each respective TPU 10-20. In the illustrated embodiment, 3-6 instructions are fetched per single thread with the overall goal of keeping the thread queues 10a-20a at equal levels. In other embodiments, a different number of instructions may be fetched and / or different targets may be set to roughly fill the queue. Also during the fetch stage, module 5 (and in particular, for example, the event handling mechanism described above) recognizes events and the conversion of the corresponding thread from a wait state to an execution state.

ディスパッチステージの間（フェッチおよび実行／リタイアステージと並列して実行する）、１つ以上の実行スレッドの各々からの命令は、そのサイクルのためにそれらのリソースの最良の利用を考慮するラウンドロビンプロトコルに基づいて、機能ユニット３０〜３８にディスパッチされる。これらの命令は、スレッドの任意の組み合わせからのものであり得る。コンパイラは、例えば、命令セットに提供された「ストップ」フラグを利用して、単一サイクルで起動され得るスレッド内の命令のグループ間の境界を特定する。他の実施形態では、例えば、特定のスレッドを優先するプロトコル、リソースの利用を無視するプロトコル等の他のプロトコルが用いられてもよい。 During the dispatch stage (running in parallel with the fetch and execute / retire stage), instructions from each of one or more execution threads consider the best use of their resources for that cycle Is dispatched to the functional units 30-38. These instructions can be from any combination of threads. The compiler uses, for example, a “stop” flag provided in the instruction set to identify boundaries between groups of instructions in a thread that can be invoked in a single cycle. In other embodiments, other protocols may be used, such as a protocol that prioritizes a specific thread, a protocol that ignores resource usage, and the like.

実行およびリタイア段階の間（フェッチおよびディスパッチステージと並列で実行される）、１つ以上のスレッドから複数の命令が同時に実行される。上述のように、示された実施形態では、５つまでの命令がサイクルごとに、例えば、整数ユニット、浮動ユニット、分岐ユニット、比較ユニット、およびメモリ機能ユニット３０〜３８によって起動および実行される。他の実施形態では、より多いか、またはより少ない命令が、例えば、機能ユニットの数およびタイプに依存して、およびＴＰＵの数に依存して起動され得る。 During the execution and retirement phases (executed in parallel with the fetch and dispatch stages), multiple instructions are executed simultaneously from one or more threads. As described above, in the illustrated embodiment, up to five instructions are activated and executed every cycle, for example, by the integer unit, floating unit, branch unit, comparison unit, and memory function units 30-38. In other embodiments, more or fewer instructions may be invoked, for example, depending on the number and type of functional units and depending on the number of TPUs.

実行が完了した場合、命令はリタイアされる。その結果は書き込まれ、命令は命令キューからクリアされる。 When execution is complete, the instruction is retired. The result is written and the instruction is cleared from the instruction queue.

他方、命令がブロックした場合、対応するスレッドは、実行状態から待ち状態に変換される。ブロックを引き起こした条件が解決された場合、対応するスレッドのブロックされた命令、および、次に、これに続くすべての命令が再スタートされる。図１１は、これを容易にするための、示された実施形態で用いられる３つのポインタのキュー管理メカニズムを示す。 On the other hand, if an instruction blocks, the corresponding thread is converted from an execution state to a wait state. If the condition that caused the block is resolved, the blocked instruction of the corresponding thread and then all subsequent instructions are restarted. FIG. 11 illustrates a three pointer queue management mechanism used in the illustrated embodiment to facilitate this.

この図面を参照して、命令キューおよび３つのポインタのセットがＴＰＵ１０〜２０ごとに維持される。ここで、そのような単一キュー１１０および１１２〜１１６のセットのみが示される。キュー１１０は、関連したＴＰＵ、および、より具体的には、そのＴＰＵ内の現在アクティブなスレッドの、フェッチされ、実行し、リタイアした（または無効の）命令を保持する。命令がフェッチされると、これらの命令は、キューの先頭に挿入される。これは、挿入（またはフェッチ）ポインタ１１２によって指定される。実行するための次の命令が抽出（または発行）ポインタ１１４によって識別される。コミットポインタ１１６は、実行がコミットされた最後の命令を識別する。命令がブロックされるか、そうでない場合、中止された場合、コミットポインタ１１６はロールバックされ、コミットポインタと抽出ポインタ間の命令を実行パイプラインに押し込む。逆に、分岐が用いられた場合、全部のキューがフラッシュされ、ポインタがリセットする。 Referring to this figure, an instruction queue and a set of three pointers are maintained for each TPU 10-20. Only such a set of single queues 110 and 112-116 are shown here. The queue 110 holds the fetched, executed, and retired (or invalid) instructions of the associated TPU and, more specifically, the currently active thread in that TPU. As instructions are fetched, they are inserted at the head of the queue. This is specified by the insert (or fetch) pointer 112. The next instruction to execute is identified by the extract (or issue) pointer 114. The commit pointer 116 identifies the last instruction whose execution has been committed. If the instruction is blocked or otherwise aborted, the commit pointer 116 is rolled back, pushing the instruction between the commit pointer and the extract pointer into the execution pipeline. Conversely, if a branch is used, all queues are flushed and the pointer is reset.

キュー１１０は円として示されるが、他のコンフィギュレーションが利用されてもよいことが分かる。図１１に示されるキューイングメカニズムは、例えば、図１２に示されるように実現され得る。命令は、デュアルポートメモリ１２０または、あるいは、一連のレジスタ（図示せず）に格納される。各々の新規にフェッチされた命令が格納される書き込みアドレスは、フェッチコマンド（例えば、パイプラインコントロールによって発行される）に応答して、メモリ１２０の連続的アドレスを生成するフェッチポインタロジック１２２によって供給される。発行された命令は、ここでは下部に示された他のポートから取得される。各命令が取得される読み出しアドレスが発行／コミットポインタロジック１２４によって供給される。そのロジックは、コミットおよび発行コマンド（例えば、パイプラインコントロールによって発行される）に応答し、連続的アドレスを、適宜、生成および／またはリセットする。
（プロセッサモジュールインプリメンテーション）
図１３は、特に、ＴＰＵ１０〜２０を実現するためのロジックを含む図１のプロセッサモジュール５のＳｏＣインプリメンテーションを示す。図１におけるように、図１３のインプリメンテーションは、上述のように構成され動作するＬ１およびＬ２キャッシュ２２〜２６を備える。同様に、このインプリメンテーションは、整数ユニット、浮動小数点ユニットおよび比較ユニットをそれぞれ備える機能ユニット３０〜３４を備える。さらなる機能ユニットが代替的または追加的に提供され得る。ＩＰＵ１０〜２０を実現するためのロジックは、パイプラインコントロール１３０、分岐ユニット３８、メモリユニット３６、レジスタファイル１３６、およびロード−格納バッファ１３８を備える。図１３に示されるコンポーネントは、示されるように、コントロールおよび情報転送するために相互接続される。破線は主要コントロールを示し、実線は述語値コントロールを示し、太い実線は６４ビットデータバスを同定し、さらに太い線は１２８ビットデータバスを同定する。図１３は本発明によるプロセッサモジュール５のあるインプリメンテーションを表し、他のインプリメンテーションが実現されてもよいことが分かる。
（パイプラインコントロールユニット）
示された実施形態では、パイプラインコントロール１３０は、図１１〜図１２との関連ですでに述べられたスレッドごとのキューを含む。スレッドごとに１２、１５または１８個の命令がパラメータ化され得る。コントロール１３０は、これらのキューからラウンドロビンベースで命令をピックアップする（しかしながら、上述のように、これは、他のベースでも実行され得る。これは、レジスタファイル１３６（これは、命令のソースおよび宛先レジスタを提供するリソースである）、および機能ユニット３０〜３８へのアクセスのシーケンスをコントロールする。パイプラインコントロール１３０は、スレッドごとのキューからの基本命令クラスをデコードし、命令を機能ユニット３０〜３８にディスパッチする。上述のように、１つ以上のスレッドからの複数の命令は、同じサイクルのそれらの機能ユニットによって実行するためにスケジューリングされ得る。コントロール１３０は、さらに、スレッドごとの構成キューを空にすること、分岐ユニット３８にシグナリングすること、および、可能である場合、例えば、サイクルごとのベースで機能ユニットをアイドリングしてユーザの消費を低減することを担う。 Although queue 110 is shown as a circle, it will be appreciated that other configurations may be utilized. The queuing mechanism shown in FIG. 11 may be realized as shown in FIG. 12, for example. The instructions are stored in the dual port memory 120 or a series of registers (not shown). The write address at which each newly fetched instruction is stored is provided by fetch pointer logic 122 that generates successive addresses in memory 120 in response to a fetch command (eg, issued by pipeline control). The The issued instruction is obtained here from the other port shown at the bottom. The read address from which each instruction is obtained is supplied by issue / commit pointer logic 124. The logic responds to commit and issue commands (eg, issued by pipeline control) and generates and / or resets sequential addresses as appropriate.
(Processor module implementation)
FIG. 13 shows in particular the SoC implementation of the processor module 5 of FIG. 1 including logic for implementing the TPUs 10-20. As in FIG. 1, the implementation of FIG. 13 includes L1 and L2 caches 22-26 configured and operating as described above. Similarly, this implementation comprises functional units 30-34 each comprising an integer unit, a floating point unit and a comparison unit. Additional functional units may be provided alternatively or additionally. The logic for realizing the IPUs 10 to 20 includes a pipeline control 130, a branch unit 38, a memory unit 36, a register file 136, and a load-store buffer 138. The components shown in FIG. 13 are interconnected for control and information transfer as shown. The dashed line indicates the primary control, the solid line indicates the predicate value control, the thick solid line identifies the 64-bit data bus, and the thicker line identifies the 128-bit data bus. FIG. 13 represents one implementation of the processor module 5 according to the invention and it will be appreciated that other implementations may be implemented.
(Pipeline control unit)
In the illustrated embodiment, the pipeline control 130 includes a per-thread queue as previously described in connection with FIGS. 12, 15 or 18 instructions can be parameterized per thread. The control 130 picks up instructions from these queues on a round-robin basis (however, as described above, this can also be performed on other bases. This is the register file 136 (which is the source and destination of the instruction). And the sequence of access to the functional units 30 to 38. The pipeline control 130 decodes the basic instruction class from the per-thread queue and directs the instructions to the functional units 30 to 38. As described above, multiple instructions from one or more threads can be scheduled for execution by their functional units in the same cycle, and control 130 can further empty the per-thread configuration queue. Branching It is signaled to Tsu bets 38, and, when possible, for example, responsible for reducing the consumption of the user idling functional units at the base of each cycle.

図１４は、パイプラインコントロールユニット１３０のブロック図である。ユニットは、スレッドクラスキューのコントロールロジック１４０、スレッドクラス（またはスレッドごとの）キュー１４２それ自体、命令ディスパッチ１４４、ロングワードデコードユニット１４６、および機能ユニットキュー１４８ａ〜１４８ｅを備え、これらは相互に（そして、モジュール５の他のコンポーネントに接続される。スレッドクラス（スレッドごとの）キューは、図１１〜１２との関連ですでに述べられたように構成され動作する。スレッドクラスキューコントロールロジック１４０は、これらのキュー１４２の入力側をコントロールし、従って、図１１〜図１２に示され、かつすでに述べられた挿入ポインタ機能を提供する。コントロールロジック１４０は、さらに、ユニットキュー１４８ａ〜１４８ｅの入力側をコントロールすること、および分岐ユニット３８とインターフェースして命令フェッチをコントロールすることを担う。後者に関して、ロジック１４０は、（例えば、ほとんどの命令をリタイアするＴＰＵを補償するために）上述の態様で命令フェッチのバランシングを担う。 FIG. 14 is a block diagram of the pipeline control unit 130. The unit comprises a thread class queue control logic 140, a thread class (or per thread) queue 142 itself, an instruction dispatch 144, a longword decode unit 146, and functional unit queues 148a-148e that are mutually (and , Connected to other components of module 5. Thread class (per thread) queues are configured and operate as described above in connection with Figures 11-12. Controls the input side of these queues 142 and thus provides the insertion pointer function shown and described above in Figures 11 to 12. Control logic 140 further controls the input side of unit queues 148a-148e. Con It is responsible for rolling and interfacing with branch unit 38 to control instruction fetching, with respect to the latter, logic 140 may fetch instructions in the manner described above (eg, to compensate for a TPU that retires most instructions). Responsible for balancing.

命令ディスパッチ１４４は、スレッドクラスキューの各々において利用可能な命令のスケジューリングをサイクルごとに評価および決定する。上述のように、示された実施形態では、命令をより迅速にリタイアするキューを考慮して、キューがラウンドロビンベースでハンドリングされる。命令ディスパッチ１４４もまた、スレッドクラスキュー１４２の出力側をコントロールする。これに関して、命令ディスパッチ１４４は、図１１〜図１２との関連ですでに述べられた抽出およびコミットポインタを管理し、これは、命令がリタイアされたコミットポインタを更新すること、および命令が中止された場合（例えば、スレッドスイッチまたは例外のため）、そのポインタをロールバックすることを包含する。 Instruction dispatch 144 evaluates and determines the scheduling of instructions available in each of the thread class queues on a cycle-by-cycle basis. As described above, in the illustrated embodiment, queues are handled on a round-robin basis, taking into account queues that retire instructions more quickly. Instruction dispatch 144 also controls the output side of thread class queue 142. In this regard, instruction dispatch 144 manages the extract and commit pointers already described in connection with FIGS. 11-12, which update the commit pointer when the instruction is retired, and the instruction is aborted. If it does (for example, due to a thread switch or an exception), it includes rolling back that pointer.

ロングワードデコードユニット１４６は、Ｌ１命令キャッシュ２２から入来する命令ロングワードをデコードする。示された実施形態では、そのような各ロングワードは、命令へとデコードされる。これは、１つまたは２つのロングワードをデコードするためにパラメータ化され得、それぞれ、３つの命令および６つの命令にデコードされる。デコードユニット１４６は、さらに、各命令の命令クラスをデコードすることを担う。 The long word decode unit 146 decodes the instruction long word coming from the L1 instruction cache 22. In the illustrated embodiment, each such longword is decoded into an instruction. This can be parameterized to decode one or two longwords, decoded into three and six instructions, respectively. The decode unit 146 is further responsible for decoding the instruction class of each instruction.

ユニットキュー１４８ａ〜１４８ｅは、機能ユニット３０〜３８によって実行されるべき実際の命令をキューに入れる。各キューは、スレッド単位ベースで組織され、クラスキューとの一致が保たれる。ユニットキューは、上述のように、コントロールの目的で、スレッドクラスキューコントロール１４０および命令ディスパッチ１４４に結合される。キュー１４８ａ〜１４８ｅからの命令は、対応するパイプライン１５０ａ〜１５０ｂに転送され、機能ユニットそれ自体３０〜３８にルーティングされる。命令は、さらに、レジスタファイルパイプライン１５２に渡される。 Unit queues 148a-148e queue actual instructions to be executed by functional units 30-38. Each queue is organized on a per-thread basis and remains consistent with the class queue. The unit queue is coupled to thread class queue control 140 and instruction dispatch 144 for control purposes, as described above. Instructions from queues 148a-148e are transferred to the corresponding pipelines 150a-150b and routed to the functional units themselves 30-38. The instruction is further passed to the register file pipeline 152.

図１５は、例えば、１４８ａ等の個別ユニットキューのブロック図である。これは、ＴＰＵごとに１つの命令キュー１５４ａ〜１５４ｅを備える。これらは、コントロールの目的で、スレッドクラスキューコントロール１４０（ｔｃｑｕｅｕｅ＿ｃｔｌとラベル付けされる）および命令ディスパッチ１４４（ｉｄｉｓｐａｔｃｈとラベル付けされる）に結合される。これらは、さらに、示されるように、命令入力のロングワードデコードユニット１４６（ｌｗｄｅｃｏｄｅとラベル付けされる）およびスレッド選択ユニット１５６に結合される。このユニットは、示されるように、命令ディスパッチ１４４によって提供されるコントロール信号に基づいて、スレッド選択をコントロールする。ユニット１５６からの出力は、対応するパイプライン１５０ａ〜１５０ｅ、およびレジスタファイルパイプライン１５２にルーティングされる。 FIG. 15 is a block diagram of an individual unit queue such as 148a. This comprises one instruction queue 154a-154e per TPU. These are coupled to a thread class queue control 140 (labeled tcpqueue_ctl) and instruction dispatch 144 (labeled dispatch) for control purposes. These are further coupled to an instruction input longword decode unit 146 (labeled lwdecode) and a thread selection unit 156, as shown. This unit controls thread selection based on control signals provided by instruction dispatch 144 as shown. The output from unit 156 is routed to corresponding pipelines 150a-150e and register file pipeline 152.

図１４を再び参照して、整数ユニットパイプライン１５０ａおよび浮動小数点ユニットパイプライン１５０ｂは、それぞれの機能ユニットの適切な命令フィールドをデコードする。各パイプラインは、さらに、それぞれの機能ユニットへのコマンドの時間を測定する。さらに、各パイプライン１５０ａ、１５０ｂは、分岐または中止に基づいて、それぞれのパイプラインに押し込み（ｓｑｕａｓｈｉｎｇ）を適用する。さらに、各々は、パワーダウン信号が１つのサイクルの間に用いられない場合、これをそれぞれの機能ユニットに印加する。示された比較ユニットパイプライン１５０ｃ、分岐ユニットパイプライン１５０ｄおよびメモリユニットパイプライン１５０ｅは、そのそれぞれの機能ユニット、比較ユニット３４、分岐ユニット３８、およびメモリユニット３６に同様の機能を提供する。レジスタファイルパイプライン１５０もまた、同様に、レジスタファイル１３６に対して同様の機能を提供する。 Referring again to FIG. 14, integer unit pipeline 150a and floating point unit pipeline 150b decode the appropriate instruction field for each functional unit. Each pipeline further measures the time of the command to the respective functional unit. Further, each pipeline 150a, 150b applies squashing to the respective pipeline based on branching or aborting. Furthermore, each applies a power down signal to the respective functional unit if it is not used during one cycle. The illustrated comparison unit pipeline 150c, branch unit pipeline 150d, and memory unit pipeline 150e provide similar functions to their respective functional units, the comparison unit 34, the branch unit 38, and the memory unit 36. Register file pipeline 150 also provides similar functionality for register file 136 as well.

ここで、図１３に戻って、示された分岐ユニット３８は、命令アドレスの生成およびアドレス変換、ならびに命令フェッチを担う。さらに、これは、スレッド処理ユニット１０〜２０の状態を維持する。図１６は、分岐ユニット３８のブロック図である。これは、図面に示されるように、相互に（および、モジュール５の他のコンポーネントに）接続されるコントロールロジック１６０、スレッド状態記憶装置１６２ａ〜１６２ｅ、スレッド選択１６４、アドレス加算器１６６、セグメント変換ＣＡＭ（ｃｏｎｔｅｎｔａｄｄｒｅｓｓａｂｌｅｍｅｍｏｒｙ）１６８を備える。 Returning now to FIG. 13, the branch unit 38 shown is responsible for instruction address generation and address translation, and instruction fetching. Furthermore, this maintains the state of the thread processing units 10-20. FIG. 16 is a block diagram of the branch unit 38. This includes control logic 160, thread state storage 162a-162e, thread selection 164, address adder 166, segment translation CAM connected to each other (and to other components of module 5) as shown in the figure. (Content addressable memory) 168.

コントロールロジック１６０は、パイプラインコントロール１３０からのコマンド信号に基づいてユニット３８を駆動する。これは、示されるように、さらに、入力として、命令キャッシュ２２状態およびＬ２キャッシュ２６アクノリッジメントを取得する。ロジック１６０は、示されるように、スレッドスイッチをパイプラインコントロール１３０に、かつ、コマンドを命令キャッシュ２２およびＬ２キャッシュに出力する。スレッド状態記憶装置１６２ａ〜１６２ｅは、それぞれのＴＰＵ１０〜２０の各々のスレッド状態を格納する。これらのＴＰＵの各々に、図３に示され、すでに述べられた汎用レジスタ、述語レジスタ、およびコントロールレジスタが維持される。 The control logic 160 drives the unit 38 based on the command signal from the pipeline control 130. This, as indicated, further takes as instructions the instruction cache 22 state and the L2 cache 26 acknowledgment. The logic 160 outputs thread switches to the pipeline control 130 and commands to the instruction cache 22 and L2 cache as shown. The thread state storage devices 162a to 162e store the thread states of the TPUs 10 to 20, respectively. Each of these TPUs maintains the general registers, predicate registers, and control registers shown in FIG. 3 and described above.

スレッド状態記憶装置から取得されたアドレス情報は、示されるように、スレッドセレクタにルーティングされる。このセレクタは、その情報からスレッドアドレスを選択し、かつ、コントロール１６０からのコントロール信号（示される）に基づいてアドレス演算が実行され得る。アドレス加算器１６６は、示されるように、スレッドセレクタ１６４、およびレジスタファイル（ｒｅｇｉｓｔｅｒｓｏｕｒｃｅとラベル付けされる）によって供給されるアドレス指定情報の出力に基づいて、選択されたアドレスをインクリメントするか、または、分岐アドレス計算を実行する。さらに、アドレス加算器１６６は、分岐結果を出力する。新規に計算されたアドレスは、上述のように、図５との関連ですでに述べられたように動作し、次の命令フェッチと共に用いるために変換された命令キャッシュアドレスを生成するセグメント変換メモリ１６８にルーティングされる。
（機能ユニット）
図１３に戻って、メモリユニット３６は、データキャッシュ２４アドレス生成およびアドレス変換を含む、メモリ参照命令の実行を担う。さらに、ユニット３６は、上述のように、ペンディング（メモリ）イベントテーブル（ＰＥＴ）５０を維持する。図１７は、メモリユニット３６のブロック図である。これは、図面に示されるように、相互に（および、モジュール５の他のコンポーネントに）接続される、コントロールロジック１７０、アドレス加算器１７２、およびセグメント変換ＣＡＭ１７４を備える。 The address information obtained from the thread state store is routed to the thread selector as shown. The selector selects a thread address from that information and can perform address operations based on a control signal (shown) from the control 160. The address adder 166 increments the selected address based on the output of addressing information provided by the thread selector 164 and the register file (labeled register source), as shown, or Execute branch address calculation. Further, the address adder 166 outputs a branch result. The newly calculated address operates as described above in connection with FIG. 5, as described above, to generate a segment translation memory 168 that generates a translated instruction cache address for use with the next instruction fetch. Routed to.
(Functional unit)
Returning to FIG. 13, the memory unit 36 is responsible for executing memory reference instructions, including data cache 24 address generation and address translation. In addition, unit 36 maintains a pending (memory) event table (PET) 50 as described above. FIG. 17 is a block diagram of the memory unit 36. This comprises a control logic 170, an address adder 172, and a segment translation CAM 174 connected to each other (and to the other components of module 5) as shown in the figure.

コントロールロジック１７０は、パイプラインコントロール１３０からのコマンド信号に基づいて、ユニット３６を駆動する。これは、示されるように、さらに、入力として、データキャッシュ２２状態およびＬ２キャッシュ２６アクノリッジメントを取得する。ロジック１７０は、示されるように、スレッドスイッチをパイプラインコントロール１３０および分岐ユニット３８に、かつコマンドをデータキャッシュ２４およびＬ２キャッシュに出力する。アドレス加算器１７２は、レジスタファイル１３６から提供されたアドレス指定情報をインクリメントするか、または必要なアドレス計算を実行する。新規に計算されたアドレスは、セグメント変換メモリ１７４にルーティングされる。このメモリは、図５との関連ですでに述べられたように動作し、データアクセスと共に用いるための変換命令キャッシュアドレスを生成する。図面には示されないが、ユニット３６もまた、上述のように、ＰＥＴを備える。 The control logic 170 drives the unit 36 based on the command signal from the pipeline control 130. This, as shown, also gets the data cache 22 state and the L2 cache 26 acknowledgment as inputs. Logic 170 outputs thread switches to pipeline control 130 and branch unit 38 and commands to data cache 24 and L2 cache, as shown. The address adder 172 increments the addressing information provided from the register file 136 or performs the necessary address calculations. The newly calculated address is routed to the segment translation memory 174. This memory operates as previously described in connection with FIG. 5 and generates a translation instruction cache address for use with data access. Although not shown in the drawings, the unit 36 also comprises PET as described above.

図１８は、Ｌ１命令キャッシュ２２またはＬ２データキャッシュ２４のいずれかを実現するキャッシュユニットのブロック図である。このユニットは、データアレイとして機能する１６個の１２８ｘ２５６バイトシングルポートメモリセル１８０ａ〜１８０ｐを、タグアレイとして機能する、対応する１６個の３２ｘ５６バイトデュアルポートメモリセル１８２ａ〜１８２ｐと一緒に備える。これらは、示されるように、Ｌ１およびＬ２アドレスおよびデータバスに結合される。コントロールロジック１８４および１８６は、メモリセルと、Ｌ１キャッシュコントロールおよびＬ２キャッシュコントロールとに結合される。 FIG. 18 is a block diagram of a cache unit that implements either the L1 instruction cache 22 or the L2 data cache 24. This unit comprises 16 128x256 byte single port memory cells 180a-180p functioning as a data array along with 16 corresponding 32x56 byte dual port memory cells 182a-182p functioning as a tag array. These are coupled to the L1 and L2 address and data buses as shown. Control logic 184 and 186 are coupled to the memory cells and the L1 and L2 cache controls.

再び図１３に戻って、レジスタファイル１３６は、機能ユニット３０〜３８によって実行される命令によってアクセスされるすべてのソースおよび宛先レジスタのリソースとして機能する。このレジスタファイルは、図２０に示されるように実現される。ここに示されるように、遅延およびワイヤオーバーヘッドを低減するために、ユニット１３６は、機能ユニット３０〜３８ごとに別々のレジスタファイルに分解される。示された実施形態では、各インスタンスは、ＴＰＵの各々につき４８個の６４ビットレジスタを提供する。他の実施形態は、ＴＰＵに割り当てられたレジスタの数、ＴＰＵの数およびレジスタのサイズにより様々であり得る。 Returning again to FIG. 13, register file 136 serves as a resource for all source and destination registers accessed by instructions executed by functional units 30-38. This register file is realized as shown in FIG. As shown here, to reduce delay and wire overhead, unit 136 is broken down into separate register files for each functional unit 30-38. In the illustrated embodiment, each instance provides 48 64-bit registers for each TPU. Other embodiments may vary depending on the number of registers allocated to the TPU, the number of TPUs, and the size of the registers.

各インスタンス２００ａ〜２００ｅは、各インスタンスの上部に入る矢印によって示されるように、５個のワイヤポートを有し、これらのポートを介して、機能ユニット３０〜３８の各々が、出力データを同時に書き込み得る（これにより、インスタンスが均一なデータを保持することを保証する）。各々は、各インスタンスの下部から発する矢印によって示されるように、異なった数の読み出しポートを提供する。このポートを介してそれぞれの機能ユニットがデータを取得する。従って、示されるように、整数ユニット３０、浮動小数点ユニット３２、およびメモリユニットと関連付けられたインスタンスは、すべて、３つの読み出しポートを有し、比較ユニット３４と関連付けられたインスタンスは、２つの読み出しポートを有し、分岐ユニット３８と関連付けられたインスタンスは、１つのポートを有する。 Each instance 200a-200e has five wire ports, as indicated by the arrows that enter the top of each instance, through which each of the functional units 30-38 writes output data simultaneously. Get (this ensures that the instance holds uniform data). Each provides a different number of read ports, as indicated by the arrows emanating from the bottom of each instance. Each functional unit acquires data via this port. Thus, as shown, the integer unit 30, the floating point unit 32, and the instance associated with the memory unit all have three read ports, and the instance associated with the compare unit 34 has two read ports. And the instance associated with the branch unit 38 has one port.

レジスタファイルインスタンス２００〜２００ｅは、各サイクルのシングルスレッドについてすべてのポートが読み出されることによって最適化され得る。さらに、記憶ビットは、ポートアクセスのワイヤ下で折りたたまれ得る。 Register file instances 200-200e may be optimized by reading all ports for a single thread in each cycle. Furthermore, the storage bit can be folded under the port access wire.

図２１および図２２は、それぞれ、整数ユニット３０および比較ユニット３４のブロック図である。図２３Ａおよび図２３Ｂは、それぞれ、浮動小数点ユニット３２、および本明細書中で用いられる乗算加算融合ユニット（ｆｕｓｅｄｍｕｌｔｉｐｌｙ−ａｄｄｕｎｉｔ）のブロック図である。これらのユニットの構成および動作は、図面で提供されるコンポーネント、相互接続、および表示から明らかである。
（消費者−生産者メモリ）
従来技術のマルチプロセッサシステムでは、スレッドまたはプロセッサ間でデータに基づく処理の流れを実現するための同期オーバーヘッドおよびプログラミングの困難性が非常に高い。プロセッサモジュール５は、あるスレッドが、データが利用可能であることを示す場合に、他のスレッドは、データが利用可能である状態で待ち、かつ、トランスペアレントにウェイクアップすることを可能にして、上述のことが容易に行われることを可能にするメモリ命令を提供する。このようなソフトウェアトランスペアレントの消費者−生産者メモリ動作は、より高性能の細粒度スレッドレベルと、効率的にデータが配向された、消費者−生産者プログラミングスタイルとの並列処理を可能にする。 21 and 22 are block diagrams of the integer unit 30 and the comparison unit 34, respectively. FIG. 23A and FIG. 23B are block diagrams of the floating point unit 32 and the multiply-multiple-add unit used herein, respectively. The configuration and operation of these units will be apparent from the components, interconnections, and displays provided in the drawings.
(Consumer-Producer memory)
Prior art multiprocessor systems have very high synchronization overhead and programming difficulties to implement a data-based processing flow between threads or processors. The processor module 5 allows one thread to wait in a state where data is available and to wake up transparently if one thread indicates that data is available, as described above. A memory instruction is provided that allows this to be done easily. Such software-transparent consumer-producer memory operations allow parallel processing of higher performance fine-grained thread levels and consumer-producer programming styles with efficient data orientation.

示された実施形態は、「フィル」メモリ命令を提供する。これは、選択されたメモリロケーションにデータをロードし、そのロケーションにある状態、すなわち、「フル」状態を関連付けるためのデータ生産者であるスレッドによって用いられる。命令が実行されるときにロケーションがすでにその状態である場合、実行がシグナリングされる。 The illustrated embodiment provides “fill” memory instructions. This is used by a thread that is a data producer to load data into a selected memory location and associate a state at that location, ie, a “full” state. If the location is already in that state when the instruction is executed, execution is signaled.

この実施形態は、さらに、「エンプティ」命令を提供する。これは、選択されたロケーションからデータを取得するためにデータ消費者によって用いられる。ロケーションがフル状態と関連付けられた場合、データは、そこから読み出され（例えば、指定されたレジスタに）、かつ、命令は、そのロケーションが「エンプティ」状態と関連付けられるようにする。逆に、エンプティ命令が実行されている時点に、ロケーションがフル状態と関連付けられていない場合、命令は、それを実行したスレッドを一時的にアイドル状態（または、これに代わる実施形態では、アクティブ、非実行状態）に遷移させ、これが一旦そのように関連付けられると、これをアクティブ、実行状態に再び遷移させ（そして、エンプティ命令を実行して完了する）。エンプティ命令を用いることによって、低オーバーヘッドおよびソフトウェアトランスペアレンシーでデータが利用可能であるときにスレッドを実行することが可能である。 This embodiment further provides an “empty” instruction. This is used by the data consumer to obtain data from the selected location. If a location is associated with a full state, data is read from it (eg, in a designated register) and the instruction causes the location to be associated with an “empty” state. Conversely, if the location is not associated with a full state at the time that the empty instruction is being executed, the instruction may temporarily idle the thread that executed it (or active, in alternative embodiments, Transition to the non-execution state and once it is so associated, it transitions back to the active, execution state (and completes executing the empty instruction). By using empty instructions, it is possible to execute a thread when data is available with low overhead and software transparency.

示された実施形態では、フィルおよびエンプティオペレーションの対象であるメモリロケーションに関するステータス情報を格納するペンディング（メモリ）イベントテーブル（ＰＥＴ）５０がある。これは、ロケーションのアドレス、それぞれのフルまたはエンプティ状態、およびこれらのロケーションのデータの「消費者（ｃｏｎｓｕｍｅｒ）」、すなわち、エンプティ命令を実行し、ロケーションが埋められるのを待っているスレッドの識別情報を備える。このペンディングイベントテーブルは、例えば、例外（例えば、連続的フィル命令が特定のアドレスに実行され、介入するエンプティ命令がない場合等）の原因をシグナリングおよびトラッキングする際に有用であり得る、データの生産者の識別情報をさらに備え得る。 In the illustrated embodiment, there is a pending (memory) event table (PET) 50 that stores status information about memory locations that are the subject of fill and empty operations. This is the location address, the respective full or empty state, and the "consumer" of the data at these locations, i.e. the identity of the thread executing the empty instruction and waiting for the location to be filled Is provided. This pending event table may be useful, for example, in producing data that may be useful in signaling and tracking the cause of an exception (eg, when a continuous fill instruction is executed at a specific address and there is no empty instruction to intervene). It may further comprise identification information of the person.

それぞれのロケーションのデータは、ＰＥＴ５０に格納されず、フィルおよび／またはエンプティ命令の対象でないデータと同様に、むしろ、キャッシュおよび／またはメモリシステムそれ自体に残る。他の実施形態では、ステータス情報は、メモリシステム内で、それが関係するロケーションと並んでおよび／または別個のテーブル、リンクリスト等に格納される。 The data at each location is not stored in the PET 50 and remains in the cache and / or memory system itself, rather than data that is not subject to fill and / or empty instructions. In other embodiments, status information is stored in the memory system alongside the location to which it relates and / or in a separate table, linked list, etc.

従って、例えば、エンプティ命令が所与のメモリロケーション上で実行される場合、ＰＥＴがチェックされ、これが、同じロケーションが現在フル状態であることを示すエントリを有するかどうかを決定する。そのエントリを有する場合、フル状態である場合、そのエントリは、データをメモリロケーションからエンプティ命令によって指定されたレジスタに移動させて、エンプティに変更され、読み出しが行われる。 Thus, for example, if an empty instruction is executed on a given memory location, the PET is checked to determine if it has an entry indicating that the same location is currently full. If it has that entry, if it is full, the entry is changed to empty, moving data from the memory location to the register specified by the empty instruction, and is read.

他方、エンプティ命令が実行されているときに所与のメモリロケーションのＰｅＴにエントリがない場合（または、任意のそのようなエントリが、ロケーションは現在エンプティであることを示す場合）、ＰＥＴにエントリが生成（または更新）されて、所与のロケーションがエンプティであることを示し、かつ、そのエンプティ命令を実行したスレッドが、フィル命令によってそのロケーションに次に格納される任意のデータの消費者であることを示す。 On the other hand, if there is no entry in the PeT for a given memory location when the empty instruction is being executed (or if any such entry indicates that the location is currently empty), there is an entry in the PET. Generated (or updated) to indicate that a given location is empty and the thread that executed the empty instruction is a consumer of any data that is then stored at that location by the fill instruction It shows that.

フィル命令が、次に（おそらく、別のスレッドによって）実行された場合、ＰＥＴがチェックされて、その同じロケーションが現在エンプティ状態であることを示すエントリを有するかどうかを決定する。このようなエントリが発見されると、その状態は、フルに変更され、そのエントリにおいて識別された消費者スレッドに通知をルーティングするためにイベント送達メカニズム４４（図４）が用いられる。そのスレッドがアクティブであり、ＴＰＵ内で待ち状態である場合、通知はＴＰＵに進み、スレッドがアクティブ、実行状態に入り、そのエンプティ命令を再実行する（今回、完了する）（なぜなら、メモリロケーションは、ここで、フル状態だからである）。そのスレッドがアイドル状態である場合、通知は、システムスレッドに進み（現在実行中のどのＩＰＵにおいても）、これは、スレッドが実行中のアクティブ状態のＴＰＵにロードされるようにし、これにより、エンプティ命令は、再実行され得る。 When the fill instruction is next executed (possibly by another thread), the PET is checked to determine if it has an entry indicating that the same location is currently empty. When such an entry is found, its state is changed to full and the event delivery mechanism 44 (FIG. 4) is used to route the notification to the consumer thread identified in that entry. If the thread is active and is waiting in the TPU, the notification goes to the TPU, the thread enters the active, execution state, and re-executes its empty instruction (this time complete) (because the memory location is Because it ’s full here.) If the thread is idle, notification goes to the system thread (in any currently executing IPU), which causes the thread to be loaded into the active TPU that is running, thereby The instruction can be re-executed.

示された実施形態では、消費者／生産者のようなメモリ動作のための、ＰＥＴのこの使用は、例えば、フィルおよびエンプティ等の選択されたメモリ命令に関してのみ行われ、より一般的なロードおよびストアメモリ命令では行われない。従って、例えば、現在、エンプティ命令の対象であるメモリロケーションに対してロード命令が実行された場合であっても、そのエンプティ命令を実行したスレッドへの通知は行われず、これにより、命令は再実行され得る。他の実施形態は、この点について異なり得る。 In the illustrated embodiment, this use of PET for memory operations such as consumers / producers is done only with respect to selected memory instructions such as fill and empty, and more general loading and Not done with store memory instructions. Thus, for example, even when a load instruction is currently executed for a memory location that is the target of an empty instruction, the thread that executed the empty instruction is not notified, and the instruction is re-executed. Can be done. Other embodiments may differ in this regard.

図２４Ａは、３つの相互依存するスレッド２３０、２３２および２３４を示す。これらのスレッド間の同期化およびデータ変換は、本発明によるフィルおよびエンプティ命令によって容易になり得る。例えば、スレッド２３０は、ＭＰＥＧ２多重分離スレッド２３０であり、例えば、チューナ、ストリーミングソース等の、例えば、ＭＰＥＧ２ソース２３６から取得されたＭＰＥＧ２の多重分離を担う。例を続けるために、ＴＰＵ１０についてアクティブ、実行状態であると想定される。スレッド２３２は、ビデオデコーディングステップ１スレッドであり、多重分離されたＭＰＥＧ２信号からのビデオ信号をデコードする第１ステージを担う。ＴＰＵについてアクティブ、実行状態であると想定される。スレッド２３４は、ビデオデコーディングステップ２スレッドであり、ＬＣＤインターフェース２３８または他のデバイスを介して出力するために、多重分離したＭＰＥＧ２信号からビデオ信号をデコードする第２ステージを担う。ＴＰＵ１４についてアクティブ、実行状態であると想定される。 FIG. 24A shows three interdependent threads 230, 232 and 234. Synchronization and data conversion between these threads can be facilitated by fill and empty instructions according to the present invention. For example, the thread 230 is an MPEG2 demultiplexing thread 230 and is responsible for MPEG2 demultiplexing acquired from, for example, an MPEG2 source 236 such as a tuner or a streaming source. To continue the example, it is assumed that TPU 10 is active and running. The thread 232 is a video decoding step 1 thread, and is responsible for the first stage of decoding the video signal from the demultiplexed MPEG2 signal. It is assumed that the TPU is active and running. Thread 234 is the video decoding step 2 thread and is responsible for the second stage of decoding the video signal from the demultiplexed MPEG2 signal for output via the LCD interface 238 or other device. It is assumed that the TPU 14 is in an active and executing state.

ソース２３６からのデータストリーミングをリアルタイムで収容するために、スレッド２３０〜２３４の各々は、そのアップストリームソースによって提供されたデータを連続的に処理し、従って、他のスレッドと並列でこれを行う。図２４Ｂは、同期を保証し、かつスレッド間のデータ転送を容易にする態様で、これを容易にするフィルおよびエンプティ命令の使用を示す。図面を参照して、矢印２４０ａ〜２４０ｇは、スレッド間、特に、１つのスレッドによって書き込み（埋められる）、および別のスレッドによって読み出し（空にされる）されるデータロケーション間のフィル依存状態（ｆｉｌｌｄｅｐｅｎｄｅｎｃｉｅｓ）を示す。従って、スレッド２３０は、アドレスＡ０に予定されたデータを処理する一方で、スレッド２３２は、そのロケーションに向けられるエンプティ命令を実行し、スレッド２３４は、（スレッド２３２が最終的にフィル）アドレスＢ０に向けられたエンプティ命令を実行する。エンプティ命令の結果、スレッド２３２は、ロケーションＡ０のフィルの完了を待っている間、待ち状態（例えば、アクティブ、非実行、またはアイドル）に入り、スレッド２３４は、ロケーションＢ０のフィルの完了を待っている間、待ち状態に入る。 In order to accommodate data streaming from the source 236 in real time, each of the threads 230-234 continuously processes the data provided by its upstream source and thus does this in parallel with other threads. FIG. 24B illustrates the use of fill and empty instructions to facilitate this in a manner that ensures synchronization and facilitates data transfer between threads. Referring to the drawings, arrows 240a-240g indicate fill dependencies (fill) between threads, in particular between data locations written (filled) by one thread and read (emptied) by another thread. dependents). Thus, thread 230 processes the data destined for address A0, while thread 232 executes an empty instruction directed to that location, and thread 234 (thread 232 eventually fills) at address B0. Execute the directed empty instruction. As a result of the empty instruction, thread 232 enters a wait state (eg, active, non-executed, or idle) while waiting for the fill at location A0 to complete, and thread 234 waits for the fill at location B0 to complete. While waiting.

Ａ０のスレッド２３０のフィルが完了すると、そのスレッドがＡ０からのデータを処理することが可能になり、その結果が、フィル命令を介してＢ０に向けられる。スレッド２３４は、待ち状態に留まり、依然として、そのフィルの完了を待っている。その間、スレッド２３０は、アドレスＡ１に向けられたデータの処理を開始し、スレッド２３２は、エンプティ命令を実行し、Ａ１のフィルの完了を待っている間、スレッドを待ち状態に置く。 When the fill of the A0 thread 230 is complete, the thread can process the data from A0 and the result is directed to B0 via the fill instruction. Thread 234 remains in a wait state and is still waiting for the fill to complete. Meanwhile, thread 230 begins processing data destined for address A1, and thread 232 executes an empty instruction and puts the thread in a wait state while waiting for A1's fill to complete.

スレッド２３２がＢ０のフィル要求を実行する場合、スレッド２３４のエンプティが完了して、スレッドがＢ０からのデータを処理することを可能にし、結果がＣ０に向けられ、ここでＴＶビュワーに表示するためにＬＣＤインターフェース（図示せず）によって読み出される。３つのスレッド２３０、２３２、２３４は処理を継続し、ＭＰＥＧ２ストリーム全体の処理が完了するまで、フィルおよびエンプティ命令をこの態様で（図面に示されるように）実行する。 If thread 232 executes a fill request for B0, thread 234 is empty, allowing the thread to process data from B0, and the result is directed to C0, where it is displayed on the TV viewer. Are read out by an LCD interface (not shown). The three threads 230, 232, 234 continue processing and execute fill and empty instructions in this manner (as shown in the figure) until processing of the entire MPEG2 stream is complete.

フィルおよびエンプティ命令は、その命令フォーマットを精査することによってさらに理解され得る。
（エンプティ）
フォーマット：ｐｓＥＭＰＴＹｃａｃｈｅｔｈｒｅａｄｓｄｒｅｇ，ｂｒｅｇ，ｉｒｅｇ｛，ｓｔｏｐ｝
説明：エンプティは、メモリシステムに命令して、実効アドレスの状態をチェックさせる。状態がフルである場合、エンプティ命令は、その状態をエンプティに変更し、値をｄｒｅｇにロードする。状態がすでにエンプティである場合、命令は、その命令がフルになるまで待ち、待ち挙動は、スレッドフィールドによって表される。
（オペランドおよびフィールド）
ｐｓ述語ソースレジスタは、命令が実行されているかどうかを特定する。真（ｔｒｕｅ）である場合、命令が実行され、そうでなく偽（ｆａｌｓｅ）である場合、命令は実行されていない（副次的影響なし）
ｓｔｏｐ０命令グループがこの命令によって表現されていないことを表す。 Fill and empty instructions can be further understood by reviewing the instruction format.
(Empty)
Format: ps EMPTY cache threads dreg, breg, ireg {, stop}
Description: Empty instructs the memory system to check the status of the effective address. If the state is full, the empty instruction changes the state to empty and loads the value into dreg. If the state is already empty, the instruction waits until the instruction is full and the wait behavior is represented by the thread field.
(Operand and field)
The ps predicate source register specifies whether the instruction is being executed. If true, the instruction has been executed; otherwise, if it is false, the instruction has not been executed (no side effects)
stop 0 Indicates that the instruction group is not represented by this instruction.

１命令グループが、この命令によって表現されていることを表す。
ｔｈｒｅａｄ０無条件スレッドスイッチなし
１無条件スレッドスイッチ
２ストール時（スレッドの実行をブロック）条件付スレッドスイッチ
３リザーブ
ｓｃａｃｈｅ０再利用キャッシュヒントでのｔｂｄ
１再利用キャッシュヒントでの読み出し／書き込み
２再利用なしのキャッシュヒントでｔｂｄ
３再利用なしのキャッシュヒントでの読み出し／書き込み
ｉｍ０アドレス計算用のインデックスレジスタ（ｉｒｅｇ）を表す
１アドレス計算用のｄｉｓｐを表す
ｉｒｅｇ命令のインデックスレジスタを表す
ｂｒｅｇ命令のベースレジスタを表す
ｄｉｓｐメモリ参照命令用の２の補数の変位定数（８ビット）を表す
ｄｒｅｇ命令の宛先レジスタを表す
（フィル）
フォーマット：ｐｓＦＩＬＬｃａｃｈｅｔｈｒｅａｄｓｓ１ｒｅｇ，ｂｒｅｇ，ｉｒｅｇ｛，ｓｔｏｐ｝
説明：レジスタｓ１ｒｅｇは、実効アドレスでメモリ内のワードに書き込まれる。実効アドレスは、ｉｍ（即時記憶）フィールドに基づいて、ｂｒｅｇ（ベースレジスタ）と、ｉｒｅｇ（インデックスレジスタ）またはｄｉｓｐ（変位）とを加算することによって計算される。実効アドレスの状態は、フルに変更される。状態がすでにフルである場合、実行がシグナリングされる。
（オペランドおよびフィールド）
ｐｓ述語ソースレジスタは、命令が実行されているかどうかを表す。真である場合、命令が実行され、そうでなく偽である場合、命令は実行されていない（副次的影響なし）
ｓｔｏｐ０命令グループがこの命令によって表現されていないことを表す。 1 Indicates that an instruction group is represented by this instruction.
thread 0 No unconditional thread switch
1 Unconditional thread switch
2 Conditional thread switch at stall (blocks thread execution)
3 Reserve cache 0 tbd with reuse cache hint
1 Read / write with reuse cache hint
2 tbd with a cache hint without reuse
3 Read / write im 0 address cache index without reuse Represents the index register (ireg) for address calculation
1 represents the index register of the ireg instruction that represents the disp for address calculation. Disp represents the base register of the instruction.
Format: ps FILL cache threads s1reg, breg, ireg {, stop}
Description: Register s1reg is written to a word in memory with an effective address. The effective address is calculated by adding breg (base register) and ireg (index register) or disp (displacement) based on the im (immediate storage) field. The effective address state is changed to full. If the state is already full, execution is signaled.
(Operand and field)
The ps predicate source register indicates whether the instruction is being executed. If true, the instruction is executed; otherwise, it is not executed (no side effects)
stop 0 Indicates that the instruction group is not represented by this instruction.

１命令グループが、この命令によって表現されていることを表す。
ｔｈｒｅａｄ０無条件スレッドスイッチなし
１無条件スレッドスイッチ
２ストール時（スレッドの実行をブロック）条件付スレッドスイッチ
３リザーブ
ｓｃａｃｈｅ０再利用キャッシュヒントでのｔｂｄ
１再利用キャッシュヒントでの読み出し／書き込み
２再利用なしのキャッシュヒントでのｔｂｄ
３再利用なしのキャッシュヒントでの読み出し／書き込み
ｉｍ０アドレス計算用のインデックスレジスタ（ｉｒｅｇ）を特定
１アドレス計算用のｄｉｓｐを特定
ｉｒｅｇ命令のインデックスレジスタを特定
ｂｒｅｇ命令のベースレジスタを特定
ｄｉｓｐメモリ参照命令用の２の補数の変位定数（８ビット）を特定
ｄｒｅｇ命令の宛先レジスタを特定
（ソフトウェアイベント）
ハードウェアおよびソフトウェアイベントの処理は、それらの命令フォーマットを精査することによって、より完全に理解される。
（イベント）
フォーマット：ｐｓＥＶＥＮＴｓ１ｒｅｇ｛，ｓｔｏｐ｝
説明：イベント命令は、実行スレッドのイベントキューをポーリングする。イベントが存在する場合、命令は、例外状態レジスタにロードされたイベント状態で完了する。イベントがイベントキューに存在しない場合、スレッドはアイドル状態に遷移する。
（オペランドおよびフィールド）
ｐｓ述語ソースレジスタは、命令が実行されているかどうかを特定する。真である場合、命令が実行され、そうでなく偽である場合、命令は実行されていない（副次的影響なし）
ｓｔｏｐ０命令グループがこの命令によって表現されていないことを表す。 1 Indicates that an instruction group is represented by this instruction.
thread 0 No unconditional thread switch
1 Unconditional thread switch
2 Conditional thread switch at stall (blocks thread execution)
3 Reserve cache 0 tbd with reuse cache hint
1 Read / write with reuse cache hint
2 tbd with cache hint without reuse
3 Read / write im 0 address index register (ireg) for cache hints without reuse
1 Specify disp for address calculation Specify index register for ireg instruction Specify base register for breg instruction Specify 2's complement displacement constant (8 bits) for memory reference instruction Specify destination register for dreg instruction (software event) )
The processing of hardware and software events is more fully understood by reviewing their instruction format.
(Event)
Format: psEVENT s1reg {, stop}
Description: The event instruction polls the execution thread's event queue. If an event exists, the instruction completes with the event state loaded into the exception state register. If the event does not exist in the event queue, the thread transitions to the idle state.
(Operand and field)
The ps predicate source register specifies whether the instruction is being executed. If true, the instruction is executed; otherwise, it is not executed (no side effects)
stop 0 Indicates that the instruction group is not represented by this instruction.

１命令グループが、この命令によって表現されていることを表す。
ｓ１ｒｅｇ命令の第１のソースオペランドを含むレジスタを特定する
（ＳＷイベント）
フォーマット：ｐｓＳＷＥＶＥＮＴｓ１ｒｅｇ｛，ｓｔｏｐ｝
説明：ＳＷイベント命令は、イベントをスレッドによってハンドリングされるべきイベントキューにエンキューする。イベントフォーマットのｘｘｘを参照。 1 Indicates that an instruction group is represented by this instruction.
Identify the register containing the first source operand of the s1reg instruction (SW event)
Format: ps SWEVENT s1reg {, stop}
Description: The SW event instruction enqueues an event into the event queue to be handled by the thread. See event format xxx.

（オペランドおよびフィールド：
ｐｓ述語ソースレジスタは、命令が実行されているかどうかを特定する。真である場合、命令が実行され、そうでなく偽である場合、命令は実行されていない（副次的影響なし）
ｓｔｏｐ０命令グループがこの命令によって表現されていないことを表す。 (Operands and fields:
The ps predicate source register specifies whether the instruction is being executed. If true, the instruction is executed; otherwise, it is not executed (no side effects)
stop 0 Indicates that the instruction group is not represented by this instruction.

１命令グループが、この命令によって表現されていることを表す。
ｓ１ｒｅｇ命令の第１のソースオペランドを含むレジスタを特定する
（ＣＴＬＦＬＤ）
フォーマット：ｐｓＣｔ１Ｆｌｄｔｉｃｆｉｅｌｄ，｛，ｓｔｏｐ｝
説明：コントロールフィールド命令は、ｃｆｉｅｌｄによって特定されたコントロールフィールドを変更する。コントロールレジスタ内の他のフィールドは変更されない。 1 Indicates that an instruction group is represented by this instruction.
Identify the register containing the first source operand of the s1reg instruction (CTLLFLD)
Format: ps Ct1Fld ti cfield, {, stop}
Description: The control field instruction changes the control field specified by cfield. Other fields in the control register are not changed.

（オペランドおよびフィールド）
ｐｓ述語ソースレジスタは、命令が実行されているかどうかを特定する。真である場合、命令が実行され、そうでなく偽である場合、命令は実行されていない（副次的影響なし）
ｓｔｏｐ０命令グループがこの命令によって表現されていないことを表す。 (Operand and field)
The ps predicate source register specifies whether the instruction is being executed. If true, the instruction is executed; otherwise, it is not executed (no side effects)
stop 0 Indicates that the instruction group is not represented by this instruction.

１命令グループが、この命令によって表現されていることを表す。
ｔｉ０このスレッドコントロールレジスタへのアクセスを表す
１ＩＤｔ＿ｉｎｄｉｒｅｃｔｆｉｅｌｄ（スレッドインダイレクション）（特権化）によって特定されたスレッドのコントロールレジスタへのアクセスを表す
ｃｆｉｅｌｄ
ｃｆｉｅｌｄ［４：０］コントロールフィールド特権 1 Indicates that an instruction group is represented by this instruction.
ti 0 Indicates access to this thread control register
1 cfield indicating access to the control register of the thread specified by IDt_indirect field (thread indirection)
cfield [4: 0] Control field Privilege

（デバイス内蔵プロセッサモジュール５）
図２５は、ＳｏＣフォーマットに組み込まれた本発明によるデジタルＬＣＤ−ＴＶサブシステム２４２のブロック図である。このサブシステム２４２は、上述のように構成され、例えば、上述のように、ＭＰＥＧ２信号多重分離、ＭＰＥＧ２ビデオデコーディング、ＭＰＥＧオーディオデコーディング、デジタルＴＶユーザインターフェースオペレーション、およびオペレーティングシステムの実行（例えば、Ｌｉｎｕｘ）、プロセッサモジュール５を提供するスレッドを同時に実行するように動作するプロセッサモジュール５を備える。モジュール５は、上述のように、Ｌ２キャッシュ２６のオフチップ部分を備えるＤＤＲＤＲＡＭフラッシュメモリに結合される。このモジュールは、ＡＭＢＡＡＨＢバス２４４へのインターフェース（図示せず）を備え、これを介して、デジタルＬＣＤ−ＴＶのコンポーネントにインターフェース、すなわち、ビデオ入力インターフェース、ビデオ出力インターフェース、オーディオ出力インターフェースおよびＬＣＤインターフェースを提供する「インテレクチャルプロパティ」または「ＩＰ」２４６と通信する。当然、ＡＨＢバス５等を介してモジュール５に結合される他のＩＰが、追加的または代替的に提供されてもよい。例えば、図面において、示されたモジュール５は、デジタルＬＣＤ−ＴＶがソース信号を取得および／またはコントロールされる、ＤＭＡエンジン２４８、高速Ｉ／Ｏデバイスコントローラ２５０および低速デバイスコントローラ２５２（ＡＰＢブリッジ２５４を介する）等の選択的ＩＰと通信する。

(Device built-in processor module 5)
FIG. 25 is a block diagram of a digital LCD-TV subsystem 242 according to the present invention incorporated in the SoC format. The subsystem 242 is configured as described above, for example, as described above, MPEG2 signal demultiplexing, MPEG2 video decoding, MPEG audio decoding, digital TV user interface operations, and operating system execution (eg, Linux And a processor module 5 that operates to simultaneously execute threads that provide the processor module 5. Module 5 is coupled to a DDR DRAM flash memory comprising the off-chip portion of L2 cache 26 as described above. This module has an interface (not shown) to the AMBAAHB bus 244, through which it provides interfaces to digital LCD-TV components: video input interface, video output interface, audio output interface and LCD interface Communicate with "Intelligent Properties" or "IP" 246. Of course, other IPs coupled to the module 5 via the AHB bus 5 or the like may be additionally or alternatively provided. For example, in the drawing, the module 5 shown is a DMA engine 248, a high speed I / O device controller 250 and a low speed device controller 252 (via an APB bridge 254) in which the digital LCD-TV acquires and / or controls the source signal. ) And other selective IP.

図２６は、再びＳｏＣフォーマットで具現化されるデジタルＬＣＤ−ＴＶまたは本発明による他のアプリケーションサブシステム２５６のブロック図である。示されたサブシステムは、図２４に示される特定のＩＰの代わりにＡＰＢおよびＡＨＢ／ＡＰＢで示されるという点を除いて、上述のように構成される。アプリケーションニーズによるが、素子２５８は、上述のインプリメンテーションにおけるように、ビデオ入力インターフェース、ビデオ出力インターフェース、オーディオ出力インターフェース、およびＬＣＤインターフェース等を備え得る。 FIG. 26 is a block diagram of a digital LCD-TV or other application subsystem 256 according to the present invention, which is again embodied in SoC format. The subsystem shown is configured as described above except that it is shown with APB and AHB / APB instead of the specific IP shown in FIG. Depending on the application needs, device 258 may comprise a video input interface, a video output interface, an audio output interface, an LCD interface, etc., as in the implementation described above.

示されたサブシステムは、さらに、複数のモジュール５、例えば、モジュール５によって利用されるオフチップＬ２キャッシュ２６ｂの一部とインターフェースし、好ましくは、これを形成する相互接続を介して結合される１〜２０（以上）のモジュールを備える。その相互接続は、モジュール５によって、および、より具体的には、Ｌ２キャッシュ２６によって共有されるシフトレジスタバスを備えるリング相互接続（Ｒ１）の形態でもよい。あるいは、これは、モジュール５の組み合わされたメモリシステム内でのデータの迅速な移動を容易にする、プロプライエタリ等、別の形態の相互接続でもよい。これと関係なく、Ｌ２キャッシュは、好ましくは、任意の１つのモジュール５のＬ２キャッシュが個々のプロセッサのメモリシステムのみでなく、プロセッサモジュール５のすべての分散オールキャッシュメモリシステムにも寄与するように結合される。当然、上述のように、モジュール５は、同じメモリシステム、チップまたはバスを物理的に共有する必要はなく、その代わりに、ネットワーク等を介して接続され得る。 The illustrated subsystem further interfaces with a plurality of modules 5, eg, a portion of the off-chip L2 cache 26 b utilized by module 5, preferably coupled via the interconnects that form it. -20 (or more) modules. The interconnect may be in the form of a ring interconnect (R1) comprising a shift register bus shared by module 5 and more specifically by L2 cache 26. Alternatively, this may be another form of interconnection, such as proprietary, that facilitates the rapid movement of data within the combined memory system of module 5. Regardless of this, the L2 cache is preferably coupled so that the L2 cache of any one module 5 contributes not only to the memory system of the individual processors, but also to all the distributed all cache memory systems of the processor modules Is done. Of course, as described above, the modules 5 need not physically share the same memory system, chip or bus, but instead may be connected via a network or the like.

これまで、所望の目的を満たす装置、システムおよび方法が記載された。本明細書中に記載された実施形態が本発明の例示であること、および、変更を含む、他の実施形態が本発明の範囲に含まれることは明らかである。 So far, apparatus, systems and methods have been described that meet the desired objectives. It will be apparent that the embodiments described herein are illustrative of the invention and that other embodiments, including modifications, are within the scope of the invention.

図１は、本発明のある実施形態により構成され動作するプロセッサモジュールを示す。FIG. 1 illustrates a processor module constructed and operative in accordance with an embodiment of the present invention. 図２は、従来のスーパースカラプロセッサによるスレッド処理を本発明のある実施形態により構成され動作するプロセッサモジュールによるスレッド処理と比較する。FIG. 2 compares thread processing by a conventional superscalar processor with thread processing by a processor module constructed and operative in accordance with an embodiment of the present invention. 図３は、本発明のある実施形態により構成され動作するプロセッサにおける仮想処理ユニット（またはスレッド処理ユニット（ＴＰＵ））で実行するスレッドの可能な状態を示す。FIG. 3 illustrates possible states of threads executing in a virtual processing unit (or thread processing unit (TPU)) in a processor configured and operating in accordance with an embodiment of the present invention. 図４は、本発明のある実施形態により構成され動作するプロセッサモジュールにおけるイベント送達メカニズムを示す。FIG. 4 illustrates an event delivery mechanism in a processor module constructed and operative in accordance with an embodiment of the present invention. 図５は、本発明のある実施形態による構成され動作するシステムにおけるシステムアドレス変換への仮想アドレスのメカニズムを示す。FIG. 5 illustrates a virtual address mechanism for system address translation in a configured and operating system according to an embodiment of the present invention. 図６は、本発明のある実施形態により構成され動作するシステムにおけるレベル１およびレベル２キャッシュ構成を示す。FIG. 6 illustrates a level 1 and level 2 cache configuration in a system configured and operative in accordance with an embodiment of the present invention. 図７は、本発明のある実施形態により構成され動作するシステムにおけるタグルックアップを実行するために用いられるＬ２キャッシュおよびロジックを示す。FIG. 7 illustrates L2 cache and logic used to perform tag lookups in a system constructed and operative in accordance with an embodiment of the present invention. 図８は、本発明のある実施形態により構成され動作するシステムにおけるＬ２拡張キャッシュのタグルックアップを実行するために用いられるロジックを示す。FIG. 8 illustrates the logic used to perform the L2 extended cache tag lookup in a system constructed and operative in accordance with an embodiment of the present invention. 図９は、本発明のある実施形態により構成され動作するシステムにおけるスレッド処理ユニット（ＴＰＵ）ごとに維持される汎用レジスタ、述語レジスタ、およびスレッド状態またはコントロールレジスタを示す。FIG. 9 illustrates general purpose registers, predicate registers, and thread state or control registers maintained for each thread processing unit (TPU) in a system constructed and operative in accordance with an embodiment of the present invention. 図１０は、本発明のある実施形態により構成され動作するシステムにおけるスレッドによって実行される命令をフェッチおよびディスパッチするメカニズムを示す。FIG. 10 illustrates a mechanism for fetching and dispatching instructions executed by a thread in a system constructed and operative in accordance with an embodiment of the present invention. 図１１は、本発明のある実施形態により構成され動作するシステムにおいて用いられるキュー管理メカニズムを示す。FIG. 11 illustrates a queue management mechanism used in a system constructed and operative in accordance with an embodiment of the present invention. 図１２は、本発明のある実施形態により構成され動作するシステムにおいて用いられるキュー管理メカニズムを示す。FIG. 12 illustrates a queue management mechanism used in a system constructed and operative in accordance with an embodiment of the present invention. 図１３は、本発明のある実施形態によるスレッド処理ユニットを実現するためのロジックを含む図１のプロセッサモジュールのＳｏＣ（ｓｙｓｔｅｍ−ｏｎ−ａ−ｃｈｉｐ）のインプリメンテーションを示す。FIG. 13 illustrates a SoC (system-on-a-chip) implementation of the processor module of FIG. 1 including logic for implementing a thread processing unit according to an embodiment of the present invention. 図１４は、本発明のある実施形態により構成され動作するシステムにおけるパイプラインコントロールユニットのブロック図である。FIG. 14 is a block diagram of a pipeline control unit in a system constructed and operative in accordance with an embodiment of the present invention. 図１５は、本発明のある実施形態により構成され動作するシステムにおける個々のユニットキューのブロック図である。FIG. 15 is a block diagram of individual unit queues in a system constructed and operative in accordance with an embodiment of the present invention. 図１６は、本発明のある実施形態により構成され動作するシステムにおける分岐ユニットのブロック図である。FIG. 16 is a block diagram of a branch unit in a system configured and operative in accordance with an embodiment of the present invention. 図１７は、本発明のある実施形態により構成され動作するシステムにおけるメモリユニットのブロック図である。FIG. 17 is a block diagram of a memory unit in a system configured and operative in accordance with an embodiment of the present invention. 図１８は、本発明のある実施形態により構成され動作するシステムにおけるＬ１命令キャッシュまたはＬ１データキャッシュのいずれかを実現するキャッシュユニットのブロック図である。FIG. 18 is a block diagram of a cache unit that implements either an L1 instruction cache or an L1 data cache in a system configured and operative in accordance with an embodiment of the present invention. 図１９は、本発明のある実施形態により構成され動作するシステムにおける図７のＬ２キャッシュおよびロジックのインプリメンテーションを示す。FIG. 19 illustrates an implementation of the L2 cache and logic of FIG. 7 in a system constructed and operative in accordance with an embodiment of the present invention. 図２０は、本発明のある実施形態により構成され動作するシステムにおけるレジスタファイルのインプリメンテーションを示す。FIG. 20 illustrates a register file implementation in a system constructed and operative in accordance with an embodiment of the present invention. 図２１は、本発明のある実施形態により構成され動作するシステムにおける整数ユニットおよび比較ユニットのブロック図である。FIG. 21 is a block diagram of integer units and comparison units in a system constructed and operative in accordance with an embodiment of the present invention. 図２２は、本発明のある実施形態により構成され動作するシステムにおける整数ユニットおよび比較ユニットのブロック図である。FIG. 22 is a block diagram of integer units and comparison units in a system constructed and operative in accordance with an embodiment of the present invention. 図２３Ａは、本発明のある実施形態により構成され動作するシステムにおける浮遊少数点のブロック図である。FIG. 23A is a block diagram of floating point numbers in a system constructed and operative in accordance with an embodiment of the present invention. 図２３Ｂは、本発明のある実施形態により構成され動作するシステムにおける浮遊少数点のブロック図である。FIG. 23B is a block diagram of floating point numbers in a system constructed and operative in accordance with an embodiment of the present invention. 図２４Ａは、本発明のある実施形態により構成され動作するシステムにおける消費者および生産者メモリ命令の使用を示す。FIG. 24A illustrates the use of consumer and producer memory instructions in a system constructed and operative in accordance with an embodiment of the present invention. 図２４Ｂは、本発明のある実施形態により構成され動作するシステムにおける消費者および生産者メモリ命令の使用を示す。FIG. 24B illustrates the use of consumer and producer memory instructions in a system constructed and operative in accordance with an embodiment of the present invention. 図２５は、本発明のある実施形態により構成され動作するシステムにおけるデジタルＬＣＤ−ＴＶのブロック図である。FIG. 25 is a block diagram of a digital LCD-TV in a system constructed and operative in accordance with an embodiment of the present invention. 図２６は、本発明のある実施形態により構成され動作するシステムにおけるデジタルＬＣＤ−ＴＶまたは他のアプリケーションサブシステムのブロック図である。FIG. 26 is a block diagram of a digital LCD-TV or other application subsystem in a system constructed and operative in accordance with an embodiment of the present invention.

Explanation of symbols

１０アドレス変換
１４アドレス変換
２０アドレス変換
２２Ｌ１命令キャッシュ
２４Ｌ１データキャッシュ
２６Ｌ２キャッシュ
２８起動およびパイプラインコントロール
３０整数ユニット（ｉｕｎｉｔ）
３２浮動小数点ユニット（ｆｐｕｎｉｔ）
３４比較ユニット（ｃｕｎｉｔ）
３６メモリユニット（ｍｅｍｕｎｉｔ）
３８分岐ユニット（ｂｒｕｎｉｔ）
４４イベント−スレッド送達
４６ＴＰＵキュー
４８ＴＰＵキュー
７０データブロック
７２ａキャッシュタグアレイグループ
７２ｐキャッシュタグアレイグループ
７４ａタグ比較
７４ｐタグ比較
８０データブロック
８２ａデータアレイグループ
８２ｐデータアレイグループ
８４ａマッチしたタグを選択
８４ｐマッチしたタグを選択
８６物理ページ番号
１１２挿入（またはフェッチ）ポインタ
１１６コミットポインタ
１１４抽出（または発行）ポインタ
１２０デュアルポートメモリまたはレジスタ
１３０パイプコントロール
１３８ロード格納バッファ
１４０スレッドクラスキューコントロール
１４２スレッドクラス
１４４命令ディスパッチ
１４８ａユニットキュー
１４８ｅユニットキュー
１６０コントロール
１６８セグメント変換
１６６アドレス加算器
１６４スレッドセレクタ
１６２ａスレッド_０状態
１６２ｂスレッド_１状態
１６２ｅスレッド_ｎ状態
１７０コントロール
１７２アドレス加算器
１７４セグメント変換ＣＡＭ
１９２状態マシン
１９４リクエストキュー
１９０現在状態コントロールロジック
１９６Ｌ１キャッシュインターフェース
２６ｃＤＤＲドラムＣｔｌ
２００ａ整数レジスタファイル
２００ｂＦＰレジスタファイル
２００ｃ比較レジスタファイル
２００ｄ分岐レジスタファイル
２００ｅメモリレジスタファイル
２３６チューナまたはストリーム生成器
２３０ＭＰＥＧ２ＤＭＵＸスレッド
２３２ビデオデコードステップ１スレッド
２３４ビデオデコードステップ２スレッド 10 address translation 14 address translation 20 address translation 22 L1 instruction cache 24 L1 data cache 26 L2 cache 28 startup and pipeline control 30 integer unit (unit)
32 Floating point unit (fpunit)
34 Comparison unit
36 memory unit
38 branch unit
44 Event-Thread Delivery 46 TPU Queue 48 TPU Queue 70 Data Block 72a Cache Tag Array Group 72p Cache Tag Array Group 74a Tag Compare 74p Tag Compare 80 Data Block 82a Data Array Group 82p Data Array Group 84a Select Matched Tag 84p Matched Select tag 86 Physical page number 112 Insert (or fetch) pointer 116 Commit pointer 114 Extract (or issue) pointer 120 Dual port memory or register 130 Pipe control 138 Load store buffer 140 Thread class queue control 142 Thread class 144 Instruction dispatch 148a Unit Queue 148e Unit queue 160 Control 168 Segment Conversion 166 the address adder 164 thread selector 162a thread ₀ state 162b Thread ₁ state 162e thread _n state 170 control 172 address adder 174 segment transform CAM
192 State machine 194 Request queue 190 Current state control logic 196 L1 cache interface 26c DDR Drum Ctl
200a integer register file 200b FP register file 200c comparison register file 200d branch register file 200e memory register file 236 tuner or stream generator 230 MPEG2 DMUX thread 232 video decode step 1 thread 234 video decode step 2 thread

Claims

An embedded processor,
A. A plurality of processing units each executing one or more processes or threads (collectively “threads”);
B. One or more execution units shared by the plurality of processing units and communicatively connected to the plurality of processing units, wherein the execution unit executes instructions from the thread Unit,
C. An event delivery mechanism for delivering the event to each thread with which the event is associated, the event delivery mechanism comprising:
i communicatively connected to the plurality of processing units;
an embedded processor comprising: an event delivery mechanism that delivers each such event to the respective thread without executing an instruction.

The embedded processor of claim 1, wherein the thread to which the event is delivered processes the event without executing an instruction outside the thread.

The embedded processor of claim 1, wherein the event comprises one of a hardware interrupt, a software initiated signaling event (“software event”), and a memory event.

The embedded processor of claim 1, wherein the execution unit executes an instruction from the thread without needing to know from which thread the instruction is.

The embedded processor of claim 1, wherein each thread is forced or not forced to execute on the same processing unit while the thread is alive.

The embedded processor of claim 1, wherein at least one of the processing units is a virtual processing unit.

An embedded processor,
A. A plurality of virtual processing units, each executing one or more processes or threads (collectively “threads”), each thread forcing execution on the same processing unit while the thread is alive With multiple virtual processing units that are or cannot be forced;
B. Multiple execution units;
C. A pipeline control communicatively connected to the plurality of processing units and the plurality of execution units, wherein the pipeline control is configured to Pipeline control that launches instructions from multiple threads,
D. An event delivery mechanism that is communicatively connected to the plurality of processing units and delivers the event to each thread with which the event is associated without execution of an instruction, wherein the event is a hardware interrupt, An embedded processor with an event delivery mechanism, including any of software initiated signaling events ("software events") and memory events,
E. An embedded processor in which a thread to which an event is delivered processes the event without executing instructions outside the thread.

The embedded processor of claim 7, wherein the pipeline control comprises a plurality of instruction queues each associated with a respective virtual processing unit.

The embedded processor of claim 8, wherein the pipeline control decodes an instruction class from the instruction queue.

9. The embedded processor of claim 8, wherein the pipeline control accesses a resource supply source and a destination register for the instruction dispatched from the instruction queue by the processing unit.

The embedded processor according to claim 8, wherein the execution unit includes a branch execution unit responsible for any of instruction address generation, address translation, and instruction fetch.

The embedded processor according to claim 11, wherein the branch execution unit maintains a state of the virtual processing unit.

The embedded processor according to claim 7, wherein the pipeline control accesses the execution unit by the virtual processing unit.

The embedded processor according to claim 7, wherein the pipeline control signals a branch execution unit shared by the virtual processing unit when the instruction queue for each virtual processing unit becomes empty.

The embedded processor of claim 7, wherein the pipeline control idles the execution unit.

The embedded processor according to claim 7, wherein the plurality of execution units include any one of an integer unit, a floating unit, a branch unit, a comparison unit, and a memory unit.

An embedded processor system,
A. Multiple embedded processors;
B. A plurality of virtual processing units executing on the plurality of embedded processors, each virtual processing executing one or more processes or threads (collectively “threads”), and each virtual processing unit comprising: A plurality of virtual processing units that may or may not be forced to execute on the same processing unit while the thread is alive;
C. One or more execution units shared by the plurality of processing units and communicatively connected to the plurality of processing units, the execution unit executing instructions from the thread, the execution unit comprising: One or more execution units comprising any of an integer unit, a floating unit, a branch unit, a comparison unit, and a memory execution unit;
D. An event delivery mechanism for delivering the event to each thread with which the event is associated, the event delivery mechanism comprising:
i communicatively connected to the plurality of processing units;
an embedded processor system comprising: an event delivery mechanism for delivering each such event to said respective thread without executing an instruction.

The embedded processor system of claim 17, wherein the thread to which an event is delivered processes the event without executing an instruction outside the thread.

The embedded processor system of claim 18, wherein the events include any of hardware interrupts, software initiated signaling events (“software events”), and memory events.

The embedded processor system of claim 18, wherein the branch unit is responsible for fetching instructions to be executed for the thread.

21. The embedded processor system according to claim 20, wherein the branch unit is further responsible for either instruction address generation or address translation.

The embedded processor system according to claim 21, wherein the branch unit includes a thread state storage device that stores a thread state for each of the virtual processing units.

A. A pipeline control communicatively connected to the plurality of processing units and the plurality of execution units, wherein the pipeline control is configured to With pipeline control, dispatching instructions from multiple threads
B. The pipeline control comprises a plurality of instruction queues each associated with a respective virtual processing unit,
C. 19. The embedded processor system of claim 18, wherein instructions fetched by the branch execution unit are placed in the instruction queue associated with the respective virtual processing unit on which a corresponding thread is executed.

24. The embedded processor system of claim 23, wherein one or more instructions are fetched at some point for the thread for the purpose of keeping the instruction queue at an equal level.

25. The embedded processor system of claim 24, wherein the pipeline control dispatches one or more instructions from a given instruction queue at a point in time for execution.

26. The embedded processor system of claim 25, wherein the pipeline control causes a number of instructions dispatched from a given instruction queue at a given time to be controlled by stop flags in the queue's instruction sequence.

24. The embedded processor system of claim 23, wherein the pipeline control is activated and executed by the execution unit simultaneously with a plurality of instructions from one or more threads.

An embedded processor,
A. A plurality of processing units, each executing one or more processes or threads (collectively “threads”), each thread being forced to execute on the same processing unit while the thread is alive With multiple processing units, which are not forced or
B. An event delivery mechanism for delivering the event to each thread with which the event is associated, the event delivery mechanism comprising:
i communicatively connected to the plurality of processing units;
an embedded processor comprising: an event delivery mechanism that delivers each such event to the respective thread without executing an instruction.

30. The embedded processor of claim 28, wherein the thread to which an event is delivered processes the event without executing an instruction outside the thread.

29. The embedded processor of claim 28, wherein the event comprises any of a hardware interrupt, a software initiated signaling event ("software event"), and a memory event.

30. The embedded processor of claim 28, wherein at least one of the processing units is a virtual processing unit.

An embedded processor system,
A. Multiple embedded processors;
B. A plurality of virtual processing units executing on the plurality of embedded processors, each virtual processing executing one or more processes or threads (collectively “threads”), and each virtual processing unit comprising: A plurality of virtual processing units that may or may not be forced to execute on the same processing unit while the thread is alive;
C. An event delivery mechanism for delivering the event to each thread with which the event is associated, the event delivery mechanism comprising:
i communicatively connected to the plurality of processing units;
an embedded processor system comprising: an event delivery mechanism for delivering each such event to said respective thread without executing an instruction.

33. The embedded processor system of claim 32, wherein the thread to which an event is delivered processes the event without executing an instruction outside the thread.

34. The embedded processor system of claim 33, wherein the events include any of hardware interrupts, software initiated signaling events ("software events"), and memory events.

A method of embedded processing,
A. Executing one or more processes or threads (collectively “threads”) on each of the plurality of processing units;
B. Executing instructions from the thread in one or more execution units shared by the plurality of processing units;
C. Delivering the event to each thread with which the event is associated without executing the instruction.

36. The method of claim 35, comprising processing the event delivered by the thread without executing an instruction outside the thread.

36. The method of claim 35, wherein the event comprises one of a hardware interrupt, a software initiated signaling event ("software event"), and a memory event.

36. The method of claim 35, wherein, in the one or more execution units, executing an instruction from the thread does not require knowing from which thread the instruction is.

36. The method of claim 35, wherein each thread is forced or not forced to execute on the same processing unit while the thread is alive.

36. The method of claim 35, wherein at least one of the processing units is a virtual processing unit.

A method of the embedded processing comprising:
A. Executing one or more processes or threads (collectively "threads") on each of a plurality of virtual processing units, each thread running on the same processing unit while the thread is alive With steps that are forced or not forced to perform,
B. Activating instructions from a plurality of threads of the threads to execute simultaneously on a plurality of units of the execution unit;
C. Delivering the event to each thread with which the event is associated without executing an instruction, the event comprising a hardware interrupt, a software initiated signaling event ("software event") and a memory event Including any of the steps,
D. Processing the event delivered by the thread without executing instructions outside the thread.

42. The method of claim 41, wherein the initiating step includes the step of decoding an instruction class from the instruction queue.

42. The method of claim 41, wherein the initiating step includes controlling access by the virtual processing unit to a resource supply source and a destination register of the instruction dispatched from the instruction queue.

42. The method of claim 41, comprising performing any of the following: instruction address generation, address translation, and instruction fetching using a branch execution unit shared by all of the virtual processing units.

45. The method of claim 44, comprising maintaining the state of the virtual processing unit using the branch execution unit.

42. The method of claim 41, wherein the execution unit comprises any of an integer unit, a floating unit, a branch unit, a comparison unit, and a memory unit.

A method of embedded processing,
A. Executing a plurality of virtual processing units on a plurality of embedded processors;
B. Executing one or more processes or threads (collectively “threads”) on each of a plurality of virtual processing units, each thread being the same virtual processing unit and / or while the thread is alive Or steps forced or not forced to run on the same embedded processor, and
C. Executing instructions from the thread in one or more execution units shared by the plurality of processing units, the execution units comprising an integer unit, a floating unit, a branch unit, a comparison unit, and a memory execution A step comprising any of the units;
D. Delivering the event to each thread with which the event is associated without executing the instruction.

48. The method of claim 47, comprising processing the event to which the thread is delivered without executing instructions outside the thread.

49. The method of claim 48, wherein the event comprises one of a hardware interrupt, a software initiated signaling event ("software event"), and a memory event.

49. The method of claim 48, comprising fetching instructions to be executed for the thread using the branch execution unit.

51. The method of claim 50, comprising performing any of instruction address generation and address translation in the branch execution unit.

A. Dispatching instructions from a plurality of threads of the thread for simultaneous execution on a plurality of units of the execution unit;
B. 49. The method of claim 48, wherein the step of dispatching includes placing instructions fetched for each respective thread in an instruction queue associated with the virtual processing unit in which the thread is executed. .

53. The method of claim 52, wherein the dispatching step includes fetching one or more instructions at a point in time for a given thread for the purpose of keeping the instruction queue at an equal level.

54. The method of claim 53, wherein the dispatching step includes dispatching one or more instructions from a given instruction queue at a point in time for execution by the execution unit.

55. The method of claim 54, wherein the pipeline control causes a number of instructions dispatched from a given instruction queue at a given time to be controlled by stop flags in the queue's instruction sequence.

53. The method of claim 52, comprising initiating and executing a plurality of instructions simultaneously from one or more threads.

A method of embedded processing,
A. Executing one or more processes or threads (collectively “threads”) on each of a plurality of processing units, each thread executing on the same processing unit while the thread is alive With forced or unforced steps,
B. Delivering the event to each thread with which the event is associated without executing the instruction.

58. The method of claim 57, comprising processing the event delivered by the thread without executing an instruction outside the thread.

58. The method of claim 57, wherein the event comprises one of a hardware interrupt, a software initiated signaling event ("software event"), and a memory event.

58. The method of claim 57, wherein at least one of the processing units is a virtual processing unit.

A method of embedded processing,
A. Executing a plurality of virtual processing units on a plurality of embedded processors;
B. Executing one or more processes or threads (collectively “threads”) on each of a plurality of virtual processing units, each virtual processing unit being the same virtual processing unit while the thread is alive And / or steps forced or not forced to run on the same embedded processor; and
C. Delivering the event to each thread with which the event is associated without executing the instruction.

64. The method of claim 61, wherein the thread processes the event delivered without executing an instruction outside the thread.

64. The method of claim 62, wherein the event comprises one of a hardware interrupt, a software initiated signaling event ("software event"), and a memory event.