JP5661863B2

JP5661863B2 - System and method for data transfer in an execution device

Info

Publication number: JP5661863B2
Application number: JP2013125463A
Authority: JP
Inventors: スレシュ・ケー．・ベンクマハンティ; ルシアン・コドレスキュ; リン・ワン
Original assignee: Qualcomm Inc
Current assignee: Qualcomm Inc
Priority date: 2008-02-26
Filing date: 2013-06-14
Publication date: 2015-01-28
Anticipated expiration: 2029-02-03
Also published as: CN102089742A; US8145874B2; KR101221512B1; JP2011513843A; KR20100126442A; CN103365627B; EP2263149A1; WO2009108462A1; JP2013239183A; TW200943176A; US20090216993A1; CN103365627A; KR101183651B1

Description

本開示は一般に、実行装置内のデータ転送のシステムおよび方法に関する。 The present disclosure relates generally to systems and methods for data transfer within an execution device.

通常のプロセッサでは、命令の実行は複数の段階を必要とすることがある。プログラムシーケンス内で、データ依存命令は通常分割され、ステージの各々を通して第１命令が処理される時間、および第１命令からの結果を使用する第２命令を実行する前にレジスタに当該結果が書き込まれる時間を可能ならしめる。この事例において、複数のデータ非依存命令を使用して、命令シーケンス内のデータ依存命令を分割し、後続命令の実行で結果が必要となる前に当該結果を作成し保存する時間を可能ならしめる。データ非依存命令を使用してデータ依存命令を分割することによって、プロセッサパイプラインをフル稼働またはそれに近い状態にすることができ、パイプラインのストールを減らすことができる。 In a typical processor, instruction execution may require multiple stages. Within a program sequence, data dependent instructions are usually split and the result is written to a register before executing a second instruction that uses the result from the first instruction and the time that the first instruction is processed through each of the stages. Make time available. In this case, multiple data-independent instructions are used to divide the data-dependent instructions in the instruction sequence and allow time to create and save the results before the results are needed for subsequent instruction execution. . By dividing data-dependent instructions using data-independent instructions, the processor pipeline can be in full operation or close to it, and pipeline stalls can be reduced.

現代のコンパイラは、シーケンス外の命令の実行により実行パイプライン装置のストールを減らそうとしている。特に、データ非依存命令および／または実行の準備ができている命令は、まだ準備ができていない命令（すなわち、別の実行中の命令によってデータがまだ決定されていないデータ依存の命令）の前に置かれる。典型的には、コンパイラアプリケーションを使用して当該データ依存命令を認識することができ、コンパイラアプリケーションは、プログラムシーケンス内の対応するデータ生成命令からのデータ依存命令の間隔をあけることによって、プログラムシーケンス内の命令を編成し、パイプラインのストールを減らすことができる。 Modern compilers try to reduce execution pipeline stalls by executing instructions out of sequence. In particular, data-independent instructions and / or instructions that are ready to execute are prior to instructions that are not yet ready (ie, data-dependent instructions whose data has not yet been determined by another executing instruction). Placed in. Typically, a compiler application can be used to recognize the data-dependent instruction, and the compiler application can include the data-dependent instructions in the program sequence by spacing the corresponding data-generating instructions in the program sequence. Can be organized to reduce pipeline stalls.

特定の実施形態では、複数の実行装置を有するインターリーブ型マルチスレッド（ＩＭＴ）プロセッサ内の実行パイプラインにおいて、実行装置におけるライトバック（write back）段階の間、第１命令の実行によりレジスタファイルに書き込まれる結果に関連づけられた書き込み識別子を、第２命令に関連づけられた読み取り識別子と比較することを含む方法が開示される。書き込み識別子が読み取り識別子と合致したとき、本方法はさらに、後続の読み取り段階での実行装置による使用に備えて、結果を実行装置のローカルメモリに保存することを含む。 In certain embodiments, in an execution pipeline in an interleaved multithreaded (IMT) processor having multiple execution units, writing to a register file by execution of a first instruction during a write back phase in the execution unit A method is disclosed that includes comparing a write identifier associated with a result to be read with a read identifier associated with a second instruction. When the write identifier matches the read identifier, the method further includes storing the result in the local memory of the execution device for use by the execution device in a subsequent read stage.

別の特定の実施形態では、第１命令パケットに関連づけられた第１アドレスにより、第２命令パケットに関連づけられた第２アドレスを特定することを含む方法が開示される。多方向キャッシュに関連づけられたキャッシュラインの境界を第２アドレスが越えたか否かを決定するために、データ装置の加算器のキャリービットが調べられる。境界が越えられていないとき、多方向キャッシュはアクセスされ、先行のタグ配列探索オペレーションにより特定された第１アドレスに関連づけられたタグ配列データおよび変換索引バッファ（ＴＬＢ）探索データを使用して第２アドレスからデータが取り出される。 In another particular embodiment, a method is disclosed that includes identifying a second address associated with a second instruction packet with a first address associated with the first instruction packet. The carry bit of the data device's adder is examined to determine if the second address has crossed the boundary of the cache line associated with the multi-directional cache. When the boundary is not crossed, the multi-directional cache is accessed and the second using the tag array data and translation index buffer (TLB) search data associated with the first address specified by the previous tag array search operation. Data is extracted from the address.

さらに別の特定の実施形態では、１つあるいはそれより多くのデータ値を保存するローカルメモリを有する実行装置を含むマルチスレッドプロセッサが開示される。実行装置はさらに、読み取りオペレーションに関連づけられた読み取りアドレスが先行のライトバックオペレーションに関連づけられたライトバックアドレスと合致するか否かを決定するように適合された論理回路を含む。この論理回路は、読み取りアドレスがライトバックアドレスと合致したときにローカルメモリに１つあるいはそれより多くのデータ値を保存するように適合されている。 In yet another specific embodiment, a multithreaded processor is disclosed that includes an execution device having a local memory that stores one or more data values. The execution device further includes logic circuitry adapted to determine whether the read address associated with the read operation matches the write back address associated with the previous write back operation. The logic circuit is adapted to store one or more data values in local memory when the read address matches the write back address.

さらに別の特定の実施形態では、複数の実行装置を有するインターリーブ型マルチスレッド（ＩＭＴ）プロセッサ内の実行パイプラインにおいて、第１命令パケットの実行によりレジスタファイルに書き込まれる結果に関連づけられた書き込み識別子を、第２命令パケットに関連づけられた読み取り識別子と比較するための手段を含むプロセッサが開示される。このプロセッサはさらに、書き込み識別子が読み取り識別子と合致するときに第２命令パケットを実行する際の使用に備えて結果を実行装置に選択的にローカル保存するための手段を含む。 In yet another specific embodiment, in an execution pipeline in an interleaved multithreaded (IMT) processor having multiple execution units, a write identifier associated with a result written to a register file by execution of a first instruction packet is provided. , A processor including means for comparing with a read identifier associated with a second instruction packet is disclosed. The processor further includes means for selectively locally storing the result on the execution unit for use in executing the second instruction packet when the write identifier matches the read identifier.

データ転送論理およびローカルメモリを有するプロセッサの実施形態によって提供される１つの特有の利点は、第１命令の実行からの結果をローカル保存し、レジスタファイル読み取りオペレーションを実行することなく第２命令を実行する際に使用することができる点にある。レジスタファイル読み取りオペレーションを選択的に省略することにより、レジスタファイルのパワー消費を減らすことができる。 One particular advantage provided by an embodiment of a processor having data transfer logic and local memory is that the results from the execution of the first instruction are stored locally and the second instruction is executed without performing a register file read operation. It can be used when doing so. By selectively omitting the register file read operation, the power consumption of the register file can be reduced.

別の特有の利点は、第２命令の第２アドレスが第１命令の第１アドレスと同じキャッシュラインに関連づけられたときにタグ配列探索オペレーションを選択的に省略できる点にある。この例では、第２アドレスのタグ配列探索オペレーションを省略することができ、第１アドレスに関連づけられた先行の探索オペレーションによって特定されたタグ配列情報を再利用することができる。タグ配列探索オペレーションを選択的に省略することにより、全体のパワー消費を減らすことができる。 Another particular advantage is that the tag array search operation can be selectively omitted when the second address of the second instruction is associated with the same cache line as the first address of the first instruction. In this example, the tag array search operation for the second address can be omitted, and the tag array information specified by the previous search operation associated with the first address can be reused. By selectively omitting the tag array search operation, overall power consumption can be reduced.

さらに別の特有の利点は、同じ論理回路を使用して、データを選択的に転送すること、ならびにタグ配列探索オペレーションおよびＴＬＢ探索オペレーションを選択的に省略することができる点にある。追加的に、アセンブラまたはコンパイラを使用して命令パケットを配列し、データの転送（すなわち、スロット内転送（intra-slot forwarding））、タグ配列情報の再利用（すなわち、タグ配列探索オペレーションの省略）およびＴＬＢ探索オペレーションの選択的な省略の機会をもたらすことができる。こうしたデータの転送ならびにタグ探索オペレーションおよび／またはＴＬＢ探索オペレーションの選択的省略により、読み取りオペレーションの回数全体を減らすことができ、パワー消費を減らすことができる。 Yet another unique advantage is that the same logic can be used to selectively transfer data and selectively omit tag array search operations and TLB search operations. In addition, the instruction packet is arranged using an assembler or compiler, the data is transferred (ie, intra-slot forwarding), the tag array information is reused (ie, the tag array search operation is omitted). And an opportunity for selective omission of TLB search operations. Such data transfer and selective omission of tag search operations and / or TLB search operations can reduce the total number of read operations and reduce power consumption.

本開示の他の態様、利点および特徴は、以下の図面の簡単な説明、詳細な説明および特許請求の範囲のセクションを含む出願内容の全体を見た後に明らかとなろう。 Other aspects, advantages and features of the present disclosure will become apparent after reviewing the entirety of the application, including the following brief description of the drawings, detailed description and claims section.

図1は、データを転送するように適合された実行装置を含むシステムの特定の例解的実施形態を示すブロック図である。FIG. 1 is a block diagram illustrating a particular illustrative embodiment of a system that includes an execution device adapted to transfer data. 図2は、データを転送するように適合された実行装置の特定の例解的実施形態を示すブロック図である。FIG. 2 is a block diagram illustrating a particular illustrative embodiment of an execution device adapted to transfer data. 図3は、データ転送論理回路およびタグ配列探索／変換索引バッファ（ＴＬＢ）探索省略論理回路を有する共用制御装置を含むシステムの特定の実施形態を示すブロック図である。FIG. 3 is a block diagram illustrating a specific embodiment of a system including a shared controller having a data transfer logic and a tag array search / transform index buffer (TLB) search skip logic. 図4は、データを選択的に転送し、タグ配列探索オペレーションおよび変換索引バッファ（ＴＬＢ）オペレーションを選択的に省略するように適合されたプログラム可能論理回路（ＰＬＣ）を含むプロセッサの特定の実施形態を示すブロック図である。FIG. 4 illustrates a specific embodiment of a processor that includes programmable logic (PLC) adapted to selectively transfer data and selectively omit tag array search operations and translation index buffer (TLB) operations. FIG. 図5は、データを転送するように適合された実行パイプライン内のプロセスの例解的実施形態を示すタイミング図である。FIG. 5 is a timing diagram illustrating an exemplary embodiment of a process in an execution pipeline adapted to transfer data. 図6は、実行パイプライン内の転送論理回路の特定の例解的実施例を示す図である。FIG. 6 is a diagram illustrating a specific illustrative embodiment of a transfer logic circuit in an execution pipeline. 図7は、タグ配列探索オペレーションを省略するように適合された実行パイプライン内のプロセスの例解的実施形態を示すタイミング図である。FIG. 7 is a timing diagram illustrating an exemplary embodiment of a process in an execution pipeline adapted to omit tag array search operations. 図8は、データを選択的に転送し、タグ配列探索オペレーションまたは変換索引バッファ（ＴＬＢ）探索オペレーションを選択的に省略するように適合されたシステム特定の例解的実施形態を示すブロック図である。FIG. 8 is a block diagram illustrating a system specific exemplary embodiment adapted to selectively transfer data and selectively omit a tag array search operation or a translation index buffer (TLB) search operation. . 図9は、実行装置内のデータを転送する方法の特定の例解的実施形態を示す流れ図である。FIG. 9 is a flow diagram illustrating a specific illustrative embodiment of a method for transferring data within an execution device. 図10は、タグ配列探索オペレーションを選択的に省略する方法の特定の例解的実施形態を示す流れ図である。FIG. 10 is a flow diagram illustrating a specific illustrative embodiment of a method for selectively omitting a tag sequence search operation. 図11は、タグ配列探索オペレーションおよび／または変換索引バッファ（ＴＬＢ）探索オペレーションを選択的に省略する方法の特定の例解的実施形態を示す流れ図である。FIG. 11 is a flow diagram illustrating a specific illustrative embodiment of a method for selectively omitting a tag array search operation and / or a translation index buffer (TLB) search operation. 図12は、転送論理回路および探索省略論理回路を有する実行装置を含む通信デバイスの特定の例解的実施形態を示すブロック図である。FIG. 12 is a block diagram illustrating a particular illustrative embodiment of a communication device that includes an execution unit having a transfer logic circuit and a search skip logic circuit.

Detailed description

図１は、転送論理回路およびローカルメモリを有する少なくとも１つの実行装置を含む処理システム１００の特定の例解的実施形態を示すブロック図である。処理システム１００は、バスインターフェース１０４を介して命令キャッシュ１０６およびデータキャッシュ１１２と通信するように適合されたメモリ１０２を含む。命令キャッシュ１０６はバス１１０によってシーケンサ１１４と結合されている。加えて、シーケンサ１１４は、割込みレジスタから受信することのある、汎用割込み１１６などの割込みを受信するように適合されている。シーケンサ１１４はまた、監視制御レジスタ１３２およびグローバル制御レジスタ１３４と結合されている。 FIG. 1 is a block diagram illustrating a particular illustrative embodiment of a processing system 100 that includes at least one execution unit having transfer logic and local memory. Processing system 100 includes memory 102 adapted to communicate with instruction cache 106 and data cache 112 via bus interface 104. Instruction cache 106 is coupled to sequencer 114 by bus 110. In addition, the sequencer 114 is adapted to receive an interrupt, such as a general purpose interrupt 116, that may be received from the interrupt register. The sequencer 114 is also coupled to a supervisory control register 132 and a global control register 134.

特定の実施形態では、命令キャッシュ１０６は、複数の現在の命令レジスタを介してシーケンサ１１４に結合されており、当該レジスタはバス１１０に結合され、処理システム１００の特定のスレッドに関連づけられ得る。特定の実施形態では、処理システム１００は６つのスレッドを含むインターリーブ型マルチスレッドプロセッサである。 In certain embodiments, instruction cache 106 is coupled to sequencer 114 via a plurality of current instruction registers, which may be coupled to bus 110 and associated with a particular thread of processing system 100. In certain embodiments, processing system 100 is an interleaved multithreaded processor that includes six threads.

シーケンサ１１４は第１命令実行装置１１８、第２命令実行装置１２０、第３命令実行装置１２２および第４命令実行装置１２４に結合されている。各命令実行装置１１８、１２０、１２２および１２４は第２バス１２８を介して汎用レジスタファイル１２６に結合されることができる。汎用レジスタファイル１２６はまた、第３バス１３０を介してシーケンサ１１４、データキャッシュ１１２およびメモリ１０２に結合されることができる。監視制御レジスタ１３２およびグローバル制御レジスタ１３４は、シーケンサ１１４内の制御論理によってアクセスされ得るビットを保存して、割込みを受け入れるか否かを決定し、命令の実行を制御することができる。 The sequencer 114 is coupled to the first instruction execution unit 118, the second instruction execution unit 120, the third instruction execution unit 122, and the fourth instruction execution unit 124. Each instruction execution unit 118, 120, 122, and 124 may be coupled to the general purpose register file 126 via the second bus 128. The general purpose register file 126 can also be coupled to the sequencer 114, the data cache 112 and the memory 102 via the third bus 130. The supervisory control register 132 and the global control register 134 can store bits that can be accessed by control logic in the sequencer 114 to determine whether to accept interrupts and to control instruction execution.

第１実行装置１１８は転送論理回路１３６およびローカルメモリ１３８を含む。第２実行装置１２０は転送論理回路１４０およびローカルメモリ１４２を含む。第３実行装置１２２は転送論理回路１４４およびローカルメモリ１４６を含む。第４実行装置１２４は転送論理回路１４８およびローカルメモリ１５０を含む。実行装置１１８、１２０、１２２および１２４の各々は転送論理（すなわち、それぞれ転送論理１３６、１４０、１４４、１４８）を含むことが示されているが、特定の実施形態では、転送論理１３６のような転送論理は、実行装置１２０、１２２および１２４などのような他の実行装置と共有できることを理解すべきである。例えば、特定の実施形態では、実行装置１１８は転送論理１３６およびローカルメモリ１３８を含むことができ、他の実行装置１２０、１２２および１２４はローカルメモリ１４２、１４６および１５０を含むことができ、転送論理１３６を共有することができる。特定の実施形態では、実行装置１１８、１２０、１２２および１２４の１つあるいはそれより多くはローカルメモリを共有することができる。例えば、実行装置１１８および１２０はローカルメモリ１３８を共有することができ、実行装置１２２および１２４はローカルメモリ１４６を共有することができる。別の特定の実施形態では、転送論理１３６は実行装置１１８の外部に存在することがあり、実行装置１１８、１２０、１２２および１２４と通信することができ、これについては図４の制御装置４０６ならびに実行装置４０８、４１０、４１２および４１４に関して例解されたとおりである。 The first execution device 118 includes a transfer logic circuit 136 and a local memory 138. The second execution device 120 includes a transfer logic circuit 140 and a local memory 142. The third execution unit 122 includes a transfer logic circuit 144 and a local memory 146. The fourth execution unit 124 includes a transfer logic circuit 148 and a local memory 150. Although each of execution units 118, 120, 122, and 124 is shown to include transfer logic (ie, transfer logic 136, 140, 144, 148, respectively), in certain embodiments, such as transfer logic 136, It should be understood that the transfer logic can be shared with other execution devices, such as execution devices 120, 122, and 124. For example, in certain embodiments, execution device 118 may include transfer logic 136 and local memory 138, and other execution devices 120, 122, and 124 may include local memory 142, 146 and 150, and transfer logic 136 can be shared. In certain embodiments, one or more of the execution devices 118, 120, 122 and 124 may share local memory. For example, execution devices 118 and 120 can share local memory 138, and execution devices 122 and 124 can share local memory 146. In another particular embodiment, transfer logic 136 may reside external to execution device 118 and may communicate with execution devices 118, 120, 122, and 124, including controller 406 of FIG. As illustrated for execution devices 408, 410, 412 and 414.

特定の実施形態では、処理システム１００は、実行装置１１８、１２０、１２２および１２４によって実行可能な第１命令パケットを受信し、第１命令パケットからの結果に依存する第２命令パケットを受信するように適合されている。第１命令パケットは４つの命令を含むことができ、当該命令は実行装置１１８、１２０、１２２および１２４に提供され得る。実行装置１１８、１２０、１２２および１２４は、デコード段階、レジスタファイルアクセス段階、多重実行段階およびライトバック段階を含む複数の段階を介して、第１命令パケットからの命令を処理することができる。ライトバック段階では、実行装置１１８の転送論理１３６は、第２命令パケットの読み取りアドレスが第１命令パケットのライトバックアドレスと合致すると決定することがあり、データを汎用レジスタファイル１２６にライトバックしてメモリ１３８にローカル保存することがある。代替的な実施形態では、実行装置１１８は各受信済み命令パケットの命令の少なくとも一部をデコードして、各命令の読み取りアドレスおよびライトバックアドレスを特定することができる。転送論理回路１３６は、第２パケットの読み取りアドレスを第１命令パケットのライトバックアドレスと比較し、他の実行装置（命令実行装置１２０、１２２および１２４など）にデータ転送制御信号を送信してデータをローカルに（すなわち、それぞれのローカルメモリ１４２、１４６および１５０に）保存するように適合することができる。データは、メモリ１３８、１４２、１４６および１５０から取り出して、第２（後続）命令パケットからの命令を実行する際に使用することができる。 In certain embodiments, the processing system 100 receives a first instruction packet that is executable by the execution devices 118, 120, 122, and 124, and receives a second instruction packet that depends on a result from the first instruction packet. It is adapted to. The first instruction packet can include four instructions, which can be provided to execution units 118, 120, 122 and 124. Execution units 118, 120, 122, and 124 can process instructions from the first instruction packet through multiple stages including a decode stage, a register file access stage, a multiple execution stage, and a write back stage. In the write back phase, the transfer logic 136 of the execution unit 118 may determine that the read address of the second instruction packet matches the write back address of the first instruction packet, and write back the data to the general register file 126. Sometimes stored locally in memory 138. In an alternative embodiment, execution unit 118 may decode at least a portion of the instructions in each received instruction packet to determine the read address and write back address for each instruction. The transfer logic circuit 136 compares the read address of the second packet with the write-back address of the first command packet, and transmits a data transfer control signal to other execution devices (such as the command execution devices 120, 122, and 124). Can be adapted to be stored locally (ie, in respective local memories 142, 146 and 150). Data can be retrieved from the memories 138, 142, 146 and 150 and used in executing instructions from the second (subsequent) instruction packet.

特定の例では、転送論理回路１３６は、第２命令パケットの命令が第１命令パケットからの結果を使用することを発見することができる。特に、第１命令は、第２命令がデータを読み取るのと同じ場所にデータを書き込む。この例では、転送論理回路１３６は、第１命令パケット内の命令の結果が第２命令パケット内の命令によって利用されることを突き止めるように適合されている。例解のための非限定的な例を挙げると、転送論理回路１３６は、命令キャッシュ１０６もしくはシーケンサ１１４を介して将来の命令にアクセスできる制御論理回路（図示せず）から信号を受信することができ、または、転送論理回路１３６は、アセンブラ、コンパイラ、シーケンサ１１４もしくは他の回路によって設定できる第１パケット内の指定ビットのような転送標識を検出することができ、または、転送論理回路１３６は、命令のタイプに応じて少なくとも部分的に命令の結果の使用を予測することができる。別の実施例では、転送論理回路１３６は、すべての命令結果を後続命令による使用に備えてローカル保存する第１モード、または命令結果を一切保存しない第２モードで作動するよう構成できる。転送論理回路１３６は、バス１２８を介して汎用登録ファイル１２６に実行結果を書き込むことに加え、当該結果をローカルメモリ１３８に保存するように実行装置１１８にさせる。第２命令パケットからのデータ依存命令が実行装置１１８に提供されるとき、転送論理回路１３６は、レジスタ読み取りオペレーションの省略、ローカルメモリ１３８に保存された結果へのアクセス、および第２命令パケットからの命令を実行する際の結果の利用を、実行装置１１８にさせる。そのため、実行装置１１８は、転送論理回路１３６を利用することで、汎用レジスタファイル１２６に対する読み取りオペレーションの回数を減らす。 In a particular example, transfer logic 136 can discover that the instruction in the second instruction packet uses the result from the first instruction packet. In particular, the first instruction writes data in the same location as the second instruction reads data. In this example, transfer logic 136 is adapted to locate that the result of the instruction in the first instruction packet is utilized by the instruction in the second instruction packet. As a non-limiting example for illustration, transfer logic 136 may receive signals from control logic (not shown) that can access future instructions via instruction cache 106 or sequencer 114. Or the transfer logic 136 can detect a transfer indicator such as a designated bit in the first packet that can be set by an assembler, compiler, sequencer 114 or other circuit, or the transfer logic 136 can Depending on the type of instruction, the use of the result of the instruction can be predicted at least in part. In another embodiment, the transfer logic circuit 136 can be configured to operate in a first mode in which all instruction results are stored locally for use by subsequent instructions, or in a second mode in which no instruction results are stored. In addition to writing the execution result to the general-purpose registration file 126 via the bus 128, the transfer logic circuit 136 causes the execution device 118 to store the result in the local memory 138. When a data dependent instruction from the second instruction packet is provided to the execution unit 118, the transfer logic 136 skips the register read operation, accesses the result stored in the local memory 138, and from the second instruction packet. The execution device 118 is caused to use the result when executing the instruction. Therefore, the execution device 118 uses the transfer logic circuit 136 to reduce the number of read operations for the general-purpose register file 126.

データ依存命令がプログラムシーケンス内の隣接パケットで順序付けられるよう、命令パケットをコンパイルすることによって、コンパイル済みアプリケーションは、実行装置１１８、１２０、１２２および１２４の転送論理回路１３６、１４０、１４４および１４８を活用してパワー節約度を高めることができる。以前の命令によって生成されたデータは、バッファ、ラッチ、フリップフロップ、ローカルレジスタまたは他のメモリ要素など、ローカルメモリ１３８、１４２、１４６または１５０に保存することで、隣接パケットのためにレジスタ読み取りを実行することなく隣接パケットによって使用できるようになる。隣接していないパケット間でデータを転送できる例解的な実施形態では、ローカルメモリ１３８、１４２、１４６または１５０は、１つあるいはそれより多くの介在する（intervening）パケットが処理される間にデータを一時的にローカル保存するための１つあるいはそれより多くのレジスタを含むことができる。特に、隣接命令パケットのデータ依存命令を順序付けることにより、コンパイラはデータ転送の潜在能力を高め、それにより、読み取りオペレーションの省略の数を増やし、全体的なパワー消費を減らす。 By compiling the instruction packet so that the data dependent instructions are ordered in adjacent packets in the program sequence, the compiled application takes advantage of the transfer logic circuits 136, 140, 144, and 148 of the execution units 118, 120, 122, and 124. Power savings can be increased. Data generated by previous instructions can be stored in local memory 138, 142, 146 or 150, such as buffers, latches, flip-flops, local registers or other memory elements, to perform register reads for adjacent packets It can be used by neighboring packets without doing so. In an illustrative embodiment where data can be transferred between non-adjacent packets, the local memory 138, 142, 146 or 150 can store data while one or more intervening packets are being processed. Can include one or more registers for temporary local storage. In particular, by ordering the data-dependent instructions in adjacent instruction packets, the compiler increases the potential for data transfer, thereby increasing the number of skipped read operations and reducing overall power consumption.

特定の例では、実行装置１１８は、１つの命令パケットからのオペランド（および／またはデータ）を次の命令パケットへ転送するための転送論理回路１３６を含む。かかるデータ転送は、レジスタファイル読み取りオペレーションの回数全体を減らし、レジスタファイルのパワー消費全体を減らす。データ依存の命令パケット対の例が下の表１に提供される。

In a particular example, execution unit 118 includes transfer logic 136 for transferring operands (and / or data) from one instruction packet to the next instruction packet. Such data transfer reduces the total number of register file read operations and reduces the overall power consumption of the register file. An example of a data dependent instruction packet pair is provided in Table 1 below.

この例では、実行される特定の命令は本開示とは関係ないが、例外は、第１命令パケットに関連づけられたＶＡＬＩＧＮＢ命令の実行中に実行装置によって作成されたレジスタ対Ｒ７：６に保存された値が、第２パケットに関連づけられたＶＤＭＰＹ命令の実行中に実行装置によって使用されることである。特定の例では、実行装置１１８など同じ実行スロットでＶＡＬＩＧＮＢおよび後続のＶＤＭＰＹの両方が実行されるように、アセンブラまたはコンパイラは命令を配列することができる。追加的に、アセンブラまたはコンパイラはプログラムシーケンス内で第１命令パケットのすぐ次に第２命令パケットを配列することができる。 In this example, the particular instruction executed is not relevant to this disclosure, but the exception is stored in the register pair R7: 6 created by the execution unit during execution of the VALIGNB instruction associated with the first instruction packet. The value is used by the execution unit during execution of the VDMPY instruction associated with the second packet. In a particular example, the assembler or compiler can arrange the instructions so that both VALIGNB and subsequent VDMPY are executed in the same execution slot, such as execution unit 118. Additionally, the assembler or compiler can arrange the second instruction packet immediately following the first instruction packet in the program sequence.

図２は、転送論理回路２２０およびメモリ２２２を有する実行装置２０２を含むシステム２００の一部のブロック図である。システム２００は記憶デバイス２０４を含み、記憶デバイス２０４は、実行装置２０２の外部にあり、複数の記憶場所２０８、２１０、２１２、２１４、２１６および２１８を有する。記憶場所２０８、２１０、２１２、２１４、２１６および２１８の各々は、バス２０６を介して実行装置２０２にアクセス可能な記憶アドレスに関連づけられ得る。一般に、記憶場所２０８、２１０、２１２、２１４、２１６および２１８は、異なる長さのバス追跡（bus trace）によって実行装置２０２から分離されている。追加的に、実行装置２０２がメモリ２０４内の特定の記憶場所にアクセスするごとに、パワーが消費される。一般に、実行装置２０２は、命令を受信し、命令をデコードし、メモリ２０４のレジスタファイルへアクセスしてデータを取り出し、取り出されたデータを使用して命令を実行し、メモリ２０４へデータをライトバックするように適合されている。 FIG. 2 is a block diagram of a portion of a system 200 that includes an execution unit 202 having a transfer logic circuit 220 and a memory 222. The system 200 includes a storage device 204 that is external to the execution unit 202 and has a plurality of storage locations 208, 210, 212, 214, 216 and 218. Each of storage locations 208, 210, 212, 214, 216, and 218 may be associated with a storage address that is accessible to execution unit 202 via bus 206. In general, storage locations 208, 210, 212, 214, 216, and 218 are separated from execution unit 202 by different lengths of bus trace. In addition, each time the execution device 202 accesses a particular storage location in the memory 204, power is consumed. In general, execution unit 202 receives an instruction, decodes the instruction, accesses the register file in memory 204 to retrieve the data, executes the instruction using the retrieved data, and writes the data back to memory 204 Is adapted to be.

実行装置２０２は、転送論理回路２２０およびローカルメモリ２２２を含む。転送論理回路２２０は、特定の命令の実行によって生成されるデータが、プログラムシーケンスで次の命令を実行する中で使用される場合を検出するように適合されている。この場合、実行装置２０２は、転送論理２２０を利用してローカルメモリ２２２に第１命令の実行からの結果を保存するように適合されている。実行装置２０２は、次の命令の実行中にレジスタファイル読み取りオペレーションまたはメモリ読み取りオペレーションを省略し、ローカルメモリ２２２に保存されたデータを利用することができ、それによりメモリ読み取りオペレーションを回避しパワーを節約することができる。一般に、すべてのメモリアクセスを減らすことにより、パワーを消耗するメモリ読み取りオペレーションを選択的に回避して、パワー消費を節約することができる。 The execution device 202 includes a transfer logic circuit 220 and a local memory 222. Transfer logic 220 is adapted to detect when data generated by execution of a particular instruction is used in executing the next instruction in a program sequence. In this case, the execution unit 202 is adapted to use the transfer logic 220 to store the results from the execution of the first instruction in the local memory 222. Execution unit 202 can bypass the register file read operation or memory read operation during execution of the next instruction and can utilize the data stored in local memory 222, thereby avoiding the memory read operation and saving power. can do. In general, reducing all memory accesses can selectively avoid power consuming memory read operations and save power consumption.

図３は、データ転送論理回路３０６および探索省略論理回路３０８を有する共用制御装置３０４を含むシステム３００のブロック図である。システム３００は、共用制御装置３０４に結合された命令キャッシュ３０２を含む。共用制御装置３０４はサービス装置３１４、記憶装置３１６およびデータ装置３１８に結合されている。共用制御装置３０４はまた、ソースレジスタファイル（source register file）３１２に結合されており、ソースレジスタファイル３１２は命令装置３１０と通信する。また、命令装置３１０およびデータ装置３１８は、バス装置３２２を介して通信し、バス装置３２２は、多方向キャッシュメモリなどのメモリ３２４に結合されている。サービス装置３１４、記憶装置３１６およびデータ装置３１８は目的レジスタファイル（destination register file）３２０に結合されている。 FIG. 3 is a block diagram of a system 300 that includes a shared controller 304 having a data transfer logic circuit 306 and a search skip logic circuit 308. System 300 includes an instruction cache 302 coupled to shared controller 304. Shared controller 304 is coupled to service device 314, storage device 316 and data device 318. The shared controller 304 is also coupled to a source register file 312 that communicates with the instruction device 310. In addition, the instruction device 310 and the data device 318 communicate via a bus device 322, which is coupled to a memory 324 such as a multi-directional cache memory. Service device 314, storage device 316 and data device 318 are coupled to a destination register file 320.

特定の例解的実施形態では、システム３００は命令パケットを受信し、当該命令をデータ装置３１８が実行して結果を生成することができる。共用制御装置３０４は、データ転送論理回路３０６を利用して当該結果が後続命令パケットによって使用されるか否かを決定するように適合されている。共用制御装置３０４は、サービス装置３１４、記憶装置３１６およびデータ装置３１８と通信して後続のレジスタファイル読み取りオペレーションを省略するように適合されている。追加的に、共用制御装置３０４は、データ装置３１４と通信して、図２に示すメモリ２２２または図１に例解するローカルメモリ１３８、１４２、１４６および１５０などのローカルメモリに結果を保存するようデータ装置３１４に命令するように適合されている。共用制御装置３０４はまた、サービス装置３１４、記憶装置３１６およびデータ装置３１８を制御して、ローカル保存されたデータを後続命令パケットの実行中に利用するように適合されている。特定の実施形態では、サービス装置３１４、記憶装置３１６およびデータ装置３１８は共同して、図１に示された実行装置１１８、１２０、１２２および１２４によって実行されるオペレーションに似た処理オペレーションを実行する。 In certain illustrative embodiments, the system 300 can receive an instruction packet and the instruction can be executed by the data device 318 to produce a result. Shared controller 304 is adapted to utilize data transfer logic 306 to determine whether the result is used by subsequent instruction packets. Shared controller 304 is adapted to communicate with service device 314, storage device 316 and data device 318 to omit subsequent register file read operations. Additionally, the shared controller 304 communicates with the data device 314 to store the results in the local memory, such as the memory 222 shown in FIG. 2 or the local memories 138, 142, 146 and 150 illustrated in FIG. It is adapted to instruct the data device 314. Shared controller 304 is also adapted to control service device 314, storage device 316 and data device 318 to utilize locally stored data during execution of subsequent instruction packets. In certain embodiments, service device 314, storage device 316 and data device 318 jointly perform processing operations similar to the operations performed by execution devices 118, 120, 122 and 124 shown in FIG. .

別の特定の実施形態では、共用制御装置３０４は、探索省略論理回路３０８を利用して、第１命令に関連づけられた第１記憶アドレスがメモリ内において第２命令に関連づけられた第２記憶アドレスと同じキャッシュライン内にあるときなどにタグ配列探索オペレーションを省略すべきか否かを決定するように適合されている。特定の例では、システム３００は、データ装置３１８が第１記憶アドレスを特定でき、第１記憶アドレスに基づいて第２記憶アドレスを計算できる場合に「自動インクリメントアドレス」モードで稼働することができる。例えば、データ装置３１８は、第１記憶アドレス（Ａ）を特定して第２記憶アドレス（Ａ＋８）を計算することができる。この特定の例では、データ装置３１８は、第１命令パケットに関連づけられた少なくとも１つの命令を受信する。データ装置３１８は、当該命令に関連づけられた記憶アドレスを決定し第２記憶アドレスを計算するように適合されている。 In another specific embodiment, shared controller 304 utilizes search skip logic 308 to use a second storage address in which the first storage address associated with the first instruction is associated with the second instruction in memory. Is adapted to determine whether the tag array search operation should be omitted, such as when in the same cache line. In a particular example, the system 300 can operate in an “auto-increment address” mode when the data device 318 can identify the first storage address and calculate the second storage address based on the first storage address. For example, the data device 318 can identify the first storage address (A) and calculate the second storage address (A + 8). In this particular example, data device 318 receives at least one instruction associated with the first instruction packet. Data device 318 is adapted to determine a storage address associated with the instruction and to calculate a second storage address.

特定の例では、記憶アドレスは、ｎ方向キャッシュメモリ内の物理記憶アドレスに関係する仮想記憶アドレスであってよい。この例では、データ装置３１８は、変換索引バッファ（ＴＬＢ）探索オペレーションを実行して物理記憶アドレスを特定することにより、仮想アドレスから物理アドレスへの変換を実行することができる。データ装置３１８は、タグ配列探索オペレーションを実行して、物理記憶アドレスに関係するデータ配列内の方向を特定するタグデータを特定することができる。データ装置３１８は、タグデータおよび物理記憶アドレス情報を使用して、ｎ方向キャッシュメモリからデータを取り出すことができる。タグデータ（多方向キャッシュに関連づけられた方向を含む）を第２記憶アドレスと共にローカルメモリに保存することができる。第２記憶アドレスがデータ装置３１８によって使用されるため取り出されるとき、データ装置３１８は、第２記憶アドレスと第１記憶アドレスが同じキャッシュライン内にあるか否か決定することができる。第１記憶アドレスと第２記憶アドレスとがｎ方向キャッシュメモリ内の同じキャッシュラインに関連している場合、探索省略論理回路３０８は、データ装置３１８に対し、後続のタグ配列探索オペレーションを省略し、第１記憶アドレスからの方向を使用して第２記憶アドレスに関連づけられたｎ方向キャッシュメモリのデータにアクセスするよう命令するように適合されている。第１記憶アドレスおよび第２記憶アドレスがｎ方向キャッシュメモリ内の異なるキャッシュラインに関連づけられた場合に、探索省略論理回路３０８は、変換索引バッファ（ＴＬＢ）探索オペレーションを実行することなくタグ配列探索オペレーションを実行するようデータ装置３１８に命令するように適合されている。第２記憶アドレスがページ境界を越えた（すなわち、ページサイズを上回った）とデータ装置３１８が決定した場合、探索省略論理回路３０８は、ＴＬＢ探索オペレーションおよびタグ配列探索オペレーションを実行して第２記憶アドレスに関連づけられた物理アドレスおよびタグデータを特定するようデータ装置３１８に命令する。 In a particular example, the storage address may be a virtual storage address related to a physical storage address in the n-directional cache memory. In this example, data device 318 can perform a translation from a virtual address to a physical address by performing a translation index buffer (TLB) search operation to identify the physical storage address. Data device 318 may perform tag array search operations to identify tag data that identifies a direction in the data array related to a physical storage address. Data device 318 can retrieve data from the n-directional cache memory using the tag data and physical storage address information. Tag data (including the direction associated with the multi-directional cache) can be stored in local memory along with a second storage address. When the second storage address is retrieved for use by data device 318, data device 318 can determine whether the second storage address and the first storage address are in the same cache line. If the first storage address and the second storage address are associated with the same cache line in the n-direction cache memory, the search skip logic 308 omits a subsequent tag array search operation for the data device 318, and The direction from the first storage address is used to instruct to access data in the n-direction cache memory associated with the second storage address. When the first storage address and the second storage address are associated with different cache lines in the n-direction cache memory, the search skip logic 308 performs a tag array search operation without performing a translation index buffer (TLB) search operation. Is adapted to instruct the data device 318 to perform. If the data device 318 determines that the second storage address has crossed the page boundary (ie, has exceeded the page size), the search skip logic 308 performs a TLB search operation and a tag array search operation to perform the second storage. Data device 318 is instructed to identify the physical address and tag data associated with the address.

特定の例では、ｎ方向キャッシュメモリのページはキャッシュラインより大きいサイズを有する。例えば、キャッシュラインは３２バイトを含むことがあり、ページは約４０９６ビット（約４キロビット）である場合がある。この場合、自動インクリメントアドレスが８バイト増えると、自動インクリメントアドレス計算が次のキャッシュラインに進む前にタグ配列データを３回再利用することができ（キャッシュラインはシーケンシャルオペレーションでアクセスされていると想定）、例えば第１ＴＬＢ探索オペレーションからのページ変換を、別のＴＬＢ探索オペレーションの実行が必要となる前に多くの回数（すなわち約５１１回）再利用することができる。 In a particular example, the n-directional cache memory page has a size larger than the cache line. For example, a cache line may contain 32 bytes and a page may be about 4096 bits (about 4 kilobits). In this case, if the auto-increment address increases by 8 bytes, the tag array data can be reused three times before the auto-increment address calculation proceeds to the next cache line (assuming that the cache line is accessed by sequential operation). ), For example, page conversions from the first TLB search operation can be reused many times (ie, about 511 times) before another TLB search operation needs to be performed.

特に、後続命令によってアクセスされた記憶アドレスが先行のメモリアクセスと同じキャッシュラインに関連づけられた場合、先行のタグ配列探索オペレーションで得られたタグ配列データを後続の記憶アドレスのために再利用することができ、それにより後続のタグ配列探索オペレーションを回避することができる。別の特定の例では、ページ境界を越えたときに限りＴＬＢ探索オペレーションを選択的に実行することにより、ＴＬＢがアクセスされる回数を減らして全体のパワー消費を減らすことができる。 In particular, if the storage address accessed by the subsequent instruction is associated with the same cache line as the previous memory access, the tag array data obtained in the previous tag array search operation is reused for the subsequent storage address. So that subsequent tag sequence search operations can be avoided. In another particular example, the TLB search operation can be selectively performed only when crossing a page boundary, thereby reducing the number of times the TLB is accessed and reducing overall power consumption.

図４は、プロセッサシステム４００のブロック図である。プロセッサシステム４００は、命令装置４０２および割込みレジスタ４０４を含み、これらは制御装置４０６に結合されている。制御装置４０６は複数の実行装置４０８、４１０、４１２および４１４に結合されている。実行装置４０８、４１０、４１２および４１４の各々はローカルメモリ４２６、４２８、４３０および４３２をそれぞれ含むことができる。 FIG. 4 is a block diagram of the processor system 400. The processor system 400 includes an instruction device 402 and an interrupt register 404 that are coupled to a controller 406. Controller 406 is coupled to a plurality of execution devices 408, 410, 412 and 414. Each of execution devices 408, 410, 412 and 414 may include local memories 426, 428, 430 and 432, respectively.

制御装置４０６は、デコーダ４１６、制御レジスタファイル４１８、汎用レジスタファイル４２０、プログラム可能論理コントローラ（ＰＬＣ）回路４２２およびシリコン内デバッガ（in-silicon debugger）（ＩＳＤＢ）回路４２４を含む。ＩＳＤＢ回路４２４は、プロセッサシステム４００の実行中にソフトウェアをデバッグするために使用できるＪＴＡＧ（joint test action group）に基づくハードウェアデバッガを提供する。特定の実施形態では、ＩＳＤＢ回路４２４はスレッドの個々のデバッグをサポートし、スレッド実行の停止を可能にするとともに、制御レジスタファイル４１８および汎用レジスタファイル４２０を含む、命令およびデータメモリの監視および変更を可能にする。 The controller 406 includes a decoder 416, a control register file 418, a general purpose register file 420, a programmable logic controller (PLC) circuit 422 and an in-silicon debugger (ISDB) circuit 424. The ISDB circuit 424 provides a hardware debugger based on a JTAG (joint test action group) that can be used to debug software during execution of the processor system 400. In certain embodiments, the ISDB circuit 424 supports individual debugging of threads, allows thread execution to stop, and monitors and changes instruction and data memory, including the control register file 418 and the general register file 420. to enable.

特定の例解的実施形態では、デコーダ４１６は命令を受信しデコードする。デコーダ４１６はデコード済み命令に関係するデータをＰＬＣ回路４２２に通信し、ＰＬＣ回路４２２は、命令パケットのシーケンスで第２命令パケットによって利用される結果を第１命令パケットが生成する場合を検出する論理を含むことができる。連続した命令パケット間のこうしたデータ依存性の検出に伴い、ＰＬＣ回路４２２は、それぞれのローカルメモリ４２６、４２８、４３０または４３２に結果を保存するためにデータ生成命令を実行している実行装置４０８、４１０、４１２および４１４の少なくとも１つに対する制御信号を生成するように適合されている。ＰＬＣ４２２は、汎用レジスタファイル４２０およびデコーダ４１６を制御して、後続命令パケットからのデータ依存命令を選択済み実行装置（例：実行装置４０８）に経路指定し、それにより実行装置が後続命令の実行中にローカル保存済みデータを（すなわち、ローカルメモリ４２６に保存済みのデータ）を利用できるようにするように適合されている。この例では、ＰＬＣ４２２はまた、結果がローカル保存されているときには、実行装置４０８およびバス４３４を制御して、実行装置４０８がメモリ（汎用レジスタファイル４２０など）にアクセスするのを防ぎ、当該結果を取り出すことができる。 In certain exemplary embodiments, decoder 416 receives and decodes instructions. Decoder 416 communicates data related to the decoded instruction to PLC circuit 422, which detects logic when the first instruction packet generates a result used by the second instruction packet in the sequence of instruction packets. Can be included. With the detection of such data dependency between successive instruction packets, the PLC circuit 422 executes the data generation instructions 408 to store the results in the respective local memory 426, 428, 430 or 432, Adapted to generate a control signal for at least one of 410, 412 and 414. The PLC 422 controls the general register file 420 and the decoder 416 to route data dependent instructions from the subsequent instruction packet to the selected execution device (eg, execution device 408) so that the execution device is executing the subsequent instruction. Are adapted to make available locally stored data (ie, data stored in local memory 426). In this example, the PLC 422 also controls the execution unit 408 and bus 434 to prevent the execution unit 408 from accessing memory (such as the general purpose register file 420) when the results are stored locally, It can be taken out.

特定の例では、実行装置４０８は制御装置４０６からデータ生成命令を受信し、命令を実行し、結果を汎用レジスタファイル４２０にライトバックすることができる。実行装置４０８はまた、制御装置４０６のＰＬＣ４２２から受信した制御信号に応じて、ローカルメモリ４２６に結果を保存することができる。実行装置４０８は、ローカルメモリ４２６からの保存済み結果を利用する制御装置４０６から次の命令を受信することができる。実行装置４０８は、ローカルメモリ４２６にアクセスして保存済み結果を取り出し、取り出した結果を使用して次の命令を実行することができる。この特定の例では、実行装置４０８は、汎用レジスタファイル４２０から結果をリードバックすることなく、次の命令を実行することができ、それによりレジスタファイル読み取りオペレーションを省略し、パワーを節約することができる。 In a particular example, execution unit 408 may receive a data generation instruction from controller 406, execute the instruction, and write the result back to general register file 420. Execution device 408 can also store the results in local memory 426 in response to control signals received from PLC 422 of control device 406. The execution unit 408 can receive the next instruction from the control unit 406 that uses the saved result from the local memory 426. The execution unit 408 can access the local memory 426 to retrieve the stored result and execute the next instruction using the retrieved result. In this particular example, execution unit 408 can execute the next instruction without reading back the result from general register file 420, thereby omitting register file read operations and saving power. it can.

別の特定の実施形態では、制御装置４０６は、タグ配列探索オペレーションにより特定されたタグ配列データを選択的に再利用するように適合されている。例えば、自動インクリメント機能を使用して第１アドレスから第２アドレスを計算するとき、ＰＬＣ４２２は、キャリービットを調べて、第２アドレスが第１アドレスとは異なるキャッシュラインに関連するときを決定することができる。例えば、キャッシュラインが３２バイト幅の場合、第２アドレスの第５ビットはキャリービットを表す。キャリービットが変わると、第２アドレスはキャッシュメモリ内の次のキャッシュラインに関連づけられる。一般に、ＰＬＣ４２２は、第２アドレスが第１命令とは異なるキャッシュラインに関連づけられることをキャリービットが示すまで、先行のタグ配列探索オペレーションからのタグ配列データを再利用するよう実行装置４０８、４１０、４１２および４１４に命令する。この場合、ＰＬＣ４２２は、実行装置４０８、４１０、４１２および４１４に、変換索引バッファ（ＴＬＢ）探索オペレーションを実行することなく新たなタグ配列探索オペレーションを実行させる。 In another specific embodiment, the controller 406 is adapted to selectively reuse tag sequence data identified by a tag sequence search operation. For example, when calculating the second address from the first address using the auto-increment function, the PLC 422 examines the carry bit to determine when the second address is associated with a different cache line than the first address. Can do. For example, if the cache line is 32 bytes wide, the fifth bit of the second address represents a carry bit. When the carry bit changes, the second address is associated with the next cache line in the cache memory. In general, the PLC 422 is configured to re-use the tag array data from the previous tag array search operation until the carry bit indicates that the second address is associated with a different cache line than the first instruction. Commands 412 and 414. In this case, the PLC 422 causes the execution devices 408, 410, 412 and 414 to execute a new tag array search operation without executing a translation index buffer (TLB) search operation.

さらに別の特定の実施形態では、制御装置４０６は、変換索引バッファ（ＴＬＢ）探索オペレーションを選択的に実行するように適合されている。特に、ＰＬＣ４２２は、第２記憶アドレスの計算からのキャリービットを調べて、ページ境界を越えていることを計算済み記憶アドレスが示すときを決定する。例えば、メモリ配列のページサイズが約４０９６ビット（すなわち４キロビット）である場合、第２記憶アドレスの第１１ビットがキャリービットを表すことがある。したがって、第２記憶アドレスの第１１ビットが変わると、ページ境界を越え、ＰＬＣ４２２は実行装置４０８、４１０、４１２または４１４の１つに、ＴＬＢ探索オペレーションを開始させ、続いてタグ配列探索オペレーションを行うことができる。この例では、タグ配列探索オペレーションはＴＬＢ探索オペレーションよりも高い頻度で発生する。ＰＬＣ４２２は、タグ配列探索オペレーションおよびＴＬＢ探索オペレーションのうち１つまたは両方を選択的に省略して、全体のパワー消費を減らすように適合されている。 In yet another specific embodiment, the controller 406 is adapted to selectively perform a translation index buffer (TLB) search operation. In particular, the PLC 422 examines the carry bit from the calculation of the second storage address to determine when the calculated storage address indicates that the page boundary has been exceeded. For example, if the page size of the memory array is approximately 4096 bits (ie, 4 kilobits), the 11th bit of the second storage address may represent a carry bit. Thus, when the 11th bit of the second storage address changes, the page boundary is crossed and the PLC 422 causes one of the execution devices 408, 410, 412 or 414 to initiate a TLB search operation followed by a tag array search operation. be able to. In this example, tag sequence search operations occur more frequently than TLB search operations. The PLC 422 is adapted to selectively omit one or both of the tag sequence search operation and the TLB search operation to reduce overall power consumption.

図５は、データ転送論理を含む実行装置の命令サイクル５００の特定の例を示す図である。一般に、命令サイクル５００は、特定のスレッドの観点から実行装置の複数の段階を表す。実行装置は一般に、ライトバック段階５０２、デコード段階５０４、レジスタ読み取り段階５０６、１つあるいはそれより多くの実行段階５０８、５１０および５１２ならびに第２ライトバック段階５１４を含む１つあるいはそれより多くの段階でデータおよび命令を処理する。命令サイクル５００はただ１つのライトバック段階（ライトバック段階５１４）を含み、次いで実行サイクルがデコード段階５０４で始まり繰り返されることを理解すべきである。ライトバック段階５０２は説明目的で例解されている。 FIG. 5 is a diagram illustrating a specific example of an instruction cycle 500 for an execution unit that includes data transfer logic. In general, instruction cycle 500 represents multiple stages of an execution device in terms of a particular thread. The execution unit generally includes one or more stages including a write back stage 502, a decode stage 504, a register read stage 506, one or more execution stages 508, 510 and 512 and a second write back stage 514. To process data and instructions. It should be understood that instruction cycle 500 includes only one write-back stage (write-back stage 514) and then the execution cycle is repeated beginning with decode stage 504. Writeback stage 502 is illustrated for illustrative purposes.

一般に、ライトバック段階５０２では、５１６において、以前実行された命令からの結果が汎用レジスタファイルのようなレジスタにライトバックされる。５１８において、次の命令パケット（１〜４つの命令を含み得る）が受信され、受信済みパケットの読み取り識別子が、レジスタに書き込まれた書き込み結果に関連づけられた書き込み識別子と比較される。読み取り識別子と書き込み識別子とが合致したとき、（５２０において）書き込み結果は実行装置にローカル保存され、同様に５１６においてレジスタにライトバックされる。この場合、５２２において（５０６の）レジスタ読み取りを省略することができ、実行装置にローカル保存されたデータを使用することができる。５２４では、レジスタ読み取り段階（５０６）で読み取られたデータまたは実行装置にローカル保存されたデータの少なくとも１つを使用して命令が実行される。よって、（５１８で）読み取り識別子と書き込み識別子とが合致したとき、（５０６の）レジスタ読み取り段階を省略することができ、ローカル保存されたデータを利用することができ、その結果、データ転送が可能になる。 In general, in the write back stage 502, at 516, the results from the previously executed instruction are written back to a register such as a general purpose register file. At 518, the next instruction packet (which may include one to four instructions) is received and the read identifier of the received packet is compared to the write identifier associated with the write result written to the register. When the read identifier matches the write identifier, the write result is stored locally on the execution unit (at 520) and similarly written back to the register at 516. In this case, register reading (at 506) can be omitted at 522, and data stored locally in the execution unit can be used. At 524, the instruction is executed using at least one of the data read in the register read stage (506) or data stored locally on the execution unit. Thus, when the read identifier and the write identifier match (at 518), the register read step (at 506) can be omitted, and the locally stored data can be used, resulting in data transfer. become.

特定の例解的実施形態では、図５に例解された実行装置段階５０４、５０６、５０８、５１０、５１２および５１４は、インターリーブ型マルチスレッドプロセッサ内の実行装置のサイクルを表す。追加的に、ライトバック段階５０２は、先行命令の実行サイクルの最終段階を表す。ファイル読み取り段階５０６でレジスタファイル読み取りオペレーションを実行することなく、（５２２において）先行命令からのデータを実行装置のローカルメモリから取り出すことができ、５２４で（すなわち、１つあるいはそれより多くの実行段階５０８、５１０および５１２で）次の命令と共に処理することができる。特定の例解的実施形態では、段階５０４、５１６、５０８，５１０、５１２および５１４の各々は、特定のオペレーションが実行されるクロックサイクルを表すことがある。 In a particular illustrative embodiment, the execution unit stages 504, 506, 508, 510, 512 and 514 illustrated in FIG. 5 represent execution unit cycles within an interleaved multithreaded processor. Additionally, the write back stage 502 represents the final stage of the execution cycle of the preceding instruction. The data from the preceding instruction can be retrieved from the local memory of the execution unit (at 522) without performing a register file read operation at file read stage 506 (ie, one or more execution stages). 508, 510 and 512) can be processed with the next instruction. In certain exemplary embodiments, each of stages 504, 516, 508, 510, 512 and 514 may represent a clock cycle in which a particular operation is performed.

図６は、プロセッサの実行装置におけるデータ転送論理６００の特定の例解的実施形態を示すブロック図である。この場合、ライトバック段階６０２、デコード段階６０４およびレジスタファイル読み取り段階６０６に関してデータ転送論理６００が示されている。例解的実施形態では、データ転送論理６００は、代表スロット２のような、複数のスロットのうちの単一処理スロットを表し、レジスタファイルの代表レジスタ「Ｓ」および「Ｔ」を使用して読み取りおよび書き込みオペレーションを行うことができる。 FIG. 6 is a block diagram illustrating a particular illustrative embodiment of data transfer logic 600 in a processor execution unit. In this case, the data transfer logic 600 is shown with respect to the write back stage 602, the decode stage 604 and the register file read stage 606. In the illustrative embodiment, data transfer logic 600 represents a single processing slot of a plurality of slots, such as representative slot 2, and is read using the representative registers “S” and “T” of the register file. And write operations can be performed.

ライトバック段階６０２に関して、転送論理６００は、比較器６０８および６１０、論理和ゲート６１１、インバータ６１４および６１６、論理積ゲート６１８および６２０ならびにレジスタファイル６１２を含む。転送論理６００はまた、転送可能フリップフロップ回路６３６、転送データフリップフロップ回路６３８を含む。比較器６０８は次のパケットレジスタ「Ｓ」（Ｒｓ）読み取り識別子情報６２２および現在のパケット書き込み識別子情報６２４をインプットとして受信し、インバータ６１４のインプットに結合されたアウトプットを提供する。インバータ６１４のアウトプットは、論理積ゲート６１８の第１インプットに結合され、論理積ゲート６１８の第２インプットはスロット２レジスタ「Ｓ」読み取り可能（ｓ２ＲｓＲｄＥｎ）インプット６３２に結合されている。論理積ゲート６１８はまた、レジスタファイル６１２のスロット２レジスタに結合されたアウトプットを含む。比較器６１０は、次のパケットレジスタ「Ｔ」（Ｒｔ）読み取り識別子情報６２６（次のパケット読み取り識別子情報６２２と同じであってよい）および現在のパケット書き取り識別子情報６２８を受信し、アウトプットを提供し、当該アウトプットは、インバータ６１６を介して論理積ゲート６２０のインプットに結合する。論理積ゲート６２０はまた、第２インプットでスロット２レジスタ「Ｔ」読み取り可能（ｓ２ＲｔＲｄＥｎ）インプット６３４を受信し、レジスタファイル６１２のスロット２レジスタに結合されたアウトプットを提供する。比較器６０８および６１０のアウトプットはまた、転送可能フリップフロップ６３６へのインプットとして、また論理和ゲート６１１へのインプットとして提供され、論理和ゲート６１１は転送データフリップフロップ６３８にイネーブルインプットを提供する。転送データフリップフロップ６３８はまた、実行装置データライトバック６３０からデータを受信する。 With respect to write-back stage 602, transfer logic 600 includes comparators 608 and 610, OR gate 611, inverters 614 and 616, AND gates 618 and 620, and register file 612. The transfer logic 600 also includes a transferable flip-flop circuit 636 and a transfer data flip-flop circuit 638. Comparator 608 receives the next packet register “S” (Rs) read identifier information 622 and current packet write identifier information 624 as inputs and provides an output coupled to the input of inverter 614. The output of inverter 614 is coupled to a first input of AND gate 618, and a second input of AND gate 618 is coupled to slot 2 register “S” readable (s 2 RsRdEn) input 632. AND gate 618 also includes an output coupled to the slot 2 register of register file 612. Comparator 610 receives next packet register “T” (Rt) read identifier information 626 (which may be the same as next packet read identifier information 622) and current packet write identifier information 628 and provides output. The output is coupled to the input of the AND gate 620 via the inverter 616. The AND gate 620 also receives the slot 2 register “T” readable (s2RtRdEn) input 634 at the second input and provides an output coupled to the slot 2 register of the register file 612. The outputs of comparators 608 and 610 are also provided as inputs to transferable flip-flop 636 and as inputs to OR gate 611, which provides an enable input to transfer data flip-flop 638. Transfer data flip-flop 638 also receives data from execution unit data writeback 630.

転送論理６００のデコード段階６０４では、転送可能フリップフロップ６３６のアウトプットが、第２転送可能フリップフロップ６４０へのインプットとして、また第２転送データフリップフロップ６４２へのイネーブルインプットとして提供される。転送データフリップフロップ６３８は、第２転送データフリップフロップ６４２へデータインプットを提供する。 At the decode stage 604 of transfer logic 600, the output of transferable flip-flop 636 is provided as an input to second transferable flip-flop 640 and as an enable input to second transfer data flip-flop 642. Transfer data flip-flop 638 provides a data input to second transfer data flip-flop 642.

レジスタファイル読み取り段階６０６では、第２転送可能フリップフロップ６４０は、第１多重装置６４４の選択インプットに、また第２多重装置６４６の選択インプットに転送可能信号を提供する。第１多重装置６４４は、第１インプットで転送済みデータ、第２インプットでレジスタ（ｓ）データを受信し、次の命令パケットを実行する際の使用に備えて転送済みデータまたはレジスタ（ｓ）データのいずれかを含むアウトプット６４８を提供する。第２多重装置６４６は、第１インプットで転送済みデータ、第２インプットでレジスタ（ｔ）データを受信し、次の命令パケットを実行する際の使用に備えて転送済みデータまたはレジスタ（ｔ）データのいずれかを含むアウトプット６５０を提供する。 In the register file read stage 606, the second transferable flip-flop 640 provides a transfer enable signal to the selection input of the first multiplexer 644 and to the selection input of the second multiplexer 646. The first multiplexer 644 receives the transferred data at the first input, the register (s) data at the second input, and the transferred data or register (s) data for use in executing the next instruction packet. An output 648 is provided that includes any of the following: The second multiplexer 646 receives the transferred data at the first input, the register (t) data at the second input, and the transferred data or register (t) data for use in executing the next instruction packet. An output 650 is provided that includes any of the following:

一般に、比較器６０８は、次のパケット読み取り識別子情報６２２および現在のパケット書き込み識別子情報６２４を受信するように適合されている。比較器６１０は、次のパケット読み取り識別子情報６２６および現在のパケット書き込み識別子情報６２８を受信するように適合されている。次のパケット読み取り識別子６２２および６２６のうち１つが現在のパケット書き込み識別子６２４および６２８のうち１つと合致した場合、かかる合致を特定した比較器（例：比較器６０８および６１０のうちの１つ）は、論理１値をアウトプットで提供し、転送データフリップフロップ６３８を有効にし、それぞれのインバータ６１４または６１６およびそれぞれの論理積ゲート６１８または６２０を介した対応するレジスタ読み取り可能状態を無効にする。 In general, comparator 608 is adapted to receive next packet read identifier information 622 and current packet write identifier information 624. Comparator 610 is adapted to receive next packet read identifier information 626 and current packet write identifier information 628. If one of the next packet read identifiers 622 and 626 matches one of the current packet write identifiers 624 and 628, the comparator that identified such a match (eg, one of comparators 608 and 610) , Provide a logic 1 value at the output, enable transfer data flip-flop 638, and disable the corresponding register readable state via each inverter 614 or 616 and each AND gate 618 or 620.

特定の例解的実施形態では、次のパケット読み取り識別子６２２が現在のパケット書き込み識別子情報６２４と合致したとき、比較器６０８は転送データフリップフロップ６３８へのインプットとして論理ｈｉｇｈアウトプットを提供する。インバータ６１４は論理ｈｉｇｈアウトプット（logic high output）を反転させ、論理積ゲート６１８へのインプットとして論理ｌｏｗ値を提供し、レジスタファイル６１２へのスロット２レジスタ（ｓ）読み取り可能状態を無効にする。転送データフリップフロップ６３８はライトバックインプット６３０を介して実行装置からデータを受信し、データを保存する。デコード段階６０４では、データは第２転送データフリップフロップ６４２に転送される。転送されたデータは、第１多重装置６４４および第２多重装置６４６に提供され、第２転送可能フリップフロップ６４０からのアウトプットに基づき第１アウトプット６４８および第２アウトプット６５０のうち１つに選択的に提供される。第２転送可能フリップフロップ６４０は、比較器６０８のアウトプットを第１多重装置６４４に提供し、比較器６１０のアウトプットを第２多重装置６４６に提供して、第２転送データフリップフロップ６４２からの転送済みデータまたはレジスタデータのうちの１つを選択する。 In a particular illustrative embodiment, when the next packet read identifier 622 matches the current packet write identifier information 624, the comparator 608 provides a logic high output as an input to the transfer data flip-flop 638. Inverter 614 inverts the logic high output, provides a logic low value as an input to AND gate 618, and disables the slot 2 register (s) readable state to register file 612. The transfer data flip-flop 638 receives data from the execution unit via the write-back input 630 and stores the data. In the decoding stage 604, the data is transferred to the second transfer data flip-flop 642. The transferred data is provided to the first multiplexer 644 and the second multiplexer 646 and is output to one of the first output 648 and the second output 650 based on the output from the second transferable flip-flop 640. Provided selectively. The second transferable flip-flop 640 provides the output of the comparator 608 to the first multiplexer 644 and provides the output of the comparator 610 to the second multiplexer 646, from the second transfer data flip-flop 642. One of the transferred data or the register data is selected.

転送論理６００は、読み取り／書き込み識別子の合致に基づきレジスタ読み取りオペレーションを選択的に可能にするように適合されている。また、転送論理６００を使用して、後続命令での再利用に備えて記憶アドレスに関連づけられたタグ配列データ（方向情報など）を選択的にキャッシュすることができる。特定の例では、転送論理６００または同様の論理は、計算済みアドレスに関連づけられたキャリービットを調べて、計算済みアドレスが次のキャッシュラインに関連づけられた（すなわち、タグ配列探索オペレーションにつながり、変換索引バッファ（ＴＬＢ）探索オペレーションを省略する）ときを決定するように適合することができる。別の特定の例では、転送論理または同様の論理は、計算済みアドレスに関連づけられたキャリービットを調べて、計算済みアドレスがページ境界を越える場合（すなわち、変換索引バッファ（ＴＬＢ）探索オペレーションおよびタグ配列探索オペレーションにつながる場合）を特定するように適合することができる。タグ配列データが依然として有効である（すなわち、第１および第２記憶アドレスのキャッシュラインが同じである）と転送論理６００または同様の論理が決定した場合、タグ配列データを、ＴＬＢ探索オペレーションおよび／またはタグ配列探索オペレーションを実行することなく第２記憶場所にアクセスする際の使用に備えて、転送データフリップフロップ６３８および６４２などのデータラッチでラッチすることができる。 Transfer logic 600 is adapted to selectively allow register read operations based on read / write identifier matches. Transfer logic 600 can also be used to selectively cache tag array data (such as direction information) associated with a storage address in preparation for reuse in subsequent instructions. In a particular example, transfer logic 600 or similar logic examines the carry bit associated with the calculated address, and the calculated address is associated with the next cache line (ie, led to a tag array search operation and translated). The index buffer (TLB) search operation can be omitted). In another specific example, transfer logic or similar logic examines the carry bit associated with a calculated address and if the calculated address crosses a page boundary (ie, a translation index buffer (TLB) search operation and tag). Can be adapted to identify (if it leads to a sequence search operation). If the transfer logic 600 or similar logic determines that the tag array data is still valid (i.e., the cache lines of the first and second storage addresses are the same), the tag array data is sent to the TLB search operation and / or It can be latched with a data latch, such as transfer data flip-flops 638 and 642, for use in accessing the second storage location without performing a tag array search operation.

図７は、データ転送論理を含み、探索オペレーションを選択的に省略するように適合された実行装置の命令サイクル７００の特定の例を示す図である。命令サイクル７００は一般に、ライトバック段階７０２、デコード段階７０４、レジスタ読み取り段階７０６、１つあるいはそれより多くの実行段階７０８、７１０および７１２ならびに第２ライトバック段階７１４を含む複数の段階を含む。命令サイクル７００はただ１つのライトバック段階（ライトバック段階７１４）を含み、次いで実行サイクルがデコード段階７０４で始まり繰り返されることを理解すべきである。ライトバック段階７０２は説明目的で例解されている。 FIG. 7 is a diagram illustrating a specific example of an execution unit instruction cycle 700 that includes data transfer logic and is adapted to selectively omit a search operation. The instruction cycle 700 generally includes multiple stages including a write back stage 702, a decode stage 704, a register read stage 706, one or more execution stages 708, 710 and 712 and a second write back stage 714. It should be understood that instruction cycle 700 includes only one write-back stage (write-back stage 714), and then the execution cycle is repeated beginning with decode stage 704. Writeback stage 702 is illustrated for illustrative purposes.

一般に、先行命令の実行の実行段階（実行段階７０８など）では、第１記憶アドレスおよび第２記憶アドレスを計算することができ、第２記憶アドレスをローカルメモリ（図１に例解されるローカルメモリ１３８など）に保存することができる。ライトバック段階７０２の間、７１６において、以前実行された命令からの結果がキャッシュアドレスまたは汎用レジスタファイルのようなレジスタにライトバックされる。７１８ではローカルメモリから第２記憶アドレスを取り出すことができる。７２０では、第２記憶アドレスに関連づけられた１つあるいはそれより多くのキャリービットの値が、１つあるいはそれより多くのキャリービットが自動インクリメントオペレーションによるキャリー値を示すか否かを決定するために調べられる。１つあるいはそれより多くのキャリービットの第１キャリービットの値が、第２記憶アドレスが第１記憶アドレスとは異なるキャッシュラインに関連づけられていることを示さなかった場合、７２２において変換索引バッファ（ＴＬＢ）探索オペレーションおよびタグ配列探索オペレーションは省略され、先行のタグ配列値が使用されてメモリからデータが取り出される。１つあるいはそれより多くのキャリービットの第１キャリービットの値がキャリー値を示し、１つあるいはそれより多くのキャリービットの第２キャリービットがキャリー値を示さなかった場合、例えば、第２記憶アドレスが先行記憶アドレスと同じページ内の異なるキャッシュラインに関連づけられているとき、７２４において、ＴＬＢ探索オペレーションは省略されるが、タグ配列探索オペレーションは実行され、メモリからデータを取り出すためにタグ配列値を取り出す。１つあるいはそれより多くのキャリービットの各々がキャリー値を示す場合、７２６に示すようにＴＬＢ探索オペレーションおよびタグ配列探索オペレーションが実行される。 In general, in the execution stage of execution of the preceding instruction (such as execution stage 708), the first storage address and the second storage address can be calculated, and the second storage address is stored in the local memory (the local memory illustrated in FIG. 1). 138). During the write back phase 702, at 716, the results from the previously executed instruction are written back to a register, such as a cache address or a general purpose register file. At 718, the second storage address can be retrieved from the local memory. At 720, the value of one or more carry bits associated with the second storage address is used to determine whether one or more carry bits indicate a carry value from an auto-increment operation. Be examined. If the value of the first carry bit of one or more carry bits did not indicate that the second storage address is associated with a different cache line than the first storage address, the translation index buffer (at 722) The TLB) search operation and tag array search operation are omitted, and the previous tag array value is used to retrieve data from memory. If the value of the first carry bit of one or more carry bits indicates a carry value and the second carry bit of one or more carry bits does not indicate a carry value, for example, a second store When the address is associated with a different cache line in the same page as the previous storage address, at 724, the TLB search operation is omitted, but the tag array search operation is performed and the tag array value is retrieved to retrieve data from memory. Take out. If each of the one or more carry bits indicates a carry value, a TLB search operation and a tag array search operation are performed as indicated at 726.

特定の例では、タグ配列探索オペレーションを省略することができ、先行のタグ配列探索オペレーションにより決定されたタグ配列データを使用してメモリ内のアドレスにアクセスすることができる。特に、タグ配列データを使用することで、タグ配列データを探索することなく、またＴＬＢ探索オペレーションを実行することなく記憶アドレスにアクセスすることができる。 In a particular example, the tag array search operation can be omitted, and the tag array data determined by the previous tag array search operation can be used to access an address in memory. In particular, by using tag sequence data, the storage address can be accessed without searching the tag sequence data and without performing a TLB search operation.

特定の例解的実施形態では、図７に例解する段階７０４、７０６、７０８、７１０、７１２および７１４は、インターリーブ型マルチスレッドプロセッサ内の実行装置の段階を表すことがある。追加的に、特定の実施形態では、段階７０４、７０６、７０８、７１０、７１２および７１４は、クロックサイクルを表すことがある。 In certain exemplary embodiments, stages 704, 706, 708, 710, 712 and 714 illustrated in FIG. 7 may represent stages of an execution unit within an interleaved multithreaded processor. Additionally, in certain embodiments, stages 704, 706, 708, 710, 712, and 714 may represent clock cycles.

図８は、実行装置８０８および８１０内でそれぞれローカルメモリ８０９および８１１を使用してデータを選択的に転送するための制御装置８０６を有する回路デバイス８０２を含むシステム８００の特定の例解的実施形態を示すブロック図である。また、制御装置８０６は、タグ配列８２６または変換索引バッファ（ＴＬＢ）装置８６２に関係する探索オペレーションを選択的に省略するように適合されている。特定の例では、制御装置８０６は、計算されたアドレスが以前計算されたアドレスと同じキャッシュライン内または同じページ内にあるときに先行の探索オペレーションからのタグ配列情報および／または変換索引バッファ（ＴＬＢ）情報を転送することによって、タグ配列８２６における探索オペレーション、ＴＬＢ装置８６２における探索オペレーション、またはその組合せを省略することができる。 FIG. 8 illustrates a particular illustrative embodiment of a system 800 that includes a circuit device 802 having a controller 806 for selectively transferring data using local memories 809 and 811 within execution units 808 and 810, respectively. FIG. Controller 806 is also adapted to selectively omit search operations related to tag array 826 or transform index buffer (TLB) device 862. In a particular example, the controller 806 may detect tag alignment information and / or translation index buffer (TLB) from a previous search operation when the calculated address is in the same cache line or in the same page as the previously calculated address. ) By transferring information, the search operation in the tag array 826, the search operation in the TLB device 862, or a combination thereof can be omitted.

一般に、回路デバイス８０２はデータ装置８０４を含み、データ装置８０４は制御装置８０６、バス装置８１２および結合変換索引バッファ（joint translation look-aside buffer）（ＴＬＢ）装置８１３と通信する。バス装置８１２はレベル２密結合メモリ（ＴＣＭ）／キャッシュメモリ８５８と通信する。また、制御装置８０６は、第１実行装置８０８、第２実行装置８１０、命令装置８１４およびシリコン内デバッガ（ＩＳＤＢ）装置８１８と通信する。命令装置８１４は、結合ＴＬＢ装置８１３およびＩＳＤＢ装置８１８と通信する。また、回路デバイス８０２は、埋め込み追跡装置（embedded trace unit）（ＥＵ）８２０およびメモリビルトインセルフテスト（ＢＩＳＴ）またはテスト容易化設計（ＤＦＴ）装置８２２を含む。ＩＳＤＢ装置８１８、ＥＵ８２０およびメモリＢＩＳＴ装置８２２は、回路デバイス８０２で作動するソフトウェアのテストおよびデバッグを行うための手段を提供する。 In general, circuit device 802 includes a data device 804 that communicates with a controller 806, a bus device 812, and a joint translation look-aside buffer (TLB) device 813. Bus device 812 communicates with a level 2 tightly coupled memory (TCM) / cache memory 858. The control unit 806 also communicates with the first execution unit 808, the second execution unit 810, the instruction unit 814, and the in-silicon debugger (ISDB) unit 818. Command device 814 communicates with combined TLB device 813 and ISDB device 818. The circuit device 802 also includes an embedded trace unit (EU) 820 and a memory built-in self test (BIST) or design for testability (DFT) device 822. ISDB device 818, EU 820 and memory BIST device 822 provide a means for testing and debugging software running on circuit device 802.

制御装置８０６は、レジスタファイル８３６および８３８、制御論理回路８４０、割込み制御回路８４２、制御レジスタ８４４および命令デコーダ８４８を含む。一般に、制御装置８０６は、スレッドをスケジュールし、命令装置（ＩＵ）８１４に命令を要求し、命令をデコードして３つの実行装置、すなわちデータ装置８０４（実行スロット１および０、それぞれ８３０および８３２）、実行装置８０８および実行装置８１０に発行する。命令装置８１４は、命令変換索引バッファ（ＩＴＬＢ）８６４、命令アドレス生成装置８６６、命令制御レジスタ８６８、命令パケット整列回路８７０および命令キャッシュ８７２を含む。命令装置（ＩＵ）８１４は、主メモリまたは命令キャッシュ８７２から命令を取り出し、取り出された命令を制御装置８０６に提供する役割を担うプロセッサパイプラインのフロントエンドであり得る。 Controller 806 includes register files 836 and 838, control logic 840, interrupt control circuit 842, control register 844 and instruction decoder 848. In general, the controller 806 schedules threads, requests instructions from an instruction unit (IU) 814, decodes the instructions and decodes three execution units: data unit 804 (execution slots 1 and 0, 830 and 832 respectively). To the execution device 808 and the execution device 810. The instruction unit 814 includes an instruction translation index buffer (ITLB) 864, an instruction address generation unit 866, an instruction control register 868, an instruction packet alignment circuit 870, and an instruction cache 872. An instruction unit (IU) 814 may be a front end of a processor pipeline that is responsible for fetching instructions from main memory or instruction cache 872 and providing the fetched instructions to controller 806.

データ装置８０４は、キャッシュ可能データを包含するデータ配列８２４を含む。特定の実施形態では、データ配列８２４は、１６のサブ配列メモリバンクに配列された、各バンクが１６組の１６方向を含む多方向データ配列であってよい。サブ配列内の各記憶場所は、倍長語または８バイトのデータを保存するように適合することができる。特定の例では、サブ配列は２５６の倍長語（すなわち１６×１６の倍長語）を含むことができる。データ装置８０４はまた、データ配列８２４に関連づけられた物理タグを保存するタグ配列８２６を含む。特定の実施形態では、タグ配列８２６は静的ランダムアクセスメモリ（ＳＲＡＭ）である。データ装置８０４はまた、キャッシュラインに関連づけられた状態を保存するように適合された状態配列８２８を含む。特定の例では、状態配列８２８は、キャッシュミス事象に応じた置き換えのためのキャッシュ方向を供給する。データ装置８０４はまた、実行装置（スロット１）８３０および実行装置（スロット０）８３２を含み、これらは一般にロードオペレーションおよび保存オペレーションを実行する。データ装置８０４は、データ装置８０４のオペレーションを制御する制御回路８３４を含む。 Data device 804 includes a data array 824 containing cacheable data. In certain embodiments, the data array 824 may be a multi-directional data array arranged in 16 sub-array memory banks, each bank including 16 sets of 16 directions. Each storage location in the sub-array can be adapted to store double words or 8 bytes of data. In a particular example, the subsequence can include 256 doublewords (ie, 16 × 16 doublewords). Data device 804 also includes a tag array 826 that stores physical tags associated with data array 824. In certain embodiments, tag array 826 is static random access memory (SRAM). Data device 804 also includes a state array 828 adapted to store the state associated with the cache line. In a particular example, state array 828 provides a cache direction for replacement in response to a cache miss event. Data device 804 also includes execution unit (slot 1) 830 and execution unit (slot 0) 832, which generally perform load and save operations. Data device 804 includes a control circuit 834 that controls the operation of data device 804.

一般に、データ装置８０４は、制御装置８０６と通信して、実行装置８３０および８３２で実行する命令を受信する。データ装置８０４はまた、バスサービス要求のためにバス装置８１２と通信し、結合ＴＬＢ主メモリ装置変換のために結合ＴＬＢ装置８１３と通信する。 In general, the data device 804 communicates with the control device 806 to receive instructions to be executed by the execution devices 830 and 832. Data device 804 also communicates with bus device 812 for bus service requests and communicates with combined TLB device 813 for combined TLB main memory device conversion.

バス装置８１２はバス待ち行列装置（bus queue unit）８５０、レベル２タグ配列８５４、非同期先入れ先出し（ＦＩＦＯ）装置８５２およびレベル２インターフェース８５６を含む。レベル２インターフェース８５６は、レベル２ＴＣＭ／キャッシュ８５８と通信する。結合ＴＬＢ装置８１３は、制御レジスタ８６０および６４項目を含む結合ＴＬＢテーブル８６２を含む。 Bus device 812 includes a bus queue unit 850, a level 2 tag array 854, an asynchronous first in first out (FIFO) device 852, and a level 2 interface 856. Level 2 interface 856 communicates with level 2 TCM / cache 858. The combined TLB device 813 includes a combined TLB table 862 that includes control registers 860 and 64 entries.

特定の例解的実施形態では、制御装置８０６は第１命令パケットおよび第２命令パケットを受信する。制御装置８０６は、第１命令パケットからの命令を実行のため実行装置８０８に提供することができる。実行装置８０８は、第１命令パケットからの第１命令を実行し、第１命令に関連づけられた第１アドレスを決定することができる。特定の例では、実行装置８０８は第１命令に基づき第１仮想アドレスを計算することができ、第１仮想アドレスに基づき（すなわち、自動インクリメント機能により）第２仮想アドレスを計算することができる。実行装置８０８は、制御装置８０８を介してデータ装置８０４と通信し、ＴＬＢ装置８１３を介して変換索引バッファ（ＴＬＢ）探索オペレーションを実行することができる。データ装置８０４はＴＬＢ装置８１３と通信することによってＴＬＢ探索オペレーションを制御することができ、タグ配列８２６を介してタグ配列探索オペレーションを実行し、データ配列８２４のような多方向メモリ内の方向を決定することもできる。ＴＬＢページ変換情報およびタグ配列データを制御装置８０６を介して実行装置８０８に提供することができる。制御装置８０６は実行装置８０８に対し、タグ配列情報および／またはＴＬＢページ変換情報をメモリ８０９に保存するよう命令することができる。実行装置８０８はタグ配列情報に基づき記憶場所からデータを取り出すことができる。 In certain exemplary embodiments, the controller 806 receives a first command packet and a second command packet. The controller 806 can provide instructions from the first instruction packet to the execution unit 808 for execution. Execution unit 808 may execute the first instruction from the first instruction packet and determine a first address associated with the first instruction. In a particular example, execution unit 808 can calculate a first virtual address based on the first instruction and can calculate a second virtual address based on the first virtual address (ie, with an auto-increment function). The execution unit 808 can communicate with the data unit 804 via the control unit 808 and can perform a translation index buffer (TLB) search operation via the TLB unit 813. Data device 804 can control TLB search operations by communicating with TLB device 813 and performs tag array search operations via tag array 826 to determine the direction in multi-directional memory such as data array 824. You can also The TLB page conversion information and the tag arrangement data can be provided to the execution device 808 via the control device 806. The control device 806 can instruct the execution device 808 to save the tag array information and / or the TLB page conversion information in the memory 809. The execution device 808 can retrieve data from the storage location based on the tag arrangement information.

特定の例では、第２仮想アドレスが第１仮想アドレスと同じキャッシュラインに関連づけられている場合、実行装置８０８は、メモリ８０９からの保存済みタグ配列情報を使用して、タグ配列探索オペレーションおよびＴＬＢページ変換を実行することなくデータ配列８２４などの物理メモリに直接アクセスすることができる。特定の実施形態では、制御装置８０６の制御論理回路８４０は実行装置８０８に対し、保存済みタグ配列情報を使用するよう命令することができる。第２仮想アドレスが第１仮想アドレスと異なるキャッシュラインに関連づけられている場合、実行装置８０８は、制御装置８０６を介してデータ装置８０４と通信して、ＴＬＢ探索オペレーションを実行することなく（すなわち、仮想ページから物理ページへの変換を実行することなく）、タグ配列探索オペレーションを実行して第２仮想アドレスに関係するタグ情報を決定することができる。 In a particular example, if the second virtual address is associated with the same cache line as the first virtual address, the execution unit 808 uses the saved tag array information from the memory 809 to perform a tag array search operation and a TLB. Physical memory such as data array 824 can be directly accessed without performing page translation. In certain embodiments, the control logic 840 of the controller 806 can instruct the execution unit 808 to use the saved tag array information. If the second virtual address is associated with a different cache line than the first virtual address, the execution unit 808 communicates with the data unit 804 via the control unit 806 without performing a TLB search operation (ie, A tag array search operation can be performed to determine tag information related to the second virtual address (without performing a conversion from a virtual page to a physical page).

特定の例解的実施形態では、実行装置８０８および８１０は、図３に例解する探索省略論理回路３０８のような、タグ配列探索オペレーションおよび／またはＴＬＢ探索オペレーションを省略するときを決定するメモリ配列探索省略論理回路を含む。別の特定の例解的実施形態では、制御論理回路８４０は、実行装置８０８および８１０を制御して、タグ配列探索、ＴＬＢ探索またはその組合せを選択的に省略することができる。また、実行装置８０８および８１０は、図１に示す転送論理回路１３６のようなデータ転送論理回路を含むことができる。特定の例解的実施形態では、制御論理回路８４０は、実行装置８０８および８１０を制御することによって第１命令からのデータを第２命令に選択的に転送し、後続命令を実行する際の使用に備えて当該データをそれぞれのメモリ８０９および８１１に保存するように適合されている。 In a particular illustrative embodiment, execution units 808 and 810 may determine a memory array that determines when to omit a tag array search operation and / or a TLB search operation, such as search skip logic circuit 308 illustrated in FIG. Includes search skip logic. In another particular illustrative embodiment, the control logic 840 can control the execution units 808 and 810 to selectively omit the tag sequence search, the TLB search, or a combination thereof. Execution devices 808 and 810 can also include a data transfer logic circuit, such as transfer logic circuit 136 shown in FIG. In certain illustrative embodiments, the control logic 840 controls the execution units 808 and 810 to selectively transfer data from the first instruction to the second instruction for use in executing subsequent instructions. Adapted to store the data in respective memories 809 and 811.

図９は、データ転送の方法の特定の例解的実施形態を示すブロック図である。９０２では、複数の実行装置を有するインターリーブ型マルチスレッドプロセッサ内で、実行装置のライトバック段階の間、レジスタファイルに書き込まれるデータに関連づけられた書き込み識別子が、実行パイプラインの後続の読み取り段階の読み取り識別子と比較される。９０４に進み、書き込み識別子が読み取り識別子と合致しなかった場合、本方法は９０６に進み、第１命令パケットの実行から生じたデータが、レジスタファイルにある１つの場所に書き込まれ、当該データを実行装置にローカル保存しない。代替的に、９０４で書き込み識別子が読み取り識別子と合致した場合、本方法は９０８に進み、データはレジスタファイルに書き込まれ、後続の読み取り段階での実行装置による使用に備えて実行装置にローカル保存される。９０８から９１０に進み、本方法は、ローカル保存場所からデータを取り出すことを含む。代替的に、９０６から９１２に進み、本方法は、レジスタファイルの場所からデータを取り出すことを含む。９１４に進み、本方法は、取り出されたデータを使用して後続の読み取り段階を実行することを含む。特定の例では、本方法は、実行装置にローカル保存されたデータを使用して実行装置で命令パケットを実行することを含む。本方法は９１６で終了する。
FIG. 9 is a block diagram illustrating a specific illustrative embodiment of a method for data transfer. At 902, in an interleaved multithreaded processor having multiple execution units, a write identifier associated with data written to a register file during a write back phase of the execution unit is read in a subsequent read phase of the execution pipeline. Compared with identifier. Proceeding to 904, if the write identifier does not match the read identifier, the method proceeds to 906 where the data resulting from the execution of the first instruction packet is written to one location in the register file and the data is executed. Do not save locally on the device. Alternatively, if the write identifier matches the read identifier at 904, the method proceeds to 908 and the data is written to a register file and stored locally on the execution device for use by the execution device in a subsequent read stage. The Proceeds from 908 to 910, the method includes retrieving or data from the local storage Place. Alternatively, proceeding from 906 to 912, the method includes retrieving data from a register file location. Proceeding to 914 , the method includes performing a subsequent read step using the retrieved data. In a particular example, the method includes executing an instruction packet at the execution device using data stored locally at the execution device. The method ends at 916 .

特定の例では、本方法は、データに含まれる１つあるいはそれより多くのゼロ値ビットを特定して、書き込み識別子が読み取り識別子と合致するか否かを決定することを含む。ゼロ値ビットに基づき、本方法は、１つあるいはそれより多くのゼロ値ビットに関連づけられた実行装置内のデータパスへのパワーを減らすための標識を生成することを含むことができる。別の特定の例では、本方法は、データ装置に関連づけられた多方向キャッシュのキャッシュラインアドレスを、書き込み識別子に関連づけられたキャッシュラインアドレスと比較すること、および、書き込み識別子に関連づけられたキャッシュラインアドレスがデータ装置に関連づけられたキャッシュラインアドレスと合致したときに、変換索引バッファ（ＴＬＢ）タグを読み込むことなく多方向キャッシュからデータを取り出すことを含む。 In a particular example, the method includes identifying one or more zero-value bits included in the data to determine whether the write identifier matches the read identifier. Based on the zero value bits, the method can include generating an indicator to reduce power to the data path in the execution unit associated with one or more zero value bits. In another specific example, the method compares a cache line address of a multi-directional cache associated with a data device with a cache line address associated with a write identifier, and a cache line associated with the write identifier. Retrieving data from the multi-directional cache without reading the translation index buffer (TLB) tag when the address matches the cache line address associated with the data device.

図１０は、タグ配列探索オペレーションを選択的に省略する方法の特定の例解的実施形態を示すブロック図である。１００２では、自動インクリメント機能を使用して第１記憶アドレスから第２記憶アドレスが計算される。続いて１００４では、第２記憶アドレスに関連づけられた第１キャリービットが調べられる。ある例解的実施形態では、第２アドレスが第１アドレスと同じキャッシュライン内にあるか否かを決定するため、キャリービットは、キャッシュラインのサイズに関連づけられたアドレスビットである。例えば、下位ビット００００、０００１、．．．０１１１を有する連続アドレスは単一のキャッシュライン内にあるが、下位ビット１０００を有する次の連続アドレスは異なるキャッシュラインにある場合、０から１に変化するビット（すなわち最下位から４番目のアドレスビット）が、１００４で調べられるキャリービットである。続く例では、第２アドレスが第１アドレスの自動インクリメントによって生成されるとき、最下位から４番目のアドレスビットが値を変えると、キャリー値が示される。１００６では、第１キャリービットがキャリー値を示す場合、本方法は１００８に進み、タグ配列探索オペレーションを実行して第２記憶アドレスに関連づけられたタグ配列情報を取り出す。１０１０に進み、タグ配列情報がローカルメモリに保存される。１０１２に進み、タグ配列情報を使用してキャッシュメモリからデータが取り出される。 FIG. 10 is a block diagram illustrating a specific illustrative embodiment of a method for selectively omitting a tag sequence search operation. At 1002, a second storage address is calculated from the first storage address using an auto-increment function. Subsequently, at 1004, the first carry bit associated with the second storage address is examined. In an illustrative embodiment, the carry bit is an address bit associated with the size of the cache line to determine whether the second address is in the same cache line as the first address. For example, the lower bits 0000, 0001,. . . If the consecutive address with 0111 is in a single cache line, but the next consecutive address with low bit 1000 is in a different cache line, then the bit that changes from 0 to 1 (ie, the 4th least significant address bit) ) Is the carry bit examined at 1004. In the following example, when the second address is generated by auto-incrementing the first address, the carry value is indicated when the fourth least significant address bit changes value. At 1006, if the first carry bit indicates a carry value, the method proceeds to 1008 and performs a tag array search operation to retrieve tag array information associated with the second storage address. Proceeding to 1010, the tag array information is stored in local memory. Proceeding to 1012, data is retrieved from the cache memory using the tag array information.

１００６に戻り、第１キャリービットがキャリー値を示していない場合、本方法は１０１４に進み、第１記憶アドレスに関連づけられた探索オペレーションのような先行のタグ配列探索オペレーションでタグ配列情報が特定されたローカルメモリからタグ配列情報が取り出される。本方法は１０１６で終了する。 Returning to 1006, if the first carry bit does not indicate a carry value, the method proceeds to 1014 where the tag sequence information is identified in a previous tag sequence search operation, such as the search operation associated with the first storage address. Tag array information is extracted from the local memory. The method ends at 1016.

図１１は、タグ配列探索オペレーション、変換索引バッファ（ＴＬＢ）探索オペレーションまたはその組合せを選択的に省略する（バイパスする）方法の特定の例解的実施形態を示す流れ図である。１１０２では、ＴＬＢ探索オペレーションを実行して仮想記憶アドレスを物理記憶アドレスに変換する。続いて１１０４では、タグ配列探索オペレーションを実行して物理アドレスに関連づけられたタグ情報を決定する。１１０６に進み、タグ配列情報がローカルメモリに保存される。１１０８に進み、自動インクリメント機能を使用して第１記憶アドレスから計算される第２記憶アドレスが受信される。特定の例では、第１記憶アドレスをインクリメントすることにより第１記憶アドレスから第２記憶アドレスを計算することができる。続いて１１１０では、キャリー値（すなわちキャリービットの値）を特定するため、第２記憶アドレスに関連づけられたキャッシュラインキャリービットが調べられる。特定の例では、例えばキャリービットは３２ビットのキャッシュに関係する第５アドレスビットであり得る。１１１２で、キャッシュラインキャリービットがキャリー値を示していない場合、本方法は１１１４に進み、ローカルメモリに保存されたタグ情報が取り出される。続いて１１１６では、取り出されたタグ情報に基づきメモリの第２記憶アドレスからデータが取り出される。１１１２に戻り、キャッシュラインキャリービットがキャリー値を示している場合、本方法は１１１８に進み、キャリー値を特定するためページ境界キャリービットが調べられる。１１２０では、ページ境界キャリービットがキャリー値を示している場合、本方法は１１０２に戻り、ＴＬＢ探索オペレーションを実行して記憶アドレスを物理アドレスに変換する。１１２０に戻り、ページ境界ビットがキャリー値を示していない場合、本方法は１１０４に進み、タグ配列探索オペレーションを実行して物理アドレスに関連づけられたタグ情報を決定し、ＴＬＢ探索オペレーションは実行しない。 FIG. 11 is a flow diagram illustrating a specific illustrative embodiment of a method for selectively omitting (bypassing) a tag array search operation, a translation index buffer (TLB) search operation, or a combination thereof. At 1102, a TLB search operation is performed to convert the virtual storage address to a physical storage address. Subsequently, at 1104, a tag array search operation is executed to determine tag information associated with the physical address. Proceeding to 1106, the tag array information is stored in the local memory. Proceeding to 1108, a second storage address calculated from the first storage address using the auto-increment function is received. In a particular example, the second storage address can be calculated from the first storage address by incrementing the first storage address. Subsequently, at 1110, the cache line carry bit associated with the second storage address is examined to identify the carry value (ie, the carry bit value). In a particular example, for example, the carry bit may be the fifth address bit associated with a 32-bit cache. If the cache line carry bit does not indicate a carry value at 1112, the method proceeds to 1114 where the tag information stored in local memory is retrieved. Subsequently, at 1116, data is extracted from the second storage address of the memory based on the extracted tag information. Returning to 1112, if the cache line carry bit indicates a carry value, the method proceeds to 1118 where the page boundary carry bit is examined to determine the carry value. At 1120, if the page boundary carry bit indicates a carry value, the method returns to 1102 to perform a TLB search operation to convert the storage address to a physical address. Returning to 1120, if the page boundary bit does not indicate a carry value, the method proceeds to 1104 where a tag array search operation is performed to determine tag information associated with the physical address, and no TLB search operation is performed.

図１２は、レジスタ読み取りオペレーションおよび／または変換索引バッファ（ＴＬＢ）探索オペレーションを選択的に省略する論理回路を含むプロセッサを含む代表的なワイヤレス通信デバイス１２００を示すブロック図である。ワイヤレス通信デバイス１２００は、１つあるいはそれより多くの実行装置１２６８と通信するデータ転送／探索省略論理回路１２６４を含むデジタル信号プロセッサ（ＤＳＰ）１２１０を含むことができる。１つあるいはそれより多くの実行装置１２６８の各々は、ローカルメモリ１２７０を含む。データ転送／探索省略論理回路１２６４は、実行装置１２６８を制御して、後続命令パケットによる使用に備えてデータをローカルメモリ１２７０にローカル保存することによってデータを転送するよう作動することができる。ワイヤレス通信デバイス１２００はまた、ＤＳＰ１２１０にアクセスできるメモリ１２３２を含む。データ転送／探索省略論理回路１２６４はまた、実行装置１２６８を制御して（異なる記憶アドレスに関連づけられた先行のタグ配列探索オペレーションからの）以前決定されたタグ配列情報を利用し、変換索引バッファ（ＴＬＢ）探索オペレーションおよびタグ配列探索オペレーションの両方を省略するように適合されている。以前決定されたタグ配列情報を使用して、他のタグ配列探索オペレーションを実行することなくメモリ１２３２のようなメモリにアクセスすることができる。別の特定の実施形態では、図１〜１１に関して説明したとおり、先行のＴＬＢ探索オペレーションにより決定された第１アドレスのページ変換情報を使用して、別のＴＬＢ探索オペレーションを実行することなく、タグ配列探索オペレーションを実行することができる。特定の実施形態では、データ転送および／またはＴＬＢ探索省略論理回路１２６４は、データ転送機能、タグ配列探索省略機能、ＴＬＢ探索省略機能またはこれらの任意の組合せを提供することができる。 FIG. 12 is a block diagram illustrating an exemplary wireless communication device 1200 that includes a processor that includes logic circuitry that selectively omits register read operations and / or translation index buffer (TLB) search operations. The wireless communication device 1200 can include a digital signal processor (DSP) 1210 that includes data transfer / search skip logic 1264 that communicates with one or more execution units 1268. Each of the one or more execution units 1268 includes a local memory 1270. Data transfer / search skip logic 1264 may operate to transfer data by controlling execution unit 1268 to store data locally in local memory 1270 for use by subsequent instruction packets. The wireless communication device 1200 also includes a memory 1232 that can access the DSP 1210. Data transfer / search skip logic 1264 also controls execution unit 1268 to utilize previously determined tag sequence information (from previous tag sequence search operations associated with different storage addresses) and a translation index buffer ( It is adapted to omit both TLB) search operations and tag sequence search operations. The previously determined tag sequence information can be used to access a memory such as memory 1232 without performing other tag sequence search operations. In another specific embodiment, as described with respect to FIGS. 1-11, the page translation information of the first address determined by the previous TLB search operation is used to perform the tag without performing another TLB search operation. Sequence search operations can be performed. In certain embodiments, the data transfer and / or TLB search skip logic 1264 can provide a data transfer function, a tag sequence search skip function, a TLB search skip function, or any combination thereof.

特定の実施形態では、ワイヤレス通信デバイス１２００は、データ転送回路および探索省略論理回路の両方を含むことができる。別の特定の実施形態では、ワイヤレス通信デバイス１２００はデータ転送回路のみ含むことができる。さらに別の特定の実施形態では、探索省略論理回路を含むことができる。さらに別の特定の実施形態では、データを転送すべきか否かを決定するように適合された論理回路を使用して、タグ配列探索オペレーション、ＴＬＢ探索オペレーションまたはこれらの組合せを省略すべきか否かを決定することができる。 In certain embodiments, the wireless communication device 1200 can include both a data transfer circuit and a search skip logic circuit. In another specific embodiment, the wireless communication device 1200 can include only data transfer circuitry. In yet another specific embodiment, search skip logic may be included. In yet another specific embodiment, a logic circuit adapted to determine whether data should be transferred is used to determine whether to omit a tag array search operation, a TLB search operation, or a combination thereof. Can be determined.

図１２はまた、デジタル信号プロセッサ１２１０およびディスプレイ１２２８に結合したディスプレイコントローラ１２２６を示している。コーダ／デコーダ（ＣＯＤＥＣ）１２３４もデジタル信号プロセッサ１２１０に結合することができる。スピーカー１２３６およびマイクロフォン１２３８はＣＯＤＥＣ１２３４に結合することができる。 FIG. 12 also shows a display controller 1226 coupled to the digital signal processor 1210 and the display 1228. A coder / decoder (CODEC) 1234 may also be coupled to the digital signal processor 1210. A speaker 1236 and a microphone 1238 can be coupled to the CODEC 1234.

図１２はまた、ワイヤレスコントローラ１２４０がデジタル信号プロセッサ１２１０およびワイヤレスアンテナ１２４２に結合できることを示している。特定の実施形態では、オンチップシステム１２２２に入力デバイス１２３０および電源１２４４が結合されている。さらに、特定の実施形態では、図１２に例解されたとおり、ディスプレイ１２２８、入力デバイス１２３０、スピーカー１２３６、マイクロフォン１２３８、ワイヤレスアンテナ１２４２および電源１２４４はオンチップシステム１２２２の外にある。ただし、各々はオンチップシステム１２２２のコンポーネントに結合されている。 FIG. 12 also illustrates that the wireless controller 1240 can be coupled to the digital signal processor 1210 and the wireless antenna 1242. In certain embodiments, an input device 1230 and a power source 1244 are coupled to the on-chip system 1222. Further, in certain embodiments, the display 1228, input device 1230, speaker 1236, microphone 1238, wireless antenna 1242, and power source 1244 are external to the on-chip system 1222 as illustrated in FIG. However, each is coupled to a component of on-chip system 1222.

データ転送および／またはＴＬＢ探索省略論理回路１２６４、１つあるいはそれより多くの実行装置１２６８ならびにローカルメモリ１２７０は、デジタル信号プロセッサ１２１０の別個のコンポーネントとして示されているが、データ転送および／またはＴＬＢ探索省略論理回路１２６４、１つあるいはそれより多くの実行装置１２６８ならびにローカルメモリ１２７０を、ワイヤレスコントローラ１２４０、ＣＯＤＥＣ１２３４、ディスプレイコントローラ１２２６、他の処理コンポーネント（図示していない汎用プロセッサなど）またはこれらの任意の組合せといった他の処理コンポーネントに組み込むことができることを理解すべきである。 Data transfer and / or TLB search skip logic 1264, one or more execution units 1268 and local memory 1270 are shown as separate components of digital signal processor 1210, but data transfer and / or TLB search Omission logic 1264, one or more execution units 1268 and local memory 1270 may be connected to a wireless controller 1240, CODEC 1234, display controller 1226, other processing components (such as a general purpose processor not shown) or any combination thereof. It should be understood that other processing components can be incorporated.

当業者はさらに、本明細書で開示した実施形態との関係で説明した様々な例解的な論理ブロック、構成、モジュール、回路およびアルゴリズムのステップが、電子ハードウェア、コンピュータソフトウェアまたは両方の組合せとして実施できることを理解しよう。ハードウェアおよびソフトウェアのこの互換性を明示するため、様々な例解的なコンポーネント、ブロック、構成、モジュール、回路およびステップについて、これらの機能面から全般的に上述してきた。こうした機能がハードウェアまたはソフトウェアとして実施されるか否かは、特定のアプリケーションおよびシステム全体に課せられる設計上の制約に左右される。当業者は、各々の特定のアプリケーションに関し様々な方法で前述の機能を実施することができるが、かかる実施の決定を、本開示の範囲から逸脱するものと解釈すべきではない。 Those skilled in the art further understand that the various illustrative logic blocks, configurations, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or a combination of both. Understand what can be done. To illustrate this interchangeability of hardware and software, various illustrative components, blocks, configurations, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the overall system. Those skilled in the art can implement the aforementioned functions in various ways for each particular application, but such implementation decisions should not be construed as departing from the scope of the present disclosure.

本明細書で開示する実施形態との関係で説明する方法またはアルゴリズムのステップは、ハードウェア、プロセッサによって実行されるソフトウェアモジュール、またはこの２つの組合せで直接体現することができる。ソフトウェアモジュールは、ＲＡＭメモリ、フラッシュメモリ、ＲＯＭメモリ、ＰＲＯＭメモリ、ＥＰＲＯＭメモリ、ＥＥＰＲＯＭメモリ、レジスタ、ハードディスク、取り外し可能ディスク、ＣＤ−ＲＯＭまたは当技術分野で知られているその他の形式の記憶媒体に存在し得る。例解的な記憶媒体はプロセッサに結合され、それによりプロセッサは記憶媒体から情報を読み取ること、および記憶媒体に情報を書き込むことができる。代替法として、記憶媒体はプロセッサと一体化することもできる。プロセッサおよび記憶媒体はＡＳＩＣ中に存在し得る。ＡＳＩＣはコンピューティングデバイスまたはユーザ端末中に存在し得る。代替法として、プロセッサおよび記憶媒体はコンピューティングデバイスまたはユーザ端末中に個別のコンポーネントとして存在することもできる。 The method or algorithm steps described in connection with the embodiments disclosed herein may be directly embodied in hardware, software modules executed by a processor, or a combination of the two. Software modules reside in RAM memory, flash memory, ROM memory, PROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM or other form of storage medium known in the art Can do. An exemplary storage medium is coupled to the processor such that the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and storage medium may reside in an ASIC. The ASIC may reside in a computing device or user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a computing device or user terminal.

開示された実施形態に関するこれまでの説明は、開示された実施形態を当業者が実施または利用できるようにするため提供されている。これらの実施形態の様々な修正形態が当業者には容易に理解されると思われ、本明細書に定める一般的原理を、本開示の趣旨または範囲から逸脱することなく他の実施形態に適用することができる。よって、本開示は、本明細書に示す実施形態に限定されることを意図しておらず、この原理および以下の請求項が定める新規な特徴と合致する最大の範囲を認めるものである。
以下に、本願出願の当初の特許請求の範囲に記載された発明を付記する。
［Ｃ１］複数の実行装置を有するインターリーブ型マルチスレッド（ＩＭＴ）プロセッサ内の実行パイプラインにおいて、実行装置でのライトバック段階の間、第１命令の実行によりレジスタファイルに書き込まれる結果に関連づけられた書き込み識別子を、第２命令に関連づけられた読み取り識別子と比較することと、前記書き込み識別子が前記読み取り識別子と合致したとき、後続の読み取り段階での前記実行装置による使用に備えて、前記結果を前記実行装置のローカルメモリに保存することとを具備する方法。
［Ｃ２］前記方法は、前記書き込み識別子が前記読み取り識別子と合致しなかったとき、前記結果を前記ローカルメモリに保存することなく、前記結果を前記レジスタファイルに書き込むことをさらに具備するＣ１記載の方法。
［Ｃ３］前記ローカルメモリに保存された前記結果を使用して、前記実行装置で命令パケットを実行することをさらに具備するＣ１記載の方法。
［Ｃ４］前記書き込み識別子に含まれる１つあるいはそれより多くのゼロ値ビットを特定することと、前記１つあるいはそれより多くのゼロ値ビットに関連づけられた前記実行装置内のデータパスへのパワーを減らすための標識を生成することとをさらに具備するＣ１記載の方法。
［Ｃ５］前記書き込み識別子が前記読み取り識別子と合致したとき、データ転送可能アウトプット標識を生成することと、前記データ転送可能アウトプット標識に応じてレジスタファイルのスロットを選択的に無効にすることとをさらに具備するＣ１記載の方法。
［Ｃ６］前記データ転送可能アウトプット標識に関係する選択信号を生成することと、前記レジスタファイルからのアウトプットまたは前記ローカルメモリからの前記結果のうちの１つを、前記第２命令を実行する際の使用に備えて前記実行装置に選択的に提供することとをさらに具備するＣ５記載の方法。
［Ｃ７］前記第１命令の第１記憶アドレスから計算された前記第２命令の第２記憶アドレスのキャリービットを、前記第２記憶アドレスおよび前記第１記憶アドレスが１つのキャッシュラインに関連しているか否かを決定するために調べることと、前記第２記憶アドレスに関連づけられた第２キャッシュラインアドレスが、前記第１記憶アドレスに関連づけられた第１キャッシュラインアドレスと合致したとき、タグ配列探索オペレーションを実行することなく多方向キャッシュからデータを取り出すこととをさらに具備するＣ１記載の方法。
［Ｃ８］第１命令パケットに関連づけられた第１アドレスから、第２命令パケットに関連づけられた第２アドレスを決定することと、データ装置の加算器のキャリービットを、多方向キャッシュに関連づけられたキャッシュラインの境界を前記第２アドレスが越えたか否かを決定するために調べることと、前記境界を越えていないとき、先行のタグ配列探索オペレーションにより決定された前記第１アドレスに関連づけられたタグ配列データおよび変換索引バッファ（ＴＬＢ）探索データを使用して前記第２アドレスからデータを取り出すために前記多方向キャッシュにアクセスすることとを具備する方法。
［Ｃ９］キャッシュライン境界を越えた場合、前記方法は、変換索引バッファ（ＴＬＢ）探索オペレーションを実行することなく前記第２命令に関連づけられたタグ配列情報を決定するためにタグ配列探索オペレーションを実行することをさらに具備するＣ８記載の方法。
［Ｃ１０］前記タグ配列情報を使用して前記多方向キャッシュからデータを読み取ることをさらに具備するＣ９記載の方法。
［Ｃ１１］前記第１アドレスが第１メモリ読み取りアドレスを具備し、前記第２アドレスが第２メモリ読み取りアドレスを具備するＣ８記載の方法。
［Ｃ１２］前記第２アドレスを、前記第１命令パケットの実行により決定された結果に関連づけられた第１書き込みアドレスと比較することと、前記第１書き込みアドレスが前記第２アドレスと合致したとき、前記第２命令パケットを実行する際の使用に備えて前記結果を実行装置内のローカルメモリに保存することとをさらに具備するＣ８記載の方法。
［Ｃ１３］前記ローカルメモリから前記結果を取り出すことと、前記取り出された結果を使用して前記第２命令パケットを実行することとをさらに具備するＣ１２記載の方法。
［Ｃ１４］ページ境界を越えたとき、前記方法は、前記多方向キャッシュに関連づけられた物理アドレスに前記第２アドレスを変換するために、変換索引バッファ（ＴＬＢ）探索オペレーションを実行することと、タグ情報を決定するために、タグ配列探索オペレーションを実行することと、前記タグ情報および前記物理アドレスに基づきメモリにアクセスすることとをさらに具備するＣ８記載の方法。
［Ｃ１５］相対アドレス指定を使用して前記第１アドレスから前記第２アドレスを決定するＣ８記載の方法。
［Ｃ１６］１つあるいはそれより多くのデータ値を保存するローカルメモリと、読み取りオペレーションに関連づけられた読み取りアドレスが先行のライトバックオペレーションに関連づけられたライトバックアドレスと合致するか否かを決定するように適合された論理回路であって、前記読み取りアドレスが前記ライトバックアドレスと合致したとき、前記ローカルメモリに前記１つあるいはそれより多くのデータ値を保存するように適合された論理回路とを備える実行装置を具備するマルチスレッドプロセッサ。
［Ｃ１７］前記論理回路が、前記実行装置の外にあるメモリの記憶場所からデータを読み取るように適合されており、前記読み取りアドレスが前記ライトバックアドレスと合致しなかったとき、前記記憶場所が前記読み取りアドレスに対応するＣ１６記載のマルチスレッドプロセッサ。
［Ｃ１８］前記実行装置がライトバック段階、デコード段階およびレジスタファイル読み取り段階を含む複数の実行段階を具備するＣ１６記載のマルチスレッドプロセッサ。
［Ｃ１９］前記論理回路が１つあるいはそれより多くの比較器を備え、前記１つあるいはそれより多くの比較器が、読み取りアドレス情報を書き込みアドレス情報と比較し、結果を作成してデータ転送を選択的に可能にするように適合されているＣ１８記載のマルチスレッドプロセッサ。
［Ｃ２０］前記ローカルメモリが前記実行装置内に１つあるいはそれより多くのデータラッチを具備するＣ１６記載のマルチスレッドプロセッサ。
［Ｃ２１］データ転送を選択的に可能にするために、データ転送論理回路によって前記１つあるいはそれより多くのデータラッチが選択的にアクティブ化されるＣ２０記載のマルチスレッドプロセッサ。
［Ｃ２２］命令に関連づけられた読み取りアドレスの少なくとも一部が先行命令に関連づけられた読み取りアドレスの一部と合致するとき、タグ配列探索オペレーションを実行することなく、多方向キャッシュメモリ内の記憶アドレスを決定するように適合された第２論理回路をさらに具備するＣ１６記載のマルチスレッドプロセッサ。
［Ｃ２３］複数の実行装置を有するインターリーブ型マルチスレッド（ＩＭＴ）プロセッサ内の実行パイプラインにおいて、第１命令パケットの実行によりレジスタファイルに書き込まれる結果に関連づけられた書き込み識別子を、第２命令パケットに関連づけられた読み取り識別子と比較するための手段と、前記書き込み識別子が前記読み取り識別子と合致したとき、前記第２命令パケットを実行する際の使用に備えて、前記結果を実行装置に選択的にローカル保存するための手段とを具備するプロセッサ。
［Ｃ２４］前記第１命令パケットに関連づけられた第１アドレスから前記第２命令パケットに関連づけられた第２アドレスを決定するための手段と、多方向キャッシュに関連づけられたキャッシュラインのキャッシュライン境界を前記第２アドレスが越えたか否かの決定をすることを決定するために、データ装置の加算器のキャリービットを調べるための手段と、変換索引バッファ（ＴＬＢ）またはタグ配列にアクセスすることなく前記第１アドレスに関連づけられたローカル保存済み物理アドレスデータおよび方向データを使用して、仮想アドレスを前記多方向キャッシュに関連づけられた物理アドレスに変換するための手段とをさらに具備するＣ２３記載のプロセッサ。
［Ｃ２５］第１命令パケットの実行によりレジスタファイルに書き込まれる結果に関連づけられた書き込み識別子を、第２命令パケットに関連づけられた読み取り識別子と比較するための手段が、前記書き込み識別子および前記読み取り識別子を受信し、前記書き込み識別子と前記読み取り識別子とが合致するか否かを示す第１アウトプットを提供するように適合された第１比較器と、前記書き込み識別子および第２読み取り識別子を受信し、前記書き込み識別子と前記第２読み取り識別子とが合致するか否かを示す第２アウトプットを提供するように適合された第２比較器と、前記第１アウトプットおよび前記第２アウトプットに基づき前記第２命令パケットを実行する際の使用に備えて、ローカル保存済みデータまたはレジスタデータのうちの１つを前記実行装置に選択的に提供するように適合された論理回路とを具備するＣ２３記載のプロセッサ。
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the disclosed embodiments. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. can do. Accordingly, this disclosure is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with this principle and the novel features defined by the following claims.
Hereinafter, the invention described in the scope of claims of the present application will be appended.
[C1] In an execution pipeline in an interleaved multi-thread (IMT) processor having multiple execution units, associated with the result written to the register file by execution of the first instruction during the write-back phase in the execution unit Comparing the write identifier with a read identifier associated with a second instruction; and when the write identifier matches the read identifier, the result is provided for use by the execution device in a subsequent read step. Storing in the local memory of the execution device.
[C2] The method of C1, further comprising writing the result to the register file without storing the result in the local memory when the write identifier does not match the read identifier. .
[C3] The method of C1, further comprising executing an instruction packet at the execution unit using the result stored in the local memory.
[C4] identifying one or more zero-value bits included in the write identifier; and power to a data path in the execution unit associated with the one or more zero-value bits Generating a label to reduce the amount of C1.
[C5] generating a data transferable output indicator when the write identifier matches the read identifier; selectively disabling a register file slot according to the data transferable output indicator; The method according to C1, further comprising:
[C6] generating a selection signal related to the data transferable output indicator, and executing the second instruction with one of the output from the register file or the result from the local memory C5. The method of C5, further comprising: selectively providing to the execution device for ready use.
[C7] The carry bit of the second storage address of the second instruction calculated from the first storage address of the first instruction is related to the cache line in which the second storage address and the first storage address are one. Checking to determine whether the second cache line address associated with the second storage address matches the first cache line address associated with the first storage address; The method of C1, further comprising retrieving data from the multi-directional cache without performing the operation.
[C8] Determining the second address associated with the second instruction packet from the first address associated with the first instruction packet and the carry bit of the data device adder associated with the multi-directional cache Checking to determine if the second address has crossed a cache line boundary; and if not, the tag associated with the first address determined by a previous tag array search operation Accessing the multi-directional cache to retrieve data from the second address using sequence data and translation index buffer (TLB) search data.
[C9] If a cache line boundary is crossed, the method performs a tag array search operation to determine tag array information associated with the second instruction without performing a translation index buffer (TLB) search operation. The method of C8, further comprising:
[C10] The method of C9, further comprising reading data from the multi-directional cache using the tag sequence information.
[C11] The method of C8, wherein the first address comprises a first memory read address and the second address comprises a second memory read address.
[C12] comparing the second address with a first write address associated with a result determined by execution of the first instruction packet; and when the first write address matches the second address; The method of C8, further comprising storing the result in a local memory within the execution unit for use in executing the second instruction packet.
[C13] The method according to C12, further comprising: retrieving the result from the local memory; and executing the second instruction packet using the retrieved result.
[C14] When crossing a page boundary, the method performs a translation index buffer (TLB) search operation to translate the second address to a physical address associated with the multi-directional cache; The method of C8, further comprising: performing a tag array search operation to determine information; and accessing a memory based on the tag information and the physical address.
[C15] The method of C8, wherein the second address is determined from the first address using relative addressing.
[C16] local memory storing one or more data values and determining whether the read address associated with the read operation matches the write back address associated with the previous write back operation. A logic circuit adapted to store the one or more data values in the local memory when the read address matches the write-back address. A multi-thread processor comprising an execution device.
[C17] When the logic circuit is adapted to read data from a memory location outside the execution unit and the read address does not match the write-back address, the memory location is A multithread processor according to C16 corresponding to the read address.
[C18] The multithread processor according to C16, wherein the execution device includes a plurality of execution stages including a write-back stage, a decoding stage, and a register file reading stage.
[C19] The logic circuit includes one or more comparators, and the one or more comparators compare the read address information with the write address information and create a result for data transfer. The multithreaded processor according to C18, adapted to selectively enable.
[C20] The multithreaded processor according to C16, wherein the local memory includes one or more data latches in the execution unit.
[C21] The multithreaded processor according to C20, wherein the one or more data latches are selectively activated by a data transfer logic circuit to selectively enable data transfer.
[C22] When at least a portion of the read address associated with the instruction matches a portion of the read address associated with the preceding instruction, the storage address in the multi-directional cache memory is determined without performing a tag array search operation. The multithreaded processor according to C16, further comprising a second logic circuit adapted to determine.
[C23] In an execution pipeline in an interleaved multithread (IMT) processor having a plurality of execution devices, a write identifier associated with a result written to a register file by execution of the first instruction packet is set as a second instruction packet. Means for comparing with an associated read identifier; and, when the write identifier matches the read identifier, selectively localizing the result to an execution device for use in executing the second instruction packet. A processor comprising means for storing.
[C24] means for determining a second address associated with the second instruction packet from a first address associated with the first instruction packet; and a cache line boundary of a cache line associated with the multi-directional cache. Means for examining a carry bit of a data device adder to determine whether the second address has been exceeded, and without accessing a translation index buffer (TLB) or tag array; The processor of C23, further comprising means for translating a virtual address to a physical address associated with the multi-directional cache using locally stored physical address data and direction data associated with the first address.
[C25] means for comparing a write identifier associated with the result written to the register file by execution of the first instruction packet with a read identifier associated with the second instruction packet, wherein the write identifier and the read identifier are Receiving a first comparator adapted to provide a first output indicating whether the write identifier and the read identifier match; receiving the write identifier and the second read identifier; A second comparator adapted to provide a second output indicating whether a write identifier and the second read identifier match; and based on the first output and the second output, the second output Local stored data or register data for use when executing two instruction packets And a logic circuit adapted to selectively provide one of the execution devices to the execution unit.

Claims

Determining a second address associated with a second packet of instructions from a first address associated with the first packet of instructions;
Examining the carry bit of the adder of the data device to determine whether the second address has crossed a cache line boundary associated with the multi-directional cache;
In response to determining that the second address has crossed the boundary of the cache line, to determine whether the second address has crossed a page boundary associated with the multi-directional cache, A second carry bit of the adder is examined and associated with the second address without performing a translation index buffer (TLB) search operation in response to determining that the page boundary has not been exceeded. to determine the tag sequence information, and want to consider performing some of the tag sequence search operation,
In response to determining that the boundary of the cache line has not been crossed by the second address, using tag array data and translation index buffer (TLB) search data associated with the first address Accessing the multi-directional cache to retrieve data from a second address, the tag array data being a result of a previous tag array search operation;
A method comprising:

Further comprising that use the tag sequence information reading data from the multi-way cache, the method according to claim 1.

The method of claim 1, wherein the first address comprises a first memory read address and the second address comprises a second memory read address.

Comparing the second address to a first write address associated with a result determined by execution of the first packet of instructions;
Storing the result in local memory within the execution unit for use in executing the second packet of instructions when the first write address matches the second address;
The method of claim 1, further comprising:

Retrieving the result from the local memory;
Using the retrieved result to execute the second packet of instructions;
The method of claim 4 , further comprising:

When a page boundary is crossed by the second address,
The method
Said second address, in order to convert the physical address associated with the multi-way cache, and performing the translation lookaside buffer (TLB) search operations,
To determine the tag information, and performing the tag sequence search operation,
Accessing the memory based on the tag information and the physical address;
The method of claim 1, further comprising:

The method of claim 1, wherein the second address is determined from the first address using relative addressing.

When executed by the processor
Determining a second address associated with a second packet of instructions from a first address associated with the first packet of instructions;
Examining the carry bit of the adder of the data device to determine whether the second address has crossed a cache line boundary associated with the multi-directional cache;
In response to determining that the second address has crossed the boundary of the cache line, to determine whether the second address has crossed a page boundary associated with the multi-directional cache, A second carry bit of the adder is examined and associated with the second address without performing a translation index buffer (TLB) search operation in response to determining that the page boundary has not been exceeded. to determine the tag sequence information, and want to consider performing some of the tag sequence search operation,
In response to determining that the boundary of the cache line has not been crossed by the second address, tag array data and translation index buffer associated with the first address determined from a previous tag array search operation (TLB) using search data to access the multi-directional cache to retrieve data from the second address;
A computer-readable storage medium having a program code for causing the processor to execute.