JP2023152646A

JP2023152646A - Arithmetic processing unit and arithmetic processing method

Info

Publication number: JP2023152646A
Application number: JP2022196251A
Authority: JP
Inventors: 元押山; Gen Oshiyama
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2022-03-30
Filing date: 2022-12-08
Publication date: 2023-10-17

Abstract

To reduce a circuit scale of an arithmetic processing unit that concurrently executes in turn from executable instructions of a plurality of instructions.SOLUTION: An arithmetic processing unit that concurrently executes in turn from executable instructions of a plurality of instructions comprises: a queue 105a that stores the plurality of instructions; a pipeline in which dependent source instructions of the plurality of instructions are executed; and a control section 105 that holds an index indicating an execution stage of the dependent source instruction in the pipeline, performs data dependent resolution of the dependent source instructions of the plurality of instructions and dependent destination instructions utilizing execution results of the dependent source instructions, and controls issue timing of the plurality of instructions.SELECTED DRAWING: Figure 9

Description

本発明は、演算処理装置及び演算処理方法に関する。 The present invention relates to an arithmetic processing device and an arithmetic processing method.

演算処理装置のコア（プロセッサコア）におけるマイクロアーキテクチャの性能改善は、同じ動作周波数帯においてはＩＰＣ（Instruction Per Cycle）の向上を意味する。ＩＰＣ向上のためのアーキテクチャとして、スーバースカラ及びアウトオブオーダ（ＯｏＯ；Out of Order）実行等が用いられている。 Improving the performance of the microarchitecture in the core of an arithmetic processing unit (processor core) means improving IPC (Instruction Per Cycle) in the same operating frequency band. Superscalar and out-of-order (OoO) execution are used as architectures for improving IPC.

スーバースカラプロセッサは、ＡＬＵ（Arithmetic Logic Unit）等の演算器を含むパイプラインを複数備えることで、パイプライン数の命令を１サイクルで実行する。 A superscalar processor includes a plurality of pipelines including arithmetic units such as ALUs (Arithmetic Logic Units) and executes the number of instructions in one cycle.

アウトオブオーダ実行は、プロセッサにおいて、発行（dispatch）された複数の命令のうちの実行可能な命令から順に（アウトオブオーダで）並列に実行する手法である。実行可能な命令とは、例えば、他の命令との間で依存関係の無い命令若しくは依存関係が解消した命令、又は、利用するハードウェアへの割り当てが可能な命令、等が挙げられる。 Out-of-order execution is a method in which a processor executes executable instructions among a plurality of dispatched instructions in parallel (out-of-order). Executable instructions include, for example, instructions that have no dependence with other instructions or whose dependencies have been resolved, or instructions that can be assigned to the hardware to be used.

アウトオブオーダ実行に対応するプロセッサは、実行可能な命令を検知し、検知した命令をパイプラインに発行（issue）するためのスケジューラを備える。例えば、スケジューラは、命令の依存解決タイミングか否か（命令が実行可能か否か）のタイミング判定を行なう比較器を、パイプラインごとに備える。 A processor capable of out-of-order execution includes a scheduler for detecting executable instructions and issuing the detected instructions to a pipeline. For example, the scheduler includes a comparator for each pipeline that determines whether or not the instruction dependency resolution timing is reached (whether or not the instruction is executable).

特開平１０－０７８８７２号公報Japanese Patent Application Publication No. 10-078872 特開２０１５－２３２９０２号公報Japanese Patent Application Publication No. 2015-232902

アウトオブオーダ実行に対応するプロセッサにおいては、アウトオブオーダに発行できる命令数（Inflightな命令幅）の増加に応じて、スケジューラのエントリが増加し、比較器の回路規模も増大する。また、パイプライン数が倍増すると、比較器の数も倍増する。 In a processor that supports out-of-order execution, as the number of instructions that can be issued out-of-order (inflight instruction width) increases, the number of scheduler entries increases and the circuit scale of the comparator also increases. Furthermore, when the number of pipelines doubles, the number of comparators also doubles.

このように、プロセッサコアの高機能化によるＩＰＣの向上と、プロセッサコアの回路規模の削減（換言すれば高集積化）との両立が困難になる場合がある。 In this way, it may become difficult to simultaneously improve IPC by increasing the functionality of the processor core and reduce the circuit scale of the processor core (in other words, increase integration).

１つの側面では、本発明は、複数の命令のうちの実行可能な命令から順に並列に実行する演算処理装置の回路規模を削減することを目的の１つとする。 In one aspect, one of the objects of the present invention is to reduce the circuit scale of an arithmetic processing device that sequentially executes executable instructions among a plurality of instructions in parallel.

１つの側面では、演算処理装置は、複数の命令のうちの実行可能な命令から順に並列に実行する演算処理装置であって、キューと、制御部と、を備えてよい。前記キューは、前記複数の命令を格納してよい。前記制御部は、前記複数の命令のうちの依存元命令が実行されるパイプラインと、前記パイプラインにおける前記依存元命令の実行ステージとを示す指標を保持し、前記複数の命令のうちの前記依存元命令と、前記依存元命令の実行結果を利用する依存先命令とのデータ依存解決を行ない、前記複数の命令の発行タイミングを制御してよい。 In one aspect, the arithmetic processing device is an arithmetic processing device that executes executable instructions among a plurality of instructions in parallel in order, and may include a queue and a control unit. The queue may store the plurality of instructions. The control unit retains an index indicating a pipeline in which a dependent instruction among the plurality of instructions is executed and an execution stage of the dependent instruction in the pipeline, and The issue timing of the plurality of instructions may be controlled by performing data dependency resolution between a dependent instruction and a dependent instruction that uses an execution result of the dependent instruction.

１つの側面では、複数の命令のうちの実行可能な命令から順に並列に実行する演算処理装置の回路規模を削減することができる。 In one aspect, it is possible to reduce the circuit scale of an arithmetic processing device that sequentially executes executable instructions among a plurality of instructions in parallel.

一実施形態に係るプロセッサの構成例を示すブロック図である。1 is a block diagram illustrating a configuration example of a processor according to an embodiment. FIG. 比較例に係るスケジューラの構成例を示すブロック図である。FIG. 2 is a block diagram illustrating a configuration example of a scheduler according to a comparative example. パイプラインのステージ例とレイテンシに応じたデータのフォワーディングの一例とを示す図である。FIG. 3 is a diagram illustrating an example of pipeline stages and an example of data forwarding according to latency. 比較例に係るプロセッサのコアの構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of a core of a processor according to a comparative example. 比較例に係る依存解決判定部の構成例を示すブロック図である。FIG. 7 is a block diagram illustrating a configuration example of a dependency resolution determination unit according to a comparative example. 一実施形態に係るスケジューラの構成例を示すブロック図である。FIG. 2 is a block diagram illustrating a configuration example of a scheduler according to an embodiment. パイプラインステージ情報の一例を示す図である。It is a figure showing an example of pipeline stage information. 一実施形態に係る依存解決判定部の構成例を示すブロック図である。FIG. 2 is a block diagram illustrating a configuration example of a dependency resolution determination unit according to an embodiment. 一実施形態に係るプロセッサのコアの一部の構成例を示す図である。FIG. 2 is a diagram illustrating a configuration example of a part of a core of a processor according to an embodiment. 図８に示す依存解決判定部の回路構成例を示す図である。FIG. 9 is a diagram illustrating an example of a circuit configuration of a dependency resolution determining section illustrated in FIG. 8; パイプラインステージ情報の割当例を説明するための図である。FIG. 3 is a diagram for explaining an example of allocation of pipeline stage information. 比較例に係るフォワーディングタイミングの判定例を示す図である。FIG. 7 is a diagram illustrating an example of forwarding timing determination according to a comparative example. 一実施形態に係るフォワーディングタイミングの判定例を示す図である。FIG. 3 is a diagram illustrating an example of forwarding timing determination according to an embodiment. 比較例に係る最短フォワーディングタイミングが2つの場合の判定例を示す図である。FIG. 7 is a diagram showing a determination example when there are two shortest forwarding timings according to a comparative example. 一実施形態に係る最短フォワーディングタイミングが2つの場合の判定例を示す図である。FIG. 7 is a diagram illustrating an example of determination when there are two shortest forwarding timings according to an embodiment. 比較例に係るプロセッサのコアの構成におけるキャッシュミスしたロードから依存する依存チェーンの後続命令をキャンセルする処理に着目した構成例を示す図である。FIG. 7 is a diagram illustrating a configuration example focusing on a process of canceling a subsequent instruction in a dependency chain dependent on a cache-missed load in a core configuration of a processor according to a comparative example. 比較例に係るキャッシュミスしたロードから依存する依存チェーンの後続命令をキャンセルする処理における判定例を示す図である。FIG. 7 is a diagram illustrating an example of determination in a process of canceling a subsequent instruction in a dependency chain that depends on a load that causes a cache miss according to a comparative example. ロードパイプが1つの場合における演算パイプ先行ロードタイミング信号の一例を示す図である。FIG. 7 is a diagram showing an example of an arithmetic pipe advance load timing signal when there is one load pipe. 一実施形態に係るパイプラインステージ情報の割当例を説明するための図である。FIG. 3 is a diagram for explaining an example of pipeline stage information allocation according to an embodiment. 一実施形態に係るプロセッサのコアの構成におけるキャッシュミスしたロードから依存する依存チェーンの後続命令をキャンセルするための構成例を示す図である。FIG. 3 is a diagram illustrating an example of a configuration for canceling a subsequent instruction in a dependency chain that depends on a cache-missed load in a core configuration of a processor according to an embodiment; 一実施形態に係るキャッシュミスしたロードから依存する依存チェーンの後続命令をキャンセルする処理における判定例を示す図である。FIG. 6 is a diagram illustrating an example of determination in a process of canceling a subsequent instruction in a dependency chain that depends on a load that has caused a cache miss according to an embodiment; 比較例に係るプロセッサのコアの構成におけるCF（Condition Flag）更新制御に着目した構成例を示す図である。FIG. 7 is a diagram illustrating a configuration example focusing on CF (Condition Flag) update control in a core configuration of a processor according to a comparative example. 比較例に係るCFの更新制御の判定例を示す図である。FIG. 7 is a diagram illustrating an example of determination of CF update control according to a comparative example. 一実施形態に係るプロセッサのコアの構成におけるCF更新制御に着目した構成例を示す図である。FIG. 2 is a diagram illustrating a configuration example focusing on CF update control in a core configuration of a processor according to an embodiment. 一実施形態に係るCFの更新制御の判定例を示す図である。FIG. 3 is a diagram illustrating an example of determination of CF update control according to an embodiment.

以下、図面を参照して本発明の実施の形態を説明する。ただし、以下に説明する実施形態は、あくまでも例示であり、以下に明示しない種々の変形や技術の適用を排除する意図はない。例えば、本実施形態を、その趣旨を逸脱しない範囲で種々変形して実施することができる。なお、以下の説明で用いる図面において、同一符号を付した部分は、特に断らない限り、同一若しくは同様の部分を表す。 Embodiments of the present invention will be described below with reference to the drawings. However, the embodiments described below are merely examples, and there is no intention to exclude the application of various modifications and techniques not specified below. For example, this embodiment can be modified and implemented in various ways without departing from the spirit thereof. In the drawings used in the following description, parts with the same reference numerals represent the same or similar parts unless otherwise specified.

〔１〕一実施形態
〔１－１〕プロセッサの構成例
図１は、一実施形態に係るプロセッサ１の構成例を示すブロック図である。図１には、簡単のため、プロセッサ１のコアにおける一部の回路構成を例示する。 [1] One Embodiment [1-1] Configuration Example of Processor FIG. 1 is a block diagram showing a configuration example of a processor 1 according to one embodiment. For simplicity, FIG. 1 illustrates a part of the circuit configuration in the core of the processor 1. As shown in FIG.

プロセッサ１は、複数の命令のうちの実行可能な命令から順に並列に実行するＣＰＵ（Central Processing Unit）等の演算処理装置の一例である。例えば、プロセッサ１は、複数の命令のうちの実行可能な命令から順に並列に実行する処理の一例であるアウトオブオーダ実行、に対応するスーパースカラプロセッサであってよい。 The processor 1 is an example of an arithmetic processing device such as a CPU (Central Processing Unit) that executes executable instructions among a plurality of instructions in parallel. For example, the processor 1 may be a superscalar processor that supports out-of-order execution, which is an example of processing in which executable instructions among a plurality of instructions are executed in parallel.

図１に示すように、プロセッサ１は、命令フェッチ制御ＰＣ（Program Counter）１０１、命令フェッチ部１０２、デコーダ１０３、アロケータ１０４、及び、スケジューラ１０５を備えてよい。また、プロセッサ１は、複数の演算器１０６、ロード／ストアキュー１０７、及び、レジスタ１０８を備えてよい。 As shown in FIG. 1, the processor 1 may include an instruction fetch control PC (Program Counter) 101, an instruction fetch unit 102, a decoder 103, an allocator 104, and a scheduler 105. Further, the processor 1 may include a plurality of arithmetic units 106, a load/store queue 107, and a register 108.

命令フェッチ制御ＰＣ１０１は、次に実行する命令が格納されたアドレス（主記憶装置、例えばメモリ１００のアドレス）を記憶するプログラムカウンタである。命令フェッチ部１０２は、命令フェッチ制御ＰＣ１０１が示すアドレスから命令をフェッチする。デコーダ１０３は、命令フェッチ部１０２がフェッチした命令をデコードする。アロケータ１０４は、デコードされた命令に、当該命令を実行する資源、例えば複数の演算器１０６のいずれか等を割り当て、スケジューラ１０５に命令を発行（dispatch）する。 The instruction fetch control PC 101 is a program counter that stores the address (address of the main storage device, for example, the memory 100) where the next instruction to be executed is stored. The instruction fetch unit 102 fetches an instruction from the address indicated by the instruction fetch control PC 101. The decoder 103 decodes the instruction fetched by the instruction fetch unit 102. The allocator 104 allocates a resource for executing the decoded instruction, such as one of the plurality of arithmetic units 106, to the decoded instruction, and dispatches the instruction to the scheduler 105.

スケジューラ１０５は、アロケータ１０４から発行された命令を、アロケータ１０４により割り当てられた演算器１０６に発行（issue）する。なお、アロケータ１０４からスケジューラ１０５への命令の発行は、ディスパッチ（dispatch）と称されてもよく、スケジューラ１０５から演算器１０６に向けた命令の発行は、イシュー（issue）と称されてもよい。 The scheduler 105 issues the instructions issued from the allocator 104 to the arithmetic units 106 allocated by the allocator 104. Note that the issuance of an instruction from the allocator 104 to the scheduler 105 may be referred to as a dispatch, and the issuance of an instruction from the scheduler 105 to the arithmetic unit 106 may be referred to as an issue.

スケジューラ１０５は、例えば、アロケータ１０４から発行された命令（Instruction）をプログラム順（in-order順）に受け取り蓄積するキューを備える。スケジューラ１０５は、命令間の依存関係が解決された命令を、キューからアウトオブオーダで演算器１０６に発行する。 The scheduler 105 includes, for example, a queue that receives and accumulates instructions issued from the allocator 104 in program order (in-order order). The scheduler 105 issues instructions with inter-instruction dependencies resolved from the queue to the arithmetic unit 106 out of order.

複数の演算器１０６のそれぞれは、スケジューラ１０５から自身に発行された命令を実行し実行結果をレジスタ１０８に書き込む。複数の演算器１０６は、演算器１０６ａ～１０６ｈを含んでよい。演算器（ＢＲＣＨ）１０６ａは、ブランチ（分岐）命令を実行する演算器である。演算器（ＡＧＵ）１０６ｂ及び１０６ｃは、アドレス生成（Address Generation）命令を実行する演算器である。演算器（ＳＴＤ）１０６ｄは、ストアデータの送信を実行する演算器である。演算器（ＦＸＵ）１０６ｅ及び１０６ｆは、固定小数点演算命令を実行する演算器である。演算器（ＦＬＵ）１０６ｇ及び１０６ｈは、浮動小数点演算命令を実行する演算器である。 Each of the plurality of arithmetic units 106 executes the instruction issued to itself from the scheduler 105 and writes the execution result to the register 108. The plurality of computing units 106 may include computing units 106a to 106h. The arithmetic unit (BRCH) 106a is an arithmetic unit that executes a branch instruction. Arithmetic units (AGU) 106b and 106c are arithmetic units that execute address generation instructions. The computing device (STD) 106d is a computing device that executes transmission of store data. Arithmetic units (FXU) 106e and 106f are arithmetic units that execute fixed-point arithmetic instructions. Arithmetic units (FLU) 106g and 106h are arithmetic units that execute floating point arithmetic instructions.

ロード／ストアキュー１０７は、演算器１０６ｂ～１０６ｄによる命令の実行において、レジスタ１０８又はメモリ１００に対してロード又はストアされるデータを格納するキューである。 The load/store queue 107 is a queue that stores data to be loaded or stored in the register 108 or the memory 100 during execution of instructions by the arithmetic units 106b to 106d.

レジスタ１０８は、複数の演算器１０６による命令の実行に利用される。メモリ１００は、例えばＤＩＭＭ（Dual Inline Memory Module）等の主記憶装置である。 The register 108 is used for execution of instructions by the plurality of arithmetic units 106. The memory 100 is, for example, a main storage device such as a DIMM (Dual Inline Memory Module).

プロセッサ１では、複数の演算パイプライン１０９ａ及び複数のロードパイプライン１０９ｂが形成される。以下、演算パイプライン１０９ａ及びロードパイプライン１０９ｂを単に演算パイプ１０９ａ及びロードパイプ１０９ｂと称する場合がある。１つの演算パイプ１０９ａは、スケジューラ１０５から演算器１０６への矢印、レジスタ１０８から演算器１０６へのレジスタリードを示す矢印（レジスタreadポートに相当）、演算器１０６、及び、演算器１０６からレジスタ１０８へのレジスタ書き込みを示す矢印（レジスタwriteポートに相当）を含む。複数のロードパイプ１０９ｂは、図１の例ではLDR0及びLDR1である。 In the processor 1, a plurality of calculation pipelines 109a and a plurality of load pipelines 109b are formed. Hereinafter, the calculation pipeline 109a and the load pipeline 109b may be simply referred to as the calculation pipe 109a and the load pipe 109b. One arithmetic pipe 109a includes an arrow from the scheduler 105 to the arithmetic unit 106, an arrow indicating a register read from the register 108 to the arithmetic unit 106 (corresponding to a register read port), an arithmetic unit 106, and an arrow from the arithmetic unit 106 to the register 108. Contains an arrow (corresponding to the register write port) indicating a register write to. The plurality of load pipes 109b are LDR0 and LDR1 in the example of FIG.

なお、図１では、レジスタreadポートが各演算器１０６に１本ずつ接続されているが、１本の矢印が１オペランド分を示すものではなく、２オペランド又は３オペランド等の複数のオペランドを１本の線（wire）として表現している。実際には、レジスタreadポートは、最低でもbit幅（固定小数点なら64bit，浮動小数点ならSIMD幅分）×オペランド数の線を含んでよい。 In FIG. 1, one register read port is connected to each arithmetic unit 106, but each arrow does not indicate one operand, but rather multiple operands, such as two or three operands, are connected to one operand. It is expressed as a book wire. In reality, a register read port may include at least a line with a bit width (64 bits for fixed point, SIMD width for floating point) x number of operands.

スケジューラ１０５は、演算パイプ１０９ａと同数の発行口（issue port）を有する。各演算器１０６、又は、スケジューラ１０５の発行口から各演算器１０６までの各パスは、パイプラインと称されてもよい。プロセッサ１では、同時書き込み及び実行を可能とするために、レジスタ書き込みを行なう演算パイプ１０９ａ及びロードパイプ１０９ｂのパイプラインと同数のレジスタwriteポートが用意される。言い換えると、レジスタwriteポートの数と、演算パイプ１０９ａ及びロードパイプ１０９ｂのパイプライン数とが一致する。 The scheduler 105 has the same number of issue ports as the calculation pipes 109a. Each computing unit 106 or each path from the issuing port of the scheduler 105 to each computing unit 106 may be referred to as a pipeline. In the processor 1, in order to enable simultaneous writing and execution, the same number of register write ports as the pipelines of the arithmetic pipe 109a and the load pipe 109b that perform register writing are prepared. In other words, the number of register write ports matches the number of pipelines of the calculation pipe 109a and the load pipe 109b.

〔１－２〕比較例に係るプロセッサコアの構成例
図２は、比較例に係るスケジューラ１０５の構成例を示すブロック図である。図２に例示するように、比較例に係るスケジューラ１０５は、ペイロードデータ記憶部１０５ａ、依存解決判定部１０５ｂ、選択部１０５ｃ、及び、セレクタ１０５ｄを備えてよい。 [1-2] Configuration example of processor core according to comparative example FIG. 2 is a block diagram showing a configuration example of the scheduler 105 according to the comparative example. As illustrated in FIG. 2, the scheduler 105 according to the comparative example may include a payload data storage section 105a, a dependency resolution determination section 105b, a selection section 105c, and a selector 105d.

ペイロードデータ記憶部１０５ａは、アロケータ１０４から発行された命令（受け付けた命令）のペイロードデータを記憶する。ペイロードデータ記憶部１０５ａを含む、命令を記憶する記憶部は、スケジューラ１０５のキューの一例である。 The payload data storage unit 105a stores payload data of instructions issued (instructions received) from the allocator 104. A storage unit that stores instructions, including the payload data storage unit 105a, is an example of a queue of the scheduler 105.

依存解決判定部１０５ｂは、アロケータ１０４から発行された命令の依存関係を解決する。命令の依存関係とは、例えば、第１の命令の演算結果（データ）が、第２の命令の演算で利用される関係である。第１の命令は、「依存元命令」又は「先行命令」と称されてもよく、第２の命令は、「依存先命令」又は「後続命令」と称されてもよい。 The dependency resolution determination unit 105b resolves the dependency relationship of the instructions issued by the allocator 104. The instruction dependency relationship is, for example, a relationship in which the operation result (data) of a first instruction is used in the operation of a second instruction. The first instruction may be referred to as a "dependent instruction" or a "predecessor instruction," and the second instruction may be referred to as a "dependent instruction" or a "successor instruction."

依存解決判定部１０５ｂは、例えば、データの依存が解決されたか否かを判定し、依存が解決されたと判定した場合、選択部１０５ｃに、依存関係が解決された命令のうちのどの命令を発行するかを示す通知（依存解決通知）を発行する。当該通知は、例えば、スケジューラ１０５のキュー内の命令のエントリを示す情報を含んでもよい。依存又は依存関係の解決とは、依存先命令の演算の際に依存元命令の演算結果を利用できる状態にあることを意味してよい。 For example, the dependency resolution determination unit 105b determines whether or not the data dependence has been resolved, and when determining that the dependency has been resolved, issues to the selection unit 105c which instruction among the instructions for which the dependency relationship has been resolved. issue a notification (dependency resolution notification) indicating whether The notification may include, for example, information indicating the entry of the instruction in a queue of scheduler 105. Dependency or dependency relationship resolution may mean that the operation result of the dependent instruction can be used during the operation of the dependent instruction.

選択部１０５ｃは、依存解決判定部１０５ｂからの通知に応じて、依存関係が解決された命令のうちの発行する命令をセレクタ１０５ｄで調停する。例えば、選択部１０５ｃは、依存解決通知が示すエントリと、アロケータ１０４から発行された命令のプログラム順の情報とに基づき、発行するエントリを決定してよい。また、選択部１０５ｃは、セレクタ１０５ｄに対して、ペイロードデータ記憶部１０５ａのキューのエントリを選択してよい。 The selection unit 105c uses the selector 105d to arbitrate which instructions to issue among the instructions whose dependencies have been resolved, in response to the notification from the dependency resolution determination unit 105b. For example, the selection unit 105c may determine the entry to be issued based on the entry indicated by the dependency resolution notification and information on the program order of the instructions issued from the allocator 104. Further, the selection unit 105c may select an entry in the queue of the payload data storage unit 105a for the selector 105d.

例えば、先行命令に依存する後続命令が複数存在する場合、依存解決通知は複数のエントリに対して行なわれ得る。選択部１０５ｃは、依存解決が行なわれた複数の命令の中でいずれの命令をスケジューラ１０５から発行するか決定する。 For example, if there are multiple subsequent instructions that depend on the preceding instruction, dependency resolution notifications may be sent to multiple entries. The selection unit 105c determines which instruction is to be issued from the scheduler 105 among the plurality of instructions for which dependency resolution has been performed.

セレクタ１０５ｄは、選択部１０５ｃによる選択（調停）に応じて、パイプラインに命令を発行する。「命令」には、演算器１０６が利用するデータ、例えばペイロードデータ記憶部１０５ａに格納されたペイロードデータも含まれる。例えば、セレクタ１０５ｄは、選択部１０５ｃにより選択されたエントリのペイロードデータを、ペイロードデータ記憶部１０５ａからパイプラインに発行する。 The selector 105d issues an instruction to the pipeline in response to selection (arbitration) by the selection unit 105c. The "instruction" also includes data used by the arithmetic unit 106, such as payload data stored in the payload data storage unit 105a. For example, the selector 105d issues the payload data of the entry selected by the selection unit 105c from the payload data storage unit 105a to the pipeline.

スケジューラ１０５は、ペイロードデータ記憶部１０５ａと依存解決判定部１０５ｂとを、エントリごとに独立して備えてよい。また、スケジューラ１０５は、依存解決判定部１０５ｂを、オペランド数（例えば、“2”,“3”又は“4”等）分備えてよい。換言すれば、依存解決判定部１０５ｂは、エントリごと、且つ、オペランドごとに１つずつ備えられてよい。 The scheduler 105 may include a payload data storage section 105a and a dependency resolution determination section 105b independently for each entry. Furthermore, the scheduler 105 may include dependency resolution determination units 105b for the number of operands (for example, "2", "3", or "4", etc.). In other words, one dependency resolution determining unit 105b may be provided for each entry and each operand.

なお、スケジューラ１０５のキューにおける命令の依存解決タイミングの判定方式（wakeup方式）としては、連想方式が用いられるものとするが、これに限定されるものではなく、間接方式、直接方式等の手法が用いられてもよい。また、これらの方式は、いずれも、多ポート小容量ＲＡＭ（Random Access Memory）又はＣＡＭ（Content Addressable Memory）等の特別な回路が利用されるため、自動合成ツールで合成させることは困難である。例えば、これらの方式として、自動合成ツールで合成でき、且つ、電力消費が少ないスタンダードセル（stdcell）が用いられてもよい。 Note that an associative method is used as the method (wakeup method) for determining the dependency resolution timing of instructions in the queue of the scheduler 105, but the method is not limited to this, and methods such as an indirect method and a direct method may also be used. may be used. Furthermore, since all of these methods utilize special circuits such as a multi-port small-capacity RAM (Random Access Memory) or a CAM (Content Addressable Memory), it is difficult to synthesize them using an automatic synthesis tool. For example, as these methods, a standard cell (stdcell) that can be synthesized using an automatic synthesis tool and consumes low power may be used.

ここで、依存解決判定部１０５ｂは、依存命令間のデータハザード（先行命令の演算結果を待たなければ後続命令のパイプラインを進められない状態）を、フォワーディング（forwarding）制御により解決する。フォワーディングは、バイパス（bypass）と称されてもよい。図２では図示を省略するが、スケジューラ１０５は、例えば、パイプライン内でデータをフォワーディングするフォワーディング部１０５ｅ（図４参照）を備えてよい。 Here, the dependency resolution determination unit 105b resolves data hazards between dependent instructions (a state in which the pipeline of a subsequent instruction cannot proceed without waiting for the calculation result of a preceding instruction) by forwarding control. Forwarding may also be referred to as bypass. Although not shown in FIG. 2, the scheduler 105 may include, for example, a forwarding unit 105e (see FIG. 4) that forwards data within the pipeline.

図３は、パイプラインのステージ例とレイテンシに応じたデータのフォワーディングの一例とを示す図である。符号Ａは、レイテンシが1サイクルであるadd命令のフォワーディングタイミングの一例を示し、符号Ｂは、レイテンシが2サイクルであるmul命令のフォワーディングタイミングの一例を示す。符号Ｃは、各ステージの機能の一例を示す。なお、add命令（例えばadd: r0 <= r1 + r2）は、レジスタr1内のデータ（オペランド1のデータ）とレジスタr2内のデータ（オペランド2のデータ）との和を、レジスタr0に書き込む演算命令である。mul命令（例えばmul: r0 <= r1 * r2）は、レジスタr1内のオペランド1のデータとレジスタr2内のオペランド2のデータとの積を、レジスタr0に書き込む演算命令である。 FIG. 3 is a diagram illustrating an example of pipeline stages and an example of data forwarding according to latency. Symbol A indicates an example of forwarding timing of an add instruction with a latency of 1 cycle, and symbol B indicates an example of forwarding timing of a mul instruction with a latency of 2 cycles. Symbol C indicates an example of the function of each stage. Note that the add instruction (for example, add: r0 <= r1 + r2) is an operation that writes the sum of the data in register r1 (data of operand 1) and the data in register r2 (data of operand 2) to register r0. It is a command. The mul instruction (for example, mul: r0 <= r1 * r2) is an arithmetic instruction that writes the product of the data of operand 1 in register r1 and the data of operand 2 in register r2 to register r0.

符号Ａ及びＢにおいてPステージの前ステージ（以下、「スケジューラステージ」と表記する場合がある）は、命令がスケジューラ１０５内に存在することを示す。当該ステージでは、依存解決判定部１０５ｂは、先行命令のデータが利用可能なタイミングか否かを判定する（wakeupを待つ）。例えば、依存解決判定部１０５ｂは、オペランド（以下、「OP」と表記する場合がある）ごとにwakeup bit（ready bit）を保持する。全てのwakeup bitがオンになった場合、全てのオペランドのデータが利用可能であることを意味する。この場合、依存解決判定部１０５ｂは、後続命令をwakeup（ready）する、例えば、後続命令の依存解決通知を選択部１０５ｃに出力する。 In symbols A and B, a stage preceding the P stage (hereinafter sometimes referred to as a "scheduler stage") indicates that an instruction exists in the scheduler 105. At this stage, the dependency resolution determination unit 105b determines whether the timing is such that the data of the preceding instruction can be used (wait for wakeup). For example, the dependency resolution determination unit 105b holds a wakeup bit (ready bit) for each operand (hereinafter sometimes referred to as "OP"). If all wakeup bits are on, it means that data for all operands is available. In this case, the dependency resolution determination unit 105b wakes up (ready) the subsequent instruction, for example, outputs a dependency resolution notification of the subsequent instruction to the selection unit 105c.

Pステージ（符号Ｃ参照）は、全てのオペランドのデータが利用可能になった場合に、スケジューラ１０５が当該命令を選択して発行（issue）するステージである。 The P stage (see symbol C) is a stage in which the scheduler 105 selects and issues the instruction when data for all operands becomes available.

Bステージは、パイプラインがレジスタ１０８のデータを読み出すステージである。 The B stage is the stage where the pipeline reads data from register 108.

Fステージは、パイプラインがフォワーディングされてきたデータ（オペランド）を読み出す（取得する）ステージである。 The F stage is a stage in which the pipeline reads (obtains) forwarded data (operands).

X（Xn；nは1以上の整数）ステージは、Bステージ又はFステージで取得したデータを用いて演算器１０６が演算を実行するステージであり、命令の演算レイテンシ、換言すれば命令の実行に要するサイクル数分のステージとなる。 The X (Xn; n is an integer greater than or equal to 1) stage is a stage in which the arithmetic unit 106 executes arithmetic operations using the data acquired in the B stage or F stage. There are stages for the number of cycles required.

Wステージは、Xnステージの演算結果（データ）をレジスタ１０８に書き込むステージである。 The W stage is a stage in which the calculation result (data) of the Xn stage is written into the register 108.

また、サイクルを示すステージとして、LX（Last X）ステージは、最後のXnステージのサイクルを示す。Mm（又はLXMm）ステージは、LXからm（mは1以上の整数）サイクルだけ前（LX minus mサイクル）のステージである。Pp（又はLXPp）ステージは、LXからp（pは1以上の整数）サイクルだけ後（LX plus pサイクル）のステージである。 Further, as a stage indicating a cycle, the LX (Last X) stage indicates the cycle of the last Xn stage. The Mm (or LXMm) stage is a stage m (m is an integer greater than or equal to 1) cycles before LX (LX minus m cycles). The Pp (or LXPp) stage is a stage p (p is an integer greater than or equal to 1) cycles after LX (LX plus p cycles).

依存解決判定部１０５ｂは、後続命令におけるXnステージよりも前のステージ（例えばFステージ）に、先行命令の演算結果をフォワーディング（バイパス）する。 The dependency resolution determination unit 105b forwards (bypasses) the operation result of the preceding instruction to a stage (for example, F stage) before the Xn stage of the subsequent instruction.

例えば、符号Ａでは、命令1（No.1）の演算レイテンシが1であるため、依存解決判定部１０５ｂは、命令1と命令2のOP1との依存を検知し、且つ、命令1のaddがPステージになったことを検知すると、これを契機に命令2（No.2）をwakeupする。これにより、スケジューラ１０５は、命令1のX1ステージ（4サイクル目）の演算結果を、命令1と依存関係にある命令2のFステージ（X1ステージの直前）のOP1にフォワーディングできる。 For example, in code A, the operation latency of instruction 1 (No. 1) is 1, so the dependency resolution determination unit 105b detects the dependence between instruction 1 and instruction 2 on OP1, and the add of instruction 1 is When it detects that it has entered the P stage, it uses this as an opportunity to wake up command 2 (No. 2). Thereby, the scheduler 105 can forward the operation result of the X1 stage (fourth cycle) of instruction 1 to OP1 of the F stage (immediately before the X1 stage) of instruction 2, which has a dependency relationship with instruction 1.

また、符号Ｂでは、命令1の演算レイテンシが2であるため、依存解決判定部１０５ｂは、命令1のmulがBステージになったことの検知を契機に命令2をwakeupする。これにより、スケジューラ１０５は、命令1のX2ステージ（5サイクル目）の演算結果を、命令1と依存関係にある命令2のFステージ（X1ステージの直前）のOP1にフォワーディングできる。 Further, in code B, since the operation latency of instruction 1 is 2, the dependency resolution determining unit 105b wakes up instruction 2 upon detecting that mul of instruction 1 has reached the B stage. Thereby, the scheduler 105 can forward the operation result of the X2 stage (fifth cycle) of the instruction 1 to the OP1 of the F stage (immediately before the X1 stage) of the instruction 2, which has a dependency relationship with the instruction 1.

このように、依存解決判定部１０５ｂは、複数サイクルの演算レイテンシに対してフォワーディング制御を行なう場合、LXステージから何サイクル前に後続命令をwakeupするかを判定する。例えば、依存解決判定部１０５ｂは、符号Ａ及びＢの例では共通して、LXM3（LXから2サイクル前）で後続命令をwakeupすればよい。符号Ａの例では演算レイテンシ1の先行命令1のLXM3はPステージであり、符号Ｂの例では演算レイテンシ2の先行命令1のLXM3はBステージであるが、いずれの例でもLXM3でのwakeupにより最短のタイミングでのフォワーディングが行なわれる。 In this manner, the dependency resolution determining unit 105b determines how many cycles before the LX stage the subsequent instruction should be woken up when performing forwarding control for multiple cycles of arithmetic latency. For example, in the examples of codes A and B, the dependency resolution determination unit 105b may wake up the subsequent instruction at LXM3 (two cycles before LX). In the example with code A, LXM3 of preceding instruction 1 with arithmetic latency 1 is in P stage, and in the example with code B, LXM3 of preceding instruction 1 with arithmetic latency 2 is in B stage, but in both examples, wakeup in LXM3 Forwarding is performed at the shortest timing.

依存解決判定部１０５ｂがこのようなフォワーディングタイミングを判定するために、スケジューラ１０５は、例えば、先行命令のLXM3のタイミングで後続命令の各OPのPRA（Physical Register Address）の値を比較するための比較器を備える。依存解決判定部１０５ｂは、比較器におけるPRAの一致を判定した場合に後続命令のwakeupを行なう。ここでの比較器とは、投機実行される命令を一意に区別したIDどうしの一致を検知するためのものである。PRAは、当該IDの種類の一例である。このため、比較対象はPRAに限られなくてもよい。また、比較器は、エンコードされたIDの形で一致を判定するが、IDをデコードした形で持ち、デコードした各信号線どうしのand-orを取ることでIDどうしの一致を検知してもよい。 In order for the dependency resolution determining unit 105b to determine such forwarding timing, the scheduler 105 performs, for example, a comparison process to compare the values of PRA (Physical Register Address) of each OP of the subsequent instruction at the timing of LXM3 of the preceding instruction. Prepare a container. The dependency resolution determining unit 105b wakes up the subsequent instruction when determining that the PRA in the comparator matches. The comparator here is for detecting a match between IDs that uniquely distinguish instructions to be speculatively executed. PRA is an example of the type of ID. Therefore, the comparison target does not have to be limited to PRA. In addition, a comparator determines a match in the form of an encoded ID, but it is also possible to detect a match between IDs by holding the ID in a decoded form and taking an and-or of each decoded signal line. good.

なお、スケジューラ１０５から演算器１０６までの間には、命令を格納するキューが存在しないため、フォワーディングが実行されるのは演算器１０６の手前となる。これは、スケジューラ１０５が、命令を発行（issue）するタイミングの制御と、いずれのステージからフォワーディングするかの決定とを行なうことを意味する。 Note that since there is no queue for storing instructions between the scheduler 105 and the arithmetic unit 106, forwarding is executed before the arithmetic unit 106. This means that the scheduler 105 controls the timing of issuing an instruction and determines from which stage the instruction should be forwarded.

以下の説明では、スケジューラ１０５のキューからデータが出力される部分（図２の例ではセレクタ１０５ｄ）から、比較器を含め、タイミングの制御とフォワーディングステージの決定とを実行する部分までを、スケジューラ１０５と称するものとする。 In the following description, the scheduler 105 will be described from the part of the scheduler 105 that outputs data from the queue (selector 105d in the example of FIG. 2) to the part that controls timing and determines the forwarding stage, including the comparator. shall be called.

図４は、比較例に係るプロセッサ１のコアの構成例を示す図である。図４では、コアの構成のうちの、オペランドへの演算結果のフォワーディングに着目した構成例を示す。図４では、演算レイテンシが1サイクルであり、フォワーディング対象のオペランドが1つである場合を例に挙げる。なお、以下、write addressをw addressと表記し、read addressをr addressと表記する場合がある。 FIG. 4 is a diagram illustrating an example of the configuration of the core of the processor 1 according to the comparative example. FIG. 4 shows an example of a core configuration focusing on forwarding of operation results to operands. In FIG. 4, an example is given in which the calculation latency is one cycle and the number of operands to be forwarded is one. Note that hereinafter, write address may be expressed as w address, and read address may be expressed as r address.

図４では、セレクタ１０５ｄの１つの発行口（issue port）から１つのパイプラインに命令が発行される例を示すが、演算器１０６ごとに設けられた発行口から、それぞれの演算器１０６のパイプラインに命令が発行されてよい。 In FIG. 4, an example is shown in which an instruction is issued to one pipeline from one issue port of the selector 105d. Commands may be issued to the line.

図４に示すように、セレクタ１０５ｄから発行された先行命令のwrite address２０１は、B、F、X1の各ステージから分岐し、それぞれのステージに対応する比較器１０５ｆに入力される。また、先行命令と依存関係を有する後続命令のread address２０２（Fステージでforwarding待ちのオペランドのアドレス）が各比較器１０５ｆに入力される。なお、図４に示すように、比較器１０５ｆを含む太実線で囲った領域（回路）は、フォワーディング部１０５ｅと称されてよい。 As shown in FIG. 4, the write address 201 of the preceding instruction issued from the selector 105d branches from each stage of B, F, and X1, and is input to the comparator 105f corresponding to each stage. Further, the read address 202 (address of the operand waiting for forwarding in the F stage) of the subsequent instruction that has a dependency relationship with the preceding instruction is input to each comparator 105f. Note that, as shown in FIG. 4, an area (circuit) surrounded by a thick solid line including the comparator 105f may be referred to as a forwarding unit 105e.

各ステージの比較器１０５ｆは、write address２０１とread address２０２とが一致、例えばPRAが一致する場合に、対応するステージのセレクタ２０５にフォワーディング情報、例えばポート1をセレクトする信号を送出する。フォワーディング情報が入力されたセレクタ２０５は、演算器１０６（一例としてＡＬＵ）からの先行命令の演算結果を、X1/LXステージ、W/LXP1ステージ、又はLXP2ステージ等から、後続命令のFステージのオペランドに入力するようにデータをセレクトする。このように、セレクタ２０５の切り替え制御により、フォワーディング２０６が行なわれる。 The comparator 105f of each stage sends forwarding information, for example, a signal to select port 1, to the selector 205 of the corresponding stage when the write address 201 and the read address 202 match, for example, when the PRA matches. The selector 205 to which forwarding information has been input receives the operation result of the preceding instruction from the arithmetic unit 106 (an ALU as an example) from the X1/LX stage, W/LXP1 stage, LXP2 stage, etc., and converts it to the operand of the F stage of the subsequent instruction. Select the data to input. In this way, forwarding 206 is performed under switching control of selector 205.

なお、先行命令の演算結果は、先行命令のW/LXP1ステージで、レジスタ１０８のwrite address２０１にreg write２０３で書き込まれる。フォワーディング２０６が行なわれない場合、当該データは、後続命令のBステージでレジスタ１０８のread address２０２からreg read２０４で読み出される。セレクタは、ポート0をセレクトし、当該データを演算器１０６に入力する。 Note that the operation result of the preceding instruction is written to the write address 201 of the register 108 by reg write 203 in the W/LXP1 stage of the preceding instruction. If forwarding 206 is not performed, the data is read by reg read 204 from read address 202 of register 108 in the B stage of the subsequent instruction. The selector selects port 0 and inputs the data to the arithmetic unit 106.

図４に示す例において、フォワーディング部１０５ｅの比較器１０５ｆの数は、フォワーディング対象のオペランド数に応じて増加、例えばオペランド数の整数倍（１～４倍程度；一例として３倍）となる。また、パイプライン数がx倍になると、比較器１０５ｆがread address２０２と比較するwrite address２０１がx倍になり、read address２０２自体もx倍になるため、比較器１０５ｆの数は、x^2倍となる。 In the example shown in FIG. 4, the number of comparators 105f of the forwarding unit 105e increases according to the number of operands to be forwarded, for example, becomes an integral multiple (approximately 1 to 4 times; as an example, 3 times) of the number of operands. Furthermore, when the number of pipelines increases x times, the write address 201 that the comparator 105f compares with the read address 202 becomes x times as many, and the read address 202 itself also becomes x times, so the number of comparators 105f becomes x^2 times. Become.

さらに、フォワーディングのタイミング数（セレクタ２０５の数）は、reg read２０４からreg write２０３までのレイテンシ－演算レイテンシ（図４では4-1==3）の数となる。このため、パイプラインが深くなる、例えば、reg read２０４からreg write２０３までのレイテンシが4から7になると、フォワーディングのタイミングが２倍となり、比較器１０５ｆの数も２倍となる。 Furthermore, the number of forwarding timings (the number of selectors 205) is the number of latency from reg read 204 to reg write 203 minus calculation latency (4-1==3 in FIG. 4). Therefore, when the pipeline becomes deeper, for example, when the latency from reg read 204 to reg write 203 increases from 4 to 7, the forwarding timing doubles and the number of comparators 105f also doubles.

また、Inflightな命令幅が増加すると、比較幅がlog2で増加するため、１つあたりの比較器１０５ｆの回路規模が増大する。 Furthermore, when the inflight instruction width increases, the comparison width increases by log2, and therefore the circuit scale of each comparator 105f increases.

このように、プロセッサ１のコアの高機能化によるＩＰＣの向上を図る場合、フォワーディング部１０５ｅの回路規模が増加する。 In this way, when improving the IPC by increasing the functionality of the core of the processor 1, the circuit scale of the forwarding unit 105e increases.

図５は、比較例に係る依存解決判定部１０５ｂの構成例を示すブロック図である。図５では、スケジューラ１０５における1エントリ且つ1オペランド分のwakeupを行なう依存解決判定部１０５ｂの構成例を示す。 FIG. 5 is a block diagram illustrating a configuration example of the dependency resolution determination unit 105b according to the comparative example. FIG. 5 shows a configuration example of the dependency resolution determination unit 105b that performs wakeup for one entry and one operand in the scheduler 105.

図５に示すように、依存解決判定部１０５ｂは、最短フォワードタイミング判定比較回路３０１、キャッシュミスタイミング判定比較回路３０３及び３０４、キャッシュミス判定回路３０５並びにＡＮＤ回路３０６を備える。 As shown in FIG. 5, the dependency resolution determination unit 105b includes a shortest forward timing determination comparison circuit 301, cache miss timing determination comparison circuits 303 and 304, a cache miss determination circuit 305, and an AND circuit 306.

最短フォワードタイミング判定比較回路３０１は、最短フォワードタイミングを判定する比較回路である。最短フォワードタイミング判定比較回路３０１は、パイプラインごとの比較器、例えば演算パイプ１０９ａの数＋ロードパイプ１０９ｂの数に相当する数の比較器を備える。 The shortest forward timing determination comparison circuit 301 is a comparison circuit that determines the shortest forward timing. The shortest forward timing determination comparison circuit 301 includes comparators for each pipeline, for example, the number of comparators corresponding to the number of calculation pipes 109a+the number of load pipes 109b.

キャッシュミス判定回路３０５には、各ロードパイプ１０９ｂからのキャッシュミス判定信号３０２が入力される。キャッシュミス判定信号３０２は、例えば図１に示すロード／ストアキュー１０７からレジスタ１０８へのLDR0及びLDR1の信号であってよい。 A cache miss determination signal 302 from each load pipe 109b is input to the cache miss determination circuit 305. The cache miss determination signal 302 may be, for example, the LDR0 and LDR1 signals from the load/store queue 107 to the register 108 shown in FIG.

スケジューラ１０５は、キャッシュミスが発生する前に、キャッシュミス依存のロード（キャッシュミスしたロード命令（１）から依存するロード命令（２））を投機的に発行する。ロード命令（１）の一例としては、r0レジスタの中身に書かれているアドレスにあるデータをr1レジスタに書き込むロード命令“load r1 <= [r0]”が挙げられる。また、ロード命令（２）の一例としては、ロード命令“load r2 <= [r1]”が挙げられる。 The scheduler 105 speculatively issues a cache miss-dependent load (load instruction (2) dependent from the load instruction (1) that caused the cache miss) before a cache miss occurs. An example of the load instruction (1) is a load instruction "load r1 <= [r0]" that writes data at the address written in the contents of the r0 register to the r1 register. Furthermore, an example of the load instruction (2) is the load instruction "load r2 <= [r1]".

キャッシュミスタイミング判定比較回路３０３は、投機的に発行したロードをキャンセルするための比較回路である。キャッシュミスタイミング判定比較回路３０４は、投機的に発行したキャッシュミスしたロード命令（１）の結果を用いた演算命令（２）の演算結果に依存した命令（３）をキャンセルするための比較回路である。なお、addは演算の一例である。ロード命令（１）の一例としては、“load r1 <= [r0]”が挙げられる。演算命令（２）の一例としては、r1レジスタのデータと、r1レジスタのデータとを加算してr2レジスタに書き込むadd命令“add r2 <= r1 + r1”が挙げられる。命令（３）の一例としては、add命令“add r3 <= r2 + r2”が挙げられる。また、命令（３）の他の例としては、ロード命令“load r3 <= [r2]”が挙げられる。 The cache miss timing determination comparison circuit 303 is a comparison circuit for canceling a speculatively issued load. The cache miss timing judgment comparison circuit 304 is a comparison circuit for canceling an instruction (3) that depends on the operation result of an operation instruction (2) using the result of a speculatively issued load instruction (1) that misses the cache. be. Note that add is an example of an operation. An example of the load instruction (1) is “load r1 <= [r0]”. An example of the operation instruction (2) is an add instruction "add r2 <= r1 + r1" that adds the data in the r1 register and the data in the r1 register and writes the result to the r2 register. An example of the instruction (3) is the add instruction “add r3 <= r2 + r2”. Further, another example of the instruction (3) is the load instruction "load r3 <= [r2]".

キャッシュミスタイミング判定比較回路３０３は、キャッシュミスしたロードから、直接依存のあるロード又は演算命令について、ロードパイプ１０９ｂにおけるキャッシュミスタイミングを判定する回路であり、ロードパイプ１０９ｂの数に相当する数の比較器を備える。 The cache miss timing determination comparison circuit 303 is a circuit that determines the cache miss timing in the load pipe 109b for a directly dependent load or operation instruction from a load that has a cache miss, and performs a number of comparisons corresponding to the number of load pipes 109b. Prepare a container.

キャッシュミスタイミング判定比較回路３０４は、上述した命令（３）について、演算パイプ１０９ａにおけるキャッシュミスタイミングを判定する回路であり、演算パイプ１０９ａの数に相当する数の比較器を備える。 The cache miss timing determination comparison circuit 304 is a circuit that determines the cache miss timing in the arithmetic pipe 109a for the above-mentioned instruction (3), and includes a number of comparators corresponding to the number of arithmetic pipes 109a.

キャッシュミス判定回路３０５は、キャッシュミス判定信号３０２と、キャッシュミスタイミング判定比較回路３０３及び３０４の出力とに基づき、キャッシュミスの有無を判定する回路である。 The cache miss determination circuit 305 is a circuit that determines the presence or absence of a cache miss based on the cache miss determination signal 302 and the outputs of the cache miss timing determination comparison circuits 303 and 304.

ＡＮＤ回路３０６は、最短フォワードタイミング判定比較回路３０１の出力と、キャッシュミス判定回路３０５の出力の反転とのＡＮＤを取った結果を、依存解決通知の信号３０７として出力する。信号３０７は、スケジューラ１０５（キュー）のエントリxxにおけるOPxのデータ依存解決ビット（図５の例ではQ_xx_OPx_ready bit）である。 The AND circuit 306 outputs the result of ANDing the output of the shortest forward timing determination comparison circuit 301 and the inversion of the output of the cache miss determination circuit 305 as a dependency resolution notification signal 307. The signal 307 is a data dependency resolution bit (Q_xx_OPx_ready bit in the example of FIG. 5) of OPx in entry xx of the scheduler 105 (queue).

図５に示す例において、依存解決判定部１０５ｂは、キューのエントリ数がM倍になると回路規模もM倍となる。また、依存解決判定部１０５ｂにおけるブロック３０１、３０３及び３０４の比較器の数は、フォワーディング対象のオペランド数に応じて増加、例えばオペランド数の整数倍（１～４倍程度；一例として３倍）となる。さらに、パイプライン数がx倍になると、比較器の数もx倍となる。 In the example shown in FIG. 5, when the number of entries in the queue increases by M times, the circuit size of the dependency resolution determination unit 105b also increases by M times. Further, the number of comparators in blocks 301, 303, and 304 in the dependency resolution determining unit 105b increases depending on the number of operands to be forwarded, for example, when the number of comparators is an integral multiple (approximately 1 to 4 times; as an example, 3 times) of the number of operands. Become. Furthermore, when the number of pipelines increases x times, the number of comparators also increases x times.

また、最短フォワードタイミングが１つから２つに増加する場合、最短フォワードタイミング判定比較回路３０１の比較器の数は２倍になる。 Furthermore, when the shortest forward timing increases from one to two, the number of comparators in the shortest forward timing determination comparison circuit 301 doubles.

さらに、Inflightな命令幅が増加すると、比較幅（id幅）がlog2で増加するため、１つあたりの比較器の回路規模が増大する。 Furthermore, as the inflight instruction width increases, the comparison width (id width) increases by log2, which increases the circuit scale of each comparator.

このように、プロセッサ１のコアの高機能化によるＩＰＣの向上を図る場合、依存解決判定部１０５ｂの回路規模が増加する。 In this way, when improving the IPC by increasing the functionality of the core of the processor 1, the circuit scale of the dependency resolution determination unit 105b increases.

〔１－３〕一実施形態に係るプロセッサコアの構成例
以下、一実施形態に係るプロセッサ１のコアについて説明する。なお、以下の一実施形態の説明において、特に言及しない構成、処理及び機能は、比較例と同様である。 [1-3] Configuration example of processor core according to one embodiment The core of the processor 1 according to one embodiment will be described below. Note that in the following description of one embodiment, configurations, processes, and functions that are not particularly mentioned are the same as those in the comparative example.

図６は、一実施形態に係るスケジューラ１０５の構成例を示すブロック図である。図６に例示するように、一実施形態に係るスケジューラ１０５は、図２に示す比較例と比較して、依存解決判定部１０５ｂに代えて依存解決判定部１１を備え、パイプラインステージ情報を出力するセレクタ１０５ｄを備える点が異なる。スケジューラ１０５は、複数の命令のうちの依存元命令が実行されるパイプラインと、当該パイプラインにおける依存元命令の実行ステージとを示す指標を保持し、複数の命令のうちの前記依存元命令と、依存元命令の実行結果を利用する依存先命令とのデータ依存解決を行ない、複数の命令の発行タイミングを制御する制御部の一例である。 FIG. 6 is a block diagram illustrating a configuration example of the scheduler 105 according to one embodiment. As illustrated in FIG. 6, the scheduler 105 according to one embodiment includes a dependency resolution determination unit 11 in place of the dependency resolution determination unit 105b, and outputs pipeline stage information, as compared to the comparative example illustrated in FIG. The difference is that a selector 105d is provided. The scheduler 105 retains an index indicating a pipeline in which a dependent instruction among a plurality of instructions is executed and an execution stage of the dependent instruction in the pipeline, and , is an example of a control unit that performs data dependency resolution with a dependent instruction using the execution result of a dependent source instruction and controls the issuance timing of a plurality of instructions.

依存解決判定部１１は、選択部１０５ｃに依存解決通知を出力するとともに、エントリ選択信号でセレクトされるセレクタ１０５ｄ（図６の下側のセレクタ１０５ｄ）にパイプラインステージ（以下、「ＰＳ」と表記する場合がある）情報２０７を出力する。 The dependency resolution determination unit 11 outputs a dependency resolution notification to the selection unit 105c, and also selects a pipeline stage (hereinafter referred to as "PS") to the selector 105d (lower selector 105d in FIG. 6) selected by the entry selection signal. ) information 207 is output.

選択部１０５ｃからはエントリ選択信号が出力される。例えば選択部１０５ｃからセレクタ１０５ｄに入る矢印は、スケジューラ１０５からＰＳ情報２０７を含む発行命令の情報をパイプラインに出力するために使用され、選択部１０５ｃから依存解決判定部１１に入る矢印は、後続依存命令の解決のために使用される。図６の上側のセレクタ１０５ｄは、エントリ選択信号に基づき命令をパイプラインに発行する。図６の下側のセレクタ１０５ｄは、スケジューラ１０５内のエントリ毎に保持される（依存解決判定部１１内の）ＰＳ情報２０７をエントリ選択信号で選択し、上側のセレクタ１０５ｄから命令発行されるエントリのＰＳ情報２０７を出力する。 An entry selection signal is output from the selection section 105c. For example, an arrow that enters the selector 105d from the selection unit 105c is used to output information on the issued instruction including PS information 207 from the scheduler 105 to the pipeline, and an arrow that enters the dependency resolution determination unit 11 from the selection unit 105c is used for subsequent Used for resolving dependent instructions. The selector 105d on the upper side of FIG. 6 issues an instruction to the pipeline based on the entry selection signal. The lower selector 105d in FIG. 6 selects the PS information 207 (in the dependency resolution determination unit 11) held for each entry in the scheduler 105 using an entry selection signal, and selects the entry for which the command is issued from the upper selector 105d. PS information 207 is output.

依存解決判定部１１は、ＰＳ情報２０７並びに図５（及び後述する図８）に基づき、複数の命令のうちの依存元命令と依存関係にある依存先命令であって、依存元命令の実行結果を利用する依存先命令の依存解決タイミングを制御する。１つの実施例として、IEEE754-2008で規定されるFMA（Fused multiply-add）、例えば積に対して和を行なうx*y+zの積和演算に着目する。例えば、依存解決判定部１１は、FMAの加算部への依存のケースにおいて、ＰＳ情報２０７に基づいて“Q_xx_OPx_ready”を立て、“Q_xx_OPx_ready”によってwakeupを行なう。 Based on the PS information 207 and FIG. 5 (and FIG. 8, which will be described later), the dependency resolution determination unit 11 determines whether the dependent instruction among the plurality of instructions has a dependency relationship with the dependent instruction and the execution result of the dependent instruction. Controls the dependency resolution timing of dependent instructions that use . As one example, we will focus on FMA (Fused multiply-add) defined in IEEE754-2008, for example, a product-sum operation of x*y+z in which products are summed. For example, in the case of dependence on the FMA addition unit, the dependency resolution determining unit 11 sets “Q_xx_OPx_ready” based on the PS information 207 and performs wakeup using “Q_xx_OPx_ready”.

図７は、ＰＳ情報２０７の一例を示す図である。ＰＳ情報２０７は、複数の命令のうちの依存元命令が実行されるパイプラインと、パイプラインにおける依存元命令の実行ステージとを示す指標の一例である。 FIG. 7 is a diagram showing an example of the PS information 207. The PS information 207 is an example of an index indicating a pipeline in which a dependent instruction among a plurality of instructions is executed and an execution stage of the dependent instruction in the pipeline.

例えば、ＰＳ情報２０７は、スケジューラ１０５のエントリ内のオペランドごとに、当該オペランドの先行命令のパイプライン情報及びステージ情報を含んでよい。図７の例では、ＰＳ情報２０７は、オペランドごとのステージＩＤ２０８と、依存元命令が発行されたパイプラインの識別情報の一例であるパイプラインＩＤ２０９とを含む。 For example, the PS information 207 may include, for each operand in the entry of the scheduler 105, pipeline information and stage information of the preceding instruction of the operand. In the example of FIG. 7, the PS information 207 includes a stage ID 208 for each operand, and a pipeline ID 209 that is an example of identification information of the pipeline in which the dependent instruction was issued.

図７の例では、ＰＳ情報２０７は、[5:3]の3ビットのパイプラインＩＤ２０９と、[2:0]の3ビットのステージＩＤ２０８とを含む6ビットの情報であってよい。なお、ステージＩＤ２０８及びパイプラインＩＤ２０９のビット数は、これらに限定されるものではない。 In the example of FIG. 7, the PS information 207 may be 6-bit information including a 3-bit pipeline ID 209 of [5:3] and a 3-bit stage ID 208 of [2:0]. Note that the number of bits of the stage ID 208 and pipeline ID 209 is not limited to these.

ステージＩＤ２０８は、例えば、最短のフォワーディングタイミングからフォワーディングが不要になるレジスタ１０８経由のデータ受け渡しの前までのステージごとにユニークなＩＤ（stage_id）である。換言すれば、ステージＩＤ２０８は、依存元命令から依存先命令に実行結果（例えば演算結果）を最短で転送できる発行タイミング（例えば後述する図１３の命令2のＰ）から、実行結果をレジスタ１０８に格納し、依存先命令が実行結果を読み出せる発行タイミング（例えば図１３の命令５から後続のＰ）までの各ステージに一意に割り当てられたステージの識別情報の一例である。 The stage ID 208 is, for example, a unique ID (stage_id) for each stage from the shortest forwarding timing to before data transfer via the register 108 where forwarding becomes unnecessary. In other words, the stage ID 208 transfers the execution result to the register 108 from the issuance timing (for example, P of instruction 2 in FIG. 13, which will be described later) at which the execution result (for example, operation result) can be transferred from the dependent instruction to the dependent instruction in the shortest possible time. This is an example of stage identification information that is uniquely assigned to each stage from the issuance timing (for example, from instruction 5 to the subsequent P in FIG. 13) at which the dependent instruction can read the execution result.

ステージＩＤ２０８は、後続命令がPステージになった（発行された）ときの先行命令のステージを示してよい。 The stage ID 208 may indicate the stage of the preceding instruction when the subsequent instruction reaches the P stage (is issued).

例えば、ステージＩＤ２０８は、最短フォワーディングタイミングのLXMnで初期値（例えば0）以外の値をセットされ、LXPnで初期値（例えば0）をセット（0-cleared）されるカウンタとして表現されてよい。換言すれば、ステージＩＤ２０８は、依存元命令が実行される最後のサイクルから第１所定数だけ前のサイクルでセットされ、最後のサイクルから第２所定数だけ後のサイクルでリセットされるカウンタであり、カウンタの値が各ステージに一意に割り当てられたものである。 For example, the stage ID 208 may be expressed as a counter that is set to a value other than the initial value (for example, 0) at the shortest forwarding timing LXMn, and is set (0-cleared) to the initial value (for example, 0) at LXPn. In other words, the stage ID 208 is a counter that is set a first predetermined number of cycles before the last cycle in which the dependent instruction is executed, and is reset in a second predetermined number of cycles after the last cycle. , the counter value is uniquely assigned to each stage.

第１所定数及び第２所定数は、異なるパイプライン及びforwarding種の組み合わせごとに違った値であってもよい。また、異なるステージＩＤ２０８が割り当てられてもよい。これら２つの理由は、フォワードタイミングがパイプラインの長さとレジスタ１０８の読み書きタイミングとに依存するためである。forwarding種とは、例えば固定小数点レジスタ、浮動小数点レジスタ、CF（Condition Flag）等である。 The first predetermined number and the second predetermined number may have different values for different combinations of pipelines and forwarding types. Also, different stage IDs 208 may be assigned. These two reasons are because forward timing depends on the length of the pipeline and the read/write timing of register 108. The forwarding type is, for example, a fixed point register, a floating point register, a CF (Condition Flag), or the like.

パイプラインＩＤ２０９は、依存元命令が流れているパイプラインの識別情報の一例であり、パイプラインごとにユニークな値が設定されてよい。パイプラインＩＤ２０９は、依存元命令がパイプラインに入力されるタイミング、及び、依存元命令が入力されているパイプラインを検出するために利用される。 The pipeline ID 209 is an example of identification information of the pipeline through which the dependent instruction is flowing, and a unique value may be set for each pipeline. The pipeline ID 209 is used to detect the timing at which the dependent instruction is input to the pipeline and the pipeline to which the dependent instruction is input.

図８は、一実施形態に係る依存解決判定部１１の構成例を示すブロック図である。図８では、スケジューラ１０５における1エントリ且つ1オペランド分のwakeupを行なう依存解決判定部１１の構成例を示す。 FIG. 8 is a block diagram illustrating a configuration example of the dependency resolution determination unit 11 according to an embodiment. FIG. 8 shows a configuration example of the dependency resolution determination unit 11 that performs wakeup for one entry and one operand in the scheduler 105.

図８に示すように、依存解決判定部１１は、図５に示す比較例と比較して、キャッシュミスタイミング判定比較回路３０３及び３０４に代えてキャッシュミス判定タイミング生成回路１１ａを備えるとともに、新たにＰＳ情報更新回路１１ｂを備える点が異なる。 As shown in FIG. 8, compared to the comparative example shown in FIG. The difference is that a PS information update circuit 11b is provided.

キャッシュミス判定タイミング生成回路１１ａは、演算レイテンシ信号及び演算パイプ先行ロードタイミング信号を生成する。なお、演算レイテンシ信号及び演算パイプ先行ロードタイミング信号は、制御部１０５ｇ（図１６参照）から入力されてもよい。なお、実際には、フォワーディング及びキャッシュミス判定では同じＰＳ情報２０７が利用されるため、図９に示すフォワーディング部１２と図１６に示す制御部１０５ｇとは一体となってもよい。 The cache miss determination timing generation circuit 11a generates an arithmetic latency signal and an arithmetic pipe advance load timing signal. Note that the calculation latency signal and the calculation pipe advance load timing signal may be input from the control unit 105g (see FIG. 16). Incidentally, since the same PS information 207 is actually used in forwarding and cache miss determination, the forwarding section 12 shown in FIG. 9 and the control section 105g shown in FIG. 16 may be integrated.

演算レイテンシ信号は、パイプラインが複数レイテンシである場合に生成される信号であり、パイプラインごとに生成される。演算パイプ先行ロードタイミング信号は、演算パイプ１０９ａを流れる演算命令へ直接或いは間接的（間に別の依存命令が存在する場合）に依存があるロード命令のキャッシュミス判定タイミングを示す信号である。すなわち、演算パイプ先行ロードタイミング信号とは、依存チェーンにある先行ロード命令が、キャッシュミス判定タイミングを迎えておらずいつキャッシュミス判定タイミングを迎えるか、今がキャッシュミス判定タイミングであるのか、或いは、キャッシュミス判定タイミングを過ぎてキャッシュミス確定であるのか、を示す信号である。演算パイプ１０９ａを流れる演算命令へ間接的に依存があるケースは、１つのオペランドに対して複数の演算パイプ先行ロードタイミング信号が立ち得る。 The calculation latency signal is a signal generated when a pipeline has multiple latencies, and is generated for each pipeline. The arithmetic pipe advance load timing signal is a signal indicating the cache miss determination timing of a load instruction that is directly or indirectly (if another dependent instruction exists) dependent on the arithmetic instruction flowing through the arithmetic pipe 109a. In other words, the arithmetic pipe advance load timing signal indicates when the advance load instruction in the dependency chain has not yet reached the cache miss determination timing and will reach the cache miss determination timing, or whether it is now the cache miss determination timing. This signal indicates whether a cache miss has been confirmed after the cache miss determination timing has passed. In the case where there is indirect dependence on the arithmetic instruction flowing through the arithmetic pipe 109a, a plurality of arithmetic pipe advance load timing signals may rise for one operand.

ＰＳ情報更新回路１１ｂは、最短フォワードタイミング判定比較回路３０１、キャッシュミス判定回路３０５、及び、ＰＳ情報２０７に基づき、信号１１ｃを出力する。 The PS information update circuit 11b outputs a signal 11c based on the shortest forward timing determination comparison circuit 301, the cache miss determination circuit 305, and the PS information 207.

信号１１ｃは、スケジューラ１０５（キュー）のエントリxxにおけるオペランドxの依存元命令（先行命令）のパイプラインとステージ情報とを有するビット（図８の例ではQ_xx_OPx_igt [x:0] bit）であり、更新後のＰＳ情報２０７の一例である。 The signal 11c is a bit (Q_xx_OPx_igt [x:0] bit in the example of FIG. 8) that has the pipeline and stage information of the dependent instruction (preceding instruction) of the operand x in the entry xx of the scheduler 105 (queue), This is an example of updated PS information 207.

このように、ＰＳ情報更新回路１１ｂは、キューのエントリごと、且つ、各々のオペランドごとに、依存元命令に関するＰＳ情報２０７を生成する。また、ＰＳ情報更新回路１１ｂは、オペランドごとに保持したＰＳ情報２０７に基づき、データ依存解決を行なう。ＰＳ情報更新回路１１ｂは、ＰＳ情報２０７を保持するため、ＰＳ情報保持部と称されてもよい。なお、オペランドとは、演算器の直前に設けられたdata sourceのレジスタのことであり、例えば、Fサイクルで行なったforwarding結果を保持し、X1サイクルで出力するレジスタのことである。 In this way, the PS information update circuit 11b generates PS information 207 regarding the dependent instruction for each queue entry and for each operand. Further, the PS information update circuit 11b performs data dependency resolution based on the PS information 207 held for each operand. Since the PS information update circuit 11b holds the PS information 207, it may be called a PS information holding unit. Note that the operand is a data source register provided immediately before the arithmetic unit, and is, for example, a register that holds the forwarding result performed in the F cycle and outputs it in the X1 cycle.

図９は、一実施形態に係るプロセッサ１のコアの一部の構成例を示す図である。図９に例示するように、一実施形態では、プロセッサ１（スケジューラ１０５）は、図４に示す比較例に係るフォワーディング部１０５ｅに代えて、フォワーディング部１２を備える。 FIG. 9 is a diagram illustrating a partial configuration example of the core of the processor 1 according to an embodiment. As illustrated in FIG. 9, in one embodiment, the processor 1 (scheduler 105) includes a forwarding unit 12 instead of the forwarding unit 105e according to the comparative example shown in FIG.

フォワーディング部１２は、図４に示す比較例に係る複数の比較器１０５ｆに代えて、情報生成部１３を備える。 The forwarding unit 12 includes an information generation unit 13 instead of the plurality of comparators 105f according to the comparative example shown in FIG.

フォワーディング部１２は、ＰＳ情報２０７に基づき、依存元命令の演算結果を、依存解決判定部１１による制御に応じたタイミングでパイプラインに発行された依存先命令のオペランドにフォワーディングする制御を実行するフォワーディング制御部の一例である。 The forwarding unit 12 executes control based on the PS information 207 to forward the operation result of the dependent instruction to the operand of the dependent instruction issued to the pipeline at a timing according to the control by the dependency resolution determining unit 11. This is an example of a control unit.

情報生成部１３は、スケジューラ１０５の依存解決判定部１１で生成され保持された、オペランドごとのステージＩＤ２０８に基づき、セレクタ２０５（フォワーディング２０６）に出力するフォワーディング情報を生成する。 The information generation unit 13 generates forwarding information to be output to the selector 205 (forwarding 206) based on the stage ID 208 for each operand, which is generated and held by the dependency resolution determination unit 11 of the scheduler 105.

ステージＩＤ２０８は、reg read２０４からreg write２０３までのレイテンシ－演算レイテンシのフォワーディングタイミング（forwarding mux；図９の例では3つのタイミング）のそれぞれにユニークに割り当てられる。複数のフォワーディングタイミングのそれぞれは、セレクタ２０５に対応する。 The stage ID 208 is uniquely assigned to each of the latency-calculation latency forwarding timings (forwarding mux; three timings in the example of FIG. 9) from reg read 204 to reg write 203. Each of the plurality of forwarding timings corresponds to the selector 205.

なお、フォワーディングタイミング数は、演算レイテンシが2サイクルである場合でも、reg read２０４からreg write２０３までのレイテンシは5サイクルとなり、演算レイテンシが2となるため、演算レイテンシが1の場合と同様に3になる。 Note that even if the calculation latency is 2 cycles, the number of forwarding timings is 3, as the latency from reg read 204 to reg write 203 is 5 cycles, and the calculation latency is 2, which is the same as when the calculation latency is 1. .

ステージＩＤ２０８のビット数（stage_id幅）は、フォワーディングタイミング（図９の例では3つ）＋レジスタ経由でデータを読み出すタイミング（依存未検出；dependency undetected、例えば1つ）==4より、log_2{4}==2となる。 The number of bits (stage_id width) of the stage ID 208 is log_2{4, from forwarding timing (three in the example of FIG. 9) + timing to read data via register (dependency undetected, for example, one) ==4 }==2.

図９に示すパイプラインでは、ステージＩＤ２０８は、2ビットで表現され、例えば3つのフォワーディングタイミングに対して{1,2,3}が割り当てられる。フォワーディングタイミング{0}はレジスタリード（レジスタ１０８経由でデータを読み出す）に割り当てられる。 In the pipeline shown in FIG. 9, the stage ID 208 is expressed by 2 bits, and for example, {1, 2, 3} is assigned to three forwarding timings. Forwarding timing {0} is assigned to register read (reading data via register 108).

情報生成部１３は、依存解決判定部１１から入力されるステージＩＤ２０８をデコードする。情報生成部１３は、ステージＩＤ２０８がフォワーディングタイミングに割り当てられた1,2,3のいずれかである場合、1hot0のforward指示信号を生成する。例えば、情報生成部１３は、ステージＩＤ２０８に対応するセレクタ２０５に対しては値が1のforward指示信号を生成し、他の全てのセレクタ２０５に対しては値が0となるforward指示信号を生成する。値が1のforward指示信号が入力されたセレクタ２０５は、比較例と同様に、フォワーディング２０６を実行する。forward指示信号は、フォワーディング情報の一例である。 The information generation unit 13 decodes the stage ID 208 input from the dependency resolution determination unit 11. The information generation unit 13 generates a forward instruction signal of 1hot0 when the stage ID 208 is one of 1, 2, and 3 assigned to the forwarding timing. For example, the information generation unit 13 generates a forward instruction signal with a value of 1 for the selector 205 corresponding to the stage ID 208, and generates a forward instruction signal with a value of 0 for all other selectors 205. do. The selector 205 to which the forward instruction signal with a value of 1 is input executes forwarding 206 similarly to the comparative example. The forward instruction signal is an example of forwarding information.

なお、情報生成部１３は、ステージＩＤ２０８がレジスタ１０８経由でデータを読み出すタイミングに割り当てられた値（0）である場合、全てのセレクタ２０５に対して、値が0のforward指示信号を生成する。この場合、セレクタ２０５は、比較例と同様にreg read２０４を選択する。 Note that when the stage ID 208 is a value (0) assigned to the timing of reading data via the register 108, the information generation unit 13 generates a forward instruction signal having a value of 0 for all selectors 205. In this case, the selector 205 selects reg read 204 as in the comparative example.

図１０は、図８に示す依存解決判定部１１の回路構成例を示す図である。依存解決判定部１１は、依存解決判定部１０５ｂ内エントリごと且つオペランドごとに設けられている。図１０に示すように、依存解決判定部１１は、エントリごと且つオペランドごとに保持された先行依存命令を指すID(Q_xx_OPx_INST_ID)と、演算パイプ１０９ａ及びロードパイプ１０９ｂを流れる上記先行依存命令のIDとの一致を検知する最短フォワードタイミング判定比較回路３０１（図１０参照）を有する。 FIG. 10 is a diagram illustrating an example of the circuit configuration of the dependency resolution determination unit 11 illustrated in FIG. 8. The dependency resolution determination unit 11 is provided for each entry in the dependency resolution determination unit 105b and for each operand. As shown in FIG. 10, the dependency resolution determination unit 11 uses an ID (Q_xx_OPx_INST_ID) indicating a preceding dependent instruction held for each entry and each operand, and an ID of the preceding dependent instruction flowing through the calculation pipe 109a and the load pipe 109b. It has a shortest forward timing determination comparison circuit 301 (see FIG. 10) that detects the coincidence of .

そして、最短フォワードタイミング判定比較回路３０１は、先行依存命令が流れ得るパイプライン１０９ａ及び１０９ｂの数に対応する比較器を持つ。例えば、最短フォワードタイミング判定比較回路３０１は、フォワーディング部１２から通知された、演算パイプ１０９ａに対する最短フォワードタイミングLXM3ステージのINST_ID(LXM3_FXa_INST_ID)や各ロードパイプ１０９ｂの最短フォワードタイミングLステージのINST_ID(L_LDX_INST_ID)と、エントリごと且つオペランドごとに保持された先行依存命令を指すINST_ID(Q_xx_OPx_INST_ID)との一致を検出する。先行依存命令のINST_ID(instruction_id)は、各命令にユニークなIDであり、PRA等が用いられてもよい。そして、最短フォワードタイミング判定比較回路３０１は、直接の依存命令がどのパイプラインを流れたかという検知情報を、ＰＳ情報更新回路１１ｂに通知する。 The shortest forward timing determination comparison circuit 301 has comparators corresponding to the number of pipelines 109a and 109b through which preceding dependent instructions can flow. For example, the shortest forward timing determination comparison circuit 301 uses the INST_ID (LXM3_FXa_INST_ID) of the shortest forward timing LXM3 stage for the calculation pipe 109a and the INST_ID (L_LDX_INST_ID) of the shortest forward timing L stage of each load pipe 109b notified from the forwarding unit 12. , a match is detected with INST_ID (Q_xx_OPx_INST_ID) that indicates a preceding dependent instruction held for each entry and each operand. INST_ID (instruction_id) of the preceding dependent instruction is an ID unique to each instruction, and PRA or the like may be used. Then, the shortest forward timing determination comparison circuit 301 notifies the PS information update circuit 11b of detection information about which pipeline the directly dependent instruction flows through.

ＰＳ情報更新回路１１ｂは、直接の依存命令がどのパイプラインを流れたかという検知情報をエンコードすることで、演算パイプ１０９ａ又はロードパイプ１０９ｂにユニークなパイプラインＩＤ２０９を生成する。また、ＰＳ情報更新回路１１ｂは、ステージＩＤ２０８に固定的に1をセットする。さらに、ＰＳ情報更新回路１１ｂは、1サイクルごとにステージＩＤ２０８を参照し、ステージＩＤ２０８が0以外であればステージＩＤ２０８をインクリメントし、閾値であればステージＩＤ２０８に0をセットする。なお、ステージＩＤ２０８が0である場合、ＰＳ情報更新回路１１ｂは、ステージＩＤ２０８をインクリメント（更新）しない。 The PS information update circuit 11b generates a unique pipeline ID 209 for the calculation pipe 109a or the load pipe 109b by encoding detection information about which pipeline the directly dependent instruction flows through. Additionally, the PS information update circuit 11b permanently sets the stage ID 208 to 1. Furthermore, the PS information update circuit 11b refers to the stage ID 208 every cycle, and if the stage ID 208 is other than 0, increments the stage ID 208, and if it is a threshold value, sets the stage ID 208 to 0. Note that when the stage ID 208 is 0, the PS information update circuit 11b does not increment (update) the stage ID 208.

なお、キャッシュミスしたロードから依存する依存チェーンの後続命令をキャンセルする際に、ＰＳ情報更新回路１１ｂは、ステージＩＤ２０８に0をセット（0-cleared）してもよい。ステージＩＤ２０８に基づく依存先命令の発行制御において、ロード命令を依存元とする依存先命令が誤って発行されないようにするためである。 Note that when canceling a subsequent instruction in a dependency chain that depends on a load that has caused a cache miss, the PS information update circuit 11b may set the stage ID 208 to 0 (0-cleared). This is to prevent a dependent instruction whose dependence source is a load instruction from being erroneously issued in the issuance control of dependent instructions based on the stage ID 208.

キャッシュミス判定タイミング生成回路１１ａは、直接依存の先行演算命令のレイテンシ情報と、当該演算命令からの依存情報を持つＰＳ情報２０７とをもとに、間接的に依存する先行ロードのキャッシュミス判定タイミングを同定する。また、キャッシュミス判定タイミング生成回路１１ａは、ＰＳ情報２０７をもとに、直接依存の先行ロードのキャッシュミス判定タイミングを同定する。 The cache miss judgment timing generation circuit 11a generates the cache miss judgment timing of the indirectly dependent preceding load based on the latency information of the directly dependent preceding operation instruction and the PS information 207 having dependency information from the concerned operation instruction. identify. Further, the cache miss determination timing generation circuit 11a identifies the cache miss determination timing of the directly dependent preceding load based on the PS information 207.

なお、図１０において、“F_”は図３に例示する演算パイプ１０９ａのステージを持つ場合のFステージを示す。“T_”は後述する図１７に例示するロードパイプ１０９ｂのステージを持つ場合のTステージを示し、“L_”は後述する図１７に例示するロードパイプ１０９ｂのステージを持つ場合のLステージを示す。 Note that in FIG. 10, "F_" indicates the F stage when the stage of the calculation pipe 109a illustrated in FIG. 3 is included. “T_” indicates a T stage when the stage includes a load pipe 109b illustrated in FIG. 17, which will be described later, and “L_” indicates an L stage when the stage includes a load pipe 109b illustrated in FIG. 17, which will be discussed later.

キャッシュミス判定タイミング生成回路１１ａからキャッシュミス判定回路３０５に入力される信号は、図１０の上から順に以下を意味する信号である。
・先行ロードがLDaでTタイミングであることを示す信号。
・先行ロードがLDbでTタイミングであることを示す信号。
・先行ロードがTタイミングを過ぎてなおかつキャッシュミス確定であることを示す信号。 The signals inputted from the cache miss determination timing generation circuit 11a to the cache miss determination circuit 305 have the following meanings in order from the top of FIG.
・Signal indicating that the preceding load is LDa and T timing.
・Signal indicating that the advance load is LDb and T timing.
- A signal indicating that the preceding load has exceeded the T timing and a cache miss has been confirmed.

このように、キャッシュミス判定タイミング生成回路１１ａは、直接依存又は間接依存する先行ロードがそれぞれのキャッシュミス判定タイミングを迎えているか、キャッシュミス判定タイミングを過ぎて既にキャッシュミス確定済みか、を示す信号を出力する。 In this way, the cache miss determination timing generation circuit 11a generates a signal indicating whether the directly dependent or indirectly dependent preceding loads have reached their respective cache miss determination timings, or whether the cache miss determination timing has passed and the cache miss has already been confirmed. Output.

図１１は、ＰＳ情報２０７の割当例を説明するための図である。図１１では、符号Ａで示すＰＳ情報２０７のうち、パイプラインＩＤ２０９（id[5:3]）の割り当て例を符号Ｂに示し、ステージＩＤ２０８（id[2:0]）（演算パイプ１０９ａの場合）の割り当て例を符号Ｃに示す。 FIG. 11 is a diagram for explaining an example of allocation of PS information 207. In FIG. 11, among the PS information 207 indicated by symbol A, an example of assignment of pipeline ID 209 (id[5:3]) is indicated by symbol B, and stage ID 208 (id[2:0]) (in case of calculation pipe 109a ) is shown in symbol C.

符号Ｂに示すように、パイプラインＩＤ２０９は、図１に示すLDR0、LDR1、FXU0（ＦＸＵ１０６ｅ）、FXU1（ＦＸＵ１０６ｆ）、FLU0（ＦＬＵ１０６ｇ）、FLU1（ＦＬＵ１０６ｈ）の各パイプラインに{0,1,2,3,4,5}が割り当てられる。 As shown in symbol B, the pipeline ID 209 is {0,1,2 ,3,4,5} are assigned.

符号Ｃに示すように、ステージＩＤ２０８（演算パイプ１０９ａの場合）は、LXM3以前、LXM2、LXM1、LX、LXP1以降の各ステージに{0,1,2,3,0}が割り当てられる。なお、LXM3以前のstage_id[2:0]==0は、ステージＩＤ２０８が未発行の状態（dependency undetected）を示し、LXP1以降のstage_id[2:0]==0は、レジスタ１０８経由でのデータの読み出しを行なうことを示す。 As shown by symbol C, the stage ID 208 (in the case of the calculation pipe 109a) is assigned {0,1,2,3,0} to each stage before LXM3, LXM2, LXM1, LX, and after LXP1. Note that stage_id[2:0]==0 before LXM3 indicates that the stage ID 208 has not been issued (dependency undetected), and stage_id[2:0]==0 after LXP1 indicates the data via the register 108. Indicates that the data will be read.

例えば、フォワーディングタイミング未検出時、stage_id==0（依存未検出）である。ＰＳ情報更新回路１１ｂは、最短フォワードタイミング判定比較回路３０１がフォワーディングタイミングを検出すると、ＯＲ回路でstage_id=1をセットし、エンコード部（図１０の「PS情報encode」参照）でパイプラインＩＤ２０９をセットする。 For example, when forwarding timing is not detected, stage_id==0 (dependency not detected). When the shortest forward timing determination comparison circuit 301 detects the forwarding timing, the PS information update circuit 11b sets stage_id=1 in the OR circuit, and sets the pipeline ID 209 in the encoder (see "PS information encode" in FIG. 10). do.

また、ＰＳ情報更新回路１１ｂは、更新部（図１０の「updater」参照）において、1サイクル後に、ＰＳ情報更新回路１１ｂが保持するstage_idについて、現在のstage_id!=0を検出し、ＰＳ情報更新回路１１ｂが保持するstage_idの値をインクリメントして、stage_id=2をセットする。ＰＳ情報更新回路１１ｂは、更新部において、2サイクル後に、現在のstage_id!=0を検出し値をインクリメントして、stage_id=3をセットする。ＰＳ情報更新回路１１ｂは、更新部において、3サイクル後に、現在のstage_id==3(==threshold）を検出し値をインクリメントして、stage_id=0（レジスタリード）をセットする。その後、ＰＳ情報更新回路１１ｂは、更新部において、現在のstage_id==0を検出し、インクリメントを抑制する。 In addition, the PS information update circuit 11b detects the current stage_id!=0 for the stage_id held by the PS information update circuit 11b after one cycle in the update unit (see "updater" in FIG. 10), and updates the PS information. The value of stage_id held by the circuit 11b is incremented and stage_id=2 is set. In the PS information update circuit 11b, the update unit detects the current stage_id!=0 after two cycles, increments the value, and sets stage_id=3. The PS information update circuit 11b detects the current stage_id==3 (==threshold) in the update section after three cycles, increments the value, and sets stage_id=0 (register read). After that, the PS information update circuit 11b detects the current stage_id==0 in the update unit and suppresses incrementing.

上述した例では、ステージＩＤ２０８が、…,0,1,2,3,0,…と変化する（インクリメントされる）場合を説明したが、これに限定されるものではなく、ステージＩＤ２０８は、…,3,2,0,1,3,…或いは…,0,1,3,2,0,…等のような種々の態様で更新されてよい。 In the above example, the case where the stage ID 208 changes (is incremented) as ..., 0, 1, 2, 3, 0, ... is explained, but the stage ID 208 is not limited to this. ,3,2,0,1,3,... or...,0,1,3,2,0,... and so on.

図５に示す比較例では、キャッシュミスタイミング判定比較回路３０３及び３０４がロードパイプ１０９ｂ及び演算パイプ１０９ａの数に相当する比較器を備える。また、図４に示す比較例では、フォワーディング部１０５ｅが３つの比較器１０５ｆを備える。 In the comparative example shown in FIG. 5, the cache miss timing determination comparison circuits 303 and 304 include comparators corresponding to the number of load pipes 109b and calculation pipes 109a. Furthermore, in the comparative example shown in FIG. 4, the forwarding section 105e includes three comparators 105f.

一方、一実施形態に係る依存解決判定部１１によれば、図１０に示すように、比較器のフォワードタイミング、及び、レジスタ経由でデータを読み出すタイミングの計4つをユニークに判定できるステージＩＤ２０８を含むＰＳ情報２０７を生成する。また、依存解決判定部１１は、LDa、LDb等及びＰＳ情報２０７等の情報に基づき、キャッシュミスタイミングの判定を行なうことができる。従って、依存解決判定部１１は、キャッシュミスタイミング判定用の比較器（キャッシュミスタイミング判定比較回路３０３及び３０４）を省略することができる。 On the other hand, as shown in FIG. 10, the dependency resolution determination unit 11 according to the embodiment has a stage ID 208 that can uniquely determine a total of four factors: the forward timing of the comparator and the timing of reading data via the register. PS information 207 containing the information is generated. Furthermore, the dependency resolution determining unit 11 can determine cache miss timing based on information such as LDa, LDb, and PS information 207. Therefore, the dependency resolution determination unit 11 can omit a comparator for cache mistiming determination (cache mistiming determination comparison circuits 303 and 304).

また、図９に示すように、フォワーディング部１２は、ステージＩＤ２０８に基づきフォワーディングタイミングを示す情報を生成する情報生成部１３を備えることで、比較器１０５ｆを省略することができる。 Further, as shown in FIG. 9, the forwarding unit 12 includes an information generating unit 13 that generates information indicating forwarding timing based on the stage ID 208, thereby making it possible to omit the comparator 105f.

このように、一実施形態によれば、スケジューラ１０５の依存解決判定部１１におけるwakeup論理の比較コスト（例えば実装面積）の削減により、複数の命令のうちの実行可能な命令から順に並列に実行するプロセッサ１の回路規模を削減することができる。これにより、スケジューラ１０５の利用可能なエントリ数を増加させることができ、プロセッサ１の性能を向上させることができる。 As described above, according to one embodiment, by reducing the comparative cost (for example, mounting area) of the wakeup logic in the dependency resolution determining unit 11 of the scheduler 105, executable instructions among a plurality of instructions are executed in parallel in order. The circuit scale of the processor 1 can be reduced. Thereby, the number of usable entries of the scheduler 105 can be increased, and the performance of the processor 1 can be improved.

〔１－４〕フォワーディングタイミングの判定例
次に、一実施形態に係るフォワーディングタイミングの判定例を、比較例と対比して説明する。 [1-4] Forwarding Timing Determination Example Next, a forwarding timing determination example according to an embodiment will be described in comparison with a comparative example.

図１２は、比較例に係るフォワーディングタイミングの判定例を示す図であり、図１３は、一実施形態に係るフォワーディングタイミングの判定例を示す図である。符号Ａは、レイテンシが１サイクルであるadd命令のフォワーディングタイミングの一例を示し、符号Ｂは、各ステージの機能の一例を示す。 FIG. 12 is a diagram illustrating an example of forwarding timing determination according to a comparative example, and FIG. 13 is a diagram illustrating an example of forwarding timing determination according to an embodiment. Symbol A indicates an example of the forwarding timing of an add instruction with a latency of one cycle, and symbol B indicates an example of the function of each stage.

図１２及び図１３において、命令1（No.1）は依存元命令を示し、命令2-5（No.2-5）は依存先命令を示す。命令2-4は、フォワーディング（命令1から各命令2-4への矢印参照）が実行される命令であり、命令5は、レジスタ１０８の参照が行なわれる命令である。 In FIGS. 12 and 13, instruction 1 (No. 1) indicates a dependent instruction, and instruction 2-5 (No. 2-5) indicates a dependent instruction. Instruction 2-4 is an instruction in which forwarding (see arrows from instruction 1 to each instruction 2-4) is executed, and instruction 5 is an instruction in which register 108 is referenced.

図１２に示すように、比較例に係る比較器１０５ｆ（図４参照）のうち、Bステージに対応する比較器１０５ｆは、命令1のBステージと命令2のPステージとを比較する（“<->”参照。以下同様）。一致する場合、当該比較器１０５ｆは、命令1のLXステージから命令2のFステージのOP1にデータをフォワード（破線参照）するためのフォワーディング情報を出力する。 As shown in FIG. 12, among the comparators 105f (see FIG. 4) according to the comparative example, the comparator 105f corresponding to the B stage compares the B stage of instruction 1 and the P stage of instruction 2 (“< ->” (see below). If they match, the comparator 105f outputs forwarding information for forwarding data from the LX stage of instruction 1 to OP1 of the F stage of instruction 2 (see broken line).

Fステージに対応する比較器１０５ｆは、命令1のFステージと命令3のPステージとを比較し、一致していたら、命令1のP1ステージから命令3のFステージのOP1にデータをフォワード（一点鎖線参照）するためのフォワーディング情報を出力する。 The comparator 105f corresponding to the F stage compares the F stage of instruction 1 and the P stage of instruction 3, and if they match, forwards the data from the P1 stage of instruction 1 to OP1 of the F stage of instruction 3 (one point outputs forwarding information for (see dashed line).

X1ステージに対応する比較器１０５ｆは、命令1のX1ステージと命令4のPステージとを比較し、一致していたら、命令1のP2ステージから命令4のFステージのOP1にデータをフォワード（点線参照）するためのフォワーディング情報を出力する。 The comparator 105f corresponding to the X1 stage compares the X1 stage of instruction 1 and the P stage of instruction 4, and if they match, forwards the data from the P2 stage of instruction 1 to OP1 of the F stage of instruction 4 (dotted line output forwarding information for reference).

なお、命令5については、命令1の演算結果がBステージ（5サイクル目）でレジスタ１０８に書き込まれる。このため、プロセッサ１は、命令5のBステージ（6サイクル目）でレジスタ１０８からデータを読み出し可能である。 Note that for instruction 5, the operation result of instruction 1 is written to the register 108 at the B stage (fifth cycle). Therefore, the processor 1 can read data from the register 108 at the B stage (sixth cycle) of instruction 5.

このように、フォワーディング部１０５ｅの比較器１０５ｆは、{B,F,X1}_[先行命令pipeline]と、P_[依存命令pipeline]との比較を行なう。 In this way, the comparator 105f of the forwarding unit 105e compares {B,F,X1}_[preceding instruction pipeline] and P_[dependent instruction pipeline].

上述した比較例に係るフォワーディングタイミングの判定では、フォワーディング部１０５ｅが３つの比較器１０５ｆを備える。このため、先行命令のパイプライン数、依存命令のパイプライン数等が多くなると、パイプラインの深さが大きくなり、比較器１０５ｆの数が増加する。例えば、比較器１０５ｆの数は、パイプラインの深さと、フォワーディング可能なパイプライン数と、パイプライン数と、オペランド数との乗算により算出できる。比較器１０５ｆの数が増加すると、スケジューラ１０５の物量が増加するため、プロセッサ１の性能が出ずにディレイが発生し、スケジューラ１０５のエントリが制限される。 In the forwarding timing determination according to the comparative example described above, the forwarding unit 105e includes three comparators 105f. Therefore, as the number of pipelines for preceding instructions and the number of pipelines for dependent instructions increases, the depth of the pipeline increases and the number of comparators 105f increases. For example, the number of comparators 105f can be calculated by multiplying the depth of the pipeline, the number of forwardable pipelines, the number of pipelines, and the number of operands. When the number of comparators 105f increases, the amount of material in the scheduler 105 increases, so the performance of the processor 1 is not achieved, a delay occurs, and the entries of the scheduler 105 are limited.

図１３の例において、依存解決判定部１１は、スケジューラ１０５のエントリ内の命令1のLXM3と、命令2-5とを比較する。 In the example of FIG. 13, the dependency resolution determination unit 11 compares LXM3 of instruction 1 in the entry of the scheduler 105 and instruction 2-5.

ステージＩＤ２０８は、依存元命令である命令1のステージを示している。命令2のstage_id==1である場合、依存元命令1がLXM2ステージであることを示す。ＰＳ情報２０７はPステージでスケジューラ１０５から選択されるため、命令2がPステージであっても、スケジューラ１０５内でも同様の意味となる。命令1の演算レイテンシが1のときはLXM2ステージがBステージとなる。図１３の説明において同様である。情報生成部１３は、命令2のstage_id==1を検出すると、命令1のLXステージから命令2のFステージのOP1にデータをフォワード（破線参照）するためのフォワーディング情報を出力する。 The stage ID 208 indicates the stage of instruction 1, which is the dependent instruction. If stage_id of instruction 2==1, it indicates that dependent instruction 1 is in the LXM2 stage. Since the PS information 207 is selected from the scheduler 105 in the P stage, the meaning is the same in the scheduler 105 even if the instruction 2 is in the P stage. When the operation latency of instruction 1 is 1, the LXM2 stage becomes the B stage. The same applies to the explanation of FIG. When the information generation unit 13 detects stage_id==1 of instruction 2, it outputs forwarding information for forwarding data from the LX stage of instruction 1 to OP1 of the F stage of instruction 2 (see broken line).

命令3のstage_id==2である場合、依存元命令1がLXM1ステージであることを示す。情報生成部１３は、命令3のstage_id==2を検出すると、命令1のLXP1（LastX plus 1）ステージから命令3のFステージのOP1にデータをフォワード（一点鎖線参照）するためのフォワーディング情報を出力する。 If stage_id of instruction 3 is 2, it indicates that dependent instruction 1 is in LXM1 stage. When the information generation unit 13 detects stage_id==2 of instruction 3, it generates forwarding information for forwarding data from the LXP1 (LastX plus 1) stage of instruction 1 to OP1 of the F stage of instruction 3 (see the dashed line). Output.

命令4のstage_id==3である場合、依存元命令1がLXステージであることを示す。情報生成部１３は、命令4のstage_id==3を検出すると、命令1のLXP2（LastX plus 2）ステージから命令4のFステージのOP1にデータをフォワード（点線参照）するためのフォワーディング情報を出力する。 If stage_id of instruction 4 is 3, it indicates that dependent instruction 1 is in the LX stage. When the information generation unit 13 detects stage_id==3 of instruction 4, it outputs forwarding information for forwarding data from the LXP2 (LastX plus 2) stage of instruction 1 to OP1 of the F stage of instruction 4 (see dotted line). do.

なお、命令5については、stage_id==0であるため、情報生成部１３は、フォワーディング情報を生成しない。この場合、プロセッサ１は、命令5のBステージ（6サイクル目）でレジスタ１０８からデータを読み出し可能である。 Note that for instruction 5, since stage_id==0, the information generation unit 13 does not generate forwarding information. In this case, processor 1 can read data from register 108 at the B stage (sixth cycle) of instruction 5.

このように、一実施形態に係るフォワーディング部１２によれば、比較器１０５ｆを省略した構成において、ステージＩＤ２０８を利用することでフォワーディングタイミングの判定を行なうことができる。 In this way, according to the forwarding unit 12 according to one embodiment, the forwarding timing can be determined by using the stage ID 208 in a configuration in which the comparator 105f is omitted.

なお、一実施形態では、依存先命令が依存元命令からデータを受け取るタイミングが依存先命令のFステージである場合を例に挙げたが、これに限定されるものではなく、受け取るタイミングは、依存先命令のFステージよりも前のステージであってもよい。 In one embodiment, an example is given in which the timing at which the dependent instruction receives data from the dependent instruction is at the F stage of the dependent instruction; however, this is not limited to this, and the timing at which data is received may be It may be a stage before the F stage of the previous instruction.

上述した説明では、最短フォワーディングタイミングが1つである場合を例に挙げたが、最短フォワードタイミングが2つ以上存在する場合もある。 In the above description, the case where there is one shortest forwarding timing is taken as an example, but there may be two or more shortest forwarding timings.

例えば、IEEE754-2008では、x*y+zの積和演算を行なうFMA（Fused multiply-add）が規定される。FMA（fma）における和部分は、積に対して和を行なうため、演算レイテンシは少なくて済む。すなわち、FMAの場合、先行命令の演算レイテンシだけではなく、FMAのオペランド種によってレイテンシが決まる。 For example, IEEE754-2008 defines FMA (Fused multiply-add) that performs a multiply-add operation of x*y+z. The sum part in FMA (fma) performs summation on products, so the calculation latency is small. That is, in the case of FMA, the latency is determined not only by the operation latency of the preceding instruction but also by the FMA operand type.

図１２及び図１３の例では、後続命令が先行命令のX1ステージからOP1にフォワーディングされたデータを使用することを想定した。以下の説明では、FMAが「積」の演算に2サイクル、「和」の演算に2サイクルの計4サイクルかかる場合を想定する。 In the examples of FIGS. 12 and 13, it is assumed that the subsequent instruction uses data forwarded to OP1 from the X1 stage of the preceding instruction. In the following explanation, it is assumed that FMA takes a total of 4 cycles, 2 cycles for the "product" operation and 2 cycles for the "sum" operation.

後続命令のxyの積部分は、先行命令のX1ステージからのデータを利用し、後続命令のzの和部分は先行命令のX3ステージからのデータを利用することになる。この場合、wakeupのタイミングは、xyの積部分であればLXM3となるが、zの和部分であるとLXM5となる。つまり、スケジューラ１０５は、LXM3及びLXM5の2つの最短フォワードタイミングについて比較を行なうことになる。 The xy product portion of the subsequent instruction uses data from the X1 stage of the preceding instruction, and the z sum portion of the subsequent instruction uses data from the X3 stage of the preceding instruction. In this case, the wakeup timing is LXM3 for the product of xy, but is LXM5 for the sum of z. In other words, the scheduler 105 compares the two shortest forward timings of LXM3 and LXM5.

図１４は、比較例に係る最短フォワーディングタイミングが2つの場合の判定例を示す図であり、図１５は、一実施形態に係る最短フォワーディングタイミングが2つの場合の判定例を示す図である。符号Ａは、レイテンシが2サイクルであるfma命令のフォワーディングタイミングの一例を示し、符号Ｂは、各ステージの機能の一例を示し、符号Ｃは、fmaの機能の一例を示す。 FIG. 14 is a diagram illustrating a determination example when there are two shortest forwarding timings according to a comparative example, and FIG. 15 is a diagram illustrating a determination example when there are two shortest forwarding timings according to an embodiment. Symbol A indicates an example of the forwarding timing of an fma instruction with a latency of 2 cycles, symbol B indicates an example of the function of each stage, and symbol C indicates an example of the fma function.

図１４及び図１５において、命令1（No.1）及び命令2（No.2）のそれぞれは依存元命令を示し、命令3（No.3；図１４）又は命令3-7（No.3-7；図１５）は、フォワーディング（命令1又は2から命令3-7への矢印参照）が実行される依存先命令を示す。図１５に示す命令8は、レジスタ１０８の参照が行なわれる依存先命令である。 In FIGS. 14 and 15, instruction 1 (No. 1) and instruction 2 (No. 2) each indicate a dependent instruction, and instruction 3 (No. 3; FIG. 14) or instruction 3-7 (No. -7; FIG. 15) indicates the dependent instruction on which forwarding (see the arrow from instruction 1 or 2 to instruction 3-7) is executed. Instruction 8 shown in FIG. 15 is a dependent instruction that refers to register 108.

また、図１４及び図１５において、命令2から命令3の和演算部分のオペランドへのフォワーディングタイミングは、和演算の実行に間に合えばよく、積演算の間は実行されなくてもよい。例えば、命令1から命令3へのフォワーディングでは、命令1のLXステージから命令3のFステージのOP1, OP2にデータがフォワーディングされる。なお、OP2へのフォワーディングについては図示及び説明を省略する。命令2から命令3へのフォワーディングでは、命令2のLXステージから命令3のX2ステージのOP3にデータがフォワーディングされる。これにより、multiply（積演算）のレイテンシとなる2サイクルを隠匿することができる。 In addition, in FIGS. 14 and 15, the forwarding timing of the sum operation portion of instruction 2 to the operand of instruction 3 only needs to be in time for the execution of the sum operation, and does not need to be performed during the product operation. For example, in forwarding from instruction 1 to instruction 3, data is forwarded from the LX stage of instruction 1 to OP1 and OP2 of the F stage of instruction 3. Note that illustration and description of forwarding to OP2 will be omitted. In forwarding from instruction 2 to instruction 3, data is forwarded from the LX stage of instruction 2 to OP3 of the X2 stage of instruction 3. This makes it possible to hide the two-cycle latency of multiply (product operation).

比較例では、図１４に示すように、スケジューラ１０５のエントリ内の命令1のLXM3と命令3（OP1）とを4サイクル目（命令1のLXの3サイクル前）で比較する。また、accumulate（符号ＣのX3, X4参照）での最速の命令発行のために、スケジューラ１０５は、命令2のLXM5と命令3（OP3）とを4サイクル目で比較し、命令間の依存を検知して依存解決を図る。 In the comparative example, as shown in FIG. 14, LXM3 of instruction 1 in the entry of the scheduler 105 and instruction 3 (OP1) are compared in the fourth cycle (three cycles before LX of instruction 1). In addition, in order to issue the fastest instruction in accumulate (see X3, Detect and resolve dependencies.

このためには、依存解決判定部１０５ｂにおいて依存を検知するためのコスト、例えば、命令固有のＩＤ及びレジスタ番号の比較コスト（例えば実装面積）が増大する。また、パイプライン内の比較器１０５ｆの数も、図１２に示す例と比較して増加する。 For this reason, the cost for detecting dependence in the dependency resolution determination unit 105b, for example, the cost for comparing the instruction-specific ID and register number (for example, the mounting area) increases. Further, the number of comparators 105f in the pipeline also increases compared to the example shown in FIG. 12.

一方、図１５の例では、依存解決判定部１１のＰＳ情報更新回路１１ｂは、既述のLXM3での比較に基づく命令1のパイプラインに対応するＰＳ情報２０７の生成に加えて、LXM5での比較に基づく命令2のパイプラインに対応するＰＳ情報２０７の生成を行なう。例えば、ＰＳ情報更新回路１１ｂは、最短フォワードタイミング判定比較回路３０１からの出力により特定されるパイプラインＩＤ２０９ごとに、LXM3、LXM5のそれぞれのタイミングで、各ＰＳ情報２０７のステージＩＤ２０８を更新する。 On the other hand, in the example of FIG. 15, the PS information update circuit 11b of the dependency resolution determination unit 11 generates the PS information 207 corresponding to the pipeline of instruction 1 based on the comparison in LXM3 described above, as well as generates the PS information 207 in LXM5. PS information 207 corresponding to the pipeline of instruction 2 is generated based on the comparison. For example, the PS information update circuit 11b updates the stage ID 208 of each PS information 207 at each timing of LXM3 and LXM5 for each pipeline ID 209 specified by the output from the shortest forward timing determination comparison circuit 301.

なお、図１５の例では、後続命令3のOP1に対応するステージＩＤ２０８（ＰＳ情報２０７）は、先行命令1のパイプラインにおけるstage_idを示し、後続命令3のOP3に対応するステージＩＤ２０８は、先行命令2のパイプラインにおけるstage_idを示す。 In the example of FIG. 15, the stage ID 208 (PS information 207) corresponding to OP1 of subsequent instruction 3 indicates stage_id in the pipeline of preceding instruction 1, and the stage ID 208 corresponding to OP3 of subsequent instruction 3 indicates the preceding instruction. Shows stage_id in pipeline 2.

（命令1からのフォワーディング）
情報生成部１３は、命令3のPサイクルにおいてOP1に対応するstage_id==1を検出すると、命令1のLXステージから命令3のFステージのOP1にデータをフォワードするためのフォワーディング情報を出力する。 (Forwarding from instruction 1)
When the information generation unit 13 detects stage_id==1 corresponding to OP1 in the P cycle of instruction 3, it outputs forwarding information for forwarding data from the LX stage of instruction 1 to OP1 of the F stage of instruction 3.

情報生成部１３は、命令4のPサイクルにおいてOP1に対応するstage_id==2を検出すると、命令1のLXP1ステージから命令4のFステージのOP1にデータをフォワードするためのフォワーディング情報を出力する。 When the information generation unit 13 detects stage_id==2 corresponding to OP1 in the P cycle of instruction 4, it outputs forwarding information for forwarding data from the LXP1 stage of instruction 1 to OP1 of the F stage of instruction 4.

情報生成部１３は、命令5のPサイクルにおいてOP1に対応するstage_id==3を検出すると、命令1のLXP2ステージから命令5のFステージのOP1にデータをフォワードするためのフォワーディング情報を出力する。 When the information generation unit 13 detects stage_id==3 corresponding to OP1 in the P cycle of instruction 5, it outputs forwarding information for forwarding data from the LXP2 stage of instruction 1 to OP1 of the F stage of instruction 5.

なお、命令6以降については、OP1に対応するstage_id==0であるため、情報生成部１３は、OP1についてのフォワーディング情報を生成しない。この場合、プロセッサ１は、命令6以降のBステージ（6サイクル目）でレジスタ１０８からデータを読み出す。 Note that for instructions after instruction 6, the stage_id corresponding to OP1 is 0, so the information generation unit 13 does not generate forwarding information for OP1. In this case, the processor 1 reads data from the register 108 in the B stage (sixth cycle) after instruction 6.

（命令2からのフォワーディング）
情報生成部１３は、命令3のPサイクルにおいてOP3に対応するstage_id==1を検出すると、命令2のLXステージから命令3のX2ステージのOP3にデータをフォワードするためのフォワーディング情報を出力する。 (Forwarding from instruction 2)
When the information generation unit 13 detects stage_id==1 corresponding to OP3 in the P cycle of instruction 3, it outputs forwarding information for forwarding data from the LX stage of instruction 2 to OP3 of the X2 stage of instruction 3.

情報生成部１３は、命令4のPサイクルにおいてOP3に対応するstage_id==2を検出すると、命令2のLXP1ステージから命令4のX2ステージのOP3にデータをフォワードするためのフォワーディング情報を出力する。 When the information generation unit 13 detects stage_id==2 corresponding to OP3 in the P cycle of instruction 4, it outputs forwarding information for forwarding data from the LXP1 stage of instruction 2 to OP3 of the X2 stage of instruction 4.

情報生成部１３は、命令5のPサイクルにおいてOP3に対応するstage_id==3を検出すると、命令2のLXステージから命令5のFステージのOP3にデータをフォワードするためのフォワーディング情報を出力する。 When the information generation unit 13 detects stage_id==3 corresponding to OP3 in the P cycle of instruction 5, it outputs forwarding information for forwarding data from the LX stage of instruction 2 to OP3 of the F stage of instruction 5.

情報生成部１３は、命令6のPサイクルにおいてOP3に対応するstage_id==4を検出すると、命令2のLXP1ステージから命令6のFステージのOP3にデータをフォワードするためのフォワーディング情報を出力する。 When the information generation unit 13 detects stage_id==4 corresponding to OP3 in the P cycle of instruction 6, it outputs forwarding information for forwarding data from the LXP1 stage of instruction 2 to OP3 of the F stage of instruction 6.

情報生成部１３は、命令7のPサイクルにおいてOP3に対応するstage_id==5を検出すると、命令2のLXP2ステージから命令7のFステージのOP3にデータをフォワードするためのフォワーディング情報を出力する。 When the information generation unit 13 detects stage_id==5 corresponding to OP3 in the P cycle of instruction 7, it outputs forwarding information for forwarding data from the LXP2 stage of instruction 2 to OP3 of the F stage of instruction 7.

なお、命令8以降については、OP3に対応するstage_id==0であるため、情報生成部１３は、OP3についてのフォワーディング情報を生成しない。この場合、プロセッサ１は、命令8以降のBステージ（11サイクル目）でレジスタ１０８からデータを読み出す。 Note that for instructions after instruction 8, since stage_id==0 corresponding to OP3, the information generation unit 13 does not generate forwarding information for OP3. In this case, the processor 1 reads data from the register 108 in the B stage (11th cycle) after instruction 8.

これにより、FMAのような複数の最短フォワードタイミングが存在する場合の処理においても、パイプライン内の比較器１０５ｆは不要とすることができる。 This makes it possible to eliminate the need for the comparator 105f in the pipeline even in processing where a plurality of shortest forward timings exist, such as in FMA.

なお、依存解決判定部１１は、例えば、Accumulate forwarding（FではなくX2へのforwarding；命令3,4におけるOP3参照）がない命令である場合、op3_stage_id==2の場合にwakeup ready bitを立て、Pステージのstage_id==3以降に当該命令が発行されるように制御してよい。 Note that, for example, if the instruction does not have Accumulate forwarding (forwarding to X2 instead of F; see OP3 in instructions 3 and 4), the dependency resolution determination unit 11 sets the wakeup ready bit when op3_stage_id==2, The command may be controlled to be issued after stage_id==3 of P stage.

〔１－５〕キャッシュミスしたロードから依存する依存チェーンの後続命令をキャンセルするための構成例
ここまで、フォワーディングタイミングの判定を行なうための構成例を説明したが、一実施形態に係るＰＳ情報２０７を用いる手法は、キャッシュミスしたロードから依存する依存チェーンの後続命令のキャンセルにも適用可能である。 [1-5] Configuration example for canceling a subsequent instruction in a dependency chain that depends on a cache-missed load Up to this point, a configuration example for determining forwarding timing has been described. The method using can also be applied to cancel subsequent instructions in a dependency chain that depend on a load that misses the cache.

図１６は、比較例に係るプロセッサ１のコアの構成におけるキャッシュミスしたロードから依存する依存チェーンの後続命令をキャンセルする（以下、「キャッシュミスキャンセル」と表記する場合がある）処理に着目した構成例を示す図である。図１６では、演算レイテンシが1サイクルであり、対象のオペランドが1つである場合を例に挙げる。 FIG. 16 shows a configuration focusing on a process of canceling a subsequent instruction in a dependency chain that depends on a cache-missed load (hereinafter sometimes referred to as "cache-miss cancellation") in the core configuration of processor 1 according to a comparative example. It is a figure which shows an example. In FIG. 16, an example is given in which the calculation latency is one cycle and the number of target operands is one.

「キャンセル」とは、スケジューラ１０５から発行（issue）済みの命令をパイプラインから消すとともに、スケジューラ１０５内に残っている当該命令を発行されない状態にすることを意味してよい。 "Cancellation" may mean erasing an instruction that has been issued by the scheduler 105 from the pipeline, and making the instruction remaining in the scheduler 105 unissued.

図１６において、破線は、キャッシュミスしたロード（Load）から演算依存の命令をキャンセルする処理、一点鎖線は、ロードに依存する演算から演算依存の命令をキャンセルする処理、点線は、対応するスケジューラ１０５のエントリ内のキャンセル処理を示す。 In FIG. 16, a broken line indicates a process for canceling an operation-dependent instruction from a load that misses the cache, a chain line indicates a process for canceling an operation-dependent instruction from an operation that depends on a load, and a dotted line indicates a process for the corresponding scheduler 105. This shows the cancellation process in the entry.

図１６に示すように、プロセッサ１のコア（例えばスケジューラ１０５）は、制御部１０５ｇを備えてよい。制御部１０５ｇは、比較器１０５ｆを含むフォワーディング部１０５ｅと、複数の比較器１０５ｈと、演算パイプ先行ロードタイミング信号（図１６以降においては“cancel_code”と表記する場合がある）に基づきcache miss信号を生成する生成回路１０５ｉとを備えてよい。また、プロセッサ１のコアは、cache miss信号とvalid信号とに基づき依存先命令のフォワーディング２０６をキャンセルするキャンセル回路２１１を備えてよい。 As shown in FIG. 16, the core of the processor 1 (for example, the scheduler 105) may include a control unit 105g. The control unit 105g outputs a cache miss signal based on a forwarding unit 105e including a comparator 105f, a plurality of comparators 105h, and an arithmetic pipe advance load timing signal (sometimes referred to as "cancel_code" from FIG. 16 onwards). A generation circuit 105i that generates the information may be provided. Further, the core of the processor 1 may include a cancellation circuit 211 that cancels the forwarding 206 of the dependent instruction based on the cache miss signal and the valid signal.

図１７は、比較例に係るキャッシュミスしたロードから依存する依存チェーンの後続命令をキャンセルする処理における判定例を示す図である。図１７の符号Ａは、レイテンシが１サイクルであるadd命令のキャッシュミスキャンセルのタイミングの一例を示し、符号Ｂは、各ステージの機能の一例を示す。 FIG. 17 is a diagram illustrating an example of determination in a process of canceling a subsequent instruction in a dependency chain that depends on a load that has caused a cache miss according to a comparative example. Symbol A in FIG. 17 indicates an example of the timing of cache miss cancellation of an add instruction with a latency of one cycle, and symbol B indicates an example of the function of each stage.

図１７の符号Ｂに示すように、Lステージは、キャッシュミスに依存するロード命令がパイプラインに発行されるステージを示す。Rステージは、ＲＡＭからのデータの読み出しが行なわれるステージを示す。Mステージは、タグマッチ及び選択が行なわれるステージを示す。Tステージは、データの到着又はミスが発生するステージを示す。キャッシュミスは、Tステージで判明する。Wステージは、レジスタ１０８へのデータの書き込みが行なわれるステージを示す。 As shown by reference numeral B in FIG. 17, the L stage indicates a stage where a load instruction depending on a cache miss is issued to the pipeline. The R stage indicates a stage in which data is read from the RAM. M stage indicates the stage where tag matching and selection are performed. The T stage indicates the stage where data arrives or misses. Cache misses are discovered at the T stage. The W stage indicates a stage in which data is written to the register 108.

以下、図１６及び図１７を参照して、比較例に係るキャッシュミス時の依存先命令のキャンセル処理の一例を説明する。 Hereinafter, with reference to FIGS. 16 and 17, an example of a process for canceling a dependent instruction at the time of a cache miss according to a comparative example will be described.

図１７に示すように、命令2-5は、命令1と直接の依存関係がある。このため、スケジューラ１０５は、命令1のTステージと各命令2-5との比較を行ない、Tステージでキャッシュミスが判定された場合に、各命令2-5の発行をキャンセルする。なお、命令5,8の発行タイミング以降は、命令はスケジューラ１０５から発行されない。 As shown in FIG. 17, instructions 2-5 have a direct dependency on instruction 1. Therefore, the scheduler 105 compares the T stage of instruction 1 with each instruction 2-5, and cancels the issuance of each instruction 2-5 if a cache miss is determined at the T stage. Note that no instructions are issued from the scheduler 105 after the timing of issuing instructions 5 and 8.

命令2-5は、命令1から自命令までの発行タイミング情報、及び、命令1のmiss確定情報となる演算パイプ先行ロードタイミング信号を持ち、命令1のTステージ（サイクル）を同定する。 Instruction 2-5 has issue timing information from instruction 1 to its own instruction, and an arithmetic pipe advance load timing signal that is miss confirmation information for instruction 1, and identifies the T stage (cycle) of instruction 1.

演算パイプ先行ロードタイミング信号は、図１６に示す制御部１０５ｇ、例えば比較器１０５ｈで生成されるコードである。演算パイプ先行ロードタイミング信号は、既に投機的にスケジューラからissue（Pステージ）された命令や、スケジューラからissueされていない命令でロードからつながる演算から依存している命令に対して、依存先行ロード（命令1）のキャッシュミス判定タイミング（命令1のTステージ）を通知し、上記命令のキャンセルを行なうための信号である。演算パイプライン１０９ａのFステージの演算パイプ先行ロードタイミング信号は、図５の演算パイプ用キャッシュミスタイミング判定比較器３０４や図８のキャッシュミスタイミング生成回路１１ａに通知する演算パイプ先行ロードタイミング信号と一致する。 The arithmetic pipe advance load timing signal is a code generated by the control unit 105g shown in FIG. 16, for example, the comparator 105h. The operation pipe preload timing signal is used to perform dependent preload ( This signal notifies the cache miss determination timing (T stage of instruction 1) of instruction 1) and cancels the above instruction. The arithmetic pipe advance load timing signal of the F stage of the arithmetic pipeline 109a matches the arithmetic pipe advance load timing signal notified to the arithmetic pipe cache miss timing determination comparator 304 in FIG. 5 and the cache miss timing generation circuit 11a in FIG. do.

例えば、演算パイプ先行ロードタイミング信号は、以下のいずれかを示してよい。 For example, the compute pipe preload timing signal may indicate any of the following:

・直接或いは間接的に依存先行するロード命令のロードパイプ１０９ｂのキャッシュミス判定タイミングがNサイクル後に訪れること。
・演算パイプ１０９ａの比較タイミングがロードパイプ１０９ｂのキャッシュミスタイミングと一致している（キャッシュミス判定信号を参照する）こと。
・既にキャッシュミスタイミングを過ぎている（以前にキャッシュミス判定信号を参照しキャッシュミスが確定している）こと。 - The cache miss determination timing of the load pipe 109b of the load instruction that is directly or indirectly dependent and preceding occurs after N cycles.
- The comparison timing of the calculation pipe 109a matches the cache miss timing of the load pipe 109b (refer to the cache miss determination signal).
- The cache miss timing has already passed (the cache miss has been determined by referring to the cache miss determination signal previously).

例えば、命令2の発行タイミングでは、制御部１０５ｇは、2サイクル目の命令1のRステージ（write address２１０）と命令2のPステージ（read address２０２）とを比較する。命令3の発行タイミングでは、制御部１０５ｇは、3サイクル目の命令1のMステージと命令3のPステージとを比較する。命令4の発行タイミングでは、制御部１０５ｇは、4サイクル目の命令1のTステージと命令4のPステージとを比較する。命令5から後ろのタイミングでは，スケジューラ内比較器３０３を用いて依存先行ロードのキャッシュミス判定タイミング（Tステージ）を同定し、キャッシュミス判定信号を用いてキャッシュミスキャンセルを行なう。そのため、命令5から後ろのタイミングでは、ロードからの直接依存命令はスケジューラから発行されない。 For example, at the issuance timing of instruction 2, the control unit 105g compares the R stage (write address 210) of instruction 1 and the P stage (read address 202) of instruction 2 in the second cycle. At the timing of issuing instruction 3, the control unit 105g compares the M stage of instruction 1 and the P stage of instruction 3 in the third cycle. At the timing of issuing instruction 4, the control unit 105g compares the T stage of instruction 1 and the P stage of instruction 4 in the fourth cycle. At the timing after instruction 5, the in-scheduler comparator 303 is used to identify the cache miss determination timing (T stage) of the dependent advance load, and the cache miss determination signal is used to perform cache miss cancellation. Therefore, at the timing after instruction 5, the scheduler does not issue any directly dependent instructions from the load.

制御部１０５ｇは、比較器１０５ｈによるRステージとPステージとの比較結果に応じて、cancel_code[1]にビット1をセットする。また、制御部１０５ｇは、比較器１０５ｈによるMステージとPステージとの比較結果に応じて、cancel_code[2]にビット1をセットする。さらに、比較器１０５ｈによるTステージとPステージとの比較結果に応じて、cancel_code[3]にビット1をセットする。 The control unit 105g sets bit 1 in cancel_code[1] according to the comparison result between the R stage and the P stage by the comparator 105h. Further, the control unit 105g sets bit 1 in cancel_code[2] according to the comparison result between the M stage and the P stage by the comparator 105h. Further, bit 1 is set in cancel_code[3] according to the comparison result between the T stage and the P stage by the comparator 105h.

なお、命令6-8は、命令1と直接の依存関係がなく、命令1から依存する命令2に依存する命令である。このため、制御部１０５ｇは、命令2と各命令6-8との比較と、命令2の演算パイプ先行ロードタイミング信号とに基づき、各命令6-8のキャンセルを行なう。 Note that instructions 6-8 are instructions that have no direct dependency relationship with instruction 1 and depend on instruction 2, which instruction 1 depends on. Therefore, the control unit 105g cancels each instruction 6-8 based on the comparison between instruction 2 and each instruction 6-8 and the arithmetic pipe advance load timing signal of instruction 2.

例えば、依存先行ロード（命令1）から直接依存がない命令（命令6-7）に対しては、該当命令から直接依存のある命令（命令2）から演算パイプ先行ロードタイミング信号を引き継ぐ（図１６のB stage cancel_code[2:0]或いはF stage cancel_code[1:0]からB stage cancel_code[2:0]へloopbackする複数のpathで示される)ことによって先行依存先行ロード（命令1）のキャッシュミス判定タイミング（命令1のTステージ）が同定され、キャッシュミス判定信号によってキャッシュミスキャンセルされる。 For example, for instructions (instructions 6-7) that are not directly dependent on the dependent preload (instruction 1), the arithmetic pipe preload timing signal is inherited from the instruction (instruction 2) that is directly dependent on the corresponding instruction (Figure 16 B stage cancel_code[2:0] or multiple paths that loopback from F stage cancel_code[1:0] to B stage cancel_code[2:0] The determination timing (T stage of instruction 1) is identified, and the cache miss is canceled by the cache miss determination signal.

また、命令8から後ろの命令は、スケジューラ１０５内のキャッシュミスタイミング判定比較回路３０４内で、演算パイプライン１０９ａのFステージのID一致と演算パイプライン１０９ａのFステージの演算パイプ先行ロードタイミング信号と先行演算命令（命令2）のレイテンシ（=1）とから先行ロード（命令1）のキャッシュミス判定タイミング（Tステージ）が同定され、命令1のキャッシュミス判定信号から命令8のキャッシュミスキャンセルが行なわれるため、スケジューラ１０５からはissueされない。 In addition, the instructions after instruction 8 are determined by the cache miss timing judgment comparison circuit 304 in the scheduler 105 based on the ID match of the F stage of the arithmetic pipeline 109a and the arithmetic pipe advance load timing signal of the F stage of the arithmetic pipeline 109a. The cache miss judgment timing (T stage) of the preceding load (instruction 1) is identified from the latency (=1) of the preceding operation instruction (instruction 2), and the cache miss of instruction 8 is canceled from the cache miss judgment signal of instruction 1. Therefore, the scheduler 105 does not issue it.

演算パイプ先行ロードタイミング信号は、依存ロード（命令1）と或る命令（命令2）との間の相対位置情報を含む。これにより、各命令6-8は、下記の２つの情報により依存元のロード命令1のTステージを同定できる。 The arithmetic pipe advance load timing signal includes relative position information between a dependent load (instruction 1) and a certain instruction (instruction 2). As a result, each instruction 6-8 can identify the T stage of the dependent load instruction 1 using the following two pieces of information.

・各命令6-8と命令2との比較による、各命令6-8と命令2との間の相対位置。
・演算パイプ先行ロードタイミング信号に基づく命令2と命令1との間の相対位置。 - Relative position between each instruction 6-8 and instruction 2 by comparing each instruction 6-8 and instruction 2.
- Relative position between instruction 2 and instruction 1 based on the arithmetic pipe preload timing signal.

制御部１０５ｇは、命令6-8について、直接の依存元である命令2のFステージの演算パイプ先行ロードタイミング信号を取得し、命令2のFステージと各命令6-8との比較と、命令1のcache miss判定信号とに基づき、各命令6-8をキャンセルする。 For instruction 6-8, the control unit 105g obtains the arithmetic pipe advance load timing signal of the F stage of instruction 2, which is a direct dependency source, and compares the F stage of instruction 2 with each instruction 6-8, and compares the F stage of instruction 2 with each instruction 6-8. Each instruction 6-8 is canceled based on the cache miss determination signal of 1.

例えば、命令6の発行タイミングは、3サイクル目の命令2のBステージ==命令6のPステージとなる（図１６のn4参照）。命令2のB_cancel_code[1]（n5参照）は、命令2のBステージ＝命令1のMステージを示す。制御部１０５ｇは、命令1のMステージ＝命令6のPステージであるため、命令6のB_cancel_code[2]をセットする。 For example, the issue timing of instruction 6 is the B stage of instruction 2 in the third cycle==P stage of instruction 6 (see n4 in FIG. 16). B_cancel_code[1] (see n5) of instruction 2 indicates that B stage of instruction 2 = M stage of instruction 1. The control unit 105g sets B_cancel_code[2] of the instruction 6 because the M stage of the instruction 1 is the P stage of the instruction 6.

命令7の発行タイミングは、4サイクル目の命令2のFステージ==命令7のPステージ（n6参照）、又は、命令3のBステージ==命令7のPステージ（依存があったとき）（n4参照）となる。命令2のF_cancel_code[1]（n8参照）は、命令2のFステージ==命令1のTステージを示す。制御部１０５ｇは、命令1のTステージ==命令7のPステージであるため、cache miss信号で命令7をキャンセルする（g1参照）。命令3のB_cancel_code[2]（n7参照）は、命令3のBステージ==命令1のTステージを示す。制御部１０５ｇは、命令1のTステージ==命令7のPステージであるため、cache miss信号で命令7をキャンセルする（g1参照）。 The issue timing of instruction 7 is the F stage of instruction 2 == P stage of instruction 7 (see n6) in the 4th cycle, or the B stage of instruction 3 == P stage of instruction 7 (when there is a dependency) ( (see n4). F_cancel_code[1] (see n8) of instruction 2 indicates that F stage of instruction 2 == T stage of instruction 1. Since the T stage of instruction 1 == the P stage of instruction 7, the control unit 105g cancels instruction 7 with the cache miss signal (see g1). B_cancel_code[2] (see n7) of instruction 3 indicates that B stage of instruction 3 ==T stage of instruction 1. Since the T stage of instruction 1 == the P stage of instruction 7, the control unit 105g cancels instruction 7 with the cache miss signal (see g1).

例えば、制御部１０５ｇは、T stage miss detect信号と演算パイプ先行ロードタイミング信号とのANDとなるcache miss信号をキャンセル回路２１１に発行する。キャンセル回路２１１は、cache miss信号に応じて、セレクタ２０５に入力される後続命令の、P stage valid信号を落とす。これにより、後続命令のキャンセルが行なわれる。 For example, the control unit 105g issues to the cancellation circuit 211 a cache miss signal that is an AND of the T stage miss detect signal and the arithmetic pipe advance load timing signal. The cancel circuit 211 drops the P stage valid signal of the subsequent instruction input to the selector 205 in response to the cache miss signal. This causes the subsequent instruction to be canceled.

それ以降のタイミングでは、命令5,8は発行されない。例えば、4サイクル目の命令2のFステージ==スケジューラ１０５内比較器では、制御部１０５ｇは、命令2のF_cancel_code[1]==1でcache miss信号を参照する。 Instructions 5 and 8 are not issued at subsequent timings. For example, in the F stage of the instruction 2 in the fourth cycle==the comparator in the scheduler 105, the control unit 105g refers to the cache miss signal in the F_cancel_code[1]==1 of the instruction 2.

各命令は、自命令の後続命令をキャンセルするために、PステージからFステージまで演算パイプ先行ロードタイミング信号（キャンセル情報）を持ち続ける。Fステージになると、各命令とスケジューラ１０５内後続命令との比較が行なわれる。また、上述した命令6の発行タイミングでは、3サイクル目で命令2の演算パイプ先行ロードタイミング信号から命令6の演算パイプ先行ロードタイミング信号へと演算パイプ先行ロードタイミング信号が引き継がれる。 Each instruction continues to have an arithmetic pipe advance load timing signal (cancellation information) from the P stage to the F stage in order to cancel the instruction following the own instruction. At the F stage, each instruction is compared with subsequent instructions within the scheduler 105. Furthermore, in the issuance timing of instruction 6 described above, the arithmetic pipe advance load timing signal is inherited from the arithmetic pipe advance load timing signal of instruction 2 to the arithmetic pipe advance load timing signal of instruction 6 in the third cycle.

図１８は、ロードパイプ１０９ｂが1つの場合における演算パイプ先行ロードタイミング信号（cancel_code）の一例を示す図である。演算パイプ先行ロードタイミング信号の各値は、例えば、以下の状態を示してよい。なお、括弧内のn1-n3,g1-g3は、それぞれ図１６に付した同一符号の信号又は構成に対応する。 FIG. 18 is a diagram illustrating an example of the arithmetic pipe advance load timing signal (cancel_code) when there is one load pipe 109b. Each value of the arithmetic pipe advance load timing signal may indicate the following states, for example. Note that n1-n3 and g1-g3 in parentheses correspond to signals or configurations with the same symbols in FIG. 16, respectively.

P stage[3:1]の場合。
[3]: P==依存load(命令1) Tタイミング
→cache miss信号でキャンセル(g1)→1サイクル後はB stage[0]
[2]: P==依存load(命令1) Mタイミング(B==T)
→1サイクル後はB stage[2](n1)
[1]: P==依存load(命令1) Rタイミング(F==T)
→1サイクル後はB stage[1](n2)
[0]: Pが依存load(命令1) Tタイミングより遅く、ミス確定済
（スケジューラ１０５から発行されないため不要） For P stage[3:1].
[3]: P==dependent load (instruction 1) T timing → cancel with cache miss signal (g1) → B stage[0] after 1 cycle
[2]: P==dependent load (instruction 1) M timing (B==T)
→B stage[2](n1) after 1 cycle
[1]: P==dependent load (instruction 1) R timing (F==T)
→B stage[1](n2) after 1 cycle
[0]: P is dependent load (instruction 1) later than T timing, error confirmed (unnecessary as it is not issued by scheduler 105)

B stage[2:0]の場合。
[2]: B==依存load(命令1) Tタイミング
→cache miss信号でキャンセル(g2)→1サイクル後はF stage[0]
[1]: B==依存load(命令1) Mタイミング(F==T)
[0]: Bが依存load(命令1) Tタイミングより遅く、ミス確定済
→1サイクル後はF stage[0](n3) For B stage[2:0].
[2]: B==dependent load (instruction 1) T timing → cancel with cache miss signal (g2) → F stage[0] after 1 cycle
[1]: B==dependent load (instruction 1) M timing (F==T)
[0]: B is dependent load (instruction 1) later than T timing, mistake confirmed → 1 cycle later, F stage[0](n3)

F stage[1:0]の場合。
[1]: F==依存load(命令1) Tタイミング
→cache miss信号でキャンセル(g3)
[0]: Fが依存load(命令1) Tタイミングより遅く、ミス確定済 For F stage[1:0].
[1]: F==dependent load (instruction 1) T timing → cancel with cache miss signal (g3)
[0]: F is dependent load (instruction 1) later than T timing, error confirmed

このように、例えば3ビットのP stage cancel_code[3:1]のうちのコード[3]は、Pステージの依存ロード（命令1）のTタイミングを示す。図１７及び図１８の例において、スケジューラ１０５は、命令1のキャッシュミス確定情報（演算パイプ先行ロードタイミング信号）を用いて、FステージにおいてロードのTステージに被るタイミングで命令2を発行し、4サイクル目で比較を行なう。スケジューラ１０５は、例えば、命令6-8については、依存元命令2のFステージの演算パイプ先行ロードタイミング信号と、命令2のFサイクルのタイミングと命令1のキャッシュミス判定信号とを待って、キャンセルを行なう。 In this way, for example, code [3] of the 3-bit P stage cancel_code [3:1] indicates the T timing of the P stage dependent load (instruction 1). In the examples of FIGS. 17 and 18, the scheduler 105 uses the cache miss confirmation information of instruction 1 (arithmetic pipe advance load timing signal) to issue instruction 2 at a timing that overlaps with the T stage of loading in the F stage, and Comparison is performed at the cycle. For example, for instructions 6-8, the scheduler 105 waits for the arithmetic pipe advance load timing signal of the F stage of dependent instruction 2, the timing of the F cycle of instruction 2, and the cache miss determination signal of instruction 1, and then cancels the instructions. Do the following.

このように、比較例に係る依存解決判定部１０５ｂは、キャッシュミスタイミング判定比較回路３０３及び３０４において、Tステージ及びFステージのタイミング比較器、制御部１０５ｇにおいてT-Rステージ及びPステージのタイミング比較器、B-Fステージ及びPステージのタイミング比較器を備える。これらの比較器は、パイプラインの数が増加するほど、比較コスト（例えば実装面積）が増加する。 In this way, the dependency resolution determination unit 105b according to the comparative example has timing comparators for the T stage and F stage in the cache miss timing determination comparison circuits 303 and 304, a timing comparator for the T-R stage and the P stage in the control unit 105g, Equipped with timing comparators for B-F stage and P stage. The comparison cost (eg, implementation area) of these comparators increases as the number of pipelines increases.

図１９は、一実施形態に係るＰＳ情報２０７の割当例を説明するための図である。図１９には、図１１に符号Ｃで示す演算パイプ１０９ａ用のステージＩＤ２０８の割当例に加えて、ロードパイプ１０９ｂ用のステージＩＤ２０８の割当例を符号Ｄで示す。 FIG. 19 is a diagram for explaining an example of allocation of PS information 207 according to one embodiment. In FIG. 19, in addition to an example of assignment of the stage ID 208 for the calculation pipe 109a, which is indicated by the symbol C in FIG. 11, an example of assignment of the stage ID 208 for the load pipe 109b is indicated by the symbol D.

一実施形態に係るステージＩＤ２０８は、図１９の符号Ｄに示すように、ロードパイプ１０９ｂ用として、L以前、R、M、T、W以降の各ステージに{0,1,2,3,0}が割り当てられる。なお、L以前のstage_id[2:0]==0は、依存が未解決の状態（dependency undetected）を示し、W以降のstage_id[2:0]==0は、レジスタ１０８経由でのデータの読み出しを行なうことを示す。また、先行命令の依存チェーンでキャッシュミスによる命令のキャンセルが生じた場合、或いは、パイプラインフラッシュが生じた場合等、依存が未解決状態を示すステージＩＤ２０８stage_id[2:0]=0をセットしてもよい。 As shown by reference numeral D in FIG. 19, the stage ID 208 according to one embodiment is {0,1,2,3,0 for each stage before L, R, M, T, and after W for the load pipe 109b. } is assigned. Note that stage_id[2:0]==0 before L indicates the dependency is unresolved (dependency undetected), and stage_id[2:0]==0 after W indicates that the data is transferred via register 108. Indicates that reading is to be performed. Also, if an instruction is canceled due to a cache miss in the dependency chain of the preceding instruction, or if a pipeline flush occurs, set stage ID 208 stage_id[2:0]=0, which indicates that the dependency is in an unresolved state. Good too.

図２０は、一実施形態に係るプロセッサ１のコアの構成におけるキャッシュミスしたロードから依存する依存チェーンの後続命令をキャンセルするための構成例を示す図である。図２１は、一実施形態に係るキャッシュミスしたロードから依存する依存チェーンの後続命令をキャンセルする処理の判定例を示す図である。なお、演算パイプ先行ロードタイミング信号の値、命令の発行又はキャンセルタイミング等は、比較例と同様である。 FIG. 20 is a diagram illustrating a configuration example for canceling a subsequent instruction in a dependency chain that depends on a cache-missed load in the core configuration of the processor 1 according to an embodiment. FIG. 21 is a diagram illustrating a determination example of processing for canceling a subsequent instruction in a dependency chain that depends on a load that has caused a cache miss, according to an embodiment. Note that the value of the arithmetic pipe advance load timing signal, instruction issue or cancellation timing, etc. are the same as in the comparative example.

図２０に示すように、プロセッサ１のコア（例えばスケジューラ１０５）は、制御部１４を備えてよい。制御部１４は、情報生成部１３を含むフォワーディング部１２と、情報生成部１５と、生成回路１０５ｉとを備えてよい。また、プロセッサ１のコアは、キャンセル回路２１１を備えてよい。 As shown in FIG. 20, the core of the processor 1 (for example, the scheduler 105) may include the control unit 14. The control unit 14 may include a forwarding unit 12 including the information generation unit 13, an information generation unit 15, and a generation circuit 105i. Further, the core of the processor 1 may include a cancellation circuit 211.

図２０に示すパイプラインステージは、ロード用であるため、パイプラインステージに入力されるＰＳ情報２０７は、キャッシュミスキャンセルに利用される。例えば、ＰＳ情報２０７は、Tサイクルの特定に利用される。 Since the pipeline stage shown in FIG. 20 is for loading, the PS information 207 input to the pipeline stage is used for cache miss cancellation. For example, the PS information 207 is used to specify the T cycle.

図２１に示すように、依存解決判定部１１は、Tサイクルの依存キャンセルをstage_id==3 & pipeline_id == load & T_cache_missで判定する。また、依存解決判定部１１は、Fサイクルの依存キャンセルを{(stage_id==2 & f_latency==1) | (stage_id==1 & f_latency==2)} & {(F_cancel_code[1]&T_cache_miss)|F_cancel_code[0]} & pipeline_id==execで判定する。 As shown in FIG. 21, the dependency resolution determining unit 11 determines whether to cancel the dependency of the T cycle based on stage_id==3 & pipeline_id == load & T_cache_miss. In addition, the dependency resolution determination unit 11 performs dependency cancellation of F cycle {(stage_id==2 & f_latency==1) | (stage_id==1 & f_latency==2)} & {(F_cancel_code[1]&T_cache_miss)| Determine with F_cancel_code[0]} & pipeline_id==exec.

制御部１４は、情報生成部１５において、依存解決判定部１１における判定結果としてＰＳ情報２０７を取得し、パイプラインＩＤ２０９からロードパイプ１０９ｂを特定し、ステージＩＤ２０８から演算パイプ先行ロードタイミング信号を生成する。制御部１４は、演算パイプ先行ロードタイミング信号と、後続のＰＳ情報２０７とに基づき、Tサイクルのタイミングを同定する。そして、制御部１４は、Tサイクルにおけるキャッシュミスの有無を示す判定信号に基づき、キャンセル回路２１１により後続命令のキャンセルを行なう。 In the information generation unit 15, the control unit 14 acquires the PS information 207 as the determination result in the dependency resolution determination unit 11, identifies the load pipe 109b from the pipeline ID 209, and generates a calculation pipe advance load timing signal from the stage ID 208. . The control unit 14 identifies the timing of the T cycle based on the arithmetic pipe advance load timing signal and the subsequent PS information 207. Then, the control unit 14 causes the cancel circuit 211 to cancel the subsequent instruction based on the determination signal indicating the presence or absence of a cache miss in the T cycle.

このように、制御部１４は、依存元命令の依存チェーンに先行ロード命令が存在し、先行ロード命令がキャッシュミスした場合に、当該先行ロード命令の後続の依存チェーンの全ての命令（後続の一連の命令：後続チェイン命令）をキャンセルするキャンセル制御部の一例である。キャンセルには、先行ロード命令の後続の依存チェーンの全ての命令について、データ依存解決の成否を示す情報（例えばopX_ready bit）に否を示す値をセットする（例えばopX_ready bitを落とす）ことが含まれてよい。 In this way, when there is a pre-load instruction in the dependency chain of the dependent source instruction and the pre-load instruction causes a cache miss, the control unit 14 controls all instructions in the dependency chain following the pre-load instruction (the subsequent series This is an example of a cancel control unit that cancels the following chain command. Cancellation includes setting a value indicating failure (for example, dropping opX_ready bit) to information indicating success or failure of data dependency resolution (for example, opX_ready bit) for all instructions in the dependency chain following the preload instruction. It's fine.

また、依存解決判定部１１は、発行制御において、依存元命令に係るＰＳ情報２０７に依存先命令の発行を抑止する情報（例えばステージＩＤ２０８に0）をセットすることで、依存先命令が誤発行制御されることを抑制できる。 In addition, in the issuance control, the dependency resolution determination unit 11 sets information (for example, 0 in the stage ID 208) to suppress the issuance of the dependent instruction in the PS information 207 related to the dependent instruction, so that the dependent instruction is incorrectly issued. You can suppress being controlled.

以上のように、一実施形態によれば、依存解決判定部１１は、キャッシュミスタイミング判定比較回路３０３及び３０４を、デコーダを含むキャッシュミス判定タイミング生成回路１１ａを置き換えるとともに、エンコーダ及びインクリメンタ（updater）を含むＰＳ情報更新回路１１ｂを備える（図８、図１０参照）。このように、依存解決判定部１０５ｂ内の比較器をステージＩＤ２０８のデコーダに置き換え、ステージＩＤ２０８に基づきキャンセルを制御することで、スケジューラ１０５内の比較器の数を減少させることができる。また、パイプラインでは、比較器１０５ｆに加えて、比較器１０５ｈを省略することができる。 As described above, according to one embodiment, the dependency resolution determination unit 11 replaces the cache miss timing determination comparison circuits 303 and 304 with the cache miss determination timing generation circuit 11a including the decoder, ) (see FIGS. 8 and 10). In this way, the number of comparators in the scheduler 105 can be reduced by replacing the comparator in the dependency resolution determination unit 105b with a decoder for the stage ID 208 and controlling cancellation based on the stage ID 208. Furthermore, in the pipeline, the comparator 105h can be omitted in addition to the comparator 105f.

〔１－６〕Inflight Condition Flag更新制御のための構成例
CF（Condition Flag）は、レジスタ１０８と同様にrenameされる。Architectural registerとしてのCFは、commit（reorder）後に更新される。 [1-6] Configuration example for Inflight Condition Flag update control
CF (Condition Flag) is renamed similarly to register 108. CF as an architectural register is updated after commit (reorder).

プロセッサ１のコアは、Rename stageでCFを保持し、都度更新する。Rename stage以降のCFは、Inflight CFと呼ばれる。コアは、CF更新命令を投機的に実行している間にRename stageの値が正しくない場合に、スケジューラ１０５以降でCFを更新すべきか否かを示すbitを持ち、CFの更新を制御する。 The core of processor 1 holds the CF in the Rename stage and updates it each time. CFs after the Rename stage are called Inflight CFs. The core has a bit indicating whether or not to update the CF in the scheduler 105 or later when the value of Rename stage is incorrect while speculatively executing a CF update instruction, and controls the update of the CF.

命令は、Rename stageのCFを参照し、CFの更新命令が実行済みであれば値を載せる。CFの更新命令が実行済みではない場合、コアは、スケジューラ１０５又はパイプラインにおいて、CF更新命令のパイプラインとタイミングとを判断し、CFの値をキャプチャ及びフォワーディングする。 The instruction refers to the CF in the Rename stage, and if the CF update instruction has been executed, the value is posted. If the CF update instruction has not been executed, the core determines the pipeline and timing of the CF update instruction in the scheduler 105 or pipeline, and captures and forwards the CF value.

図２２は、比較例に係るプロセッサ１のコアの構成におけるCF更新制御に着目した構成例を示す図である。図２２では、演算レイテンシが1サイクルであり、対象のオペランドが1つである場合を例に挙げる。 FIG. 22 is a diagram illustrating a configuration example focusing on CF update control in the core configuration of the processor 1 according to the comparative example. In FIG. 22, an example is given in which the calculation latency is one cycle and the number of target operands is one.

図２２に示すように、プロセッサ１のコア（例えばスケジューラ１０５）は、CF更新タイミング判定部１０５ｊ及びキャプチャ部１０５ｍを備えてよい。CF更新タイミング判定部１０５ｊは、命令ＩＤ２１２とリード命令ＩＤとを比較する比較器１０５ｋを含む。キャプチャ部１０５ｍは、比較器１０５ｌを備える。また、コアは、比較器１０５ｋからのフォワーディングタイミング情報に基づき、キャプチャ部１０５ｍからのCF、又は、演算器１０６の演算結果を選択するセレクタ２１３を備える。 As shown in FIG. 22, the core of the processor 1 (for example, the scheduler 105) may include a CF update timing determination section 105j and a capture section 105m. The CF update timing determination unit 105j includes a comparator 105k that compares the instruction ID 212 and the read instruction ID. The capture unit 105m includes a comparator 105l. The core also includes a selector 213 that selects the CF from the capture unit 105m or the calculation result of the calculation unit 106 based on the forwarding timing information from the comparator 105k.

CF更新タイミング判定部１０５ｊにおいて、点線は、LXからXステージへのフォワーディング、破線は、LXP1からFステージへのフォワーディング、一点鎖線は、LXP2からFステージへのフォワーディングを示す。実線は、LXP3からFステージへのフォワーディングを示す。 In the CF update timing determination unit 105j, a dotted line indicates forwarding from LX to X stage, a broken line indicates forwarding from LXP1 to F stage, and a dashed line indicates forwarding from LXP2 to F stage. The solid line indicates forwarding from LXP3 to F stage.

図２３は、比較例に係るCFの更新制御の判定例を示す図である。図２３の符号Ａは、レイテンシが１サイクルであるaddeq命令のタイミングの一例を示し、符号Ｂは、各ステージの機能の一例を示す。 FIG. 23 is a diagram illustrating a determination example of CF update control according to a comparative example. Symbol A in FIG. 23 indicates an example of the timing of an addeq instruction with a latency of one cycle, and symbol B indicates an example of the function of each stage.

図２３の符号Ｂに示すように、Cステージは、inflight CFの読み出しが行なわれるステージを示す。他のステージは図１２と同様である。 As shown by reference numeral B in FIG. 23, the C stage indicates the stage at which the inflight CF is read. Other stages are similar to those in FIG. 12.

符号Ａに示すcmp: r0, r1は、r0、r1の値を比較し、一致したらequal、一致しなければnot equalを示すCFを出力する。コアは、cmp命令では結果のレジスタ１０８への書き込みを行なわず、スケジューラ１０５が保持するCFのみを更新する。inflight CF及びスケジューラ１０５内の後続CF読み出し依存命令のinflight CF fieldのCFも更新される。Architectural registerとしてのCFはcommit後に更新される。また、スケジューラ１０５が保持するInflight資源はLXP1で更新される。例えば、スケジューラ１０５は、Inflight資源でCFをregister renamingのように制御する。 cmp: r0, r1 indicated by symbol A compares the values of r0 and r1, and outputs CF indicating equal if they match, and not equal if they do not match. The core does not write the result to the register 108 with the cmp instruction, but only updates the CF held by the scheduler 105. The inflight CF and the CF of the inflight CF field of the subsequent CF read dependent instruction in the scheduler 105 are also updated. CF as an Architectural register is updated after commit. Furthermore, the Inflight resources held by the scheduler 105 are updated with LXP1. For example, the scheduler 105 controls the CF using Inflight resources, such as register renaming.

cmp命令はCFを更新する命令であり、addeq命令はCFを読み出す命令であるため、cmp命令とaddeq命令とは依存関係がある。 Since the cmp instruction is an instruction to update the CF, and the addeq instruction is an instruction to read the CF, there is a dependency between the cmp and addeq instructions.

図２３の例において、命令1（cmp命令）から命令2-5（addeq命令）へのタイミングは、比較器１０５ｋ、セレクタ２１３経路でフォワーディング２１４が行なわれる。命令1から命令7（addeq命令）へのタイミング以降は、キャプチャ部１０５ｍ経由でInflight CFの読み出しが行なわれる。例えば、比較器１０５ｌは、命令1のLXP1タイミングを捕捉し、Inflight CFを更新するために備えられる。 In the example of FIG. 23, forwarding 214 is performed at the timing from instruction 1 (cmp instruction) to instruction 2-5 (addeq instruction) via the comparator 105k and selector 213 path. After the timing from instruction 1 to instruction 7 (addeq instruction), Inflight CF is read out via the capture unit 105m. For example, comparator 105l is provided to capture the LXP1 timing of instruction 1 and update the Inflight CF.

このように、比較例では、書き込みタイミングをキャプチャするために、そのタイミングを同定するための比較器１０５ｋが用いられる。 In this way, in the comparative example, in order to capture the write timing, the comparator 105k for identifying the timing is used.

また、CFの依存はレジスタ１０８のlifetimeとは異なるため、キャプチャ部１０５ｍは、専用の制御bitを持ち、CFがパイプラインからフォワーディングされるか否かを管理する。この情報を持つために、キャプチャ部１０５ｍは、タイミングを見る専用の比較器１０５ｌを用いる。 Furthermore, since the dependence of the CF is different from the lifetime of the register 108, the capture unit 105m has a dedicated control bit and manages whether or not the CF is forwarded from the pipeline. In order to have this information, the capture unit 105m uses a comparator 105l dedicated to checking timing.

上述したCFの更新制御において、パイプライン数、パイプラインのステージ数等が増加すると、スケジューラ１０５のキュー内及びパイプライン（フォワーディング部１０５ｅ）で、CF更新命令のパイプライン及びタイミングを判定するための比較器の数が増加する。 In the above-mentioned CF update control, when the number of pipelines, the number of pipeline stages, etc. increases, the process for determining the pipeline and timing of the CF update instruction is performed in the queue of the scheduler 105 and in the pipeline (forwarding unit 105e). The number of comparators increases.

図２４は、一実施形態に係るプロセッサ１のコアの構成におけるCF更新制御に着目した構成例を示す図である。図２４では、演算レイテンシが1サイクルであり、対象のオペランドが1つである場合を例に挙げる。 FIG. 24 is a diagram illustrating a configuration example focusing on CF update control in the core configuration of the processor 1 according to an embodiment. In FIG. 24, an example is given in which the calculation latency is one cycle and the number of target operands is one.

図２４に示すように、プロセッサ１のコア（例えばスケジューラ１０５）は、CF更新タイミング判定部１６及びキャプチャ部１８を備えてよい。CF更新タイミング判定部１６は、ステージＩＤ２０８に基づきフォワーディング情報を生成する情報生成部１７を含む。キャプチャ部１８は、比較器１０５ｌに代えて、ステージＩＤ２０８の入力を含む。また、コアは、情報生成部１７からのフォワーディングタイミング情報に基づき、キャプチャ部１８からのCF、又は、演算器１０６の演算結果を選択するセレクタ２１３を備える。 As shown in FIG. 24, the core of the processor 1 (for example, the scheduler 105) may include a CF update timing determination section 16 and a capture section 18. The CF update timing determination unit 16 includes an information generation unit 17 that generates forwarding information based on the stage ID 208. The capture unit 18 includes an input of the stage ID 208 instead of the comparator 105l. The core also includes a selector 213 that selects the CF from the capture unit 18 or the calculation result of the calculation unit 106 based on the forwarding timing information from the information generation unit 17.

CF更新タイミング判定部１６において、点線は、LXからXステージへのフォワーディング、破線は、LXP1からFステージへのフォワーディング、一点鎖線は、LXP2からFステージへのフォワーディングを示す。実線は、LXP3からFステージへのフォワーディングを示す。 In the CF update timing determination unit 16, a dotted line indicates forwarding from LX to X stage, a broken line indicates forwarding from LXP1 to F stage, and a dashed line indicates forwarding from LXP2 to F stage. The solid line indicates forwarding from LXP3 to F stage.

図２５は、一実施形態に係るCFの更新制御の判定例を示す図である。図２５の符号Ａは、レイテンシが１サイクルであるaddeq命令のタイミングの一例を示し、符号Ｂは、各ステージの機能の一例を示す。 FIG. 25 is a diagram illustrating an example of determination of CF update control according to an embodiment. Symbol A in FIG. 25 indicates an example of the timing of an addeq instruction with a latency of one cycle, and symbol B indicates an example of the function of each stage.

図２５の例において、命令1（cmp命令）から命令2-5（addeq命令）へのタイミングは、情報生成部１７、セレクタ２１３経路でフォワーディング２１４が行なわれる。命令1（cmp命令）から命令6（addeq命令）へのタイミングでは、スケジューラ１０５内のCFの更新後のPステージであり、キャプチャ部１８でCFのキャプチャ（read）後に、命令6が発行されるため、フォワーディング２１４は不要となる。命令1から命令7（addeq命令）へのタイミング以降は、キャプチャ部１８経由でInflight CFの読み出しが行なわれる。 In the example of FIG. 25, forwarding 214 is performed at the timing from instruction 1 (cmp instruction) to instruction 2-5 (addeq instruction) through the information generation unit 17 and selector 213 path. The timing from instruction 1 (cmp instruction) to instruction 6 (addeq instruction) is the P stage after the update of the CF in the scheduler 105, and after the capture unit 18 captures (reads) the CF, instruction 6 is issued. Therefore, forwarding 214 becomes unnecessary. After the timing from instruction 1 to instruction 7 (addeq instruction), Inflight CF is read out via the capture unit 18.

このように、CF更新タイミング判定部１６は、ＰＳ情報２０７に基づき、Inflight Condition Flagの更新を制御する更新制御部の一例である。 In this way, the CF update timing determination unit 16 is an example of an update control unit that controls updating of the Flight Condition Flag based on the PS information 207.

以上のように、ＰＳ情報２０７は、フォワーディングに加えて、キャッシュミスしたロードから依存する依存チェーンの後続命令のキャンセル、及び、CFの更新制御の一方又は双方に対しても適用可能であり、それぞれにおいて比較器を省略可能となる。 As described above, in addition to forwarding, the PS information 207 can also be applied to one or both of the following instructions in the dependency chain that depends on the cache miss load, and CF update control. In this case, the comparator can be omitted.

〔２〕その他
上述した一実施形態に係る技術は、以下のように変形、変更して実施することができる。 [2] Others The technique according to the embodiment described above can be modified and changed as follows.

一実施形態では、スケジューラ１０５がPステージで比較するものとしたが、これに限定されるものではなく、比較は、フォワーディングタイミングまでに実施されればよい。 In one embodiment, the scheduler 105 performs the comparison at the P stage, but the invention is not limited to this, and the comparison may be performed by the forwarding timing.

また、情報生成部１３、１７は、PステージでＰＳ情報２０７に基づきフォワーディング情報を生成するものとしたが、これに限定されるものではなく、生成タイミングは、フォワーディング又はキャプチャタイミングに間に合ういずれのステージでもよい。 Furthermore, although the information generating units 13 and 17 generate forwarding information based on the PS information 207 at the P stage, the present invention is not limited to this, and the generation timing may be determined at any stage that is in time for the forwarding or capture timing. But that's fine.

さらに、比較例として例示した各種比較器は、ＣＡＭであっても同様である。 Furthermore, the various comparators illustrated as comparative examples are the same even if they are CAMs.

また、オペランド又はCFごとに割り当てられる先行命令のステージＩＤ２０８は、カウントアップされるカウンタであるものとしたが、これに限定されるものではなく、カウントダウンされるカウンタであってもよい。或いは、ステージＩＤ２０８は、パイプライン及びステージを同定可能であれば種々の形態の情報であってもよい。 Furthermore, although the stage ID 208 of the preceding instruction assigned to each operand or CF is a counter that counts up, the stage ID 208 is not limited to this, and may be a counter that counts down. Alternatively, the stage ID 208 may be information in various forms as long as the pipeline and stage can be identified.

〔３〕付記
以上の実施形態に関し、さらに以下の付記を開示する。 [3] Additional notes Regarding the above embodiments, the following additional notes are further disclosed.

（付記１）
複数の命令のうちの実行可能な命令から順に並列に実行する演算処理装置であって、
前記複数の命令を格納するキューと、
前記複数の命令のうちの依存元命令が実行されるパイプラインと、前記パイプラインにおける前記依存元命令の実行ステージとを示す指標を保持し、前記複数の命令のうちの前記依存元命令と、前記依存元命令の実行結果を利用する依存先命令とのデータ依存解決を行ない、前記複数の命令の発行タイミングを制御する制御部と、
を備える、演算処理装置。 (Additional note 1)
An arithmetic processing unit that executes executable instructions among a plurality of instructions in parallel,
a queue storing the plurality of instructions;
holding an index indicating a pipeline in which a dependent instruction among the plurality of instructions is executed and an execution stage of the dependent instruction in the pipeline; a control unit that performs data dependency resolution with a dependent instruction using an execution result of the dependent source instruction and controls issuance timing of the plurality of instructions;
An arithmetic processing device comprising:

（付記２）
前記指標は、
前記依存元命令が発行されたパイプラインの識別情報と、
前記依存元命令から前記依存先命令に実行結果を最短で転送できる発行タイミングから、前記実行結果をレジスタに格納し、前記依存先命令が前記実行結果を読み出せる発行タイミングまでの各ステージに一意に割り当てられたステージの識別情報と、を含む、
付記１に記載の演算処理装置。 (Additional note 2)
The said index is
Identification information of the pipeline in which the dependent instruction was issued;
Uniquely for each stage from the issue timing at which the execution result can be transferred from the dependent source instruction to the dependent instruction in the shortest time to the issuing timing at which the execution result can be stored in a register and the dependent instruction can read the execution result. an assigned stage identification information;
The arithmetic processing device according to supplementary note 1.

（付記３）
前記制御部は、前記キューのエントリごと、且つ、前記複数の命令の各々のオペランドごとに、各々の依存元命令に関する前記指標を生成する、
付記２に記載の演算処理装置。 (Additional note 3)
The control unit generates the index regarding each dependent instruction for each entry in the queue and for each operand of each of the plurality of instructions.
The arithmetic processing device according to supplementary note 2.

（付記４）
前記ステージの識別情報は、前記依存元命令が実行される最後のサイクルから第１所定数だけ前のサイクルでセットされ、前記最後のサイクルから第２所定数だけ後のサイクルでリセットされ、前記セットされるサイクルから前記リセットされるサイクルまでの間の１以上のサイクルの各々には一意の識別子が割り当てられる、
付記３に記載の演算処理装置。 (Additional note 4)
The stage identification information is set in a cycle a first predetermined number of times before the last cycle in which the dependent source instruction is executed, is reset in a second predetermined number of cycles after the last cycle, and each of the one or more cycles between the reset cycle and the reset cycle is assigned a unique identifier;
The arithmetic processing device according to appendix 3.

（付記５）
前記制御部は、前記オペランドに保持した前記指標に基づき、前記データ依存解決を行なう、
付記４に記載の演算処理装置。 (Appendix 5)
The control unit performs the data dependency resolution based on the index held in the operand.
The arithmetic processing device according to appendix 4.

（付記６）
前記指標に基づき、前記依存元命令の実行結果を、前記制御に応じたタイミングで前記依存先命令が実行されるパイプラインに発行された前記依存先命令のオペランドにフォワーディングする制御を実行するフォワーディング制御部を備える、
付記１～付記５のいずれか１項に記載の演算処理装置。 (Appendix 6)
forwarding control that executes control based on the index to forward the execution result of the dependent instruction to an operand of the dependent instruction issued to a pipeline in which the dependent instruction is executed at a timing according to the control; equipped with a department;
The arithmetic processing device according to any one of Supplementary notes 1 to 5.

（付記７）
前記指標に基づき、前記依存元命令の依存チェーンに先行ロード命令が存在し、前記先行ロード命令がキャッシュミスした場合に、当該先行ロード命令の後続の依存チェーン内の全ての命令をキャンセルするキャンセル制御部を備える、
付記１～付記６のいずれか１項に記載の演算処理装置。 (Appendix 7)
Based on the index, if a preceding load instruction exists in the dependency chain of the dependent source instruction and the preceding load instruction causes a cache miss, cancel control cancels all instructions in the dependency chain subsequent to the preceding load instruction. equipped with a department;
The arithmetic processing device according to any one of Supplementary notes 1 to 6.

（付記８）
前記キャンセル制御部は、前記キャンセルにおいて、前記先行ロード命令の後続の依存チェーンの全ての命令について、データ依存解決の成否を示す情報に否を示す値をセットする、
付記７に記載の演算処理装置。 (Appendix 8)
In the cancellation, the cancellation control unit sets a value indicating failure in information indicating success or failure of data dependency resolution for all instructions in a dependency chain subsequent to the preceding load instruction.
The arithmetic processing device according to appendix 7.

（付記９）
前記指標に基づき、Inflight Condition Flagの更新を制御する更新制御部を備える、付記１～付記８のいずれか１項に記載の演算処理装置。 (Appendix 9)
The arithmetic processing device according to any one of Supplementary Notes 1 to 8, comprising an update control unit that controls updating of the Flight Condition Flag based on the index.

（付記１０）
複数の命令のうちの実行可能な命令から順に並列に実行する演算処理装置における演算処理方法であって、
前記演算処理装置が備えるスケジューラが、
受け付けた前記複数の命令をキューに格納し、
前記キューに格納された前記複数の命令のうちの依存元命令が実行されるパイプラインと、前記パイプラインにおける前記依存元命令の実行ステージとを示す指標を保持し、前記複数の命令のうちの前記依存元命令と、前記依存元命令の実行結果を利用する依存先命令とのデータ依存解決を行ない、前記複数の命令の発行タイミングを制御する、
処理を実行する、演算処理方法。 (Appendix 10)
An arithmetic processing method in an arithmetic processing device that executes executable instructions among a plurality of instructions in parallel,
A scheduler included in the arithmetic processing device,
storing the plurality of received instructions in a queue;
It holds an index indicating a pipeline in which a dependent instruction among the plurality of instructions stored in the queue is executed and an execution stage of the dependent instruction in the pipeline; performing data dependency resolution between the dependent instruction and a dependent instruction that uses the execution result of the dependent instruction, and controlling the issuance timing of the plurality of instructions;
An arithmetic processing method that performs processing.

（付記１１）
前記指標は、
前記依存元命令が発行されたパイプラインの識別情報と、
前記依存元命令から前記依存先命令に実行結果を最短で転送できる発行タイミングから、前記実行結果をレジスタに格納し、前記依存先命令が前記実行結果を読み出せる発行タイミングまでの各ステージに一意に割り当てられたステージの識別情報と、を含む、
付記１０に記載の演算処理方法。 (Appendix 11)
The said index is
Identification information of the pipeline in which the dependent instruction was issued;
Uniquely for each stage from the issue timing at which the execution result can be transferred from the dependent source instruction to the dependent instruction in the shortest time to the issuing timing at which the execution result can be stored in a register and the dependent instruction can read the execution result. an assigned stage identification information;
The arithmetic processing method according to appendix 10.

（付記１２）
前記スケジューラが、前記キューのエントリごと、且つ、前記複数の命令の各々のオペランドごとに、各々の依存元命令に関する前記指標を生成する、
付記１１に記載の演算処理方法。 (Appendix 12)
the scheduler generates the index for each dependent instruction for each entry in the queue and for each operand of each of the plurality of instructions;
The arithmetic processing method described in Appendix 11.

（付記１３）
前記ステージの識別情報は、前記依存元命令が実行される最後のサイクルから第１所定数だけ前のサイクルでセットされ、前記最後のサイクルから第２所定数だけ後のサイクルでリセットされ、前記セットされるサイクルから前記リセットされるサイクルまでの間の１以上のサイクルの各々には一意の識別子が割り当てられる、
付記１２に記載の演算処理方法。 (Appendix 13)
The stage identification information is set in a cycle a first predetermined number of times before the last cycle in which the dependent source instruction is executed, is reset in a second predetermined number of cycles after the last cycle, and each of the one or more cycles between the reset cycle and the reset cycle is assigned a unique identifier;
The arithmetic processing method according to appendix 12.

（付記１４）
前記スケジューラが、前記オペランドに保持した前記指標に基づき、前記データ依存解決を行なう、
処理を実行する、付記１３に記載の演算処理方法。 (Appendix 14)
the scheduler performs the data dependency resolution based on the index held in the operand;
The arithmetic processing method according to appendix 13, which performs the processing.

（付記１５）
前記スケジューラが、前記指標に基づき、前記依存元命令の実行結果を、前記制御に応じたタイミングで前記依存先命令が実行されるパイプラインに発行された前記依存先命令のオペランドにフォワーディングする制御を実行する、
処理を実行する、付記１０～付記１４のいずれか１項に記載の演算処理方法。 (Appendix 15)
The scheduler controls, based on the index, forwarding the execution result of the dependent instruction to the operand of the dependent instruction issued to a pipeline in which the dependent instruction is executed at a timing according to the control. Execute,
The arithmetic processing method according to any one of Supplementary Notes 10 to 14, which performs the processing.

（付記１６）
前記スケジューラが、前記指標に基づき、前記依存元命令の依存チェーンに先行ロード命令が存在し、前記先行ロード命令がキャッシュミスした場合に、当該先行ロード命令の後続の依存チェーン内の全ての命令をキャンセルする、
処理を実行する、付記１０～付記１５のいずれか１項に記載の演算処理方法。 (Appendix 16)
Based on the index, if a preload instruction exists in the dependency chain of the dependent instruction and the preload instruction causes a cache miss, the scheduler executes all instructions in the dependency chain subsequent to the preload instruction. Cancel,
The arithmetic processing method according to any one of Supplementary Notes 10 to 15, which performs the processing.

（付記１７）
前記スケジューラが、前記キャンセルにおいて、前記先行ロード命令の後続の依存チェーンの全ての命令について、データ依存解決の成否を示す情報に否を示す値をセットする、
処理を実行する、付記１６に記載の演算処理方法。 (Appendix 17)
In the cancellation, the scheduler sets a value indicating failure in information indicating success or failure of data dependency resolution for all instructions in a dependency chain subsequent to the advance load instruction.
The arithmetic processing method according to appendix 16, which performs the processing.

（付記１８）
前記スケジューラが、前記指標に基づき、Inflight Condition Flagの更新を制御する、
処理を実行する、付記１０～付記１７のいずれか１項に記載の演算処理方法。 (Appendix 18)
the scheduler controls updating of the Inflight Condition Flag based on the indicator;
The arithmetic processing method according to any one of Supplementary Notes 10 to 17, which performs the processing.

１プロセッサ
１１依存解決判定部
１１ａキャッシュミス判定タイミング生成回路
１１ｂＰＳ情報更新回路
１１ｃ、３０７信号
１２フォワーディング部
１３、１５、１７情報生成部
１４制御部
１６ CF更新タイミング判定部
１８キャプチャ部
１００メモリ
１０１命令フェッチ制御ＰＣ
１０２命令フェッチ部
１０３デコーダ
１０４アロケータ
１０５スケジューラ
１０５ａペイロードデータ記憶部
１０５ｃ選択部
１０５ｄ、２０５、２１３セレクタ
１０４ｉ生成回路
１０６演算器
１０７ロード／ストアキュー
１０８レジスタ
１０９ａ演算パイプ
１０９ｂロードパイプ
２０７ＰＳ情報
２０８ステージＩＤ
２０９パイプラインＩＤ
２１１キャンセル回路
３０１最短フォワードタイミング判定比較回路
３０２キャッシュミス判定信号
３０５キャッシュミス判定回路
３０６ＡＮＤ回路 1 Processor 11 Dependency resolution determination unit 11a Cache miss determination timing generation circuit 11b PS information update circuit 11c, 307 Signal 12 Forwarding unit 13, 15, 17 Information generation unit 14 Control unit 16 CF update timing determination unit 18 Capture unit 100 Memory 101 Instruction Fetch control PC
102 Instruction fetch section 103 Decoder 104 Allocator 105 Scheduler 105a Payload data storage section 105c Selection section 105d, 205, 213 Selector 104i Generation circuit 106 Arithmetic unit 107 Load/store queue 108 Register 109a Arithmetic pipe 109b Load pipe 207 PS information 20 8 Stage ID
209 Pipeline ID
211 Cancellation circuit 301 Shortest forward timing determination comparison circuit 302 Cache miss determination signal 305 Cache miss determination circuit 306 AND circuit

Claims

An arithmetic processing unit that executes executable instructions among a plurality of instructions in parallel,
a queue storing the plurality of instructions;
holding an index indicating a pipeline in which a dependent instruction among the plurality of instructions is executed and an execution stage of the dependent instruction in the pipeline; a control unit that performs data dependency resolution with a dependent instruction using an execution result of the dependent source instruction and controls issuance timing of the plurality of instructions;
An arithmetic processing device comprising:

The said index is
Identification information of the pipeline in which the dependent instruction was issued;
Uniquely for each stage from the issue timing at which the execution result can be transferred from the dependent source instruction to the dependent instruction in the shortest time to the issuing timing at which the execution result can be stored in a register and the dependent instruction can read the execution result. an assigned stage identification information;
The arithmetic processing device according to claim 1.

The control unit generates the index regarding each dependent instruction for each entry in the queue and for each operand of each of the plurality of instructions.
The arithmetic processing device according to claim 2.

The stage identification information is set in a cycle a first predetermined number of times before the last cycle in which the dependent source instruction is executed, is reset in a second predetermined number of cycles after the last cycle, and each of the one or more cycles between the reset cycle and the reset cycle is assigned a unique identifier;
The arithmetic processing device according to claim 3.

The control unit performs the data dependency resolution based on the index held in the operand.
The arithmetic processing device according to claim 4.

forwarding control that executes control based on the index to forward the execution result of the dependent instruction to an operand of the dependent instruction issued to a pipeline in which the dependent instruction is executed at a timing according to the control; equipped with a department;
The arithmetic processing device according to any one of claims 1 to 5.

Based on the index, if a preceding load instruction exists in the dependency chain of the dependent source instruction and the preceding load instruction causes a cache miss, cancel control cancels all instructions in the dependency chain subsequent to the preceding load instruction. equipped with a department;
The arithmetic processing device according to any one of claims 1 to 5.

In the cancellation, the cancellation control unit sets a value indicating failure in information indicating success or failure of data dependency resolution for all instructions in a dependency chain subsequent to the preceding load instruction.
The arithmetic processing device according to claim 7.

The arithmetic processing device according to any one of claims 1 to 5, further comprising an update control unit that controls updating of the Flight Condition Flag based on the index.

An arithmetic processing method in an arithmetic processing device that executes executable instructions among a plurality of instructions in parallel,
A scheduler included in the arithmetic processing device,
storing the plurality of received instructions in a queue;
It holds an index indicating a pipeline in which a dependent instruction among the plurality of instructions stored in the queue is executed and an execution stage of the dependent instruction in the pipeline; performing data dependency resolution between the dependent instruction and a dependent instruction using the execution result of the dependent instruction, and controlling the issuance timing of the plurality of instructions;
An arithmetic processing method that performs processing.