JP2015203950A

JP2015203950A - Arithmetic processing unit and control method for arithmetic processing unit

Info

Publication number: JP2015203950A
Application number: JP2014082660A
Authority: JP
Inventors: 猛一田端; Takekazu Tabata; 吉田　利雄; Toshio Yoshida; 利雄吉田; 秋月　康伸; Yasunobu Akizuki; 康伸秋月
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2014-04-14
Filing date: 2014-04-14
Publication date: 2015-11-16
Anticipated expiration: 2034-04-14
Also published as: JP6340887B2

Abstract

PROBLEM TO BE SOLVED: To efficiently execute an SIMD (Single Instruction Multiple Data) indirect access instruction.SOLUTION: An arithmetic processing unit includes: an instruction decoder; a memory access entry part RSA; a memory access pipeline EAGA; a multi-data instruction entry part RSF; a plurality of arithmetic units 330; and a plurality of registers for multi-data instructions. The plurality of arithmetic units 330 process the processing of the entry of multi-data instructions output from the RSF in parallel, and an arithmetic pipeline FLA generates a plurality of memory access requests corresponding to multi-data indirect memory access instructions in the EAGA in response to the output of the entry of the multi-data indirect memory access instructions, and the plurality of arithmetic units 330 supply a plurality of memory addresses acquired from the plurality of registers for multi-data instructions to the FLA.

Description

本発明は，演算処理装置及び演算処理装置の制御方法に関する。 The present invention relates to an arithmetic processing unit and a control method for the arithmetic processing unit.

演算処理装置であるＣＰＵまたは演算処理部であるＣＰＵコアの高速化手法として，スーパースカラやアウト・オブ・オーダ，ＳＩＭＤ（Single Instruction Multiple Data)などの技術が知られている。例えば，スーパースカラでは複数の命令を同時に処理し，アウト・オブ・オーダではＣＰＵコア内部の資源について処理が可能になったものを順不同で処理しプログラムの順序通り完了させる。 Techniques such as superscalar, out-of-order, SIMD (Single Instruction Multiple Data) are known as methods for speeding up a CPU that is an arithmetic processing unit or a CPU core that is an arithmetic processing unit. For example, in the superscalar, a plurality of instructions are processed simultaneously, and in the out-of-order, those that can be processed for the resources in the CPU core are processed in random order and are completed in the order of the program.

一方，ＳＩＭＤは，１つの命令で複数のデータを並列に処理する。ＳＩＭＤ処理ではＳＩＭＤレジスタと呼ばれるレジスタを用いる。ＳＩＭＤレジスタには，ＳＩＭＤ命令で並列に処理可能な個数と同数もしくはそれ以上の複数個のデータが１つの固まりとして格納される。ＳＩＭＤ命令で指定されるオペランドによりこの複数個のデータを１つの固まりとして格納するＳＩＭＤレジスタが選択され，ＳＩＭＤレジスタ内の複数個のデータごとに命令の処理が実行される。この複数個のデータの各々を要素と呼び並列に処理するデータの要素数をＳＩＭＤ幅と呼ぶ。ＳＩＭＤ命令においては，例えば命令で指定される第一ソースオペランド，第二ソースオペランド，第三ソースオペランドをＳＩＭＤレジスタから読み出し，ＳＩＭＤ演算を行いデスティネーションオペランドにデータを書き込む。このＳＩＭＤ処理では，複数個のデータについて同じ命令の処理が並列に実行される。 On the other hand, SIMD processes a plurality of data in parallel with one instruction. In the SIMD process, a register called a SIMD register is used. In the SIMD register, a plurality of pieces of data equal to or greater than the number that can be processed in parallel by the SIMD instruction are stored as one unit. The SIMD register for storing the plurality of data as one unit is selected by the operand specified by the SIMD instruction, and the instruction processing is executed for each of the plurality of data in the SIMD register. Each of the plurality of pieces of data is called an element, and the number of elements of data to be processed in parallel is called a SIMD width. In the SIMD instruction, for example, the first source operand, the second source operand, and the third source operand specified by the instruction are read from the SIMD register, the SIMD operation is performed, and the data is written to the destination operand. In this SIMD processing, processing of the same instruction is executed in parallel for a plurality of data.

また，ＳＩＭＤ命令以外の命令の処理においては，汎用レジスタと呼ばれるレジスタを用いる。汎用レジスタには，例えばメモリアクセスやＳＩＭＤで並列に処理を行わないデータが格納される。 Further, in processing of instructions other than SIMD instructions, a register called a general purpose register is used. For example, data that is not processed in parallel by memory access or SIMD is stored in the general-purpose register.

従来は，メモリとＳＩＭＤレジスタ間のデータ転送は，汎用レジスタに格納されたアドレスを用いて行われる。メモリアクセス命令においては，オペランドアドレス生成器が，命令で指定される第一ソースオペランド，第二ソースオペランドを汎用レジスタから読み出し，メモリアクセスのためのアドレスの生成を行う。この読み出したアドレスを先頭アドレスとして用いて，メモリの連続するアドレス領域に存在するＳＩＭＤ幅のデータを読み出し，デスティネーションオペランドに対するＳＩＭＤレジスタに読み出したデータをロードする，もしくはメモリの連続するアドレス領域にＳＩＭＤレジスタから読み出したＳＩＭＤ幅のデータをストアする。 Conventionally, data transfer between a memory and a SIMD register is performed using an address stored in a general-purpose register. In the memory access instruction, the operand address generator reads the first source operand and the second source operand specified by the instruction from the general-purpose register, and generates an address for memory access. Using this read address as the head address, the SIMD width data existing in the continuous address area of the memory is read, the read data is loaded into the SIMD register for the destination operand, or the SIMD is stored in the continuous address area of the memory. Stores SIMD width data read from the register.

特開２００９−１６３４４２号公報JP 2009-163442 A 特公平４−７９０２６号公報Japanese Patent Publication No. 4-79026 特開２００４−３８７５０号公報JP 2004-38750 A 特開２０１１−３４４５０号公報JP 2011-34450 A

しかしながら，並列に処理が可能なデータがメモリの連続するアドレス領域に存在しない場合は，上記のＳＩＭＤロード命令，ＳＩＭＤストア命令を適用することはできない。 However, when there is no data that can be processed in parallel in the continuous address area of the memory, the above SIMD load instruction and SIMD store instruction cannot be applied.

さらに，従来のＣＰＵコアには，独立した複数のアドレスが格納されたＳＩＭＤレジスタをソースオペランドに指定してメモリアクセスするインダイレクトメモリアクセス命令をＳＩＭＤ命令で実行するための構成が設けられていない。したがって，従来のＣＰＵコアは，例えば，ＳＭＩＤレジスタの各々の要素を個別にアクセス可能な構成を利用して，ＳＩＭＤ幅に対する複数のメモリアクセス命令によりデータ転送を行う。 Further, the conventional CPU core is not provided with a configuration for executing an indirect memory access instruction for performing memory access by designating a SIMD register storing a plurality of independent addresses as a source operand using the SIMD instruction. Therefore, the conventional CPU core performs data transfer by a plurality of memory access instructions for the SIMD width using, for example, a configuration capable of individually accessing each element of the SMID register.

または，ブロックロード命令のようにインダイレクトアクセス命令を複数の命令に分解して，ＳＩＭＤ幅に対する回数の命令を順次実行することも考えられる。しかし，上記のように複数の命令を実行する方法では，命令デコーダをＳＩＭＤ幅と同じ回数使用しなければならず，また命令実行のためのリザベーションステーションやコミットスタックエントリをＳＩＭＤ幅と同数使用しなければならず，ＣＰＵコアの多くの内部資源を使用する。さらに，命令デコーダを複数サイクル占有するため，依存関係のない後続命令をアウト・オブ・オーダで実行することができず，アウト・オブ・オーダで処理できるＣＰＵコアの構成を生かすことができない。 Alternatively, it is conceivable that the indirect access instruction is decomposed into a plurality of instructions such as a block load instruction, and instructions corresponding to the SIMD width are sequentially executed. However, in the method of executing a plurality of instructions as described above, the instruction decoder must be used as many times as the SIMD width, and the reservation station and commit stack entries for executing instructions must be used as many times as the SIMD width. It must use many internal resources of the CPU core. Furthermore, since the instruction decoder occupies a plurality of cycles, subsequent instructions having no dependency cannot be executed out-of-order, and the CPU core configuration that can be processed out-of-order cannot be utilized.

そこで，本実施の形態の一つの目的は，レジスタに格納された複数個の独立したデータをアドレスとして使用し，メモリ領域の複数箇所にアクセスする命令を１つの命令で実行する演算処理装置及び演算処理装置の制御方法を提供することにある。 Accordingly, one object of the present embodiment is to provide an arithmetic processing device and an arithmetic unit that use a plurality of independent data stored in a register as addresses and execute an instruction to access a plurality of locations in a memory area with a single instruction. It is providing the control method of a processing apparatus.

本実施の形態の第１の側面は，
命令をデコードする命令デコーダと，
前記命令デコーダによりメモリアクセス命令のエントリを生成されるメモリアクセスエントリ部（ＲＳＡ）と，
前記メモリアクセスエントリ部から出力された前記メモリアクセス命令のエントリをメモリに対して実行するメモリアクセスパイプライン（ＥＡＧＡ）と，
前記命令デコーダにより複数のデータを１つの命令で処理するマルチデータ命令のエントリを生成されるマルチデータ命令エントリ部（ＲＳＦ）と，
複数の演算器と複数のマルチデータ命令用レジスタとを有し，前記マルチデータ命令エントリ部から出力された前記マルチデータ命令のエントリの処理を前記複数の演算器により並列に処理し，前記複数のマルチデータ命令用レジスタに演算結果を格納する演算パイプライン（ＦＬＡ）とを有し，
前記演算パイプラインは，前記複数のマルチデータ命令用レジスタに格納されている複数のメモリアドレスについて前記メモリにメモリアクセスするマルチデータインダイレクトメモリアクセス命令のエントリの出力に応答して，前記メモリアクセスパイプラインに前記マルチデータインダイレクトメモリアクセス命令に対応する複数のメモリアクセス要求を生成し，前記複数の演算器が前記複数のマルチデータ命令用レジスタから取得した前記複数のメモリアドレスを前記メモリアクセスパイプラインに供給する演算処理装置である。 The first aspect of the present embodiment is
An instruction decoder for decoding instructions;
A memory access entry part (RSA) for generating an entry of a memory access instruction by the instruction decoder;
A memory access pipeline (EAGA) for executing an entry of the memory access instruction output from the memory access entry unit with respect to a memory;
A multi-data instruction entry part (RSF) for generating an entry of a multi-data instruction for processing a plurality of data with one instruction by the instruction decoder;
A plurality of arithmetic units and a plurality of multi-data instruction registers; processing of the multi-data instruction entry output from the multi-data instruction entry unit is processed in parallel by the plurality of arithmetic units; An operation pipeline (FLA) for storing operation results in a multi-data instruction register;
The operation pipeline is responsive to an output of an entry of a multi-data indirect memory access instruction that performs memory access to the memory for a plurality of memory addresses stored in the plurality of multi-data instruction registers. A plurality of memory access requests corresponding to the multi-data indirect memory access instruction on a line, and the plurality of computing units obtain the plurality of memory addresses acquired from the plurality of multi-data instruction registers in the memory access pipeline. It is the arithmetic processing unit which supplies to.

第１の側面によれば，マルチデータインダイレクトメモリアクセス命令を少ない資源で効率的に実行する。 According to the first aspect, the multi-data indirect memory access instruction is efficiently executed with few resources.

本実施の形態における演算処理装置が実現可能なインダイレクトメモリアクセス方式を説明する図である。It is a figure explaining the indirect memory access system which can implement | achieve the arithmetic processing unit in this Embodiment. ブロックロード命令によるパイプライン処理の例を示す図である。It is a figure which shows the example of the pipeline process by a block load instruction. 本実施の形態におけるＳＩＭＤインダイレクトメモリアクセス命令によるパイプライン処理を示す図である。It is a figure which shows the pipeline process by the SIMD indirect memory access instruction in this Embodiment. 本実施の形態における演算処理装置を搭載した情報処理装置を示す図である。It is a figure which shows the information processing apparatus carrying the arithmetic processing apparatus in this Embodiment. ＣＰＵコア３０の全体構成を示す図である。2 is a diagram illustrating an overall configuration of a CPU core 30. FIG. 本実施の形態のＳＩＭＤインダイレクトメモリアクセス（ロードまたはストア）命令を実行するＣＰＵコアの構成を示す図である。It is a figure which shows the structure of CPU core which performs the SIMD indirect memory access (load or store) instruction | indication of this Embodiment. 浮動小数点演算リザベーションステーションＲＳＦにエントリとして格納されるフラグ構成を示す図である。It is a figure which shows the flag structure stored as an entry in the floating point arithmetic reservation station RSF. ＳＩＭＤインダイレクトロード命令と通常のロード命令の処理を示すフローチャートである。It is a flowchart which shows the process of a SIMD indirect load command and a normal load command. ＳＩＭＤインダイレクトロード命令と通常のロード命令の処理を示すフローチャートである。It is a flowchart which shows the process of a SIMD indirect load command and a normal load command. ＳＩＭＤインダイレクトメモリアクセスの一つであるＳＩＭＤインダイレクトロード命令のパイプライン及びタイムチャートを示す図である。It is a figure which shows the pipeline and time chart of a SIMD indirect load instruction which is one of the SIMD indirect memory accesses. 本実施の形態のＳＩＭＤインダイレクトメモリアクセス（ロードまたはストア）命令を実行するＣＰＵコアの構成を示す図である。It is a figure which shows the structure of CPU core which performs the SIMD indirect memory access (load or store) instruction | indication of this Embodiment. 演算用インタフェース３３１とアドレス用インタフェース３１０の構成を示す図である。FIG. 3 is a diagram showing the configuration of a calculation interface 331 and an address interface 310. ＳＩＭＤ幅が２のＳＩＭＤインダイレクトメモリアクセス命令の場合のパイプラインとアドレス用インタフェース３１０の入出力信号変化を示す図である。FIG. 10 is a diagram showing changes in input / output signals of a pipeline and an address interface 310 in the case of a SIMD indirect memory access instruction with a SIMD width of 2; ＳＩＭＤ幅が４のＳＩＭＤインダイレクトメモリアクセス命令の場合のパイプラインとアドレス用インタフェース回路３１０の入出力信号変化を示す図である。FIG. 11 is a diagram showing changes in input / output signals of the pipeline and address interface circuit 310 in the case of a SIMD indirect memory access instruction with a SIMD width of 4; 図６の１つのメモリアクセスパイプラインＥＡＧＡを有する場合の演算用インタフェース３３１とアドレス用インタフェース３１０の構成を示す図である。FIG. 7 is a diagram illustrating a configuration of an arithmetic interface 331 and an address interface 310 when the single memory access pipeline EAGA of FIG. 6 is provided. ＳＩＭＤ幅２の場合での後続するＲＳＡから投入されるメモリアクセスとの衝突を示す図である。It is a figure which shows the collision with the memory access thrown in from subsequent RSA in the case of SIMD width 2. ＳＩＭＤ幅４の場合での後続するＲＳＡから投入されるメモリアクセスとの衝突を示す図である。It is a figure which shows the collision with the memory access thrown in from the subsequent RSA in the case of SIMD width 4. ＳＩＭＤ幅４の場合での後続するＳＩＭＤインダイレクトメモリアクセス命令のエントリの投入により生成されるメモリアクセス要求との衝突を示す図である。It is a figure which shows the collision with the memory access request produced | generated by throwing in the entry of the subsequent SIMD indirect memory access instruction in the case of SIMD width 4. インダイレクトメモリアクセス要求の衝突を回避する抑止信号を生成する演算用インタフェース３３３の構成を示す図である。It is a figure which shows the structure of the interface for operation 333 which produces | generates the suppression signal which avoids the collision of an indirect memory access request. ＲＳＦとそのＳＩＭＤインダイレクトメモリアクセス命令のエントリの出力抑止回路を示す図である。It is a figure which shows the output suppression circuit of the entry of RSF and its SIMD indirect memory access instruction. ＲＳＡとその通常のメモリアクセス命令のエントリの出力抑止回路を示す図である。It is a figure which shows the output suppression circuit of RSA and its normal memory access instruction entry. ＣＳＥ内の完了待ち合わせ回路を示す図である。It is a figure which shows the completion waiting circuit in CSE.

本実施の形態において１つの命令で複数のデータについて処理を行う命令をＳＩＭＤ命令（またはマルチデータ命令）と称する。ＳＩＭＤ命令は，例えばＳＩＭＤ幅の数のデータについてＳＩＭＤ幅の数の演算器が並列に処理を行い，ＳＩＭＤ幅の数のレジスタを１つのレジスタ単位とするＳＩＭＤレジスタに処理結果を格納する。 In this embodiment, an instruction for processing a plurality of data with one instruction is referred to as a SIMD instruction (or a multi-data instruction). In the SIMD instruction, for example, SIMD-width arithmetic units perform parallel processing on SIMD-width data, and the processing results are stored in SIMD registers each having a SIMD-width register as one register unit.

図１は，本実施の形態における演算処理装置が実現可能なインダイレクトメモリアクセス方式を説明する図である。図１は，ＳＩＭＤレジスタ３３２＿１に格納された複数の独立したデータをアドレスとして使用し，１つの命令でメモリ領域１４の複数箇所にアクセスするインダイレクトメモリアクセス方式である。図１の例では，ＳＩＭＤ幅が４の例であり，このようなインダイレクトメモリアクセスを行う命令を，ＳＩＭＤインダイレクトメモリアクセス命令と称する。 FIG. 1 is a diagram for explaining an indirect memory access method that can be realized by the arithmetic processing unit according to the present embodiment. FIG. 1 shows an indirect memory access method in which a plurality of independent data stored in the SIMD register 332_1 are used as addresses and a plurality of locations in the memory area 14 are accessed by one instruction. In the example of FIG. 1, the SIMD width is an example of 4, and such an indirect memory access instruction is referred to as a SIMD indirect memory access instruction.

図１（Ａ）は，ＳＩＭＤインダイレクトロード命令（またはＳＩＭＤ間接ロード命令）の例であり，ＳＩＭＤレジスタ３３２＿１に格納された４つの独立したデータをアドレスとして利用し，メモリ１４の４つのアドレスADD_0-ADD_3のデータDATA_0-DATA_3を読み出し，別のＳＩＭＤレジスタ３３２＿２に書込む。このＳＩＭＤインダイレクトロード命令は，例えば次のように記述される。
load %f100 %f200
ここで，%f100はアドレスが格納されたＳＩＭＤレジスタ３３２＿１のレジスタ番号であり，%f200はデータを書込むＳＩＭＤレジスタ３３２＿２のレジスタ番号である。 FIG. 1A shows an example of a SIMD indirect load instruction (or SIMD indirect load instruction). The four independent data stored in the SIMD register 332_1 are used as addresses, and four addresses ADD_0− The data DATA_0 to DATA_3 of ADD_3 is read and written to another SIMD register 332_2. This SIMD indirect load instruction is described as follows, for example.
load% f100% f200
Here,% f100 is the register number of the SIMD register 332_1 in which the address is stored, and% f200 is the register number of the SIMD register 332_2 to which data is written.

図１（Ｂ）は，ＳＩＭＤインダイレクトストア命令（またはＳＩＭＤ間接ストア命令）の例であり，ＳＩＭＤレジスタ３３２＿１に格納された４つの独立したデータをアドレスとして利用し，別のＳＩＭＤレジスタ３３２＿３内のデータDATA_0-DATA_3をメモリ１４の４つのアドレスADD_0-ADD_3の領域に書き込む。このＳＩＭＤインダイレクトストア命令は，例えば次のように記述される。
store %f100 %f300
ここで，%f100はアドレスが格納されたＳＩＭＤレジスタ３３２＿１のレジスタ番号であり，%f300は書き込みデータが格納されたＳＩＭＤレジスタ３３２＿３のレジスタ番号である。 FIG. 1B is an example of a SIMD indirect store instruction (or SIMD indirect store instruction), using four independent data stored in the SIMD register 332_1 as addresses, and data in another SIMD register 332_3. DATA_0-DATA_3 is written in the areas of the four addresses ADD_0-ADD_3 in the memory 14. This SIMD indirect store instruction is described as follows, for example.
store% f100% f300
Here,% f100 is the register number of the SIMD register 332_1 in which the address is stored, and% f300 is the register number of the SIMD register 332_3 in which the write data is stored.

上記の場合，ＳＩＭＤレジスタ３３２＿１に４つの独立したアドレスを書き込む処理は，例えば４回のロード命令を実行することで行われる。または，メモリの連続するアドレスに４つの独立したアドレスを書き込んでおき，メモリの先頭アドレスをソースアドレスとするＳＩＭＤロード命令を実行することで行われる。 In the above case, the process of writing four independent addresses to the SIMD register 332_1 is performed by executing, for example, four load instructions. Alternatively, it is performed by writing four independent addresses at successive addresses in the memory and executing a SIMD load instruction with the head address of the memory as the source address.

図２は，ブロックロード命令によるパイプライン処理の例を示す図である。ここでのブロックロード命令は，例えばメモリの連続領域のデータを複数の汎用レジスタに書き込む命令である。ブロックロード命令は，命令デコーダでデコードされると，命令デコーダが複数のメモリアクセス命令を生成し，その複数のメモリアクセス命令が，順次命令デコーダでデコードされ，メモリアクセス用リザベーションステーションにエントリされ，メモリアクセスされる。つまり，マルチフロー方式である。 FIG. 2 is a diagram illustrating an example of pipeline processing by a block load instruction. The block load instruction here is an instruction for writing data in a continuous area of a memory into a plurality of general purpose registers, for example. When the block load instruction is decoded by the instruction decoder, the instruction decoder generates a plurality of memory access instructions, and the plurality of memory access instructions are sequentially decoded by the instruction decoder and entered in the memory access reservation station, Accessed. In other words, it is a multiflow method.

したがって，４つのデータをメモリからロードする場合は，ブロックロード命令は４つのメモリアクセス命令に分割され，それぞれ命令デコードとリザベーションステーションへのエントリとメモリアクセスとが４回繰り返される。そのため，後続の演算命令は，４サイクルにわたりデコード待ち状態となる。 Therefore, when loading four data from the memory, the block load instruction is divided into four memory access instructions, and instruction decoding, entry to the reservation station, and memory access are repeated four times, respectively. Therefore, the subsequent operation instruction is in a decoding wait state for four cycles.

このようなブロックロード命令の手法を利用して上記のＳＩＭＤインダイレクトロード命令を実現しようとすると，同様に，命令デコードとリザベーションステーションへのエントリとインダイレクトロードの処理とを４回繰り返す必要があり，ＣＰＵコア内の資源を４サイクルにわたり占有し，後続の演算命令の命令デコードが可能になるのはマルチフローの最後の命令のデコードが完了した後になる。これでは，後続命令が依存関係のない命令の場合に利用可能なアウト・オブ・オーダの利点を生かすことができない。 When trying to implement the SIMD indirect load instruction using the block load instruction method, it is necessary to repeat the instruction decoding, the entry to the reservation station, and the indirect load process four times. The resources in the CPU core are occupied for 4 cycles, and the subsequent operation instruction can be decoded after the decoding of the last instruction of the multiflow is completed. This makes it impossible to take advantage of the out-of-order available when the subsequent instruction has no dependency.

［本実施の形態］
図３は，本実施の形態におけるＳＩＭＤインダイレクトメモリアクセス命令によるパイプライン処理を示す図である。本実施の形態のＳＩＭＤインダイレクトメモリアクセス命令では，１つのＳＩＭＤインダイレクトメモリアクセス命令を命令デコーダがデコードし，命令デコーダが１つの命令をＳＩＭＤリザベーションステーションにエントリし，４回のメモリアクセスを繰り返し実行する。したがって，命令デコーダは１サイクルで開放されるので，ＳＩＭＤインダイレクトメモリアクセス命令と依存関係のない後続の演算命令を，次のサイクルで命令デコードすることができる。したがって，アウト・オブ・オーダのメリットを生かすことができる。さらに，図３には示されていないが，１つのＳＩＭＤインダイレクトメモリアクセス命令をＳＩＭＤリザベーションステーションにエントリするので，リザベーションステーションに複数のエントリを使用する必要はなく，コミットスタックエントリのエントリも１つしか使用しないので，ＣＰＵコア内の資源を効率的に使用する。 [This embodiment]
FIG. 3 is a diagram showing pipeline processing by the SIMD indirect memory access instruction in the present embodiment. In the SIMD indirect memory access instruction according to the present embodiment, the instruction decoder decodes one SIMD indirect memory access instruction, the instruction decoder enters one instruction in the SIMD reservation station, and repeatedly executes four memory accesses. To do. Therefore, since the instruction decoder is released in one cycle, a subsequent operation instruction that does not depend on the SIMD indirect memory access instruction can be decoded in the next cycle. Therefore, the advantage of out-of-order can be utilized. Further, although not shown in FIG. 3, since one SIMD indirect memory access instruction is entered in the SIMD reservation station, it is not necessary to use a plurality of entries in the reservation station, and there is one entry in the commit stack entry. Therefore, the resources in the CPU core are used efficiently.

図４は，本実施の形態における演算処理装置を搭載した情報処理装置を示す図である。コンピュータなどの情報処理装置１０は，ＣＰＵ/メモリボード１２と，大容量の記憶装置であるハードディスク１１とを有する。ＣＰＵ/メモリボード１２は，ＣＰＵチップである演算処理装置２０と，演算処理装置２０と外部のハードディスク１１などを接続するインタコネクト１３と，DRAM等のメモリ１４とを有する。 FIG. 4 is a diagram showing an information processing apparatus equipped with the arithmetic processing apparatus in the present embodiment. An information processing apparatus 10 such as a computer has a CPU / memory board 12 and a hard disk 11 which is a large-capacity storage device. The CPU / memory board 12 includes an arithmetic processing unit 20 that is a CPU chip, an interconnect 13 that connects the arithmetic processing unit 20 and an external hard disk 11, and a memory 14 such as a DRAM.

演算処理装置２０は，例えば，４つのＣＰＵコア（演算処理部）３０Ａ−３０Ｄと，４つのＣＰＵコアで共有される二次キャッシュ２４と，入出力インタフェース２６と，メインメモリ１４へのアクセスを制御するメモリアクセスコントローラ２８とを有する。 The arithmetic processing unit 20 controls access to, for example, four CPU cores (arithmetic processing units) 30A-30D, a secondary cache 24 shared by the four CPU cores, an input / output interface 26, and the main memory 14. And a memory access controller 28.

図５は，ＣＰＵコア３０の全体構成を示す図である。ＣＰＵコア３０は，分岐命令の予測を行う分岐予測部３０２と，プログラムカウンタPCと分岐予測部３０２の予測に基づいて命令フェッチアドレスを生成する命令フェッチアドレス生成器３０１と，一次命令キャッシュ３０３と，フェッチされた命令をデコードする命令デコーダ３０５と，レジスタリネーミング部３０６と，メモリアクセス用リザベーションステーションＲＳＡ（Reservation Station for Address generate）と，整数演算用リザベーションステーションＲＳＥ（Reservation Station for Execute）と，浮動小数点ＳＩＭＤリザベーションステーションＲＳＦ（Reservation Station for Floating）と，分岐用リザベーションステーションＲＳＢＲ（Reservation Station for Branch）と，コミットスタックエントリＣＳＥ（Commit Stack Entry）とを有する。 FIG. 5 is a diagram illustrating the overall configuration of the CPU core 30. The CPU core 30 includes a branch prediction unit 302 that performs branch instruction prediction, an instruction fetch address generator 301 that generates an instruction fetch address based on the prediction of the program counter PC and the branch prediction unit 302, a primary instruction cache 303, An instruction decoder 305 for decoding the fetched instruction, a register renaming unit 306, a memory access reservation station RSA (Reservation Station for Address generate), an integer operation reservation station RSE (Reservation Station for Execute), and a floating point It has a SIMD reservation station RSF (Reservation Station for Floating), a branch reservation station RSBR (Reservation Station for Branch), and a commit stack entry CSE (Commit Stack Entry).

メモリアクセス用リザベーションステーションＲＳＡのメモリアクセスパイプラインＥＡＧＡは，アドレス用インタフェース３１０と，オペランドアドレス生成器３１１と，アドレス選択回路３１３と，一次データキャッシュ３１２とを有する。整数演算用リザベーションステーションＲＳＥの整数演算パイプラインＥＸＡは，演算用インタフェース３３３と，固定小数点演算器３２０と，固定小数点リネーミングレジスタ３２１と，固定小数点レジスタ３２２とを有する。 The memory access pipeline EAGA of the memory access reservation station RSA includes an address interface 310, an operand address generator 311, an address selection circuit 313, and a primary data cache 312. The integer arithmetic pipeline EXA of the integer arithmetic reservation station RSE includes an arithmetic interface 333, a fixed point arithmetic unit 320, a fixed point renaming register 321, and a fixed point register 322.

また，浮動小数点ＳＩＭＤリザベーションステーションＲＳＦの浮動小数点ＳＩＭＤ演算パイプラインＦＬＡは，演算用インタフェース３３３と，最大ＳＩＭＤ幅の数のＳＩＭＤ演算器３３０と，浮動小数点ＳＩＭＤリネーミングレジスタ３３１と，浮動小数点ＳＩＭＤレジスタ３３２とを有する。さらに，ＣＰＵコア３０は，２つのプログラムカウンタＰＣ，ＮＥＸＴＰＣを有する。また，ＣＰＵコア３０は，演算器３２０，３３０が生成したデータを一時的に格納するストアバッファＳＴＢを有する。 The floating-point SIMD reservation station RSF has a floating-point SIMD operation pipeline FLA that includes an operation interface 333, a SIMD calculator 330 having the maximum SIMD width, a floating-point SIMD renaming register 331, and a floating-point SIMD register 332. And have. Further, the CPU core 30 has two program counters PC and NEXTPC. The CPU core 30 has a store buffer STB that temporarily stores data generated by the arithmetic units 320 and 330.

ＳＩＭＤ演算器３３０のＳＩＭＤ幅は，例えば２もしくは４を命令で指定可能である。浮動小数点ＳＩＭＤレジスタは最大ＳＩＭＤ幅の４つ要素で構成されている。これらのレジスタの要素をそれぞれ，要素０，要素１，要素２，要素３と呼ぶ。浮動小数点ＳＩＭＤ幅２の演算を行う場合，ＳＩＭＤレジスタの要素０と要素１を使用する。浮動小数点ＳＩＭＤ幅４の演算を行う場合，ＳＩＭＤレジスタのすべての要素を使用する。 For the SIMD width of the SIMD calculator 330, for example, 2 or 4 can be specified by an instruction. The floating point SIMD register is composed of four elements having a maximum SIMD width. These register elements are referred to as element 0, element 1, element 2, and element 3, respectively. When a floating point SIMD width 2 operation is performed, elements 0 and 1 of the SIMD register are used. When performing a floating point SIMD width 4 operation, all elements of the SIMD register are used.

メモリアクセスパイプラインＥＡＧＡ，整数演算パイプラインＥＸＡ，浮動小数点ＳＩＭＤ演算パイプラインＦＬＡは，それぞれ１つのパイプラインまたは２つ以上のパイプラインを有してもよく，それぞれ独立に命令を実行可能である。また，一次データキャッシュ３１２は，メモリアクセスパイプラインＥＡＧＡのパイプライン数が２の場合，それに合わせて，２つのポートを設け同時に最大２つのアドレスによりアクセスを行うことができるようにしてもよい。さらに，メモリアクセスパイプラインＥＡＧＡのパイプライン数を，最大ＳＩＭＤ幅と同じ４組にしてもよい。その場合は，一次データキャッシュ３１２も４つのポートを有して同時の最大４つのアドレスによりアクセスを行うことできるようにするのが望ましい。 The memory access pipeline EAGA, the integer arithmetic pipeline EXA, and the floating-point SIMD arithmetic pipeline FLA may each have one pipeline or two or more pipelines, and can execute instructions independently. In addition, when the number of pipelines of the memory access pipeline EAGA is 2, the primary data cache 312 may be provided with two ports according to the number of pipelines so that it can be accessed simultaneously with a maximum of two addresses. Further, the number of pipelines of the memory access pipeline EAGA may be set to the same four groups as the maximum SIMD width. In that case, it is desirable that the primary data cache 312 also has four ports and can be accessed by a maximum of four addresses simultaneously.

命令フェッチアドレス生成器３０１は，分岐予測部３０２またはプログラムカウンタＰＣからの命令アドレスを選択し，一次命令キャッシュ３０３に対して命令フェッチリクエストを発行する。一次命令キャッシュ３０３は，命令フェッチリクエストに応じた命令を命令バッファ３０４に格納する。命令バッファ３０４から命令デコーダ３０５に対しては，プログラムにより指定された順番通りに，すなわちインオーダで命令が供給され，命令デコーダ３０５は，命令バッファから供給された命令をインオーダでデコードする。 The instruction fetch address generator 301 selects an instruction address from the branch prediction unit 302 or the program counter PC and issues an instruction fetch request to the primary instruction cache 303. The primary instruction cache 303 stores an instruction corresponding to the instruction fetch request in the instruction buffer 304. Instructions are supplied from the instruction buffer 304 to the instruction decoder 305 in the order specified by the program, that is, in order, and the instruction decoder 305 decodes the instruction supplied from the instruction buffer in order.

命令デコーダ３０５は，デコードした命令の種類に応じて，各リザベーションステーションＲＳＡ，ＲＳＥ，ＲＳＦ及びＲＳＢＲのいずれかに，命令に対応する必要なエントリを作成する。これとともに命令デコーダ３０５はデコードされたすべての命令に対応するエントリをＣＳＥに作成する。 The instruction decoder 305 creates a necessary entry corresponding to the instruction in each of the reservation stations RSA, RSE, RSF, and RSBR according to the type of the decoded instruction. At the same time, the instruction decoder 305 creates entries in the CSE corresponding to all decoded instructions.

レジスタリネーミング部３０６は，リザベーションステーションＲＳＡ，ＲＳＥ，ＲＳＦのいずれかにエントリが作成された場合に，命令に応じた処理で使用されるレジスタのアドレスに，リネーミングレジスタ３２１，３３１のアドレスを割り当てる。 The register renaming unit 306 assigns the addresses of the renaming registers 321 and 331 to the register addresses used in the processing according to the instruction when an entry is created in any of the reservation stations RSA, RSE, and RSF. .

リザベーションステーションＲＳＡ，ＲＳＥ，ＲＳＦは，保持されたエントリのうち，処理に必要な資源（データ，演算器，レジスタ等）が準備されたものから順次パイプラインに出力し，後段のパイプラインＥＡＧＡ，ＥＸＡ，ＦＬＡに出力したエントリに対応する処理を実行させる。これにより，命令がアウト・オブ・オーダで実行される。 The reservation stations RSA, RSE, and RSF sequentially output to the pipeline the resources (data, arithmetic units, registers, etc.) required for processing from the held entries, and the subsequent pipelines EAGA, EXA , The process corresponding to the entry output to FLA is executed. As a result, the instruction is executed out of order.

浮動小数点演算用リザベーションステーションＲＳＦには，例えば，ＳＩＭＤ演算命令に対応するエントリが格納される。１つのパイプラインＦＬＡはＳＩＭＤ幅の数のＳＩＭＤ演算器３３０を有する。ＳＩＭＤ演算器３３０はＲＳＦからのエントリに基づいて演算対象とするデータを選択し，ＳＩＭＤ幅の数のＳＩＭＤ演算器で演算を並列に実行する。演算結果は浮動小数点・ＳＩＭＤリネーミングレジスタ３３１に一時的に格納される。 For example, an entry corresponding to a SIMD operation instruction is stored in the reservation station RSF for floating point operation. One pipeline FLA has SIMD arithmetic units 330 having a number of SIMD widths. The SIMD calculator 330 selects data to be calculated based on the entry from the RSF, and executes the calculations in parallel by SIMD calculators having the number of SIMD widths. The calculation result is temporarily stored in the floating point / SIMD renaming register 331.

メモリアクセス用リザベーションステーションＲＳＡには，命令デコーダ３０５によりＳＩＭＤインダイレクトメモリアクセス命令以外のメモリアクセス命令に対応するエントリが生成され，格納される。そして，ＲＳＡは格納されている複数のエントリのいずれかを選択してパイプラインに出力する。メモリアクセス命令のエントリがパイプラインに出力されると，そのエントリに対応するメモリアクセス要求がパイプラインの各ステージを順番に転送する。オペランドアドレス生成回路３１１は，ＲＳＡのエントリのメモリアクセス要求に基づいて演算対象とするデータを選択し，アドレスを生成し，生成されたアドレスを用いてメモリアクセス要求を一次データキャッシュ３１２に入力する。一次データキャッシュ３１２は，メモリアクセス要求に対するメモリアクセスを実行する。 In the reservation station RSA for memory access, an entry corresponding to a memory access instruction other than the SIMD indirect memory access instruction is generated and stored by the instruction decoder 305. Then, the RSA selects any one of the stored entries and outputs it to the pipeline. When an entry of a memory access instruction is output to the pipeline, a memory access request corresponding to the entry transfers each stage of the pipeline in turn. The operand address generation circuit 311 selects data to be calculated based on the memory access request of the RSA entry, generates an address, and inputs the memory access request to the primary data cache 312 using the generated address. The primary data cache 312 performs memory access in response to a memory access request.

コミットスタックエントリＣＳＥは，命令デコーダ３０５によりデコードされたすべての命令に対応するエントリを保持し，各エントリに対応する処理の実行状況を管理し，これらの命令をインオーダで完了させる。例えば，ＣＳＥは，次に完了させるべきエントリに対応する処理の結果が，固定小数点リネーミングレジスタ３２１および浮動小数点ＳＩＭＤリネーミングレジスタ３３１に格納されたと判定すると，格納されたデータを固定小数点レジスタ３２２または浮動小数点ＳＩＭＤレジスタ３３２に出力させる。これにより，各リザベーションステーションでアウト・オブ・オーダに実行された命令が，インオーダで完了する。 The commit stack entry CSE holds entries corresponding to all instructions decoded by the instruction decoder 305, manages the execution status of processing corresponding to each entry, and completes these instructions in order. For example, when the CSE determines that the result of the process corresponding to the next entry to be completed is stored in the fixed-point renaming register 321 and the floating-point SIMD renaming register 331, the stored data is stored in the fixed-point register 322 or The data is output to the floating point SIMD register 332. As a result, instructions executed out-of-order at each reservation station are completed in-order.

図５のＣＰＵコア３０では，固定小数点演算パイプラインＥＸＡはＳＩＭＤ構成になっていない。一方，浮動小数点演算パイプラインＦＬＡはＳＩＭＤ構成になっていて，最大ＳＩＭＤ幅の数のＳＩＭＤ演算器３３０を有する。しかし，固定小数点演算パイプラインＥＸＡもＳＩＭＤ構成になっていてもよい。 In the CPU core 30 of FIG. 5, the fixed-point arithmetic pipeline EXA does not have a SIMD configuration. On the other hand, the floating point arithmetic pipeline FLA has a SIMD configuration and includes SIMD arithmetic units 330 having the maximum SIMD width. However, the fixed-point arithmetic pipeline EXA may also have a SIMD configuration.

本実施の形態のＣＰＵコア３０は，浮動小数点ＳＩＭＤ演算パイプラインＦＬＡの演算用インタフェース３３３の出力信号をメモリアクセスパイプラインＥＡＧＡのアドレス用インタフェース３１０に供給してメモリアクセス命令を生成させるためのバス３３４と，ＳＩＭＤ演算器３３０が取得したアドレスをアドレス選択回路３１３に供給するためのバス３３５とを有する。アドレス用インタフェース３１０は，演算用インタフェース３３３の出力信号に基づいて生成したＳＩＭＤ幅の数のメモリアクセス命令をメモリアクセスパイプラインＥＡＧＡに出力する。また，アドレス選択回路３１３は，オペランドアドレス生成器３３１からのバスに代えて浮動小数点ＳＩＭＤ演算器３３０からのバス３３５を選択し，ＳＩＭＤ演算器３３０が浮動小数点ＳＩＭＤレジスタ３３２や浮動小数点ＳＩＭＤリネーミングレジスタ３３１から取得したアドレスを，前述のＳＩＭＤ幅のメモリアクセス命令と共に一次データキャッシュ３１２へ供給する。 The CPU core 30 of this embodiment supplies a bus 334 for supplying an output signal of the arithmetic interface 333 of the floating-point SIMD arithmetic pipeline FLA to the address interface 310 of the memory access pipeline EAGA to generate a memory access instruction. And a bus 335 for supplying the address acquired by the SIMD arithmetic unit 330 to the address selection circuit 313. The address interface 310 outputs memory access instructions having the number of SIMD widths generated based on the output signal of the arithmetic interface 333 to the memory access pipeline EAGA. The address selection circuit 313 selects the bus 335 from the floating-point SIMD calculator 330 instead of the bus from the operand address generator 331, and the SIMD calculator 330 selects the floating-point SIMD register 332 or the floating-point SIMD renaming register. The address obtained from 331 is supplied to the primary data cache 312 together with the above-mentioned SIMD width memory access instruction.

［実施の形態のＳＩＭＤインダイレクトメモリアクセス命令を処理する構成と処理の概略］
図６は，本実施の形態のＳＩＭＤインダイレクトメモリアクセス（ロードまたはストア）命令を実行するＣＰＵコアの構成を示す図である。図６には，後述するパイプラインの各サイクルが括弧付きで示されている。 [Outline of Configuration and Processing for Processing SIMD Indirect Memory Access Instruction of Embodiment]
FIG. 6 is a diagram illustrating a configuration of a CPU core that executes a SIMD indirect memory access (load or store) instruction according to the present embodiment. In FIG. 6, each cycle of the pipeline to be described later is shown in parentheses.

図６のＣＰＵコア３０は，メモリアクセス用リザベーションステーションＲＳＡ（またはメモリアクセスエントリ部）がメモリアクセス命令のエントリを出力する１つのメモリアクセスパイプラインＥＡＧＡを有する。また，浮動小数点ＳＩＭＤリザベーションステーションＲＳＦ（またはマルチデータ命令エントリ部）がＳＩＭＤ命令のエントリを出力するＳＩＭＤ演算パイプラインＦＬＡも１つ有する。そして，ＳＩＭＤ演算パイプラインＦＬＡは，最大ＳＩＭＤ幅４と同じ数の浮動小数点ＳＩＭＤ演算器３３０を有する。 The CPU core 30 in FIG. 6 has one memory access pipeline EAGA from which a memory access reservation station RSA (or memory access entry unit) outputs an entry of a memory access instruction. The floating-point SIMD reservation station RSF (or multi-data instruction entry unit) also has one SIMD operation pipeline FLA that outputs SIMD instruction entries. The SIMD arithmetic pipeline FLA has the same number of floating point SIMD arithmetic units 330 as the maximum SIMD width 4.

本実施の形態のＳＩＭＤインダイレクトメモリアクセス命令の処理の概略は次のとおりである。命令デコーダ３０５は，ＳＩＭＤインダイレクトメモリアクセス命令をデコードして，そのエントリを浮動小数点ＳＩＭＤリザベーションステーションＲＳＦに生成する。ＲＳＦは，ＳＩＭＤインダイレクトメモリアクセス命令のエントリをＳＩＭＤ演算パイプラインＦＬＡに出力すると，それに応答してＳＩＭＤ演算パイプラインＦＬＡがバス３３４を介してＳＩＭＤ幅に対応した数のメモリアクセス要求をメモリアクセスパイプラインＥＡＧＡに生成する。具体的には，演算用インタフェース３３３が投入されたエントリのフラグ信号群を，バス３３４を介してアドレス用インタフェース３１０に供給し，アドレス用インタフェース３１０がそのフラグ信号群に基づいてメモリアクセスパイプラインＥＡＧＡに複数のメモリアクセス命令のアクセス要求を順次生成する。または，演算用インタフェース３３３がそのフラグ信号に基づいて複数のメモリアクセス命令のアクセス要求を順次生成し，バス３３４を介してアドレス用インタフェース３１０に供給してパイプラインＥＡＧＡに生成してもよい。 The outline of processing of the SIMD indirect memory access instruction of the present embodiment is as follows. The instruction decoder 305 decodes the SIMD indirect memory access instruction and generates the entry in the floating point SIMD reservation station RSF. When the RSF outputs an entry for the SIMD indirect memory access instruction to the SIMD operation pipeline FLA, the SIMD operation pipeline FLA responds to the number of memory access requests corresponding to the SIMD width via the bus 334 in response to the entry. Generate on line EAGA. Specifically, the flag signal group of the entry in which the arithmetic interface 333 is input is supplied to the address interface 310 via the bus 334, and the address interface 310 uses the memory access pipeline EAGA based on the flag signal group. To sequentially generate access requests for a plurality of memory access instructions. Alternatively, the operation interface 333 may sequentially generate access requests for a plurality of memory access instructions based on the flag signal, and supply the request to the address interface 310 via the bus 334 to generate the pipeline EAGA.

また，ＳＩＭＤインダイレクトメモリアクセス命令のエントリの投入または出力に応答して，ＳＩＭＤ幅の数のＳＩＭＤ演算器３３０は，ＳＩＭＤ幅の数のアドレスを浮動小数点ＳＩＭＤレジスタ３３２から並列に取得し，バス３３５を介してメモリアクセスパイプラインＥＡＧＡに供給する。具体的には，ＳＩＭＤ演算器３３０は，取得した複数のアドレスをバス３３５を経由して順次アドレス選択回路３１３に供給する。アドレス選択回路３１３は，バス３３５から供給される複数のアドレスを，先に生成された複数のメモリアクセス命令のアクセス要求のタイミングに合わせて選択し，一次データキャッシュ３１２に出力する。 In response to the input or output of the entry of the SIMD indirect memory access instruction, the SIMD arithmetic unit 330 having the number of SIMD widths acquires the address of the SIMD width number from the floating-point SIMD register 332 in parallel, and the bus 335 To the memory access pipeline EAGA. Specifically, the SIMD calculator 330 sequentially supplies the acquired plurality of addresses to the address selection circuit 313 via the bus 335. The address selection circuit 313 selects a plurality of addresses supplied from the bus 335 according to the access request timings of the plurality of previously generated memory access instructions, and outputs the selected addresses to the primary data cache 312.

具体的には，図６中の右上に示したとおり，アドレス選択回路３１３は，オペランドアドレス生成器３１１のアドレスフラグＡ＿ＥＡＧＡ＿ＡＤＤとバス３３５のいずれかを選択するセレクタＬ５を有する。そして，後述するようにＳＩＭＤインダイレクトメモリ命令のエントリがＳＩＭＤ演算用パイプラインＦＬＡに出力されたことに応答してアドレス用インタフェース３１０が生成するフラグ信号Ｂ１＿ＥＡＧＡ＿ＩＮＤＩＲＥＣＴの「１」により，セレクタＬ５はバス３３５側を選択し，バス３３５を経由して供給されるアドレスを選択し，アドレスフラグＡ＿ＥＡＧＡ＿ＡＤＤとして一次データキャッシュ３１２に転送する。これにより，アドレス用インタフェース３１０がＢ１サイクルのステージで生成したメモリアクセス要求の転送タイミングに整合して，Ｂ１サイクルのステージより後のＡサイクルのステージでバス３３５を介してアドレスが供給され，ＳＩＭＤインダイレクトメモリアクセス命令のアドレスを加えたメモリアクセス要求が一次データキャッシュ３１２に転送される。 Specifically, as shown in the upper right of FIG. 6, the address selection circuit 313 includes a selector L5 that selects either the address flag A_EAGA_ADD of the operand address generator 311 or the bus 335. As will be described later, the selector L5 causes the bus 335 by the flag signal B1_EAGA_INDDIRECT “1” generated by the address interface 310 in response to the entry of the SIMD indirect memory instruction being output to the SIMD operation pipeline FLA. And the address supplied via the bus 335 is selected and transferred to the primary data cache 312 as an address flag A_EAGA_ADD. As a result, the address is supplied via the bus 335 at the A cycle stage after the B1 cycle stage in alignment with the transfer timing of the memory access request generated by the address interface 310 at the B1 cycle stage. A memory access request including the address of the direct memory access instruction is transferred to the primary data cache 312.

一次データキャッシュ３１２は，ロード命令の場合は一次キャッシュ３１２からまたはメモリ１４から読み出した複数のデータを，浮動小数点ＳＩＭＤリネーミングレジスタ３３１に格納する。そして，コミットスタックエントリからの指令に応じて，読み出した複数のデータを浮動小数点ＳＩＭＤリネーミングレジスタ３３１から浮動小数点ＳＩＭＤレジスタ３３２に転送する。これらのレジスタ３３１，３３２は，ＳＩＭＤ幅の数のレジスタが一括してレジスタ番号で特定される。また，ストア命令の場合は，浮動小数点ＳＩＭＤレジスタ３３２に格納されている複数のデータを一次キャッシュ３１２またはメモリ１４に順次書き込む。 In the case of a load instruction, the primary data cache 312 stores a plurality of data read from the primary cache 312 or the memory 14 in the floating point SIMD renaming register 331. Then, in response to a command from the commit stack entry, a plurality of read data is transferred from the floating point SIMD renaming register 331 to the floating point SIMD register 332. In these registers 331 and 332, the registers having the number of SIMD widths are collectively specified by register numbers. In the case of a store instruction, a plurality of data stored in the floating point SIMD register 332 are sequentially written to the primary cache 312 or the memory 14.

ＳＩＭＤインダイレクトメモリアクセス命令の処理の概略をより具体的に説明すると次の通りである。 The outline of processing of the SIMD indirect memory access instruction will be described more specifically as follows.

まず，命令デコーダ３０５はＳＩＭＤインダイレクトメモリアクセス命令をデコードし，ＲＳＦ及びＣＳＥにエントリを作成する。ＣＳＥのエントリ番号（エントリされた命令識別情報）をＩＩＤと呼ぶ。演算やメモリアクセスの完了の際にＣＳＥにこのＩＩＤと完了信号を通知することにより，ＣＳＥは命令完了の判定を行う。エントリの作成と同時に，命令デコーダ３０５は，一次データキャッシュ３１２が管理する資源であるフェッチポートＦＰをＳＩＭＤ幅と同数の連続した個数確保する。フェッチポートＦＰは，一次データキャッシュがメモリアクセスを行う際に必要なメモリアドレスを記憶しておく資源であり，通常のメモリアクセス命令では１つのＦＰが確保される。ＳＩＭＤインダイレクトメモリアクセス命令ではＳＩＭＤ幅と同数のアドレスによりアクセスを行うため，複数のプリフェッチポートＦＰを使用する。 First, the instruction decoder 305 decodes the SIMD indirect memory access instruction and creates entries in the RSF and CSE. The CSE entry number (entry instruction identification information) is called IID. By notifying the CSE of this IID and completion signal upon completion of computation or memory access, the CSE determines completion of the instruction. Simultaneously with the creation of the entry, the instruction decoder 305 secures a continuous number of fetch ports FP that are resources managed by the primary data cache 312 as many as the SIMD width. The fetch port FP is a resource for storing a memory address necessary when the primary data cache performs memory access, and one FP is secured for a normal memory access instruction. In the SIMD indirect memory access instruction, a plurality of prefetch ports FP are used in order to access with the same number of addresses as the SIMD width.

図７は，浮動小数点演算リザベーションステーションＲＳＦにエントリとして格納されるフラグ構成を示す図である。ＳＩＭＤインダイレクトメモリアクセス命令を実行するため，インダイレクトフラグＩＮＤＩＲＥＣＴとフェッチポートフラグＦＰが追加されている。ＩＮＤＩＲＥＣＴフラグはデコードした命令がＳＩＭＤインダイレクトメモリアクセス命令の場合に「１」となる。ＦＰフラグはデコード時に確保した先頭のＦＰ番号を示す。また，ＲＳＦには，これら以外にもＳＩＭＤ演算器３３０に演算種の指示を行うＯＰＣＯＤＥ，命令のＳＩＭＤ幅を識別する４ＳＩＭＤフラグ（幅が２なら「０」，４なら「１」），演算に使用するオペランドを示すＲ１＿ＡＤＲＳ，ＣＳＥのエントリ番号を示すＩＩＤなどを格納する。 FIG. 7 is a diagram showing a flag configuration stored as an entry in the floating-point arithmetic reservation station RSF. In order to execute the SIMD indirect memory access instruction, an indirect flag DIRECT and a fetch port flag FP are added. The DIRECT flag becomes “1” when the decoded instruction is a SIMD indirect memory access instruction. The FP flag indicates the first FP number secured at the time of decoding. In addition to these, the RSF includes an OPCODE that instructs the SIMD computing unit 330 to specify the type of operation, a 4 SIMD flag that identifies the SIMD width of the instruction (“0” if the width is 2, and “1” if the width is 4). R1_ADRS indicating the operand to be used, IID indicating the CSE entry number, and the like are stored.

ＲＳＦは，ＳＩＭＤインダイレクトメモリアクセス命令のエントリに必要な資源が準備され実行可能となると，浮動小数点ＳＩＭＤパイプラインＦＬＡの演算用インタフェース３３３にその命令のエントリを出力または投入する。 When the resources necessary for the entry of the SIMD indirect memory access instruction are prepared and can be executed, the RSF outputs or inputs the instruction entry to the arithmetic interface 333 of the floating-point SIMD pipeline FLA.

演算用インタフェース３３３は，ＲＳＦから出力された命令がＳＩＭＤインダイレクトメモリアクセス命令のエントリである場合，そのエントリのインダイレクト命令，ＳＩＭＤ幅，ＩＩＤ，ＦＰ，ＦＬＡの命令が有効か否かを示すバリッドのフラグ信号群を，バス３３４を介して，アドレス用インタフェース３１０に転送する。アドレス用インタフェース３１０はこのフラグ信号群に基づいて，メモリアクセスパイプラインＥＡＧＡに複数のメモリアクセス命令のメモリアクセス要求をシリアルに生成する。 When the instruction output from the RSF is an entry of a SIMD indirect memory access instruction, the arithmetic interface 333 indicates whether the indirect instruction, SIMD width, IID, FP, or FLA instruction of the entry is valid. Are transferred to the address interface 310 via the bus 334. The address interface 310 serially generates memory access requests for a plurality of memory access instructions in the memory access pipeline EAGA based on the flag signal group.

上記の複数のメモリアクセス命令の生成と同時に，ＳＩＭＤ演算器３３０は，演算用インタフェース３３３からのフラグ信号に基づいて，ＳＩＭＤ幅の数のアドレスをＳＩＭＤレジスタ３３２から並列に読み出す。ＳＩＭＤレジスタ３３２のレジスタ番号は，ＳＩＭＤインダイレクトメモリアクセス命令のソースオペランドに示されている。ＳＩＭＤ演算器３３０は，ＳＩＭＤレジスタ３３２からのアドレスの読み出しが完了すると，その複数のアドレスをアドレス選択回路３１３にバス３３５を介して転送する。そして，アドレス用インタフェース３１０が，アドレス用インタフェース３１０がパイプラインＥＡＧＡに順次生成した複数のメモリアクセス要求と，バス３３５を介して転送されてきた複数のアドレスとを，タイミングを整合させて，アドレス選択回路３１３に転送する。すなわち，アドレス選択回路３１３は，オペランドアドレス生成器３３１からのアドレスに代えて，ＳＩＭＤ演算器３３０から供給されてきたアドレスを選択し，複数のアドレスを一次データキャッシュ３１２にシリアルに転送する。一次データキャッシュ３１２は，複数のアドレスそれぞれについてデータの読み出しまたはＳＩＭＤレジスタ内のデータの書き込みを行う。 Simultaneously with the generation of the plurality of memory access instructions, the SIMD calculator 330 reads SIMD-width number addresses in parallel from the SIMD register 332 based on the flag signal from the calculation interface 333. The register number of the SIMD register 332 is indicated in the source operand of the SIMD indirect memory access instruction. When the reading of the address from the SIMD register 332 is completed, the SIMD arithmetic unit 330 transfers the plurality of addresses to the address selection circuit 313 via the bus 335. Then, the address interface 310 selects the addresses by matching the timings of the plurality of memory access requests sequentially generated by the address interface 310 to the pipeline EAGA and the plurality of addresses transferred via the bus 335. Transfer to circuit 313. That is, the address selection circuit 313 selects the address supplied from the SIMD calculator 330 instead of the address from the operand address generator 331, and serially transfers a plurality of addresses to the primary data cache 312. The primary data cache 312 reads data or writes data in the SIMD register for each of a plurality of addresses.

一次データキャッシュ３１２は，データの読み出しを完了すると読み出したデータを浮動小数点ＳＩＭＤリネーミングレジスタに格納するとともに，ＣＳＥに読み出しが完了したエントリ識別情報ＩＩＤと完了通知を送る。データの書き込みの場合は，一次データキャッシュ３１２は，単にＣＳＥに書き込みが完了したエントリ識別情報ＩＩＤと完了通知を送る。ＣＳＥはエントリ識別情報ＩＩＤと完了通知によりＳＩＭＤ幅すべての要素の読み出しまたは書き込みが完了するのを待ち，ＳＩＭＤインダイレクトメモリアクセス命令を完了させる。 When the primary data cache 312 completes the reading of data, the primary data cache 312 stores the read data in the floating-point SIMD renaming register, and sends entry identification information IID and completion notification to the CSE. In the case of data writing, the primary data cache 312 simply sends entry identification information IID and completion notification of writing completion to the CSE. The CSE waits for reading or writing of all the elements of the SIMD width by the entry identification information IID and the completion notification, and completes the SIMD indirect memory access instruction.

次に，演算用インタフェース３３３は，ＳＩＭＤインダイレクト命令のエントリに基づいて生成されるメモリアクセス要求が，後続のメモリアクセス命令のアクセス要求と衝突することを防止するために，ＲＳＡとＲＳＦに命令のエントリの出力を抑止する抑止信号を生成する。すなわち，第１に，演算用インタフェース３３３は，ＳＩＭＤインダイレクトメモリアクセス命令のエントリに応答して，抑止信号３３６をＲＳＡに出力し，ＲＳＡに，そのＳＩＭＤインダイレクトメモリアクセス命令に基づいてアドレス用インタフェース３１０で生成されるメモリアクセス要求と衝突する後続のメモリアクセス命令のエントリの出力を抑止させる。第２に，演算用インタフェース３３３は，ＳＩＭＤインダイレクトメモリアクセス命令のエントリに応答して，抑止信号３３７をＲＳＦに出力し，ＲＳＦに，後続のＳＩＭＤインダイレクトメモリアクセス命令のエントリの出力を抑止させる。これにより，先行するＳＩＭＤインダイレクトメモリアクセス命令によりアドレス用インタフェース３１０に生成された複数サイクルにわたるメモリアクセス要求と，後続のＳＩＭＤインダイレクトメモリアクセス命令に基づいて生成されるメモリアクセス要求とが衝突することを防止する。 Next, in order to prevent the memory access request generated based on the entry of the SIMD indirect instruction from colliding with the access request of the subsequent memory access instruction, the arithmetic interface 333 transmits the instruction to the RSA and RSF. Generates a suppression signal that suppresses entry output. That is, first, the arithmetic interface 333 outputs a suppression signal 336 to the RSA in response to the entry of the SIMD indirect memory access instruction, and the RSA receives the address interface based on the SIMD indirect memory access instruction. The output of the entry of the subsequent memory access instruction colliding with the memory access request generated at 310 is suppressed. Second, the operation interface 333 outputs a suppression signal 337 to the RSF in response to the entry of the SIMD indirect memory access instruction, and causes the RSF to suppress the output of the subsequent SIMD indirect memory access instruction entry. . As a result, a memory access request over a plurality of cycles generated in the address interface 310 by the preceding SIMD indirect memory access instruction collides with a memory access request generated based on the subsequent SIMD indirect memory access instruction. To prevent.

図８，図９は，ＳＩＭＤインダイレクトロード命令と通常のロード命令の処理を示すフローチャートである。まず，命令フェッチ（S1），命令バッファに格納（S2），命令デコード（S3）が行われ，命令デコーダの結果，通常のロード命令の場合（S4のNO），工程S5以下の処理が行われ，ＳＩＭＤインダイレクトメモリアクセス命令の場合（S4のYES），工程S21以下の処理が行われる。 8 and 9 are flowcharts showing the processing of the SIMD indirect load instruction and the normal load instruction. First, instruction fetch (S1), instruction buffer storage (S2), and instruction decode (S3) are performed. If the result of the instruction decoder is a normal load instruction (NO in S4), the process from step S5 is performed. In the case of a SIMD indirect memory access instruction (YES in S4), the processes in and after step S21 are performed.

通常のロード命令の場合（S4のNO），命令デコーダ３０５は，フェッチポートＦＰを１つ確保し，ＲＳＡへロード命令のエントリを作成する（S5）。ＲＳＡは，ロード命令のエントリを投入する準備が完了したことを確認し（S6のYES），先行するＳＩＭＤインダイレクトメモリアクセス命令に基づいて生成されるメモリアクセス命令と衝突していない場合（S7のNO），ロード命令のエントリをメモリアクセスパイプラインＥＡＧＡに出力または投入する。 In the case of a normal load instruction (NO in S4), the instruction decoder 305 secures one fetch port FP and creates an entry for the load instruction in the RSA (S5). The RSA confirms that the load instruction entry is ready (YES in S6), and does not collide with a memory access instruction generated based on the preceding SIMD indirect memory access instruction (in S7). NO), the entry of the load instruction is output or input to the memory access pipeline EAGA.

メモリアクセスパイプラインＥＡＧＡでは，オペランドアドレス生成器３１１が固定小数点レジスタ３２２などからデータを読み出し（S8），オペランドアドレス生成器３１１がアドレスを生成し（S9），アドレス選択回路３１３がオペランドアドレス生成器からのアドレスを選択する（S10）。そして，一次データキャッシュ３１２がそのアドレスを使用してデータを読み出す処理を実行する（S11）。一次データキャッシュ３１２が読み出したデータを浮動小数点ＳＩＭＤリネーミングレジスタに格納してその処理を完了すると（S12），通常のロード命令の場合は，ＣＳＥがフェッチポートFPを１個開放し，ＳＩＭＤリネーミングレジスタ３３１からＳＩＭＤレジスタ３３２に読み出しデータを転送する（S19）。 In the memory access pipeline EAGA, the operand address generator 311 reads data from the fixed-point register 322 or the like (S8), the operand address generator 311 generates an address (S9), and the address selection circuit 313 receives from the operand address generator. Address is selected (S10). Then, the primary data cache 312 executes a process of reading data using the address (S11). When the data read by the primary data cache 312 is stored in the floating-point SIMD renaming register and the processing is completed (S12), in the case of a normal load instruction, the CSE releases one fetch port FP and SIMD renaming. Read data is transferred from the register 331 to the SIMD register 332 (S19).

以上のように，通常のロード命令は，ＲＳＡにロード命令のエントリが生成され，メモリアクセスパイプラインＥＡＧＡのオペランドアドレス生成器３３１がアドレスの取得と生成を行い，一次データキャッシュへ３１２にロード要求を行う。 As described above, in the normal load instruction, an entry of the load instruction is generated in RSA, the operand address generator 331 of the memory access pipeline EAGA acquires and generates an address, and a load request is sent to the primary data cache 312. Do.

なお，連続するメモリアドレスに対するＳＩＭＤロード命令のエントリがＲＳＡに生成された場合は，オペランドアドレス生成器３１１がその先頭のアドレスを固定小数点レジスタ３２２などから読み出し，一次データキャッシュ３１２が連続する例えば２つのアドレスのデータを２つのＳＩＭＤリネーミングレジスタに格納する。ただし，この連続するメモリアドレスに対するＳＩＭＤロード命令は，本実施の形態における複数のＳＩＭＤレジスタ内の独立した複数のアドレスに対するＳＩＭＤインダイレクトメモリアクセス命令とは異なる命令である。 If SIMD load instruction entries for consecutive memory addresses are generated in the RSA, the operand address generator 311 reads the head address from the fixed-point register 322 or the like, and the primary data cache 312 continues for example two Address data is stored in two SIMD renaming registers. However, the SIMD load instruction for the continuous memory addresses is different from the SIMD indirect memory access instruction for a plurality of independent addresses in the plurality of SIMD registers in the present embodiment.

次に，ＳＩＭＤインダイレクトメモリアクセス命令の場合（S4のYES），命令デコーダはフェッチポートＦＰをＳＩＭＤ幅の個数確保し，ＲＳＦにＳＩＭＤ命令のエントリを生成する（S21）。ＲＳＦは，エントリを投入する準備が完了したことを確認し（S22のYES），先行するＳＩＭＤインダイレクトメモリアクセス命令に基づいて生成されるメモリアクセス命令と衝突していない場合（S23のNO），ＳＩＭＤインダイレクトメモリアクセス命令のエントリをＳＩＭＤ演算パイプラインＦＬＡに出力または投入する。 Next, in the case of a SIMD indirect memory access instruction (YES in S4), the instruction decoder secures the number of fetch ports FP with the SIMD width, and generates an entry for the SIMD instruction in the RSF (S21). The RSF confirms that the entry is ready to be entered (YES in S22), and does not collide with a memory access instruction generated based on the preceding SIMD indirect memory access instruction (NO in S23). An entry of the SIMD indirect memory access instruction is output or input to the SIMD operation pipeline FLA.

そして，ＳＩＭＤ演算パイプラインＦＬＡのＳＩＭＤ演算器３３０が，ＳＩＭＤ幅の数のアドレスをＳＩＭＤレジスタ３３２から並列に読み出す（S23）。この読み出しには後述するとおり２サイクルを要する。そして，この読み出しとともに，ＳＩＭＤ演算パイプラインＦＬＡの演算用インタフェース３３３がバス３３４を介してアドレス用インタフェース３１０にフラグ信号群を転送し，メモリアクセスパイプラインＥＡＧＡにメモリアクセスのリクエスト０を生成させる（S24）。生成されたリクエストはメモリアクセスパイプラインＥＡＧＡを転送する。さらに，ＳＩＭＤ演算器３３０はＳＩＭＤレジスタ３３２から取得したＳＩＭＤ幅の数のアドレスを，バス３３５を介してアドレス選択回路３１３に供給し（S35），アドレス選択回路３１３は，リクエスト０に基づきＳＩＭＤ演算器からのアドレスを選択し（S26），リクエスト０にフェッチポートＦＰを割り当てる（S27）。 Then, the SIMD arithmetic unit 330 of the SIMD arithmetic pipeline FLA reads addresses in the number of SIMD widths in parallel from the SIMD register 332 (S23). This reading requires two cycles as will be described later. Simultaneously with this reading, the calculation interface 333 of the SIMD calculation pipeline FLA transfers the flag signal group to the address interface 310 via the bus 334, and causes the memory access pipeline EAGA to generate a memory access request 0 (S24). ). The generated request transfers the memory access pipeline EAGA. Further, the SIMD arithmetic unit 330 supplies the address of the number of SIMD widths acquired from the SIMD register 332 to the address selection circuit 313 via the bus 335 (S35), and the address selection circuit 313 receives the SIMD arithmetic unit based on the request 0. (S26), and the fetch port FP is assigned to the request 0 (S27).

上記の工程S23-S27を２回繰り返す。２回目はメモリアクセスのリクエスト１が生成される。さらに，ＳＩＭＤ幅が４の場合（S28のYES），工程S23-S27と同じ処理工程S29-S32を２回繰り返す。これによりメモリアクセスのリクエスト２，３が生成される。 The above steps S23-S27 are repeated twice. A memory access request 1 is generated the second time. Further, when the SIMD width is 4 (YES in S28), the same processing steps S29-S32 as steps S23-S27 are repeated twice. As a result, memory access requests 2 and 3 are generated.

そして，一次データキャッシュ３１２は，フェッチポートＦＰのアドレスを利用してデータを読み出す処理を実行開始する（S11）。ＣＳＥは，ＳＩＭＤ幅が２の場合は２回の一次データキャッシュの処理完了通知（S12,S14）に応答して，ＳＩＭＤ幅が４の場合は４回の一次データキャッシュの処理完了通知（S12,S14,S16,S17）に応答して，ＳＩＭＤ処理の完了を検出し，フェッチポートＦＰをＳＩＭＤ幅の個数開放し，ＳＩＭＤリネーミングレジスタ３３１からＳＩＭＤレジスタ３３２に読み出しデータを転送する（S20）。 Then, the primary data cache 312 starts executing a process of reading data using the address of the fetch port FP (S11). When the SIMD width is 2, the CSE responds to the primary data cache processing completion notification (S12, S14) twice, and when the SIMD width is 4, the primary data cache processing completion notification (S12, S14). In response to S14, S16, S17), the completion of SIMD processing is detected, the number of fetch ports FP is released by the SIMD width, and read data is transferred from the SIMD renaming register 331 to the SIMD register 332 (S20).

以上のように，ＳＩＭＤインダイレクトロード命令の場合は，命令デコーダがＳＩＭＤインダイレクトロード命令を１回デコードし，命令デコーダがＲＳＦにＳＩＭＤインダイレクトロード命令のエントリを１つ生成し，ＳＩＭＤ演算パイプラインＦＬＡがメモリアクセスパイプラインＥＡＧＡにＳＩＭＤ幅の数のメモリアクセスのリクエストを生成し，複数のＳＩＭＤ演算器に複数のアドレスをＳＩＭＤレジスタから並列に取得させ，ＳＩＭＤ演算器が取得した複数のアドレスをメモリアクセスパイプラインＥＡＧＡに転送して複数のメモリアクセスのリクエストに合体させ，メモリアクセスパイプラインＥＡＧＡが一次データキャッシュへのロードリクエストを行う。ＳＩＭＤインダイレクトストア命令の場合も，一次データキャッシュがメモリにストアすることを除いて上記のロード命令と同じ動作である。 As described above, in the case of the SIMD indirect load instruction, the instruction decoder decodes the SIMD indirect load instruction once, the instruction decoder generates one entry of the SIMD indirect load instruction in the RSF, and the SIMD operation pipeline. The FLA generates a memory access request for the number of SIMD widths in the memory access pipeline EAGA, causes a plurality of SIMD computing units to obtain a plurality of addresses in parallel from the SIMD register, and stores the plurality of addresses obtained by the SIMD computing unit in the memory The data is transferred to the access pipeline EAGA and merged into a plurality of memory access requests, and the memory access pipeline EAGA makes a load request to the primary data cache. The SIMD indirect store instruction is the same operation as the above load instruction except that the primary data cache stores in the memory.

次に，本実施の形態のＳＩＭＤインダイレクトメモリアクセス命令のパイプライン処理を説明する。まず，ＳＩＭＤインダイレクトロード命令のパイプラインステージを以下に示す。図６に括弧付きで示したステージを参照することで以下のパイプラインステージが明らかになる。
Ｄ（Ｄｅｃｏｄｅ）：命令デコーダが命令をデコードする。
ＤＴ（ＤｅｃｏｄｅＴｒａｎｓｆｅｒ）：Ｄサイクルの命令を転送し，ＲＳＦに格納する。
Ｐ（Ｐｒｉｏｒｉｔｙ）：ＲＳＦがＳＩＭＤ演算器へ投入する命令のエントリを決定し出力（投入）する。
ＰＴ（ＰｒｉｏｒｉｔｙＴｒａｎｓｆｅｒ）：Ｐサイクルのエントリのフラグ信号群を，演算用インタフェースを介して転送し，ＳＩＭＤ演算器３３０に投入する。
Ｂ１，Ｂ２（Ｂｕｆｆｅｒ）：ＳＩＭＤ演算器が演算に必要なデータをレジスタから入力する。例えば，データは浮動小数点ＳＩＭＤレジスタ３３２やリネーミングレジスタ３３１から取得される。この例では取得に２サイクルを要する。
Ｘ（ｅＸｅｃｕｔｉｏｎ）：ＳＩＭＤ演算器がメモリアクセスに必要なデータを読み出す。
Ａ（Ａｄｄｒｅｓｓ）：ＳＩＭＤ演算器がメモリにアクセスするアドレスをアドレス選択回路３１３に転送する。
Ｔ（Ｔａｇ）：一次データキャッシュがアドレスに基づいてタグにアクセスする。
Ｍ（Ｍａｔｃｈ）：一次データキャッシュが読み出したキャッシュタグを比較する。
Ｂ（Ｂｕｆｆｅｒ）：一次データキャッシュから読み出したデータをバッファする。
Ｒ（Ｒｅｓｕｌｔ）：一次データキャッシュアクセスを完了する。
ＲＴ（Ｒｅｓｕｌｔ）：Ｒサイクルのデータを転送し，リネーミングレジスタへの書き込みを行い，ＣＳＥへ完了通知を行う。
Ｃ（Ｃｏｍｍｉｔ）：すべての要素が完了したかどうかの命令完了の判定を行う。
Ｗ（Ｗｒｉｔｅ）：完了した命令による各種レジスタの更新やリソースの解放を行う。このとき，浮動小数点ＳＩＭＤリネーミングレジスタ３３１からＳＩＭＤレジスタ３３２に読み出したデータを転送する。 Next, the pipeline processing of the SIMD indirect memory access instruction of this embodiment will be described. First, the pipeline stage of the SIMD indirect load instruction is shown below. By referring to the stages shown in parentheses in FIG. 6, the following pipeline stages become clear.
D (Decode): The instruction decoder decodes the instruction.
DT (Decode Transfer): Transfers D-cycle instructions and stores them in the RSF.
P (Priority): Determines and outputs (inputs) an entry of an instruction that the RSF inputs to the SIMD computing unit.
PT (Priority Transfer): The flag signal group of the entry in the P cycle is transferred through the calculation interface and is input to the SIMD calculator 330.
B1, B2 (Buffer): SIMD calculator inputs data necessary for calculation from a register. For example, the data is acquired from the floating point SIMD register 332 or the renaming register 331. In this example, two cycles are required for acquisition.
X (eExection): The SIMD calculator reads data necessary for memory access.
A (Address): The address at which the SIMD computing unit accesses the memory is transferred to the address selection circuit 313.
T (Tag): The primary data cache accesses the tag based on the address.
M (Match): The cache tag read by the primary data cache is compared.
B (Buffer): Buffers data read from the primary data cache.
R (Result): The primary data cache access is completed.
RT (Result): Transfers R cycle data, writes to the renaming register, and notifies the CSE of completion.
C (Commit): It is determined whether or not the instruction has been completed.
W (Write): Various registers are updated and resources are released by a completed instruction. At this time, the read data is transferred from the floating point SIMD renaming register 331 to the SIMD register 332.

ＳＩＭＤインダイレクトロード命令以外の通常ロード命令のパイプラインステージを以下に示す。
Ｄ（Ｄｅｃｏｄｅ）：命令をデコードする。
ＤＴ（ＤｅｃｏｄｅＴｒａｎｓｆｅｒ）：Ｄサイクルの命令を転送し，ＲＳＡに命令のエントリを格納する。
Ｐ（Ｐｒｉｏｒｉｔｙ）：リザベーションステーションＲＳＡから実行ユニットへ投入する命令のエントリを決定し出力（投入）する。
Ｂ１，Ｂ２（Ｂｕｆｆｅｒ）：オペランドアドレス生成器がロードアドレス生成に必要なデータを決定しレジスタから入力する。
Ａ（Ａｄｄｒｅｓｓ）：オペランドアドレス生成器がメモリにアクセスするアドレスを計算する。
Ｔ（Ｔａｇ）：一次データキャッシュが計算したアドレスに基づいてタグにアクセスする。
Ｍ（Ｍａｔｃｈ）：一次データキャッシュが読み出したキャッシュタグを比較する。
Ｂ（Ｂｕｆｆｅｒ）：一次データキャッシュから読み出したデータをバッファする。
Ｒ（Ｒｅｓｕｌｔ）：一次データキャッシュアクセスを完了する。
ＲＴ（Ｒｅｓｕｌｔ）：Ｒサイクルのデータを転送し，リネーミングレジスタへの書き込みを行い，ＣＳＥへ完了通知を行う。
Ｃ（Ｃｏｍｍｉｔ）：命令完了の判定を行う。
Ｗ（Ｗｒｉｔｅ）：完了した命令による，各種レジスタの更新やリソースの解放を行う。このとき，リネーミングレジスタからレジスタに転送する。 The pipeline stages of normal load instructions other than SIMD indirect load instructions are shown below.
D (Decode): Decodes an instruction.
DT (Decode Transfer): Transfers an instruction of D cycle and stores an instruction entry in RSA.
P (Priority): An entry of an instruction to be input from the reservation station RSA to the execution unit is determined and output (input).
B1, B2 (Buffer): Operand address generator determines data necessary for load address generation and inputs from a register.
A (Address): The address at which the operand address generator accesses the memory is calculated.
T (Tag): The tag is accessed based on the address calculated by the primary data cache.
M (Match): The cache tag read by the primary data cache is compared.
B (Buffer): Buffers data read from the primary data cache.
R (Result): The primary data cache access is completed.
RT (Result): Transfers R cycle data, writes to the renaming register, and notifies the CSE of completion.
C (Commit): The completion of the instruction is determined.
W (Write): Various registers are updated and resources are released by a completed instruction. At this time, the data is transferred from the renaming register to the register.

図１０は，ＳＩＭＤインダイレクトメモリアクセスの一つであるＳＩＭＤインダイレクトロード命令のパイプライン及びタイムチャートを示す図である。 FIG. 10 is a diagram showing a pipeline and a time chart of a SIMD indirect load instruction which is one of SIMD indirect memory accesses.

ＲＳＦは，タイミング３のＰサイクルでＳＩＭＤインダイレクトロード命令のエントリをＳＩＭＤ演算パイプラインＦＬＡに投入する。そして，タイミング５のＢ１サイクルで，演算用インタフェース３３３がＳＩＭＤインダイレクトロード命令のフラグ信号を出力する。 The RSF inputs the SIMD indirect load instruction entry to the SIMD operation pipeline FLA in the P cycle of timing 3. Then, in the B1 cycle at timing 5, the arithmetic interface 333 outputs a flag signal of the SIMD indirect load instruction.

ＳＩＭＤ幅２のＳＩＭＤインダイレクトロード命令の場合は，ＳＩＭＤ演算パイプラインＦＬＡが，タイミング６において，アドレス用インタフェース３１０内のパイプラインＥＡＧＡにメモリアクセス用のリクエスト０を生成し，次のタイミング７において，リクエスト１を生成する。生成されたメモリアクセスのリクエストは，メモリアクセスパイプラインＥＡＧＡにて，ＳＩＭＤインダイレクトロード命令以外のロード命令におけるＢ１サイクルとなる。生成したメモリアクセスのエントリ識別情報ＩＩＤはＳＩＭＤ演算パイプラインＦＬＡから送られたものを使用し，フェッチポートＦＰはＳＩＭＤ演算パイプラインＦＬＡのＦＰの値とそれに１を加算した値を使用する。ＳＩＭＤ演算パイプラインＦＬＡのＳＩＭＤ演算器３３０が，タイミング７のＸ１サイクルでＳＩＭＤレジスタ（またはＳＩＭＤリネーミングレジスタ）から読み出した複数のデータのうち要素０を，タイミング８のＸ２サイクルでＳＩＭＤレジスタ（またはＳＩＭＤリネーミングレジスタ）から読み出した要素１をそれぞれシリアルにメモリアクセスパイプラインＥＡＧＡのアドレス選択回路３１３に転送する。タイミング８，９（Ａサイクル）で，アドレス選択回路３１３は，ＳＩＭＤ演算パイプラインＦＬＡから転送されてきた要素０と要素１のアドレスを選択し，一次データキャッシュ３１２に転送する。一次データキャッシュにアクセスしたデータすべてが存在した場合，一次データキャッシュ３１２は，タイミング１３，１４で読み出したデータをＳＩＭＤリネーミングレジスタ３３１に転送し，タイミング１４ですべてのメモリアクセスの完了報告を行う。その結果，ＣＳＥは，命令完了の判定を行い，ＳＩＭＤリネーミングレジスタ３３１のデータをＳＩＭＤレジスタ３３２に転送する。 In the case of a SIMD indirect load instruction with a SIMD width of 2, the SIMD operation pipeline FLA generates a memory access request 0 in the pipeline EAGA in the address interface 310 at timing 6, and at the next timing 7, Request 1 is generated. The generated memory access request is the B1 cycle in the load instruction other than the SIMD indirect load instruction in the memory access pipeline EAGA. The generated memory access entry identification information IID is sent from the SIMD operation pipeline FLA, and the fetch port FP uses the FP value of the SIMD operation pipeline FLA and a value obtained by adding 1 to it. The SIMD operator 330 of the SIMD operation pipeline FLA reads element 0 out of the plurality of data read from the SIMD register (or SIMD renaming register) in the X1 cycle at timing 7 and the SIMD register (or SIMD in the X2 cycle at timing 8). Each element 1 read from the renaming register is serially transferred to the address selection circuit 313 of the memory access pipeline EAGA. At timings 8 and 9 (A cycle), the address selection circuit 313 selects the addresses of the elements 0 and 1 transferred from the SIMD operation pipeline FLA and transfers them to the primary data cache 312. When all the data accessed to the primary data cache exists, the primary data cache 312 transfers the data read at timings 13 and 14 to the SIMD renaming register 331 and reports completion of all memory accesses at timing 14. As a result, the CSE determines completion of the instruction and transfers the data in the SIMD renaming register 331 to the SIMD register 332.

また，ＳＩＭＤ幅４のＳＩＭＤインダイレクトロード命令の場合は，ＳＩＭＤ演算パイプラインＦＬＡが，タイミング６，７，８，９において，アドレス用インタフェース内のパイプラインＥＡＧＡにメモリアクセス用のリクエスト０，１，２，３をシリアルに生成する。生成された４つのメモリアクセスのリクエストは，メモリアクセスパイプラインＥＡＧＡにて，ＳＩＭＤインダイレクトロード命令以外のロード命令におけるＢ１サイクルとなる。生成したメモリアクセスのリクエストのエントリ識別情報ＩＩＤはＳＩＭＤ演算パイプラインＦＬＡから送られたものを使用し，フェッチポートＦＰはＳＩＭＤ演算パイプラインＦＬＡのＦＰ値とそれに１，２，３を加算した値を使用する。ＳＩＭＤ演算パイプラインＦＬＡのＳＩＭＤ演算器３３０が，タイミング７，８，９，１０のＸ１，Ｘ２，Ｘ３，Ｘ４サイクルでＳＩＭＤレジスタ（またはＳＩＭＤリネーミングレジスタ）から読み出したＳＩＭＤのデータのうち要素０と要素１と要素２と要素３をそれぞれ，シリアルにメモリアクセスパイプラインＥＡＧＡのアドレス選択回路３１３に転送する。タイミング８，９，１０，１１（Ａサイクル）で，アドレス選択回路３１３は，ＳＩＭＤ演算パイプラインＦＬＡから転送されてきたアドレスをそれぞれ選択し，一次データキャッシュ３１２に転送する。一次データキャッシュにアクセスしたデータすべてが存在した場合，一次データキャッシュ３１２は，タイミング１３，１４，１５，１６で読み出したデータをＳＩＭＤリネーミングレジスタ３３１に転送し，タイミング１６ですべてのメモリアクセスの完了報告を行う。その結果，ＣＳＥは，命令完了の判定を行い，ＳＩＭＤリネーミングレジスタ３３１のデータをＳＩＭＤレジスタ３３２に転送する。 In the case of a SIMD indirect load instruction with a SIMD width of 4, the SIMD operation pipeline FLA sends requests 0, 1, and 0 for memory access to the pipeline EAGA in the address interface at timings 6, 7, 8, and 9. 2 and 3 are generated serially. The generated four memory access requests are B1 cycles in a load instruction other than the SIMD indirect load instruction in the memory access pipeline EAGA. The entry identification information IID of the generated memory access request uses the information sent from the SIMD operation pipeline FLA, and the fetch port FP uses the FP value of the SIMD operation pipeline FLA and a value obtained by adding 1, 2, and 3 to it. use. The SIMD arithmetic unit 330 of the SIMD operation pipeline FLA has element 0 of SIMD data read from the SIMD register (or SIMD renaming register) in the X1, X2, X3, and X4 cycles at timings 7, 8, 9, and 10. Element 1, element 2 and element 3 are serially transferred to the address selection circuit 313 of the memory access pipeline EAGA. At timings 8, 9, 10, and 11 (A cycle), the address selection circuit 313 selects the addresses transferred from the SIMD operation pipeline FLA and transfers them to the primary data cache 312. When all data accessed to the primary data cache exists, the primary data cache 312 transfers the data read at timings 13, 14, 15, and 16 to the SIMD renaming register 331, and all memory accesses are completed at timing 16. Make a report. As a result, the CSE determines completion of the instruction and transfers the data in the SIMD renaming register 331 to the SIMD register 332.

以上，ＳＩＭＤインダイレクトロード命令について説明したが，ＳＩＭＤインダイレクトストア命令でも，ＳＩＭＤ演算パイプラインＦＬＡがメモリアクセスパイプラインＥＡＧＡにＳＩＭＤ幅の数のメモリアクセスのリクエストを生成することと，ＳＩＭＤ演算器がＳＩＭＤ幅の数のアドレスをＳＩＭＤレジスタから並列に取得してメモリアクセスパイプラインＥＡＧＡにシリアルに転送することと，一次データキャッシュにＳＩＭＤ幅の数のメモリストアのリクエストを投入することは同じである。ＳＩＭＤインダイレクトストア命令の場合は，一次データキャッシュはＳＩＭＤレジスタに格納されているＳＩＭＤ幅の数のデータを一次キャッシュメモリまたはメモリに書き込む。 Although the SIMD indirect load instruction has been described above, even in the SIMD indirect store instruction, the SIMD arithmetic pipeline FLA generates a memory access request of the number of SIMD widths in the memory access pipeline EAGA, and the SIMD arithmetic unit Obtaining SIMD-width-number addresses in parallel from the SIMD registers and serially transferring them to the memory access pipeline EAGA is equivalent to submitting a SIMD-width-number memory store request to the primary data cache. In the case of the SIMD indirect store instruction, the primary data cache writes data of the number of SIMD widths stored in the SIMD register into the primary cache memory or the memory.

［本実施の形態におけるＳＩＭＤインダイレクトメモリアクセスの詳細説明］
図１１は，本実施の形態のＳＩＭＤインダイレクトメモリアクセス（ロードまたはストア）命令を実行するＣＰＵコアの構成を示す図である。図１１のＣＰＵコアの構成におけるＳＩＭＤダイレクトメモリアクセスの詳細な説明を行う。 [Detailed description of SIMD indirect memory access in this embodiment]
FIG. 11 is a diagram illustrating a configuration of a CPU core that executes a SIMD indirect memory access (load or store) instruction according to the present embodiment. A detailed description of SIMD direct memory access in the CPU core configuration of FIG. 11 will be given.

図１１のＣＰＵコア３０は，図６と異なり，メモリアクセス用リザベーションステーションＲＳＡ（またはメモリアクセスエントリ部）が，メモリアクセス命令のエントリを出力するメモリアクセスパイプラインとして，２つのパイプラインＥＡＧＡ，ＥＡＧＢを有する。それに対応して，一次データキャッシュ３１２は，２つのメモリアクセス要求を並列に処理する構成を有する。また，浮動小数点ＳＩＭＤリザベーションステーションＲＳＦ（またはマルチデータ命令エントリ部）が，ＳＩＭＤ命令のエントリを出力するＳＩＭＤ演算パイプラインとして，２つのパイプラインＦＬＡ，ＦＬＢを有する。そして，ＳＩＭＤ演算パイプラインＦＬＡ，ＦＬＢは，最大ＳＩＭＤ幅４と同じ数の浮動小数点ＳＩＭＤ演算器３３０を，それぞれ有する。浮動小数点ＳＩＭＤレジスタ３３２と浮動小数点ＳＩＭＤリネーミングレジスタ３３１は，最大ＳＩＭＤ幅４と同じ数のレジスタが一括してレジスタ番号で指定可能である。それ以外の構成は，図６と同じである。 The CPU core 30 of FIG. 11 differs from FIG. 6 in that the memory access reservation station RSA (or the memory access entry unit) uses two pipelines EAGA and EAGB as memory access pipelines for outputting memory access instruction entries. Have. Correspondingly, the primary data cache 312 has a configuration for processing two memory access requests in parallel. The floating-point SIMD reservation station RSF (or multi-data instruction entry unit) has two pipelines FLA and FLB as SIMD operation pipelines that output SIMD instruction entries. The SIMD operation pipelines FLA and FLB have the same number of floating-point SIMD calculators 330 as the maximum SIMD width 4. As for the floating-point SIMD register 332 and the floating-point SIMD renaming register 331, the same number of registers as the maximum SIMD width 4 can be collectively designated by register numbers. Other configurations are the same as those in FIG.

したがって，ＳＩＭＤ演算パイプラインＦＬＡは，ＳＩＭＤインダイレクトアクセスメモリ命令のエントリに応答して，２つのメモリアクセス要求を２つのメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢに同時に生成することができ，ＳＩＭＤ演算器３３０は，ＳＩＭＤレジスタ３３２から取得した２つのアドレスを２つのメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢのアドレス選択回路３１３に並列に転送することができる。後述する図１３に示す通りである。 Therefore, the SIMD operation pipeline FLA can simultaneously generate two memory access requests to the two memory access pipelines EAGA and EAGB in response to the entry of the SIMD indirect access memory instruction. , Two addresses acquired from the SIMD register 332 can be transferred in parallel to the address selection circuits 313 of the two memory access pipelines EAGA and EAGB. This is as shown in FIG.

ＳＩＭＤ幅が２の場合は，ＳＩＭＤ演算パイプラインＦＬＡは，バス３３４を介して，１サイクルで２つのメモリアクセス要求を２つのパイプラインＥＡＧＡ，ＥＡＧＢに生成する。すなわち，演算用インタフェース３３３がフラグ信号群をバス３３４を介してアドレス用インタフェース３１０に転送し，アドレス用インタフェース３１０は，その転送されたフラグ信号群に基づいて，１サイクルで２つのメモリアクセス要求を２つのパイプラインＥＡＧＡ，ＥＡＧＢに生成する。そして，ＳＩＭＤ演算器３３０は，バス３３５を介して，１サイクルで２つのアドレスを２つのパイプラインＥＡＧＡ，ＥＡＧＢのアドレス選択回路３１３に転送する。アドレス選択回路３１３内のセレクタＬ５（図６参照）は，前述のとおり，インダイレクトフラグ信号Ｂ１＿ＥＡＧＡ＿ＩＮＤＩＲＥＣＴ，Ｂ１＿ＷＡＧＢ＿ＩＮＤＩＲＥＣＴの「１」により，バス３３５側を選択し，ＳＩＭＤ演算器３３０から供給される２つのアドレスを２つのパイプラインＥＡＧＡ，ＥＡＧＢに出力する。これにより，アドレス用インタフェース３１０が生成した２つのメモリアクセス要求に，アドレス選択回路３１３のサイクルＡのステージで，バス３３５から供給された２つのアドレスが加えられる。 When the SIMD width is 2, the SIMD operation pipeline FLA generates two memory access requests to the two pipelines EAGA and EAGB in one cycle via the bus 334. That is, the arithmetic interface 333 transfers the flag signal group to the address interface 310 via the bus 334, and the address interface 310 makes two memory access requests in one cycle based on the transferred flag signal group. The two pipelines EAGA and EAGB are generated. Then, the SIMD computing unit 330 transfers two addresses to the address selection circuits 313 of the two pipelines EAGA and EAGB in one cycle via the bus 335. As described above, the selector L5 (see FIG. 6) in the address selection circuit 313 selects the bus 335 side by “1” of the indirect flag signals B1_EAGA_INDDIRECT, B1_WAGB_INDDIRECT, and two addresses supplied from the SIMD calculator 330 Are output to the two pipelines EAGA and EAGB. As a result, the two addresses supplied from the bus 335 are added to the two memory access requests generated by the address interface 310 at the stage of the cycle A of the address selection circuit 313.

また，ＳＩＭＤ幅が４の場合は，ＳＩＭＤ演算パイプラインＦＬＡは，２サイクルで４つのメモリアクセス要求を２つのパイプラインＥＡＧＡ，ＥＡＧＢに生成し，２サイクルで４つのアドレスを転送する。図１４に示すとおりである。 When the SIMD width is 4, the SIMD operation pipeline FLA generates four memory access requests in two cycles in two pipelines EAGA and EAGB, and transfers four addresses in two cycles. As shown in FIG.

図８，図９のフローチャート図は，図１１のＣＰＵコアの構成にも適用できる。ただし，図１のＣＰＵコアは，２つのメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢを有するので，図８の工程S24-S27，工程S29-S32をそれぞれ１回行えば良い。 The flowcharts of FIGS. 8 and 9 can also be applied to the configuration of the CPU core of FIG. However, since the CPU core in FIG. 1 has two memory access pipelines EAGA and EAGB, each of steps S24 to S27 and steps S29 to S32 in FIG. 8 may be performed once.

［演算用インタフェース３３１とアドレス用インタフェース３１０によるインダイレクトメモリアクセス要求の生成］
図１２は，演算用インタフェース３３１とアドレス用インタフェース３１０の構成を示す図である。演算用インタフェース３３１は，ＲＳＦから投入される演算命令のエントリから後段のＳＩＭＤ演算器３３０などに対して制御信号を適切なタイミングで出力する。同様に，アドレス用インタフェース３１０は，ＲＳＡから投入されるメモリアクセス命令のエントリから後段のオペランドアドレス生成器３１１などに対して制御信号を適切なタイミングで出力する。 [Generation of Indirect Memory Access Requests by the Operation Interface 331 and the Address Interface 310]
FIG. 12 is a diagram showing the configuration of the calculation interface 331 and the address interface 310. The calculation interface 331 outputs a control signal at an appropriate timing from the entry of the calculation instruction input from the RSF to the subsequent SIMD calculator 330 or the like. Similarly, the address interface 310 outputs a control signal at an appropriate timing from the entry of the memory access instruction input from the RSA to the operand address generator 311 in the subsequent stage.

アドレス用インタフェース３１０は，ＲＳＡから２つのパイプラインＥＡＧＡ，ＥＡＧＢに投入された，ＳＩＭＤインダイレクトメモリアクセス命令以外の通常メモリアクセス命令のエントリのフラグ信号を，ラッチ回路群F1_A，F1_Bでラッチし，後段のオペランドアドレス生成器３１１に転送する。一方，演算用インタフェース３３１は，ＲＳＦからＳＩＭＤ演算パイプラインＦＬＡに投入されたＳＩＭＤインダイレクトメモリアクセス命令のエントリのフラグ信号を，アドレス用インタフェース３１０にバス３３４を介して転送する。そして，アドレス用インタフェース３１０内のアンドゲートA1,A2，ラッチ回路群F2，F3，セレクタL1,L2,L3,L4，オアゲートアR1,R2，加算器ADD1,ADD2らの回路が，その転送されてきたフラグ信号に基づいて，２つのメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢに，それぞれメモリアクセスのリクエストを生成する。 The address interface 310 latches the flag signal of the entry of the normal memory access instruction other than the SIMD indirect memory access instruction, which is input from the RSA into the two pipelines EAGA and EAGB, by the latch circuit groups F1_A and F1_B. To the operand address generator 311. On the other hand, the arithmetic interface 331 transfers the flag signal of the entry of the SIMD indirect memory access instruction input from the RSF to the SIMD arithmetic pipeline FLA to the address interface 310 via the bus 334. Then, the AND gates A1, A2, the latch circuit groups F2, F3, the selectors L1, L2, L3, L4, the OR gates R1, R2, and the adders ADD1, ADD2 in the address interface 310 have been transferred. Based on the flag signal, a memory access request is generated in each of the two memory access pipelines EAGA and EAGB.

図１２では，アドレス用インタフェース３１０が破線で囲まれた回路を有するように示されている。しかし，破線で囲まれた回路の一部を演算用インタフェース３３３が有するようにしてもよい。したがって，演算用インタフェース３３３とアドレス用インタフェース３１０とバス３３４の構成により，ＳＩＭＤ演算パイプラインＦＬＡが，２つのメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢに，それぞれメモリアクセスのリクエストを生成する。 In FIG. 12, the address interface 310 is shown as having a circuit surrounded by a broken line. However, the arithmetic interface 333 may have a part of the circuit surrounded by the broken line. Therefore, the SIMD operation pipeline FLA generates a memory access request to each of the two memory access pipelines EAGA and EAGB by the configuration of the operation interface 333, the address interface 310, and the bus 334.

図１２の各信号について説明する。 Each signal in FIG. 12 will be described.

パイプラインＦＬＡ側のエントリのフラグ信号については，次の通りである。入力信号（バリッド信号）Ｂ１＿ＦＬＡ＿ＶＡＬＩＤ＿ＥＡＩＴＦは，浮動小数点・ＳＩＭＤパイプラインのＢ１サイクルでパイプラインＦＬＡのＳＩＭＤ演算器３３０に演算要求が出たときに１となる。 The flag signal of the entry on the pipeline FLA side is as follows. The input signal (valid signal) B1_FLA_VALID_EAITF becomes 1 when a calculation request is issued to the SIMD calculator 330 of the pipeline FLA in the B1 cycle of the floating point / SIMD pipeline.

入力信号（インダイレクト信号）Ｂ１＿ＦＬＡ＿ＩＮＤＩＲＥＣＴ＿ＥＡＩＴＦは，浮動小数点・ＳＩＭＤパイプラインのＢ１サイクルで演算要求がＳＩＭＤインダイレクトメモリアクセス命令であった場合１となる。 The input signal (indirect signal) B1_FLA_INDDIRECT_EAITF becomes 1 when the operation request is a SIMD indirect memory access instruction in the B1 cycle of the floating point / SIMD pipeline.

入力信号（４ＳＩＭＤ信号）Ｂ１＿ＦＬＡ＿４ＳＩＭＤ＿ＥＡＩＴＦは，浮動小数点・ＳＩＭＤパイプラインのＢ１サイクルでＳＩＭＤ幅が４であるときに１となる。 The input signal (4 SIMD signal) B1_FLA_4SIMD_EAITF becomes 1 when the SIMD width is 4 in the B1 cycle of the floating point / SIMD pipeline.

入力信号（ＩＩＤ信号）Ｂ１＿ＦＬＡ＿ＩＩＤ＿ＥＡＩＴＦには，パイプラインＦＬＡで実行される命令のエントリの識別情報ＩＩＤが転送される。 The identification information IID of the entry of the instruction executed in the pipeline FLA is transferred to the input signal (IID signal) B1_FLA_IID_EAITF.

入力信号（ＦＰ信号）Ｂ１＿ＦＬＡ＿ＦＰ＿ＥＡＩＴＦは，ＳＩＭＤインダイレクトメモリアクセス命令において命令デコーダ３０５で確保したフェッチポートＦＰの先頭の番号を転送する。 The input signal (FP signal) B1_FLA_FP_EAITF transfers the leading number of the fetch port FP secured by the instruction decoder 305 in the SIMD indirect memory access instruction.

パイプラインＥＡＧＡ，ＥＡＧＢ側のエントリのフラグ信号については，次の通りである。入力信号（バリッド信号）Ｐ＿ＥＡＧＡ＿ＶＡＬＩＤ，Ｐ＿ＥＡＧＢ＿ＶＡＬＩＤは，ＲＳＡからオペランドアドレス生成器３３１及び一次データキャッシュ３１２へメモリアクセス要求が出力された時に１となる。 The flag signals of the entries on the pipelines EAGA and EAGB are as follows. Input signals (valid signals) P_EAGA_VALID and P_EAGB_VALID become 1 when a memory access request is output from the RSA to the operand address generator 331 and the primary data cache 312.

入力信号（ＦＰ信号）Ｐ＿ＥＡＧＡ＿ＦＰ，Ｐ＿ＥＡＧＢ＿ＦＰには，ＲＳＡからオペランドアドレス生成器３３１にメモリアクセス要求が出たとき，一次キャッシュメモリ３１２で使用するフェッチポート番号ＦＰ番号が転送される。 When a memory access request is issued from the RSA to the operand address generator 331, the fetch port number FP number used in the primary cache memory 312 is transferred to the input signals (FP signals) P_EAGA_FP and P_EAGB_FP.

入力信号（ＩＩＤ信号）Ｐ＿ＥＡＧＡ＿ＩＩＤ，Ｐ＿ＥＡＧＢ＿ＩＩＤには，ＲＳＡからオペランドアドレス生成器３３１にメモリアクセス要求が出たとき，それぞれの要求に対応するエントリ識別情報ＩＩＤが転送される。 When a memory access request is issued from the RSA to the operand address generator 331, entry identification information IID corresponding to each request is transferred to the input signals (IID signals) P_EAGA_IID and P_EAGB_IID.

アドレス用インタフェース回路３１０は，ＳＩＭＤインダイレクトメモリアクセス命令のエントリがＳＩＭＤ演算パイプラインＦＬＡに投入された場合，演算用インタフェース３３３が出力するフラグ信号を用いて，メモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢに２つもしくは４つのメモリアクセス要求を生成する。このメモリアクセス要求は，以下に説明する４つの出力信号Ｂ１＿ＥＡＧＡ＿＊＊＊と，４つの出力信号Ｂ１＿ＥＡＧＢ＿＊＊＊に対応する。また，アドレス用インタフェース回路３１０は，通常のメモリアクセス命令のエントリがメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢに投入された場合は，そのエントリのフラグ信号をそのままオペランドアドレス生成器３１１に転送する。 When the SIMD indirect memory access instruction entry is input to the SIMD operation pipeline FLA, the address interface circuit 310 uses the flag signal output from the operation interface 333 to generate two addresses in the memory access pipelines EAGA and EAGB. Alternatively, four memory access requests are generated. This memory access request corresponds to four output signals B1_EAGA _ *** and four output signals B1_EAGB _ *** described below. Further, when an entry of a normal memory access instruction is input to the memory access pipelines EAGA and EAGB, the address interface circuit 310 transfers the flag signal of the entry to the operand address generator 311 as it is.

出力信号（バリッド信号）Ｂ１＿ＥＡＧＡ＿ＶＡＬＩＤ＿ＯＲは，オアゲートＲ１により出力され，ＲＳＦが投入したＳＩＭＤインダイレクトメモリアクセス命令により生成されたメモリアクセス要求と，ＲＳＡからの通常のメモリアクセス命令に対するメモリアクセス要求の論理和である。このバリッド信号が１であるとき，メモリアクセスパイプラインＥＡＧＡのオペランドアドレス生成器３１１及び一次データキャッシュ３１２へのメモリアクセス要求が有効となる。 The output signal (valid signal) B1_EAGA_VALID_OR is output by the OR gate R1, and is a logical sum of the memory access request generated by the SIMD indirect memory access instruction input by the RSF and the memory access request for the normal memory access instruction from the RSA. is there. When this valid signal is 1, a memory access request to the operand address generator 311 and the primary data cache 312 of the memory access pipeline EAGA is valid.

出力信号（バリッド信号）Ｂ１＿ＥＡＧＢ＿ＶＡＬＩＤ＿ＯＲは，メモリアクセスパイプラインＥＡＧＢ側のバリッド信号であり，上記と同様である。 The output signal (valid signal) B1_EAGB_VALID_OR is a valid signal on the memory access pipeline EAGB side and is the same as described above.

出力信号（インダイレクト信号）Ｂ１＿ＥＡＧＡ＿ＩＮＤＩＲＥＣＴ，出力信号Ｂ１＿ＥＡＧＢ＿ＩＮＤＩＲＥＣＴは，対応するバリッド信号Ｂ１＿ＥＡＧＡ＿ＶＡＬＩＤ＿ＯＲ，Ｂ１＿ＥＡＧＢ＿ＶＡＬＩＤ＿ＯＲ信号が１であるときに有効になる信号であり，メモリアクセス要求がＳＩＭＤインダイレクトメモリアクセス命令により生成されたことを示す。オアゲートＲ２が出力する。この信号は，後続のオペランドアドレス生成器３１１を経由してアドレス選択回路３１３に転送され，アドレス選択回路３１３においてＳＩＭＤ演算器３３０からバス３３５を介して転送されるアドレスを選択するために使用される。 The output signal (indirect signal) B1_EAGA_INDDIRECT and the output signal B1_EAGB_INDDIRECT are valid when the corresponding valid signal B1_EAGA_VALID_OR and B1_EAGB_VALID_OR signals are 1, and the memory access request is generated by the SIMD indirect memory access instruction. Indicates. OR gate R2 outputs. This signal is transferred to the address selection circuit 313 via the subsequent operand address generator 311 and is used to select an address transferred from the SIMD calculator 330 via the bus 335 in the address selection circuit 313. .

出力信号（ＩＩＤ信号）Ｂ１＿ＥＡＧＡ＿ＩＩＤ，出力信号（ＩＩＤ信号）Ｂ１＿ＥＡＧＢ＿ＩＩＤは，対応するバリッド信号Ｂ１＿ＥＡＧＡ＿ＶＡＬＩＤ＿ＯＲ，Ｂ１＿ＥＡＧＢ＿ＶＡＬＩＤ＿ＯＲ信号が１であるときに有効になる信号である。ＳＩＭＤインダイレクトメモリアクセス命令である場合，セレクタＬ４が演算用インタフェース３３３から転送されてきた入力信号Ｂ１＿ＦＬＡ＿ＩＩＤ＿ＥＡＩＴＦのエントリ識別情報ＩＩＤを選択する。もしそうでない場合，セレクタＬ４は，ＲＳＡからのＩＩＤ信号Ｐ＿ＥＡＧＡ＿ＩＩＤ，Ｐ＿ＥＡＧＢ＿ＩＩＤを選択する。 The output signal (IID signal) B1_EAGA_IID and the output signal (IID signal) B1_EAGB_IID are signals that are valid when the corresponding valid signals B1_EAGA_VALID_OR and B1_EAGB_VALID_OR signals are 1. In the case of the SIMD indirect memory access instruction, the selector L4 selects the entry identification information IID of the input signal B1_FLA_IID_EAITF transferred from the calculation interface 333. Otherwise, the selector L4 selects the IID signals P_EAGA_IID and P_EAGB_IID from the RSA.

出力信号（ＦＰ信号）Ｂ１＿ＥＡＧＡ＿ＦＰ，出力信号（ＦＰ信号）Ｂ１＿ＥＡＧＢ＿ＦＰは，対応するバリッド信号Ｂ１＿ＥＡＧＡ＿ＶＡＬＩＤ＿ＯＲ，Ｂ１＿ＥＡＧＢ＿ＶＡＬＩＤ＿ＯＲ信号が１であるときに有効になる信号である。ＳＩＭＤインダイレクトメモリアクセス命令の場合の場合で，ＳＩＭＤ幅が２である場合は，入力ＦＰ信号Ｂ１＿ＦＬＡ＿ＦＰ＿ＥＡＩＴＦで転送されてきたＦＰ値と，加算器ＡＤＤ２で＋１加算したＦＰ値とが，セレクタＬ３で選択され出力される。一方，ＳＩＭＤ幅が４である場合は，次のクロックサイクルで，入力ＦＰ信号Ｂ１＿ＦＬＡ＿ＦＰ＿ＥＡＩＴＦで転送されてきたＦＰ値に加算器ＡＤＤ１で＋２されたＦＰ値と，加算器ＡＤＤ２で＋１加算したＦＰ値とが，セレクタＬ３で選択され出力される。例えば，ＳＩＭＤ幅４であり，ＳＩＭＤインダイレクトメモリアクセス命令でありかつＦＰ信号Ｂ１＿ＦＬＡ＿ＦＰ＿ＥＡＩＴＦで転送された値が５であった場合，図１４のタイミング６でパイプラインＥＡＧＡに生成されたリクエストのＦＰ信号Ｂ１＿ＥＡＧＡ＿ＦＰは５，パイプラインＥＡＧＢに生成されたリクエストのＦＰ信号Ｂ１＿ＥＡＧＢ＿ＦＰは６，タイミング７でパイプラインＥＡＧＡに生成されたリクエストのＦＰ信号Ｂ１＿ＥＡＧＡ＿ＦＰは７，パイプラインＥＡＧＢに生成されたリクエストのＦＰ信号Ｂ１＿ＥＡＧＢ＿ＦＰは８になる。ＳＩＭＤインダイレクトメモリアクセス命令でない場合は，ＲＳＡからのＦＰ信号Ｐ＿ＥＡＧＡ＿ＦＰ，Ｐ＿ＥＡＧＢ＿ＦＰがそれぞれセレクタＬ３で選択される。 The output signal (FP signal) B1_EAGA_FP and the output signal (FP signal) B1_EAGB_FP are signals that are valid when the corresponding valid signals B1_EAGA_VALID_OR and B1_EAGB_VALID_OR signals are 1. In the case of the SIMD indirect memory access instruction, when the SIMD width is 2, the selector FP selects the FP value transferred by the input FP signal B1_FLA_FP_EAITF and the FP value added by +1 by the adder ADD2. And output. On the other hand, when the SIMD width is 4, the FP value transferred by the adder ADD1 to the FP value transferred by the input FP signal B1_FLA_FP_EAITF in the next clock cycle, and the FP value obtained by adding +1 by the adder ADD2 Are selected and output by the selector L3. For example, when the SIMD width is 4, the SIMD indirect memory access instruction and the value transferred by the FP signal B1_FLA_FP_EAITF is 5, the FP signal B1_EAGA_FP of the request generated in the pipeline EAGA at the timing 6 in FIG. 5, the FP signal B1_EAGB_FP of the request generated in the pipeline EAGB is 6, the FP signal B1_EAGA_FP of the request generated in the pipeline EAGA at timing 7 is 7, and the FP signal B1_EAGB_FP of the request generated in the pipeline EAGB is 8 become. If it is not the SIMD indirect memory access instruction, the FP signals P_EAGA_FP and P_EAGB_FP from the RSA are selected by the selector L3.

図１３は，ＳＩＭＤ幅が２のＳＩＭＤインダイレクトメモリアクセス命令の場合のパイプラインとアドレス用インタフェース３１０の入出力信号変化を示す図である。ＳＩＭＤ演算パイプラインＦＬＡの演算用インタフェース３３３が，タイミング５のサイクルＢ１で図１２の入力信号（Ｂ１＿ＦＬＡ＿＊＊＊）を出力し，アドレス用インタフェース３１０が，それらの入力信号に基づいてタイミング６で図１２の出力信号（Ｂ１＿ＥＡＧＡ＿＊＊＊，Ｂ１＿ＥＡＧＢ＿＊＊＊）によるメモリアクセス要求を生成する。 FIG. 13 is a diagram showing input / output signal changes in the pipeline and address interface 310 in the case of a SIMD indirect memory access instruction with a SIMD width of 2. The operation interface 333 of the SIMD operation pipeline FLA outputs the input signal (B1_FLA _ ***) of FIG. 12 in the cycle B1 of timing 5, and the address interface 310 outputs the signal at timing 6 based on these input signals. A memory access request is generated by 12 output signals (B1_EAGA _ ***, B1_EAGB _ ***).

タイミング５の入力ＩＩＤ信号Ｂ１＿ＦＬＡ＿ＩＤＤ＿ＥＡＩＴＦ（＝２）がセレクタＬ１を介してラッチＦ２でラッチされ，タイミング６の出力ＩＩＤ信号Ｂ１＿ＥＡＧＡ＿ＩＩＤ，Ｂ１＿ＥＡＧＢ＿ＩＩＤが共に２になる。 The input IID signal B1_FLA_IDD_EAITF (= 2) at the timing 5 is latched by the latch F2 via the selector L1, and the output IID signals B1_EAGA_IID and B1_EAGB_IID at the timing 6 are both 2.

タイミング５の入力バリッド信号Ｂ１＿ＦＬＡ＿ＶＡＬＩＤ＿ＥＡＩＴＦ（＝１）と入力インダイレクト信号Ｂ１＿ＦＬＡ＿ＩＮＤＩＲＥＣＴ＿ＥＡＩＴＦ（＝１）の論理積がアンドゲートＡ１を介してラッチＦ２でラッチされ，オアゲートＲ１，Ｒ２を介して，タイミング６の出力バリッド信号Ｂ１＿ＥＡＧＡ＿ＶＡＬＩＤ＿ＯＲ，Ｂ１＿ＥＡＧＢ＿ＶＡＬＩＤ＿ＯＲが共に１になり，出力インダイレクト信号Ｂ１＿ＥＡＧＡ＿ＩＮＤＩＲＥＣＴ，Ｂ１＿ＥＡＧＢ＿ＩＮＤＩＲＥＣＴも共に１になる。 The logical product of the input valid signal B1_FLA_VALID_EAITF (= 1) at the timing 5 and the input indirect signal B1_FLA_DIRECT_EAITF (= 1) is latched by the latch F2 through the AND gate A1, and the output valid at the timing 6 through the OR gates R1 and R2. The signals B1_EAGA_VALID_OR and B1_EAGB_VALID_OR are both 1, and the output indirect signals B1_EAGA_INDDIRECT and B1_EAGB_INDDIRECT are both 1.

そして，タイミング５の入力ＦＰ信号Ｂ１＿ＦＬＡ＿ＥＡＩＴＦ（＝４）がセレクタＬ２を介してラッチＦ２＿ＦＰでラッチされ，セレクタＬ３を介して，タイミング６の出力ＦＰ信号Ｂ１＿ＥＡＧＡ＿ＦＰ（＝４），Ｂ１＿ＥＡＧＢ＿ＦＰ（＝５）になる。 Then, the input FP signal B1_FLA_EAITF (= 4) at timing 5 is latched by the latch F2_FP via the selector L2, and becomes the output FP signals B1_EAGA_FP (= 4) and B1_EAGB_FP (= 5) via the selector L3. .

上記の動作により，ＳＩＭＤ演算用パイプラインＦＬＡは，演算用インタフェース３３３が出力するフラグ信号により，タイミング６で，アドレス用インタフェース３１０内の２つのメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢに，２つのメモリアクセス要求を生成する。 By the above operation, the SIMD operation pipeline FLA makes two memory access requests to the two memory access pipelines EAGA and EAGB in the address interface 310 at timing 6 based on the flag signal output from the operation interface 333. Is generated.

図１４は，ＳＩＭＤ幅が４のＳＩＭＤインダイレクトメモリアクセス命令の場合のパイプラインとアドレス用インタフェース回路３１０の入出力信号変化を示す図である。ＳＩＭＤ演算パイプラインＦＬＡの演算用インタフェース３３３が，タイミング５のサイクルＢ１で図１２の入力信号（Ｂ１＿ＦＬＡ＿＊＊＊）を出力し，アドレス用インタフェース３１０が，それらの入力信号に基づいて，タイミング６，７で図１２の出力信号（Ｂ１＿ＥＡＧＡ＿＊＊＊，Ｂ１＿ＥＡＧＢ＿＊＊＊）によるメモリアクセス要求を生成する。 FIG. 14 is a diagram showing input / output signal changes in the pipeline and address interface circuit 310 in the case of a SIMD indirect memory access instruction having a SIMD width of 4. The operation interface 333 of the SIMD operation pipeline FLA outputs the input signal (B1_FLA _ ***) of FIG. 12 in the cycle B1 of the timing 5, and the address interface 310 receives the timing 6, 7 generates a memory access request by the output signals (B1_EAGA _ ***, B1_EAGB _ ***) in FIG.

タイミング５の演算用インタフェース３３３が出力する入力信号と，タイミング６でアドレス用インタフェース３１０内のメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢに生成される出力信号は，図１３のＳＩＭＤ幅２の場合と同じである。 The input signal output from the operation interface 333 at timing 5 and the output signal generated in the memory access pipelines EAGA and EAGB in the address interface 310 at timing 6 are the same as in the case of the SIMD width 2 in FIG. .

ただし，ＳＩＭＤ幅４の場合は，タイミング６のラッチＦ２の入力ＩＩＤ信号をセレクタＬ１を介してラッチＦ２が再度ラッチし，タイミング６のアンドゲートＡ１の出力と入力４ＳＩＭＤ信号Ｂ１＿ＦＬＡ＿４ＳＩＭＤ＿ＥＡＩＴＦのラッチ信号の論理積を，アンドゲートＡ２を介してラッチＦ３がラッチする。また，タイミング６の入力ＦＰ信号Ｂ１＿ＦＬＡ＿ＦＰ＿ＥＡＩＴＦの値に加算器ＡＤＤ１で＋２した値を，セレクタＬ２を介してラッチＦ２＿ＦＰがラッチする。それに対応して，タイミング７では，タイミング６と同様にして，メモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢの出力バリッド信号，出力インダイレクト信号が１を維持し，出力ＩＩＤ信号が２を維持し，出力ＦＰ信号が６，７になる。 However, in the case of the SIMD width 4, the input IID signal of the latch F2 at timing 6 is latched again via the selector L1, and the logical product of the output of the AND gate A1 at timing 6 and the latch signal of the input 4 SIMD signal B1_FLA_4SIMD_EAITF Is latched by the latch F3 via the AND gate A2. Further, the value obtained by adding +2 by the adder ADD1 to the value of the input FP signal B1_FLA_FP_EAITF at timing 6 is latched by the latch F2_FP via the selector L2. Correspondingly, at timing 7, as in timing 6, the output valid signal and output indirect signal of the memory access pipelines EAGA and EAGB maintain 1, the output IID signal maintains 2, and the output FP signal Becomes 6,7.

上記の動作により，ＳＩＭＤ演算用パイプラインＦＬＡは，演算用インタフェース３３３の出力する信号により，タイミング６で，アドレス用インタフェース３１０内の２つのメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢに，２つのメモリアクセス要求を生成し，さらに，タイミング７でメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢにさらに２つのメモリアクセス要求を生成する。 With the above operation, the SIMD operation pipeline FLA issues two memory access requests to the two memory access pipelines EAGA and EAGB in the address interface 310 at timing 6 according to the signal output from the operation interface 333. In addition, two more memory access requests are generated in the memory access pipelines EAGA and EAGB at timing 7.

図１５は，図６の１つのメモリアクセスパイプラインＥＡＧＡを有する場合の演算用インタフェース３３１とアドレス用インタフェース３１０の構成を示す図である。ＳＩＭＤインダイレクトメモリアクセス命令の場合，図１２と異なり次のような動作になる。図１０も参照して説明する。 FIG. 15 is a diagram showing the configuration of the arithmetic interface 331 and the address interface 310 when the single memory access pipeline EAGA of FIG. 6 is provided. In the case of the SIMD indirect memory access instruction, the following operation is performed unlike FIG. This will be described with reference to FIG.

まず，タイミング５の入力バリッド信号Ｂ１＿ＦＬＡ＿ＶＡＬＩＤ＿ＥＡＩＴＦと入力インダイレクト信号Ｂ１＿ＦＬＡ＿ＩＮＤＩＲＥＣＴ＿ＥＡＩＴＦの論理積が，アンドゲートＡ１を介して２つのラッチＦ２＿１でラッチされ，そのラッチＦ２＿１の出力がさらに次のタイミングでラッチＦ２＿２でラッチされ，タイミング６，７で，出力バリッド信号Ｂ１＿ＥＡＧＡ＿ＶＡＬＩＤ＿ＯＲと出力インダイレクト信号Ｂ１＿ＥＡＧＡ＿ＩＮＤＩＲＥＣＴが２サイクルにわたり１を出力する。 First, the logical product of the input valid signal B1_FLA_VALID_EAITF at timing 5 and the input indirect signal B1_FLA_INDDIRECT_EAITF is latched by the two latches F2_1 via the AND gate A1, and the output of the latch F2_1 is further latched by the latch F2_2 at the next timing. At timings 6 and 7, the output valid signal B1_EAGA_VALID_OR and the output indirect signal B1_EAGA_INDDIRECT output 1 over 2 cycles.

ＳＩＭＤ幅が４の場合は，さらに，タイミング７のラッチＦ２＿２の出力と入力４ＳＩＭＤ信号のラッチＦ２＿２の出力の論理積が，アンドゲートＡ２を介してラッチＦ３＿１でラッチされ，そのラッチＦ３＿１の出力がさらに次のタイミングでラッチされ，タイミング８，９で，出力バリッド信号Ｂ１＿ＥＡＧＡ＿ＶＡＬＩＤ＿ＯＲと出力インダイレクト信号Ｂ１＿ＥＡＧＡ＿ＩＮＤＩＲＥＣＴが２サイクルにわたり１を出力する。 When the SIMD width is 4, the logical product of the output of the latch F2_2 at timing 7 and the output of the latch F2_2 of the input 4 SIMD signal is latched by the latch F3_1 via the AND gate A2, and the output of the latch F3_1 is further increased. Latched at the next timing, and at timings 8 and 9, the output valid signal B1_EAGA_VALID_OR and the output indirect signal B1_EAGA_INDDIRECT output 1 over two cycles.

タイミング５の入力ＩＩＤ信号Ｂ１＿ＦＬＡ＿ＩＩＤ＿ＥＡＩＴＦは，セレクタＬ１を介してラッチＦ２で４回ラッチされ，タイミング６，７，８，９でセレクタＬ４を介して出力ＩＩＤ信号Ｂ１＿ＥＡＧＡ＿ＩＩＤとして出力される。 The input IID signal B1_FLA_IID_EAITF at timing 5 is latched four times by the latch F2 through the selector L1, and is output as the output IID signal B1_EAGA_IID through the selector L4 at timings 6, 7, 8, and 9.

タイミング５の入力ＦＰ信号Ｂ１＿ＦＬＡ＿ＦＰ＿ＥＡＩＴＦは，セレクタＬ２を介してラッチＦ２＿ＦＰがラッチし，その後，３サイクルで加算器ＡＤＤ１でそれぞれ＋１したフェッチポートＦＰの値をラッチＦ２＿ＦＰがラッチする。そして，タイミング６，７，８，９で，出力ＦＰ信号Ｂ１＿ＥＡＧＡ＿ＦＰが，入力ＦＰ値，それに＋１，＋２，＋３されたＦＰ値になる。 An input FP signal B1_FLA_FP_EAITF at timing 5 is latched by the latch F2_FP via the selector L2, and then the value of the fetch port FP incremented by the adder ADD1 in three cycles is latched by the latch F2_FP. At timings 6, 7, 8, and 9, the output FP signal B1_EAGA_FP becomes the input FP value and the FP value that is +1, +2, or +3.

［衝突を回避するためのＲＳＡとＲＡＦによる新たなエントリ投入の抑止］
図１６，図１７は，ＳＩＭＤ幅２の場合と４の場合での後続するＲＳＡから投入されるメモリアクセスとの衝突を示す図である。いずれも，図１１の例で示している。 [Inhibition of new entry by RSA and RAF to avoid collision]
FIGS. 16 and 17 are diagrams showing a collision between a memory access input from a subsequent RSA in the case of SIMD width 2 and 4. Both are shown in the example of FIG.

本実施の形態では，ＲＡＦがＳＩＭＤインダイレクトメモリアクセス命令のエントリをＳＩＭＤ演算パイプラインＦＬＡに投入すると，ＳＩＭＤ演算パイプラインＦＬＡが，演算用インタフェース３３３が出力する信号を利用して，アドレス用インタフェース３１０内のメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢに複数のメモリアクセス要求を生成する。そのため，その生成されたメモリアクセス要求と後続のＲＳＡから投入されるメモリアクセス要求とが衝突する場合がある。図１１の例では，ＳＩＭＤ幅が２の場合は１回メモリアクセス要求が生成されるので１回衝突する場合があり，ＳＩＭＤ幅が４の場合は２回メモリアクセス要求が生成されるので２回衝突する場合がある。図６のメモリアクセスパイプラインＥＡＧＡが１つの例では，ＳＩＭＤ幅２では２回衝突し，ＳＩＭＤ幅４では４回衝突する場合がある。 In the present embodiment, when the RAF inputs an entry for the SIMD indirect memory access instruction to the SIMD operation pipeline FLA, the SIMD operation pipeline FLA uses the signal output from the operation interface 333 to generate an address interface 310. A plurality of memory access requests are generated in the memory access pipelines EAGA and EAGB. Therefore, the generated memory access request may collide with a memory access request input from the subsequent RSA. In the example of FIG. 11, when the SIMD width is 2, the memory access request is generated once, so there is a case of collision once, and when the SIMD width is 4, the memory access request is generated twice. There may be a collision. In one example of the memory access pipeline EAGA of FIG. 6, there may be two collisions with SIMD width 2 and four collisions with SIMD width 4.

図１１の例で説明すると以下のとおりである。図１６，１７には衝突がＢ１への取消線で示されている。 The description of the example of FIG. 11 is as follows. 16 and 17, the collision is indicated by a strike-through line to B1.

（１）図１６のＳＩＭＤ幅２の場合は，タイミング３でＲＳＦがパイプラインＦＬＡにＳＩＭＤ幅２のＳＩＭＤインダイレクトメモリアクセス命令のエントリを出力し，タイミング５でＲＳＡがパイプラインＥＡＧＡもしくはＥＡＧＢにメモリアクセス命令のエントリを出力した場合，タイミング６で，ＳＩＭＤインダイレクトメモリアクセス命令により生成されるメモリアクセス要求のサイクルＢ１の信号と，ＲＳＡから転送されるメモリアクセス要求のサイクルＢ１の信号が衝突する。 (1) In the case of SIMD width 2 in FIG. 16, at timing 3, RSF outputs an SIMD indirect memory access instruction entry of SIMD width 2 to pipeline FLA, and at timing 5, RSA stores memory in pipeline EAGA or EAGB. When an access instruction entry is output, at timing 6, the memory access request cycle B1 signal generated by the SIMD indirect memory access instruction and the memory access request cycle B1 signal transferred from the RSA collide.

（２）図１７のＳＩＭＤ幅４の場合は，タイミング３でＲＳＦがパイプラインＦＬＡにＳＩＭＤ幅４のインダイレクト命令のエントリを出力し，タイミング５もしくは６においてＲＳＡがパイプラインＥＡＧＡもしくはＥＡＧＢにメモリアクセス命令のエントリを出力した場合，次の衝突が発生する。 (2) In the case of SIMD width 4 in FIG. 17, RSF outputs an indirect instruction entry of SIMD width 4 to pipeline FLA at timing 3, and RSA makes memory access to pipeline EAGA or EAGB at timing 5 or 6 When an instruction entry is output, the following collision occurs.

すなわち，タイミング５でＲＳＡがパイプラインＥＡＧＡもしくはＥＡＧＢにメモリアクセス命令のエントリを出力した場合，タイミング６でＳＩＭＤインダイレクトメモリアクセス命令により生成されるメモリアクセス要求のサイクルＢ１の信号と，ＲＳＡから転送されるメモリアクセス要求のサイクルＢ１の信号が衝突する。 That is, when RSA outputs a memory access instruction entry to pipeline EAGA or EAGB at timing 5, it is transferred from memory access request cycle B1 generated by the SIMD indirect memory access instruction at timing 6 and from RSA. The memory access request cycle B1 signals collide with each other.

また，タイミング６でＲＳＡがパイプラインＥＡＧＡもしくはＥＡＧＢにメモリアクセス命令のエントリを出力した場合，タイミング７でＳＩＭＤインダイレクトメモリアクセス命令により生成されるメモリアクセス要求のサイクルＢ１の信号と，ＲＳＡから転送されるメモリアクセス要求のサイクルＢ１の信号とが衝突する。 Also, when RSA outputs a memory access instruction entry to pipeline EAGA or EAGB at timing 6, the memory access request cycle B 1 signal generated by the SIMD indirect memory access instruction at timing 7 is transferred from RSA. Conflicts with the signal of the memory access request cycle B1.

図１８は，ＳＩＭＤ幅４の場合での後続するＳＩＭＤインダイレクトメモリアクセス命令のエントリの投入により生成されるメモリアクセス要求との衝突を示す図である。いずれも，図１１の２つのメモリアクセスパイプラインＥＡＧＡ，ＥＧＡＢを有する例で示している。 FIG. 18 is a diagram showing a collision with a memory access request generated by inputting an entry of a subsequent SIMD indirect memory access instruction in the case of SIMD width 4. Both are shown in an example having two memory access pipelines EAGA and EGAB in FIG.

本実施の形態では，ＳＩＭＤインダイレクトメモリアクセス命令のエントリの投入に応答して，ＳＩＭＤ演算パイプラインＦＬＡが，演算用インタフェース３３３が出力する信号を利用して，メモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢにメモリアクセス要求を生成する。そのため，その生成されたメモリアクセス要求が，後続のＳＩＭＤインダイレクトメモリアクセス命令のエントリの投入に応答してメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢに生成されるメモリアクセス要求と衝突する場合がある。図１１の例では，ＳＩＭＤ幅が４の場合に２回メモリアクセス要求が生成されるので，後続のＳＩＭＤインダイレクトメモリアクセス命令に対応するメモリアクセス要求と，１回衝突する場合がある。図６の例では，ＳＩＭＤ幅２では１回衝突し，ＳＩＭＤ幅４では３回衝突する場合がある。 In this embodiment, in response to the entry of the entry of the SIMD indirect memory access instruction, the SIMD operation pipeline FLA uses the signal output from the operation interface 333 to store memory in the memory access pipelines EAGA and EAGB. Generate an access request. For this reason, the generated memory access request may collide with the memory access request generated in the memory access pipelines EAGA and EAGB in response to the entry of the entry of the subsequent SIMD indirect memory access instruction. In the example of FIG. 11, since the memory access request is generated twice when the SIMD width is 4, there may be a collision with the memory access request corresponding to the subsequent SIMD indirect memory access instruction. In the example of FIG. 6, there is a case where the collision occurs once for the SIMD width 2 and the collision occurs three times for the SIMD width 4.

図１１の例で説明すると図１８に示されるように以下のとおりである。図１８には衝突がＢ１への取消線で示されている。 The description of the example of FIG. 11 is as follows as shown in FIG. In FIG. 18, the collision is indicated by a strikethrough to B1.

（３）タイミング３でＲＳＦがＳＩＭＤ幅４のＳＩＭＤインダイレクトメモリアクセス命令のエントリを出力し，タイミング４でＲＳＦがＳＩＭＤ幅２もしくは４のＳＩＭＤインダイレクトメモリアクセス命令のエントリを出力した場合，次のとおり衝突が発生する。すなわち，タイミング３でＲＳＦから出力された４ＳＩＭＤインダイレクトメモリアクセス命令により生成されたメモリアクセス要求のサイクルＢ１の信号と，次のタイミング４でＲＳＦから出力された２または４ＳＩＭＤインダイレクトメモリアクセス命令により生成されたメモリアクセス要求のサイクルＢ１の信号とが，タイミング７で衝突する。 (3) When the RSF outputs an entry for a SIMD indirect memory access instruction with a SIMD width of 4 at timing 3, and when the RSF outputs an entry for a SIMD indirect memory access instruction with a SIMD width of 2 or 4 at timing 4, A collision occurs. That is, the memory access request cycle B1 signal generated by the 4 SIMD indirect memory access instruction output from the RSF at timing 3 and the 2 or 4 SIMD indirect memory access instruction output from the RSF at the next timing 4 The signal in cycle B1 of the memory access request made collides at timing 7.

図１９は，インダイレクトメモリアクセス要求の衝突を回避する抑止信号を生成する演算用インタフェース３３３の構成を示す図である。演算用インタフェース３３３は，ＲＳＦが投入するＳＩＭＤインダイレクトメモリアクセス命令のエントリのＰサイクルのフラグ信号を入力し，ラッチ群Ｆ１０でラッチし，さらにラッチ群Ｆ１１でラッチする。それにより，演算用インタフェース３３３は，Ｐサイクルから２サイクル後のＢ１サイクルの出力信号を，ＳＩＭＤ演算パイプラインＦＬＡのＳＩＭＤ演算器３３０と，メモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢのアドレス用インタフェース３１０に転送する。演算用インタフェース３３３が２つのラッチ群Ｆ１０，Ｆ１１を有するのは，例えば，タイミングを調整するためである。 FIG. 19 is a diagram illustrating a configuration of the calculation interface 333 that generates a suppression signal that avoids collision of indirect memory access requests. The calculation interface 333 receives the P cycle flag signal of the entry of the SIMD indirect memory access instruction input by the RSF, latches it in the latch group F10, and further latches it in the latch group F11. Thereby, the operation interface 333 transfers the output signal of the B1 cycle two cycles after the P cycle to the SIMD operation unit 330 of the SIMD operation pipeline FLA and the address interface 310 of the memory access pipelines EAGA and EAGB. . The reason why the calculation interface 333 includes the two latch groups F10 and F11 is, for example, to adjust timing.

そして，演算用インタフェース３３３は，Ｐサイクルの３つのフラグ信号からＲＳＦへの後続のＳＩＭＤインダイレクトメモリアクセス命令のエントリの投入を抑止する抑止信号ＩＮＨ＿ＦＬＡ＿ＩＮＤＩＲＥＣＴ＿ＯＰと，ＰＴサイクルの２つのフラグ信号からとともに，Ｂ１サイクルの３つのフラグ信号からも，ＲＳＡへの後続のメモリアクセス命令の投入を抑止する抑止信号ＩＮＨ＿ＲＳＡ＿ＰＲＩＯＲＩＴＹを生成する。 Then, the arithmetic interface 333 receives the B1 together with the inhibition signal INH_FLA_INDDIRECT_OP for inhibiting the entry of the subsequent SIMD indirect memory access instruction entry from the three flag signals in the P cycle to the RSF, and the two flag signals in the PT cycle. Also from the three flag signals of the cycle, an inhibition signal INH_RSA_PRIORITY that inhibits the subsequent memory access instruction from being input to the RSA is generated.

演算用インタフェース３３３の動作は次のとおりである。 The operation of the calculation interface 333 is as follows.

入力信号（バリッド信号）Ｐ＿ＦＬＡ＿ＶＡＬＩＤは，浮動小数点・ＳＩＭＤパイプラインのＰサイクルでパイプラインＦＬＡへのＳＩＭＤ演算器に演算要求が出力されたときに１となる。 The input signal (valid signal) P_FLA_VALID becomes 1 when a calculation request is output to the SIMD calculator to the pipeline FLA in the P cycle of the floating point / SIMD pipeline.

入力信号（インダイレクト信号）Ｐ＿ＦＬＡ＿ＩＮＤＩＲＥＣＴは，入力バリッド信号Ｐ＿ＦＬＡ＿ＶＡＬＩＤが１のときに有効となる信号であり，演算要求がＳＩＭＤインダイレクトメモリアクセス命令の場合に，浮動小数点・ＳＩＭＤパイプラインのＰサイクルで１となる。 The input signal (indirect signal) P_FLA_INDDIRECT is a signal that is valid when the input valid signal P_FLA_VALID is 1, and is 1 in the P cycle of the floating point / SIMD pipeline when the operation request is a SIMD indirect memory access instruction. It becomes.

入力信号（４ＳＩＭＤ信号）Ｐ＿ＦＬＡ＿４ＳＩＭＤは，入力バリッド信号Ｐ＿ＦＬＡ＿ＶＡＬＩＤが１のときに有効となる信号であり，ＳＩＭＤ演算器の演算幅が４であるときに浮動小数点・ＳＩＭＤパイプラインのＰサイクルで１となる。 The input signal (4SIMD signal) P_FLA_4SIMD is a signal that becomes valid when the input valid signal P_FLA_VALID is 1, and becomes 1 in the P cycle of the floating-point / SIMD pipeline when the operation width of the SIMD arithmetic unit is 4. .

入力信号（ＩＤＤ信号）Ｐ＿ＦＬＡ＿ＩＩＤは，入力バリッド信号Ｐ＿ＦＬＡ＿ＶＡＬＩＤが１のときに有効となる信号であり，パイプラインＦＬＡで実行される演算のＣＳＥのエントリ番号を示す。 The input signal (IDD signal) P_FLA_IID is a signal that is valid when the input valid signal P_FLA_VALID is 1, and indicates the CSE entry number of the operation executed in the pipeline FLA.

入力信号（ＦＰ信号）Ｐ＿ＦＬＡ＿ＦＰは，入力バリッド信号Ｐ＿ＦＬＡ＿ＶＡＬＩＤが１かつ入力インダイレクト信号Ｐ＿ＦＬＡ＿ＩＮＤＩＲＥＣＴが１のときに有効となる信号であり，ＳＩＭＤインダイレクトメモリアクセス命令において命令デコーダで確保された一次データキャッシュ内のフェッチポートＦＰの先頭番号を示す。 The input signal (FP signal) P_FLA_FP is a signal that is valid when the input valid signal P_FLA_VALID is 1 and the input indirect signal P_FLA_DIRECT is 1, and is in the primary data cache secured by the instruction decoder in the SIMD indirect memory access instruction. Of the first fetch port FP.

演算用インタフェース３３３は，５つの入力信号をラッチＦ１０，Ｆ１１でラッチして中継し，５つの出力信号Ｂ１＿ＦＬＡ＿ＶＡＬＩＤ＿ＥＡＩＴＦ，Ｂ１＿ＦＬＡ＿ＩＮＤＩＲＥＣＴ＿ＥＡＩＴＦ，Ｂ１＿ＦＬＡ＿４ＳＩＭＤ＿ＥＡＩＴＦ，Ｂ１＿ＦＬＡ＿ＩＩＤ＿ＥＡＩＴＦ，Ｂ１＿ＦＬＡ＿ＦＰ＿ＥＡＩＴＦを，アドレス用インタフェース３１０に転送し，メモリアクセスのリクエストを生成させる。 The arithmetic interface 333 latches and relays the five input signals by the latches F10 and F11, and transfers the five output signals B1_FLA_VALID_EAITF, B1_FLA_DIRECT_EAITF, B1_FLA_4SIMD_EAITF, B1_FLA_IID_EAITF, and B1_FLA_P address access to the B1_FLA_P address. Generate.

同様に，演算用インタフェース３３３は，４つの入力信号をラッチＦ１０，Ｆ１１でラッチして中継し，４つの出力信号Ｂ１＿ＦＬＡ＿ＶＡＬＩＤ，Ｂ１＿ＦＬＡ＿ＩＮＤＩＲＥＣＴ，Ｂ１＿ＦＬＡ＿４ＳＩＭＤ，Ｂ１＿ＦＬＡ＿ＩＩＤを，ＳＩＭＤ演算器に転送する。 Similarly, the calculation interface 333 latches and relays four input signals by the latches F10 and F11, and transfers the four output signals B1_FLA_VALID, B1_FLA_DIRECT, B1_FLA_4SIMD, and B1_FLA_IID to the SIMD calculator.

演算用インタフェース３３３では，アンドゲートＡ４がＰサイクルの２つの入力信号Ｐ＿ＦＬＡ＿ＶＡＬＩＤ，Ｐ＿ＦＬＡ＿ＩＮＤＩＲＥＣＴの論理積を後続の通常メモリアクセス命令の抑止信号ＩＮＨ＿ＲＳＡ＿ＰＲＩＯＲＩＴＹとして生成し，ＲＳＡに転送する。これにより，ＲＳＡは，後続のメモリアクセス命令のエントリのメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢへの投入を抑止する。 In the arithmetic interface 333, the AND gate A4 generates a logical product of the two input signals P_FLA_VALID and P_FLA_INDDIRECT of the P cycle as the inhibition signal INH_RSA_PRIORITY of the subsequent normal memory access instruction and transfers it to the RSA. As a result, the RSA suppresses the entry of the subsequent memory access instruction entry into the memory access pipelines EAGA and EAGB.

図１６に示されるとおり，タイミング３のＰサイクルの２つの信号が全て１の場合に，タイミング４で抑止信号ＩＮＨ＿ＲＳＡ＿ＰＲＩＯＲＩＴＹが１になり，タイミング５においてＲＳＡがメモリアクセス命令のエントリのパイプラインＥＡＧＡ，ＥＡＧＢへの投入を抑止する。これにより，タイミング６でＢ１サイクルの信号が発生せず，衝突が回避される。 As shown in FIG. 16, when the two signals in the P cycle at timing 3 are all 1, the inhibition signal INH_RSA_PRIORITY becomes 1 at timing 4, and at timing 5, RSA is the pipelines EAGA and EAGB of the entry of the memory access instruction. Suppressing input to. As a result, the B1 cycle signal is not generated at timing 6, and collision is avoided.

さらに，演算用インタフェース３３３では，アンドゲートＡ５がＰサイクルの３つの入力信号Ｐ＿ＦＬＡ＿ＶＡＬＩＤ，Ｐ＿ＦＬＡ＿ＩＮＤＩＲＥＣＴ，Ｐ＿ＦＬＡ＿４ＳＩＭＤの論理積を後続の通常メモリアクセス命令の抑止信号ＩＮＨ＿ＲＳＡ＿ＰＲＩＯＲＩＴＹとして生成し，ＲＳＡに転送する。これにより，ＲＳＡは，後続のメモリアクセス命令のメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢへの投入を抑止する。 Further, in the arithmetic interface 333, the AND gate A5 generates a logical product of the three input signals P_FLA_VALID, P_FLA_INDDIRECT, and P_FLA_4SIMD in the P cycle as an inhibition signal INH_RSA_PRIORITY of the subsequent normal memory access instruction and transfers it to the RSA. As a result, the RSA inhibits subsequent memory access instructions from being input to the memory access pipelines EAGA and EAGB.

図１７に示されるとおり，タイミング３のＰサイクルの３つの信号が全て１の場合に，タイミング５で抑止信号ＩＮＨ＿ＲＳＡ＿ＰＲＩＯＲＩＴＹが１になり，タイミング６においてＲＳＡがメモリアクセス命令のパイプラインＥＡＧＡ，ＥＡＧＢへの投入を抑止する。これにより，タイミング７でＢ１サイクルの信号が発生せず，衝突が回避される。図１７では，図１６と同様にして，タイミング４で抑止信号ＩＮＨ＿ＲＳＡ＿ＰＲＩＯＲＩＴＹが１になり，タイミング５におけるＲＳＡでのメモリアクセス命令の投入が抑止されている。 As shown in FIG. 17, when all three signals in the P cycle at timing 3 are 1, the inhibition signal INH_RSA_PRIORITY becomes 1 at timing 5, and RSA is sent to the pipelines EAGA and EAGB of memory access instructions at timing 6. Suppress input. As a result, the B1 cycle signal is not generated at timing 7, and collision is avoided. In FIG. 17, as in FIG. 16, the inhibition signal INH_RSA_PRIORITY becomes 1 at timing 4, and the input of the memory access instruction at RSA at timing 5 is inhibited.

そして，演算用インタフェース３３３では，アンドゲートＡ３がＰサイクルの３つの入力信号Ｐ＿ＦＬＡ＿ＶＡＬＩＤ，Ｐ＿ＦＬＡ＿ＩＮＤＩＲＥＣＴ，Ｐ＿ＦＬＡ＿４ＳＩＭＤの論理積を後続のＳＩＭＤインダイレクトメモリアクセス命令の抑止信号ＩＮＨ＿ＦＬＡ＿ＩＮＤＩＲＥＣＴ＿ＯＰとして生成し，ＲＳＦに転送する。これにより，ＲＳＦは，後続のＳＩＭＤインダイレクトメモリアクセス命令のエントリのＳＩＭＤ演算パイプラインＦＬＡへの投入を抑止する。 In the arithmetic interface 333, the AND gate A3 generates a logical product of the three input signals P_FLA_VALID, P_FLA_INDDIRECT, and P_FLA_4SIMD in the P cycle as a subsequent SIMD indirect memory access instruction suppression signal INH_FLA_INDDIRECT_OP and transfers it to the RSF. As a result, the RSF suppresses the entry of the subsequent SIMD indirect memory access instruction entry into the SIMD operation pipeline FLA.

図１８に示されるとおり，タイミング３のＰサイクルの３つの信号が全て１の場合に，抑止信号ＩＮＨ＿ＦＬＡ＿ＩＮＤＩＲＥＣＴ＿ＯＰが１になり，次のタイミング４においてＲＳＦがＳＩＭＤインダイレクトメモリアクセス命令のエントリのパイプラインＦＬＡへの投入を抑止する。これにより，タイミング７で生成されたＢ１サイクルの信号が発生せず，衝突が回避される。 As shown in FIG. 18, when all three signals of the P cycle at timing 3 are 1, the inhibition signal INH_FLA_INDDIRECT_OP becomes 1, and at the next timing 4, the RSF becomes the pipeline FLA of the entry of the SIMD indirect memory access instruction. Suppressing input to. Thereby, the signal of the B1 cycle generated at the timing 7 is not generated, and the collision is avoided.

図２０は，ＲＳＦとそのＳＩＭＤインダイレクトメモリアクセス命令のエントリの出力抑止回路を示す図である。ＲＳＦは，例えば２０個のエントリ保持部３３７を有し，リザベーションステーションＲＳＦに生成された命令のエントリに対応するフラグが格納されている。フラグの例は図７に示されている。 FIG. 20 is a diagram illustrating an output suppression circuit for an entry of an RSF and its SIMD indirect memory access instruction. The RSF has, for example, 20 entry holding units 337, and stores flags corresponding to the entries of instructions generated in the reservation station RSF. An example of the flag is shown in FIG.

各エントリ保持部３３７に対応するＲＳＦエントリ出力条件検出回路３３８は，これらのフラグを用い，ＲＳＦ内のエントリそれぞれについてパイプラインへの出力可能条件が成立したことを検出する。このＲＳＦエントリ出力条件検出回路３３８は，ＲＳＦそれぞれに格納された命令のエントリが処理可能となった場合に１を出力し，出力が可能でない場合は０を出力する。 The RSF entry output condition detection circuit 338 corresponding to each entry holding unit 337 uses these flags to detect that an output enabling condition to the pipeline is established for each entry in the RSF. The RSF entry output condition detection circuit 338 outputs 1 when an entry of an instruction stored in each RSF can be processed, and outputs 0 when output is not possible.

抑止回路３３９は，演算用インタフェース３３３で生成された抑止信号ＩＮＨ＿ＦＬＡ＿ＩＮＤＩＲＥＣＴ＿ＯＰと，ＲＳＦのエントリ保持部３３７に格納されているＩＮＤＩＲＥＣＴフラグが共に１の場合に，出力条件検出回路３３８の出力を強制的に０にする。これにより，対応するＲＳＦエントリが出力可能であるか否かを示すＲＥＡＤＹ信号がラッチＲＳＦｘｘ＿ＲＥＡＤＹにラッチされる。ｘｘは００−１９である。 The inhibition circuit 339 forces the output of the output condition detection circuit 338 to 0 when both the inhibition signal INH_FLA_INDDIRECT_OP generated by the arithmetic interface 333 and the INDIRECT flag stored in the RSF entry holding unit 337 are 1. To. As a result, the READY signal indicating whether or not the corresponding RSF entry can be output is latched in the latch RSFxx_READY. xx is 00-19.

ＦＬＡ出力選択回路３４０は，このＲＥＡＤＹ信号が１のＲＳＦエントリから，次に出力するＲＳＦエントリを選択し，演算用インタフェース３３３へ出力する。ただし，ＳＩＭＤインダイレクトメモリアクセス命令の場合は，ＩＮＤＩＲＥＣＴフラグが１になるので，抑止信号ＩＮＨ＿ＦＬＡ＿ＩＮＤＩＲＥＣＴ＿ＯＰが１となったとき，そのエントリのＲＥＡＤＹ信号が０となるため，ＦＬＡ出力選択回路３４０はそのＳＩＭＤインダイレクトメモリアクセス命令のエントリを選択することはない。ＳＩＭＤインダイレクトメモリアクセス命令以外の命令の場合は，ＩＮＤＩＲＥＣＴフラグが０になるので，エントリ出力条件検出回路３３８の出力がＲＥＡＤＹ信号として使用される。したがって，ＳＩＭＤインダイレクトメモリアクセス命令以外の命令については，必要な資源が準備されたエントリがあればその命令のエントリが出力される。これによりＲＳＦが，抑止信号ＩＮＨ＿ＦＬＡ＿ＩＮＤＩＲＥＣＴ＿ＯＰに応じて，ＳＩＭＤインダイレクトメモリアクセス命令のエントリの出力を抑止する。 The FLA output selection circuit 340 selects the RSF entry to be output next from the RSF entry whose READY signal is 1, and outputs it to the operation interface 333. However, in the case of a SIMD indirect memory access instruction, the INDIRECT flag is set to 1. Therefore, when the inhibition signal INH_FLA_INDDIRECT_OP is set to 1, the READY signal of the entry is set to 0, so that the FLA output selection circuit 340 has its SIMD input. The direct memory access instruction entry is not selected. In the case of an instruction other than the SIMD indirect memory access instruction, the DIRECT flag is set to 0, so that the output of the entry output condition detection circuit 338 is used as the READY signal. Therefore, for an instruction other than the SIMD indirect memory access instruction, if there is an entry in which necessary resources are prepared, the entry of that instruction is output. As a result, the RSF suppresses the output of the SIMD indirect memory access instruction entry in response to the suppression signal INH_FLA_INDDIRECT_OP.

図２１は，ＲＳＡとその通常のメモリアクセス命令のエントリの出力抑止回路を示す図である。ＲＳＡは，例えば２０個のエントリ保持部３１４を有する。各エントリ保持部３１４に対応するＲＳＡエントリ出力条件検出回路３１５は，ＲＳＡエントリそれぞれについてパイプラインへの出力可能条件が成立したことを検出する。このＲＳＡエントリ出力条件検出回路３１５は，ＲＳＡそれぞれに格納された命令が処理可能となった場合に１を出力し，出力が可能でない場合は０を出力する。 FIG. 21 is a diagram showing an output suppression circuit for an entry of RSA and its normal memory access instruction. The RSA has, for example, 20 entry holding units 314. The RSA entry output condition detection circuit 315 corresponding to each entry holding unit 314 detects that an output enabling condition to the pipeline is established for each RSA entry. The RSA entry output condition detection circuit 315 outputs 1 when an instruction stored in each RSA can be processed, and outputs 0 when the instruction cannot be output.

抑止回路３１６は，演算用インタフェース３３３で生成された抑止信号ＩＮＨ＿ＲＳＡ＿ＰＲＩＯＲＩＴＹが１の場合に，ＲＳＡエントリ出力条件検出回路３１５の出力を強制的に０にする。これにより，対応するＲＳＡエントリが出力可能であるか否かを示すＲＥＡＤＹ信号がラッチＲＳＡｘｘ＿ＲＥＡＤＹにラッチされる。 The inhibition circuit 316 forcibly sets the output of the RSA entry output condition detection circuit 315 to 0 when the inhibition signal INH_RSA_PRIORITY generated by the arithmetic interface 333 is 1. As a result, a READY signal indicating whether or not the corresponding RSA entry can be output is latched in the latch RSAxx_READY.

ＥＡＧＡ／ＥＡＧＢ出力選択回路３１７は，ＲＥＡＤＹ信号が１のＲＳＡエントリから出力するＲＳＡエントリを選択し，メモリアクセスパイプラインＥＡＧＡまたはＥＡＧＢに出力し，アドレス用インタフェースへ転送する。抑止信号ＩＮＨ＿ＲＳＡ＿ＰＲＩＯＲＩＴＹが１のとき，ＲＳＡエントリ出力条件検出回路３１５から出力された値に関わらず，すべてのＲＳＡのＲＥＡＤＹ信号が０になる。これによりＥＡＧＡ／ＥＡＧＢ出力選択回路３１５は出力可能なエントリがないため，メモリリクエストをメモリアクセスパイプラインＥＡＧＡ，ＥＡＧＢにエントリを出力しない。これによりＲＳＡが，抑止信号ＩＮＨ＿ＲＳＡ＿ＰＲＩＯＲＩＴＹに応じて，メモリアクセス命令のエントリの出力を抑止する。 The EAGA / EAGB output selection circuit 317 selects the RSA entry output from the RSA entry whose READY signal is 1, outputs it to the memory access pipeline EAGA or EAGB, and transfers it to the address interface. When the inhibition signal INH_RSA_PRIORITY is 1, the READY signals of all RSAs become 0 regardless of the value output from the RSA entry output condition detection circuit 315. As a result, the EAGA / EAGB output selection circuit 315 does not output an entry to the memory access pipelines EAGA and EAGB because there is no entry that can be output. As a result, the RSA inhibits output of the memory access instruction entry in response to the inhibit signal INH_RSA_PRIORITY.

図２２は，ＣＳＥ内の完了待ち合わせ回路を示す図である。図２２には，ＣＳＥの１つのエントリに対する完了待ち合わせ回路が示されている。 FIG. 22 is a diagram showing a completion waiting circuit in the CSE. FIG. 22 shows a completion waiting circuit for one entry of the CSE.

まず，ＣＳＥのエントリにインダイレクトフラグＣＳＥ＿ＩＮＤＩＲＥＣＴが含まれている。ＣＳＥのエントリがＳＩＭＤインダイレクトメモリアクセス命令の場合，そのエントリのインダイレクトフラグＣＳＥ＿ＩＮＤＩＲＥＣＴが１になる。また，その命令が４ＳＩＭＤの場合に４ＳＩＭＤ信号ＣＳＥ＿４ＳＩＭＤが１になる。ＣＳＥにエントリされた命令がＳＩＭＤインダイレクトメモリアクセス命令であった場合，一次データキャッシュ３１２が同じＣＳＥのエントリ番号ＩＩＤに対して２ＳＩＭＤなら２回，４ＳＩＭＤなら４回の完了報告をＣＳＥに行う。 First, the indirect flag CSE_INDDIRECT is included in the CSE entry. When the CSE entry is a SIMD indirect memory access instruction, the indirect flag CSE_INDDIRECT of the entry becomes 1. Further, when the instruction is 4 SIMD, 4 SIMD signal CSE_4SIMD becomes 1. If the instruction entered in the CSE is a SIMD indirect memory access instruction, the primary data cache 312 sends a completion report to the CSE twice for 2 SIMD and 4 completions for 4 SIMD for the same CSE entry number IID.

入力信号（インダイレクト信号）ＣＳＥ＿ＩＮＤＩＲＥＣＴ，入力信号（４ＳＩＭＤ信号）ＣＳＥ＿４ＳＩＭＤは，命令デコーダ３０５によりＣＳＥに登録されたエントリのフラグである。入力信号ＣＳＥ＿ＩＮＤＩＲＥＣＴが１のときＣＳＥのエントリがＳＩＭＤインダイレクトメモリアクセス命令であることを示す。入力信号ＣＳＥ＿４ＳＩＭＤが１のとき，ＣＳＥのエントリのＳＩＭＤ幅が４であることを示し，０のときＳＩＭＤ幅が２であることを示す。 An input signal (indirect signal) CSE_INDDIRECT and an input signal (4SIMD signal) CSE_4SIMD are flags of entries registered in the CSE by the instruction decoder 305. When the input signal CSE_INDDIRECT is 1, it indicates that the entry of CSE is a SIMD indirect memory access instruction. When the input signal CSE_4SIMD is 1, it indicates that the SIMD width of the CSE entry is 4, and when it is 0, it indicates that the SIMD width is 2.

本実施の形態の一次データキャッシュ３１２は，２つの独立したメモリアクセスを同時に処理する。そのため，一次データキャッシュ３１２は，メモリアクセス完了信号を２つ独立して通知する。 The primary data cache 312 of the present embodiment processes two independent memory accesses simultaneously. Therefore, the primary data cache 312 notifies two memory access completion signals independently.

入力信号ＲＴ＿ＳＴＶ＿０，ＲＴ＿ＳＴＶ＿１は一次データキャッシュから転送されるメモリアクセスの完了信号である。 Input signals RT_STV_0 and RT_STV_1 are memory access completion signals transferred from the primary data cache.

入力信号ＲＴ＿ＳＴＶ＿０＿ＣＳＥ＿ＳＥＬ，ＲＴ＿ＳＴＶ＿１＿ＣＳＥ＿ＳＥＬは，一次データキャッシュにおいて処理中のエントリ番号ＩＩＤが，ＣＳＥのエントリ番号と一致したとき１となる。 The input signals RT_STV_0_CSE_SEL and RT_STV_1_CSE_SEL are set to 1 when the entry number IID being processed in the primary data cache matches the entry number of the CSE.

ＲＴ＿ＳＴＶ＿０とＲＴ＿ＳＴＶ＿０＿ＣＳＥ＿ＳＥＬが１となったとき，もしくはＲＴ＿ＳＴＶ＿１とＲＴ＿ＳＴＶ＿１＿ＣＳＥ＿ＳＥＬが１となったとき，アンドゲートＡ８またはＡ９の出力により，ＣＳＥへのメモリアクセス完了報告が有効となる。メモリアクセス完了報告が有効になると，加算器３５１が３ビットの入力信号に＋１加算してメモリアクセス完了回数記憶素子３５３に出力する。 When RT_STV_0 and RT_STV_0_CSE_SEL become 1, or when RT_STV_1 and RT_STV_1_CSE_SEL become 1, the output of the AND gate A8 or A9 makes the memory access completion report to the CSE valid. When the memory access completion report becomes valid, the adder 351 adds +1 to the 3-bit input signal and outputs it to the memory access completion count storage element 353.

命令デコードがＣＳＥにエントリを作成したときに，メモリアクセス完了回数記憶素子３５３を０にリセットする。その後，一次データキャッシュ３１２からの完了報告により，ＲＴ＿ＳＴＶ＿０とＲＴ＿ＳＴＶ＿０＿ＣＳＥ＿ＳＥＬ両方が１になった場合，もしくはＲＴ＿ＳＴＶ＿１とＲＴ＿ＳＴＶ＿１＿ＣＳＥ＿ＳＥＬ両方が１となった場合，加算器３５１がメモリアクセス完了回数を＋１加算する。 When the instruction decode creates an entry in the CSE, the memory access completion count storage element 353 is reset to zero. Thereafter, when both RT_STV_0 and RT_STV_0_CSE_SEL are set to 1 or both RT_STV_1 and RT_STV_1_CSE_SEL are set to 1 by the completion report from the primary data cache 312, the adder 351 adds +1 to the memory access completion count.

メモリアクセス命令の種類により，メモリアクセス完了回数が規定の値（１，２，４回）となったとき，出力信号（完了信号）ＣＳＥ＿ＭＥＭ＿ＣＯＭＰが１となる。アンドゲートＡ６により，ＳＩＭＤインダイレクトメモリアクセス命令かつＳＩＭＤ幅が４の場合，４回のメモリアクセス完了が通知されたとき，加算器３５１のビット２の出力が１になり，完了信号ＣＳＥ＿ＭＥＭ＿ＣＯＭＰが１となる。ＳＩＭＤインダイレクトメモリアクセス命令かつＳＩＭＤ幅が２の場合，２回のメモリアクセス完了が通知されたとき，加算器３５１のビット１の出力が１になり，完了信号が１となる。そして，ＳＩＭＤインダイレクト命令でないメモリアクセス命令の場合は，１回メモリアクセス完了が通知されたとき，加算器３５１のビット０の出力が１になり，完了信号ＣＳＥ＿ＭＥＭ＿ＣＯＭＰが１となる。 Depending on the type of memory access instruction, the output signal (completion signal) CSE_MEM_COMP becomes 1 when the memory access completion count reaches a specified value (1, 2, 4). When the ANDD A6 notifies the SIMD indirect memory access instruction and the SIMD width is 4, when the completion of four memory accesses is notified, the output of bit 2 of the adder 351 becomes 1, and the completion signal CSE_MEM_COMP is set to 1. Become. When the SIMD indirect memory access instruction and the SIMD width are 2, when the completion of two memory accesses is notified, the output of bit 1 of the adder 351 becomes 1, and the completion signal becomes 1. In the case of a memory access instruction that is not a SIMD indirect instruction, the output of bit 0 of the adder 351 becomes 1 when the completion of memory access is notified once, and the completion signal CSE_MEM_COMP becomes 1.

完了判定回路３５４は，この完了信号ＣＳＥ＿ＭＥＭ＿ＣＯＭＰを入力し，命令が完了可能となったことを示す信号を生成する。完了判定回路３５４は，処理が完了した命令をプログラムの順番に完了したと判定し，例えばリネーミングレジスタからレジスタに処理結果を転送し，エントリを開放する。 The completion determination circuit 354 receives the completion signal CSE_MEM_COMP and generates a signal indicating that the instruction can be completed. The completion determination circuit 354 determines that the instructions that have been processed are completed in the order of the program, for example, transfers the processing result from the renaming register to the register, and releases the entry.

以上の通り，本実施の形態によれば，ＳＩＭＤインダイレクトメモリアクセス命令のエントリをＲＳＦに生成し，そのエントリがＳＩＭＤ演算用パイプラインＦＬＡに出力されると，メモリアクセスパイプラインＥＡＧＡ，ＥＧＡＢにＳＩＭＤ幅に応じた数のメモリアクセスを生成し，ＳＩＭＤ演算器３３０が複数のＳＩＭＤレジスタ３３２に格納されている独立した複数のアドレスを取得てメモリアクセスパイプラインＥＡＧＡ，ＥＧＡＢに転送し，一次データキャッシュ３１２がその複数のアドレスを使用して複数のＳＩＭＤレジスタ３３２に格納されている複数のデータについてメモリアクセスを行う。したがって，命令デコーダやＣＳＥ，ＲＳＡ，ＲＳＦのエントリなどの資源を効率的に使用してＳＩＭＤインダイレクトメモリアクセス命令を実行する。 As described above, according to the present embodiment, when an entry of a SIMD indirect memory access instruction is generated in the RSF and the entry is output to the SIMD operation pipeline FLA, the SIMD is transferred to the memory access pipelines EAGA and EGAB. The number of memory accesses corresponding to the width is generated, and the SIMD computing unit 330 acquires a plurality of independent addresses stored in the plurality of SIMD registers 332 and transfers them to the memory access pipelines EAGA and EGAB, and the primary data cache 312 Performs memory access to a plurality of data stored in the plurality of SIMD registers 332 using the plurality of addresses. Therefore, SIMD indirect memory access instructions are executed by efficiently using resources such as instruction decoders and CSE, RSA, and RSF entries.

以上の実施の形態をまとめると，次の付記のとおりである。 The above embodiment is summarized as follows.

（付記１）
命令をデコードする命令デコーダと，
前記命令デコーダによりメモリアクセス命令のエントリを生成されるメモリアクセスエントリ部と，
前記メモリアクセスエントリ部から出力された前記メモリアクセス命令のエントリをメモリに対して実行するメモリアクセスパイプラインと，
前記命令デコーダにより複数のデータを１つの命令で処理するマルチデータ命令のエントリを生成されるマルチデータ命令エントリ部と，
複数の演算器と複数のマルチデータ命令用レジスタとを有し，前記マルチデータ命令エントリ部から出力された前記マルチデータ命令のエントリの処理を前記複数の演算器により並列に処理し，前記複数のマルチデータ命令用レジスタに演算結果を格納する演算パイプラインとを有し，
前記演算パイプラインは，前記複数のマルチデータ命令用レジスタに格納されている複数のメモリアドレスについて前記メモリにメモリアクセスするマルチデータインダイレクトメモリアクセス命令のエントリの出力に応答して，前記メモリアクセスパイプラインに前記マルチデータインダイレクトメモリアクセス命令に対応する複数のメモリアクセス要求を生成し，前記複数の演算器が前記複数のマルチデータ命令用レジスタから取得した前記複数のメモリアドレスを前記メモリアクセスパイプラインに供給する演算処理装置。 (Appendix 1)
An instruction decoder for decoding instructions;
A memory access entry part for generating an entry of a memory access instruction by the instruction decoder;
A memory access pipeline for executing an entry of the memory access instruction output from the memory access entry unit with respect to a memory;
A multi-data instruction entry unit for generating an entry of a multi-data instruction for processing a plurality of data with one instruction by the instruction decoder;
A plurality of arithmetic units and a plurality of multi-data instruction registers; processing of the multi-data instruction entry output from the multi-data instruction entry unit is processed in parallel by the plurality of arithmetic units; An operation pipeline for storing operation results in a register for multi-data instructions,
The operation pipeline is responsive to an output of an entry of a multi-data indirect memory access instruction that performs memory access to the memory for a plurality of memory addresses stored in the plurality of multi-data instruction registers. A plurality of memory access requests corresponding to the multi-data indirect memory access instruction on a line, and the plurality of computing units obtain the plurality of memory addresses acquired from the plurality of multi-data instruction registers in the memory access pipeline. Arithmetic processing device to supply to.

（付記２）
前記演算パイプラインは，前記メモリアクセスパイプラインの第１サイクルのステージに前記複数のメモリアクセス要求を生成し，前記メモリアクセスパイプラインの前記第１サイクルより後の第２サイクルのステージに前記複数のメモリアドレスを供給する
付記１に記載された演算処理装置。 (Appendix 2)
The arithmetic pipeline generates the plurality of memory access requests at a stage of the first cycle of the memory access pipeline, and the plurality of stages at a stage of a second cycle after the first cycle of the memory access pipeline. The arithmetic processing unit according to attachment 1, which supplies a memory address.

（付記３）
前記演算パイプラインは，前記メモリアクセスパイプラインに生成した複数のメモリアクセス要求のパイプライン転送タイミングにあわせて前記複数のメモリアドレスを供給する
付記２に記載された演算処理装置。 (Appendix 3)
The arithmetic processing device according to appendix 2, wherein the arithmetic pipeline supplies the plurality of memory addresses in accordance with pipeline transfer timings of a plurality of memory access requests generated in the memory access pipeline.

（付記４）
さらに，前記メモリアクセスパイプラインに接続されたキャッシュユニットを有し，
前記演算パイプラインは，前記メモリアクセスパイプラインに生成する前記複数のメモリアクセス要求に前記キャッシュユニット内のアクセス先メモリアドレスを格納する複数のフェッチポートの識別情報を含める
付記１に記載された演算処理装置。 (Appendix 4)
A cache unit connected to the memory access pipeline;
The arithmetic pipeline includes the identification information of a plurality of fetch ports that store access destination memory addresses in the cache unit in the plurality of memory access requests generated in the memory access pipeline. apparatus.

（付記５）
前記演算パイプラインは，前記メモリアクセスパイプラインに，前記複数のメモリアクセス要求をシリアルに生成し，前記複数のメモリアドレスをシリアルに供給する
付記１，２，３のいずれかに記載された演算処理装置。 (Appendix 5)
The arithmetic processing pipeline according to any one of appendices 1, 2, and 3 that serially generates the plurality of memory access requests and supplies the plurality of memory addresses serially to the memory access pipeline. apparatus.

（付記６）
前記メモリアクセスパイプラインが複数設けられ，
前記演算パイプラインは，前記複数のメモリアクセスパイプラインに，前記複数のメモリアクセス要求の少なくとも一部のメモリアクセス要求を並列に生成し，前記複数のメモリアドレスの少なくとも一部のアドレスを並列に供給する
付記１，２，３のいずれかに記載された演算処理装置。 (Appendix 6)
A plurality of the memory access pipelines;
The arithmetic pipeline generates at least a part of the plurality of memory access requests in parallel to the plurality of memory access pipelines, and supplies at least a part of the plurality of memory addresses in parallel. The arithmetic processing device described in any one of Supplementary Notes 1, 2, and 3.

（付記７）
さらに，前記メモリアクセスパイプラインに接続されたキャッシュユニットを有し，
前記キャッシュユニットは，前記複数のメモリアクセス要求に応答して，前記複数のマルチデータ命令用レジスタとの間でデータ転送を行う
付記１に記載された演算処理装置。 (Appendix 7)
A cache unit connected to the memory access pipeline;
The arithmetic processing unit according to attachment 1, wherein the cache unit performs data transfer with the plurality of multi-data instruction registers in response to the plurality of memory access requests.

（付記８）
前記演算パイプラインは，前記メモリアクセスエントリ部に抑止信号を出力して，前記メモリアクセスエントリ部に，前記メモリアクセスパイプラインに生成する前記複数のメモリアクセス要求と衝突するメモリアクセス命令のエントリの出力を抑止させる
付記１に記載された演算処理装置。 (Appendix 8)
The operation pipeline outputs a suppression signal to the memory access entry unit, and outputs to the memory access entry unit an entry of a memory access instruction that collides with the plurality of memory access requests generated in the memory access pipeline. The arithmetic processing unit described in the supplementary note 1.

（付記９）
前記演算パイプラインは，前記マルチデータ命令エントリ部に抑止信号を出力して，前記マルチデータ命令エントリ部に，前記メモリアクセスパイプラインにシリアルに生成する前記メモリアクセス要求と衝突するマルチデータインダイレクトメモリアクセス命令のエントリの出力を抑止させる
付記５に記載された演算処理装置。 (Appendix 9)
The operation pipeline outputs a suppression signal to the multi-data instruction entry unit, and the multi-data indirect memory that collides with the memory access request generated serially in the memory access pipeline to the multi-data instruction entry unit The arithmetic processing unit according to attachment 5, wherein output of an access command entry is suppressed.

（付記１０）
前記演算パイプラインに出力される前記マルチデータインダイレクトメモリアクセス命令のエントリは，マルチデータインダイレクトメモリアクセスを示すインダイレクトメモリアクセス信号と，前記複数のデータの数を示すマルチデータ幅情報信号とを有し，
前記演算パイプラインは，前記メモリアクセスパイプラインに，前記マルチデータ幅情報信号が示す数の前記メモリアクセス要求を生成し，前記マルチデータ幅情報信号が示す数の前記複数のメモリアドレスを供給する
付記１に記載された演算処理装置。 (Appendix 10)
The entry of the multi-data indirect memory access instruction output to the arithmetic pipeline includes an indirect memory access signal indicating multi-data indirect memory access and a multi-data width information signal indicating the number of the plurality of data. Have
The operation pipeline generates the number of memory access requests indicated by the multi-data width information signal and supplies the plurality of memory addresses indicated by the multi-data width information signal to the memory access pipeline. 1. The arithmetic processing apparatus according to 1.

（付記１１）
命令をデコードする命令デコーダと，
前記命令デコーダによりメモリアクセス命令のエントリを生成されるメモリアクセスエントリ部と，
前記メモリアクセスエントリ部から出力された前記メモリアクセス命令のエントリをメモリに対して実行するメモリアクセスパイプラインと，
前記命令デコーダにより複数のデータを１つの命令で処理するマルチデータ命令のエントリを生成されるマルチデータ命令エントリ部と，
複数の演算器と複数のマルチデータ命令用レジスタとを有し，前記マルチデータ命令エントリ部から出力された前記マルチデータ命令のエントリの処理を前記複数の演算器により並列に処理し，前記複数のマルチデータ命令用レジスタに演算結果を格納する演算パイプラインとを有する演算処理装置の制御方法において，
前記演算パイプラインが，前記複数のマルチデータ命令用レジスタに格納されている複数のメモリアドレスについて前記メモリにメモリアクセスするマルチデータインダイレクトメモリアクセス命令のエントリの投入に応答して，前記メモリアクセスパイプラインに前記マルチデータインダイレクトメモリアクセス命令に対応する複数のメモリアクセス要求を生成し，
前記演算パイプラインが，前記複数の演算器が前記複数のマルチデータ命令用レジスタから取得した前記複数のメモリアドレスを前記メモリアクセスパイプラインに供給する演算処理装置の制御方法。 (Appendix 11)
An instruction decoder for decoding instructions;
A memory access entry part for generating an entry of a memory access instruction by the instruction decoder;
A memory access pipeline for executing an entry of the memory access instruction output from the memory access entry unit with respect to a memory;
A multi-data instruction entry unit for generating an entry of a multi-data instruction for processing a plurality of data with one instruction by the instruction decoder;
A plurality of arithmetic units and a plurality of multi-data instruction registers; processing of the multi-data instruction entry output from the multi-data instruction entry unit is processed in parallel by the plurality of arithmetic units; In a control method of an arithmetic processing unit having an arithmetic pipeline for storing an arithmetic result in a multi-data instruction register,
In response to input of an entry of a multi-data indirect memory access instruction for performing memory access to the memory with respect to a plurality of memory addresses stored in the plurality of multi-data instruction registers, the operation pipeline Generating a plurality of memory access requests corresponding to the multi-data indirect memory access instruction in a line;
A method for controlling an arithmetic processing unit, wherein the arithmetic pipeline supplies the plurality of memory addresses acquired by the plurality of arithmetic units from the plurality of multi-data instruction registers to the memory access pipeline.

（付記１２）
前記演算パイプラインが，前記メモリアクセスパイプラインの第１サイクルのステージに前記複数のメモリアクセス要求を生成し，前記メモリアクセスパイプラインの前記第１サイクルより後の第２サイクルのステージに前記複数のメモリアドレスを供給する
付記１１に記載された演算処理装置の制御方法。 (Appendix 12)
The arithmetic pipeline generates the plurality of memory access requests at a stage of the first cycle of the memory access pipeline, and the plurality of stages at a stage of a second cycle after the first cycle of the memory access pipeline. The method for controlling an arithmetic processing unit according to attachment 11, which supplies a memory address.

３０１：命令フェッチアドレス生成器
３０２：分岐予測機構
３０３：一次命令キャッシュ
３０４：命令バッファ
３０５：命令デコーダ
３０６：レジスタリネーミング部
ＲＳＡ：メモリアクセス用リザベーションステーション（アドレス生成リザベーションステーション），メモリアクセスエントリ部
３１０：アドレス用インタフェース
３１１：オペランドアドレス生成器
３１２：一次データキャッシュ
３１３：アドレス選択回路
ＥＡＧＡ，ＥＡＧＢ：オペランドアドレス生成器，メモリアクセスパイプライン
ＳＴＢ：ストアバッファ
ＲＳＥ：固定小数点演算用リザベーションステーション
３２０：固定小数点演算器
３２２：固定小数点レジスタ
３２１：固定小数点リネーミングレジスタ
ＲＳＦ：浮動小数点演算用リザベーションステーション，マルチデータ命令エントリ部
３３０：浮動小数点ＳＩＭＤ演算器，マルチデータ命令用演算器
３３２：浮動小数点ＳＩＭＤレジスタ，マルチデータ命令用レジスタ
３３１：浮動小数点ＳＩＭＤリネーミングレジスタ
３３３：演算用インタフェース
ＦＬＡ，ＦＬＢ：浮動小数点ＳＩＭＤ演算パイプライン，ＳＩＭＤ演算パイプライン
ＣＳＥ：コミットスタックエントリ
ＲＳＢＲ：分岐用リザベーションステーション
ＰＣ：プログラムカウンタ 301: Instruction fetch address generator 302: Branch prediction mechanism 303: Primary instruction cache 304: Instruction buffer 305: Instruction decoder 306: Register renaming unit RSA: Memory access reservation station (address generation reservation station), Memory access entry unit 310 : Address interface 311: Operand address generator 312: Primary data cache 313: Address selection circuit EAGA, EAGB: Operand address generator, memory access pipeline STB: Store buffer RSE: Reservation station 320 for fixed point arithmetic: Fixed point arithmetic 322: Fixed point register 321: Fixed point renaming register RSF: Reservation station for floating point arithmetic Multi-data instruction entry unit 330: floating-point SIMD arithmetic unit, multi-data instruction arithmetic unit 332: floating-point SIMD register, multi-data instruction register 331: floating-point SIMD renaming register 333: arithmetic interface FLA, FLB: floating-point SIMD operation pipeline, SIMD operation pipeline CSE: commit stack entry RSBR: branch reservation station PC: program counter

Claims

An instruction decoder for decoding instructions;
A memory access entry part for generating an entry of a memory access instruction by the instruction decoder;
A memory access pipeline for executing an entry of the memory access instruction output from the memory access entry unit with respect to a memory;
A multi-data instruction entry unit for generating an entry of a multi-data instruction for processing a plurality of data with one instruction by the instruction decoder;
A plurality of arithmetic units and a plurality of multi-data instruction registers; processing of the multi-data instruction entry output from the multi-data instruction entry unit is processed in parallel by the plurality of arithmetic units; An operation pipeline for storing operation results in a register for multi-data instructions,
The operation pipeline is responsive to an output of an entry of a multi-data indirect memory access instruction that performs memory access to the memory for a plurality of memory addresses stored in the plurality of multi-data instruction registers. A plurality of memory access requests corresponding to the multi-data indirect memory access instruction on a line, and the plurality of computing units obtain the plurality of memory addresses acquired from the plurality of multi-data instruction registers in the memory access pipeline. Arithmetic processing device to supply to.

The arithmetic pipeline generates the plurality of memory access requests at a stage of the first cycle of the memory access pipeline, and the plurality of stages at a stage of a second cycle after the first cycle of the memory access pipeline. The arithmetic processing unit according to claim 1, wherein a memory address is supplied.

The arithmetic processing device according to claim 2, wherein the arithmetic pipeline supplies the plurality of memory addresses in accordance with pipeline transfer timings of a plurality of memory access requests generated in the memory access pipeline.

A cache unit connected to the memory access pipeline;
2. The operation according to claim 1, wherein the operation pipeline includes identification information of a plurality of fetch ports storing access destination memory addresses in the cache unit in the plurality of memory access requests generated in the memory access pipeline. Processing equipment.

4. The operation according to claim 1, wherein the operation pipeline serially generates the plurality of memory access requests and supplies the plurality of memory addresses serially to the memory access pipeline. Processing equipment.

A plurality of the memory access pipelines;
The arithmetic pipeline generates at least a part of the plurality of memory access requests in parallel to the plurality of memory access pipelines, and supplies at least a part of the plurality of memory addresses in parallel. The arithmetic processing device according to claim 1, 2, or 3.

A cache unit connected to the memory access pipeline;
The arithmetic processing unit according to claim 1, wherein the cache unit performs data transfer with the plurality of multi-data instruction registers in response to the plurality of memory access requests.

The operation pipeline outputs a suppression signal to the memory access entry unit, and outputs to the memory access entry unit an entry of a memory access instruction that collides with the plurality of memory access requests generated in the memory access pipeline. The arithmetic processing unit according to claim 1, wherein

The operation pipeline outputs a suppression signal to the multi-data instruction entry unit, and the multi-data indirect memory that collides with the memory access request generated serially in the memory access pipeline to the multi-data instruction entry unit The arithmetic processing unit according to claim 5, wherein output of an access command entry is suppressed.

An instruction decoder for decoding instructions;
A memory access entry part (RSA) for generating an entry of a memory access instruction by the instruction decoder;
A memory access pipeline for executing an entry of the memory access instruction output from the memory access entry unit with respect to a memory;
A multi-data instruction entry unit for generating an entry of a multi-data instruction for processing a plurality of data with one instruction by the instruction decoder;
A plurality of arithmetic units and a plurality of multi-data instruction registers; processing of the multi-data instruction entry output from the multi-data instruction entry unit is processed in parallel by the plurality of arithmetic units; In a control method of an arithmetic processing unit having an arithmetic pipeline for storing an arithmetic result in a multi-data instruction register,
In response to input of an entry of a multi-data indirect memory access instruction for performing memory access to the memory with respect to a plurality of memory addresses stored in the plurality of multi-data instruction registers, the operation pipeline Generating a plurality of memory access requests corresponding to the multi-data indirect memory access instruction in a line;
A method for controlling an arithmetic processing unit, wherein the arithmetic pipeline supplies the plurality of memory addresses acquired by the plurality of arithmetic units from the plurality of multi-data instruction registers to the memory access pipeline.