JP2017027156A

JP2017027156A - Arithmetic processing device and method for controlling arithmetic processing device

Info

Publication number: JP2017027156A
Application number: JP2015142344A
Authority: JP
Inventors: 健三品; Takeshi Mishina; 徹引地; Toru Hikichi
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2015-07-16
Filing date: 2015-07-16
Publication date: 2017-02-02
Anticipated expiration: 2035-07-16
Also published as: JP6569347B2

Abstract

PROBLEM TO BE SOLVED: To improve the access latency of a cache memory in an arithmetic processing device having a plurality of banks and a cache memory shared by a plurality of arithmetic units.SOLUTION: The arithmetic processing device has: a plurality of cores 11; a last level (LL) cache memory having a plurality of banks shared by the plurality of cores; a pipeline selection unit <1>15 for selecting a request to be outputted from other than requests for new data from the cores among requests to the cache memory; and pipeline selection units <2>17-1, 17-2 for selecting, for each bank of the cache memory, a request from requests for new data and requests selected by the pipeline selection unit <1> and outputting the selected request to the pipeline of the LL cache memory. The pipeline selection units <2> are placed at positions closer to the cache memory than is the pipeline selection unit <1>, and thereby a path in which signals pertaining to requests for new data is shortened and the access latency of the LL cache memory is improved.SELECTED DRAWING: Figure 1

Description

本発明は、演算処理装置及び演算処理装置の制御方法に関する。 The present invention relates to an arithmetic processing device and a control method for the arithmetic processing device.

キャッシュメモリは、プロセッサの演算処理部からの要求データを格納して性能を向上させるために使用されている。プロセッサは、複数のコア（演算部）とそれら複数のコア（演算部）で共有されるＬＬ（Last Level、ラストレベル）キャッシュメモリとを有する構成が一般的である。それぞれのコア（演算部）は、演算処理パイプラインから直接アクセス可能な高速で小容量の一次（レベル１）キャッシュメモリ（さらに二次（レベル２）キャッシュメモリ等を有することもある）を有しており、一次キャッシュメモリでキャッシュミスが発生したときにＬＬキャッシュメモリへアクセスされる。 The cache memory is used for storing request data from the arithmetic processing unit of the processor to improve performance. The processor generally includes a plurality of cores (arithmetic units) and an LL (Last Level) cache memory shared by the plurality of cores (arithmetic units). Each core (arithmetic unit) has a high-speed and small-capacity primary (level 1) cache memory (which may also have a secondary (level 2) cache memory, etc.) that can be directly accessed from the arithmetic processing pipeline. When a cache miss occurs in the primary cache memory, the LL cache memory is accessed.

従来のプロセッサのレイアウトの例を図９に示す。図９には、４つのコア（コア＜０＞〜コア＜３＞）と、それら４つのコアで共有される１バンクのＬＬキャッシュメモリとを有する例を示している。それぞれのコア９０１は、演算部及び一次（レベル１）キャッシュメモリを有する。また、ＬＬキャッシュメモリは、ＬＬ（ラストレベル）キャッシュタグ部９０６及びＬＬ（ラストレベル）キャッシュデータ部９０７を有する。なお、外部ポート９０８は、プロセッサの外部とやりとりする要求のインターフェースである。 An example of a conventional processor layout is shown in FIG. FIG. 9 shows an example having four cores (core <0> to core <3>) and one bank of LL cache memory shared by the four cores. Each core 901 includes a calculation unit and a primary (level 1) cache memory. The LL cache memory includes an LL (last level) cache tag unit 906 and an LL (last level) cache data unit 907. The external port 908 is an interface for requests to communicate with the outside of the processor.

例えば、コア＜３＞９０１からの要求データが、ＬＬキャッシュメモリでキャッシュヒットした場合の信号（要求及びデータ等）が流れる経路は、図９に破線で示すようになる。すなわち、コア＜３＞９０１→コアからの様々な要求のインターフェースであるポート（各種ポート９０３）→優先順位に従ってパイプライン投入するコアからの要求を決定するパイプライン制御部９０４のパイプライン選択部９０５→ＬＬキャッシュタグ部９０６→ＬＬキャッシュデータ部９０７→コア＜３＞９０１となる。 For example, a path through which a signal (request, data, etc.) when requested data from the core <3> 901 hits a cache hit in the LL cache memory is shown by a broken line in FIG. That is, core <3> 901 → port (various ports 903) which is an interface for various requests from the core → pipeline selection unit 905 of the pipeline control unit 904 that determines a request from the core to be pipelined according to priority. → LL cache tag part 906 → LL cache data part 907 → core <3> 901.

近年のプロセッサは、性能向上のためにコア（演算部）の数が増加傾向にあり、キャッシュヒット率を保つためにＬＬキャッシュメモリの容量も増加している。それに伴い、ＬＬキャッシュメモリを効率よく使用するために、図１０に示すようなＬＬキャッシュメモリを複数のバンクに分割する方法が用いられる。図１０は、ＬＬキャッシュメモリを複数のバンクに分割した従来のプロセッサのレイアウトの例を示す図である。図１０において、図９に示した構成要素と同じ機能を有する構成要素には同一の符号を付している。 In recent processors, the number of cores (arithmetic units) tends to increase to improve performance, and the capacity of the LL cache memory also increases to maintain the cache hit rate. Accordingly, in order to use the LL cache memory efficiently, a method of dividing the LL cache memory as shown in FIG. 10 into a plurality of banks is used. FIG. 10 is a diagram showing a layout example of a conventional processor in which the LL cache memory is divided into a plurality of banks. 10, components having the same functions as those shown in FIG. 9 are given the same reference numerals.

図１０には、ＬＬキャッシュメモリを２バンクとした例、すなわちＬＬキャッシュタグ部９０６−１及びＬＬキャッシュデータ部９０７−１を有する１つのバンクと、ＬＬキャッシュタグ部９０６−２及びＬＬキャッシュデータ部９０７−２を有する１つのバンクとの２バンクとした例を示している。なお、コア＜７＞９０１からの要求データが、ＬＬキャッシュメモリでキャッシュヒットした場合の信号（要求及びデータ等）が流れる経路を破線で示している。 FIG. 10 shows an example in which the LL cache memory has two banks, that is, one bank having an LL cache tag unit 906-1 and an LL cache data unit 907-1, an LL cache tag unit 906-2, and an LL cache data unit. An example is shown in which two banks are included, one bank having 907-2. A path through which a signal (request, data, etc.) when requested data from the core <7> 901 hits a cache hit in the LL cache memory is indicated by a broken line.

ＳＭＴ（Simultaneous Multi Thread）方式のプロセッサにおいて、同時に実行されるスレッドで共有されるキャッシュメモリへのアクセス要求を制御するキャッシュ制御技術が提案されている（特許文献１参照）。特許文献１では、アクセス要求を保持するポート手段をスレッド構成に合わせて分割して使用するように制御して、ポート手段の保持するスレッドの発行したアクセス要求の中からアクセス要求を選択する処理を行うことで、アクセス処理に必要となる資源を効率的に利用してキャッシュメモリへのアクセス処理を実行できるようにしている。 In an SMT (Simultaneous Multi Thread) type processor, a cache control technique for controlling an access request to a cache memory shared by threads executed simultaneously has been proposed (see Patent Document 1). In Japanese Patent Laid-Open No. 2004-228688, a process for selecting an access request from access requests issued by a thread held by a port unit by controlling the port unit holding the access request so as to be divided according to the thread configuration is used. By doing so, the resources required for the access processing can be efficiently used to execute the access processing to the cache memory.

国際公開第２００８／１５５８２２号公報International Publication No. 2008/155822

図１０に一例を示したように、複数のコアで共有されるＬＬキャッシュメモリを複数のバンクに分割した場合、ＬＬキャッシュメモリの各バンクへのアクセスに係る距離等を考えると、パイプラインに投入する要求を選択するパイプライン選択部９０５は、通常、パイプライン制御部９０４の中央に配置される。したがって、ＬＬキャッシュメモリのバンクの数を多くするほど、パイプライン制御部のパイプライン選択部とＬＬキャッシュメモリのＬＬキャッシュタグ部との物理的な距離が長くなり、ＬＬキャッシュメモリのアクセスレイテンシの悪化を招く。 As shown in an example in FIG. 10, when the LL cache memory shared by a plurality of cores is divided into a plurality of banks, considering the distance to access each bank of the LL cache memory, etc., it is input to the pipeline The pipeline selection unit 905 that selects a request to be performed is normally arranged at the center of the pipeline control unit 904. Therefore, as the number of banks of the LL cache memory is increased, the physical distance between the pipeline selection unit of the pipeline control unit and the LL cache tag unit of the LL cache memory is increased, and the access latency of the LL cache memory is deteriorated. Invite.

１つの側面では、本発明は、複数のバンクを有し、複数の演算部で共有されるキャッシュメモリを有する演算処理装置にて、当該キャッシュメモリのアクセスレイテンシを改善することを目的とする。 In one aspect, an object of the present invention is to improve an access latency of a cache memory in an arithmetic processing unit having a plurality of banks and having a cache memory shared by a plurality of arithmetic units.

演算処理装置の一態様は、複数の演算部と、複数のバンクを有し、前記複数の演算部で共有されるキャッシュメモリと、キャッシュメモリに対する要求の内の演算部からの第１の要求以外の要求から、出力する要求を選択する第１の選択部と、キャッシュメモリのバンク毎に第１の要求及び第１の選択部により選択された要求から要求を選択してキャッシュメモリのアクセスに係るパイプラインに対して出力する第２の選択部とを有する。第２の選択部は、第１の選択部よりもキャッシュメモリに近い位置に配置される。 One aspect of the arithmetic processing device includes a plurality of arithmetic units, a plurality of banks, a cache memory shared by the plurality of arithmetic units, and a request other than the first request from the arithmetic unit among requests for the cache memory A first selection unit that selects a request to be output from the requests and a request selected from the first request and the request selected by the first selection unit for each bank of the cache memory, And a second selection unit that outputs to the pipeline. The second selection unit is arranged at a position closer to the cache memory than the first selection unit.

発明の一態様においては、第１の要求及び第１の選択部により選択された要求から要求を選択して出力する第２の選択部をキャッシュメモリの近傍に配置することで、第１の要求に係る信号の流れる経路を短縮し、キャッシュメモリのアクセスレイテンシを改善することができる。 In one aspect of the invention, the first request and the second selection unit for selecting and outputting the request from the request selected by the first selection unit are arranged in the vicinity of the cache memory, thereby providing the first request. The path through which the signal flows can be shortened, and the access latency of the cache memory can be improved.

本発明の実施形態における演算処理装置のレイアウトの例を示す図である。It is a figure which shows the example of the layout of the arithmetic processing unit in embodiment of this invention. 本実施形態におけるパイプライン選択部の構成例を示す図である。It is a figure which shows the structural example of the pipeline selection part in this embodiment. 本実施形態におけるキャッシュメモリへのアクセスを説明するための図である。It is a figure for demonstrating access to the cache memory in this embodiment. 従来技術におけるキャッシュメモリへのアクセスを説明するための図である。It is a figure for demonstrating access to the cache memory in a prior art. 本実施形態と従来技術でのキャッシュメモリのアクセスレイテンシを比較するための図である。It is a figure for comparing the access latency of the cache memory in this embodiment and a prior art. 本実施形態における演算処理装置でのスループット向上の例を示す図である。It is a figure which shows the example of the throughput improvement in the arithmetic processing unit in this embodiment. 本実施形態におけるパイプライン選択部の動作例を示す図である。It is a figure which shows the operation example of the pipeline selection part in this embodiment. 本実施形態と従来技術でのパイプラインスロットの比較例を示す図である。It is a figure which shows the comparative example of the pipeline slot in this embodiment and a prior art. 本実施形態における演算処理装置のレイアウトの他の例を示す図である。It is a figure which shows the other example of the layout of the arithmetic processing unit in this embodiment. 従来のプロセッサのレイアウトを示す図である。It is a figure which shows the layout of the conventional processor. 従来のプロセッサのレイアウトを示す図である。It is a figure which shows the layout of the conventional processor.

以下、本発明の実施形態を図面に基づいて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明の一実施形態における演算処理装置としてのプロセッサのレイアウトの例を示す図である。本実施形態におけるプロセッサは、複数のコア（演算部）と、それら複数のコアで共有される複数のバンクを有するキャッシュメモリとを有する。図１には、８つのコア（コア＜０＞〜コア＜７＞）と、それら８つのコアで共有される２バンクのＬＬ（ラストレベル）キャッシュメモリとを有する例を示している。 FIG. 1 is a diagram showing an example of the layout of a processor as an arithmetic processing unit according to an embodiment of the present invention. The processor according to the present embodiment includes a plurality of cores (arithmetic units) and a cache memory having a plurality of banks shared by the plurality of cores. FIG. 1 shows an example having eight cores (core <0> to core <7>) and two banks of LL (last level) cache memory shared by the eight cores.

それぞれのコア１１は、演算処理等を行う演算部及び一次（レベル１）キャッシュメモリを有する。なお、各コア１１が、さらに二次（レベル２）キャッシュメモリを有する構成であっても良い。また、ＬＬキャッシュメモリは、ＬＬキャッシュタグ部１８−１とＬＬキャッシュデータ部１９−１とが１つのバンクを構成しており、ＬＬキャッシュタグ部１８−２とＬＬキャッシュデータ部１９−２とが１つのバンクを構成している。ＬＬキャッシュタグ部１８−１、１８−２には、ＬＬキャッシュデータ部１９−１、１９−２に格納されたデータのタグ（データのアドレスやデータの状態を示す情報等）が記憶されている。 Each core 11 includes an arithmetic unit that performs arithmetic processing and the like and a primary (level 1) cache memory. Each core 11 may further include a secondary (level 2) cache memory. In the LL cache memory, the LL cache tag unit 18-1 and the LL cache data unit 19-1 constitute one bank, and the LL cache tag unit 18-2 and the LL cache data unit 19-2 include One bank is configured. In the LL cache tag units 18-1 and 18-2, tags of data stored in the LL cache data units 19-1 and 19-2 (information indicating data addresses, data states, and the like) are stored. .

コア１１（コア＜０＞〜コア＜７＞）とＬＬキャッシュメモリ（ＬＬキャッシュタグ部１８−１、１８−２及びＬＬキャッシュデータ部１９−１、１９−２）とは転送バス１２を介して通信可能に接続されている。コア１１（コア＜０＞〜コア＜７＞）からの要求やプロセッサの外部からの要求に基づくＬＬキャッシュメモリへのアクセスは、要求の受付や選択及びタグ検索等のパイプライン制御を行うパイプライン制御部１４によって実行される。 The core 11 (core <0> to core <7>) and the LL cache memory (LL cache tag units 18-1 and 18-2 and LL cache data units 19-1 and 19-2) are connected via the transfer bus 12. It is connected so that it can communicate. Access to the LL cache memory based on a request from the core 11 (core <0> to core <7>) or a request from outside the processor is a pipeline that performs pipeline control such as request reception, selection, and tag search. It is executed by the control unit 14.

本実施形態におけるプロセッサは、コア１１からの要求の内の最も要求頻度が高い新規データの要求（新規要求）のインターフェースである新規要求ポート１６−１、１６−２と、コア１１からの要求の内の新規要求以外の様々な要求のインターフェースである各種ポート１３とを有する。新規要求以外のコア１１からの要求には、例えばデータの書き換え等を行うスヌープ応答などがある。また、本実施形態におけるプロセッサは、プロセッサの外部とやりとりする要求のインターフェースである外部ポート２０を有する。 The processor according to the present embodiment includes new request ports 16-1 and 16-2 which are interfaces for new data requests (new requests) having the highest request frequency among requests from the core 11, and requests from the core 11. And various ports 13 which are interfaces for various requests other than new requests. Requests from the core 11 other than new requests include, for example, a snoop response for rewriting data. In addition, the processor according to the present embodiment has an external port 20 that is an interface for requests to communicate with the outside of the processor.

また、本実施形態において、パイプライン制御部１４におけるパイプライン選択部は、新規要求以外の要求の内から出力する要求を選択するパイプライン選択部＜１＞１５と、新規要求及びパイプライン選択部＜１＞１５により選択された要求のどちらかをパイプライン投入する要求として選択しＬＬキャッシュタグ部１８−１、１８−２に渡すパイプライン選択部＜２＞１７−１、１７−２とを有する。パイプライン選択部＜１＞１５及びパイプライン選択部＜２＞１７−１、１７−２は、予め設定された優先順位に従って要求の選択を行う。 In this embodiment, the pipeline selection unit in the pipeline control unit 14 includes a pipeline selection unit <1> 15 that selects a request to be output from requests other than the new request, and a new request and pipeline selection unit. <1> Select one of the requests selected by 15 as a pipeline input request, and select the pipeline selection units <2> 17-1 and 17-2 to be passed to the LL cache tag units 18-1 and 18-2. Have. The pipeline selection unit <1> 15 and the pipeline selection units <2> 17-1 and 17-2 select requests according to preset priorities.

図２は、本実施形態におけるパイプライン選択部１５、１７の構成例を示す図である。なお、本実施形態では、パイプライン投入する要求のポートとして、新規要求ポート、スヌープ応答ポート、ロードバッファポート、及び外部スヌープポートの４つとし、投入する優先順位が、スヌープ応答ポートが一番高く、ロードバッファポート、外部スヌープポートの順に低くなり、新規要求ポートが一番低いものとする。パイプライン選択部＜１＞１５は、ＡＮＤ（論理積演算）回路２０１、２０２、２０３及び選択回路２０４を有する。また、パイプライン選択部＜２＞１７は、ＡＮＤ回路２０６及び選択回路２０７を有する。 FIG. 2 is a diagram illustrating a configuration example of the pipeline selection units 15 and 17 in the present embodiment. In this embodiment, the request ports to be entered into the pipeline are four ports, that is, a new request port, a snoop response port, a load buffer port, and an external snoop port, and the priority of entry is the highest in the snoop response port. , Load buffer port, external snoop port, and the new request port is the lowest. The pipeline selection unit <1> 15 includes AND (logical product operation) circuits 201, 202, and 203 and a selection circuit 204. The pipeline selection unit <2> 17 includes an AND circuit 206 and a selection circuit 207.

ＡＮＤ回路２０１には、スヌープ応答ポートの信号ＳＲＰＴ及びスヌープ応答ポートの抑制信号ＳＲＰＴ＿ＩＮＨが入力される。ＡＮＤ回路２０２には、ロードバッファポートの信号ＬＤＰＴ、ロードバッファポートの抑制信号ＬＤＰＴ＿ＩＮＨ、及びスヌープ応答ポートの信号ＳＲＰＴが入力される。また、ＡＮＤ回路２０３には、外部スヌープポートの信号ＯＳＰＴ、外部スヌープポートの抑制信号ＯＳＰＴ＿ＩＮＨ、スヌープ応答ポートの信号ＳＲＰＴ、及びロードバッファポートの信号ＬＤＰＴが入力される。選択回路２０４は、ＡＮＤ回路２０１、２０２、２０３の出力を選択的に出力する。 The AND circuit 201 receives a snoop response port signal SRPT and a snoop response port suppression signal SRPT_INH. The AND circuit 202 receives the load buffer port signal LDPT, the load buffer port suppression signal LDPT_INH, and the snoop response port signal SRPT. The AND circuit 203 also receives the external snoop port signal OSPT, the external snoop port suppression signal OSPT_INH, the snoop response port signal SRPT, and the load buffer port signal LDPT. The selection circuit 204 selectively outputs the outputs of the AND circuits 201, 202, and 203.

ＡＮＤ回路２０６には、新規要求ポートの信号ＲＱＰＴ及び新規要求ポートの抑制信号ＲＱＰＴ＿ＩＮＨが入力されるＡＮＤ回路２０５の出力と、パイプライン選択部＜１＞１５の選択回路２０４の出力とが入力される。選択回路２０７は、パイプライン選択部＜１＞１５の選択回路２０４の出力及びＡＮＤ回路２０６の出力を選択的に出力する。 The AND circuit 206 receives the output of the AND circuit 205 to which the new request port signal RQPT and the new request port suppression signal RQPT_INH are input, and the output of the selection circuit 204 of the pipeline selection unit <1> 15. . The selection circuit 207 selectively outputs the output of the selection circuit 204 and the output of the AND circuit 206 of the pipeline selection unit <1> 15.

ここで、信号ＳＲＰＴ、ＬＤＰＴ、ＯＳＰＴ、ＲＱＰＴは、それぞれのポートに要求があるときに“１”となり、要求がない状態では“０”となる信号である。また、信号ＳＲＰＴ＿ＩＮＨ、ＬＤＰＴ＿ＩＮＨ、ＯＳＰＴ＿ＩＮＨ、ＲＱＰＴ＿ＩＮＨは、それぞれのポートに対して要求タイミングが適切でない場合に“１”とされる信号である。例えば、抑制信号は、メモリ読み込みに全体で４サイクルかかる場合に、３サイクル以内に次のメモリ読み込みの信号がきたときに４サイクルが経過する（前のメモリ読み込みが完了する）まで次の出力を抑制させる信号である。 Here, the signals SRPT, LDPT, OSPT, and RQPT are “1” when there is a request for each port, and “0” when there is no request. The signals SRPT_INH, LDPT_INH, OSPT_INH, and RQPT_INH are signals that are set to “1” when the request timing is not appropriate for each port. For example, if the memory read takes 4 cycles in total, the next output is output until 4 cycles elapse (the previous memory read is completed) when the next memory read signal comes within 3 cycles. It is a signal to be suppressed.

図２に示した構成のパイプライン選択部＜１＞１５は、スヌープ応答ポートの信号ＳＲＰＴが“１”であり、スヌープ応答ポートの抑制信号ＳＲＰＴ＿ＩＮＨが“０”であるとき（入力状態１）、スヌープ応答の要求を出力する。また、スヌープ応答ポートの信号ＳＲＰＴが“０”であり、ロードバッファポートの信号ＬＤＰＴが“１”であり、ロードバッファポートの抑制信号ＬＤＰＴ＿ＩＮＨが“０”であるとき（入力状態２）、ロードバッファの要求を出力する。また、スヌープ応答ポートの信号ＳＲＰＴが“０”であり、ロードバッファポートの信号ＬＤＰＴが“０”であり、外部スヌープポートの信号ＯＳＰＴが“１”であり、外部スヌープポートの抑制信号ＯＳＰＴ＿ＩＮＨが“０”であるとき（入力状態３）、外部スヌープの要求を出力する。前述した入力状態１、入力状態２、入力状態３のいずれでもない場合には、パイプライン選択部＜１＞１５は要求を出力しない。 The pipeline selection unit <1> 15 having the configuration shown in FIG. 2 has a snoop response port signal SRPT of “1” and a snoop response port suppression signal SRPT_INH of “0” (input state 1). Outputs a snoop response request. When the snoop response port signal SRPT is “0”, the load buffer port signal LDPT is “1”, and the load buffer port suppression signal LDPT_INH is “0” (input state 2), the load buffer Output the request. Further, the signal SRPT of the snoop response port is “0”, the signal LDPT of the load buffer port is “0”, the signal OSPT of the external snoop port is “1”, and the suppression signal OSPT_INH of the external snoop port is “1”. When 0 "(input state 3), an external snoop request is output. If none of the input state 1, input state 2, or input state 3 described above, the pipeline selection unit <1> 15 does not output a request.

また、図２に示した構成のパイプライン選択部＜２＞１７は、パイプライン選択部＜１＞１５がスヌープ応答、ロードバッファ、外部スヌープの何れの要求を出力しているとき、パイプライン選択部＜１＞１５が出力する要求を出力する。また、パイプライン選択部＜１＞１５がスヌープ応答、ロードバッファ、外部スヌープの何れかの要求も出力しておらず、新規要求ポートの信号ＲＱＰＴが“１”であり、新規要求ポートの抑制信号ＲＱＰＴ＿ＩＮＨが“０”であるとき、新規要求を出力する。それら以外の場合、パイプライン選択部＜２＞１７は要求を出力しない。 Also, the pipeline selection unit <2> 17 having the configuration shown in FIG. 2 selects the pipeline when the pipeline selection unit <1> 15 outputs any of the snoop response, load buffer, and external snoop requests. The request output by the part <1> 15 is output. Further, the pipeline selection unit <1> 15 does not output any of the snoop response, the load buffer, and the external snoop request, the signal RQPT of the new request port is “1”, and the suppression signal of the new request port When RQPT_INH is “0”, a new request is output. In other cases, the pipeline selection unit <2> 17 does not output a request.

このように本実施形態では、パイプライン選択部＜１＞１５及びパイプライン選択部＜２＞１７−１、１７−２を配置し、パイプライン選択部＜１＞１５では新規要求以外の要求を選択し、パイプライン選択部＜２＞１７−１、１７−２では新規要求とパイプライン選択部＜１＞１５で選択された要求のどちらかを選択してＬＬキャッシュタグ部１８−１、１８−２に渡す。また、コア１１からの要求のインターフェースとして、新規要求のための新規要求ポート１６−１、１６−２と、新規要求以外の要求の各種ポート１３とに分けて設ける。 Thus, in the present embodiment, the pipeline selection unit <1> 15 and the pipeline selection units <2> 17-1, 17-2 are arranged, and the pipeline selection unit <1> 15 makes requests other than new requests. The pipeline selection units <2> 17-1 and 17-2 select either a new request or the request selected by the pipeline selection unit <1> 15 to select the LL cache tag units 18-1 and 18 -2. In addition, as request interfaces from the core 11, new request ports 16-1 and 16-2 for new requests and various ports 13 for requests other than new requests are provided separately.

これにより図１に示すように、新規要求ポート１６−１、１６−２とパイプライン選択部＜２＞１７−１、１７−２とを、パイプライン選択部＜１＞１５よりもＬＬキャッシュタグ部１８−１、１８−２に近づけて近傍に配置することが可能になる。したがって、例えば図１に破線で示したようにコア＜７＞１１からの要求データが、ＬＬキャッシュメモリのＬＬキャッシュタグ部１８−２及びＬＬキャッシュデータ部１９−２で構成されるバンクでキャッシュヒットした場合、従来のようにパイプライン制御部１４の中央を経由する必要がなくなるため、信号が流れる経路が短くなり、ＬＬキャッシュメモリのアクセスレイテンシを短縮することができる。なお、例えば、コア＜７＞１１からの要求データが、ＬＬキャッシュメモリにおいてコア＜７＞１１から遠方のＬＬキャッシュタグ部１８−１及びＬＬキャッシュデータ部１９−１で構成されるバンクでキャッシュヒットしたとしても、信号が流れる経路は、従来のパイプライン制御部１４の中央を経由する経路と同様の距離であるので、ＬＬキャッシュメモリのアクセスレイテンシが悪化することはない。 As a result, as shown in FIG. 1, the new request ports 16-1 and 16-2 and the pipeline selection units <2> 17-1 and 17-2 are connected to the LL cache tag more than the pipeline selection unit <1> 15. It becomes possible to arrange in the vicinity near the parts 18-1 and 18-2. Therefore, for example, as shown by the broken line in FIG. 1, the request data from the core <7> 11 is cache hit in the bank constituted by the LL cache tag unit 18-2 and the LL cache data unit 19-2 of the LL cache memory. In this case, since there is no need to go through the center of the pipeline control unit 14 as in the prior art, the path through which the signal flows is shortened, and the access latency of the LL cache memory can be shortened. Note that, for example, the request data from the core <7> 11 is cache hit in a bank configured with the LL cache tag unit 18-1 and the LL cache data unit 19-1 far from the core <7> 11 in the LL cache memory. Even so, since the path through which the signal flows is the same distance as the path through the center of the conventional pipeline control unit 14, the access latency of the LL cache memory does not deteriorate.

図３Ａ及び図３Ｂを参照して、本実施形態におけるＬＬキャッシュメモリへのアクセスについて説明する。図３Ａは、本実施形態におけるＬＬキャッシュメモリへのアクセスを説明するための図であり、図３Ｂは、図１０に示した従来のプロセッサにおけるＬＬキャッシュメモリへのアクセスを説明するための図である。なお、図３Ａ及び図３Ｂにおいて、期間ＳＴ１、ＳＴ２、ＳＴ３、ＳＴ４、ＳＴ５、ＳＴ６のそれぞれは、プロセッサの動作周波数の１サイクル期間に相当する。 With reference to FIG. 3A and FIG. 3B, access to the LL cache memory in this embodiment will be described. FIG. 3A is a diagram for explaining access to the LL cache memory in the present embodiment, and FIG. 3B is a diagram for explaining access to the LL cache memory in the conventional processor shown in FIG. . 3A and 3B, each of the periods ST1, ST2, ST3, ST4, ST5, and ST6 corresponds to one cycle period of the operating frequency of the processor.

図３Ｂに示すように、図１０に示した従来のプロセッサにおいては、コア３０１からの新規要求は、期間ＳＴ２において、それぞれのコアに対応する新規要求ポート３６１に入力される。そして、次のサイクルの期間ＳＴ３においてＬＲＵ（Least Recently Used）制御部３７１により最も古い新規要求が選択されて出力される。続く、期間ＳＴ４において、所定の優先順位に従ってパイプライン投入する要求を選択するパイプライン選択部３８１により選択されると、期間ＳＴ６においてＬＬキャッシュタグ部３５１に到達する。なお、期間ＳＴ５は、パイプライン選択部３８１とＬＬキャッシュタグ部３５１との間の転送に要する期間として設けている。 As shown in FIG. 3B, in the conventional processor shown in FIG. 10, a new request from the core 301 is input to the new request port 361 corresponding to each core in the period ST2. Then, in the next cycle period ST3, the oldest new request is selected and output by the LRU (Least Recently Used) control unit 371. In the next period ST4, when selected by the pipeline selection unit 381 that selects a request to enter the pipeline according to a predetermined priority, the LL cache tag unit 351 is reached in the period ST6. The period ST5 is provided as a period required for transfer between the pipeline selection unit 381 and the LL cache tag unit 351.

また、コア３０１からのスヌープ応答の要求も同様に、期間ＳＴ２において、それぞれのコアに対応するスヌープ応答ポート３６２に入力され、次のサイクルの期間ＳＴ３においてＬＲＵ制御部３７２により最も古いスヌープ応答の要求が選択されて出力される。続く、期間ＳＴ４において、パイプライン選択部３８１により選択されると、期間ＳＴ６においてＬＬキャッシュタグ部３５１に到達する。 Similarly, a snoop response request from the core 301 is input to the snoop response port 362 corresponding to each core in the period ST2, and the oldest snoop response request is received by the LRU control unit 372 in the period ST3 of the next cycle. Is selected and output. Subsequently, when selected by the pipeline selection unit 381 in the period ST4, the LL cache tag unit 351 is reached in the period ST6.

メモリ制御部３０２からの外部スヌープの要求、メモリ制御部３０３からのロードバッファの要求、その他の機能部３０４からのエラー処理等の要求は、期間ＳＴ２において、それぞれ対応するポート３６３、３６４、３６５に入力される。そして、期間ＳＴ４において、パイプライン選択部３８１により選択されると、期間ＳＴ６においてＬＬキャッシュタグ部３５１に到達する。 Requests for external snoops from the memory control unit 302, requests for load buffers from the memory control unit 303, requests for error processing from other functional units 304, etc. are sent to the corresponding ports 363, 364, 365 in the period ST2, respectively. Entered. Then, when selected by the pipeline selection unit 381 in the period ST4, the LL cache tag unit 351 is reached in the period ST6.

このように図１０に示した従来のプロセッサにおいては、要求の種類にかかわらずＬＬキャッシュタグ部３５１に達するには６サイクル分の時間を要している。 As described above, in the conventional processor shown in FIG. 10, it takes 6 cycles to reach the LL cache tag unit 351 regardless of the type of request.

それに対して、図３Ａに示すように、本実施形態におけるプロセッサにおいては、コア３０１からの新規要求は、期間ＳＴ２において、それぞれのコアに対応する新規要求ポート３３１に入力される。そして、次のサイクルの期間ＳＴ３においてＬＲＵ制御部３４１により最も古い新規要求が選択されて出力され、さらにパイプライン選択部＜２＞３４２により選択されると、期間ＳＴ４においてＬＬキャッシュタグ部３５１に到達する。 On the other hand, as shown in FIG. 3A, in the processor according to the present embodiment, a new request from the core 301 is input to the new request port 331 corresponding to each core in the period ST2. Then, when the oldest new request is selected and output by the LRU control unit 341 in the period ST3 of the next cycle and further selected by the pipeline selection unit <2> 342, the LL cache tag unit 351 is reached in the period ST4. To do.

コア３０１からのスヌープ応答の要求は、期間ＳＴ２において、それぞれのコアに対応するスヌープ応答ポート３１１に入力される。そして、次のサイクルの期間ＳＴ３においてパイプライン選択部＜１＞３２１により選択され、期間ＳＴ５においてパイプライン選択部＜２＞３４２により選択されると、期間ＳＴ６においてＬＬキャッシュタグ部３５１に到達する。なお、期間ＳＴ５は、パイプライン選択部＜１＞３２１とパイプライン選択部＜２＞３４２との間の転送に要する期間として設けている。 The request for the snoop response from the core 301 is input to the snoop response port 311 corresponding to each core in the period ST2. When the pipeline selection unit <1> 321 is selected in the period ST3 of the next cycle and the pipeline selection unit <2> 342 is selected in the period ST5, the LL cache tag unit 351 is reached in the period ST6. The period ST5 is provided as a period required for transfer between the pipeline selection unit <1> 321 and the pipeline selection unit <2> 342.

メモリ制御部３０２からの外部スヌープの要求、メモリ制御部３０３からのロードバッファの要求、その他の機能部３０４からのエラー処理等の要求は、期間ＳＴ２において、それぞれ対応するポート３１２、３１３、３１４に入力される。そして、次のサイクルの期間ＳＴ３においてパイプライン選択部＜１＞３２１により選択され、期間ＳＴ５においてパイプライン選択部＜２＞３４２により選択されると、期間ＳＴ６においてＬＬキャッシュタグ部３５１に到達する。 Requests for external snoops from the memory control unit 302, requests for load buffers from the memory control unit 303, requests for error processing from other functional units 304, etc. are sent to the corresponding ports 312, 313, and 314 in the period ST2, respectively. Entered. When the pipeline selection unit <1> 321 is selected in the period ST3 of the next cycle and the pipeline selection unit <2> 342 is selected in the period ST5, the LL cache tag unit 351 is reached in the period ST6.

このように本実施形態では、パイプライン選択部＜１＞３２１とパイプライン選択部＜２＞３４２との２つに分けるとともに、新規要求ポート３３１及びパイプライン選択部＜２＞３４２をＬＬキャッシュタグ部３５１に近接した位置に配置することで、新規要求以外の要求は従来と同様に６サイクル分の時間でＬＬキャッシュタグ部３５１に到達するが、コア３０１からの新規要求は４サイクル分の時間でＬＬキャッシュタグ部３５１に到達させることが可能となる。 As described above, in this embodiment, the pipeline selection unit <1> 321 and the pipeline selection unit <2> 342 are divided into two, and the new request port 331 and the pipeline selection unit <2> 342 are divided into LL cache tags. By placing it at a position close to the unit 351, requests other than new requests reach the LL cache tag unit 351 in a time corresponding to 6 cycles as in the conventional case, but new requests from the core 301 take time for 4 cycles. Thus, the LL cache tag unit 351 can be reached.

すなわち、コアからの新規データの要求（新規要求）は、本実施形態におけるプロセッサでは図４（Ａ）に示すように、新規要求ポート及びパイプライン選択部＜２＞を介して４サイクル分の時間（ＳＴ１〜ＳＴ４）でＬＬキャッシュタグ部３５１に到達させることが可能となる。一方、従来のプロセッサでは、図４（Ｂ）に示すように、各種ポート内の新規要求ポート及びパイプライン選択部を介してＬＬキャッシュタグ部３５１に到達するまでに６サイクル分の時間（ＳＴ１〜ＳＴ６）を要する。 That is, a request for new data (new request) from the core is a time corresponding to four cycles through the new request port and the pipeline selection unit <2> in the processor according to the present embodiment, as shown in FIG. It is possible to reach the LL cache tag unit 351 in (ST1 to ST4). On the other hand, in the conventional processor, as shown in FIG. 4B, it takes six cycles (ST1 to ST1) to reach the LL cache tag unit 351 via the new request port in each port and the pipeline selection unit. ST6) is required.

図３Ａ、図３Ｂ、及び図４を参照して説明したように、本実施形態におけるプロセッサは、パイプライン選択部＜１＞３２１とパイプライン選択部＜２＞３４２との２つに分けたことによって、新規要求に係る経路の論理段数を減少させることができ、ＬＬキャッシュメモリのアクセスレイテンシを短縮することが可能となる。 As described with reference to FIGS. 3A, 3B, and 4, the processor according to the present embodiment is divided into the pipeline selection unit <1> 321 and the pipeline selection unit <2> 342. As a result, the number of logical stages in the path related to the new request can be reduced, and the access latency of the LL cache memory can be shortened.

また、本実施形態では、新規要求が新規要求以外の要求よりも短い時間でＬＬキャッシュタグ部に到達することが可能となるため、スループットを向上させることができる。例えば、図５に示すように、Ｒ（新規要求以外の要求）、Ｅ（要求なし）、Ｒ、Ｒ、Ｎ（新規要求）の順に１サイクル毎に要求が発行されたとする。このとき、ＳＴＡＧＥ０ではＲ、Ｅ、Ｒ、Ｒ、Ｎとなるが、Ｎ（新規要求）はＳＴＡＧＥ４でＬＬキャッシュタグ部に到達するため、以降ではＥ（要求なし）の部分にＮ（新規要求）が割り込むことができ、スループットが向上していることがわかる。 Further, in the present embodiment, since a new request can reach the LL cache tag unit in a shorter time than a request other than the new request, the throughput can be improved. For example, as shown in FIG. 5, it is assumed that a request is issued for each cycle in the order of R (request other than a new request), E (no request), R, R, and N (new request). At this time, in STAGE0, R, E, R, R, and N become N, but since N (new request) reaches the LL cache tag part in STAGE4, N (new request) is subsequently added to the E (no request) portion. It can be seen that the throughput is improved.

ここで、前述したようにパイプライン投入する優先順位が、スヌープ応答ポートが一番高く、ロードバッファポート、外部スヌープポートの順に低くなり、新規要求ポートが一番低い場合、従来のプロセッサにおけるパイプライン選択部は、例えば図６（Ｂ）に示すように構成される。図６（Ｂ）は、従来のプロセッサにおけるパイプライン選択部の構成例をしており、ＡＮＤ回路６０１、６０２、６０３、６０４及び選択回路６０５を有する。 Here, as described above, when the priority for entering the pipeline is highest in the snoop response port, lower in the order of the load buffer port and the external snoop port, and the lowest in the new request port, the pipeline in the conventional processor The selection unit is configured as shown in FIG. 6B, for example. FIG. 6B illustrates a configuration example of a pipeline selection unit in a conventional processor, which includes AND circuits 601, 602, 603, 604, and a selection circuit 605.

ＡＮＤ回路６０１には、スヌープ応答ポートの信号ＳＲＰＴ及びスヌープ応答ポートの抑制信号ＳＲＰＴ＿ＩＮＨが入力される。ＡＮＤ回路６０２には、ロードバッファポートの信号ＬＤＰＴ、ロードバッファポートの抑制信号ＬＤＰＴ＿ＩＮＨ、及びスヌープ応答ポートの信号ＳＲＰＴが入力される。ＡＮＤ回路６０３には、外部スヌープポートの信号ＯＳＰＴ、外部スヌープポートの抑制信号ＯＳＰＴ＿ＩＮＨ、スヌープ応答ポートの信号ＳＲＰＴ、及びロードバッファポートの信号ＬＤＰＴが入力される。ＡＮＤ回路６０４には、新規要求ポートの信号ＲＱＰＴ、新規要求ポートの抑制信号ＲＱＰＴ＿ＩＮＨ、スヌープ応答ポートの信号ＳＲＰＴ、ロードバッファポートの信号ＬＤＰＴ、及び外部スヌープポートの信号ＯＳＰＴが入力される。選択回路６０５は、ＡＮＤ回路６０１、６０２、６０３、６０４の出力を選択的に出力する。 The AND circuit 601 receives a snoop response port signal SRPT and a snoop response port suppression signal SRPT_INH. The AND circuit 602 receives the load buffer port signal LDPT, the load buffer port suppression signal LDPT_INH, and the snoop response port signal SRPT. The AND circuit 603 receives an external snoop port signal OSPT, an external snoop port suppression signal OSPT_INH, a snoop response port signal SRPT, and a load buffer port signal LDPT. The AND circuit 604 receives a new request port signal RQPT, a new request port suppression signal RQPT_INH, a snoop response port signal SRPT, a load buffer port signal LDPT, and an external snoop port signal OSPT. The selection circuit 605 selectively outputs the outputs of the AND circuits 601, 602, 603, and 604.

図６（Ｂ）に示したように構成されたパイプライン選択部においては、要求の入力順序によっては、どの要求も選択されずスループットを低下させてしまうことがある。例えば、コアからの要求が、スヌープ応答ポート→ロードバッファポート→スヌープ応答ポート及び新規要求ポート（同時）の順に来ていた場合、パイプラインに投入される順序は優先順位に従うと、スヌープ応答ポート→ロードバッファポート→スヌープ応答ポート→新規要求ポートとなる。 In the pipeline selection unit configured as shown in FIG. 6B, depending on the input order of requests, no request may be selected and throughput may be reduced. For example, if the requests from the core are in the order of snoop response port → load buffer port → snoop response port and new request port (simultaneous), the order to be put into the pipeline follows the priority order, the snoop response port → Load buffer port → snoop response port → new request port.

しかし、スヌープ応答ポート及び新規要求ポート（同時）が来ているタイミングにおいて、スヌープ応答ポートの信号ＳＲＰＴ、スヌープ応答ポートの抑制信号ＳＲＰＴ＿ＩＮＨ、及び新規要求ポートの信号ＲＱＰＴが“１”であり、その他の信号ＬＤＲＴ、ＯＳＰＴ及び抑制信号ＬＤＲＴ＿ＩＮＨ、ＯＳＰＴ＿ＩＮＨ、ＲＱＰＴ＿ＩＮＨが“０”である場合、優先度の高いスヌープ応答ポートの要求は来ているがスヌープ応答ポートの抑制信号ＳＲＰＴ＿ＩＮＨによりスヌープ応答ポートの要求は通らない。また、新規要求ポートの要求も、優先度の高いスヌープ応答ポートの要求が来ているために通らない。したがって、ＡＮＤ回路６０１、６０２、６０３、６０４の出力は“０”となり、パイプライン選択部の選択回路６０５は、どの要求も選択しない。 However, at the timing when the snoop response port and the new request port (simultaneously) are coming, the snoop response port signal SRPT, the snoop response port suppression signal SRPT_INH, and the new request port signal RQPT are “1”. When the signals LDRT and OSPT and the suppression signals LDRT_INH, OSPT_INH, and RQPT_INH are “0”, a request for a snoop response port with a high priority is received, but a request for a snoop response port is not transmitted due to the suppression signal SRPT_INH of the snoop response port. . Also, a request for a new request port does not pass because a request for a snoop response port with a high priority is received. Therefore, the outputs of the AND circuits 601, 602, 603, and 604 are “0”, and the selection circuit 605 of the pipeline selection unit does not select any request.

それに対して、本実施形態によれば、図６（Ａ）に示すように、スヌープ応答ポートの信号ＳＲＰＴ、スヌープ応答ポートの抑制信号ＳＲＰＴ＿ＩＮＨ、及び新規要求ポートの信号ＲＱＰＴが“１”であり、その他の信号ＬＤＲＴ、ＯＳＰＴ及び抑制信号ＬＤＲＴ＿ＩＮＨ、ＯＳＰＴ＿ＩＮＨ、ＲＱＰＴ＿ＩＮＨが“０”である場合、パイプライン選択部＜１＞の選択回路２０４は、どの要求も選択しない。本実施形態では、パイプライン選択部＜２＞が新規要求と新規要求以外の要求との選択を行うので、パイプライン選択部＜２＞の選択回路２０７は、新規要求ポートの要求を選択して出力することができる。 On the other hand, according to the present embodiment, as shown in FIG. 6A, the snoop response port signal SRPT, the snoop response port suppression signal SRPT_INH, and the new request port signal RQPT are “1”. When the other signals LDRT, OSPT and the suppression signals LDRT_INH, OSPT_INH, RQPT_INH are “0”, the selection circuit 204 of the pipeline selection unit <1> does not select any request. In this embodiment, since the pipeline selection unit <2> selects a new request and a request other than the new request, the selection circuit 207 of the pipeline selection unit <2> selects the request for the new request port. Can be output.

すなわち、コアからの要求が、スヌープ応答ポート→ロードバッファポート→スヌープ応答ポート及び新規要求ポート（同時）の順に来ていた場合、従来のプロセッサにおいては、図７（Ｂ）に示すように、スヌープ応答ポート→ロードバッファポート→（命令なし）→スヌープ応答ポート→新規要求ポートの順でパイプラインに投入される。それに対して、本実施形態によれば、図７（Ａ）に示すように、スヌープ応答ポート→ロードバッファポート→新規要求ポート→スヌープ応答ポートの順でパイプラインに投入することが可能となり、パイプラインスロットを効率的に使用でき、スループットを向上させることが可能となる。 That is, when the requests from the core are in the order of snoop response port → load buffer port → snoop response port and new request port (simultaneous), in the conventional processor, as shown in FIG. Response port-> load buffer port-> (no instruction)-> snoop response port-> new request port. On the other hand, according to the present embodiment, as shown in FIG. 7A, it is possible to enter the pipeline in the order of snoop response port → load buffer port → new request port → snoop response port. Line slots can be used efficiently, and throughput can be improved.

なお、前述した説明では、８つのコア（コア＜０＞〜コア＜７＞）と、それらのコアで共有される２バンクのＬＬキャッシュメモリとを有する例を示したが、本発明は、これに限定されるものではない。本実施形態におけるプロセッサが有するコアの数や、ＬＬキャッシュメモリのバンク数は、任意の複数であればよい。例えば、図８に一例を示すように、８つのコア（コア＜０＞〜コア＜７＞）と、それら８つのコアで共有される４バンクのＬＬキャッシュメモリとを有するプロセッサにおいても適用可能である。 In the above description, an example having eight cores (core <0> to core <7>) and two banks of LL cache memory shared by these cores has been shown. It is not limited to. The number of cores included in the processor and the number of banks of the LL cache memory in this embodiment may be any plural number. For example, as shown in FIG. 8, the present invention can also be applied to a processor having eight cores (core <0> to core <7>) and four banks of LL cache memory shared by the eight cores. is there.

図８は、本実施形態における演算処理装置としてのプロセッサのレイアウトの他の例を示す図である。図８において、図１に示した構成要素と同じ機能を有する構成要素には同一の符号を付し、重複する説明は省略する。 FIG. 8 is a diagram illustrating another example of the layout of the processor as the arithmetic processing device according to the present embodiment. 8, components having the same functions as those shown in FIG. 1 are denoted by the same reference numerals, and redundant description is omitted.

図８に示す例では、ＬＬキャッシュメモリは、ＬＬキャッシュタグ部１８−１とＬＬキャッシュデータ部１９−１とが１つのバンクを構成しており、ＬＬキャッシュタグ部１８−２とＬＬキャッシュデータ部１９−２とが１つのバンクを構成している。また、ＬＬキャッシュタグ部１８−３とＬＬキャッシュデータ部１９−３とが１つのバンクを構成しており、ＬＬキャッシュタグ部１８−４とＬＬキャッシュデータ部１９−４とが１つのバンクを構成している。 In the example shown in FIG. 8, in the LL cache memory, the LL cache tag unit 18-1 and the LL cache data unit 19-1 constitute one bank, and the LL cache tag unit 18-2 and the LL cache data unit 19-2 constitute one bank. Further, the LL cache tag unit 18-3 and the LL cache data unit 19-3 constitute one bank, and the LL cache tag unit 18-4 and the LL cache data unit 19-4 constitute one bank. doing.

また、ＬＬキャッシュメモリの各バンクに対応させるように、新規要求ポート１６−１、１６−２、１６−３、１６−４及びパイプライン選択部＜２＞１７−１、１７−２、１７−３、１７−４を設けて、パイプライン選択部＜１＞１５よりもＬＬキャッシュタグ部１８−１、１８−２、１８−３、１８−４に近づけて近傍に配置している。このように、新規要求ポートとパイプライン選択部＜２＞とを、ＬＬキャッシュメモリの各バンクに隣接するように分割して配置すれば良い。これにより、例えば図８に破線で示すように、コア１１からの要求データが、そのコアに近いＬＬキャッシュメモリのバンクでキャッシュヒットした場合、信号が流れる経路が短くなり、ＬＬキャッシュメモリのアクセスレイテンシを短縮することができる。 Also, new request ports 16-1, 16-2, 16-3, 16-4 and pipeline selection units <2> 17-1, 17-2, 17- are associated with each bank of the LL cache memory. 3 and 17-4 are provided and arranged closer to the LL cache tag units 18-1, 18-2, 18-3, and 18-4 than the pipeline selection unit <1> 15. In this way, the new request port and the pipeline selection unit <2> may be divided and arranged so as to be adjacent to each bank of the LL cache memory. As a result, for example, as indicated by a broken line in FIG. 8, when the requested data from the core 11 has a cache hit in the bank of the LL cache memory close to the core, the path through which the signal flows is shortened, and the access latency of the LL cache memory is reduced. Can be shortened.

本実施形態によれば、パイプライン選択部＜１＞とパイプライン選択部＜２＞とに分け、新規要求ポート及びパイプライン選択部＜２＞をＬＬキャッシュメモリに近接した位置に配置することで、コアからのＬＬキャッシュメモリに対する要求の内で最も要求頻度が高い新規データの要求に係る経路の物理的な距離を短縮することができ、ＬＬキャッシュメモリのアクセスレイテンシを改善し、プロセッサ全体での処理性能を向上させることができる。 According to the present embodiment, the pipeline selection unit <1> is divided into the pipeline selection unit <2>, and the new request port and the pipeline selection unit <2> are arranged at positions close to the LL cache memory. The physical distance of the path related to a request for new data having the highest request frequency among the requests from the core to the LL cache memory can be shortened, the access latency of the LL cache memory can be improved, and the entire processor Processing performance can be improved.

なお、前記実施形態は、何れも本発明を実施するにあたっての具体化のほんの一例を示したものに過ぎず、これらによって本発明の技術的範囲が限定的に解釈されてはならないものである。すなわち、本発明はその技術思想、またはその主要な特徴から逸脱することなく、様々な形で実施することができる。 The above-described embodiments are merely examples of implementation in carrying out the present invention, and the technical scope of the present invention should not be construed as being limited thereto. That is, the present invention can be implemented in various forms without departing from the technical idea or the main features thereof.

１１コア
１２転送バス
１３各種ポート
１４パイプライン制御部
１５パイプライン選択部＜１＞
１６新規要求ポート
１７パイプライン選択部＜２＞
１８ＬＬ（ラストレベル）キャッシュタグ部
１９ＬＬ（ラストレベル）キャッシュデータ部
２０外部ポート 11 Core 12 Transfer bus 13 Various ports 14 Pipeline controller 15 Pipeline selector <1>
16 New request port 17 Pipeline selection section <2>
18 LL (last level) cache tag part 19 LL (last level) cache data part 20 External port

Claims

A plurality of arithmetic units;
A cache memory having a plurality of banks and shared by the plurality of arithmetic units;
A first selection unit that selects a request to be output from requests other than the first request from the arithmetic unit among the requests to the cache memory;
Each bank of the cache memory is arranged at a position closer to the cache memory than the first selection unit, and selects and selects a request from the first request and the request selected by the first selection unit And a second selection unit that outputs a request to a pipeline related to access to the cache memory.

A port unit that is arranged together with the second selection unit at a position closer to the cache memory than the first selection unit for each bank of the cache memory, and that receives a request for the cache memory other than the first request The arithmetic processing unit according to claim 1, further comprising a first port unit that receives the different first requests.

The arithmetic processing apparatus according to claim 1, wherein the first request is a request having a lower priority than a request selected by the first selection unit.

A plurality of arithmetic units;
A cache memory having a plurality of banks and shared by the plurality of arithmetic units;
A first selection unit that selects a request to be output from requests other than the first request from the arithmetic unit among the requests to the cache memory;
For each bank of the cache memory, a request is selected from the first request and the request selected by the first selection unit, and the selected request is output to a pipeline related to the access to the cache memory. And an arithmetic processing unit having two selection units.

5. The first port unit that receives the first request different from a port unit that receives a request for the cache memory other than the first request is provided for each bank of the cache memory. Arithmetic processing unit.

The arithmetic processing apparatus according to claim 1, wherein the first request is a request for new data from the arithmetic unit.

A control method of an arithmetic processing unit having a plurality of arithmetic units and a cache memory having a plurality of banks and shared by the plurality of arithmetic units,
The first selection unit of the arithmetic processing unit selects a request to be output from a request other than the first request from the arithmetic unit among the requests for the cache memory,
The second selection unit of the arithmetic processing unit provided for each bank of the cache memory selects a request from the first request and the request selected by the first selection unit, and selects the selected request A control method for an arithmetic processing unit, characterized by outputting to a pipeline related to access to a cache memory.