JP4532931B2

JP4532931B2 - Processor and prefetch control method

Info

Publication number: JP4532931B2
Application number: JP2004049348A
Authority: JP
Inventors: 昭宏松居; 秀貴青木; 直伸助川; 雅尋處
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2004-02-25
Filing date: 2004-02-25
Publication date: 2010-08-25
Anticipated expiration: 2024-02-25
Also published as: JP2005242527A

Description

本発明は、プロセッサ、および、プリフェッチ制御方法に係り、メモリからキャッシュへのプリフェッチ機能を有するプログラムであって、特に、ストライド・アクセスを含む複数データストリームを持つメモリアクセスに用いて好適なプロセッサ、および、プリフェッチ制御方法に関する。 The present invention relates to a processor and a prefetch control method, and is a program having a prefetch function from a memory to a cache, and is particularly suitable for a memory access having a plurality of data streams including stride access, and The present invention relates to a prefetch control method.

コンピュータシステムにおいて、プロセッサ処理性能に比較して、メモリ処理性能の向上率が低いため、年々その処理性能の乖離が著しくなっている。そのため、プロセッサ内にキャッシュメモリを実装し、メモリ処理性能の遅さを隠蔽するのが普通となっている。しかし、キャッシュメモリはデータの時間的・空間的局所性を利用するため、局所性の無いメモリアクセスパターンでは、キャッシュメモリが有効に働かないことがあり、処理性能が極端に悪化する。このような現象は、特に大規模科学技術計算に多く見られるように、配列データを連続的にアクセスし、前の段階で計算に用いたデータの再利用性が低いときに発生する。これに対する対策方法としては、コンパイラ・アセンブラなどにより、メモリからキャッシュメモリへと事前にデータを転送させるプリフェッチ命令を挿入するコードを生成することで性能低下を防いできた。しかしながら、データ配列をリストアクセスする場合や、オブジェクト指向言語で書かれたプログラムでは、メモリアクセスパターンが連続アクセスであったとしても、プリフェッチ命令を挿入したコードを生成できない場合が多い。一方、以下の特許文献１に示されるようなハードウェアでプリフェッチをおこなう方式では、連続的なアドレスに対するアクセス・ストリームと不連続なアドレスに対するアクセス・ストリームが混在した場合に、全ストリームに一様にプリフェッチ要求を発行してしまうため、不連続なアクセス・ストリームに対しては、プリフェッチ発行対象のデータがキャッシュに到達する前にロード要求が発行されてしまい、メモリからのデータ転送のレイテンシを隠蔽し切れず有効にプリフェッチが機能しない問題が起こる可能性がある。この問題に対する解決としては、特許文献１のように、十分先のアドレスに対してプリフェッチ要求を出すことや、複数のプリフェッチ命令一度に出すことが考えられるが、これは無尽蔵ではないメモリバンド幅を圧迫し、また、過剰に出してしまった多量のプリフェッチがループ終了時に残ってしまう問題（オーバーラン）を引き起こしてしまう。また、以下の特許文献２のように、前回アクセスしたアドレスと今回アクセスしたアドレスとの差分から次のアドレスを予測しプリフェッチ要求を出すことにより、不連続アドレスに対するプリフェッチの有効性を高める技術があるが、全てのデータアクセスストリームに対してアドレス履歴を保存する領域を確保することはチップ面積の増大を招く問題がある。 In a computer system, since the improvement rate of memory processing performance is low compared to processor processing performance, the difference in processing performance has become significant year by year. For this reason, it is common to mount a cache memory in the processor to conceal the slowness of memory processing performance. However, since the cache memory uses temporal / spatial locality of data, the cache memory may not work effectively in a memory access pattern without locality, and the processing performance is extremely deteriorated. Such a phenomenon occurs when the sequence data is continuously accessed and the reusability of the data used in the calculation at the previous stage is low, as is often seen in large-scale scientific and engineering calculations. As a countermeasure against this, it was possible to prevent performance degradation by generating a code for inserting a prefetch instruction for transferring data in advance from the memory to the cache memory by a compiler / assembler or the like. However, in the case of list access of a data array or a program written in an object-oriented language, it is often impossible to generate a code in which a prefetch instruction is inserted even if the memory access pattern is continuous access. On the other hand, in the method of prefetching with hardware as shown in Patent Document 1 below, when an access stream for continuous addresses and an access stream for discontinuous addresses are mixed, it is uniformly distributed over all streams. Since a prefetch request is issued, for a discontinuous access stream, a load request is issued before the data to be prefetched reaches the cache, thus hiding the latency of data transfer from memory. There is a possibility that the prefetch does not work effectively without cutting. As a solution to this problem, it is conceivable to issue a prefetch request to a sufficiently ahead address or to issue a plurality of prefetch instructions at a time, as in Patent Document 1, but this is not an inexhaustible memory bandwidth. This causes a problem (overrun) in which a large amount of prefetch that is pressed and excessively remains at the end of the loop. Further, as in Patent Document 2 below, there is a technique for improving the effectiveness of prefetching for a discontinuous address by predicting the next address from the difference between the address accessed last time and the address accessed this time and issuing a prefetch request. However, securing an area for storing the address history for all data access streams has a problem of increasing the chip area.

特開平０９−１４６８３５号公報JP 09-146835 A

特開平０６−０２８１８０号公報Japanese Patent Laid-Open No. 06-028180

キャッシュメモリを設けたプロセッサでは、キャッシュメモリの短いアクセスレイテンシを想定して、命令スケジューリングをおこなうため、キャッシュミスを起こすと処理性能が大幅に低下する。これに対し、多くのプロセッサは、ハードウェアまたはソフトウェアによってプリフェッチ処理を実現し、メモリのアクセスレイテンシを隠蔽するのに十分な時間分前もってデータをキャッシュに転送しておくのが普通である。ソフトウェアによるプリフェッチでは、コンパイラ・アセンブラや人手でによりコード中に効率よくプリフェッチ命令を埋め込むことが可能であるが、命令数の増加によるオーバーヘッドが問題となる。一方のハードウェア・プリフェッチでは通常、複数のデータストリームに対しプリフェッチ要求を発行できる仕組みをプロセッサ内部で実現するが、例えば連続でないメモリアドレスに対するアクセス（例えば、配列においてインデックスが２づつ増加していくA(0),A(4),A(8),A(12),…のようなストライド・アクセス）ストリームでは、ループのイタレーション（iteration, 反復段階）が進むに従って連続アクセス・ストリームに比べ速いスピードでロード要求先アドレスが増加（または減少）することから、連続アクセスのストリームと同じだけ先のアドレスにプリフェッチ要求を出してしまう場合には不連続アクセス・ストリーム側でレイテンシが隠し切れないという問題が起こる。この問題に対して、全てのストリームに十分な時間分前もってプリフェッチ要求を出すことも考えられるが、特にストリーム数が多い場合においてはメモリバンド幅をプリフェッチによる転送データで食いつぶしてしまうこと、また、相当サイズのキャッシュが要求されること、ループ終了後にも余剰のプリフェッチ要求により、次のループが始まり新しいストリームが生成されているにも拘らず前回ループに関するデータがメモリから転送されてしまう（オーバーラン・プリフェッチ）問題を引き起こしてしまうため、現実的ではない。 In a processor provided with a cache memory, instruction scheduling is performed assuming a short access latency of the cache memory. Therefore, if a cache miss occurs, the processing performance is significantly reduced. In contrast, many processors typically implement prefetch processing with hardware or software, and typically transfer data to the cache a sufficient amount of time to conceal the memory access latency. In prefetching by software, it is possible to embed prefetch instructions efficiently in code by a compiler / assembler or manually, but overhead due to an increase in the number of instructions becomes a problem. On the other hand, in hardware prefetch, a mechanism that can issue a prefetch request to a plurality of data streams is usually realized in the processor. For example, access to non-contiguous memory addresses (for example, an index increases by 2 in an array A). (0), A (4), A (8), A (12), ... (stride access) streams are faster than continuous access streams as the iteration of the loop progresses Since the load request destination address increases (or decreases) at a speed, the latency cannot be concealed on the discontinuous access stream side when a prefetch request is issued to the same address as the continuous access stream. Happens. To solve this problem, it may be possible to issue a prefetch request to all streams in advance, but especially when the number of streams is large, the memory bandwidth may be devoured by the transfer data by prefetching. The size of the cache is requested, and the extra prefetch request after the end of the loop causes the data related to the previous loop to be transferred from the memory even though the next loop starts and a new stream is generated (overrun (Prefetch) problem, which is not realistic.

本発明は、上記問題点を解決するためになされたもので、その目的は、プリフェッチ機能を有するプロセッサにおいて、命令数増大によるオーバーヘッドがないハードウェアによるプリフェッチにより、データストリーム間の要求先アドレス増減のスピードに応じてプリフェッチ要求先アドレスを可変にして、メモリ性能の優れたプロセッサを提供することにある。 The present invention has been made in order to solve the above-described problems, and an object of the present invention is to increase or decrease the request destination address between data streams in a processor having a prefetch function by hardware prefetch without an overhead due to an increase in the number of instructions. An object of the present invention is to provide a processor with excellent memory performance by changing the prefetch request destination address according to the speed.

本発明のハードウェアプリフェッチをおこなうプロセッサは、連続アクセスが続くであろうというデータストリームを検出すると、そのデータストリームが現在アクセスしているアドレスを保持するとともに、そのデータストリームにおいて起こったキャッシュミスの頻度、または、キャッシュラインが更新された頻度をデータストリーム管理テーブルに保持する。そして、現在データストリーム管理テーブルに保持されている全てのデータストリームの頻度情報を元に、それぞれのストリームに対し最適なプリフェッチ要求先アドレスをハードウェアが自動的に設定し、プリフェッチ要求を発行することにある。すなわち、キャッシュミスの頻度、または、キャッシュラインが更新された頻度の大小に応じて、プリフェッチオフセットの増減をおこなう。そして、アクセスしているメモリアドレスに対してそのプリフェッチオフセットを加算したアドレスをプリフェッチ要求先アドレスとし、メモリからキャッシュメモリにプリフェッチ要求をおこなう。 When a processor that performs hardware prefetching of the present invention detects a data stream that will continue to be accessed continuously, it maintains the address that the data stream is currently accessing, and the frequency of cache misses that occurred in the data stream. Alternatively, the frequency at which the cache line is updated is held in the data stream management table. Based on the frequency information of all data streams currently held in the data stream management table, the hardware automatically sets the optimum prefetch request destination address for each stream and issues a prefetch request. It is in. That is, the prefetch offset is increased or decreased according to the frequency of cache misses or the frequency of cache line updates. Then, a prefetch request is made from the memory to the cache memory using an address obtained by adding the prefetch offset to the memory address being accessed as a prefetch request destination address.

本発明によれば、プリフェッチ機能を有するプロセッサにおいて、命令数増大によるオーバーヘッドがないハードウェアによるプリフェッチにより、データストリーム間の要求先アドレス増減のスピードに応じてプリフェッチ要求先アドレスを可変にして、メモリ性能の優れたプロセッサを提供することができる。 According to the present invention, in a processor having a prefetch function, the prefetch request destination address is made variable according to the speed of increase / decrease of the request destination address between data streams by hardware prefetch without the overhead due to the increase in the number of instructions, thereby improving the memory performance. An excellent processor can be provided.

以下、本発明に係る各実施形態を、図１ないし図１３を用いて説明する。 Embodiments according to the present invention will be described below with reference to FIGS.

〔実施形態１〕
以下、本発明に係る第一の実施形態を、図１ないし図６を用いて説明する。
先ず、図１を用いて本発明の第一の実施形態に係るプロセッサを用いたシステム構成とその動作について説明する。
図１は、本発明の第一の実施形態に係るプロセッサを用いたシステム構成図である。 [Embodiment 1]
A first embodiment according to the present invention will be described below with reference to FIGS.
First, a system configuration using the processor according to the first embodiment of the present invention and its operation will be described with reference to FIG.
FIG. 1 is a system configuration diagram using a processor according to the first embodiment of the present invention.

本実施形態のコンピュータシステム３００は、プロセッサ３１０とメモリ２４０を持つ。 The computer system 300 of this embodiment has a processor 310 and a memory 240.

プロセッサ３１０は、ロード／ストア・ユニット１０と、キャッシュメモリ（図と以下の説明では、単に「キャッシュ」ともいう）１５０と、データストリーム管理テーブル３２０とを有する。データストリーム管理テーブル３２０は、データストリームに関する情報を保持し、プリフェッチの制御に用いるためのテーブルである。キャッシュメモリ１５０は、本実施形態では、プロセッサの内部に描いたが、プロセッサの外部に設けられるキャッシュに対しても本発明は有効である。 The processor 310 includes a load / store unit 10, a cache memory (also simply referred to as “cache” in the drawings and the following description) 150, and a data stream management table 320. The data stream management table 320 is a table for storing information about the data stream and used for prefetch control. In this embodiment, the cache memory 150 is drawn inside the processor, but the present invention is also effective for a cache provided outside the processor.

本実施形態における、キャッシュメモリ１５０はライン単位で管理されているものとする。ラインとは、複数ワード長の集合であり、メモリからキャッシュへのデータ転送はこのライン単位でおこなわれ、ロード／ストア・ユニットとキャッシュ間のアドレス指定はメモリアドレスの上位の一部のビットからなるキャッシュラインアドレスによりなされる。 In this embodiment, it is assumed that the cache memory 150 is managed in units of lines. A line is a set of a plurality of words in length, and data transfer from the memory to the cache is performed in units of this line, and addressing between the load / store unit and the cache is made up of some upper bits of the memory address. This is done by the cache line address.

データストリーム管理テーブル３２０は、ハードウェア・プリフェッチの発行対象となっているデータストリームにおける情報を保持するためのエントリを複数持っている。図１と以下の説明では、データストリームにおける情報を保持するためのエントリは二つとしたが、多いほうが性能向上の観点からは望ましい。 The data stream management table 320 has a plurality of entries for holding information in the data stream for which hardware prefetch is issued. In FIG. 1 and the following description, there are two entries for holding information in the data stream, but a larger number is desirable from the viewpoint of performance improvement.

データストリーム管理テーブル３２０のエントリ５３０は、レジスタ５０、４００、４１０、９０を持ち、エントリ５３５は、レジスタ５５、４０５、４１５、９５を持つ。 The entry 530 of the data stream management table 320 has registers 50, 400, 410, and 90, and the entry 535 has registers 55, 405, 415, and 95.

レジスタ５０、５５は、保持するデータストリームの最新のロード／ストア要求先キャッシュラインアドレスの値を保持するレジスタであり、ロード／ストア要求が新しいキャッシュラインアドレスを有するメモリエリアにアクセスするたびに更新される。 The registers 50 and 55 hold the value of the latest load / store request destination cache line address of the data stream to be held, and are updated each time the load / store request accesses a memory area having a new cache line address. The

レジスタ４００、４０５は、保持するデータストリームのアドレスの変化がプラスの方向かマイナスの方向のどちらに進行しているかを符号として保持する。すなわち、これは、そのデータストリームに対してロード／ストア要求先が発行されるアドレスの方向でもある。 The registers 400 and 405 hold, as a sign, whether the change in the address of the held data stream proceeds in the positive direction or the negative direction. That is, this is also the direction of the address at which the load / store request destination is issued for the data stream.

レジスタ４１０、４１５は、各保持データが現在有効なデータストリームのものであるかどうかを保持する。有効であると判断される間は、このデータストリームに関する情報は、破棄されることはない。有効でないと判断されたエントリは、破棄され、新しいデータストリームの情報を格納するために使われる。エントリが有効性でないとの判断は、一定期間更新されていないときになされ、これらのレジスタ４１０、４１５に、初期化機構５５０がおこなう。この詳細な処理については、後に説明する。 Registers 410, 415 hold whether each piece of retained data is for a currently valid data stream. While it is determined to be valid, information about this data stream is not discarded. Entries that are determined to be invalid are discarded and used to store new data stream information. The determination that the entry is not valid is made when it has not been updated for a certain period of time, and the initialization mechanism 550 performs these registers 410 and 415. This detailed process will be described later.

レジスタ９０、９５は、各データストリームにおいて起きたキャッシュミスの回数を保持する。 Registers 90 and 95 hold the number of cache misses that occurred in each data stream.

以下では、本実施形態のプロセッサ３１０の動作について説明する。 Below, operation | movement of the processor 310 of this embodiment is demonstrated.

先ず、レジスタ５０、５５にキャッシュラインアドレスを格納する処理について説明する。 First, processing for storing the cache line address in the registers 50 and 55 will be described.

プロセッサ３１０において、ロード／ストア・ユニット１０からキャッシュ１５０に対してロード要求がデータ線２０により発行されると、それと同時にそのロード要求の要求先キャッシュラインアドレスがデータ線３０によりデータストリーム検出のための制御機構６０に伝えられる。 In the processor 310, when a load request is issued from the load / store unit 10 to the cache 150 through the data line 20, the requested cache line address of the load request is simultaneously sent to the data line 30 for detecting a data stream. This is transmitted to the control mechanism 60.

なお、以下の説明では、ロード要求の例について述べるが、ストア要求についても同様である。 In the following description, an example of a load request will be described, but the same applies to a store request.

制御機構６０は、その要求先キャッシュラインアドレスを、データ線３５０または３６０により伝えられるデータストリーム管理テーブル３２０の保持するデータストリームの前回要求先キャッシュラインアドレスと比較する。そして、ロード要求が同一のデータストリームに対してなされたものであると認識すると、データ線３０により伝えられるロード要求先キャッシュラインアドレスをそのデータストリームの次のロード要求先キャッシュラインアドレスとして更新するための動作をする。すなわち、その要求先キャッシュラインアドレスアドレスが保持している二つのデータストリームの内、いずれかのデータストリームに属するかをデータ線７５によりエントリ番号として選択器４３０に伝える。 The control mechanism 60 compares the request destination cache line address with the previous request destination cache line address of the data stream held in the data stream management table 320 transmitted by the data line 350 or 360. When it is recognized that the load request is made for the same data stream, the load request destination cache line address transmitted by the data line 30 is updated as the next load request destination cache line address of the data stream. To work. That is, the data line 75 informs the selector 430 as an entry number of which of the two data streams held by the requested cache line address address belongs to.

また、制御機構６０は、データストリーム検出の機能を有しており、データ線３０により伝えられたロード要求先キャッシュラインアドレスが新規のデータストリームにおけるロード要求先キャッシュラインアドレスであると認識すると、新規のデータストリームであることをデータ線７５に伝える。例えば、データストリームの検出は、ロード／ストア要求が、連続して所定のキャッシュラインに発行されたときにおこなわれる。 Further, the control mechanism 60 has a data stream detection function. When the control mechanism 60 recognizes that the load request destination cache line address transmitted by the data line 30 is the load request destination cache line address in the new data stream, the control mechanism 60 To the data line 75. For example, detection of a data stream is performed when load / store requests are continuously issued to a predetermined cache line.

さらに、制御機構６０は、データストリーム検出時にそのデータストリームの進行方向がプラス方向かマイナス方向か、すなわち、データストリームにおいてアクセスされるアドレスが増加するか減少するかを、前回アドレスと新規要求先アドレスの差分をとることで判断し、データ線４５０または４５５により、レジスタ４００または４０５に、その進行方向を記録する
選択器４３０は、データ線７５により伝えられるエントリ番号のエントリが保持するデータストリームのキャッシュラインアドレスを新規アドレスで置き換えるため、レジスタ５０または５５にデータ線６３０または６３５によりデータ線３０より伝えられる新規アドレスを伝える。また、データ線７５により新規のデータストリームの検出が伝えられると、エントリ５３０とエントリ５３５の有効ビットをデータ線６００および６０５により確認し、有効ビットが０（無効）のエントリのキャッシュラインアドレスをデータ線３０の新規アドレスに置き換えるため、その新規アドレスをデータ線６３０または６３５に伝える。エントリ５３０とエントリ５３５の有効ビットがいずれも０であれば、データストリームはエントリ５３０に登録することにする。エントリ５３０とエントリ５３５の有効ビットがどちらも１（有効）であれば、データストリームの登録は破棄される。すなわち、いずれのエントリが使用中のときには、新しいデータストリームに関する情報は登録されない。本実施形態のデータストリーム管理テーブルの更新アルゴリズムは、このように、有効ビットを書き換えることになされるが、他のアルゴリズム、例えば、ＬＲＵ（Least Recently Used）アルゴリズムなどによってもよい。 Further, when the data stream is detected, the control mechanism 60 determines whether the moving direction of the data stream is positive or negative, that is, whether the address accessed in the data stream increases or decreases, and the previous address and the new request destination address. The selection direction is recorded in the register 400 or 405 by the data line 450 or 455. The selector 430 caches the data stream held by the entry of the entry number transmitted by the data line 75. In order to replace the line address with a new address, a new address transmitted from the data line 30 is transmitted to the register 50 or 55 via the data line 630 or 635. When the detection of a new data stream is transmitted through the data line 75, the valid bits of the entry 530 and the entry 535 are confirmed by the data lines 600 and 605, and the cache line address of the entry whose valid bit is 0 (invalid) is stored in the data. To replace the new address on line 30, the new address is communicated to data line 630 or 635. If the valid bits of entry 530 and entry 535 are both 0, the data stream is registered in entry 530. If the valid bits of entry 530 and entry 535 are both 1 (valid), the data stream registration is discarded. That is, when any entry is in use, information regarding the new data stream is not registered. The update algorithm of the data stream management table of the present embodiment is such that the valid bit is rewritten as described above, but may be another algorithm such as an LRU (Least Recently Used) algorithm.

次に、ロード要求がキャッシュミスを引き起こした場合の動作について説明する。 Next, an operation when a load request causes a cache miss will be described.

データ線２０により伝えられるロード要求がキャッシュミスを引き起こした場合には、データ線３３０により選択器４６０に、そのキャッシュミスを引き起こしたキャッシュラインアドレスが伝えられる。選択器４６０は、データ線７５により伝えられるエントリ番号に対応するデータストリームにおける、データ線４７０またはデータ線４７５より伝えられるキャッシュミス回数をデータ線４８０を用いて加算器４９０に伝える。 When the load request transmitted through the data line 20 causes a cache miss, the data line 330 transmits the cache line address causing the cache miss to the selector 460. The selector 460 transmits the number of cache misses transmitted from the data line 470 or the data line 475 in the data stream corresponding to the entry number transmitted via the data line 75 to the adder 490 using the data line 480.

加算器４９０は、データ線４８０より伝えられたキャッシュミス回数に対し１を加えたものを新規キャッシュミス回数としてデータ線５００を用いて選択器５１０に伝える。選択器５１０はデータ線７５により伝えられるエントリ番号を基にデータ線５２０またはデータ線５２５のいずれかを用い、キャッシュミス回数９０または９５を書き換える。 The adder 490 transmits a value obtained by adding 1 to the number of cache misses transmitted from the data line 480 to the selector 510 using the data line 500 as the number of new cache misses. The selector 510 rewrites the cache miss count 90 or 95 using either the data line 520 or the data line 525 based on the entry number conveyed by the data line 75.

次に、ハードウェア・プリフェッチの要求先アドレスを決定する場合の動作を説明する。 Next, the operation for determining the hardware prefetch request destination address will be described.

データストリーム管理テーブル３２０が保持する各ストリームのキャッシュミス回数９０、９５は、それぞれデータ線４７０、４７５によりＡＮＤ回路５８０、５８５に伝えられ、ＡＮＤ回路５８０、５８５は、伝えられた回数とデータ線６００、６０５により伝えられる有効ビットを回数情報のビット幅に拡張したものとのＡＮＤをそれぞれとり、出力をデータ線２２０、２２５に伝える。これにより、有効ビットが０（無効）のときには、キャッシュミス回数は常に０にクリアされる。 The cache miss counts 90 and 95 of each stream held in the data stream management table 320 are transmitted to the AND circuits 580 and 585 through the data lines 470 and 475, respectively. The AND circuits 580 and 585 transmit the transmitted count and the data line 600. , 605 and AND of the effective bits transmitted to the bit width of the number information, respectively, and the outputs are transmitted to the data lines 220 and 225. Thus, when the valid bit is 0 (invalid), the number of cache misses is always cleared to 0.

加算器１００は、データ線２２０、２２５より伝えられるキャッシュミス回数の和をデータ線２１０に伝える。除算器１１０、１１５は、それぞれデータ線２２０、２２５より伝えられるキャッシュミス回数をそれぞれ被除数とし、データ線２１０より伝えられる全ストリームのキャッシュミス回数の和を除数２１０とする頻度比率を計算する。乗算器１２０、１２５は、それぞれデータ線６１０、６１５より伝えられる各ストリームにおける頻度比率と、それぞれデータ線６２０、６２５により伝えられる各ストリームの進行方向を表す符号と、１ストリーム時の最大プリフェッチオフセット量レジスタ８００から持ってきた最大プリフェッチオフセット量とを乗じ、小数点以下は切り捨てた上で、その結果をそれぞれデータ線１６０、１６５に伝える。 The adder 100 transmits the sum of the number of cache misses transmitted from the data lines 220 and 225 to the data line 210. Dividers 110 and 115 calculate a frequency ratio in which the number of cache misses transmitted from data lines 220 and 225 is a dividend and the sum of the number of cache misses transmitted from data line 210 is a divisor 210. The multipliers 120 and 125 respectively transmit the frequency ratio in each stream transmitted from the data lines 610 and 615, the code indicating the traveling direction of each stream transmitted by the data lines 620 and 625, and the maximum prefetch offset amount in one stream, respectively. The result is transmitted to the data lines 160 and 165 after multiplying by the maximum prefetch offset amount brought from the register 800 and rounded down after the decimal point.

最大プリフェッチオフセット量レジスタ８００は、ＲＯＭなどの書き換え不可能なメモリでもよいし、書き換え可能なメモリでもよい。最大プリフェッチオフセット量は、プロセッサやメモリの性能、また、実行させるプログラムにより最適な値を選択することができる。 The maximum prefetch offset amount register 800 may be a non-rewritable memory such as a ROM or a rewritable memory. As the maximum prefetch offset amount, an optimum value can be selected depending on the performance of the processor and the memory and the program to be executed.

加算器１３０、１３５は、それぞれデータ線１６０、１６５より伝えられる入力を各データストリームにおけるプリフェッチオフセット量として、データ線３５０、３６０より伝えられるキャッシュラインアドレスに対して加え、それぞれをデータ線２３０、２３５に伝える。キャッシュ１５０はデータ線２３０、２３５より伝えられるキャッシュラインアドレスを各データストリームに対するプリフェッチ要求の要求先キャッシュラインアドレスとし、そのプリフェッチ要求先キャッシュラインアドレスが更新されるのを契機に、データ線２６０によりハードウェア・プリフェッチを発行する。キャッシュ１５０は、メモリ２４０から要求データをデータ線１４０により受け取り、データ線２００によりロード／ストア・ユニット１０に伝える。 The adders 130 and 135 add the inputs transmitted from the data lines 160 and 165, respectively, to the cache line addresses transmitted from the data lines 350 and 360 as prefetch offset amounts in the respective data streams, and add the data lines 230 and 235, respectively. To tell. The cache 150 uses the cache line address transmitted from the data lines 230 and 235 as the request destination cache line address of the prefetch request for each data stream, and the data line 260 causes the hardware to be executed by the data line 260 when the prefetch request destination cache line address is updated. Issue wear prefetch. The cache 150 receives the request data from the memory 240 via the data line 140 and transmits it to the load / store unit 10 via the data line 200.

次に、図２を用いて制御機構６０の詳細な構成と動作について説明する。
図２は、本発明の第一の実施形態に係る制御機構６０の構成図である。 Next, the detailed configuration and operation of the control mechanism 60 will be described with reference to FIG.
FIG. 2 is a configuration diagram of the control mechanism 60 according to the first embodiment of the present invention.

制御機構６０において、データ線３０より伝えられるロード要求先の新規キャッシュラインアドレスは、減算器８２０、８３０においてそれぞれデータ線３５０、３６０より伝えられる保持されているデータストリームの前回の要求先キャッシュラインアドレスとの差分が計算され、それぞれデータ線８４０、８５０にその結果を伝える。比較器８６０は、データ線８４０または８５０から伝えられる値がプラス１かマイナス１であることを確認すると、プラスかマイナスかをデータ線８００に伝え、データ線８４０とデータ線８５０のどちらの結果がプラスまたはマイナス１であったのかをデータ線８１０に伝える。 In the control mechanism 60, the new cache line address of the load request destination transmitted from the data line 30 is the previous request destination cache line address of the held data stream transmitted from the data lines 350 and 360 in the subtracters 820 and 830, respectively. And the result is transmitted to the data lines 840 and 850, respectively. When the comparator 860 confirms that the value transmitted from the data line 840 or 850 is plus 1 or minus 1, the comparator 860 informs the data line 800 whether the value is plus or minus, and the result of either the data line 840 or the data line 850 is determined. Tell the data line 810 whether it was positive or negative 1.

このような制御機構６０の動作は、キャッシュラインアドレスの変化がプラス・マイナス１のときには、同一のデータストリームが続いているものとして処理するためのものである。 Such an operation of the control mechanism 60 is for processing that the same data stream continues when the change of the cache line address is plus or minus 1.

また、制御機構６０はデータストリーム検出機構８７０を有し、データ線３０より伝えられるキャッシュラインアドレスが新規のデータストリームであることを検出すると、それをデータ線８８０に伝える。 Further, the control mechanism 60 has a data stream detection mechanism 870, and when it detects that the cache line address transmitted from the data line 30 is a new data stream, it transmits it to the data line 880.

本発明はデータストリーム検出の方法には依存しないが、本実施形態においては特許文献１に代表されるように、連続するキャッシュラインにアクセスがあった場合にそれをストリームとして検出するものとする。制御機構６０は、データ線８１０とデータ線８８０をデータ線７５にまとめて出力する。選択器８９０は、データ線８００により伝えられるプラスかマイナスかを示す符号データでデータ線８１０が示すエントリの符号レジスタを書き換えるため、データ線４５０または４５５に符号データを伝える。 Although the present invention does not depend on a data stream detection method, in this embodiment, as represented by Patent Document 1, when a continuous cache line is accessed, it is detected as a stream. The control mechanism 60 collectively outputs the data line 810 and the data line 880 to the data line 75. The selector 890 transmits the code data to the data line 450 or 455 in order to rewrite the code register of the entry indicated by the data line 810 with the code data indicating whether the data line 800 is positive or negative.

次に、図３を用いて初期化機構５５０の詳細な構成と動作について説明する。 Next, the detailed configuration and operation of the initialization mechanism 550 will be described with reference to FIG.

図３は、本発明の第一の実施形態に係る初期化機構５５０の構成図である。 FIG. 3 is a configuration diagram of the initialization mechanism 550 according to the first embodiment of the present invention.

初期化機構は、新規のデータストリームが検出されたときと、一定時間保持しているデータストリームにおけるロード／ストア要求先キャッシュラインアドレスの更新がないときに、データストリーム管理テーブル３０を初期化する機能を持つ。 The initialization mechanism is a function for initializing the data stream management table 30 when a new data stream is detected and when there is no update of the load / store request destination cache line address in the data stream held for a certain period of time. have.

初期化機構５５０において、データ線７５から比較器１２６０に新規データストリームの検出が伝えられると、比較器１２６０はデータ線６００またはデータ線６０５よりエントリ５３０またはエントリ５３５で有効ビットが０であるエントリのエントリ番号をデータ線１２８０に伝え、エントリ５３０とエントリ５３５のどちらの有効ビットも０である場合は、対象エントリがないことをデータ線１２８０に伝え、エントリ５３０とエントリ５３５のどちらの有効ビットも１である場合はエントリ５３０のエントリ番号をデータ線１２８０に伝える。 In the initialization mechanism 550, when the detection of a new data stream is transmitted from the data line 75 to the comparator 1260, the comparator 1260 transmits the entry having an effective bit of 0 in the entry 530 or the entry 535 from the data line 600 or the data line 605. When the entry number is transmitted to the data line 1280 and both the valid bits of the entry 530 and the entry 535 are 0, the fact that there is no target entry is transmitted to the data line 1280, and both the valid bits of the entry 530 and the entry 535 are 1 If so, the entry number of the entry 530 is transmitted to the data line 1280.

初期化機構５５０において、カウンタ１２４０はクロック・サイクルをカウントしており、実装により既定のサイクル数をカウントしたら、それを比較器１２３０に伝えると共に０に戻って、再びカウントを始める。比較器１２３０は、データ線７５の出力から、エントリ５３０またはエントリ５３５のデータストリームにおけるロード／ストア要求先キャッシュラインアドレスの更新が伝えられると、エントリ５３０の場合はデータ線１３００によりレジスタ１２２０を、エントリ５３５の場合はデータ線１３０５によりレジスタ１２２５を、それぞれ１にセットする。また比較器１２３０はカウンタから既定サイクルの経過を伝えられると、データ線１２９０および１２９５からの入力を調べ、入力値が０であるエントリ番号をデータ線１２７０に伝えると共に、レジスタ１２２０および１２２５をデータ線１３００および１３０５により０にリセットする。 In the initialization mechanism 550, the counter 1240 is counting clock cycles, and once the implementation counts a predetermined number of cycles, it communicates it to the comparator 1230 and returns to 0 to begin counting again. When the update of the load / store request destination cache line address in the data stream of the entry 530 or the entry 535 is transmitted from the output of the data line 75, in the case of the entry 530, the comparator 1230 stores the register 1220 by the data line 1300. In the case of 535, the registers 1225 are set to 1 by the data line 1305, respectively. Further, when the comparator 1230 is informed of the progress of the predetermined cycle from the counter, it checks the input from the data lines 1290 and 1295, transmits the entry number whose input value is 0 to the data line 1270, and transmits the registers 1220 and 1225 to the data line. It is reset to 0 by 1300 and 1305.

選択器１２５０は、データ線１２７０および１２８０から伝えられるエントリ番号を基に、エントリ５３０の場合はデータ線１２００により、また、エントリ５３５の場合はデータ線１２０５により、各エントリの有効ビットとキャッシュミス回数を０にリセットする。 Based on the entry numbers transmitted from the data lines 1270 and 1280, the selector 1250 uses the data line 1200 in the case of the entry 530 and the data line 1205 in the case of the entry 535 through the valid bit and the number of cache misses. Is reset to 0.

これにより、一定期間、データストリーム管理テーブル３０のレジスタ５０、５５に格納されたキャッシュラインアドレスの変化がないエントリのレジスタ４１０、４１５の有効ビットに０が格納されて、そのエントリは、他のデータストリームの情報を格納するために使用できるようになる。 As a result, 0 is stored in the valid bits of the registers 410 and 415 of the entry having no change in the cache line address stored in the registers 50 and 55 of the data stream management table 30 for a certain period. It can be used to store stream information.

次に、図４ないし図６を用いて本実施形態の効果を説明する。
図４は、本発明の第一の実施形態に係るプリフェッチ制御のメモリアクセスの様子を示す図である。
図５は、従来技術に係るプリフェッチ制御のメモリアクセスの様子を示す図である。
図６は、図４におけるメモリアクセスの場合の各段階のキャッシュミス回数とプリフェッチオフセットを示す表である。 Next, the effect of this embodiment will be described with reference to FIGS.
FIG. 4 is a diagram showing a state of memory access for prefetch control according to the first embodiment of the present invention.
FIG. 5 is a diagram illustrating a state of memory access for prefetch control according to the related art.
FIG. 6 is a table showing the number of cache misses and prefetch offsets at each stage in the case of memory access in FIG.

ここで、図２は、キャッシュミス回数に応じてプリフェッチオフセットが変化し、ストリームに応じて最適なタイミングでプリフェッチが発行される様子を示している。また、比較のため、図６は複数のデータストリームに対し一様のプリフェッチオフセットでプリフェッチ要求を出した場合の様子を示している。この図６の場合には、プリフェッチオフセット量は、固定の３キャッシュライン分である。 Here, FIG. 2 shows a state in which the prefetch offset changes according to the number of cache misses, and the prefetch is issued at an optimal timing according to the stream. For comparison, FIG. 6 shows a state in which a prefetch request is issued with a uniform prefetch offset for a plurality of data streams. In the case of FIG. 6, the prefetch offset amount is for three fixed cache lines.

さて、モデルとして、スカラー変数ｉによるループで、Ａ（ｉ）、Ｂ（ｉ＊４）の２つのストリームが存在するループを考える。二つの横軸は配列Ａ，Ｂのメモリ空間を表しており、短い目盛りが１ワード長、長い目盛りが各イタレーションにおいてアクセスするメモリアドレスである。目盛りの下の数値はｉの値を表す。×印はそのイタレーションでキャッシュミスを起こしたことを示す。矢印はプリフェッチ要求を表しており、矢印の起点はプリフェッチを発行するタイミングを表し、矢印の終点はプリフェッチ要求先のキャッシュラインの先頭を指す。すなわち、矢印の起点のメモリアドレスのところに、ロード／ストア要求が発行されたときに、矢印の終点の部分に、プリフェッチ要求がなされる。そして、矢印の長さがプリフェッチオフセットである。なお、点線の矢印は、既に、そのキャッシュラインにプリフェッチ要求がなされているので、実際には、プリフェッチ要求がおこなわれないことを意味する。 Now, as a model, consider a loop with a scalar variable i in which two streams A (i) and B (i * 4) exist. The two horizontal axes represent the memory spaces of the arrays A and B. The short scale is one word length, and the long scale is a memory address accessed in each iteration. The numerical value below the scale represents the value of i. A cross indicates that a cache miss has occurred during the iteration. The arrow indicates a prefetch request, the starting point of the arrow indicates the timing of issuing a prefetch, and the end point of the arrow indicates the beginning of the cache line to which the prefetch request is made. That is, when a load / store request is issued at the memory address at the starting point of the arrow, a prefetch request is made at the end point of the arrow. The length of the arrow is the prefetch offset. A dotted arrow means that a prefetch request is not actually made because a prefetch request has already been made to the cache line.

ここで、本実施形態のシステムのメモリにおいては、メモリ２４０からキャッシュ１５０への転送におけるレイテンシは、本ループが全てのロード要求でキャッシュヒットし４度回る時間、すなわち、ｉが４増える時間であるとする。 Here, in the memory of the system of the present embodiment, the latency in the transfer from the memory 240 to the cache 150 is the time for which this loop hits the cache with all load requests and rotates four times, that is, the time i increases by 4. And

また、本実施形態のシステムの既定の１ストリーム時の最大プリフェッチオフセット量は６であるとし、１キャッシュラインは４ワード分であるとする。 In addition, it is assumed that the maximum prefetch offset amount for a predetermined one stream of the system of the present embodiment is 6, and that one cache line is for 4 words.

さて、図５に示されるように、Ａ（０）をアクセスするときに、キャッシュミスが発生する。次のＡ（１）〜Ａ（３）では、キャッシュにデータが持ってこられるためキャッシュのデータがヒットする。そして、Ａ（４）でキャッシュミスが発生する。このときに、連続したキャッシュラインアドレスにおいてキャッシュミスが発生したために、データストリームを検出したことになり、３キャッシュライン分のプリフェッチオフセットにより、プリフェッチ要求を出す。同様にして、Ａ（８）、Ａ（１２）でキャッシュミスが発生するが、それ以降のところでは、次々とプリフェッチ要求が出されて、キャッシュにデータがプリフェッチされているため、キャッシュミスをおこすことはない。 Now, as shown in FIG. 5, when accessing A (0), a cache miss occurs. In the next A (1) to A (3), the data is brought into the cache, so the cache data hits. Then, a cache miss occurs at A (4). At this time, since a cache miss has occurred at successive cache line addresses, a data stream has been detected, and a prefetch request is issued with a prefetch offset for three cache lines. Similarly, cache misses occur in A (8) and A (12), but after that, prefetch requests are issued one after another and data is prefetched into the cache, so a cache miss occurs. There is nothing.

しかしながら、Ｂ（ｉ＊４）の場合には、ロード／ストア要求でアクセスするメモリの進行が速いためプリフェッチするエリアが含まれないという問題が発生する。 However, in the case of B (i * 4), there is a problem that the area to be prefetched is not included because the progress of the memory accessed by the load / store request is fast.

この場合、Ｂ（０）をアクセスするときに、キャッシュミスが発生する。そして、Ｂ（４）（ｉ＝１のとき）でキャッシュミスが発生する。このときに、連続したキャッシュラインアドレスにおいてキャッシュミスが発生したために、データストリームを検出したことになり、３キャッシュライン分のプリフェッチオフセットにより、プリフェッチ要求を出す。 In this case, a cache miss occurs when accessing B (0). A cache miss occurs at B (4) (when i = 1). At this time, since a cache miss has occurred at successive cache line addresses, a data stream has been detected, and a prefetch request is issued with a prefetch offset for three cache lines.

ここまでは、Ａの場合と同様である。そして、Ｂ（８）（ｉ＝２のとき）、Ｂ（１２）（ｉ＝３のとき）も、キャッシュミスが発生する。また、Ｂ（１２）（ｉ＝４のとき）も、メモリレイテンシ（ｉが４増える時間）があるために、プリフェッチが間に合わずキャッシュミスが発生する。同様に、全てのＢのアクセスにおいてキャッシュミスが発生することになる。このようになる原因は、Ｂの場合に、プリフェッチオフセットが小さすぎることによるものである。 The process up to here is the same as in the case of A. A cache miss also occurs in B (8) (when i = 2) and B (12) (when i = 3). Also, B (12) (when i = 4) also has memory latency (time for i to increase by 4), so prefetch is not in time and a cache miss occurs. Similarly, a cache miss occurs in all B accesses. The reason for this is that in the case of B, the prefetch offset is too small.

そこで、本実施形態は、キャッシュミスの回数に応じて、プリフェッチオフセットを可変にするものである。 Therefore, in the present embodiment, the prefetch offset is made variable according to the number of cache misses.

プリフェッチオフセットは、以下の（式１）に従う。 The prefetch offset follows (Equation 1) below.

of_ｋ = cm_ｋ/Σcm_ｋ … （式１）

of_ｋ：データストリームｋのプリフェッチオフセット
cm_ｋ：データストリームｋのヒットミス回数
Σ：全てのデータストリームｋについて和をとることを表す

図５、図６に示されるように、ｉ＝１のとき、配列Ｂに対してのプリフェッチオフセットは、６×（１／１）で計算される。また、ｉ＝４のとき、Ａのキャッシュミス回数は、１で、Ｂのキャッシュミス回数は、４であり、配列Ｂに対してのプリフェッチオフセットは、６×（４／（１＋４））で計算される。すなわち、４キャッシュライン分のプリフェッチオフセットによりプリフェッチ要求がなされる。 of _k = cm _k / Σcm _k (Formula 1)

of _k : prefetch offset of data stream k
cm _k : number of hit misses of data stream k Σ: represents summing all data streams k

As shown in FIGS. 5 and 6, when i = 1, the prefetch offset for the array B is calculated by 6 × (1/1). When i = 4, the number of cache misses of A is 1, the number of cache misses of B is 4, and the prefetch offset for the array B is calculated by 6 × (4 / (1 + 4)). Is done. That is, a prefetch request is made with a prefetch offset for four cache lines.

このようにすると、Ｂのプリフェッチオフセットは、十分大きくなり、ｉ＝８以降は新たなキャッシュミスは発生せず、全てのストリームに対してレイテンシを隠蔽できるだけ手前のタイミングでプリフェッチ要求を発行できるようになることが、図５より理解できる。なお、ｉ＝８以降は、Ａ、Ｂともキャッシュミス回数に変更がないため、ループ終了までこのプリフェッチオフセットが維持されることになる。 In this way, the prefetch offset of B becomes sufficiently large so that a new cache miss does not occur after i = 8, and a prefetch request can be issued at a timing before concealing latency for all streams. It can be understood from FIG. Note that after i = 8, the number of cache misses is not changed for both A and B, so this prefetch offset is maintained until the end of the loop.

以上のように、本実施形態による処理系によれば複数のストリームのそれぞれに対し最適なタイミングでプリフェッチを発行することが可能であることから、全てのデータストリームにおいてメモリ・レイテンシを隠蔽することが可能である。また、本実施形態ではロード要求の例について述べたが、ストア要求についても上記の手法で同様の効果が期待できる。 As described above, according to the processing system according to the present embodiment, it is possible to issue a prefetch for each of a plurality of streams at an optimal timing, so that memory latency can be concealed in all data streams. Is possible. Further, although an example of a load request has been described in the present embodiment, a similar effect can be expected for a store request using the above method.

〔実施形態２〕
以下、本発明に係る第二の実施形態を、図７ないし図１０を用いて説明する。 [Embodiment 2]
Hereinafter, a second embodiment according to the present invention will be described with reference to FIGS.

第一の実施形態では、データストリームにおけるキャッシュミスの回数に応じて、プリフェッチオフセットを調整する技術を説明した。本実施形態は、データストリームにおけるロード／ストア要求のキャッシュラインアドレスの更新回数に応じてプリフェッチオフセットを調整するものである。基本的な考え方は、同じなので、第一の実施形態と比べて相違点を中心に説明する。また、本実施形態でも、ロード要求の場合について説明する。 In the first embodiment, the technology for adjusting the prefetch offset according to the number of cache misses in the data stream has been described. In the present embodiment, the prefetch offset is adjusted according to the number of updates of the cache line address of the load / store request in the data stream. Since the basic idea is the same, differences will be mainly described as compared with the first embodiment. Also in this embodiment, a case of a load request will be described.

先ず、図７を用いて本発明の第二の実施形態に係るプロセッサを用いたシステム構成とその動作について説明する。
図７は、本発明の第二の実施形態に係るプロセッサを用いたシステム構成図である。 First, a system configuration using the processor according to the second embodiment of the present invention and its operation will be described with reference to FIG.
FIG. 7 is a system configuration diagram using a processor according to the second embodiment of the present invention.

本実施形態のプロセッサ３１０も、データストリーム管理テーブル３２０を有する。データストリーム管理テーブル３２０は、データストリームに関する情報を保持し、プリフェッチの制御に用いるためのテーブルであることは、第一の実施形態と同様である。 The processor 310 of this embodiment also has a data stream management table 320. The data stream management table 320 holds information about the data stream and is a table used for prefetch control, as in the first embodiment.

第一の実施形態と異なっているのは、レジスタ９０、９５が各データストリームのロード要求先キャッシュラインアドレスが更新された回数を保持することである。 The difference from the first embodiment is that the registers 90 and 95 hold the number of times the load request destination cache line address of each data stream has been updated.

制御機構６０が、データストリームのキャッシュラインアドレスを、レジスタ５０、５５に書き込む動作は、第一の実施形態と同様である。また、レジスタ４００、４０５のデータストリームのアクセスされるアドレスの進行方向を書き込む動作についても第一の実施形態と同様である。 The operation in which the control mechanism 60 writes the cache line address of the data stream to the registers 50 and 55 is the same as in the first embodiment. Further, the operation of writing the traveling direction of the accessed address of the data stream of the registers 400 and 405 is the same as that of the first embodiment.

また、初期化機構５５０が、そのデータストリームの有効性を、レジスタ９０、９５に書き込むことも第一の実施形態と同様である。さらに、新しいエントリの場合には、レジスタ９０、９５が０にリセットされることも同様である。 Further, the initialization mechanism 550 writes the validity of the data stream in the registers 90 and 95 as in the first embodiment. Further, in the case of a new entry, the registers 90 and 95 are similarly reset to 0.

異なっているのは、データ線３０により伝えられるロード要求先キャッシュラインアドレスが更新されると、制御機構６０は、それをデータ線３３０により選択器４６０に伝えることである。この制御機構６０の動作は、後に詳細に説明する。 The difference is that when the load request destination cache line address conveyed by the data line 30 is updated, the control mechanism 60 transmits it to the selector 460 via the data line 330. The operation of the control mechanism 60 will be described in detail later.

本実施形態の選択器４６０は、データ線７５により伝えられるエントリ番号に対応するデータ線４７０またはデータ線４７５より伝えられるキャッシュライン更新回数を、データ線４８０を用いて加算器４９０に伝える。加算器４９０は、データ線４８０より伝えられたキャッシュライン更新回数に対し１を加えたものを新規キャッシュライン更新回数として、データ線５００を用いて選択器５１０に伝える。選択器５１０は、データ線７５により伝えられるエントリ番号を基にデータ線５２０またはデータ線５２５のどちらかを用い、キャッシュライン更新回数９０または９５を書き換える
次に、本実施形態のハードウェア・プリフェッチの要求先アドレスを決定する場合の動作を説明する。 The selector 460 according to the present embodiment transmits the data line 470 corresponding to the entry number transmitted through the data line 75 or the cache line update count transmitted from the data line 475 to the adder 490 using the data line 480. The adder 490 transmits a value obtained by adding 1 to the cache line update count transmitted from the data line 480 to the selector 510 using the data line 500 as a new cache line update count. The selector 510 rewrites the cache line update count 90 or 95 using either the data line 520 or the data line 525 based on the entry number conveyed by the data line 75. Next, the hardware prefetching according to the present embodiment is performed. The operation for determining the request destination address will be described.

データストリーム管理テーブル３２０が保持する各ストリームのキャッシュライン更新回数９０、９５は、それぞれデータ線４７０、４７５によりＡＮＤ回路５８０、５８５に伝えられ、ＡＮＤ回路５８０、５８５は、伝えられた回数とデータ線６００、６０５により伝えられる有効ビットを回数情報のビット幅に拡張したものとのＡＮＤをそれぞれとり、出力をデータ線２２０、２２５に伝える。これにより、有効ビットが０（無効）のときには、キャッシュライン更新回数は常に０にクリアされる。 The cache line update counts 90 and 95 of each stream held in the data stream management table 320 are transmitted to the AND circuits 580 and 585 through the data lines 470 and 475, respectively. The AND circuits 580 and 585 transmit the transmitted counts and data lines. An AND operation is performed on the effective bits transmitted by 600 and 605, which are expanded to the bit width of the frequency information, and the output is transmitted to the data lines 220 and 225. As a result, when the valid bit is 0 (invalid), the cache line update count is always cleared to 0.

加算器１００は、データ線２２０、２２５より伝えられるキャッシュライン更新回数の和をデータ線２１０に伝える。除算器１１０、１１５は、それぞれデータ線２２０、２２５より伝えられるキャッシュライン更新回数をそれぞれ被除数とし、データ線２１０より伝えられる全ストリームのキャッシュライン更新回数の和を除数２１０とする頻度比率を計算する。乗算器１２０、１２５は、それぞれデータ線６１０、６１５より伝えられる各ストリームにおける頻度比率と、それぞれデータ線６２０、６２５により伝えられる各ストリームの進行方向を表す符号と、１ストリーム時の最大プリフェッチオフセット量レジスタ８００から持ってきた最大プリフェッチオフセット量とを乗じ、小数点以下は切り捨てた上で、その結果をそれぞれデータ線１６０、１６５に伝える。 The adder 100 transmits to the data line 210 the sum of the cache line update times transmitted from the data lines 220 and 225. Dividers 110 and 115 calculate a frequency ratio in which the number of cache line updates transmitted from data lines 220 and 225, respectively, is a dividend, and the sum of the cache line update numbers of all streams transmitted from data line 210 is a divisor 210. . The multipliers 120 and 125 respectively transmit the frequency ratio in each stream transmitted from the data lines 610 and 615, the code indicating the traveling direction of each stream transmitted by the data lines 620 and 625, and the maximum prefetch offset amount in one stream, respectively. The result is transmitted to the data lines 160 and 165 after multiplying by the maximum prefetch offset amount brought from the register 800 and rounded down after the decimal point.

以下、このプリフェッチオフセット量により、キャッシュ１５０がメモリ２４０からプリフェッチをおこなう動作は、第一の実施形態と同様である。 Hereinafter, the operation in which the cache 150 performs the prefetch from the memory 240 using the prefetch offset amount is the same as that in the first embodiment.

次に、図８を用いて制御機構６０の詳細な構成と動作について説明する。
図８は、本発明の第二の実施形態に係る制御機構６０の構成図である。 Next, the detailed configuration and operation of the control mechanism 60 will be described with reference to FIG.
FIG. 8 is a configuration diagram of the control mechanism 60 according to the second embodiment of the present invention.

本実施形態の制御機構６０が、第一の実施形態と異なるところは、キャッシュラインの更新があったことをデータ線３３０に伝えることである。これにより、選択器４６０が、データストリーム管理テーブルのキャッシュラインの更新回数を増加させる処理をおこなう。 The difference between the control mechanism 60 of the present embodiment and the first embodiment is that the data line 330 is notified that the cache line has been updated. As a result, the selector 460 performs processing for increasing the number of cache line updates in the data stream management table.

その他の機能と動作、データストリーム検出機構についても第一の実施形態と同様である。 Other functions and operations and the data stream detection mechanism are the same as those in the first embodiment.

次に、図９および図１０を用いて本実施形態の効果を説明する。
図９は、本発明の第二の実施形態に係るプリフェッチ制御のメモリアクセスの様子を示す図である。
図１０は、図９におけるメモリアクセスの場合の各段階のキャッシュミス回数とプリフェッチオフセットを示す表である。 Next, the effect of this embodiment is demonstrated using FIG. 9 and FIG.
FIG. 9 is a diagram showing a state of memory access for prefetch control according to the second embodiment of the present invention.
FIG. 10 is a table showing the number of cache misses and prefetch offsets at each stage in the case of memory access in FIG.

配列Ａ、Ｂを使ったモデル、メモリレイテンシ、最大プリフェッチオフセット量、および、図９における表記法などを第一の実施形態と同様に設定する。 The model using the arrays A and B, the memory latency, the maximum prefetch offset amount, and the notation in FIG. 9 are set in the same manner as in the first embodiment.

そこで、本実施形態は、キャッシュラインアドレスの更新回数に応じて、プリフェッチオフセットを可変にするものである。 Therefore, in the present embodiment, the prefetch offset is made variable according to the number of cache line address updates.

プリフェッチオフセットは、以下の（式２）に従う。 The prefetch offset follows (Equation 2) below.

of_ｋ = lau_ｋ/Σlau_ｋ … （式２）

of_ｋ：データストリームｋのプリフェッチオフセット
lau_ｋ：データストリームｋのキャッシュラインアドレスの更新回数
Σ：全てのデータストリームｋについて和をとることを表す

図９、図１０に示されるように、ｉ＝１のとき、配列Ｂに対してのプリフェッチオフセットは、６×（１／１）で計算される。また、ｉ＝４のとき、Ａのキャッシュミス回数は、１で、Ｂのキャッシュミス回数は、４であり、配列Ｂに対してのプリフェッチオフセットは、６×（４／（１＋４））で計算される。すなわち、４キャッシュライン分のプリフェッチオフセットによりプリフェッチ要求がなされる。 of _k = lau _k / Σlau _k (Formula 2)

of _k : prefetch offset of data stream k
lau _k : Number of updates of the cache line address of the data stream k Σ: Represents the sum of all the data streams k

As shown in FIGS. 9 and 10, when i = 1, the prefetch offset for the array B is calculated by 6 × (1/1). When i = 4, the number of cache misses of A is 1, the number of cache misses of B is 4, and the prefetch offset for the array B is calculated by 6 × (4 / (1 + 4)). Is done. That is, a prefetch request is made with a prefetch offset for four cache lines.

このようにすると、Ｂのプリフェッチオフセットは、十分大きくなり、ｉ＝１６以降は新たなキャッシュミスは発生せず、全てのストリームに対してレイテンシを隠蔽できるだけ手前のタイミングでプリフェッチ要求を発行できるようになることが、図９より理解できる。このように、キャッシュラインアドレスの更新回数ににより、プリフェッチアドレスを調整する手法によっても、キャッシュミスのないメモリアクセスの性能を向上させたプロセッサを提供することができる。 In this way, the prefetch offset of B becomes sufficiently large so that a new cache miss does not occur after i = 16, and a prefetch request can be issued at a timing before concealing latency for all streams. It can be understood from FIG. As described above, it is possible to provide a processor with improved memory access performance without a cache miss even by a method of adjusting a prefetch address according to the number of cache line address updates.

〔実施形態３〕
次に、本発明に係る第三の実施形態を、図１１を用いて説明する。
図１１は、本発明の第三の実施形態に係るデータストリーム管理テーブルの一部の構成を示した図である。 [Embodiment 3]
Next, a third embodiment according to the present invention will be described with reference to FIG.
FIG. 11 is a diagram showing a partial configuration of a data stream management table according to the third embodiment of the present invention.

本実施形態で説明するのは、レジスタ９０、９５の最上位ビットが立った場合に１ビット下位にシフトすることにより、極力頻度比率を保ったまま情報量を減じることを可能にする機構である。 In the present embodiment, a mechanism that makes it possible to reduce the amount of information while maintaining the frequency ratio as much as possible by shifting to the lower one bit when the most significant bit of the registers 90 and 95 stands. .

第一の実施形態に、この第三の実施形態を加えれば、式１、式２より求められるプリフェッチオフセットが同じであり、しかも、レジスタ９０、９５の容量があふれるをおこすことがない。特に、第二の実施形態のようにキャッシュラインの更新回数を頻度情報とする場合は、図７におけるライン更新保持のためのレジスタ９０、９５に十分な容量を用意できない場合が考えられるため効果的である。 If this third embodiment is added to the first embodiment, the prefetch offsets obtained from the equations 1 and 2 are the same, and the capacity of the registers 90 and 95 does not overflow. In particular, when the cache line update count is used as the frequency information as in the second embodiment, there is a possibility that a sufficient capacity cannot be prepared in the registers 90 and 95 for holding the line update in FIG. It is.

図１１に示される最上位ビット１１００が１になると、それがデータ線１１２０によりシフト回路１１３０に、またデータ線１１８０によりシフト回路１１３５に伝わる。同様に、最上位ビット１１０５が１になると、それがデータ線１１２５によりシフト回路１１３０に、またデータ線１１８５によりシフト回路１１３５に伝わる。シフト回路１１３０は、最上位ビット１１００または１１０５が１になったことをデータ線１１２０またはデータ線１１２５により伝えられると、データ線１１４０から伝えられる頻度情報５３０を下位に１ビットシフトしたものをデータ線１１５０に伝え、レジスタ５３０の保持する頻度情報を更新する。同様に、シフト回路１１３５は最上位ビット１１００または１１０５が１になったことをデータ線１１８０またはデータ線１１８５により伝えられると、データ線１１４５から伝えられる頻度情報５３５を下位に１ビットシフトしたものをデータ線１１５５に伝え、レジスタ５３５の保持する頻度情報を更新する。下位に１ビットシフトすることは、その値を２分の１にすることである。
以上の仕組みにより、極力頻度比率を保ったまま情報量を減じることが可能となる。 When the most significant bit 1100 shown in FIG. 11 becomes 1, it is transmitted to the shift circuit 1130 through the data line 1120 and to the shift circuit 1135 through the data line 1180. Similarly, when the most significant bit 1105 becomes 1, it is transmitted to the shift circuit 1130 through the data line 1125 and to the shift circuit 1135 through the data line 1185. When the shift circuit 1130 is notified by the data line 1120 or the data line 1125 that the most significant bit 1100 or 1105 is set to 1, the shift circuit 1130 shifts the frequency information 530 transmitted from the data line 1140 to the lower order by 1 bit. 1150, and the frequency information held in the register 530 is updated. Similarly, when the shift circuit 1135 is notified by the data line 1180 or the data line 1185 that the most significant bit 1100 or 1105 is set to 1, the shift circuit 1135 shifts the frequency information 535 transmitted from the data line 1145 downward by 1 bit. This is transmitted to the data line 1155 and the frequency information held in the register 535 is updated. Shifting 1 bit to the lower order means that the value is halved.
With the above mechanism, it is possible to reduce the amount of information while maintaining the frequency ratio as much as possible.

〔実施形態４〕
次に、本発明に係る第四の実施形態を、図１２および図１３を用いて説明する。
図１２は、本発明の第四の実施形態に係るデータストリーム管理テーブルの一部の構成を示した図である。
図１３は、本発明の第四の実施形態に係るレジスタ９０、９５の動作を説明するタイミングチャートである。 [Embodiment 4]
Next, a fourth embodiment according to the present invention will be described with reference to FIGS.
FIG. 12 is a diagram showing a partial configuration of a data stream management table according to the fourth embodiment of the present invention.
FIG. 13 is a timing chart for explaining the operation of the registers 90 and 95 according to the fourth embodiment of the present invention.

本実施形態は、第一の実施形態または第二の実施形態のデータストリーム管理テーブル３２０のレジスタ９０、９５に保持される頻度情報を一定期間内の頻度として採取する機能を追加するものである。キャッシュミス回数またはキャッシュライン更新回数などの頻度情報を一定期間内で採取するよう実装することにより、アクセスするアドレスの間隔がループの途中で変化するようなデータストリームに対しても柔軟に適当なプリフェッチを発行することができる。 In the present embodiment, a function of collecting frequency information held in the registers 90 and 95 of the data stream management table 320 of the first embodiment or the second embodiment as a frequency within a certain period is added. By implementing frequency information such as the number of cache misses or the number of cache line updates within a certain period of time, it is possible to flexibly prefetch data streams where the access address interval changes during the loop. Can be issued.

図１２において、カウンタ９４０、９５０は、システムの開始と同時にサイクル数をカウントし始め、最初はデータ線９６０、９７０にそれぞれ０を出力する。処理系で既定のサイクル数をｎとして、カウンタ９４０は奇数×ｎサイクル目、カウンタ９５０は偶数×ｎサイクル目に、それぞれ出力を反転させる。選択器９２０、９２５は、データ線９６０からの入力が０のときはデータ線５２０、５２５からの入力をそれぞれデータ線９８０、９８５に伝え、データ線９６０からの入力が１の時はデータ線５２０、５２５からの入力をそれぞれデータ線１０００、１００５に伝える。また、選択器９３０、９３５は、データ線９７０からの入力が０のときはデータ線４７０、４７５にそれぞれデータ線９９０、９９５からの入力を出力し、データ線９７０からの入力が１のときは、データ線４７０、４７５にそれぞれデータ線１０１０、１０１５からの入力を出力する。 In FIG. 12, counters 940 and 950 start counting the number of cycles simultaneously with the start of the system, and initially output 0 to data lines 960 and 970, respectively. In the processing system, the predetermined number of cycles is n, and the counter 940 inverts the output at the odd number × n cycle, and the counter 950 inverts the output at the even number × n cycle. The selectors 920 and 925 transmit the input from the data lines 520 and 525 to the data lines 980 and 985, respectively, when the input from the data line 960 is 0, and the data line 520 when the input from the data line 960 is 1. 525 is transmitted to data lines 1000 and 1005, respectively. The selectors 930 and 935 output the inputs from the data lines 990 and 995 to the data lines 470 and 475, respectively, when the input from the data line 970 is 0, and when the inputs from the data line 970 are 1, respectively. , The inputs from the data lines 1010 and 1015 are output to the data lines 470 and 475, respectively.

また、選択器９２０はレジスタ９００からレジスタ９１０に書き込み先を変える際にレジスタ９００の値を０にリセットし、同様にレジスタ９１０からレジスタ９００に書き込み先を変える際にレジスタ９１０の値を０にリセットする。また同様に、選択器９２５はレジスタ９０５からレジスタ９１５に書き込み先を変える際にレジスタ９０５の値を０にリセットし、同様にレジスタ９１５からレジスタ９０５に書き込み先を変える際にレジスタ９１５の値を０にリセットする。 The selector 920 resets the value of the register 900 to 0 when the write destination is changed from the register 900 to the register 910, and similarly resets the value of the register 910 to 0 when the write destination is changed from the register 910 to the register 900. To do. Similarly, the selector 925 resets the value of the register 905 to 0 when changing the write destination from the register 905 to the register 915, and similarly changes the value of the register 915 to 0 when changing the write destination from the register 915 to the register 905. Reset to.

ここで、本実施形態では、図１３に示されるように、書き込みと読み込みの周期の位相を意図的にずらしている。これは、仮に書き込みと読み込みの周期の位相が同じであった場合には、書き込みと読み込みが切り替わるたびに頻度情報の比率が最適なプリフェッチ発行が可能な定常状態に収束する時間がかかってしまうことが考えられるためである。本実施形態では、図１３に示すように、レジスタ９００、９０５への書き込みと読み込みがｎサイクルずつずれ、キャッシュライン更新回数を２ｎサイクル時間内に採取することが可能となる。例えば、ループの途中でストライド・アクセスの間隔が変わるデータストリームや、ループの途中でアドレス増加からアドレス減少に転じるようなデータストリームに関しても最大でも２ｎサイクル時間後には変化に対応した頻度情報を利用することができる。 Here, in this embodiment, as shown in FIG. 13, the phase of the writing and reading cycle is intentionally shifted. This means that if the phase of the write and read cycles is the same, it takes time to converge to a steady state where the prefetch issue with the optimal frequency information ratio can be made each time the write and read are switched. This is because of this. In this embodiment, as shown in FIG. 13, writing to and reading from the registers 900 and 905 are shifted by n cycles, and the number of cache line updates can be collected within 2n cycle time. For example, even for a data stream in which the stride access interval changes in the middle of a loop or a data stream in which the address increases from an address increase to an address decrease in the middle of the loop, frequency information corresponding to the change is used after 2n cycle time at the maximum. be able to.

本発明の第一の実施形態に係るプロセッサを用いたシステム構成図である。1 is a system configuration diagram using a processor according to a first embodiment of the present invention. 本発明の第一の実施形態に係る制御機構６０の構成図である。It is a block diagram of the control mechanism 60 which concerns on 1st embodiment of this invention. 本発明の第一の実施形態に係る初期化機構５５０の構成図である。It is a block diagram of the initialization mechanism 550 which concerns on 1st embodiment of this invention. 本発明の第一の実施形態に係るプリフェッチ制御のメモリアクセスの様子を示す図である。It is a figure which shows the mode of the memory access of the prefetch control which concerns on 1st embodiment of this invention. 従来技術に係るプリフェッチ制御のメモリアクセスの様子を示す図である。It is a figure which shows the mode of the memory access of the prefetch control which concerns on a prior art. 図４におけるメモリアクセスの場合の各段階のキャッシュミス回数とプリフェッチオフセットを示す表である。5 is a table showing the number of cache misses and prefetch offsets at each stage in the case of memory access in FIG. 本発明の第二の実施形態に係るプロセッサを用いたシステム構成図である。It is a system configuration figure using a processor concerning a second embodiment of the present invention. 本発明の第二の実施形態に係る制御機構６０の構成図である。It is a block diagram of the control mechanism 60 which concerns on 2nd embodiment of this invention. 本発明の第二の実施形態に係るプリフェッチ制御のメモリアクセスの様子を示す図である。It is a figure which shows the mode of the memory access of the prefetch control which concerns on 2nd embodiment of this invention. 図９におけるメモリアクセスの場合の各段階のキャッシュミス回数とプリフェッチオフセットを示す表である。10 is a table showing the number of cache misses and prefetch offset at each stage in the case of memory access in FIG. 9. 本発明の第三の実施形態に係るデータストリーム管理テーブルの一部の構成を示した図である。It is the figure which showed the structure of a part of data stream management table which concerns on 3rd embodiment of this invention. 本発明の第四の実施形態に係るデータストリーム管理テーブルの一部の構成を示した図である。It is the figure which showed the structure of a part of data stream management table which concerns on 4th embodiment of this invention. 本発明の第四の実施形態に係るレジスタ９０、９５の動作を説明するタイミングチャートである。It is a timing chart explaining operation of registers 90 and 95 concerning a 4th embodiment of the present invention.

Claims

Means for detecting a data stream predicted to be continuously accessed from a request destination address of a load / store instruction issued to a memory;
Means for issuing a prefetch request to a cache to the memory in response to a load / store instruction request in the data stream;
A data stream management table that holds a request destination address of a load / store instruction of each data stream and a frequency of a cache miss that occurs in the load / store instruction in each data stream;
Depending on the frequency of cache misses occurring in the load / store instruction definitive to all data streams, the processor and determines the prefetch request address for each data stream.

The ratio of the frequency of cache misses generated in total and the load / store instruction definitive each data stream in the frequency of cache misses occurring in the load / store instruction definitive to all data streams based on, prefetch request for each data stream The processor of claim 1, wherein the address is determined.

The prefetch request address for each data stream is:
And a predetermined maximum prefetch offset, a ratio of the frequency of cache misses generated in total load / store instruction definitive each data stream as a denominator in the frequency of cache misses occurring in the load / store instruction definitive to all data streams to a molecule To get the prefetch offset,
3. The processor according to claim 2, wherein the processor is calculated by taking a sum of a request destination address of a load / store instruction held in the data stream management table and the prefetch offset.

When the request destination address of the load / store instruction of each data stream held in the data stream management table is not updated for a certain period, the data stream is invalidated and information on other data streams is held in the data stream management table 2. The processor of claim 1, wherein the processor uses an invalid data stream entry.

The frequency of cache misses occurring in the load / store instruction definitive each data stream for a period of from the data stream is detected until determining the prefetch request address for each data stream or, determined by the measurement of the predetermined period, The processor of claim 1 wherein:

The processor obtains the frequency of cache misses occurring in load / store instructions in each data stream by measuring for a certain period, and includes a register for writing the frequency and a register for reading the frequency, 2. The processor according to claim 1, wherein the writing of the frequency to the writing register of the frequency and the reading of the frequency from the register for reading the frequency are shifted by a predetermined period.

And means for reducing the frequency of cache misses occurring in the load / store instructions of each data stream held in the data stream management table while maintaining the relative ratio between the data streams. The processor according to claim 1.

Means for detecting a data stream predicted to be continuously accessed from a request destination address of a load / store instruction issued to a memory;
A processor that issues a prefetch request to a cache to the memory in units of cache lines in response to a load / store instruction request in the data stream;
A data stream management table that holds a line address related to a request destination address of each data stream load / store instruction and a frequency at which a line address related to a request destination address of each data stream load / store instruction is updated;
Each data according to the sum of the frequency at which the line address relating to the request destination address of the load / store instruction of all data streams is updated and the frequency at which the line address relating to the request destination address of the load / store instruction of each data stream is updated A processor for determining a prefetch request address for each stream.

Based on the ratio of the frequency at which the line address related to the request destination address of the load / store instruction of all the data streams is updated and the frequency at which the line address related to the request destination address of the load / store instruction of each data stream is updated 9. The processor according to claim 8, wherein a prefetch request address for each data stream is determined.

The prefetch request address for each data stream is:
The line address related to the request destination address of each data stream load / store instruction is updated using the sum of the predetermined maximum prefetch offset and the frequency at which the line address related to the request destination address of the load / store instruction of all data streams is updated. Multiplied by the ratio with the frequency as a numerator to obtain the prefetch offset,
10. The processor according to claim 9, wherein the processor is calculated by taking the sum of a request destination address of a load / store instruction held in the data stream management table and the prefetch offset.

When the line address related to the request destination address of the load / store instruction of each data stream held in the data stream management table is not updated for a certain period, the data stream is invalidated and the information of the other data stream is stored in the data stream management table 9. The processor according to claim 8, wherein the invalidated data stream entry is used when the data stream is held.

The frequency at which the line address related to the request destination address of the load / store instruction for each data stream is updated is the period from when the data stream is detected until the prefetch request address for each data stream is determined, or for a certain period The processor according to claim 8, wherein the processor is obtained by measurement of:

The processor obtains the frequency at which the line address related to the request destination address of the load / store instruction of each data stream is updated by measuring for a certain period, and reads the frequency and a register for writing the frequency. 9. The processor according to claim 8, further comprising: a register, wherein the frequency writing to the frequency writing register and the frequency reading from the frequency reading register are shifted by a predetermined period.

Means for reducing the frequency of the data stream held in the data stream management table while maintaining the relative ratio between the data streams of the frequency at which the line address relating to the request destination address of the load / store instruction of each data stream is updated. 9. A processor according to claim 8, comprising:

Means for detecting a data stream predicted to be continuously accessed from a request destination address of a load / store instruction issued to a memory;
Means for issuing a prefetch request to the cache to the memory in response to a load / store instruction request in the data stream;
In a processor comprising a data stream management table for holding a request destination address of a load / store instruction of the data stream and a frequency of a cache miss occurring in the load / store instruction in the data stream for a plurality of data streams ,
Issuing a load / store instruction to the memory;
Detecting a data stream predicted to be continuously accessed from a request destination address of a load / store instruction issued to the memory;
Recording a request destination address of the data stream load / store instruction in the data stream management table;
Recording the frequency of cache misses occurring in load / store instructions in the data stream in the data stream management table;
Summing up the frequency of cache misses that occurred in load / store instructions in the data streams of all data streams held in the data stream management table ;
And a predetermined maximum prefetch offset, said as the denominator the total frequency of cache misses occurring in the load / store instruction definitive to all data streams, and frequency of molecules of the cache miss has occurred in the load / store instruction definitive each data stream Multiplying the calculated ratio to obtain a prefetch offset;
In response to a load / store instruction request in the data stream, a prefetch address used when a prefetch request to the cache is made is a request destination address of the load / store instruction in the data stream held in the data stream management table. Determining by summing with the prefetch offset in the data stream;
Issuing a prefetch request to the cache based on the obtained prefetch address of the data stream in response to a load / store instruction request in the data stream.

Means for detecting a data stream predicted to be continuously accessed from a request destination address of a load / store instruction issued to a memory;
Means for issuing a prefetch request to the cache in units of cache lines to the memory in response to a load / store instruction request in the data stream;
For a plurality of data streams, a processor comprising a data stream management table that holds a line address related to a request destination address of the load / store instruction of the data stream and a frequency at which the line address is updated.
Issuing a load / store instruction to the memory;
Detecting a data stream predicted to be continuously accessed from a request destination address of a load / store instruction issued to the memory;
Recording a line address related to a request destination address of the load / store instruction of the data stream in the data stream management table;
Recording in the data stream management table the frequency at which the line address related to the request destination address of the load / store instruction of the data stream is updated in the data stream;
Summing up the frequency with which the line address related to the request destination address of the load / store instruction of each data stream of all the data streams held in the data stream management table is updated;
The sum of all the data streams at the frequency at which the line address related to the request destination address of the load / store instruction of each data stream is updated is used as the denominator, and the request destination address of the load / store instruction of each data stream. Multiplying by the numerator the frequency at which the line address was updated, obtaining a prefetch offset;
In response to a load / store instruction request in the data stream, a prefetch address when performing a prefetch request to the cache is related to a request destination address of the load / store instruction in the data stream held in the data stream management table. Determining by summing the line address and the prefetch offset in the data stream;
Issuing a prefetch request to the cache based on the obtained prefetch address of the data stream in response to a load / store instruction request in the data stream.