JP2011008485A

JP2011008485A - Data processing apparatus

Info

Publication number: JP2011008485A
Application number: JP2009150788A
Authority: JP
Inventors: Takashi Nakada; 尚中田; Yasuhiko Nakajima; 康彦中島
Original assignee: Nara Institute of Science and Technology NUC
Current assignee: Nara Institute of Science and Technology NUC
Priority date: 2009-06-25
Filing date: 2009-06-25
Publication date: 2011-01-13

Abstract

PROBLEM TO BE SOLVED: To provide a data processing apparatus efficiently transferring data between a plurality of primary caches.SOLUTION: The data processing apparatus 1 includes: a plurality of computing units 11, 21, 31, and 41: the plurality of primary caches 12, 22, 32 and 42; a secondary cache 50; and a plurality of successively and continuously connected buffer caches 13, 23, 33, and 43. Data from the secondary cache 50 are transferred to the buffer cache 13 in the first stage, and the plurality of buffer caches 13, 23, 33, and 43 successively transfer a part of data of those respective buffer caches 13, 23, 33, and 43 to the post stage side, and transfers a part of the data of each buffer cache 13, 23, 33, and 43 to primary caches 12, 22, 32, and 42 corresponding to those respective buffer caches 13, 23, 33, and 43.

Description

複数のプロセッサを接続し、複数のプロセッサの各々による処理を同期して行なうことができるマルチプロセッサシステムに好適なキャッシュ方式に関する。特に、本発明は、このキャッシュ方式を用いたデータ供給によりデータ処理するデータ処理装置に関する。 The present invention relates to a cache system suitable for a multiprocessor system in which a plurality of processors are connected and processing by each of the plurality of processors can be performed synchronously. In particular, the present invention relates to a data processing apparatus that processes data by supplying data using this cache system.

近年、組み込み機器が取り扱う情報量が増大し、低消費電力、且つ、最低性能保証可能なプロセッサの需要が急速に高まっている。これまで、最低性能の保証は専用ハードウェア化による実現が一般的であった。 In recent years, the amount of information handled by embedded devices has increased, and the demand for processors with low power consumption and guaranteed minimum performance has increased rapidly. Until now, the guarantee of minimum performance has been generally realized by using dedicated hardware.

最近では、次々に策定される画像や無線等の新規格に追随したり、製品差別化のためのフィルタ微調整や出荷後の機能更新に対応したりするには、時間的及び経済的コストが大きい専用ハードウェアの採用が難しくなって来ている。 Recently, time and economic costs are required to follow new standards such as image and wireless, which are being developed one after another, and to fine-tune filters for product differentiation and to update functions after shipment. Adopting large dedicated hardware is becoming difficult.

ところで、デジタルシネマ等に用いられる高解像度の画像処理の需要が、近年、高まって来ている。デジタルシネマ等の高解像度の画像を補正するためのフィルタ処理の計算量は膨大であり、既存のプロセッサによるリアルタイム処理は不可能である。また、定常的に１サイクルごとに１画素を生成できるような専用ハードウェアに匹敵する性能を有し、且つ、汎用的な機械語命令を実行可能な、高性能、且つ、柔軟なアーキテクチャは、現在のところ皆無である。 Meanwhile, the demand for high-resolution image processing used in digital cinema and the like has been increasing in recent years. The amount of calculation of filter processing for correcting high-resolution images such as digital cinema is enormous, and real-time processing by an existing processor is impossible. In addition, a high-performance and flexible architecture that has a performance comparable to that of dedicated hardware that can generate one pixel per cycle on a regular basis and that can execute general-purpose machine language instructions. There is nothing at present.

このようなこのような状況を考慮して、マルチコアやメニイコア（例えば、非特許文献１〜３を参照）及び、リコンフィギャラブルデータパス（例えば、非特許文献４を参照）が現在有望視されている。 In consideration of such a situation, multi-cores and many-cores (for example, see Non-Patent Documents 1 to 3) and reconfigurable data paths (for example, see Non-Patent Document 4) are currently promising. Yes.

マルチコアやメニイコアは、様々な粒度に分割したアプリケーションプログラムを複数コアにより並列実行するアーキテクチャである。これらマルチコアやメニイコアは一般的な並列プログラミング手法を利用することができるという有利な点を持っている。 Multi-core and many-core are architectures in which application programs divided into various granularities are executed in parallel by a plurality of cores. These multi-cores and many-cores have the advantage of being able to use common parallel programming techniques.

また、ソフトウェアが制御可能な局所メモリを備えることにより、予測不能なキャッシュミスによる性能低下をある程度抑制できる。このため、柔軟性と拡張性に優れるだけでなく、コア数増加による最低性能保証も視野に入れることが可能である。 In addition, by providing a local memory that can be controlled by software, performance degradation due to an unpredictable cache miss can be suppressed to some extent. For this reason, not only is it excellent in flexibility and expandability, but it is also possible to consider the minimum performance guarantee by increasing the number of cores.

一方、リコンフィギャラブルデータパスは、プロセッサコアよりも粒度の小さい演算器を多数配置し、機能の柔軟性と高速性の両立を図るアーキテクチャである。このリコンフィギャラブルデータパスは、膨大な演算器を配置することにより、専用ハードウェアと同様、比較的低い動作周波数でも高性能プロセッサに匹敵する性能を達成し得るという有利な点を持っている。 On the other hand, the reconfigurable data path is an architecture in which a large number of arithmetic units having a smaller granularity than the processor core are arranged to achieve both functional flexibility and high speed. This reconfigurable data path has the advantage that, by arranging a large number of arithmetic units, a performance comparable to that of a high-performance processor can be achieved even at a relatively low operating frequency, as in the case of dedicated hardware.

ところで、上述したマルチコアや、メニイコア、リコンフィギャラブルデータパスのいずれのアーキテクチャにあっても、プロセッサにスパースカラ型やＶＬＩＷ（Very Long Instruction Word）型を採用するのが一般的である。 By the way, a sparse scalar type or a VLIW (Very Long Instruction Word) type is generally adopted as a processor in any of the above-described multi-core, many-core, and reconfigurable data path architectures.

このようなプロセッサにおいては、キャッシュ方式の採用が性能向上に大きく寄与する。そのキャッシュ方式としては、複数のプロセッサの各々に１次キャッシュを内蔵させると同時に、外部の主記憶との間に２次キャッシュを設ける方式が挙げられる。この方式では、２次キャッシュのヒット率を高め、主記憶へのアクセスを低減し、プロセッサの性能向上を図っている。 In such a processor, the adoption of the cache method greatly contributes to the performance improvement. As the cache system, there is a system in which a primary cache is built in each of a plurality of processors and at the same time a secondary cache is provided between the external main memory. In this method, the hit rate of the secondary cache is increased, access to the main memory is reduced, and the performance of the processor is improved.

Vangal, S. et al.: An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS, ISSCC, pp.98-99 (2007).Vangal, S. et al .: An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS, ISSCC, pp. 98-99 (2007). Bell, S. et al.: Tile64 Processor: A 64-Core SoC with Mesh Interconnect, ISSCC, pp.88-89 (2008).Bell, S. et al .: Tile64 Processor: A 64-Core SoC with Mesh Interconnect, ISSCC, pp.88-89 (2008). Kyo, S., Okazaki, S. and Arai, T.: An Integrated Memory Array Processor for Embedded Image Recognition Systems, IEEE Transactions on Computers, Vol.56, No.5, pp.622-634 (2007).Kyo, S., Okazaki, S. and Arai, T .: An Integrated Memory Array Processor for Embedded Image Recognition Systems, IEEE Transactions on Computers, Vol.56, No.5, pp.622-634 (2007). Becker, J. and Hubner, M.: Run-time reconfigurabilility and other future trends, the 19th annual symposium on Integrated circuits and systems design, pp.9-11 (2006).Becker, J. and Hubner, M .: Run-time reconfigurabilility and other future trends, the 19th annual symposium on Integrated circuits and systems design, pp. 9-11 (2006).

上述したキャッシュ方式を採用する場合では、同時に多数のプロセッサにデータを供給するために、複数のプロセッサの各々に設けられた１次キャッシュ間における内容の転送が必要となる。 In the case of adopting the cache method described above, in order to supply data to a large number of processors at the same time, it is necessary to transfer contents between primary caches provided in each of the plurality of processors.

しかしながら、上述したアーキテクチャのいずれにおいても、このような複数の１次キャッシュ間における内容転送を効率よく行なうことが困難であるといった課題があった。 However, in any of the above-described architectures, there is a problem that it is difficult to efficiently transfer contents between the plurality of primary caches.

上記課題に鑑み、本発明の目的は、複数の１次キャッシュ間におけるデータ転送を効率よく行なうことができるデータ処理装置を提供することにある。 In view of the above problems, an object of the present invention is to provide a data processing apparatus that can efficiently perform data transfer between a plurality of primary caches.

上記目的を達成するために、本発明に係るデータ処理装置は、複数の演算器と、前記複数の演算器の各々に設けられ、対応する演算器にデータを転送する複数の第１キャッシュと、前記複数の演算器に共有化され、前記複数の演算器の各処理に利用されるデータを格納する第２キャッシュと、前記複数の第１キャッシュの各々に設けられ、対応する第１キャッシュにデータを転送する複数のバッファキャッシュとを備え、前記複数のバッファキャッシュは、前記第２キャッシュに接続され、前記第２キャッシュからデータが転送される第１段目のバッファキャッシュを含み、前記複数のバッファキャッシュの各々は、前記第１段目のバッファキャッシュから順次連続的に接続されており、前記複数のバッファキャッシュの各々は、前記第２キャッシュから前記第１段目のバッファキャッシュに転送されたデータの一部を、各バッファキャッシュの後段側に順次転送すると共に、各バッファキャッシュに格納されているデータの一部を、各バッファキャッシュに対応する第１キャッシュに転送する。 In order to achieve the above object, a data processing apparatus according to the present invention includes a plurality of arithmetic units, a plurality of first caches provided in each of the plurality of arithmetic units and transferring data to the corresponding arithmetic units, A second cache that is shared by the plurality of computing units and stores data used for each processing of the plurality of computing units, and a data that is provided in each of the plurality of first caches and that corresponds to the first cache. A plurality of buffer caches, wherein the plurality of buffer caches are connected to the second cache and include a first-stage buffer cache to which data is transferred from the second cache, and the plurality of buffers Each of the caches is sequentially connected sequentially from the first stage buffer cache, and each of the plurality of buffer caches is connected to the second key cache. A part of the data transferred from the cache to the first stage buffer cache is sequentially transferred to the subsequent stage of each buffer cache, and a part of the data stored in each buffer cache is transferred to each buffer cache. Transfer to the corresponding first cache.

上記のデータ処理装置では、各バッファキャッシュは、第２キャッシュから第１段目のバッファキャッシュに転送されるデータの一部を、各バッファキャッシュの後段側に順次転送すると共に、各バッファキャッシュに格納されているデータの一部を各バッファキャッシュに対応する第１キャッシュに転送する。 In the above data processing apparatus, each buffer cache sequentially transfers a part of the data transferred from the second cache to the first stage buffer cache to the subsequent stage side of each buffer cache, and stores it in each buffer cache. A part of the stored data is transferred to the first cache corresponding to each buffer cache.

このため、各第１キャッシュは、各第１キャッシュが格納するデータを、他の第１キャッシュと互いに転送しなうことなく、第２キャッシュに格納されたデータを各第１キャッシュに対応する演算器に転送することができる。 For this reason, each first cache transfers the data stored in the second cache to the computing unit corresponding to each first cache without transferring the data stored in each first cache to each other. Can be transferred.

したがって、複数の第１キャッシュ間におけるデータ転送を効率よく行なうことができるデータ処理装置を実現することができる。 Therefore, it is possible to realize a data processing apparatus that can efficiently transfer data between a plurality of first caches.

前記複数のバッファキャッシュの各々は、各バッファキャッシュに対応する演算器の処理に必要なデータを、各バッファキャッシュに対応する第１キャッシュに転送することが好ましい。 Each of the plurality of buffer caches preferably transfers data necessary for processing of an arithmetic unit corresponding to each buffer cache to a first cache corresponding to each buffer cache.

この場合、各バッファキャッシュは、各演算器の処理に必要なデータを各演算器に効率よく転送することができる。 In this case, each buffer cache can efficiently transfer data necessary for processing of each arithmetic unit to each arithmetic unit.

このため、各バッファキャッシュと各演算器との間の不要なデータ転送が低減されるので、データ処理装置の消費電力を削減することができる。 For this reason, unnecessary data transfer between each buffer cache and each arithmetic unit is reduced, so that the power consumption of the data processing apparatus can be reduced.

前記複数の第１キャッシュの各々は、各第１キャッシュの記憶領域のうち、対応するバッファキャッシュから転送されるデータの格納に不要となる記憶不要領域の記憶動作を停止させることが好ましい。 Each of the plurality of first caches preferably stops a storage operation of a storage unnecessary area that is unnecessary for storing data transferred from the corresponding buffer cache among the storage areas of each first cache.

この場合、各第１キャッシュの記憶領域のうち、データを格納しない記憶領域の記憶動作を停止させることができる。 In this case, it is possible to stop the storage operation of a storage area that does not store data among the storage areas of each first cache.

このため、各第１キャッシュの不要な記憶領域の記憶動作による消費電力を削減することができ、その結果、データ処理装置の消費電力が削減される。 Therefore, it is possible to reduce the power consumption due to the storage operation of the unnecessary storage area of each first cache, and as a result, the power consumption of the data processing device is reduced.

前記複数のバッファキャッシュの各々は、各バッファキャッシュの後段側のバッファキャッシュに対応する演算器の処理に必要なデータを、各バッファキャッシュの後段側のバッファキャッシュに転送することが好ましい。 Each of the plurality of buffer caches preferably transfers data necessary for processing of an arithmetic unit corresponding to the buffer cache on the subsequent stage of each buffer cache to the buffer cache on the subsequent stage of each buffer cache.

この場合、各バッファキャッシュは、各バッファキャッシュの後段側のバッファキャッシュに対応する演算器の処理に必要なデータを後段側のバッファキャッシュに効率よく転送することができる。 In this case, each buffer cache can efficiently transfer data necessary for processing of the arithmetic unit corresponding to the buffer cache on the subsequent stage of each buffer cache to the buffer cache on the subsequent stage.

このため、各バッファキャッシュと他のバッファキャッシュとの間の不要なデータ転送が低減されるので、データ処理装置の消費電力を削減することができる。 For this reason, unnecessary data transfer between each buffer cache and another buffer cache is reduced, so that the power consumption of the data processing apparatus can be reduced.

前記複数のバッファキャッシュの各々は、各バッファキャッシュの記憶領域のうち、前段のバッファキャッシュから転送されるデータの格納に不要となる記憶不要領域の記憶動作を停止させることが好ましい。 Each of the plurality of buffer caches preferably stops a storage operation of a storage unnecessary area that is unnecessary for storing data transferred from the preceding buffer cache among the storage areas of each buffer cache.

この場合、各バッファキャッシュの記憶領域のうち、データを格納しない記憶領域の記憶動作を停止させることができる。 In this case, it is possible to stop the storage operation of a storage area that does not store data among the storage areas of each buffer cache.

このため、各バッファキャッシュの不要な記憶領域の記憶動作による消費電力を削減することができ、その結果、データ処理装置の消費電力が削減される。 For this reason, it is possible to reduce the power consumption due to the storage operation of the unnecessary storage area of each buffer cache, and as a result, the power consumption of the data processing device is reduced.

前記データ処理装置は、自身が処理すべきプログラムの実行に基づくデータアクセスパターンを解析し、その解析結果を用いてプログラムを処理するものであり、前記複数の演算器の各処理に必要なデータは、前記データ処理装置が処理するプログラムのデータアクセスパターンの解析結果に基づいて特定されることが好ましい。 The data processing device analyzes a data access pattern based on execution of a program to be processed by itself, processes the program using the analysis result, and data necessary for each processing of the plurality of arithmetic units is: Preferably, the data processing device is specified based on the analysis result of the data access pattern of the program processed by the data processing device.

この場合、データ処理装置が処理するプログラムのデータアクセスパターンがあらかじめ解析されていない場合でも、上記の効果を実現することができる。 In this case, even when the data access pattern of the program processed by the data processing device has not been analyzed in advance, the above effect can be realized.

前記プログラムのデータアクセスパターンの解析結果に基づいて前記複数の演算器の各処理に必要なデータを特定する特定部と、前記複数の第１キャッシュの各々に設けられ、前記特定部による特定結果に基づいて、対応する第１キャッシュの記憶不要領域の記憶動作を停止させる第１実行部と、前記複数のバッファキャッシュの各々に設けられ、前記特定部による特定結果に基づいて、対応するバッファキャッシュの記憶不要領域の記憶動作を停止させる第２実行部とをさらに備えていることが好ましい。 Provided in each of the plurality of first caches, a specifying unit for specifying data necessary for each processing of the plurality of arithmetic units based on the analysis result of the data access pattern of the program, Based on the first execution unit for stopping the storage operation of the storage unnecessary area of the corresponding first cache, and each of the plurality of buffer caches, and based on the identification result by the identification unit, It is preferable to further include a second execution unit that stops the storage operation of the storage unnecessary area.

この場合、各第１キャッシュの記憶領域のうち、データを格納しない記憶領域の記憶動作の停止及び、各バッファキャッシュの記憶領域のうち、データを格納しない記憶領域の記憶動作の停止を、簡単な装置構成で制御することができる。 In this case, it is possible to simply stop the storage operation of the storage area that does not store data among the storage areas of each first cache and stop the storage operation of the storage area that does not store data among the storage areas of each buffer cache. It can be controlled by the device configuration.

このため、データ処理装置の製造コストを削減することができる。 For this reason, the manufacturing cost of a data processor can be reduced.

前記データ処理装置が処理するプログラムは、あらかじめデータアクセスパターンの解析が行なわれており、前記複数の演算器の各処理に必要なデータは、前記プログラムのデータアクセスパターンの解析内容に基づいてあらかじめ特定されており、前記複数の第１キャッシュの各々は、前記複数の演算器の各処理に必要なデータを格納すべく、各第１キャッシュの記憶領域があらかじめ設定されており、前記複数のバッファキャッシュの各々は、前記複数の演算器の各処理に必要なデータを、対応する第１キャッシュに転送すべく、各バッファキャッシュの記憶領域があらかじめ設定されていることが好ましい。 The program processed by the data processing apparatus is analyzed in advance for data access patterns, and the data required for each processing of the plurality of computing units is specified in advance based on the analysis contents of the data access patterns of the programs. In each of the plurality of first caches, a storage area of each first cache is set in advance in order to store data necessary for each process of the plurality of arithmetic units. It is preferable that a storage area of each buffer cache is preset in order to transfer data necessary for each process of the plurality of arithmetic units to the corresponding first cache.

この場合、各演算器の処理に必要なデータをあらかじめ特定することができるので、各第１キャッシュに必要とされる記憶領域及び、各バッファキャッシュに必要とされる記憶領域をあらかじめ設定することができる。 In this case, since the data necessary for the processing of each computing unit can be specified in advance, the storage area required for each first cache and the storage area required for each buffer cache can be set in advance. it can.

このため、データ処理装置の装置構成をより簡略化し、消費電力をより削減することができる。 For this reason, the apparatus configuration of the data processing apparatus can be further simplified, and the power consumption can be further reduced.

本発明のデータ処理装置は、以上のように、複数の演算器と、前記複数の演算器の各々に設けられ、対応する演算器にデータを転送する複数の第１キャッシュと、前記複数の演算器に共有化され、前記複数の演算器の各処理に利用されるデータを格納する第２キャッシュと、前記複数の第１キャッシュの各々に設けられ、対応する第１キャッシュにデータを転送する複数のバッファキャッシュとを備え、前記複数のバッファキャッシュは、前記第２キャッシュに接続され、前記第２キャッシュからデータが転送される第１段目のバッファキャッシュを含み、前記複数のバッファキャッシュの各々は、前記第１段目のバッファキャッシュから順次連続的に接続されており、前記複数のバッファキャッシュの各々は、前記第２キャッシュから前記第１段目のバッファキャッシュに転送されたデータの一部を、各バッファキャッシュの後段側に順次転送すると共に、各バッファキャッシュに格納されているデータの一部を、各バッファキャッシュに対応する第１キャッシュに転送する。 As described above, the data processing device of the present invention includes a plurality of arithmetic units, a plurality of first caches that are provided in each of the plurality of arithmetic units and transfer data to the corresponding arithmetic units, and the plurality of arithmetic units. A second cache that stores data used for each process of the plurality of computing units and a plurality of first caches that are provided in each of the plurality of first caches and that transfer data to the corresponding first cache A plurality of buffer caches, wherein the plurality of buffer caches are connected to the second cache and include a first-stage buffer cache to which data is transferred from the second cache, each of the plurality of buffer caches being , Sequentially connected from the first stage buffer cache, and each of the plurality of buffer caches is connected to the second cache from the second cache. A part of the data transferred to the first-stage buffer cache is sequentially transferred to the subsequent stage side of each buffer cache, and a part of the data stored in each buffer cache is transferred to the first corresponding to each buffer cache. Transfer to cache.

それゆえ、複数の１次キャッシュ間におけるデータ転送を効率よく行なうことができるという効果を奏する。 Therefore, there is an effect that data transfer between a plurality of primary caches can be performed efficiently.

本発明の実施の形態１に係るデータ処理装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the data processor which concerns on Embodiment 1 of this invention. 一般的なぼかし処理を行なうためのプログラムを説明するための説明図である。It is explanatory drawing for demonstrating the program for performing a general blurring process. 図２のプログラムを用いた従来の並列処理の手順を説明するための説明図である（その１）。It is explanatory drawing for demonstrating the procedure of the conventional parallel processing using the program of FIG. 2 (the 1). 図２のプログラムを用いた従来の並列処理の手順を説明するための説明図である（その２）。It is explanatory drawing for demonstrating the procedure of the conventional parallel processing using the program of FIG. 2 (the 2). 図２のプログラムを用いた従来の並列処理の手順を説明するための説明図である（その３）。FIG. 3 is an explanatory diagram for explaining a procedure of conventional parallel processing using the program of FIG. 2 (part 3); 図２のプログラムを用いた本発明の実施の形態１に係る並列処理の手順を説明するための説明図である（その１）。FIG. 3 is an explanatory diagram for explaining a procedure of parallel processing according to the first embodiment of the present invention using the program of FIG. 2 (part 1); 図２のプログラムを用いた本発明の実施の形態１に係る並列処理の手順を説明するための説明図である（その２）。FIG. 6 is an explanatory diagram for explaining a procedure of parallel processing according to the first embodiment of the present invention using the program of FIG. 2 (part 2); 図２のプログラムを用いた本発明の実施の形態１に係る並列処理の手順を説明するための説明図である（その３）。FIG. 6 is an explanatory diagram for explaining a parallel processing procedure according to the first embodiment of the present invention using the program of FIG. 2 (part 3); 図２のプログラムを用いた本発明の実施の形態１に係る並列処理の手順を説明するための説明図である（その４）。FIG. 6 is an explanatory diagram for explaining a procedure of parallel processing according to the first embodiment of the present invention using the program of FIG. 2 (part 4); 本発明の実施の形態２に係るデータ処理装置の概略構成を示すブロック図である。It is a block diagram which shows schematic structure of the data processor which concerns on Embodiment 2 of this invention. 図１０のキャッシュ方式を説明するための説明図である。It is explanatory drawing for demonstrating the cache system of FIG.

以下、図面を参照しつつ本発明の実施の形態について説明する。以下の説明に用いる図面では、同一の部品に同一の符号を付してある。それらの名称及び機能も同一である。したがって、それらについての詳細な説明は繰り返さない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings used for the following description, the same parts are denoted by the same reference numerals. Their names and functions are also the same. Therefore, detailed description thereof will not be repeated.

（実施の形態１）
図１は、本発明の実施の形態１に係るデータ処理装置の概略構成を示すブロック図である。 (Embodiment 1)
FIG. 1 is a block diagram showing a schematic configuration of a data processing apparatus according to Embodiment 1 of the present invention.

本発明の実施の形態１に係るデータ処理装置１は、図１に示すように、第１段演算部１０と、第２段演算部２０と、第３段演算部３０と、第４段演算部４０と、２次キャッシュ（第２キャッシュ）５０と、メインメモリ６０と、を備えている。 As shown in FIG. 1, the data processing apparatus 1 according to the first embodiment of the present invention includes a first stage computing unit 10, a second stage computing unit 20, a third stage computing unit 30, and a fourth stage computing. Unit 40, a secondary cache (second cache) 50, and a main memory 60.

そして、第１段演算部１０は、演算器１１と、１次キャッシュ（第１キャッシュ）１２と、バッファキャッシュ１３と、転送制御部（特定部）１４と、第１の転送実行部（第１実行部）１５と、第２の転送実行部（第２実行部）１６と、を有している。同様に、第２段演算部２０は、演算器２１と、１次キャッシュ２２と、バッファキャッシュ２３と、転送制御部２４と、第１の転送実行部２５と、第２の転送実行部２６と、を有している。第３段演算部３０は、演算器３１と、１次キャッシュ３２と、バッファキャッシュ３３と、転送制御部３４と、第１の転送実行部３５と、第２の転送実行部３６と、を有している。第４段演算部４０は、演算器４１と、１次キャッシュ４２と、バッファキャッシュ４３と、転送制御部４４と、第１の転送実行部４５と、第２の転送実行部４６と、を有している。 The first stage arithmetic unit 10 includes an arithmetic unit 11, a primary cache (first cache) 12, a buffer cache 13, a transfer control unit (specification unit) 14, and a first transfer execution unit (first An execution unit) 15 and a second transfer execution unit (second execution unit) 16. Similarly, the second stage arithmetic unit 20 includes an arithmetic unit 21, a primary cache 22, a buffer cache 23, a transfer control unit 24, a first transfer execution unit 25, and a second transfer execution unit 26. ,have. The third-stage arithmetic unit 30 includes an arithmetic unit 31, a primary cache 32, a buffer cache 33, a transfer control unit 34, a first transfer execution unit 35, and a second transfer execution unit 36. is doing. The fourth-stage arithmetic unit 40 includes an arithmetic unit 41, a primary cache 42, a buffer cache 43, a transfer control unit 44, a first transfer execution unit 45, and a second transfer execution unit 46. is doing.

本実施の形態に係るデータ処理装置１は、自身が処理するプログラム（命令）及びデータ（以下、このプログラム（命令）及びデータを単に「データ」と呼ぶ場合もある。）を長期的に格納するメインメモリ６０と、メインメモリ６０に格納されているプログラム（命令）及びデータの一部を短期的に格納する２次キャッシュ５０と、２次キャッシュ５０に格納されているプログラム（命令）及びデータの一部を一時的に格納する複数の１次キャッシュ１２、２２、３２、４２及び複数のバッファキャッシュ１３、２３、３３、４３と、から構成されたメモリ方式を持っている。 The data processing apparatus 1 according to the present embodiment stores a program (instruction) and data processed by itself (hereinafter, the program (instruction) and data may be simply referred to as “data”) for a long time. Main memory 60, secondary cache 50 for storing a part of programs (instructions) and data stored in main memory 60 in the short term, and programs (instructions) and data stored in secondary cache 50 It has a memory system composed of a plurality of primary caches 12, 22, 32, 42 and a plurality of buffer caches 13, 23, 33, 43 for temporarily storing a part.

メインメモリ６０は、第１〜４段演算部１０、２０、３０、４０により使用されるプログラム（命令）及びデータを格納する。メインメモリ６０は、第１〜４段演算部１０、２０、３０、４０が実際に使用中及び使用中でない、のいずれのプログラム（命令）及びデータも長期的に格納する。このため、メインメモリ６０は、アクセス機能よりも格納機能が重視されており、低速でも大容量であることが要求される。 The main memory 60 stores programs (instructions) and data used by the first to fourth stage arithmetic units 10, 20, 30 and 40. The main memory 60 stores any programs (instructions) and data that the first to fourth stage arithmetic units 10, 20, 30, and 40 are actually in use and not in use for a long time. For this reason, the main memory 60 emphasizes the storage function rather than the access function, and is required to have a large capacity even at a low speed.

また、メインメモリ６０は、公知の磁気ディスク、光磁気ディスク、磁気テープを用いることができる。 The main memory 60 can be a known magnetic disk, magneto-optical disk, or magnetic tape.

さらに、メインメモリ６０は、第１〜４段演算部１０、２０、３０、４０からは一種の入出力装置という位置づけとなる。したがって、本実施の形態に係るデータ処理装置１は、メインメモリ６０に代えて、データ処理装置１がデータ処理をするのに必要なプログラム（命令）及びデータを収集する入力装置と、データ処理装置１によるデータ処理の結果をユーザーが使えるようにするための出力装置と、からなる外部Ｉ／Ｏを用いても構わない。 Furthermore, the main memory 60 is positioned as a kind of input / output device from the first to fourth stage arithmetic units 10, 20, 30, and 40. Therefore, in the data processing device 1 according to the present embodiment, instead of the main memory 60, an input device that collects programs (commands) and data necessary for the data processing device 1 to perform data processing, and a data processing device An external I / O comprising an output device for enabling the user to use the result of data processing by 1 may be used.

２次キャッシュ５０は、第１〜４段演算部１０、２０、３０、４０が実際に使用中のプログラム（命令）及びデータを格納する。この２次キャッシュ５０は、第１〜４段演算部１０、２０、３０、４０の各々の演算器１１、２１、３１、４１による直接アクセスが可能である。すなわち、２次キャッシュ５０は、演算器１１、２１、３１、４１により共有化されている。 The secondary cache 50 stores programs (instructions) and data that are actually used by the first to fourth stage arithmetic units 10, 20, 30, and 40. The secondary cache 50 can be directly accessed by the arithmetic units 11, 21, 31, 41 of the first to fourth stage arithmetic units 10, 20, 30, 40. That is, the secondary cache 50 is shared by the computing units 11, 21, 31, and 41.

このため、２次キャッシュ５０は、アクセス機能よりも格納機能が重視されているメインメモリ６０に対し、格納機能よりもアクセス機能が重視されている。そして、２次キャッシュ５０には、小容量でも高速性が要求される。 For this reason, the secondary cache 50 places more importance on the access function than the storage function for the main memory 60 where the storage function is more important than the access function. The secondary cache 50 is required to have high speed even with a small capacity.

また、２次キャッシュ５０は、公知のＤＲＡＭ、ＳＲＡＭ等の半導体メモリを用いることができる。 As the secondary cache 50, a known semiconductor memory such as DRAM or SRAM can be used.

１次キャッシュ１２、２２、３２、４２及びバッファキャッシュ１３、２３、３３、４３は、各演算器１１、２１、３１、４１との間において直接データ転送を行なう。そして、これら１次キャッシュ１２、２２、３２、４２及びバッファキャッシュ１３、２３、３３、４３は、２次キャッシュ５０に格納されている、各演算器１１、２１、３１、４１が実際に使用中のプログラム（命令）及びデータを順次、２次キャッシュ５０から取得しつつ、各演算器１１、２１、３１、４１に転送する。 The primary caches 12, 22, 32, and 42 and the buffer caches 13, 23, 33, and 43 directly transfer data to and from the calculators 11, 21, 31, and 41. The primary caches 12, 22, 32, 42 and the buffer caches 13, 23, 33, 43 are actually used by the respective computing units 11, 21, 31, 41 stored in the secondary cache 50. Are sequentially acquired from the secondary cache 50 and transferred to the computing units 11, 21, 31, 41.

第１段演算部１０のバッファキャッシュ（第１段目のバッファキャッシュ）１３は、２次キャッシュ５０に接続するように配置されており、２次キャッシュ５０との間において直接のデータ転送を行なう。第１段演算部１０のバッファキャッシュ１３は、自身と２次キャッシュ５０との間において取り交わすデータの最小単位（以下、「ブロック」と呼ぶ場合もある。）を格納可能な容量を持っている。第１段演算部１０のバッファキャッシュ１３と２次キャッシュ５０との間におけるデータ転送は、このブロック単位で行なわれる。 The buffer cache (first stage buffer cache) 13 of the first stage arithmetic unit 10 is arranged so as to be connected to the secondary cache 50, and performs direct data transfer with the secondary cache 50. The buffer cache 13 of the first stage arithmetic unit 10 has a capacity capable of storing the minimum unit of data exchanged between itself and the secondary cache 50 (hereinafter also referred to as “block”). Data transfer between the buffer cache 13 and the secondary cache 50 of the first stage arithmetic unit 10 is performed in units of blocks.

一方、第２段演算部２０のバッファキャッシュ２３、第３段演算部３０のバッファキャッシュ３３及び第４段演算部４０のバッファキャッシュ４３は、２次キャッシュ５０との間においては直接のデータ転送を行なわない。 On the other hand, the buffer cache 23 of the second stage computing unit 20, the buffer cache 33 of the third stage computing unit 30, and the buffer cache 43 of the fourth stage computing unit 40 perform direct data transfer with the secondary cache 50. Don't do it.

すなわち、バッファキャッシュ１３、２３、３３、４３は、バッファキャッシュ１３を第１段目とし、バッファキャッシュ１３から順次連続的に接続された配置構成を有している。そして、バッファキャッシュ１３、２３、３３、４３は、２次キャッシュ５０からバッファキャッシュ１３に転送されたデータを、各々の後段側に順次転送することができる。 That is, the buffer caches 13, 23, 33, and 43 have an arrangement configuration in which the buffer cache 13 is the first stage and the buffer caches 13 are sequentially connected sequentially. The buffer caches 13, 23, 33, and 43 can sequentially transfer the data transferred from the secondary cache 50 to the buffer cache 13 to each subsequent stage.

具体的には、第２段演算部２０のバッファキャッシュ２３は、第１段演算部１０のバッファキャッシュ１３に接続するように配置されており、第１段演算部１０のバッファキャッシュ１３との間において直接のデータ転送を行なう。第３段演算部３０のバッファキャッシュ３３は、第２段演算部２０のバッファキャッシュ２３に接続するように配置されており、第２段演算部２０のバッファキャッシュ２３との間において直接のデータ転送を行なう。そして、第４段演算部４０のバッファキャッシュ４３は、第３段演算部３０のバッファキャッシュ３３に接続するように配置されており、第３段演算部３０のバッファキャッシュ３３との間において直接のデータ転送を行なう。上記のいずれのデータ転送も、上記のブロック単位で行なわれる。 Specifically, the buffer cache 23 of the second stage arithmetic unit 20 is disposed so as to be connected to the buffer cache 13 of the first stage arithmetic unit 10, and is connected to the buffer cache 13 of the first stage arithmetic unit 10. Direct data transfer is performed at. The buffer cache 33 of the third stage arithmetic unit 30 is arranged so as to be connected to the buffer cache 23 of the second stage arithmetic unit 20, and direct data transfer with the buffer cache 23 of the second stage arithmetic unit 20 To do. The buffer cache 43 of the fourth stage arithmetic unit 40 is arranged so as to be connected to the buffer cache 33 of the third stage arithmetic unit 30, and is directly connected to the buffer cache 33 of the third stage arithmetic unit 30. Perform data transfer. Any of the above data transfers is performed in units of the blocks.

第１段演算部１０のバッファキャッシュ１３は、２次キャッシュ５０から１ブロックのデータ転送が行なわれると、そのデータ転送が行なわれる直前までに格納していたデータを第１段演算部１０の１次キャッシュ１２に転送する。そして、この１次キャッシュ１２は、バッファキャッシュ１３から１ブロック毎のデータが転送される度に、自身が格納する数ブロックのデータを更新する。 When one block of data is transferred from the secondary cache 50 to the buffer cache 13 of the first stage arithmetic unit 10, the data stored immediately before the data transfer is performed is stored in the buffer cache 13 of the first stage arithmetic unit 10. Transfer to next cache 12. The primary cache 12 updates the data of several blocks stored by itself every time the data for each block is transferred from the buffer cache 13.

また、このバッファキャッシュ１３は、１次キャッシュ１２へのデータ転送にあわせて、１次キャッシュ１２に転送したデータと同一のデータを第２段演算部２０のバッファキャッシュ２３に転送する。 The buffer cache 13 transfers the same data as the data transferred to the primary cache 12 to the buffer cache 23 of the second stage arithmetic unit 20 in accordance with the data transfer to the primary cache 12.

なお、バッファキャッシュ１３から１次キャッシュ１２へのデータ転送及び、バッファキャッシュ１３から第２段演算部２０のバッファキャッシュ２３へのデータ転送においては、バッファキャッシュ１３に格納されている１ブロックのすべてのデータがデータ転送の対象となるものではない。データ転送を受ける１次キャッシュ１２及び、バッファキャッシュ２３が要求するデータのみが転送されてもよい。 In the data transfer from the buffer cache 13 to the primary cache 12 and the data transfer from the buffer cache 13 to the buffer cache 23 of the second stage arithmetic unit 20, all of one block stored in the buffer cache 13 is stored. Data is not subject to data transfer. Only the data requested by the primary cache 12 and the buffer cache 23 that receive the data transfer may be transferred.

第２段演算部２０のバッファキャッシュ２３は、第１段演算部１０のバッファキャッシュ１３から１ブロックのデータ転送が行なわれると、そのデータ転送が行なわれる直前までに格納していたデータを第２段演算部２０の１次キャッシュ２２に転送する。そして、この１次キャッシュ２２は、バッファキャッシュ２３から１ブロック毎のデータが転送される度に、自身が格納する数ブロックのデータを更新する。 When one block of data is transferred from the buffer cache 13 of the first stage arithmetic unit 10, the buffer cache 23 of the second stage arithmetic unit 20 stores the data stored immediately before the data transfer is performed for the second time. Transfer to the primary cache 22 of the stage operation unit 20. The primary cache 22 updates several blocks of data stored by itself every time the data for each block is transferred from the buffer cache 23.

なお、バッファキャッシュ２３から１次キャッシュ２２へのデータ転送及び、バッファキャッシュ２３から第３段演算部３０のバッファキャッシュ３３へのデータ転送においては、バッファキャッシュ２３に格納されている１ブロックのすべてのデータがデータ転送の対象となるものではない。データ転送を受ける１次キャッシュ２２及び、バッファキャッシュ３３が要求するデータのみが転送されてもよい。 In the data transfer from the buffer cache 23 to the primary cache 22 and the data transfer from the buffer cache 23 to the buffer cache 33 of the third stage arithmetic unit 30, all of one block stored in the buffer cache 23 is stored. Data is not subject to data transfer. Only the data requested by the primary cache 22 and the buffer cache 33 that receive data transfer may be transferred.

第３段演算部３０のバッファキャッシュ３３は、第２段演算部２０のバッファキャッシュ２３から１ブロックのデータ転送が行なわれると、そのデータ転送が行なわれる直前までに格納していたデータを第３段演算部３０の１次キャッシュ３２に転送する。そして、この１次キャッシュ３２は、バッファキャッシュ３３から１ブロック毎のデータが転送される度に、自身が格納する数ブロックのデータを更新する。 When one block of data is transferred from the buffer cache 23 of the second stage arithmetic unit 20, the buffer cache 33 of the third stage arithmetic unit 30 stores the data stored immediately before the data transfer is performed in the third state. Transfer to the primary cache 32 of the stage operation unit 30. The primary cache 32 updates several blocks of data stored by itself every time data of each block is transferred from the buffer cache 33.

なお、バッファキャッシュ３３から１次キャッシュ３２へのデータ転送及び、バッファキャッシュ３３から第４段演算部４０のバッファキャッシュ４３へのデータ転送においては、バッファキャッシュ３３に格納されている１ブロックのすべてのデータがデータ転送の対象となるものではない。データ転送を受ける１次キャッシュ３２及び、バッファキャッシュ４３が要求するデータのみが転送されてもよい。 In the data transfer from the buffer cache 33 to the primary cache 32 and the data transfer from the buffer cache 33 to the buffer cache 43 of the fourth stage arithmetic unit 40, all of one block stored in the buffer cache 33 is stored. Data is not subject to data transfer. Only the data requested by the primary cache 32 and the buffer cache 43 that receive data transfer may be transferred.

第４段演算部４０のバッファキャッシュ４３は、第３段演算部３０のバッファキャッシュ３３から１ブロックのデータ転送が行なわれると、そのデータ転送が行なわれる直前までに格納していたデータを第４段演算部４０の１次キャッシュ４２に転送する。そして、この１次キャッシュ４２は、バッファキャッシュ４３から１ブロック毎のデータが転送される度に、自身が格納する数ブロックのデータを更新する。 When one block of data is transferred from the buffer cache 33 of the third-stage arithmetic unit 30 to the buffer cache 43 of the fourth-stage arithmetic unit 40, the data stored until immediately before the data transfer is transferred to the fourth state. The data is transferred to the primary cache 42 of the stage calculation unit 40. The primary cache 42 updates several blocks of data stored therein each time the data for each block is transferred from the buffer cache 43.

なお、バッファキャッシュ４３から１次キャッシュ４２へのデータ転送においては、バッファキャッシュ４３に格納されている１ブロックのすべてのデータがデータ転送の対象となるものではない。データ転送を受ける１次キャッシュ４２が要求するデータのみが転送されてもよい。 In the data transfer from the buffer cache 43 to the primary cache 42, not all the data of one block stored in the buffer cache 43 is a target for data transfer. Only the data requested by the primary cache 42 that receives the data transfer may be transferred.

１次キャッシュ１２、２２、３２、４２及びバッファキャッシュ１３、２３、３３、４３はいずれも、低容量でも高速な半導体メモリである、公知のＥＣＬ／ＢｉＣＭＯＳのＳＲＡＭ、ＥＣＬ等の高速ＳＲＡＭを用いて実現することができる。 The primary caches 12, 22, 32, and 42 and the buffer caches 13, 23, 33, and 43 all use a high-speed SRAM such as a well-known ECL / BiCMOS SRAM or ECL, which is a high-speed semiconductor memory with a low capacity. Can be realized.

第１段演算部１０の転送制御部１４は、第１の転送実行部１５及び第２の転送実行部１６の各データ転送実行処理を制御する。第１の転送実行部１５は、転送制御部１４からの制御内容に従って、バッファキャッシュ１３と１次キャッシュ１２との間におけるデータ転送を実行する。第２の転送実行部１６は、転送制御部１４からの制御内容に従って、バッファキャッシュ１３とバッファキャッシュ２３との間におけるデータ転送を実行する。 The transfer control unit 14 of the first stage calculation unit 10 controls each data transfer execution process of the first transfer execution unit 15 and the second transfer execution unit 16. The first transfer execution unit 15 executes data transfer between the buffer cache 13 and the primary cache 12 in accordance with the control content from the transfer control unit 14. The second transfer execution unit 16 executes data transfer between the buffer cache 13 and the buffer cache 23 according to the control content from the transfer control unit 14.

第２段演算部２０の転送制御部２４は、第１の転送実行部２５及び第２の転送実行部２６の各データ転送実行処理を制御する。第１の転送実行部２５は、転送制御部２４からの制御内容に従って、バッファキャッシュ２３と１次キャッシュ２２との間におけるデータ転送を実行する。第２の転送実行部２６は、転送制御部２４からの制御内容に従って、バッファキャッシュ２３とバッファキャッシュ３３との間におけるデータ転送を実行する。 The transfer control unit 24 of the second stage calculation unit 20 controls each data transfer execution process of the first transfer execution unit 25 and the second transfer execution unit 26. The first transfer execution unit 25 executes data transfer between the buffer cache 23 and the primary cache 22 in accordance with the control content from the transfer control unit 24. The second transfer execution unit 26 executes data transfer between the buffer cache 23 and the buffer cache 33 in accordance with the control content from the transfer control unit 24.

第３段演算部３０の転送制御部３４は、第１の転送実行部３５及び第２の転送実行部３６の各データ転送実行処理を制御する。第１の転送実行部３５は、転送制御部３４からの制御内容に従って、バッファキャッシュ３３と１次キャッシュ３２との間におけるデータ転送を実行する。第２の転送実行部３６は、転送制御部３４からの制御内容に従って、バッファキャッシュ３３とバッファキャッシュ４３との間におけるデータ転送を実行する。 The transfer control unit 34 of the third stage arithmetic unit 30 controls each data transfer execution process of the first transfer execution unit 35 and the second transfer execution unit 36. The first transfer execution unit 35 executes data transfer between the buffer cache 33 and the primary cache 32 in accordance with the control content from the transfer control unit 34. The second transfer execution unit 36 executes data transfer between the buffer cache 33 and the buffer cache 43 in accordance with the control content from the transfer control unit 34.

第４段演算部４０の転送制御部４４は、第１の転送実行部４５及び第２の転送実行部４６の各データ転送実行処理を制御する。第１の転送実行部４５は、転送制御部４４からの制御内容に従って、バッファキャッシュ４３と１次キャッシュ４２との間におけるデータ転送を実行する。第２の転送実行部４６は、転送制御部４４からの制御内容に従って、バッファキャッシュ４３と後段のバッファキャッシュ（図示省略）との間におけるデータ転送を実行する。 The transfer control unit 44 of the fourth stage calculation unit 40 controls each data transfer execution process of the first transfer execution unit 45 and the second transfer execution unit 46. The first transfer execution unit 45 executes data transfer between the buffer cache 43 and the primary cache 42 according to the control content from the transfer control unit 44. The second transfer execution unit 46 executes data transfer between the buffer cache 43 and a subsequent buffer cache (not shown) according to the control content from the transfer control unit 44.

次に、本実施の形態に係るデータ処理装置１のキャッシュ方式について説明する。以下では、３×３の画素からぼかし処理を行なう例を用いて、データ処理装置１のキャッシュ方式について説明する。図２に、このぼかし処理を行なうためのプログラムを示す。 Next, the cache method of the data processing apparatus 1 according to the present embodiment will be described. Hereinafter, the cache method of the data processing device 1 will be described using an example of performing blurring processing from 3 × 3 pixels. FIG. 2 shows a program for performing this blurring process.

図２のプログラム（命令）２は、ぼかし処理の対象となる３×３画素に対し、その中心画素の上下左右の画素を用いて、ぼかし処理を実行するためのプログラム（命令）である。なお、上記のような３×３画素からぼかし処理を行なう場合、３×３画素を構成する９画素すべての画素値を用いるのが一般的である。ここでは、データ処理装置１のキャッシュ方式の説明の容易化を図るために、上記のように上下左右の４つの画素を用いるぼかし処理を例としている。もちろん、本発明は、３×３画素を構成する９つの画素すべての画素値を用いるぼかし処理にも適用可能であることは言うまでもない。 The program (instruction) 2 in FIG. 2 is a program (instruction) for executing the blurring process on the 3 × 3 pixels to be subjected to the blurring process using the upper, lower, left, and right pixels of the center pixel. When blurring processing is performed from 3 × 3 pixels as described above, it is common to use the pixel values of all nine pixels constituting the 3 × 3 pixels. Here, in order to facilitate the description of the cache method of the data processing apparatus 1, a blurring process using four pixels, top, bottom, left, and right as described above is taken as an example. Of course, it goes without saying that the present invention can also be applied to blurring processing using the pixel values of all nine pixels constituting a 3 × 3 pixel.

一般に、画像処理においては、対象となる画素群を少しずつ一定方向にずらしつつ、１行毎に演算を行なう処理が多い。このため、次の処理に必要となるデータは予測可能である。上記のプログラム（命令）２を用いるぼかし処理においては、３×３の画素に対してぼかし処理が繰り返される。例えば、対象となる画素群が例えば水平方向に１画素分ずつ移動するとすれば、垂直方向の２行分は再利用することができ、新たに必要となるのは垂直方向の１行分のみである。 In general, in image processing, there are many processes in which calculation is performed for each row while a target pixel group is gradually shifted in a certain direction. For this reason, the data required for the next process can be predicted. In the blurring process using the program (instruction) 2 described above, the blurring process is repeated for 3 × 3 pixels. For example, if the target pixel group moves, for example, by one pixel in the horizontal direction, two vertical rows can be reused, and only one vertical row is required. is there.

そこで、データ処理装置１のキャッシュ方式においては、第１段演算部１０のバッファキャッシュ１３に、第１段演算部１０の演算器１１が新たに必要とする垂直方向の１行分のデータが２次キャッシュ５０から供給される。 Therefore, in the cache system of the data processing apparatus 1, the buffer cache 13 of the first stage arithmetic unit 10 stores 2 pieces of data for one line in the vertical direction newly required by the arithmetic unit 11 of the first stage arithmetic unit 10. Supplied from the next cache 50.

第２段演算部２０のバッファキャッシュ２３に、第２段演算部２０の演算器２１が新たに必要とする垂直方向の１行分のデータが第１段演算部１０のバッファキャッシュ１３から供給される。 Data for one line in the vertical direction newly required by the computing unit 21 of the second stage computing unit 20 is supplied from the buffer cache 13 of the first stage computing unit 10 to the buffer cache 23 of the second stage computing unit 20. The

第３段演算部３０のバッファキャッシュ３３に、第３段演算部３０の演算器３１が新たに必要とする垂直方向の１行分のデータが第２段演算部２０のバッファキャッシュ２３から供給される。 Data for one line in the vertical direction newly required by the computing unit 31 of the third stage computing unit 30 is supplied from the buffer cache 23 of the second stage computing unit 20 to the buffer cache 33 of the third stage computing unit 30. The

第４段演算部４０のバッファキャッシュ４３に、第４段演算部４０の演算器４１が新たに必要とする垂直方向の１行分のデータが第３段演算部３０のバッファキャッシュ３３から供給される。 Data for one line in the vertical direction newly required by the computing unit 41 of the fourth stage computing unit 40 is supplied from the buffer cache 33 of the third stage computing unit 30 to the buffer cache 43 of the fourth stage computing unit 40. The

このようなデータ転送の結果、例えば、第２段演算部２０のバッファキャッシュ２３には、第１段演算部１０のバッファキャッシュ１３の１世代前のデータが格納され、第３段演算部３０のバッファキャッシュ３３には、第２段演算部２０のバッファキャッシュ２３の１世代前のデータが格納されることになる。 As a result of such data transfer, for example, the buffer cache 23 of the second stage arithmetic unit 20 stores data one generation before the buffer cache 13 of the first stage arithmetic unit 10, and the third stage arithmetic unit 30 The buffer cache 33 stores data one generation before the buffer cache 23 of the second stage arithmetic unit 20.

ここで、データ処理装置１の第１〜４段演算部１０、２０、３０、４０の各々が２次キャッシュ５０の内容を直接参照する構成を採用することは現実的ではない。２次キャッシュ５０に必要なポート数が大幅に増大してしまうからである。 Here, it is not realistic that each of the first to fourth stage arithmetic units 10, 20, 30, and 40 of the data processing device 1 directly refers to the contents of the secondary cache 50. This is because the number of ports required for the secondary cache 50 is significantly increased.

そこで、データ処理装置１では、第１〜４段演算部１０、２０、３０、４０の各々が２次キャッシュ５０の内容を直接参照せず、第１〜４段演算部１０、２０、３０、４０の各々に対応する１次キャッシュ１２、２２、３２、４２及びバッファキャッシュ１３、２３、３３、４３を直接参照する構成を採用する。 Therefore, in the data processing device 1, each of the first to fourth stage arithmetic units 10, 20, 30, and 40 does not directly refer to the contents of the secondary cache 50, and the first to fourth stage arithmetic units 10, 20, 30, A configuration in which the primary caches 12, 22, 32, and 42 and the buffer caches 13, 23, 33, and 43 corresponding to the respective 40 are directly referred to is employed.

このようにデータ処理装置１は、第１〜４段演算部１０、２０、３０、４０の各々が、プログラム（命令）２を用い、第１〜４段演算部１０、２０、３０、４０の各演算器１１、２１、３１、４１が並列的にぼかし処理する。そして、データ処理装置１は、第１〜４段演算部１０、２０、３０、４０の各々の処理結果を用いて、１つの画面全体のぼかし処理を実行する。 As described above, in the data processing device 1, each of the first to fourth stage arithmetic units 10, 20, 30, and 40 uses the program (instruction) 2, and the first to fourth stage arithmetic units 10, 20, 30, and 40 Each computing unit 11, 21, 31, 41 performs blurring processing in parallel. Then, the data processing apparatus 1 executes the blurring process for one entire screen using the processing results of the first to fourth stage arithmetic units 10, 20, 30, and 40.

次に、データ処理装置１のキャッシュ方式の動作について説明する。 Next, the operation of the cache method of the data processing apparatus 1 will be described.

先ず、図３を用いて、図２のプログラム（命令）２の並列処理について、従来の手法を用いた場合に予想される処理手順を説明する。図３において、４つの演算器１１ａ、２１ａ、３１ａ、４１ａの各々が、１次キャッシュ１２ａ、２２ａ、３２ａ、４２ａを持っているとする。 First, with reference to FIG. 3, the processing procedure expected when the conventional method is used for the parallel processing of the program (instruction) 2 in FIG. 2 will be described. In FIG. 3, it is assumed that each of the four arithmetic units 11a, 21a, 31a, and 41a has primary caches 12a, 22a, 32a, and 42a.

この場合、演算器１１ａは、３×３の同一の画素群Ａにおける、中心画素の上側に位置する画素のデータを用いた処理を実行する。演算器２１ａは、その中心画素の下側に位置する画素のデータを用いた処理を実行する。演算器３１ａは、その中心画素の右側に位置する画素のデータを用いた処理を実行する。演算器４１ａは、その中心画素の左側に位置する画素のデータを用いた処理を実行する。 In this case, the computing unit 11a executes processing using data of pixels located above the center pixel in the same 3 × 3 pixel group A. The computing unit 21a executes processing using data of a pixel located below the center pixel. The computing unit 31a executes processing using data of a pixel located on the right side of the central pixel. The computing unit 41a executes processing using data of a pixel located on the left side of the central pixel.

より具体的には、図４に示すように、時刻ｔ＝１において、外部より供給されるデータが１次キャッシュ１２ａに格納される。そして、この１次キャッシュ１２ａに格納されたデータを用いて、演算器１１ａは、３×３の同一の画素群Ａにおける、中心画素の上側に位置する画素のデータを用いた処理を実行する。 More specifically, as shown in FIG. 4, at time t = 1, data supplied from the outside is stored in the primary cache 12a. Then, using the data stored in the primary cache 12a, the computing unit 11a executes processing using data of pixels located above the center pixel in the same 3 × 3 pixel group A.

次に、時刻ｔ＝２において、１次キャッシュ２２ａに格納されたデータを用いて、演算器２１ａは、その中心画素の下側に位置する画素のデータを用いた処理を実行する。この時、原則的には、１次キャッシュ１２ａに格納されているデータのすべてが１次キャッシュ２２ａに転送される必要がある。 Next, at time t = 2, using the data stored in the primary cache 22a, the computing unit 21a executes processing using data of a pixel located below the central pixel. At this time, in principle, all of the data stored in the primary cache 12a needs to be transferred to the primary cache 22a.

同様に、時刻ｔ＝３において、１次キャッシュ３２ａに格納されたデータを用いて、演算器３１ａは、その中心画素の右側に位置する画素のデータを用いた処理を実行する。この時も、原則的には、１次キャッシュ２２ａに格納されているデータのすべてが１次キャッシュ３２ａに転送される必要がある。 Similarly, at time t = 3, using the data stored in the primary cache 32a, the arithmetic unit 31a executes processing using data of a pixel located on the right side of the central pixel. Also at this time, in principle, all of the data stored in the primary cache 22a needs to be transferred to the primary cache 32a.

さらに同様に、時刻ｔ＝４において、１次キャッシュ４２ａに格納されたデータを用いて、演算器４１ａは、その中心画素の左側に位置する画素のデータを用いた処理を実行する。この時も、原則的には、１次キャッシュ３２ａに格納されているデータのすべてが１次キャッシュ４２ａに転送される必要がある。 Further, similarly, at time t = 4, using the data stored in the primary cache 42a, the computing unit 41a executes processing using the data of the pixel located on the left side of the central pixel. Also at this time, in principle, all of the data stored in the primary cache 32a needs to be transferred to the primary cache 42a.

ここで、上記の時刻ｔ＝４において、演算器１１ａ、２１ａ、３１ａの各処理の状態について説明する。 Here, the state of each process of the computing units 11a, 21a, and 31a at the time t = 4 will be described.

図５に示すように、演算器３１ａの処理対象は、演算器４１ａの処理対象である３×３の画素群Ａを水平方向に１画素分だけずらした３×３の画素群Ｂである。 As shown in FIG. 5, the processing target of the computing unit 31a is a 3 × 3 pixel group B obtained by shifting the 3 × 3 pixel group A, which is the processing target of the computing unit 41a, by one pixel in the horizontal direction.

また、演算器３１ａの処理対象は、演算器４１ａの処理対象である３×３の画素群Ａを水平方向に１画素分だけずらした３×３の画素群Ｂである。 The processing target of the computing unit 31a is a 3 × 3 pixel group B obtained by shifting the 3 × 3 pixel group A, which is the processing target of the computing unit 41a, by one pixel in the horizontal direction.

同様に、演算器２１ａの処理対象は、演算器４１ａの処理対象である３×３の画素群Ａを水平方向に２画素分だけずらした３×３の画素群Ｃである。言いかえれば、演算器２１ａの処理対象は、演算器３１ａの処理対象である３×３の画素群Ｂを水平方向に１画素分だけずらした３×３の画素群Ｃである。 Similarly, the processing target of the computing unit 21a is a 3 × 3 pixel group C obtained by shifting the 3 × 3 pixel group A, which is the processing target of the computing unit 41a, by two pixels in the horizontal direction. In other words, the processing target of the computing unit 21a is a 3 × 3 pixel group C obtained by shifting the 3 × 3 pixel group B, which is the processing target of the computing unit 31a, by one pixel in the horizontal direction.

さらに同様に、演算器１１ａの処理対象は、演算器４１ａの処理対象である３×３の画素群Ａを水平方向に３画素分だけずらした３×３の画素群Ｄである。言いかえれば、演算器１１ａの処理対象は、演算器２１ａの処理対象である３×３の画素群Ｃを水平方向に１画素分だけずらした３×３の画素群Ｄである。 Similarly, the processing target of the computing unit 11a is a 3 × 3 pixel group D obtained by shifting the 3 × 3 pixel group A, which is the processing target of the computing unit 41a, by three pixels in the horizontal direction. In other words, the processing target of the computing unit 11a is a 3 × 3 pixel group D obtained by shifting the 3 × 3 pixel group C, which is the processing target of the computing unit 21a, by one pixel in the horizontal direction.

このことから分かるように、１次キャッシュ１２ａのうち毎サイクル更新されるデータは、２次キャッシュ５０ａから送り込まれるデータのみである。このため、実際には、１次キャッシュ１２ａに格納されているデータのすべてを１次キャッシュ２２ａに転送する必要はない。すなわち、１次キャッシュ１２ａは、２次キャッシュ５０ａから送り込まれたデータのみを、１次キャッシュ２２ａに転送すれば良い。 As can be seen from this, the data updated every cycle in the primary cache 12a is only the data sent from the secondary cache 50a. Therefore, in practice, it is not necessary to transfer all of the data stored in the primary cache 12a to the primary cache 22a. That is, the primary cache 12a may transfer only the data sent from the secondary cache 50a to the primary cache 22a.

そして、このことは、１次キャッシュ２２ａから１次キャッシュ３２ａへのデータ転送及び、１次キャッシュ３２ａから１次キャッシュ４２ａへのデータ転送についても同様である。 This also applies to data transfer from the primary cache 22a to the primary cache 32a and data transfer from the primary cache 32a to the primary cache 42a.

そこで、本実施の形態に係るデータ処理装置１においては、図６に示すように、１次キャッシュ１２及びバッファキャッシュ１３が、演算器１１にとっての本来の１次キャッシュとしての役割を担っている。同様に、１次キャッシュ２２及びバッファキャッシュ２３が、演算器２１にとっての本来の１次キャッシュとしての役割を、１次キャッシュ３２及びバッファキャッシュ３３が、演算器３１にとっての本来の１次キャッシュとしての役割を、１次キャッシュ４２及びバッファキャッシュ４３が、演算器４１にとって本来の１次キャッシュとしての役割を、それぞれが担っている。 Therefore, in the data processing apparatus 1 according to the present embodiment, as shown in FIG. 6, the primary cache 12 and the buffer cache 13 serve as the original primary cache for the computing unit 11. Similarly, the primary cache 22 and the buffer cache 23 serve as the original primary cache for the computing unit 21, and the primary cache 32 and the buffer cache 33 serve as the original primary cache for the computing unit 31. The primary cache 42 and the buffer cache 43 each have a role as an original primary cache for the computing unit 41.

そして、図１に示したように、第１段演算部１０から第２段演算部２０へのデータ転送はバッファキャッシュ１３とバッファキャッシュ２３との間において実行される。同様に、第２段演算部２０から第３段演算部３０へのデータ転送はバッファキャッシュ２３とバッファキャッシュ３３との間において実行され、第３段演算部３０から第４段演算部４０へのデータ転送はバッファキャッシュ３３とバッファキャッシュ４３との間において実行される。 As shown in FIG. 1, data transfer from the first stage arithmetic unit 10 to the second stage arithmetic unit 20 is executed between the buffer cache 13 and the buffer cache 23. Similarly, the data transfer from the second stage computing unit 20 to the third stage computing unit 30 is executed between the buffer cache 23 and the buffer cache 33, and the data is transferred from the third stage computing unit 30 to the fourth stage computing unit 40. Data transfer is executed between the buffer cache 33 and the buffer cache 43.

具体的には、図６に示したように、例えば３×３の画素における、垂直方向の１行分の画素データ「１７、２７、３７」が２次キャッシュ５０から第１段演算部１０のバッファキャッシュ１３に送り込まれると、それまでのバッファキャッシュ１３に格納されていた垂直方向の１行分の画素データ「１６、２６、３６」が第１段演算部１０の１次キャッシュ１２に転送されると共に、第２段演算部２０のバッファキャッシュ２３に転送される。 Specifically, as shown in FIG. 6, for example, pixel data “17, 27, 37” for one line in the vertical direction in 3 × 3 pixels is transferred from the secondary cache 50 to the first stage arithmetic unit 10. When sent to the buffer cache 13, the pixel data “16, 26, 36” of one line in the vertical direction stored in the buffer cache 13 until then is transferred to the primary cache 12 of the first stage arithmetic unit 10. And transferred to the buffer cache 23 of the second stage arithmetic unit 20.

同様に、垂直方向の１行分の画素データ「１６、２６、３６」が第１段演算部１０のバッファキャッシュ１３から第２段演算部２０のバッファキャッシュ２３に送り込まれると、それまでのバッファキャッシュ２３に格納されていた垂直方向の１行分の画素データ「１５、２５、３５」が第２段演算部２０の１次キャッシュ２２に転送されると共に、第３段演算部３０のバッファキャッシュ３３に転送される。 Similarly, when pixel data “16, 26, 36” for one line in the vertical direction is sent from the buffer cache 13 of the first stage arithmetic unit 10 to the buffer cache 23 of the second stage arithmetic unit 20, the buffer up to that point The pixel data “15, 25, 35” for one line in the vertical direction stored in the cache 23 is transferred to the primary cache 22 of the second stage arithmetic unit 20 and the buffer cache of the third stage arithmetic unit 30 33.

また、垂直方向の１行分の画素データ「１５、２５、３５」が第２段演算部２０のバッファキャッシュ２３から第３段演算部３０のバッファキャッシュ３３に送り込まれると、それまでのバッファキャッシュ３３に格納されていた垂直方向の１行分の画素データ「１４、２４、３４」が第３段演算部３０の１次キャッシュ３２に転送されると共に、第４段演算部４０のバッファキャッシュ４３に転送される。 Further, when pixel data “15, 25, 35” for one line in the vertical direction is sent from the buffer cache 23 of the second stage arithmetic unit 20 to the buffer cache 33 of the third stage arithmetic unit 30, the buffer cache up to that point The pixel data “14, 24, 34” for one line in the vertical direction stored in 33 is transferred to the primary cache 32 of the third stage arithmetic unit 30 and the buffer cache 43 of the fourth stage arithmetic unit 40. Forwarded to

さらに、垂直方向の１行分の画素データ「１４、２４、３４」が第３段演算部３０のバッファキャッシュ３３から第４段演算部４０のバッファキャッシュ４３に送り込まれると、それまでのバッファキャッシュ４３に格納されていた垂直方向の１行分の画素データ「１３、２３、３３」が第４段演算部４０の１次キャッシュ４２に転送されると共に、例えば後段の演算部（図示省略）のバッファキャッシュ（図示省略）に転送される。 Further, when pixel data “14, 24, 34” for one line in the vertical direction is sent from the buffer cache 33 of the third stage arithmetic unit 30 to the buffer cache 43 of the fourth stage arithmetic unit 40, the buffer cache up to that point The pixel data “13, 23, 33” for one line in the vertical direction stored in 43 is transferred to the primary cache 42 of the fourth-stage arithmetic unit 40 and, for example, in a subsequent arithmetic unit (not shown). It is transferred to a buffer cache (not shown).

次に、本発明の実施の形態に係るデータ処理装置１のキャッシュ方式の動作に特徴部分について説明する。 Next, a characteristic part of the operation of the cache method of the data processing apparatus 1 according to the embodiment of the present invention will be described.

上述したように、データ処理装置１は、第１〜４段演算部１０、２０、３０、４０の各々が転送制御部１４、２４、３４、４４、第１の転送実行部１５、２５、３５、４５、及び、第２の転送実行部１６、２６、３６、４６を有している。データ処理装置１においては、転送制御部１４、２４、３４、４４及び、第１の転送実行部１５、２５、３５、４５の各々の動作により、バッファキャッシュ１３、２３、３３、４３と１次キャッシュ１２、２２、３２、４２との間におけるデータ転送を制御する。また、転送制御部１４、２４、３４、４４及び、第２の転送実行部１６、２６、３６、４６の各々の動作により、バッファキャッシュ１３、２３、３３、４３間におけるデータ転送を制御する。 As described above, in the data processing apparatus 1, each of the first to fourth stage arithmetic units 10, 20, 30, and 40 includes the transfer control units 14, 24, 34, 44, and the first transfer execution units 15, 25, 35. , 45 and second transfer execution units 16, 26, 36, 46. In the data processing device 1, the buffer caches 13, 23, 33, 43 and the primary are transferred by the operations of the transfer control units 14, 24, 34, 44 and the first transfer execution units 15, 25, 35, 45. Controls data transfer to and from the caches 12, 22, 32, and 42. The data transfer between the buffer caches 13, 23, 33, and 43 is controlled by the operations of the transfer control units 14, 24, 34, and 44 and the second transfer execution units 16, 26, 36, and 46, respectively.

以下では、３つの実施例を用いて、この動作について具体的に説明する。 Hereinafter, this operation will be described in detail using three examples.

（実施例１）
この実施例１は、バッファキャッシュ１３、２３、３３、４３と１次キャッシュ１２、２２、３２、４２との間におけるデータ転送を制御する実施例である。 Example 1
In the first embodiment, data transfer between the buffer caches 13, 23, 33, and 43 and the primary caches 12, 22, 32, and 42 is controlled.

図７において、演算器１１は、３×３の同一の画素群における、中心画素の上側に位置する画素のデータを用いた処理を実行する。演算器２１は、その中心画素の下側に位置する画素のデータを用いた処理を実行する。演算器３１は、その中心画素の右側に位置する画素のデータを用いた処理を実行する。演算器４１は、その中心画素の左側に位置する画素のデータを用いた処理を実行する。 In FIG. 7, the computing unit 11 executes processing using data of pixels located above the center pixel in the same 3 × 3 pixel group. The computing unit 21 executes processing using data of a pixel located below the center pixel. The computing unit 31 executes processing using data of a pixel located on the right side of the central pixel. The computing unit 41 executes processing using data of a pixel located on the left side of the central pixel.

この場合、第１段演算部１０においては、１次キャッシュ１２に格納されるべきデータのうち、演算器１１が必要とするデータは、水平方向に１行分の画素データ「１２、１３、１４、１５、１６、１７」である。一方、図６に示した他の画素データ「２２、２３、２４、２５、２６、２７、３２、３３、３４、３５、３６、３７」は不要となる。 In this case, in the first stage arithmetic unit 10, among the data to be stored in the primary cache 12, the data required by the arithmetic unit 11 is the pixel data “12, 13, 14 for one line in the horizontal direction. , 15, 16, 17 ". On the other hand, the other pixel data “22, 23, 24, 25, 26, 27, 32, 33, 34, 35, 36, 37” shown in FIG.

このため、転送制御部１４は、バッファキャッシュ１３から送り込まれる１ブロック分の画素データのうち、３×３の画素群における、中心画素の上側に位置する画素のデータのみが１次キャッシュ１２に格納されるよう、第１の転送実行部１５を制御する。 For this reason, the transfer control unit 14 stores, in the primary cache 12, only the data of the pixels located above the center pixel in the 3 × 3 pixel group among the pixel data for one block sent from the buffer cache 13. As a result, the first transfer execution unit 15 is controlled.

第１の転送実行部１５は、転送制御部１４からの制御内容に従って、１次キャッシュ１２の全記憶領域のうち、不要となる記憶領域（記憶不要領域）の記憶動作を停止させる。 The first transfer execution unit 15 stops the storage operation of unnecessary storage areas (storage unnecessary areas) among all the storage areas of the primary cache 12 according to the control contents from the transfer control unit 14.

そうすることにより、バッファキャッシュ１３から送り込まれる１ブロック分の画素データのうち、演算器１１が必要とするデータのみが、１次キャッシュ１２に格納されることになる。 By doing so, only the data required by the computing unit 11 among the pixel data for one block sent from the buffer cache 13 is stored in the primary cache 12.

同様に、第２段演算部２０においては、１次キャッシュ２２に格納されるべきデータのうち、演算器２１が必要とするデータは、水平方向に１行分の画素データ「３２、３３、３４、３５、３６」である。一方、図６に示した他の画素データ「１２、１３、１４、１５、１６、２２、２３、２４、２５、２６」は不要となる。 Similarly, in the second stage arithmetic unit 20, the data required by the arithmetic unit 21 among the data to be stored in the primary cache 22 is the pixel data “32, 33, 34 for one line in the horizontal direction. , 35, 36 ”. On the other hand, the other pixel data “12, 13, 14, 15, 16, 22, 23, 24, 25, 26” shown in FIG.

このため、転送制御部２４は、バッファキャッシュ２３から送り込まれる１ブロック分の画素データのうち、３×３の画素群における、中心画素の下側に位置する画素のデータのみが１次キャッシュ２２に格納されるよう、第１の転送実行部２５を制御する。 For this reason, the transfer control unit 24 transfers only the data of the pixel located below the center pixel in the 3 × 3 pixel group to the primary cache 22 among the pixel data for one block sent from the buffer cache 23. The first transfer execution unit 25 is controlled so as to be stored.

第１の転送実行部２５は、転送制御部２４からの制御内容に従って、１次キャッシュ２２の全記憶領域のうち、不要となる記憶領域（記憶不要領域）の記憶動作を停止させる。 The first transfer execution unit 25 stops the storage operation of unnecessary storage areas (storage unnecessary areas) among all the storage areas of the primary cache 22 according to the control content from the transfer control unit 24.

そうすることにより、バッファキャッシュ２３から送り込まれる１ブロック分の画素データのうち、演算器２１が必要とするデータのみが、１次キャッシュ２２に格納されることになる。 By doing so, only the data required by the calculator 21 is stored in the primary cache 22 among the pixel data for one block sent from the buffer cache 23.

また、第３段演算部３０においては、１次キャッシュ３２に格納されるべきデータのうち、演算器３１が必要とするデータは、水平方向に１行分の画素データ「２２、２３、２４、２５」である。一方、図６に示した他の画素データ「１２、１３、１４、１５、３２、３３、３４、３５」は不要となる。 Further, in the third-stage arithmetic unit 30, among the data to be stored in the primary cache 32, the data required by the arithmetic unit 31 is the pixel data “22, 23, 24, 25 ". On the other hand, the other pixel data “12, 13, 14, 15, 32, 33, 34, 35” shown in FIG.

このため、転送制御部３４は、バッファキャッシュ３３から送り込まれる１ブロック分の画素データのうち、３×３の画素群における、中心画素の右側に位置する画素のデータのみが１次キャッシュ３２に格納されるよう、第１の転送実行部３５を制御する。 For this reason, the transfer control unit 34 stores, in the primary cache 32, only the data of the pixel located on the right side of the central pixel in the 3 × 3 pixel group among the pixel data for one block sent from the buffer cache 33. As a result, the first transfer execution unit 35 is controlled.

第１の転送実行部３５は、転送制御部３４からの制御内容に従って、１次キャッシュ３２の全記憶領域のうち、不要となる記憶領域（記憶不要領域）の記憶動作を停止させる。 The first transfer execution unit 35 stops the storage operation of unnecessary storage areas (storage unnecessary areas) among all the storage areas of the primary cache 32 in accordance with the control content from the transfer control unit 34.

そうすることにより、バッファキャッシュ３３から送り込まれる１ブロック分の画素データのうち、演算器３１が必要とするデータのみが、１次キャッシュ３２に格納されることになる。 By doing so, only the data required by the calculator 31 is stored in the primary cache 32 among the pixel data for one block sent from the buffer cache 33.

さらに、第４段演算部４０においては、１次キャッシュ４２に格納されるべきデータのうち、演算器４１が必要とするデータは、水平方向に１行分の画素データ「２２、２３、２４」である。一方、図６に示した他の画素データ「１２、１３、１４、３２、３３、３４」は不要となる。 Further, in the fourth-stage arithmetic unit 40, among the data to be stored in the primary cache 42, the data required by the arithmetic unit 41 is the pixel data “22, 23, 24” for one line in the horizontal direction. It is. On the other hand, the other pixel data “12, 13, 14, 32, 33, 34” shown in FIG.

このため、転送制御部４４は、バッファキャッシュ４３から送り込まれる１ブロック分の画素データのうち、３×３の画素群における、中心画素の左側に位置する画素のデータのみが１次キャッシュ４２に格納されるよう、第１の転送実行部４５を制御する。 For this reason, the transfer control unit 44 stores, in the primary cache 42, only the data of the pixel located on the left side of the central pixel in the 3 × 3 pixel group among the pixel data for one block sent from the buffer cache 43. As a result, the first transfer execution unit 45 is controlled.

第１の転送実行部４５は、転送制御部４４からの制御内容に従って、１次キャッシュ４２の全記憶領域のうち、不要となる記憶領域（記憶不要領域）の記憶動作を停止させる。 The first transfer execution unit 45 stops the storage operation of unnecessary storage areas (storage unnecessary areas) among all the storage areas of the primary cache 42 according to the control content from the transfer control unit 44.

そうすることにより、バッファキャッシュ４３から送り込まれる１ブロック分の画素データのうち、演算器４１が必要とするデータのみが、１次キャッシュ４２に格納されることになる。 By doing so, only the data necessary for the computing unit 41 among the pixel data for one block sent from the buffer cache 43 is stored in the primary cache 42.

このようにして、１次キャッシュ１２、２２、３２、４２の各々における不要な記憶領域の記憶動作を停止させることができる。このため、１次キャッシュ１２、２２、３２、４２の各々の消費電力を、各々の全記憶領域を動作させる場合と比較して、大幅に削減することができる。 In this manner, the storage operation of unnecessary storage areas in each of the primary caches 12, 22, 32, and 42 can be stopped. For this reason, the power consumption of each of the primary caches 12, 22, 32, and 42 can be greatly reduced as compared with the case where all the storage areas are operated.

また、バッファキャッシュ１３、２３、３３、４３と１次キャッシュ１２、２２、３２、４２との間におけるデータ転送量自体も減らすことができ、その結果、各データ転送に要する消費電力も削減することができる。 In addition, the data transfer amount itself between the buffer caches 13, 23, 33, 43 and the primary caches 12, 22, 32, 42 can be reduced, and as a result, the power consumption required for each data transfer can be reduced. Can do.

したがって、この実施例１によれば、データ処理装置１の消費電力を大幅に削減することができる。 Therefore, according to the first embodiment, the power consumption of the data processing apparatus 1 can be greatly reduced.

（実施例２）
この実施例２は、バッファキャッシュ１３、２３、３３、４３と１次キャッシュ１２、２２、３２、４２との間におけるデータ転送を制御する他の実施例である。 (Example 2)
The second embodiment is another embodiment for controlling data transfer between the buffer caches 13, 23, 33, and 43 and the primary caches 12, 22, 32, and 42.

この実施例２では、図８に示すように、上記の実施例１において、１次キャッシュ１２、２２、３２、４２の各々における記憶動作を停止させる記憶領域をさらに増加させたものである。 In the second embodiment, as shown in FIG. 8, the storage area for stopping the storage operation in each of the primary caches 12, 22, 32, and 42 in the first embodiment is further increased.

この実施例２では、図８において、上記の実施例１と同様、演算器１１は、３×３の同一の画素群における、中心画素の上側に位置する画素のデータを用いた処理を実行する。演算器２１は、その中心画素の下側に位置する画素のデータを用いた処理を実行する。演算器３１は、その中心画素の右側に位置する画素のデータを用いた処理を実行する。演算器４１は、その中心画素の左側に位置する画素のデータを用いた処理を実行する。 In the second embodiment, in FIG. 8, as in the first embodiment, the computing unit 11 executes processing using data of pixels located above the center pixel in the same 3 × 3 pixel group. . The computing unit 21 executes processing using data of a pixel located below the center pixel. The computing unit 31 executes processing using data of a pixel located on the right side of the central pixel. The computing unit 41 executes processing using data of a pixel located on the left side of the central pixel.

ここで、本実施例２が上記の実施例１と異なる点は、例えば、第１段演算部１０であれば、演算器１１が実際に用いるデータを１次キャッシュ１２内の画素データ「１６」に絞っている点である。 Here, the second embodiment is different from the first embodiment in that, for example, in the case of the first stage arithmetic unit 10, the data actually used by the arithmetic unit 11 is the pixel data “16” in the primary cache 12. It is a point focused on.

この場合、第１段演算部１０においては、上記の実施例１とは異なり、図７に示した画素データ「１６」のみ、格納できればよい。 In this case, unlike the first embodiment, the first-stage arithmetic unit 10 only needs to store the pixel data “16” shown in FIG.

さらに、転送制御部１４は、バッファキャッシュ１３から順次送り込まれるデータが、そのデータの送り込み直後のみにおいて１次キャッシュ１２に格納されるよう、第１の転送実行部１５を制御する。 Furthermore, the transfer control unit 14 controls the first transfer execution unit 15 so that data sequentially sent from the buffer cache 13 is stored in the primary cache 12 only immediately after the data is sent.

そうすることにより、バッファキャッシュ１３から送り込まれる１ブロック分の画素データのうち、演算器１１が実際に用いるデータのみが、１次キャッシュ１２に格納されることになる。 By doing so, only the data actually used by the computing unit 11 among the pixel data for one block sent from the buffer cache 13 is stored in the primary cache 12.

同様に、第２段演算部２０であれば、演算器２１が実際に用いるデータを１次キャッシュ２２内の画素データ「３５」に絞っている点である。 Similarly, in the case of the second stage arithmetic unit 20, the data actually used by the arithmetic unit 21 is limited to the pixel data “35” in the primary cache 22.

この場合、第２段演算部２０においては、上記の実施例１とは異なり、図７に示した画素データ「３５」のみ、格納できればよい。 In this case, unlike the first embodiment, only the pixel data “35” shown in FIG.

さらに、転送制御部２４は、バッファキャッシュ２３から順次送り込まれるデータが、そのデータの送り込み直後のみにおいて１次キャッシュ２２に格納されるよう、第１の転送実行部２５を制御する。 Furthermore, the transfer control unit 24 controls the first transfer execution unit 25 so that data sequentially sent from the buffer cache 23 is stored in the primary cache 22 only immediately after the data is sent.

そうすることにより、バッファキャッシュ２３から送り込まれる１ブロック分の画素データのうち、演算器２１が実際に用いるデータのみが、１次キャッシュ２２に格納されることになる。 By doing so, only the data actually used by the computing unit 21 among the pixel data for one block sent from the buffer cache 23 is stored in the primary cache 22.

第３段演算部３０であれば、演算器３１が実際に用いるデータをバッファキャッシュ３３内の画素データ「２５」に絞っている点である。 In the third stage arithmetic unit 30, the data actually used by the arithmetic unit 31 is limited to the pixel data “25” in the buffer cache 33.

この場合、第３段演算部３０においては、上記の実施例１とは異なり、１次キャッシュ２２による画素データの格納は不要となる。 In this case, unlike the first embodiment, the third-stage arithmetic unit 30 does not need to store pixel data in the primary cache 22.

このため、転送制御部３４は、バッファキャッシュ３３から送り込まれる１ブロック分の画素データが１次キャッシュ３２に格納されないよう、第１の転送実行部３５を制御する。 Therefore, the transfer control unit 34 controls the first transfer execution unit 35 so that the pixel data for one block sent from the buffer cache 33 is not stored in the primary cache 32.

第１の転送実行部３５は、転送制御部３４からの制御内容に従って、１次キャッシュ３２の全記憶領域のうち、不要となる記憶領域（記憶不要領域）、すなわち、全記憶領域の記憶動作を停止させる。 The first transfer execution unit 35 performs a storage operation (unnecessary storage area) of all the storage areas of the primary cache 32, that is, the storage operation of all the storage areas, according to the control content from the transfer control unit 34. Stop.

バッファキャッシュ３３から送り込まれる１ブロック分の画素データはいずれも、１次キャッシュ３２に格納されない。 None of the pixel data for one block sent from the buffer cache 33 is stored in the primary cache 32.

第４段演算部４０であれば、演算器４１が実際に用いるデータを１次キャッシュ４２内の画素データ「２２」に絞っている点である。 In the case of the fourth stage computing unit 40, the data actually used by the computing unit 41 is limited to the pixel data “22” in the primary cache.

この場合、第４段演算部４０においては、上記の実施例１とは異なり、図７に示した画素データ「２２、２３」のみ、格納できればよい。 In this case, unlike the first embodiment, only the pixel data “22, 23” shown in FIG.

さらに、転送制御部４４は、バッファキャッシュ４３から順次送り込まれるデータが、そのデータの送り込み直後から、その中心画素の左側に位置するまで、１次キャッシュ４２に格納されるよう、第１の転送実行部４５を制御する。 Further, the transfer control unit 44 executes the first transfer so that the data sequentially sent from the buffer cache 43 is stored in the primary cache 42 from immediately after the data is sent until it is located on the left side of the central pixel. The unit 45 is controlled.

そうすることにより、バッファキャッシュ４３から送り込まれる１ブロック分の画素データのうち、演算器４１が実際に用いるデータのみが、１次キャッシュ４２に格納されることになる。 By doing so, only the data actually used by the computing unit 41 among the pixel data for one block sent from the buffer cache 43 is stored in the primary cache 42.

したがって、この実施例２によれば、データ処理装置１の消費電力を大幅に削減することができる。 Therefore, according to the second embodiment, the power consumption of the data processing apparatus 1 can be significantly reduced.

（実施例３）
この実施例３は、上記の実施例１及び２とは異なり、バッファキャッシュ１３、２３、３３、４３間におけるデータ転送を制御する実施例である。 (Example 3)
In the third embodiment, unlike the first and second embodiments, data transfer between the buffer caches 13, 23, 33, and 43 is controlled.

図９において、演算器１１は、３×３の同一の画素群における、中心画素の上側に位置する画素のデータを用いた処理を実行する。演算器２１は、その中心画素の下側に位置する画素のデータを用いた処理を実行する。演算器３１は、その中心画素の右側に位置する画素のデータを用いた処理を実行する。演算器４１は、その中心画素の左側に位置する画素のデータを用いた処理を実行する。 In FIG. 9, the arithmetic unit 11 executes processing using data of pixels located above the center pixel in the same 3 × 3 pixel group. The computing unit 21 executes processing using data of a pixel located below the center pixel. The computing unit 31 executes processing using data of a pixel located on the right side of the central pixel. The computing unit 41 executes processing using data of a pixel located on the left side of the central pixel.

したがって、バッファキャッシュ１３からバッファキャッシュ２３へのデータ転送においては、演算器１１が必要とする、３×３の画素群における、中心画素の上側に位置する画素のデータ「１７」を、バッファキャッシュ２３に転送する必要はない。言いかえれば、演算器２１、３１、４１が必要とする、その中心画素の下側及び左右側に位置する画素のデータ「２７、３７」を転送しなければならない。 Therefore, in the data transfer from the buffer cache 13 to the buffer cache 23, the data “17” of the pixel located above the center pixel in the 3 × 3 pixel group required by the arithmetic unit 11 is stored in the buffer cache 23. There is no need to transfer to. In other words, the data “27, 37” of the pixels located on the lower side and the left and right sides of the central pixel required by the arithmetic units 21, 31, 41 must be transferred.

このため、転送制御部１４は、バッファキャッシュ１３からバッファキャッシュ２３に送り込まれるデータとして、３×３の画素群における、中心画素の上側に位置する画素のデータを除くデータが設定されるよう、第２の転送実行部１６を制御する。 Therefore, the transfer control unit 14 sets the data excluding the data of the pixel located above the center pixel in the 3 × 3 pixel group as the data sent from the buffer cache 13 to the buffer cache 23. 2 transfer execution unit 16 is controlled.

第２の転送実行部１６は、転送制御部１４からの制御内容に従って、バッファキャッシュ１３に送り込まれた１ブロック分のデータのうち、バッファキャッシュ１３がバッファキャッシュ２３に送り込むべきデータを設定する。 The second transfer execution unit 16 sets data that the buffer cache 13 should send to the buffer cache 23 among the data for one block sent to the buffer cache 13 in accordance with the control contents from the transfer control unit 14.

そうすることにより、バッファキャッシュ１３に送り込まれた１ブロック分の画素データのうち、演算器２１、３１、４１が必要とするデータのみが、バッファキャッシュ２３に転送されることになる。 By doing so, only the data required by the calculators 21, 31, 41 among the pixel data for one block sent to the buffer cache 13 is transferred to the buffer cache 23.

同様に、バッファキャッシュ２３からバッファキャッシュ３３へのデータ転送においては、演算器２１が必要とする、３×３の画素群における、中心画素の下側に位置する画素のデータ「３６」を、バッファキャッシュ３３に転送する必要はない。言いかえれば、演算器３１、４１が必要とする、その中心画素の左右側に位置する画素のデータ「２６」を転送しなければならない。 Similarly, in the data transfer from the buffer cache 23 to the buffer cache 33, the data “36” of the pixel located below the center pixel in the 3 × 3 pixel group required by the arithmetic unit 21 is buffered. There is no need to transfer to the cache 33. In other words, the data “26” of the pixels located on the left and right sides of the central pixel required by the calculators 31 and 41 must be transferred.

このため、転送制御部２４は、バッファキャッシュ２３からバッファキャッシュ３３に送り込まれるデータとして、３×３の画素群における、中心画素の上下側に位置する画素のデータを除くデータが設定されるよう、第２の転送実行部２６を制御する。 For this reason, the transfer control unit 24 sets data excluding data of pixels located above and below the center pixel in the 3 × 3 pixel group as data sent from the buffer cache 23 to the buffer cache 33. The second transfer execution unit 26 is controlled.

第２の転送実行部２６は、転送制御部２４からの制御内容に従って、バッファキャッシュ２３に送り込まれたデータのうち、バッファキャッシュ２３がバッファキャッシュ３３に送り込むべきデータを設定する。 The second transfer execution unit 26 sets data that the buffer cache 23 should send to the buffer cache 33 among the data sent to the buffer cache 23 according to the control content from the transfer control unit 24.

そうすることにより、バッファキャッシュ２３に送り込まれた画素データのうち、演算器３１、４１が必要とするデータのみが、バッファキャッシュ３３に転送されることになる。 By doing so, only the data required by the calculators 31 and 41 among the pixel data sent to the buffer cache 23 is transferred to the buffer cache 33.

また、バッファキャッシュ３３からバッファキャッシュ４３へのデータ転送においては、演算器４１が必要とする、その中心画素の左側に位置する画素のデータ「２５」を転送しなければならない。 Further, in the data transfer from the buffer cache 33 to the buffer cache 43, the data “25” of the pixel located on the left side of the central pixel required by the arithmetic unit 41 must be transferred.

このため、転送制御部３４は、バッファキャッシュ３３からバッファキャッシュ４３に送り込まれるデータとして、３×３の画素群における、中心画素の左側に位置する画素のデータが設定されるよう、第２の転送実行部３６を制御する。 For this reason, the transfer control unit 34 performs the second transfer so that the data of the pixel located on the left side of the center pixel in the 3 × 3 pixel group is set as the data sent from the buffer cache 33 to the buffer cache 43. The execution unit 36 is controlled.

第２の転送実行部３６は、転送制御部３４からの制御内容に従って、バッファキャッシュ３３に送り込まれたデータのうち、バッファキャッシュ３３がバッファキャッシュ４３に送り込むべきデータを設定する。 The second transfer execution unit 36 sets data to be sent to the buffer cache 43 by the buffer cache 33 among the data sent to the buffer cache 33 in accordance with the control contents from the transfer control unit 34.

そうすることにより、バッファキャッシュ３３に送り込まれた画素データのうち、演算器４１が必要とするデータのみが、バッファキャッシュ４３に転送されることになる。 By doing so, only the data required by the arithmetic unit 41 among the pixel data sent to the buffer cache 33 is transferred to the buffer cache 43.

このようにして、バッファキャッシュ１３、２３、３３、４３間におけるデータ転送量を減らすことができ、その結果、各データ転送に要する消費電力を削減することができる。 In this way, the data transfer amount between the buffer caches 13, 23, 33, 43 can be reduced, and as a result, the power consumption required for each data transfer can be reduced.

したがって、この実施例３によれば、データ処理装置１の消費電力を大幅に削減することができる。 Therefore, according to the third embodiment, the power consumption of the data processing apparatus 1 can be greatly reduced.

上記の実施例１〜３においては、上述したように、第１〜４段演算部１０、２０、３０、４０の各々において、転送制御部１４、２４、３４、４４及び、第１の転送実行部１５、２５、３５、４５の各々の動作により、バッファキャッシュ１３、２３、３３、４３と１次キャッシュ１２、２２、３２、４２との間におけるデータ転送を制御する。また、転送制御部１４、２４、３４、４４及び、第２の転送実行部１６、２６、３６、４６の各々の動作により、バッファキャッシュ１３、２３、３３、４３間におけるデータ転送を制御する。 In the first to third embodiments, as described above, in each of the first to fourth stage arithmetic units 10, 20, 30, and 40, the transfer control units 14, 24, 34, and 44 and the first transfer execution are performed. Data transfer between the buffer caches 13, 23, 33, 43 and the primary caches 12, 22, 32, 42 is controlled by the operations of the units 15, 25, 35, 45. The data transfer between the buffer caches 13, 23, 33, and 43 is controlled by the operations of the transfer control units 14, 24, 34, and 44 and the second transfer execution units 16, 26, 36, and 46, respectively.

ここで、転送制御部１４、２４、３４、４４による、第１の転送実行部１５、２５、３５、４５及び、第２の転送実行部１６、２６、３６、４６の各動作の制御は、例えば、データ処理装置１が処理するプログラム（命令）に公知の自動並列化を行なう際に用いられるデータアクセスパターン解析の結果に基づいて実行すればよい。ここで、このデータアクセスパターン解析とは、そのプログラム実行に基づくデータアクセスの規則性を解析することを意味する。 Here, the control of each operation of the first transfer execution units 15, 25, 35, 45 and the second transfer execution units 16, 26, 36, 46 by the transfer control units 14, 24, 34, 44 is as follows: For example, the program (instruction) processed by the data processing apparatus 1 may be executed based on the result of data access pattern analysis used when performing known automatic parallelization. Here, this data access pattern analysis means analyzing the regularity of data access based on the program execution.

一般に、１つのプログラムの処理を並列処理しようとする場合、その並列化のための作業として、プログラムのタスクへの分割、タスク間のデータアクセスパターンの解析に基づく並列性の検出及び指示、各タスクのプロセッサへの配置、及び、プロセッサ間のデータ通信コード及び同期コードの挿入、が必要となる。 In general, when processing of one program is to be performed in parallel, the tasks for parallelization are divided into program tasks, parallelism detection and instruction based on analysis of data access patterns between tasks, each task Need to be arranged in the processor and to insert a data communication code and a synchronization code between the processors.

これらの作業のうち、データアクセスパターンの解析が、例えば、データ処理装置１により自動的に実行される。並列性が意識されずにプログラミングされたプログラムのデータアクセスパターンの解析が実行され、その解析結果を用いてプログラムの処理がデータ処理装置１により実行される。 Of these operations, analysis of the data access pattern is automatically executed by the data processing device 1, for example. Analysis of a data access pattern of a programmed program is executed without being conscious of parallelism, and processing of the program is executed by the data processing apparatus 1 using the analysis result.

転送制御部１４、２４、３４、４４は、プログラムの解析の後、プログラムの解析の結果を参照し、演算器１１、２１、３１、４１の各々による処理内容を検出する。そして、転送制御部１４、２４、３４、４４は、その検出結果を用いて、１次キャッシュ１２、２２、３２、４２及びバッファキャッシュ１３、２３、３３、４３の各々に格納すべきデータを特定する。 The transfer control units 14, 24, 34, 44 detect the processing contents of each of the computing units 11, 21, 31, 41 by referring to the result of the program analysis after analyzing the program. Then, the transfer control units 14, 24, 34, and 44 specify data to be stored in each of the primary caches 12, 22, 32, and 42 and the buffer caches 13, 23, 33, and 43 using the detection results. To do.

転送制御部１４、２４、３４、４４は、その特定結果に従って、第１の転送実行部１５、２５、３５、４５及び、第２の転送実行部１６、２６、３６、４６の各動作を制御する。 The transfer control units 14, 24, 34, 44 control the operations of the first transfer execution units 15, 25, 35, 45 and the second transfer execution units 16, 26, 36, 46 according to the identification result. To do.

一方、上記のアクセスパターンの解析のための作業は、プログラマーがプログラミングする際に、あらかじめ実行されている場合もある。つまり、プログラマーが、例えばデータ処理装置１のために定められた記述方法によって、明示的にアクセスパターンの記述を行なった場合である。言い換えれば、データ処理装置１が処理するプログラムのデータアクセスパターンの解析が、あらかじめ行なわれている場合である。 On the other hand, the above-described work for analyzing the access pattern may be executed in advance when a programmer performs programming. That is, this is a case where the programmer explicitly describes the access pattern by a description method defined for the data processing apparatus 1, for example. In other words, the data access pattern of the program processed by the data processing device 1 is analyzed in advance.

この場合では、プログラムのアクセスパターンの解析内容をあらかじめ取得することができるので、その解析内容に基づき、１次キャッシュ１２、２２、３２、４２及びバッファキャッシュ１３、２３、３３、４３の記憶領域をあらかじめ減らしておくことができる。 In this case, since the analysis contents of the access pattern of the program can be acquired in advance, the storage areas of the primary caches 12, 22, 32, and 42 and the buffer caches 13, 23, 33, and 43 are based on the analysis contents. It can be reduced in advance.

さらに、転送制御部１４、２４、３４、４４、第１の転送実行部１５、２５、３５、４５及び、第２の転送実行部１６、２６、３６、４６は不要となる。 Furthermore, the transfer control units 14, 24, 34, 44, the first transfer execution units 15, 25, 35, 45, and the second transfer execution units 16, 26, 36, 46 are not necessary.

このため、データ処理装置１の装置構成の簡略化、消費電力のさらなる削減が可能となる。 For this reason, the apparatus configuration of the data processing apparatus 1 can be simplified and the power consumption can be further reduced.

上述したデータアクセスパターンの解析を含むプログラムの並列化については、例えば、本田弘樹、「並列処理のためのシステムソフトウェア―３．自動並列化コンパイラ―」、情報処理、Vol. 34、No.9、pp．1150―1157に記載されている。 For the parallelization of the program including the analysis of the data access pattern described above, for example, Hiroki Honda, “System Software for Parallel Processing—3. Automatic Parallelizing Compiler”, Information Processing, Vol. 34, No. 9, pp. 1150-1157.

（実施の形態２）
次に、本発明の実施の形態２について説明する。本発明の実施の形態２は、上記の実施の形態１に係るデータ処理装置１の具体的な構成に係る実施の形態である。 (Embodiment 2)
Next, a second embodiment of the present invention will be described. The second embodiment of the present invention is an embodiment according to a specific configuration of the data processing apparatus 1 according to the first embodiment.

本実施の形態に係るデータ処理装置は、公知のＶＬＩＷ命令を実行できるｎ個のＶＬＩＷプロセッサを直列に配置し、それらｎ個のＶＬＩＷプロセッサの各々にレジスタファイル、演算器、キャッシュを連結するＬｉｎｅａｒＡｒｒａｙＰｉｐｅｌｉｎｅＰｒｏｃｅｓｓｏｒである。 In the data processing apparatus according to the present embodiment, n VLIW processors that can execute known VLIW instructions are arranged in series, and a linear array that connects a register file, an arithmetic unit, and a cache to each of the n VLIW processors. Pipeline Processor.

本実施の形態に係るデータ処理装置は、２つの動作状態、すなわち、通常動作状態（非アレイ動作状態）及びアレイ動作状態を持つ。既存プログラム資産を利用できるように、通常動作時には、初段のみが動作し、既存のＶＬＩＷプロセッサと同様に動作する。そのため、初段のレジスタは、初段の演算器及びＬＤ／ＳＴユニットからのフィードバックを備える。 The data processing apparatus according to the present embodiment has two operation states, that is, a normal operation state (non-array operation state) and an array operation state. In normal operation, only the first stage operates so that the existing program assets can be used, and operates in the same manner as the existing VLIW processor. Therefore, the first-stage register includes feedback from the first-stage arithmetic unit and the LD / ST unit.

一方、アレイ動作時には、ｎ段全体にＶＬＩＷ命令がマッピングされ、終了条件を満たすまで同じ命令列を繰り返し実行する。 On the other hand, during the array operation, the VLIW instruction is mapped to the entire n stages, and the same instruction sequence is repeatedly executed until the end condition is satisfied.

このようにして単体の演算器に比べ、最大ｎ倍の並列処理を実行することができる。 In this way, a maximum of n times parallel processing can be executed as compared with a single arithmetic unit.

図１０は、本発明の実施の形態２に係るデータ処理装置の概略構成を示すブロック図である。 FIG. 10 is a block diagram showing a schematic configuration of the data processing apparatus according to Embodiment 2 of the present invention.

図１０に示すように、本実施の形態に係るデータ処理装置３は、命令フェッチ部７０と、命令デコード部８０と、ｒｅｇ（レジスタファイル部）_１〜ｒｅｇ_ｎと、演算器_１〜演算器_ｎと、ＬＤ／ＳＴ（ロード／ストア部）_１〜ＬＤ／ＳＴ_ｎと、１次キャッシュ_１〜１次キャッシュ_ｎと、バッファキャッシュ_１〜バッファキャッシュ_ｎと、２次キャッシュ５０と、メインメモリ６０と、を備えている。 As shown in FIG. 10, the data processing device 3 according to the present embodiment includes an instruction fetch unit 70, an instruction decode unit 80, reg (register file units) _{1 to} reg _n , and arithmetic units ₁ to _n. LD / ST (load / store unit) _{1 to} LD / ST _n , primary cache ₁ to primary cache _n , buffer cache ₁ to buffer cache _n , secondary cache 50, main memory 60, It has.

命令フェッチ部７０は、命令メモリ部（図示省略）から必要な命令をフェッチして、命令デコード部８０は、そのフェッチした命令をデコードする。命令デコード部８０によるデコード結果により、演算器_１〜演算器_ｎにおける処理内容が決定する。 The instruction fetch unit 70 fetches a necessary instruction from an instruction memory unit (not shown), and the instruction decoding unit 80 decodes the fetched instruction. Based on the decoding result by the instruction decoding unit 80, the processing contents in the arithmetic units ₁ to _n are determined.

このデータ処理装置３では、公知のＶＬＩＷ方式によるプロセッサアーキテクチャを前提としており、命令フェッチ部７０により例えば３２ビット幅の命令が例えば４個同時にフェッチされ、命令デコード部８０によりそれらフェッチされた命令が同時にデコードされるものと想定する。 This data processing apparatus 3 is premised on a known VLIW processor architecture. For example, four instructions having a width of 32 bits, for example, are simultaneously fetched by the instruction fetch unit 70, and the instructions fetched by the instruction decoding unit 80 are simultaneously fetched. Assume that it is decoded.

このデータ処理装置３において、第１段演算部は、ｒｅｇ_１、演算器_１、ＬＤ／ＳＴ_１、１次キャッシュ_１及び、バッファキャッシュ_１を含んでいる。また、この第１段演算部は、上記の実施の形態１の転送制御部、第１の転送実行部及び、第２の転送実行部も含んでいる。なお、図面の見易さを図るため、図１０には、これら転送制御部、第１の転送実行部及び、第２の転送実行部は表示されていない。以下の他の演算部においても同様である。 In the data processing device 3, the first stage arithmetic unit includes reg ₁ , arithmetic unit ₁ , LD / ST ₁ , primary cache _1, and buffer cache ₁ . The first-stage arithmetic unit also includes the transfer control unit, the first transfer execution unit, and the second transfer execution unit of the first embodiment. In order to make the drawing easier to see, FIG. 10 does not show the transfer control unit, the first transfer execution unit, and the second transfer execution unit. The same applies to other arithmetic units described below.

同様に、第２段演算部は、ｒｅｇ_２、演算器_２、ＬＤ／ＳＴ_２、１次キャッシュ_２及び、バッファキャッシュ_２を含んでいる。また、この第２段演算部は、上記の実施の形態１の転送制御部、第１の転送実行部及び、第２の転送実行部も含んでいる。 Similarly, the second stage computing unit includes reg ₂ , computing unit ₂ , LD / ST ₂ , primary cache _2, and buffer cache ₂ . The second-stage arithmetic unit also includes the transfer control unit, the first transfer execution unit, and the second transfer execution unit of the first embodiment.

第３段演算部は、ｒｅｇ_３、演算器_３、ＬＤ／ＳＴ_３、１次キャッシュ_３及び、バッファキャッシュ_３を含んでいる。また、この第３段演算部は、上記の実施の形態１の転送制御部、第１の転送実行部及び、第２の転送実行部も含んでいる。 The third stage computing unit includes reg ₃ , computing unit ₃ , LD / ST ₃ , primary cache _3, and buffer cache ₃ . The third-stage arithmetic unit also includes the transfer control unit, the first transfer execution unit, and the second transfer execution unit of the first embodiment.

第４段演算部は、ｒｅｇ_４、演算器_４、ＬＤ／ＳＴ_４、１次キャッシュ_４及び、バッファキャッシュ_４を含んでいる。また、この第４段演算部は、上記の実施の形態１の転送制御部、第１の転送実行部及び、第２の転送実行部も含んでいる。 The fourth stage computing unit includes reg ₄ , computing unit ₄ , LD / ST ₄ , primary cache _4, and buffer cache ₄ . The fourth-stage arithmetic unit also includes the transfer control unit, the first transfer execution unit, and the second transfer execution unit of the first embodiment.

第ｎ段演算部は、ｒｅｇ_ｎ、演算器_ｎ、ＬＤ／ＳＴ_ｎ、１次キャッシュ_ｎ及び、バッファキャッシュ_ｎを含んでいる。また、この第ｎ段演算部は、上記の実施の形態１の転送制御部、第１の転送実行部及び、第２の転送実行部も含んでいる。 The n-th stage arithmetic unit includes reg _n , arithmetic unit _n , LD / ST _n , primary cache _n, and buffer cache _n . The n-th stage arithmetic unit also includes the transfer control unit, the first transfer execution unit, and the second transfer execution unit of the first embodiment.

なお、図面の見易さを図るため、図１０には、第１〜ｎ段演算部の各々の転送制御部、第１の転送実行部及び、第２の転送実行部は表示していない。 For ease of viewing the drawing, FIG. 10 does not show the transfer control unit, the first transfer execution unit, and the second transfer execution unit of the first to n-th stage calculation units.

ｒｅｇ_１〜ｒｅｇ_ｎは、各々が対応する演算器_１〜演算器_ｎにおける演算処理に必要なデータを保持するものである。ｒｅｇ_１〜ｒｅｇ_ｎの各々は、複数のレジスタからなるレジスタ群（図示省略）と、そのレジスタ群の各レジスタの読み出しデータを外部に転送するための転送器（図示省略）と、を有している。 Reg _{1 to} reg _n hold data necessary for the arithmetic processing in the arithmetic units ₁ to _n corresponding to each of them. Each of reg _{1 to} reg _n includes a register group (not shown) including a plurality of registers, and a transfer unit (not shown) for transferring read data of each register of the register group to the outside. Yes.

レジスタ群の各レジスタに対する読み出しや書き込みは、命令デコード部８０によるデコード結果に基づいて実行される。レジスタ群の各レジスタは、自身のレジスタ番号をアクセスのキーとして読み出しや書き込みがされる。 Reading and writing to each register of the register group is executed based on the decoding result by the instruction decoding unit 80. Each register in the register group is read or written using its own register number as an access key.

ｒｅｇ_１〜ｒｅｇ_ｎの転送器は、読み出しレジスタ番号が指定されると、その指定された番号が付されたレジスタに保持されているデータを外部に転送する。 When the read register number is specified, the transfer units reg _{1 to} reg _n transfer the data held in the register with the specified number to the outside.

ｒｅｇ_１〜ｒｅｇ_ｎの各レジスタ群のレジスタ同士は一対一に対応している。具体的には、ｒｅｇ_１〜ｒｅｇ_ｎの各レジスタ群の各レジスタ間においてレジスタ番号が同一のもの同士が対応付けられている。 The registers in each register group of reg _{1 to} reg _n have a one-to-one correspondence. Specifically, registers having the same register number are associated with each other in each register group of reg _{1 to} reg _n .

演算器_１〜演算器_ｎの各々は、データ処理装置３における実体的な処理を行なうものである。演算器_１〜演算器_ｎの各々は、上記の実施の形態１の演算器の各々に相当するものである。 Each of the arithmetic units ₁ to _n performs substantial processing in the data processing device 3. Each of the arithmetic units ₁ to _n corresponds to each of the arithmetic units in the first embodiment.

演算器_１〜演算器_ｎの各々は、複数の演算器からなる演算器群（図示省略）と、複数の保持器からなる保持器群（図示省略）と、転送器（図示省略）と、を有している。 Each of the computing units ₁ to _n includes a computing unit group (not shown) composed of a plurality of computing units, a cage group (not shown) consisting of a plurality of cages, and a transfer unit (not shown). Have.

ｒｅｇ_１〜ｒｅｇ_ｎの各々の転送器は、各レジスタ群のレジスタの読み出しデータを対応する演算器_１〜演算器_ｎに転送可能である。そして、演算器_１〜演算器_ｎの演算器群の各演算器は、ｒｅｇ_１〜ｒｅｇ_ｎの各レジスタのうちから２つの読み出しデータを取得し、それらデータを用いて四則演算や論理演算等各種の演算処理を実行する。各演算器の演算処理は同時に実行される。 Each transfer unit of reg _{1 to} reg _n can transfer the read data of the registers of each register group to the corresponding calculation units ₁ to _n . Then, the arithmetic unit _{1 to} each calculator arithmetic unit group of the arithmetic unit _n gets the two read data from among the registers reg ₁ through REG _n, arithmetic and logical operations such as various using those data The calculation process is executed. The arithmetic processing of each arithmetic unit is executed simultaneously.

演算器_１〜演算器_ｎの保持器群の各保持器は、各々に対応する演算器の演算結果を格納する。各保持器は、各演算器と一対一に対応している。 Each holder of the holder group of the calculators ₁ to _n stores the calculation results of the calculators corresponding to the holders. Each retainer has a one-to-one correspondence with each arithmetic unit.

演算器_１〜演算器_ｎの転送器は、対応する各保持器に格納されている、各演算器の演算結果を外部に転送する。 The transfer units of the calculation units ₁ to _n transfer the calculation results of the calculation units stored in the corresponding holding units to the outside.

ＬＤ／ＳＴ_１〜ＬＤ／ＳＴ_ｎの各々は、複数のＬＤ（ロード部）からなるロード部群（図示省略）と、複数のＳＴ（ストア部）からなるストア部群と、を有している。 Each of LD / ST _{1 to} LD / ST _n includes a load unit group (not shown) composed of a plurality of LDs (load units) and a store unit group composed of a plurality of STs (store units). .

１次キャッシュ_１〜１次キャッシュ_ｎの各々は、各々が対応するＬＤ／ＳＴ_１〜ＬＤ／ＳＴ_ｎに接続されており、ＬＤ／ＳＴ_１〜ＬＤ／ＳＴ_ｎによるロード、ストア動作に従って読み出し及び書き込みが高速に実行される。１次キャッシュ_１〜１次キャッシュ_ｎの各々はは、大容量の２次キャッシュ５０とは別の小容量のキャッシュメモリを用いて構成されている。 Each of the primary cache ₁ to primary cache _n is connected to the corresponding LD / ST ₁ to LD / ST _n , and reads and writes in accordance with load and store operations by LD / ST _{1 to} LD / ST _n Is executed at high speed. Each of the primary cache ₁ to primary cache _n is configured using a small-capacity cache memory different from the large-capacity secondary cache 50.

バッファキャッシュ_１〜バッファキャッシュ_ｎの各々は、最大で全内容を次段以降に伝搬させるために容量を極めて小さくする必要がある。このため、バッファキャッシュ_１〜バッファキャッシュ_ｎの各々は、１次キャッシュ_１〜１次キャッシュ_ｎの各々と同様、大容量の２次キャッシュ５０とは別の小容量のキャッシュメモリを用いて構成されている。 Each of the buffer cache ₁ to the buffer cache _n needs to have a very small capacity in order to propagate the entire contents to the next and subsequent stages at the maximum. For this reason, each of the buffer cache ₁ to the buffer cache _n is configured by using a small-capacity cache memory different from the large-capacity secondary cache 50, similarly to each of the primary cache ₁ to the primary cache _n. Yes.

このデータ処理装置３は、公知のＶＬＩＷ方式によるプロセッサアーキテクチャを前提としており、このため、ＶＬＩＷ形式の機械語命令は通常、第１段演算部を構成する、ｒｅｇ_１、演算器_１、ＬＤ／ＳＴ_１、１次キャッシュ_１及び、バッファキャッシュ_１により実行される。すなわち、ＶＬＩＷ方式による演算処理の動作（非アレイ動作）は、第１段演算部により実行される。 The data processing device 3 is premised on a known VLIW processor architecture. Therefore, a machine language instruction in the VLIW format usually constitutes a first stage arithmetic unit, reg ₁ , arithmetic unit ₁ , LD / ST ₁ , executed by the primary cache ₁ and the buffer cache ₁ . That is, the operation processing (non-array operation) by the VLIW method is executed by the first stage arithmetic unit.

したがって、上記の実施の形態１における、複数の演算器による演算処理の同時動作（アレイ動作）を開始するために必要となるレジスタ情報は、常時、ｒｅｇ_１に格納されている。 Accordingly, register information necessary for starting the simultaneous operation (array operation) of the arithmetic processing by the plurality of arithmetic units in the first embodiment is always stored in reg ₁ .

そして、命令デコード部８０によるデコード結果によりアレイ動作開始命令が検出された場合、演算器_１〜演算器_ｎに対して、各演算器_１〜演算器_ｎによる演算処理に必要なデータを格納するレジスタのレジスタ番号を表わすソースレジスタ番号、各演算器_１〜演算器_ｎによる演算処理の演算種別、及び、各演算器_１〜演算器_ｎの演算結果の格納先であるレジスタのレジスタ番号を表わすデスティネーションレジスタ番号、からなる制御情報Ａが第１〜ｎ段演算部の各々に設定される。 When the array operation start instruction is detected by the decoding result of the instruction decoding unit 80, with respect to the arithmetic unit _{1 to} arithmetic unit _n, and stores the data necessary for the arithmetic processing by the arithmetic unit _{1 to} arithmetic unit _n register A source register number representing the register number of each, a computation type of computation processing by each computing unit ₁ to computing unit _n , and a destination representing a register number of a register that is a storage destination of computation results of each computing unit ₁ to computing unit _n Control information A including a register number is set in each of the first to n-th stage arithmetic units.

この制御情報Ａは、アレイ動作開始命令の付加情報として配置すればよい。この場合、アレイ動作開始命令のデコード時に制御情報Ａを一度に獲得することができる。 This control information A may be arranged as additional information of the array operation start command. In this case, the control information A can be obtained at a time when the array operation start instruction is decoded.

また、この制御情報Ａは、後続のＶＬＩＷ命令列自身として供給してもよい。この場合、アレイ動作開始命令をデコードした後、引き続き後続するＶＬＩＷ命令を順にデコードし、ループの繰り返しを意味する後方分岐命令、すなわちアレイ動作の最終段に対応する命令をデコードするまでの間に、ループからの脱出を意味する前方分岐命令、すなわちアレイ動作の終結条件に対応する命令を検出して、休止条件としてセットできる。このため、既存命令列に付加すべき制御情報を削減することができる。 The control information A may be supplied as a subsequent VLIW instruction sequence itself. In this case, after the array operation start instruction is decoded, the subsequent VLIW instructions are successively decoded in order, and the backward branch instruction meaning loop repetition, that is, the instruction corresponding to the final stage of the array operation is decoded. A forward branch instruction meaning exit from the loop, that is, an instruction corresponding to the termination condition of the array operation can be detected and set as a pause condition. For this reason, control information to be added to the existing instruction sequence can be reduced.

この際、各演算器_１〜演算器_ｎによる演算処理に必要なデータは、前段から順次伝搬されてくることを前提にすれば、演算器_１〜演算器_ｎのすべてに対して一斉に制御情報を放送する必要はなく、各演算器_１〜演算器_ｎに最初のデータが到着すると同時に制御情報が到着する構成とすることができる。 At this time, if it is assumed that the data necessary for the arithmetic processing by each of the arithmetic units ₁ to _n is sequentially transmitted from the previous stage, the control information for all of the arithmetic units ₁ to _n is simultaneously controlled. The control information can be configured to arrive at the same time as the first data arrives at each of the computing units ₁ to _n .

アレイ動作開始後は、例えばループ構造の１イタレーションが演算器ネットワークに写像されており、データを順次流し込むことにより大量のデータ処理を行なう。 After the array operation starts, for example, one iteration of the loop structure is mapped to the arithmetic unit network, and a large amount of data processing is performed by sequentially flowing data.

すなわち、アレイ動作開始後は、該アレイ動作が終了するまでの間、各演算器_１〜演算器_ｎに対する制御情報を変更する必要がなく、また、非アレイ動作時に必要であった命令デコード部８０によるデコード動作を実行する必要がなくなる。このため、命令デコード部８０は停止し、さらに、命令フェッチ部７０によるフェッチ動作も同様に停止することができる。 That is, after the array operation is started, it is not necessary to change the control information for each of the arithmetic units ₁ to _n until the end of the array operation, and the instruction decode unit 80 required at the time of the non-array operation. It is no longer necessary to execute the decoding operation by. For this reason, the instruction decoding unit 80 is stopped, and further, the fetch operation by the instruction fetch unit 70 can be similarly stopped.

また、制御情報Ａに、各演算器_１〜演算器_ｎのアレイ動作を停止させるためのアレイ動作終結条件を付加しておき、アレイ動作中にあらかじめ指示した条件が満たされた場合に、自動的に非アレイ動作に復帰する構成とする。 In addition, an array operation termination condition for stopping the array operation of each of the arithmetic units ₁ to _n is added to the control information A, and when the conditions specified in advance during the array operation are satisfied, In this case, the non-array operation is restored.

このアレイ動作終結条件とは、具体的には、各演算器_１〜演算器_ｎの実行サイクル数等である。 The array operation termination condition is specifically the number of execution cycles of each of the arithmetic units ₁ to _n .

２次キャッシュ５０は、ＬＤ／ＳＴ_１が保有するバッファキャッシュ_１のみに接続されている。そして、第２段以降については、バッファキャッシュ_１のデータが順次伝搬されている。 The secondary cache 50 is connected only to the buffer cache ₁ held by the LD / ST ₁ . In the second and subsequent stages, the data in the buffer cache ₁ is sequentially propagated.

ロード命令は、ｒｅｇ_１に格納されたアドレス情報を演算器_１において加減算して得られるアドレスに従って１次キャッシュ_１及びバッファキャッシュ_１を参照し、得られたデータをＬＤ／ＳＴ_１のストア部群のストア部に格納される。 The load instruction refers to the primary cache ₁ and the buffer cache ₁ according to the address obtained by adding / subtracting the address information stored in reg ₁ in the arithmetic unit ₁ , and the obtained data is stored in the store unit group of the LD / ST _1. Stored in the store.

このストア部に格納されたデータは、次のサイクルにおいて、後段の演算器_２またはｒｅｇ_２の入力となる。 The data stored in the store unit becomes the input of the subsequent arithmetic unit ₂ or reg _{2 in} the next cycle.

次に、データ処理装置３のキャッシュ方式の動作について説明する。図１１は、データ処理装置３のキャッシュ方式を説明するための説明図である。 Next, the operation of the cache method of the data processing device 3 will be described. FIG. 11 is an explanatory diagram for explaining a cache method of the data processing device 3.

図１１に示すように、データ処理装置３のキャッシュ方式４では、第１段演算部が１次キャッシュ１０３及びバッファキャッシュ１０４を含み、第２段演算部が１次キャッシュ２０３及びバッファキャッシュ２０４を含み、第３段演算部が１次キャッシュ３０３及びバッファキャッシュ３０４を含み、第４段演算部が１次キャッシュ４０３及びバッファキャッシュ４０４を含み、第５段演算部が１次キャッシュ５０３及びバッファキャッシュ５０４を含み、第６段演算部が１次キャッシュ６０３及びバッファキャッシュ６０４を含み、第７段演算部が１次キャッシュ７０３及びバッファキャッシュ７０４を含み、第８段演算部が１次キャッシュ８０３及びバッファキャッシュ８０４を含み、第９段演算部が１次キャッシュ９０３及びバッファキャッシュ９０４を含んでいる。 As shown in FIG. 11, in the cache method 4 of the data processing device 3, the first stage arithmetic unit includes the primary cache 103 and the buffer cache 104, and the second stage arithmetic unit includes the primary cache 203 and the buffer cache 204. The third stage arithmetic unit includes a primary cache 303 and a buffer cache 304, the fourth stage arithmetic unit includes a primary cache 403 and a buffer cache 404, and the fifth stage arithmetic unit includes a primary cache 503 and a buffer cache 504. The sixth stage arithmetic unit includes a primary cache 603 and a buffer cache 604, the seventh stage arithmetic unit includes a primary cache 703 and a buffer cache 704, and the eighth stage arithmetic unit includes a primary cache 803 and a buffer cache 804. And the ninth stage arithmetic unit includes a primary cache 903 and a buffer. It contains Yasshu 904.

そして、１次キャッシュ１０３がＬＤ／ＳＴ１０１、１０２に接続され、１次キャッシュ２０３がＬＤ／ＳＴ２０１、２０２に接続され、１次キャッシュ３０３がＬＤ／ＳＴ３０１、３０２に接続され、１次キャッシュ４０３がＬＤ／ＳＴ４０１、４０２に接続され、１次キャッシュ５０３がＬＤ／ＳＴ５０１、５０２に接続され、１次キャッシュ６０３がＬＤ／ＳＴ６０１、６０２に接続され、１次キャッシュ７０３がＬＤ／ＳＴ７０１、７０２に接続され、１次キャッシュ８０３がＬＤ／ＳＴ８０１、８０２に接続され、１次キャッシュ９０３がＬＤ／ＳＴ９０１、９０２に接続されている。 The primary cache 103 is connected to the LD / STs 101 and 102, the primary cache 203 is connected to the LD / STs 201 and 202, the primary cache 303 is connected to the LD / STs 301 and 302, and the primary cache 403 is LD. / ST 401 and 402, the primary cache 503 is connected to the LD / ST 501 and 502, the primary cache 603 is connected to the LD / ST 601 and 602, the primary cache 703 is connected to the LD / ST 701 and 702, A primary cache 803 is connected to the LD / STs 801 and 802, and a primary cache 903 is connected to the LD / STs 901 and 902.

２次キャッシュ５０のバンク数は４ウェイ（Ｗａｙ０、Ｗａｙ１、Ｗａｙ２、Ｗａｙ３）であり、Ｗａｙ０、Ｗａｙ１、Ｗａｙ２の各データからなるブロックがバッファキャッシュ１０４に送り込まれる。 The number of banks of the secondary cache 50 is 4 ways (Way0, Way1, Way2, Way3), and blocks including Way0, Way1, and Way2 data are sent to the buffer cache 104.

このキャッシュ方式４では、例えば３×３の画素における、垂直方向の１行分の画素データ「０６、１６、２６」が２次キャッシュ５０から第１段演算部のバッファキャッシュ１０４に送り込まれると、それまでのバッファキャッシュ１０４に格納されていた垂直方向の１行分の画素データ「０５、１５、２５」が第１段演算部の１次キャッシュ１０３に転送されると共に、第２段演算部のバッファキャッシュ２０４に転送される。 In this cache system 4, for example, when pixel data “06, 16, 26” for one row in the vertical direction in 3 × 3 pixels is sent from the secondary cache 50 to the buffer cache 104 of the first stage arithmetic unit, The pixel data “05, 15, 25” for one line in the vertical direction stored in the buffer cache 104 until then is transferred to the primary cache 103 of the first stage arithmetic unit, and also the second stage arithmetic unit It is transferred to the buffer cache 204.

同様に、垂直方向の１行分の画素データ「０５、１５、２５」が第１段演算部のバッファキャッシュ１０４から第２段演算部のバッファキャッシュ２０４に送り込まれると、それまでのバッファキャッシュ２０４に格納されていた垂直方向の１行分の画素データ「０４、１４、２４」が第２段演算部の１次キャッシュ２０３に転送されると共に、第３段演算部のバッファキャッシュ３０４に転送される。 Similarly, when pixel data “05, 15, 25” for one line in the vertical direction is sent from the buffer cache 104 of the first stage arithmetic unit to the buffer cache 204 of the second stage arithmetic unit, the buffer cache 204 up to that point The pixel data “04, 14, 24” for one line in the vertical direction stored in is transferred to the primary cache 203 of the second stage arithmetic unit and also transferred to the buffer cache 304 of the third stage arithmetic unit. The

垂直方向の１行分の画素データ「０４、１４、２４」が第２段演算部のバッファキャッシュ２０４から第３段演算部のバッファキャッシュ３０４に送り込まれると、それまでのバッファキャッシュ３０４に格納されていた垂直方向の１行分の画素データ「０３、１３、２３」が第３段演算部の１次キャッシュ３０３に転送されると共に、第４段演算部のバッファキャッシュ４０４に転送される。 When pixel data “04, 14, 24” for one line in the vertical direction is sent from the buffer cache 204 of the second-stage arithmetic unit to the buffer cache 304 of the third-stage arithmetic unit, it is stored in the buffer cache 304 until then. The pixel data “03, 13, 23” for one line in the vertical direction is transferred to the primary cache 303 of the third stage arithmetic unit and is also transferred to the buffer cache 404 of the fourth stage arithmetic unit.

垂直方向の１行分の画素データ「０３、１３、２３」が第３段演算部のバッファキャッシュ３０４から第４段演算部のバッファキャッシュ４０４に送り込まれると、それまでのバッファキャッシュ４０４に格納されていた垂直方向の１行分の画素データ「０２、１２、２２」が第４段演算部の１次キャッシュ４０３に転送されると共に、第５段演算部のバッファキャッシュ５０４に転送される。 When pixel data “03, 13, 23” for one line in the vertical direction is sent from the buffer cache 304 of the third stage arithmetic unit to the buffer cache 404 of the fourth stage arithmetic unit, it is stored in the buffer cache 404 up to that point. The pixel data “02, 12, 22” for one row in the vertical direction is transferred to the primary cache 403 of the fourth stage arithmetic unit and also transferred to the buffer cache 504 of the fifth stage arithmetic unit.

垂直方向の１行分の画素データ「０２、１２、２２」が第４段演算部のバッファキャッシュ４０４から第５段演算部のバッファキャッシュ５０４に送り込まれると、それまでのバッファキャッシュ５０４に格納されていた垂直方向の１行分の画素データ「０１、１１、２１」が第５段演算部の１次キャッシュ５０３に転送されると共に、第６段演算部のバッファキャッシュ６０４に転送される。 When pixel data “02, 12, 22” for one line in the vertical direction is sent from the buffer cache 404 of the fourth-stage arithmetic unit to the buffer cache 504 of the fifth-stage arithmetic unit, it is stored in the buffer cache 504 until then. The pixel data “01, 11, 21” for one line in the vertical direction is transferred to the primary cache 503 of the fifth-stage arithmetic unit and also transferred to the buffer cache 604 of the sixth-stage arithmetic unit.

垂直方向の１行分の画素データ「０１、１１、２１」が第５段演算部のバッファキャッシュ５０４から第６段演算部のバッファキャッシュ６０４に送り込まれると、それまでのバッファキャッシュ６０４に格納されていた垂直方向の１行分の画素データ「００、１０、２０」が第６段演算部の１次キャッシュ６０３に転送されると共に、第７段演算部のバッファキャッシュ７０４に転送される。 When pixel data “01, 11, 21” for one line in the vertical direction is sent from the buffer cache 504 of the fifth-stage arithmetic unit to the buffer cache 604 of the sixth-stage arithmetic unit, it is stored in the buffer cache 604 until then. The pixel data “00, 10, 20” for one line in the vertical direction is transferred to the primary cache 603 of the sixth stage arithmetic unit and is also transferred to the buffer cache 704 of the seventh stage arithmetic unit.

本実施の形態２においても、上記の実施の形態１と同様、第１〜９段演算部の各々の転送制御部、第１の転送実行部及び第２の転送実行部の各動作により、バッファキャッシュ１０４、２０４、３０４、４０４、５０４、６０４、７０４、８０４、９０４と１次キャッシュ１０３、２０３、３０３、４０３、５０３、６０３、７０３、８０３、９０３との間におけるデータ転送を制御し、バッファキャッシュ１０４、２０４、３０４、４０４、５０４、６０４、７０４、８０４、９０４間におけるデータ転送を制御する。 Also in the second embodiment, as in the first embodiment, each operation of the transfer control unit, the first transfer execution unit, and the second transfer execution unit of the first to ninth stage arithmetic units performs buffering. Controls data transfer between the caches 104, 204, 304, 404, 504, 604, 704, 804, 904 and the primary caches 103, 203, 303, 403, 503, 603, 703, 803, 903, and buffers Controls data transfer between the caches 104, 204, 304, 404, 504, 604, 704, 804, 904.

本発明は上述した各実施形態に限定されるものではなく、請求項に示した範囲で種々の変更が可能であり、異なる実施形態にそれぞれ開示された技術的手段を適宜組み合わせて得られる実施形態についても本発明の技術的範囲に含まれる。 The present invention is not limited to the above-described embodiments, and various modifications are possible within the scope shown in the claims, and embodiments obtained by appropriately combining technical means disclosed in different embodiments. Is also included in the technical scope of the present invention.

本発明は、複数の機械語命令を高速に同時実行するデータ処理装置に好適に利用することができる。 The present invention can be suitably used for a data processing apparatus that simultaneously executes a plurality of machine language instructions at high speed.

１、３、データ処理装置
２ぼかし処理用プログラム
４キャッシュ方式
１０第１段演算部
１１、１１ａ、２１、２１ａ、３１、３１ａ、４１、４１ａ演算器
１２、１２ａ、２２、２２ａ、３２、３２ａ、４２、４２ａ、１０３、２０３、３０３、４０３、５０３、６０３、７０３、８０３、９０３１次キャッシュ（第１キャッシュ）
１３、２３、３３、４３、１０４、２０４、３０４、４０４、５０４、６０４、７０４、８０４、９０４バッファキャッシュ
１４、２４、３４、４４転送制御部（特定部）
１５、２５、３５、４５第１の転送実行部（第１実行部）
１６、２６、３６、４６第２の転送実行部（第２実行部）
２０第２段演算部
３０第３段演算部
４０第４段演算部
５０、５０ａ２次キャッシュ（第２キャッシュ）
６０メインメモリ
７０命令フェッチ部
８０命令デコード部
１０１、１０２、２０１、２０２、３０１、３０２、４０１、４０２、５０１、５０２、６０１、６０２、７０１、７０２、８０１、８０２、９０１、９０２ＬＤ／ＳＴ 1, 3, Data processing device 2 Blur processing program 4 Cache method 10 First stage operation unit 11, 11a, 21, 21a, 31, 31a, 41, 41a Operation unit 12, 12a, 22, 22a, 32, 32a, 42, 42a, 103, 203, 303, 403, 503, 603, 703, 803, 903 Primary cache (first cache)
13, 23, 33, 43, 104, 204, 304, 404, 504, 604, 704, 804, 904 Buffer cache 14, 24, 34, 44 Transfer control unit (specification unit)
15, 25, 35, 45 First transfer execution unit (first execution unit)
16, 26, 36, 46 Second transfer execution unit (second execution unit)
20 Second stage arithmetic unit 30 Third stage arithmetic unit 40 Fourth stage arithmetic unit 50, 50a Secondary cache (second cache)
60 Main memory 70 Instruction fetch unit 80 Instruction decode unit 101, 102, 201, 202, 301, 302, 401, 402, 501, 502, 601, 602, 701, 702, 801, 802, 901, 902 LD / ST

Claims

A plurality of arithmetic units;
A plurality of first caches provided in each of the plurality of computing units and transferring data to the corresponding computing units;
A second cache for storing data shared by the plurality of computing units and used for each processing of the plurality of computing units;
A plurality of buffer caches provided in each of the plurality of first caches for transferring data to the corresponding first caches;
The plurality of buffer caches include a first-stage buffer cache connected to the second cache and to which data is transferred from the second cache;
Each of the plurality of buffer caches is sequentially connected sequentially from the first stage buffer cache,
Each of the plurality of buffer caches sequentially transfers a part of the data transferred from the second cache to the first-stage buffer cache to the subsequent stage of each buffer cache and is stored in each buffer cache. A data processing apparatus for transferring a part of stored data to a first cache corresponding to each buffer cache.

2. The data processing according to claim 1, wherein each of the plurality of buffer caches transfers data necessary for processing of an arithmetic unit corresponding to each buffer cache to a first cache corresponding to each buffer cache. apparatus.

Each of the plurality of first caches stops a storage operation of a storage unnecessary area that is unnecessary for storing data transferred from a corresponding buffer cache among the storage areas of each first cache. Item 3. A data processing apparatus according to Item 2.

Each of the plurality of buffer caches transfers data necessary for processing of an arithmetic unit corresponding to a buffer cache on the subsequent stage of each buffer cache to the buffer cache on the subsequent stage of each buffer cache. The data processing device according to any one of 1 to 3.

5. Each of the plurality of buffer caches stops a storage operation of a storage unnecessary area that is unnecessary for storing data transferred from a preceding buffer cache among storage areas of each buffer cache. The data processing apparatus described in 1.

The data processing apparatus analyzes a data access pattern based on execution of a program to be processed by itself, and processes the program using the analysis result.
6. The data processing according to claim 3, wherein data necessary for each processing of the plurality of computing units is specified based on an analysis result of a data access pattern of a program processed by the data processing device. apparatus.

A specifying unit for specifying data necessary for each processing of the plurality of arithmetic units based on the analysis result of the data access pattern of the program;
A first execution unit that is provided in each of the plurality of first caches and stops a storage operation of a storage unnecessary area of the corresponding first cache based on a specification result by the specification unit;
And a second execution unit that is provided in each of the plurality of buffer caches and stops the storage operation of the storage unnecessary area of the corresponding buffer cache based on the identification result by the identification unit. The data processing apparatus according to claim 6.

The program processed by the data processing device has been previously analyzed for data access patterns,
Data necessary for each processing of the plurality of arithmetic units is specified in advance based on the analysis contents of the data access pattern of the program,
In each of the plurality of first caches, a storage area of each first cache is set in advance in order to store data necessary for each processing of the plurality of arithmetic units.
In each of the plurality of buffer caches, a storage area of each buffer cache is set in advance so as to transfer data necessary for each process of the plurality of arithmetic units to a corresponding first cache. The data processing apparatus according to claim 1.