JP3850599B2

JP3850599B2 - Parallel image processing apparatus and parallel image processing method

Info

Publication number: JP3850599B2
Application number: JP29011499A
Authority: JP
Inventors: 秀幸和泉; 政治水野; 史子高橋; 和司佐々木; 眞紀濱窪
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1999-10-12
Filing date: 1999-10-12
Publication date: 2006-11-29
Anticipated expiration: 2019-10-12
Also published as: JP2001109880A

Description

【０００１】
【発明の属する技術分野】
この発明は、共有メモリ型の並列計算機上で実行する画像処理において、プロセッサへの共有メモリの割り当てと、プロセッサの共有メモリへのアクセス制御に係る並列画像処理装置及び並列画像処理方法に関するものである。
【０００２】
【従来の技術】
画像処理の高速化手法の１つに、複数のプロセッサを使った並列処理がある。並列処理は、画像処理内での依存関係が少なく、画像サイズが大きいケースや処理負荷が高いケースで有効である。このような画像処理の１つに、ＳＡＲ（ＳｙｎｔｈｅｔｉｃＡｐｅｒｔｕｒｅＲａｄａｒ：合成開口レーダ）画像の再生処理がある。
【０００３】
ＳＡＲ画像再生処理を並列処理する従来例として、特開昭５８−２２９８２号公報の「合成開口レーダの画像処理システム」がある。図５６は従来システムの構成を示すブロック図である。図において、１０１はＳＡＲ受信画像データを記憶している磁気テープ、１０２は磁気テープ１０１に記憶されているＳＡＲ受信画像データをレンジ方向にライン単位に圧縮処理するレンジ方向圧縮装置、１０３−１，１０３−２は、レンジ方向圧縮済画像データをコーナーターン（縦横転置）処理するための２次元画像メモリ、１０４は、２次元画像メモリ１０３−１，１０３−２をアクセスする２次元画像メモリ制御部である。
【０００４】
また、図５６において、１０５はコーナーターンされたデータをアジマス方向に圧縮するアジマス方向圧縮装置、１０６は圧縮された画像データを記憶する磁気テープ、１０７は全体システムの制御を行うＣＰＵ、１０８はＣＰＵ１０７のメインメモリ、１０９は高速フーリエ変換を行う共通データ用ＦＦＴ（Ｆａｓｔ
ＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ）装置、１１０はデータバスである。
【０００５】
なお、ＳＡＲ画像再生処理については、特開昭５８−２２９８２号公報に記載されている他、「合成開口レーダ画像ハンドブック」（朝倉書店、飯坂譲二監修、日本写真測量学会編、１９９８年５月２０日初版第１冊）等の市販の書籍にも記載されているので、説明を省略する。
【０００６】
次に動作について説明する。
ここでは、各処理フェーズを実行する処理装置を設けてパイプラインで実行している。また、レンジ方向圧縮装置１０２とアジマス方向圧縮装置１０５も、パイプラインで実行される。また、レンジ方向圧縮装置１０２とアジマス方向圧縮装置１０５では、内部でＦＦＴやＩＦＦＴ（ＩｎｖｅｒｓｅＦａｓｔＦｏｕｒｉｅｒＴｒａｎｓｆｏｒｍ，高速逆フーリエ変換）を行う装置を複数用意して、並列実行を行うことも可能である。
【０００７】
レンジ方向圧縮済画像データに対しデータ転置を行うコーナーターン処理は、２次元画像メモリ１０３−１，１０３−２を使って高速に実行する。ここでは、パイプラインでの処理速度のギャップを埋めるために、２個の２次元画像メモリ１０３−１，１０３−２を設け、交替で利用するための２次元画像メモリ制御部１０４を備えている。
【０００８】
上記のように、この装置は、各処理フェーズ毎に専用の処理装置を設け、処理フェーズごとに、別々の装置で処理を順次パイプライン実行していく、ＳＡＲ画像再生処理の専用システムである。また、２次元画像メモリ１０３−１，１０３−２や２次元画像メモリ制御部１０４のような専用のメモリ装置を使って、コーナーターンを実行する。また、専用のＨ／Ｗ（Ｈａｒｄｗａｒｅ）を利用したシステムとしては、特開昭５８−１９１９７９号公報にも記載されている。上記のような専用のＨ／Ｗと２次元画像メモリを利用したシステムは、特に、コーナーターン処理を高速に行えるが、一般的に実現コストは高くなる。
【０００９】
一方、並列処理の実現方法の１つに、コスト的に有利な汎用の並列計算機を利用した共有メモリ密結合型のＳＭＰ（ＳｙｍｔｅｒｉｃＭｕｌｔｉＰｒｏｃｅｓｓｏｒ）がある。図５７はＳＭＰの構成の一例を示すブロック図である。図５７において、１２１，１２２，１２３，１２４はプロセッサで、１２５は共有メモリである。ここでは、４個のプロセッサの例であるが、プロセッサは任意の個数で構成することが可能である。なお、ＳＭＰは、共有結合密結合型の並列計算機の中で、各プロセッサが同等の機能を有するものである。
【００１０】
ＳＭＰでは、図５６に示した２次元メモリ１０３−１，１０３−２を使用せずに、共有メモリ１２５を複数のプロセッサ１２１〜１２４で共有するため、これを有効に利用することで、プロセッサ１２１〜１２４間のデータ転送を効率良く行うことができる。また、画像処理の並列化を行う場合でも、この共有メモリ１２５を有効に利用できるか否かが、処理の効率化を考えるうえで、大きなポイントとなる。
【００１１】
上記の従来システムである特開昭５８−２２９８２号公報や特開昭５８−１９１９７９号公報は、専用システムにおいて処理を並列に実行できるが、ＳＭＰ構成をとっておらず、共有メモリを有効に利用するための手段を有していない。また、ＳＡＲ画像再生処理を並列処理する従来例としては、上記のような専用システム等を用いた例はあるが、共有メモリ型のＳＭＰ構成を有した従来例はなく、コーナーターン処理を行うプロセッサのキャッシュメモリの制御方法に関する従来例もない。
【００１２】
次にＳＭＰ構成でのＳＡＲ画像再生処理の並列化方法について説明する。
まず、ＳＡＲ画像再生処理の特徴を示す。
（１）各処理部分単位では、処理結果への依存性が低く、並列実行可能な部分が多い。
（２）処理にレンジ及びアジマスの方向性があり、各方向別に行と列の処理を割り当てる（例：レンジを行、アジマスを列）。例えば、レンジ方向の処理では、レンジを行、アジマスを列として、行（レンジ）方向に処理を進めていく。
（３）処理の途中でコーナーターン処理と呼ぶ行と列の方向を転置する処理を行う。
（４）１つの画像に対して順次処理を行う。すなわち、処理結果を次々に利用して処理を行う。
（５）画像サイズが大きく、処理に時間がかかる。
【００１３】
上記のような特徴を持つＳＡＲ画像再生処理を、図５７に示すＳＭＰで実装する場合には、次のような実装方法を、容易に考えることができる。
（１）共有メモリ１２５上に、処理前と処理後の領域を確保する。
（２）使用可能なプロセッサ１２１〜１２４の数に応じて、任意の個数で画像を分割して並列実行する。
【００１４】
図５８及び図５９はＳＭＰでのＳＡＲ画像再生処理を実行する方法の一例である。共有メモリ１２５の利用方法としては、図５８に示すように、（１）各処理毎に処理前と処理後の共有メモリ１２５を切り替えて使用する方法や、図５９のように、（２）処理毎に共有メモリ１２５を割り当てて順次利用する方法や、（３）両者を組み合わせる方法等を容易に考えることができる。
【００１５】
次に共有メモリ型計算機でのキャッシュ操作について説明する。
ＳＭＰの各プロセッサ１２１〜１２４では、共有メモリ１２５の内容をキャッシュと呼ぶ高速小容量のメモリにコピーして使用（読み書き）する。そして、プロセッサ１２１〜１２４のキャッシュ内に存在しないデータへの読み込みが発生した時には、共有メモリ１２５からキャッシュへデータをコピーする。キャッシュは通常、複数の等量の記憶領域（キャッシュのラインと呼ぶ）に分割されている。データの読み込みは、このライン単位で実行する。
【００１６】
また、データの読み込み時に、キャッシュのラインが全て埋まっている場合には、１つのラインを選択して、新規のデータと入れ換える。この入れ換え対象の選択方法としては、アクセス頻度や最終アクセス時刻等を利用する方法がある。このキャッシュへのデータの読み込みを、キャッシュのリードミス時の処理と呼ぶ。
【００１７】
共有メモリ１２５の同じ場所が、異なったプロセッサでキャッシュ（キャッシュメモリ上にデータとして存在していること）されている時に、データの書き込みが起きると、データの一貫性を保つための処理が行われる。この処理方法としては、書き込みデータをそのまま共有メモリ１２５上に書き込むライトスルー方式と、データの書き込まれたキャッシュのデータを、キャッシュのブロック（ライン）単位で共有メモリ１２５に書き戻すライトバック方式がある。このキャッシュのデータの一貫性を保つための処理時間を、キャッシュのライトミス時の処理と呼ぶ。
【００１８】
上記のキャッシュのリードミス時の処理にかかる時間と、ライトミス時の処理にかかる時間は、並列処理の性能低下の要因となるオーバヘッド時間である。
【００１９】
なお、共有メモリ型計算機でのキャッシュ操作については、「並列処理マシン」（富田眞治・末吉敏則著、オーム社、コンピュータアーキテクチャシリーズ・電子情報通信学会編）や、「並列計算機アーキテクチャ」（奥川峻史著、コロナ社）等の市販の書籍に記載されているので、詳細は省略する。
【００２０】
キャッシュ操作によるオーバヘッド時間では、ライトミス時の時間は、一般にリードミス時の時間よりも大きい。ライトミスが頻発すると、そのオーバヘッド時間により、並列処理の性能を著しく低下させる原因となる。このため、ライトミスを低減することが、並列処理を効率良く動作させる時の大きなポイントになる。
【００２１】
従来のＳＭＰでの実装方法を利用した場合、コーナーターン処理で、キャッシュミスを引き起こし易い問題がある。これを図６０，図６１を使って説明する。図６０は、プロセッサ１２１とプロセッサ１２２が割り付けた画像上の領域を示す図である。図６１は、図６０に対する共有メモリ１２５上の位置を示す図である。図６０及び図６１において、斜線の部分はキャッシュの範囲を示している。キャッシュは、各ライン単位で、共有メモリ１２５の連続した領域を対象にデータを格納している。
【００２２】
図６０に示すように、プロセッサ１２１とプロセッサ１２２では、それぞれ異なったレンジ方向の領域を割り付けて処理を行っている。この割付方法は、レンジ方向に処理を進める場合に、連続した領域を順次アクセスしていくので、キャッシュのリードミスを抑える効果がある。また、コーナーターン処理以外のＳＡＲ画像処理では、書き込みも同じ方向（この場合レンジ方向）に行い、ライトミスも起こりにくい。
【００２３】
【発明が解決しようとする課題】
従来の並列画像処理装置は以上のように構成されているので、コーナーターン処理では、図６０に示すように、書き込み方向（レンジ方向）が読み込み方向（レンジ方向）と異なり、各プロセッサ１２１，１２２がデータを書き込むメモリ領域は、図６１に示すように、読み込みとは異なり、各プロセッサの書き込み領域が同じキャッシュの範囲に重なる可能性が高く、キャッシュのライトミスが起こりやすくなり、オーバーヘッド時間が多くなるという課題があった。
【００２４】
なお、ここでは、ＳＡＲ画像再生処理の例で説明したが、ＳＡＲ画像再生処理と同様に、各処理部分では、処理に方向性があり、処理結果への依存性が低く、途中でコーナーターン処理を実施する画像処理では、同じ課題が発生する。
【００２５】
この発明は上記のような課題を解決するためになされたもので、キャッシュの情報と、画像サイズと、使用可能なプロセッサ数の情報から、各プロセッサがコーナーターン処理を実行する画像の範囲とアクセス方法を制御することで、キャッシュのライトミスを低減し、オーバーヘッド時間を低減することができる並列画像処理装置及び並列画像処理方法を得ることを目的とする。
【００２６】
また、キャッシュのライトミスだけでなく、リードしたデータを全て使用しない時に発生するリードミスについても考慮し、最終的なオーバーヘッド時間を低減することができる並列画像処理装置及び並列画像処理方法を得ることを目的とする。
【００２７】
【課題を解決するための手段】
この発明に係る並列画像処理装置は、複数のプロセッサと共有メモリを含むプラットホーム上で動作し、画像の行方向と列方向の並びを転置するコーナーターン処理を含む画像の再生処理を実行する画像処理プログラムに上記複数のプロセッサへの上記共有メモリの対象領域の割り付けを指示するものにおいて、処理対象画像のサイズや各画素のデータサイズ等の画像情報を設定している画像情報設定手段と、画像処理で使用する上記プロセッサの個数情報を設定している使用プロセッサ数設定手段と、上記各プロセッサに搭載されているキャッシュの構成やサイズ等のキャッシュ情報を設定しているキャッシュ情報設定手段と、上記画像情報設定手段に設定されている画像情報、上記使用プロセッサ数設定手段に設定されている上記プロセッサの個数情報、上記キャッシュ情報設定手段に設定されているキャッシュ情報を入手し、キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出し、算出した画像ブロック単位で、上記各プロセッサが上記コーナーターン処理を実行するように上記共有メモリの対象領域を割り付けるメモリ割付手段とを備えたものである。
【００２８】
この発明に係る並列画像処理装置は、画像サイズが算出した画像ブロックの整数倍にならない時に、上記画像サイズが上記画像ブロックの整数倍になるように、画像領域を拡張して共有メモリ上に確保するメモリ確保手段を備えたものである。
【００２９】
この発明に係る並列画像処理装置は、画像サイズが算出した画像ブロックの整数倍にならない時に、上記画像ブロックの幅の行又は列を画像ブロックの帯として算出し、算出した上記画像ブロックの帯で画像のアクセス方向を指定するアクセス方向指定手段を備えたものである。
【００３０】
この発明に係る並列画像処理装置は、画像サイズが算出した画像ブロックの整数倍にならない時に、上記画像ブロックの幅の行又は列を画像ブロックの帯として算出し、算出した上記画像ブロックの帯で画像のアクセス方向を指定すると共に、算出した画像ブロックの帯の個数が、使用プロセッサ数設定手段に設定されているプロセッサの個数の整数倍にならない時に、上記画像ブロックの帯を分割して、複数のプロセッサで処理するよう指定する多重書き込み対応アクセス方向指定手段を備えたものである。
【００３１】
この発明に係る並列画像処理装置は、画像サイズが算出した画像ブロックの整数倍になるように、画像領域を拡張して共有メモリ上に確保するメモリ確保手段と、上記画像ブロックの幅の行又は列を画像ブロックの帯として算出し、算出した上記画像ブロックの帯で画像のアクセス方向を指定するアクセス方向指定手段と、上記画像ブロックの幅の行又は列を画像ブロックの帯として算出し、算出した上記画像ブロックの帯で画像のアクセス方向を指定すると共に、算出した画像ブロックの帯の個数が、使用プロセッサ数設定手段に設定されているプロセッサの個数の整数倍にならない時に、上記画像ブロックの帯を分割して、複数のプロセッサで処理するよう指定する多重書き込み対応アクセス方向指定手段と、上記画像サイズが上記画像ブロックの整数倍にならない時に、その修正方法の判定条件を設定しているはみ出し修正方法設定手段とを備えたものである。
【００３２】
この発明に係る並列画像処理装置は、要求仕様として与えられる時間制約と各プロセッサの処理時間等の時間制約条件を設定している処理時間制約設定手段と、実際に使用する上記プロセッサの個数を画像処理プログラムに指定する実行プロセッサ数指定手段とを備えたものである。
【００３３】
この発明に係る並列画像処理装置は、画像ブロックの処理に必要なキャッシュのライン数が不足する時に、キャッシュデータの退避回数が最小になる画像へのアクセスパターンを算出し、画像処理プログラムに指定する画素アクセス順序指定手段を備えたものである。
【００３４】
この発明に係る並列画像処理装置は、複数の画像サイズを持つ各画像を対象にコーナーターン処理を行う時に、上記各画像に対して、キャッシュのラインサイズが画素サイズの整数倍にならない場合に画素の補間を行うと共に、算出した画像ブロックの整数倍で、かつ、処理対象画像を包含する最小の画像サイズを算出し、算出した各画像の最小の画像サイズの中から最大の画像サイズを選択し、選択した上記最大の画像サイズの領域を共有メモリ上に確保する複数サイズ対応メモリ確保手段を備えたものである。
【００３５】
この発明に係る並列画像処理装置は、複数サイズ対応メモリ確保手段が確保した共有メモリの対象領域を使用し、確保の対象となった画像よりも小さいサイズの画像処理を行う時に、選択された小さいサイズの画像に対する画像ブロック単位で、コーナーターン処理に使用する共有メモリの使用範囲を算出し、算出した使用範囲で上記コーナーターン処理を実行するように、各プロセッサに処理対象の領域やアクセス手順を指定する複数サイズ対応アクセス制御手段を備えたものである。
【００３６】
この発明に係る並列画像処理装置は、複数のプロセッサと共有メモリを含むプラットホーム上で動作し、画像の行方向と列方向の並びを転置するコーナーターン処理を含む画像の再生処理を実行する画像処理プログラムに上記複数のプロセッサへの上記共有メモリの対象領域の割り付けを指示するものにおいて、処理対象画像のサイズや各画素のデータサイズ等の画像情報を設定している画像情報設定手段と、画像処理で使用する上記プロセッサの個数情報を設定している使用プロセッサ数設定手段と、上記各プロセッサに搭載されている多階層のキャッシュの構成やサイズ等のキャッシュ情報を設定しているキャッシュ情報設定手段と、上記画像情報設定手段に設定されている画像情報、上記キャッシュ情報設定手段に設定されているキャッシュ情報を入手し、１次キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出するメモリ割付手段と、上記画像情報設定手段に設定されている画像情報、上記使用プロセッサ数設定手段に設定されている上記プロセッサの個数情報、上記キャッシュ情報設定手段に設定されているキャッシュ情報を入手し、上記メモリ割付手段が算出した１次キャッシュの画像ブロックをもとに、キャッシュの階層ごとに低位キャッシュの複数の画像ブロックを包含し１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出し、算出した最上位キャッシュの画像ブロック単位でコーナーターン処理を実行し、かつ、上記最上位キャッシュの画像ブロックを実行する個数が上記プロセッサ間で均等になるように、上記各プロセッサへ上記共有メモリの対象領域を割り付けるよう上記画像処理プログラムに指示する多階層キャッシュ対応メモリ割付手段とを備えたものである。
【００３７】
この発明に係る並列画像処理装置は、画像サイズが最上位キャッシュの画像ブロックの整数倍にならない時に、上記画像サイズが上記最上位キャッシュの画像ブロックの整数倍になるように、画像領域を拡張して共有メモリ上に確保する多階層キャッシュ対応メモリ確保手段を備えたものである。
【００３８】
この発明に係る並列画像処理装置は、上位キャッシュのラインサイズが、低位キャッシュの画像ブロックの整数倍にならない時に、上記上位キャッシュのラインサイズが、上記低位キャッシュの画像ブロックの整数倍になるように、上記低位キャッシュの画像ブロックの領域を拡張して共有メモリ上に確保する多階層画像ブロック用メモリ確保手段を備えたものである。
【００３９】
この発明に係る並列画像処理装置は、画像サイズが算出した最上位キャッシュの画像ブロックの整数倍にならない時に、上記最上位キャッシュの画像ブロックの行又は列を上記最上位キャッシュの画像ブロックの帯として算出し、算出した上記画像ブロックの帯で画像のアクセス方向を指定すると共に、算出した画像ブロックの帯の個数が、使用プロセッサ数設定手段に設定されているプロセッサの個数の整数倍にならない時に、上記画像ブロックの帯を分割して、複数のプロセッサで処理するよう指定する多階層キャッシュ対応アクセス方向指定手段を備えたものである。
【００４０】
この発明に係る並列画像処理装置は、算出した画像ブロックの処理に必要なキャッシュのライン数が不足する時に、キャッシュデータの退避回数が最小になる画像又は低次のキャッシュへのアクセスパターンを指定する多階層キャッシュ対応アクセスパターン指定手段を備えたものである。
【００４１】
この発明に係る並列画像処理装置は、複数のプロセッサと共有メモリを含むプラットホーム上で動作し、画像の行方向と列方向の並びを転置するコーナーターン処理を含む画像の再生処理を実行する画像処理プログラムに上記複数のプロセッサへの上記共有メモリの対象領域の割り付けを指示するものにおいて、処理対象画像のサイズや各画素のデータサイズ等の画像情報を設定している画像情報設定手段と、画像処理で使用する上記プロセッサの個数情報を設定している使用プロセッサ数設定手段と、上記各プロセッサに搭載されているキャッシュの構成やサイズ等のキャッシュ情報を設定しているキャッシュ情報設定手段と、上記画像情報設定手段に設定されている画像情報、上記キャッシュ情報設定手段に設定されているキャッシュ情報を入手し、キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出するメモリ割付手段と、上記コーナーターン処理の前に実行する事前実行処理の内容を入手すると共に、上記画像情報設定手段に設定されている画像情報、上記使用プロセッサ数設定手段に設定されている上記プロセッサの個数情報、上記キャッシュ情報設定手段に設定されているキャッシュ情報を入手し、上記メモリ割付手段が算出した画像ブロック単位を基準として上記事前実行処理と上記コーナーターン処理を同時に実行し、上記各プロセッサの負荷が均等になるように、上記各プロセッサに上記共有メモリの対象領域を割り付ける前処理対応メモリ割付手段とを備えたものである。
【００４２】
この発明に係る並列画像処理装置は、複数のプロセッサと共有メモリを含むプラットホーム上で動作し、画像の行方向と列方向の並びを転置するコーナーターン処理を含む画像の再生処理を実行する画像処理プログラムに上記複数のプロセッサへの上記共有メモリの対象領域の割り付けを指示するものにおいて、処理対象画像のサイズや各画素のデータサイズ等の画像情報を設定している画像情報設定手段と、画像処理で使用する上記プロセッサの個数情報を設定している使用プロセッサ数設定手段と、上記各プロセッサに搭載されているキャッシュの構成やサイズ等のキャッシュ情報を設定しているキャッシュ情報設定手段と、上記画像情報設定手段に設定されている画像情報、上記キャッシュ情報設定手段に設定されているキャッシュ情報を入手し、キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出するメモリ割付手段と、上記コーナーターン処理の後に実行する事後実行処理の内容を入手すると共に、上記画像情報設定手段に設定されている画像情報、上記使用プロセッサ数設定手段に設定されている上記プロセッサの個数情報、上記キャッシュ情報設定手段に設定されているキャッシュ情報を入手し、上記メモリ割付手段が算出した画像ブロック単位を基準として上記コーナーターン処理と上記事後実行処理を同時に実行し、上記各プロセッサの負荷が均等になるように、上記各プロセッサに上記共有メモリの対象領域を割り付ける後処理対応メモリ割付手段とを備えたものである。
【００４３】
この発明に係る並列画像処理装置は、複数のプロセッサと共有メモリを含むプラットホーム上で動作し、画像の行方向と列方向の並びを転置するコーナーターン処理を含む画像の再生処理を実行する画像処理プログラムに上記複数のプロセッサへの上記共有メモリの対象領域の割り付けを指示するものにおいて、処理対象画像のサイズや各画素のデータサイズ等の画像情報を設定している画像情報設定手段と、画像処理で使用する上記プロセッサの個数情報を設定している使用プロセッサ数設定手段と、上記各プロセッサに搭載されているキャッシュの構成やサイズ等のキャッシュ情報を設定しているキャッシュ情報設定手段と、上記画像情報設定手段に設定されている画像情報、上記キャッシュ情報設定手段に設定されているキャッシュ情報を入手し、キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出するメモリ割付手段と、上記コーナーターン処理の前に実行する事前実行処理の内容を入手すると共に、上記画像情報設定手段に設定されている画像情報、上記使用プロセッサ数設定手段に設定されている上記プロセッサの個数情報、上記キャッシュ情報設定手段に設定されているキャッシュ情報を入手し、上記メモリ割付手段が算出した画像ブロック単位を基準として上記事前実行処理と上記コーナーターン処理を同時に実行し、上記各プロセッサの負荷が均等になるように、上記各プロセッサに上記共有メモリの対象領域を割り付ける前処理対応メモリ割付手段と、上記コーナーターン処理の後に実行する事後実行処理の内容を入手すると共に、上記画像情報設定手段に設定されている画像情報、上記使用プロセッサ数設定手段に設定されている上記プロセッサの個数情報、上記キャッシュ情報設定手段に設定されているキャッシュ情報を入手し、上記メモリ割付手段が算出した画像ブロック単位を基準として上記コーナーターン処理と上記事後実行処理を同時に実行し、上記各プロセッサの負荷が均等になるように、上記各プロセッサに上記共有メモリの対象領域を割り付ける後処理対応メモリ割付手段と、上記コーナーターン処理と同時に実行する上記事前実行処理又は上記事後実行処理を選択する前後処理選択手段とを備えたものである。
【００４４】
この発明に係る並列画像処理方法は、複数のプロセッサと共有メモリを使用し、画像の行方向と列方向の並びを転置するコーナーターン処理を含む画像の再生処理を実行する際に、上記複数のプロセッサへの上記共有メモリの対象領域の割り付けを指示するものにおいて、処理対象画像のサイズや各画素のデータサイズ等の画像情報を入手し、画像処理で使用する上記プロセッサの個数情報を入手し、上記各プロセッサに搭載されているキャッシュの構成やサイズ等のキャッシュ情報を入手し、入手した画像情報とキャッシュ情報により、キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出し、入手した上記プロセッサの個数情報により、算出した上記画像ブロック単位で、上記各プロセッサが上記コーナーターン処理を実行するように上記共有メモリの対象領域を割り付けるものである。
【００４５】
この発明に係る並列画像処理方法は、画像サイズが算出した画像ブロックの整数倍にならない時に、上記画像サイズが上記画像ブロックの整数倍になるように、画像領域を拡張して共有メモリ上に確保するものである。
【００４６】
この発明に係る並列画像処理方法は、画像サイズが算出した画像ブロックの整数倍にならない時に、上記画像ブロックの幅の行又は列を画像ブロックの帯として算出し、算出した上記画像ブロックの帯で画像のアクセス方向を指定するものである。
【００４７】
この発明に係る並列画像処理方法は、画像サイズが算出した画像ブロックの整数倍にならない時に、上記画像ブロックの幅の行又は列を画像ブロックの帯として算出し、算出した上記画像ブロックの帯で画像のアクセス方向を指定すると共に、算出した画像ブロックの帯の個数が、プロセッサの個数の整数倍にならない時に、上記画像ブロックの帯を分割して、複数のプロセッサで処理するよう指定するものである。
【００４８】
この発明に係る並列画像処理方法は、画像ブロックの処理に必要なキャッシュのライン数が不足する時に、キャッシュデータの退避回数が最小になる画像へのアクセスパターンを算出し指定するものである。
【００４９】
この発明に係る並列画像処理方法は、複数の画像サイズを持つ各画像を対象にコーナーターン処理を行う時に、上記各画像に対して、キャッシュのラインサイズが画素サイズの整数倍にならない場合に画素の補間を行うと共に、算出した画像ブロックの整数倍で、かつ、処理対象画像を包含する最小の画像サイズを算出し、算出した各画像の最小の画像サイズの中から最大の画像サイズを選択し、選択した上記最大の画像サイズの領域を共有メモリ上に確保するものである。
【００５０】
この発明に係る並列画像処理方法は、確保した共有メモリの対象領域を使用し、確保の対象となった画像よりも小さいサイズの画像処理を行う時に、選択された小さいサイズの画像に対する画像ブロック単位で、コーナーターン処理に使用する共有メモリの使用範囲を算出し、算出した使用範囲で上記コーナーターン処理を実行するように、各プロセッサに処理対象の領域やアクセス手順を指定するものである。
【００５１】
この発明に係る並列画像処理方法は、複数のプロセッサと共有メモリを使用し、画像の行方向と列方向の並びを転置するコーナーターン処理を含む画像の再生処理を実行する際に、上記複数のプロセッサへの上記共有メモリの対象領域の割り付けを指示するものにおいて、処理対象画像のサイズや各画素のデータサイズ等の画像情報を入手し、画像処理で使用する上記プロセッサの個数情報を入手し、上記各プロセッサに搭載されている多階層のキャッシュの構成やサイズ等のキャッシュ情報を入手し、入手した上記画像情報と上記キャッシュ情報により、１次キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出し、入手した上記画像情報と上記キャッシュ情報により、算出した１次キャッシュの画像ブロックをもとに、キャッシュの階層ごとに低位キャッシュの複数の画像ブロックを包含し１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出し、入手した上記プロセッサの個数情報により、算出した最上位キャッシュの画像ブロック単位でコーナーターン処理を実行し、かつ、上記最上位キャッシュの画像ブロックを実行する個数が上記プロセッサ間で均等になるように、上記各プロセッサへ上記共有メモリの対象領域を割り付けるものである。
【００５２】
この発明に係る並列画像処理方法は、画像サイズが最上位キャッシュの画像ブロックの整数倍にならない時に、上記画像サイズが上記最上位キャッシュの画像ブロックの整数倍になるように、画像領域を拡張して共有メモリ上に確保するものである。
【００５３】
この発明に係る並列画像処理方法は、上位キャッシュのラインサイズが、低位キャッシュの画像ブロックの整数倍にならない時に、上記上位キャッシュのラインサイズが、上記低位キャッシュの画像ブロックの整数倍になるように、上記低位キャッシュの画像ブロックの領域を拡張して共有メモリ上に確保するものである。
【００５４】
この発明に係る並列画像処理方法は、画像サイズが算出した最上位キャッシュの画像ブロックの整数倍にならない時に、上記最上位キャッシュの画像ブロックの行又は列を上記最上位キャッシュの画像ブロックの帯として算出し、算出した上記画像ブロックの帯で画像のアクセス方向を指定すると共に、算出した画像ブロックの帯の個数が、プロセッサの個数の整数倍にならない時に、上記画像ブロックの帯を分割して、複数のプロセッサで処理するよう指定するものである。
【００５５】
この発明に係る並列画像処理方法は、算出した画像ブロックの処理に必要なキャッシュのライン数が不足する時に、キャッシュデータの退避回数が最小になる画像又は低次のキャッシュへのアクセスパターンを指定するものである。
【００５６】
この発明に係る並列画像処理方法は、複数のプロセッサと共有メモリを使用し、画像の行方向と列方向の並びを転置するコーナーターン処理を含む画像の再生処理を実行する際に、上記複数のプロセッサへの上記共有メモリの対象領域の割り付けを指示するものにおいて、処理対象画像のサイズや各画素のデータサイズ等の画像情報を入手し、画像処理で使用する上記プロセッサの個数情報を入手し、上記各プロセッサに搭載されているキャッシュの構成やサイズ等のキャッシュ情報を入手し、入手した上記画像情報と上記キャッシュ情報により、キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出し、上記コーナーターン処理の前に実行する事前実行処理の内容を入手し、入手した上記画像情報、上記プロセッサの個数情報、上記キャッシュ情報、上記事前実行処理の内容により、算出した上記画像ブロック単位を基準として上記事前実行処理と上記コーナーターン処理を同時に実行し、上記各プロセッサの負荷が均等になるように、上記各プロセッサに上記共有メモリの対象領域を割り付けるものである。
【００５７】
この発明に係る並列画像処理方法は、複数のプロセッサと共有メモリを使用し、画像の行方向と列方向の並びを転置するコーナーターン処理を含む画像の再生処理を実行する際に、上記複数のプロセッサへの上記共有メモリの対象領域の割り付けを指示するものにおいて、処理対象画像のサイズや各画素のデータサイズ等の画像情報を入手し、画像処理で使用する上記プロセッサの個数情報を入手し、上記各プロセッサに搭載されているキャッシュの構成やサイズ等のキャッシュ情報を入手し、入手した上記画像情報と上記キャッシュ情報により、キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出し、上記コーナーターン処理の後に実行する事後実行処理の内容を入手し、入手した上記画像情報、上記プロセッサの個数情報、上記キャッシュ情報、上記事後実行処理の内容により、算出した画像ブロック単位を基準として上記コーナーターン処理と上記事後実行処理を同時に実行し、上記各プロセッサの負荷が均等になるように、上記各プロセッサに上記共有メモリの対象領域を割り付けるものである。
【００５８】
【発明の実施の形態】
以下、この発明の実施の一形態を説明する。
実施の形態１．
図１はこの発明の実施の形態１による並列画像処理装置の構成を示すブロック図である。図において、１はＳＡＲ画像の再生処理を実行する画像処理プログラム、２は、ＳＭＰの並列計算機における、図５７に示す各プロセッサ１２１〜１２４や共有メモリ１２５，キャッシュ等のＨ／Ｗ（Ｈａｒｄｗａｒｅ），ＯＳ（ＯｐｅｒａｔｉｎｇＳｙｓｔｅｍ），並列化ライブラリ等により構成されたプラットホームで、３は、プラットホーム２上で動作し、コーナーターン処理における各プロセッサ１２１〜１２４への共有メモリ１２５の対象領域の割り付け方法等を画像処理プログラム１に指示する並列画像処理装置である。
【００５９】
また、図１の並列画像処理装置３において、１１は、処理対象画像のサイズ（画素数）、各画素のデータサイズ、コーナーターン処理前後に実行可能な処理情報等の処理対象の画像情報を設定している画像情報設定手段、１２は画像処理で使用するプロセッサの個数情報を設定している使用プロセッサ数設定手段、１３は各プロセッサに搭載されているキャッシュの構成やサイズ等のキャッシュ情報を設定しているキャッシュ情報設定手段、１４は、画像情報設定手段１１，使用プロセッサ数設定手段１２，キャッシュ情報設定手段１３の各情報を入手し、キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる処理領域の画像ブロックを算出し、算出した画像ブロック単位で、各プロセッサがコーナータウン処理を実行するように共有メモリの対象領域を割り付けるよう、画像処理プログラム１に指示するメモリ割付手段である。
【００６０】
図２はこの発明の実施の形態１による並列画像処理装置の処理を示すフローチャートで、図３は並列画像処理装置３での基本的なメモリ割り付けの概念を示す図で、図４はこの並列画像処理装置３を利用した時の各プロセッサでのキャッシュアクセスの概念を示す図である。
【００６１】
図５は具体的なメモリ割り付けの実行例を示す図である。ここでは、処理対象画像を、１画素当たり８Ｂｙｔｅで、画像サイズはレンジ方向２０４８画素×アジマス方向１６３８４画素と仮定している。また、プロセッサに搭載されているキャッシュは、データキャッシュ、１ライン当たり３２Ｂｙｔｅ×５１２ライン、１次キャッシュのみを仮定している。
【００６２】
図６は１画像ブロックを複数プロセッサで処理する一例を示す図である。この例では、各プロセッサに割り付けた処理は、途中で退避させないものとする。また、共有メモリ１２５上に処理前と処理後の領域を確保して、レンジ方向の処理（行：レンジ、列：アジマス）からアジマス方向の処理（行：アジマス、列：レンジ）へコーナーターン処理するものとする。
【００６３】
並列画像処理装置３は、プラットホーム２上で動作し、コーナーターン処理でのメモリの割り付け方法等を画像処理プログラム１に指示するが、並列画像処理装置３の実装法としては、並列化支援ライブラリとして独立して提供する方法、プラットホーム２であるＯＳや並列化ライブラリへ組み込む方法、画像処理プログラム１の中へ組み込む方法がある。この実施の形態では、並列化支援ライブラリとして独立して提供する方法で説明する。
【００６４】
次に動作について説明する。
画像情報設定手段１１は、処理対象画像の情報として、画像のサイズ（画素数）、各画素のデータサイズ、コーナーターン処理前後に実行可能な処理情報等を設定している。この例では、１画素当たり８Ｂｙｔｅ，画像サイズはレンジ方向２０４８画素×アジマス方向１６３８４画素という情報を設定している。画像情報設定手段１１への画像情報の設定は、ユーザ入力等により手動で行う他、画像データや画像処理プログラム１から自動的に行うこともできる。
【００６５】
使用プロセッサ数設定手段１２は、画像処理で使用するプロセッサの個数情報を設定している。この個数情報は、最大利用可能なプロセッサの個数であり、基本的に、指定された個数で画像処理を行う。使用プロセッサ数設定手段１２へのプロセッサの個数情報の設定は、画像処理プログラム１の中で指定する方法やユーザ入力等で行う方法が考えられる。
【００６６】
キャッシュ情報設定手段１３は、プロセッサに搭載されているキャッシュの構成（階層型や１次キャッシュのみか等）、各ライン毎のサイズ（容量）、ライン数、一貫性保持方法、データの退避順序、ライトミスとリードミスにかかるオーバーヘッド時間、データをキャッシュへ読み書きするのにかかる時間等のキャッシュ情報を設定している。
【００６７】
この例では、プロセッサに搭載されているキャッシュは、データキャッシュ、１ライン当たり１６Ｂｙｔｅ×５１２ライン、１次キャッシュのみ、一貫性保持方法はライトバック、データの退避順序は最終アクセス時刻順とする。キャッシュ情報設定手段１３へのキャッシュ情報の設定は、ユーザ入力等により手動で行う他、システムが用意するＩ／Ｆ（Ｉｎｔｅｒｆａｃｅ）経由等で自動的に行うこともできる。
【００６８】
次にメモリ割付手段１４の動作について説明する。
メモリ割付手段１４は、図２のステップＳＴ１において、キャッシュ情報設定手段１３からキャッシュ情報を入手し、ステップＳＴ２において、画像情報設定手段１１から画像情報を入手し、ステップＳＴ３において、使用プロセッサ数設定手段１２から使用するプロセッサの個数情報を入手する。
【００６９】
ステップＳＴ４において、メモリ割付手段１４は、ステップＳＴ１で入手したキャッシュ情報と、ステップＳＴ２で入手した画像情報に基づき、処理単位である画像ブロックを算出する。画像ブロックは、キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる処理領域であり、ここでは、１辺の長さが１キャッシュラインサイズの正方形の画像上の領域である。
【００７０】
ステップＳＴ４における画像ブロックの算出後に、ステップＳＴ５において、メモリ割付手段１４は、ステップＳＴ３で入手した使用プロセッサ数の情報に基づき、各プロセッサ１２１〜１２４が算出された画像ブロック単位でコーナーターンを行うように共有メモリ１２５の対象領域を割り付けるよう、画像処理プログラム１に指示する。画像処理プログラム１では、外部から指定された共有メモリ１２５の領域毎に、各プロセッサ１２１〜１２４でコーナーターン処理するようにプログラムされている。
【００７１】
処理前画像の画像ブロックには、コーナーターン処理後に対応する処理結果画像の画像ブロックでの書き込みに必要なデータが全て含まれ、また、読み込まれた画像ブロックのデータは、全て処理結果画像の画像ブロックに書き込まれる。
【００７２】
図５７に示す複数プロセッサ１２１〜１２４で、コーナーターン処理を行う時に、各プロセッサ１２１〜１２４が担当する画像領域について、画像ブロックを重ねなければ書き込みによるキャッシュのライトミスは起こらない。このため、算出された画像ブロック単位で、重なりなくコーナーターン処理を行うことで、書き込み側でのキャッシュのライトミスをなくすことができる。また、データの読み込みも、最小限の１回だけで済むので、キャッシュのリードミスによるオーバヘッド時間を最小に抑えることができる。
【００７３】
図３に、画像ブロック単位での、複数プロセッサによるコーナーターン処理の概要を示す。ここでは、プロセッサ１２１とプロセッサ１２２は、それぞれ異なった画像ブロックを担当している。このため、処理前画像と処理結果画像のそれぞれの画像領域で、キャッシュするデータが重ならずに、コーナーターン処理を実行している。
【００７４】
図４に画像ブロック単位でのコーナーターン処理を行っている時の各プロセッサ内部でのキャッシュの使用状況を示す。ここでは、画像ブロック単位でのコーナーターン処理を行うのに十分なライン数のキャッシュがあると仮定している。このため、読み込みと書き込みに、それぞれ別のキャッシュラインを割り当て、画像ブロック単位でのコーナーターン処理を、必要なデータの退避なしに実行している。
【００７５】
図５を使って具体的な実行例を示す。
図５における１，２，３，…の番号は、各画素の番号であり、処理前と処理結果の画像間では、コーナーターン処理前の読み込み位置と、コーナーターン処理後の書き込み位置で対応している。また、画像とメモリ空間の間では、それぞれ対応する位置を示している。
【００７６】
ここでは、キャッシュのラインサイズが３２ｂｙｔｅなので、画像ブロックは３２ｂｙｔｅ×３２ｂｙｔｅ分の領域となる。１画素のデータが８ｂｙｔｅなので、１つのキャッシュラインに４画素分のデータが格納され、画像ブロックは４画素×４画素の画像上の領域となる。全画像データは、レンジ方向２０４８画素×アジマス方向１６３８４画素なので、５１２×４０９６の画像ブロックとなる。
【００７７】
画像ブロックは画像上の領域のため、共有メモリ１２５の空間上は不連続の領域になる。図５に示すように、１キャッシュラインのサイズ（４画素）単位の領域が、共有メモリ１２５の空間上に散在している。
【００７８】
画像ブロック単位でのコーナーターン処理を、図５における画素番号１〜１６で示す。読み込み側では、（１，２，３，４），（５，６，７，８），（９，１０，１１，１２），（１３，１４，１５，１６）の４個のキャッシュラインとして読み込まれる。コーナーターン処理後の書き込みでは、（１，５，９，１３），（２，６，１０，１４），（３，７，１１，１５），（４，８，１２，１６）の４個のキャッシュラインとして書き込まれる。
【００７９】
なお、各画像ブロック内のコーナーターン処理では、処理間で依存関係がなく、全てのデータがプロセッサ１２１〜１２４内のキャッシュにあることを前提としているので、どの画素から処理を実行しても良い。
【００８０】
この操作を、５１２×４０９６の画像ブロックで実行することで、画像全体でのコーナーターン処理を実行する。
【００８１】
図２に示すステップＳＴ５において、各プロセッサ１２１〜１２４へ画像ブロックを割り付ける時には、プロセッサ１２１〜１２４間での処理負荷を均等した方が効率が良い。各画像ブロック間の処理には、依存関係がないので、任意の画像ブロックを、任意のプロセッサ１２１〜１２４へ割り付けて並列実行できる。このため、基本的には、各プロセッサ１２１〜１２４に割り付ける画像ブロックの個数に注目し、処理負荷が等しくなるように割り付ける。
【００８２】
一方で、この並列画像処理装置では、画像サイズとプロセッサ数を任意に設定できるため、画像ブロックを均等に分割できないケースがある。この場合、図６に示すように、１つの画像ブロックを複数のプロセッサ１２１〜１２４で実行することもできる。
【００８３】
図６では、４個のプロセッサ１２１〜１２４が、それぞれ処理前画像側の画像ブロック内のデータを全て読み込む。処理結果画像への書き込みでは、お互いの書き込み先キャッシュが重ならないように、画像ブロック内の指定されたラインにのみデータを書き込む。このため、読み込んだデータの一部を無駄にするリードミスがあるが、書き込みでのライトミスはおこさない。また、各プロセッサ１２１〜１２４での処理間には依存関係がないので、並列に処理を実行できる。
【００８４】
通常、１プロセッサで実行する場合は、４ライン分のデータを読み込み、４ライン分のデータを書き込むが、この処理方法では、各プロセッサ１２１〜１２４で４ライン分のデータを読み込み、１ライン分のデータを書き込んでいる。
【００８５】
このため、１画像ブロックの処理時間としては、１プロセッサで実行する場合より処理時間が短い。ただし、処理対象の画像ブロックの個数が増える。図６の例では、１画像ブロックあたりでは、３ライン分のデータ書き込み時間が短くなる。キャッシュへのデータのリードとライト時間が同等とすると５／８倍時間が短くなる。一方、処理すべき画像ブロックの個数は４倍に増える。
【００８６】
ＳＡＲ画像処理では、一般に画像サイズが大きいために、プロセッサ１２１〜１２４の個数と比較して、処理すべき画像ブロックの個数が大きい。この場合には、まず画像ブロック単位で処理を分割し、割り切れない余りとなった画像ブロックだけを、２個以上のプロセッサ１２１〜１２４で分割して実行することが考えられる。
【００８７】
逆に、画像サイズが小さく、プロセッサ１２１〜１２４の個数と比較して、処理すべき画像ブロックの個数が小さい場合には、画像ブロックを複数のプロセッサ１２１〜１２４で分割して実行することにより、処理時間を短縮することもできる。
【００８８】
この実施の形態では、ＳＡＲ画像再生処理に適用しているが、ＳＡＲ画像再生処理以外でも、コーナーターン処理に相当する処理を実行する並列画像処理に対しても適用できる。
【００８９】
この実施の形態では、画像のサイズやキャッシュのサイズを設定しているが、任意の画像（画素サイズや画像の画素数）、任意のキャッシュ（構成、ライン数、サイズ、一貫性保持方法、データの退避順序等）、任意の個数のプロセッサでも良い。
【００９０】
この実施の形態では、（行：レンジ、列：アジマス）から（行：アジマス、列：レンジ）へのコーナーターン処理を実行しているが、（行：アジマス、列：レンジ）から（行：レンジ、列：アジマス）のコーナーターン処理でも良い。また、レンジ、アジマスが規定されていない画像処理であっても、コーナーターン処理と同等の画像処理内容であれば良い。
【００９１】
この実施の形態では、並列画像処理装置３の実装法として、並列化支援ライブラリとして独立して提供されているが、プラットホーム２であるＯＳや並列化ライブラリへ組み込む方法、又は画像処理プログラム１へ組み込む方法でも良い。
【００９２】
この実施の形態では、画像処理プログラム１に処理領域を指定することで、各プロセッサの処理領域を設定していたが、プラットホーム２の機能等を使って、並列画像処理装置３が直接指定する方法で実装しても良い。
【００９３】
この実施の形態では、各プロセッサに割り付けた処理は、途中で退避させられないことを仮定しているが、１画像ブロック単位の処理が途中で退避させられない限り、タイムシェアリングシステムのようなある周期で退避されるようなシステムであっても良い。また、１画像ブロック単位の処理が退避させられるケースでも、退避されたプロセッサと新たに実行を行うプロセッサ間でしか、キャッシュのライトミスをおこさないため、キャッシュのライトミスの確率は低い。
【００９４】
この実施の形態では、コーナーターン処理単独の例で説明したが、一連の画像処理をＳＭＰ上でパイプライン処理する場合にでも、そのコーナーターン処理部分にこの並列画像処理装置３を適用しても良い。
【００９５】
以上のように、この実施の形態１によれば、画像の縦方向と横方向の並びを転置するコーナーターン処理を実行する時に、画像サイズ（画素数）や各画素のデータサイズ等の処理対象の画像情報と、キャッシュの構成やサイズやキャッシュライン数やキャッシュの一貫性保持方法等のキャッシュ情報から、キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる処理領域の画像ブロックを算出し、画像ブロック単位で処理を実行するように、並列に動作する複数のプロセッサ１２１〜１２４へ処理対象位置の共有メモリ１２５の対象領域を、各プロセッサ１２１〜１２４間で重ねないように割り付けることで、書き込みによるキャッシュのライトミスを起こさず、かつ、データの読み込みも、最小限の１回だけで済ませることができ、キャッシュによるオーバーヘッド時間を低減できるという効果が得られる。
【００９６】
また、画像処理に使用するプロセッサ１２１〜１２４の個数の情報から、処理対象の画像を画像ブロック単位で、各プロセッサ１２１〜１２４に均等に割り振ることにより、並列処理において効果的な負荷分散を行い、効率の良いコーナーターン処理を実現できるという効果が得られる。
【００９７】
さらに、負荷分散を行う時に、１画像ブロックの処理を複数のプロセッサ１２１〜１２４により行うことで、より公平な負荷分散が行えると共に、１画像ブロックの処理を高速化できるという効果が得られる。
【００９８】
実施の形態２．
図７はこの発明の実施の形態２による並列画像処理装置の構成を示すブロック図である。図において、１５は、画像サイズが算出した画像ブロックの整数倍にならない時に、画像サイズが画像ブロックの整数倍になるように画像領域を拡張して共有メモリ１２５上に確保するメモリ確保手段であり、その他は実施の形態１の図１に示す構成と同等である。図８はこの実施の形態２による並列画像処理装置の処理を示すフローチャートであり、図９はメモリ確保の実行例を示す図である。この実施の形態では、算出した画像サイズが画像ブロックサイズの整数倍にならない点が、実施の形態１と異なっている。
【００９９】
次に動作について説明する。
図８のステップＳＴ１からＳＴ４の画像ブロックを計算する処理までは、実施の形態１と同じである。ステップＳＴ１１において、メモリ割付手段１４は、画像サイズが、算出した画像ブロックの整数倍になるかをチェックする。チェックの結果、画像サイズが、算出した画像ブロックの整数倍になる場合は、実施の形態１と同様に、ステップＳＴ５において、メモリ割付手段１４は、共有メモリ１２５の領域の各プロセッサ１２１〜１２４への割付を、画像処理プログラム１に指示する。
【０１００】
上記ステップＳＴ１１で、画像サイズが、算出した画像ブロックの整数倍にならない場合には、メモリ確保手段１５は、メモリ割付手段１４の指示により、図９に示すように、画像サイズが画像ブロックの整数倍になるように、画像領域を拡張して共有メモリ１２５上に確保する。
【０１０１】
この例では、メモリ確保手段１５が画像処理プログラム１に画像のサイズを指定することで、画像処理プログラム１により画像領域が拡張される。拡張された画像領域は、コーナーターン処理の際には、他の部分と同様に実行されるが、コーナーターン処理以外の処理では無視される。
【０１０２】
図９（ａ）に、アジマス方向で画像サイズが画像ブロックの整数倍とならない例を示す。ここでは、キャッシュラインに、任意の位置からデータを読み込めると仮定する。整数倍とならないために、このまま処理すると、画像処理後のデータの右端での書き込みが、次の行の左端のキャッシュに影響を与える。例えば、Ａ，Ｅを書き込もうとすると、（Ａ，Ｅ，２，６）という単位でキャッシュに書き込む。一方で、左上端の画像ブロックでは、（２，６，１０，１４）の単位でキャッシュに書き込む。このように、（２，６）を管理するキャッシュが２つあることになり、この２つのキャッシュ間でライトミスが発生する。
【０１０３】
なお、ラインサイズの整数倍のアドレスから読み込む場合でも、キャッシュラインには、（Ａ，Ｅ，２，６）と（１０，１４，・，・，）という単位でしか書き込めなくなり、読み込んだ画像ブロックをそのまま書き込むと、キャッシュのライトミスが発生する。
【０１０４】
ここで、図９（ｂ）のように、画像サイズが画像ブロックの整数倍になるように、処理対象の画像領域を拡張して共有メモリ１２５上に確保する。拡張後の画像全体を対象にコーナーターン処理を行うことで、実施の形態１と同じ方法で実行できる。
【０１０５】
この実施の形態では、アジマス方向の画像サイズが画像ブロックの整数倍にならない場合であるが、レンジ方向の画像サイズが画像ブロックの整数倍にならない場合は、レンジ方向に画像サイズの領域を増やせば良く、また、アジマスとレンジの両方向の画像サイズが画像ブロックの整数倍にならない場合は、両方向に画像サイズの領域を増やせば良い。
【０１０６】
この実施の形態では、画像サイズの領域を増やす量も任意に設定できる。このため、上記実施の形態以外のキャッシュや画像サイズを指定された場合でも、同様の手法で画像サイズの領域を増やせば良い。
【０１０７】
この実施の形態では、画像サイズの領域を拡張して共有メモリ１２５上の確保する方法として、画像処理プログラム１が指定された領域量を確保しているが、メモリ確保手段１５が共有メモリ１２５の領域を確保して、画像処理プログラム１へ使用先を指定しても良い。この場合、共有メモリ１２５の確保サイズを知らせず、画像処理プログラム１では、メモリ割付手段１４が、拡張領域も含めて画像ブロックで指定した範囲をコーナーターン処理するように指示しても良い。
【０１０８】
以上のように、この実施の形態２によれば、コーナーターン処理を行う時に、処理対象の画像サイズが画像ブロックの整数倍にならない時に、画像サイズが画像ブロックの整数倍になるように、処理対象の画像領域を拡張して共有メモリ１２５上に確保することにより、キャッシュによるオーバーヘッド時間を低減できるという効果が得られる。
【０１０９】
実施の形態３．
図１０はこの発明の実施の形態３による並列画像処理装置の構成を示すブロック図である。図において、１６は、画像サイズが算出した画像ブロックの整数倍にならない時に、画像ブロック幅の行又は列を画像ブロックの帯として算出し、算出した画像ブロックの帯で画像のアクセス方向を指定するアクセス方向指定手段で、その他は実施の形態１の図１に示す構成と同等である。この実施の形態では、実施の形態２と同様に、画像サイズが画像ブロックサイズの整数倍にならない点が実施の形態１と異なっている。
【０１１０】
図１１はこの実施の形態３による並列画像処理装置の処理を示すフローチャートで、図１２は、アジマス方向の画像サイズが画像ブロックの整数倍にならない時に、アクセス方向を指定した処理内容を示す図であり、図１３は、レンジ方向の画像サイズが画像ブロックの整数倍にならない時に、アクセス方向を指定した処理内容を示す図である。
【０１１１】
次に動作について説明する。
図１１のステップＳＴ１からＳＴ１１までで、画像サイズが、算出した画像ブロックの整数倍になるかをチェックし、チェックの結果、整数倍になった場合のステップＳＴ５の処理は、実施の形態２と同じである。
【０１１２】
ステップＳＴ１１のチェック結果、整数倍にならない時は、ステップＳＴ１３において、メモリ割付手段１４の指示により、アクセス方向指定手段１６は、画像へのアクセス方向と画像ブロックの帯を算出し、算出した画像ブロックの帯での画像へのアクセス方法を、画像処理プログラム１に指定する。
【０１１３】
画像へのアクセス方向と画像ブロックの帯の算出後、図１１のステップＳＴ５において、メモリ割付手段１４は、算出した画像ブロックの帯を基準に、画像ブロックの帯が重ならず、かつ負荷が均等になるように、共有メモリ１２５の領域の各プロセッサ１２１〜１２４への割り付けを、画像処理プログラム１に指示する。
【０１１４】
なお、この例の画像処理プログラム１では、外部から指定されたアクセス順序で、各プロセッサ１２１〜１２４に指示し、コーナーターン処理できるようにプログラムされている。
【０１１５】
次に上記ステップＳＴ１３で、アクセス方向指定手段１６が画像へのアクセス方向と画像ブロックの帯を算出する方法について説明する。
画像へのアクセス方向は、アジマス方向の画像サイズが画像ブロックの整数倍にならない時に、図１２に示すようにアジマス方向になり、逆に、レンジ方向の画像サイズが画像ブロックの整数倍にならない時に、図１３に示すようにレンジ方向になる。
【０１１６】
画像ブロックの帯は、画像ブロック幅で、各方向の画像の行と列になる。図１２の例では、処理前画像で画像ブロック幅の列になり、処理結果画像で画像ブロック幅の行になる。
【０１１７】
アクセス方向指定手段１６は、書き込みが重ならず、かつ、なるべく読み込みの重複を少なくするようにアクセス方法を指定する。この方法の一例を図１２及び図１３に示す。ただし、アクセス方法は、はみ出す画素数により若干異なるために、図１２及び図１３では代表的な一例を示している。また、ここでは、レンジ方向からアジマス方向へのコーナーターン処理の例で説明するが、アジマス方向からレンジ方向へのコーナーターン処理は、アジマスとレンジを入れ換えたものである。
【０１１８】
図１２は、処理結果画像の行方向（アジマス方向）に、画像ブロックの整数倍から１画素はみ出す例である。図１２（ａ）は、画像イメージ上での各画素の読み込みと書き込みの位置関係を、各画素に対応する番号とアルファベットで示している。図１２（ｂ）は画像イメージ上でのアクセス順序を示している。
【０１１９】
画像の両端では、書き込み側で１画素ずつずれている影響のために、読み込みと書き込みを１回ずつで終えることができない。ここでは、なるべく読み込みの重複を少なくするようにする。図１２では、最初に、＜１＞，＜２＞，＜３＞，＜４＞と、＜Ｎ−２＞，＜Ｎ−１＞，＜Ｎ＞のキャッシュデータを読み込む。これを、書き込み側の［１］（画素Ｋ，Ｇ，Ｃ，４），［２］（画素Ｆ，Ｂ，３，７），［３］（画素Ａ，２，６，１０），［４］（１，５，９，１３）の順で書き込む。この時、＜１＞（画素１，２，３，４）に含まれるデータは、全てコーナーターン処理を完了している。
【０１２０】
次のコーナーターン処理では、＜２＞，＜３＞，＜４＞と、＜５＞のキャッシュデータ（画素８，１２，１６，２０）を、書き込み側の［５］のキャッシュに書き込む。ここでは、＜２＞，＜３＞，＜４＞のデータは読み込み済のため、＜５＞のデータだけが新たに読み込まれる。そして、＜２＞（画素５，６，７，８）のデータについてコーナーターン処理を完了する。
【０１２１】
以下順次、次のコーナーターン処理では、＜３＞，＜４＞，＜５＞と、＜６＞のキャッシュデータ（画素１１，１５，１９，２３）を、書き込み側の［６］のキャッシュに書き込み、さらに、次のコーナーターン処理では、＜４＞，＜５＞，＜６＞と、＜７＞のキャッシュデータ（画素１４，１８，２２，２６）を、書き込み側の［７］のキャッシュに書き込むことで、コーナーターン処理を行う。ここでは、＜６＞，＜７＞のデータだけが順次読み込まれ、＜３＞（画素９，１０，１１，１２），＜４＞（画素１３，１４，１５，１６）に含まれるデータのコーナーターン処理が完了する。
【０１２２】
このように、＜１＞〜＜Ｎ−３＞までのデータについては、キャッシュのデータを最小回数（１回ずつ）の読み込みと書き込みを行うことで、コーナーターン処理が実行できる。
【０１２３】
一方、コーナーターン処理の最終段階では、＜Ｎ−５＞，＜Ｎ−４＞，＜Ｎ−３＞，＜Ｎ−２＞のデータを［Ｎ−２］のキャッシュに、＜Ｎ−４＞，＜Ｎ−３＞，＜Ｎ−２＞，＜Ｎ−１＞のデータを［Ｎ−１］のキャッシュに、＜Ｎ−３＞，＜Ｎ−２＞，＜Ｎ−１＞，＜Ｎ＞のデータを［Ｎ］のキャッシュにコーナーターン処理を行う。ここでは、最初に読み込まれた＜Ｎ−２＞，＜Ｎ−１＞，＜Ｎ＞のデータが再度読み込まれる。
【０１２４】
最終的に、図１２の方法では、＜Ｎ−２＞，＜Ｎ−１＞，＜Ｎ＞の＊印のデータを最大２回読み込む（キャッシュのライン数が十分にあれば、１回の読み込みで済む）。＜Ｎ−２＞〜＜Ｎ＞まで読み込んだデータについては、１回目には、図１２の点線部分のデータを書き込まずに、最終段階で再度読み込んだ時に書き込む。このため、各キャッシュに、重複なく１回の書き込みでコーナーターン処理を実行できる。
【０１２５】
図１３は、処理結果画像の列方向（レンジ方向）に、画像ブロックの整数倍から１画素はみ出す例である。図１３（ａ）は画像イメージ上での各画素の読み込みと書き込みの位置関係を、図１３（ｂ）は画像イメージ上でのアクセス順序を示している。
【０１２６】
図１３では、読み込み側で１画素ずつずれている影響のために、読み込みと書き込みを１回ずつで終えることができない。ここでも、なるべく読み込みの重複を少なくするようにする。図１３では、最初に、＜１＞（画素１，２，３，４）と＜Ｎ−２＞（画素Ａ，５，６，７），＜Ｎ−１＞（画素Ｆ，Ｂ，９，１０），＜Ｎ＞（画素Ｋ，Ｇ，Ｃ，１３）のキャッシュデータを読み込む。これを、書き込み側の［１］（画素１，５，９，１３）に書き込む。
【０１２７】
次のコーナーターン処理では、新たに＜２＞（画素１４，１５，１６，ａ１３）のキャッシュデータを読み込み、書き込み側の［２］（画素２，６，１０，１４）のキャッシュに書き込む。ここでは、読み込み済の＜Ｎ−１＞，＜Ｎ＞，＜１＞のデータを利用する。
【０１２８】
以下順次、次のコーナーターン処理では、新たに＜３＞（画素１１，１２，ａ９，ａ１０）のキャッシュデータを読み込んで、書き込み側の［３］（画素３，７，１１，１５）のキャッシュに書き込み、さらに、次のコーナーターン処理では、新たに＜４＞（画素８，ａ５，ａ６，ａ７）のキャッシュデータを読み込んで、書き込み側の［４］（画素４，８，１２，１６）のキャッシュに書き込むことで、コーナーターン処理を行う。
【０１２９】
このように、＜１＞〜＜Ｎ−３＞までのデータについては、キャッシュのデータを最小回数（１回ずつ）の読み込みと書き込みを行うことで、コーナーターン処理を実行できる。
【０１３０】
一方、コーナーターン処理の最終段階では、＜Ｎ−２＞のデータを再度読み込んで［Ｎ−２］のキャッシュに、＜Ｎ−１＞のデータを再度読み込んで［Ｎ−１］のキャッシュに、＜Ｎ＞のデータを再度読み込んで［Ｎ］のキャッシュに、コーナーターン処理を行う。
【０１３１】
最終的に、図１３の方法では、＜Ｎ−２＞，＜Ｎ−１＞，＜Ｎ＞の＊印のデータを最大２回読み込む（キャッシュのライン数が十分にあれば、１回の読み込みで済む）。＜Ｎ−２＞〜＜Ｎ＞まで読み込んだデータについては、１回目には、図１２の点線部分のデータを書き込まずに、最終段階で再度読み込んだ時に書き込む。このため、各キャッシュに、重複なく１回の書き込みでコーナーターン処理を実行できる。
【０１３２】
この実施の形態では、特定のずれ幅の例で説明を行ったが、任意のずれ幅でも良い。また、この実施の形態では、画像処理プログラム１を介して、各プロセッサ１２１〜１２４にアクセス方法を指示していたが、プラットホーム２の機能等を使って、並列画像処理装置３が直接指定する方法で実装しても良い。
【０１３３】
以上のように、この実施の形態３によれば、コーナーターン処理を行う時に、処理対象の画像サイズが画像ブロックの整数倍にならない時でも、画像へのアクセス方向と画像ブロックの帯を算出し、画像ブロックの帯で指定されたアクセス方向へ、読み込みの重複が少なくかつ書き込みの重複がないアクセス方法でコーナーターン処理を行うように、各プロセッサに指定することにより、実施の形態２のように画像サイズを変更せずに、キャッシュのリードミスとライトミスを抑えて、オーバーヘッド時間を低減できるという効果が得られる。
【０１３４】
実施の形態４．
図１４はこの発明の実施の形態４による並列画像処理装置の構成を示すブロック図である。図において、１７は、画像サイズが画像ブロックの整数倍にならない時に、画像ブロック幅の行又は列を画像ブロックの帯として算出し、算出した画像ブロックの帯で画像のアクセス方向を指定すると共に、算出した画像ブロックの帯の個数が、使用プロセッサ数設定手段１２に設定されているプロセッサの個数の整数倍にならない時に、画像ブロックの帯を分割して、複数のプロセッサ１２１〜１２４で処理するよう指定する多重書き込み対応アクセス方向指定手段で、その他は、実施の形態１の図１に示す構成と同等である。この実施の形態では、画像ブロックの帯を基準に各プロセッサ１２１〜１２４に処理を割り振った時に、均等に割り振ることができず、画像ブロックの帯を分割して実行することが、実施の形態３と異なっている。
【０１３５】
図１５はこの実施の形態４による並列画像処理装置の処理を示すフローチャートで、図１６はアジマス方向に画像サイズが画像ブロックの整数倍にならない時の処理内容を示す図であり、図１７はレンジ方向に画像サイズが画像ブロックの整数倍にならない時の処理内容を示す図である。
【０１３６】
次に動作について説明する。
図１５のステップＳＴ１からＳＴ１１までの画像サイズが画像ブロックの整数倍になるかをチェックするまでの処理は、実施の形態３と同様である。画像サイズが画像ブロックの整数倍にならない時に、ステップＳＴ１３において、多重書き込み対応アクセス方向指定手段１７は、メモリ割付手段１４の指示により、画像へのアクセス方向と画像ブロックの帯を算出して、画像処理プログラム１に指定する。
【０１３７】
ステップＳＴ１４において、多重書き込み対応アクセス方向指定手段１７は、画像ブロックの帯の個数が設定されたプロセッサの個数の整数倍になるかをチェックする。チェックした結果、整数倍になる場合は、メモリ割付手段１４が実施の形態３と同様にステップＳＴ５の処理を行う。
【０１３８】
画像ブロックの帯の個数が設定されたプロセッサの個数の整数倍にならない時には、ステップＳＴ１５において、多重書き込み対応アクセス方向指定手段１７は、まず画像ブロックの帯の重なりを許して、各プロセッサ１２１〜１２４の負荷が均等になるように、画像ブロックの帯を各プロセッサ１２１〜１２４に処理を割り当てる。
【０１３９】
次に、画像ブロックの帯が重ならない部分については、多重書き込み対応アクセス方向指定手段１７は、実施の形態３と同様のアクセス方向を指定する。画像ブロックの帯が重なった部分については、多重書き込み対応アクセス方向指定手段１７は、次に示すアクセス方法を指定する。
【０１４０】
多重書き込み対応アクセス方向指定手段１７では、画像サイズがアジマス方向に画像ブロックの整数倍にならない時は、図１６に示すようにアジマスの方向で読み込みと書き込みの範囲を分割し、画像サイズがレンジ方向に画像ブロックの整数倍にならない時は、図１７に示すようにレンジの方向で、読み込みと書き込みの範囲を分割する。分割した範囲について、データの書き込みが重複せず、かつ、読み込み範囲の重複が小さくなるアクセス方法を指示する。この例を図１６と図１７に示す。
【０１４１】
図１６において、プロセッサ１２１，１２２は、基本的に図示された各領域について処理を行うが、処理前画像の点線部分は、画像ブロックの帯の個数が設定されたプロセッサの個数の整数倍にならないために、プロセッサ１２１，１２２のキャッシュがデータを多重に読み込む部分である。また、処理結果画像におけるプロセッサ１２２の範囲の斜線部分は、プロセッサ１２２が書き込む領域であるが、プロセッサ１２１が書き込みの処理を行うことにより、キャッシュのライトミスを抑えている。
【０１４２】
図１７においても、プロセッサ１２１，１２２は、基本的に図示された各領域について処理を行うが、処理前画像の点線部分は、画像ブロックの帯の個数が、設定されたプロセッサの個数の整数倍にならないために、プロセッサ１２１，１２２のキャッシュがデータを多重に読み込む部分である。また、処理結果画像の点線部分については、各プロセッサ１２１，１２２が、それぞれ書き込みの処理を行うことにより、キャッシュのライトミスを抑えている。
【０１４３】
ステップＳＴ１５の処理を終了後に、メモリ割付手段１４が実施の形態３と同様にステップＳＴ５の処理を行う。
【０１４４】
この実施の形態では、特定のずれ幅の例で説明を行ったが、任意のずれ幅でも良い。
【０１４５】
以上のように、この実施の形態４によれば、コーナーターン処理を行う時に、キャッシュのリードミスを最小限に抑えると共にライトミスを抑えて、オーバーヘッド時間を低減できるという効果が得られる。
【０１４６】
また、画像ブロックの帯での処理を分割して実行するため、画像ブロックの帯の個数が、設定されたプロセッサの個数の整数倍にならない時でも、プロセッサに処理負荷を均等に分割することにより、効率の良いコーナーターン処理を実行できるという効果が得られる。
【０１４７】
実施の形態５．
図１８はこの発明の実施の形態５による並列画像処理装置の構成を示すブロック図である。図において、１５は画像サイズが画像ブロックの整数倍になるように画像領域を拡張してメモリ上に確保するメモリ確保手段、１６は、画像サイズが画像ブロックの整数倍にならない時に、画像ブロック幅の行又は列を画像ブロックの帯として算出し、算出した画像ブロックの帯で画像のアクセス方向を指定するアクセス方向指定手段、１７は、画像サイズが画像ブロックの整数倍にならない時に、画像ブロック幅の行又は列を画像ブロックの帯として算出し、算出した画像ブロックの帯で画像のアクセス方向を指定すると共に、算出した画像ブロックの帯の個数が、使用プロセッサ数設定手段１２に設定されているプロセッサの個数の整数倍にならない時に、画像ブロックの帯を分割して、複数のプロセッサ１２１〜１２４で処理するよう指定する多重書き込み対応アクセス方向指定手段である。
【０１４８】
また、図１８において、１８は、画像サイズが画像ブロックの整数倍にならない時の修正方法の判定条件を設定するはみ出し修正方法設定手段であり、その他は実施の形態１の図１に示す構成と同等である。この実施の形態は、画像サイズが画像ブロックの整数倍にならない時に、その修正方法を選択するものである。図１９はこの実施の形態５による並列画像処理装置の処理を示すフローチャートである。
【０１４９】
次に動作について説明する。
はみ出し修正方法設定手段１８は、画像サイズが画像ブロックの整数倍にならない時に、修正方法を選択するための基準となる判定条件を設定している。このはみ出し修正方法設定手段１８への情報の設定は、ユーザ入力等で行う。メモリ割付手段１４は、はみ出し修正方法設定手段１８が保有している条件式の基準に基づき、画像情報設定手段１１の画像情報や、キャッシュ情報設定手段１３のキャッシュ情報等を使って、修正方法を選択する。
【０１５０】
また、はみ出し修正方法設定手段１８に、メモリの確保にかかる時間や、アクセス方向を指定した処理にかかる時間等を、手動や過去の実行情報の履歴から設定して、メモリ割付手段１４が、各修正法を選択した時のコーナーターン処理を含めた画像処理時間等を計算して、最短となる方法を選択することもできる。
【０１５１】
この他、アクセス方向指定手段１６が選択されたケースで、使用プロセッサ数設定手段１２からのプロセッサの個数により、画像ブロックの帯で処理が均等に分割できない場合には、メモリ割付手段１４が、多重書き込み対応アクセス方向指定手段１７に切り替えて処理を行うといった設定も行うことができる。
【０１５２】
図１９のステップＳＴ１からＳＴ４までの処理は、実施の形態１の処理と同じである。ステップＳＴ１１において、メモリ割付手段１４は、画像サイズが、算出した画像ブロックの整数倍になるかをチェックする。チェックの結果、整数倍になる時は、実施の形態１と同様に、ステップＳＴ５の処理を行う。整数倍にならない時には、ステップＳＴ１６において、メモリ割付手段１４は、はみ出し修正方法設定手段１８に問い合わせて、修正方法を選択する。
【０１５３】
メモリ割付手段１４がメモリ確保手段１５を選択した場合には、実施の形態２と同様に、ステップＳＴ１２及びステップＳＴ５の処理を行い、アクセス方向指定手段１６を選択した場合には、実施の形態３と同様に、ステップＳＴ１３及びステップＳＴ５の処理を行い、多重書き込み対応アクセス方向指定手段１７を選択した場合には、実施の形態４と同様に、ステップＳＴ１３，ＳＴ１４，ＳＴ１５，ＳＴ５の処理を行う。
【０１５４】
以上のように、この実施の形態５によれば、コーナーターン処理を行う時に、処理対象の画像サイズが画像ブロックサイズの整数倍にならない時に、キャッシュのリードミスを最小限に抑えると共にライトミスを抑えて、オーバーヘッド時間を低減できると共に、複数の対処方法から最適の方法を選択することで、効率の良いコーナーターン処理を実行できるという効果が得られる。
【０１５５】
実施の形態６．
図２０はこの発明の実施の形態６による並列画像処理装置の構成を示すブロック図である。図において、１９は要求仕様として与えられる時間制約と各プロセッサ１２１〜１２４の処理時間等の時間制約条件を設定している処理時間制約設定手段、２０は実際に使用するプロセッサ１２１〜１２４の個数を画像処理プログラム１に指定する実行プロセッサ数指定手段であり、その他は実施の形態１の図１に示す構成と同等である。この実施の形態では、画像ブロックをプロセッサ１２１〜１２４に分割する前に、使用するプロセッサ数を調整することが実施の形態１と異なる。
図２１はこの実施の形態６による並列画像処理装置の処理を示すフローチャートである。
【０１５６】
次に動作について説明する。
処理時間制約設定手段１９は、要求仕様として与えられるコーナーターン処理に対する時間制約条件の情報を設定している。ここで設定する時間制約条件には、処理完了までの目標時間、最大許容時間等の時間条件の他、処理時間（速度）とシステム資源（プロセッサ）の利用に関するポリシ（処理時間優先やシステム資源の節約優先）等の処理時間に影響を与える各種情報がある。また、コーナーターン処理にかかった実行時間情報を、画像サイズや使用したプロセッサ数や画像の分割処理方法等のパラメータと一緒に記録しておくこともできる。
【０１５７】
この処理時間制約設定手段１９への情報設定は、ユーザ入力等により手動で行う他に、処理実行時に実行時間情報を、画像サイズや使用したプロセッサ数等のパラメータと一緒に蓄積しておくこともできる。
【０１５８】
実行プロセッサ数指定手段２０は、指定されたプロセッサ数で処理を実行するように指定する手段である。この例では、画像処理プログラム１にプロセッサ数を指定する。なお、この例の画像処理プログラム１では、実行プロセッサ数指定手段２０が指定したプロセッサ数でコーナーターン処理を行う。
【０１５９】
図２１のステップＳＴ１からＳＴ４までの画像ブロックを計算するところまでの処理は、実施の形態１と同じである。ただし、この実施の形態では、ステップＳＴ３で、使用プロセッサ数設定手段１２から入手したプロセッサの個数を、プロセッサ数Ａとして記憶しておく。
【０１６０】
ステップＳＴ４で画像ブロックを計算後、ステップＳＴ２１において、メモリ割付手段１４は、処理時間制約設定手段１９から時間制約条件を入手し、ステップＳＴ２２において、メモリ割付手段１４は、入手した時間制約を満たす最適のプロセッサ数Ｂを計算する。この計算では、過去のコーナーターン処理にかかった実行時間情報と、その時の画像サイズ等の処理パラメータを使って、与えられた条件下でのプロセッサ数と処理時間の関係を予測して算出することも含まれている。
【０１６１】
例えば、システム資源の節約優先の場合には、最大許容時間を満たすのに最低限必要な個数のプロセッサ数を算出する。目標時間〜最大許容時間の範囲で処理時間優先の場合には、目標時間内で終了するのに最低限必要な個数のプロセッサ数を算出する。処理時間最優先の場合には、プロセッサ数を無限大∞とする。
【０１６２】
ステップＳＴ２２でプロセッサ数Ｂを算出した後に、ステップＳＴ２３において、メモリ割付手段１４はプロセッサ数Ａの値とプロセッサ数Ｂの値を比較し、同じであれば、実施の形態１と同様にステップＳＴ５の処理を行い、ステップＳＴ２３でＡの値とＢの値が異なり、ステップＳＴ２４において、Ｂの値がＡの値以上の場合には、メモリ割付手段１４はＡの値を最終的なプロセッサ数として、実施の形態１と同様にステップＳＴ５の処理を実施する。
【０１６３】
ステップＳＴ２４において、Ｂの値がＡの値より小さい場合には、ステップＳＴ２５において、メモリ割付手段１４はＢの値を最終的なプロセッサ数とする。この場合は、プロセッサ数Ｂの値を使用し、ステップＳＴ５において、実施の形態１と同じ手順でプロセッサへの処理領域の割り付けを行うと共に、メモリ割付手段１４は実行プロセッサ数指定手段２０に最終的なプロセッサ数を伝達し、実行プロセッサ数指定手段２０は、伝達されたプロセッサ数で処理を実行するように画像処理プログラム１へ指定する。
【０１６４】
この実施の形態では、使用プロセッサ数設定手段１２が最大利用可能なプロセッサ数を指定する例で説明したが、使用プロセッサ数設定手段１２では、プロセッサ数の設定を行わず、処理時間制約設定手段１９で計算機資源への利用制約の１つとして、最大利用可能プロセッサ数を設定し、プロセッサ数Ａを無限大∞として実行しても良い。
【０１６５】
この実施の形態では、実行プロセッサ数指定手段２０が、画像処理プログラム１にプロセッサ数を指定しているが、実行プロセッサ数指定手段２０が、画像処理プログラム１で使用するプロセッサの個数を、プラットホーム２の機能を使って直接設定するように実装しても良い。
【０１６６】
この実施の形態では、コーナーターン処理について、使用プロセッサ数を実行プロセッサ数指定手段２０で指定しているが、コーナーターン処理以外の画像処理についても、実行プロセッサ数指定手段２０でプロセッサ数を変更できるように実装しても良い。
【０１６７】
この実施の形態では、処理時間制約設定手段１９での処理時間関連の情報は、コーナーターン処理関連の時間を使用しているが、コーナーターン処理を含む画像処理全体についての制約時間情報や実行時間情報を使用して、プロセッサ数Ｂを算出しても良い。
【０１６８】
この実施の形態では、実施の形態１の並列画像処理装置に処理時間制約設定手段１９と実行プロセッサ数指定手段２０を加えているが、実施の形態２〜実施の形態５の並列画像処理装置に対しても、処理時間制約設定手段１９と実行プロセッサ数指定手段２０を加えても良い。また、実施の形態４と実施の形態５に適用した場合には、画像ブロックの帯を複数プロセッサに分割しないという条件を、処理時間制約設定手段１９に加えることで、画像ブロックの帯での重なりを抑制しても良い。
【０１６９】
以上のように、この実施の形態６によれば、キャッシュによるオーバーヘッド時間を低減できると共に、要求仕様として与えられる時間制約条件の情報に基づいて最適のプロセッサ数を算出し、算出したプロセッサ数で効率の良いコーナーターン処理を実行できるという効果が得られる。
【０１７０】
また、最適のプロセッサ数の算出では、過去の実行時間情報等を利用することで、要求仕様として与えられた時間内で、より正確にコーナーターン処理を完了できると共に、利用する資源を節約しながらコーナーターン処理を実行できるという効果が得られる。
【０１７１】
実施の形態７．
図２２はこの発明の実施の形態７による並列画像処理装置の構成を示すブロック図である。図において、２１は、画像ブロックの処理に必要な各プロセッサ１２１〜１２４のキャッシュのライン数が不足する時に、キャッシュデータの退避回数が最小になる画像へのアクセスパターンを算出し、画像処理プログラム１に指定する画素アクセス順序指定手段であり、その他は実施の形態１の図１に示す構成と同等である。この例では、画像ブロックをプロセッサ１２１〜１２４に分割して実行する時に、１つの画像ブロックをコーナーターン処理するのに、各プロセッサ１２１〜１２４のキャッシュライン数が不足することが、実施の形態１と異なる。
【０１７２】
図２３はこの実施の形態７による並列画像処理装置の処理を示すフローチャートであり、図２４はキャッシュのライン数が不足した時のアクセスパターンを示す図である。
【０１７３】
次に動作について説明する。
図２３のステップＳＴ１からＳＴ４における画像ブロックを計算するところまでの処理は、実施の形態１と同じである。画像ブロックを計算後、ステップＳＴ３１において、メモリ割付手段１４は、まず、画像ブロックの処理に必要なキャッシュのライン数を算出する。このライン数は、コーナーターン処理の処理前画像側と処理結果画像側で、画像ブロックのデータを全てキャッシュに収めるのに必要なライン数となる。
【０１７４】
図２４の例では、１キャッシュライン当たり４画素であり、画像ブロックは４画素×４画素の領域になる。この場合、処理前画像側と処理結果画像側で、画像ブロックのデータを全てキャッシュに収めるのに、それぞれ４ラインづつ必要となるため、ここでの必要なキャッシュのライン数は８ラインとなる。
【０１７５】
必要なキャッシュのライン数を算出した後、ステップＳＴ３１において、メモリ割付手段１４は、キャッシュ情報設定手段１３が設定している各プロセッサ１２１〜１２４が搭載しているキャッシュのライン数と比較する。ここで、キャッシュのライン数が不足しない時（プロセッサ１２１〜１２４が搭載しているキャッシュのライン数≧必要なキャッシュのライン数）には、実施の形態１と同様に、ステップＳＴ５の処理を実行する。
【０１７６】
キャッシュのライン数が不足する時（プロセッサ１２１〜１２４が搭載しているキャッシュのライン数＜必要なキャッシュのライン数）には、ステップＳＴ３２において、メモリ割付手段１４の指示により、画素アクセス順序指定手段２１は、画素へのアクセス順序を計算して、画像処理プログラム１に指定する。その後、ステップＳＴ５において、各プロセッサ１２１〜１２４への画像ブロック割り付けを実施の形態１と同様に実行する。
【０１７７】
ステップＳＴ３２の処理について説明する。
アクセス順序の制御を指示された画素アクセス順序指定手段２１では、キャッシュ情報設定手段１３の情報を利用して、キャッシュデータの退避回数が最小になるアクセスパターンを算出する。ここでは、画素へのアクセス順序を制御することで、キャッシュデータの退避回数が最小になるパターンを算出する。
【０１７８】
図２４に、必要なキャッシュのライン数は「８ライン」の場合に、キャッシュのライン数が８ラインよりも少ない場合のアクセスパターンの例を示す。なお、ここでは、キャッシュデータの退避順序が、最終アクセス順序であることを仮定している。この仮定のもと、図２４のアクセスパターンでは、アクセス順序を図示のように行い、キャッシュデータの退避回数を最小にしている。
【０１７９】
アクセスパターンの算出後、ステップＳＴ３２において、画素アクセス順序指定手段２１が、メモリ割付手段１４と同様に、画像処理プログラム１に画素へのアクセスパターンを指定することで、各プロセッサ１２１〜１２４での実際のアクセス順序を制御する。ここでは、画像処理プログラム１が外部から指定された順序で各プロセッサ１２１〜１２４の各画素へのアクセス順序を変更できることを仮定している。
【０１８０】
この実施の形態では、画像ブロックのサイズが４画素×４画素であるが、任意のサイズの画像ブロックでも良い。また、キャッシュラインの不足数についても任意の数で良い。さらに、この実施の形態では、キャッシュデータの退避順序が、最終アクセス順序である例で説明したが、他のキャッシュ退避方法で実現されたシステムでは、キャッシュデータの退避回数を最小となるアクセスパターンを指定するように実装しても良い。
【０１８１】
この実施の形態では、画素アクセス順序指定手段２１が、画像処理プログラム１にアクセスパターンを指定することにより、アクセスパターンの制御を行っているが、画素アクセス順序指定手段２１がプラットホーム２の機能を使って直接設定するように実装しても良い。
【０１８２】
この実施の形態では、実施の形態１の並列画像処理装置に画素アクセス順序指定手段２１を追加しているが、実施の形態２〜実施の形態６の並列画像処理装置に対しても、画素アクセス順序指定手段２１を追加しても良い。
【０１８３】
以上のように、この実施の形態７によれば、画像ブロックをプロセッサ１２１〜１２４に分割して実行する時に、１つの画像ブロックをコーナーターン処理するのに、各プロセッサ１２１〜１２４のキャッシュのライン数が不足しても、各画素へのアクセスパターンを制御し、キャッシュデータの退避回数を抑えることにより、オーバーヘッド時間を低減できるという効果が得られる。
【０１８４】
実施の形態８．
図２５はこの発明の実施の形態８による並列画像処理装置の構成を示すブロック図である。図において、２２は、各プロセッサ１２１〜１２４のキャッシュが多階層の構成をとる時に、画像情報設定手段に１１に設定されている画像情報、使用プロセッサ数設定手段１２に設定されているプロセッサ１２１〜１２４の個数情報、キャッシュ情報設定手段１３に設定されているキャッシュ情報を入手して、メモリ割付手段１４が算出した１次キャッシュの画像ブロックをもとに、キャッシュの階層ごとに低位キャッシュの画像ブロックを包含するように画像ブロックを算出し、算出した最上位キャッシュの画像ブロック単位でコーナーターン処理を実行し、かつ、最上位キャッシュの画像ブロックを実行する個数がプロセッサ１２１〜１２４間で均等になるように、各プロセッサ１２１〜１２４へ共有メモリ１２５の対象領域を割り付けるよう画像処理プログラム１に指示する多階層キャッシュ対応メモリ割付手段であり、その他は実施の形態１の図１に示す構成と同等である。
図２６はこの実施の形態８による並列画像処理装置の処理を示すフローチャートである。
【０１８５】
図２７はこの実施の形態でのメモリ割り付けの概念を示す図で、図２８は画像ブロックを複数のプロセッサ１２１〜１２４で処理する一例を示す図である。この例では、各プロセッサ１２１〜１２４は、１次キャッシュと２次キャッシュの２階層のキャッシュ構成を取るものと仮定する。１次キャッシュは実施の形態１と同じサイズ（３２Ｂｙｔｅ×５１２ライン）でライトスルー方式、２次キャッシュは１２８Ｂｙｔｅ×１６，３８４ラインのサイズでライトバック方式、２次キャッシュへのライトミスが、他のキャッシュミスのオーバーヘッド時間と比較して特に大きいと仮定する。
【０１８６】
次に動作について説明する。
図２６のステップＳＴ４１において、多階層キャッシュ対応メモリ割付手段２２は、メモリ割付手段１４に、階層の低い１次キャッシュの画像ブロックの算出を指示し、ステップＳＴ４２において、メモリ割付手段１４は、画像情報、使用プロセッサ数、キャッシュ情報を入手して、実施の形態１と同様にして、１次キャッシュの画像ブロックを算出する。
【０１８７】
多階層キャッシュ対応メモリ割付手段２２は、ステップＳＴ４３において、１つ上位の２次キャッシュを選択し、ステップＳＴ４４において、対象の２次キャッシュのキャッシュ情報を入手し、ステップＳＴ４５において、１次キャッシュの画像ブロック（以下、１次画像ブロックとする）を包含するように、２次キャッシュの画像ブロック（以下、２次画像ブロック）を算出する。図２７に示す例では、１次画像ブロック４個で２次キャッシュの幅と一致するため、１次画像ブロック４個×４個のサイズの画像ブロックを２次画像ブロックとする。
【０１８８】
ステップＳＴ４６において、多階層キャッシュ対応メモリ割付手段２２は、さらに上位のキャッシュが存在するかをチェックし、存在する場合には、ステップＳＴ４７において、１つ上位階層のキャッシュを選択し、ステップＳＴ４４に戻り、２次画像ブロックを求めたのと同様の方法で、画像ブロックを求めていく。例えば、３次キャッシュが存在する場合には、２次画像ブロックを基準に、それを包含する３次キャッシュの画像ブロックを算出する。
【０１８９】
ステップＳＴ４６で、上位のキャッシュが存在しない場合には、最上位のキャッシュの画像ブロックを、全体の画像ブロックとして決定し、ステップＳＴ５の処理を行う。この例では、２次キャッシュまでしか存在しないので、２次画像ブロックを全体の画像ブロックとする。
【０１９０】
ステップＳＴ５において、多階層キャッシュ対応メモリ割付手段２２は、決定した全体の画像ブロックを基準に、実施の形態１のメモリ割付手段１４と同様の方法で、各プロセッサ１２１〜１２４へ処理する対象領域を割り付ける。この例では、２次画像ブロックを１つの単位として領域を割り付け、各プロセッサ１２１〜１２４で処理を実行する。
【０１９１】
また、全体の画像ブロック単位では、画像ブロックを均等に割り付けられない時には、実施の形態１と同様に、画像ブロックの処理を複数のプロセッサ１２１〜１２４で実行する。図２８は多階層キャッシュの画像ブロックを複数のプロセッサ１２１〜１２４で処理する一例を示す図である。
【０１９２】
多階層キャッシュの画像ブロックで、このような複数のプロセッサ１２１〜１２４での実行を行った場合、最上位のキャッシュレベル（この例では２次キャッシュ）では、実施の形態１と同様に、全てのプロセッサ１２１〜１２４が画像ブロック全体のデータを読み込む（この例では、２次画像ブロックのデータは全てのプロセッサ１２１〜１２４が読み込む）。
【０１９３】
一方で、低位のキャッシュについては、必ずしも全てのデータを読み込む必要はない。図２８の例では、プロセッサ１２１は１次画像ブロックの＜１＞，＜５＞，＜９＞，＜１３＞だけを、プロセッサ１２２は１次画像ブロックの＜２＞，＜６＞，＜１０＞，＜１４＞だけを読み込めば良い。
書き込みについては、実施の形態１と同様に、全てのキャッシュで重ならないように実行する。
【０１９４】
この実施の形態では、１次キャッシュと２次キャッシュによる構成で説明したが、並列実行する各プロセッサ１２１〜１２４に、３次キャッシュ以上の階層的な構造を取る場合でも適用できる。
【０１９５】
この実施の形態では、１次キャッシュと２次キャッシュについて、上記の所定のサイズと一貫性保持のための方法を割り付けて説明したが、任意のサイズ及び一貫性保持のための方法でも良い。また、処理画像の画素サイズや画像サイズの大きさについても任意で良い。
【０１９６】
この実施の形態では、実施の形態１の並列画像処理装置に多階層キャッシュ対応メモリ割付手段２２を追加したが、実施の形態６の並列画像処理装置に対しても、多階層キャッシュ対応メモリ割付手段２２を追加しても良い。
【０１９７】
以上のように、この実施の形態８によれば、多階層のキャッシュ構成を取る場合に、最上位キャッシュの画像ブロックを算出し、各プロセッサ１２１〜１２４が最上位キャッシュの画像ブロック単位でコーナーターン処理を実行し、かつ、最上位キャッシュの画像ブロックを実行する個数が均等になるように、各プロセッサ１２１〜１２４へ共有メモリ１２５の対象領域を割り付けることにより、キャッシュによるオーバーヘッド時間を低減できるという効果が得られる。
【０１９８】
また、多階層のキャッシュ構成を取る場合でも、１画像ブロックの処理を複数のプロセッサ１２１〜１２４で行うことにより、より公平な負荷分散が行えると共に、１画像ブロックの処理を高速化できるという効果が得られる。
【０１９９】
実施の形態９．
図２９はこの発明の実施の形態９による並列画像処理装置の構成を示すブロック図である。図において、２３は、画像サイズが最上位キャッシュの画像ブロックの整数倍にならない時に、画像サイズが最上位キャッシュの画像ブロックの整数倍になるように、画像領域を拡張して共有メモリ１２５上に確保する多階層キャッシュ対応メモリ確保手段であり、その他は実施の形態８の図２５に示す構成と同等である。この実施の形態では、画像サイズが最上位キャッシュの画像ブロックサイズの整数倍にならない点が、実施の形態８と異なっている。
図３０はこの実施の形態９による並列画像処理装置の処理を示すフローチャートである。
【０２００】
次に動作について説明する。
図３０のステップＳＴ４１からＳＴ４７までの最上位キャッシュの画像ブロックを計算するところまでは、実施の形態８と同じである。
【０２０１】
ステップＳＴ５１において、多階層キャッシュ対応メモリ割付手段２２は、画像サイズが、算出した最上位キャッシュの画像ブロックの整数倍になるかをチェックする。チェックの結果、整数倍になる場合は、実施の形態８と同様にステップＳＴ５の処理を行う。
【０２０２】
整数倍にならない時には、ステップＳＴ５２において、多階層キャッシュ対応メモリ割付手段２２の指示により、多階層キャッシュ対応メモリ確保手段２３が、実施の形態２のメモリ確保手段１５と同様の方法で、画像サイズが最上位キャッシュの画像ブロックの整数倍になるように画像領域を拡張して、共有メモリ１２５上に確保する。そして、多階層キャッシュ対応メモリ割付手段２２は、実施の形態８と同様に、ステップＳＴ５の処理を行う。
【０２０３】
この実施の形態では、実施の形態１の並列画像処理装置に、多階層キャッシュ対応メモリ割付手段２２と多階層キャッシュ対応メモリ確保手段２３を追加しているが、実施の形態６の並列画像処理装置に、多階層キャッシュ対応メモリ割付手段２２と多階層キャッシュ対応メモリ確保手段２３を追加しても良い。
【０２０４】
以上のように、この実施の形態９によれば、多階層のキャッシュ構成を取り、処理対象の画像サイズが最上位キャッシュの画像ブロックサイズの整数倍にならない時に、コーナーターン処理において、画像サイズが最上位キャッシュの画像ブロックの整数倍になるように、処理対象の画像領域を拡張して共有メモリ１２５上に確保するので、キャッシュによるオーバーヘッド時間を低減できるという効果が得られる。
【０２０５】
実施の形態１０．
図３１はこの発明の実施の形態１０による並列画像処理装置の構成を示すブロック図である。図において、２４は、上位キャッシュのラインサイズが、低位キャッシュの画像ブロックの整数倍にならない時に、上位キャッシュのラインサイズが、低位キャッシュの画像ブロックの整数倍になるように、低位キャッシュの画像ブロックの領域を拡張して共有メモリ１２５上に確保する多階層画像ブロック用メモリ確保手段であり、その他は実施の形態８の図２５に示す構成と同等である。この実施の形態では、上位キャッシュのラインサイズが、低位キャッシュの画像ブロックの整数倍にならない点が、実施の形態８と異なっている。
図３２はこの実施の形態１０による並列画像処理装置の処理を示すフローチャートであり、図３３はこの実施の形態での領域拡張方法の概念を示す図である。
【０２０６】
次に動作について説明する。
図３２のステップＳＴ４１からＳＴ４４までの２次キャッシュ以上の各階層で対象キャッシュの情報を入手するところまでは、実施の形態８と同じである。
【０２０７】
ステップＳＴ５１において、多階層キャッシュ対応メモリ割付手段２２は、選択した各階層で、ステップＳＴ４４で入手した対象キャッシュの情報に基づき、選択された階層のキャッシュのラインサイズが、１つ下の低位のキャッシュの画像ブロックサイズの整数倍になるかをチェックする。チェックの結果、整数倍になる時は、実施の形態８と同様に、ステップＳＴ４５以降の処理を行う。
【０２０８】
整数倍にならない時には、ステップＳＴ５２において、多階層キャッシュ対応メモリ割付手段２２の指示に基づき、多階層画像ブロック用メモリ確保手段２４が、整数倍になるように、全ての低位キャッシュの画像ブロックの領域を拡張して、共有メモリ１２５上に確保する。この例を図３３に示す。ここで、領域拡張の指示方法は、実施の形態２や実施の形態９と同様の方法である。ステップＳＴ５２の領域拡張後の処理手順は、実施の形態８と同様に、ステップＳＴ４５以降の処理を行う。
【０２０９】
この領域拡張を行った場合に、画像中にアクセスされない領域ができるため、拡張後も画像処理プログラム１が画像データに正しくアクセスできるようにする必要がある。この方法としては、画像処理プログラム１を多階層画像ブロック用メモリ確保手段２４での領域拡張に対応したプログラムとし、多階層画像ブロック用メモリ確保手段２４から画像処理プログラム１へ領域の拡張を指示することで、画像処理プログラム１が自動的に対応する方法がある。
【０２１０】
また、画像処理プログラム１は、並列画像処理装置３が指定した領域を画像データとして参照しながら実行するプログラムとし、並列画像処理装置３内の多階層画像ブロック用メモリ確保手段２４が、領域の拡張情報と画像データの位置情報を管理し、画像処理プログラム１が正しいデータにアクセスできるように画像データの位置を計算して指定する方法もある。
【０２１１】
この実施の形態では、実施の形態８の並列画像処理装置に多階層画像ブロック用メモリ確保手段２４を追加しているが、実施の形態９の並列画像処理装置に対しても、多階層画像ブロック用メモリ確保手段２４を追加しても良い。また、実施の形態６の並列画像処理装置についても、多階層キャッシュ対応メモリ割付手段２２と多階層画像ブロック用メモリ確保手段２４を追加しても良い。
【０２１２】
以上のように、この実施の形態１０によれば、多階層のキャッシュ構成で、上位キャッシュのラインサイズが、低位キャッシュの画像ブロックの整数倍にならない時に、コーナーターン処理において、上位キャッシュのラインサイズが低位キャッシュの画像ブロックの整数倍になるように、低位キャッシュの画像ブロックの領域を拡張して共有メモリ１２５上に確保することにより、キャッシュによるオーバーヘッド時間を低減できるという効果が得られる。
【０２１３】
実施の形態１１．
図３４はこの発明の実施の形態１１による並列画像処理装置の構成を示すブック図である。図において、２５は、画像サイズが算出した最上位キャッシュの画像ブロックの整数倍にならない時に、最上位キャッシュの画像ブロックの行又は列を最上位キャッシュの帯として算出し、算出した画像ブロックの帯で画像のアクセス方向を指定すると共に、算出した画像ブロックの帯の個数が、使用プロセッサ数設定手段１２に設定されているプロセッサ１２１〜１２４の個数の整数倍にならない時に、画像ブロックの帯を分割して、複数のプロセッサ１２１〜１２４で処理するよう指定する多階層キャッシュ対応アクセス方向指定手段であり、その他は実施の形態８の図２５に示す構成と同等である。この実施の形態では、画像サイズが最上位キャッシュの画像ブロックの整数倍にならない点が、実施の形態８と異なっている。
【０２１４】
図３５はこの実施の形態１１による並列画像処理装置の処理を示すフローチャートであり、図３６はこの実施の形態１１でのアクセス方向指定方法の概念を示す図である。
【０２１５】
次に動作について説明する。
図３５のステップＳＴ４１からＳＴ４７までの最上位キャッシュの画像ブロックを計算するところまでは、実施の形態８と同じである。
【０２１６】
ステップＳＴ６１において、多階層キャッシュ対応メモリ割付手段２２は、画像サイズが、算出した最上位キャッシュの画像ブロックの整数倍になるかをチェックする。チェックの結果、整数倍になる時は、実施の形態８と同様にステップＳＴ５の処理を行う。整数倍にならない時には、ステップＳ６２において、多階層キャッシュ対応メモリ割付手段２２の指示により、多階層キャッシュ対応アクセス方向指定手段２５は画像ブロックの帯と画像へのアクセス方向を算出する。
【０２１７】
多階層キャッシュ対応アクセス方向指定手段２５が画像ブロックの帯と画像へのアクセス方向を算出する方法は、基本的に、実施の形態３と同じである。画像ブロックの帯は、最上位キャッシュの画像ブロック幅でアジマス又はレンジ方向になる。図３６の例では、画像ブロックの帯は、処理前画像で画像ブロック幅の列、処理後方向で画像ブロック幅の行になり、画像へのアクセス方向は、アジマス側にはみ出した時にはアジマス方向、レンジ側にはみ出した時にはレンジ方向になる。
【０２１８】
ステップＳＴ６３において、多階層キャッシュ対応メモリ割付手段２２は、画像ブロックの帯の個数が、設定されたプロセッサの個数の整数倍になるかをチェックする。チェックの結果、整数倍になる時は、実施の形態８と同様にステップＳＴ５の処理を行うが、この時、画像ブロックの帯を基準に、画像ブロックの帯が重ならず、かつ負荷が均等になるようにプロセッサ１２１〜１２４に処理を割り付ける。
【０２１９】
ステップＳＴ６３で、画像ブロックの帯の個数が、設定されたプロセッサの個数の整数倍にならない時には、ステップＳ６４において、多階層キャッシュ対応メモリ割付手段２２は、算出した画像ブロックの帯を低位キャッシュでの画像ブロック幅の整数倍に分割する。図３６にその例を示す。
【０２２０】
図３６において、低位キャッシュ（１次キャッシュ）の画像ブロックは、図における（ａ），（ｂ）と同じサイズである。２分割する場合には、（Ａ）と（Ｂ）に分割し、ステップＳＴ５において、２つのプロセッサで実行するよう割り付ける。また、４分割する場合には、さらに（Ａ）の幅の半分、すなわち（ａ），（ｂ）の単位に分割し、ステップＳＴ５において、４つのプロセッサで実行するよう割り付ける。
【０２２１】
この実施の形態では、アジマス方向にはみ出す例を示したが、レンジ方向にはみ出す場合でも適用できる。また、この実施の形態では、実施の形態８の並列画像処理装置に、多階層キャッシュ対応アクセス方向指定手段２５を追加しているが、実施の形態９と実施の形態１０の並列画像処理装置に対しても、多階層キャッシュ対応アクセス方向指定手段２５を追加しても良い。
【０２２２】
以上のように、この実施の形態１１によれば、多階層のキャッシュ構成を取る場合で、処理対象の画像サイズが最上位キャッシュの画像ブロックの整数倍にならない時でも、実施の形態１０のように画像ブロックの領域を拡張せずに、画像ブロックの帯と画像へのアクセス方向を算出し、画像ブロックの帯で指定されたアクセス方向へ、読み込みの重複が少なく、かつ書き込みの重複がないアクセス方法でコーナーターン処理を行うことにより、キャッシュによるオーバーヘッド時間を低減できるという効果が得られる。
【０２２３】
また、画像ブロックの帯を複数のプロセッサ１２１〜１２４で分割実行する場合でも、低位キャッシュでの画像ブロック幅の整数倍に分割することで、キャッシュのリードミスとライトミスを抑えて、効率の良いコーナーターン処理を実行できるという効果が得られる。
【０２２４】
実施の形態１２．
図３７はこの発明の実施の形態１２による並列画像処理装置の構成を示すブロック図である。図において、２６は、算出した画像ブロックの処理に必要な各プロセッサ１２１〜１２４のキャッシュのライン数が不足する時に、キャッシュデータの退避回数が最小になる画像又は低次のキャッシュへのアクセスパターンを指定する多階層キャッシュ対応アクセスパターン指定手段であり、その他は実施の形態８の図２５に示す構成と同等である。この実施の形態では、画像ブロックをプロセッサに分割して実行する時に、１つの画像ブロックをコーナーターン処理するのに、各プロセッサ１２１〜１２４のキャッシュのライン数が不足していることが、実施の形態８と異なっている。
【０２２５】
図３８はこの実施の形態１２による並列画像処理装置の処理を示すフローチャートであり、図３９はこの実施の形態１２での２次キャッシュのライン数が不足した時のアクセスパターンを示す図である。
【０２２６】
次に動作について説明する。
図３８のステップＳＴ４１からステップＳＴ４７までの最上位キャッシュの画像ブロックを計算するところまでは、実施の形態８と同じである。最上位キャッシュの画像ブロックを計算後、ステップＳＴ７１において、多階層キャッシュ対応メモリ割付手段２２は、まず、画像ブロックの処理に必要なキャッシュのライン数を各キャッシュの階層毎に算出する。このライン数は、各階層のキャッシュで、最上位キャッシュの画像ブロックを対象に、処理前画像側と処理結果画像側で、データを全てキャッシュに収めるのに必要なライン数となる。このライン数の算出方法は、実施の形態７と同じである。
【０２２７】
必要なキャッシュのライン数を算出した後、多階層キャッシュ対応メモリ割付手段２２は、キャッシュ情報設定手段１３が設定している各プロセッサ１２１〜１２４が搭載しているキャッシュのライン数と比較し、キャッシュのライン数が不足するか否かをチェックする。全ての階層でキャッシュのライン数が不足しない時（プロセッサ１２１〜１２４が搭載しているキャッシュのライン数≧必要なキャッシュのライン数）には、実施の形態８と同様に、ステップＳＴ５の処理を実行する。
【０２２８】
いずれかの階層でキャッシュのライン数が不足する時（プロセッサ１２１〜１２４が搭載しているキャッシュのライン数＜必要なキャッシュのライン数）は、ステップＳＴ７２において、多階層キャッシュ対応メモリ割付手段２２の指示により、多階層キャッシュ対応アクセスパターン指定手段２６は、キャッシュ情報設定手段１３の情報を利用して、キャッシュデータの退避回数が最小になるアクセスパターンを算出する。その場合のアクセス順序を図３９に示している。
【０２２９】
この時、最上位キャッシュからチェックし、ライン数が不足した時には、ライン数が不足したキャッシュの階層で、その階層の画像ブロック単位で処理を行うようにする。自身の階層の画像ブロック単位でもライン数が不足する時には、低位キャッシュの画像ブロック単位を選択し、アクセス順序を制御して実行する。ライン数の不足が解消されるまで、より低位キャッシュの画像ブロックを選択していく。
【０２３０】
図３９の例では、１次キャッシュの画像ブロックである＜１＞，＜２＞，＜３＞といった単位で処理を実行するようにする。この例の場合、任意のアクセスパターンで実行するには、１次キャッシュで１２８ライン必要となる。これは、１次キャッシュの１ラインは４画素で、２５６画素では６４ライン、読み込みと書き込みで２倍必要なためである。一方、１次キャッシュの画像ブロック単位で実行すれば８ラインで実行可能になる。
【０２３１】
この画像ブロック単位での処理でライン不足を解消する方法は、コーナーターン処理を任意のアクセスパターンで実行するケースと同じ効率で実行できる利点がある。
【０２３２】
最終的に、１次キャッシュの画像ブロック単位でも、キャッシュラインが不足する場合には、実施の形態７と同様に、画素単位でのアクセスパターンを制御する。この場合、低位キャッシュの画像ブロック単位ごとに実行する場合と比べて処理効率が低下するが、実施の形態７と同様の方法を取ることで、オーバーヘッドを抑えることができる。
【０２３３】
アクセスパターンの算出後、ステップＳＴ７２において、多階層キャッシュ対応アクセスパターン指定手段２６は、実施の形態７の画素アクセス順序指定手段２１と同様の方法で各プロセッサでの実際のアクセス順序を計算し、実施の形態８と同様に、ステップＳＴ５の処理を実行する。
【０２３４】
この実施の形態では、キャッシュの構成が１次と２次のキャッシュとなっているが、任意のキャッシュの構成でも良い。また、画素サイズや画素数といった画像情報に対しても、任意の値でも良い。
【０２３５】
この実施の形態では、実施の形態８の並列画像処理装置に多階層キャッシュ対応アクセスパターン指定手段２６を追加しているが、実施の形態９〜実施の形態１１の並列画像処理装置に対しても、多階層キャッシュ対応アクセスパターン指定手段２６を追加しても良い。
【０２３６】
以上のように、この実施の形態１２によれば、多階層のキャッシュ構成を取る場合に、最上位キャッシュの画像ブロックをプロセッサ１２１〜１２４に分割してコーナーターン処理を行う時に、各プロセッサ１２１〜１２４のキャッシュのライン数が不足しても、各画素またはキャッシュへのアクセスパターンを制御することで、キャッシュデータの退避回数等のオーバーヘッド時間を削減できるという効果が得られる。
【０２３７】
また、画像ブロック単位での処理でライン不足を解消する方法で実行した場合には、任意のパターンで実行するケースと同じ効率で実行でき、キャッシュのライン数が不足した状況でも、ライン数が充足した状況と同じ効率でコーナーターン処理を実行できるという効果が得られる。
【０２３８】
実施の形態１３．
図４０はこの発明の実施の形態１３による並列画像処理装置の構成を示すブロック図である。図において、２７は、複数の画像サイズを持つ各画像を対象にコーナーターン処理を行う時に、各画像に対して、キャッシュのラインサイズが画素サイズの整数倍にならない場合に画素の補間を行うと共に、算出した画像ブロックの整数倍で、かつ、処理対象画像を包含する最小の画像サイズを算出し、算出した各画像の最小の画像サイズの中から最大の画像サイズを選択し、選択した最大の画像サイズの領域を共有メモリ１２５上に確保する複数サイズ対応メモリ確保手段であり、その他は実施の形態１の図１に示す構成と同等である。
【０２３９】
図４１はこの実施の形態１３による並列画像処理装置の処理を示すフローチャートであり、図４２は画素のすきまを補間する方法を示す図である。
【０２４０】
次に動作について説明する。
複数サイズ対応メモリ確保手段２７は、図４１のステップＳＴ８１において、キャッシュ情報設定手段１３からキャッシュの情報を得て、ステップＳＴ８２において、処理対象の画像を１つ選択し、ステップＳＴ８３において、選択した画像の情報を画像情報設定手段１１から得る。
【０２４１】
ステップＳＴ８４において、複数サイズ対応メモリ確保手段２７は、入手したキャッシュ情報と画像の情報から、キャッシュのラインサイズが画素サイズの整数倍になるかをチェックする。ここで、画素サイズがキャッシュのラインサイズよりも大きい場合は、画素を包含する最小のライン数を対象にチェックを行う。キャッシュのラインサイズが画素サイズの整数倍になる場合は、ステップＳＴ８６において、メモリ割付手段１４は、複数サイズ対応メモリ確保手段２７の指示により、この画像データに対する画像ブロックを算出する。
【０２４２】
キャッシュのラインサイズが画素サイズの整数倍にならない場合には、ステップＳＴ８５において、複数サイズ対応メモリ確保手段２７は、図４２に示すように、処理対象画像に一定のマージンを取り、キャッシュのラインサイズが画素サイズの整数倍になるように、処理対象画像を変更する。なお、画素サイズがキャッシュのラインサイズよりも大きい場合には、画素を包含する最小のライン数を対象にこの操作を実行する。
【０２４３】
処理対象画像を変更後、ステップＳＴ８６において、メモリ割付手段１４は、複数サイズ対応メモリ確保手段２７の指示により、変更した処理対象画像の画像ブロックを算出する。ステップＳＴ８７において、複数サイズ対応メモリ確保手段２７は、画像ブロックの整数倍で、かつ、処理対象画像全体を包含する最小の画像サイズを算出する。
【０２４４】
最小の画像サイズを算出後、ステップＳＴ８８において、複数サイズ対応メモリ確保手段２７は、残りの処理対象の画像があるかをチェックする。処理対象の画像がある場合には、ステップＳＴ８２に戻り、その中から１つを選択して、上記のステップＳＴ８３からＳＴ８８までの一連の処理を実行する。ステップＳＴ８８で、処理対象の画像がない場合は、ステップＳＴ８９において、複数サイズ対応メモリ確保手段２７は、算出した全ての最小の画像サイズの中から、最大の画像サイズを選択する。ステップＳＴ９０において、複数サイズ対応メモリ確保手段２７は、選択した画像サイズの領域を、共有メモリ１２５上に確保する。ステップＳＴ９１において、複数サイズ対応メモリ確保手段２７は、処理対象の画像を１つ選択する。その後のステップＳＴ１〜ＳＴ５において、メモリ割付手段１４が、実施の形態１と同様に処理を実行する。
【０２４５】
複数サイズ対応メモリ確保手段２７での共有メモリ１２５上の確保や、補間した画像へのアクセス方法には、上記の実施の形態と同様に、画像処理プログラム１を設定可能なプログラムとして実装し、並列画像処理装置３側から設定値を与えて実現する方法がある。
【０２４６】
また、画像処理プログラム１は、指定された領域に対する操作を行うように実装し、並列画像処理装置３側で共有メモリ１２５上の確保や、補間した画像へのアクセス方法を管理し、画像処理プログラム１が意識せずに正しい領域をアクセスできるように、画像処理プログラム１へ操作領域を指示する方法でも実現できる。
【０２４７】
この他、複数サイズ対応メモリ確保手段２７では、特定のアラインメント、すなわち、キャッシュのラインサイズの整数倍のアドレスからしか、キャッシュにデータが読み込めない場合には、画像の先頭からキャッシュに読み込める位置から共有メモリ１２５上に領域を確保する。
【０２４８】
この実施の形態では、複数画像を対象としているが、単一画像を対象としても良い。また、この実施の形態では、実施の形態１の並列画像処理装置に複数サイズ対応メモリ確保手段２７を追加しているが、実施の形態２〜実施の形態１２の並列画像処理装置に対しても、複数サイズ対応メモリ確保手段２７を追加しても良い。
【０２４９】
この実施の形態では、キャッシュが１次キャッシュのみの構成であったが、多階層キャッシュ構成の時でも、実施の形態８の方法で最上位キャッシュの画像ブロックを算出し、最上位キャッシュの画像ブロックを使用して最小の画像サイズを算出しても良い。
【０２５０】
以上のように、この実施の形態１３によれば、複数の画像サイズを対象にコーナーターン処理を行う時に、算出した画像ブロックの整数倍で、かつ、処理対象画像を包含する最小の画像領域を算出し、算出した各画像の最小の画像サイズの中から最大の画像サイズを選択し、選択した最大の画像サイズの領域を共有メモリ１２５上に確保することにより、複数の異なったサイズの画像を対象に処理する場合でも、キャッシュによるオーバーヘッド時間を低減できるという効果が得られる。
【０２５１】
また、キャッシュのラインサイズが画像の画素サイズの整数倍にならない場合には、画素を補間することにより、各画素データへのアクセスがキャッシュ上にあることを保証することができ、画像処理全体を効率良く実行できるという効果が得られる。
【０２５２】
実施の形態１４．
図４３はこの発明の実施の形態１４による並列画像処理装置の構成を示すブロック図である。図において、２８は、複数サイズ対応メモリ確保手段２７が確保した共有メモリ１２５の領域を使用し、確保の対象となった画像よりも小さいサイズの画像処理を行う時に、選択された小さいサイズの画像に対する画像ブロック単位で、コーナーターン処理に使用する共有メモリ１２５の使用範囲を算出し、算出した使用範囲でコーナーターン処理を実行するように、各プロセッサ１２１〜１２４に処理対象の領域やアクセス手順を指定する複数サイズ対応アクセス制御手段であり、その他は実施の形態１３の図４０に示す構成と同等である。
【０２５３】
図４４はこの実施の形態１４による並列画像処理装置の処理を示すフローチャートであり、図４５は実施の形態１４でのメモリの使用範囲の一例を示す図である。
【０２５４】
次に動作について説明する。
この実施の形態は、実施の形態１３の図４１のフローチャートに示す処理を行った後、図４４のフローチャートに示す処理を行う。複数サイズ対応アクセス制御手段２８は、図４４のステップＳＴ１０１において、複数サイズ対応メモリ確保手段２７が確保した共有メモリ１２５の領域のサイズを入手し、ステップＳＴ１０２において、画像サイズの異なる複数の画像の中から、処理対象の画像を選択する。この選択方法としては、ユーザの入力による指示や、画像処理プログラム１から処理対象として選択した画像の通知を受ける方法等がある。
【０２５５】
選択画像が確定した時点で、ステップＳＴ１０３において、メモリ割付手段１４は、複数サイズ対応アクセス制御手段２８の指示により、実施の形態１と同様の方法で、選択された画像に対する画像ブロックを算出する。
【０２５６】
ステップＳＴ１０４において、複数サイズ対応アクセス制御手段２８は、算出された画像ブロックの値を受け取り、前記の複数サイズ対応メモリ確保手段２７が確保したメモリ領域のサイズの中から、コーナーターン処理を実行する共有メモリ１２５の使用範囲を決定する。この決定方法を図４５の例で示す。確保した共有メモリ１２５の領域の先頭から、単純に利用すると、図４５（ａ）に示した「共有メモリ領域の先頭から使用した場合」のように、画像ブロックからはみ出した領域を利用する可能性がある。
【０２５７】
一方、実施の形態１３で示したように、複数サイズ対応メモリ確保手段２７が確保するメモリ領域は、処理対象画像の全ての画像に対して、処理対象画像を包含し、かつ、画像ブロックの整数倍となる領域も包含することを保証している。このため、複数サイズ対応アクセス制御手段２８では、選択された画像に対する画像ブロックを基準に、処理対象画像を包含し、かつ、画像ブロックの整数倍となる領域を、コーナーターン処理を実行する領域として決定する。
【０２５８】
図４５（ｂ）に示した「本装置での利用例」は、選択したコーナーターン処理を実行する領域の一例である。コーナーターン処理を実行する領域の選択後、複数サイズ対応アクセス制御手段２８は、ステップＳＴ１０５において、使用するプロセッサ数の情報を入手し、ステップＳＴ１０６において、各プロッセに割り振る処理領域とアクセスパターンを計算する。そして、実施の形態１と同様に、ステップＳＴ５の処理を行う。
【０２５９】
この実施の形態では、実施の形態１を基にした実施の形態１３の並列画像処理装置に複数サイズ対応アクセス制御手段２８を追加しているが、実施の形態２〜実施の形態１２の並列画像処理装置に対しても、複数サイズ対応メモリ確保手段２７と複数サイズ対応アクセス制御手段２８を追加しても良い。
【０２６０】
また、実施の形態１３と同様に、多階層キャッシュ構成でも良い。この場合には、実施の形態１０の多階層画像ブロック用メモリ確保手段２４と連動して、下位のキャッシュから順に共有メモリ１２５の領域を設定していくと有効である。
【０２６１】
以上のように、この実施の形態１４によれば、複数の画像を対象に確保した大きいサイズの共有メモリ１２５の領域を使用して、確保の対象となった画像よりも小さいサイズの画像の処理を行う時に、選択された小さいサイズの画像に対する画像ブロック単位で、コーナーターン処理の実行時に使用する共有メモリ１２５の使用範囲を算出し、かつ、算出した使用範囲で効率良くコーナーターン処理を行うように、各プロセッサ１２１〜１２４へ処理対象の領域やアクセス手順を指定することにより、キャッシュによるオーバーヘッド時間を低減できるという効果が得られる。
【０２６２】
また、１度確保した共有メモリ１２５の領域を、複数のサイズの画像で共同して利用しながら、効率良くコーナーターン処理を実行できるという効果が得られる。
【０２６３】
実施の形態１５．
図４６はこの発明の実施の形態１５による並列画像処理装置の構成を示すブロック図である。図において、２９は、コーナーターン処理の前に実行する事前実行処理の内容を入手すると共に、画像情報設定手段１１に設定されている画像情報、使用プロセッサ数設定手段１２に設定されているプロセッサ１２１〜１２４の個数情報、キャッシュ情報設定手段１３に設定されているキャッシュ情報を入手し、メモリ割付手段１４が算出した画像ブロック単位を基準として事前実行処理とコーナーターン処理を同時に実行し、各プロセッサ１２１〜１２４の負荷が均等になるように、各プロセッサ１２１〜１２４に共有メモリ１２５の対象領域を割り付ける前処理対応メモリ割付手段であり、その他は実施の形態１の図１に示す構成と同等である。
【０２６４】
図４７はこの実施の形態１５による並列画像処理装置の処理を示すフローチャートであり、図４８は各プロセッサ１２１〜１２４への処理の割り付けと処理の段数を示す図であり、図４９は各プロセッサ１２１〜１２４での各処理の段で処理を実行する時のアクセスパターンの一例を示す図である。
【０２６５】
次に動作について説明する。
図４７のステップＳＴ１１１において、メモリ割付手段１４は、前処理対応メモリ割付手段２９の指示により、画像ブロックを算出する。ステップＳＴ１１２において、前処理対応メモリ割付手段２９は、コーナーターン処理の前に実行する事前実行処理の内容を入手する。この入手方法としては、ユーザの入力による指示や、画像処理プログラム１から対象の処理について通知を受ける方法等がある。
【０２６６】
ステップＳＴ１１３において、前処理対応メモリ割付手段２９は、入手した事前実行処理が画素単位の処理であるかをチェックし、事前実行処理が、各画素単位の処理である場合は、ステップＳＴ１１４において、前処理対応メモリ割付手段２９は、使用プロセッサ数設定手段１２からプロセッサ１２１〜１２４の個数を入手する。
【０２６７】
ステップＳＴ１１５において、前処理対応メモリ割付手段２９は、処理前の画像に対して、コーナーターン処理と同様に、画像ブロック単位で共有メモリ１２５の領域にアクセスし、アクセスした領域において、各画素の単位で、事前実行処理を実行し、事前実行処理の計算結果を、各画素の単位で処理結果の画像に書き込むことで、結果として、画像ブロック単位を基準として、事前実行処理とコーナーターン処理を同時に実行する、画像処理プログラム１に指示する。
【０２６８】
上記ステップＳＴ１１３において、事前実行処理が画素単位の処理でない場合には、ステップＳＴ１１６において、前処理対応メモリ割付手段２９は、ライン単位の処理パターンを選択する。そして、前処理対応メモリ割付手段２９は、図４８に示すように、画像ブロック幅の、処理前画像の列と処理結果画像の行を、各プロセッサに割り付ける基準にする担当領域として算出する。図４８ではプロセッサ数をＭ個としている。図４８のプロセッサ（１）、プロセッサ（２）、…、プロセッサ（Ｍ）の担当領域は、仮想的なプロセッサ（１）〜（Ｍ）に割り当てた担当領域である。
【０２６９】
また、前処理対応メモリ割付手段２９は、画像ブロック幅の、処理前画像での１行分と対応する処理結果画像の１列を、各段で実行する処理部分とし、第１段〜第Ｎ段の処理部分として算出する。仮想的なプロセッサ（１）〜（Ｍ）では、第１段から順に、第Ｎ段まで、各段ごとに、事前実行処理とコーナーターン処理を実行すると仮定する。
【０２７０】
各段ごとの処理について図４９を使って説明する。
事前実行処理が、レンジ又はアジマス方向のアクセスの場合、ＳＡＲ画像の処理では、１画素分の計算を行うのに、その方向１行分のデータを必要とする。図４９の例では、画素Ａ１分の計算を行うのに、処理前画像の（ａ）の行の全画素のデータを必要とする。このため、各段ごとの処理では、処理前画像の（ａ）の行の全画素のデータを使って、４回事前実行処理を行うことで、Ａ１，Ａ２，Ａ３，Ａ４の結果を得る。これを、（ｂ），（ｃ），（ｄ）の行についても同様に行うことで、仮想的なプロセッサ（１）の第１段の部分の処理結果が得られる。
【０２７１】
前処理対応メモリ割付手段２９は、仮想的なプロセッサ（１）〜（Ｍ）に割り付ける担当領域と第１段〜第Ｎ段の処理部分を算出後、ステップＳＴ１１７において、画像情報設定手段１１から画像情報、使用プロセッサ数設定手段１２からプロセッサ１２１〜１２４の個数、キャッシュ情報設定手段１３からキャッシュ情報を、それぞれ入手する。
【０２７２】
ステップＳＴ１１８において、前処理対応メモリ割付手段２９は、各プロセッサへの処理の割り付け方法を算出する。ここでは、まず、仮想的なプロセッサ（１）〜（Ｍ）に割り当てた担当領域の各段の処理を実行するのに、十分なキャッシュのラインサイズがあるかチェックし、全てのデータをキャッシュに入れて処理できるだけの十分な領域があれば、特にアクセス方法についての指示は行わない。キャッシュのラインサイズが不足し、キャッシュデータの退避が発生する場合には、各段ごとの処理での、データのアクセス方法を指示する。
【０２７３】
ここでは、キャッシュデータの退避方法や、キャッシュのリードミスにかかる時間から、オーバーヘッドが最小となるアクセス方法を算出する。この例としては、図４９のように、（ａ）の行を処理するときに（ａ１），（ａ２）の順で交互にアクセスする方法がある。また、各プロセッサに処理負荷が均等になるように、仮想的なプロセッサ（１）〜（Ｍ）に担当領域を算出する。
【０２７４】
ステップＳＴ１１９において、前処理対応メモリ割付手段２９は、上記で算出した割り付け方法に従って、事前実行処理とコーナーターン処理を実行するよう共有メモリ１２５の領域の割り付けを、画像処理プログラム１に指示する。
【０２７５】
この実施の形態では、画像処理としてＳＡＲ画像の再生処理を行っているが、ＳＡＲ画像の再生処理以外でも、画像の行または列方向の処理と、各画素単位の処理で構成され、途中でコーナーターン処理に相当する処理を実行する画像処理にも適用できる。
【０２７６】
この実施の形態では、実施の形態１の並列画像処理装置に前処理対応メモリ割付手段２９を追加しているが、実施の形態２〜実施の形態１４の並列画像処理装置に対しても、前処理対応メモリ割付手段２９を追加しても良い。
【０２７７】
プロセッサ１２１〜１２４への処理の割り付けでは、使用するプロセッサ１２１〜１２４の個数により、画像ブロックや担当領域の個数が均等に分割できないケースがある。この場合には、他の実施の形態で示した画像ブロックや画像ブロックの帯を、複数のプロセッサ１２１〜１２４で実行する方法を応用して、画像ブロックや担当領域を複数のプロセッサ１２１〜１２４で、容易に並列実行することができる。
【０２７８】
以上のように、この実施の形態１５によれば、コーナーターン処理用に算出した画像ブロック単位を基準として、事前実行処理とコーナーターン処理を同時に実行し、かつ、各プロセッサ１２１〜１２４の処理負荷が均等になるように、各プロセッサ１２１〜１２４に共有メモリ１２５の対象領域を割り付けることにより、コーナーターン処理において、事前実行処理の完了待ちをする必要がなくなり、画像処理全体のオーバーヘッド時間を低減できるという効果が得られる。
【０２７９】
実施の形態１６．
図５０はこの発明の実施の形態１６の並列画像処理装置の構成を示すブロック図である。図において、３０は、コーナーターン処理の後に実行する事後実行処理の内容を入手すると共に、画像情報設定手段１１に設定されている画像情報、使用プロセッサ数設定手段１２に設定されているプロセッサ１２１〜１２４の個数情報、キャッシュ情報設定手段１３に設定されているキャッシュ情報を入手し、メモリ割付手段１４が算出した画像ブロック単位を基準としてコーナーターン処理と事後実行処理を同時に実行し、各プロセッサ１２１〜１２４の負荷が均等になるように、各プロセッサ１２１〜１２４に共有メモリ１２５の対象領域を割り付ける後処理対応メモリ割付手段であり、その他は実施の形態１の図１に示す構成と同等である。
【０２８０】
図５１はこの実施の形態１６による並列画像処理装置の処理を示すフローチャートであり、図５２は、各プロセッサへの処理の割り付けと、各プロセッサでのアクセスパターンの一例を示す図である。図５３はライン単位の処理パターンを示す図である。
【０２８１】
次に動作について説明する。
ステップＳＴ１３１において、メモリ割付手段１４は、後処理対応メモリ割付手段３０の指示により画像ブロックを算出する。ステップＳＴ１３２において、後処理対応メモリ割付手段３０は、コーナーターン処理後に実行する事後実行処理の内容を入手する。入手の方法は、実施の形態１５の前処理対応メモリ割付手段２９と同様である。
【０２８２】
ステップＳＴ１３３において、後処理対応メモリ割付手段３０は、入手した事後実行処理が画素単位の処理であるかをチェックし、事後実行処理が、各画素単位の処理である場合は、ステップＳＴ１３４において、後処理対応メモリ割付手段３０は、使用プロセッサ数設定手段１２からプロセッサ１２１〜１２４の個数を入手する。
【０２８３】
ステップＳＴ１３５において、後処理対応メモリ割付手段３０は、処理前の画像に対して、コーナーターン処理と同様に、画像ブロック単位で共有メモリ１２５の領域にアクセスし、アクセスした領域において、各画素の単位で、事後実行処理を実行し、事後実行処理の計算結果を、各画素の単位で処理結果の画像に書き込むことで、結果として、画像ブロック単位を基準として、コーナーターン処理と事後実行処理を同時に行うよう、画像処理プログラム１に指示する。
【０２８４】
上記ステップＳＴ１３３において、事後実行処理が画素単位の処理でない場合には、ステップＳＴ１３６において、後処理対応メモリ割付手段３０は、ライン単位の処理パターンを選択する。そして、後処理対応メモリ割付手段３０は、図５２に示すように、画像ブロック幅の、処理前画像の列と処理結果画像の行を、各プロセッサ１２１〜１２４に割り付ける基準にする担当領域として算出する。図５２ではプロセッサ数をＭ個としている。図５２のプロセッサ（１）、プロセッサ（２）、…、プロセッサ（Ｍ）の担当領域は、仮想的なプロセッサ（１）〜（Ｍ）に割り付けた担当領域である。
【０２８５】
ただし、各段で実行する処理部分については、コーナーターン処理用に書き込んだデータをそのまま使用できる。事後実行処理は、算出した担当領域を基準に、コーナーターン処理結果の書き込み直後に、処理結果画像の１行単位で処理を実行するものとする。
【０２８６】
ステップＳＴ１３７において、後処理対応メモリ割付手段３０は、画像情報設定手段１１から画像情報、使用プロセッサ数設定手段１２からプロセッサ１２１〜１２４の個数、キャッシュ情報設定手段１３からキャッシュ情報を、それぞれ入手する。
【０２８７】
ステップＳＴ１３８において、後処理対応メモリ割付手段３０は、各プロセッサへの処理の割り付け方法を算出する。ここでは、まず、仮想的なプロセッサ（１）〜（Ｍ）に割り当てた担当領域の各段の処理を実行するのに、十分なキャッシュのラインサイズがあるかチェックし、全てのデータをキャッシュに入れて処理できるだけの十分な領域があれば、特にアクセス方法についての指示は行わない。キャッシュのラインサイズが不足し、キャッシュデータの退避が発生する場合には、データのアクセス方法を指示する。
【０２８８】
ここでは、キャッシュデータの退避方法や、キャッシュのリードミスにかかる時間から、図５３に示すように、オーバーヘッドが最小となるアクセス方法を算出する。また、各プロセッサに処理負荷が均等になるように、仮想的なプロセッサ（１）〜（Ｍ）に担当領域を算出する。
【０２８９】
ステップＳＴ１３９において、後処理対応メモリ割付手段３０は、上記で算出した割り付け方法に従って、コーナーターン処理と事後実行処理を実行するよう共有メモリ１２５の領域の割り付けを、画像処理プログラム１に指示する。
【０２９０】
この実施の形態では、実施の形態１の並列画像処理装置に後処理対応メモリ割付手段３０を追加しているが、実施の形態２〜実施の形態１４の並列画像処理装置に対しても、後処理対応メモリ割付手段３０を追加しても良い。
【０２９１】
また、実施の形態１５と同様に、ＳＡＲ画像の再生処理以外の画像処理でも適用できる。また、画像ブロックや担当領域の複数のプロセッサでの並列実行も、実施の形態１５と同様に容易にできる。
【０２９２】
以上のように、この実施の形態１６によれば、コーナーターン処理用に算出した画像ブロック単位を基準として、コーナーターン処理と事後実行処理を同時に実行し、かつ、各プロセッサ１２１〜１２４の処理負荷が均等になるように、各プロセッサ１２１〜１２４に共有メモリ１２５の対象領域を割り付けることにより、事後実行処理において、コーナーターン処理の完了待ちをする必要がなくなり、画像処理全体のオーバーヘッド時間を低減できるという効果が得られる。
【０２９３】
実施の形態１７．
図５４はこの発明の実施の形態１７による並列画像処理装置の構成を示すブロック図である。図において、２９は実施の形態１５における前処理対応メモリ割付手段、３０は実施の形態１６における後処理対応メモリ割付手段、３１は、コーナーターン処理と同時に実行する事前実行処理又は事後実行処理を選択する前後処理選択手段であり、その他は実施の形態１の図１に示す構成と同等である。図５５はこの実施の形態１７による並列画像処理装置の処理を示すフローチャートである。
【０２９４】
次に動作について説明する。
ＳＡＲ画像の再生処理では、画像サイズに加えて、ＳＡＲ画像を撮像する人工衛星や航空機等のプラットホームの種類等により、再生処理でのプラットホームの動揺補整等、再生方法が異なるケースがある。この場合、ＳＡＲ画像の再生処理処理装置では、複数の再生方法に柔軟に対応することが求められる。
【０２９５】
前後処理選択手段３１は、図５５のステップＳＴ１５１において、複数の種類の事前実行処理と事後実行処理を登録しておく。登録方法は、ユーザの入力による指示や、画像処理プログラム１から通知を受ける方法、画像処理プログラム１を解析して自動的に入手する方法等がある。
【０２９６】
ステップＳＴ１５２において、前後処理選択手段３１は、登録した処理の中から、コーナーターン処理と同時に実行する処理を選択する。この選択方法には、ユーザの入力による指示や、画像処理プログラム１から通知を受ける方法等がある。また、過去の実行時刻と実行方法を記録しておき、その情報から最も時間が短くなると期待されるものを選択する方法もある。
【０２９７】
前後処理選択手段３１は、ステップＳＴ１５３において、選択した処理が事前実行処理か否かをチェックし、事前実行処理の場合には、ステップＳＴ１５４において、前処理対応メモリ割付手段２９に選択した事前実行処理の実行を指示する。その後の処理は、実施の形態１５と同じ処理を行う。
【０２９８】
ステップＳＴ１５３で、事前実行処理でなければ、ステップＳＴ１５５において、前後処理選択手段３１は、後処理対応メモリ割付手段３０に選択した事後実行処理の実行を指示する。その後の処理は、実施の形態１６と同じ処理を行う。
【０２９９】
この実施の形態では、画像処理としてＳＡＲ画像の再生処理を行っているが、ＳＡＲ画像の再生処理以外でも、画像の行または列方向の処理と、各画素単位の処理で構成され、途中でコーナーターン処理に相当する処理を実行する画像処理にも適用できる。
【０３００】
この実施の形態では、実施の形態１の並列画像処理装置に前後処理選択手段３１と前処理対応メモリ割付手段２９と後処理対応メモリ割付手段３０を追加しているが、実施の形態２〜実施の形態１４の並列画像処理装置に対しても、前後処理選択手段３１と前処理対応メモリ割付手段２９と後処理対応メモリ割付手段３０を追加しても良い。
【０３０１】
以上のように、この実施の形態１７によれば、コーナーターン処理と同時に実行する事前実行処理又は事後実行処理を選択できるようにしているので、複数の処理方法から選択して処理を行う必要があるＳＡＲ画像の再生処理でも、任意の処理とコーナーターン処理と同時に実行することができ、完了待ちによるオーバーヘッド時間を低減できるという効果が得られる。
【０３０２】
【発明の効果】
以上のように、この発明によれば、画像情報、プロセッサの個数情報、キャッシュ情報を入手し、キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出し、算出した画像ブロック単位で、各プロセッサがコーナーターン処理を実行するように共有メモリの対象領域を割り付けることにより、キャッシュのライトミスとリードミスを削減し、キャッシュによるオーバーヘッド時間を低減できるという効果がある。
【０３０３】
この発明によれば、画像サイズが算出した画像ブロックの整数倍にならない時に、画像サイズが画像ブロックの整数倍になるように、画像領域を拡張して共有メモリ上に確保することにより、キャッシュによるオーバーヘッド時間を低減できるという効果がある。
【０３０４】
この発明によれば、画像サイズが算出した画像ブロックの整数倍にならない時に、画像ブロックの幅の行又は列を画像ブロックの帯として算出し、算出した画像ブロックの帯で画像のアクセス方向を指定することにより、キャッシュによるオーバーヘッド時間を低減できるという効果がある。
【０３０５】
この発明によれば、画像サイズが算出した画像ブロックの整数倍にならない時に、画像ブロックの幅の行又は列を画像ブロックの帯として算出し、算出した上記画像ブロックの帯で画像のアクセス方向を指定すると共に、算出した画像ブロックの帯の個数が、プロセッサの個数の整数倍にならない時に、画像ブロックの帯を分割して、複数のプロセッサで処理するよう指定することにより、キャッシュによるオーバーヘッド時間を低減できると共に、効率の良いコーナーターン処理を実行できるという効果がある。
【０３０６】
この発明によれば、画像サイズが画像ブロックの整数倍にならない時に、複数の修正方法を備え、その修正方法の判定条件を設定していることにより、キャッシュによるオーバーヘッド時間を低減できると共に、複数の対処方法から最適の方法を選択することができ、効率の良いコーナーターン処理を実行できるという効果がある。
【０３０７】
この発明によれば、要求仕様として与えられる時間制約と各プロセッサの処理時間等の時間制約条件を設定し、実際に使用する上記プロセッサの個数を画像処理プログラムに指定することにより、要求仕様として与えられる時間制約条件の情報に基づいて最適のプロセッサ数を算出でき、算出したプロセッサ数で効率の良いコーナーターン処理を実行できるという効果がある。
【０３０８】
この発明によれば、画像ブロックの処理に必要なキャッシュのライン数が不足する時に、キャッシュデータの退避回数が最小になる画像へのアクセスパターンを算出し、画像処理プログラムに指定することにより、キャッシュによるオーバーヘッド時間を低減できるという効果がある。
【０３０９】
この発明によれば、複数の画像サイズを持つ各画像を対象にコーナーターン処理を行う時に、各画像に対して、キャッシュのラインサイズが画素サイズの整数倍にならない場合に画素の補間を行うと共に、算出した画像ブロックの整数倍で、かつ、処理対象画像を包含する最小の画像サイズを算出し、算出した各画像の最小の画像サイズの中から最大の画像サイズを選択し、選択した最大の画像サイズの領域を共有メモリ上に確保することにより、複数の異なったサイズの画像を対象に処理する場合でも、キャッシュによるオーバーヘッド時間を低減できるという効果がある。
【０３１０】
この発明によれば、確保した共有メモリの対象領域を使用し、確保の対象となった画像よりも小さいサイズの画像処理を行う時に、選択された小さいサイズの画像に対する画像ブロック単位で、コーナーターン処理に使用する共有メモリの使用範囲を算出し、算出した使用範囲でコーナーターン処理を実行するように、各プロセッサに処理対象の領域やアクセス手順を指定することにより、キャッシュによるオーバーヘッド時間を低減できると共に、１度確保した共有メモリの領域を、複数のサイズの画像で共同して利用しながら、効率良くコーナーターン処理を実行できるという効果がある。
【０３１１】
この発明によれば、画像情報、キャッシュ情報を入手し、１次キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出し、画像情報、プロセッサの個数情報、キャッシュ情報を入手し、算出した１次キャッシュの画像ブロックをもとに、キャッシュの階層ごとに低位キャッシュの複数の画像ブロックを包含し１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出し、算出した最上位キャッシュの画像ブロック単位でコーナーターン処理を実行し、かつ、最上位キャッシュの画像ブロックを実行する個数がプロセッサ間で均等になるように、各プロセッサへ共有メモリの対象領域を割り付けるよう画像処理プログラムに指示することにより、キャッシュによるオーバーヘッド時間を低減できると共に、効率の良いコーナーターン処理を実行できるという効果がある。
【０３１２】
この発明によれば、画像サイズが最上位キャッシュの画像ブロックの整数倍にならない時に、画像サイズが最上位キャッシュの画像ブロックの整数倍になるように、画像領域を拡張して共有メモリ上に確保することにより、キャッシュによるオーバーヘッド時間を低減できるという効果がある。
【０３１３】
この発明によれば、上位キャッシュのラインサイズが、低位キャッシュの画像ブロックの整数倍にならない時に、上位キャッシュのラインサイズが、低位キャッシュの画像ブロックの整数倍になるように、低位キャッシュの画像ブロックの領域を拡張して共有メモリ上に確保することにより、キャッシュによるオーバーヘッド時間を低減できるという効果がある。
【０３１４】
この発明によれば、画像サイズが算出した最上位キャッシュの画像ブロックの整数倍にならない時に、最上位キャッシュの画像ブロックの行又は列を最上位キャッシュの画像ブロックの帯として算出し、算出した画像ブロックの帯で画像のアクセス方向を指定すると共に、算出した画像ブロックの帯の個数が、プロセッサの個数の整数倍にならない時に、画像ブロックの帯を分割して、複数のプロセッサで処理するよう指定することにより、キャッシュによるオーバーヘッド時間を低減できると共に、効率の良いコーナーターン処理を実行できるという効果がある。
【０３１５】
この発明によれば、算出した画像ブロックの処理に必要なキャッシュのライン数が不足する時に、キャッシュデータの退避回数が最小になる画像又は低次のキャッシュへのアクセスパターンを指定することにより、キャッシュによるオーバーヘッド時間を低減できるという効果がある。
【０３１６】
この発明によれば、画像情報、キャッシュ情報を入手し、キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出し、コーナーターン処理の前に実行する事前実行処理の内容を入手すると共に、画像情報、プロセッサの個数情報、キャッシュ情報を入手し、算出した画像ブロック単位を基準として事前実行処理とコーナーターン処理を同時に実行し、各プロセッサの負荷が均等になるように、各プロセッサに共有メモリの対象領域を割り付けることにより、コーナーターン処理において、事前実行処理の完了待ちをする必要がなくなり、画像処理全体のオーバーヘッド時間を低減できるという効果がある。
【０３１７】
この発明によれば、画像情報、キャッシュ情報を入手し、キャッシュのライトミスとリードミスにかかるオーバーヘッド時間が最小となる１辺の長さが１キャッシュラインサイズの正方形の処理領域の画像ブロックを算出し、コーナーターン処理の後に実行する事後実行処理の内容を入手すると共に、画像情報、プロセッサの個数情報、キャッシュ情報を入手し、算出した画像ブロック単位を基準としてコーナーターン処理と事後実行処理を同時に実行し、各プロセッサの負荷が均等になるように、各プロセッサに共有メモリの対象領域を割り付けることにより、事後実行処理において、コーナーターン処理の完了待ちをする必要がなくなり、画像処理全体のオーバーヘッド時間を低減できるという効果がある。
【０３１８】
この発明によれば、コーナーターン処理と同時に実行する事前実行処理又は事後実行処理を選択することにより、任意の処理とコーナーターン処理と同時に実行することができ、完了待ちによるオーバーヘッド時間を低減できるという効果がある。
【図面の簡単な説明】
【図１】この発明の実施の形態１による並列画像処理装置の構成を示すブロック図である。
【図２】この発明の実施の形態１による並列画像処理装置の処理を示すフローチャートである。
【図３】この発明の実施の形態１による基本的なメモリ割り付けの概念を示す図である。
【図４】この発明の実施の形態１によるキャッシュアクセスの概念を示す図である。
【図５】この発明の実施の形態１による具体的なメモリ割り付けの実行例を示す図である。
【図６】この発明の実施の形態１による１画像ブロックを複数プロセッサで処理する一例を示す図である。
【図７】この発明の実施の形態２による並列画像処理装置の構成を示すブロック図である。
【図８】この発明の実施の形態２による並列画像処理装置の処理を示すフローチャートである。
【図９】この発明の実施の形態２によるメモリ確保の実行例を示す図である。
【図１０】この発明の実施の形態３による並列画像処理装置の構成を示すブロック図である。
【図１１】この発明の実施の形態３による並列画像処理装置の処理を示すフローチャートである。
【図１２】この発明の実施の形態３による、アジマス方向の画像サイズが画像ブロックの整数倍にならない場合にアクセス方向を指定した処理内容を示す図である。
【図１３】この発明の実施の形態３による、レンジ方向の画像サイズが画像ブロックの整数倍にならない場合にアクセス方向を指定した処理内容を示す図である。
【図１４】この発明の実施の形態４による並列画像処理装置の構成を示すブロック図である。
【図１５】この発明の実施の形態４による並列画像処理装置の処理を示すフローチャートである。
【図１６】この発明の実施の形態４による、アジマス方向に画像サイズが画像ブロックの帯の整数倍にならない時の処理内容を示す図である。
【図１７】この発明の実施の形態４による、レンジ方向に画像サイズが画像ブロックの帯の整数倍にならない時の処理内容を示す図である。
【図１８】この発明の実施の形態５による並列画像処理装置の構成を示すブロック図である。
【図１９】この発明の実施の形態５による並列画像処理装置の処理を示すフローチャートである。
【図２０】この発明の実施の形態６による並列画像処理装置の構成を示すブロック図である。
【図２１】この発明の実施の形態６による並列画像処理装置の処理を示すフローチャートである。
【図２２】この発明の実施の形態７による並列画像処理装置の構成を示すブロック図である。
【図２３】この発明の実施の形態７による並列画像処理装置の処理を示すフローチャートである。
【図２４】この発明の実施の形態７によるキャッシュのライン数が不足した時のアクセスパターンを示す図である。
【図２５】この発明の実施の形態８による並列画像処理装置の構成を示すブロック図である。
【図２６】この発明の実施の形態８による並列画像処理装置の処理を示すフローチャートである。
【図２７】この発明の実施の形態８によるメモリ割り付けの概念を示す図である。
【図２８】この発明の実施の形態８による画像ブロックを複数のプロセッサで処理する一例を示す図である。
【図２９】この発明の実施の形態９による並列画像処理装置の構成を示すブロック図である。
【図３０】この発明の実施の形態９による並列画像処理装置の処理を示すフローチャートである。
【図３１】この発明の実施の形態１０による並列画像処理装置の構成を示すブロック図である。
【図３２】この発明の実施の形態１０による並列画像処理装置の処理を示すフローチャートである。
【図３３】この発明の実施の形態１０による領域拡張方向の概念を示す図である。
【図３４】この発明の実施の形態１１による並列画像処理装置の構成を示すブロック図である。
【図３５】この発明の実施の形態１１による並列画像処理装置の処理を示すフローチャートである。
【図３６】この発明の実施の形態１１によるアクセス方向指定方法の概念を示す図である。
【図３７】この発明の実施の形態１２による並列画像処理装置の構成を示すブロック図である。
【図３８】この発明の実施の形態１２による並列画像処理装置の処理を示すフローチャートである。
【図３９】この発明の実施の形態１２による２次キャッシュのライン数が不足した時のアクセスパターンを示す図である。
【図４０】この発明の実施の形態１３による並列画像処理装置の構成を示すブロック図である。
【図４１】この発明の実施の形態１３による並列画像処理装置の処理を示すフローチャートである。
【図４２】この発明の実施の形態１３による画素のすき間を補間する方法を示す図である。
【図４３】この発明の実施の形態１４による並列画像処理装置の構成を示すブロック図である。
【図４４】この発明の実施の形態１４による並列画像処理装置の処理を示すフローチャートである。
【図４５】この発明の実施の形態１４によるメモリの使用範囲の一例を示す図である。
【図４６】この発明の実施の形態１５による並列画像処理装置の構成を示すブロック図である。
【図４７】この発明の実施の形態１５による並列画像処理装置の処理を示すフローチャートである。
【図４８】この発明の実施の形態１５による各プロセッサへの処理の割り付けと処理の段数を示す図である。
【図４９】この発明の実施の形態１５による各プロセッサでの処理の段で処理を実行する時のアクセスパターンの一例を示す図である。
【図５０】この発明の実施の形態１６による並列画像処理装置の構成を示すブロック図である。
【図５１】この発明の実施の形態１６による並列画像処理装置の処理を示すフローチャートである。
【図５２】この発明の実施の形態１６による各プロセッサへの処理の割り付けと各プロセッサでのアクセスパターンの一例を示す図である。
【図５３】この発明の実施の形態１６によるライン単位の処理のパターンを示す図である。
【図５４】この発明の実施の形態１７による並列画像処理装置の構成を示すブロック図である。
【図５５】この発明の実施の形態１７による並列画像処理装置の処理を示すフローチャートである。
【図５６】従来のシステムの構成を示すブロック図である。
【図５７】従来のＳＭＰの構成の一例を示すブロック図である。
【図５８】従来のＳＭＰでのＳＡＲ画像再生処理を実行する方法の一例を示す図である。
【図５９】従来のＳＭＰでのＳＡＲ画像再生処理を実行する方法の一例を示す図である。
【図６０】従来の複数のプロセッサが割り当てた画像上の領域を示す図である。
【図６１】従来の複数のプロセッサが割り付けたメモリ上の位置を示す図である。
【符号の説明】
１画像処理プログラム、２プラットホーム、３並列画像処理装置、１１画像情報設定手段、１２使用プロセッサ数設定手段、１３キャッシュ情報設定手段、１４メモリ割付手段、１５メモリ確保手段、１６アクセス方向指定手段、１７多重書き込み対応アクセス方向指定手段、１８はみ出し修正方法設定手段、１９処理時間制約設定手段、２０実行プロセッサ数指定手段、２１画素アクセス順序指定手段、２２多階層キャッシュ対応メモリ割付手段、２３多階層キャッシュ対応メモリ確保手段、２４多階層画像ブロック用メモリ確保手段、２５多階層キャッシュ対応アクセス方向指定手段、２６多階層キャッシュ対応アクセスパターン指定手段、２７複数サイズ対応メモリ確保手段、２８複数サイズ対応アクセス制御手段、２９前処理対応メモリ割付手段、３０後処理対応メモリ割付手段、３１前後処理選択手段、１０１磁気テープ、１０２レンジ方向圧縮装置、１０３−１２次元画像メモリ、１０３−２２次元画像メモリ、１０４２次元画像メモリ制御部、１０５アジマス方向圧縮装置、１０６磁気テープ、１０７ＣＰＵ、１０８メインメモリ、１０９ＦＦＴ装置、１１０データバス、１２１，１２２，１２３，１２４プロセッサ、１２５共有メモリ。[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a parallel image processing apparatus and a parallel image processing method related to allocation of a shared memory to a processor and access control to the shared memory of a processor in image processing executed on a shared memory type parallel computer. .
[0002]
[Prior art]
One technique for speeding up image processing is parallel processing using a plurality of processors. Parallel processing is effective in cases where there are few dependencies in image processing and the image size is large or the processing load is high. As one of such image processing, there is a reproduction processing of a SAR (Synthetic Aperture Radar: Synthetic Aperture Radar) image.
[0003]
As a conventional example of parallel processing of SAR image reproduction processing, there is "Synthetic Aperture Radar Image Processing System" disclosed in Japanese Patent Laid-Open No. 58-222982. FIG. 56 is a block diagram showing the configuration of a conventional system. In the figure, 101 is a magnetic tape that stores SAR received image data, 102 is a range direction compression device that compresses the SAR received image data stored on the magnetic tape 101 line by line in the range direction, 103-1, 103-2 is a two-dimensional image memory for corner turn (vertical / horizontal transposition) processing of the range-direction compressed image data, and 104 is a two-dimensional image memory control unit that accesses the two-dimensional image memories 103-1 and 103-2. It is.
[0004]
In FIG. 56, reference numeral 105 denotes an azimuth direction compression device that compresses corner-turned data in the azimuth direction, 106 denotes a magnetic tape that stores compressed image data, 107 denotes a CPU that controls the entire system, and 108 denotes a CPU 107. The main memory 109 is a common data FFT (Fast) for performing a fast Fourier transform.
(Fourier Transform) device 110 is a data bus.
[0005]
The SAR image reproduction processing is described in Japanese Patent Application Laid-Open No. 58-222982, and “Synthetic Aperture Radar Image Handbook” (supervised by Asakura Shoten, Joji Iizaka, edited by the Japan Society of Photogrammetry, May 1998) The description is omitted because it is also described in commercially available books such as the first edition of Japan.
[0006]
Next, the operation will be described.
Here, a processing device for executing each processing phase is provided and executed in a pipeline. The range direction compressor 102 and the azimuth direction compressor 105 are also implemented in a pipeline. In addition, the range direction compression device 102 and the azimuth direction compression device 105 can perform parallel execution by preparing a plurality of devices that perform FFT and IFFT (Inverse Fast Fourier Transform).
[0007]
Corner turn processing for transposing data in the range direction compressed image data is executed at high speed using the two-dimensional image memories 103-1 and 103-2. Here, in order to fill the processing speed gap in the pipeline, two two-dimensional image memories 103-1 and 103-2 are provided, and a two-dimensional image memory control unit 104 for use in replacement is provided. .
[0008]
As described above, this device is a dedicated system for SAR image reproduction processing in which a dedicated processing device is provided for each processing phase, and processing is sequentially pipelined by different devices for each processing phase. The corner turn is executed using a dedicated memory device such as the two-dimensional image memories 103-1 and 103-2 and the two-dimensional image memory control unit 104. A system using a dedicated H / W (Hardware) is also described in JP-A-58-191979. The system using the dedicated H / W and the two-dimensional image memory as described above can particularly perform corner turn processing at high speed, but generally the implementation cost is high.
[0009]
On the other hand, one method for realizing parallel processing is a shared memory tightly coupled SMP (Symmetric Multi Processor) using a general-purpose parallel computer which is advantageous in terms of cost. FIG. 57 is a block diagram showing an example of the configuration of SMP. In FIG. 57, 121, 122, 123, and 124 are processors, and 125 is a shared memory. Here, an example of four processors is shown, but an arbitrary number of processors can be configured. In the SMP, each processor has an equivalent function in a covalently coupled tightly coupled parallel computer.
[0010]
In SMP, since the shared memory 125 is shared by the plurality of processors 121 to 124 without using the two-dimensional memories 103-1 and 103-2 shown in FIG. 56, the processor 121 can be used effectively by using the shared memory 125. Data transfer between .about.124 can be performed efficiently. Even when image processing is performed in parallel, whether or not the shared memory 125 can be used effectively is a big point in considering processing efficiency.
[0011]
The above-mentioned conventional systems, such as Japanese Laid-Open Patent Publication Nos. 58-222982 and 58-191979, can execute processes in parallel in a dedicated system, but do not take an SMP configuration and effectively use a shared memory. There is no means to do this. Further, as a conventional example of parallel processing of SAR image reproduction processing, there is an example using the above-described dedicated system, but there is no conventional example having a shared memory type SMP configuration, and a processor that performs corner turn processing There is no conventional example regarding the cache memory control method.
[0012]
Next, a method for parallelizing the SAR image reproduction process in the SMP configuration will be described.
First, characteristics of the SAR image reproduction process will be described.
(1) In each processing part unit, the dependence on the processing result is low, and there are many parts that can be executed in parallel.
(2) The process has range and azimuth directionality, and row and column processes are assigned to each direction (example: range is row, azimuth is column). For example, in the processing in the range direction, the processing proceeds in the row (range) direction with the range as the row and the azimuth as the column.
(3) A process of transposing the direction of rows and columns, called a corner turn process, is performed during the process.
(4) Processing is sequentially performed on one image. That is, processing is performed using the processing results one after another.
(5) The image size is large and processing takes time.
[0013]
When the SAR image reproduction process having the above characteristics is implemented by the SMP shown in FIG. 57, the following implementation method can be easily considered.
(1) Reserve areas before and after processing on the shared memory 125.
(2) An image is divided into an arbitrary number according to the number of usable processors 121 to 124 and executed in parallel.
[0014]
58 and 59 show an example of a method for executing the SAR image reproduction process by SMP. As a method of using the shared memory 125, as shown in FIG. 58, (1) a method in which the shared memory 125 before and after processing is switched for each processing, or (2) processing as shown in FIG. A method of allocating the shared memory 125 every time and sequentially using the shared memory 125, or (3) a method of combining the two can be easily considered.
[0015]
Next, the cache operation in the shared memory computer will be described.
In each of the SMP processors 121 to 124, the contents of the shared memory 125 are copied to a high-speed small-capacity memory called a cache and used (read / write). Then, when reading into data that does not exist in the caches of the processors 121 to 124 occurs, the data is copied from the shared memory 125 to the cache. A cache is usually divided into a plurality of equal storage areas (called cache lines). Data reading is executed in units of lines.
[0016]
When all the cache lines are filled when data is read, one line is selected and replaced with new data. As a method for selecting the replacement target, there is a method using the access frequency, the last access time, or the like. This reading of data into the cache is called a cache read miss process.
[0017]
When the same location in the shared memory 125 is cached by different processors (that exists as data on the cache memory), if data is written, processing is performed to maintain data consistency. . As this processing method, there are a write-through method in which write data is directly written on the shared memory 125 and a write-back method in which the cache data in which the data is written is written back to the shared memory 125 in units of cache blocks (lines). . This processing time for maintaining the consistency of cache data is referred to as cache write miss processing.
[0018]
The time required for the processing at the time of the cache read miss and the time required for the processing at the time of the write miss are overhead times that cause a decrease in the performance of parallel processing.
[0019]
For cache operations on shared memory computers, see "Parallel Processing Machine" (Toshiji Tomita, Toshinori Sueyoshi, Ohm, Computer Architecture Series, IEICE), "Parallel Computer Architecture" (Shunfumi Okugawa) Since it is described in commercially available books such as the author, Corona), the details are omitted.
[0020]
In the overhead time due to the cache operation, the time at the write miss is generally longer than the time at the read miss. If write mistakes occur frequently, the overhead time causes a significant decrease in parallel processing performance. For this reason, reducing write misses is a major point when operating parallel processing efficiently.
[0021]
When the conventional SMP mounting method is used, there is a problem that the corner turn process tends to cause a cache miss. This will be described with reference to FIGS. FIG. 60 is a diagram illustrating areas on an image allocated by the processor 121 and the processor 122. 61 is a diagram showing positions on the shared memory 125 with respect to FIG. 60 and 61, the hatched portion indicates the cache range. The cache stores data for each continuous line of the shared memory 125 for each line.
[0022]
As shown in FIG. 60, the processor 121 and the processor 122 perform processing by allocating areas in different range directions. This allocation method has the effect of suppressing cache read misses because successive areas are accessed sequentially when processing proceeds in the range direction. In SAR image processing other than corner turn processing, writing is also performed in the same direction (in this case, the range direction), and a write error is unlikely to occur.
[0023]
[Problems to be solved by the invention]
Since the conventional parallel image processing apparatus is configured as described above, in the corner turn process, the writing direction (range direction) is different from the reading direction (range direction) as shown in FIG. As shown in FIG. 61, the memory area to which data is written is unlikely to be read, unlike the read area, the write area of each processor is likely to overlap the same cache range, cache write misses are likely to occur, and the overhead time is large. There was a problem of becoming.
[0024]
Here, the example of the SAR image reproduction process has been described. However, as in the SAR image reproduction process, each processing part has a direction in the process and has low dependency on the processing result, and the corner turn process is performed in the middle. The same problem occurs in image processing that implements the above.
[0025]
The present invention has been made to solve the above-described problems. From the cache information, the image size, and the information on the number of usable processors, the range and access of the image on which each processor performs corner turn processing. It is an object of the present invention to obtain a parallel image processing apparatus and a parallel image processing method capable of reducing cache write misses and overhead time by controlling the method.
[0026]
In addition, in consideration of not only cache write misses but also read misses that occur when all read data is not used, it is possible to obtain a parallel image processing apparatus and a parallel image processing method capable of reducing the final overhead time. Objective.
[0027]
[Means for Solving the Problems]
The parallel image processing apparatus according to the present invention operates on a platform including a plurality of processors and a shared memory, and executes image reproduction processing including corner turn processing for transposing the row direction and the column direction of the image. An image information setting unit for setting image information such as a size of a processing target image and a data size of each pixel, instructing the program to allocate the target area of the shared memory to the plurality of processors, and image processing Used processor number setting means for setting the number information of the processors used in the above; cache information setting means for setting cache information such as the configuration and size of the cache installed in each processor; and the image The image information set in the information setting means and the process set in the used processor number setting means. Number information Sa, get cache information set in the cache information setting means, the minimum overhead time required for a write miss in the cache and read miss One side is a square with one cache line size Memory allocation means for calculating an image block of a processing area and allocating a target area of the shared memory so that each processor executes the corner turn process for each calculated image block.
[0028]
The parallel image processing apparatus according to the present invention expands an image area and secures it on a shared memory so that the image size is an integral multiple of the image block when the image size is not an integral multiple of the calculated image block. Memory securing means is provided.
[0029]
The parallel image processing apparatus according to the present invention calculates a row or a column of the width of the image block as a band of the image block when the image size is not an integer multiple of the calculated image block. Access direction specifying means for specifying the access direction of the image is provided.
[0030]
The parallel image processing apparatus according to the present invention calculates a row or a column of the width of the image block as a band of the image block when the image size is not an integer multiple of the calculated image block. When the image access direction is designated and the calculated number of image block bands is not an integral multiple of the number of processors set in the processor count setting means, the image block bands are divided into a plurality of Multiple-write compatible access direction designating means for designating to be processed by this processor.
[0031]
The parallel image processing apparatus according to the present invention includes a memory securing unit that expands an image area and secures it on a shared memory so that the image size is an integral multiple of the calculated image block, and a row of the image block width or The column is calculated as an image block band, the access direction specifying means for specifying the image access direction in the calculated image block band, and the row or column of the width of the image block is calculated as the image block band and calculated. When the image access direction is designated by the image block band and the calculated number of image block bands is not an integral multiple of the number of processors set in the processor number setting means, Multiple-write compatible access direction designating means that divides a band and designates processing by a plurality of processors, and the image size is the image When not an integral multiple of the lock, in which a correction method setting unit protrudes is set determination conditions of the correction method.
[0032]
The parallel image processing apparatus according to the present invention includes a processing time constraint setting means for setting a time constraint given as a required specification and a time constraint condition such as a processing time of each processor, and the number of the processors actually used. And an execution processor number designating means for designating the processing program.
[0033]
The parallel image processing apparatus according to the present invention calculates an access pattern to an image that minimizes the number of times cache data is saved when the number of cache lines necessary for processing an image block is insufficient, and designates it to the image processing program. A pixel access order designation unit is provided.
[0034]
The parallel image processing apparatus according to the present invention, when performing a corner turn process on each image having a plurality of image sizes, performs pixel processing when the cache line size is not an integral multiple of the pixel size for each of the images. And the minimum image size that is an integral multiple of the calculated image block and that includes the image to be processed is calculated, and the maximum image size is selected from the minimum image sizes of the calculated images. And a plurality of size-corresponding memory securing means for securing the selected area of the maximum image size on the shared memory.
[0035]
The parallel image processing apparatus according to the present invention uses the target area of the shared memory secured by the multiple-size compatible memory securing unit, and performs the image processing of a size smaller than the image targeted for securing, and the selected small size Calculate the shared memory usage range used for corner turn processing for each image block of the size image, and execute the above corner turn processing within the calculated usage range for each processor with the processing target area and access procedure. A plurality of sizes corresponding access control means is provided.
[0036]
The parallel image processing apparatus according to the present invention operates on a platform including a plurality of processors and a shared memory, and executes image reproduction processing including corner turn processing for transposing the row direction and the column direction of the image. An image information setting unit for setting image information such as a size of a processing target image and a data size of each pixel, instructing the program to allocate the target area of the shared memory to the plurality of processors, and image processing Used processor number setting means for setting the number information of the processors used in the above, and cache information setting means for setting cache information such as the configuration and size of the multi-level cache mounted on each processor; Image information set in the image information setting means, and cache information set in the cache information setting means. Memory allocation means for obtaining image information and calculating an image block of a square processing area having a side length of 1 cache line size, which minimizes overhead time required for primary cache write miss and read miss, and the image Obtain the image information set in the information setting means, the number information of the processors set in the used processor number setting means, and the cache information set in the cache information setting means, and the memory allocation means calculates Based on the image blocks of the primary cache, each cache hierarchy includes a plurality of image blocks of the lower cache, and the length of one side is one cache line size. Of Calculate the image block of the square processing area, execute the corner turn processing for each calculated image block of the highest cache, and equalize the number of execution of the image cache of the highest cache among the processors. And multi-level cache compatible memory allocating means for instructing the image processing program to allocate the target area of the shared memory to each of the processors.
[0037]
The parallel image processing apparatus according to the present invention expands the image area so that the image size is an integral multiple of the image block of the highest cache when the image size is not an integral multiple of the image block of the highest cache. Thus, a multi-tier cache compatible memory securing means for securing on the shared memory is provided.
[0038]
In the parallel image processing apparatus according to the present invention, when the line size of the upper cache is not an integer multiple of the image block of the lower cache, the line size of the upper cache is an integer multiple of the image block of the lower cache. The memory block for the multi-layer image block is provided for expanding the area of the image block of the low-level cache and securing it on the shared memory.
[0039]
In the parallel image processing apparatus according to the present invention, when the image size does not become an integral multiple of the calculated image block of the highest cache, the row or column of the image block of the highest cache is used as a band of the image block of the highest cache. Calculate and specify the image access direction in the calculated image block band, and when the calculated number of image block bands is not an integral multiple of the number of processors set in the processor number setting means used, The image block band is divided and provided with a multi-tier cache compatible access direction specifying means for specifying processing by a plurality of processors.
[0040]
The parallel image processing apparatus according to the present invention designates an access pattern to an image or a lower-order cache that minimizes the number of cache data saves when the number of cache lines necessary for processing the calculated image block is insufficient. A multi-tier cache compatible access pattern designating means is provided.
[0041]
The parallel image processing apparatus according to the present invention operates on a platform including a plurality of processors and a shared memory, and executes image reproduction processing including corner turn processing for transposing the row direction and the column direction of the image. An image information setting unit for setting image information such as a size of a processing target image and a data size of each pixel, instructing the program to allocate the target area of the shared memory to the plurality of processors, and image processing Used processor number setting means for setting the number information of the processors used in the above; cache information setting means for setting cache information such as the configuration and size of the cache installed in each processor; and the image Image information set in the information setting means, cache information set in the cache information setting means The available, overhead time is minimized according to a write miss of cache and read miss One side is a square with one cache line size Memory allocation means for calculating the image block of the processing area, and the contents of the pre-execution process executed before the corner turn process are obtained, and the image information set in the image information setting means and the number of used processors are set. The processor number information set in the means and the cache information set in the cache information setting means are obtained, and the pre-execution process and the corner turn process based on the image block unit calculated by the memory allocation means Are executed simultaneously, and pre-processing corresponding memory allocating means for allocating the target area of the shared memory to the processors so that the loads on the processors are equalized.
[0042]
The parallel image processing apparatus according to the present invention operates on a platform including a plurality of processors and a shared memory, and executes image reproduction processing including corner turn processing for transposing the row direction and the column direction of the image. An image information setting unit for setting image information such as a size of a processing target image and a data size of each pixel, instructing the program to allocate the target area of the shared memory to the plurality of processors, and image processing Used processor number setting means for setting the number information of the processors used in the above; cache information setting means for setting cache information such as the configuration and size of the cache installed in each processor; and the image Image information set in the information setting means, cache information set in the cache information setting means The available, overhead time is minimized according to a write miss of cache and read miss One side is a square with one cache line size Memory allocation means for calculating an image block of a processing area, contents of post-execution processing executed after the corner turn processing, image information set in the image information setting means, and processor number setting means used Information on the number of processors set in the cache information, cache information set in the cache information setting means, and the corner turn process and the post-execution process based on the image block unit calculated by the memory allocation means And post-processing corresponding memory allocating means for allocating the target area of the shared memory to each processor so that the load on each processor is equalized.
[0043]
The parallel image processing apparatus according to the present invention operates on a platform including a plurality of processors and a shared memory, and executes image reproduction processing including corner turn processing for transposing the row direction and the column direction of the image. An image information setting unit for setting image information such as a size of a processing target image and a data size of each pixel, instructing the program to allocate the target area of the shared memory to the plurality of processors, and image processing Used processor number setting means for setting the number information of the processors used in the above; cache information setting means for setting cache information such as the configuration and size of the cache installed in each processor; and the image Image information set in the information setting means, cache information set in the cache information setting means The available, overhead time is minimized according to a write miss of cache and read miss One side is a square with one cache line size Memory allocation means for calculating the image block of the processing area, and the contents of the pre-execution process executed before the corner turn process are obtained, and the image information set in the image information setting means and the number of used processors are set. The processor number information set in the means and the cache information set in the cache information setting means are obtained, and the pre-execution process and the corner turn process based on the image block unit calculated by the memory allocation means Are executed at the same time, and preprocessing corresponding memory allocation means for allocating the target area of the shared memory to the processors so as to equalize the load on the processors, and the contents of post-execution processing executed after the corner turn processing And the image information set in the image information setting means, the use The number information of the processor set in the processor number setting means and the cache information set in the cache information setting means are obtained, and the corner turn processing and the above are performed based on the image block unit calculated by the memory allocation means. The post-execution execution process is executed simultaneously, and the post-processing corresponding memory allocating means for allocating the target area of the shared memory to the processors so as to equalize the load of the processors, And a pre- and post-process selection means for selecting the pre-execution process or the post-execution process.
[0044]
The parallel image processing method according to the present invention uses a plurality of processors and a shared memory, and executes the image reproduction processing including corner turn processing for transposing the arrangement in the row direction and the column direction of the image. Instructing the allocation of the target area of the shared memory to the processor, obtaining image information such as the size of the processing target image and the data size of each pixel, obtaining the number information of the processor used in the image processing, Obtain cache information such as the configuration and size of the cache installed in each of the above processors, and minimize the overhead time required for cache write misses and read misses based on the obtained image information and cache information. One side is a square with one cache line size The image block of the processing area is calculated, and the target area of the shared memory is allocated so that each processor executes the corner turn process in units of the calculated image block based on the obtained number information of the processors. .
[0045]
In the parallel image processing method according to the present invention, when the image size does not become an integral multiple of the calculated image block, the image area is expanded and secured on the shared memory so that the image size becomes an integral multiple of the image block. To do.
[0046]
In the parallel image processing method according to the present invention, when the image size is not an integral multiple of the calculated image block, the row or column of the width of the image block is calculated as the band of the image block, and the calculated image block band This specifies the image access direction.
[0047]
In the parallel image processing method according to the present invention, when the image size is not an integral multiple of the calculated image block, the row or column of the width of the image block is calculated as the band of the image block, and the calculated image block band Specifies the image access direction, and specifies that when the calculated number of image block bands is not an integral multiple of the number of processors, the image block band is divided and processed by a plurality of processors. is there.
[0048]
The parallel image processing method according to the present invention calculates and specifies an access pattern to an image that minimizes the number of cache data saves when the number of cache lines necessary for image block processing is insufficient.
[0049]
In the parallel image processing method according to the present invention, when the corner turn process is performed on each image having a plurality of image sizes, the pixel size is determined when the cache line size is not an integral multiple of the pixel size for each image. And the minimum image size that is an integral multiple of the calculated image block and that includes the image to be processed is calculated, and the maximum image size is selected from the minimum image sizes of the calculated images. The area of the selected maximum image size is secured on the shared memory.
[0050]
The parallel image processing method according to the present invention uses the target area of the secured shared memory, and performs image block unit for the selected small size image when performing image processing of a size smaller than the secured image. Thus, the use range of the shared memory used for the corner turn process is calculated, and the processing target area and the access procedure are designated to each processor so that the corner turn process is executed within the calculated use range.
[0051]
The parallel image processing method according to the present invention uses a plurality of processors and a shared memory, and executes the image reproduction processing including corner turn processing for transposing the arrangement in the row direction and the column direction of the image. Instructing the allocation of the target area of the shared memory to the processor, obtaining image information such as the size of the processing target image and the data size of each pixel, obtaining the number information of the processor used in the image processing, Obtain cache information such as the configuration and size of the multi-tier cache mounted on each processor, and minimize the overhead time required for write misses and read misses in the primary cache based on the obtained image information and cache information. The above-mentioned image block obtained by calculating the image block of the square processing area whose one side length is 1 cache line size The image information and the cache information on the basis of the image block of the calculated primary cache, one cache line rhino length of one side includes a plurality of image blocks low cache every hierarchy cache Of The number of image blocks in the square processing area calculated, the number of executions of corner turn processing for each image block in the highest cache calculated and the execution of the image block in the highest cache based on the obtained number information of the processor Is allocated to each processor such that the target area of the shared memory is equalized among the processors.
[0052]
The parallel image processing method according to the present invention expands the image area so that the image size is an integral multiple of the image block of the highest cache when the image size is not an integral multiple of the image block of the highest cache. Is secured on the shared memory.
[0053]
In the parallel image processing method according to the present invention, when the line size of the upper cache is not an integer multiple of the image block of the lower cache, the line size of the upper cache is an integer multiple of the image block of the lower cache. The image block area of the low-level cache is expanded and secured on the shared memory.
[0054]
In the parallel image processing method according to the present invention, when the image size does not become an integral multiple of the calculated image block of the highest cache, the row or column of the image block of the highest cache is used as the band of the image block of the highest cache. Calculate and specify the access direction of the image in the calculated image block band, and when the calculated number of image block bands is not an integral multiple of the number of processors, the image block band is divided, It is specified to be processed by a plurality of processors.
[0055]
In the parallel image processing method according to the present invention, when the number of cache lines necessary for processing the calculated image block is insufficient, an access pattern to an image or a low-order cache that minimizes the number of cache data saves is designated. Is.
[0056]
The parallel image processing method according to the present invention uses a plurality of processors and a shared memory, and executes the image reproduction processing including corner turn processing for transposing the arrangement in the row direction and the column direction of the image. Instructing the allocation of the target area of the shared memory to the processor, obtaining image information such as the size of the processing target image and the data size of each pixel, obtaining the number information of the processor used in the image processing, Obtain cache information such as the configuration and size of the cache installed in each processor, and the obtained image information and cache information minimizes the overhead time for cache write misses and read misses. One side is a square with one cache line size The image block of the processing area is calculated, the contents of the pre-execution process executed before the corner turn process are obtained, and the acquired image information, the number information of the processors, the cache information, and the contents of the pre-execution process The pre-execution process and the corner turn process are executed simultaneously on the basis of the calculated image block unit, and the target area of the shared memory is allocated to each processor so that the load on each processor is equalized. is there.
[0057]
The parallel image processing method according to the present invention uses a plurality of processors and a shared memory, and executes the image reproduction processing including corner turn processing for transposing the arrangement in the row direction and the column direction of the image. Instructing the allocation of the target area of the shared memory to the processor, obtaining image information such as the size of the processing target image and the data size of each pixel, obtaining the number information of the processor used in the image processing, Obtain cache information such as the configuration and size of the cache installed in each processor, and the obtained image information and cache information minimizes the overhead time for cache write misses and read misses. One side is a square with one cache line size Calculate the image block of the processing area, obtain the contents of the post-execution process executed after the corner turn process, and obtain the image information, the number information of the processors, the cache information, and the contents of the post-execution process The corner turn process and the post-execution process are executed simultaneously on the basis of the calculated image block unit, and the target area of the shared memory is allocated to each processor so that the load on each processor is equalized. is there.
[0058]
DETAILED DESCRIPTION OF THE INVENTION
An embodiment of the present invention will be described below.
Embodiment 1 FIG.
FIG. 1 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 1 of the present invention. In the figure, 1 is an image processing program for executing SAR image reproduction processing, and 2 is an H / W (Hardware) such as each of the processors 121 to 124, shared memory 125, and cache shown in FIG. 3 is a platform configured by an OS (Operating System), a parallelized library, and the like. 3 operates on the platform 2, and shows an image of a method of allocating a target area of the shared memory 125 to each of the processors 121 to 124 in corner turn processing. 2 is a parallel image processing apparatus that instructs the processing program 1.
[0059]
In the parallel image processing apparatus 3 of FIG. 1, 11 sets image information to be processed such as the size (number of pixels) of the processing target image, the data size of each pixel, and processing information that can be executed before and after corner turn processing. Image information setting means, 12 is a processor number setting means for setting the number of processors used for image processing, and 13 is a cache information setting unit for setting the size and size of the cache installed in each processor. The cache information setting unit 14 obtains the information of the image information setting unit 11, the used processor number setting unit 12, and the cache information setting unit 13, and the overhead time required for cache write miss and read miss is minimized. The image block of the processing area is calculated, and each processor performs corner town processing for each calculated image block. To allocate the target area of the shared memory to the line, a memory allocation unit that instructs the image processing program 1.
[0060]
FIG. 2 is a flowchart showing the processing of the parallel image processing apparatus according to Embodiment 1 of the present invention. FIG. 3 is a diagram showing the concept of basic memory allocation in the parallel image processing apparatus 3. FIG. It is a figure which shows the concept of the cache access in each processor when the processing apparatus 3 is utilized.
[0061]
FIG. 5 is a diagram showing a specific example of memory allocation. Here, it is assumed that the processing target image is 8 bytes per pixel and the image size is 2048 pixels in the range direction × 16384 pixels in the azimuth direction. Further, it is assumed that the cache mounted on the processor is only a data cache, 32 bytes × 512 lines per line, and a primary cache.
[0062]
FIG. 6 is a diagram showing an example of processing one image block by a plurality of processors. In this example, the processing assigned to each processor is not saved in the middle. Further, areas before and after processing are secured on the shared memory 125, and corner turn processing is performed from processing in the range direction (row: range, column: azimuth) to processing in the azimuth direction (row: azimuth, column: range). It shall be.
[0063]
The parallel image processing apparatus 3 operates on the platform 2 and instructs the image processing program 1 on a memory allocation method in corner turn processing. As a mounting method of the parallel image processing apparatus 3, a parallelization support library is used. There are a method that is provided independently, a method that is incorporated into an OS or a parallel library as the platform 2, and a method that is incorporated into the image processing program 1. In this embodiment, a description will be given of a method provided independently as a parallelization support library.
[0064]
Next, the operation will be described.
The image information setting unit 11 sets the image size (number of pixels), the data size of each pixel, processing information that can be executed before and after the corner turn process, and the like as information about the processing target image. In this example, information of 8 bytes per pixel and image size of 2048 pixels in the range direction × 16384 pixels in the azimuth direction is set. The setting of the image information in the image information setting means 11 can be manually performed by user input or the like, or can be automatically performed from the image data or the image processing program 1.
[0065]
The used processor number setting means 12 sets information on the number of processors used in image processing. This number information is the maximum number of processors that can be used, and basically, image processing is performed with a specified number. The setting of the processor number information in the used processor number setting means 12 may be a method specified in the image processing program 1 or a method performed by user input.
[0066]
The cache information setting unit 13 includes a configuration of a cache (hierarchical type or primary cache only) mounted on the processor, a size (capacity) for each line, the number of lines, a consistency maintaining method, a data saving order, Cache information such as the overhead time required for write misses and read misses, and the time required for reading and writing data to the cache are set.
[0067]
In this example, the cache installed in the processor is a data cache, 16 bytes × 512 lines per line, only the primary cache, the consistency maintaining method is write back, and the data saving order is the last access time order. The setting of the cache information in the cache information setting means 13 can be manually performed by user input or the like, or can be automatically performed via an I / F (Interface) prepared by the system.
[0068]
Next, the operation of the memory allocation unit 14 will be described.
The memory allocation means 14 obtains cache information from the cache information setting means 13 in step ST1 of FIG. 2, obtains image information from the image information setting means 11 in step ST2, and uses processor number setting means in step ST3. 12, information on the number of processors to be used is obtained.
[0069]
In step ST4, the memory allocation unit 14 calculates an image block as a processing unit based on the cache information obtained in step ST1 and the image information obtained in step ST2. An image block is a processing area in which the overhead time required for a cache write miss and read miss is minimized. Here, the image block is an area on a square image having one cache line size.
[0070]
After the calculation of the image block in step ST4, in step ST5, the memory allocation unit 14 performs a corner turn for each image block calculated by each of the processors 121 to 124 based on the information on the number of used processors obtained in step ST3. The image processing program 1 is instructed to allocate the target area of the shared memory 125. The image processing program 1 is programmed such that each processor 121 to 124 performs corner turn processing for each area of the shared memory 125 designated from the outside.
[0071]
The image block of the pre-processing image includes all data necessary for writing in the image block of the corresponding processing result image after the corner turn processing, and all the data of the read image block is the image of the processing result image. Written to the block.
[0072]
When the plurality of processors 121 to 124 shown in FIG. 57 perform corner turn processing, cache write misses due to writing do not occur unless image blocks are overlapped for the image areas handled by the processors 121 to 124. For this reason, by performing corner turn processing without overlapping in the calculated image block unit, it is possible to eliminate a cache write miss on the writing side. Further, since the data can be read only once as a minimum, the overhead time due to a cache read miss can be minimized.
[0073]
FIG. 3 shows an outline of corner turn processing by a plurality of processors in units of image blocks. Here, the processor 121 and the processor 122 are in charge of different image blocks. For this reason, corner turn processing is executed in the image areas of the pre-processing image and the processing result image without overlapping the cached data.
[0074]
FIG. 4 shows a cache usage state in each processor when corner turn processing is performed in units of image blocks. Here, it is assumed that there is a cache having a sufficient number of lines to perform corner turn processing in units of image blocks. For this reason, different cache lines are assigned to reading and writing, and corner turn processing in units of image blocks is executed without saving necessary data.
[0075]
A specific execution example is shown using FIG.
The numbers 1, 2, 3,... In FIG. 5 are the numbers of the respective pixels, and correspond to the reading position before the corner turn processing and the writing position after the corner turn processing between the images before processing and the processing result. ing. Also, corresponding positions are shown between the image and the memory space.
[0076]
Here, since the cache line size is 32 bytes, the image block is an area of 32 bytes × 32 bytes. Since the data of one pixel is 8 bytes, the data for four pixels is stored in one cache line, and the image block is an area on the image of 4 pixels × 4 pixels. Since all the image data is 2048 pixels in the range direction × 16384 pixels in the azimuth direction, the image block is 512 × 4096.
[0077]
Since the image block is an area on the image, the space of the shared memory 125 becomes a discontinuous area. As shown in FIG. 5, areas of one cache line size (4 pixels) are scattered in the space of the shared memory 125.
[0078]
Corner turn processing in units of image blocks is indicated by pixel numbers 1 to 16 in FIG. On the reading side, four cache lines (1, 2, 3, 4), (5, 6, 7, 8), (9, 10, 11, 12), (13, 14, 15, 16) are used. Is read. In writing after corner turn processing, four pieces (1, 5, 9, 13), (2, 6, 10, 14), (3, 7, 11, 15), (4, 8, 12, 16) are used. Written as a cache line.
[0079]
In the corner turn process in each image block, there is no dependency between the processes, and it is assumed that all data is in the caches in the processors 121 to 124. Therefore, the process may be executed from any pixel. .
[0080]
By performing this operation on a 512 × 4096 image block, corner turn processing is performed on the entire image.
[0081]
In step ST5 shown in FIG. 2, when assigning image blocks to the processors 121 to 124, it is more efficient to equalize the processing load among the processors 121 to 124. Since there is no dependency in the processing between the image blocks, any image block can be assigned to any processor 121 to 124 and executed in parallel. For this reason, basically, attention is paid to the number of image blocks to be assigned to each of the processors 121 to 124, and the processing loads are assigned to be equal.
[0082]
On the other hand, in this parallel image processing apparatus, since the image size and the number of processors can be arbitrarily set, there are cases where image blocks cannot be divided equally. In this case, as shown in FIG. 6, one image block can be executed by a plurality of processors 121 to 124.
[0083]
In FIG. 6, the four processors 121 to 124 each read all the data in the image block on the pre-processing image side. In writing to the processing result image, data is written only to the designated line in the image block so that the write destination caches do not overlap each other. For this reason, there is a read miss that wastes a part of the read data, but a write miss in writing does not occur. In addition, since there is no dependency between the processes in the processors 121 to 124, the processes can be executed in parallel.
[0084]
Normally, when executed by one processor, four lines of data are read and four lines of data are written. In this processing method, each processor 121 to 124 reads four lines of data, and one line of data is read. Data is being written.
[0085]
For this reason, the processing time for one image block is shorter than the processing time for one image block. However, the number of image blocks to be processed increases. In the example of FIG. 6, the data writing time for 3 lines is shortened per image block. If the read and write times of data to the cache are equivalent, the time is shortened by 5/8 times. On the other hand, the number of image blocks to be processed increases four times.
[0086]
In SAR image processing, since the image size is generally large, the number of image blocks to be processed is larger than the number of processors 121 to 124. In this case, it is conceivable that the processing is first divided in units of image blocks, and only the image blocks that are not divisible are divided and executed by two or more processors 121 to 124.
[0087]
Conversely, when the image size is small and the number of image blocks to be processed is small compared to the number of processors 121 to 124, the image block is divided and executed by a plurality of processors 121 to 124. Processing time can also be shortened.
[0088]
In this embodiment, the present invention is applied to the SAR image reproduction process. However, the present invention can be applied to a parallel image process that executes a process corresponding to the corner turn process in addition to the SAR image reproduction process.
[0089]
In this embodiment, the size of the image and the size of the cache are set, but any image (pixel size or number of pixels of the image), any cache (configuration, number of lines, size, consistency maintaining method, data The number of processors may be any number.
[0090]
In this embodiment, corner turn processing from (row: azimuth, column: azimuth) to (row: azimuth, column: range) is executed, but from (row: azimuth, column: range) to (row: Corner turn processing of range, row: azimuth) may be used. Further, even if the image processing does not define the range and azimuth, the image processing content may be the same as the corner turn processing.
[0091]
In this embodiment, the parallel image processing apparatus 3 is independently provided as a parallelization support library as a mounting method. However, the parallel image processing apparatus 3 is incorporated into the OS or the parallelization library as the platform 2 or into the image processing program 1. The method is fine.
[0092]
In this embodiment, the processing area of each processor is set by designating the processing area in the image processing program 1, but the parallel image processing apparatus 3 directly designates it using the function of the platform 2 or the like. It may be implemented with.
[0093]
In this embodiment, it is assumed that the processing assigned to each processor cannot be evacuated in the middle. However, as long as the processing of one image block unit is not evacuated in the middle, it is similar to a time sharing system. A system that is evacuated at a certain cycle may be used. Even in the case where the processing for each image block is saved, since the cache write miss is performed only between the saved processor and the newly executing processor, the probability of cache write miss is low.
[0094]
In this embodiment, the corner turn processing alone is described as an example. However, even when a series of image processing is pipelined on the SMP, the parallel image processing device 3 is applied to the corner turn processing portion. good.
[0095]
As described above, according to the first embodiment, when the corner turn process for transposing the vertical and horizontal arrangements of the image is executed, the processing target such as the image size (number of pixels) and the data size of each pixel is processed. Image information, and cache information such as cache configuration, size, number of cache lines, cache coherency maintenance method, etc., calculate the image block of the processing area that minimizes the overhead time for cache write miss and read miss, Writing is performed by allocating the target area of the shared memory 125 at the processing target position so as not to overlap between the processors 121 to 124 so that the processing is performed in units of image blocks. Does not cause a cache write miss, and data can be read only once. Rukoto can, there is an advantage that it reduces the overhead time by the cache.
[0096]
Further, from the information on the number of processors 121 to 124 used for image processing, an image to be processed is equally allocated to each processor 121 to 124 in image block units, thereby performing effective load distribution in parallel processing, The effect that an efficient corner turn process is realizable is acquired.
[0097]
Further, when load distribution is performed, the processing of one image block is performed by the plurality of processors 121 to 124, so that it is possible to achieve more fair load distribution and to speed up the processing of one image block.
[0098]
Embodiment 2. FIG.
FIG. 7 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 2 of the present invention. In the figure, reference numeral 15 denotes memory securing means for expanding the image area and securing it on the shared memory 125 so that the image size becomes an integer multiple of the image block when the image size does not become an integral multiple of the calculated image block. Other than this, the configuration is the same as that of the first embodiment shown in FIG. FIG. 8 is a flowchart showing processing of the parallel image processing apparatus according to the second embodiment, and FIG. 9 is a diagram showing an execution example of memory reservation. This embodiment is different from the first embodiment in that the calculated image size is not an integral multiple of the image block size.
[0099]
Next, the operation will be described.
The process from step ST1 to ST4 in FIG. 8 for calculating the image block is the same as in the first embodiment. In step ST11, the memory allocation unit 14 checks whether the image size is an integral multiple of the calculated image block. As a result of the check, if the image size is an integral multiple of the calculated image block, the memory allocation unit 14 sends the processor 121 to 124 in the area of the shared memory 125 in step ST5 as in the first embodiment. Is assigned to the image processing program 1.
[0100]
If the image size is not an integral multiple of the calculated image block in step ST11, the memory securing unit 15 instructs the memory allocation unit 14 to set the image size to an integer of the image block as shown in FIG. The image area is expanded and secured on the shared memory 125 so as to be doubled.
[0101]
In this example, the image area is expanded by the image processing program 1 when the memory securing unit 15 designates the image size to the image processing program 1. The expanded image area is executed in the same manner as the other parts in the corner turn process, but is ignored in processes other than the corner turn process.
[0102]
FIG. 9A shows an example in which the image size is not an integral multiple of the image block in the azimuth direction. Here, it is assumed that data can be read from the cache line in an arbitrary position. Since it does not become an integral multiple, if processing is performed as it is, writing at the right end of the data after image processing affects the cache at the left end of the next line. For example, if A and E are to be written, they are written to the cache in units of (A, E, 2, 6). On the other hand, in the upper left image block, data is written to the cache in units of (2, 6, 10, 14). Thus, there are two caches managing (2, 6), and a write miss occurs between the two caches.
[0103]
Even when reading from an address that is an integral multiple of the line size, the cache line can be written only in units of (A, E, 2, 6) and (10, 14,...). If written as is, a cache write miss occurs.
[0104]
Here, as shown in FIG. 9B, the image area to be processed is expanded and secured on the shared memory 125 so that the image size is an integral multiple of the image block. By performing the corner turn process on the entire expanded image, it can be executed in the same manner as in the first embodiment.
[0105]
In this embodiment, the image size in the azimuth direction is not an integral multiple of the image block. However, if the image size in the range direction is not an integral multiple of the image block, the image size area can be increased in the range direction. If the image size in both directions of azimuth and range is not an integral multiple of the image block, the area of the image size may be increased in both directions.
[0106]
In this embodiment, the amount of increase in the image size area can also be set arbitrarily. For this reason, even when a cache or an image size other than those in the above embodiment is designated, the image size area may be increased by the same method.
[0107]
In this embodiment, the image processing program 1 secures the designated area amount as a method for securing the area of the image size by expanding the area of the image size. An area may be secured and the usage destination may be designated to the image processing program 1. In this case, the reserved size of the shared memory 125 may not be notified, and the image processing program 1 may instruct the memory allocation unit 14 to perform corner turn processing on the range specified by the image block including the extension area.
[0108]
As described above, according to the second embodiment, when corner turn processing is performed, when the image size to be processed is not an integral multiple of the image block, the processing is performed so that the image size is an integral multiple of the image block. By expanding the target image area and securing it on the shared memory 125, it is possible to reduce the overhead time due to the cache.
[0109]
Embodiment 3 FIG.
FIG. 10 is a block diagram showing the configuration of a parallel image processing apparatus according to Embodiment 3 of the present invention. In the figure, reference numeral 16 denotes a case where the image block width or column is calculated as an image block band when the image size is not an integral multiple of the calculated image block, and the image access direction is designated by the calculated image block band. The other access direction designating means is the same as that of the first embodiment shown in FIG. Similar to the second embodiment, this embodiment is different from the first embodiment in that the image size does not become an integral multiple of the image block size.
[0110]
FIG. 11 is a flowchart showing the processing of the parallel image processing apparatus according to the third embodiment, and FIG. 12 is a diagram showing the processing contents in which the access direction is specified when the image size in the azimuth direction does not become an integral multiple of the image block. FIG. 13 is a diagram showing the processing contents in which the access direction is designated when the image size in the range direction is not an integral multiple of the image block.
[0111]
Next, the operation will be described.
In step ST1 to ST11 in FIG. 11, it is checked whether the image size is an integral multiple of the calculated image block. As a result of the check, the process in step ST5 is the same as in the second embodiment. The same.
[0112]
If the result of the check in step ST11 is not an integral multiple, in step ST13, in response to an instruction from the memory allocation unit 14, the access direction designating unit 16 calculates the access direction to the image and the band of the image block, and calculates the calculated image block. The image processing program 1 designates an image access method in the band.
[0113]
After the calculation of the access direction to the image and the band of the image block, in step ST5 of FIG. 11, the memory allocation unit 14 uses the calculated band of the image block as a reference and the band of the image block does not overlap and the load is equal. Thus, the image processing program 1 is instructed to allocate the area of the shared memory 125 to each of the processors 121 to 124.
[0114]
Note that the image processing program 1 of this example is programmed so that corner turns can be processed by instructing the processors 121 to 124 in the access order designated from the outside.
[0115]
Next, a method in which the access direction designating unit 16 calculates the access direction to the image and the band of the image block in step ST13 will be described.
When the image size in the azimuth direction is not an integral multiple of the image block, the image access direction is the azimuth direction as shown in FIG. 12, and conversely, when the image size in the range direction is not an integral multiple of the image block. As shown in FIG.
[0116]
The band of the image block is an image block width, and becomes a row and a column of the image in each direction. In the example of FIG. 12, the image block width is a column in the pre-processing image, and the image block width is in the processing result image.
[0117]
The access direction designating unit 16 designates an access method so that writing is not overlapped and duplication of reading is reduced as much as possible. An example of this method is shown in FIGS. However, since the access method is slightly different depending on the number of protruding pixels, a typical example is shown in FIGS. Here, an example of corner turn processing from the range direction to the azimuth direction will be described. However, the corner turn processing from the azimuth direction to the range direction is performed by exchanging the azimuth and the range.
[0118]
FIG. 12 is an example in which one pixel protrudes from the integral multiple of the image block in the row direction (azimuth direction) of the processing result image. FIG. 12A shows the positional relationship between reading and writing of each pixel on the image image with numbers and alphabets corresponding to each pixel. FIG. 12B shows the access order on the image.
[0119]
At both ends of the image, reading and writing cannot be completed once because of the effect of shifting one pixel at a time on the writing side. Here, the duplication of reading is reduced as much as possible. In FIG. 12, first, the cache data <1>, <2>, <3>, <4> and <N-2>, <N-1>, <N> are read. This is changed to [1] (pixels K, G, C, 4), [2] (pixels F, B, 3, 7), [3] (pixels A, 2, 6, 10), [4] on the writing side. ] (1, 5, 9, 13) in this order. At this time, all the data included in <1> (pixels 1, 2, 3, 4) has completed the corner turn processing.
[0120]
In the next corner turn process, the cache data (pixels 8, 12, 16, and 20) <2>, <3>, <4>, and <5> are written into the cache [5] on the writing side. Here, since the data <2>, <3>, and <4> have been read, only the data <5> is newly read. Then, the corner turn process is completed for the data of <2> (pixels 5, 6, 7, 8).
[0121]
Subsequently, in the next corner turn process, the cache data (pixels 11, 15, 19, and 23) of <3>, <4>, <5>, and <6> are transferred to the cache [6] on the writing side. In writing, and in the next corner turn processing, the cache data (pixels 14, 18, 22, and 26) of <4>, <5>, <6>, and <7> are transferred to the cache of [7] on the writing side. Write corners to perform corner turn processing. Here, only the data <6> and <7> are sequentially read, and the data included in <3> (pixels 9, 10, 11, 12) and <4> (pixels 13, 14, 15, 16). Corner turn processing is completed.
[0122]
Thus, for data from <1> to <N-3>, the corner turn process can be executed by reading and writing the cache data a minimum number of times (one time at a time).
[0123]
On the other hand, in the final stage of the corner turn process, the data of <N-5>, <N-4>, <N-3>, and <N-2> are stored in the cache [N-2] and <N-4>. , <N-3>, <N-2>, <N-1> in the cache of [N-1], <N-3>, <N-2>, <N-1>, <N The corner turn processing is performed on the data of> to the cache of [N]. Here, the data of <N-2>, <N-1>, <N> read first is read again.
[0124]
Finally, in the method of FIG. 12, the data marked with * of <N-2>, <N-1>, and <N> is read at most twice (if the number of cache lines is sufficient, the data is read once. Is enough). For the data read from <N-2> to <N>, the data for the first time is written when it is read again at the final stage without writing the data of the dotted line portion in FIG. For this reason, corner turn processing can be executed by writing once in each cache without duplication.
[0125]
FIG. 13 shows an example in which one pixel protrudes from the integral multiple of the image block in the column direction (range direction) of the processing result image. FIG. 13A shows the positional relationship between reading and writing of each pixel on the image image, and FIG. 13B shows the access order on the image image.
[0126]
In FIG. 13, the reading and writing cannot be completed once because of the influence of shifting by one pixel on the reading side. Again, try to minimize duplicate readings. In FIG. 13, first, <1> (pixels 1, 2, 3, 4) and <N-2> (pixels A, 5, 6, 7), <N-1> (pixels F, B, 9, 10), <N> (pixels K, G, C, 13) cache data is read. This is written into [1] (pixels 1, 5, 9, 13) on the writing side.
[0127]
In the next corner turn process, the cache data of <2> (pixels 14, 15, 16, a13) is newly read and written into the cache of [2] (pixels 2, 6, 10, 14) on the writing side. Here, the read data <N-1>, <N>, and <1> are used.
[0128]
Subsequently, in the next corner turn process, the cache data of <3> (pixels 11, 12, a9, a10) is newly read, and the cache of [3] (pixels 3, 7, 11, 15) on the writing side is read. In the next corner turn process, the cache data of <4> (pixels 8, a5, a6, a7) is newly read and [4] (pixels 4, 8, 12, 16) on the writing side is read. Corner turn processing is performed by writing to the cache.
[0129]
As described above, the corner turn process can be executed for the data from <1> to <N-3> by reading and writing the cache data the minimum number of times (one time at a time).
[0130]
On the other hand, in the final stage of the corner turn process, the data <N-2> is read again into the cache [N-2], and the data <N-1> is read again into the cache [N-1]. The data of <N> is read again, and the corner turn process is performed on the cache of [N].
[0131]
Finally, in the method of FIG. 13, the data marked with * of <N-2>, <N-1>, and <N> is read twice at maximum (if the number of cache lines is sufficient, one read is performed). Is enough). For the data read from <N-2> to <N>, the data for the first time is written when it is read again at the final stage without writing the data of the dotted line portion in FIG. For this reason, corner turn processing can be executed by writing once in each cache without duplication.
[0132]
In this embodiment, an example of a specific deviation width has been described, but an arbitrary deviation width may be used. In this embodiment, the access method is instructed to each of the processors 121 to 124 via the image processing program 1. However, the parallel image processing device 3 directly designates the function using the platform 2 or the like. It may be implemented with.
[0133]
As described above, according to the third embodiment, when the corner turn process is performed, the access direction to the image and the band of the image block are calculated even when the image size to be processed is not an integral multiple of the image block. As shown in the second embodiment, each processor is designated so that corner turn processing is performed in an access method in which there is little duplication of reading and no duplication of writing in the access direction designated by the band of the image block. There is an effect that the cache read miss and write miss can be suppressed without changing the image size, and the overhead time can be reduced.
[0134]
Embodiment 4 FIG.
FIG. 14 is a block diagram showing the configuration of a parallel image processing apparatus according to Embodiment 4 of the present invention. In the figure, 17 indicates that when the image size does not become an integral multiple of the image block, the row or column of the image block width is calculated as the band of the image block, the access direction of the image is designated by the calculated band of the image block, and When the calculated number of image block bands is not an integral multiple of the number of processors set in the used processor number setting means 12, the image block bands are divided and processed by a plurality of processors 121-124. The multi-write compatible access direction designating means to be designated is otherwise the same as the configuration shown in FIG. In this embodiment, when processing is assigned to each of the processors 121 to 124 on the basis of the band of the image block, it is not possible to assign the processing equally, and the band of the image block is divided and executed. Is different.
[0135]
FIG. 15 is a flowchart showing the processing of the parallel image processing apparatus according to the fourth embodiment. FIG. 16 is a diagram showing the processing contents when the image size does not become an integral multiple of the image block in the azimuth direction. It is a figure which shows the processing content when the image size does not become an integral multiple of an image block in a direction.
[0136]
Next, the operation will be described.
The processing up to checking whether the image size from step ST1 to ST11 in FIG. 15 is an integral multiple of the image block is the same as in the third embodiment. When the image size does not become an integral multiple of the image block, in step ST13, the multiple write compatible access direction designating unit 17 calculates the access direction to the image and the band of the image block according to the instruction from the memory allocating unit 14, Specify in processing program 1.
[0137]
In step ST14, the multiple writing compatible access direction designating unit 17 checks whether the number of image block bands is an integral multiple of the set number of processors. If the result of the check indicates an integral multiple, the memory allocation unit 14 performs the process of step ST5 as in the third embodiment.
[0138]
When the number of image block bands does not become an integral multiple of the set number of processors, in step ST15, the multiple-write compatible access direction designating unit 17 first allows the image block bands to overlap and allows each of the processors 121-124. Processing is assigned to each of the processors 121 to 124 so that the load of the image block is equalized.
[0139]
Next, for the portion where the band of the image block does not overlap, the multiple writing compatible access direction designating unit 17 designates the same access direction as in the third embodiment. For the overlapping portion of the image block bands, the multiple writing compatible access direction designating unit 17 designates the following access method.
[0140]
When the image size does not become an integral multiple of the image block in the azimuth direction, the multiple writing compatible access direction designating unit 17 divides the read / write range in the azimuth direction as shown in FIG. When the image block does not become an integral multiple of the image block, the reading and writing ranges are divided in the range direction as shown in FIG. For the divided range, an access method is specified in which data writing does not overlap and reading range overlap is reduced. This example is shown in FIGS.
[0141]
In FIG. 16, the processors 121 and 122 basically perform processing on each illustrated area, but the dotted line portion of the unprocessed image does not become an integral multiple of the number of processors in which the number of image block bands is set. Therefore, the caches of the processors 121 and 122 are portions that read data in multiples. The shaded area in the range of the processor 122 in the processing result image is an area in which the processor 122 writes, but the processor 121 performs writing processing to suppress cache write misses.
[0142]
In FIG. 17 as well, the processors 121 and 122 basically perform processing for each region shown in the figure, but in the dotted line portion of the pre-processing image, the number of image block bands is an integral multiple of the number of set processors. Therefore, the caches of the processors 121 and 122 are portions that read data in multiples. In addition, with respect to the dotted line portion of the processing result image, each processor 121 and 122 performs a writing process to suppress cache write misses.
[0143]
After the process of step ST15 is completed, the memory allocation unit 14 performs the process of step ST5 as in the third embodiment.
[0144]
In this embodiment, an example of a specific deviation width has been described, but an arbitrary deviation width may be used.
[0145]
As described above, according to the fourth embodiment, when corner turn processing is performed, the cache read miss can be minimized, the write miss can be suppressed, and the overhead time can be reduced.
[0146]
In addition, since the processing in the image block bands is divided and executed, even when the number of image block bands is not an integral multiple of the set number of processors, the processing load is equally divided among the processors. The effect that an efficient corner turn process can be performed is acquired.
[0147]
Embodiment 5 FIG.
FIG. 18 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 5 of the present invention. In the figure, reference numeral 15 is a memory securing means for expanding and securing the image area on the memory so that the image size is an integral multiple of the image block, and 16 is the image block width when the image size is not an integral multiple of the image block. The access direction designating means 17 for designating the row or column of the image block as a band of the image block and designating the access direction of the image by the calculated band of the image block, the image block width when the image size is not an integral multiple of the image block Are calculated as image block bands, the image access direction is designated by the calculated image block bands, and the calculated number of image block bands is set in the used processor number setting means 12. When the number of processors is not an integral multiple, the band of the image block is divided and processed by the plurality of processors 121 to 124. A multiple write-enabled access direction specifying means for constant.
[0148]
Also, in FIG. 18, reference numeral 18 denotes an overhang correction method setting means for setting a correction method determination condition when the image size does not become an integral multiple of the image block, and the other is the configuration shown in FIG. 1 of the first embodiment. It is equivalent. In this embodiment, when the image size does not become an integral multiple of the image block, the correction method is selected. FIG. 19 is a flowchart showing the processing of the parallel image processing apparatus according to the fifth embodiment.
[0149]
Next, the operation will be described.
The protrusion correction method setting means 18 sets a determination condition as a reference for selecting a correction method when the image size does not become an integral multiple of the image block. The setting of information in the protrusion correction method setting means 18 is performed by user input or the like. The memory allocation unit 14 uses the image information of the image information setting unit 11 and the cache information of the cache information setting unit 13 based on the criteria of the conditional expression held by the protrusion correction method setting unit 18 to determine the correction method. select.
[0150]
Further, the memory allocation unit 14 sets the time required for securing the memory, the time required for the process specifying the access direction, and the like from the history of the execution information and the past, in the protrusion correction method setting unit 18. It is also possible to calculate the image processing time including the corner turn process when the correction method is selected, and to select the method that is the shortest.
[0151]
In addition, in the case where the access direction designating unit 16 is selected, if the processing cannot be divided equally in the band of the image block due to the number of processors from the used processor number setting unit 12, the memory allocation unit 14 Settings such as switching to the write-compatible access direction designating unit 17 to perform processing can also be performed.
[0152]
The processing from step ST1 to ST4 in FIG. 19 is the same as the processing in the first embodiment. In step ST11, the memory allocation unit 14 checks whether the image size is an integral multiple of the calculated image block. If the result of the check is an integer multiple, the process of step ST5 is performed as in the first embodiment. When it does not become an integral multiple, in step ST16, the memory allocation means 14 inquires the protrusion correction method setting means 18 and selects a correction method.
[0153]
When the memory allocating unit 14 selects the memory securing unit 15, as in the second embodiment, the processes of step ST 12 and step ST 5 are performed, and when the access direction specifying unit 16 is selected, the third embodiment is performed. In the same manner as in Step 4, when Steps ST13 and ST5 are processed and the multiple write compatible access direction designating unit 17 is selected, Steps ST13, ST14, ST15, and ST5 are performed as in the fourth embodiment.
[0154]
As described above, according to the fifth embodiment, when corner turn processing is performed, when the size of the image to be processed does not become an integral multiple of the image block size, cache read misses are minimized and write misses are suppressed. In addition, the overhead time can be reduced, and by selecting an optimum method from a plurality of coping methods, an effect that an efficient corner turn process can be executed can be obtained.
[0155]
Embodiment 6 FIG.
FIG. 20 is a block diagram showing the configuration of a parallel image processing apparatus according to Embodiment 6 of the present invention. In the figure, 19 is a processing time constraint setting means for setting a time constraint given as a required specification and a time constraint condition such as the processing time of each processor 121 to 124, and 20 is the number of processors 121 to 124 that are actually used. The number of execution processors is designated by the image processing program 1, and the rest is the same as the configuration shown in FIG. This embodiment is different from the first embodiment in that the number of processors to be used is adjusted before the image block is divided into the processors 121 to 124.
FIG. 21 is a flowchart showing the processing of the parallel image processing apparatus according to the sixth embodiment.
[0156]
Next, the operation will be described.
The processing time constraint setting means 19 sets time constraint condition information for corner turn processing given as a required specification. The time constraints set here include time conditions such as the target time to completion of processing and the maximum allowable time, as well as policies on processing time (speed) and system resource (processor) usage (processing time priority and system resource There are various types of information that affect the processing time, such as saving priority. The execution time information required for the corner turn process can be recorded together with parameters such as the image size, the number of processors used, and the image division processing method.
[0157]
In addition to manually setting information in the processing time constraint setting means 19 by user input or the like, execution time information may be stored together with parameters such as the image size and the number of processors used during processing execution. it can.
[0158]
The execution processor number designating means 20 is a means for designating the processing to be executed with the designated number of processors. In this example, the number of processors is specified in the image processing program 1. In the image processing program 1 of this example, corner turn processing is performed with the number of processors designated by the execution processor number designation means 20.
[0159]
The processing up to the calculation of image blocks from step ST1 to ST4 in FIG. 21 is the same as in the first embodiment. However, in this embodiment, the number of processors obtained from the used processor number setting means 12 is stored as the processor number A in step ST3.
[0160]
After calculating the image block in step ST4, in step ST21, the memory allocation unit 14 obtains the time constraint condition from the processing time constraint setting unit 19, and in step ST22, the memory allocation unit 14 satisfies the obtained time constraint. The number of processors B is calculated. In this calculation, use the execution time information for the past corner turn processing and the processing parameters such as the image size at that time to predict and calculate the relationship between the number of processors and the processing time under the given conditions. Is also included.
[0161]
For example, in the case of system resource saving priority, the minimum number of processors required to satisfy the maximum allowable time is calculated. When the processing time is prioritized in the range from the target time to the maximum allowable time, the minimum number of processors required to finish within the target time is calculated. If the processing time has the highest priority, the number of processors is infinite ∞.
[0162]
After calculating the number of processors B in step ST22, in step ST23, the memory allocation means 14 compares the value of the number of processors A and the value of the number of processors B, and if they are the same, in step ST5 as in the first embodiment. If the value of A is different from the value of B in step ST23 and the value of B is greater than or equal to the value of A in step ST24, the memory allocating unit 14 sets the value of A as the final number of processors. The process of step ST5 is performed as in the first embodiment.
[0163]
If the value of B is smaller than the value of A in step ST24, the memory allocation unit 14 sets the value of B as the final number of processors in step ST25. In this case, the value of the number of processors B is used, and in step ST5, the processing areas are allocated to the processors in the same procedure as in the first embodiment, and the memory allocation means 14 finally assigns to the execution processor number designation means 20 The execution processor number designating unit 20 instructs the image processing program 1 to execute the processing with the transmitted number of processors.
[0164]
In this embodiment, the example in which the used processor number setting unit 12 specifies the maximum available processor number has been described. However, the used processor number setting unit 12 does not set the number of processors, and the processing time constraint setting unit 19 Thus, as one of the usage constraints on the computer resources, the maximum number of available processors may be set, and the number of processors A may be set to infinity ∞.
[0165]
In this embodiment, the execution processor number designating unit 20 designates the number of processors in the image processing program 1, but the execution processor number designating unit 20 determines the number of processors used in the image processing program 1 by the platform 2. It may be implemented to set directly using the function.
[0166]
In this embodiment, the number of processors used is designated by the execution processor number designating means 20 for corner turn processing, but the number of processors can be changed by the execution processor number designation means 20 for image processing other than corner turn processing. You may implement as follows.
[0167]
In this embodiment, the processing time-related information in the processing time constraint setting means 19 uses the time related to corner turn processing, but the constraint time information and execution time for the entire image processing including corner turn processing. The number of processors B may be calculated using the information.
[0168]
In this embodiment, the processing time constraint setting means 19 and the execution processor number specifying means 20 are added to the parallel image processing apparatus of the first embodiment, but the parallel image processing apparatuses of the second to fifth embodiments are added. Alternatively, the processing time constraint setting means 19 and the execution processor number specifying means 20 may be added. Further, when applied to the fourth embodiment and the fifth embodiment, the condition that the image block band is not divided into a plurality of processors is added to the processing time constraint setting means 19, thereby overlapping the image block bands. May be suppressed.
[0169]
As described above, according to the sixth embodiment, the overhead time due to the cache can be reduced, and the optimum number of processors is calculated based on the time constraint condition information given as the required specification. The effect that a good corner turn process can be executed is obtained.
[0170]
Also, in calculating the optimal number of processors, using the past execution time information etc., corner turn processing can be completed more accurately within the time given as the required specification, while saving resources used The effect that corner turn processing can be performed is acquired.
[0171]
Embodiment 7 FIG.
FIG. 22 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 7 of the present invention. In the figure, reference numeral 21 denotes an image processing program 1 that calculates an access pattern to an image that minimizes the number of cache data saves when the number of cache lines of each of the processors 121 to 124 necessary for image block processing is insufficient. 1 is the same as the configuration shown in FIG. 1 of the first embodiment. In this example, when the image block is divided into the processors 121 to 124 and executed, the number of cache lines of each of the processors 121 to 124 is insufficient to perform corner turn processing for one image block. And different.
[0172]
FIG. 23 is a flowchart showing the processing of the parallel image processing apparatus according to the seventh embodiment, and FIG. 24 is a diagram showing an access pattern when the number of cache lines is insufficient.
[0173]
Next, the operation will be described.
The processing from the step ST1 to the step ST4 in FIG. 23 for calculating the image block is the same as that in the first embodiment. After calculating the image block, in step ST31, the memory allocation unit 14 first calculates the number of cache lines necessary for processing the image block. This number of lines is the number of lines necessary to store all image block data in the cache on the pre-processing image side and the processing result image side of the corner turn processing.
[0174]
In the example of FIG. 24, there are 4 pixels per cache line, and the image block is an area of 4 pixels × 4 pixels. In this case, four lines are required to store all the image block data in the cache on the pre-processing image side and the processing result image side, so the number of cache lines required here is eight.
[0175]
After calculating the required number of cache lines, in step ST31, the memory allocation unit 14 compares the number of cache lines installed in the processors 121 to 124 set by the cache information setting unit 13. Here, when the number of cache lines is not insufficient (the number of cache lines installed in the processors 121 to 124 ≧ the number of required cache lines), the process of step ST5 is executed as in the first embodiment. To do.
[0176]
When the number of cache lines is insufficient (the number of cache lines installed in the processors 121 to 124 <the number of necessary cache lines), in step ST32, in accordance with an instruction from the memory allocation unit 14, the pixel access order designating unit 21 calculates the order of access to the pixels and designates it to the image processing program 1. Thereafter, in step ST5, image block allocation to each of the processors 121 to 124 is executed in the same manner as in the first embodiment.
[0177]
The process of step ST32 will be described.
The pixel access order designating unit 21 instructed to control the access order uses the information in the cache information setting unit 13 to calculate an access pattern that minimizes the number of times cache data is saved. Here, a pattern that minimizes the number of times cache data is saved is calculated by controlling the access order to the pixels.
[0178]
FIG. 24 shows an example of an access pattern when the required number of cache lines is “8 lines” and the number of cache lines is less than 8. Here, it is assumed that the save order of the cache data is the last access order. Under this assumption, in the access pattern of FIG. 24, the access order is performed as shown in the figure, and the number of cache data saves is minimized.
[0179]
After calculating the access pattern, in step ST32, the pixel access order designating unit 21 designates an access pattern to the pixel in the image processing program 1 in the same manner as the memory allocation unit 14, so that each processor 121-124 actually To control the access order. Here, it is assumed that the image processing program 1 can change the access order of the pixels of the processors 121 to 124 in the order designated from the outside.
[0180]
In this embodiment, the size of the image block is 4 pixels × 4 pixels, but an image block of any size may be used. Also, the number of deficient cache lines may be any number. Furthermore, in this embodiment, the example in which the cache data save order is the last access order has been described. However, in a system realized by another cache save method, an access pattern that minimizes the number of cache data saves is provided. May be implemented as specified.
[0181]
In this embodiment, the pixel access order designating unit 21 controls the access pattern by designating the access pattern to the image processing program 1, but the pixel access order designating unit 21 uses the function of the platform 2. It may be implemented to set directly.
[0182]
In this embodiment, the pixel access order specifying means 21 is added to the parallel image processing apparatus of the first embodiment, but the pixel access is also applied to the parallel image processing apparatuses of the second to sixth embodiments. Order specifying means 21 may be added.
[0183]
As described above, according to the seventh embodiment, when an image block is divided into the processors 121 to 124 and executed, the cache line of each of the processors 121 to 124 is processed to corner-turn one image block. Even if the number is insufficient, the overhead time can be reduced by controlling the access pattern to each pixel and suppressing the number of times the cache data is saved.
[0184]
Embodiment 8 FIG.
FIG. 25 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 8 of the present invention. In the figure, reference numeral 22 denotes image information set in the image information setting means 11 when the caches of the processors 121 to 124 have a multi-level configuration, and the processors 121 to 121 set in the used processor number setting means 12. The number information of 124 and the cache information set in the cache information setting means 13 are obtained, and the image block of the lower cache is determined for each cache hierarchy based on the image block of the primary cache calculated by the memory allocation means 14. , The corner turn process is executed for each calculated image block of the highest cache, and the number of executed image blocks of the highest cache is equalized among the processors 121 to 124. As described above, the target area of the shared memory 125 is allocated to each of the processors 121 to 124. A multi-level cache corresponding memory allocation means for instructing to so that the image processing program 1, others are equivalent to the structure shown in Figure 1 of the first embodiment.
FIG. 26 is a flowchart showing the processing of the parallel image processing apparatus according to the eighth embodiment.
[0185]
FIG. 27 is a diagram showing the concept of memory allocation in this embodiment, and FIG. 28 is a diagram showing an example in which an image block is processed by a plurality of processors 121-124. In this example, it is assumed that each of the processors 121 to 124 has a two-level cache configuration of a primary cache and a secondary cache. The primary cache is the same size as in the first embodiment (32 bytes × 512 lines) and the write-through method, the secondary cache is 128 bytes × 16,384 lines, the write-back method, and the write miss to the secondary cache Assume that it is particularly large compared to the cache miss overhead time.
[0186]
Next, the operation will be described.
In step ST41 of FIG. 26, the multi-tier cache compatible memory allocation means 22 instructs the memory allocation means 14 to calculate the image block of the lower primary cache, and in step ST42, the memory allocation means 14 Then, the number of processors used and cache information are obtained, and the image block of the primary cache is calculated in the same manner as in the first embodiment.
[0187]
In step ST43, the multi-tier cache-compatible memory allocation unit 22 selects a secondary cache that is one higher level, obtains cache information of the target secondary cache in step ST44, and in step ST45, the image of the primary cache. A secondary cache image block (hereinafter referred to as a secondary image block) is calculated so as to include a block (hereinafter referred to as a primary image block). In the example shown in FIG. 27, since four primary image blocks match the width of the secondary cache, an image block having a size of 4 primary image blocks × 4 pieces is set as a secondary image block.
[0188]
In step ST46, the multi-tier cache-compatible memory allocation unit 22 checks whether there is a higher-level cache, and if so, in step ST47, selects one higher-level cache and returns to step ST44. The image block is obtained in the same manner as the secondary image block is obtained. For example, when a tertiary cache exists, an image block of a tertiary cache including the secondary image block is calculated on the basis of the secondary image block.
[0189]
If there is no higher cache in step ST46, the image block of the highest cache is determined as the entire image block, and the process of step ST5 is performed. In this example, since only the secondary cache exists, the secondary image block is set as the entire image block.
[0190]
In step ST5, the multi-tier cache compliant memory allocating unit 22 uses the same method as the memory allocating unit 14 of the first embodiment on the basis of the determined entire image block to determine the target areas to be processed by the processors 121 to 124. Assign. In this example, a region is allocated with the secondary image block as one unit, and the processing is executed by each of the processors 121 to 124.
[0191]
Further, when the image blocks cannot be allocated uniformly in the entire image block unit, the processing of the image blocks is executed by the plurality of processors 121 to 124 as in the first embodiment. FIG. 28 is a diagram illustrating an example of processing an image block of a multilevel cache by a plurality of processors 121 to 124.
[0192]
When an image block of a multi-level cache is executed by such a plurality of processors 121 to 124, all the cache levels (secondary cache in this example) are the same as in the first embodiment. The processors 121 to 124 read the data of the entire image block (in this example, all the processors 121 to 124 read the data of the secondary image block).
[0193]
On the other hand, it is not always necessary to read all data for a low-level cache. In the example of FIG. 28, the processor 121 only performs <1>, <5>, <9>, <13> of the primary image block, and the processor 122 does <2>, <6>, <10 of the primary image block. >, <14> need only be read.
As with the first embodiment, writing is executed so that all caches do not overlap.
[0194]
In this embodiment, the configuration using the primary cache and the secondary cache has been described. However, the present invention can also be applied to a case where the processors 121 to 124 that are executed in parallel have a hierarchical structure higher than the tertiary cache.
[0195]
In this embodiment, the primary cache and the secondary cache have been described by assigning the above-described predetermined size and the method for maintaining consistency, but any method for maintaining the size and consistency may be used. Also, the pixel size of the processed image and the size of the image size may be arbitrary.
[0196]
In this embodiment, the multilevel cache-compatible memory allocation means 22 is added to the parallel image processing apparatus of the first embodiment. However, the multilevel cache-compatible memory allocation means is also added to the parallel image processing apparatus of the sixth embodiment. 22 may be added.
[0197]
As described above, according to the eighth embodiment, when a multi-level cache configuration is adopted, the image block of the highest cache is calculated, and each of the processors 121 to 124 performs a corner turn for each image block of the highest cache. An effect that cache overhead time can be reduced by allocating the target area of the shared memory 125 to each of the processors 121 to 124 so that the number of executions of the image blocks in the top cache is equalized. Is obtained.
[0198]
Further, even when a multi-level cache configuration is adopted, by performing processing of one image block by the plurality of processors 121 to 124, it is possible to perform more even load distribution and speed up processing of one image block. can get.
[0199]
Embodiment 9 FIG.
FIG. 29 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 9 of the present invention. In the figure, reference numeral 23 indicates that the image area is expanded on the shared memory 125 so that the image size becomes an integer multiple of the image block of the highest cache when the image size does not become an integer multiple of the image block of the highest cache. The multi-level cache compatible memory securing means to be secured is the same as the configuration shown in FIG. 25 of the eighth embodiment. This embodiment is different from the eighth embodiment in that the image size is not an integral multiple of the image block size of the highest cache.
FIG. 30 is a flowchart showing the processing of the parallel image processing apparatus according to the ninth embodiment.
[0200]
Next, the operation will be described.
The processing up to calculating the image block of the highest cache from steps ST41 to ST47 in FIG. 30 is the same as in the eighth embodiment.
[0201]
In step ST51, the multilevel cache-compatible memory allocation unit 22 checks whether the image size is an integral multiple of the calculated image block of the highest-level cache. If the result of the check is an integer multiple, the process of step ST5 is performed as in the eighth embodiment.
[0202]
When it does not become an integral multiple, in step ST52, the multi-tier cache compatible memory allocation means 23 instructs the image size to be the same as the memory allocation means 15 of the second embodiment in accordance with an instruction from the multi-level cache compatible memory allocation means 22. The image area is expanded so as to be an integral multiple of the image block of the highest cache, and is secured on the shared memory 125. Then, the multi-tier cache compatible memory allocating means 22 performs the process of step ST5 as in the eighth embodiment.
[0203]
In this embodiment, the multilevel cache-compatible memory allocation unit 22 and the multilevel cache-compatible memory securing unit 23 are added to the parallel image processing device of the first embodiment, but the parallel image processing device of the sixth embodiment is added. In addition, multi-tier cache compatible memory allocation means 22 and multi-tier cache compatible memory securing means 23 may be added.
[0204]
As described above, according to the ninth embodiment, when the multi-level cache configuration is adopted and the image size to be processed is not an integral multiple of the image block size of the highest cache, the image size is reduced in the corner turn processing. Since the image area to be processed is expanded and secured on the shared memory 125 so as to be an integral multiple of the image block of the highest cache, an effect of reducing the overhead time due to the cache can be obtained.
[0205]
Embodiment 10 FIG.
FIG. 31 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 10 of the present invention. In the figure, reference numeral 24 denotes an image block of the lower cache so that when the line size of the upper cache does not become an integer multiple of the image block of the lower cache, the line size of the upper cache becomes an integer multiple of the image block of the lower cache. This is a multi-hierarchical image block memory securing means that expands the area and secures it on the shared memory 125, and the others are equivalent to the configuration shown in FIG. 25 of the eighth embodiment. This embodiment is different from the eighth embodiment in that the line size of the upper cache is not an integral multiple of the image block of the lower cache.
FIG. 32 is a flowchart showing the processing of the parallel image processing apparatus according to the tenth embodiment, and FIG. 33 is a diagram showing the concept of the region expansion method in this embodiment.
[0206]
Next, the operation will be described.
The processing up to obtaining the target cache information at each level of the secondary cache and higher from steps ST41 to ST44 in FIG. 32 is the same as in the eighth embodiment.
[0207]
In step ST51, the multi-tier cache compatible memory allocating unit 22 selects the cache in which the line size of the cache of the selected hierarchy is one lower level in the selected hierarchy based on the target cache information obtained in step ST44. Check if it is an integer multiple of the image block size. If the result of the check is an integer multiple, the processing after step ST45 is performed as in the eighth embodiment.
[0208]
If it does not become an integral multiple, in step ST52, based on an instruction from the multi-tier cache-compatible memory allocation means 22, the multi-tier image block memory securing means 24 makes all the low-level cache image block areas so that they become an integer multiple. Is secured on the shared memory 125. This example is shown in FIG. Here, the region expansion instruction method is the same method as in the second and ninth embodiments. The processing procedure after the area expansion in step ST52 performs the processing after step ST45, as in the eighth embodiment.
[0209]
When this area expansion is performed, there is an area that is not accessed in the image. Therefore, it is necessary for the image processing program 1 to correctly access the image data even after the expansion. In this method, the image processing program 1 is a program corresponding to the area expansion in the multi-layer image block memory securing unit 24, and the multi-layer image block memory securing unit 24 instructs the image processing program 1 to expand the area. Thus, there is a method in which the image processing program 1 automatically corresponds.
[0210]
The image processing program 1 is a program that is executed while referring to an area designated by the parallel image processing apparatus 3 as image data. The multi-layer image block memory securing unit 24 in the parallel image processing apparatus 3 expands the area. There is also a method of managing the information and the position information of the image data and calculating and specifying the position of the image data so that the image processing program 1 can access the correct data.
[0211]
In this embodiment, the multi-hierarchy image block memory securing means 24 is added to the parallel image processing apparatus of the eighth embodiment, but the multi-hierarchy image block is also added to the parallel image processing apparatus of the ninth embodiment. A memory reservation unit 24 may be added. Also for the parallel image processing apparatus of the sixth embodiment, a multi-tier cache compatible memory allocation unit 22 and a multi-layer image block memory securing unit 24 may be added.
[0212]
As described above, according to the tenth embodiment, in a multi-level cache configuration, when the line size of the upper cache is not an integral multiple of the image block of the lower cache, the line size of the upper cache is used in the corner turn process. By expanding the area of the image block of the low-level cache and securing it on the shared memory 125 such that the image block becomes an integral multiple of the image block of the low-level cache, an effect of reducing the overhead time due to the cache can be obtained.
[0213]
Embodiment 11 FIG.
FIG. 34 is a book diagram showing the configuration of a parallel image processing apparatus according to Embodiment 11 of the present invention. In the figure, reference numeral 25 denotes a row or column of the image block of the highest cache as the highest cache band when the image size is not an integral multiple of the calculated image block of the highest cache, and the calculated image block band. The access direction of the image is designated with the above, and the image block band is divided when the calculated number of image block bands is not an integral multiple of the number of processors 121 to 124 set in the processor number setting means 12 used. The multi-layer cache compatible access direction designating means for designating processing by the plurality of processors 121 to 124 is otherwise the same as the configuration shown in FIG. 25 of the eighth embodiment. This embodiment is different from the eighth embodiment in that the image size does not become an integral multiple of the image block of the highest cache.
[0214]
FIG. 35 is a flowchart showing the processing of the parallel image processing apparatus according to the eleventh embodiment, and FIG. 36 is a diagram showing the concept of the access direction designating method in the eleventh embodiment.
[0215]
Next, the operation will be described.
The steps up to calculating the image block of the highest cache from steps ST41 to ST47 in FIG. 35 are the same as those in the eighth embodiment.
[0216]
In step ST61, the multi-tier cache compatible memory allocation unit 22 checks whether the image size is an integral multiple of the calculated image block of the highest cache. If the result of the check is an integer multiple, the process of step ST5 is performed as in the eighth embodiment. If it is not an integral multiple, in step S62, the multi-tier cache compatible access direction designating means 25 calculates the band of the image block and the access direction to the image in accordance with an instruction from the multi-tier cache compatible memory allocation means 22.
[0217]
The method by which the multi-tier cache compatible access direction designating means 25 calculates the band of the image block and the access direction to the image is basically the same as in the third embodiment. The image block band is in the azimuth or range direction with the image block width of the top cache. In the example of FIG. 36, the band of the image block is a column of the image block width in the pre-processing image, and a row of the image block width in the post-processing direction, and the access direction to the image is the azimuth direction when protruding to the azimuth side, When it protrudes to the range side, it becomes the range direction.
[0218]
In step ST63, the multi-level cache compliant memory allocation unit 22 checks whether the number of image block bands is an integral multiple of the set number of processors. If the result of the check is an integer multiple, the process of step ST5 is performed in the same manner as in the eighth embodiment. At this time, the image block bands do not overlap and the load is equal, based on the image block bands. The processing is assigned to the processors 121 to 124 so that
[0219]
When the number of image block bands does not become an integral multiple of the set number of processors in step ST63, in step S64, the multi-tier cache-compatible memory allocating means 22 uses the calculated image block bands in the low-level cache. Divide into integer multiples of image block width. An example is shown in FIG.
[0220]
In FIG. 36, the image block of the low-level cache (primary cache) has the same size as (a) and (b) in the figure. When dividing into two, it divides | segments into (A) and (B), and it allocates so that it may perform with two processors in step ST5. In the case of dividing into four parts, it is further divided into half of the width of (A), that is, into units of (a) and (b), and in step ST5, it is assigned to be executed by four processors.
[0221]
In this embodiment, an example of protruding in the azimuth direction is shown, but the present invention can be applied even when protruding in the range direction. In this embodiment, the multi-hierarchy cache-compatible access direction designating unit 25 is added to the parallel image processing apparatus of the eighth embodiment. However, the parallel image processing apparatuses of the ninth and tenth embodiments have the same structure. On the other hand, an access direction specifying means 25 corresponding to the multi-tier cache may be added.
[0222]
As described above, according to the eleventh embodiment, even when a multi-level cache configuration is adopted, even when the image size to be processed is not an integral multiple of the image block of the highest cache, as in the tenth embodiment. Without expanding the image block area, the image block band and the access direction to the image are calculated, and the access direction specified by the image block band is low and there is no write overlap. By performing the corner turn processing by the method, an effect that the overhead time due to the cache can be reduced can be obtained.
[0223]
Even when the image block band is divided and executed by the plurality of processors 121 to 124, by dividing the image block band into an integral multiple of the image block width in the low-level cache, cache read misses and write misses can be suppressed, and an efficient corner can be obtained. The effect that a turn process can be performed is acquired.
[0224]
Embodiment 12 FIG.
FIG. 37 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 12 of the present invention. In the figure, reference numeral 26 denotes an access pattern to an image or a low-order cache that minimizes the number of cache data saves when the number of cache lines of each of the processors 121 to 124 necessary for processing the calculated image block is insufficient. This is a multi-tier cache-compatible access pattern designating means to be designated, and the others are the same as the configuration shown in FIG. 25 of the eighth embodiment. In this embodiment, when an image block is divided into processors and executed, the number of cache lines of each of the processors 121 to 124 is insufficient to perform corner turn processing of one image block. Different from Form 8.
[0225]
FIG. 38 is a flowchart showing the processing of the parallel image processing apparatus according to the twelfth embodiment, and FIG. 39 is a diagram showing an access pattern when the number of secondary cache lines is insufficient in the twelfth embodiment.
[0226]
Next, the operation will be described.
The processing up to the calculation of the image block of the highest cache from step ST41 to step ST47 in FIG. 38 is the same as in the eighth embodiment. After calculating the image block of the highest cache, in step ST71, the multi-tier cache compatible memory allocation unit 22 first calculates the number of cache lines necessary for processing the image block for each cache hierarchy. This number of lines is the number of lines necessary to store all the data in the cache on the pre-processing image side and the processing result image side for the image block of the highest level cache in each hierarchy cache. The method for calculating the number of lines is the same as in the seventh embodiment.
[0227]
After calculating the required number of cache lines, the multi-level cache-compatible memory allocation unit 22 compares the cache line number installed in each of the processors 121 to 124 set by the cache information setting unit 13 and determines the cache line. Check whether the number of lines is insufficient. When the number of cache lines does not become insufficient in all layers (the number of cache lines installed in the processors 121 to 124 ≧ the number of necessary cache lines), the processing in step ST5 is performed as in the eighth embodiment. Execute.
[0228]
When the number of cache lines is insufficient in any hierarchy (the number of cache lines installed in the processors 121 to 124 <the required number of cache lines), in step ST72, the memory allocation means 22 of the multi-tier cache correspondence memory 22 In response to the instruction, the multi-tier cache-compatible access pattern designating unit 26 uses the information in the cache information setting unit 13 to calculate an access pattern that minimizes the number of cache data saves. FIG. 39 shows the access order in that case.
[0229]
At this time, a check is performed from the highest cache, and when the number of lines is insufficient, processing is performed in units of image blocks in the hierarchy in the cache hierarchy in which the number of lines is insufficient. When the number of lines is insufficient even in the image block unit of its own hierarchy, the image block unit of the lower cache is selected, and the access order is controlled and executed. Until the shortage of lines is resolved, the lower cache image block is selected.
[0230]
In the example of FIG. 39, processing is executed in units of <1>, <2>, <3> which are image blocks of the primary cache. In the case of this example, 128 lines are required in the primary cache to execute with an arbitrary access pattern. This is because one line of the primary cache requires 4 pixels, 256 lines require 64 lines, and doubles are required for reading and writing. On the other hand, if it is executed in units of image blocks in the primary cache, it can be executed in 8 lines.
[0231]
This method of eliminating the shortage of lines by processing in units of image blocks has the advantage that it can be executed with the same efficiency as the case where corner turn processing is executed with an arbitrary access pattern.
[0232]
Finally, if the cache line is insufficient even in the image block unit of the primary cache, the access pattern in the pixel unit is controlled as in the seventh embodiment. In this case, the processing efficiency is reduced as compared with the case of executing for each image block unit of the low-level cache, but overhead can be suppressed by adopting the same method as in the seventh embodiment.
[0233]
After calculating the access pattern, in step ST72, the multi-tier cache-compatible access pattern designating means 26 calculates the actual access order in each processor by the same method as the pixel access order designating means 21 of the seventh embodiment, and implements it. Similarly to the eighth embodiment, the process of step ST5 is executed.
[0234]
In this embodiment, the cache configuration is the primary and secondary caches, but any cache configuration may be used. Also, any value may be used for image information such as the pixel size and the number of pixels.
[0235]
In this embodiment, the multi-hierarchical cache-compatible access pattern designating means 26 is added to the parallel image processing apparatus of the eighth embodiment, but also for the parallel image processing apparatuses of the ninth to eleventh embodiments. A multi-tier cache compatible access pattern designating means 26 may be added.
[0236]
As described above, according to the twelfth embodiment, when a multi-level cache configuration is employed, when the corner block process is performed by dividing the image block of the highest cache into the processors 121 to 124, each processor 121 to Even if the number of cache lines 124 is insufficient, an overhead time such as the number of times cache data is saved can be reduced by controlling an access pattern to each pixel or cache.
[0237]
In addition, when executed in a way that eliminates the shortage of lines by processing in units of image blocks, it can be executed with the same efficiency as the case of executing with an arbitrary pattern, and the number of lines is sufficient even when the number of cache lines is insufficient. The effect that the corner turn process can be executed with the same efficiency as the above situation is obtained.
[0238]
Embodiment 13 FIG.
40 is a block diagram showing the structure of a parallel image processing apparatus according to Embodiment 13 of the present invention. In the figure, reference numeral 27 denotes interpolation of pixels when the cache line size is not an integral multiple of the pixel size for each image when corner turn processing is performed on each image having a plurality of image sizes. Calculate the minimum image size that is an integer multiple of the calculated image block and includes the processing target image, select the maximum image size from the minimum image sizes of each calculated image, and select the maximum This is a memory securing means for multiple sizes that secures an image size area on the shared memory 125, and the rest is the same as the configuration shown in FIG.
[0239]
FIG. 41 is a flowchart showing the processing of the parallel image processing apparatus according to the thirteenth embodiment, and FIG. 42 is a diagram showing a method for interpolating pixel gaps.
[0240]
Next, the operation will be described.
In step ST81 of FIG. 41, the multiple size correspondence memory securing unit 27 obtains cache information from the cache information setting unit 13, selects one processing target image in step ST82, and selects the selected image in step ST83. Is obtained from the image information setting means 11.
[0241]
In step ST84, the multiple size correspondence memory securing unit 27 checks whether the cache line size is an integral multiple of the pixel size from the obtained cache information and image information. Here, if the pixel size is larger than the cache line size, the check is performed for the minimum number of lines including the pixel. When the cache line size is an integral multiple of the pixel size, in step ST86, the memory allocation unit 14 calculates an image block for the image data in accordance with an instruction from the multiple size correspondence memory securing unit 27.
[0242]
If the cache line size does not become an integral multiple of the pixel size, in step ST85, the multiple size corresponding memory securing unit 27 takes a certain margin in the processing target image as shown in FIG. The processing target image is changed so that becomes an integer multiple of the pixel size. When the pixel size is larger than the cache line size, this operation is executed for the minimum number of lines including pixels.
[0243]
After changing the processing target image, in step ST <b> 86, the memory allocation unit 14 calculates an image block of the changed processing target image in accordance with an instruction from the multiple size correspondence memory securing unit 27. In step ST87, the multiple size correspondence memory securing unit 27 calculates a minimum image size that is an integral multiple of the image block and includes the entire processing target image.
[0244]
After calculating the minimum image size, in step ST88, the multi-size correspondence memory securing unit 27 checks whether there is a remaining image to be processed. When there is an image to be processed, the process returns to step ST82, and one of them is selected, and a series of processes from the above steps ST83 to ST88 are executed. If there is no image to be processed in step ST88, in step ST89, the multiple-size corresponding memory securing unit 27 selects the maximum image size from all the calculated minimum image sizes. In step ST <b> 90, the multiple size correspondence memory securing unit 27 secures an area of the selected image size on the shared memory 125. In step ST91, the multiple-size compatible memory securing unit 27 selects one image to be processed. In subsequent steps ST1 to ST5, the memory allocation unit 14 performs the same process as in the first embodiment.
[0245]
As in the above-described embodiment, the image processing program 1 is implemented as a settable program for securing on the shared memory 125 by the multiple size correspondence memory securing unit 27 and for accessing the interpolated image. There is a method of realizing by setting a setting value from the image processing apparatus 3 side.
[0246]
The image processing program 1 is mounted so as to perform an operation on a designated area, and the parallel image processing apparatus 3 side manages the securing on the shared memory 125 and the access method to the interpolated image. It can also be realized by a method of instructing the operation area to the image processing program 1 so that the correct area can be accessed without being conscious of 1.
[0247]
In addition, when the data can be read into the cache only from a specific alignment, that is, an address that is an integer multiple of the line size of the cache, the multi-size compatible memory securing unit 27 shares it from the position where it can be read into the cache from the beginning of the image. An area is secured on the memory 125.
[0248]
In this embodiment, a plurality of images are targeted, but a single image may be targeted. In this embodiment, the multi-size compatible memory securing unit 27 is added to the parallel image processing apparatus of the first embodiment, but also for the parallel image processing apparatuses of the second to twelfth embodiments. Further, a plurality of size corresponding memory securing means 27 may be added.
[0249]
In this embodiment, the cache has a configuration of only the primary cache. However, even in the case of a multi-level cache configuration, the image block of the highest cache is calculated by the method of Embodiment 8 and the image block of the highest cache is obtained. May be used to calculate the minimum image size.
[0250]
As described above, according to the thirteenth embodiment, when corner turn processing is performed on a plurality of image sizes, the minimum image area that is an integral multiple of the calculated image block and includes the processing target image is obtained. By calculating, selecting the maximum image size from among the calculated minimum image sizes of each image, and securing an area of the selected maximum image size on the shared memory 125, a plurality of images of different sizes can be obtained. Even when processing the target, the effect of reducing the overhead time due to the cache can be obtained.
[0251]
Also, if the cache line size is not an integer multiple of the image pixel size, interpolation of the pixels can ensure that access to each pixel data is on the cache, and the overall image processing is reduced. The effect of being able to execute efficiently is acquired.
[0252]
Embodiment 14 FIG.
FIG. 43 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 14 of the present invention. In the figure, reference numeral 28 designates an image of a small size selected when an area of the shared memory 125 secured by the multiple size correspondence memory securing unit 27 is used and image processing of a size smaller than the image to be secured is performed. For each image block, the range of use of the shared memory 125 used for corner turn processing is calculated, and the areas to be processed and the access procedure are assigned to each of the processors 121 to 124 so that the corner turn processing is executed within the calculated usage range. This is a multiple size compatible access control means to be specified, and the others are the same as the configuration shown in FIG. 40 of the thirteenth embodiment.
[0253]
FIG. 44 is a flowchart showing the processing of the parallel image processing apparatus according to the fourteenth embodiment, and FIG. 45 is a diagram showing an example of the memory use range in the fourteenth embodiment.
[0254]
Next, the operation will be described.
In this embodiment, after performing the process shown in the flowchart of FIG. 41 of the thirteenth embodiment, the process shown in the flowchart of FIG. 44 is performed. In step ST101 of FIG. 44, the multi-size compatible access control unit 28 obtains the size of the area of the shared memory 125 secured by the multi-size compatible memory securing unit 27. In step ST102, the multiple size compatible access control unit 28 To select an image to be processed. As this selection method, there are an instruction by a user input, a method of receiving notification of an image selected as a processing target from the image processing program 1, and the like.
[0255]
When the selected image is confirmed, in step ST103, the memory allocating unit 14 calculates an image block for the selected image in the same manner as in the first embodiment in accordance with an instruction from the multiple size compatible access control unit 28.
[0256]
In step ST104, the multiple size corresponding access control means 28 receives the calculated image block value, and executes the corner turn process from the size of the memory area secured by the multiple size corresponding memory securing means 27. The use range of the memory 125 is determined. This determination method is shown in the example of FIG. If it is simply used from the beginning of the reserved area of the shared memory 125, the possibility of using an area that protrudes from the image block as shown in “when used from the beginning of the shared memory area” shown in FIG. There is.
[0257]
On the other hand, as shown in the thirteenth embodiment, the memory area secured by the multiple size correspondence memory securing unit 27 includes the processing target image with respect to all the images of the processing target image, and is an integer of the image block. It is guaranteed to include the doubled area. For this reason, the multiple-size compatible access control means 28 includes, as a reference area for the selected image, an area that includes the processing target image and that is an integral multiple of the image block as an area for performing the corner turn process. decide.
[0258]
The “use example in this apparatus” shown in FIG. 45B is an example of a region where the selected corner turn process is executed. After selecting the area for executing the corner turn process, the multiple size compatible access control means 28 obtains information on the number of processors to be used in step ST105, and calculates the process area and access pattern to be allocated to each process in step ST106. . Then, similarly to Embodiment 1, the process of step ST5 is performed.
[0259]
In this embodiment, the multi-size compatible access control means 28 is added to the parallel image processing apparatus of the thirteenth embodiment based on the first embodiment, but the parallel images of the second to twelfth embodiments are added. Also for the processing device, a memory supporting means for multiple sizes 27 and an access control means for multiple sizes may be added.
[0260]
Further, as in the thirteenth embodiment, a multi-tier cache configuration may be used. In this case, it is effective to set the area of the shared memory 125 in order from the lower cache in conjunction with the multi-layer image block memory securing unit 24 of the tenth embodiment.
[0261]
As described above, according to the fourteenth embodiment, processing of an image having a size smaller than the image to be secured is performed using the large-sized shared memory 125 area secured to a plurality of images. To calculate the use range of the shared memory 125 used when executing the corner turn process in units of image blocks for the selected small-size image, and to efficiently perform the corner turn process within the calculated use range. In addition, by designating the processing target area and the access procedure to each of the processors 121 to 124, it is possible to reduce the overhead time due to the cache.
[0262]
Further, it is possible to obtain an effect that corner turn processing can be executed efficiently while the area of the shared memory 125 secured once is used jointly with images of a plurality of sizes.
[0263]
Embodiment 15 FIG.
FIG. 46 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 15 of the present invention. In the figure, reference numeral 29 denotes the contents of the pre-execution process executed before the corner turn process, and the image information set in the image information setting means 11 and the processor 121 set in the used processor number setting means 12. The number information of .about.124 and the cache information set in the cache information setting means 13 are obtained, and the pre-execution process and the corner turn process are simultaneously executed based on the image block unit calculated by the memory allocation means 14. The pre-processing corresponding memory allocating means for allocating the target area of the shared memory 125 to each of the processors 121 to 124 so that the loads of .about.124 are equalized, and the others are the same as the configuration shown in FIG. 1 of the first embodiment. .
[0264]
47 is a flowchart showing the processing of the parallel image processing apparatus according to the fifteenth embodiment, FIG. 48 is a diagram showing the allocation of processing to the processors 121 to 124 and the number of processing stages, and FIG. It is a figure which shows an example of the access pattern when performing a process in the step of each process in -124.
[0265]
Next, the operation will be described.
In step ST111 of FIG. 47, the memory allocation unit 14 calculates an image block in accordance with an instruction from the preprocessing-compatible memory allocation unit 29. In step ST112, the pre-processing corresponding memory allocation unit 29 obtains the contents of the pre-execution process executed before the corner turn process. As the obtaining method, there are an instruction by a user input, a method of receiving a notification from the image processing program 1 about a target process, and the like.
[0266]
In step ST113, the preprocessing-compatible memory allocation unit 29 checks whether the acquired pre-execution process is a process in units of pixels, and if the pre-execution process is a process in units of pixels, in step ST114, The processing corresponding memory allocation unit 29 obtains the number of processors 121 to 124 from the used processor number setting unit 12.
[0267]
In step ST115, the preprocessing-compatible memory allocation unit 29 accesses the area of the shared memory 125 in units of image blocks with respect to the unprocessed image, similarly to the corner turn process, and in the accessed area, the unit of each pixel. The pre-execution process is executed, and the calculation result of the pre-execution process is written in the image of the process result in units of each pixel. As a result, the pre-execution process and the corner turn process are simultaneously performed based on the image block unit. The image processing program 1 to be executed is instructed.
[0268]
In step ST113, if the pre-execution process is not a pixel unit process, in step ST116, the preprocessing correspondence memory allocation unit 29 selects a process pattern in line units. Then, as shown in FIG. 48, the pre-processing correspondence memory allocation unit 29 calculates the pre-processing image column and the processing result image row of the image block width as the assigned area as a reference to be allocated to each processor. In FIG. 48, the number of processors is M. The assigned areas of the processor (1), processor (2),..., Processor (M) in FIG. 48 are assigned areas assigned to the virtual processors (1) to (M).
[0269]
Further, the preprocessing memory allocation unit 29 uses one column of the processing result image corresponding to one row of the preprocessing image in the image block width as a processing portion to be executed in each stage, and the first to Nth stages. Calculate as the processing part of the stage. In the virtual processors (1) to (M), it is assumed that the pre-execution process and the corner turn process are executed for each stage from the first stage to the Nth stage in order.
[0270]
Processing for each stage will be described with reference to FIG.
When the pre-execution process is an access in the range or azimuth direction, the SAR image process requires data for one line in the direction to calculate one pixel. In the example of FIG. 49, data for all pixels in the row (a) of the pre-processing image is required to perform the calculation for the pixel A1. Therefore, in the process for each stage, the results of A1, A2, A3, and A4 are obtained by performing the pre-execution process four times using the data of all the pixels in the row (a) of the pre-process image. By performing this similarly for the rows (b), (c), and (d), the processing result of the first stage portion of the virtual processor (1) can be obtained.
[0271]
The preprocessing-compatible memory allocation unit 29 calculates the assigned area to be allocated to the virtual processors (1) to (M) and the first to Nth stage processing parts, and then in step ST117, the image information setting unit 11 outputs the image. Information, the number of used processors 121 to 124 from the used processor number setting means 12, and the cache information from the cache information setting means 13 are obtained.
[0272]
In step ST118, the preprocessing-compatible memory allocation unit 29 calculates a process allocation method for each processor. Here, first, it is checked whether there is a sufficient cache line size to execute the processing of each stage of the assigned area allocated to the virtual processors (1) to (M), and all the data is stored in the cache. If there is enough area to put in and process, no instruction is given regarding the access method. When the cache line size is insufficient and cache data is saved, the data access method is instructed in the processing for each stage.
[0273]
Here, the access method that minimizes the overhead is calculated from the cache data saving method and the time required for the cache read miss. As an example, as shown in FIG. 49, there is a method of alternately accessing (a1) and (a2) in the order of processing the row (a). Further, the assigned areas are calculated for the virtual processors (1) to (M) so that the processing load is evenly distributed to the processors.
[0274]
In step ST119, the preprocessing memory allocation unit 29 instructs the image processing program 1 to allocate the area of the shared memory 125 to execute the pre-execution process and the corner turn process according to the allocation method calculated above.
[0275]
In this embodiment, SAR image reproduction processing is performed as image processing. However, in addition to SAR image reproduction processing, the image processing is configured by image row or column direction processing and pixel-by-pixel processing. The present invention can also be applied to image processing that executes processing corresponding to turn processing.
[0276]
In this embodiment, the preprocessing-compatible memory allocation unit 29 is added to the parallel image processing apparatus of the first embodiment. However, the parallel image processing apparatuses of the second to the fourteenth embodiments also have the same function. Processing-compatible memory allocation means 29 may be added.
[0277]
In the allocation of processing to the processors 121 to 124, there are cases where the number of image blocks and assigned areas cannot be divided equally depending on the number of processors 121 to 124 used. In this case, by applying the method of executing the image block and the band of the image block shown in the other embodiments by the plurality of processors 121 to 124, the image block and the assigned area are processed by the plurality of processors 121 to 124. Can be easily executed in parallel.
[0278]
As described above, according to the fifteenth embodiment, the pre-execution process and the corner turn process are simultaneously executed on the basis of the image block unit calculated for the corner turn process, and the processing load of each of the processors 121 to 124 is increased. By allocating the target area of the shared memory 125 to each of the processors 121 to 124 so as to equalize, it is not necessary to wait for completion of the pre-execution process in the corner turn process, and the overhead time of the entire image processing can be reduced. The effect is obtained.
[0279]
Embodiment 16 FIG.
50 is a block diagram showing a configuration of a parallel image processing apparatus according to the sixteenth embodiment of the present invention. In the figure, reference numeral 30 denotes the contents of the post-execution process executed after the corner turn process, the image information set in the image information setting unit 11, and the processors 121-1 set in the used processor number setting unit 12. The number information 124 and the cache information set in the cache information setting unit 13 are obtained, and the corner turn process and the post-execution process are simultaneously executed on the basis of the image block unit calculated by the memory allocation unit 14. This is post-processing corresponding memory allocation means for allocating the target area of the shared memory 125 to each of the processors 121 to 124 so that the load of 124 is equal, and the others are the same as the configuration shown in FIG. 1 of the first embodiment.
[0280]
FIG. 51 is a flowchart showing the processing of the parallel image processing apparatus according to the sixteenth embodiment, and FIG. 52 is a diagram showing an example of processing allocation to each processor and an access pattern in each processor. FIG. 53 is a diagram showing a processing pattern in units of lines.
[0281]
Next, the operation will be described.
In step ST131, the memory allocation unit 14 calculates an image block in accordance with an instruction from the post-processing correspondence memory allocation unit 30. In step ST132, the post-processing correspondence memory allocating means 30 obtains the contents of the post-execution process executed after the corner turn process. The acquisition method is the same as that of the preprocessing-compatible memory allocation unit 29 of the fifteenth embodiment.
[0282]
In step ST133, the post-processing correspondence memory allocation unit 30 checks whether the obtained post-execution process is a process in units of pixels, and if the post-execution process is a process in units of pixels, in step ST134, The processing corresponding memory allocation unit 30 obtains the number of processors 121 to 124 from the used processor number setting unit 12.
[0283]
In step ST135, the post-processing correspondence memory allocation unit 30 accesses the area of the shared memory 125 in units of image blocks with respect to the unprocessed image, similarly to the corner turn process, and in the accessed area, the unit of each pixel. Then, the post-execution process is executed, and the calculation result of the post-execution process is written in the image of the process result in units of each pixel. As a result, the corner turn process and the post-execution process are simultaneously performed based on the image block unit. The image processing program 1 is instructed to do so.
[0284]
In step ST133, when the post-execution process is not a pixel unit process, in step ST136, the post-processing correspondence memory allocation unit 30 selects a process pattern in line units. Then, as shown in FIG. 52, the post-processing correspondence memory allocation unit 30 calculates the pre-processing image column and the processing result image row of the image block width as a responsible area to be used as a reference for allocating to each of the processors 121 to 124. To do. In FIG. 52, the number of processors is M. 52, the assigned areas of the processors (1), (2),..., The processor (M) are assigned areas assigned to the virtual processors (1) to (M).
[0285]
However, the data written for the corner turn process can be used as it is for the processing part executed at each stage. In the post-execution process, it is assumed that the process is executed in units of one line of the process result image immediately after the corner turn process result is written based on the calculated assigned area.
[0286]
In step ST137, the post-processing correspondence memory allocation unit 30 acquires the image information from the image information setting unit 11, the number of processors 121 to 124 from the used processor number setting unit 12, and the cache information from the cache information setting unit 13.
[0287]
In step ST138, the post-processing-compatible memory allocation unit 30 calculates a process allocation method for each processor. Here, first, it is checked whether there is a sufficient cache line size to execute the processing of each stage of the assigned area allocated to the virtual processors (1) to (M), and all the data is stored in the cache. If there is enough area to put in and process, no instruction is given regarding the access method. When the cache line size is insufficient and the cache data is saved, the data access method is instructed.
[0288]
Here, as shown in FIG. 53, an access method that minimizes the overhead is calculated from the cache data saving method and the time required for the cache read miss. Further, the assigned areas are calculated for the virtual processors (1) to (M) so that the processing load is evenly distributed to the processors.
[0289]
In step ST139, the post-processing correspondence memory allocation unit 30 instructs the image processing program 1 to allocate the area of the shared memory 125 to execute the corner turn process and the post-execution process according to the allocation method calculated above.
[0290]
In this embodiment, post-processing corresponding memory allocation means 30 is added to the parallel image processing apparatus of the first embodiment. Processing-related memory allocation means 30 may be added.
[0291]
As in the fifteenth embodiment, the present invention can also be applied to image processing other than SAR image reproduction processing. Further, parallel execution by a plurality of processors in the image block and the assigned area can be easily performed as in the fifteenth embodiment.
[0292]
As described above, according to the sixteenth embodiment, the corner turn process and the post-execution process are simultaneously executed on the basis of the image block unit calculated for the corner turn process, and the processing load of each of the processors 121 to 124 is increased. By allocating the target area of the shared memory 125 to each of the processors 121 to 124 so as to equalize, it is not necessary to wait for the corner turn process to be completed in the post-execution process, and the overall image processing overhead time can be reduced. The effect is obtained.
[0293]
Embodiment 17. FIG.
FIG. 54 is a block diagram showing the configuration of a parallel image processing apparatus according to Embodiment 17 of the present invention. In the figure, 29 is pre-processing corresponding memory allocating means in the fifteenth embodiment, 30 is post-processing corresponding memory allocating means in the sixteenth embodiment, and 31 is a pre-execution process or a post-execution process executed simultaneously with the corner turn process. The other components are the same as those shown in FIG. 1 of the first embodiment. FIG. 55 is a flowchart showing the processing of the parallel image processing apparatus according to the seventeenth embodiment.
[0294]
Next, the operation will be described.
In the SAR image reproduction process, in addition to the image size, there are cases where the reproduction method differs depending on the type of the platform such as an artificial satellite or an aircraft that captures the SAR image, such as the motion of the platform in the reproduction process. In this case, the SAR image playback processing apparatus is required to flexibly support a plurality of playback methods.
[0295]
In step ST151 of FIG. 55, the pre- and post-process selecting means 31 registers a plurality of types of pre-execution processes and post-execution processes. The registration method includes an instruction by a user input, a method of receiving a notification from the image processing program 1, and a method of automatically obtaining the image processing program 1 by analyzing it.
[0296]
In step ST152, the pre- and post-process selection means 31 selects a process to be executed simultaneously with the corner turn process from the registered processes. As this selection method, there are an instruction by a user input, a method of receiving a notification from the image processing program 1, and the like. There is also a method of recording the past execution time and execution method and selecting the information that is expected to have the shortest time from the information.
[0297]
In step ST153, the pre- and post-process selection means 31 checks whether or not the selected process is a pre-execution process. In the case of the pre-execution process, the pre-execution process selected by the pre-processing corresponding memory allocation means 29 in step ST154 Is instructed to execute. Subsequent processing is the same as in the fifteenth embodiment.
[0298]
If it is not a pre-execution process in step ST153, the pre- and post-process selection means 31 instructs the post-processing corresponding memory allocation means 30 to execute the selected post-execution process in step ST155. Subsequent processing is the same as that in the sixteenth embodiment.
[0299]
In this embodiment, SAR image reproduction processing is performed as image processing. However, in addition to SAR image reproduction processing, the image processing is configured by image row or column direction processing and pixel-by-pixel processing. The present invention can also be applied to image processing that executes processing corresponding to turn processing.
[0300]
In this embodiment, the pre- and post-processing selection means 31, the pre-processing corresponding memory allocation means 29, and the post-processing corresponding memory allocation means 30 are added to the parallel image processing apparatus of the first embodiment. Also in the parallel image processing apparatus of the fourteenth embodiment, the pre- and post-processing selection means 31, the pre-processing corresponding memory allocation means 29, and the post-processing corresponding memory allocation means 30 may be added.
[0301]
As described above, according to the seventeenth embodiment, the pre-execution process or the post-execution process that is executed simultaneously with the corner turn process can be selected. Therefore, it is necessary to select a plurality of processing methods to perform the process. Even in the process of reproducing a certain SAR image, it can be executed simultaneously with an arbitrary process and a corner turn process, and an effect of reducing the overhead time due to completion waiting can be obtained.
[0302]
【The invention's effect】
As described above, according to the present invention, image information, the number information of processors, and cache information are obtained, and the overhead time required for cache write miss and read miss is minimized. One side is a square with one cache line size By calculating the image block of the processing area and assigning the target area of the shared memory so that each processor performs corner turn processing for each calculated image block, cache write miss and read miss are reduced, and cache overhead There is an effect that time can be reduced.
[0303]
According to the present invention, when the image size does not become an integral multiple of the calculated image block, the image area is expanded and secured on the shared memory so that the image size becomes an integral multiple of the image block. There is an effect that the overhead time can be reduced.
[0304]
According to the present invention, when the image size does not become an integral multiple of the calculated image block, the row or column of the image block width is calculated as the band of the image block, and the access direction of the image is designated by the calculated image block band. By doing so, the cache overhead time can be reduced.
[0305]
According to this invention, when the image size does not become an integral multiple of the calculated image block, the row or column of the width of the image block is calculated as the band of the image block, and the access direction of the image is determined by the calculated band of the image block. When the number of calculated image block bands is not an integral multiple of the number of processors, the image block bands are divided and specified to be processed by a plurality of processors. In addition to being able to reduce, there is an effect that an efficient corner turn process can be executed.
[0306]
According to the present invention, when the image size does not become an integral multiple of the image block, a plurality of correction methods are provided, and determination conditions for the correction method are set, so that the overhead time due to the cache can be reduced, and a plurality of correction methods are provided. An optimum method can be selected from the coping methods, and an efficient corner turn process can be executed.
[0307]
According to the present invention, the time constraint given as the requirement specification and the time constraint condition such as the processing time of each processor are set, and the number of the processors to be actually used is specified in the image processing program. The optimum number of processors can be calculated based on the information on the time constraint condition, and an efficient corner turn process can be executed with the calculated number of processors.
[0308]
According to the present invention, when the number of cache lines necessary for image block processing is insufficient, an access pattern to an image that minimizes the number of times cache data is saved is calculated and specified in the image processing program. This has the effect of reducing the overhead time.
[0309]
According to the present invention, when corner turn processing is performed on each image having a plurality of image sizes, pixel interpolation is performed for each image when the cache line size is not an integral multiple of the pixel size. Calculate the minimum image size that is an integer multiple of the calculated image block and includes the processing target image, select the maximum image size from the minimum image sizes of each calculated image, and select the maximum By securing the image size area on the shared memory, the cache overhead time can be reduced even when a plurality of images of different sizes are processed.
[0310]
According to the present invention, when the target area of the secured shared memory is used and image processing having a size smaller than the secured image is performed, the corner turn is performed in units of image blocks for the selected small-sized image. By calculating the usage range of the shared memory used for processing, and specifying the processing target area and access procedure for each processor so that corner turn processing is executed within the calculated usage range, the overhead time due to the cache can be reduced. In addition, there is an effect that corner turn processing can be executed efficiently while the shared memory area once secured is used jointly with images of a plurality of sizes.
[0311]
According to the present invention, image information and cache information are obtained, and an image block of a square processing area having a length of one cache line size and one side length that minimizes the overhead time required for primary cache write miss and read miss is obtained. Calculate, obtain image information, processor number information, and cache information, and based on the calculated primary cache image block, each cache hierarchy includes a plurality of low-level cache image blocks and the length of one side Is one cash line size Of Calculate the image block of the square processing area, execute the corner turn processing for each calculated image block of the highest cache, and so that the number of execution of the image cache of the highest cache is equal among the processors, By instructing the image processing program to allocate the target area of the shared memory to each processor, it is possible to reduce the overhead time due to the cache and to perform efficient corner turn processing.
[0312]
According to the present invention, when the image size does not become an integer multiple of the image block of the highest cache, the image area is expanded and secured on the shared memory so that the image size becomes an integer multiple of the image block of the highest cache. By doing so, the cache overhead time can be reduced.
[0313]
According to the present invention, when the line size of the upper cache is not an integer multiple of the image block of the lower cache, the image block of the lower cache is set so that the line size of the upper cache is an integer multiple of the image block of the lower cache. By expanding this area and securing it on the shared memory, there is an effect that the overhead time due to the cache can be reduced.
[0314]
According to the present invention, when the image size does not become an integral multiple of the calculated image block of the highest cache, the row or column of the image block of the highest cache is calculated as a band of the image block of the highest cache, and the calculated image Specify the image access direction with the block band, and specify that the image block band is divided and processed by multiple processors when the calculated number of image block bands is not an integral multiple of the number of processors. As a result, the overhead time due to the cache can be reduced, and an efficient corner turn process can be executed.
[0315]
According to the present invention, when the number of cache lines necessary for the processing of the calculated image block is insufficient, an access pattern to an image or a lower-level cache that minimizes the number of cache data saves is specified. This has the effect of reducing the overhead time.
[0316]
According to the present invention, image information and cache information are obtained, and the overhead time required for cache write miss and read miss is minimized. One side is a square with one cache line size Calculate the image block of the processing area, obtain the details of the pre-execution process to be executed before the corner turn process, obtain the image information, the number information of the processors, and the cache information, and obtain in advance based on the calculated image block unit It is necessary to wait for completion of the pre-execution process in the corner turn process by executing the execution process and the corner turn process at the same time and allocating the target area of the shared memory to each processor so that the load on each processor is equalized. There is an effect that the overhead time of the entire image processing can be reduced.
[0317]
According to the present invention, image information and cache information are obtained, and the overhead time required for cache write miss and read miss is minimized. One side is a square with one cache line size Calculates the image block of the processing area, obtains the contents of the post-execution process executed after the corner turn process, obtains the image information, the number information of the processors, and the cache information, and obtains the corner turn based on the calculated image block unit. By assigning a shared memory target area to each processor so that the processing and post-execution processing are executed simultaneously and the load on each processor is equalized, there is no need to wait for corner turn processing completion in post-execution processing The overhead time of the entire image processing can be reduced.
[0318]
According to the present invention, by selecting the pre-execution process or the post-execution process to be executed simultaneously with the corner turn process, it can be executed simultaneously with any process and the corner turn process, and the overhead time due to completion waiting can be reduced. effective.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 1 of the present invention.
FIG. 2 is a flowchart showing processing of the parallel image processing apparatus according to Embodiment 1 of the present invention.
FIG. 3 is a diagram showing a concept of basic memory allocation according to the first embodiment of the present invention.
FIG. 4 is a diagram showing a concept of cache access according to the first embodiment of the present invention.
FIG. 5 is a diagram showing a specific example of memory allocation according to the first embodiment of the present invention.
FIG. 6 is a diagram showing an example of processing one image block by a plurality of processors according to Embodiment 1 of the present invention;
FIG. 7 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 2 of the present invention.
FIG. 8 is a flowchart showing processing of a parallel image processing apparatus according to Embodiment 2 of the present invention.
FIG. 9 is a diagram showing an execution example of memory allocation according to Embodiment 2 of the present invention;
FIG. 10 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 3 of the present invention.
FIG. 11 is a flowchart showing processing of a parallel image processing apparatus according to Embodiment 3 of the present invention.
FIG. 12 is a diagram showing the processing contents in which the access direction is designated when the image size in the azimuth direction is not an integral multiple of the image block according to the third embodiment of the present invention.
FIG. 13 is a diagram showing the processing contents in which the access direction is specified when the image size in the range direction is not an integral multiple of the image block according to the third embodiment of the present invention.
FIG. 14 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 4 of the present invention.
FIG. 15 is a flowchart showing processing of a parallel image processing apparatus according to Embodiment 4 of the present invention.
FIG. 16 is a diagram showing the processing contents when the image size does not become an integral multiple of the band of the image block in the azimuth direction according to the fourth embodiment of the present invention.
FIG. 17 is a diagram showing processing contents when the image size does not become an integral multiple of the band of the image block in the range direction according to the fourth embodiment of the present invention.
FIG. 18 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 5 of the present invention.
FIG. 19 is a flowchart showing processing of a parallel image processing apparatus according to Embodiment 5 of the present invention.
FIG. 20 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 6 of the present invention.
FIG. 21 is a flowchart showing processing of a parallel image processing apparatus according to Embodiment 6 of the present invention.
FIG. 22 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 7 of the present invention.
FIG. 23 is a flowchart showing processing of the parallel image processing apparatus according to Embodiment 7 of the present invention.
FIG. 24 is a diagram showing an access pattern when the number of cache lines is insufficient according to the seventh embodiment of the present invention;
FIG. 25 is a block diagram showing a configuration of a parallel image processing apparatus according to an eighth embodiment of the present invention.
FIG. 26 is a flowchart showing processing of the parallel image processing apparatus according to the eighth embodiment of the present invention.
FIG. 27 shows a concept of memory allocation according to an eighth embodiment of the present invention.
FIG. 28 is a diagram showing an example in which an image block according to an eighth embodiment of the present invention is processed by a plurality of processors.
FIG. 29 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 9 of the present invention.
FIG. 30 is a flowchart showing processing of a parallel image processing apparatus according to Embodiment 9 of the present invention.
FIG. 31 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 10 of the present invention.
FIG. 32 is a flowchart showing processing of a parallel image processing apparatus according to Embodiment 10 of the present invention.
FIG. 33 shows a concept of a region expansion direction according to the tenth embodiment of the present invention.
FIG. 34 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 11 of the present invention.
FIG. 35 is a flowchart showing processing of the parallel image processing apparatus according to Embodiment 11 of the present invention.
FIG. 36 is a diagram showing the concept of an access direction designating method according to Embodiment 11 of the present invention.
FIG. 37 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 12 of the present invention.
FIG. 38 is a flowchart showing processing of a parallel image processing apparatus according to Embodiment 12 of the present invention.
FIG. 39 is a diagram showing an access pattern when the number of secondary cache lines is insufficient according to the twelfth embodiment of the present invention;
FIG. 40 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 13 of the present invention.
FIG. 41 is a flowchart showing processing of a parallel image processing apparatus according to Embodiment 13 of the present invention.
FIG. 42 is a diagram showing a method of interpolating a pixel gap according to Embodiment 13 of the present invention.
FIG. 43 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 14 of the present invention.
FIG. 44 is a flowchart showing processing of the parallel image processing apparatus according to Embodiment 14 of the present invention.
FIG. 45 is a diagram showing an example of a use range of a memory according to a fourteenth embodiment of the present invention.
FIG. 46 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 15 of the present invention.
FIG. 47 is a flowchart showing processing of the parallel image processing apparatus according to the fifteenth embodiment of the present invention.
FIG. 48 is a diagram showing processing allocation to each processor and the number of processing stages according to Embodiment 15 of the present invention;
FIG. 49 is a diagram showing an example of an access pattern when processing is executed at the stage of processing in each processor according to Embodiment 15 of the present invention;
FIG. 50 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 16 of the present invention.
FIG. 51 is a flowchart showing processing of a parallel image processing apparatus according to Embodiment 16 of the present invention.
FIG. 52 is a diagram showing an example of processing assignment to each processor and an access pattern in each processor according to Embodiment 16 of the present invention;
FIG. 53 is a diagram showing a processing pattern in units of lines according to the sixteenth embodiment of the present invention.
FIG. 54 is a block diagram showing a configuration of a parallel image processing apparatus according to Embodiment 17 of the present invention.
FIG. 55 is a flowchart showing processing of the parallel image processing apparatus according to Embodiment 17 of the present invention.
FIG. 56 is a block diagram showing a configuration of a conventional system.
FIG. 57 is a block diagram illustrating an example of a configuration of a conventional SMP.
FIG. 58 is a diagram illustrating an example of a method for executing a SAR image reproduction process in a conventional SMP.
FIG. 59 is a diagram illustrating an example of a method for executing a SAR image reproduction process in a conventional SMP.
FIG. 60 is a diagram illustrating areas on an image allocated by a plurality of conventional processors.
FIG. 61 is a diagram illustrating positions on a memory allocated by a plurality of conventional processors.
[Explanation of symbols]
DESCRIPTION OF SYMBOLS 1 Image processing program, 2 platforms, 3 Parallel image processing apparatus, 11 Image information setting means, 12 Used processor number setting means, 13 Cache information setting means, 14 Memory allocation means, 15 Memory allocation means, 16 Access direction designation means, 17 Multiple-write compatible access direction designation means, 18 protrusion correction method setting means, 19 processing time constraint setting means, 20 execution processor number designation means, 21 pixel access order designation means, 22 multi-tier cache compatible memory allocation means, 23 multi-tier cache correspondence Memory securing means, 24 Multi-tier image block memory securing means, 25 Multi-tier cache compatible access direction designating means, 26 Multi-tier cache compatible access pattern designating means, 27 Multiple size compatible memory securing means, 28 Multiple size compatible access control means, 2 Pre-processing correspondence memory allocation means, 30 Post-processing correspondence memory allocation means, 31 Pre- and post-processing selection means, 101 Magnetic tape, 102 Range direction compression device, 103-1 Two-dimensional image memory, 103-2 Two-dimensional image memory, 104 Two-dimensional Image memory control unit, 105 azimuth direction compression device, 106 magnetic tape, 107 CPU, 108 main memory, 109 FFT device, 110 data bus, 121, 122, 123, 124 processor, 125 shared memory.

Claims

An image processing program that operates on a platform including a plurality of processors and a shared memory and performs image reproduction processing including corner turn processing for transposing the row direction and column direction arrangement of the image is shared with the plurality of processors. In a parallel image processing apparatus that instructs allocation of a target area of a memory,
Image information setting means for setting image information such as the size of the processing target image and the data size of each pixel;
Use processor number setting means for setting the number information of the processors used in image processing;
Cache information setting means for setting cache information such as the configuration and size of the cache installed in each processor;
The image information set in the image information setting means, the number information of the processors set in the used processor number setting means, and the cache information set in the cache information setting means are obtained, and a cache write miss is obtained. Then, an image block of a square processing area having a side length of 1 cache line size that minimizes the overhead time required for a read miss is calculated, and each processor executes the corner turn process for each calculated image block. As described above, a parallel image processing apparatus comprising: a memory allocating unit that allocates a target area of the shared memory.

A memory securing means is provided for expanding the image area and securing it on a shared memory so that the image size becomes an integer multiple of the image block when the image size does not become an integral multiple of the calculated image block. The parallel image processing apparatus according to claim 1.

When the image size is not an integral multiple of the calculated image block, the row or column of the image block width is calculated as the image block band, and the access direction is specified by the calculated image block band. The parallel image processing apparatus according to claim 1, further comprising: means.

When the image size does not become an integral multiple of the calculated image block, the row or column of the width of the image block is calculated as the band of the image block, the access direction of the image is designated by the calculated band of the image block, and the calculation is performed. Multiple writing to specify that the image block band is divided and processed by a plurality of processors when the number of image block bands is not an integral multiple of the number of processors set in the processor number setting means used. The parallel image processing apparatus according to claim 1, further comprising a corresponding access direction designation unit.

Memory securing means for expanding the image area and securing it on the shared memory so that the image size is an integral multiple of the calculated image block;
An access direction specifying means for calculating a row or a column of the width of the image block as a band of the image block, and specifying an access direction of the image by the calculated band of the image block;
The row or column of the width of the image block is calculated as the band of the image block, the access direction of the image is designated by the calculated band of the image block, and the calculated number of bands of the image block is the processor number setting means used. An access direction designating unit for multiple writing that designates processing by a plurality of processors by dividing the band of the image block when it does not become an integral multiple of the number of processors set to
2. The parallel image processing apparatus according to claim 1, further comprising an overhang correction method setting means for setting a determination condition for a correction method when the image size does not become an integral multiple of the image block.

A processing time constraint setting means for setting a time constraint given as a required specification and a time constraint condition such as a processing time of each processor;
The parallel image processing apparatus according to claim 1, further comprising: an execution processor number designating unit that designates the number of processors actually used in an image processing program.

When there is a shortage of cache lines required for image block processing, it is provided with a pixel access order specification means that calculates the access pattern to the image that minimizes the number of cache data saves and specifies it to the image processing program. The parallel image processing apparatus according to claim 1, wherein:

When corner turn processing is performed for each image having a plurality of image sizes, pixel interpolation is performed for each of the images when the cache line size is not an integral multiple of the pixel size, and the calculated image block The minimum image size that is an integral multiple of the image and that includes the processing target image is calculated, the maximum image size is selected from the calculated minimum image sizes of the respective images, and the region of the selected maximum image size is selected. The parallel image processing apparatus according to claim 1, further comprising: a plurality of size-corresponding memory securing means that secures the data on the shared memory.

When the target area of the shared memory secured by the multi-size memory securing unit is used and image processing is performed with a size smaller than the secured image, a corner is set in units of image blocks for the selected small size image. Each processor is equipped with multiple size compatible access control means that calculates the area of access and the access procedure so that the corner turn process can be executed within the calculated use range by calculating the use range of the shared memory used for turn processing. The parallel image processing apparatus according to claim 8.

An image processing program that operates on a platform including a plurality of processors and a shared memory and performs image reproduction processing including corner turn processing for transposing the row direction and column direction arrangement of the image is shared with the plurality of processors. In a parallel image processing apparatus that instructs allocation of a target area of a memory,
Image information setting means for setting image information such as the size of the processing target image and the data size of each pixel;
Use processor number setting means for setting the number information of the processors used in image processing;
A cache information setting means for setting cache information such as a configuration and size of a multi-tier cache mounted on each processor;
The image information set in the image information setting means and the cache information set in the cache information setting means are obtained, and the length of one side that minimizes the overhead time for primary cache write miss and read miss. Memory allocation means for calculating an image block of a square processing area of one cache line size;
Obtaining the image information set in the image information setting means, the number information of the processors set in the used processor number setting means, the cache information set in the cache information setting means, and the memory allocation means calculated but on the basis of the image block of the calculated primary cache, the image blocks of the square of the processing region of the length of one side includes a plurality of image blocks low cache every hierarchy cache one cache line size Then, the corner turn process is executed for each calculated image block of the highest cache, and the shared memory is sent to each of the processors so that the number of executed image blocks of the highest cache is equal among the processors. A multi-tier cache compatible instruction that instructs the image processing program to allocate the target area. Parallel image processing apparatus characterized by comprising a re-allocation unit.

Multi-level cache that expands the image area and secures it on the shared memory so that the image size is an integer multiple of the image block of the highest cache when the image size is not an integral multiple of the image block of the highest cache The parallel image processing apparatus according to claim 10, further comprising: a corresponding memory securing unit.

When the line size of the upper cache is not an integral multiple of the image block of the lower cache, the area of the image block of the lower cache is set so that the line size of the upper cache is an integer multiple of the image block of the lower cache. 11. The parallel image processing apparatus according to claim 10, further comprising a multi-layer image block memory securing unit that expands and secures the shared memory on the shared memory.

When the image size does not become an integral multiple of the calculated image block of the highest cache, the row or column of the image block of the highest cache is calculated as a band of the image block of the highest cache, and the calculated band of the image block is calculated. The access direction of the image is designated with, and when the calculated number of image block bands is not an integral multiple of the number of processors set in the processor number setting means, the image block band is divided, The parallel image processing apparatus according to claim 10, further comprising: an access direction designating unit corresponding to a multi-tier cache that designates processing by a plurality of processors.

Multi-hierarchy cache compatible access pattern designating means for designating an access pattern to an image or low-order cache that minimizes the number of cache data saves when the number of cache lines required for processing the calculated image block is insufficient. The parallel image processing apparatus according to claim 10, wherein:

An image processing program that operates on a platform including a plurality of processors and a shared memory and performs image reproduction processing including corner turn processing for transposing the row direction and column direction arrangement of the image is shared with the plurality of processors. In a parallel image processing apparatus that instructs allocation of a target area of a memory,
Image information setting means for setting image information such as the size of the processing target image and the data size of each pixel;
Use processor number setting means for setting the number information of the processors used in image processing;
Cache information setting means for setting cache information such as the configuration and size of the cache installed in each processor;
The image information set in the image information setting means and the cache information set in the cache information setting means are obtained, and the length of one side that minimizes the overhead time for cache write miss and read miss is 1. Memory allocation means for calculating an image block of a square processing area of a cache line size;
The contents of the pre-execution process executed before the corner turn process are obtained, the image information set in the image information setting means, the number information of the processors set in the used processor number setting means, The cache information set in the cache information setting means is obtained, and the pre-execution process and the corner turn process are executed simultaneously on the basis of the image block unit calculated by the memory allocation means, and the load on each processor is evenly distributed. As described above, a parallel image processing apparatus comprising preprocessing-compatible memory allocation means for allocating the target area of the shared memory to the processors.

An image processing program that operates on a platform including a plurality of processors and a shared memory and performs image reproduction processing including corner turn processing for transposing the row direction and column direction arrangement of the image is shared with the plurality of processors. In a parallel image processing apparatus that instructs allocation of a target area of a memory,
Image information setting means for setting image information such as the size of the processing target image and the data size of each pixel;
Use processor number setting means for setting the number information of the processors used in image processing;
Cache information setting means for setting cache information such as the configuration and size of the cache installed in each processor;
The image information set in the image information setting means and the cache information set in the cache information setting means are obtained, and the length of one side that minimizes the overhead time for cache write miss and read miss is 1. Memory allocation means for calculating an image block of a square processing area of a cache line size;
The contents of the post-execution process executed after the corner turn process are obtained, the image information set in the image information setting means, the number information of the processors set in the used processor number setting means, the cache The cache information set in the information setting means is obtained, and the corner turn processing and the post-execution processing are executed simultaneously on the basis of the image block unit calculated by the memory allocation means, so that the load on each processor is evenly distributed. As described above, a parallel image processing apparatus comprising: a post-processing corresponding memory allocating unit that allocates a target area of the shared memory to each of the processors.

An image processing program that operates on a platform including a plurality of processors and a shared memory and performs image reproduction processing including corner turn processing for transposing the row direction and column direction arrangement of the image is shared with the plurality of processors. In a parallel image processing apparatus that instructs allocation of a target area of a memory,
Image information setting means for setting image information such as the size of the processing target image and the data size of each pixel;
Use processor number setting means for setting the number information of the processors used in image processing;
Cache information setting means for setting cache information such as the configuration and size of the cache installed in each processor;
The image information set in the image information setting means and the cache information set in the cache information setting means are obtained, and the length of one side that minimizes the overhead time for cache write miss and read miss is 1. Memory allocation means for calculating an image block of a square processing area of a cache line size;
The contents of the pre-execution process executed before the corner turn process are obtained, the image information set in the image information setting means, the number information of the processors set in the used processor number setting means, The cache information set in the cache information setting means is obtained, and the pre-execution process and the corner turn process are executed simultaneously on the basis of the image block unit calculated by the memory allocation means, and the load on each processor is evenly distributed. Pre-processing corresponding memory allocating means for allocating the target area of the shared memory to each of the processors,
The contents of the post-execution process executed after the corner turn process are obtained, the image information set in the image information setting means, the number information of the processors set in the used processor number setting means, the cache The cache information set in the information setting means is obtained, and the corner turn processing and the post-execution processing are executed simultaneously on the basis of the image block unit calculated by the memory allocation means, so that the load on each processor is evenly distributed. Post-processing corresponding memory allocating means for allocating the target area of the shared memory to each of the processors,
A parallel image processing apparatus comprising: a pre- and post-processing selection unit that selects the pre-execution process or the post-execution process that is executed simultaneously with the corner turn process.

When executing a reproduction process of an image including a corner turn process that transposes a row direction and a column direction of an image using a plurality of processors and a shared memory, the target area of the shared memory to the plurality of processors is changed. In a parallel image processing method for instructing allocation,
Obtain image information such as the size of the image to be processed and the data size of each pixel,
Obtain information on the number of processors used in image processing,
Obtain cache information such as the configuration and size of the cache installed in each processor above,
Based on the obtained image information and cache information, calculate an image block of a square processing area with a side length of 1 cache line size, which minimizes the overhead time required for cache write miss and read miss,
A parallel image processing method comprising: allocating a target area of the shared memory so that each processor executes the corner turn process in units of the calculated image block based on the obtained number information of the processors.

19. The image area is expanded and secured on a shared memory so that the image size becomes an integral multiple of the image block when the image size does not become an integral multiple of the calculated image block. Parallel image processing method.

When the image size is not an integral multiple of the calculated image block, the row or column of the width of the image block is calculated as a band of the image block, and the access direction of the image is designated by the calculated band of the image block. The parallel image processing method according to claim 18.

When the image size does not become an integral multiple of the calculated image block, the row or column of the width of the image block is calculated as the band of the image block, the access direction of the image is designated by the calculated band of the image block, and the calculation is performed. 19. The parallel image according to claim 18, wherein when the number of image block bands is not an integral multiple of the number of processors, the image block bands are divided and specified to be processed by a plurality of processors. Processing method.

19. The parallel image processing method according to claim 18, wherein when the number of cache lines necessary for processing an image block is insufficient, an access pattern to an image that minimizes the number of times cache data is saved is calculated and specified.

When corner turn processing is performed for each image having a plurality of image sizes, pixel interpolation is performed for each of the images when the cache line size is not an integral multiple of the pixel size, and the calculated image block The minimum image size that is an integral multiple of the image and that includes the processing target image is calculated, the maximum image size is selected from the calculated minimum image sizes of the respective images, and the region of the selected maximum image size is selected. The parallel image processing method according to claim 18, wherein the image is secured on a shared memory.

Shared memory used for corner turn processing in units of image blocks for the selected small size image when the target area of the secured shared memory is used and image processing of a size smaller than the image to be secured is performed. 24. The parallel image processing method according to claim 23, wherein an area to be processed and an access procedure are designated to each processor so that the corner turn process is executed within the calculated usage range.

When executing a reproduction process of an image including a corner turn process that transposes a row direction and a column direction of an image using a plurality of processors and a shared memory, the target area of the shared memory to the plurality of processors is changed. In a parallel image processing method for instructing allocation,
Obtain image information such as the size of the image to be processed and the data size of each pixel,
Obtain information on the number of processors used in image processing,
Obtain cache information such as the configuration and size of the multi-level cache installed in each processor above,
Based on the obtained image information and the cache information, an image block of a square processing area having a side length of 1 cache line size and a minimum overhead time for a primary cache write miss and read miss is calculated.
By the image information and the cached information obtained on the basis of the image block of the calculated primary cache, one cache line size length of one side includes a plurality of image blocks low cache every hierarchy cache Image block of the square processing area of
Based on the obtained number information of the processors, corner turn processing is executed for each calculated image block of the highest cache, and the number of execution of the image cache of the highest cache is equalized among the processors. A parallel image processing method, comprising: allocating a target area of the shared memory to each of the processors.

When the image size does not become an integral multiple of the image block of the highest cache, the image area is expanded and secured on the shared memory so that the image size becomes an integer multiple of the image block of the highest cache. The parallel image processing method according to claim 25.

When the line size of the upper cache is not an integral multiple of the image block of the lower cache, the area of the image block of the lower cache is set so that the line size of the upper cache is an integer multiple of the image block of the lower cache. 26. The parallel image processing method according to claim 25, wherein the parallel image processing method is expanded and secured on a shared memory.

When the image size does not become an integral multiple of the calculated image block of the highest cache, the row or column of the image block of the highest cache is calculated as a band of the image block of the highest cache, and the calculated band of the image block is calculated. The image access direction is specified with the button, and when the calculated number of image block bands is not an integral multiple of the number of processors, the image block bands are divided and specified to be processed by a plurality of processors. 26. The parallel image processing method according to claim 25.

26. An access pattern to an image or a low-order cache that minimizes the number of times cache data is saved when the number of cache lines necessary for processing the calculated image block is insufficient. Parallel image processing method.

When executing a reproduction process of an image including a corner turn process that transposes a row direction and a column direction of an image using a plurality of processors and a shared memory, the target area of the shared memory to the plurality of processors is changed. In a parallel image processing method for instructing allocation,
Obtain image information such as the size of the image to be processed and the data size of each pixel,
Obtain information on the number of processors used in image processing,
Obtain cache information such as the configuration and size of the cache installed in each processor above,
Based on the obtained image information and the cache information, calculate an image block of a square processing area having a side length of 1 cache line size and a minimum overhead time for cache write miss and read miss,
Obtain the details of the pre-execution process to be executed before the corner turn process.
Based on the obtained image information, the number information of the processors, the cache information, and the contents of the pre-execution process, the pre-execution process and the corner turn process are simultaneously executed based on the calculated image block unit. A parallel image processing method, comprising: allocating a target area of the shared memory to each of the processors so that the load on the processor is equalized.

When executing a reproduction process of an image including a corner turn process that transposes a row direction and a column direction of an image using a plurality of processors and a shared memory, the target area of the shared memory to the plurality of processors is changed. In a parallel image processing method for instructing allocation,
Obtain image information such as the size of the image to be processed and the data size of each pixel,
Obtain information on the number of processors used in image processing,
Obtain cache information such as the configuration and size of the cache installed in each processor above,
Based on the obtained image information and the cache information, calculate an image block of a square processing area having a side length of 1 cache line size and a minimum overhead time for cache write miss and read miss,
Obtain the details of the post-execution process to be executed after the above corner turn process.
The corner turn process and the post-execution process are simultaneously executed based on the calculated image block unit according to the obtained image information, the number information of the processors, the cache information, and the contents of the post-execution process. A parallel image processing method, wherein the target area of the shared memory is allocated to each of the processors so that the processor load is equalized.