JP3588394B2

JP3588394B2 - Compile processing method

Info

Publication number: JP3588394B2
Application number: JP21571495A
Authority: JP
Inventors: 栄次山中
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1995-08-24
Filing date: 1995-08-24
Publication date: 2004-11-10
Anticipated expiration: 2015-08-24
Also published as: JPH0962636A

Description

【０００１】
【発明の属する技術分野】
本発明は、ローカルメモリを持つ複数のプロセッサエレメントから構成される分散メモリ型並列計算機用のオブジェクトコードを生成するコンパイル処理方法に関し、特に、完全自動並列化を実現するコンパイル処理方法に関する。
【０００２】
【従来の技術】
分散メモリ型並列計算機は、図１０に示すように、メモリを有する演算器（プロセッサエレメント／ＰＥ）がネットワーク機構で複数結合された形式のアーキテクチャを持つ計算機である。この分散メモリ型並列計算機では、自ＰＥ上のデータについては、通常のアクセス方式でアクセスすることが可能であるが、他のＰＥ上のデータについては、ネットワーク機構を経由してアクセスしなければならない。
【０００３】
このようなメモリアーキテクチャを対象にコンパイラで自動並列化を行う場合、データをＰＥ上に非常にうまく分散して配置する必要があるが、現在のコンパイラの自動並列化機能では、手続き部分については自動並列化を実現できるものの、データの自動分散配置を行うことは非常に難しいというのが実情である。
【０００４】
この問題点を解決するために、ＰＥ間で共有される仮想的なメモリ空間を、各ＰＥのメモリとネットワーク機構を用いて仮想的に構成することが行われている。この仮想的なメモリ空間は仮想グローバル空間と呼ばれ、各ＰＥ毎に分かれた通常のメモリ空間はローカル空間と呼ばれている。図１１に、このメモリアーキテクチャを図示する。
【０００５】
この仮想グローバル空間は、全てのＰＥからアクセスすることが可能であるが、ネットワーク機構を介するため、アクセス速度が遅い。これに対して、ローカル空間は、自ＰＥからだけアクセスすることが可能であり、他ＰＥからアクセスすることは不可能であるが、アクセス速度は速い。
【０００６】
これから、コンパイラの自動並列化において、データの自動分散配置の問題点を回避するために、仮想グローバル空間を使用する構成を採ると、データへのアクセス速度が遅くなり、性能が著しく低下するという問題点が出ることになる。一方、性能を重視して、ローカル空間を使用すると、自動並列化において困難なデータの自動分散配置を実現する技術を構築しなければならないという問題点がある。
【０００７】
そこで、従来では、自動並列化を行うために、ユーザに対してデータの分散配置形態を指示させて、この指示に基づいてデータを分散配置する構成を採って、自動並列化機能で、なるべくローカル空間を使用するようにしつつ、ローカル空間を使用できない部分だけ、仮想グローバル空間を使用するようにして並列化を行うという構成を採っている。
【０００８】
【発明が解決しようとする課題】
しかしながら、このような従来技術に従っていると、ユーザがデータの分散配置を指示しなければならないことから、完全な自動並列化になっていないという問題点があった。そして、プログラムの内容によって仮想グローバル空間を使用せざるを得ない場合があり、当たり外れが激しく、コンスタントな性能向上を期待できないという問題点があった。そして、データを分散配置することから、並列化の対象となるループが、回転間のデータ依存関係の無いものに限定されてしまうという問題点もあった。
【０００９】
本発明はかかる事情に鑑みてなされたものであって、ローカルメモリを持つ複数のプロセッサエレメントから構成される分散メモリ型並列計算機用のオブジェクトコードを完全自動並列化でもって生成する新たなコンパイル処理方法の提供を目的とする。
【００１０】
【課題を解決するための手段】
図１に本発明の原理構成を図示する。
図中、１はデータ処理装置であって、本発明を実現するコンパイラ１０を展開するものである。
【００１１】
このコンパイラ１０は、ソースプログラムをコンパイルすることで分散メモリ型並列計算機用のオブジェクトコードを生成するものであって、ソースプログラムを字句の列に変換して、その字句列からソースプログラムの文法的な構造を再構成し、更に意味解析を実行する字句・構文解析部１１と、字句・構文解析部１１の解析結果を使って、オブジェクトコードを生成するコード生成部１２と、字句・構文解析部１１とコード生成部１２との間に位置して、コード生成の最適化を行う最適化部１３と、全体の制御処理を実行する制御部１４とを備える。
【００１２】
この最適化部１３は、本発明を実現するために、ソースプログラムの持つループ関数ごとに、配列アクセスの添字からデータ依存関係を解析し、該ループ関数が並列実行可能か否かを解析して、並列実行可能なループ関数についてはループ分割情報を生成する第１の解析部１５と、配列アクセスの要素が全てのプロセッサエレメントのローカルメモリに展開されることを想定して、並列実行可能なループ関数ごとに、配列アクセスの添字から各プロセッサエレメントについてのデータアクセス範囲を解析して、そのデータアクセス範囲情報を生成する第２の解析部１６と、配列アクセスの要素が全てのプロセッサエレメントのローカルメモリに展開されることを想定して、第１の解析部１５の生成したループ分割情報と第２の解析部１６の生成したデータアクセス範囲情報とに基づいて、各プロセッサエレメントと他のプロセッサエレメントとの間で必要となるデータ転送を解析して、転送対象と転送位置と転送先とを指定するデータ転送情報を生成する第３の解析部１７とを備える。
【００１３】
そして、コード生成部１２は、第１の解析部１５の解析結果と第３の解析部１７の解析結果とに基づいて、ループ分割情報に従って分割されるループ関数を持つオブジェクトコードを生成するとともに、データ転送情報に従って作成されるデータ転送命令及びデータ受信命令を持つオブジェクトコードを生成していくように処理する。
【００１４】
このように構成される本発明では、第１の解析部１５は、ソースプログラムの持つ手続き部分のみを処理対象として、それらの手続き部分が並列化できるのか否かを解析して、並列実行可能な手続き部分については、各プロセッサエレメントに担当させる部分を解析する。すなわち、ソースプログラムの持つデータの配列については並列化対象としない。これから、各プロセッサエレメントは、オブジェクトコードがローディングされると、データについては、ソースプログラムに記述される配列宣言文に従って、データファイルからその全てをコピーして持つことになる。
【００１５】
一方、第２の解析部１６は、各プロセッサエレメントの担当する手続き部分に対してデータを共通に割り当てつつ、第１の解析部１５で解析される各プロセッサエレメントの担当する手続き部分のデータアクセス範囲を解析し、これを受けて、第３の解析部１７は、各プロセッサエレメントの担当する手続き部分に対してデータを共通に割り当てつつ、第２の解析部１６で解析されるデータアクセス範囲と、分割された手続き部分のプログラム記述とに基づき、プロセッサエレメント間で転送すべきデータの転送形態を解析する。
【００１６】
例えば、第３の解析部１７は、プロセッサエレメントにより個別更新される配列部分については、その配列部分を他の全てのプロセッサエレメントに転送するようにと転送先を決定することで、ローディング後のデータの整合性を保つように処理する。また、後方で使用されるデータについては、その値が確定した時点から転送開始に入るようにと転送位置を決定することで、データ転送処理と演算処理とを重ね合わせてデータ転送処理を隠蔽するように処理する。また、特定のプロセッサエレメントのみが必要とするデータについては、そのデータをそのプロセッサエレメントにのみ転送するようにと転送先を決定することで、転送コストの削減を図るように処理するのである。
【００１７】
このように、本発明のコンパイラ１０では、分散メモリ型並列計算機でデータ処理を実行する場合に、手続き部分のみを自動的に並列化し、データについては、分散配置させずに全プロセッサエレメントがそのコピーを持つ構成を採って、必要に応じてデータを一致させつつ、必要なデータをやり取りする構成を採ることで、完全自動並列化を実現するものである。
【００２１】
【発明の実施の形態】
以下、実施の形態に従って本発明を詳細に説明する。
図２に、本発明のコンパイラ１０が実行する最適化処理の処理構成を図示する。
【００２２】
この図に示すように、本発明のコンパイラ１０は、最適化処理に入ると、先ず最初に、手続き部並列化解析処理を実行する。この手続き部並列化解析処理では、ソースプログラムの持つ手続き部（主にＤＯループである）が並列化できるのか否かを解析して、並列化できるＤＯループについては、ループ分割情報を生成する。すなわち、本発明のコンパイラ１０では、ソースプログラムの持つデータについては分割しない構成を採るのである。
【００２３】
この手続き部並列化解析処理を終了すると、続いて、データアクセス範囲解析処理を実行する。このデータアクセス範囲解析処理では、ソースプログラムの持つデータについては分割しないという前提の下に、手続き部並列化解析処理で求められた並列化できるＤＯループのデータアクセス範囲を解析して、データアクセス範囲情報を生成する。
【００２４】
このデータアクセス範囲解析処理を終了すると、続いて、最適データ転送解析処理を実行する。この最適データ転送解析処理では、ソースプログラムの持つデータについては分割しないという前提の下に、データアクセス範囲解析処理で生成されたデータアクセス範囲情報と、手続き部並列化解析処理で生成されたループ分割情報とに基づき、プロセッサエレメント間で転送すべきデータの転送情報を生成してから、その転送情報を最適化することで最適なデータ転送形態を決定する。
【００２５】
図３に、この最適化処理の更に詳細な処理フローを図示する。
すなわち、本発明のコンパイラ１０は、最適化処理に入ると、手続き毎に、この処理フローに示すように、手続き部並列化解析処理に入って、先ず最初に、ソースプログラムに記述されるデータフローを解析し、続いて、ソースプログラムに記述される制御フローを解析し、続いて、ＤＯループ毎に、配列アクセスの添字からデータ依存関係を解析し、続いて、これらの解析結果に基づき、ＤＯループが並列実行可能か否かを解析して、並列実行可能なＤＯループについてはループ分割情報を生成する。
【００２６】
そして、手続き部並列化解析処理を終了すると、データアクセス範囲解析処理に入って、配列アクセスの要素が全てのプロセッサエレメントのローカルメモリに展開されることを想定して、並列実行可能なＤＯループ毎に、配列アクセスの添字から各プロセッサエレメントについてのデータアクセス範囲を解析して、そのデータアクセス範囲情報を生成する。
【００２７】
そして、データアクセス範囲解析処理を終了すると、最適データ転送解析処理に入って、配列アクセスの要素が全てのプロセッサエレメントのローカルメモリに展開されることを想定して、先ず最初に、ＤＯループ毎に、生成したループ分割情報とデータアクセス範囲情報とに基づいて、各プロセッサエレメントと他のプロセッサエレメントとの間で必要となるデータ転送を解析して、転送対象／転送位置／転送先を指定するデータ転送情報を生成し、続いて、その生成したデータ転送情報を最適化する。
【００２８】
この最適化処理を実行すると、本発明のコンパイラ１０は、生成したループ分割情報に従って分割されるＤＯループを持つオブジェクトコードを生成するとともに、生成した最適化データ転送情報に従って作成されるデータ転送命令及びデータ受信命令を持つオブジェクトコードを生成する。
【００２９】
次に、具体例に従って、本発明のコンパイラ１０が実行する最適化処理について説明する。
図４に、ソースプログラムの一例を図示する。
【００３０】
このソースプログラムは、１０１個のデータ要素を持つ配列ａと、１００個のデータ要素を持つ配列ｂと、１０１個のデータ要素を持つ配列ｃと、１００個のデータ要素を持つ配列ｄとを定義して、先ず最初に、
ｂ（ｉ）＝ｓｑｒｔ（ａ（ｉ＋１））ｉ＝１〜１００
に従って、データ要素ａ（ｉ＋１）の平方根で定義されるデータ要素ｂ（ｉ）を求めることで配列ｂを算出し、続いて、
ｃ（ｉ）＝ａ（ｉ＋１）＋ｂ（ｉ）ｉ＝１〜１００
に従ってデータ要素ｃ（ｉ）を求めることで配列ｃを算出し、続いて、配列ｂを用いるサブルーチン「ｓｕｂ（ｂ）」を呼び出した後、それに続いて、
ｄ（ｉ）＝ｃｏｓ（ｃ（ｉ＋１））ｉ＝１〜１００
に従って、データ要素ｃ（ｉ＋１）のコサイン関数値で定義されるデータ要素ｄ（ｉ）を求めることで配列ｄを算出していくプログラムである。
【００３１】
このようなソースプログラムをコンパイルする場合、分散メモリ型並列計算機が４台のプロセッサエレメントで構成されるときには、本発明のコンパイラ１０は、最適化処理に入ると、配列ａ／配列ｂ／配列ｃ／配列ｄについては分割しないようにしながら、配列ｂを求めるＤＯループと、配列ｃを求めるＤＯループと、配列ｄを求めるＤＯループについては、並列化が可能であることを解析して、それらのＤＯループを４台のプロセッサエレメントで並列処理すべく、「ｉ＝１〜２５」、「ｉ＝２６〜５０」、「ｉ＝５１〜７５」、「ｉ＝７６〜１００」という添字のグループに分割する。
【００３２】
続いて、「ｉ＝１〜２５」を実行するプロセッサエレメントは、配列ｂの算出にあたって、自エレメントに展開される配列ａのデータ要素ａ（２）〜ａ（２６）を使用し、配列ｃの算出にあたって、自エレメントに展開される配列ａのデータ要素ａ（２）〜ａ（２６）と、自エレメントで算出した配列ｂのデータ要素ｂ（１）〜（２５）とを使用し、配列ｄの算出にあたって、自エレメントで算出した配列ｃのデータ要素ｃ（２）〜（２５）と、隣接プロセッサエレメントの算出した配列ｃのデータ要素ｃ（２６）とを使用するということを解析する。
【００３３】
また、「ｉ＝２６〜５０」を実行するプロセッサエレメントは、配列ｂの算出にあたって、自エレメントに展開される配列ａのデータ要素ａ（２７）〜ａ（５１）を使用し、配列ｃの算出にあたって、自エレメントに展開される配列ａのデータ要素ａ（２７）〜ａ（５１）と、自エレメントで算出した配列ｂのデータ要素ｂ（２６）〜（５０）とを使用し、配列ｄの算出にあたって、自エレメントで算出した配列ｃのデータ要素ｃ（２７）〜（５０）と、隣接プロセッサエレメントの算出した配列ｃのデータ要素ｃ（５１）とを使用するということを解析する。
【００３４】
また、「ｉ＝５１〜７５」を実行するプロセッサエレメントは、配列ｂの算出にあたって、自エレメントに展開される配列ａのデータ要素ａ（５２）〜ａ（７６）を使用し、配列ｃの算出にあたって、自エレメントに展開される配列ａのデータ要素ａ（５２）〜ａ（７６）と、自エレメントで算出した配列ｂのデータ要素ｂ（５１）〜（７５）とを使用し、配列ｄの算出にあたって、自エレメントで算出した配列ｃのデータ要素ｃ（５２）〜（７５）と、隣接プロセッサエレメントの算出した配列ｃのデータ要素ｃ（７６）とを使用するということを解析する。
【００３５】
また、「ｉ＝７６〜１００」を実行するプロセッサエレメントは、配列ｂの算出にあたって、自エレメントに展開される配列ａのデータ要素ａ（７７）〜ａ（１０１）を使用し、配列ｃの算出にあたって、自エレメントに展開される配列ａのデータ要素ａ（７７）〜ａ（１０１）と、自エレメントで算出した配列ｂのデータ要素ｂ（７６）〜（１００）とを使用し、配列ｄの算出にあたって、自エレメントで算出した配列ｃのデータ要素ｃ（７６）〜（１００）と、自エレメントに展開される配列ｃのデータ要素ｃ（１０１）の初期値とを使用するということを解析する。
【００３６】
そして、サブルーチン「ｓｕｂ（ｂ）」の実行前に、各プロセッサエレメントが配列ｂを持つ必要があることを解析して、その「ｓｕｂ（ｂ）」の前までに、各プロセッサエレメントで生成された配列ｂの配列部分を、他の全てのプロセッサエレメントに転送することを指示するデータ転送情報を生成する。なお、サブルーチンの他、ソースプログラムに入出力命令が記述されているときや、多重ＤＯループが記述されているときなどには、このように、各プロセッサエレメントで生成された配列部分を他の全てのプロセッサエレメントに転送していく必要が起こる。
【００３７】
更に、配列ｄの算出前に、「ｉ＝１〜２５」を実行するプロセッサエレメントが、「ｉ＝２６〜５０」を実行するプロセッサエレメントからデータ要素ｃ（２６）を受け取る必要があり、「ｉ＝２６〜５０」を実行するプロセッサエレメントが、「ｉ＝５１〜７５」を実行するプロセッサエレメントからデータ要素ｃ（５１）を受け取る必要があり、「ｉ＝５１〜７５」を実行するプロセッサエレメントが、「ｉ＝７６〜１００」を実行するプロセッサエレメントからデータ要素ｃ（７６）を受け取る必要があることを解析して、その配列ｄの算出時点に、「ｉ＝２６〜５０」を実行するプロセッサエレメントの持つデータ要素ｃ（２６）を、「ｉ＝１〜２５」を実行するプロセッサエレメントに転送し、「ｉ＝５１〜７５」を実行するプロセッサエレメントの持つデータ要素ｃ（５１）を、「ｉ＝２６〜５０」を実行するプロセッサエレメントに転送し、「ｉ＝７６〜１００」を実行するプロセッサエレメントの持つデータ要素ｃ（７６）を、「ｉ＝５１〜７５」を実行するプロセッサエレメントに転送することを指示するデータ転送情報を生成する。
【００３８】
本発明のコンパイラ１０が、ここで最適化処理を終了して、データ転送情報についてはこれ以上の最適化処理を実行しないときには、この最適化処理に従って生成されるオブジェクトコードが４台のプロセッサエレメントにローディングされることで、各プロセッサエレメントは、図５に示すようなデータ処理を実行する。
【００３９】
すなわち、ＰＥ１ないしＰＥ４で示される４台のプロセッサエレメントは、配列ａ／配列ｂ／配列ｃ／配列ｄのデータ要素の実体を図示しないデータファイルから読み込むことで、ソースプログラムの記述する配列宣言文の指すデータのコピーを展開する。
【００４０】
そして、コンパイラ１０の最適化処理で生成されたＤＯループの分割情報により、ＰＥ１のプロセッサエレメントは、並列実行処理に従いつつ、
ｂ（ｉ）＝ｓｑｒｔ（ａ（ｉ＋１））ｉ＝１〜２５
に従って、データ要素ｂ（ｉ）を求め、
ｃ（ｉ）＝ａ（ｉ＋１）＋ｂ（ｉ）ｉ＝１〜２５
に従ってデータ要素ｃ（ｉ）を求め、
ｄ（ｉ）＝ｃｏｓ（ｃ（ｉ＋１））ｉ＝１〜２５
に従ってデータ要素ｄ（ｉ）を求める。
【００４１】
また、コンパイラ１０の最適化処理で生成されたＤＯループの分割情報により、ＰＥ２のプロセッサエレメントは、並列実行処理に従いつつ、
ｂ（ｉ）＝ｓｑｒｔ（ａ（ｉ＋１））ｉ＝２６〜５０
に従って、データ要素ｂ（ｉ）を求め、
ｃ（ｉ）＝ａ（ｉ＋１）＋ｂ（ｉ）ｉ＝２６〜５０
に従ってデータ要素ｃ（ｉ）を求め、
ｄ（ｉ）＝ｃｏｓ（ｃ（ｉ＋１））ｉ＝２６〜５０
に従ってデータ要素ｄ（ｉ）を求める。
【００４２】
また、コンパイラ１０の最適化処理で生成されたＤＯループの分割情報により、ＰＥ３のプロセッサエレメントは、並列実行処理に従いつつ、
ｂ（ｉ）＝ｓｑｒｔ（ａ（ｉ＋１））ｉ＝５１〜７５
に従って、データ要素ｂ（ｉ）を求め、
ｃ（ｉ）＝ａ（ｉ＋１）＋ｂ（ｉ）ｉ＝５１〜７５
に従ってデータ要素ｃ（ｉ）を求め、
ｄ（ｉ）＝ｃｏｓ（ｃ（ｉ＋１））ｉ＝５１〜７５
に従ってデータ要素ｄ（ｉ）を求める。
【００４３】
また、コンパイラ１０の最適化処理で生成されたＤＯループの分割情報により、ＰＥ４のプロセッサエレメントは、並列実行処理に従いつつ、
ｂ（ｉ）＝ｓｑｒｔ（ａ（ｉ＋１））ｉ＝７６〜１００
に従って、データ要素ｂ（ｉ）を求め、
ｃ（ｉ）＝ａ（ｉ＋１）＋ｂ（ｉ）ｉ＝７６〜１００
に従ってデータ要素ｃ（ｉ）を求め、
ｄ（ｉ）＝ｃｏｓ（ｃ（ｉ＋１））ｉ＝７６〜１００
に従ってデータ要素ｄ（ｉ）を求める。
【００４４】
このとき、コンパイラ１０の最適化処理で生成されたデータ転送情報により、ＰＥ１のプロセッサエレメントは、図６上段に示すように、サブルーチン「ｓｕｂ（ｂ）」の実行に入る前に、自エレメントで算出した配列ｂのデータ要素ｂ（１）〜ｂ（２５）を、ＰＥ２／ＰＥ３／ＰＥ４のプロセッサエレメントに転送し、ＰＥ２のプロセッサエレメントは、図６下段に示すように、サブルーチン「ｓｕｂ（ｂ）」の実行に入る前に、自エレメントで算出した配列ｂのデータ要素ｂ（２６）〜ｂ（５０）を、ＰＥ１／ＰＥ３／ＰＥ４のプロセッサエレメントに転送し、ＰＥ３のプロセッサエレメントは、図７上段に示すように、サブルーチン「ｓｕｂ（ｂ）」の実行に入る前に、自エレメントで算出した配列ｂのデータ要素ｂ（５１）〜ｂ（７５）を、ＰＥ１／ＰＥ２／ＰＥ４のプロセッサエレメントに転送し、ＰＥ４のプロセッサエレメントは、図７下段に示すように、サブルーチン「ｓｕｂ（ｂ）」の実行に入る前に、自エレメントで算出した配列ｂのデータ要素ｂ（７６）〜ｂ（１００）を、ＰＥ１／ＰＥ２／ＰＥ３のプロセッサエレメントに転送していく。
【００４５】
そして、コンパイラ１０の最適化処理で生成されたデータ転送情報により、ＰＥ１のプロセッサエレメントは、ＰＥ２／ＰＥ３／ＰＥ４のプロセッサエレメントから、配列ｂのデータ要素ｂ（２６）〜ｂ（５０）と、配列ｂのデータ要素ｂ（５１）〜ｂ（７５）と、配列ｂのデータ要素ｂ（７６）〜ｂ（１００）とを受け取ることを確認すると、サブルーチン「ｓｕｂ（ｂ）」の実行に入り、ＰＥ２のプロセッサエレメントは、ＰＥ１／ＰＥ３／ＰＥ４のプロセッサエレメントから、配列ｂのデータ要素ｂ（１）〜ｂ（２５）と、配列ｂのデータ要素ｂ（５１）〜ｂ（７５）と、配列ｂのデータ要素ｂ（７６）〜ｂ（１００）とを受け取ることを確認すると、サブルーチン「ｓｕｂ（ｂ）」の実行に入り、ＰＥ３のプロセッサエレメントは、ＰＥ１／ＰＥ２／ＰＥ４のプロセッサエレメントから、配列ｂのデータ要素ｂ（１）〜ｂ（２５）と、配列ｂのデータ要素ｂ（２６）〜ｂ（５０）と、配列ｂのデータ要素ｂ（７６）〜ｂ（１００）とを受け取ることを確認すると、サブルーチン「ｓｕｂ（ｂ）」の実行に入り、ＰＥ４のプロセッサエレメントは、ＰＥ１／ＰＥ２／ＰＥ３のプロセッサエレメントから、配列ｂのデータ要素ｂ（１）〜ｂ（２５）と、配列ｂのデータ要素ｂ（２６）〜ｂ（５０）と、配列ｂのデータ要素ｂ（５１）〜ｂ（７５）とを受け取ることを確認すると、サブルーチン「ｓｕｂ（ｂ）」の実行に入っていく。
【００４６】
そして、コンパイラ１０の最適化処理で生成されたデータ転送情報により、ＰＥ２のプロセッサエレメントは、サブルーチン「ｓｕｂ（ｂ）」の実行が終了すると、図８上段に示すように、ＰＥ１のプロセッサエレメントが配列ｄの算出に入る前に、その算出で必要となる配列ｃのデータ要素ｃ（２６）を転送し、ＰＥ４のプロセッサエレメントは、サブルーチン「ｓｕｂ（ｂ）」の実行が終了すると、図８上段に示すように、ＰＥ３のプロセッサエレメントが配列ｄの算出に入る前に、その算出で必要となる配列ｃのデータ要素ｃ（７６）を転送し、ＰＥ３のプロセッサエレメントは、サブルーチン「ｓｕｂ（ｂ）」の実行が終了すると、図８下段に示すように、ＰＥ２のプロセッサエレメントが配列ｄの算出に入る前に、その算出で必要となる配列ｃのデータ要素ｃ（５１）を転送していく。
【００４７】
このようにして、本発明のコンパイラ１０では、分散メモリ型並列計算機において、手続き部分のみを自動的に並列化し、データについては、分散配置させずに全プロセッサエレメントがそのコピーを持つ構成を採って、必要に応じてデータを一致させつつ、必要なデータをやり取りする構成を採ることで、完全自動並列化を実現するものである。
【００４８】
この実施例では、配列ｄの算出で必要となる配列ｃの境界部分のデータ要素については、そのデータ要素のみをプロセッサエレメント間で転送していくという最適化されたデータ転送方法を用いる構成を開示した。この構成に従うと転送コストの削減を図ることができるが、このような最適化を行わずに、配列ｃの全データ要素を転送対象とするデータ転送方法を採ることも可能である。
【００４９】
また、分散メモリ型並列計算機がデータ転送処理と演算処理とを並列処理できる機能を持つ場合には、後方で使用されるデータについては、その値が確定した時点から転送開始に入るようにと転送位置を最適化することで、データ転送処理と演算処理とを重ね合わせてデータ転送処理を隠蔽することが可能である。
【００５０】
例えば、図４のソースプログラムをコンパイルする場合、図９に示すように、サブルーチン「ｓｕｂ（ｂ）」の実行時点で、その演算に必要となる配列ｂのデータ要素の転送を実行するのではなくて、配列ｂのデータ要素の値が確定した時点から転送処理に入るようにし、また、配列ｄの算出時点で、その算出に必要となる配列ｃのデータ要素の転送を実行するのではなくて、その配列ｃのデータ要素の値が確定した時点から転送処理に入るようにすることで、データ転送処理を隠蔽するのである。
【００５１】
この本発明のコンパイラ１０を用いることで、データ処理実行に必要となる全データを、各プロセッサエレメントのローカルメモリに展開し、データ処理実行に必要となる手続きを分割して、各プロセッサエレメントのローカルメモリに展開し、そして、各プロセッサエレメントが、自ローカルメモリに展開される手続きに従い、自ローカルメモリに展開されるデータを使ってデータ処理を実行するという分散メモリ型並列計算機上での新たなプログラム実行方法を実現できるようになる。
【００５２】
この新たなプログラム実行方法では、各プロセッサエレメントは、自ローカルメモリに展開されるデータのアクセス範囲を限定するよう処理する。そして、各プロセッサエレメントは、手続きの実行途中で、自ローカルメモリに展開されるデータを更新し、そのデータを、そのデータを必要とする他プロセッサエレメントに転送していくよう処理することになる。
【００５３】
なお、この新たなプログラム実行方法は、本発明のコンパイラ１０を用いることで実現できるものであるが、本発明のコンパイラ１０を用いなくても実現可能である。
【００５４】
【発明の効果】
以上説明したように、本発明のコンパイラでは、分散メモリ型並列計算機でデータ処理を実行する場合に、手続き部分のみを自動的に並列化し、データについては、分散配置させずに全プロセッサエレメントがそのコピーを持つ構成を採って、必要に応じてデータを一致させつつ、必要なデータをやり取りする構成を採る。
【００５５】
これにより、完全自動並列化を実現できるようになる。そして、プログラマのノウハウによらないで完全自動並列化を実現できるようになることから、当たり外れがなくなり、コンスタントな性能向上を期待できるようになる。そして、データを分散配置することから、並列化の対象となるループの範囲が拡大できるようになる。
【図面の簡単な説明】
【図１】本発明の原理構成図である。
【図２】本発明のコンパイラが実行する最適化処理の処理構成図である。
【図３】最適処理の詳細な処理フローである。
【図４】ソースプログラムの一例である。
【図５】本発明のコンパイル処理説明図である。
【図６】本発明のコンパイル処理説明図である。
【図７】本発明のコンパイル処理説明図である。
【図８】本発明のコンパイル処理説明図である。
【図９】本発明のコンパイル処理説明図である。
【図１０】分散メモリ型並列計算機の説明図である。
【図１１】メモリアーキテクチャの説明図である。
【符号の説明】
１データ処理装置
１０コンパイラ
１１字句・構文解析部
１２コード生成部
１３最適化部
１４制御部
１５第１の解析部
１６第２の解析部
１７第３の解析部[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention A part consisting of multiple processor elements with local memory Compile processing method to generate object code for distributed memory parallel computer To the law Compilation, especially for fully automatic parallelization To the law Related.
[0002]
[Prior art]
As shown in FIG. 10, the distributed memory type parallel computer is a computer having an architecture of a type in which a plurality of arithmetic units (processor elements / PEs) having a memory are connected by a network mechanism. In this distributed memory type parallel computer, data on the own PE can be accessed by a normal access method, but data on other PEs must be accessed via a network mechanism. .
[0003]
When performing automatic parallelization with a compiler for such a memory architecture, it is necessary to distribute the data very well on the PE. However, with the automatic parallelization function of the current compiler, the procedure part is automatically Although parallelization can be realized, the fact is that automatic distribution of data is very difficult.
[0004]
In order to solve this problem, a virtual memory space shared between PEs is virtually configured using a memory of each PE and a network mechanism. This virtual memory space is called a virtual global space, and a normal memory space divided for each PE is called a local space. FIG. 11 illustrates this memory architecture.
[0005]
This virtual global space can be accessed from all PEs, but has a low access speed because it is via a network mechanism. On the other hand, the local space can be accessed only from the own PE and cannot be accessed from other PEs, but the access speed is high.
[0006]
From now on, in the automatic parallelization of the compiler, if the configuration that uses the virtual global space is adopted to avoid the problem of the automatic distribution of data, the access speed to the data will be slowed down, and the performance will be remarkably reduced. You will get points. On the other hand, if local space is used with emphasis on performance, there is a problem in that a technique for realizing automatic distributed arrangement of data, which is difficult in automatic parallelization, must be constructed.
[0007]
Therefore, conventionally, in order to perform automatic parallelization, a configuration in which a user is instructed on a distributed arrangement form of data and a data is distributed and arranged based on the instruction is adopted. While using the space, only the portion where the local space cannot be used is used to perform parallelization by using the virtual global space.
[0008]
[Problems to be solved by the invention]
However, according to such a conventional technique, there is a problem that the user has to instruct the distributed arrangement of data, so that the automatic parallelization is not achieved. In some cases, the virtual global space has to be used depending on the contents of the program, and there has been a problem that the performance is so violent that constant performance improvement cannot be expected. Further, since the data is distributed and arranged, there is a problem that the loop to be parallelized is limited to a loop having no data dependency between rotations.
[0009]
The present invention has been made in view of such circumstances, A part consisting of multiple processor elements with local memory A New Compile Processing Method to Generate Object Codes for Distributed Memory Parallel Computers Using Fully Automatic Parallelization Offer Aim.
[0010]
[Means for Solving the Problems]
FIG. 1 illustrates the principle configuration of the present invention.
In the figure, reference numeral 1 denotes a data processing device which expands a compiler 10 for realizing the present invention.
[0011]
The compiler 10 generates an object code for a distributed memory parallel computer by compiling a source program. The compiler 10 converts the source program into a lexical sequence, and converts the lexical sequence of the source program into a lexical sequence. A lexical / syntactic analysis unit 11 for reconstructing the structure and further executing a semantic analysis, a code generation unit 12 for generating an object code using the analysis result of the lexical / syntax analysis unit 11, and a lexical / syntax analysis unit 11 An optimization unit 13 that optimizes code generation and a control unit 14 that executes overall control processing are provided between the code generation unit 12 and the code generation unit 12.
[0012]
This optimizing unit 13 is used to realize the present invention. For each loop function of the source program, the data dependency is analyzed from the subscript of the array access, whether or not the loop function can be executed in parallel is analyzed, and loop division information is generated for the loop function that can be executed in parallel. A first analyzer 15, Assuming that the elements of array access are expanded in the local memory of all processor elements, the data access range of each processor element is analyzed from the subscript of array access for each loop function that can be executed in parallel. Generate data access range information A second analyzer 16, Assuming that array access elements are expanded in local memories of all processor elements, the loop division information generated by the first analyzer 15 and the data access range information generated by the second analyzer 16 are used. A data transfer required between each processor element and another processor element is analyzed based on the data to generate data transfer information designating a transfer target, a transfer position, and a transfer destination. And a third analysis unit 17.
[0013]
Then, the code generation unit 12 obtains the analysis result of the first analysis unit 15 And the second 3 based on the analysis result of the analysis unit 17 To generate an object code having a loop function divided according to the loop division information, and to generate an object code having a data transfer instruction and a data reception instruction created according to the data transfer information. Process to generate object code.
[0014]
In the present invention configured as described above, the first analysis unit 15 analyzes only whether or not the procedural parts of the source program can be parallelized and analyzes whether or not these procedural parts can be parallelized. As for the procedure part, the part assigned to each processor element is analyzed. That is, the data array of the source program is not subjected to parallelization. From this, when the object code is loaded, each processor element has all the data copied from the data file in accordance with the array declaration statement described in the source program.
[0015]
On the other hand, the second analysis unit 16 While allocating data commonly to the procedure part in charge of each processor element, The first analyzer 15 analyzes the data access range of the procedure part in charge of each processor element analyzed, and in response, the third analyzer 17 analyzes the data access range of the procedure part in charge of each processor element. A data access range analyzed by the second analysis unit 16 while allocating data in common; Of the divided procedure part The transfer form of data to be transferred between the processor elements is analyzed based on the program description.
[0016]
For example, the third analysis unit 17 determines the transfer destination of the array portion that is individually updated by the processor element so as to transfer the array portion to all other processor elements. Process to maintain the consistency of In addition, for data used in the rear, the transfer position is determined so that the transfer is started from the time when the value is determined, so that the data transfer process and the arithmetic process are overlapped to hide the data transfer process. Process as follows. Further, for data required only by a specific processor element, the transfer destination is determined so as to transfer the data only to that processor element, thereby performing processing to reduce the transfer cost.
[0017]
As described above, in the compiler 10 of the present invention, when data processing is executed by a distributed memory type parallel computer, only the procedure part is automatically parallelized, and all processor elements copy the data without distributing the data. By adopting a configuration having the configuration described above, and by adopting a configuration in which necessary data is exchanged while matching data as necessary, fully automatic parallelization is realized.
[0021]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, the present invention will be described in detail according to embodiments.
FIG. 2 shows a processing configuration of the optimization processing executed by the compiler 10 of the present invention.
[0022]
As shown in this figure, when entering the optimization processing, the compiler 10 of the present invention first executes the procedure section parallelization analysis processing. In this procedure part parallelization analysis processing, it is analyzed whether a procedure part (mainly a DO loop) of the source program can be parallelized, and loop division information is generated for the DO loop that can be parallelized. That is, the compiler 10 of the present invention adopts a configuration in which the data of the source program is not divided.
[0023]
After the procedure section parallel analysis processing is completed, the data access range analysis processing is subsequently performed. In this data access range analysis process, Under the assumption that the data of the source program is not divided, Analyzing the data access range of the DO loop that can be parallelized determined by the procedure section parallelization analysis processing, data Generate access range information.
[0024]
When the data access range analysis processing is completed, the optimum data transfer analysis processing is subsequently executed. In this optimal data transfer analysis process, Under the assumption that the data of the source program is not divided, Generated by data access range analysis processing data Access range information And the loop division information generated by the procedure division parallelization analysis , Transfer information of data to be transferred between the processor elements is generated, and then the transfer information is optimized to determine an optimum data transfer mode.
[0025]
FIG. 3 shows a more detailed processing flow of this optimization processing.
That is, when the compiler 10 of the present invention enters the optimization processing, it enters the procedure section parallelization analysis processing for each procedure, as shown in this processing flow, and firstly, the data flow described in the source program. Is analyzed, the control flow described in the source program is analyzed, the data dependency is analyzed from the subscript of the array access for each DO loop, and then the DO is analyzed based on these analysis results. Whether or not the loop can be executed in parallel is analyzed, and loop division information is generated for the DO loop that can be executed in parallel.
[0026]
Then, when the procedure section parallelization analysis processing is completed, the processing enters the data access range analysis processing, Assuming that array access elements are expanded in the local memory of all processor elements, For each DO loop that can be executed in parallel, Data about each processor element Analyze the access range and data Generate access range information.
[0027]
Then, when the data access range analysis processing is completed, the optimum data transfer analysis processing is started. Assuming that array access elements are expanded in the local memory of all processor elements, First, for each DO loop, the generated loop division information and data Based on access range information Data required between each processor element and other processor elements. The data transfer is analyzed to generate data transfer information designating the transfer target / transfer position / transfer destination, and then the generated data transfer information is optimized.
[0028]
When this optimization process is performed, the compiler 10 of the present invention generates an object code having a DO loop divided according to the generated loop division information, and generates a data transfer instruction and a data transfer instruction generated according to the generated optimization data transfer information. Generate an object code with a data reception instruction.
[0029]
Next, an optimization process executed by the compiler 10 of the present invention will be described according to a specific example.
FIG. 4 illustrates an example of the source program.
[0030]
This source program defines an array a having 101 data elements, an array b having 100 data elements, an array c having 101 data elements, and an array d having 100 data elements. And first of all,
b (i) = sqrt (a (i + 1)) i = 1 to 100
The array b is calculated by obtaining a data element b (i) defined by the square root of the data element a (i + 1) according to
c (i) = a (i + 1) + b (i) i = 1 to 100
Calculates the array c by obtaining the data element c (i) according to the following, then calls a subroutine "sub (b)" using the array b, and then,
d (i) = cos (c (i + 1)) i = 1 to 100
Is a program for calculating an array d by obtaining a data element d (i) defined by a cosine function value of a data element c (i + 1) according to
[0031]
When compiling such a source program, when the distributed memory type parallel computer is composed of four processor elements, the compiler 10 of the present invention, when entering the optimization processing, executes an array a / array b / array c / While the array d is not divided, the DO loop for obtaining the array b, the DO loop for obtaining the array c, and the DO loop for obtaining the array d are analyzed to determine that parallelization is possible. Divide the loop into groups with subscripts “i = 1 to 25”, “i = 26 to 50”, “i = 51 to 75”, and “i = 76 to 100” in order to perform parallel processing by four processor elements. I do.
[0032]
Subsequently, the processor element executing “i = 1 to 25” uses the data elements a (2) to a (26) of the array a to be expanded into its own element and calculates the In the calculation, the data elements a (2) to a (26) of the array a expanded to the own element and the data elements b (1) to (25) of the array b calculated by the own element are used, and the array d In the calculation, the use of the data element c (2) to (25) of the array c calculated by the own element and the data element c (26) of the array c calculated by the adjacent processor element are analyzed.
[0033]
Further, the processor element executing “i = 26 to 50” uses the data elements a (27) to a (51) of the array a to be expanded into its own element and calculates the array c when calculating the array b. At this time, the data elements a (27) to a (51) of the array a to be expanded into the own element and the data elements b (26) to (50) of the array b calculated by the own element are used to form the array d. In the calculation, it is analyzed that the data elements c (27) to (50) of the array c calculated by the own element and the data elements c (51) of the array c calculated by the adjacent processor elements are used.
[0034]
Further, the processor element executing “i = 51 to 75” uses the data elements a (52) to a (76) of the array a to be expanded into its own element and calculates the array c when calculating the array b. In this case, the data elements a (52) to a (76) of the array a expanded to the own element and the data elements b (51) to (75) of the array b calculated by the own element are used, and the In the calculation, it is analyzed that the data element c (52) to (75) of the array c calculated by the own element and the data element c (76) of the array c calculated by the adjacent processor element are used.
[0035]
Further, the processor element executing “i = 76 to 100” calculates the array c by using the data elements a (77) to a (101) of the array a to be expanded into its own element when calculating the array b. At this time, the data elements a (77) to a (101) of the array a to be expanded into the own element and the data elements b (76) to (100) of the array b calculated by the own element are used to form the array d. In the calculation, it is analyzed that the data elements c (76) to (100) of the array c calculated by the own element and the initial values of the data elements c (101) of the array c expanded to the own element are used. .
[0036]
Before the execution of the subroutine "sub (b)", it is analyzed that each processor element needs to have the array "b", and the processor element generated by each processor element before "sub (b)" is analyzed. Data transfer information for instructing to transfer the array portion of array b to all other processor elements is generated. In addition, when input / output instructions are described in the source program or when multiple DO loops are described in addition to the subroutine, the array portion generated by each processor element is used as described above. It is necessary to transfer the data to the processor element.
[0037]
Further, before calculating the array d, the processor element executing “i = 1 to 25” needs to receive the data element c (26) from the processor element executing “i = 26 to 50”. = 26 to 50 ”needs to receive the data element c (51) from the processor element that executes“ i = 51 to 75 ”, and the processor element that executes“ i = 51 to 75 ” , Analyzing that it is necessary to receive the data element c (76) from the processor element executing “i = 76-100”, and executing the “i = 26-50” at the time of calculating the array d. The data element c (26) of the element is transferred to the processor element that executes “i = 1 to 25”, and the processor that executes “i = 51 to 75”. The data element c (51) of the processor element is transferred to the processor element that executes “i = 26-50”, and the data element c (76) of the processor element that executes “i = 76-100” is Data transfer information for instructing transfer to a processor element executing “i = 51 to 75” is generated.
[0038]
When the compiler 10 of the present invention terminates the optimization process and does not perform any further optimization process on the data transfer information, the object code generated according to the optimization process is transmitted to four processor elements. By being loaded, each processor element executes data processing as shown in FIG.
[0039]
That is, the four processor elements PE1 to PE4 read the actual data elements of the array a / array b / array c / array d from a data file (not shown), thereby obtaining the array declaration statement described by the source program. Extract a copy of the data pointed to.
[0040]
Then, based on the DO loop division information generated by the optimization process of the compiler 10, the processor element of the PE1 follows the parallel execution process and
b (i) = sqrt (a (i + 1)) i = 1 to 25
Find data element b (i) according to
c (i) = a (i + 1) + b (i) i = 1 to 25
Find data element c (i) according to
d (i) = cos (c (i + 1)) i = 1 to 25
The data element d (i) is obtained according to
[0041]
In addition, the processor element of PE2 uses the division information of the DO loop generated by the optimization processing of the compiler 10 while following the parallel execution processing.
b (i) = sqrt (a (i + 1)) i = 26-50
Find data element b (i) according to
c (i) = a (i + 1) + b (i) i = 26-50
Find data element c (i) according to
d (i) = cos (c (i + 1)) i = 26-50
The data element d (i) is obtained according to
[0042]
In addition, the processor element of PE3, based on the DO loop division information generated by the optimization process of the compiler 10,
b (i) = sqrt (a (i + 1)) i = 51-75
Find data element b (i) according to
c (i) = a (i + 1) + b (i) i = 51-75
Find data element c (i) according to
d (i) = cos (c (i + 1)) i = 51-75
The data element d (i) is obtained according to
[0043]
Further, the processor element of the PE 4 uses the division information of the DO loop generated by the optimization processing of the compiler 10 while following the parallel execution processing.
b (i) = sqrt (a (i + 1)) i = 76-100
Find data element b (i) according to
c (i) = a (i + 1) + b (i) i = 76-100
Find data element c (i) according to
d (i) = cos (c (i + 1)) i = 76-100
The data element d (i) is obtained according to
[0044]
At this time, based on the data transfer information generated by the optimization processing of the compiler 10, the processor element of the PE1 calculates its own element before the execution of the subroutine "sub (b)" as shown in the upper part of FIG. The data elements b (1) to b (25) of the array b are transferred to the processor elements of PE2 / PE3 / PE4, and the processor element of PE2 executes the subroutine "sub (b)" as shown in the lower part of FIG. Before the execution of, the data elements b (26) to b (50) of the array b calculated by the own element are transferred to the processor elements of PE1 / PE3 / PE4, and the processor element of PE3 is As shown, before the execution of the subroutine "sub (b)", the data elements b (51) to b (7) of the array b calculated by the own element are obtained. 5) is transferred to the processor element of PE1 / PE2 / PE4, and the processor element of PE4, before the execution of the subroutine "sub (b)" as shown in the lower part of FIG. The data elements b (76) to b (100) of b are transferred to the processor elements PE1 / PE2 / PE3.
[0045]
Then, based on the data transfer information generated by the optimization processing of the compiler 10, the processor element of PE1 is shifted from the processor element of PE2 / PE3 / PE4 to the data elements b (26) to b (50) of array b, When it is confirmed that the data elements b (51) to b (75) of the array b and the data elements b (76) to b (100) of the array b are received, the subroutine "sub (b)" is executed and PE2 is executed. Are the data elements b (1) to b (25) of the array b, the data elements b (51) to b (75) of the array b, and the processor element of the array b from the processor elements of PE1 / PE3 / PE4. When it is confirmed that the data elements b (76) to b (100) are received, execution of the subroutine "sub (b)" starts, and the processor element of PE3 From the processor elements of PE1 / PE2 / PE4, data elements b (1) to b (25) of array b, data elements b (26) to b (50) of array b, and data element b (76) of array b ) To b (100), the subroutine "sub (b)" starts to be executed, and the processor element of PE4 is shifted from the processor element of PE1 / PE2 / PE3 to the data element b (1) of array b. ) To b (25), the data elements b (26) to b (50) of the array b, and the data elements b (51) to b (75) of the array b, the subroutine "sub ( b)).
[0046]
When the execution of the subroutine "sub (b)" ends, the processor elements of PE2 are arranged in an array as shown in the upper part of FIG. 8 based on the data transfer information generated by the optimization processing of the compiler 10. Prior to the calculation of d, the data element c (26) of the array c required for the calculation is transferred, and the processor element of PE4 finishes executing the subroutine "sub (b)". As shown, before the processor element of PE3 starts to calculate the array d, the data element c (76) of the array c required for the calculation is transferred, and the processor element of PE3 executes the subroutine "sub (b)". Is completed, as shown in the lower part of FIG. 8, before the processor element of PE2 starts to calculate the array d, It will transfer the data element c (51) of The array c.
[0047]
In this way, the compiler 10 of the present invention employs a configuration in which only a procedure part is automatically parallelized in a distributed memory type parallel computer, and data is not distributed and all processor elements have copies thereof. By adopting a configuration in which necessary data is exchanged while matching data as necessary, fully automatic parallelization is realized.
[0048]
This embodiment discloses a configuration using an optimized data transfer method of transferring only data elements between processor elements for data elements at the boundary of array c required for calculation of array d. did. According to this configuration, the transfer cost can be reduced, but it is also possible to adopt a data transfer method in which all data elements of the array c are to be transferred without performing such optimization.
[0049]
Also, if the distributed memory type parallel computer has a function that can perform data transfer processing and arithmetic processing in parallel, the data to be used at the rear is transferred so that the transfer is started from the time when the value is determined. By optimizing the position, it is possible to conceal the data transfer processing by overlapping the data transfer processing and the arithmetic processing.
[0050]
For example, when compiling the source program of FIG. 4, as shown in FIG. 9, at the time of execution of the subroutine "sub (b)", instead of executing the transfer of the data elements of the array b required for the operation, Then, the transfer process is started from the time when the value of the data element of the array b is determined. Also, at the time of calculating the array d, the transfer of the data element of the array c required for the calculation is not executed. The data transfer process is concealed by starting the transfer process when the value of the data element of the array c is determined.
[0051]
By using the compiler 10 of the present invention, all the data necessary for executing the data processing are expanded in the local memory of each processor element, and the procedure necessary for executing the data processing is divided, and the local data of each processor element is divided. A new program on a distributed memory parallel computer that expands to memory, and each processor element executes data processing using data expanded to its own local memory according to the procedure expanded to its own local memory The execution method can be realized.
[0052]
In this new program execution method, each processor element performs processing so as to limit the access range of data developed in its own local memory. Then, during the execution of the procedure, each processor element updates the data developed in its own local memory, and performs processing to transfer the data to another processor element that needs the data.
[0053]
The new program execution method can be realized by using the compiler 10 of the present invention, but can be realized without using the compiler 10 of the present invention.
[0054]
【The invention's effect】
As described above, in the compiler of the present invention, when data processing is executed by a distributed memory type parallel computer, only the procedure part is automatically parallelized, and the data is not distributed and all the processor elements are distributed. A configuration having a copy is adopted, and necessary data is exchanged while matching data as necessary.
[0055]
This makes it possible to realize fully automatic parallelization. Further, since fully automatic parallelization can be realized without relying on the programmer's know-how, constant hitch can be expected and constant performance improvement can be expected. Since the data is distributed, the range of the loop to be parallelized can be expanded.
[Brief description of the drawings]
FIG. 1 is a principle configuration diagram of the present invention.
FIG. 2 is a processing configuration diagram of an optimization processing executed by a compiler of the present invention.
FIG. 3 is a detailed process flow of an optimal process.
FIG. 4 is an example of a source program.
FIG. 5 is an explanatory diagram of a compile process of the present invention.
FIG. 6 is an explanatory diagram of a compile process of the present invention.
FIG. 7 is an explanatory diagram of a compiling process of the present invention.
FIG. 8 is an explanatory diagram of a compile process of the present invention.
FIG. 9 is an explanatory diagram of a compile process of the present invention.
FIG. 10 is an explanatory diagram of a distributed memory type parallel computer.
FIG. 11 is an explanatory diagram of a memory architecture.
[Explanation of symbols]
1 Data processing device
10 Compiler
11 Lexical and parsing unit
12 Code generator
13 Optimizer
14 Control unit
15 First analysis unit
16 Second analysis unit
17 Third Analysis Unit

Claims

Running in compiling processing apparatus having a first generating means and the second generating means and a third generation unit and fourth generation unit, distributed memory parallel comprising a plurality of processor elements with b Karumemori A compiling method for compiling a source program executed by a computer, comprising:
Said first generating means, for each loop function with the source over the scan program analyzes the data dependency from array index access, the loop function analyzes whether it is possible to parallel execution, parallel execution loop function raw form a loop split information about,
Said second generating means, on the assumption that the elements of the upper SL array access is expanded into the local memory of all processor elements, each said parallel execution loop functions, each processor from subscript of the array access It analyzes the data access range of elements, generates an data access range information,
Said third generating means, on the assumption that the elements of the upper SL array access is expanded into the local memory of all processor elements, based on the above loop division information and the data access range information, each processor element and analyzes the data transfer required between the other processor elements, the data transfer information specifying the transfer destination and the transfer position and the transfer target form raw,
The fourth generating means generates the object code with a loop function which is divided according to the above Symbol loop distribution information, to generate object code with data transfer instruction and the data receiving instruction is generated according to the data transfer information That
Characteristic compilation processing method.

The compile processing method according to claim 1,
The third generating means, for the data used in Backward is to process to determine the transfer position to enter the transfer start from the time when the value is determined,
Characteristic compilation processing method.

3. The compile processing method according to claim 1, wherein
The third generating means, that for the sequence portion is separately updated by the profile processor element, processing to determine a transfer destination to transfer the sequence portion to all of the other processor elements,
Characteristic compilation processing method.

3. The compile processing method according to claim 1, wherein
The third generating means, that for the data to be only necessary processor element specific, processing to determine a transfer destination to transfer only the data to the processor elements,
Characteristic compilation processing method.