JP3673809B2

JP3673809B2 - Multiple repetitive processing massively parallelized source code automatic generation program, automatic generation apparatus and automatic generation method

Info

Publication number: JP3673809B2
Application number: JP2001283957A
Authority: JP
Inventors: 浩通山本; 幸造本間; 正廣吉田; 純一大久保
Original assignee: Japan Aerospace Exploration Agency JAXA; Mitsubishi Space Software Co Ltd
Current assignee: Japan Aerospace Exploration Agency JAXA; Mitsubishi Space Software Co Ltd
Priority date: 2001-09-18
Filing date: 2001-09-18
Publication date: 2005-07-20
Anticipated expiration: 2021-09-18
Also published as: JP2003091422A

Description

【０００１】
【発明の属する技術分野】
本発明は、通常の非並列のソースコード（ソースプログラム）から複数のプロセッサ（ＣＰＵ）で並列実行可能な並列化ソースコード（並列化ソースプログラム）を自動的に生成するための多重反復処理超並列化ソースコード自動生成プログラム、自動生成装置および自動生成方法に関する。
【０００２】
【従来の技術】
自然科学分野、工学分野等における数値モデリングやシミュレーション計算等では、膨大な回数の反復計算処理が必要とされている。
【０００３】
このような場合、１個のプロセッサ（ＣＰＵ）を用いて計算処理を実行していたのでは、処理時間が膨大になるため、複数のプロセッサを用いて並列処理を行うことにより、処理実行時間を実用化レベルにまで短縮させている。
【０００４】
複数のプロセッサを用いて並列処理を行うためには、複数のプロセッサが並列実行可能な並列化ソースコードが必要であり、この並列化ソースコードを生成する方法の１つに、通常の非並列なソースコードを複数のプロセッサで並列処理可能な並列化ソースコードに変換する方法が考えられている。
【０００５】
【発明が解決しようとする課題】
しかしながら、従来の並列化ソースコード生成する方法においては、多重にネストされた多重ループ構造を有する非並列化ソースコードを複数のプロセッサが実行可能な並列化ソースコードに自動的に変換するための具体的な方法については開示されておらず、上述した多重ループ構造を有する非並列化ソースコードを並列化ソースコードに自動変換することができる自動変換プログラム、自動変換装置および方法の開発が待望されていた。
【０００６】
本発明は上述した事情に鑑みてなされたものであり、多重にネストされた多重ループ構造を有する非並列化ソースコードを複数のプロセッサで並列実行可能な並列化ソースコードに自動的に変換するための具体的な自動生成プログラム、自動生成装置および自動生成方法を提供することにより、並列化処理の利用促進に寄与することをその第１の目的とする。
【０００７】
また、本発明は、上記第１の目的に加えて、複数のプロセッサをフルに利用して多重ループの反復処理を並列に行うとともに、そのプロセッサそれぞれに分担される多重ループの反復処理回数を均等化して、複数のプロセッサを用いた並列処理の高効率化を図ることをその第２の目的とする。
【０００８】
【課題を解決するための手段】
上述した目的を達成するための本発明の第１の態様によれば、ｎ（ｎは２以上の整数）重にネストされたループを含む非並列のソースコードから、ｍ（ｍは２以上の整数）個のプロセッサで並列実行可能な並列化ソースコードを自動的に生成する機能をコンピュータに実現させる並列化ソースコード自動生成プログラムであって、前記非並列化ソースコードのｎ重ループそれぞれの初期値式を、前記ｍ個のプロセッサに与えた、各プロセッサを一意に識別する０から始まるｍ個の連続した整数ｉａｋ（ｋ＝０、・・・、ｍ−１）、およびループｊ（ｊ＝１、・・・、ｎ）毎に定められた増分値δｊを用いて下式
【数１】

で表される初期値式Ｓｊに書き換え、書き換えた初期値式Ｓｊおよび増分値δｊを用いることにより、前記ｎ重ループ構造部分を、前記ｍ個のプロセッサで分担処理できる構造に変換する機能を前記コンピュータに実現させる。
【０００９】
本発明の第１の態様において、前記ｍ個のプロセッサそれぞれに分担される前記ｎ重ループ構造部分の分担反復回数が均等である。
【００１０】
また、上述した目的を達成するための本発明の第２の態様によれば、ｎ（ｎは２以上の整数）重にネストされたループを含む非並列のソースコードから、ｍ（ｍは２以上の整数）個のプロセッサで並列実行可能な並列化ソースコードを自動的に生成する並列化ソースコード自動生成装置であって、前記非並列化ソースコードのｎ重ループそれぞれの初期値式を、前記ｍ個のプロセッサに与えた、各プロセッサを一意に識別する０から始まるｍ個の連続した整数ｉａｋ（ｋ＝０、・・・、ｍ−１）、およびループｊ（ｊ＝１、・・・、ｎ）毎に定められた増分値δｊを用いて下式
【数２】

で表される初期値式Ｓｊに書き換え、書き換えた初期値式Ｓｊおよび増分値δｊを用いることにより、前記ｎ重ループ構造部分を、前記ｍ個のプロセッサで分担処理できる構造に変換する手段を備えている。
【００１１】
さらに、上述した目的を達成するための本発明の第３の態様によれば、ｎ（ｎは２以上の整数）重にネストされたループを含む非並列のソースコードから、ｍ（ｍは２以上の整数）個のプロセッサで並列実行可能な並列化ソースコードが自動的に生成される機能をコンピュータに実現させる並列化ソースコード自動生成方法であって、前記非並列化ソースコードのｎ重ループそれぞれの初期値式が、前記ｍ個のプロセッサに与えられて、各プロセッサが一意に識別される０から始まるｍ個の連続した整数ｉａｋ（ｋ＝０、・・・、ｍ−１）、およびループｊ（ｊ＝１、・・・、ｎ）毎に定められた増分値δｊを用いて下式
【数３】

で表される初期値式Ｓｊに書き換えられて、ｎ重ループ構造部分が、前記ｍ個のプロセッサで分担処理される構造に変換される。
【００１２】
【発明の実施の形態】
本発明の多重反復処理超並列化ソースコード自動生成プログラム、自動生成装置および自動生成方法に係る実施の形態を添付図面を参照して説明する。
【００１３】
図１に示すように、多重反復処理超並列化ソースコード自動生成装置１は、互いに通信可能に接続されたＣＰＵ２およびメモリ３を備えている。
【００１４】
このＣＰＵ２は、予めメモリ３に記憶された多重反復処理超並列化ソースコード自動生成プログラム（コンパイラを含む）Ｐに従って動作することにより、例えばＣＤ−ＲＯＭやハードディスク、ＤＶＤ−ＲＯＭ等の記憶装置４に記憶されたｎ（ｎは２以上の整数）重にネストされたｎ重ループを含む非並列化ソースコードＳＣを読み出してメモリ３に格納する機能と、メモリ３に格納した非並列化ソースコードＳＣから、ｍ（ｍは２以上の整数）個のプロセッサｉａ_０、・・・、ｉａ_ｍ−１が並列実行可能な並列化ソースコードＰＳＣを自動的に生成する機能と、生成された並列化ソースコードＰＳＣをコンパイルして並列化オブジェクトコードＰＯＣを生成して各プロセッサｉａ_０、・・・、ｉａ_ｍ−１にロードする機能（ＳＩＭＤ：ＳｉｎｇｌｅＩｎｓｔｒｕｃｔｉｏｎＭｕｌｔｉｐｌｅ
Ｄａｔａ方式に基づく並列化機能）とを実現するようになっている。
【００１５】

のプロセッサ番号は、０から始まるｍ個の連続した整数に対応付けられており、例えば、本実施形態では、プロセッサｉａ_０、・・・、ｉａ_ｍ−１のプロセッサ番号（ｉａ_０、・・・、ｉａ_ｍ−１で表す）は、ｍ個の連続した整数０、・・・、ｍ−１に設定されている。
【００１６】
また、本実施形態では、プロセッサ数ｍの一例として、プロセッサ数ｍ＝３０とする。また、ｍ個のプロセッサは、１台のコンピュータに内蔵されていてもよく、あるいは、複数のコンピュータそれぞれに搭載されたものでもよい。
【００１７】
具体的には、例えば、上記ｍ個のプロセッサｉａ_０、・・・、ｉａ_ｍ−１を含む並列マシン（コンピュータ）に適用される場合には、多重反復処理超並列化ソースコード自動生成装置１は、並列マシンのコンパイラのプリプロセッサとして機能するようになっており、また、例えば、ネットワークで接続された同一あるいは異なるアーキテクチャの複数のコンピュータ（プロセッサ）に対するコンパイラのプリプロセッサとして機能するようになっている。
【００１８】
図２は、非並列化ソースコードＳＣの一例を概略的に示す図である。
【００１９】
図２に示すように、非並列化ソースコードＳＣは、例えばプログラム言語ＦＯＲＴＲＡＮで記述されたものであり、ｎ（ｎ＝３）重にネストされた非並列化ｎ重ループ（ｄｏループ）構造ＮＰを含んでおり、第１ループ（ループ変数ｉ_１）の反復回数（終値式）は“５”、第２ループ（ループ変数ｉ_２）の反復回数（終値式）は“４”、第３ループ（ループ変数ｉ_３）の反復回数（終値式）は“３”であり、この３重ループで構成される走査空間ＰＡＮを、それぞれのループの反復回数の積で表すとすると、この非並列化３重ループ構造ＮＰの走査空間ＰＡＮは、５×４×３として示されており、上記プロセッサ数ｍ＝３０は、走査空間ＰＡＮ＝５×４×３の約数となるように予め設定されている。
【００２０】
一方、図３は、ＣＰＵ２の多重反復処理超並列化ソースコード自動生成プログラムＰに従った並列化ソースコード自動生成処理を含む全体処理の一例を示す概略フローチャートである。
【００２１】
図３に示すように、ＣＰＵ２は、記憶装置４に記憶された非並列化ソースコードＳＣを読み出してメモリ３に格納し（ステップＳ１）、格納した非並列化ソースコードＳＣを検索してそのコードＳＣ内の非並列化ｎ（＝３）重ループ構造ＮＰを見つけ出す（ステップＳ２）。
【００２２】
次いで、ＣＰＵ２は、見つけ出した非並列化３重ループ構造ＮＰにおける各ループそれぞれの初期値式（図２においては、ｉ１＝１、ｉ２＝１、ｉ３＝１である）を、３０個のプロセッサｉａ０、・・・、ｉａ２９にそれぞれ割り当てられたプロセッサ番号ｉａｋ（ｋ＝０、・・・、２９）、およびループｊ（ｊ＝１、・・・、３）毎に定められた増分値δｊを用いて下式
【数４】

で表される初期値式Ｓｊに書き換え、ｎ重ループ構造部分を、前記ｍ個のプロセッサで分担処理できる構造に変換する（ステップＳ３）。
【００２３】
ここで、本明細書における“ｍｏｄ（ｘ，ｙ）”は、剰余を表す関数であり、例えば、ｍｏｄ（７，４）＝３（７÷４の余りが３で３となる）であり、同様に、ｍｏｄ（１０，５）＝０となる。
次に、本明細書における“Ｉｎｔ（ｚ）”は、整数化を表す関数であり、実数ｚを丸めて、とびとびの値をとる整数にすることである。
【００２４】
さらに、本明細書における
【数５】

は、“δ_１×・・・×δ_ｊ”を表す。なお、δ_０＝１である。
【００２５】
この結果、３重ループそれぞれの初期値式Ｓ_１、Ｓ_２、Ｓ_３は、ｍ＝３０個のプロセッサにそれぞれ割り当てられたプロセッサ番号ｉａ_ｋ（ｋ＝０、・・・、２９）および増分値δ_１、δ_２、δ_３を用いて、それぞれ下式のように表される。
【００２６】
【数６】
Ｓ１＝１＋Ｉｎｔ｛ｍｏｄ（ｉａｋ，δ１）／δ０｝
Ｓ２＝１＋Ｉｎｔ｛ｍｏｄ（ｉａｋ，δ２×δ１）／δ１｝
Ｓ３＝１＋Ｉｎｔ｛ｍｏｄ（ｉａｋ，δ３×δ２×δ１）／（δ２×δ１）｝
【００２７】
続いて、ＣＰＵ２は、書き換えた初期値式Ｓ_１、Ｓ_２、Ｓ_３、増分値δ_１、δ_２、δ_３および各ループに対応する反復回数５、４、３を用いることにより、上記３重非並列化ループ構造ＮＰを含む非並列化ソースコードＳＣを、３０個のプロセッサで分担処理できる構造、すなわち、図４に示す並列化３重ループ構造ＰＮＰを含む並列化ソースコードＰＳＣに変換する（ステップＳ４）。
【００２８】
このとき、３重ループ構造における各ループｊ（ｊ＝１、・・・、３）毎の定められた増分値δ_１、δ_２、δ_３、言いかえれば、各ループｊ（ｊ＝１、・・・、３）の分割数δ_１、δ_２、δ_３を、対応するループの反復回数５、４、３の約数となる値で、かつそれぞれの積ｉ_ｄｉｖ＝δ_１×δ_２×δ_３がプロセッサ数ｍ＝３０と等しくなる値、すなわち、δ_１＝５、δ_２＝２、δ_３＝３に設定する。
【００２９】
この結果、３重ループそれぞれの初期値式Ｓ_１、Ｓ_２、Ｓ_３は、ｍ＝３０個のプロセッサにそれぞれ割り当てられたプロセッサ番号ｉａ_ｋ（ｋ＝０、・・・、２９）および増分値δ_１＝５、δ_２＝２、δ_３＝３を用いて、それぞれ下式のように表される。
【００３０】
【数７】
Ｓ１＝１＋Ｉｎｔ｛ｍｏｄ（ｉａｋ，δ１）／δ０｝
＝１＋Ｉｎｔ｛ｍｏｄ（ｉａｋ，５）｝
Ｓ２＝１＋Ｉｎｔ｛ｍｏｄ（ｉａｋ，δ２×δ１）／δ１｝
＝１＋Ｉｎｔ｛ｍｏｄ（ｉａｋ，１０）／５｝
Ｓ３＝１＋Ｉｎｔ｛ｍｏｄ（ｉａｋ，δ３×δ２×δ１）／（δ２×δ１）｝
＝１＋Ｉｎｔ｛ｍｏｄ（ｉａｋ，３０）／１０｝
【００３１】
したがって、図４に示す並列化ソースコードＰＳＣにおける並列化３重ループ構造ＰＮＰそれぞれのループ（ｉ_１，ｉ_２，ｉ_３）｛＝（１，１，１）〜（５，４，３）｝は、それぞれ図５に示すプロセッサ番号のプロセッサに分担される。なお、図５のｉａ欄に示された数値は、それぞれのプロセッサに対応づけられた０からはじまる連続した、かつ重複しない整数であり、図５の例では３０プロセッサなので、０〜２９となっている。
ここでは一例としてプロセッサｉａ_Ｋを識別するＫの値と、当該プロセッサに対応づけられた数値が等しい場合を示しており、例えば、ループ（ｉ_１，ｉ_２，ｉ_３）＝（１，３，１）の処理は、プロセッサｉａ_０（図５中は“０”）に分担され、ループ（ｉ_１，ｉ_２，ｉ_３）＝（５，３，２）の処理は、プロセッサｉａ_１４（図５中は“１４”）に分担される。
【００３２】
ここで、本実施形態の３次元走査空間（３重ループ）構造（走査空間→５×４×３）の各ループ処理を、ｍ＝３０個のプロセッサｉａ_０、・・・、ｉａ_ｍ−１に割り振る（分担させる）際の概念的な内容について図６を用いて説明する。
【００３３】
すなわち、本実施形態の分担方式では、３次元走査空間（ｉ_１，ｉ_２，ｉ_３）｛＝（１，１，１）〜（５，４，３）｝を、それぞれの次元方向の分割数（δ_１，δ_２，δ_３）＝（５，２，３）に対応するプロセッサブロックで埋め尽くす発想であり、図６に示すように、プロセッサブロック（δ_１×δ_２×δ_３）＝（５×２×３）→プロセッサｉａ_０〜ｉａ_２９を３次元走査空間（ｉ_１，ｉ_２，ｉ_３）＝（１，１，１）〜（５，２，３）まで順次割り当て、次いで、プロセッサブロック（δ_１×δ_２×δ_３）＝（５×２×３）→プロセッサｉａ_０〜ｉａ_２９を３次元走査空間（ｉ_１，ｉ_２，ｉ_３）＝（１，３，１）〜（５，４，３）まで順次割り当てることにより、上記３次元走査空間（３重ループ）構造（走査空間→５×４×３）の各ループ処理を、ｍ＝３０個のプロセッサｉａ_０、・・・、ｉａ_ｍ−１に対して、全てのプロセッサの分担反復回数が均等（２回）となるように割り当てられる。
【００３４】
このようにして、非並列化３重ループ構造ＮＰを含む非並列化コードＳＣから、並列化３重ループ構造ＰＮＰを含む並列化ソースコードＰＳＣが生成されると、ＣＰＵ２は、その並列化ソースコードＰＳＣをコンパイルして並列化オブジェクトコードＰＯＣに変換し（ステップＳ５）、変換した並列化オブジェクトコードＰＯＣ、並列分担されるプロセッサ数ｍおよび各プロセッサ自体のプロセッサ番号を、その各プロセッサｉａ_０、・・・、ｉａ_２９にそれぞれロードする（ステップＳ６）。
【００３５】
この結果、各プロセッサｉａ_０、・・・、ｉａ_２９は、ロードされた並列化オブジェクトコードＰＯＣ、並列分担されるプロセッサ数および自らのプロセッサ番号に基づいて、自らのプロセッサに分担されたループ処理を並列的に実行することができる。
【００３６】
以上述べたように、本実施形態に基づく多重反復処理超並列化ソースコード自動生成プログラムＰに基づく多重反復処理超並列化ソースコード自動生成装置１を用いることにより、ｎ（ｎ＝３）重にネストされたｎ重ループ構造を含む非並列化ソースコードＳＣから、ｍ（ｍ＝３０）個のプロセッサｉａ_０、・・・、ｉａ_２９で並列処理できる並列化３重ループ構造ＰＮＰを含む並列化ソースコードＰＳＣを自動的かつ容易に生成することができるため、並列処理の利用・導入を促進させることができる。
【００３７】
また、本実施形態によれば、ｎ（ｎ＝３）重にネストされたｎ重ループ構造を含む非並列化ソースコードＳＣから、ｍ（ｍ＝３０）個のプロセッサｉａ_０、・・・、ｉａ_２９で並列処理できる並列化３重ループ構造ＰＮＰを含む並列化ソースコードＰＳＣを生成する際に、そのプロセッサｉａ_０、・・・、ｉａ_２９にお

間で均等に設定することができるため、例えば、本実施形態で例示したように、多重ループを構成する各ループの反復回数（各ループの反復回数＝５，４，３）がプロセッサ数ｍ（ｍ＝３０）に比べて非常に小さい場合でも、その全てのプロセッサｉａ_０、・・・、ｉａ_２９に対して計算処理負荷を分散させることが可能になり、並列処理におけるプロセッサ利用効率を向上し、並列化処理速度をさらに向上させることができる。
【００３８】
なお、本実施形態においては、ｍ個のプロセッサｉａ_０、・・・、ｉａ_ｍ−１のプロセッサ番号を任意のｍ個の数に設定しておくことも可能であり、その設定されたプロセッサ番号とｍ個の連続した整数ｉａ_０、・・・、ｉａ_ｍ−１（０、・・・、ｍ−１：順不同）との対応関係を設定しておけば、上記（１）式を用いることができる。
【００３９】
また、本実施形態においては、非並列化ソースコードにおけるｎ重ループ構造の一例として、３重ループ構造を並列化する場合について説明したが、本発明はこれに限定されるものではなく、全ての多重ループ構造を並列化する場合に適用可能である。
【００４０】
さらに、プロセッサ数や各ループの反復回数も、単なる一例であり、様々なプロセッサ数および反復回数を用いることが可能である。
【００４１】
さらにまた、本実施形態においては、並列化対象としての非並列化ソースコードの記述言語を、プログラム言語ＦＯＲＴＲＡＮとし、その多重ループ構造をｄｏループ構造としたが、本発明はこれに限定されるものではなく、他のプログラム言語で記述された他のループ構造（例えば、ｉｆ文と“ｇｏｔｏ”文によるループ構造や、ｆｏｒ文を用いたループ構造等）に対しても、適用可能である。
【００４２】
そして、本実施形態では、プロセッサ数ｍ＝３０を、ｎ重ループの走査空間ＰＡＮ、すなわち、ｎ重ループそれぞれのループ反復回数の積（５×４×３）の約数に設定することを、並列化ソースコード生成の条件としたが、本発明はこれに限定されるものではなく、例えば、プロセッサ数ｍ＝３０が、ｎ重ループの走査空間ＰＡＮの約数でない場合には、走査空間ＰＡＮを仮想的に拡張し、上記プロセッサ数ｍ＝３０で割り切れる値に設定すればよく、本実施形態の多重反復処理超並列化ソースコード自動生成手法を利用することが可能になる。
【００４３】
また、本実施形態では、多重反復処理超並列化ソースコード自動生成装置において、生成された並列化ソースコードをコンパイルしてオブジェクトコードを生成する処理も行うように説明したが、本発明はこれに限定されるものではなく、コンパイル処理は、他の装置や各プロセッサが行うように構成してもよい。
【００４４】
【発明の効果】
以上述べたように本発明に係る多重反復処理超並列化ソースコード自動生成プログラム、自動生成装置および自動生成方法によれば、ｎ重にネストされたループを含む非並列化ソースコードの初期値式を、ｍ個のプロセッサにそれぞれ割り当てられたプロセッサ番号ｉａ_ｋ（ｋ＝０、・・・、ｍ−１）、およびループｊ（ｊ＝１、・・・、ｎ）毎に定められた増分値δ_ｊを用いた初期値式Ｓ_ｊに書き換えることにより、非並列化ソースコードのｎ重ループ構造部分を、ｍ個のプロセッサで分担処理できる構造に変換することができるため、非並列化ソースコードの並列化ソースコードへの変換を非常に容易に行うことができ、並列処理の利用・導入を促進させることができる。
【００４５】
また、本発明に係る多重反復処理超並列化ソースコード自動生成プログラム、自動生成装置および自動生成方法によれば、ｎ重にネストされたｎ重ループを含む非並列化ソースコードから、ｍ個のプロセッサで並列処理できる並列化ソースコードを生成する際に、そのプロセッサにおいて分担されるループ処理の回数を、全てのプロセッサ間で均等に設定することができるため、全てのプロセッサを利用して計算処理負荷を分散させることができる。この結果、並列処理におけるプロセッサ利用効率を向上し、並列化処理速度をさらに向上させることができる。
【図面の簡単な説明】
【図１】本発明の実施の形態に係る多重反復処理超並列化ソースコード自動生成装置の概略構成を示すブロック図。
【図２】非並列化ソースコードの一例を概略的に示す図。
【図３】図１に示すＣＰＵの多重反復処理超並列化ソースコード自動生成プログラムに従った並列化ソースコード自動生成処理を含む全体処理の一例を示す概略フローチャート。
【図４】本実施形態におけるＣＰＵの並列化ソースコード自動生成処理により生成された並列化ソースコードの一例を概略的に示す図。
【図５】並列化３重ループ構造それぞれのループ（ｉ_１，ｉ_２，ｉ_３）｛＝（１，１，１）〜（５，４，３）｝が分担されるプロセッサを示す図。
【図６】本実施形態の３次元走査空間（３重ループ）構造（走査空間→５×４×３）の各ループ処理を、ｍ＝３０個のプロセッサｉａ_０、・・・、ｉａ_ｍ−１に割り振る（分担させる）際の概念的な内容を説明するための図。
【符号の説明】
１多重反復処理超並列化ソースコード自動生成装置（コンパイラを含む）
２ＣＰＵ
３メモリ
４記憶装置
ｉａ_０、・・・、ｉａ_ｍ−１プロセッサ
Ｐ多重反復処理超並列化ソースコード自動生成プログラム（オブジェクトコードを生成するコンパイラを含む）
ＳＣ非並列化ソースコード
ＰＯＣ並列化オブジェクトコード
ＮＰ非並列化ソースコードの非並列化ｎ重ループ構造
ＰＳＣ並列化ソースコード
ＰＮＰ並列化ソースコードの並列化ｎ重ループ構造[0001]
BACKGROUND OF THE INVENTION
The present invention is a multi-repetition processing parallel processing for automatically generating parallel source code (parallel source program) that can be executed in parallel by a plurality of processors (CPUs) from ordinary non-parallel source code (source program). The present invention relates to a computerized source code automatic generation program, an automatic generation device, and an automatic generation method.
[0002]
[Prior art]
Numerous iterative calculation processes are required for numerical modeling and simulation calculations in the fields of natural science and engineering.
[0003]
In such a case, if the calculation processing is executed using one processor (CPU), the processing time becomes enormous. Therefore, by executing parallel processing using a plurality of processors, the processing execution time is reduced. It is shortened to a practical level.
[0004]
In order to perform parallel processing using a plurality of processors, parallel source code that can be executed in parallel by a plurality of processors is required. One of the methods for generating the parallel source code is a normal non-parallel source code. A method of converting source code into parallel source code that can be processed in parallel by a plurality of processors has been considered.
[0005]
[Problems to be solved by the invention]
However, in the conventional method for generating parallelized source code, a specific method for automatically converting non-parallelized source code having multiple nested multiple loop structures into parallelized source code that can be executed by a plurality of processors is provided. However, the development of an automatic conversion program, automatic conversion apparatus and method capable of automatically converting the non-parallelized source code having the multi-loop structure described above into parallelized source code is awaited. It was.
[0006]
The present invention has been made in view of the above circumstances, and automatically converts non-parallelized source code having multiple nested multiple loop structures into parallelized source code that can be executed in parallel by a plurality of processors. It is a first object of the present invention to contribute to promoting the use of parallel processing by providing a specific automatic generation program, automatic generation apparatus, and automatic generation method.
[0007]
In addition to the first object, the present invention performs multiple loop iterations in parallel by making full use of a plurality of processors, and equalizes the number of iterations of multiple loops shared by each processor. The second object is to improve the efficiency of parallel processing using a plurality of processors.
[0008]
[Means for Solving the Problems]
According to the first aspect of the present invention for achieving the above-described object, m (m is 2 or more) from non-parallel source code including n (n is an integer of 2 or more) nested loops. An integer parallel) source code automatic generation program for causing a computer to realize a function of automatically generating parallel source code that can be executed in parallel by an integer number of processors, each of the n-fold loops of the non-parallel source code Given a value expression to the m processors, m consecutive integers iak (k = 0,..., M−1) starting from 0 that uniquely identify each processor, and a loop j (j = 1,..., N) using the increment value δj determined for each,

In rewriting the initialization expression Sj represented, by using the initialization expression Sj and increment δj rewritten, the function of converting the n-fold loop structure portion, the sharing process can structure the m processors Make it a computer.
[0009]
In the first aspect of the present invention, the number of repeated iterations of the n-fold loop structure portion shared by each of the m processors is equal.
[0010]
According to the second aspect of the present invention for achieving the above-described object, m (m is 2) from non-parallel source code including n (n is an integer of 2 or more) nested loops. An integer parallel generation source code automatic generation apparatus that automatically generates a parallel source code that can be executed in parallel by a plurality of processors, wherein an initial value expression of each n-fold loop of the non-parallel source code is as follows: Given to the m processors, m consecutive integers iak (k = 0,..., M−1) starting from 0 that uniquely identify each processor, and a loop j (j = 1,...・ Using the increment value δj determined for each n), the following formula

Means for converting the n-fold loop structure portion into a structure that can be shared by the m processors by using the rewritten initial value expression Sj and the rewritten initial value expression Sj and the increment value δj. ing.
[0011]
Furthermore, according to the third aspect of the present invention for achieving the above object, m (m is 2) from non-parallel source code including n (n is an integer of 2 or more) nested loops. more integer) parallel executable parallelized source code processor is a parallel source code automatic generation method to realize the computer automatically generated Ru function, the non-parallel source n-fold loop each of the initial values expression, said given the m processors, m consecutive integers iak starting from 0 each processor Ru is uniquely identified (k = 0, ···, m -1), and Using the increment value δj determined for each loop j (j = 1,..., N),

In rewritten to represented by initialization expression Sj, n heavy loop structure portion, Ru is converted into the m sharing treated Ru structure processor.
[0012]
DETAILED DESCRIPTION OF THE INVENTION
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments relating to a multiple iterative processing massively parallelized source code automatic generation program, automatic generation apparatus and automatic generation method of the present invention will be described with reference to the accompanying drawings.
[0013]
As shown in FIG. 1, the multiple iterative processing massively parallelized source code automatic generation apparatus 1 includes a CPU 2 and a memory 3 that are communicably connected to each other.
[0014]
The CPU 2 operates in accordance with a multiple iterative processing and massively parallelized source code automatic generation program (including a compiler) P stored in the memory 3 in advance, so that the storage device 4 such as a CD-ROM, a hard disk, or a DVD-ROM is used. A function of reading the stored non-parallelized source code SC including n-fold loops n (n is an integer of 2 or more) stored in the memory 3 and the non-parallelized source code SC stored in the memory 3 , M (m is an integer of 2 or more) processors ia ₀ ,..., Ia _m−1 automatically generate parallel source code PSC that can be executed in parallel, and the generated parallel source code PSC compile and parallel object code POC each processor _ia 0 to generate a, ..., ability to load the _{ia m-1} (SIMD Single Instruction Multiple
A parallel function based on the Data method).
[0015]

The processor number, is associated to the m consecutive integers starting from 0, for example, in the present embodiment, the processor _ia 0, · · ·, _{ia m-1} of the processor number _(ia 0, · · · , Ia _m−1 ) is set to m consecutive integers 0,..., M−1.
[0016]
In the present embodiment, as an example of the number of processors m, the number of processors m = 30. The m processors may be built in one computer, or may be installed in each of a plurality of computers.
[0017]
Specifically, for example, when applied to a parallel machine (computer) including the _m processors ia ₀ ,. Is configured to function as a preprocessor for a compiler for a parallel machine, and for example, to function as a preprocessor for a compiler for a plurality of computers (processors) of the same or different architecture connected by a network.
[0018]
FIG. 2 is a diagram schematically showing an example of the non-parallelized source code SC.
[0019]
As shown in FIG. 2, the non-parallelized source code SC is described in, for example, the programming language FORTRAN, and has a non-parallelized n-fold loop (do loop) structure NP nested in n (n = 3) layers. The number of iterations (closing formula) of the first loop (loop variable i ₁ ) is “5”, the number of iterations (closing formula) of the second loop (loop variable i ₂ ) is “4”, and the third loop The number of iterations (closed value expression) of (loop variable i ₃ ) is “3”. If the scanning space PAN composed of this triple loop is represented by the product of the number of iterations of each loop, this deparallelization is performed. The scanning space PAN of the triple loop structure NP is shown as 5 × 4 × 3, and the number of processors m = 30 is preset to be a divisor of the scanning space PAN = 5 × 4 × 3. Yes.
[0020]
On the other hand, FIG. 3 is a schematic flowchart showing an example of the entire process including the parallelized source code automatic generation process according to the multiple iterative process massively parallelized source code automatic generation program P of the CPU 2.
[0021]
As shown in FIG. 3, the CPU 2 reads the non-parallelized source code SC stored in the storage device 4 and stores it in the memory 3 (step S1), searches the stored non-parallelized source code SC, and searches for the code. A non-parallelized n (= 3) double loop structure NP in the SC is found (step S2).
[0022]
Next, the CPU 2 calculates an initial value expression (in FIG. 2, i1 = 1, i2 = 1, i3 = 1) of each loop in the found non-parallelized triple loop structure NP, and the 30 processors ia0. ,..., Ia29, respectively, and the processor number iak (k = 0,..., 29) and the increment value δj defined for each loop j (j = 1,..., 3) are used. [Formula 4]

In writing an initialization expression Sj conversion example represented, the n-fold loop structure portion, that converts into the m can share the processing by the processor structure (step S3).
[0023]
Here, “mod (x, y)” in this specification is a function representing a remainder, for example, mod (7, 4) = 3 (the remainder of 7 ÷ 4 is 3 when 3), Similarly, mod (10, 5) = 0.
Next, “Int (z)” in this specification is a function representing integerization, and is to round the real number z to an integer that takes discrete values.
[0024]
Further, in this specification,

Represents “δ ₁ ×... × δ _j ”. Note that δ ₀ = 1.
[0025]
As a result, the initial value formulas S ₁ , S ₂ , and S ₃ of each of the triple loops are expressed by processor numbers ia _k (k = 0,..., 29) and increment values respectively assigned to m = 30 processors. Using δ ₁ , δ ₂ , and δ ₃ , they are respectively expressed by the following equations.
[0026]
[Formula 6]
S1 = 1 + Int {mod (iak, δ1) / δ0}
S2 = 1 + Int {mod (iak, δ2 × δ1) / δ1}
S3 = 1 + Int {mod (iak, δ3 × δ2 × δ1) / (δ2 × δ1)}
[0027]
Subsequently, the CPU 2 uses the rewritten initial value formulas S ₁ , S ₂ , S ₃ , the increment values δ ₁ , δ ₂ , δ ₃ and the number of iterations ₅ , ₄ , ₃ corresponding to each loop to The non-parallelized source code SC including the heavy non-parallelized loop structure NP is converted into a structure that can be shared by 30 processors, that is, a parallelized source code PSC including the parallelized triple loop structure PNP shown in FIG. (Step S4).
[0028]
At this time, the increment value δ ₁ , δ ₂ , δ ₃ determined for each loop j (j = 1,..., 3) in the triple loop structure, in other words, each loop j (j = 1, ..., 3 divided number [delta] ₁ of), [delta] _2, [delta] _3, and the value becomes a divisor of

iterations

5,4,3 corresponding loop, and each of the product _{i div} = δ ₁ × δ ₂ Xδ ₃ is set to a value equal to the number of processors m = 30, that is, δ ₁ = 5, δ ₂ = 2 and δ ₃ = 3.
[0029]
As a result, the initial value formulas S ₁ , S ₂ , and S ₃ of each of the triple loops are expressed by processor numbers ia _k (k = 0,..., 29) and increment values respectively assigned to m = 30 processors. Using δ ₁ = 5, δ ₂ = 2 and δ ₃ = 3, they are respectively expressed by the following equations.
[0030]
[Expression 7]
S1 = 1 + Int {mod (iak, δ1) / δ0}
= 1 + Int {mod (iak, 5)}
S2 = 1 + Int {mod (iak, δ2 × δ1) / δ1}
= 1 + Int {mod (iak, 10) / 5}
S3 = 1 + Int {mod (iak, δ3 × δ2 × δ1) / (δ2 × δ1)}
= 1 + Int {mod (iak, 30) / 10}
[0031]
Therefore, the loops (i ₁ , i ₂ , i ₃ ) {= (1, 1, 1) to (5, 4, 3)} of the parallel triple loop structure PNP in the parallel source code PSC shown in FIG. Are assigned to the processors having the processor numbers shown in FIG. The numerical values shown in the ia column in FIG. 5 are continuous and non-overlapping integers starting from 0 associated with each processor. In the example of FIG. Yes.
Here, as an example, a case where the value of _K identifying the processor ia _K is equal to the numerical value associated with the processor is shown, for example, loop (i ₁ , i ₂ , i ₃ ) = (1, 3, The processing of 1) is shared by the processor ia ₀ (“0” in FIG. 5), and the processing of the loop (i ₁ , i ₂ , i ₃ ) = (5, ₃ , ₂ ) is performed by the processor ia ₁₄ (FIG. 5 is assigned to “14”).
[0032]
Here, each loop processing of the three-dimensional scanning space (triple loop) structure (scanning space → 5 × 4 × 3) of this embodiment is performed with m = 30 processors ia ₀ ,..., Ia _m−1. The conceptual contents when allocating to (sharing) are described with reference to FIG.
[0033]
That is, in the sharing method of the present embodiment, the three-dimensional scanning space (i ₁ , i ₂ , i ₃ ) {= (1, 1, 1) to (5, 4, 3)} is divided into the respective dimensional directions. The idea is to fill up with processor blocks corresponding to the number (δ ₁ , δ ₂ , δ ₃ ) = (5, ₂ , ₃ ). As shown in FIG. 6, the processor block (δ ₁ × δ ₂ × δ ₃ ) = (5 × 2 × 3) → Processors ia _{0 to} ia ₂₉ are sequentially assigned to the three-dimensional scanning space (i ₁ , i ₂ , i ₃ ) = (1, 1, 1) to (5, ₂ , ₃ ), Next, the processor block (δ ₁ × δ ₂ × δ ₃ ) = (5 × 2 × 3) → the processors ia _{0 to} ia ₂₉ are changed into a three-dimensional scanning space (i ₁ , i ₂ , i ₃ ) = (1, 3, By sequentially assigning 1) to (5, 4, 3), the three-dimensional scanning space (triple loop) structure (scanning space → 5 × Each loop processing × 3), m = 30 pieces of processors _ia 0, ···, against _{ia m-1,} is shared iterations of all the processors are allocated so that uniform (2 times).
[0034]
When the parallelized source code PSC including the parallelized triple loop structure PNP is generated from the nonparallelized code SC including the nonparallelized triple loop structure NP in this way, the CPU 2 executes the parallelized source code. The PSC is compiled and converted into parallelized object code POC (step S5), and the converted parallelized object code POC, the number m of processors to be shared in parallel, and the processor number of each processor itself are assigned to each of the processors ia ₀ ,. .. and ia ₂₉ are loaded (step S6).
[0035]
As a result, each processor ia ₀ ,..., Ia ₂₉ performs the loop processing assigned to its own processor based on the loaded parallel object code POC, the number of processors assigned in parallel, and its own processor number. Can be executed in parallel.
[0036]
As described above, by using the multiple iterative processing massively parallelized source code automatic generation apparatus 1 based on the multiple iterative processing massively parallelized source code automatic generation program P based on this embodiment, n (n = 3) layers are used. Parallelization including parallelized triple loop structure PNP that can be processed in parallel by m (m = 30) processors ia ₀ ,..., Ia ₂₉ from non-parallelized source code SC including a nested n-fold loop structure Since the source code PSC can be generated automatically and easily, the use and introduction of parallel processing can be promoted.
[0037]
Further, according to the present embodiment, m (m = 30) processors ia ₀ ,... From non-parallelized source code SC including n (n = 3) nested n-fold loop structures. when generating a parallelized source code PSC including parallelized triple loop structure PNP capable parallelism with ia _29, contact the processor _ia 0, ···, a _{ia 29}

For example, as illustrated in the present embodiment, the number of iterations of each loop constituting the multiple loop (the number of iterations of each loop = 5, 4, 3) is the number of processors m ( even if m = 30) very small compared to its all processors ia _0, · · ·, it is possible to distribute the computing load on ia _29, to improve processor utilization efficiency in parallel processing The parallel processing speed can be further improved.
[0038]
In this embodiment, it is also possible to set the processor numbers of _m processors ia ₀ ,..., Ia _m−1 to an arbitrary m number, and the set processor numbers And m consecutive integers ia ₀ ,..., Ia _m−1 (0,..., M−1: out of order), the above equation (1) should be used. Can do.
[0039]
In this embodiment, the case of parallelizing a triple loop structure as an example of an n-fold loop structure in non-parallelized source code has been described. However, the present invention is not limited to this, This is applicable when parallelizing multiple loop structures.
[0040]
Furthermore, the number of processors and the number of iterations of each loop are merely examples, and various numbers of processors and iterations can be used.
[0041]
Furthermore, in this embodiment, the description language of the non-parallelized source code to be parallelized is the program language FORTRAN, and the multiple loop structure is the do loop structure, but the present invention is not limited to this. Instead, the present invention can also be applied to other loop structures described in other programming languages (for example, a loop structure using an if statement and a “go to” statement, a loop structure using a for statement, etc.).
[0042]
In this embodiment, the number of processors m = 30 is set to a divisor of the product (5 × 4 × 3) of the n-fold loop scanning space PAN, that is, the number of loop iterations of each of the n-fold loops. The conditions for generating the parallel source code are described above, but the present invention is not limited to this. For example, when the number of processors m = 30 is not a divisor of the scan space PAN of the n-fold loop, the scan space PAN Can be set to a value divisible by the number of processors m = 30, and it is possible to use the multiple iterative processing and massively parallelized source code automatic generation method of this embodiment.
[0043]
Further, in the present embodiment, the multiple iterative processing massively parallelized source code automatic generation apparatus has been described so as to perform the process of compiling the generated parallelized source code and generating the object code. The compiling process is not limited, and the compiling process may be performed by another device or each processor.
[0044]
【The invention's effect】
As described above, according to the multiple iterative processing massively parallelized source code automatic generation program, automatic generation apparatus, and automatic generation method according to the present invention, the initial value expression of the non-parallelized source code including n-fold nested loops , The processor number ia _k (k = 0,..., M−1) respectively assigned to the m processors and the increment value determined for each loop j (j = 1,..., N). By rewriting the initial value expression S _j using δ _j , the n-fold loop structure portion of the non-parallelized source code can be converted into a structure that can be shared by m processors. Can be converted into parallel source code very easily, and the use and introduction of parallel processing can be promoted.
[0045]
Further, according to the multiple iterative processing massively parallelized source code automatic generation program, automatic generation apparatus, and automatic generation method according to the present invention, m non-parallelized source code including n-fold nested n-loops can be used. When generating parallel source code that can be processed in parallel by a processor, the number of loop processes shared by the processor can be set evenly among all processors, so that all processors can be used for calculation processing. The load can be distributed. As a result, the processor utilization efficiency in the parallel processing can be improved, and the parallelization processing speed can be further improved.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a schematic configuration of a multiple iterative process massively parallelized source code automatic generation apparatus according to an embodiment of the present invention.
FIG. 2 is a diagram schematically showing an example of non-parallelized source code.
FIG. 3 is a schematic flowchart showing an example of overall processing including automatic parallelized source code generation processing according to the multiple iterative processing massively parallelized source code automatic generation program of the CPU shown in FIG. 1;
FIG. 4 is a diagram schematically showing an example of parallelized source code generated by CPU parallelized source code automatic generation processing in the present embodiment.
FIG. 5 is a diagram showing a processor in which loops (i ₁ , i ₂ , i ₃ ) {= (1, 1, 1) to (5, 4, 3)} of each parallel triple loop structure are shared.
FIG. 6 shows each loop processing of the three-dimensional scanning space (triple loop) structure (scanning space → 5 × 4 × 3) of the present embodiment with m = 30 processors ia ₀ ,..., Ia _m−. The figure for demonstrating the conceptual content at the time of allocating (sharing) to _1. FIG.
[Explanation of symbols]
1. Multiple iterative processing massively parallelized source code automatic generator (including compiler)
2 CPU
3 memory 4 storage device ia ₀ ,..., Ia _m-1 processor P multiple repetitive processing massively parallelized source code automatic generation program (including compiler for generating object code)
SC Non-parallelized source code POC Parallelized object code NP Non-parallelized source code non-parallelized n-fold loop structure PSC Parallelized source code PNP Parallelized source-code parallelized n-fold loop structure

Claims

Automatically generates parallel source code that can be executed in parallel by m (m is an integer of 2 or more) processors from non-parallel source code containing n (n is an integer of 2 or more) nested loops. A parallelized source code automatic generation program that causes a computer to realize the function to
An initial value expression of each of the n-fold loops of the non-parallelized source code is given to the m processors, and m consecutive integers iak (k = 0,...) Starting from 0 that uniquely identify each processor. , M-1) and the increment δj defined for each loop j (j = 1,..., N)

In rewriting the initialization expression Sj represented, by using the initialization expression Sj and increment δj rewritten, the function of converting the n-fold loop structure portion, the sharing process can structure the m processors A computer program for automatically generating multiple iterative parallelized source codes, which is realized by a computer.

A multiple iterative processing massively parallelized source code automatic generation program according to claim 1,
The increment value δ1,..., Δn of each of the n-fold loops is a divisor of the number of iterations of the corresponding loop, and when each product δ1 × ... × δn is expressed as idiv, A multi-repetition processing massively parallelized source code automatic generation program characterized in that the idiv and the number of processors m match.

A multiple iterative processing massively parallelized source code automatic generation program according to claim 1,
An automatic multiple parallel processing source code generation program characterized in that the number of repeated iterations of the n-fold loop structure portion assigned to each of the m processors is equal.

A multiple iterative processing massively parallelized source code automatic generation program according to claim 1,
The multi-repetition processing massively parallelized source code wherein the number of processors m is a divisor of the scanning space PAN when the product of the number of iterations of each of the n-fold loops is expressed as a scanning space PAN Automatic generation program.

A multiple iterative processing massively parallelized source code automatic generation program according to claim 1,
When the product of the number of iterations of each of the n-fold loops is expressed as the number of scanning spaces PAN, if the number of processors m is not a divisor of the scanning space PAN, the scanning space PAN is virtually expanded. A multi-repetition processing massively parallelized source code automatic generation program characterized by being divisible by the number of processors m.

Automatically generates parallel source code that can be executed in parallel by m (m is an integer of 2 or more) processors from non-parallel source code containing n (n is an integer of 2 or more) nested loops. A parallel source code automatic generation device,
An initial value expression of each of the n-fold loops of the non-parallelized source code is given to the m processors, and m consecutive integers iak (k = 0,...) Starting from 0 that uniquely identify each processor. , M-1) and the increment δj defined for each loop j (j = 1,..., N)

Means for converting the n-fold loop structure portion into a structure that can be shared by the m processors by using the rewritten initial value expression Sj and the rewritten initial value expression Sj and the increment value δj. A multi-repetition processing massively parallelized source code automatic generation device characterized by that.

Parallel source code that can be executed in parallel by m (m is an integer of 2 or more) processors is automatically generated from non-parallel source code including n (n is an integer of 2 or more) nested loops. a parallel source code automatic generation method that will be,
Wherein n-fold loop each initialization expression unparallelized source code, said given the m processors, the m successive integers IAK (k = 0 starting from 0 for each processor Ru is uniquely identified, ···, m-1) and the loop j (j = 1, ···, n) each with incremental value δj is used defined under formula

In rewritten to represented by initialization expression Sj, n heavy loop structure portion, wherein the multiple repeats processing massive parallelization source code automatically generated is converted into m sharing treated Ru structure processor, characterized in Rukoto Method.