JP3628782B2

JP3628782B2 - Parallel distributed processing system

Info

Publication number: JP3628782B2
Application number: JP31017395A
Authority: JP
Inventors: 英樹山中
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1995-11-29
Filing date: 1995-11-29
Publication date: 2005-03-16
Anticipated expiration: 2015-11-29
Also published as: JPH09146900A

Description

【０００１】
【発明の属する技術分野】
本発明は，複数のプロセッサが協調して処理を進める並列分散処理システムに関する。
【０００２】
現在の並列分散処理の環境は，それ自身の複雑さおよび並列処理のためのプログラミングの困難さから一部の専門家の独占物となってしまっているが，単一のＣＰＵによる処理のボトルネックが顕在化している現在，一般の計算機ユーザにも容易で高性能な並列処理を可能とする環境の提供が急務となっている。
【０００３】
【発明が解決しようとする課題】
複数のプロセッサが協調して処理を進める並列処理分散システム，特に，ＬＡＮ，ＷＡＮ環境でヘテロジーニァスな複数のプロセッサを１クラスタとして協調させながら一つのタスクを並列分散処理させるようなシステム環境が考えられている。このような並列処理環境を一般ユーザに提供する際に問題となるのは，簡易性と習得のし易さとであるが，これは性能との間にトレードオフの関係を生ずる。
【０００４】
従来の技術水準では，性能のために簡易性をかなりの程度犠牲にするか，または簡易性のために大幅な性能の低下を甘受せざるを得ない。
性能に関し，プロセッサ間のデータ転送の遅延とスループットが問題になるが，これは，本質的にはデータ転送の遅延の問題に還元できる。並列度の高い計算では，転送するデータの単位が小さく，転送量に関係のない転送回数だけに依存する遅延が主だからである。転送回数に依存する遅延を全体として減らすためには，転送するデータをある程度バッファに蓄積しておいて，まとめて一度に転送する必要があるが，このバッファの大きさをどの程度にすると最適であるのかは，因子が複雑に絡み合っているため，事実上実験してみないことには分からない。
【０００５】
また，プログラムを並列実行するためには，それを並列実行の単位に分割しなければならない。しかし，より小さな単位に分割すればそれだけ多くのＣＰＵが利用可能になる代わりに，実行の単位が小さくなることによる同期のオーバヘッド，コンテキスト・スイッチの増加によるオーバヘッド，データの遅延，データ転送量の増加，メモリのフラグメンテーションによるページングの増加等を招くことになり，ここでもまた，トレードオフを生じる。
【０００６】
エンドユーザに対しても並列処理によるプログラムの高速化，大規模化のメリットを享受できるようにすることが望まれているが，現状では，ある程度の性能を得るためには，エンドユーザでもエキスパート・ユーザの持つ並列処理の煩雑なノウハウを獲得してプログラムのチューニングをしなければならないという矛盾に直面する。
【０００７】
これらの解決の手段として，従来，並列処理のための高級言語，例えば，手続き型として，Ｏｃｃａｍ（Ａ．Ｂｕｒｎｓ，ＰＲＯＧＲＡＭＭＩＮＧＩＮｏｃｃａｍ２，Ａｄｄｉｓｏｎ−Ｗｅｓｌｅｙ，１９８８），ＨＰＦ（ＨｉｇｈＰｅｒｆｏｒｍａｎｃｅＦｏｒｔｒａｎＦｏｒｕｍ，ＨｉｇｈＰｅｒｆｏｒｍａｎｃｅＦｏｒｔｒａｎＬａｎｇｕａｇｅＳｐｅｃｉｆｉｃａｔｉｏｎ，１９９４），関数型として，ＣＬＥＡＮ（Ｒ．ＰｌａｓｍｅｉｊｅｒａｎｄＭ．ｖａｎＥｅｋｅｌｅｎ，ＦｕｎｃｔｉｏｎａｌＰｒｏｇｒａｍｍｉｎｇａｎｄＰａｒａｌｌｅｌＧｒａｐｈＲｅｗｒｉｔｉｎｇ，Ａｄｄｉｓｏｎ−Ｗｅｓｌｅｙ，１９９３），論理型として，ＰＡＲＬＯＧ（Ｔ．Ｃｏｎｌｏｎ，ＰｒｏｇｒａｍｍｉｎｇｉｎＰＡＲＬＯＧ，Ａｄｄｉｓｏｎ−Ｗｅｓｌｅｙ，１９８９）のような高級言語が開発されている。
【０００８】
また，高レベルのライブラリ・インタフェースとして，例えばＰＶＭ（Ａ．Ｇｅｉｓｔ，Ａ．Ｂｅｇｕｅｌｉｎ，Ｊ．Ｄｏｎｇａｒｒａ，Ｗ．Ｊｉａｎｇ，Ｒ．ＭａｎｃｈｅｋａｎｄＶ．Ｓｕｎｄｅｒａｍ，ＰＶＭ：ＰａｒａｌｌｅｌＶｉｒｔｕａｌＭａｃｈｉｎｅ − ＡＵｓｅｒｓ’ ＧｕｉｄｅａｎｄＴｕｔｏｒｉａｌｆｏｒＮｅｔｗｏｒｋｅｄＰａｒａｌｌｅｌＣｏｍｐｕｔｉｎｇ −，ＭＩＴｐｒｅｓｓ，１９９４），ＭＰＩ（ＭｅｓｓａｇｅＰａｓｓｉｎｇＩｎｔｅｒｆａｃｅＦｏｒｕｍ，ＭＰＩ：ＡＭｅｓｓａｇｅ−ＰａｓｓｉｎｇＩｎｔｅｒｆａｃｅＳｔａｎｄａｒｄ，Ｍａｙ５，１９９４）が開発されている。
【０００９】
しかし，高級言語では十分な性能がでないか，性能を出すためにはエキスパート並みのノウハウが必要であり，また，高レベルのライブラリは，未だにエンドユーザが使えるようなレベルに達していない。
【００１０】
他の中間的な解決手段として，比較的低レベルの手続き型の逐次言語と並列処理のための命令言語を組み合わせる方法（Ｉ．Ｆｏｓｔｅｒ，Ｒ．ＯｌｓｏｎａｎｄＳ．Ｔｕｅｃｋｅ，ＰｒｏｄｕｃｔｉｖｅＰａｒａｌｌｅｌＰｒｏｇｒａｍｍｉｎｇ，ＳｃｉｅｎｔｉｆｉｃＰｒｏｇｒａｍｍｉｎｇ，Ｖｏｌ．１，ｐｐ．５１−６６，１９９２；Ｌ．Ａ．ＣｒｏｗｌａｎｄＴ．Ｊ．ＬｅＢｌａｎｃ，ＰａｒａｌｌｅｌＰｒｏｇｒａｍｍｉｎｇｗｉｔｈＣｏｎｔｒｏｌＡｂｓｔｒａｃｔｉｏｎ，ＡＣＭＴｒａｎｓａｃｔｉｏｎｓｏｎＰｒｏｇｒａｍｍｉｎｇＬａｎｇｕａｇｅｓａｎｄＳｙｓｔｅｍｓ，Ｖｏｌ．１６，Ｎｏ．３，ｐｐ．５２４−５７６，１９９４）が提案されている。
【００１１】
これらの方法は，全ての面にわたってエンドユーザであるのではなく，逐次処理のエキスパートであるが並列処理に関しては比較的エンドユーザに近い人を対象として，低レベルの逐次言語のチューニングと並列処理のチューニングとを分離し，並列処理のインタフェース部分だけに簡易で画一化されたチューニング・スタイルを導入するものである。
【００１２】
本発明が対象とするシステムは，後者の考え方にもとづくものであるが，並列処理インタフェース部分のさらなる画一化とエキスパート・ユーザのための汎用性とを推進し，逐次処理部分とのインタフェースに柔軟性を持たせるために，全体を手続き型言語の意味論で統一することを図っている。
【００１３】
例えば，ネットワーク上のワークステーション群あるいは専用のマルチＣＰＵの並列計算機上で並列処理プログラムを開発する場合，通常，ワークステーションを１台だけ使って，あるいは１台のワークステーション上に構築された並列計算機のシミュレータを使って，プログラムを開発するところから始める。また，多くの場合，過去に開発されたプログラムのためのルーティンがライブラリ化されているので，利用可能なライブラリが既に存在すれば，それを再利用するところからプログラムの開発を始める。
【００１４】
このようにしてプログラムの開発が進み，実際にプログラムを実行して全体のデバッグをする段階において，従来は，新たに開発したルーティンの部分に既存のライブラリをリンクしてできるバイナリのコードを，実行時に全てのワークステーション，あるいは全てのＣＰＵに毎回転送することを繰り返していた。この方法では，デバッグを必要としないライブラリの部分までデバッグの度に毎回転送することになり，プログラム開発の効率を下げてしまうという問題があった。また，実際にプログラムを実行して運用する段階においても，同様にライブラリの部分を毎回転送するので，プログラムの起動に掛かる時間が長くなってしまう要因となっていた。
【００１５】
特に，ネットワーク上のワークステーション群を使って処理を行う場合，異なったタイプのワークステーションを使用したいという要求もある。この場合，一つのアプローチとしては，例えば前述したＰＶＭ，ＭＰＩ等の並列処理のメッセージ・パッシング・インタフェースを使って機能単位で逐次処理の部分プログラムを作り，その間を通信によって結び付けて並列処理させる方法がある。この方法では，各プログラムのネィテイブ・コードはワークステーションのタイプごとに違っていても構わない。しかし，ワークステーションのタイプごとに，プログラムをコンパイルしてネイティブ・コードを作成しなければならず，作業が煩雑になるという欠点がある。
【００１６】
他のアプローチとしては，仮想コードを設定して，そのインタプリタを各々のワークステーションに配して並列処理させる方法がある。この方法は１種類のコードしか使用しないので，コンパイルその他の作業が単純であるという利点があるが，仮想コードの実行時におけるインタプリートにかかるオーバヘッドが大きいという欠点がある。
【００１７】
この欠点を克服する方法として，仮想コードからネィティブ・コードを呼び出すインタフェースを設けて，ネイティブ・コードの部分の実行によってインタプリートのオーバヘッドを低減させる方法がある。しかし，この方法はインタプリタに予め標準的なネイティブ・コード・ライブラリをリンクさせておくものであり，ユーザの作成した一般のネイティブ・コードをリンクして使用できるわけではない。
【００１８】
本発明は上記問題点の解決を図り，エンドユーザが容易に並列分散計算を記述できる環境を提供するとともに，仮想コードからの柔軟なネイティブ・コードの利用を可能にすることにより高性能なプログラム実行環境を実現することを目的としている。
【００１９】
【課題を解決するための手段】
図１は，本発明の並列分散処理システムの原理説明図である。
図１において，プロセッサ１０，１０’は，それぞれ仮想コード実行手段１１，１１’，ライブラリ・テーブル１３，１３’，動的リンク手段１４，１４’，ネイティブ・コード・ライブラリ１５，１５’，スケジューラ１６，１６’を備える。プロセッサ１０，１０’上で動作するプログラム１２，１２’は，各プロセッサ１０，１０’に共通の書式（構文）で記述された仮想コードと，各プロセッサ１０，１０’に特有の命令コード群で記述されたネイティブ・コードとからなる。
【００２０】
本発明では，ネットワーク２０上で複数のプロセッサ１０，１０’が協調して一つの計算を進めるようなユーザのプログラムを，各プロセッサ１０，１０’に共通の書式で記述された仮想コードと，各プロセッサ１０，１０’に特有の命令コード群で記述されたネイティブ・コードからなるネイティブ・コード・ライブラリ１５，１５’とに分離し，仮想コードの実行に必要なネイティブ・コード・ライブラリ１５，１５’は，各プロセッサ１０，１０’において動的リンク手段１４，１４’により仮想コードに動的にリンクする。ネイティブ・コード・ライブラリ１５，１５’は，それぞれローカルなファイルシステムに保存される。
【００２１】
仮想コード実行手段１１，１１’は，それぞれプログラム１２，１２’の仮想コードを解釈し実行する手段である。仮想コードの実行中に，まだプログラム１２，１２’中にマッピングされていないネイティブ・コードの呼び出しを検出すると，そのネイティブ・コード・ライブラリ１５，１５’を動的リンク手段１４，１４’によってマッピングし，仮想コード実行手段１１，１１’中のネイティブ・コード・インタフェースによってネイティブ・コードを実行する。
【００２２】
ライブラリ・テーブル１３，１３’は，動的リンク手段１４，１４’がネイティブ・コード・ライブラリ１５，１５’のマッピングを行って動的にリンクする場合に，対象とするネイティブ・コード・ライブラリ１５，１５’の格納場所情報を得るためのテーブルである。
【００２３】
スケジューラ１６，１６’は，プログラム１２，１２’を実行する単位であるプロセス（またはスレッド）に対してＣＰＵ実行権を与える制御を行う手段であり，他のプロセッサと通信を行いながら仮想コード実行手段１１，１１’を制御する。また，負荷分散等のために，プロセス（またはスレッド）を他のプロセッサとの間で送受するマイグレート（移動）の機能を持つ。
【００２４】
例えば，プロセッサ１０のスケジューラ１６がプロセス（またはスレッド）を他のプロセッサ１０’にマイグレートする場合には，スケジューラ１６は，仮想コードの部分と動的リンクに必要な情報，例えばライブラリ・テーブル１３を送り，ネイティブ・コードの部分は送らない。マイグレート先のプロセッサ１０’では，受け取った仮想コードを仮想コード実行手段１１’により先頭の実行開始点または実行中断点から実行し，仮想コードの実行においてネイティブ・コードの呼び出しを検出すると，動的リンク手段１４’によりプログラム１２’にネイティブ・コード・ライブラリ１５’をマッピングして実行を進める。
【００２５】
仮想コード実行手段１１，１１’による仮想コードの実行においてネイティブ・コードの呼び出しを検出したときに，ネイティブ・コード・ライブラリ１５，１５’を動的にリンクする代わりに，スケジューラ１６，１６’間でプロセス（またはスレッド）のマイグレートを行い，スケジューラ１６，１６’が他のプロセッサから仮想コードと動的リンクに必要な情報を受け取ったときに，最初にまとめて一度にネイティブ・コード・ライブラリ１５，１５’を動的にリンクするようにしてもよい。
【００２６】
【発明の実施の形態】
本発明に係る並列分散処理システムでは，並列化が容易で高性能な実行環境を提供するために，例えばネットワーク化されたヘテロジーニァスなクラスタ上で，実行中の任意のプロセスを適度にマイグレート（移動）させながら全体の処理を進めることを可能にすることと，ネットワーク上のプロセス間を高速なストリーム通信で結び付けることを実現する。
【００２７】
すなわち，あるプロセスが通信の応答を待っている時間にＣＰＵを他のプロセスの処理に割り当てることで，ＣＰＵの利用効率を上げると共に，他のプロセスの実行による通信をも時間的にオーバラップさせて全体として平均した場合の通信のレイテンシを低下させることを可能とするシステムの提供を実現する。
【００２８】
本発明の実施の一形態として，ネットワーク上のタイプの違う複数のワークステーションで並列計算用のインタプリタを起動し，その間に仮想コードと動的（ダイナミック）リンクに必要な情報を送り，ネイティブ・コードのライブラリを使って並列計算させる場合を説明する。
【００２９】
マスタプロセッサ，スレーブプロセッサに対応するものを，特定のシステムでのプロセスとして実現し，その内部で複数の仮想コード・インタプリタをスケジューリングしながら実行するようにし，このプロセスにより，ワークステーション間で入出力データ，仮想コード，ライブラリ・テーブル，インタプリタの制御情報，その他を通信する。
【００３０】
（１）要求駆動方式
図２は，本発明の実施の一形態である要求駆動方式による実現例を示す図である。図２において，プロセッサ３０は，ダイナミック・リンカ３２およびネイティブ・コード・インタフェース３３を持つインタプリタ３１，仮想コードで記述されたプログラム３４，動的リンクの情報を格納し，ルックアップするためのライブラリ・テーブル３５，ローカル・ファイルシステムに格納されたネイティブ・コード・ライブラリ３６，他のプロセッサと通信を行いながらインタプリタ３１を制御するスケジューラ３７を備える。インタプリタ３１は，図１に示す仮想コード実行手段１１および動的リンク手段１４に対応する。プロセッサ３０’の構成も同様である。
【００３１】
ここで，プロセッサ３０からプロセッサ３０’へプロセスを移動させる事象が発生したとする。この事象は，ユーザの明示的な移動指示によるものでも，負荷分散制御によりスケジューラ３７が自律的に生じさせるものでもよい。プロセッサ３０のスケジューラ３７は，プログラム３４中の仮想コードとライブラリ・テーブル３５をプロセッサ３０’へ転送する。プロセッサ３０’のスケジューラ３７’は，プロセッサ３０から上記の情報を受け取り，プログラム３４’，ライブラリ・テーブル３５’としてメモリ上に設定する。
【００３２】
ここで，ネイティブ・コードの呼出命令は，呼び出す関数名を「ｆｕｎｃ」とすると，「ｌｉｂ＿ｃａｌｌ ”ｆｕｎｃ”」である。関数名ｆｕｎｃの関数を含むライブラリを格納するファイル名を，ここでは「／ｏｐｔ／Ｘ１１／ｌｉｂ／ｌｉｂＸ１１．ｓｏ」とする。
【００３３】
インタプリタ３１’は，プログラム３４’の仮想コードを実行し，ネイティブ・コードの呼出命令「ｌｉｂ＿ｃａｌｌ ”ｆｕｎｃ”」を検出し，呼び出す関数名ｆｕｎｃを取り出す。これが，後で説明するハッシュ・テーブルに登録されていれば，それを使って直接ネイティブ・コード・インタフェース３３’で関数名ｆｕｎｃの関数を実行する。ハッシュ・テーブルに登録されていない場合には，ダイナミック・リンカ３２’を起動し，ライブラリ・テーブル３５’に登録してあるライブラリ名から，「／ｏｐｔ／Ｘ１１／ｌｉｂ／ｌｉｂＸ１１．ｓｏ」のローカル・ファイル中の対応するネイティブ・コード・ライブラリ３６’の中身をアクセスし，関数名のハッシュ・テーブルを構築し，これをライブラリ・テーブル３５’のライブラリ名と結合する。
【００３４】
関数名ｆｕｎｃの関数が見つかるまでライブラリのハッシュ・テーブルの構築を続け，ｆｕｎｃが見つかったときには，ｆｕｎｃを含むライブラリをプログラム３４’中のネイティブ・コードの記憶領域にマッピングし，そのライブラリが利用しているその他の標準ライブラリとリンク・エディットし，ハッシュ・テーブルに各関数のエントリ・ポイントを登録し，そのライブラリの各関数を実行可能な状態にする。
【００３５】
最後にネイティブ・コード・インタフェース３３’を使って，ｆｕｎｃに対応するライブラリ関数のエントリ・ポイントを検索し，その関数を実行する。
図３は，図２の構成例による要求駆動方式の処理フローチャートである。
【００３６】
図２に示すプロセッサ３０がマスタＣＰＵ，プロセッサ３０’がスレーブＣＰＵであったとする。プロセッサ３０’が，最初何も処理するプロセス（スレッド）がない状況であったところに，図３のステップＳ１１において，マスタのプロセッサ３０からプロセス（スレッド）のマイグレートがあると，処理を開始する。
【００３７】
まず，ステップＳ１２では，仮想コードとライブラリ・テーブルをマスタのプロセッサ３０から受信する。ステップＳ１３では，インタプリタ３１’により，受け取った仮想コードを実行する。実行の結果，ステップＳ１４によりネイティブ・コード・ライブラリ呼び出し命令を検出した場合には，ステップＳ１５の処理へ進み，実行終了を指示するｅｘｉｔ命令を検出した場合には処理を中止する。その他の命令である場合には，ネイティブ・コード・ライブラリ呼び出し命令またはｅｘｉｔ命令のいずれかを検出するまで，ステップＳ１３〜Ｓ１４を繰り返す。
【００３８】
ステップＳ１５では，インタプリタ３１’は，該当するネイティブ・コード・ライブラリ３６’がマッピングされているかどうかを，関数名のハッシュ・テーブルをもとに判定する。ネイティブ・コード・ライブラリ３６’がマッピングされていなければ，ステップＳ１６の処理を行い，ネイティブ・コード・ライブラリ３６’がマッピングされていれば，ステップＳ１７の処理へ進む。
【００３９】
ステップＳ１６では，ライブラリ・テーブル３５’からマッピングするネイティブ・コード・ライブラリ３６’のファイル名を得て，ダイナミック・リンカ３２’により，そのネイティブ・コード・ライブラリ３６’をマッピングし，リンク・エディットを行う。
【００４０】
その後，ステップＳ１７では，インタプリタ３１’のネイティブ・コード・インタフェース３３’に制御を渡してネイティブ・コードを実行する。ネイティブ・コードの実行が終了したならば，ステップＳ１３へ戻り，同様に仮想コードの実行を続ける。
【００４１】
（２）前処理方式
図４は，本発明の実施の一形態である前処理方式による実現例を示す図である。図４に示す実現例は，図２に示す実現例とほぼ同じ構成であるが，ダイナミック・リンカ５４，５４’がインタプリタ５１，５１’内ではなく，スケジューラ５３，５３’の内部にある点が相違する。
【００４２】
マスタのプロセッサ５０から仮想コードとライブラリ・テーブルを転送し，スレーブのプロセッサ５０’が受信するところまでは前述の要求駆動方式と同様である。しかし，この方式では，スレーブのプロセッサ５０’がマスタのプロセッサ５０から仮想コードとライブラリ・テーブルとを受信した後，インタプリタ５１’を起動する前に，受信したライブラリ・テーブル５６’の全てのライブラリについて関数名のハッシュ・テーブルをあらかじめ作り，ライブラリをプログラム５５’のネイティブ・コードの記憶領域にマッピングし，標準ライブラリなどとともにリンク・エディットし，ハッシュ・テーブルにライブラリ関数のエントリ・ポイントを登録する処理を最初にまとめて行う。
【００４３】
インタプリタ５１’は，プログラム５５’の仮想コードを実行し，ネイティブ・コードの呼び出し命令「ｌｉｂ＿ｃａｌｌ ”ｆｕｎｃ”」を検出すると，直接ネイティブ・コード・インタフェース５２’を起動して，ネイティブ・コードで作成された関数ｆｕｎｃを実行する。
【００４４】
図５は，図４に示す構成例による前処理方式の処理フローチャートである。
図４に示すスレーブのプロセッサ５０’が，最初何も処理するプロセス（スレッド）がない状況であったところに，図５のステップＳ２１において，マスタのプロセッサ５０からプロセス（スレッド）のマイグレートがあると，処理を開始する。ステップＳ２２では，プロセッサ５０’は，仮想コードとライブラリ・テーブルをマスタのプロセッサ５０から受信する。
【００４５】
次に，ステップＳ２３では，スケジューラ５３’内のダイナミック・リンカ５４’により，受信したライブラリ・テーブル５６’に登録された全てのネイティブ・コード・ライブラリ５７’をマッピングし，リンク・エディットを行う。
【００４６】
ステップＳ２４では，インタプリタ５１’により仮想コードを実行する。実行の結果，ステップＳ２５により，ネイティブ・コード・ライブラリ呼び出し命令を検出した場合には，ステップＳ２６の処理へ進み，実行終了を指示するｅｘｉｔ命令を検出した場合には処理を中止する。その他の命令である場合には，ネイティブ・コード・ライブラリ呼び出し命令またはｅｘｉｔ命令のいずれかを検出するまで，ステップＳ２４〜Ｓ２５を繰り返す。
【００４７】
ステップＳ２６では，インタプリタ５１’のネイティブ・コード・インタフェース５２’に処理を渡してネイティブ・コードを実行する。ネイティブ・コードの実行が終了したならば，ステップＳ２４へ戻り，同様に仮想コードの実行を続ける。
【００４８】
入力データなどに依存して実行時に様々なライブラリが使用され，ライブラリの登録数が多い割には実行時に使用されるライブラリの数が少ない場合には，図２に示す要求駆動方式のほうが有利である。
【００４９】
一方，実行時に登録されたほとんどのライブラリがプロセッサに使用される場合には，図４に示す前処理方式のほうが，個々のネイティブ・コード呼び出し命令でライブラリが既にマッピングされているかどうかをチェックする必要がないので有利である。
【００５０】
【発明の効果】
本発明によれば，中間コードとしての仮想コードと，各プロセッサに特有のネイティブ・コードとを分離し，仮想コードの実行時または仮想コードの受信時に，ネイティブ・コードを動的にリンクする仕組みを設けることにより，ヘテロジーニァスな複数のプロセッサ間においても，並列処理の処理速度の高速性を保持し，かつ，エンドユーザの使用にも適した平易な並列分散処理システムの提供が可能となる。
【００５１】
例えば，ネットワーク上のワークステーション群あるいは専用のマルチＣＰＵの並列計算機上で並列処理プログラムを開発する場合，システムの標準ライブラリ，他で既に開発を終えたユーザ作成のライブラリ，デバッグが終了したユーザ作成のライブラリ等を更新する必要がない。
【００５２】
また，各プロセッサに特有のネイティブ・コードをマスタのプロセッサから毎回実行の度に，あるいは実行に先立って転送する必要がないので，デバッグその他の開発作業が軽減され，開発が迅速になる。また，開発が終了し，実際にプログラムを起動して使用するときにも，プログラムの起動時間が短縮されるという効果を奏する。
【００５３】
さらに，プログラムを変更する場合，仮想コードの部分のみの変更によりプログラム全体としての変更を柔軟に吸収し，ネイティブ・コード・ライブラリを利用することで性能を引き出すことが可能となるため，チューニングや保守が容易になるという効果を奏する。
【図面の簡単な説明】
【図１】本発明の原理説明図である。
【図２】本発明の実施の一形態である要求駆動方式による実現例を示す図である。
【図３】要求駆動方式による処理フローチャートである。
【図４】本発明の実施の一形態である前処理方式による実現例を示す図である。
【図５】前処理方式による処理フローチャートである。
【符号の説明】
１０，１０’ プロセッサ
１１，１１’ 仮想コード実行手段
１２，１２’ プログラム
１３，１３’ ライブラリ・テーブル
１４，１４’ 動的リンク手段
１５，１５’ ネイティブ・コード・ライブラリ
１６，１６’ スケジューラ
２０ネットワーク[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a parallel and distributed processing system in which a plurality of processors advance processing in cooperation.
[0002]
The current parallel and distributed processing environment has become the monopoly of some experts due to its own complexity and programming difficulties for parallel processing, but a single CPU processing bottleneck. At present, there is an urgent need to provide an environment that enables easy and high-performance parallel processing for general computer users.
[0003]
[Problems to be solved by the invention]
A parallel processing distributed system in which a plurality of processors cooperate in processing, particularly a system environment in which one task is processed in parallel and distributed while coordinating multiple heterogeneous processors as one cluster in a LAN or WAN environment. Yes. The problem in providing such a parallel processing environment to general users is simplicity and ease of learning, but this creates a trade-off relationship with performance.
[0004]
In the state of the art, it is necessary to sacrifice a considerable degree of simplicity for performance, or to accept a significant decrease in performance for simplicity.
In terms of performance, the delay and throughput of data transfer between processors becomes a problem, but this can essentially be reduced to the problem of delay in data transfer. This is because in a calculation with a high degree of parallelism, the unit of data to be transferred is small, and the delay depends only on the number of transfers that is not related to the transfer amount. In order to reduce the delay that depends on the number of transfers as a whole, it is necessary to accumulate some data to be transferred in a buffer and transfer it all at once. The fact is that the factors are intricately intertwined, so it's hard to see that they don't actually experiment.
[0005]
In order to execute a program in parallel, it must be divided into units for parallel execution. However, if you divide into smaller units, more CPUs can be used, but instead the overhead of synchronization due to smaller execution units, overhead due to increased context switches, data delays, and increased data transfer volume , Resulting in increased paging due to memory fragmentation, and again, there is a trade-off.
[0006]
It is desirable for end users to enjoy the benefits of faster and larger programs through parallel processing. At present, however, end users are also required to be able to obtain a certain level of performance. We face the contradiction that the user has to acquire the complicated know-how of parallel processing and tune the program.
[0007]
Conventionally, high-level languages for parallel processing, for example, Occam (A. Burns, PROGRAMMING IN occam 2, Addison-Wesley, 1988), HPF (High Performance Fortran Forum, High Performance), are used as means for solving these problems. FORTRAN LANGUAGE SPECIFICATION, 1994), as a function type, CLEAN (R.Plasmeijer and M.van Ekelen, Functional Programming and Parallel Graph Rewriting, LO type, PA). High-level languages such as ARLOG, Addison-Wesley, 1989) have been developed.
[0008]
Further, as a high-level library interface, for example, PVM (A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek and V. Sunderam, PVM: Parallel Virtual Machine-A User's User's User's Guide. Networked Parallel Computing-, MIT press, 1994), MPI (Message Passing Interface Forum, MPI: A Message-Passing Interface Standard, May 5, 1994) has been developed.
[0009]
However, high-level languages do not have sufficient performance, or expert-level know-how is necessary to achieve performance, and high-level libraries have not yet reached a level that can be used by end users.
[0010]
As another intermediate solution, a method of combining a relatively low-level procedural sequential language and an instruction language for parallel processing (I. Foster, R. Olson and S. Tuecke, Productive Parallel Programming, Scientific Programming, Vol. 1, pp. 51-66, 1992; LA Crown and TJ LeBlanc, Parallel Programming with Control Abstraction, ACM Transactions on Programming Language 3, p. 576, 1994).
[0011]
These methods are not end users in all aspects, but are experts in sequential processing, but with respect to parallel processing, those who are relatively close to end users are interested in low-level sequential language tuning and parallel processing. Tuning is separated and a simple and uniform tuning style is introduced only to the parallel processing interface.
[0012]
The system targeted by the present invention is based on the latter concept, but promotes further uniformization of the parallel processing interface part and versatility for expert users, and is flexible in the interface with the sequential processing part. In order to have the nature, we try to unify the whole with the semantics of procedural languages.
[0013]
For example, when developing a parallel processing program on a workstation group on a network or a dedicated multi-CPU parallel computer, the parallel computer is usually constructed using only one workstation or on one workstation. Start by developing a program using the simulator. In many cases, routines for programs that have been developed in the past are made into a library. If an available library already exists, program development is started from the point where it is reused.
[0014]
In this way, program development progresses, and at the stage of actually executing the program and debugging the entire program, conventionally, binary code that can be created by linking an existing library to the newly developed routine is executed. Sometimes it was repeated every time to all workstations or all CPUs. This method has a problem of reducing the efficiency of program development because the library portion that does not require debugging is transferred every time it is debugged. Further, even in the stage where the program is actually executed and operated, the library portion is similarly transferred each time, which causes a long time for starting the program.
[0015]
In particular, when processing is performed using a group of workstations on the network, there is a demand for using different types of workstations. In this case, one approach is to create a partial program for sequential processing in units of functions using, for example, the above-described parallel processing message passing interfaces such as PVM and MPI, and connect them in parallel to perform parallel processing. is there. In this way, the native code of each program can be different for each type of workstation. However, for each type of workstation, there is a disadvantage that the program must be compiled to create native code, which makes the work complicated.
[0016]
Another approach is to set up virtual code and place the interpreter on each workstation for parallel processing. Since this method uses only one type of code, there is an advantage that compilation and other operations are simple, but there is a drawback that overhead for interpreting at the time of execution of virtual code is large.
[0017]
As a method for overcoming this drawback, there is a method in which an interface for calling native code from virtual code is provided, and the overhead of the interpretation is reduced by executing a portion of the native code. However, in this method, a standard native code library is linked in advance to the interpreter, and general native code created by the user cannot be linked and used.
[0018]
The present invention solves the above-described problems, provides an environment in which end users can easily describe parallel and distributed computations, and enables flexible native code usage from virtual code, thereby enabling high-performance program execution. The purpose is to realize the environment.
[0019]
[Means for Solving the Problems]
FIG. 1 is an explanatory diagram of the principle of the parallel distributed processing system of the present invention.
In FIG. 1, processors 10 and 10 ′ are virtual code execution means 11 and 11 ′, library tables 13 and 13 ′, dynamic link means 14 and 14 ′, native code libraries 15 and 15 ′, and a scheduler 16, respectively. , 16 ′. The programs 12 and 12 ′ operating on the processors 10 and 10 ′ are virtual codes described in a format (syntax) common to the processors 10 and 10 ′ and instruction code groups specific to the processors 10 and 10 ′. It consists of written native code.
[0020]
In the present invention, a user program in which a plurality of processors 10 and 10 ′ advance one calculation in cooperation with each other on the network 20, virtual codes described in a format common to the processors 10 and 10 ′, It is separated into native code libraries 15 and 15 'composed of native codes described in instruction codes specific to the processors 10 and 10', and the native code libraries 15 and 15 'required for executing virtual code Are dynamically linked to the virtual code by the dynamic link means 14, 14 'in each processor 10, 10'. The native code libraries 15 and 15 'are stored in local file systems, respectively.
[0021]
The virtual code execution means 11 and 11 ′ are means for interpreting and executing the virtual codes of the programs 12 and 12 ′, respectively. When a call to native code that is not yet mapped in the program 12, 12 'is detected during execution of the virtual code, the native code library 15, 15' is mapped by the dynamic link means 14, 14 '. , The native code is executed by the native code interface in the virtual code execution means 11, 11 ′.
[0022]
The library tables 13 and 13 ′ include the target native code library 15 and the dynamic code when the dynamic link means 14 and 14 ′ dynamically link the native code libraries 15 and 15 ′. 15 is a table for obtaining storage location information of 15 ′.
[0023]
The schedulers 16 and 16 'are means for performing control to give a CPU execution right to a process (or thread) that is a unit for executing the programs 12 and 12', and perform virtual code execution means while communicating with other processors. 11, 11 'is controlled. In addition, it has a function of migration (movement) for sending and receiving processes (or threads) to and from other processors for load balancing and the like.
[0024]
For example, when the scheduler 16 of the processor 10 migrates a process (or thread) to another processor 10 ′, the scheduler 16 stores information necessary for dynamic linking with the virtual code portion, for example, the library table 13. Send, do not send the native code part. In the migration destination processor 10 ′, the received virtual code is executed from the top execution start point or execution stop point by the virtual code execution means 11 ′, and when the native code call is detected in the execution of the virtual code, The native code library 15 ′ is mapped to the program 12 ′ by the link means 14 ′ and the execution proceeds.
[0025]
Instead of dynamically linking the native code libraries 15 and 15 'when the native code call is detected in the virtual code execution by the virtual code execution means 11 and 11', the scheduler 16 and 16 ' When the process (or thread) is migrated and the scheduler 16, 16 'receives information necessary for dynamic linking with the virtual code from other processors, the native code library 15, You may make it link 15 'dynamically.
[0026]
DETAILED DESCRIPTION OF THE INVENTION
In the parallel distributed processing system according to the present invention, in order to provide a high-performance execution environment that is easy to parallelize, for example, on a networked heterogeneous cluster, any process being executed is appropriately migrated (moved). ), And it is possible to link the processes on the network with high-speed stream communication.
[0027]
In other words, by assigning the CPU to the processing of another process while a certain process is waiting for a communication response, the CPU utilization efficiency is improved, and communication by execution of other processes is also overlapped in time. It is possible to provide a system that can reduce communication latency when averaged as a whole.
[0028]
As an embodiment of the present invention, an interpreter for parallel computation is started on a plurality of workstations of different types on the network, and information necessary for a virtual code and a dynamic link is sent between them. The case of parallel calculation using the library of will be described.
[0029]
A process corresponding to a master processor and a slave processor is realized as a process in a specific system, and a plurality of virtual code interpreters are executed while being scheduled in this process. This process allows input / output data between workstations. , Virtual code, library table, interpreter control information, etc.
[0030]
(1) Request Drive Method FIG. 2 is a diagram illustrating an implementation example using a request drive method according to an embodiment of the present invention. In FIG. 2, a processor 30 includes an interpreter 31 having a dynamic linker 32 and a native code interface 33, a program 34 written in virtual code, and a library table for storing and looking up dynamic link information. 35, a native code library 36 stored in the local file system, and a scheduler 37 for controlling the interpreter 31 while communicating with other processors. The interpreter 31 corresponds to the virtual code execution means 11 and the dynamic link means 14 shown in FIG. The configuration of the processor 30 ′ is the same.
[0031]
Here, it is assumed that an event for moving the process from the processor 30 to the processor 30 ′ occurs. This event may be caused by an explicit movement instruction from the user, or may be autonomously generated by the scheduler 37 by load balancing control. The scheduler 37 of the processor 30 transfers the virtual code and the library table 35 in the program 34 to the processor 30 ′. The scheduler 37 ′ of the processor 30 ′ receives the above information from the processor 30 and sets it as a program 34 ′ and a library table 35 ′ on the memory.
[0032]
Here, the call instruction of the native code is “lib_call“ func ”” when the function name to be called is “func”. The file name for storing the library including the function with the function name func is assumed to be “/opt/X11/lib/libX11.so” here.
[0033]
The interpreter 31 ′ executes the virtual code of the program 34 ′, detects the call instruction “lib_call“ func ”” of the native code, and extracts the function name func to be called. If this is registered in a hash table described later, the function having the function name func is directly executed by using the native code interface 33 ′. If it is not registered in the hash table, the dynamic linker 32 ′ is started, and the local name “/opt/X11/lib/libX11.so” is determined from the library name registered in the library table 35 ′. The contents of the corresponding native code library 36 'in the file are accessed, a function name hash table is constructed, and this is combined with the library name of the library table 35'.
[0034]
The construction of the hash table of the library is continued until the function with the function name func is found. When the func is found, the library including the func is mapped to the storage area of the native code in the program 34 'and used by the library. Link-edit with other standard libraries, register the entry points of each function in the hash table, and make each function in the library executable.
[0035]
Finally, using the native code interface 33 ', the entry point of the library function corresponding to func is searched and the function is executed.
FIG. 3 is a process flowchart of the request driving method according to the configuration example of FIG.
[0036]
Assume that the processor 30 shown in FIG. 2 is a master CPU and the processor 30 ′ is a slave CPU. The processor 30 'starts processing when there is no process (thread) to process anything at the beginning, and there is a process (thread) migration from the master processor 30 in step S11 of FIG. .
[0037]
First, in step S12, the virtual code and the library table are received from the master processor 30. In step S13, the received virtual code is executed by the interpreter 31 ′. As a result of execution, if a native code library call instruction is detected in step S14, the process proceeds to step S15, and if an exit instruction instructing the end of execution is detected, the process is stopped. If it is another instruction, steps S13 to S14 are repeated until either the native code library call instruction or the exit instruction is detected.
[0038]
In step S15, the interpreter 31 ′ determines whether or not the corresponding native code library 36 ′ is mapped based on the hash table of function names. If the native code library 36 'is not mapped, the process of step S16 is performed. If the native code library 36' is mapped, the process proceeds to step S17.
[0039]
In step S16, the file name of the native code library 36 'to be mapped is obtained from the library table 35', the native code library 36 'is mapped by the dynamic linker 32', and link editing is performed. .
[0040]
Thereafter, in step S17, control is passed to the native code interface 33 ′ of the interpreter 31 ′ to execute the native code. When the execution of the native code is completed, the process returns to step S13, and the execution of the virtual code is continued in the same manner.
[0041]
(2) Pre-processing method FIG. 4 is a diagram showing an implementation example of the pre-processing method according to an embodiment of the present invention. The implementation example shown in FIG. 4 has substantially the same configuration as the implementation example shown in FIG. 2, except that the dynamic linkers 54 and 54 ′ are not in the interpreters 51 and 51 ′ but in the schedulers 53 and 53 ′. Is different.
[0042]
The procedure up to the point where the virtual code and the library table are transferred from the master processor 50 and received by the slave processor 50 ′ is the same as the above-described request driving method. However, in this system, after the slave processor 50 ′ receives the virtual code and the library table from the master processor 50, before starting the interpreter 51 ′, all the libraries in the received library table 56 ′ are processed. A function name hash table is created in advance, the library is mapped to the native code storage area of program 55 ', link-edited with the standard library, etc., and the library function entry point is registered in the hash table. Do it all together first.
[0043]
When the interpreter 51 ′ executes the virtual code of the program 55 ′ and detects the call instruction “lib_call“ func ”” of the native code, the interpreter 51 ′ starts the native code interface 52 ′ directly and is generated in the native code. The function func is executed.
[0044]
FIG. 5 is a process flowchart of the preprocessing method according to the configuration example shown in FIG.
In the situation where the slave processor 50 'shown in FIG. 4 does not have any process (thread) to process anything at first, there is a process (thread) migration from the master processor 50 in step S21 of FIG. And start processing. In step S22, the processor 50 ′ receives the virtual code and the library table from the master processor 50.
[0045]
Next, in step S23, all the native code libraries 57 'registered in the received library table 56' are mapped by the dynamic linker 54 'in the scheduler 53', and link editing is performed.
[0046]
In step S24, the virtual code is executed by the interpreter 51 '. As a result of execution, if a native code library call instruction is detected in step S25, the process proceeds to step S26, and if an exit instruction instructing the end of execution is detected, the process is stopped. If it is another instruction, steps S24 to S25 are repeated until either the native code library call instruction or the exit instruction is detected.
[0047]
In step S26, the native code is executed by passing processing to the native code interface 52 'of the interpreter 51'. When the execution of the native code is completed, the process returns to step S24 and the execution of the virtual code is continued in the same manner.
[0048]
Depending on the input data, etc., various libraries are used during execution. If the number of libraries used during execution is small for a large number of registered libraries, the request drive method shown in FIG. 2 is more advantageous. is there.
[0049]
On the other hand, if most libraries registered at runtime are used for the processor, the preprocessing method shown in Fig. 4 should check whether the library is already mapped by each native code call instruction. There is no advantage.
[0050]
【The invention's effect】
According to the present invention, there is provided a mechanism for separating virtual code as intermediate code and native code specific to each processor and dynamically linking native code when executing virtual code or receiving virtual code. As a result, it is possible to provide a simple parallel distributed processing system that maintains the high processing speed of parallel processing among a plurality of heterogeneous processors and that is also suitable for use by end users.
[0051]
For example, when developing a parallel processing program on a workstation group on a network or a dedicated multi-CPU parallel computer, a system standard library, a user-created library that has already been developed elsewhere, a user-created library that has been debugged, etc. There is no need to update the library.
[0052]
In addition, since native code specific to each processor does not need to be transferred from the master processor every time or prior to execution, debugging and other development work are reduced, and development is accelerated. Also, when the development is finished and the program is actually activated and used, the program activation time is shortened.
[0053]
Furthermore, when changing a program, it is possible to flexibly absorb the changes of the entire program by changing only the virtual code part, and to draw out performance by using the native code library. The effect is that it becomes easy.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating the principle of the present invention.
FIG. 2 is a diagram illustrating an implementation example using a request driving method according to an embodiment of the present invention.
FIG. 3 is a process flowchart according to a request driving method;
FIG. 4 is a diagram illustrating an implementation example using a preprocessing method according to an embodiment of the present invention.
FIG. 5 is a process flowchart according to a pre-processing method.
[Explanation of symbols]
10, 10 'processor 11, 11' virtual code executing means 12, 12 'program 13, 13' library table 14, 14 'dynamic linking means 15, 15' native code library 16, 16 'scheduler 20 network

Claims

In a parallel and distributed processing system in which multiple processors work together,
The processor is
Library storage means for storing a library of native codes described in instruction codes specific to each processor;
Virtual code execution means for interpreting and executing virtual code that is a program described in a format common to each processor;
A dynamic linking means for dynamically linking a library of native codes stored in the library storage means when a call to native code is detected in the execution of virtual code;
Means for transmitting virtual code to be executed to another processor and information about the storage location of the virtual code to be executed and the target native code library required for dynamic linking to the other processor; A parallel distributed processing system characterized by comprising:

In a parallel and distributed processing system in which multiple processors work together,
The processor is
Library storage means for storing a library of native codes described in instruction codes specific to each processor;
Virtual code execution means for interpreting and executing virtual code that is a program described in a format common to each processor;
Means for transmitting virtual code to be executed to another processor and the storage location information of the virtual code to be executed and the native code library to be necessary for dynamic linking to the other processor;
Dynamic link means for dynamically linking libraries of native codes stored in the library storage means when receiving virtual code from another processor and executing the virtual code Parallel distributed processing system characterized by