JP2012252591A

JP2012252591A - Process allocation system, process allocation method, and process allocation program

Info

Publication number: JP2012252591A
Application number: JP2011125566A
Authority: JP
Inventors: Junichi Oba; 淳一大庭; Hiroki Hasebe; 浩樹長谷部
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2011-06-03
Filing date: 2011-06-03
Publication date: 2012-12-20

Abstract

PROBLEM TO BE SOLVED: To provide an automatic optimal arrangement of a process which shortens the execution time of a parallel program using an MPI library in a large-scale PC cluster system of a three-dimensional torus structure.SOLUTION: In a large-scale PC cluster system in which network switches connected with a plurality of nodes are mutually connected in a three-dimensional torus structure, a job management node 11 which performs process allocation for execution of a parallel program for an MPI program holds position coordinates of nodes in the three-dimensional torus structure. On the basis of the position coordinates of the nodes, the number of network switches in paths in the three-dimensional torus structure between calculation nodes 12 connected with the network switches and I/O nodes 13, which are access targets during file input/output by the calculation nodes, is calculated; a set of I/O node 13 and calculation node 12 with the minimum calculated value is specified; and the MPI rank of the calculation node 12 of the specified set is determined.

Description

本発明は、プロセス割当システム、プロセス割当方法、およびプロセス割当プログラムに関するものであり、具体的には、三次元トーラス構造の大規模ＰＣクラスタシステム上で、ＭＰＩライブラリを用いた並列プログラムの実行時間短縮を図る、プロセスの自動最適配置を可能とする技術に関する。 The present invention relates to a process allocation system, a process allocation method, and a process allocation program. Specifically, the execution time of a parallel program using an MPI library is reduced on a large-scale PC cluster system having a three-dimensional torus structure. The present invention relates to a technology that enables automatic optimal arrangement of processes.

近年の大規模ＰＣクラスタシステムでは、三次元トーラス構造によるネットワークトポロジーにより計算ノードを接続し、多数の計算ノードを同時に使用することで並列プログラムを実行する。並列プログラムを実行する計算資源は、ジョブ管理ノードにより決定される。ジョブ管理ノードは、並列プログラムを実行するために必要となる計算能力、メモリ容量をもつ計算ノードをユーザ指定のノード数で確保することになる。並列プログラムは、こうして確保された計算ノードのＣＰＵ資源を使用し計算を行うとともに、他のノードに対してデータ入出力やファイル入出力の各処理を行い、全体の計算を進めていく。 In recent large-scale PC cluster systems, computation nodes are connected by a network topology having a three-dimensional torus structure, and a parallel program is executed by using a large number of computation nodes simultaneously. The computing resource for executing the parallel program is determined by the job management node. The job management node secures a calculation node having a calculation capacity and a memory capacity necessary for executing the parallel program with the number of nodes specified by the user. The parallel program performs calculations using the CPU resources of the calculation nodes thus secured, and performs the data input / output and file input / output processes for the other nodes to advance the overall calculation.

科学技術計算分野の並列プログラムは、一般にＭＰＩ(ＭｅｓｓａｇｅＰａｓｓｉｎｇＩｎｔｅｒｆａｃｅ)ライブラリを用いて実装されるものがほとんどである。並列プログラムの実行時には、このＭＰＩのプロトコルに従い複数のプロセス間でメッセージ（データ）を送受信し、全体としての計算を進めていく。 In general, most parallel programs in the scientific and technical computing field are implemented using an MPI (Message Passing Interface) library. When executing a parallel program, messages (data) are transmitted and received between a plurality of processes in accordance with the MPI protocol, and the calculation as a whole proceeds.

こうした技術に関して、三次元トーラス構造のネットワークトポロジーによって計算ノードを接続しているＰＣクラスタシステムを対象に、大規模なＭＰＩプログラムをジョブ投入する場合、計算ノード間の距離が最小となるよう資源確保を行うスケジューリング技術（特許文献１参照）などが提案されている。 With regard to these technologies, when a large-scale MPI program is submitted to a PC cluster system in which calculation nodes are connected by a network topology with a three-dimensional torus structure, resources are secured so that the distance between the calculation nodes is minimized. A scheduling technique to be performed (see Patent Document 1) has been proposed.

特開２００６−１４６８６４号公報JP 2006-146864 A

上述のように、大規模ＰＣクラスタシステムで実行される並列プログラムは、複数のプロセス間でメッセージ（データ）を送受信しながら全体の計算を進めていく。プロセスを実行する複数の計算ノードは、上記の従来技術などによりジョブ管理プログラムにより決定される。一方、並列プログラムの実行に要する時間は、計算時間と通信時間の和で求めることができる。このうち前記計算時間は、各プロセス間でほぼ同じ計算時間となるように最適化が実施されている。ここで、前記通信時間は、メッセージ送受信のプロセスを実行する計算ノード間の距離によって変化する。 As described above, the parallel program executed in the large-scale PC cluster system advances the entire calculation while transmitting and receiving messages (data) between a plurality of processes. A plurality of calculation nodes that execute the process are determined by the job management program according to the above-described conventional technique. On the other hand, the time required to execute the parallel program can be obtained by the sum of the calculation time and the communication time. Among these, the calculation time is optimized so that the calculation time is substantially the same between the processes. Here, the communication time varies depending on the distance between the computation nodes that execute the message transmission / reception process.

このため、並列プログラムの実行に要する全体時間の短縮を目指す場合、通信時間をもできるだけ短縮するために、計算ノードの割り当てに加えてプロセスの最適配置を行う技術が必要になる。しかしながら従来技術では、並列プログラムを実行する上で通信時間の短縮は考慮されておらず、利用者は手動で通信ログを採取し、プロセスを割り当てる計算ノードの配置を試みていた。 For this reason, when aiming to reduce the total time required to execute the parallel program, a technique for optimally allocating processes in addition to assigning calculation nodes is required in order to reduce the communication time as much as possible. However, in the prior art, shortening of the communication time is not considered in executing the parallel program, and the user has manually collected the communication log and tried to arrange the calculation nodes to which the process is assigned.

そこで本発明の目的は、三次元トーラス構造の大規模ＰＣクラスタシステム上で、ＭＰＩライブラリを用いた並列プログラムの実行時間短縮を図る、プロセスの自動最適配置を可能とする技術を提供することにある。 SUMMARY OF THE INVENTION An object of the present invention is to provide a technique that enables automatic optimal arrangement of processes for reducing the execution time of a parallel program using an MPI library on a large-scale PC cluster system having a three-dimensional torus structure. .

上記課題を解決する本発明のプロセス割当システムは、複数ノードと結ばれたネットワークスイッチが三次元トーラス構造で相互に接続された大規模ＰＣクラスタシステムにおいて、ＭＰＩプログラムを対象とする並列プログラム実行のためのプロセス割り当てを行う情報処理システムであって、前記三次元トーラス構造における各ノードの位置座標を保持している記憶部と、記憶部における前記三次元トーラス構造における各ノードの位置座標に基づいて、各ネットワークスイッチに結ばれた、並列プログラム実行に当たってプロセス割り当てを受け計算を行う各計算ノードと、該計算ノードによるファイル入出力時のアクセス対象である各Ｉ／Ｏノードとの間の、三次元トーラス構造での経路におけるネットワークスイッチ数を算定し、該算定値が最小となるＩ／Ｏノードと計算ノードの組を特定し、該特定した組の計算ノードのＭＰＩランクを決定する処理を実行する演算部と、を備えることを特徴とする。 The process allocation system of the present invention that solves the above problems is for executing a parallel program for an MPI program in a large-scale PC cluster system in which network switches connected to a plurality of nodes are connected to each other in a three-dimensional torus structure. An information processing system that performs the process assignment of: a storage unit that holds the position coordinates of each node in the three-dimensional torus structure, and based on the position coordinates of each node in the three-dimensional torus structure in the storage unit, A three-dimensional torus connected to each network switch between each computation node that receives a process assignment and performs computation upon execution of the parallel program, and each I / O node that is an access target at the time of file input / output by the computation node Calculate the number of network switches in the path of the structure, I / O node calculation value is minimized and to identify the set of computing nodes, characterized in that it comprises an arithmetic unit for executing processing for determining the MPI rank of the specified set of compute nodes.

また、本発明のプロセス割当方法は、複数ノードと結ばれたネットワークスイッチが三次元トーラス構造で相互に接続された大規模ＰＣクラスタシステムにおいて、ＭＰＩプログラムを対象とする並列プログラム実行のためのプロセス割り当てを行うべく、記憶部において、前記三次元トーラス構造における各ノードの位置座標を保持したコンピュータが、記憶部における前記三次元トーラス構造における各ノードの位置座標に基づいて、各ネットワークスイッチに結ばれた、並列プログラム実行に当たってプロセス割り当てを受け計算を行う各計算ノードと、該計算ノードによるファイル入出力時のアクセス対象である各Ｉ／Ｏノードとの間の、三次元トーラス構造での経路におけるネットワークスイッチ数を算定し、該算定値が最小となるＩ／Ｏノードと計算ノードの組を特定し、該特定した組の計算ノードのＭＰＩランクを決定する処理を実行することを特徴とする。 In addition, the process allocation method of the present invention provides a process allocation for executing a parallel program for an MPI program in a large-scale PC cluster system in which network switches connected to a plurality of nodes are interconnected in a three-dimensional torus structure. In the storage unit, a computer that holds the position coordinates of each node in the three-dimensional torus structure is connected to each network switch based on the position coordinates of each node in the three-dimensional torus structure in the storage unit. A network switch on a path in a three-dimensional torus structure between each computation node that receives a process allocation and performs computation in parallel program execution and each I / O node that is an access target when the file is input / output by the computation node The number is calculated and the calculated value is minimized. Identifying a set of I / O node and the compute nodes, and executes the process of determining the MPI rank of the specified set of compute nodes.

また、本発明のプロセス割当プログラムは、複数ノードと結ばれたネットワークスイッチが三次元トーラス構造で相互に接続された大規模ＰＣクラスタシステムにおいて、ＭＰＩプログラムを対象とする並列プログラム実行のためのプロセス割り当てを行うべく、記憶部において、前記三次元トーラス構造における各ノードの位置座標を保持したコンピュータに、記憶部における前記三次元トーラス構造における各ノードの位置座標に基づいて、各ネットワークスイッチに結ばれた、並列プログラム実行に当たってプロセス割り当てを受け計算を行う各計算ノードと、該計算ノードによるファイル入出力時のアクセス対象である各Ｉ／Ｏノードとの間の、三次元トーラス構造での経路におけるネットワークスイッチ数を算定し、該算定値が最小となるＩ／Ｏノードと計算ノードの組を特定し、該特定した組の計算ノードのＭＰＩランクを決定する処理を実行させる、ことを特徴とする。 In addition, the process allocation program of the present invention provides a process allocation for executing a parallel program targeting an MPI program in a large-scale PC cluster system in which network switches connected to a plurality of nodes are connected to each other in a three-dimensional torus structure. In the storage unit, the computer holding the position coordinates of each node in the three-dimensional torus structure is connected to each network switch based on the position coordinates of each node in the three-dimensional torus structure in the storage unit. A network switch on a path in a three-dimensional torus structure between each computation node that receives a process allocation and performs computation in parallel program execution and each I / O node that is an access target when the file is input / output by the computation node The number is calculated and the calculated value is Become identifies a set of I / O node and the compute node to execute the process of determining the MPI rank of the specified set of compute nodes, it is characterized.

本発明によれば、三次元トーラス構造の大規模ＰＣクラスタシステム上で、ＭＰＩライブラリを用いた並列プログラムの実行時間短縮を図る、プロセスの自動最適配置が可能となる。 According to the present invention, it is possible to automatically and optimally arrange processes for reducing the execution time of a parallel program using an MPI library on a large-scale PC cluster system having a three-dimensional torus structure.

本実施形態のプロセス割当システム（ジョブ管理ノード）を含む大規模ＰＣクラスタシステムの構成例を示す図である。It is a figure which shows the structural example of the large-scale PC cluster system containing the process allocation system (job management node) of this embodiment. 本実施形態の計算ノードの構成例を示す図である。It is a figure which shows the structural example of the calculation node of this embodiment. 本実施形態のジョブ管理ノードの構成例を示す図である。It is a figure which shows the structural example of the job management node of this embodiment. 本実施形態における一次元トーラス構造を示す図である。It is a figure which shows the one-dimensional torus structure in this embodiment. 本実施形態におけるネットワークスイッチと計算ノードの構成例を示す図である。It is a figure which shows the structural example of the network switch and calculation node in this embodiment. 本実施形態の二次元トーラス構造を示す図である。It is a figure which shows the two-dimensional torus structure of this embodiment. 本実施形態の三次元トーラス構造の一部を示す図である。It is a figure which shows a part of three-dimensional torus structure of this embodiment. 本実施形態の三次元トーラスによる大規模ＰＣクラスタシステムを示す図である。It is a figure which shows the large-scale PC cluster system by the three-dimensional torus of this embodiment. 本実施形態のＭＰＩプログラムの集団通信関数による並列処理を示す図である。It is a figure which shows the parallel processing by the collective communication function of the MPI program of this embodiment. 本実施形態の三次元トーラスによる大規模ＰＣクラスタシステム上で並列プログラムを実行する空間を示す図である。It is a figure which shows the space which performs a parallel program on the large-scale PC cluster system by the three-dimensional torus of this embodiment. 本実施形態の三次元トーラスによる大規模ＰＣクラスタシステムの大きさを示す図である。It is a figure which shows the magnitude | size of the large-scale PC cluster system by the three-dimensional torus of this embodiment. 本実施形態の三次元トーラスによる大規模ＰＣクラスタシステムを構成する計算オブジェクトを示す図である。It is a figure which shows the calculation object which comprises the large-scale PC cluster system by the three-dimensional torus of this embodiment. 本実施形態の三次元トーラスによる大規模ＰＣクラスタシステムを構成するＩ／Ｏオブジェクトを示す図である。It is a figure which shows the I / O object which comprises the large-scale PC cluster system by the three-dimensional torus of this embodiment. 本実施形態のネットワークスイッチおよびノードの命名規則を示す図である。It is a figure which shows the naming convention of the network switch and node of this embodiment. 本実施形態のトーラスにおけるオブジェクト間距離を示す図である。It is a figure which shows the distance between objects in the torus of this embodiment. 本実施形態の三次元トーラスによる大規模ＰＣクラスタシステムのノード間距離を示す図である。It is a figure which shows the distance between nodes of the large-scale PC cluster system by the three-dimensional torus of this embodiment. 本実施形態のプロセス割当方法の処理手順例１を示すフロー図である。It is a flowchart which shows process sequence example 1 of the process allocation method of this embodiment. 本実施形態のプロセス割当方法の処理手順例２を示すフロー図である。It is a flowchart which shows process sequence example 2 of the process allocation method of this embodiment. 本実施形態のＩ／Ｏノードと計算ノードの最小距離を算出するモデルを示す図である。It is a figure which shows the model which calculates the minimum distance of the I / O node of this embodiment, and a calculation node. 本実施形態のプロセス割当方法の処理手順例３を示すフロー図である。It is a flowchart which shows process sequence example 3 of the process allocation method of this embodiment. 本実施形態のＭＰＩ集団通信関数実行における最小の距離算出にて使用する作業領域を示す図である。It is a figure which shows the work area | region used by the minimum distance calculation in MPI collective communication function execution of this embodiment. 本実施形態のプロセス割当方法の処理手順例４を示すフロー図である。It is a flowchart which shows process sequence example 4 of the process allocation method of this embodiment. 本実施形態のＭＰＩ集団通信関数実行における最小の距離算出を行う再帰呼出し関数の関数仕様（ＡＰＩ）を示す図である。It is a figure which shows the function specification (API) of the recursive call function which performs the minimum distance calculation in MPI collective communication function execution of this embodiment. 本実施形態のプロセス割当方法の処理手順例５を示すフロー図である。It is a flowchart which shows process sequence example 5 of the process allocation method of this embodiment. 本実施形態のプロセス割当方法の処理手順例６を示すフロー図である。It is a flowchart which shows process sequence example 6 of the process allocation method of this embodiment.

以下に本発明の実施形態について図面を用いて詳細に説明する。図１は、本実施形態のプロセス割当システムたるジョブ管理ノード１１を含むネットワーク構成図である。図１に示すジョブ管理ノード１１は、三次元トーラス構造の大規模ＰＣクラスタシステム上で、ＭＰＩライブラリを用いた並列プログラムの実行時間短縮を図る、プロセスの自動最適配置を可能とするコンピュータである。 Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 is a network configuration diagram including a job management node 11 which is a process allocation system according to this embodiment. A job management node 11 shown in FIG. 1 is a computer that enables automatic optimal arrangement of processes to shorten the execution time of a parallel program using an MPI library on a large-scale PC cluster system having a three-dimensional torus structure.

多数の計算ノード１２等が接続された大規模なＰＣクラスタシステム１は、ログインノード１０、前記ジョブ管理ノード１１、計算ノード１２、およびＩ／Ｏノード１３がネットワーク１４に接続された構成となっている。各ノードは１台の計算機であり、他のノードとはネットワーク１４を通じて通信を行う。ＰＣクラスタシステム１の利用者は、ログインノード１０に端末エミュレータソフトウェアなどを使用して接続し、プログラム編集、コンパイル、データ編集などを行うことになる。また、こうしたＰＣクラスタシステム１において、ＭＰＩプログラムに対応して、複数の計算ノード１２を同時に使用する並列プログラムを実行する場合、利用者が、ログインノード１０での操作によりジョブ投入を指示する。 A large-scale PC cluster system 1 to which a large number of computing nodes 12 and the like are connected has a configuration in which a login node 10, the job management node 11, the computing node 12, and an I / O node 13 are connected to a network 14. Yes. Each node is one computer and communicates with other nodes through the network 14. A user of the PC cluster system 1 connects to the login node 10 using terminal emulator software or the like, and performs program editing, compilation, data editing, and the like. Further, in such a PC cluster system 1, when executing a parallel program that simultaneously uses a plurality of calculation nodes 12 corresponding to the MPI program, the user gives an instruction to submit a job through an operation on the login node 10.

図２は本実施形態の計算ノードの構成例を示す図であり、図３は本実施形態のジョブ管理ノードの構成例を示す図である。上記のジョブ投入時には、ログインノード１０から、ジョブ管理ノード１１で動作中のジョブ管理プログラム３０４に対し、通知が行われる。ジョブ管理プログラム３０４の一つの機能であるスケジューリング機能３０６は、並列プログラムを実行する複数の計算ノード候補を決定し、その情報をジョブキュー３０７に保存する。 FIG. 2 is a diagram illustrating a configuration example of the calculation node according to the present embodiment, and FIG. 3 is a diagram illustrating a configuration example of the job management node according to the present embodiment. When the above job is submitted, the login node 10 notifies the job management program 304 operating on the job management node 11. A scheduling function 306, which is one function of the job management program 304, determines a plurality of calculation node candidates that execute the parallel program, and stores the information in the job queue 307.

また、ジョブ管理プログラム３０４のもう一つの機能であるジョブ制御機能３０５は、実行状態のジョブの計算ノード１２の状態を常に管理し、スケジューリング機能３０６によって決定された複数の計算ノード候補が使用可能と確認した時点で、その計算ノード候補に並列プログラムのプロセスを生成し、一斉に実行を開始する。 Further, the job control function 305, which is another function of the job management program 304, always manages the state of the calculation node 12 of the job in the execution state, and a plurality of calculation node candidates determined by the scheduling function 306 can be used. At the point of confirmation, a parallel program process is generated for the computation node candidate, and execution is started all at once.

計算ノード１２で実行する並列プログラムのプロセスは、ネットワーク１４を通じ、Ｉ／Ｏノード１３を対象としたファイルアクセスや、別の計算ノード１２で動作する並列プログラムのプロセスとのメッセージ送受信を行いながら並列プログラム全体の実行を進めていく。 The parallel program process executed on the computation node 12 is performed through the network 14 while performing file access for the I / O node 13 and message transmission / reception with the parallel program process operating on another computation node 12. Proceed with overall execution.

実行中の並列プログラムの全プロセスの終了をジョブ管理プログラム３０４のジョブ制御機能３０５が検知した時点で、並列プログラムを実行していた全ての計算ノード１２はプログラム実行中の状態からプログラム未実行の状態となる。プログラム未実行である状態の計算ノード１２は、ジョブ制御機能３０５により、新たなプログラム実行のための計算資源として使用することができる。プログラム実行が終了したジョブの情報は、ジョブ管理プログラム３０４のジョブ制御機能３０５によって、前記ログインノード１０で操作を行った前記利用者（のユーザ端末等）に対し、電子メール等を介して通知される。こうして電子メール等によって通知を受けた利用者は、ログインノード１０での操作により、ジョブ投入を行ったプログラムの出力や実行ステータス、エラー情報などを確認することができる。 When the job control function 305 of the job management program 304 detects the end of all processes in the parallel program being executed, all the computation nodes 12 that have executed the parallel program have changed from the program executing state to the program not executing state. It becomes. The calculation node 12 in a state where the program is not executed can be used as a calculation resource for executing a new program by the job control function 305. Information on the job for which the program execution has been completed is notified to the user (user terminal, etc.) who has performed the operation at the login node 10 by the job control function 305 of the job management program 304 via e-mail or the like. The Thus, the user who has received the notification by e-mail or the like can check the output, execution status, error information, etc. of the program that submitted the job by operating the login node 10.

ここで前記ジョブ管理ノード１１およびジョブ管理ノード１１のハードウェア構成について、図２，３を用いてあらためて説明する。 Here, the hardware configuration of the job management node 11 and the job management node 11 will be described again with reference to FIGS.

前記計算ノード１２は、メモリーやハードディスクドライブなどで構成される記憶部２０１、前記記憶部２０１に格納された並列プログラム２０４等の各種プログラムを実行し計算機自体の統括制御や各種演算等を行なうＣＰＵ部２０２、他装置とネットワーク１４を介してデータ授受を行うネットワークインタフェース部２０３とを備えている。また、前記記憶部２０１、ＣＰＵ部２０２、ネットワークインタフェース部２０３は、ＢＵＳにより結ばれている。 The calculation node 12 is a CPU unit that executes various programs such as a storage unit 201 configured by a memory, a hard disk drive, and the like, and a parallel program 204 stored in the storage unit 201, and performs overall control of the computer itself and various calculations. 202, and a network interface unit 203 that exchanges data with other apparatuses via the network 14. The storage unit 201, the CPU unit 202, and the network interface unit 203 are connected by a BUS.

こうした計算ノード１２において、並列プログラム２０４の実行時には、前記記憶部２０１に該当プログラムがローディングされ、所定のオペレーティングシステム（特に図示せず）がその実行制御を行うこととなる。並列プログラム２０４は、計算機能２０５、ファイル入出力機能２０６、データ入出力機能２０７などにより構成され、計算ノード１２のＣＰＵ部２０２と記憶部２０１を使用する。並列プログラム２０４のファイル入出力機能２０６およびデータ入出力機能２０７は、計算ノード外との通信を必要とし、ネットワークインタフェース部２０３を使用して通信を行うこととなる。 In such a computation node 12, when the parallel program 204 is executed, the corresponding program is loaded into the storage unit 201, and a predetermined operating system (not shown) performs execution control thereof. The parallel program 204 includes a calculation function 205, a file input / output function 206, a data input / output function 207, and the like, and uses the CPU unit 202 and the storage unit 201 of the calculation node 12. The file input / output function 206 and the data input / output function 207 of the parallel program 204 require communication with the outside of the computation node, and perform communication using the network interface unit 203.

また、前記ジョブ管理ノード１１は、メモリーやハードディスクドライブなどで構成される記憶部３０１、前記記憶部３０１に保持されるジョブ管理プログラム３０４等を実行し計算機自体の統括制御や各種判定、演算及び制御処理を行なうＣＰＵ部３０２、ネットワーク１４を介して他装置と接続してデータ授受を行うネットワークインタフェース部３０３を備える。 The job management node 11 executes a storage unit 301 configured by a memory, a hard disk drive, and the like, a job management program 304 held in the storage unit 301, and the like, and performs overall control of the computer itself, various determinations, calculation, and control. A CPU section 302 that performs processing and a network interface section 303 that connects to other devices via the network 14 to exchange data.

こうしたジョブ管理ノード１１では、記憶部３０１にジョブ管理プログラム３０４がローディングされ、このジョブ管理プログラム３０４がＣＰＵ部３０２と記憶部３０１を使用して常に動作している。ログインノード１０からのジョブ投入の情報は、ネットワークインタフェース部３０３を通じてジョブ管理プログラム３０４が取得することになる。このジョブ管理プログラム３０４は、ジョブのスケジューリング機能３０６により、並列プログラムを実行する計算ノード候補を決定する。ジョブ管理プログラム３０４のジョブ制御機能３０５は、ネットワークインタフェース部３０３を通じて、計算ノード１２で実行すべき並列プログラムのプロセス生成や、計算ノード１２で実行するプログラムの状態取得を行う。また、ジョブのスケジューリング機能３０６は、前記ジョブ制御機能３０５より取得した最新の全ての計算ノード１２の実行情報を参照し、並列プログラムを実行する計算ノード候補を決定する。スケジューリング機能３０６は、ジョブスケジューリングを行った情報をジョブキュー３０７に登録する。また、スケジューリング機能３０６は、前記ジョブ制御機能３０５により実行対象の計算ノード候補が利用可能であることが検出された時点で、その計算ノード１２を対象に、並列プログラムのプロセスを生成し実行を開始させる。 In such a job management node 11, a job management program 304 is loaded in the storage unit 301, and the job management program 304 is always operating using the CPU unit 302 and the storage unit 301. Information on job submission from the login node 10 is acquired by the job management program 304 through the network interface unit 303. The job management program 304 uses the job scheduling function 306 to determine calculation node candidates for executing the parallel program. The job control function 305 of the job management program 304 performs process generation of a parallel program to be executed by the calculation node 12 and acquisition of the status of the program executed by the calculation node 12 through the network interface unit 303. The job scheduling function 306 refers to the latest execution information of all the calculation nodes 12 acquired from the job control function 305, and determines calculation node candidates for executing the parallel program. The scheduling function 306 registers information on job scheduling in the job queue 307. The scheduling function 306 generates a parallel program process for the calculation node 12 and starts execution when the job control function 305 detects that the calculation node candidate to be executed is available. Let

図４は、一次元のトーラス構造を示す図である。トーラス構造は、基準点から一方向に進んでいく場合に元の基準点に戻る特性をもつ構造である。図４における構造では、オブジェクト４０１を基準点とし、右回りに順に進んでいく場合にオブジェクト４０１に戻る。同様に左回りに順に進んでいく場合も元のオブジェクト４０１に戻るため、図４の構造はトーラス構造である。図４のトーラス構造が計算機システムを構成するネットワーク構造となる場合は、オブジェクト４０１はネットワークスイッチをその内部に含み、オブジェクト間は、ネットワークケーブルによりネットワークスイッチ同士を接続する構成となる。このような構造を有するネットワークは、一台のネットワークスイッチの障害やネットワークケーブルの断線が発生した場合に、逆方向のネットワークスイッチとネットワークケーブルを使用し通信を行うことでオブジェクト４０１の通信が可能となり、冗長性を有する構造となっている。 FIG. 4 is a diagram showing a one-dimensional torus structure. The torus structure is a structure having a characteristic of returning to the original reference point when traveling in one direction from the reference point. In the structure shown in FIG. 4, the object 401 is used as a reference point, and the process returns to the object 401 when proceeding clockwise. Similarly, the structure of FIG. 4 is a torus structure since the original object 401 is also returned in the case of proceeding counterclockwise. When the torus structure shown in FIG. 4 is a network structure that constitutes a computer system, the object 401 includes a network switch therein, and the network switches are connected to each other by a network cable. In a network having such a structure, when a failure of one network switch or a disconnection of a network cable occurs, communication of the object 401 becomes possible by performing communication using the network switch and the network cable in the reverse direction. The structure has redundancy.

図５は、トーラス構造によるネットワークトポロジーにより計算機システムを構築する場合のオブジェクト（ネットワーク構成要素）の内部構造を示す図である。オブジェクト４０１は、ネットワークスイッチ５０１と複数のノード５０２で構成する。また、ネットワークスイッチ５０１は、当該ネットワークスイッチ５０１を中心に、ｘ軸方向、ｙ軸方向、ｚ軸方向の全６方向に対してネットワークケーブルを接続し、隣のオブジェクト（ネットワーク構成要素）のネットワークスイッチ５０１との間で通信を行う。 FIG. 5 is a diagram showing an internal structure of an object (network component) when a computer system is constructed with a network topology based on a torus structure. An object 401 includes a network switch 501 and a plurality of nodes 502. The network switch 501 connects network cables to all six directions of the x-axis direction, the y-axis direction, and the z-axis direction with the network switch 501 as the center, and the network switch of the adjacent object (network component) Communication with 501 is performed.

ネットワークスイッチ５０１には、ｘ軸方向同士を接続するネットワークケーブル５０３、ｙ軸方向同士を接続するネットワークケーブル５０４、ｚ軸方向同士を接続するネットワークケーブル５０５、オブジェクト４０１に含まれるノード５０２と接続するネットワークケーブル５０６を接続のために使用する。 The network switch 501 includes a network cable 503 that connects the x-axis directions, a network cable 504 that connects the y-axis directions, a network cable 505 that connects the z-axis directions, and a network that connects to the node 502 included in the object 401. Cable 506 is used for connection.

ネットワークスイッチ５０１は、これらの経路から取得したパケットに対するスイッチングを行い、オブジェクト内部のノード方向あるいは外部のオブジェクト（ネットワーク構成要素）方向にパケットを送出する。同じネットワークスイッチ５０１に接続されるノード５０２の間で通信を行う場合は、１台のネットワークスイッチ５０１のみのスイッチングでノード５０２間のデータ転送を完結することができる。異なるネットワークスイッチ５０１に接続されるノード５０２と通信を行う場合には、送信元のノード５０２から当該ノード５０２が属するネットワークスイッチ５０１にデータを送り、そこから別のオブジェクト４０１に含まれるネットワークスイッチ５０１にデータが送られる。これを受信先のノード５０２を含んでいるオブジェクト４０１に至るまで隣のネットワークスイッチ５０１を対象とした通信を繰り返し、最後に受信先のノード５０２を含んでいるネットワークスイッチ５０１から受信先のノード５０２にデータを送ることで、２つのノード間でのデータ転送を完了する。 The network switch 501 performs switching for the packets acquired from these paths, and sends the packets in the node direction inside the object or in the direction of the external object (network component). When communication is performed between nodes 502 connected to the same network switch 501, data transfer between the nodes 502 can be completed by switching only one network switch 501. When communicating with a node 502 connected to a different network switch 501, data is sent from the transmission source node 502 to the network switch 501 to which the node 502 belongs, and from there to a network switch 501 included in another object 401. Data is sent. This communication is repeated for the adjacent network switch 501 until the object 401 including the receiving node 502 is reached. Finally, the network switch 501 including the receiving node 502 changes to the receiving node 502. Sending data completes the data transfer between the two nodes.

図６は、二次元トーラス構造を示す図である。二次元トーラス構造は、基準点から一方向に進んでいく場合に元の基準点に戻る特性を２つの軸に対して同時に満たす構造である。図６の構造において、オブジェクト４０１を基準点とし、右方向に進んでいく場合も左方向に進んでいく場合も元のオブジェクト４０１に戻る。同様に、オブジェクト４０１を基準点とし、上方向に進んでいく場合も下方向に進んでいく場合も元のオブジェクト４０１に戻る。これらの特徴を兼ね備えていることから、図６の構造は二次元トーラス構造としての要件を満たしている。オブジェクト４０１はネットワークスイッチ５０１とノード５０２で構成しており、ネットワークスイッチ５０１のｘ軸方向およびｙ軸方向にネットワークケーブルを接続することで、二次元トーラス構造のネットワークを構成することができる。 FIG. 6 shows a two-dimensional torus structure. The two-dimensional torus structure is a structure that simultaneously satisfies the characteristics of returning to the original reference point when traveling in one direction from the reference point with respect to two axes. In the structure of FIG. 6, the object 401 is used as a reference point, and the original object 401 is returned to both the right direction and the left direction. Similarly, the object 401 is used as a reference point, and the original object 401 is returned to both the case where the object 401 moves upward and the case where the object 401 moves downward. Since these features are combined, the structure of FIG. 6 satisfies the requirements as a two-dimensional torus structure. The object 401 includes a network switch 501 and a node 502, and a network having a two-dimensional torus structure can be configured by connecting network cables in the x-axis direction and the y-axis direction of the network switch 501.

図７は、三次元トーラス構造を示す図である。三次元トーラス構造は、基準点から一方向に進んでいく場合に元の基準点に戻る特性を３つの軸に対して同時に満たす構造である。図７の構造において、オブジェクト４０１を基準点とし、ｘ軸方向に進んでいく場合、ｙ軸方向に進んでいく場合、そしてｚ軸方向に進んでいく場合のそれぞれで元のオブジェクト４０１に戻る。これらの特徴を全て満たしていることから、図７の構造は三次元トーラス構造としての要件を満たしている。オブジェクト４０１はネットワークスイッチ５０１とノード５０２で構成しており、ネットワークスイッチ５０１のｘ軸方向、ｙ軸方向およびｚ軸方向にネットワークケーブルを接続することで、三次元トーラス構造のネットワークを構成することができる。 FIG. 7 shows a three-dimensional torus structure. The three-dimensional torus structure is a structure that simultaneously satisfies the characteristics of returning to the original reference point when traveling in one direction from the reference point with respect to the three axes. In the structure of FIG. 7, the object 401 is used as a reference point, and the original object 401 is returned to each other when traveling in the x-axis direction, traveling in the y-axis direction, and traveling in the z-axis direction. Since all of these characteristics are satisfied, the structure of FIG. 7 satisfies the requirements as a three-dimensional torus structure. The object 401 includes a network switch 501 and a node 502. By connecting network cables in the x-axis direction, y-axis direction, and z-axis direction of the network switch 501, a network having a three-dimensional torus structure can be configured. it can.

図８は、三次元トーラス構造による大規模ＰＣクラスタシステムの構成例を示す図である。大規模ＰＣクラスタシステム１は、三次元トーラス構造のネットワークトポロジーによりオブジェクト（ネットワーク構成要素）間を接続する。オブジェクト４０１は２種類に分類され、１台のネットワークスイッチ５０１と複数の計算ノード１２によって構成されるオブジェクト８０３と、１台のネットワークスイッチ５０１と複数の計算ノード１２のほかに１台または複数のＩ／Ｏノード１３、ログインノード１０、ジョブ管理ノード１１を含むオブジェクト８０４とする。大規模ＰＣクラスタシステム１で並列プログラムをジョブとして実行する場合、利用者はいずれかのオブジェクト８０４に含まれるログインノード１０に端末エミュレータソフトウェアなどを使用して接続し、所定操作によってジョブ投入を行う。 FIG. 8 is a diagram illustrating a configuration example of a large-scale PC cluster system having a three-dimensional torus structure. The large-scale PC cluster system 1 connects objects (network components) with a network topology having a three-dimensional torus structure. The object 401 is classified into two types, and an object 803 constituted by one network switch 501 and a plurality of calculation nodes 12, one network switch 501 and a plurality of calculation nodes 12, and one or a plurality of I The object 804 includes the / O node 13, the login node 10, and the job management node 11. When a parallel program is executed as a job in the large-scale PC cluster system 1, the user connects to the login node 10 included in any of the objects 804 using terminal emulator software or the like, and submits the job by a predetermined operation.

並列プログラムを実行する計算ノード候補の決定は、いずれかのオブジェクト８０４に含まれるジョブ管理ノード１１で常に動作するジョブ管理プログラム３０４のスケジューリング機能３０６によって行い、計算ノード候補ができるだけ密となる三次元空間で並列プログラムを実行する計算ノード候補を集約するよう、スケジューリングを行う。 The calculation node candidates for executing the parallel program are determined by the scheduling function 306 of the job management program 304 that always operates in the job management node 11 included in any of the objects 804, so that the calculation node candidates are as dense as possible. The scheduling is performed so that the computation node candidates that execute the parallel program are aggregated.

このようなスケジューリングを実施した結果、三次元トーラス構造の三次元空間の一部分であるジョブ実行空間８０２の中の複数のオブジェクト４０１に含まれる各計算ノード１２が並列プログラムをそれぞれ実行する。 As a result of performing such scheduling, each computation node 12 included in the plurality of objects 401 in the job execution space 802, which is a part of the three-dimensional space of the three-dimensional torus structure, executes the parallel program.

計算ノード１２で実行するプログラムがファイル入出力を行う場合において、同一のネットワークスイッチ５０１にＩ／Ｏノード１３が接続され、その中のファイルが入出力の対象である場合は、ネットワークスイッチ間の通信を必要とせずにファイル入出力が可能である。そうでない場合は、別のオブジェクト８０４に接続されるＩ／Ｏノードを対象にファイル入出力を行うこととなり、ネットワークスイッチ間の通信を何回か繰り返す通信を行って、ファイル入出力を行う。 When a program executed on the computation node 12 performs file input / output, if the I / O node 13 is connected to the same network switch 501 and the file in it is an input / output target, communication between the network switches File input / output is possible without the need for Otherwise, file input / output is performed for an I / O node connected to another object 804, and communication is repeated several times between network switches to perform file input / output.

図９は、ＭＰＩプログラムにおいてＭＰＩ＿ＢＣＡＳＴ関数などの集団通信関数を実行した場合のデータの流れの例を示す図である。ＭＰＩ＿ＢＣＡＳＴ関数は１つのプロセスのみが所有しているデータを他の全てのプロセスに与える機能を提供する。ＭＰＩ＿ＢＣＡＳＴ関数は、同期型の関数であり、全プロセスでＭＰＩ＿ＢＣＡＳＴ関数を実行すると、内部で同期を行った後、データ転送のための通信を並列に行い、１つのプロセスが所有するデータを残りの全てのプロセスに与えることができる。 FIG. 9 is a diagram illustrating an example of a data flow when a collective communication function such as the MPI_BCAST function is executed in the MPI program. The MPI_BCAST function provides a function of giving data owned by only one process to all other processes. The MPI_BCAST function is a synchronous function, and when the MPI_BCAST function is executed in all processes, after synchronization is performed internally, communication for data transfer is performed in parallel, and all the data owned by one process is all remaining. Can be given to the process.

ＭＰＩプログラムがＮプロセスで実行され、ＭＰＩ＿ＢＣＡＳＴ関数が全プロセスから呼び出される場合は、データを所有する１プロセスからデータを所有しない残りの（Ｎ−１）プロセスに対してＮ−１回のデータ転送を行う方式を行わない。図９に示す方式によって、ＭＰＩプログラムのプロセスあたり最大ｌｏｇ2 Ｎ回（小数点以下は切り上げ）の通信を行い全てのプロセスへのデータを転送する。 When the MPI program is executed in N processes and the MPI_BCAST function is called from all processes, N-1 data transfer is performed from one process that owns data to the remaining (N-1) processes that do not own data. Do not do the method. By the method shown in FIG. 9, communication is performed at maximum log 2 N times (rounded up after the decimal point) per process of the MPI program to transfer data to all processes.

図９は、ＭＰＩプログラムが８プロセスの並列で実行される場合の、ＭＰＩ＿ＢＣＡＳＴ関数実行時のデータの流れを示している。この図の場合、Ｎ＝８であるために、プロセスあたり最大ｌｏｇ2 Ｎ回すなわち最大３回の転送回数で全てのプロセスへのデータ転送が完了する。図９は並列数が「８」であるために８プロセスのそれぞれにＭＰＩランクとして０から７までが与えられる。 FIG. 9 shows the data flow when the MPI_BCAST function is executed when the MPI program is executed in parallel in 8 processes. In this case, since N = 8, data transfer to all processes is completed at the maximum number of log 2 N times per process, that is, a maximum of 3 transfer times. In FIG. 9, since the parallel number is “8”, MPI ranks 0 to 7 are given to each of the 8 processes.

データははじめにＭＰＩランク「０」のプロセスが保持しており、これをＭＰＩランク１から７までの７プロセスにＭＰＩ＿ＢＣＡＳＴ関数によって与える。１回目の転送ではＭＰＩランク０のプロセスとＭＰＩランク４のプロセスとの間でデータを送受信する。次に２回目の転送処理ではＭＰＩランク０のプロセスはＭＰＩランク２のプロセスとの間でデータを送受信する。ＭＰＩランク４のプロセスは既にデータを受信済みであり、この時に並行してＭＰＩランク０から受信したデータを直ちにＭＰＩランク４のプロセスとＭＰＩランク６のプロセスとの間で送受信する。２回目までの転送処理が完了した時点でデータを所有するＭＰＩランクは０，２，４，６の４つとなる。次に３回目の転送処理を行う時点ではＭＰＩランク０のプロセスはＭＰＩランク１のプロセスとの間で送受信を行う。同じく既にデータを受信しているプロセスはまだ未受信のプロセスとの間で送受信を行い、ＭＰＩランク２のプロセスはＭＰＩランク３のプロセスと、ＭＰＩランク４のプロセスはＭＰＩランク５のプロセスと、ＭＰＩランク６のプロセスはＭＰＩランク７のプロセスとの間で送受信を行う。その結果、ここまでの全ての送受信が完了した時点でデータは全てのプロセスが持つこととなる。ＭＰＩランク０のプロセスが３回のデータ送受信を行う間に、他のＭＰＩランクのプロセスがデータ未受信のプロセスを対象に適切にデータ送受信を繰り返すことで短い時間で全プロセスへのデータ転送を完了することができる。本方式は、ｌｏｇ2 Ｎ回数で全プロセスへの転送が完了するため、例えば６５５３６プロセスのＭＰＩプログラムの中でＭＰＩ＿ＢＣＡＳＴ関数を実行した場合は１６回（ｌｏｇ2 ６５５３６＝１６）の転送回数に要する時間で全６５５３６プロセスで同一データを持つことができる。 Data is initially held by the process of MPI rank “0”, and this is given to 7 processes from MPI ranks 1 to 7 by the MPI_BCAST function. In the first transfer, data is transmitted and received between the MPI rank 0 process and the MPI rank 4 process. Next, in the second transfer process, the MPI rank 0 process transmits and receives data to and from the MPI rank 2 process. The MPI rank 4 process has already received data. At this time, the data received from the MPI rank 0 is immediately transmitted and received between the MPI rank 4 process and the MPI rank 6 process. When the transfer process up to the second time is completed, there are four MPI ranks 0, 2, 4, and 6 that own the data. Next, at the time of performing the third transfer process, the MPI rank 0 process performs transmission / reception with the MPI rank 1 process. Similarly, a process that has already received data performs transmission / reception with a process that has not yet been received, an MPI rank 2 process is an MPI rank 3 process, an MPI rank 4 process is an MPI rank 5 process, and an MPI rank process. The rank 6 process transmits to and receives from the MPI rank 7 process. As a result, when all the transmission / reception so far is completed, all processes have the data. Data transfer to all processes is completed in a short time by repeating data transmission / reception appropriately for processes that have not received data while other MPI rank processes perform data transmission / reception three times. can do. In this method, transfer to all processes is completed in log2 N times. For example, when the MPI_BCAST function is executed in the MPI program of 65536 processes, the transfer time of 16 times (log2 65536 = 16) is all required. 65536 processes can have the same data.

図１０は三次元トーラス構造によるＰＣクラスタシステム１における計算ノード１２とＩ／Ｏノード１３の距離について示した図である。三次元トーラス構造によるＰＣクラスタシステム１は計算ノード１２のみを含むオブジェクト（以下、計算オブジェクト）とＩ／Ｏノード１３を含むオブジェクト（以下、Ｉ／Ｏオブジェクト）で構成し、いくつかの計算オブジェクトおよびＩ／Ｏオブジェクトによってジョブ実行空間８０２を構成している。 FIG. 10 is a diagram showing the distance between the calculation node 12 and the I / O node 13 in the PC cluster system 1 having a three-dimensional torus structure. The PC cluster system 1 having a three-dimensional torus structure is composed of an object including only a calculation node 12 (hereinafter referred to as a calculation object) and an object including an I / O node 13 (hereinafter referred to as an I / O object). A job execution space 802 is configured by I / O objects.

ジョブ実行空間８０２は計算オブジェクト１００１および計算オブジェクト１００２を含んでいる。ここでオブジェクト間の距離は２つのオブジェクトの物理空間上の距離ではなく、２つのオブジェクト４０１を接続しているネットワークスイッチ５０１の三次元空間座標上の差分と定義する。このような定義を行った上で、Ｉ／Ｏオブジェクト１００３と計算オブジェクト１００１の距離とＩ／Ｏオブジェクト１００３と計算オブジェクト１００２の距離を比較する。計算オブジェクト１００１と計算オブジェクト１００２の座標のｘおよびｙが同じであり、ｚのみが異なる場合を考えることとする。ＰＣクラスタシステム１の全体は三次元トーラス構造であるため、図１０においてＩ／Ｏオブジェクト１００３が存在するＰＣクラスタシステム１の一番下のｘｙ平面は一番上のｘｙ平面とｚ方向に隣の関係にある。このため、ジョブ実行空間８０２でＩ／Ｏオブジェクト１００３により近い計算オブジェクトはｚ方向に隣となる計算オブジェクト１００１である。ジョブ実行空間８０２はＩ／Ｏオブジェクト１００４も含んでいる。Ｉ／Ｏオブジェクト１００４も計算ノードを含んでおり、ジョブ実行空間８０２に含む計算オブジェクト１００１、計算オブジェクト１００２、Ｉ／Ｏオブジェクト１００４で一番Ｉ／Ｏノードに近い計算ノードを含むオブジェクトはＩ／Ｏオブジェクト１００４でその距離を０とする。 The job execution space 802 includes a calculation object 1001 and a calculation object 1002. Here, the distance between the objects is not the distance between the two objects in the physical space but is defined as a difference in the three-dimensional space coordinates of the network switch 501 connecting the two objects 401. After making such a definition, the distance between the I / O object 1003 and the calculation object 1001 and the distance between the I / O object 1003 and the calculation object 1002 are compared. Assume that the coordinates x and y of the calculation object 1001 and the calculation object 1002 are the same, and only z is different. Since the entire PC cluster system 1 has a three-dimensional torus structure, in FIG. 10, the bottom xy plane of the PC cluster system 1 where the I / O object 1003 exists is adjacent to the top xy plane in the z direction. There is a relationship. Therefore, a calculation object closer to the I / O object 1003 in the job execution space 802 is a calculation object 1001 adjacent in the z direction. The job execution space 802 also includes an I / O object 1004. The I / O object 1004 also includes a calculation node. The calculation object 1001, the calculation object 1002, and the I / O object 1004 included in the job execution space 802 are the I / O objects including the calculation node closest to the I / O node. The object 1004 sets the distance to zero.

図１１は三次元トーラス構造によるＰＣクラスタシステム１の大きさを示した図である。トーラス構造は基準点から一方向に進んでいく場合に同じ基準点に戻る性質があるため、同一方向に無限に進むことも可能となる。ここでｘ方向に進み、再び基準点に到達するまでのオブジェクト４０１の個数をＭｘ個とする。同様にｙ方向およびｚ方向に進み、基準点に到達するまでのオブジェクト４０１の個数をＭｙ個およびＭｚ個とする。ｘ、ｙ、ｚ方向に共通の基準となる三次元空間の原点を設け（０,０,０）の座標位置にあるオブジェクト４０１を定義する。ｘ方向にＭｘ個、ｙ方向にＭｙ個、ｚ方向にＭｚ個のオブジェクトが存在しているため、三次元トーラス構造によるＰＣクラスタシステム１を構成するオブジェクト４０１は、全て三次元座標で表すことができる。全てのオブジェクト４０１は（０,０,０）から(Ｍｘ−１,Ｍｙ−１,Ｍｚ−１)までのいずれかの座標に存在し、オブジェクト総数はＭｘ×Ｍｙ×Ｍｚ個である。 FIG. 11 is a diagram showing the size of the PC cluster system 1 having a three-dimensional torus structure. Since the torus structure has the property of returning to the same reference point when traveling in one direction from the reference point, it is possible to travel infinitely in the same direction. Here, it is assumed that the number of objects 401 is X in the x direction until reaching the reference point again. Similarly, the number of the objects 401 that proceed in the y direction and the z direction and reach the reference point is My and Mz. An origin 401 of a three-dimensional space serving as a common reference in the x, y, and z directions is provided, and an object 401 at a coordinate position of (0, 0, 0) is defined. Since there are Mx objects in the x direction, My objects in the y direction, and Mz objects in the z direction, all the objects 401 constituting the PC cluster system 1 having a three-dimensional torus structure can be represented by three-dimensional coordinates. it can. All objects 401 exist at any coordinate from (0,0,0) to (Mx-1, My-1, Mz-1), and the total number of objects is Mx × My × Mz.

図１２は計算オブジェクトの三次元座標位置とオブジェクトの内部構造との関連を示した図である。計算オブジェクトはネットワークスイッチ５０１と複数の計算ノード１２で構成する。計算オブジェクトの座標位置はネットワークスイッチ５０１の座標位置と同一とし、隣接のネットワークスイッチ５０１のｘ、ｙ、ｚ方向に対して１をプラスまたはマイナスした位置となる。三次元トーラス構造であるために、ｘ座標がとりうる座標位置としては０からＭｘ−１までである。ｙ座標も同様に０からＭｙ−１まで、ｚ座標も同様に０からＭｚ−１までである。 FIG. 12 is a diagram showing the relationship between the three-dimensional coordinate position of the calculation object and the internal structure of the object. The calculation object includes a network switch 501 and a plurality of calculation nodes 12. The coordinate position of the calculation object is the same as the coordinate position of the network switch 501, and is a position obtained by adding or subtracting 1 to the x, y, z direction of the adjacent network switch 501. Because of the three-dimensional torus structure, the coordinate positions that the x coordinate can take are from 0 to Mx-1. Similarly, the y coordinate is from 0 to My-1, and the z coordinate is also from 0 to Mz-1.

あるネットワークスイッチ５０１のｘ座標が０またはＭｘ−１である場合、隣のネットワークスイッチ５０１の座標は、−１またはＭｘが座標値とならず、ｘ座標が０の場合に１をマイナスした座標位置としてＭｘ−１、ｘ座標がＭｘ−１の場合に１をプラスした座標位置として０となる。ｙ座標の値が０またはＭｙ−１の場合やｚ座標の値が０またはＭｚ−１の場合にも同様となる。 When the x coordinate of a network switch 501 is 0 or Mx-1, the coordinate of the adjacent network switch 501 is a coordinate position obtained by subtracting 1 when -1 or Mx is not a coordinate value and the x coordinate is 0. When Mx-1 and the x coordinate is Mx-1, the coordinate position obtained by adding 1 is 0. The same applies when the y-coordinate value is 0 or My-1 or the z-coordinate value is 0 or Mz-1.

また、計算オブジェクト内部の計算ノード１２に関してもノード番号を設定する。図１２では計算オブジェクトは計算ノード１２を６ノード有し、ノード番号０１から０６までとしている。このノード番号は、オブジェクトの中で一意の値となるように設定する。三次元トーラス構造のＰＣクラスタシステム１からノードを特定する場合、オブジェクト４０１の三次元座標とノード番号との組合せにより求めることができる。 A node number is also set for the calculation node 12 inside the calculation object. In FIG. 12, the calculation object has 6 calculation nodes 12 and node numbers 01 to 06 are set. This node number is set to be a unique value among objects. When a node is specified from the PC cluster system 1 having a three-dimensional torus structure, it can be obtained by a combination of the three-dimensional coordinates of the object 401 and the node number.

図１３はＩ／Ｏオブジェクトの三次元座標位置とオブジェクトの内部構造との関連を示した図である。Ｉ／Ｏオブジェクトは計算オブジェクトとほぼ同じであるが、複数の計算ノード１２に加えてＩ／Ｏノード１３を含むことが異なる。Ｉ／Ｏオブジェクトの座標位置については計算オブジェクトと同じ考え方であり、ノード番号についても同じ考え方である。Ｉ／Ｏオブジェクト内部の計算ノード１２およびＩ／Ｏノード１３に関して、図１３の例では計算ノード１２を６ノード、Ｉ／Ｏノード１３を１ノード有している。これらのノード番号がＩ／Ｏオブジェクトで一意の値となるように、計算ノード１２のノード番号は０１から０６までとし、Ｉ／Ｏノード１３のノード番号はこれに続く０７とする。 FIG. 13 is a diagram showing the relationship between the three-dimensional coordinate position of the I / O object and the internal structure of the object. An I / O object is almost the same as a calculation object, but includes an I / O node 13 in addition to a plurality of calculation nodes 12. The coordinate position of the I / O object is the same as the calculation object, and the node number is the same. Regarding the calculation node 12 and the I / O node 13 inside the I / O object, the example of FIG. 13 has six calculation nodes 12 and one I / O node 13. The node numbers of the calculation nodes 12 are 01 to 06 and the node numbers of the I / O node 13 are 07 so that these node numbers have unique values in the I / O object.

図１４はオブジェクトを構成するネットワークスイッチ５０１およびノード５０２の名称と三次元空間座標やノード番号との対応関係を示した表である。三次元トーラス構造によるＰＣクラスタシステム１を構成するオブジェクト４０１は、三次元座標を持ち、計算ノード１２およびＩ／Ｏノード１３には同一オブジェクト内で一意となるノード番号をもつ。 FIG. 14 is a table showing the correspondence between the names of the network switch 501 and the node 502 constituting the object, the three-dimensional space coordinates, and the node numbers. An object 401 constituting the PC cluster system 1 having a three-dimensional torus structure has three-dimensional coordinates, and the calculation node 12 and the I / O node 13 have node numbers that are unique within the same object.

オブジェクトには１台のネットワークスイッチ５０１があり、ネットワークスイッチ名にはオブジェクトの三次元座標を与えることで、三次元トーラス空間内での一意のネットワークスイッチ名称をもつことができる。この例におけるネットワークスイッチ名称は「Ｓ＋座標位置」とし、先頭文字たるＳでスイッチであることを、それ以外の文字でネットワークスイッチ５０１の座標位置を表す。座標位置を示す情報はｘ、ｙ、ｚそれぞれの座標値を１文字で表し、合計で３文字とする。座標値が１０未満である場合は１０進数の値として０から９までの１文字で表現できる。座標値が１６未満の場合は１６進数の値として、０からＦまでの１文字で表現する。座標値が１６以上の場合はｎ進数としてＦより後のアルファベット文字を使用することにより解決する（１６＝Ｇ，１７＝Ｈ，１８＝Ｉ，以下同様）。 An object has one network switch 501. By giving the three-dimensional coordinates of the object to the network switch name, it is possible to have a unique network switch name in the three-dimensional torus space. In this example, the name of the network switch is “S + coordinate position”, the first character S is a switch, and other characters represent the coordinate position of the network switch 501. In the information indicating the coordinate position, each coordinate value of x, y, z is represented by one character, and the total is three characters. When the coordinate value is less than 10, it can be expressed by one character from 0 to 9 as a decimal value. When the coordinate value is less than 16, it is expressed as a hexadecimal value by one character from 0 to F. When the coordinate value is 16 or more, the problem is solved by using an alphabetic character after F as an n-ary number (16 = G, 17 = H, 18 = I, and so on).

また、ノード名はオブジェクト４０１の座標位置に加えてノード番号を使用する。ノード名称は「Ｎ＋座標位置+＿+ノード番号」とし、先頭文字のＮでノードであることを、それ以外の文字によりオブジェクト座標とノード番号を表す。ノード名称には座標位置が含まれ、これはノード５０２が接続されるネットワークスイッチ５０１の座標と同じであるため、ノード５０２が接続するネットワークスイッチ５０１はノード名から自明となる。ノード名の座標位置の部分の名称が同一である場合は同一のネットワークスイッチ５０１に接続されていることを表し、ノード間の距離は「０」となる。一方、ノード名の中の座標位置の部分が異なる場合には異なるネットワークスイッチ５０１に接続されていることを表している。 The node name uses a node number in addition to the coordinate position of the object 401. The node name is “N + coordinate position + _ + node number”. The first character N indicates a node, and other characters represent object coordinates and node numbers. The node name includes a coordinate position, which is the same as the coordinates of the network switch 501 to which the node 502 is connected. Therefore, the network switch 501 to which the node 502 is connected is obvious from the node name. When the names of the coordinate positions of the node names are the same, it indicates that they are connected to the same network switch 501 and the distance between the nodes is “0”. On the other hand, if the part of the coordinate position in the node name is different, it indicates that it is connected to a different network switch 501.

図１５はオブジェクト間の距離について示した図である。上述してきたとおり、トーラス構造は基準点から同一方向に進む場合に元の基準点に戻る特性がある。図１５に例示する一次元トーラス構造は、Ｍｘ個のオブジェクト４０１で構成されており、オブジェクト１５０１を基準点としている。この基準点から右回りにｎ個進んだオブジェクト１５０２までの距離を考える場合、右回りに進んだ距離と左回りに進んだ距離の２通りが存在する。 FIG. 15 shows the distance between objects. As described above, the torus structure has a characteristic of returning to the original reference point when traveling in the same direction from the reference point. The one-dimensional torus structure illustrated in FIG. 15 includes Mx objects 401 and uses the object 1501 as a reference point. When considering the distance from the reference point to the object 1502 that has advanced n clockwise, there are two types of distances: a clockwise distance and a counterclockwise distance.

オブジェクト１５０１とオブジェクト１５０２の間の距離は、右回りに進んだ場合はｎ、左回りに進んだ場合はＭｘ−ｎである。ここでｎがＭｘ／２より小さい場合には右回りに進む距離が小さく、Ｍｘ／２より大きい場合には左回りに進む距離が小さくなる。オブジェクト間の距離は、このように進む経路により２通りの値をとるが、Ｍｘ／２以下の、より小さい値をオブジェクト間の距離と定義する。オブジェクト４０１はネットワークスイッチ５０１と複数のノード５０２で構成しており、オブジェクトの距離は２つのネットワークスイッチ間の距離、２つのノード間の距離と等しい。ここでいう距離は三次元トーラス構造のＰＣクラスタシステム１において、ネットワークスイッチ５０１を経由する個数という意味となる。 The distance between the object 1501 and the object 1502 is “n” when proceeding clockwise and “Mx−n” when proceeding counterclockwise. Here, when n is smaller than Mx / 2, the distance traveled clockwise is small, and when larger than Mx / 2, the distance traveled counterclockwise is small. The distance between the objects takes two values depending on the path that travels in this way, but a smaller value that is less than or equal to Mx / 2 is defined as the distance between the objects. The object 401 includes a network switch 501 and a plurality of nodes 502, and the distance between the objects is equal to the distance between the two network switches and the distance between the two nodes. The distance here means the number passing through the network switch 501 in the PC cluster system 1 having a three-dimensional torus structure.

図１６は三次元トーラス構造によるＰＣクラスタシステム１のノード間の距離の計算方法を示した図である。三次元トーラスによるＰＣクラスタシステム１は、オブジェクト４０１によりｘ方向、ｙ方向、ｚ方向のそれぞれにトーラス構造をもつ。図１６のＰＣクラスタシステム例においては、ｘ方向に１０個のオブジェクト(Ｍｘ＝１０)、ｙ方向に１０個のオブジェクト(Ｍｙ＝１０)、ｚ方向に１０個のオブジェクト(Ｍｚ＝１０)で構成しているものとする。 FIG. 16 is a diagram showing a method for calculating the distance between nodes of the PC cluster system 1 using a three-dimensional torus structure. The PC cluster system 1 using a three-dimensional torus has a torus structure in each of the x, y, and z directions by the object 401. In the example of the PC cluster system in FIG. 16, it is composed of 10 objects in the x direction (Mx = 10), 10 objects in the y direction (My = 10), and 10 objects in the z direction (Mz = 10). Suppose you are.

座標位置(０，０，０)にあるオブジェクト４０１にノード１６０１があり、このノード番号が０７である場合、図１４で例示した定義によりノード名はＮ０００＿０７と定義できる。また、座標位置(１，１，１)にあるオブジェクト４０１にノード１６０２があり、このノード番号が０１である場合、同様にノード名はＮ１１１＿０１と定義される。また、座標位置(９，０，１)にあるオブジェクト４０１にノード１６０３があり、ノード番号が０６である場合、ノード名はＮ９０１＿０６と定義される。 If the object 401 at the coordinate position (0, 0, 0) has a node 1601 and this node number is 07, the node name can be defined as N000_07 by the definition illustrated in FIG. If the object 401 at the coordinate position (1, 1, 1) has a node 1602 and the node number is 01, the node name is similarly defined as N111_01. When the object 401 at the coordinate position (9, 0, 1) has a node 1603 and the node number is 06, the node name is defined as N901_06.

ここでノード１６０１とノード１６０２との間の距離を考える。ノード１６０１の座標位置は、そのノード名にも表されているように（０，０，０）であり、ノード１６０２の座標位置は(１，１，１)である。ｘ方向の距離を考える場合、右方向には１−０＝１となる。左方向にはＭｘ＝１０であるために１０−１＝９となる。小さい値が距離となるためにノード１６０１とノード１６０２のｘ方向の距離は１である。同様にｙ方向の距離も１、ｚ方向の距離も１である。ここで、ノード間の距離はｘ方向、ｙ方向、ｚ方向の距離を全て足した値となり、ノード１６０１とノード１６０２の間の距離は、前記距離を足した３となる。 Here, the distance between the node 1601 and the node 1602 is considered. The coordinate position of the node 1601 is (0, 0, 0) as shown in the node name, and the coordinate position of the node 1602 is (1, 1, 1). When considering the distance in the x direction, 1-0 = 1 in the right direction. Since Mx = 10 in the left direction, 10−1 = 9. Since the smaller value is the distance, the distance between the node 1601 and the node 1602 in the x direction is 1. Similarly, the distance in the y direction is 1, and the distance in the z direction is 1. Here, the distance between the nodes is a value obtained by adding all the distances in the x direction, the y direction, and the z direction, and the distance between the nodes 1601 and 1602 is 3 obtained by adding the distances.

同様に、オブジェクト１６０１とオブジェクト１６０３の間の距離を求める。ノード１６０１の座標位置は(０，０，０)、ノード１６０３の座標位置は(９,０,１)である。ｘ方向の距離を考える場合、右方向には９−０＝９となる。左方向にはＭｘ＝１０であるために１０−９＝１となる。小さい値を採用するためにノード１６０１とノード１６０３のｘ方向の距離は１となる。同様にｙ方向の距離は０、ｚ方向の距離は１である。以上によりノード１６０１とノード１６０３の距離は２である。最後にノード１６０２とノード１６０３の距離を求める。ノード１６０２の座標位置は(１，１，１)、ノード１６０３の座標位置は(９，０，１)である。ｘ方向の距離を考える場合、右方向には９−１＝８となる。左方向にはＭｘ＝１０であるために１０―９＋１＝２となる。小さい値を採用するためにノード１６０２とノード１６０３のｘ方向の距離は２となる。同様にｙ方向の距離は１、ｚ方向の距離は０である。以上によりノード１６０２とノード１６０３の距離は３である。 Similarly, the distance between the object 1601 and the object 1603 is obtained. The coordinate position of the node 1601 is (0, 0, 0), and the coordinate position of the node 1603 is (9, 0, 1). When considering the distance in the x direction, 9−0 = 9 in the right direction. Since Mx = 10 in the left direction, 10−9 = 1. In order to employ a small value, the distance in the x direction between the node 1601 and the node 1603 is 1. Similarly, the distance in the y direction is 0, and the distance in the z direction is 1. Thus, the distance between the node 1601 and the node 1603 is 2. Finally, the distance between the node 1602 and the node 1603 is obtained. The coordinate position of the node 1602 is (1, 1, 1), and the coordinate position of the node 1603 is (9, 0, 1). When considering the distance in the x direction, 9-1 = 8 in the right direction. Since Mx = 10 in the left direction, 10−9 + 1 = 2. In order to employ a small value, the distance between the node 1602 and the node 1603 in the x direction is 2. Similarly, the distance in the y direction is 1, and the distance in the z direction is 0. Thus, the distance between the node 1602 and the node 1603 is 3.

以下、本実施形態におけるプロセス割当方法の実際手順について図に基づき説明する。以下で説明するプロセス割当方法に対応する各種動作は、主として前記ジョブ管理ノード１１が実行するジョブ管理プログラム３０４によって実現される。そして、こうしたプログラムは、以下に説明される各種の動作を行うためのコードから構成されている。 The actual procedure of the process allocation method in this embodiment will be described below with reference to the drawings. Various operations corresponding to the process allocation method described below are realized mainly by the job management program 304 executed by the job management node 11. And such a program is comprised from the code | cord | chord for performing the various operation | movement demonstrated below.

図１７は並列プログラムを実行する際の、利用者の操作手順と合わせて、ジョブ管理ノード１１におけるジョブ管理プログラム３０４のスケジューリング機能３０６、およびジョブ制御機能３０５の処理フロー例を示す図である。 FIG. 17 is a diagram illustrating a processing flow example of the scheduling function 306 and the job control function 305 of the job management program 304 in the job management node 11 together with the user operation procedure when executing the parallel program.

利用者はログインノード１０に大規模ＰＣクラスタシステム１の外部のネットワークを通じて端末エミュレータソフトウェアなどを使用してログイン（Ｓ１７０１）し、対話的な操作をログインノード１０で行う。また、利用者はログインノード１０上で並列プログラムの入力データをファイルシステム上に作成（Ｓ１７０２）し、その後にジョブ投入のための操作を行うことで並列プログラムの実行指示（Ｓ１７０３）を行う。ログインノード１０上の実行指示（Ｓ１７０３）はジョブ管理ノード１１で常に動作するジョブ管理プログラム３０４にネットワーク１４を通じ、直ちに通知される。 The user logs in to the login node 10 using terminal emulator software or the like through a network outside the large-scale PC cluster system 1 (S1701), and performs an interactive operation on the login node 10. Further, the user creates input data of the parallel program on the log-in node 10 on the file system (S1702), and then performs an operation for submitting the job to execute the parallel program execution instruction (S1703). The execution instruction (S 1703) on the login node 10 is immediately notified via the network 14 to the job management program 304 that always operates in the job management node 11.

一方、ジョブ管理ノード１１で動作するジョブ管理プログラム３０４のスケジューリング機能３０６では、ジョブ実行対象の計算ノード構成を定義するキューの情報をファイルやデータベースなどを通じて取得（Ｓ１７０７）し、利用者が指定したプログラム実行のための要件とキューの構成を比較する。その結果とジョブ制御機能３０５が提供する計算ノード１２の使用状況の情報とを用いてジョブスケジューリング（Ｓ１７０８）を行い、並列プログラム実行のための情報をジョブという単位で管理する。一つ一つのジョブにはジョブＩＤと呼ぶ一意の識別番号をジョブ管理プログラム３０４が設定する。 On the other hand, the scheduling function 306 of the job management program 304 operating on the job management node 11 obtains queue information defining the calculation node configuration to be executed through a file or database (S1707), and the program specified by the user. Compare the requirements for execution with the queue configuration. Job scheduling (S1708) is performed using the result and the usage status information of the computing node 12 provided by the job control function 305, and information for parallel program execution is managed in units of jobs. The job management program 304 sets a unique identification number called a job ID for each job.

前記ジョブ管理ノード１１のスケジューリング機能３０６は、前記ジョブスケジューリング（Ｓ１７０８）によって生成するジョブを、ＦＩＦＯ（Ｆｉｒｓｔ IｎＦｉｒｓｔＯｕｔ）を基本とするジョブキューを対象にジョブ登録（Ｓ１７０９）する。登録したジョブはジョブＩＤをキーとしてジョブキュー３０７から取得することができる。 The scheduling function 306 of the job management node 11 registers a job generated by the job scheduling (S 1708) for a job queue based on FIFO (First In First Out) (S 1709). The registered job can be acquired from the job queue 307 using the job ID as a key.

また、前記スケジューリング機能３０６は、該当ジョブのジョブＩＤを、ログインノード１０にて対話的操作を行ってジョブ投入を行った前記利用者（の操作端末）に対し、応答メッセージとして返却する。 In addition, the scheduling function 306 returns the job ID of the job as a response message to the user (the operation terminal) who performed the interactive operation at the login node 10 and submitted the job.

実行要件に合致する全計算ノード候補のうち、他のジョブのプログラムが実行されておらずプログラム実行が可能であるものについて検知したジョブ制御機能３０５は、プログラムを実行する計算ノード候補に関して、実行環境初期化（Ｓ１７１０）を行い、前記ジョブ登録（Ｓ１７０９）によってジョブキュー３０７に登録された情報について「実行待ち」から「実行開始」にジョブのステータスを変更する。 The job control function 305 that has detected a program that can execute a program without executing another job program among all the calculation node candidates that match the execution requirement is executed in relation to the calculation node candidate that executes the program. Initialization (S1710) is performed, and the status of the job is changed from “Waiting for execution” to “Starting execution” for the information registered in the job queue 307 by the job registration (S1709).

その後、前記ジョブ制御機能３０５は、計算ノード候補に対しジョブの実行プログラムのプロセスを生成し、ジョブ実行（Ｓ１７１１）によってプログラムの実行を開始する。この時に、ジョブキュー３０７で「実行開始」から「実行中」に該当ジョブのステータスを変更する。ログインノード１０の利用者は、ジョブＩＤによってジョブの実行状況を確認することが可能であり、この場合、例えばジョブ制御機能３０５が前記ジョブキュー３０７から、該当するジョブＩＤを有するジョブのステータスの情報を読み出して表示する。また、計算ノード１２での実行中プログラムの終了について検知したジョブ制御機能３０５は、ジョブ実行終了（Ｓ１７１２）の処理を行う。ジョブ実行終了（Ｓ１７１２）の処理では、プログラム実行結果の情報を電子メール等により、ログインノード１０の利用者のアカウントに対して送信するとともに、前記ジョブキュー３０７にて「実行中」から「実行終了」に該当ジョブのステータスを変更する。ログインノード１０の利用者は（所定の端末において）、ジョブ管理プログラム３０４より送信された電子メールを受信し、その内容やプログラムが出力した結果ファイルの参照などをログインノード１０上で閲覧する、実行結果確認（Ｓ１７０５）を行う。 Thereafter, the job control function 305 generates a job execution program process for the computation node candidate, and starts execution of the program by job execution (S1711). At this time, the status of the job is changed from “execution start” to “under execution” in the job queue 307. The user of the login node 10 can check the execution status of the job by the job ID. In this case, for example, the job control function 305 receives information on the status of the job having the corresponding job ID from the job queue 307. Is read and displayed. Further, the job control function 305 that has detected the end of the program being executed in the computing node 12 performs the job execution end (S1712) process. In the process of job execution end (S1712), the program execution result information is transmitted to the user account of the login node 10 by e-mail or the like, and from the “execution in progress” to the “execution end” in the job queue 307. Change the job status to "". The user of the login node 10 (at a predetermined terminal) receives the e-mail transmitted from the job management program 304, and browses the contents and the reference of the result file output by the program on the login node 10. The result is confirmed (S1705).

図１８は、ＭＰＩプログラムでの通信時間最小化を目的として、前記スケジューリング機能３０６にＭＰＩランク決定機能を追加する処理の流れを示した図である。ジョブ管理ノード１１にて実行するジョブ管理プログラム３０４のスケジューリング機能３０６は、ジョブ実行対象の計算ノード構成の定義情報としてファイルやデータベースなどからジョブキュー３０７の情報取得（Ｓ１８０１）を行う。次にスケジューリング機能３０６は、ジョブ制御機能３０５が具備する計算ノード１２の実行状態取得機能を用いて、ジョブキュー３０７を構成する計算ノード１２の実行状況の取得（Ｓ１８０２）を行う。 FIG. 18 is a diagram showing a flow of processing for adding an MPI rank determination function to the scheduling function 306 for the purpose of minimizing the communication time in the MPI program. The scheduling function 306 of the job management program 304 executed in the job management node 11 acquires information on the job queue 307 from a file, a database, or the like as definition information of the calculation node configuration to be executed (S1801). Next, the scheduling function 306 acquires the execution status of the calculation nodes 12 constituting the job queue 307 using the execution status acquisition function of the calculation nodes 12 included in the job control function 305 (S1802).

スケジューリング機能３０６は、その後に計算ノード１２の実行状況とジョブキュー３０７の制限条件、既にスケジューリングが完了しているジョブの優先度や並列プログラム実行のために指定された要件などの情報などからジョブスケジューリング（Ｓ１８０３）を行い、ジョブキュー３０７に登録する情報（ジョブ）を生成する。 The scheduling function 306 then performs job scheduling based on information such as the execution status of the computing node 12 and the restriction conditions of the job queue 307, the priority of jobs that have already been scheduled, and requirements specified for parallel program execution. (S1803) is performed, and information (job) to be registered in the job queue 307 is generated.

いままでのスケジューリング機能３０６では、ジョブ登録（Ｓ１８０４）でジョブ管理プログラム３０４のスケジューリング機能３０６が完了するが、本方式ではジョブ登録（Ｓ１８０４）の前にジョブスケジューリング（Ｓ１８０３）により決定した計算ノードに対して一般的なＭＰＩプログラムが通信時間を短縮するためのＭＰＩランクの決定（Ｓ１８０５）を行うこととなる。 In the conventional scheduling function 306, the job registration (S1804) completes the scheduling function 306 of the job management program 304. In this method, the calculation node determined by job scheduling (S1803) before job registration (S1804) is used. Therefore, the general MPI program determines the MPI rank (S1805) for shortening the communication time.

スケジューリング機能３０６に追加をするＭＰＩランクの決定を行う処理（Ｓ１８０５）は４つの処理で成り立っており、初めにＭＰＩランクが「０」となる計算ノード１２を決定する処理（Ｓ１８０６）を行って、一般的なＭＰＩプログラムにおいて他のノードとは異なる特別な役割を行うことが多いＭＰＩランクが「０」となるプロセスを配置する計算ノード１２を決定する。 The process of determining the MPI rank to be added to the scheduling function 306 (S1805) consists of four processes. First, the process of determining the calculation node 12 having the MPI rank of “0” (S1806) is performed. In a general MPI program, a calculation node 12 that arranges a process having an MPI rank “0” that often plays a special role different from other nodes is determined.

ＭＰＩランクの決定を行う処理（Ｓ１８０５）の次処理では、ＭＰＩ集団通信関数の実行で通信経路が最小となるための情報を保持するテーブルの初期化（Ｓ１８０７）を行い、その後にＭＰＩランクが「０」以外となる計算ノードを決定する処理（Ｓ１８０８）を行う。これらの処理は、ＭＰＩ集団通信関数の通信時間を短縮するためのＭＰＩプロセス配置を決定する。ここまでの処理の追加でＭＰＩプログラム全体の通信時間を短縮するためのＭＰＩランクは決定している。ＭＰＩランクの決定を行う処理（Ｓ１８０５）に追加する最後の処理では、決定したＭＰＩランクをファイルなど外部に出力する処理（Ｓ１８０９）を行う。また、ジョブ登録の処理（Ｓ１８０４）では、ジョブキュー３０７に登録する情報と、ＭＰＩランクが出力されるファイルの情報とを関連付けした後に、これら情報をジョブキュー３０７に登録する。ジョブキュー３０７に登録しているジョブに関しては、計算ノード１２の状態の変更（ノードの故障、ノードの追加など）やジョブの実行状況の変化（強制終了やジョブの凍結、優先順位の変更）などによって、一度決定した計算ノード１２を変更する再スケジューリングを行うことがある。本方式によって追加を行うこれらの処理は計算ノード１２が変更となる再スケジューリングが発生した場合には、新たにスケジューリングが行われた計算ノード候補を対象として前記Ｓ１８０６から前記Ｓ１８０９までの処理を再度行い、ジョブキュー３０７に登録しているジョブと一致したＭＰＩランク情報を生成する。 In the next process of the MPI rank determination process (S1805), the table for holding information for minimizing the communication path by executing the MPI collective communication function is initialized (S1807). Processing for determining a calculation node other than “0” is performed (S1808). These processes determine the MPI process arrangement for shortening the communication time of the MPI collective communication function. The MPI rank for shortening the communication time of the entire MPI program is determined by adding the processing so far. In the last process added to the process of determining the MPI rank (S1805), a process of outputting the determined MPI rank to the outside such as a file (S1809) is performed. Also, in the job registration process (S1804), after the information registered in the job queue 307 is associated with the information of the file to which the MPI rank is output, the information is registered in the job queue 307. For jobs registered in the job queue 307, changes in the state of the computation node 12 (node failure, node addition, etc.), changes in job execution status (forced termination, job freeze, priority change), etc. Therefore, rescheduling may be performed to change the computation node 12 once determined. These processes to be added according to this method are performed again from S1806 to S1809 with respect to newly scheduled calculation node candidates when rescheduling that changes the calculation node 12 occurs. Then, MPI rank information that matches the job registered in the job queue 307 is generated.

図１９はＭＰＩランクが「０」となる計算ノード１２を決定する処理の考え方を導くための、三次元トーラス構造によるＰＣクラスタシステム１の構成図である。ＭＰＩランクが「０」のプロセスは、その他のプロセスと比較して特別な処理を行うケースが多い。例えば、入出力ファイルをアクセスする処理を、ＭＰＩランクが「０」のプロセスが全プロセスを代表して行うことや、ＭＰＩランクが「０」のプロセスが読み込んだ入力データを、ＭＰＩ＿ＢＣＡＳＴ関数などのＭＰＩ集団通信関数を使用して全プロセスに配布するといった例を挙げることができる。 FIG. 19 is a configuration diagram of the PC cluster system 1 having a three-dimensional torus structure for deriving the concept of processing for determining a calculation node 12 having an MPI rank of “0”. In many cases, a process with an MPI rank of “0” performs special processing compared to other processes. For example, the process of accessing an input / output file is performed by a process with an MPI rank of “0” representing all processes, or input data read by a process with an MPI rank of “0” is read by an MPI such as an MPI_BCAST function. For example, a collective communication function can be used to distribute to all processes.

ファイル入出力を実行する計算ノード１２はＩ／Ｏノード１３との距離が小さいことは入出力時間の短縮とネットワーク資源の効率的な利用の点で優位となる。図１９の三次元トーラスによるＰＣクラスタシステム１において、Ｉ／Ｏノード１９０２は座標位置(０，０，０)のノード番号０７にあるとする。一方、並列プログラムの実行はジョブ実行空間８０２のいずれかのオブジェクト内の計算ノード１２を使用するものとする。ジョブ実行空間８０２の中の計算ノード１２のどれをＭＰＩランク「０」とすべきかはオブジェクト間の距離を算出し、その距離が一番小さいノードを特定することで行う。こうして特定したノードは最も効率よく最小の通信量でファイル入出力を行うことが可能である。 The short distance between the computation node 12 that executes file input / output and the I / O node 13 is advantageous in terms of shortening the input / output time and efficient use of network resources. In the PC cluster system 1 using the three-dimensional torus in FIG. 19, it is assumed that the I / O node 1902 is at the node number 07 at the coordinate position (0, 0, 0). On the other hand, the execution of the parallel program uses the computation node 12 in any object in the job execution space 802. Which of the calculation nodes 12 in the job execution space 802 should be set to MPI rank “0” is determined by calculating the distance between objects and specifying the node having the smallest distance. The nodes thus identified can perform file input / output with the minimum amount of communication most efficiently.

ここで、オブジェクト間の距離を算出するための式は式１９０３であり、図１４で例示したノード名の規則に従う場合のＩ／Ｏノード名に含まれる座標位置を(ｘ１,ｙ１,ｚ１)とし、計算ノード名に含まれる座標位置を(ｘ２,ｙ２,ｚ２)とする。ジョブ管理ノード１１は、この２つの座標を式１９０３に当てはめてオブジェクト間の距離を算出する。全ての計算ノード１２とＩ／Ｏノード１３との組み合わせで距離を求め、一番距離の小さい計算ノード１２をＭＰＩランク「０」のプロセスを実行する計算ノードと決定する。 Here, the equation for calculating the distance between the objects is Equation 1903, and the coordinate position included in the I / O node name when following the rule of the node name illustrated in FIG. 14 is (x1, y1, z1). The coordinate position included in the calculation node name is (x2, y2, z2). The job management node 11 calculates the distance between objects by applying these two coordinates to the formula 1903. The distances are obtained from combinations of all the calculation nodes 12 and the I / O nodes 13, and the calculation node 12 having the shortest distance is determined as a calculation node that executes the process of the MPI rank “0”.

図１９のジョブ実行空間８０２は全て計算オブジェクトのみで構成しているが、Ｉ／Ｏオブジェクトを含む場合は、Ｉ／Ｏオブジェクトに含まれる計算ノード１２が一番距離の小さい計算ノードとなり（Ｉ／Ｏノードと計算ノードが同一ネットワークスイッチ５０１に接続されるため）、その場合にはＭＰＩランク「０」をＩ／Ｏオブジェクトに含む計算ノード１２となる。Ｉ／Ｏオブジェクトは複数の計算ノード１２を含み、ＭＰＩランク「０」とする計算ノード１２のノード番号はオブジェクト内の計算ノード１２の最小のノード番号を有する計算ノード１２とする。 The job execution space 802 in FIG. 19 is composed of only calculation objects, but when an I / O object is included, the calculation node 12 included in the I / O object is the calculation node with the shortest distance (I / O). In this case, the computation node 12 includes the MPI rank “0” in the I / O object. The I / O object includes a plurality of calculation nodes 12, and the node number of the calculation node 12 having the MPI rank “0” is the calculation node 12 having the smallest node number of the calculation nodes 12 in the object.

図２０は、本実施形態のＭＰＩ集団通信関数実行における最小の距離算出を示すフローであって、ジョブ管理ノード１１による、ＭＰＩランク「０」とするプロセスを実行する計算ノード１２を求めるフローチャートとなる。このフローは図１８のＭＰＩランク「０」のノードの決定（Ｓ１８０６）の詳細な処理内容を示している。 FIG. 20 is a flowchart showing the minimum distance calculation in the MPI collective communication function execution according to the present embodiment, and is a flowchart for obtaining the calculation node 12 for executing the process of setting the MPI rank “0” by the job management node 11. . This flow shows the detailed processing contents of the node determination (S1806) of the MPI rank “0” in FIG.

ここで、ジョブ管理ノード１１は、ノード間の最小距離、計算ノード１２およびＩ／Ｏノード１３に関するノード番号や三次元座標位置を初期化する（Ｓ２００２）。続いて、ジョブ管理ノード１１は、ＰＣクラスタシステム１を構成するＩ／Ｏノード１３の座標情報を、例えば記憶部３０１の定義ファイルやデータベースなどから取得（Ｓ２００３）する。同様に、ジョブ管理ノード１１は計算ノード１２の座標情報を前記定義ファイルやデータベースなどから取得する（Ｓ２００４）。これらの処理によりジョブ管理ノード１１は、全てのＩ／Ｏノード名が格納されるリストと全ての計算ノード名が格納されるリストを用意している。 Here, the job management node 11 initializes the minimum distance between the nodes, the node number and the three-dimensional coordinate position regarding the calculation node 12 and the I / O node 13 (S2002). Subsequently, the job management node 11 acquires the coordinate information of the I / O node 13 constituting the PC cluster system 1 from, for example, a definition file or database in the storage unit 301 (S2003). Similarly, the job management node 11 acquires the coordinate information of the calculation node 12 from the definition file, the database, or the like (S2004). Through these processes, the job management node 11 prepares a list in which all I / O node names are stored and a list in which all calculation node names are stored.

次にジョブ管理ノード１１は、Ｉ／Ｏノード名のリストと計算ノード名のリストを用いた２重ループ構造の処理によって、ＭＰＩランク「０」とする計算ノード１２を求める。繰り返し処理Ｓ２００５はＩ／Ｏノード名のリストによる繰り返しであり、ジョブ管理ノード１１は、そのループ内でＩ／Ｏノード名を分解しＩ／Ｏノード１３の座標とノード番号を取得する（Ｓ２００６）。次にジョブ管理ノード１１は、繰り返し処理Ｓ２００７にて、計算ノード名のリストによる繰り返しを実行し、そのループ内で計算ノード名を分解し、計算ノード１２の座標とノード番号を取得する（Ｓ２００８）。 Next, the job management node 11 obtains the calculation node 12 having the MPI rank “0” by the double loop structure process using the I / O node name list and the calculation node name list. The iterative process S2005 is an iterative process using a list of I / O node names, and the job management node 11 disassembles the I / O node name in the loop and obtains the coordinates and node number of the I / O node 13 (S2006). . Next, in step S2007, the job management node 11 repeats the calculation node name list, decomposes the calculation node name in the loop, and acquires the coordinates and node number of the calculation node 12 (S2008). .

続いてジョブ管理ノード１１は、前記Ｓ２００６およびＳ２００８の処理で取得したＩ／Ｏノード座標と計算ノード座標からノード間の距離を算出する（Ｓ２００９）。ジョブ管理ノード１１はこのＳ２００９で算出した距離と、これまで得ているノード間の最小距離とを判定処理Ｓ２０１０によって比較し、前記Ｓ２００９で算出した距離が既存の最小距離より小さい場合には（Ｓ２０１０：ｙｅｓ）、ノード間の最小距離、計算ノード座標とノード番号、Ｉ／Ｏノード座標とノード番号を該当ノードの情報で更新する（Ｓ２０１１）。ジョブ管理ノード１１は、計算ノード１２に対する繰り返し終了Ｓ２０１２とＩ／Ｏノード１３に対する繰り返し終了Ｓ２０１３を行い、全てのＩ／Ｏノード１３と計算ノード１２の組み合わせに対して上記の処理を行うことで、最終的に一番距離の小さいＩ／Ｏノード座標とノード番号、計算ノード座標とノード番号を決定し、ここで求めた計算ノード１２をＭＰＩランクが「０」の計算ノードとする。 Subsequently, the job management node 11 calculates the distance between the nodes from the I / O node coordinates and the calculation node coordinates acquired in the processes of S2006 and S2008 (S2009). The job management node 11 compares the distance calculated in S2009 with the minimum distance between the nodes obtained so far in the determination process S2010, and if the distance calculated in S2009 is smaller than the existing minimum distance (S2010). : Yes), the minimum distance between the nodes, the calculation node coordinates and the node number, and the I / O node coordinates and the node number are updated with the information of the corresponding node (S2011). The job management node 11 performs the repetition end S2012 for the calculation node 12 and the repetition end S2013 for the I / O node 13, and performs the above processing for all combinations of the I / O node 13 and the calculation node 12. Finally, the I / O node coordinate and node number with the shortest distance and the calculation node coordinate and node number are determined, and the calculation node 12 obtained here is set as the calculation node having the MPI rank “0”.

図２１はＭＰＩランクを決定する処理で扱う計算ノード一覧のテーブル情報を示す図である。計算ノード一覧テーブル２１０１は、並列プログラムを実行するプロセス数だけ要素数すなわちレコードがある。一つのプロセスについては３つの情報をもつ。一つはプロセスを実行する計算ノード名（nodename）である。計算ノード名のカラムには計算ノードが含まれるオブジェクトすなわちネットワークスイッチ５０１の三次元座標位置とノード番号の２つの情報を含んでいる。グループＩＤ（groupid）のカラムはＭＰＩランクの決定処理を実行していく上で作業用に使用し、初期値は全て「０」である。ＭＰＩランクのカラム（mpirank）は、決定したＭＰＩランクを設定し、初期値は全て「−１」である。ＭＰＩランクが負の値であることはＭＰＩランクが未決定であることを表し、０以上の値はＭＰＩランクが決定していることを表す。 FIG. 21 is a diagram showing table information of a list of calculation nodes handled in the process of determining the MPI rank. The calculation node list table 2101 has the same number of elements, that is, records, as the number of processes executing the parallel program. Each process has three pieces of information. One is the name of a computation node (nodename) that executes the process. The column of the calculation node name includes two pieces of information including the object including the calculation node, that is, the three-dimensional coordinate position of the network switch 501 and the node number. The column of group ID (groupid) is used for work in executing the MPI rank determination process, and the initial values are all “0”. The MPI rank column (mpirank) sets the determined MPI rank, and the initial values are all “−1”. A negative MPI rank value indicates that the MPI rank has not yet been determined, and a value of 0 or more indicates that the MPI rank has been determined.

この図２１の上段における計算ノード一覧テーブル２１０１は、初期状態のテーブルであり、全てのグループＩＤは「０」、全てのＭＰＩランクは「−１」に設定されている。一方、図２０で述べた処理を行い、ＭＰＩランクが「０」となるノード名が決定した場合、ジョブ管理ノード１１は、前記決定した計算ノード１２について、ＭＰＩランクのカラムの値を「０」に設定する。 The calculation node list table 2101 in the upper part of FIG. 21 is a table in an initial state, in which all group IDs are set to “0” and all MPI ranks are set to “−1”. On the other hand, when the process described with reference to FIG. 20 is performed and the node name having the MPI rank “0” is determined, the job management node 11 sets the MPI rank column value to “0” for the determined calculation node 12. Set to.

図２２は計算ノード一覧テーブル２１０１を初期化するフローチャートであり、図１８のランクテーブルの初期化（Ｓ１８０７）の詳細な処理内容を示している。はじめに、ジョブ管理ノード１１は、前記スケジューリング機能３０６によって決定した計算ノード候補のリストを取得し、所定記憶領域での「Nodelist」に設定する（Ｓ２２０１）。次にジョブ管理ノード１１は、計算ノード一覧テーブル２１０１をメモリー上に確保し、所定記憶領域に設定したカウンタiをゼロに初期化する（Ｓ２２０２）。次にジョブ管理ノード１１は、繰り返し処理Ｓ２２０３の内部で計算ノード一覧テーブル２１０１のノード名（nodename）に、計算ノード一覧のリストからインデックスiで取得した計算ノード名を設定し、計算ノード一覧テーブル２１０１のグループＩＤ(groupid)をゼロに初期化する（Ｓ２２０４）。次にジョブ管理ノード１１は、比較処理Ｓ２２０５において、計算ノード一覧からインデックスiで取得した計算ノード名と、前記ＭＰＩランク「０」のノードの決定（Ｓ１８０６）で決定しているＭＰＩランクが「０」の計算ノード名とを比較する。 FIG. 22 is a flowchart for initializing the calculation node list table 2101 and shows the detailed processing contents of the initialization (S1807) of the rank table of FIG. First, the job management node 11 acquires a list of calculation node candidates determined by the scheduling function 306, and sets it to “Nodelist” in a predetermined storage area (S2201). Next, the job management node 11 secures the calculation node list table 2101 in the memory, and initializes the counter i set in the predetermined storage area to zero (S2202). Next, the job management node 11 sets the node name (nodename) of the calculation node list table 2101 inside the iterative process S2203 to the calculation node name obtained by the index i from the list of calculation node lists, and the calculation node list table 2101. The group ID (groupid) is initialized to zero (S2204). Next, in the comparison process S2205, the job management node 11 has the MPI rank determined by the determination of the node of the MPI rank “0” (S1806) and the MPI rank “0” obtained from the calculation node list with the index i. Is compared with the calculation node name.

これらノード名が一致しない場合（Ｓ２２０５：ｎｏ）、ジョブ管理ノード１１は計算ノード一覧テーブル２１０１のＭＰＩランク(mpirank)に、ＭＰＩランク未決定を表す「−１」を設定する（Ｓ２２０６）。他方、前記ノード名が一致する場合（Ｓ２２０５：ｙｅｓ）、ジョブ管理ノード１１は計算ノード一覧テーブル２１０１のＭＰＩランク(mpirank)に、前記ＭＰＩランクの値である「０」を設定する（Ｓ２２０７）。 If these node names do not match (S2205: no), the job management node 11 sets “−1” representing MPI rank undecided in the MPI rank (mpirank) of the calculation node list table 2101 (S2206). On the other hand, if the node names match (S2205: yes), the job management node 11 sets the MPI rank value “0” in the MPI rank (mpirank) of the calculation node list table 2101 (S2207).

その後、ジョブ管理ノード１１は、前記カウンタｉを更新し（Ｓ２２０８）、繰り返し処理Ｓ２２０３の終了判定により、計算ノード一覧のリストにインデックスｉでアクセスできる限り、繰り返し処理を行う（Ｓ２２０９）。全ての処理を終了した時点では、計算ノード一覧テーブル２１０１の全ての要素の計算ノード名には、スケジューリング機能３０６により決定した計算ノード名が、グループＩＤ（groupid）には「０」が設定される。ＭＰＩランクが０である計算ノード１２のＭＰＩランク（mpirank）のみに「０」を設定し、その他の計算ノードのＭＰＩランク(mpirank)には「−１」を設定する。 Thereafter, the job management node 11 updates the counter i (S2208), and performs the iterative process as long as it can access the list of the calculation node list with the index i by the end determination of the iterative process S2203 (S2209). When all the processes are completed, the calculation node names determined by the scheduling function 306 are set to the calculation node names of all elements of the calculation node list table 2101, and “0” is set to the group ID (groupid). . “0” is set only for the MPI rank (mpirank) of the computation node 12 whose MPI rank is 0, and “−1” is set for the MPI rank (mpirank) of the other computation nodes.

図２３はジョブ管理ノード１１がＭＰＩランクが「０」以外のＭＰＩランクを決定する処理の関数仕様を示した図である。ここで、図１８のランク０以外のＭＰＩランク決定（Ｓ１８０８）を行うための処理を一つの関数とし、この関数は図２１のテーブルに格納された情報を更新しながらＭＰＩランクが「０」以外のプロセスを実行する計算ノードを決定していく。関数名は「ＲａｎｋＤｅｃｉｓｉｏｎ」とし、５つの引数を必要とする。この関数は再帰呼出しを行い、再帰呼び出しの深さは第５引数の「ｄｅｐｔｈ」に設定する（初期の深さは０である）。 FIG. 23 is a diagram illustrating a function specification of processing in which the job management node 11 determines an MPI rank other than MPI rank “0”. Here, the processing for performing MPI rank determination (S1808) other than rank 0 in FIG. 18 is set as one function, and this function updates the information stored in the table in FIG. 21 while the MPI rank is other than “0”. The calculation node that executes the process is determined. The function name is “Rank Decision” and requires five arguments. This function performs a recursive call and sets the depth of the recursive call to the fifth argument “depth” (the initial depth is 0).

ＲａｎｋＤｅｃｉｓｉｏｎ関数の第１引数には、図２１に示した計算ノード一覧テーブルを指定する。第２引数には図２１に示した計算ノード一覧テーブルの要素数（プロセス数）を指定する。第３引数には基準ランクを指定し、初回の呼び出し時にはＭＰＩランク「０」のノードのみが決定済みであり、「０」を指定する。第４引数は対象のグループＩＤであり、初回の呼び出し時には図２２の処理により、初期化を行った値である「０」を指定する。第５引数は関数の再帰呼出し数を指定し、初回の呼び出し時には「０」を指定する。ＲａｎｋＤｅｃｉｓｉｏｎ関数は、指定されたグループＩＤの範囲で深さ「ｄｅｐｔｈ」の集団通信実行時に、基準ランク（ベースノード）と通信を行う相手のＭＰＩランクを求める。この時、グループＩＤを２分割し、一方は次の深さで基準ランクと通信を行うグループＩＤでＲａｎｋＤｅｃｉｓｉｏｎ関数を呼び出す。もう一方は新たなグループＩＤを更新し、今回決定した相手先のＭＰＩランクを新たな基準ランクとし、次の深さにて新たな基準ランクと通信を行う新たなグループＩＤでＲａｎｋＤｅｃｉｓｉｏｎ関数を呼び出す。 The calculation node list table shown in FIG. 21 is designated as the first argument of the Rank Decision function. The number of elements (number of processes) in the calculation node list table shown in FIG. 21 is designated as the second argument. The reference rank is designated as the third argument, and only the node having the MPI rank “0” has been determined at the first call, and “0” is designated. The fourth argument is a target group ID, and “0”, which is an initialized value, is designated by the process of FIG. 22 at the first call. The fifth argument specifies the number of recursive calls of the function, and “0” is specified at the first call. The Rank Decision function obtains the MPI rank of the other party that communicates with the reference rank (base node) when executing collective communication with the depth “depth” within the range of the specified group ID. At this time, the group ID is divided into two, and one calls the Rank Decision function with the group ID that communicates with the reference rank at the next depth. The other updates the new group ID, sets the MPI rank of the partner determined this time as a new reference rank, and calls the Rank Decision function with a new group ID that communicates with the new reference rank at the next depth.

図２４および図２５はＭＰＩランク「０」以外のＭＰＩランクを決定する処理のフローチャートであり、図１８のランク０ノード以外の決定（Ｓ１８０８）の詳細な処理内容を示している。この場合、ジョブ管理ノード１１が、ＭＰＩランク「０」以外の計算ノードを決定する処理に際し、前記ＲａｎｋＤｅｃｉｓｉｏｎ関数２３０１を利用する。ＲａｎｋＤｅｃｉｓｉｏｎ関数２３０１は図２３に示す関数仕様に従って処理を行う。引数で与えられた「ｂａｓｅｎｏｄｅ」には既にＭＰＩランクが決定している計算ノード１２のＭＰＩランクが設定され、初回のＲａｎｋＤｅｃｉｓｉｏｎ関数呼出し時には図２０の処理で決定するＭＰＩランク「０」が設定される。 24 and 25 are flowcharts of processing for determining an MPI rank other than the MPI rank “0”, and shows the detailed processing contents of determination other than the rank 0 node (S1808) in FIG. In this case, the job management node 11 uses the Rank Decision function 2301 when determining a calculation node other than the MPI rank “0”. The Rank Decision function 2301 performs processing according to the function specifications shown in FIG. The MPI rank of the computation node 12 whose MPI rank has already been determined is set in “basenode” given by the argument, and the MPI rank “0” determined by the processing of FIG. 20 is set when the Rank Decision function is called for the first time. .

この場合、ＲａｎｋＤｅｃｉｓｉｏｎ関数は、はじめに「ｂａｓｅｎｏｄｅ」で指定されたＭＰＩランクのプロセスが実行される計算ノード１２の名称を取得する（Ｓ２４０１）。次にＲａｎｋＤｅｃｉｓｉｏｎ関数は、先に取得した計算ノード名と、グループＩＤが一致する各計算ノードとのそれぞれの距離を求め、各ノード間の距離一覧を作成する。ＲａｎｋＤｅｃｉｓｉｏｎ関数は、ここで求めた距離が最大のノード名を保存する（Ｓ２４０２）。距離が最大のノード名は、「ｂａｓｅｎｏｄｅ」の計算ノード１２から最も多くネットワークスイッチ５０１を経由したデータ転送が必要となる計算ノード１２である。 In this case, the Rank Decision function first acquires the name of the computation node 12 on which the MPI rank process designated by “basenode” is executed (S2401). Next, the Rank Decision function obtains the distance between the previously obtained calculation node name and each calculation node having the same group ID, and creates a list of distances between the nodes. The Rank Decision function stores the node name having the maximum distance obtained here (S2402). The node name with the largest distance is the computing node 12 that needs the most data transfer from the computing node 12 of “basenode” via the network switch 501.

次にＲａｎｋＤｅｃｉｓｉｏｎ関数は、最も距離の遠いノードとグループが一致する全てのノードとでそれぞれの距離を求め保存する（Ｓ２４０３）。このことによりＲａｎｋＤｅｃｉｓｉｏｎ関数は、グループＩＤが一致するノードについて「ｂａｓｅｎｏｄｅ」からの距離と、「ｂａｓｅｎｏｄｅ」から最も距離の遠いノードからの距離との２つの距離を算出する。次にＲａｎｋＤｅｃｉｓｉｏｎ関数は、処理Ｓ２４０３までに求めた２つの距離をもとにグループを２分する（Ｓ２４０４）。これまでは同一のグループに所属していたものを２分割するのである。「ｂａｓｅｎｏｄｅ」から距離が近いものを、例えばＡグループ、「ｂａｓｅｎｏｄｅ」から最も遠いノードから距離が近いものをＢグループとする。ノード数ｎが偶数である場合は、ＡグループおよびＢグループに所属するノードは各々ｎ／２である。ノード数ｎが奇数である場合は、最後に残るノードは「ｂａｓｅｎｏｄｅ」からの距離と「ｂａｓｅｎｏｄｅ」から一番遠いノードからの距離とを比較し、小さいほうのグループに所属する。 Next, the Rank Decision function obtains and stores the distances between the farthest node and all the nodes with the same group (S2403). As a result, the Rank Decision function calculates two distances, that is, the distance from “basenode” and the distance from the node farthest from “basenode” for the nodes having the same group ID. Next, the Rank Decision function divides the group into two based on the two distances obtained up to step S2403 (S2404). So far, what belonged to the same group is divided into two. For example, group A is closer to “basenode” and group B is closer to the node farthest from “basenode”. When the number n of nodes is an even number, the number of nodes belonging to the A group and the B group is n / 2. When the number n of nodes is an odd number, the last remaining node is compared with the distance from “basenode” and the distance from the node farthest from “basenode”, and belongs to the smaller group.

次にＲａｎｋＤｅｃｉｓｉｏｎ関数は、新しいグループＩＤを設定する（Ｓ２４０５）。「ｂａｓｅｎｏｄｅ」に近いグループはグループＩＤを２倍した値を新しいグループＩＤとしてグループＩＤを更新する。最初の呼び出しではグループＩＤは「０」であるため、２倍し更新されたグループＩＤも「０」である。「ｂａｓｅｎｏｄｅ」から最も遠いノードに所属するグループはグループＩＤを２倍して１を加えた値を新しいグループＩＤとして設定する。最初の呼び出しではグループＩＤは「０」であるため、更新されたグループＩＤは「１」となり、新しいグループＩＤとして「１」を設定する。 Next, the Rank Decision function sets a new group ID (S2405). A group close to “basenode” updates the group ID with a new group ID that is a value obtained by doubling the group ID. In the first call, the group ID is “0”, so the group ID that is doubled and updated is also “0”. For the group belonging to the node farthest from “basenode”, a value obtained by doubling the group ID and adding 1 is set as a new group ID. Since the group ID is “0” in the first call, the updated group ID is “1”, and “1” is set as the new group ID.

処理Ｓ２４０５をグループＩＤが「０」の場合に行うことで、新しいグループＩＤとして「０」のグループと「１」のグループとにノード群を分割し、グループ設定することになる。 By performing step S2405 when the group ID is “0”, the node group is divided into a group “0” and a group “1” as new group IDs, and the group is set.

続いてＲａｎｋＤｅｃｉｓｉｏｎ関数は、「ｂａｓｅｎｏｄｅ」から最も遠いノードに所属するグループの中で、「ｂａｓｅｎｏｄｅ」に最も距離が近いノードを検索する（Ｓ２４０６）。Ｓ２４０６の処理で検索したノードには、ＭＰＩランクをテーブル２１０２に設定する。ここで設定するＭＰＩランクは引数で渡された「ｂａｓｅｎｏｄｅ」、「ｔａｂｌｅｃｎｔ」、「ｄｅｐｔｈ」によって一意となる値である。ＲａｎｋＤｅｃｉｓｉｏｎ関数は、ＭＰＩランクを設定した後、判定処理Ｓ２４０７により、２分割したグループＢの個数が「０」より大きい場合（Ｓ２４０７：ｙｅｓ）、グループＢを対象にＲａｎｋＤｅｃｉｓｉｏｎ関数を呼出す（Ｓ２４０８）。再帰呼出しが行われたＲａｎｋＤｅｃｉｓｉｏｎ関数は、これまでのＳ２４０１からＳ２４０６までの処理を行い、グループの分割と１つのＭＰＩランクの決定を行い、必要に応じて更にＲａｎｋＤｅｃｉｓｉｏｎ関数を呼び出す。判定処理Ｓ２４０７の後に判定処理Ｓ２４０９により、２分割したグループＡの個数が０より大きい場合（Ｓ２４０９：ｙｅｓ）、グループＡを対象にＲａｎｋＤｅｃｉｓｉｏｎ関数を呼出す（Ｓ２４１０）。再帰呼出しが行われたＲａｎｋＤｅｃｉｓｉｏｎ関数は、これまでのＳ２４０１からＳ２４０６までの処理を行い、グループの分割と１つのＭＰＩランクの決定を行い、必要に応じて更にＲａｎｋＤｅｃｉｓｉｏｎ関数を呼び出す。再帰呼出しでは、ＭＰＩランク決定に伴って更新したテーブルとサイズ、決定したＭＰＩランク、「ｂａｓｅｎｏｄｅ」から近いノードに所属するグループのグループＩＤ、「ｄｅｐｔｈ」に１を加えた値をＲａｎｋＤｅｃｉｓｉｏｎ関数の引数として渡すことになる。 Subsequently, the Rank Decision function searches for a node closest to “basenode” in the group belonging to the node farthest from “basenode” (S2406). The MPI rank is set in the table 2102 for the node searched in the process of S2406. The MPI rank set here is a unique value depending on “basenode”, “tablecnt”, and “depth” passed as arguments. In the Rank Decision function, after setting the MPI rank, if the number of groups B divided into two is larger than “0” (S2407: yes) by the determination process S2407, the Rank Decision function is called for the group B (S2408). The Rank Decision function that has been recursively called performs the processes from S2401 to S2406 so far, divides the group and determines one MPI rank, and further calls the Rank Decision function as necessary. If the number of groups A divided into two is larger than 0 by the determination process S2409 after the determination process S2407 (S2409: yes), the Rank Decision function is called for the group A (S2410). The Rank Decision function that has been recursively called performs the processes from S2401 to S2406 so far, divides the group and determines one MPI rank, and further calls the Rank Decision function as necessary. In the recursive call, the table and size updated in accordance with the MPI rank determination, the determined MPI rank, the group ID of the group belonging to the node closer to “basenode”, and a value obtained by adding 1 to “depth” as arguments of the Rank Decision function Will pass.

以上、本発明を実施するための最良の形態などについて具体的に説明したが、本発明はこれに限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能である。 Although the best mode for carrying out the present invention has been specifically described above, the present invention is not limited to this, and various modifications can be made without departing from the scope of the invention.

こうした本実施形態によれば、三次元トーラス構造の大規模ＰＣクラスタシステム上で、ＭＰＩライブラリを用いた並列プログラムの実行時間短縮を図る、プロセスの自動最適配置が可能となる。 According to the present embodiment, it is possible to automatically and optimally arrange processes for reducing the execution time of a parallel program using an MPI library on a large-scale PC cluster system having a three-dimensional torus structure.

本明細書の記載により、少なくとも次のことが明らかにされる。すなわち、前記プロセス割当システムにおいて、前記三次元トーラス構造によるネットワークトポロジーは、ｘ、ｙ、ｚ軸方向の各々に環状の構造を有し、２つのノード間通信において１以上のネットワークスイッチを経由した通信が実行されるものであり、前記記憶部における各ノードの位置座標の情報は、所定ネットワークスイッチを三次元の原点座標にした場合の、各ネットワークスイッチにおける前記原点座標との相対位置座標によって三次元座標が規定されており、前記演算部は、ノード間で通信を行う場合の経路上のネットワークスイッチ数を、該当各ノードの三次元座標間におけるｘ、ｙ、ｚ座標のそれぞれの差分の合計を算定して求めるとしてもよい。 At least the following will be clarified by the description of the present specification. That is, in the process allocation system, the network topology based on the three-dimensional torus structure has an annular structure in each of the x, y, and z axis directions, and communication between two nodes via one or more network switches. The position coordinate information of each node in the storage unit is three-dimensional according to the relative position coordinate with respect to the origin coordinate in each network switch when the predetermined network switch is set to the three-dimensional origin coordinate. Coordinates are defined, and the calculation unit calculates the number of network switches on the path when communication is performed between nodes, and the total difference of x, y, z coordinates between the three-dimensional coordinates of each node. It may be calculated and obtained.

また、前記プロセス割当システムにおいて、前記演算部は、前記ＭＰＩランク「０」のノードとの経路上のネットワークスイッチ数を各計算ノードについて算定し、ＭＰＩプログラムのジョブ投入時に指定されたプロセス数に応じた数の計算ノードを、前記算定したネットワークスイッチ数が少ない順に選択するとしてもよい。 In the process allocation system, the calculation unit calculates the number of network switches on the path to the node of the MPI rank “0” for each calculation node, and according to the number of processes designated when the MPI program job is submitted. The number of calculation nodes may be selected in ascending order of the calculated number of network switches.

また、前記プロセス割当システムにおいて、前記演算部は、ＭＰＩ集団通信関数の実行に際し、前記プロセス数に応じて選択された計算ノード群を、前記ＭＰＩランク「０」のノードとの経路上のネットワークスイッチ数の多少により、前記ネットワークスイッチ数が少ない方のグループであり、ＭＰＩランク「０」のノードをベースノードとするグループＡ、前記ネットワークスイッチ数が多い方のグループのグループＢ、の２分割するグループ分けの処理と、前記グループＢに含まれる計算ノードのうち、前記ネットワークスイッチ数が最少のものを、当該グループＢにおいて、ＭＰＩ集団通信関数実行時に前記ＭＰＩランク「０」のノードから最初にデータ転送を受けるベースノードとして特定する処理と、前記グループＡ、Ｂのそれぞれに関し、ベースノード以外の計算ノードが残っている限り、前記グループ分けの処理および前記ベースノードの特定の処理を繰り返し実行する処理とを実行するとしてもよい。 In the process allocation system, when the MPI collective communication function is executed, the arithmetic unit selects a calculation node group selected according to the number of processes as a network switch on a path with a node of the MPI rank “0”. Depending on the number of the groups, the group having the smaller number of network switches is divided into two groups: a group A having a node with MPI rank “0” as a base node and a group B having the larger number of network switches. Dividing processing and among the calculation nodes included in the group B, the one with the smallest number of network switches is first transferred from the node of the MPI rank “0” in the group B when the MPI collective communication function is executed. Processing to identify as a base node to receive and those of the groups A and B Re respect, as long as the remaining computing nodes other than the base node may be and a process of repeatedly executing the specific processing process and the base node of the grouping.

また、前記プロセス割当システムにおいて、前記演算部は、ＭＰＩ実装に際し、前記プロセス数に応じて選択された計算ノード群について、少なくとも該計算ノード群が含む計算ノードおよび前記各グループのベースノードの情報をＭＰＩランク情報としてファイル出力するとしてもよい。 In the process allocation system, the computing unit, when implementing MPI, for the computation node group selected according to the number of processes, at least information on the computation node included in the computation node group and the base node of each group. A file may be output as MPI rank information.

１大規模ＰＣクラスタシステム
１０ログインノード
１１ジョブ管理ノード（プロセス割当システム）
１２計算ノード
１３Ｉ／Ｏノード
１４ネットワーク
２０１計算ノードの記憶部
２０２計算ノードのＣＰＵ部
２０３計算ノードのネットワークインタフェース部
２０４並列プログラム
３０１ジョブ管理ノードの記憶部
３０２ジョブ管理ノードのＣＰＵ部（演算部）
３０３ジョブ管理ノードのネットワークインタフェース部
３０４ジョブ管理プログラム
４０１オブジェクト
５０１ネットワークスイッチ
８０２ジョブ管理プログラムが確保したノード空間
８０３計算ノードのみで構成するオブジェクト
８０４計算ノードとＩ／Ｏノードで構成するオブジェクト 1 Large-scale PC cluster system 10 Login node 11 Job management node (process allocation system)
12 computing node 13 I / O node 14 network 201 computing node storage unit 202 computing node CPU unit 203 computing node network interface unit 204 parallel program 301 job management node storage unit 302 job management node CPU unit (calculation unit)
303 Network interface unit 304 of the job management node 304 Job management program 401 Object 501 Network switch 802 Node space 803 secured by the job management program 804 Object composed only of computation nodes 804 Object composed of computation nodes and I / O nodes

Claims

In a large-scale PC cluster system in which network switches connected to a plurality of nodes are connected to each other in a three-dimensional torus structure, the information processing system performs process allocation for executing parallel programs targeting MPI programs,
A storage unit that holds the position coordinates of each node in the three-dimensional torus structure;
Based on the position coordinates of each node in the three-dimensional torus structure in the storage unit, each calculation node connected to each network switch for performing process assignment upon execution of a parallel program, and when a file is input / output by the calculation node Calculating the number of network switches in the path of the three-dimensional torus structure between each I / O node that is the access target, and specifying the set of I / O node and calculation node that minimizes the calculated value, An arithmetic unit that executes a process for determining the calculation node of the MPI rank as “0” using the specified set of calculation nodes as a reference for parallel program execution;
A process allocation system comprising:

The network topology based on the three-dimensional torus structure has a ring structure in each of the x, y, and z axis directions, and communication via one or more network switches is executed in communication between two nodes.
The information of the position coordinates of each node in the storage unit is defined as a three-dimensional coordinate by a relative position coordinate with the origin coordinate in each network switch when a predetermined network switch is a three-dimensional origin coordinate,
The arithmetic unit obtains the number of network switches on a path when communication is performed between nodes by calculating a total of respective differences of x, y, and z coordinates between the three-dimensional coordinates of each node. The process allocation system according to claim 1.

The calculation unit calculates the number of network switches on the path to the node having the MPI rank of “0” for each calculation node, and calculates the number of calculation nodes according to the number of processes specified when the job of the MPI program is input. 3. The process allocation system according to claim 1, wherein the selected network switches are selected in ascending order.

The computing unit is
When executing the MPI collective communication function, the number of network switches depends on the number of network switches on the path with the MPI rank “0” node serving as a reference for the calculation node group selected according to the number of processes. A grouping process that divides the group A into two groups, the group A having the MPI rank “0” as a base node and the group B having the larger number of network switches, which is the smaller group,
Among the calculation nodes included in the group B, a node having the smallest number of network switches is set as a base node that first receives data transfer from the node of the MPI rank “0” when executing the MPI collective communication function in the group B. Process to identify,
For each of the groups A and B, as long as computation nodes other than the base node remain, the grouping process and the process of recursively executing the specific process of the base node are executed.
The process allocation system according to claim 3.

When the MPI is implemented, the calculation unit includes at least MPI rank information of the MPI rank “0” included in the calculation node group and the base node information of each group included in the calculation node group selected according to the number of processes. 5. The process allocation system according to claim 4, wherein a file is output as rank information.

In a large-scale PC cluster system in which network switches connected to a plurality of nodes are connected to each other in a three-dimensional torus structure, in the storage unit, the third order is assigned in order to perform process allocation for parallel program execution targeting MPI programs. A computer that holds the position coordinates of each node in the original torus structure,
Based on the position coordinates of each node in the three-dimensional torus structure in the storage unit, each calculation node connected to each network switch for performing process assignment upon execution of a parallel program, and when a file is input / output by the calculation node Calculating the number of network switches in the path of the three-dimensional torus structure between each I / O node that is the access target, and specifying the set of I / O node and calculation node that minimizes the calculated value, A process allocation method, comprising: executing a process of determining an MPI rank of the specified set of computation nodes.

In a large-scale PC cluster system in which network switches connected to a plurality of nodes are connected to each other in a three-dimensional torus structure, in the storage unit, the third order is assigned in order to perform process allocation for parallel program execution targeting MPI programs. In the computer that holds the position coordinates of each node in the original torus structure,
Based on the position coordinates of each node in the three-dimensional torus structure in the storage unit, each calculation node connected to each network switch for performing process assignment upon execution of a parallel program, and when a file is input / output by the calculation node Calculating the number of network switches in the path of the three-dimensional torus structure between each I / O node that is the access target, and specifying the set of I / O node and calculation node that minimizes the calculated value, A process of determining the MPI rank of the specified set of computation nodes;
A process allocation program characterized by that.