JP5577745B2

JP5577745B2 - Cluster system, process allocation method, and program

Info

Publication number: JP5577745B2
Application number: JP2010040632A
Authority: JP
Inventors: 佑基水野
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2010-02-25
Filing date: 2010-02-25
Publication date: 2014-08-27
Anticipated expiration: 2030-02-25
Also published as: JP2011175573A

Description

本発明は、クラスタシステム、プロセス配置方法、及びプログラムに関するものである。 The present invention relates to a cluster system, a process arrangement method, and a program.

計算量が非常に多い処理を行うハイパフォーマンスコンピューティングの分野では、高い計算性能を持つ計算システムが必要とされる。
従来は、単体性能が高い計算ノードを少数接続した計算クラスタが利用されてきた。この場合、接続形態は全てのノード間で直接通信できる単段クロスバー方式であり、これにより均一なインターコネクト通信性能が実現できていた。 In the field of high performance computing that performs processing with a large amount of calculation, a calculation system having high calculation performance is required.
Conventionally, a calculation cluster in which a small number of calculation nodes with high unit performance are connected has been used. In this case, the connection form is a single-stage crossbar system in which direct communication can be performed between all the nodes, thereby achieving uniform interconnect communication performance.

現在は、計算機のコモディティ化が進み、比較的単体性能が低い計算ノードを多数接続した計算クラスタが用いられることが多くなった。この場合、単段クロスバー方式を採用すると、接続数はノード数の約２乗にもなるため、ノード数の増加によってネットワーク部品が爆発的に増加する。 Nowadays, computers have become more commoditized, and computation clusters are often used in which many computation nodes with relatively low unit performance are connected. In this case, if the single-stage crossbar method is adopted, the number of connections becomes about the square of the number of nodes, and therefore the number of nodes increases the number of network parts explosively.

このため、計算ノードを多数接続した計算クラスタでは、ファットツリーやメッシュといった多段接続を用いることが一般的である。多段接続はネットワーク部品の増大をある程度抑えることができるが、ノードの組み合わせによって経由する段数にばらつきが生じるため、インターコネクト通信性能が均一にならないという問題がある。 For this reason, in a calculation cluster in which a large number of calculation nodes are connected, it is common to use a multistage connection such as a fat tree or a mesh. Multi-stage connection can suppress an increase in the number of network components to some extent, but there is a problem that the interconnect communication performance is not uniform because the number of stages through which the nodes are combined varies.

多数の計算ノードが接続された計算クラスタの場合、一般的に、バッチサーバによって計算ノードの一元管理を行う。バッチサーバは、フロントエンドからのジョブ実行要求を受け付け、各計算ノードのＣＰＵやメモリといったリソースの空き状況に応じて、ジョブの割り当てを決定する。バッチサーバにより、計算ノードの負荷を均等にし、計算クラスタ全体として効率的な運用を実現する。 In the case of a calculation cluster in which a large number of calculation nodes are connected, generally, the calculation node is centrally managed by a batch server. The batch server accepts a job execution request from the front end, and determines job assignment according to the availability of resources such as CPUs and memories of each computation node. The batch server equalizes the load on the computation nodes and realizes efficient operation as a whole computation cluster.

しかし、現在のバッチサーバは、ジョブの割り当ての判断にはＣＰＵやメモリといった計算ノード自体のリソースの空き状況のみを考慮し、計算ノード間のインターコネクト通信性能を考慮することはほとんどない。このため、分散並列ジョブに割り当てられる計算ノード群のインターコネクト通信性能に偏りが生じることがある。 However, the current batch server considers only the availability of resources of the calculation node itself such as CPU and memory in determining job allocation, and hardly considers the interconnect communication performance between the calculation nodes. For this reason, the interconnect communication performance of the computation node group assigned to the distributed parallel job may be biased.

また、ジョブの一例として分散並列ジョブがある。分散並列ジョブは、個々のプロセスに計算を分割し、プロセス間で適宜データを交換しながら並列に計算を進めていく。分散並列ジョブは、個々のプロセスを別々の計算ノードに割り当てることで、並列化による高い計算性能が発揮できる。 An example of a job is a distributed parallel job. A distributed parallel job divides the calculation into individual processes, and advances the calculation in parallel while appropriately exchanging data between the processes. Distributed parallel jobs can exhibit high computing performance due to parallelization by assigning individual processes to different computing nodes.

分散並列ジョブを実行する場合、計算クラスタの実装によってプロセス間の通信に特性が見られる場合がある。例えば、Ｘ−Ｙ−Ｚ軸の格子モデルで平面方向をより詳細に計算する場合、Ｘ−Ｙ平面に配置されたプロセス間はＺ方向に配置されたプロセス間よりも通信が多くなる。 When executing a distributed parallel job, there may be a characteristic in communication between processes depending on the implementation of a calculation cluster. For example, when the plane direction is calculated in more detail using an X-Y-Z axis lattice model, communication between processes arranged on the XY plane is more than communication between processes arranged in the Z direction.

このように、プロセス間の通信に特性が見られる場合、分散並列ジョブの性能が十分に発揮されない場合がある。これは、バッチサーバが計算ノード間のインターコネクト通信性能や分散並列ジョブの通信特性を考慮せずに割り当てを行うため、インターコネクト通信性能の偏りが分散並列ジョブのプロセス間の通信特性と合致せず、最適なインターコネクト通信性能を引き出せないためである。 As described above, when there is a characteristic in communication between processes, the performance of the distributed parallel job may not be sufficiently exhibited. This is because the batch server performs allocation without considering the interconnect communication performance between compute nodes and the communication characteristics of distributed parallel jobs, so the bias of interconnect communication performance does not match the communication characteristics between processes of distributed parallel jobs, This is because optimum interconnect communication performance cannot be obtained.

特許文献１には、それぞれがプロセッサを備える複数のノードを格子状に接続させた格子型コンピュータシステムにおいて、格子型コンピュータシステムにおける複数のノードとノード間接続装置の接続形態にしたがって作成された論理ノードからなる格子モデルが、外部からなされる一つまたは複数のサービス要求に対応付けられた一つ以上の論理ノードを含む方形領域に分割されており、この方形領域内のいずれかの論理ノードにおいて実行されるスケジューラが、該方形領域に対応するサービス要求のジョブを構成するタスクの並列度および直列度に基づいて、方形領域内の他の論理ノードにタスクを処理するためのプログラムを割り当てることが記載されている。 Patent Document 1 discloses a logical node created according to a connection form of a plurality of nodes and inter-node connection devices in a lattice type computer system in a lattice type computer system in which a plurality of nodes each having a processor are connected in a lattice shape. The grid model consisting of is divided into square areas containing one or more logical nodes associated with one or more service requests made from the outside, and executed at any logical node in this square area The assigned scheduler assigns a program for processing a task to other logical nodes in the rectangular area based on the parallelism and seriality of the tasks constituting the service request job corresponding to the rectangular area. Has been.

特開２００７−２０６９８７号公報JP 2007-206987 A

特許文献１には、計算ノードにタスクを最適配置する手法が記載されているが、ノードが格子状に接続された形態にしか適用できない。 Patent Document 1 describes a method for optimally arranging tasks in calculation nodes, but it can be applied only to a form in which nodes are connected in a grid pattern.

そこで、本発明の目的は、計算ノードの接続形態にかかわらず、プロセスを計算ノードに最適に配置することにより通信時間を最適化し、ジョブの実行性能を向上させることである。 Accordingly, an object of the present invention is to optimize the communication time and improve the job execution performance by optimally arranging the processes in the calculation nodes regardless of the connection form of the calculation nodes.

本発明に係るクラスタシステムは、複数の計算ノードと、フロントエンド装置を介して要求されたバッチ処理を、前記複数の計算ノードに割り当てるバッチサーバを備えたクラスタシステムであって、前記バッチサーバは、各々の前記計算ノード間のインターコネクト通信性能情報を含むテーブルを作成する、インターコネクト通信性能テーブル作成部と、運用開始時に、各々の前記計算ノードに、前記インターコネクト通信性能テーブルを送信すると共に、前記バッチ処理の要求時に、各々の前記計算ノードに前記バッチ処理に含まれるジョブと、前記ジョブの通信特性を送信する、情報配布部と、を備え、前記計算ノードは、前記ジョブの通信特性と、前記インターコネクト通信性能を突き合わせることにより、各計算ノードに配置するプロセスを決定するプロセス配置計算部、を備える。 The cluster system according to the present invention is a cluster system including a plurality of computing nodes and a batch server that allocates batch processing requested via a front-end device to the plurality of computing nodes, and the batch server includes: An interconnect communication performance table creation unit that creates a table including interconnect communication performance information between each of the calculation nodes, and at the start of operation, transmits the interconnect communication performance table to each of the calculation nodes, and the batch processing A job included in the batch process and an information distribution unit that transmits the communication characteristics of the job to each of the calculation nodes at the time of the request, and the calculation node includes the communication characteristics of the job and the interconnect Place each computing node by matching the communication performance Process arrangement calculation unit for determining a process comprises a.

本発明によれば、計算ノードの接続形態にかかわらず、ジョブの通信特性に基づいてプロセスを計算ノードに最適に配置することで、通信時間の最適化を図り、ジョブの実行性能を向上させることができる。 According to the present invention, it is possible to optimize the communication time and improve the job execution performance by optimally arranging the processes in the calculation nodes based on the job communication characteristics regardless of the connection form of the calculation nodes. Can do.

本発明の実施の形態によるクラスタシステムの構成を示す図である。It is a figure which shows the structure of the cluster system by embodiment of this invention. 本発明の実施の形態によるフロントエンド、バッチサーバ、計算ノードの構成の詳細を示す図である。It is a figure which shows the detail of a structure of the front end by embodiment of this invention, a batch server, and a calculation node. 本発明の実施の形態によるクラスタシステムの動作のフローチャートである。It is a flowchart of operation | movement of the cluster system by embodiment of this invention. インターコネクト通信性能テーブルの例を示す図である。It is a figure which shows the example of an interconnect communication performance table. 本発明の実施の形態によるクラスタシステムの動作のフローチャートである。It is a flowchart of operation | movement of the cluster system by embodiment of this invention. バッチ要求に含まれる通信特性の例を示す図である。It is a figure which shows the example of the communication characteristic contained in a batch request. ジョブのプロセス配置を模式的に示す図である。It is a figure which shows typically the process arrangement | positioning of a job. 本発明の実施の形態によるクラスタシステムの動作のフローチャートである。It is a flowchart of operation | movement of the cluster system by embodiment of this invention. 割り当てノード一覧の例を示す図である。It is a figure which shows the example of an allocation node list. 本発明の実施の形態によるクラスタシステムの動作のフローチャートである。It is a flowchart of operation | movement of the cluster system by embodiment of this invention. プロセス配置決定処理を説明する図である。It is a figure explaining process arrangement | positioning determination processing. ノードグループの分割の例を示す図である。It is a figure which shows the example of the division | segmentation of a node group. 本発明の実施の形態によるクラスタシステムの動作のフローチャートである。It is a flowchart of operation | movement of the cluster system by embodiment of this invention. ノードグループに対して最適化処理を行った例を示す図である。It is a figure which shows the example which performed the optimization process with respect to a node group. プロセス配置とノードグループの突合せを説明する図である。It is a figure explaining the matching of a process arrangement | positioning and a node group.

次に、本発明を実施するための最良の形態について、図面を参照して詳細に説明する。
図１は、本実施形態によるクラスタシステム１０の構成を示す図である。図に示すように、クラスタシステム１０は、フロントエンド１００、バッチサーバ２００、計算ノード３００を備えている。各フロントエンド１００、バッチサーバ２００、及び各計算ノード３００は、通信ネットワークを介して接続されている。また、計算ノード３００間のインターコネクト通信性能は均一ではなく、隣接していない計算ノード３００へは他の計算ノード３００を経由して通信を行う必要がある。 Next, the best mode for carrying out the present invention will be described in detail with reference to the drawings.
FIG. 1 is a diagram showing a configuration of a cluster system 10 according to the present embodiment. As shown in the figure, the cluster system 10 includes a front end 100, a batch server 200, and a computation node 300. Each front end 100, batch server 200, and each calculation node 300 are connected via a communication network. Further, the interconnect communication performance between the computation nodes 300 is not uniform, and it is necessary to communicate with the computation nodes 300 that are not adjacent via the other computation nodes 300.

図２は、フロントエンド１００、バッチサーバ２００、計算ノード３００の構成の詳細を示す図である。
図に示すように、フロントエンド１００は、バッチ要求部１０１を備えている。バッチ要求部１０１は、バッチサーバ２００にバッチ要求を送信する。バッチ要求には計算ジョブとジョブの通信特性が含まれる。バッチ要求部１０１は、コンピュータのプロセッサにおいて実行されることにより実現される機能ブロックである。 FIG. 2 is a diagram showing details of the configuration of the front end 100, the batch server 200, and the calculation node 300.
As shown in the figure, the front end 100 includes a batch request unit 101. The batch request unit 101 transmits a batch request to the batch server 200. The batch request includes communication jobs and job communication characteristics. The batch request unit 101 is a functional block realized by being executed in a computer processor.

バッチサーバ２００は、ジョブ受付部２０１、ジョブ管理部２０２、インターコネクト通信性能テーブル作成部２０３、情報配布部２０４、バッチリクエスト記憶部２０５を備えている。 The batch server 200 includes a job reception unit 201, a job management unit 202, an interconnect communication performance table creation unit 203, an information distribution unit 204, and a batch request storage unit 205.

ジョブ受付部２０１は、フロントエンド１００から送信されるバッチ要求を受信する。
ジョブ管理部２０２は、バッチリクエスト記憶部２０５を参照し、計算ノード３００の割り当てを決定する。
インターコネクト通信性能テーブル作成部２０３は、全ての計算ノード３００のインターコネクト通信性能を取得し、インターコネクト通信性能テーブルを作成する。 The job reception unit 201 receives a batch request transmitted from the front end 100.
The job management unit 202 refers to the batch request storage unit 205 and determines assignment of the calculation nodes 300.
The interconnect communication performance table creation unit 203 acquires the interconnect communication performance of all the computation nodes 300 and creates the interconnect communication performance table.

情報配布部２０４は、インターコネクト通信性能テーブルを各計算ノード３００に送信する。また、各計算ノード３００へバッチ要求と割り当てノード一覧を送信する。
バッチリクエスト記憶部２０５は、ジョブ受付部２０１で受信したバッチ要求を記憶する。 The information distribution unit 204 transmits an interconnect communication performance table to each computation node 300. In addition, a batch request and an allocation node list are transmitted to each calculation node 300.
The batch request storage unit 205 stores the batch request received by the job reception unit 201.

ジョブ受付部２０１、ジョブ管理部２０２、インターコネクト通信性能テーブル作成部２０３、情報配布部２０４は、コンピュータのプロセッサにおいて実行されることにより実現される機能ブロックである。バッチリクエスト記憶部２０５は、メモリ、ハードディスク等の記憶装置により実現される。 The job reception unit 201, job management unit 202, interconnect communication performance table creation unit 203, and information distribution unit 204 are functional blocks that are realized by being executed by a processor of a computer. The batch request storage unit 205 is realized by a storage device such as a memory or a hard disk.

計算ノード３００は、ジョブ実行部３０１を備えている。ジョブ実行部３０１は、情報取得部３０２、プロセス配置計算部３０３、プロセス起動部３０４、インターコネクト通信性能テーブル記憶部３０５を備えている。 The calculation node 300 includes a job execution unit 301. The job execution unit 301 includes an information acquisition unit 302, a process arrangement calculation unit 303, a process activation unit 304, and an interconnect communication performance table storage unit 305.

情報取得部３０２は、インターコネクト通信性能テーブルを受信し、インターコネクト通信性能テーブル記憶部３０５に記憶する。また、バッチ要求と割り当てノード一覧を受信し、プロセス配置計算部３０３へ提供する。 The information acquisition unit 302 receives the interconnect communication performance table and stores it in the interconnect communication performance table storage unit 305. The batch request and the allocation node list are received and provided to the process arrangement calculation unit 303.

プロセス配置計算部３０３は、割り当てノード一覧、ジョブ通信特性、及びインターコネクト通信特性テーブルを参照してプロセス配置を計算する。 The process arrangement calculation unit 303 calculates a process arrangement with reference to the allocation node list, job communication characteristics, and interconnect communication characteristics table.

プロセス起動部３０４は、プロセス配置計算部３０３によって計算されたプロセス配置に基づいて、自身に配置されたプロセスを生成し、実行を開始する。 The process activation unit 304 generates a process arranged in itself based on the process arrangement calculated by the process arrangement calculation unit 303 and starts execution.

ジョブ実行部３０１、情報取得部３０２、プロセス配置計算部３０３、プロセス起動部３０４は、コンピュータのプロセッサにおいて実行されることにより実現される機能ブロックである。インターコネクト通信性能テーブル記憶部３０５は、メモリ、ハードディスク等の記憶装置により実現される。 The job execution unit 301, the information acquisition unit 302, the process arrangement calculation unit 303, and the process activation unit 304 are functional blocks that are realized by being executed by a processor of a computer. The interconnect communication performance table storage unit 305 is realized by a storage device such as a memory or a hard disk.

次に、クラスタシステム１０の動作について説明する。
クラスタシステム１０の動作は、実行順に、システムの運用開始時、バッチ処理要求時、ジョブ実行開始時の３つに分けることができる。 Next, the operation of the cluster system 10 will be described.
The operation of the cluster system 10 can be divided into three in the order of execution: when the system starts operating, when batch processing is requested, and when job execution starts.

まず、クラスタシステム１０の運用開始時の処理について、図３のフローチャートを用いて説明する。
運用が開始されると、バッチサーバ２００のインターコネクト通信性能テーブル作成部２０３が、接続されている全ての計算ノード３００のインターコネクト通信性能を取得し、インターコネクト通信性能テーブルを作成する（ステップＳ１０１）。 First, processing at the start of operation of the cluster system 10 will be described with reference to the flowchart of FIG.
When the operation is started, the interconnect communication performance table creation unit 203 of the batch server 200 acquires the interconnect communication performance of all the connected computing nodes 300, and creates the interconnect communication performance table (step S101).

図４は、インターコネクト通信性能テーブルの例を示す図である。図に示すように、各計算ノード３００間のインターコネクト通信性能が数値で示されている。数値が小さいものほどインターコネクト通信性能が高い。 FIG. 4 is a diagram illustrating an example of an interconnect communication performance table. As shown in the figure, the interconnect communication performance between the computation nodes 300 is indicated by numerical values. The smaller the number, the higher the interconnect communication performance.

次に、情報配布部２０４が、インターコネクト通信性能テーブルを各計算ノード３００に送信する（ステップＳ１０２）。 Next, the information distribution unit 204 transmits an interconnect communication performance table to each computation node 300 (step S102).

次に、各々の計算ノード３００は、インターコネクト通信性能テーブルをジョブ実行部３０１の情報取得部３０２で受信し、インターコネクト通信性能テーブル記憶部３０５に記憶する（ステップＳ１０３）。 Next, each computing node 300 receives the interconnect communication performance table by the information acquisition unit 302 of the job execution unit 301 and stores it in the interconnect communication performance table storage unit 305 (step S103).

次に、バッチ処理要求時の処理について、図５のフローチャートを用いて説明する。
まず、フロントエンド１００のバッチ要求部１０１からバッチサーバ２００へバッチ要求が送信される（ステップＳ２０１）。バッチ要求には、ジョブの実行要求とジョブの通信特性が含まれる。 Next, processing when a batch processing request is made will be described with reference to the flowchart of FIG.
First, a batch request is transmitted from the batch request unit 101 of the front end 100 to the batch server 200 (step S201). The batch request includes a job execution request and job communication characteristics.

図６は、バッチ要求部１０１から送信されるバッチ要求に含まれるジョブの通信特性の例を示す図である。また、図７は、ジョブのプロセス配置を模式的に示す図である。
ジョブの通信特性は、図７に示すように、ジョブのプロセス配置を格子状とみなし、格子の各次元の優先順位、次元軸に配置するプロセス数を設定する。図７中のプロセス番号は、ジョブに含まれる個々のプロセスを識別する番号である。次元の小さい順に０から番号が振られている。 FIG. 6 is a diagram illustrating an example of communication characteristics of a job included in a batch request transmitted from the batch request unit 101. FIG. 7 is a diagram schematically showing the process arrangement of a job.
As shown in FIG. 7, the job communication characteristics are such that the job process arrangement is regarded as a grid, and the priority of each dimension of the grid and the number of processes arranged on the dimension axis are set. The process number in FIG. 7 is a number for identifying each process included in the job. Numbers are assigned from 0 in ascending order of dimension.

バッチサーバ２００は、ジョブ受付部２０１でバッチ要求を受信し、バッチリクエスト記憶部２０５に記憶する（ステップＳ２０２）。 The batch server 200 receives the batch request at the job reception unit 201 and stores it in the batch request storage unit 205 (step S202).

次に、ジョブ管理部２０２は、バッチリクエスト記憶部２０５を参照し、計算ノード３００の割り当てを決定する（ステップＳ２０３）。 Next, the job management unit 202 refers to the batch request storage unit 205 and determines assignment of the calculation nodes 300 (step S203).

次に、ジョブ実行開始時の処理について、図８のフローチャートを用いて説明する。
バッチサーバ２００は、バッチリクエスト記憶部２０５に格納されているジョブの実行時間になると、情報配布部２０４によって各計算ノード３００へバッチ要求と割り当てノード一覧を送信する（ステップＳ３０１）。 Next, processing at the start of job execution will be described using the flowchart of FIG.
When the execution time of the job stored in the batch request storage unit 205 is reached, the batch server 200 transmits a batch request and an allocation node list to each computation node 300 by the information distribution unit 204 (step S301).

図９は、割り当てノード一覧の例を示す図である。図９に示すように、割り当てノード一覧には、ジョブに割り当てられた計算ノード３００が列挙されている。 FIG. 9 is a diagram illustrating an example of an allocation node list. As shown in FIG. 9, the assignment node list lists the calculation nodes 300 assigned to the job.

計算ノード３００は、情報取得部３０２においてバッチ要求と割り当てノード一覧を受信すると、プロセス配置計算部３０３において、割り当てノード一覧、ジョブ通信特性、及びインターコネクト通信特性テーブルを参照してプロセス配置を計算する（ステップＳ３０２）。 When the information acquisition unit 302 receives the batch request and the allocation node list, the calculation node 300 calculates a process allocation with reference to the allocation node list, job communication characteristics, and interconnect communication characteristics table in the process allocation calculation unit 303 ( Step S302).

次に、各計算ノード３００は、プロセス起動部３０４においてステップＳ３０２で計算したプロセス配置に基づいて、自身に配置されたプロセスを生成し、実行を開始する（ステップＳ３０３）。 Next, each calculation node 300 generates a process arranged in itself based on the process arrangement calculated in step S302 in the process activation unit 304, and starts execution (step S303).

ステップＳ３０２のプロセス配置計算処理について、図１０のフローチャートを用いて詳しく説明する。
ここでは、図６に示す通信特性を持つジョブに対して、図９に示すノードが割り当てられた場合を例に説明する。また、各ノード間のインターコネクト通信性能は図４に示すとおりとする。 The process arrangement calculation process in step S302 will be described in detail with reference to the flowchart in FIG.
Here, a case where the node shown in FIG. 9 is assigned to the job having the communication characteristics shown in FIG. 6 will be described as an example. The interconnect communication performance between the nodes is as shown in FIG.

まず、プロセス配置計算部３０３は、ジョブの通信特性と割り当てノード一覧を参照して、プロセスの配置順を決定する（ステップＳ４０１）。プロセス番号は、図７に示すように、次元の小さいものから順に割り振られている。 First, the process placement calculation unit 303 refers to the communication characteristics of the job and the list of assigned nodes, and determines the process placement order (step S401). As shown in FIG. 7, the process numbers are assigned in ascending order of dimension.

図１１に示すように、プロセス配置計算部３０３は、ジョブの通信特性の優先順位が大きい次元の方向で、プロセス数単位で番号をまとめていき、プロセス配置を決定する。 As shown in FIG. 11, the process arrangement calculation unit 303 determines the process arrangement by collecting numbers in units of processes in the dimension direction in which the priority of communication characteristics of jobs is large.

次に、プロセス配置計算部３０３は、割り当てノード一覧に含まれる計算ノード３００をまとめて、１つのノードグループを作成する（ステップＳ４０２）。 Next, the process placement calculation unit 303 collects the calculation nodes 300 included in the allocation node list and creates one node group (step S402).

次に、プロセス配置計算部３０３は、ジョブ通信特性を参照し、優先順位が一番小さい時限のプロセス数でノードグループを分割する（ステップＳ４０４）。ここでは、図１２に示すように、割り当てノード一覧のノードグループを、図６に示す通信特性の優先順位が一番小さい次元である１次元目のプロセス数「３」で分割する。 Next, the process arrangement calculation unit 303 refers to the job communication characteristics and divides the node group by the number of processes with the lowest priority (step S404). Here, as shown in FIG. 12, the node group of the allocation node list is divided by the number of processes “3” in the first dimension, which is the dimension with the lowest priority of communication characteristics shown in FIG.

次に、プロセス配置計算部３０３は、分割したノードグループ間で最適化処理を行う（ステップＳ４０５）。 Next, the process placement calculation unit 303 performs optimization processing between the divided node groups (step S405).

ステップＳ４０５のノードグループ間最適化処理について、図１３のフローチャートを用いて説明する。 The inter-node group optimization process in step S405 will be described with reference to the flowchart of FIG.

まず、プロセス配置計算部３０３は、ノードグループを２つ選択する（ステップＳ５０１）。以下、選択したノードグループをＧａ、Ｇｂとする。 First, the process placement calculation unit 303 selects two node groups (step S501). Hereinafter, the selected node group is referred to as Ga and Gb.

次に、プロセス配置計算部３０３は、Ｇａ及びＧｂから、ノードを１つずつ選択する（ステップＳ５０２）。 Next, the process arrangement calculation unit 303 selects one node at a time from Ga and Gb (step S502).

次に、プロセス配置計算部３０３は、インターコネクト通信性能テーブルを参照し、Ｇａ、Ｇｂそれぞれにおいて、ノードグループ内のインターコネクト通信性能の合計を計算する（ステップＳ５０３）。以下、計算した合計をＴａ、Ｔｂとする。 Next, the process placement calculation unit 303 refers to the interconnect communication performance table and calculates the total interconnect communication performance within the node group for each of Ga and Gb (step S503). Hereinafter, the calculated sum is assumed to be Ta and Tb.

次に、プロセス配置計算部３０３は、Ｇａ及びＧｂから選択したノードをスワップさせた新しいノードグループＧａ’、Ｇｂ’を作成する（ステップＳ５０４）。 Next, the process arrangement calculation unit 303 creates new node groups Ga ′ and Gb ′ in which the nodes selected from Ga and Gb are swapped (step S504).

次に、プロセス配置計算部３０３は、ノードグループＧａ’、Ｇｂ’について、ノードグループ内のインターコネクト通信性能の合計値Ｔａ’、Ｔｂ’を計算する（ステップＳ５０５）。 Next, the process placement calculation unit 303 calculates the total values Ta ′ and Tb ′ of the interconnect communication performance within the node group for the node groups Ga ′ and Gb ′ (step S505).

次に、プロセス配置計算部３０３は、（Ｔａ’＋Ｔｂ’）＜（Ｔａ＋Ｔｂ）の場合（ステップＳ５０６：Ｙ）、Ｇａ、ＧｂをＧａ’、Ｇｂ’で更新する（ステップＳ５０７）。 Next, when (Ta ′ + Tb ′) <(Ta + Tb) (step S506: Y), the process arrangement calculation unit 303 updates Ga and Gb with Ga ′ and Gb ′ (step S507).

プロセス配置計算部３０３は、全てのノードグループの組み合わせについてステップＳ５０１〜Ｓ５０７の処理を行い（ステップＳ５０８）、全てのノードグループで更新がなくなったら、最適化処理を終了する（ステップＳ５０９）。 The process placement calculation unit 303 performs the processing of steps S501 to S507 for all combinations of node groups (step S508), and ends the optimization processing when there is no update in all the node groups (step S509).

図１４は、図１２に示す３つの分割されたノードグループに対して最適化処理を行った例を示す図である。 FIG. 14 is a diagram illustrating an example in which optimization processing is performed on the three divided node groups illustrated in FIG.

図１０のステップＳ４０５のノードグループ間最適化処理が終了したら、プロセス配置計算部３０３は、分割したノードグループそれぞれに対して、次に優先順位が小さい次元のプロセス数で分割を行う。 When the inter-node group optimization process in step S405 of FIG. 10 is completed, the process placement calculation unit 303 divides each divided node group with the number of processes of the dimension with the next lowest priority.

プロセス配置計算部３０３は、未処理の次元が残り１つになったら、分割処理を終了し、分割したノードグループを併合する（ステップＳ４０３）。 When the number of unprocessed dimensions becomes one, the process placement calculation unit 303 ends the division process and merges the divided node groups (step S403).

さらに、プロセス配置計算部３０３は、図１５に示すように、プロセス配置とノードグループを突き合わせて各計算ノード３００に割り当てるプロセスを決定する（ステップＳ４０６）。 Further, as shown in FIG. 15, the process placement calculation unit 303 matches the process placement with the node group and determines a process to be assigned to each computation node 300 (step S406).

以上のように、本実施の形態によれば、クラスタシステム１０の運用開始時にバッチサーバ２００から各計算ノード３００に計算ノード３００間のインターコネクト通信性能テーブルが送信されると共に、バッチ要求時には、バッチサーバ２００から計算ノード３００に、ジョブと共にジョブの通信特性を送信し、各計算ノード３００は、ジョブの通信特性とインターコネクト通信性能を突き合わせて、各計算ノード３００に配置するプロセスを決定する。
これにより、計算ノードの接続形態にかかわらず、プロセスを計算ノードに最適に配置し、通信時間を最適化し、ジョブの実行性能を向上させることができる。 As described above, according to the present embodiment, the interconnect communication performance table between the computation nodes 300 is transmitted from the batch server 200 to each computation node 300 at the start of operation of the cluster system 10, and at the time of batch request, the batch server The communication characteristics of the job are transmitted together with the job from 200 to the calculation node 300, and each calculation node 300 matches the communication characteristics of the job with the interconnect communication performance to determine a process to be arranged in each calculation node 300.
As a result, regardless of the connection form of the calculation nodes, processes can be optimally arranged in the calculation nodes, communication time can be optimized, and job execution performance can be improved.

上記の実施の形態の一部または全部は、以下の付記のようにも記載されうるが、以下には限られない。
（付記１）複数の計算ノードと、
フロントエンド装置を介して要求されたバッチ処理を、前記複数の計算ノードに割り当てるバッチサーバを備えたクラスタシステムであって、
前記バッチサーバは、
各々の前記計算ノード間のインターコネクト通信性能情報を含むテーブルを作成する、インターコネクト通信性能テーブル作成部と、
運用開始時に、各々の前記計算ノードに、前記インターコネクト通信性能テーブルを送信すると共に、前記バッチ処理の要求時に、各々の前記計算ノードに前記バッチ処理に含まれるジョブと、前記ジョブの通信特性を送信する、情報配布部と、を備え、
前記計算ノードは、
前記ジョブの通信特性と、前記インターコネクト通信性能を突き合わせることにより、各計算ノードに配置するプロセスを決定するプロセス配置計算部、を備えた、クラスタシステム。 A part or all of the above embodiment can be described as in the following supplementary notes, but is not limited thereto.
(Supplementary note 1) a plurality of calculation nodes;
A cluster system comprising a batch server that assigns batch processing requested via a front-end device to the plurality of computing nodes,
The batch server
An interconnect communication performance table creating unit for creating a table including interconnect communication performance information between each of the computing nodes;
At the start of operation, the interconnect communication performance table is transmitted to each of the calculation nodes, and at the time of the batch processing request, the job included in the batch processing and the communication characteristics of the job are transmitted to each of the calculation nodes. And an information distribution department,
The compute node is
A cluster system comprising: a process placement calculation unit that determines a process to be placed in each computation node by matching the communication characteristics of the job with the interconnect communication performance.

（付記２）付記１に記載のクラスタシステムにおいて、
前記ジョブの通信特性には、前記ジョブに含まれるプロセスを各プロセス間の通信特性に基づいて格子状に配置した場合の、前記格子を構成する各次元の優先順位、及び各次元軸上に配置されるプロセス数の情報を含み、
前記プロセス配置計算部は、
前記優先順位と前記各次元軸上に配置されるプロセス数に基づいて、各プロセス間の通信特性と前記計算ノード間のインターコネクト通信性能が合致するように、各計算ノードに配置するプロセスを決定する、クラスタシステム。 (Appendix 2) In the cluster system described in Appendix 1,
The communication characteristics of the job include the priority of each dimension constituting the grid and the arrangement on each dimension axis when the processes included in the job are arranged in a grid based on the communication characteristics between the processes. Information on the number of processes
The process arrangement calculation unit
Based on the priority and the number of processes arranged on each dimension axis, a process to be arranged in each calculation node is determined so that communication characteristics between the processes and interconnect communication performance between the calculation nodes match. , Cluster system.

（付記３）各々の計算ノード間のインターコネクト通信性能情報を含むテーブルを作成する工程と、
各々の前記計算ノードに、前記インターコネクト通信性能テーブルを送信する工程と、
バッチ処理の要求を受け、各々の前記計算ノードに前記バッチ処理に含まれるジョブと、前記ジョブの通信特性を送信する工程と、
前記計算ノードが、前記ジョブの通信特性と、前記インターコネクト通信性能を突き合わせることにより、各計算ノードに配置するプロセスを決定する工程と、を備えた、プロセス配置方法。 (Additional remark 3) The process of creating the table containing the interconnect communication performance information between each calculation node,
Transmitting the interconnect communication performance table to each of the computing nodes;
Receiving a request for batch processing, sending a job included in the batch processing to each of the calculation nodes, and communication characteristics of the job;
And a step of determining a process to be arranged in each calculation node by matching the communication characteristics of the job with the interconnect communication performance.

（付記４）コンピュータを、
複数の計算ノード間のインターコネクト通信性能情報を含むテーブルを作成する、インターコネクト通信性能テーブル作成部と、
運用開始時に、各々の前記計算ノードに、前記インターコネクト通信性能テーブルを送信すると共に、バッチ処理の要求時に、各々の前記計算ノードに前記バッチ処理に含まれるジョブと、前記ジョブの通信特性を送信する、情報配布部と、して機能させるプログラム。 (Appendix 4)
An interconnect communication performance table creation unit for creating a table including interconnect communication performance information between a plurality of computing nodes;
At the start of operation, the interconnect communication performance table is transmitted to each of the calculation nodes, and at the time of requesting batch processing, the job included in the batch processing and the communication characteristics of the job are transmitted to each of the calculation nodes. A program that functions as an information distribution department.

１０クラスタシステム、１００フロントエンド、１０１バッチ要求部、２００バッチサーバ、２０１ジョブ受付部、２０２ジョブ管理部、２０３インターコネクト通信性能テーブル作成部、２０４情報配布部、２０５バッチリクエスト記憶部、３００計算ノード、３０１ジョブ実行部、３０２情報取得部、３０３プロセス配置計算部、３０４プロセス起動部、３０５インターコネクト通信性能テーブル記憶部 10 cluster system, 100 front end, 101 batch request unit, 200 batch server, 201 job reception unit, 202 job management unit, 203 interconnect communication performance table creation unit, 204 information distribution unit, 205 batch request storage unit, 300 calculation node, 301 Job execution unit 302 Information acquisition unit 303 Process allocation calculation unit 304 Process activation unit 305 Interconnect communication performance table storage unit

Claims

Multiple compute nodes;
A cluster system comprising a batch server that assigns batch processing requested via a front-end device to the plurality of computing nodes,
The batch server
An interconnect communication performance table creating unit for creating a table including interconnect communication performance information between each of the computing nodes;
At the start of operation, the interconnect communication performance table is transmitted to each of the calculation nodes, and at the time of the batch processing request, the job included in the batch processing and the communication characteristics of the job are transmitted to each of the calculation nodes. And an information distribution department,
The compute node is
And communication characteristics of the job, by matching the interconnect communication performance, process arrangement calculation unit for determining a process of placing each compute node, Bei give a,
The communication characteristics of the job include the priority of each dimension constituting the grid and the arrangement on each dimension axis when the processes included in the job are arranged in a grid based on the communication characteristics between the processes. Information on the number of processes
The process arrangement calculation unit
Based on the priority and the number of processes arranged on each dimension axis, a process to be arranged in each calculation node is determined so that communication characteristics between the processes and interconnect communication performance between the calculation nodes match. , Cluster system.

Creating a table containing interconnect communication performance information between each compute node;
Transmitting the interconnect communication performance table to each of the computing nodes;
Receiving a request for batch processing, sending a job included in the batch processing to each of the calculation nodes, and communication characteristics of the job;
The computing nodes, the communication characteristics of the job, by matching the interconnect communication performance, Bei example a step, the determining the process of placing each compute node,
The communication characteristics of the job include the priority of each dimension constituting the grid and the arrangement on each dimension axis when the processes included in the job are arranged in a grid based on the communication characteristics between the processes. Information on the number of processes
In the step of determining a process to be arranged in each calculation node,
Based on the priority and the number of processes arranged on each dimension axis, a process to be arranged in each calculation node is determined so that communication characteristics between the processes and interconnect communication performance between the calculation nodes match. , process alignment.

Computer
An interconnect communication performance table creation unit for creating a table including interconnect communication performance information between a plurality of computing nodes;
At the start of operation, the interconnect communication performance table is transmitted to each of the calculation nodes, and at the time of requesting batch processing, the job included in the batch processing and the communication characteristics of the job are transmitted to each of the calculation nodes. , Function as the information distribution department ,
The communication characteristics of the job include the priority of each dimension constituting the grid and the arrangement on each dimension axis when the processes included in the job are arranged in a grid based on the communication characteristics between the processes. A program that contains information on the number of processes that will be executed .

Computer
A program that functions as a computation node that executes batch processing assigned to a plurality of computation nodes by a batch server,
The program causes the computer to
Functions as a process placement calculation unit that determines the process to be placed in each computation node by matching the communication characteristics of the job included in the batch processing received from the batch server and the interconnect communication performance information between the plurality of computation nodes Let
The communication characteristics of the job include the priority of each dimension constituting the grid and the arrangement on each dimension axis when the processes included in the job are arranged in a grid based on the communication characteristics between the processes. Information on the number of processes
The process arrangement calculation unit
Based on the priority and the number of processes arranged on each dimension axis, a process to be arranged in each calculation node is determined so that communication characteristics between the processes and interconnect communication performance between the calculation nodes match. ,program.