JP2011197896A

JP2011197896A - Computer system and task management method

Info

Publication number: JP2011197896A
Application number: JP2010062647A
Authority: JP
Inventors: Akihiro Ito; 昭博伊藤
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2010-03-18
Filing date: 2010-03-18
Publication date: 2011-10-06

Abstract

PROBLEM TO BE SOLVED: To provide a method for efficiently moving a task composing a parallel calculation job between computers.SOLUTION: A computer system includes a plurality of computers connected via a network and is configured such that each computer executes a plurality of tasks composing a job in accordance with scheduling. When moving a task between the computers, execution states of the tasks are referred to and a task processing of which has ended is preferentially selected as a task to be moved.

Description

本発明は、複数の計算機によって並列分散処理を行う計算機システムに関し、特に、並列計算ジョブを構成するタスクを実行中の計算機から別の計算機に移動する技術に関する。 The present invention relates to a computer system that performs parallel and distributed processing using a plurality of computers, and more particularly to a technique for moving a task constituting a parallel calculation job from a computer that is executing to another computer.

一般に、大規模計算を行う場合、複数の計算機で並列計算を行うことによって、計算機システムの処理性能を向上させている。このような並列計算機システムでは、計算機の負荷に偏りが発生する場合がある。このため、計算機の負荷を平準化する技術として、ある計算機上で動作しているプロセスを他の計算機に移動させるプロセスマイグレーション技術が知られている。 Generally, when performing a large-scale calculation, the processing performance of a computer system is improved by performing parallel calculation with a plurality of computers. In such a parallel computer system, the load on the computer may be biased. For this reason, as a technique for leveling the load on a computer, a process migration technique for moving a process operating on a certain computer to another computer is known.

例えば、特許文献１には、転送元プロセスのコピーを転送先計算機に作成し、その間に転送元プロセスの実行を継続し、転送元プロセスの実行記録を転送元計算機上に作成する。第１の転送処理が終了した後、転送先計算機に実行記録を転送するが、この第２の転送処理中にも転送元プロセスの実行を継続し、その実行記録を転送元計算機に作成する。このように実行記録の転送を繰り返し、転送予想時間が所定の長さ以下になった時点で、プロセスの実行を、転送元プロセスから転送先プロセスへ移行する。これによって、プロセスの停止時間を少なくしてプロセスを移動することができる。 For example, in Patent Document 1, a copy of a transfer source process is created in a transfer destination computer, during which the execution of the transfer source process is continued, and an execution record of the transfer source process is created on the transfer source computer. After the completion of the first transfer process, the execution record is transferred to the transfer destination computer. The execution of the transfer source process is continued during the second transfer process, and the execution record is created in the transfer source computer. As described above, the transfer of the execution record is repeated, and when the estimated transfer time becomes equal to or shorter than the predetermined length, the process execution is shifted from the transfer source process to the transfer destination process. As a result, the process can be moved with a reduced process stop time.

また、特許文献２には、移動体端末が通信中の基地局の近くの計算機サーバからサービスを受けることによって、計算機サーバからのレスポンスを向上し、及び計算機リソースを節減するための技術である。特許文献２に記載された発明では、移動体端末の移動先を予測し、予測された移動先の計算機に予めプロセスのアドレス空間のデータを移動しておき、予測された移動先に移動体端末が移動することが明らかになった場合に、プロセスの実行を移行する。 Patent Document 2 discloses a technique for improving response from a computer server and saving computer resources by receiving a service from a computer server near a base station with which a mobile terminal is communicating. In the invention described in Patent Document 2, the destination of the mobile terminal is predicted, data in the process address space is moved in advance to the predicted destination computer, and the mobile terminal is moved to the predicted destination. If it becomes clear that moves, transition the execution of the process.

特開２００４―７８４６５号公報JP 2004-78465 A 特開２００４−８８２００号公報JP 2004-88200 A

大規模計算は時間がかかるためジョブの実行時間を短くすることが重要であり、このためにネットワークやＣＰＵといった計算機資源の利用効率を向上する必要がある。特許文献１に記載された技術によると、未転送の差分データを繰り返し更新するため、計算機間で送受信されるデータは転送対象プロセスによって管理されるデータ量を超え、ネットワーク負荷が増える。また、複数回データを転送するため、転送処理のオーバーヘッドが増える。これら理由によって、プロセスを移動するためのネットワーク、ＣＰＵ等の計算機資源を多く必要とし、結果的にジョブ実行時間が長くなるという課題がある。 Since large-scale computation takes time, it is important to shorten the job execution time. For this reason, it is necessary to improve the utilization efficiency of computer resources such as a network and a CPU. According to the technique described in Patent Document 1, since untransferred differential data is repeatedly updated, the data transmitted and received between computers exceeds the amount of data managed by the transfer target process, and the network load increases. In addition, since data is transferred a plurality of times, the overhead of the transfer process increases. For these reasons, there is a problem that a large amount of computer resources such as a network and a CPU for moving processes are required, resulting in a long job execution time.

特許文献２に記載された技術は、移動体端末に対するサービスに特化したプロセス移動を最適化するものであり、複数の計算機を使用して大規模計算する用途に適用することができない。 The technique described in Patent Document 2 optimizes process movement specialized for services to mobile terminals, and cannot be applied to large-scale calculations using a plurality of computers.

本発明は、並列計算ジョブを構成するタスクを、計算機間で効率的に移動する方法を提供することを目的とする。 It is an object of the present invention to provide a method for efficiently moving tasks constituting a parallel calculation job between computers.

本発明の代表的な一例を示せば以下の通りである。すなわち、ネットワークによって接続された複数の計算機を備え、前記各計算機はジョブを構成する複数のタスクをスケジューリングに従って実行する計算機システムであって、前記計算機間でタスクを移動する場合、前記タスクの実行状態を参照して、処理が終了したタスクを優先的に移動すべきタスクに選択することを特徴とする。 A typical example of the present invention is as follows. That is, a computer system including a plurality of computers connected by a network, each computer executing a plurality of tasks constituting a job according to scheduling, and when a task is moved between the computers, the execution state of the task Referring to FIG. 4, the task that has been processed is selected as a task to be moved preferentially.

本発明によると、タスクの移動処理中におけるシステムのＣＰＵ利用率の低下を抑制することが可能となる。これによって、タスクの移動処理のオーバーヘッドを抑えることができる。 According to the present invention, it is possible to suppress a decrease in the CPU utilization rate of the system during task movement processing. As a result, the overhead of task movement processing can be suppressed.

本発明の実施形態の計算機システムの構成図である。It is a block diagram of the computer system of embodiment of this invention. 本発明の実施形態の計算機のハードウェア構成図である。It is a hardware block diagram of the computer of embodiment of this invention. 本発明の実施形態のジョブ管理プロセスの内部構成図である。It is an internal block diagram of the job management process of the embodiment of the present invention. 本発明の実施形態のノード内ジョブ管理プロセスの内部構成図である。It is an internal block diagram of the intra-node job management process of the embodiment of the present invention. 本発明の実施形態のジョブ実行プロセスの内部構成図である。It is an internal block diagram of the job execution process of embodiment of this invention. 本発明の実施の形態のジョブ管理プロセス内のスレーブ管理テーブルの構成図である。It is a block diagram of the slave management table in the job management process of the embodiment of this invention. 本発明の実施の形態のジョブ管理プロセス内のジョブ管理テーブルの構成図である。It is a block diagram of the job management table in the job management process of the embodiment of this invention. 本発明の実施の形態のジョブ管理プロセス内のタスク管理テーブルの構成図である。It is a block diagram of the task management table in the job management process of embodiment of this invention. 本発明の実施の形態のジョブ管理プロセス内の移動中タスク管理テーブルの構成図である。FIG. 6 is a configuration diagram of a moving task management table in the job management process according to the embodiment of this invention. 本発明の実施形態のノード内ジョブ管理プロセスの内部テーブルの構成図である。It is a block diagram of the internal table of the job management process in a node of embodiment of this invention. 本発明の実施形態のノード内ジョブ管理プロセスの内部テーブルの構成図である。It is a block diagram of the internal table of the job management process in a node of embodiment of this invention. 本発明の実施形態のノード内ジョブ管理プロセスの内部テーブルの構成図である。It is a block diagram of the internal table of the job management process in a node of embodiment of this invention. 本発明の実施形態のジョブ実行プロセスの内部テーブルの構成図である。It is a block diagram of the internal table of the job execution process of embodiment of this invention. 本発明の実施形態のジョブ実行プロセスの内部テーブルの構成図である。It is a block diagram of the internal table of the job execution process of embodiment of this invention. 本発明の実施形態のタスクの状態遷移図である。It is a task state transition diagram of an embodiment of the present invention. 本発明の実施形態のタスクを実行する際のプロセス間通信を示すシーケンス図である。It is a sequence diagram which shows the interprocess communication at the time of performing the task of embodiment of this invention. 本発明の実施形態のタスクのスケジューリング方法を示すフローチャートである。It is a flowchart which shows the scheduling method of the task of embodiment of this invention. 本発明の実施形態のタスクのスワップアウト方法を示すフローチャートである。It is a flowchart which shows the swap-out method of the task of embodiment of this invention. 本発明の実施形態の移動対象タスクの選択方法を示すフローチャートである。It is a flowchart which shows the selection method of the movement object task of embodiment of this invention. 本発明の実施形態のタスクの移動方法を示すフローチャートである。It is a flowchart which shows the movement method of the task of embodiment of this invention.

以下に、図面を参照して、本発明の実施形態について説明する。 Embodiments of the present invention will be described below with reference to the drawings.

図１は、本発明の実施形態の複数のノード（計算機）を用いて並列計算ジョブを実行する計算機システムの例を示す構成図である。 FIG. 1 is a configuration diagram illustrating an example of a computer system that executes a parallel calculation job using a plurality of nodes (computers) according to the embodiment of this invention.

本実施の形態の計算機システムは、クライアントノード１０、マスタノード２０及びスレーブノード３０を備える。クライアントノード１０は、ジョブの実行を指示する計算機である。マスタノード２０は、ジョブの実行を管理する計算機である。スレーブノード３０は、ジョブ本体を実行する計算機である。本計算機システムは、１台のマスタノード２０と、必要な数のスレーブノード３０とを備える。スレーブノード３０の数を増やすことによって、計算機システムの計算能力を高めることができる。 The computer system of this embodiment includes a client node 10, a master node 20, and a slave node 30. The client node 10 is a computer that instructs execution of a job. The master node 20 is a computer that manages job execution. The slave node 30 is a computer that executes the job body. This computer system includes one master node 20 and a required number of slave nodes 30. By increasing the number of slave nodes 30, the computing capacity of the computer system can be increased.

この計算機システムは、複数のタスクから構成される並列計算ジョブを実行するシステムである。クライアントノード１０が、ジョブの実行をマスタノード２０に指示すると、マスタノード２０は、ジョブを構成する各タスクをスレーブノード３０に割り当て、スレーブノード３０が割り当てられたタスクを実行することによって、ジョブが実行される。 This computer system is a system for executing a parallel calculation job composed of a plurality of tasks. When the client node 10 instructs the master node 20 to execute a job, the master node 20 assigns each task constituting the job to the slave node 30 and executes the task to which the slave node 30 is assigned. Executed.

クライアントノード１０は、ジョブの実行を指示するクライアントプロセスＰ１を実行する。マスタノード２０は、ジョブ全体の実行を制御するジョブ管理プロセスＰ２と、分散ファイルシステム（ＦＳ）プロセスＰ５とを実行する。スレーブノード３０は、ノード内で動作するジョブの実行を制御するノード内ジョブ管理プロセスＰ３と、ジョブを実行するジョブ実行プロセスＰ４と、分散ＦＳプロセスＰ５を実行する。 The client node 10 executes a client process P1 that instructs execution of a job. The master node 20 executes a job management process P2 that controls the execution of the entire job and a distributed file system (FS) process P5. The slave node 30 executes an in-node job management process P3 that controls execution of jobs that operate in the node, a job execution process P4 that executes jobs, and a distributed FS process P5.

ジョブ実行プロセスＰ４は、クライアントプロセスＰ１がジョブの実行を指示すると、動的に作成される。同一ノードにおいて、ジョブ毎にジョブ実行プロセスＰ４のインスタンスが作成される。従って、一つのノードが複数のジョブを同時に実行する場合、ノードに複数個のジョブ実行プロセスが生成される。ジョブの処理が終了した後、ジョブ実行プロセスＰ４は消滅する。 The job execution process P4 is dynamically created when the client process P1 instructs execution of a job. In the same node, an instance of the job execution process P4 is created for each job. Therefore, when one node executes a plurality of jobs simultaneously, a plurality of job execution processes are generated in the node. After the job processing is completed, the job execution process P4 disappears.

分散ＦＳは、ファイルの本体を複数のノードに分散配置し、どのノードからもすべてのファイルへ同一のパス名によるアクセスを可能にするファイルシステムである。ファイルサイズが大きい場合、一つのファイルが複数のノードに分割して配置される。分散ＦＳプロセスＰ５は、分散ＦＳを実現するためのプロセスであり、各プロセスは分散ＦＳプロセスを経由して、分散ＦＳ上のファイルにアクセスすることができる。以下、分散ＦＳプロセスを経由して分散ＦＳにアクセスすることを、単に、分散ＦＳにアクセスする、分散ＦＳからファイルを読み出す、等で表現する。 The distributed FS is a file system in which the main body of a file is distributed to a plurality of nodes, and all the files can be accessed from any node using the same path name. When the file size is large, one file is divided into a plurality of nodes. The distributed FS process P5 is a process for realizing the distributed FS, and each process can access a file on the distributed FS via the distributed FS process. Hereinafter, accessing the distributed FS via the distributed FS process is simply expressed as accessing the distributed FS, reading a file from the distributed FS, or the like.

図２は、本発明の実施の形態の各ノード（計算機）のハードウェアの構成図である。 FIG. 2 is a hardware configuration diagram of each node (computer) according to the embodiment of this invention.

図２に示すように、各ノードは、プロセッサ（ＣＰＵ）１０１、ネットワークインターフェース（ＬＡＮＩ／Ｆ）１０２、入出力インターフェース１０３、メモリ１０４及びストレージインターフェース１０５を備え、これらの各部はノードの内部で接続されている。 As shown in FIG. 2, each node includes a processor (CPU) 101, a network interface (LAN I / F) 102, an input / output interface 103, a memory 104, and a storage interface 105. These units are connected inside the node. Has been.

プロセッサ１０１は、メモリに格納されたプログラムを実行する。入出力インターフェース１０３には、キーボード１０６、マウス１０７及びディスプレイ１０８が接続される。システムの利用者は、これらの入出力インターフェース１０３に接続された装置を使用してジョブの実行を指示する。ストレージインターフェース１０５には、ハードディスクドライブ等の記憶装置１０９が接続されている、記憶装置１０９は、オペレーティングシステム、各ノードで動作するプロセスのプログラム、及び並列計算によって解析されるデータが格納される。 The processor 101 executes a program stored in the memory. A keyboard 106, a mouse 107 and a display 108 are connected to the input / output interface 103. A user of the system instructs execution of a job using a device connected to these input / output interfaces 103. A storage device 109 such as a hard disk drive is connected to the storage interface 105. The storage device 109 stores an operating system, a process program operating on each node, and data analyzed by parallel calculation.

システム起動時に、記憶装置１０９からプログラムがメモリ１０４に格納され、ＣＰＵ１０１がメモリ１０４に格納されたプログラムを実行する。なお、マスタノード２０とスレーブノード３０は、ユーザが直接入出力操作を行わないため、キーボード１０６、マウス１０７、ディスプレイ１０８を備えなくてもよい。 When the system is activated, a program is stored in the memory 104 from the storage device 109, and the CPU 101 executes the program stored in the memory 104. Note that the master node 20 and the slave node 30 do not need to include the keyboard 106, the mouse 107, and the display 108 because the user does not directly perform input / output operations.

図３Ａ〜図３Ｃは、本発明の実施の形態の各ノードで実行されるプロセスの内部構成を示す図である。 3A to 3C are diagrams showing an internal configuration of a process executed in each node according to the embodiment of this invention.

図３Ａに示すジョブ管理プロセスＰ２は、ジョブ制御プログラムＰ２１、スレーブ管理テーブルＰ２２、ジョブ管理テーブルＰ２３、タスク管理テーブルＰ２４及び移動中タスク管理テーブルＰ２５を含む。例えば、ジョブ管理プロセスＰ２は、記憶装置１０９に格納されており、メモリ１０４に読み出され、ＣＰＵ１０１０により実行される。 The job management process P2 shown in FIG. 3A includes a job control program P21, a slave management table P22, a job management table P23, a task management table P24, and a moving task management table P25. For example, the job management process P2 is stored in the storage device 109, read into the memory 104, and executed by the CPU 1010.

ジョブ制御プログラムＰ２１は、ジョブを制御するサブプログラムである。従って、ジョブ制御プログラムＰ２１はメモリに格納されており、プロセッサ１０１によりメモリより読み出されて実行される。スレーブ管理テーブルＰ２２は、システム内のスレーブノード３０の情報を管理するための情報を保持する（図４Ａ参照）。ジョブ管理テーブルＰ２３は、実行中のジョブを管理するための情報を保持する（図４Ｂ参照）。タスク管理テーブルＰ２４は、実行中ジョブのタスクを管理するための情報を保持する（図４Ｃ参照）。移動中タスク管理テーブルＰ２５は、移動中のタスクを管理するための情報を保持する（図４Ｄ参照）。 The job control program P21 is a subprogram that controls jobs. Accordingly, the job control program P21 is stored in the memory, and is read from the memory by the processor 101 and executed. The slave management table P22 holds information for managing information of the slave nodes 30 in the system (see FIG. 4A). The job management table P23 holds information for managing the job being executed (see FIG. 4B). The task management table P24 holds information for managing the task of the job being executed (see FIG. 4C). The moving task management table P25 holds information for managing the moving task (see FIG. 4D).

図４Ａ〜図４Ｄは、本発明の実施の形態のジョブ管理プロセスＰ２に含まれるテーブルの構成図である。 4A to 4D are configuration diagrams of tables included in the job management process P2 according to the embodiment of this invention.

図４Ａに示すスレーブ管理テーブルＰ２２は、スレーブＩＤ、ノードアドレス、最大タスク数、最大メモリ量、及びタスクＩＤ一覧を含み、システム内の全てのスレーブノードの情報を格納する。スレーブ管理テーブルＰ２の各エントリは各スレーブノードに対応する。 The slave management table P22 illustrated in FIG. 4A includes a slave ID, a node address, a maximum number of tasks, a maximum memory amount, and a task ID list, and stores information on all slave nodes in the system. Each entry in the slave management table P2 corresponds to each slave node.

スレーブＩＤは、スレーブノード３０を一意に識別するための識別子である。ノードアドレスは、スレーブノード３０のネットワーク上のアドレスである。最大タスク数は、そのスレーブノードで同時に実行可能なタスクの数である。最大メモリ量は、そのスレーブノードでタスクが利用可能なメモリ量であり、例えば、単位はＭＢ（メガバイト）で表される。タスクＩＤ一覧は、そのスレーブノードで実行中のタスクの識別子である。 The slave ID is an identifier for uniquely identifying the slave node 30. The node address is an address on the network of the slave node 30. The maximum number of tasks is the number of tasks that can be executed simultaneously on the slave node. The maximum amount of memory is the amount of memory that can be used by a task in the slave node. For example, the unit is expressed in MB (megabyte). The task ID list is an identifier of a task being executed on the slave node.

タスクＩＤ一覧は、例えば、ジョブの識別子（ジョブＩＤ）とタスク番号とを組み合わせたものが記載される。具体的には、図４Ａに示すように、最初のエントリ（スレーブＩＤ＝０１）のタスクＩＤ一覧の「Ｊ０１：１，Ｊ０１：２」は、ジョブＩＤが「Ｊ０１」でタスク番号が「１」のタスクと、ジョブＩＤが「Ｊ０１」でタスク番号が「２」のタスクが実行中であることを示す。 In the task ID list, for example, a combination of a job identifier (job ID) and a task number is described. Specifically, as shown in FIG. 4A, “J01: 1, J01: 2” in the task ID list of the first entry (slave ID = 01) has a job ID “J01” and a task number “1”. And the task with the job ID “J01” and the task number “2” are being executed.

図４Ｂに示すジョブ管理テーブルＰ２３は、ジョブＩＤ、ジョブプログラム、初期データ、及びタスク番号一覧を含み、本計算機システムで実行中の全てのジョブの情報を格納する。ジョブ管理テーブルＰ２３の各エントリは実行中の各ジョブに対応する。 The job management table P23 illustrated in FIG. 4B includes job IDs, job programs, initial data, and task number lists, and stores information on all jobs being executed in the computer system. Each entry in the job management table P23 corresponds to each job being executed.

ジョブＩＤは、ジョブを一意に識別するための識別子である。ジョブプログラムは、ジョブの実行形式のプログラムであり、ジョブ実行プロセスＰ４内で実行される。初期データは、各タスクを初期化する際に各タスクに読み込まれるデータであり、分散ＦＳに格納されているファイルのファイル名で表される。タスク番号一覧は、このジョブを構成する全タスクのタスク番号である。ジョブ管理テーブルＰ２３のエントリはジョブＩＤ毎に設けられるため、ジョブ管理テーブルＰ２３のタスク番号一覧は、ジョブＩＤを含まない。 The job ID is an identifier for uniquely identifying a job. The job program is a job execution format program, and is executed in the job execution process P4. The initial data is data read by each task when each task is initialized, and is represented by the file name of the file stored in the distributed FS. The task number list is the task numbers of all tasks constituting this job. Since an entry in the job management table P23 is provided for each job ID, the task number list in the job management table P23 does not include a job ID.

図４Ｃに示すタスク管理テーブルＰ２４は、ジョブＩＤ、タスク番号、実行状態、全データ量、スワップ量、及びスワップフラグを含み、システムで実行中の全てのタスクの情報を格納する。タスク管理テーブルＰ２４の各エントリは実行中のタスクに対応する。 The task management table P24 illustrated in FIG. 4C includes a job ID, task number, execution state, total data amount, swap amount, and swap flag, and stores information on all tasks being executed in the system. Each entry in the task management table P24 corresponds to a task being executed.

ジョブＩＤは、ジョブを一意に識別するための識別子である。タスク番号は、ジョブ内でタスクを識別するための番号である。実行状態は、各タスクの実行状態を示し、図７に示す状態遷移図に示されたいずれかの状態となる。全データ量は、このタスクによって利用される全データ量であり、例えば、単位はＭＢ（メガバイト）で表される。なお、全データ量には、スワップ状態のデータの量も含まれる。スワップ量は、スワップされたデータの量であり、例えば、単位はＭＢ（メガバイト）で表される。スワップフラグは、タスクのデータのスワップ処理の状態を示し、ＩＤＬＥ（スワップ処理中ではない）、ＢＵＳＹ（スワップ処理中）、ＰＲＥ＿ＯＵＴ（スワップアウト処理を行う直前）のいずれかとなる。 The job ID is an identifier for uniquely identifying a job. The task number is a number for identifying a task in the job. The execution state indicates the execution state of each task, and is one of the states shown in the state transition diagram shown in FIG. The total data amount is the total data amount used by this task. For example, the unit is expressed in MB (megabyte). Note that the total amount of data includes the amount of data in the swap state. The swap amount is the amount of swapped data, and for example, the unit is expressed in MB (megabyte). The swap flag indicates the status of task data swap processing, and is one of IDLE (not in swap processing), BUSY (in swap processing), or PRE_OUT (immediately before performing swap-out processing).

なお、タスクによってメモリ１０４に保持されるデータを、記憶装置１０９に移動する処理がスワップアウトであり、記憶装置１０９に移動されたデータをメモリ１０４に戻す処理がスワップインである。 Note that the process of moving the data held in the memory 104 by the task to the storage device 109 is swap-out, and the process of returning the data moved to the storage device 109 to the memory 104 is swap-in.

図４Ｄに示す移動中タスク管理テーブルＰ２５は、ジョブＩＤ、タスク番号、移動元スレーブＩＤ、移動先スレーブＩＤ、及び実行状態を含み、移動中の全てのタスクの情報が格納される。移動中タスク管理テーブルＰ２５の各エントリは移動中の各タスクに対応する。 The moving task management table P25 illustrated in FIG. 4D includes a job ID, a task number, a moving source slave ID, a moving destination slave ID, and an execution state, and stores information on all moving tasks. Each entry in the moving task management table P25 corresponds to each moving task.

ジョブＩＤは、ジョブを一意に識別するための識別子である。タスク番号は、タスクの番号は、ジョブ内でタスクを識別するための番号である。移動元スレーブＩＤ及び移動先スレーブＩＤは、各々、移動されるタスクの移動元のスレーブノード３０の識別子と、移動先のスレーブノード３０の識別子である。実行状態は、移動中タスクの移動前の状態が示され、タスク移動後にそのタスクの実行状態を移動前の実行状態に復帰するため、移動前の状態を保存するために使用される。 The job ID is an identifier for uniquely identifying a job. The task number is a number for identifying a task in the job. The source slave ID and the destination slave ID are respectively an identifier of the source slave node 30 and an identifier of the destination slave node 30 of the task to be moved. The execution state indicates a state before the movement of the task being moved, and is used to save the state before the movement in order to return the execution state of the task to the execution state before the movement after the task movement.

次に、図３Ｂに示すノード内ジョブ管理プロセスＰ３の構成を説明する。例えば、ノード内ジョブ管理プロセスＰ３は、記憶装置１０９に格納されており、メモリ１０４に読み出され、ＣＰＵ１０１０により実行される。 Next, the configuration of the intra-node job management process P3 shown in FIG. 3B will be described. For example, the intra-node job management process P3 is stored in the storage device 109, read into the memory 104, and executed by the CPU 1010.

ノード内ジョブ管理プロセスＰ３は、ノード内ジョブ制御プログラムＰ３１、ノード内ジョブ管理テーブルＰ３２、タスク管理テーブルＰ３３及びメモリ管理テーブルＰ３４を含む。ノード内ジョブ制御プログラムＰ３１は、ジョブの実行を制御するプログラムである。従って、ジョブ制御プログラムＰ３１はメモリに格納されており、プロセッサ１０１によりメモリより読み出されて実行される。ノード内ジョブ管理テーブルＰ３２は、自ノードで実行中のジョブを管理するための情報を保持する（図５Ａ参照）。タスク管理テーブルＰ３３、実行中のタスクを管理するための情報を保持する（図５Ｂ参照）。メモリ管理テーブルＰ３４は、自ノードで実行中のタスクのメモリ使用状態を管理するための情報を保持する（図５Ｃ参照）。 The intra-node job management process P3 includes an intra-node job control program P31, an intra-node job management table P32, a task management table P33, and a memory management table P34. The intra-node job control program P31 is a program that controls job execution. Therefore, the job control program P31 is stored in the memory, and is read from the memory by the processor 101 and executed. The intra-node job management table P32 holds information for managing jobs being executed on the own node (see FIG. 5A). The task management table P33 holds information for managing the task being executed (see FIG. 5B). The memory management table P34 holds information for managing the memory usage state of the task being executed on the own node (see FIG. 5C).

図５Ａ〜図５Ｃは、ノード内ジョブ管理プロセスＰ３に含まれるテーブルの構成図である。 5A to 5C are configuration diagrams of tables included in the intra-node job management process P3.

図５Ａに示すノード内ジョブ管理テーブルＰ３２は、ジョブＩＤ、ジョブプログラム、初期データ、及びタスク番号一覧を含み、このノードで実行中のジョブの情報のみが含まれる。ノード内ジョブ管理テーブルＰ３２の各エントリは、そのノードで実行中の各ジョブに対応する。 The intra-node job management table P32 illustrated in FIG. 5A includes a job ID, a job program, initial data, and a task number list, and includes only information on a job being executed on this node. Each entry in the intra-node job management table P32 corresponds to each job being executed on the node.

ノード内ジョブ管理テーブルＰ３２は、ジョブ管理プロセスＰ２に保持されるジョブ管理テーブルＰ２３と同じ構成であり、同名の各項目は同じ内容のデータを保持する。但し、ノード内ジョブ管理テーブルＰ３２には、そのノードで実行されているジョブの情報のみが含まれる。 The intra-node job management table P32 has the same configuration as the job management table P23 held in the job management process P2, and each item having the same name holds the same data. However, the intra-node job management table P32 includes only information on jobs executed on the node.

図５Ｂに示すタスク管理テーブルＰ３３は、ジョブＩＤ、タスク番号、実行状態、及びノードアドレスを含み、システムで実行中のタスクの情報を格納する。なお、タスク管理テーブルＰ３３には、このノードで実行中のタスクの情報だけでなく、他ノードで実行中のタスクの情報も含まれるが、このノードで実行中のジョブ以外のジョブに対応するタスクの実行状態に関する情報は含まれない。タスク管理テーブルＰ３３の各エントリは実行中のタスクに対応する。 A task management table P33 illustrated in FIG. 5B includes information on a task being executed in the system, including a job ID, a task number, an execution state, and a node address. Note that the task management table P33 includes not only information on tasks being executed on this node but also information on tasks being executed on other nodes, but tasks corresponding to jobs other than jobs being executed on this node. Information on the execution status of is not included. Each entry in the task management table P33 corresponds to a task being executed.

ジョブＩＤは、ジョブを一意に識別するための識別子である。タスク番号は、ジョブ内でタスクを識別するための番号である。実行状態はタスクの実行状態、実行状態は、各タスクの実行状態を示し、図７に示す状態遷移図に示されたいずれかの状態となる。なお、このノードで実行中のタスクの実行状態のみが格納される。ノードアドレスは、このジョブを実行中のスレーブノード３０のネットワーク上のアドレスである。なお、他のノードで実行中のタスクのノードアドレスのみが格納される。 The job ID is an identifier for uniquely identifying a job. The task number is a number for identifying a task in the job. The execution state indicates the execution state of the task, and the execution state indicates the execution state of each task, which is one of the states shown in the state transition diagram shown in FIG. Only the execution state of the task being executed in this node is stored. The node address is an address on the network of the slave node 30 that is executing this job. Note that only the node address of a task being executed on another node is stored.

図５Ｃに示すメモリ管理テーブルＰ３４は、ジョブＩＤ、タスク番号、データ名、データ量、及びスワップ状態を含み、このノードで実行中のすべてのタスクに割り当てられるメモリ領域の情報が含まれる。メモリ管理テーブルＰ３４の各エントリはメモリ領域に対応する。 The memory management table P34 illustrated in FIG. 5C includes a job ID, a task number, a data name, a data amount, and a swap state, and includes information on memory areas allocated to all tasks being executed on this node. Each entry in the memory management table P34 corresponds to a memory area.

ジョブＩＤは、ジョブを一意に識別するための識別子である。タスク番号は、ジョブ内でタスクを識別するための番号である。データ名は、各タスクによって保持されるデータの名称である。データ量は、各タスクによって保持されるデータの量であり、例えば、単位はＭＢ（メガバイト）で表される。スワップ状態は、メモリのスワップ（特に、スワップアウト）に関する状態を示す。具体的には、スワップ状態には、スワップなしを示す）「ＮＯＮＥ」、スワップアウト直前を示す「ＰＲＥ＿ＯＵＴ」、スワップアウト処理中を示す「ＯＵＴ」、スワップアウト済を示す「ＳＷＡＰ」、スワップイン処理中を示す「ＩＮ」のいずれかが格納される。 The job ID is an identifier for uniquely identifying a job. The task number is a number for identifying a task in the job. The data name is the name of data held by each task. The data amount is the amount of data held by each task. For example, the unit is represented by MB (megabyte). The swap state indicates a state related to memory swap (in particular, swap-out). Specifically, the swap state indicates “no swap”) “NONE”, “PRE_OUT” indicating immediately before swap-out, “OUT” indicating that swap-out is in progress, “SWAP” indicating that swap-out has been completed, swap-in processing One of “IN” indicating the inside is stored.

ノード内ジョブ管理プロセスＰ３は、ジョブＩＤ及びタスク番号によってタスクを一意に識別し、識別されたタスクによって保持されるデータの情報をデータ名、データ量、及びスワップ状態に格納する。各タスクは内部データ、バッファ、受信用バッファの３種類のデータを保持する。内部データはワーク用メモリ領域であり、各タスクは各メモリ領域に名前を付けることによって複数のワーク用メモリ領域を保持することができる。タスクがメモリ領域を保持する場合、ワーク用メモリ領域の名前をメモリ管理テーブルＰ３４のデータ名に格納する。バッファ用のメモリ領域には「ｂｕｆ」、受信バッファには「ｂｕｆ＿ｒｅｃｖ」を、メモリ管理テーブルＰ３４のデータ名に格納するとよい。 The intra-node job management process P3 uniquely identifies a task by a job ID and a task number, and stores data information held by the identified task in a data name, a data amount, and a swap state. Each task holds three types of data: internal data, buffer, and reception buffer. The internal data is a work memory area, and each task can hold a plurality of work memory areas by naming each memory area. When the task holds the memory area, the name of the work memory area is stored in the data name of the memory management table P34. “Buf” may be stored in the buffer memory area, and “buf_recv” may be stored in the data name of the memory management table P34 in the reception buffer.

次に、図３Ｃに示すジョブ実行プロセスＰ４の構成について説明する。例えば、ジョブ実行プロセスＰ４は、記憶装置１０９に格納されており、メモリ１０４に読み出され、ＣＰＵ１０１０により実行される。 Next, the configuration of the job execution process P4 shown in FIG. 3C will be described. For example, the job execution process P4 is stored in the storage device 109, read into the memory 104, and executed by the CPU 1010.

ジョブ実行プロセスＰ４は、タスク制御プログラムＰ４１、データ移動プログラムＰ４２、タスク移動プログラムＰ４３、タスク管理テーブルＰ４４、メモリ管理テーブルＰ４５及びタスクＰ４６を含む。タスク制御プログラムＰ４１は、タスクの実行を制御するプログラムである。データ移動プログラムＰ４２は、データのスワップ処理を行うプログラムである。タスク移動プログラムＰ４３は、タスクのノード間移動処理を行うプログラムである。。従って、タスク制御プログラムＰ４１、データ移動プログラムＰ４２、タスク移動プログラムＰ４３はメモリに格納されており、プロセッサ１０１によりメモリより読み出されて実行される。タスク管理テーブルＰ４４は、自ノードで実行中のタスクの情報を管理する（図６Ａ参照）。メモリ管理テーブルＰ４５は、自ノードで実行中のタスクのメモリ使用状態を管理する（図６Ｂ参照）。 The job execution process P4 includes a task control program P41, a data movement program P42, a task movement program P43, a task management table P44, a memory management table P45, and a task P46. The task control program P41 is a program that controls the execution of tasks. The data movement program P42 is a program that performs data swap processing. The task movement program P43 is a program for performing a task inter-node movement process. . Therefore, the task control program P41, the data movement program P42, and the task movement program P43 are stored in the memory, and are read from the memory by the processor 101 and executed. The task management table P44 manages information on tasks being executed on the own node (see FIG. 6A). The memory management table P45 manages the memory usage state of the task being executed in the own node (see FIG. 6B).

タスクＰ４６は、ジョブ実行プロセスＰ４内に複数個生成することができる。タスクＰ４６の数の上限は、スレーブ管理テーブルＰ２２の最大タスク数に記載された数である。タスクはユーザプログラムであるタスク処理コードＰ４７、内部データ（ｄａｔａ）、バッファ（ｂｕｆ）、及び受信バッファ（ｂｕｆ＿ｒｅｃｖ）を含む。 A plurality of tasks P46 can be generated in the job execution process P4. The upper limit of the number of tasks P46 is the number described in the maximum number of tasks in the slave management table P22. The task includes a task processing code P47, which is a user program, internal data (data), a buffer (buf), and a reception buffer (buf_recv).

内部データは、タスクが内部的に利用するワーク用のメモリ領域である。各タスクは、メモリ領域ごとに名前をつけることによって、複数のメモリ領域を利用することが可能である。タスク処理コードＰ４７は、タスク制御プログラムＰ４１に対してメモリ領域を要求することによって、内部データを利用することができる。タスク制御プログラムＰ４１は、タスク処理コードＰ４７が利用したメモリ領域を管理しており、必要があればデータ移動プログラムＰ４２に要求してスワップ処理を行う。 The internal data is a work memory area used internally by a task. Each task can use a plurality of memory areas by naming each memory area. The task processing code P47 can use internal data by requesting a memory area from the task control program P41. The task control program P41 manages the memory area used by the task processing code P47, and if necessary, requests the data movement program P42 to perform swap processing.

また、タスク同士はメッセージ交換によって通信することが可能である。タスクが受信したメッセージは、いったん受信バッファ（ｂｕｆ＿ｒｅｃｖ）に保存される。一方、タスク処理コードＰ４７は、バッファ（ｂｕｆ）からメッセージを取り出して利用する。本システムは、複数のタスクの並行処理と同期処理とを交互に実行する処理モデルで動作する。ある並行処理フェーズで送信したメッセージは、その並行処理フェーズにおいて宛先タスクの受信バッファ（ｂｕｆ＿ｒｅｃｖ）に保存される。その後、同期フェーズにおいて受信バッファ（ｂｕｆ＿ｒｅｃｖ）からバッファ（ｂｕｆ）へメッセージを移動し、次の並行処理フェーズにおいて、宛先タスクのタスク処理コードＰ４７は、バッファへ移動されたメッセージを利用することができる。並行処理と同期処理とを交互に行う動作の詳細は後述する。 Tasks can communicate with each other by exchanging messages. The message received by the task is temporarily stored in the reception buffer (buf_recv). On the other hand, the task processing code P47 takes out a message from the buffer (buf) and uses it. This system operates on a processing model that alternately executes parallel processing and synchronous processing of a plurality of tasks. A message transmitted in a certain parallel processing phase is stored in the reception buffer (buf_recv) of the destination task in the parallel processing phase. Thereafter, the message is moved from the reception buffer (buf_recv) to the buffer (buf) in the synchronization phase, and the task processing code P47 of the destination task can use the message moved to the buffer in the next parallel processing phase. Details of operations for alternately performing parallel processing and synchronous processing will be described later.

図６Ａ、図６Ｂは、ジョブ実行プロセスＰ４に含まれるテーブルの構成図である。 6A and 6B are configuration diagrams of tables included in the job execution process P4.

図６Ａに示すタスク管理テーブルＰ４４は、ジョブＩＤ、タスク番号、実行状態、及びノードアドレスを含み、このジョブ実行プロセスＰ４によって実行されるジョブに対応するすべてのタスクの情報が記載される。タスク管理テーブルＰ４４の各エントリは、そのノードで実行中の各タスクに対応する。 The task management table P44 shown in FIG. 6A includes a job ID, a task number, an execution state, and a node address, and describes information on all tasks corresponding to the job executed by the job execution process P4. Each entry in the task management table P44 corresponds to each task being executed on the node.

タスク管理テーブルＰ４４は、ノード内ジョブ管理プロセスＰ３に含まれるタスク管理テーブルＰ３３と同様の構成であり、各項目の意味も同じである。但し、タスク管理テーブルＰ４４には、このジョブ実行プロセスＰ４によって実行されるジョブ以外のジョブに対応するタスクの情報は記載されない。 The task management table P44 has the same configuration as the task management table P33 included in the intra-node job management process P3, and the meaning of each item is also the same. However, the task management table P44 does not include information on tasks corresponding to jobs other than jobs executed by the job execution process P4.

図６Ｂに示すメモリ管理テーブルＰ４５は、ジョブＩＤ、タスク番号、データ名、データ量、データＩＤ、及びスワップファイル名を含み、このジョブ実行プロセスＰ４が実行する全てのタスクによって保持されるメモリ領域の情報が記載される。メモリ管理テーブルＰ４５の各エントリは、各メモリ領域に対応する。 The memory management table P45 shown in FIG. 6B includes a job ID, a task number, a data name, a data amount, a data ID, and a swap file name, and a memory area held by all tasks executed by the job execution process P4. Information is written. Each entry in the memory management table P45 corresponds to each memory area.

メモリ管理テーブルＰ４５のジョブＩＤ、タスク番号、データ名、データ量は、ノード内ジョブ管理プロセスＰ３に保持されるメモリ管理テーブルＰ３４の各項目と同じ意味である。データＩＤは、メモリ領域の識別子であり、タスク処理コードＰ４７はデータＩＤを使用してメモリ領域にアクセスする。具体的には、データＩＤとしてメモリ領域のハンドル番号やアドレス等を利用する。スワップファイル名は、このメモリ領域がスワップアウトされている場合のスワップファイルの名前である。このスワップファイルは、分散ＦＳ上ではなく、各スレーブノードのローカルファイルシステム上に格納される。 The job ID, task number, data name, and data amount in the memory management table P45 have the same meaning as the items in the memory management table P34 held in the in-node job management process P3. The data ID is an identifier of the memory area, and the task processing code P47 accesses the memory area using the data ID. Specifically, the handle number or address of the memory area is used as the data ID. The swap file name is the name of the swap file when this memory area is swapped out. This swap file is not stored on the distributed FS but on the local file system of each slave node.

次に、タスクの実行状態の遷移を説明する。 Next, the transition of the task execution state will be described.

図７は、本発明の実施形態のタスクの状態遷移図である。以下に説明するタスクの状態名称は、ジョブ管理プロセスＰ２に含まれるタスク管理テーブルＰ２４と、ノード内ジョブ管理プロセスＰ３に含まれるタスク管理テーブルＰ３３と、ジョブ実行プロセスＰ４に含まれるタスク管理テーブルＰ４４の実行状態に記載される。 FIG. 7 is a task state transition diagram according to the embodiment of this invention. The task status names described below are the task management table P24 included in the job management process P2, the task management table P33 included in the in-node job management process P3, and the task management table P44 included in the job execution process P4. It is described in the execution state.

新規にジョブ実行プロセスＰ４内にタスクが生成された時には、タスクの状態は「ＩＮＩＴ」になり、ジョブ管理テーブルＰ２３及びノード内ジョブ管理テーブルＰ３２の初期データに記載されたファイルを分散ＦＳから読み込む。ファイルが読み込まれた後、タスクの状態は「ＷＡＩＴ」になる。これは待機状態を意味しており、クライアントプロセスＰ１からタスク実行指示があるまで待機する。 When a task is newly generated in the job execution process P4, the task state becomes “INIT”, and the files described in the initial data of the job management table P23 and the intra-node job management table P32 are read from the distributed FS. After the file is read, the task state becomes “WAIT”. This means a standby state, and waits until there is a task execution instruction from the client process P1.

タスク実行指示を受けた場合、タスクＰ４６内の受信バッファ（ｂｕｆ＿ｒｅｃｖ）内のメッセージをバッファ（ｂｕｆ）に移動した後、タスクの状態は「ＰＲＥ＿ＲＵＮ」になる。この状態はタスクがスケジューリングされるのを待っている状態である。 When a task execution instruction is received, after the message in the reception buffer (buf_recv) in the task P46 is moved to the buffer (buf), the task state becomes “PRE_RUN”. This state is a state waiting for a task to be scheduled.

その後、タスクがスケジューリングされると、スワップアウト済みのデータをメモリ領域に読み込んだ後、タスクの状態は「ＲＵＮ」になり、タスク処理コードＰ４６の実行を開始する。タスク処理コードＰ４６を実行している間、タスクＰ４６によって管理されるデータはスワップアウトされない。タスク処理コードＰ４６の実行が終了した後、タスクの状態は「ＷＡＩＴ」に戻る。タスクの状態が「ＷＡＩＴ」の時、クライアントプロセスＰ１からタスク終了指示があると、タスクは、利用したメモリ領域を開放し、消滅する。なお、図７では、この状態を「ＥＮＤ」と記載しているが、これは便宜的に状態を記載しており、実際にはタスクが消滅するため「ＥＮＤ」という状態はテーブルには記載されない。 After that, when the task is scheduled, after the swapped-out data is read into the memory area, the task state becomes “RUN” and the execution of the task processing code P46 is started. While the task processing code P46 is being executed, the data managed by the task P46 is not swapped out. After the execution of the task processing code P46 is completed, the task state returns to “WAIT”. When the task status is “WAIT”, if there is a task end instruction from the client process P1, the task releases the used memory area and disappears. In FIG. 7, this state is described as “END”. However, this state is described for convenience, and the state “END” is not described in the table because the task actually disappears. .

タスクの状態が「ＷＡＩＴ」又は「ＰＲＥ＿ＲＵＮ」の時、タスクを他のスレーブノード上のジョブ実行プロセスに移動することが可能である。タスクの移動が開始すると、移動元タスクの状態は「ＭＩＧ＿ＳＥＮＤ」になり、タスクの移動が終了すると、移動元タスクは消滅する。移動元タスクが移動を開始すると、移動先スレーブノードのジョブ実行プロセス上に、「ＭＩＧ＿ＲＥＣＶ」という状態のタスクが生成される。移動処理が完了すると、タスクの状態は「ＷＡＩＴ」又は「ＰＲＥ＿ＲＵＮ」になり、移動前の状態を復元する。従って、タスクの移動中は同一ジョブＩＤとタスク番号を持つ二つのタスク（移動元タスク及び移動先タスク）存在する。 When the task state is “WAIT” or “PRE_RUN”, it is possible to move the task to a job execution process on another slave node. When the movement of the task starts, the state of the movement source task becomes “MIG_SEND”, and when the movement of the task ends, the movement source task disappears. When the source task starts to move, a task with a state of “MIG_RECV” is generated on the job execution process of the destination slave node. When the movement process is completed, the task state becomes “WAIT” or “PRE_RUN”, and the state before the movement is restored. Therefore, there are two tasks (movement source task and movement destination task) having the same job ID and task number during task movement.

図８は、本発明の実施形態のタスク実行持のプロセス間通信を示すシーケンス図であり、タスクが状態遷移する際の、各プロセスの連携動作を示す。なお、各プロセスは、各プロセスの属する計算機のＣＰＵによって実行される。 FIG. 8 is a sequence diagram showing inter-process communication with task execution according to the embodiment of this invention, and shows a cooperative operation of each process when a task makes a state transition. Each process is executed by the CPU of the computer to which each process belongs.

ジョブの処理を開始する時、クライアントプロセスＰ１はジョブ開始要求（Ｍ０１）をジョブ管理プロセスＰ２に送信する。ジョブ開始要求はジョブプログラムのファイル名、初期データのファイル名を含む。ジョブプログラム及び初期データは、分散ＦＳ上に格納されており、上記ファイル名は分散ＦＳ上のファイル名である。 When starting job processing, the client process P1 transmits a job start request (M01) to the job management process P2. The job start request includes the file name of the job program and the file name of the initial data. The job program and initial data are stored on the distributed FS, and the file name is a file name on the distributed FS.

ジョブ管理プロセスＰ２は、ジョブ開始要求（Ｍ０１）を受けると、実行されるジョブにジョブＩＤを割り当て、ノード内ジョブ管理プロセスＰ３にプロセス生成要求（Ｍ０２）を送信する。プロセス生成要求（Ｍ０２）は、ジョブＩＤ、ジョブプログラムのファイル名、及び初期データのファイル名を含む。 Upon receiving the job start request (M01), the job management process P2 assigns a job ID to the job to be executed and transmits a process generation request (M02) to the intra-node job management process P3. The process generation request (M02) includes a job ID, a file name of the job program, and a file name of initial data.

ノード内ジョブ管理プロセスＰ３は、プロセス生成要求（Ｍ０２）を受けると、ジョブ実行プロセスＰ４を生成（Ｍ０３）する。プロセス生成要求（Ｍ０２）に含まれるジョブプログラムのファイル名及び初期データのファイル名によって特定されるファイルを分散ＦＳから読み出し、読み出された初期データをパラメータとしたプログラムファイルを含むジョブ実行プロセスＰ４を起動する。ジョブ実行プロセスＰ４は、起動後に、プロセス生成通知（Ｍ０４）をノード内ジョブ管理プロセスＰ３に送信する。 Upon receipt of the process generation request (M02), the intra-node job management process P3 generates a job execution process P4 (M03). A file specified by the file name of the job program and the file name of the initial data included in the process generation request (M02) is read from the distributed FS, and a job execution process P4 including a program file using the read initial data as a parameter is to start. The job execution process P4 transmits a process generation notification (M04) to the in-node job management process P3 after starting.

ノード内ジョブ管理プロセスＰ３は、プロセス生成通知（Ｍ０４）を受けると、新規に生成されたジョブの情報をノード内ジョブ管理テーブルＰ３２に追加し、ジョブ管理プロセスＰ２にプロセス生成通知（Ｍ０５）を転送する。この時点でノード内ジョブ管理テーブルＰ３２には、ジョブＩＤ、ジョブプログラム、及び初期データが記載されたエントリが一つ生成される。但し、まだタスクは生成されていないため、タスク番号一覧は空である。 Upon receipt of the process generation notification (M04), the intra-node job management process P3 adds information on the newly generated job to the intra-node job management table P32 and transfers the process generation notification (M05) to the job management process P2. To do. At this point, one entry describing the job ID, job program, and initial data is generated in the intra-node job management table P32. However, the task number list is empty because no task has been generated yet.

ジョブ管理プロセスＰ２は、プロセス生成通知（Ｍ０５）を受けると、ノード内ジョブ管理プロセスＰ３と同様に、ジョブの情報をジョブ管理テーブルＰ２３に追加し、クライアントプロセスＰ１にジョブ開始通知（Ｍ０６）を送信する。 Upon receipt of the process generation notification (M05), the job management process P2 adds job information to the job management table P23 and transmits a job start notification (M06) to the client process P1 as in the in-node job management process P3. To do.

クライアントプロセスＰ１は、ジョブ開始通知（Ｍ０６）を受けると、ジョブ管理プロセスＰ０２にタスク生成要求（Ｍ０７）を送信する。 Upon receiving the job start notification (M06), the client process P1 transmits a task generation request (M07) to the job management process P02.

ジョブ管理プロセスＰ２は、タスク生成要求（Ｍ０７）を受けると、生成されるタスクにタスク番号を割り当て、ノード内ジョブ管理プロセスＰ３経由でジョブ実行プロセスＰ４にタスク生成要求を転送する（Ｍ０８、Ｍ０９）。ジョブ管理プロセスＰ２によって割り当てられたタスク番号は、タスク生成要求（Ｍ８、Ｍ９）に付加されてジョブ実行プロセスＰ４に送られる。 Upon receiving the task generation request (M07), the job management process P2 assigns a task number to the generated task and transfers the task generation request to the job execution process P4 via the intra-node job management process P3 (M08, M09). . The task number assigned by the job management process P2 is added to the task generation request (M8, M9) and sent to the job execution process P4.

ジョブ実行プロセスＰ４は、タスク生成要求（Ｍ９）を受けると、タスクＰ４６を生成し、タスクの生成が完了した後、生成したタスクの情報をタスク管理テーブルＰ４４に追加し、生成されたタスクのジョブＩＤ、タスク番号状態、タスク状態（ＩＮＩＴ）を含む状態遷移通知［ＩＮＩＴ］（Ｍ１０）をノード内ジョブ管理プロセスＰ３に送信する。 Upon receiving the task generation request (M9), the job execution process P4 generates a task P46, and after the generation of the task is completed, adds the generated task information to the task management table P44, and generates a job for the generated task A state transition notification [INIT] (M10) including the ID, task number state, and task state (INIT) is transmitted to the intra-node job management process P3.

ノード内ジョブ管理プロセスＰ３は、状態遷移通知（Ｍ１０）を受けると、受信したタスクの情報（ジョブＩＤ、タスク番号、実行状態）をタスク管理テーブルＰ３３に追加し、状態遷移通知（Ｍ１０）をジョブ管理プロセスＰ２に転送（Ｍ１１）する。 Upon receiving the state transition notification (M10), the intra-node job management process P3 adds the received task information (job ID, task number, execution state) to the task management table P33, and sends the state transition notification (M10) to the job. Transfer (M11) to the management process P2.

ジョブ管理プロセスＰ４は、ジョブ管理プロセスＰ２を受けると、ノード内ジョブ管理プロセスＰ３と同様に、タスクの情報をタスク管理テーブルＰ２４に追加する。タスクはまだデータを読み込んでいないため、タスク管理テーブルＰ２４に記載する全データ量、スワップ量は０とし、スワップフラグは「ＩＤＬＥ」とする。この時点で各プロセスＰ２、Ｐ３、Ｐ４に含まれるタスク管理テーブルに記載された、このタスクの実行状態は「ＩＮＩＴ」になる。 Upon receiving the job management process P2, the job management process P4 adds task information to the task management table P24 as in the in-node job management process P3. Since the task has not yet read data, the total data amount and the swap amount described in the task management table P24 are set to 0, and the swap flag is set to “IDLE”. At this time, the execution state of this task described in the task management table included in each of the processes P2, P3, and P4 is “INIT”.

さらに、ジョブ管理プロセスＰ２は、スレーブ管理テーブルＰ２２のノードアドレスが、受信した状態遷移通知［ＩＮＩＴ］の送信元ノードアドレスに一致するエントリを検索し、当該エントリのタスクＩＤ一覧に新規に作成されたタスクの情報（ジョブＩＤ及びタスク番号）を追加する。 Further, the job management process P2 searches for an entry in which the node address of the slave management table P22 matches the transmission source node address of the received state transition notification [INIT], and is newly created in the task ID list of the entry. Task information (job ID and task number) is added.

図示を省略したが、ジョブ管理プロセスＰ２は、状態遷移通知［ＩＮＩＴ］（Ｍ１１）を受けると、スレーブ管理テーブルＰ２２を検索し、受信した状態遷移通知［ＩＮＩＴ］（Ｍ１１）が含むジョブＩＤをタスクＩＤ一覧に含むエントリを検索する。検索されたエントリがタスクを起動したジョブを実行中のノードに対応するエントリなので、これらエントリに対応するノードで状態遷移通知［ＩＮＩＴ］（Ｍ１１）の送信元以外のノードに、状態遷移通知［ＩＮＩＴ］（Ｍ１１）の送信元ノードのアドレスを追加した状態遷移通知［ＩＮＩＴ］を送信する。この状態遷移通知［ＩＮＩＴ］を受けたノード内ジョブ管理プロセスＰ３は、タスク管理テーブルＰ３３に生成されたタスクに対応するエントリを追加する。このとき、このタスクはリモートノードで実行されているので、タスク管理テーブルＰ３３の実行状態には何も記載せず、タスク管理テーブルＰ３３のノードアドレスには、受信した状態遷移通知［ＩＮＩＴ］にジョブ管理プロセスが追加したアドレスを記載する。 Although not shown, when the job management process P2 receives the state transition notification [INIT] (M11), the job management process P2 searches the slave management table P22 and sets the job ID included in the received state transition notification [INIT] (M11) as a task. Search for an entry included in the ID list. Since the retrieved entry is an entry corresponding to the node that is executing the job that started the task, the state transition notification [INIT] is sent to a node other than the transmission source of the state transition notification [INIT] (M11) at the node corresponding to these entries. ] A state transition notification [INIT] to which the address of the transmission source node of (M11) is added is transmitted. Upon receiving this state transition notification [INIT], the intra-node job management process P3 adds an entry corresponding to the generated task to the task management table P33. At this time, since this task is executed in the remote node, nothing is described in the execution state of the task management table P33, and the node address of the task management table P33 contains a job in the received state transition notification [INIT]. Enter the address added by the management process.

このように、新規にタスクを生成した場合、ジョブ実行プロセスＰ４からタスクの情報をジョブ管理プロセスＰ２に転送し、さらに他ノードのノード内ジョブ管理プロセスＰ３へタスクの情報を転送することによって、当該ジョブに関係するノード内ジョブ管理プロセスＰ３へタスクの情報を伝える。 As described above, when a new task is generated, the task information is transferred from the job execution process P4 to the job management process P2, and further transferred to the in-node job management process P3 of the other node. The task information is transmitted to the intra-node job management process P3 related to the job.

次に、ジョブ実行プロセスＰ４は、初期データを分散ＦＳから読み込み、読み込んだデータをタスクの内部データ（ｄａｔａ）としてメモリ領域に格納する。初期データのファイル名は、ジョブ開始要求（Ｍ０１）、プロセス生成要求（Ｍ０２）に記載されており、ジョブ実行プロセスの起動オプションによってジョブ実行プロセスＰ４に送られる。 Next, the job execution process P4 reads initial data from the distributed FS, and stores the read data in the memory area as internal data (data) of the task. The file name of the initial data is described in the job start request (M01) and the process generation request (M02), and is sent to the job execution process P4 by the start option of the job execution process.

ジョブ実行プロセスＰ４は、初期データの読み込みが完了した後、ジョブ実行プロセスＰ４は、タスク管理テーブルＰ４４に記載された当該タスクに対応するエントリの実行状態を「ＷＡＩＴ」に更新し、ジョブＩＤ、タスク番号、タスク状態（ＷＡＩＴ）を含む状態遷移通知［ＷＡＩＴ］（Ｍ１２）を、ノード内ジョブ管理プロセスＰ３へ送信する。 After the job execution process P4 completes the reading of the initial data, the job execution process P4 updates the execution state of the entry corresponding to the task described in the task management table P44 to “WAIT”, the job ID, the task A state transition notification [WAIT] (M12) including the number and task state (WAIT) is transmitted to the intra-node job management process P3.

ノード内ジョブ管理プロセスＰ３は、状態遷移通知［ＷＡＩＴ］（Ｍ１２）を受けると、ジョブ実行プロセスＰ４と同様に、タスク管理テーブルＰ３３の当該タスクのエントリの実行状態を「ＷＡＩＴ」に更新する。そして、ノード内ジョブ管理プロセスＰ３は、状態遷移通知［ＷＡＩＴ］をジョブ管理プロセスＰ２に転送（Ｍ１３）する。 Upon receiving the state transition notification [WAIT] (M12), the intra-node job management process P3 updates the execution state of the entry of the task in the task management table P33 to “WAIT”, as in the job execution process P4. Then, the intra-node job management process P3 transfers the state transition notification [WAIT] to the job management process P2 (M13).

ジョブ管理プロセスＰ２は、状態遷移通知［ＷＡＩＴ］（Ｍ１３）を受けると、ジョブＩＤとタスク番号が一致するエントリをタスク管理テーブルＰ２４から検索し、検索されたエントリの実行状態を「ＷＡＩＴ」に更新する。 Upon receiving the state transition notification [WAIT] (M13), the job management process P2 searches the task management table P24 for an entry having a matching job ID and task number, and updates the execution state of the searched entry to “WAIT”. To do.

このように、タスク実行状態が変更されたときに、状態遷移通知は、ジョブ実行プロセスＰ４から、ノード内ジョブ管理プロセスＰ３、ジョブ管理プロセスＰ２へ順に転送される。ノード内ジョブ管理プロセスＰ３及びジョブ管理プロセスＰ２は、状態遷移通知を受けると、各々、ジョブＩＤ及びタスク番号をキーとしてタスク管理テーブルＰ３３及びＰ２４から当該タスクに対応するエントリを検索し、検索されたエントリの実行状態を通知された値に変更する。 Thus, when the task execution state is changed, the state transition notification is sequentially transferred from the job execution process P4 to the in-node job management process P3 and the job management process P2. Upon receipt of the state transition notification, the in-node job management process P3 and the job management process P2 search the entry corresponding to the task from the task management tables P33 and P24 using the job ID and task number as a key, respectively. Change the execution status of the entry to the reported value.

タスクのメモリ使用量又はスワップ状態が変更された場合も、タスク実行状態が変更された場合と同様に、状態遷移通知がジョブ実行プロセスＰ４から、ジョブ管理プロセスＰ２へ転送される。この場合は、ノード内ジョブ管理プロセスＰ３は、状態遷移通知に従ってメモリ管理テーブルＰ３４のエントリの追加／削除、データ量、及びスワップ状態を変更する。ジョブ管理プロセスＰ２も、状態遷移通知に従ってタスク管理テーブルＰ２４の全データ量、スワップ量、及びスワップフラグを変更する。 Even when the memory usage amount or the swap state of the task is changed, the state transition notification is transferred from the job execution process P4 to the job management process P2, similarly to the case where the task execution state is changed. In this case, the intra-node job management process P3 changes the addition / deletion of entries in the memory management table P34, the data amount, and the swap state in accordance with the state transition notification. The job management process P2 also changes the total data amount, the swap amount, and the swap flag in the task management table P24 according to the state transition notification.

クライアントプロセスＰ１は、タスク生成要求（Ｍ０７）を送信した後、ジョブ管理プロセスＰ２に同期要求（Ｍ１４）を送信する。同期要求（Ｍ１４）は、同期待ち対象タスクのタスクＩＤ一覧を含む。ジョブ管理プロセスＰ２は、
同期要求（Ｍ１４）れを受けると、同期要求（Ｍ１４）に記載されたすべてのタスクＩＤが「ＷＡＩＴ」状態になるまで待つ。同期要求（Ｍ１４）に記載されたタスクが既に終了している場合や、同期待ち中に終了した場合は、ジョブ管理プロセスＰ２はクライアントプロセスＰ１にエラー通知を返信する。図８では、タスクの状態が「ＷＡＩＴ」になったので、同期完了通知（Ｍ１５）をクライアントに返している。 After transmitting the task generation request (M07), the client process P1 transmits a synchronization request (M14) to the job management process P2. The synchronization request (M14) includes a task ID list of tasks to be synchronized. The job management process P2
When receiving the synchronization request (M14), it waits until all task IDs described in the synchronization request (M14) are in the “WAIT” state. If the task described in the synchronization request (M14) has already been completed or has been completed while waiting for synchronization, the job management process P2 returns an error notification to the client process P1. In FIG. 8, since the task status is “WAIT”, a synchronization completion notification (M15) is returned to the client.

図８では、生成され、及び同期されタスクは一つのみであるが、実際には複数のタスクが生成され、及び同期される。具体的には、タスク生成要求（Ｍ０７）には、複数のタスクに関する情報が記載されており、ジョブ管理プロセスＰ２は、タスク生成要求（Ｍ０７）に記載された複数のタスクの生成を、ノード内ジョブ管理プロセスＰ３に要求する。このとき、ジョブ管理プロセスＰ２は、スレーブ管理テーブルＰ２２の最大タスク数とタスクＩＤ一覧を参照し、タスクの生成が要求されるノードに新たにタスクを生成する余裕があるかを確認する。ジョブ管理プロセスＰ２は、要求された数のタスクを一つのスレーブノードに生成できない場合は、他のスレーブノードにタスクを生成するために、他のノード内ジョブ管理プロセスＰ３にタスク生成要求を送信する。 In FIG. 8, there is only one task that is generated and synchronized, but in practice a plurality of tasks are generated and synchronized. Specifically, the task generation request (M07) describes information about a plurality of tasks, and the job management process P2 generates a plurality of tasks described in the task generation request (M07) within the node. Request to the job management process P3. At this time, the job management process P2 refers to the maximum number of tasks and the task ID list in the slave management table P22, and confirms whether there is room for generating a new task in a node for which task generation is requested. When the requested number of tasks cannot be generated in one slave node, the job management process P2 transmits a task generation request to the job management process P3 in another node in order to generate a task in another slave node. .

従って、タスク実行するジョブ実行プロセスＰ４は、複数のノードで動作することになるが、各タスクからの状態遷移通知は必ずジョブ管理プロセスＰ２に送信される。ジョブ管理プロセスＰ２は、クライアントプロセスＰ１から同期要求（Ｍ１４）を受けると、同期要求（Ｍ１４）に記載されたすべてのタスクの状態を確認し、すべてのタスクの状態が「ＷＡＩＴ」になったら、クライアントプロセスＰ１に同期完了通知（Ｍ１５）を返信する。 Therefore, the job execution process P4 for executing the task operates on a plurality of nodes, but the state transition notification from each task is always transmitted to the job management process P2. When the job management process P2 receives the synchronization request (M14) from the client process P1, the job management process P2 confirms the status of all tasks described in the synchronization request (M14), and when the status of all tasks becomes “WAIT”, A synchronization completion notification (M15) is returned to the client process P1.

クライアントプロセスＰ１は、同期完了通知（Ｍ１５）を受けると、タスク実行指示（Ｍ１６）をジョブ管理プロセスＰ２に送信する。タスク実行指示（Ｍ１６）には実行すべきタスクのタスクＩＤが記載されており、ジョブ管理プロセスＰ２は、スレーブ管理テーブルＰ２２のタスクＩＤ一覧を参照し、指定されたタスクＩＤに対応するタスクが動作中のスレーブのノードアドレスを取得する。ジョブ管理プロセスＰ２は、そのノードのノード内ジョブ管理プロセスにタスク実行指示（Ｍ１７）を送信する。タスク実行指示（Ｍ１６）には、当該ノードで実行すべきタスクのタスクＩＤが記載されている。タスク実行指示（Ｍ１６）に記載されているタスクが異なるノードで実行されている場合、各ノードのノード内ジョブ管理プロセスにタスク実行指示（Ｍ１７）を送信する。 Upon receiving the synchronization completion notification (M15), the client process P1 transmits a task execution instruction (M16) to the job management process P2. The task execution instruction (M16) describes the task ID of the task to be executed, and the job management process P2 refers to the task ID list in the slave management table P22 and the task corresponding to the specified task ID operates. Get the node address of the slave inside. The job management process P2 transmits a task execution instruction (M17) to the intra-node job management process of the node. The task execution instruction (M16) describes the task ID of the task to be executed at the node. When the task described in the task execution instruction (M16) is being executed in a different node, the task execution instruction (M17) is transmitted to the intra-node job management process of each node.

ノード内ジョブ管理プロセスＰ０３は、タスク実行指示（Ｍ１７）を受けると、タスクＩＤに含まれるジョブＩＤを取り出し、取り出されたジョブＩＤを含むタスク実行指示（Ｍ１８）を、そのジョブを担当するジョブ実行プロセスに転送する。 Upon receiving the task execution instruction (M17), the intra-node job management process P03 extracts the job ID included in the task ID, and executes the task execution instruction (M18) including the extracted job ID as the job execution responsible for the job. Transfer to process.

ジョブ実行プロセスＰ４は、タスク実行指示（Ｍ１８）を受けると、タスク実行指示（Ｍ１８）に含まれるタスクＩＤからタスク番号を取り出し、取り出されたタスク番号をキーとしてタスク管理テーブルＰ４４から当該タスクに対応するエントリを検索し、検索されたエントリの実行状態を「ＰＲＥ＿ＲＵＮ」に変更する。そして、ジョブ実行プロセスＰ４は、当該タスクＰ４６内の受信バッファ（ｂｕｆ＿ｒｅｃｖ）に蓄えられたメッセージをバッファ（ｂｕｆ）に移動し、受信バッファ（ｂｕｆ＿ｒｅｃｖ）の内容をクリアする。その後、ジョブ実行プロセスＰ４は、状態遷移通知［ＰＲＥ＿ＲＵＮ］をノード内ジョブ管理プロセスＰ０３に返信する。 Upon receiving the task execution instruction (M18), the job execution process P4 extracts the task number from the task ID included in the task execution instruction (M18), and responds to the task from the task management table P44 using the extracted task number as a key. The entry to be searched is searched, and the execution state of the searched entry is changed to “PRE_RUN”. Then, the job execution process P4 moves the message stored in the reception buffer (buf_recv) in the task P46 to the buffer (buf), and clears the contents of the reception buffer (buf_recv). Thereafter, the job execution process P4 returns a state transition notification [PRE_RUN] to the intra-node job management process P03.

ノード内ジョブ管理プロセスＰ３は、状態遷移通知［ＰＲＥ＿ＲＵＮ］（Ｍ２０）を受けると、受信したタスク実行指示（Ｍ１７）に含まれるタスクＩＤからジョブＩＤとタスク番号を取り出し、取り出されたジョブＩＤ及びタスク番号をキーとしてタスク管理テーブルＰ３３を検索し、検索されたエントリの実行状態を「ＰＲＥ＿ＲＵＮ」に変更する。そして、ノード内ジョブ管理プロセスＰ３は状態遷移通知（Ｍ２１）をジョブ管理プロセスＰ２に転送する。 Upon receiving the state transition notification [PRE_RUN] (M20), the in-node job management process P3 extracts the job ID and task number from the task ID included in the received task execution instruction (M17), and extracts the extracted job ID and task. The task management table P33 is searched using the number as a key, and the execution state of the searched entry is changed to “PRE_RUN”. Then, the intra-node job management process P3 transfers the state transition notification (M21) to the job management process P2.

ジョブ管理プロセスＰ２は、状態遷移通知（Ｍ２１）を受けると、ノード内ジョブ管理プロセスＰ３と同様に、タスク管理テーブルＰ２４の、このタスクに対応するエントリの実行状態を「ＰＲＥ＿ＲＵＮ」に変更する。 Upon receiving the state transition notification (M21), the job management process P2 changes the execution state of the entry corresponding to this task in the task management table P24 to “PRE_RUN”, as in the in-node job management process P3.

次に、ノード内ジョブ管理プロセスＰ３は、タスク管理テーブルＰ３３を確認し、ノードアドレスが設定されていないタスク（自ノードで実行中のタスク）で実行状態が「ＲＵＮ」になっているエントリの数を計算する。この数が、そのノードで実際にスケジューリングされているタスクの数である。そして、この数が自ノードの最大タスク数よりも小さければ、タスクの実行を許可し、スケジューリング要求（Ｍ２２）をジョブ実行プロセスＰ４に送信する。スケジューリング要求（Ｍ２２）には、スケジューリング対象となったタスクのタスク番号が含まれている。 Next, the intra-node job management process P3 confirms the task management table P33, and the number of entries in which the execution state is “RUN” for a task for which a node address is not set (task being executed on the own node). Calculate This number is the number of tasks actually scheduled at that node. If this number is smaller than the maximum number of tasks of the own node, the task execution is permitted and a scheduling request (M22) is transmitted to the job execution process P4. The scheduling request (M22) includes the task number of the task to be scheduled.

ジョブ実行プロセスＰ４は、スケジューリング要求（Ｍ２２）を受けると、スケジューリング要求（Ｍ２２）に含まれるタスク番号をキーとしてタスク管理テーブルＰ４４を検索し、検索されたエントリの実行状態を「ＲＵＮ」に変更する。 Upon receiving the scheduling request (M22), the job execution process P4 searches the task management table P44 using the task number included in the scheduling request (M22) as a key, and changes the execution state of the searched entry to “RUN”. .

そして、ジョブ実行プロセスＰ４は、状態遷移通知［ＲＵＮ］をノード内ジョブ管理プロセスＰ３に返信し、そのタスクＰ４６のタスク処理コードＰ４７の実行を開始する。タスク処理コードＰ４７の実行は、タスク毎にスレッドが割り当てられ、非同期に実行される。このタスク処理コードＰ４７が他タスクへのメッセージ送信や、他タスクから受信したメッセージを、タスクＰ４６のバッファ（ｂｕｆ）から読み出す。 Then, the job execution process P4 returns a state transition notification [RUN] to the intra-node job management process P3, and starts executing the task processing code P47 of the task P46. The task processing code P47 is executed asynchronously with a thread assigned to each task. The task processing code P47 reads a message transmitted to another task or a message received from another task from the buffer (buf) of the task P46.

タスク処理コードＰ４７の実行が終了すると、ジョブ実行プロセスＰ４は、実行が終了したタスクのタスク番号をキーとしてタスク管理テーブルＰ４４を検索し、検索されたエントリの実行状態を「ＷＡＩＴ」に変更する。そして、この実行が終了したタスクのジョブＩＤとタスク番号を含んだ状態遷移通知［ＷＡＩＴ］（Ｍ２５）をノード内ジョブ管理プロセスＰ３に送信する。 When the execution of the task processing code P47 is completed, the job execution process P4 searches the task management table P44 using the task number of the task that has been executed as a key, and changes the execution state of the searched entry to “WAIT”. Then, a state transition notification [WAIT] (M25) including the job ID and task number of the task whose execution has been completed is transmitted to the intra-node job management process P3.

ノード内ジョブ管理プロセスＰ３は、状態遷移通知［ＷＡＩＴ］（Ｍ２５）を受けると、状態遷移通知［ＷＡＩＴ］（Ｍ２５）に含まれるタスクのジョブＩＤ及びタスク番号をキーとしてタスク管理テーブルＰ３３を検索し、検索されたエントリの実行状態を「ＷＡＩＴ」に変更する。さらに、ノード内ジョブ管理プロセスＰ３は、状態遷移通知［ＷＡＩＴ］（Ｍ２６）をジョブ管理プロセスＰ２に転送する。 Upon receiving the state transition notification [WAIT] (M25), the intra-node job management process P3 searches the task management table P33 using the job ID and task number of the task included in the state transition notification [WAIT] (M25) as keys. Then, the execution state of the retrieved entry is changed to “WAIT”. Furthermore, the intra-node job management process P3 transfers the state transition notification [WAIT] (M26) to the job management process P2.

ジョブ管理プロセスＰ２は、状態遷移通知［ＷＡＩＴ］（Ｍ２６）を受けると、ノード内ジョブ管理プロセスＰ３と同様に、タスク管理テーブルＰ２４の、このタスクに対応するエントリの実行状態を「ＷＡＩＴ」に変更する。その後、ジョブ管理プロセスＰ２は、クライアントプロセスＰ１に同期完了通知（Ｍ２７）を返信する。 Upon receiving the state transition notification [WAIT] (M26), the job management process P2 changes the execution state of the entry corresponding to this task to “WAIT” in the task management table P24, as in the in-node job management process P3. To do. Thereafter, the job management process P2 returns a synchronization completion notification (M27) to the client process P1.

クライアントプロセスＰ１は、タスク実行指示（Ｍ１６）を送信した後に、同期要求（Ｍ１９）をジョブ管理プロセスに送信し、実行したタスクの状態が「ＷＡＩＴ」になるまで待つ。この処理は、Ｍ１４で示した同期処理と同様の処理である。 After transmitting the task execution instruction (M16), the client process P1 transmits a synchronization request (M19) to the job management process and waits until the state of the executed task becomes “WAIT”. This process is the same as the synchronization process indicated by M14.

このように、クライアントプロセスＰ１からのタスク実行指示によるタスク処理の開始と、タスクの同期を交互に行うことによって、タスクの実行が進んでいく。 In this way, task execution proceeds by alternately performing task processing start by task execution instructions from the client process P1 and task synchronization.

上記説明では、ノード内ジョブ管理プロセスＰ３はタスクが「ＰＲＥ＿ＲＵＮ」状態になった時に、そのタスクをスケジューリングしたが、クライアントプロセスが多数のタスクに対してタスク実行指示を行うと、スレーブノードの最大タスク数を超える数のタスクが同時に「ＰＲＥ＿ＲＵＮ」状態となり、「ＰＲＥ＿ＲＵＮ」状態になったタスクをすぐにはスケジューリングできなくなる。また、タスクの内部データ（ｄａｔａ）、バッファ（ｂｕｆ）はスワップアウト可能であり、タスク処理コードを実行する際に上記メモリ領域がスワップアウト済みであればスワップインする必要がある。また、ノード内のメモリが不足する場合、タスク処理コードを実行する前に、スワップアウト可能な別のタスクが管理する内部データ（ｄａｔａ）、バッファ（ｂｕｆ）をスワップアウトする必要もある。以下では、図９、図１０を参照し、スケジューリング処理の詳細を説明する。 In the above description, the intra-node job management process P3 schedules the task when the task is in the “PRE_RUN” state. However, if the client process issues a task execution instruction to many tasks, the maximum task of the slave node The number of tasks exceeding the number simultaneously enter the “PRE_RUN” state, and the task in the “PRE_RUN” state cannot be scheduled immediately. The internal data (data) and buffer (buf) of the task can be swapped out. If the memory area is already swapped out when executing the task processing code, it is necessary to swap in. If the memory in the node is insufficient, it is necessary to swap out the internal data (data) and the buffer (buf) managed by another task that can be swapped out before executing the task processing code. Hereinafter, the scheduling process will be described in detail with reference to FIGS. 9 and 10.

ノード内ジョブ制御プログラムＰ３１が処理を開始すると、タスク管理テーブルＰ３３を確認し
、実行状態が「ＰＲＥ＿ＲＵＮ」状態のエントリを検索する（Ｓ１０１）。 When the intra-node job control program P31 starts processing, it checks the task management table P33 and searches for an entry whose execution state is “PRE_RUN” (S101).

「ＰＲＥ＿ＲＵＮ」状態のエントリがない場合、「ＰＲＥ＿ＲＵＮ」状態のタスクが現れるまで待機する（Ｓ１０２）。 If there is no entry in the “PRE_RUN” state, the process waits until a task in the “PRE_RUN” state appears (S102).

Ｓ１０１において、「ＰＲＥ＿ＲＵＮ」状態のエントリが存在した場合、Ｓ１０１で見つけられた「ＰＲＥ＿ＲＵＮ」状態のエントリのジョブＩＤ及びタスク番号の一組を選択し、選択されたエントリとジョブＩＤ及びタスク番号が一致する、メモリ管理テーブルＰ３４のエントリを検索する。これらのエントリが、このタスクによって利用されるメモリ領域に対応する。 In S101, if there is an entry in the “PRE_RUN” state, a set of job ID and task number of the entry in “PRE_RUN” state found in S101 is selected, and the selected entry matches the job ID and task number. The entry of the memory management table P34 is searched. These entries correspond to the memory area used by this task.

そして、これらのエントリのスワップ状態を確認し、すべてのエントリのスワップ状態が「ＮＯＮＥ」であれば、そのタスクをスケジューリング対象として選択する。一方、スワップ状態が「ＮＯＮＥ」以外である場合、Ｓ１０１で見つけた別のエントリのジョブＩＤ及びタスク番号の組を選択し、前述した手順によってメモリ管理テーブルＰ３４を確認する。 Then, the swap state of these entries is confirmed, and if the swap state of all entries is “NONE”, the task is selected as a scheduling target. On the other hand, if the swap state is other than “NONE”, a combination of the job ID and task number of another entry found in S101 is selected, and the memory management table P34 is confirmed by the above-described procedure.

メモリ管理テーブルＰ３４のスワップ状態が「ＮＯＮＥ」である場合、メモリ領域がスワップアウトされておらず、直近でスワップアウトされる予定もない。従って、このステップではスワップインが不要のタスクを優先的にスケジューリングするための処理を行う。このような制御を行うことによって、スレーブノードのＣＰＵの空き時間を最小とし、効率的なスケジューリングが可能となる。また後述するが、タスクを移動する際、スワップ済みデータサイズが大きいタスクを優先的にタスク移動の対象として選択する（Ｓ１０３）。「ＲＵＮ」状態のタスクを移動することはできないため、前述した制御は、スワップ済みデータサイズが大きいタスクを、スケジューリングせずに移動対象として残す効果がある。 When the swap state of the memory management table P34 is “NONE”, the memory area is not swapped out and is not scheduled to be swapped out most recently. Therefore, in this step, processing for preferentially scheduling tasks that do not require swap-in is performed. By performing such control, the idle time of the slave node CPU is minimized and efficient scheduling becomes possible. As will be described later, when a task is moved, a task having a large swapped data size is preferentially selected as a task movement target (S103). Since a task in the “RUN” state cannot be moved, the control described above has an effect of leaving a task having a large swapped data size as a movement target without scheduling.

Ｓ１０３においてタスクが選択できた場合、選択されたタスクをスケジューリングする。このため、ノード内ジョブ制御プログラムＰ３１は、当該タスクに対応するタスク管理テーブルＰ３３の実行状態を「ＲＵＮ」に変更する。そして、自ノードのメモリ使用状況を確認し、必要があれば他のタスクをスワップアウトする（Ｓ１０５）。このスワップアウト処理の詳細は図１０を用いて後述する。この処理が終わればＳ１１１に進む。 If a task can be selected in S103, the selected task is scheduled. Therefore, the intra-node job control program P31 changes the execution state of the task management table P33 corresponding to the task to “RUN”. Then, the memory usage status of the own node is confirmed, and if necessary, other tasks are swapped out (S105). Details of the swap-out process will be described later with reference to FIG. When this process ends, the process proceeds to S111.

Ｓ１０６においてスワップインが不要なタスクを選択できなかった場合、Ｓ１０１で見つけられた他のタスクを選択する。このとき、メモリ管理テーブルＰ３４のスワップ状態が「ＳＷＡＰ」（スワップ済み）のエントリのデータ量をタスク毎に加算し、この計算されたデータ量が最小のものをスケジューリング対象として選択する（Ｓ１０６）。メモリ管理テーブルＰ３４のデータ量は、メモリ領域のデータサイズを表している。このため、スワップ済みのデータサイズが最小のものを選択することによって、スケジューリングにおけるスワップイン処理を最小にしている。また、Ｓ１０３と同様に、スワップ済みデータサイズが大きいタスクをスケジューリングせずに移動対象として残す効果がある。 If a task that does not require swap-in cannot be selected in S106, another task found in S101 is selected. At this time, the data amount of the entry whose swap state is “SWAP” (swapped) in the memory management table P34 is added for each task, and the one with the smallest calculated data amount is selected as a scheduling target (S106). The amount of data in the memory management table P34 represents the data size of the memory area. For this reason, the swap-in process in the scheduling is minimized by selecting the data having the smallest swapped data size. Further, similarly to S103, there is an effect that a task having a large swapped data size is left as a movement target without being scheduled.

その後、ノード内ジョブ制御プログラムＰ３１は、当該タスクに対応するタスク管理テーブルＰ３３の実行状態を「ＲＵＮ」に変更する。そして、Ｓ１０５と同様の手順で他のタスクをスワップアウトする（Ｓ１０７）。この処理の詳細は後述する。 Thereafter, the intra-node job control program P31 changes the execution state of the task management table P33 corresponding to the task to “RUN”. Then, other tasks are swapped out in the same procedure as S105 (S107). Details of this processing will be described later.

次に、ノード内ジョブ制御プログラムＰ３１は、スケジューリング対象として選択されたタスクのタスク番号をキーとしてメモリ管理テーブルＰ３４を検索し、検索されたタスクの中でスワップ状態が「ＰＲＥ＿ＯＵＴ」及び「ＯＵＴ」のものを見つける（Ｓ１０８）。スワップ状態が「ＰＲＥ＿ＯＵＴ」のエントリは、近いうちにスワップアウトされる予定のメモリ領域であり、このエントリが存在すればそのエントリのスワップ状態を「ＮＯＮＥ」に変更する。スワップ状態が「ＯＵＴ」のエントリは、現在スワップアウト中のメモリ領域である。スワップ状態が「ＯＵＴ」のエントリが存在した場合、ノード内ジョブ制御プログラムＰ３１は、このメモリ領域のスワップアウト処理をキャンセルするように、ジョブ実行プロセスＰ４に要求し、キャンセル処理が完了したら、上記エントリのスワップ状態を「ＮＯＮＥ」に変更する（Ｓ１０９）。 Next, the intra-node job control program P31 searches the memory management table P34 using the task number of the task selected as the scheduling target as a key, and the swap status among the searched tasks is “PRE_OUT” and “OUT”. A thing is found (S108). An entry whose swap status is “PRE_OUT” is a memory area that is scheduled to be swapped out soon. If this entry exists, the swap status of the entry is changed to “NONE”. An entry whose swap state is “OUT” is a memory area currently being swapped out. When there is an entry whose swap status is “OUT”, the in-node job control program P31 requests the job execution process P4 to cancel the swap-out process of this memory area, and when the cancel process is completed, the entry The swap state is changed to “NONE” (S109).

次に、ノード内ジョブ制御プログラムＰ３１は、メモリ管理テーブルＰ３４を参照し、スケジューリング対象として選択されたタスクのタスク番号をキーとしてメモリ管理テーブルＰ３４から検索、検索されたエントリの中でスワップ状態が「ＳＷＡＰ」のものを探す。状態が「ＳＷＡＰ」のエントリはメモリ領域がスワップアウト済みなので、ノード内ジョブ制御プログラムＰ３１は、ジョブ実行プロセスＰ４に、このメモリ領域をスワップインするように要求する（Ｓ１１０）。ジョブ実行プロセスＰ４によるスワップイン処理が終了したらＳ１１１に進む。 Next, the intra-node job control program P31 refers to the memory management table P34, searches the memory management table P34 using the task number of the task selected as the scheduling target as a key, and the swap status among the searched entries is “ Search for “SWAP”. Since the memory area of the entry whose state is “SWAP” has already been swapped out, the intra-node job control program P31 requests the job execution process P4 to swap in the memory area (S110). When the swap-in process by the job execution process P4 is completed, the process proceeds to S111.

次に、ノード内ジョブ制御プログラムＰ３１は、スケジューリング対象として選択したタスクのジョブＩＤとタスク番号を含むスケジューリング要求を作成し、上記ジョブＩＤに対応するジョブ実行プロセスＰ４にスケジューリング要求を送信する。そして、ジョブ実行プロセスＰ４から状態遷移通知が返信されるのを待つ。状態遷移通知を受信すると、スケジューリングできたことが確認できるので、Ｓ１０１に戻る（Ｓ１１１）。 Next, the intra-node job control program P31 creates a scheduling request including the job ID and task number of the task selected as the scheduling target, and transmits the scheduling request to the job execution process P4 corresponding to the job ID. Then, it waits for a status transition notification to be returned from the job execution process P4. When the state transition notification is received, it can be confirmed that scheduling has been performed, and the process returns to S101 (S111).

次に、スワップアウト処理（Ｓ１０５）の詳細について説明する。 Next, details of the swap-out process (S105) will be described.

図１０は、本発明の実施の形態のタスクをスワップアウトする手順を示すフローチャートである。 FIG. 10 is a flowchart showing a procedure for swapping out a task according to the embodiment of this invention.

ノード内ジョブ制御プログラムＰ３１は、自ノードに実装されたメモリの量及び使用されているメモリの量を取得し、使用メモリ量が一定基準以下（例えば、実装メモリ量の８０％以下）であるか否かを判定する（Ｓ２０１）。使用メモリ量は、メモリ管理テーブルＰ３４のスワップ状態が「ＳＷＡＰ」「ＯＵＴ」「ＰＲＥ＿ＯＵＴ」以外のエントリのデータ量を合計することによって求める。使用メモリ量が前記基準以下である場合、空きメモリ量は十分であると判定し処理を終える。一方、使用メモリ量が前記基準を超える場合、空きメモリ量は不十分であると判定し、スワップアウト処理を行うために、次のステップに進む（Ｓ２０１）。 The intra-node job control program P31 obtains the amount of memory installed in the own node and the amount of used memory, and whether the used memory amount is below a certain standard (for example, 80% or less of the installed memory amount). It is determined whether or not (S201). The used memory amount is obtained by summing up the data amounts of entries other than “SWAP”, “OUT”, and “PRE_OUT” in the swap state of the memory management table P34. If the amount of used memory is less than or equal to the reference, it is determined that the amount of free memory is sufficient, and the process ends. On the other hand, if the amount of used memory exceeds the reference, it is determined that the amount of free memory is insufficient, and the process proceeds to the next step in order to perform swap-out processing (S201).

ノード内ジョブ制御プログラムＰ３１は、タスク管理テーブルＰ３３を参照し、実行状態が「ＷＡＩＴ」であるエントリを検索する。実行状態が「ＷＡＩＴ」であるエントリが見つかった場合、見つかったエントリのジョブＩＤ及びタスク番号が、タスク管理テーブルＰ３３から検索されたエントリと一致し、スワップ状態が「ＳＷＡＰ」であるエントリを、メモリ管理テーブルＰ３４から検索する。この条件を満たすエントリが見つかった場合、それらのいずれかのメモリ領域をスワップ対象として選択する。 The intra-node job control program P31 refers to the task management table P33 and searches for an entry whose execution state is “WAIT”. When an entry whose execution state is “WAIT” is found, an entry whose job ID and task number of the found entry match the entry searched from the task management table P33 and whose swap state is “SWAP” is stored in the memory. Search from the management table P34. When an entry satisfying this condition is found, one of those memory areas is selected as a swap target.

一方、この条件を満たすエントリが見つからなかった場合、タスク管理テーブルＰ３３を参照し、実行状態が「ＰＲＥ＿ＲＵＮ」であるエントリを検索し、見つかったエントリのジョブＩＤ及びタスク番号が、メモリ管理テーブルＰ３４のジョブＩＤ及びタスクＩＤに一致し、スワップ状態が「ＳＷＡＰ」であるエントリを、メモリ管理テーブルＰ３４から検索する。そして、この条件を満たすエントリが見つかったら、それらのいずれかのメモリ領域をスワップ対象として選択する。 On the other hand, when an entry satisfying this condition is not found, the task management table P33 is referred to, an entry whose execution state is “PRE_RUN” is searched, and the job ID and task number of the found entry are stored in the memory management table P34. The memory management table P34 is searched for an entry that matches the job ID and task ID and whose swap state is “SWAP”. When an entry satisfying this condition is found, one of those memory areas is selected as a swap target.

前述したように、実行状態が「ＷＡＩＴ」であるタスクを優先的にスワップアウトするように制御している。このため、すぐにスケジューリングされるタスク（実行状態が「ＰＲＥ＿ＲＵＮ」であるタスク）をなるべくスワップアウトしない状態にしておき、スケジューリングするときのスワップイン処理を少なくする効果がある。また、後述するように、本発明では、スワップアウトされたタスク、及びスワップアウトされる予定のタスクは優先的に移動対象タスクとする。従って、前述した制御によって、次にスケジューリングするまで時間がかかるタスク（実行状態が「ＷＡＩＴ」であるタスク）を優先的に移動対象タスクにする効果がある（Ｓ２０２）。 As described above, the task whose execution state is “WAIT” is controlled to be preferentially swapped out. For this reason, the task scheduled immediately (task whose execution state is “PRE_RUN”) is set in a state where it is not swapped out as much as possible, and the swap-in processing at the time of scheduling is reduced. Further, as will be described later, in the present invention, tasks that are swapped out and tasks that are scheduled to be swapped out are preferentially set as tasks to be moved. Therefore, the above-described control has an effect of preferentially setting a task that takes time until the next scheduling (a task whose execution state is “WAIT”) as a movement target task (S202).

Ｓ２０２でスワップ対象を選択できなかった場合、スワップアウトすることはできないので処理を終了する。一方、スワップ対象を選択できたら、Ｓ２０４に進む（Ｓ２０３）。 If the swap target cannot be selected in S202, the process is terminated because it cannot be swapped out. On the other hand, if the swap target can be selected, the process proceeds to S204 (S203).

次に、ノード内ジョブ制御プログラムＰ３１は、ジョブ管理プロセスＰ２に、Ｓ２０２でスワップ対象として選択されたメモリ領域のジョブＩＤ及びタスク番号を含むスワップ事前通知を送信し（Ｓ２０４）、一定時間待機する（Ｓ２０５）。ジョブ管理プロセスＰ２は、受信したスワップ事前通知に含まれるジョブＩＤ及びタスク番号をキーとしてタスク管理テーブルＰ２４を検索し、検索されたエントリのスワップフラグを「ＰＲＥ＿ＯＵＴ」に変更する。この処理は、この当該メモリ領域がスワップ対象となったことをジョブ管理プロセスＰ２に通知し、タスク移動対象として選択される可能性を高めるためのものである。タスク移動処理の詳細は後述する。 Next, the intra-node job control program P31 transmits a swap pre-notification including the job ID and task number of the memory area selected as the swap target in S202 to the job management process P2 (S204), and waits for a predetermined time (S204). S205). The job management process P2 searches the task management table P24 using the job ID and task number included in the received prior notice of swap as keys, and changes the swap flag of the searched entry to “PRE_OUT”. This process notifies the job management process P2 that the memory area has been swapped, and increases the possibility of being selected as a task movement target. Details of the task movement process will be described later.

ノード内ジョブ制御プログラムＰ３１は、一定時間待機した後、Ｓ２０２でスワップ対象として選択されたメモリ領域のジョブＩＤ及びタスク番号をキーとしてタスク管理テーブルＰ３３を検索する。そして、該当するエントリが検索された場合、検索されたエントリの実行状態が、「ＰＲＥ＿ＲＵＮ」又は「ＲＵＮ」であることを確認する。さらに、選択されたメモリ領域のジョブＩＤ、タスクＩＤ及びデータ名をキーとしてメモリ管理テーブルＰ３４を検索し、検索されたエントリのスワップ状態が「ＰＲＥ＿ＲＵＮ」であることを確認する。前述した全ての条件が確認できた場合、そのメモリ領域はスワップアウト可能であるので、次のステップに進む。一方、前述した一部の条件が確認できない場合、スワップアウトできないため、Ｓ２０１へ戻る（Ｓ２０６）。 The intra-node job control program P31 waits for a predetermined time, and then searches the task management table P33 using the job ID and task number of the memory area selected as the swap target in S202 as a key. When the corresponding entry is searched, it is confirmed that the execution state of the searched entry is “PRE_RUN” or “RUN”. Further, the memory management table P34 is searched using the job ID, task ID, and data name of the selected memory area as a key, and it is confirmed that the swap state of the searched entry is “PRE_RUN”. If all the above conditions are confirmed, the memory area can be swapped out, and the process proceeds to the next step. On the other hand, if some of the conditions described above cannot be confirmed, swap-out cannot be performed, and the process returns to S201 (S206).

ノード内ジョブ制御プログラムＰ３１は、スワップアウト対象メモリ領域のジョブＩＤ、タスク番号及びデータ名を含むスワップ要求をジョブ実行プロセスＰ４に送信し、ジョブ実行プロセスＰ４にスワップアウトを指示する。さらに、ノード内ジョブ制御プログラムＰ３１は、そのメモリ領域に対応するメモリ管理テーブルＰ３４のエントリのスワップ状態を「ＯＵＴ」に変更し、そのメモリ領域のジョブＩＤ及びタスク番号を含むスワップ開始通知をジョブ管理プロセスＰ２に送信する。ジョブ管理プロセスＰ２は、スワップ開始通知に含まれるジョブＩＤ及びタスク番号をキーとしてタスク管理テーブルＰ２４を検索し、検索されたエントリのスワップフラグを「ＢＵＳＹ」に変更する。スワップ処理が開始された後、Ｓ２０１に戻る（Ｓ２０７）。 The intra-node job control program P31 transmits a swap request including the job ID, task number, and data name of the swap-out target memory area to the job execution process P4 and instructs the job execution process P4 to perform swap-out. Further, the intra-node job control program P31 changes the swap state of the entry in the memory management table P34 corresponding to the memory area to “OUT”, and performs swap management notification including the job ID and task number of the memory area. Send to process P2. The job management process P2 searches the task management table P24 using the job ID and task number included in the swap start notification as keys, and changes the swap flag of the searched entry to “BUSY”. After the swap process is started, the process returns to S201 (S207).

次に、タスクの移動処理の詳細について説明する。システムを構成するスレーブノードの負荷（メモリ使用率、ＣＰＵ使用率）に偏りがある場合、負荷が高いスレーブノードから負荷が低いスレーブノードへタスクを移動することによって、負荷の偏りを平準化する。一般に、負荷が高いスレーブで複数のタスクが実行されている可能性が高く、移動候補となるタスクは複数存在する。本発明では、以下の優先順位で移動候補となるタスクを選択する。
（１）スワップアウト直前のタスク。
（２）すぐにはスケジューリングされない（ＷＡＩＴ状態の）タスク。
（３）メモリの一部がスワップアウトされているタスク。
（４）スワップアウトされていないタスク。 Next, details of the task movement process will be described. When there is a bias in the load (memory usage rate, CPU usage rate) of the slave nodes constituting the system, the load bias is leveled by moving the task from the slave node having a high load to the slave node having a low load. In general, there is a high possibility that a plurality of tasks are executed by a slave with a high load, and there are a plurality of tasks that are movement candidates. In the present invention, a task that is a movement candidate is selected in the following priority order.
(1) Task immediately before swap-out.
(2) Tasks that are not scheduled immediately (in WAIT state).
(3) A task in which a part of the memory is swapped out.
(4) Tasks that have not been swapped out.

スワップアウトされたメモリ領域を持つタスクをスケジューリングするにはスワップインが必要となる。このため、スワップアウトが発生するとＣＰＵの利用効率が低下する。また、本実施の形態では、移動後のタスクによって保持される全てのメモリ領域は、実メモリ上に割り当てられる。そこで、スワップアウト直前のタスクを優先的に移動することによって、スワップアウトの発生を抑制し、ＣＰＵの利用効率の低下を防止する。 To schedule a task with a swapped-out memory area, swap-in is required. For this reason, when swap-out occurs, the CPU utilization efficiency decreases. Further, in the present embodiment, all memory areas held by the moved task are allocated on the real memory. Therefore, by preferentially moving the task immediately before the swap-out, the occurrence of the swap-out is suppressed and a decrease in the CPU utilization efficiency is prevented.

また、スケジューリング直前のタスクをスワップアウトすると、すぐにスワップインが発生するため、ＣＰＵの利用効率が低下する。また、移動中のタスクはスケジューリングできないため、スケジューリング直前のタスクを移動対象として選択すると、ＣＰＵの利用効率を低下させる可能性がある。そこで、すぐにスケジューリングされないタスクを優先的にスケジューリング対象として選択する。 In addition, if the task immediately before scheduling is swapped out, swap-in occurs immediately, which reduces the CPU utilization efficiency. In addition, since a task that is moving cannot be scheduled, if the task immediately before scheduling is selected as a movement target, there is a possibility of reducing the CPU utilization efficiency. Therefore, a task that is not scheduled immediately is selected as a scheduling target with priority.

本実施の形態では、移動後のタスクを保持する全てのメモリ領域を実メモリ上に割り当てるため、スワップアウトされたタスクを移動すると、同時にスワップインを行うことになる。そこで、スワップアウトされたタスクを優先的に移動することによって、タスク移動中に実質的にスワップインを行い、処理時間を短縮する。 In this embodiment, all the memory areas that hold the task after the move are allocated on the real memory. Therefore, when the swapped-out task is moved, the swap-in is performed at the same time. Therefore, by preferentially moving the swapped-out task, the swap-in is substantially performed during the task movement, and the processing time is shortened.

次に、タスク移動の制御方法について詳しく説明する。このタスクの移動は、ジョブ管理プロセスＰ２のジョブ管理部Ｐ２１が制御する。 Next, a method for controlling task movement will be described in detail. The movement of this task is controlled by the job management unit P21 of the job management process P2.

図１１は、ジョブ制御プログラムＰ２１が移動対象タスクを選択する手順を示すフローチャートである。 FIG. 11 is a flowchart illustrating a procedure by which the job control program P21 selects a movement target task.

タスクの実行状態が変化した場合、スレーブノードの負荷は変化する。また、本システムのスレーブノード数が変更した場合、スレーブノードの負荷のバランスが変化する。このため、タスクの移動を開始すべきかを判定する必要がある。例えば、スレーブノードが追加された場合、ジョブ制御プログラムＰ１２は、スレーブ管理テーブルＰ２２に、追加されたスレーブノードの情報を登録し、タスクの移動を開始すべきかを判定するために、次のステップに進む（Ｓ３０１）。 When the task execution state changes, the load on the slave node changes. Also, when the number of slave nodes in this system changes, the load balance of the slave nodes changes. For this reason, it is necessary to determine whether to start moving the task. For example, when a slave node is added, the job control program P12 registers the information of the added slave node in the slave management table P22 and determines whether to start moving a task in the next step. Proceed (S301).

次に、ジョブ制御プログラムＰ２１は、計算機資源に余裕があるスレーブノードと、計算機資源が不足しているスレーブノードとが存在するか否かを判定する（Ｓ３０２）。具体的には、ＣＰＵ資源とメモリ資源の、いずれかが不足するスレーブノードは、計算機資源が不足すると判定する。一方、いずれにも余裕があるスレーブノードは、計算機資源に余裕があると判定する。 Next, the job control program P21 determines whether there are slave nodes with sufficient computer resources and slave nodes with insufficient computer resources (S302). Specifically, a slave node in which either one of the CPU resource and the memory resource is insufficient determines that the computer resource is insufficient. On the other hand, a slave node that has a margin in both cases determines that there is a margin in computer resources.

ＣＰＵ資源については、スレーブ管理テーブルＰ２２に記載された最大タスク数と実行タスク数を比較することによって、ＣＰＵ資源に余裕があるか否かを判定する。具体的には、最大タスク数が実行タスク数より小さい場合、ＣＰＵ資源が不足すると判定する。また、実行タスク数が最大タスク数の一定割合（例えば、３０％）以下であれば、ＣＰＵ資源に余裕があると判定する。 As for the CPU resource, it is determined whether or not the CPU resource has a margin by comparing the maximum number of tasks described in the slave management table P22 with the number of execution tasks. Specifically, when the maximum number of tasks is smaller than the number of execution tasks, it is determined that CPU resources are insufficient. Further, if the number of execution tasks is equal to or less than a certain ratio (for example, 30%) of the maximum number of tasks, it is determined that there is a margin in CPU resources.

実行タスク数は、以下のように計算する。まず、スレーブ管理テーブルＰ２２を参照し、実行タスク数を調べたいスレーブノードに対応するエントリを選択する。次に、そのエントリのタスクＩＤ一覧（タスクＩＤセット）を取得し、タスクＩＤセットを一時的に保持する。タスクＩＤはジョブＩＤ及びタスク番号を含むので、それらを分離する。タスクＩＤに含まれるジョブＩＤ及びタスク番号をキーとして移動中タスク管理テーブルＰ２５を検索し、検索されたエントリの移動元スレーブＩＤが最初に選択されたスレーブ管理テーブルＰ２２のエントリのスレーブＩＤに一致している場合、そのタスクは他のノードに移動中であるため、タスクＩＤセットから取り除く。さらに、移動中タスク管理テーブルＰ２５の移動先スレーブＩＤに、最初に選択されたスレーブ管理テーブルＰ２２のエントリのスレーブＩＤに一致するエントリが存在すれば、そのタスクは他のノードから自身へ移動中であるため、タスクＩＤセットに追加する。このように作成されたタスクＩＤセットに含まれるタスクＩＤの数が実行タスク数である。 The number of execution tasks is calculated as follows. First, referring to the slave management table P22, an entry corresponding to the slave node whose number of execution tasks is to be checked is selected. Next, the task ID list (task ID set) of the entry is acquired, and the task ID set is temporarily held. Since the task ID includes a job ID and a task number, they are separated. The moving task management table P25 is searched using the job ID and task number included in the task ID as keys, and the source slave ID of the searched entry matches the slave ID of the entry of the first selected slave management table P22. If so, it is removed from the task ID set because it is moving to another node. Furthermore, if there is an entry that matches the slave ID of the entry of the slave management table P22 selected first in the movement destination slave ID of the moving task management table P25, the task is moving from another node to itself. Therefore, it is added to the task ID set. The number of task IDs included in the task ID set created in this way is the number of execution tasks.

メモリ資源については、スレーブ管理テーブルＰ２２に記載された最大メモリ量と使用メモリ量を比較することによって、メモリ資源に余裕があるか否かを判定する。具体的には、最大メモリ量が使用メモリ量より小さい場合、メモリ資源が不足すると判定する。また、使用メモリ量が最大メモリ量の一定割合（例えば、３０％）以下であれば、メモリ資源に余裕があると判定する。 For the memory resource, it is determined whether or not there is a margin in the memory resource by comparing the maximum memory amount described in the slave management table P22 with the used memory amount. Specifically, when the maximum memory amount is smaller than the used memory amount, it is determined that memory resources are insufficient. Further, if the used memory amount is equal to or less than a certain percentage (for example, 30%) of the maximum memory amount, it is determined that there is a margin in memory resources.

使用メモリ量は以下のように計算する。ＣＰＵ資源の確認処理で求めたタスクＩＤセットを、ジョブＩＤとタスク番号に分離し、分離されたジョブＩＤ及びタスク番号をキーとしてタスク管理テーブルＰ２４を検索する。そして、検索されたエントリがそのスレーブノードで動作する（実際には、タスク移動完了後に動作する）タスクに対応する。これらエントリの全データ量の和が、当該ノードにおける使用メモリ量である。 The amount of memory used is calculated as follows. The task ID set obtained in the CPU resource confirmation process is separated into a job ID and a task number, and the task management table P24 is searched using the separated job ID and task number as a key. The retrieved entry corresponds to a task that operates on the slave node (actually, it operates after completion of task movement). The sum of the total data amounts of these entries is the used memory amount in the node.

このように、計算機資源に余裕があるスレーブノードと、計算機資源が不足しているスレーブノードとの両方がシステム内に存在する場合、Ｓ３０３に進む。一方、計算機資源に余裕があるスレーブノードと、計算機資源が不足しているスレーブノードとの一方しかシステム内に存在しない場合、タスクを移動せず、処理を終える（Ｓ３０２）。 As described above, when both the slave node having sufficient computer resources and the slave node having insufficient computer resources exist in the system, the process proceeds to S303. On the other hand, if only one of the slave node having sufficient computer resources and the slave node having insufficient computer resources exists in the system, the task is not moved and the process is finished (S302).

次に、ジョブ制御プログラムＰ２１は、Ｓ３０２で調査した計算機資源が不足するスレーブノードで実行中のタスクのうち、移動可能なタスクを抽出する（Ｓ３０３）。具体的には、スレーブ管理テーブルＰ２２を参照し、計算機資源が不足するスレーブノードに対応するエントリのタスクＩＤ一覧を取得する。さらに、タスクＩＤ一覧に記載されたタスクＩＤのジョブＩＤ及びタスク番号をキーとしてタスク管理テーブルＰ２４を検索し、ジョブＩＤ及びタスク番号が一致するエントリ群を抽出する。このエントリ群が、そのスレーブノードで実行中のタスクに対応するエントリである。このエントリ群の中で、実行状態が「ＰＲＥ＿ＲＵＮ」または「ＷＡＩＴ」であるタスクを抽出する。実行状態が上記のいずれかであるタスクは移動可能である。 Next, the job control program P21 extracts a moveable task among the tasks being executed in the slave node in which the computer resources investigated in S302 are insufficient (S303). Specifically, the slave management table P22 is referred to, and a task ID list of entries corresponding to slave nodes having insufficient computer resources is acquired. Further, the task management table P24 is searched using the job ID and task number of the task ID described in the task ID list as a key, and an entry group having the matching job ID and task number is extracted. This entry group is an entry corresponding to a task being executed on the slave node. In this entry group, a task whose execution state is “PRE_RUN” or “WAIT” is extracted. A task whose execution state is one of the above can be moved.

Ｓ３０３において、条件に一致するエントリが存在しないと判定された場合、移動可能なタスクが存在しないため、タスクを移動せず、処理を終える。一方、Ｓ３０３において、条件に一致するエントリを抽出できた場合、Ｓ３０５に進む（Ｓ３０４）。 If it is determined in S303 that there is no entry that matches the condition, there is no task that can be moved, so the task is not moved and the process ends. On the other hand, if an entry matching the condition can be extracted in S303, the process proceeds to S305 (S304).

次に、ジョブ制御プログラムＰ２１は、Ｓ３０３において抽出されたタスク管理テーブルＰ２４のエントリの中で、スワップフラグが「ＰＲＥ＿ＯＵＴ」であるエントリがあるか否かを判定する（Ｓ３０５）。 Next, the job control program P21 determines whether or not there is an entry whose swap flag is “PRE_OUT” among the entries of the task management table P24 extracted in S303 (S305).

スワップフラグが「ＰＲＥ＿ＯＵＴ」であるエントリが存在する場合、このエントリに対応するタスクは、スワップアウト直前の状態なので、移動対象タスクに選択する。さらに、Ｓ３０２で求めた計算機資源に余裕があるスレーブノードの一つを移動先ノードに選択する。そして、移動対象タスクのジョブＩＤ及びタスク番号を移動中タスク管理テーブルＰ２５に追加し、タスクを実行中のノードのスレーブＩＤを移動中タスク管理テーブルＰ２５の移動元スレーブＩＤに記載し、移動先ノードのスレーブＩＤを移動先スレーブＩＤに記載し、移動対象タスクの実行状態を移動中タスク管理テーブルＰ２５の実行状態に記載する。その後、処理をＳ３０２に戻る（Ｓ３０６）。 If there is an entry whose swap flag is “PRE_OUT”, the task corresponding to this entry is in the state immediately before the swap-out, so it is selected as the task to be moved. Further, one of the slave nodes having sufficient computer resources obtained in S302 is selected as the movement destination node. Then, the job ID and task number of the task to be moved are added to the moving task management table P25, the slave ID of the node that is executing the task is described in the movement source slave ID of the moving task management table P25, and the movement destination node Is described in the destination slave ID, and the execution state of the movement target task is described in the execution state of the moving task management table P25. Thereafter, the process returns to S302 (S306).

Ｓ３０５において、スワップフラグが「ＰＲＥ＿ＯＵＴ」であるエントリが存在しない場合、Ｓ３０３で抽出されたエントリ群から実行状態が「ＷＡＩＴ」であるエントリを抽出する。実行状態が「ＷＡＩＴ」であるエントリが存在する場合、Ｓ３０８へ進む。実行状態が「ＷＡＩＴ」であるエントリが存在しない場合、Ｓ３１１へ進む（Ｓ３０７）。 If there is no entry whose swap flag is “PRE_OUT” in S305, an entry whose execution state is “WAIT” is extracted from the entry group extracted in S303. If there is an entry whose execution state is “WAIT”, the process proceeds to S308. If there is no entry whose execution state is “WAIT”, the process proceeds to S311 (S307).

Ｓ３０８では、ジョブ制御プログラムＰ２１は、Ｓ３０６で取り出されたエントリ群の各エントリのスワップフラグを確認し、スワップフラグが「ＳＷＡＰ」又は「ＢＵＳＹ」のタスクが存在するか否か判定する（Ｓ３０８）。 In S308, the job control program P21 confirms the swap flag of each entry of the entry group extracted in S306, and determines whether or not a task whose swap flag is “SWAP” or “BUSY” exists (S308).

Ｓ３０８において、スワップフラグが「ＳＷＡＰ」又は「ＢＵＳＹ」のタスクが存在すると判定された場合、そのタスクのいずれか一つを移動対象タスクにて選択する。続いて、Ｓ３０６と同様の手順で、移動対象タスクの移動先ノードを決定し、移動対象タスクの情報を移動中タスク管理テーブルＰ２５に登録し、Ｓ３０２へ戻る（Ｓ３０９）。 If it is determined in S308 that there is a task whose swap flag is “SWAP” or “BUSY”, one of the tasks is selected as the movement target task. Subsequently, in the same procedure as in S306, the movement destination node of the movement target task is determined, the information of the movement target task is registered in the moving task management table P25, and the process returns to S302 (S309).

Ｓ３０８において、スワップフラグが「ＳＷＡＰ」のタスクも、「ＢＵＳＹ」のタスクも存在しないと判定された場合、スワップ中のタスクは存在しないので、Ｓ３０７で取り出したエントリ群の一つを移動対象タスクに選択する。その後、Ｓ３０６と同様の手順によって、移動対象タスクの移動先ノードを決定し、移動対象タスクの情報を移動中タスク管理テーブルＰ２５に登録し、Ｓ３０２へ戻る（Ｓ３１０）。 If it is determined in S308 that there is no task whose swap flag is “SWAP” or “BUSY”, there is no task being swapped, so one of the entry groups extracted in S307 is set as the movement target task. select. Thereafter, the destination node of the movement target task is determined by the same procedure as S306, the information of the movement target task is registered in the moving task management table P25, and the process returns to S302 (S310).

Ｓ３０７で実行状態が「ＷＡＩＴ」であるエントリが存在しないと判定された場合、Ｓ３０３において抽出されたエントリ群のすべてのエントリの実行状態は「ＰＲＥ＿ＲＵＮ」である。これらエントリ群のエントリのスワップフラグを確認し、スワップフラグが「ＳＷＡＰ」又は「ＢＵＳＹ」のタスクが存在するか否かを判定する（Ｓ３１１）。 If it is determined in S307 that there is no entry whose execution state is “WAIT”, the execution state of all entries in the entry group extracted in S303 is “PRE_RUN”. The swap flags of the entries of these entry groups are confirmed, and it is determined whether or not there is a task whose swap flag is “SWAP” or “BUSY” (S311).

Ｓ３１１において、スワップフラグが「ＳＷＡＰ」又は「ＢＵＳＹ」のタスクが存在すると判定された場合、そのタスクの一つを移動対象タスクに選択する。その後、Ｓ３０６と同様の手順で、移動対象タスクの移動先ノードを決定し、移動対象タスクの情報を移動中タスク管理テーブルＰ２５に登録し、Ｓ３０２へ戻る（Ｓ３１２）。 If it is determined in S311 that there is a task whose swap flag is “SWAP” or “BUSY”, one of the tasks is selected as a movement target task. Thereafter, the destination node of the movement target task is determined in the same procedure as in S306, the information of the movement target task is registered in the moving task management table P25, and the process returns to S302 (S312).

Ｓ３１１において、スワップフラグが「ＳＷＡＰ」のタスクも、「ＢＵＳＹ」のタスクも存在しないと判定された場合、スワップ中のタスクは存在しないので、Ｓ３０３で抽出したエントリ群のいずれかを移動対象タスクとして選択する。続いて、Ｓ３０６と同様の手順で、移動対象タスクの移動先ノードを決定し、移動対象タスクの情報を移動中タスク管理テーブルＰ２５に登録し、Ｓ３０２へ戻る（Ｓ３１３）。 If it is determined in S311 that there is no task whose swap flag is “SWAP” or “BUSY”, there is no task being swapped, and therefore one of the entry groups extracted in S303 is set as the movement target task. select. Subsequently, a destination node of the movement target task is determined in the same procedure as in S306, information on the movement target task is registered in the moving task management table P25, and the process returns to S302 (S313).

このような手順によって、移動中タスク管理テーブルＰ２５に移動対象タスクの情報が作成される。移動対象タスク選択処理が完了した後、ジョブ制御プログラムＰ２１は、移動中タスク管理テーブルＰ２５に記載された各タスクを移動する。 By such a procedure, information on the movement target task is created in the moving task management table P25. After the movement target task selection process is completed, the job control program P21 moves each task described in the moving task management table P25.

次に、タスク移動処理の詳細を説明する。 Next, details of the task movement process will be described.

図１２は、本発明の実施形態のタスクを移動する手順を示すフローチャートである。 FIG. 12 is a flowchart illustrating a procedure for moving a task according to the embodiment of this invention.

ジョブ制御プログラムＰ２１は、移動中タスク管理テーブルＰ２５からエントリを一つ抽出し（Ｓ４０１）、以下の手順でタスクの移動処理を行う。なお、Ｓ４０１において抽出されたエントリに対応するタスクが、移動対象タスクである。 The job control program P21 extracts one entry from the moving task management table P25 (S401), and performs a task moving process according to the following procedure. Note that the task corresponding to the entry extracted in S401 is the movement target task.

まず、移動中タスク管理テーブルＰ２５に記載された全てのタスクについて、移動が完了しているか否かを判定する（Ｓ４０２）。その結果、全てのタスクの移動が完了している場合、タスク移動処理を終了する（Ｓ４０２）。 First, it is determined whether or not the movement has been completed for all tasks described in the moving task management table P25 (S402). As a result, when all the tasks have been moved, the task moving process is terminated (S402).

Ｓ４０３では、ジョブ制御プログラムＰ２１は、移動対象タスクのジョブＩＤとタスク番号を含む移動開始要求を、移動対象タスクを実行中のスレーブノードのノード内ジョブ管理プロセスＰ３に送信する。ジョブＩＤとタスク番号をキーとしてスレーブ管理テーブルＰ２２のタスクＩＤ一覧を検索する。検索されたエントリのノードアドレスが、移動対象タスクを実行中のノードのアドレスである。 In S403, the job control program P21 transmits a movement start request including the job ID and task number of the movement target task to the intra-node job management process P3 of the slave node that is executing the movement target task. The task ID list in the slave management table P22 is searched using the job ID and task number as keys. The node address of the retrieved entry is the address of the node that is executing the movement target task.

ノード内ジョブ管理プロセスＰ３は、移動開始要求を受けると、移動開始要求に含まれるジョブＩＤ及びタスク番号をキーとしてタスク管理テーブルＰ３３を検索し、検索されたエントリの実行状態を確認する。 Upon receiving the movement start request, the intra-node job management process P3 searches the task management table P33 using the job ID and task number included in the movement start request as keys, and confirms the execution status of the searched entry.

実行状態が「ＰＲＥ＿ＲＵＮ」又は「ＷＡＩＴ」であれば、状態を「ＭＩＧ＿ＳＥＮＤ」に変更する。さらに、ノード内ジョブ管理プロセスＰ３は、このタスクを実行中のジョブ実行プロセスＰ４に、移動対象タスクの実行状態を「ＭＩＧ＿ＳＥＮＤ」に変更するよう要求する。これを受けたジョブ実行プロセスＰ４は、移動対象タスクに対応するタスク管理テーブルＰ４４のエントリの実行状態を「ＭＩＧ＿ＳＥＮＤ」に変更する。その後、ノード内ジョブ管理プロセスＰ３は、移動開始通知をジョブ管理プロセスＰ２に返信する。 If the execution state is “PRE_RUN” or “WAIT”, the state is changed to “MIG_SEND”. Further, the intra-node job management process P3 requests the job execution process P4 that is executing this task to change the execution state of the migration target task to “MIG_SEND”. Receiving this, the job execution process P4 changes the execution state of the entry of the task management table P44 corresponding to the movement target task to “MIG_SEND”. Thereafter, the intra-node job management process P3 returns a movement start notification to the job management process P2.

ジョブ制御プログラムＰ２１は、ノード内ジョブ管理プロセスから移動開始通知を受けると、移動対象タスクに対応するタスク管理テーブルＰ２４内のエントリの実行状態を「ＭＩＧ＿ＳＥＮＤ」に変更し（Ｓ４０３）、Ｓ４０５へ進む（Ｓ４０４）。 When the job control program P21 receives a movement start notification from the intra-node job management process, it changes the execution state of the entry in the task management table P24 corresponding to the movement target task to “MIG_SEND” (S403), and proceeds to S405 ( S404).

移動対象タスクに対応するタスク管理テーブルＰ３３のエントリの実行状態が「ＰＲＥ＿ＲＵＮ」、「ＷＡＩＴ」以外である場合、ノード内ジョブ管理プロセスＰ３は、ジョブ管理プロセスＰ２にエラーを返信する。ジョブ制御プログラムＰ２１は、このエラーを受けた場合、移動対象タスクを移動することができないので、移動処理を中断し、Ｓ４０１に戻る（Ｓ４０４）。 If the execution state of the entry in the task management table P33 corresponding to the movement target task is other than “PRE_RUN” or “WAIT”, the intra-node job management process P3 returns an error to the job management process P2. When the job control program P21 receives this error, it cannot move the task to be moved, so the movement process is interrupted and the process returns to S401 (S404).

ジョブ制御プログラムＰ２１は、移動対象タスクに対応する移動中タスク管理テーブルＰ２５のエントリの移動先スレーブＩＤを参照し、このスレーブＩＤをキーとしてスレーブ管理テーブルＰ２２を検索する。検索されたエントリのノードアドレスが、移動先タスクを生成するノードのアドレスである。ジョブ制御プログラムＰ２１は、スレーブ管理テーブルＰ２２から検索されたエントリのタスクＩＤ一覧に、移動対象タスクのジョブＩＤに一致するジョブＩＤを持つタスクＩＤが含まれているか否かを判定する。 The job control program P21 refers to the movement destination slave ID of the entry in the moving task management table P25 corresponding to the movement target task, and searches the slave management table P22 using this slave ID as a key. The node address of the retrieved entry is the address of the node that generates the destination task. The job control program P21 determines whether or not the task ID list of the entry retrieved from the slave management table P22 includes a task ID having a job ID that matches the job ID of the movement target task.

判定の結果、このようなタスクＩＤが含まれていない場合、そのノードでジョブ実行プロセスＰ４が実行されていないため、プロセス生成要求（図８のＭ０２）を移動先ノードのノード内ジョブ管理プロセスＰ３に送信して、移動対象タスクを実行するためのジョブ実行プロセスＰ４を生成する。 As a result of the determination, if such a task ID is not included, the job execution process P4 is not executed on that node, so a process generation request (M02 in FIG. 8) is sent to the intra-node job management process P3 of the destination node. To generate a job execution process P4 for executing the task to be moved.

その後、ジョブ制御プログラムＰ２１は、タスク生成要求（図８のＭ０８）を移動先ノードのノード内ジョブ管理プロセスＰ３に送信し、移動先タスクを生成する（Ｓ４０５）。このとき、タスク生成要求には、ジョブＩＤ、タスク番号及び初期状態が「ＭＩＧ＿ＲＥＣＶ」であることを記載する。ノード内ジョブ管理プロセスＰ３は、タスク生成要求を受信すると、ジョブ実行プロセスＰ４にタスク生成要求を転送し、ジョブ実行プロセスＰ４内に移動先タスクを生成する。このときジョブ実行プロセスＰ４のタスク管理テーブルＰ４４、及びノード内ジョブ管理プロセスＰ３内のタスク管理テーブルＰ３３には、実行状態が「ＭＩＧ＿ＲＥＣＶ」であり、ジョブＩＤ及びタスク番号が、移動対象タスクのそれらと一致するエントリが作成される。 Thereafter, the job control program P21 transmits a task generation request (M08 in FIG. 8) to the intra-node job management process P3 of the movement destination node, and generates a movement destination task (S405). At this time, the task generation request describes that the job ID, task number, and initial state are “MIG_RECV”. Upon receiving the task generation request, the intra-node job management process P3 transfers the task generation request to the job execution process P4, and generates a destination task in the job execution process P4. At this time, in the task management table P44 of the job execution process P4 and the task management table P33 in the intra-node job management process P3, the execution state is “MIG_RECV”, and the job ID and task number are the same as those of the migration target task. A matching entry is created.

次に、ジョブ制御プログラムＰ２１は、移動対象先タスクに対応するエントリをタスク管理テーブルＰ２４に生成する。新規に生成されるエントリのジョブＩＤ及びタスク番号は、移動対象先タスクのそれらと一致し、実行状態は「ＭＩＧ＿ＲＥＣＶ」であり、全データ量及びスワップ量は０であり、スワップフラグは「ＩＤＬＥ」である。 Next, the job control program P21 generates an entry corresponding to the movement target task in the task management table P24. The job ID and task number of the newly generated entry match those of the migration target task, the execution state is “MIG_RECV”, the total data amount and the swap amount are 0, and the swap flag is “IDLE”. It is.

次に、ジョブ制御プログラムＰ２１は、移動対象タスクのジョブを実行する全てのジョブ実行プロセスＰ４に、移動先タスク生成通知を送信する。まず、ジョブ制御プログラムＰ２１は。スレーブ管理テーブルＰ２２のタスクＩＤ一覧を参照し、移動対象タスクのジョブＩＤを含むエントリを抽出する。これらのエントリに対応するノードでジョブが実行中なので、これらのエントリのノードアドレスに対して、移動対象タスクのジョブＩＤ、タスク番号、移動先ノードのアドレスを含む移動先タスク生成通知を送信する。 Next, the job control program P21 transmits a transfer destination task generation notification to all the job execution processes P4 that execute the job of the transfer target task. First, the job control program P21. The entry including the job ID of the task to be moved is extracted by referring to the task ID list in the slave management table P22. Since jobs are being executed at the nodes corresponding to these entries, a destination task generation notification including the job ID of the task to be moved, the task number, and the address of the destination node is transmitted to the node addresses of these entries.

各ノードで動作するノード内ジョブ管理プロセスＰ３は、移動先タスク生成通知を受信すると、タスク管理テーブルＰ３３に移動先タスクの情報を登録する。ここで新規に生成されるエントリの実行状態は「ＭＩＧ＿ＲＥＣＶ」となる。次に、ノード内ジョブ管理プロセスＰ３は、移動先タスク生成通知に含まれるジョブＩＤに対応するジョブを実行中のジョブ実行プロセスＰ４に、移動先タスク生成通知を転送する。 When the intra-node job management process P3 operating at each node receives the migration destination task generation notification, it registers information on the migration destination task in the task management table P33. Here, the execution state of the newly generated entry is “MIG_RECV”. Next, the intra-node job management process P3 transfers the transfer destination task generation notification to the job execution process P4 that is executing the job corresponding to the job ID included in the transfer destination task generation notification.

ジョブ実行プロセスＰ４は、移動先タスク生成通知を受信すると、移動先タスクの情報をタスク管理テーブルＰ４４に登録する。ここで新規に生成されるエントリの実行状態は「ＭＩＧ＿ＲＥＣＶ」になる。エントリが生成されると、ジョブ実行プロセスＰ４は、移動準備完了通知をノード内ジョブ管理プロセスＰ３に返信する。 When the job execution process P4 receives the transfer destination task generation notification, the job execution process P4 registers the information of the transfer destination task in the task management table P44. Here, the execution state of the newly generated entry is “MIG_RECV”. When the entry is generated, the job execution process P4 returns a migration preparation completion notification to the in-node job management process P3.

ノード内ジョブ管理プロセスＰ３は、移動準備完了通知を受信すると、ジョブ管理プロセスＰ２に移動準備完了通知を転送する。ジョブ制御プログラムＰ２１は、移動先タスク生成通知を送信したすべてのノードから移動準備完了通知を受信した後、Ｓ４０７に進む（Ｓ４０６）。 Upon receiving the migration preparation completion notification, the intra-node job management process P3 transfers the migration preparation completion notification to the job management process P2. The job control program P21 proceeds to S407 after receiving the migration preparation completion notification from all the nodes that transmitted the migration destination task generation notification (S406).

ジョブ制御プログラムＰ２１は、移動元タスクを実行中のジョブ実行プロセスに、移動対象タスクのジョブＩＤ、タスク番号、移動先ノードのアドレスを含むデータ移動要求を送信する。具体的には、Ｓ４０３で説明した手順と同様に、ノード内ジョブ管理プロセスを経由してデータ移動要求を送信する。 The job control program P21 transmits a data movement request including the job ID of the movement target task, the task number, and the address of the movement destination node to the job execution process that is executing the movement source task. Specifically, a data movement request is transmitted via the intra-node job management process, similarly to the procedure described in S403.

ジョブ実行プロセスＰ４のタスク移動プログラムＰ４３は、データ移動要求を受信すると、データ移動要求に含まれるタスク番号をキーとしてメモリ管理テーブルＰ４５を検索する。検索されたエントリが、移動する必要があるメモリ領域に対応する。タスク移動プログラムＰ４３は、検索されたエントリのデータＩＤを参照し、対応するメモリ領域からデータを読み取り、移動先ノードのジョブ実行プロセスＰ４に送信する。移動先ノードのジョブ実行プロセスＰ４は、データを受信すると、メモリ管理テーブルＰ４５に受信したデータの情報を登録する。但し、データＩＤはノード毎に管理されるメモリ領域のＩＤであるため、移動元ノードのメモリ管理テーブルＰ４５とは同じ値にはならない。 Upon receiving the data movement request, the task movement program P43 of the job execution process P4 searches the memory management table P45 using the task number included in the data movement request as a key. The retrieved entry corresponds to a memory area that needs to be moved. The task migration program P43 refers to the data ID of the retrieved entry, reads the data from the corresponding memory area, and transmits it to the job execution process P4 of the migration destination node. When the job execution process P4 of the migration destination node receives the data, it registers the received data information in the memory management table P45. However, since the data ID is an ID of a memory area managed for each node, it does not have the same value as the memory management table P45 of the migration source node.

また、データがスワップアウト済みである場合、メモリ管理テーブル４５のデータＩＤは空欄で、スワップファイル名にスワップファイルの名前が記載されている。この場合、スワップファイルの内容を送信する。スワップファイルからデータを送信した場合、移動先ノードにはスワップファイルは作成されず、データは実メモリ領域に割り当てられる。すべてのデータ送信が完了した後、転送元ノードのジョブ実行プロセスＰ４は、メモリ管理テーブルＰ４５から送信されたデータに対応するエントリを削除し、対応するメモリ領域を開放する。そして、ノード内ジョブ管理プロセスＰ３を経由して、データ移動完了通知をジョブ管理プロセスＰ２に送信する（Ｓ４０７）。 When the data has been swapped out, the data ID of the memory management table 45 is blank, and the name of the swap file is described in the swap file name. In this case, the contents of the swap file are transmitted. When data is transmitted from the swap file, the swap file is not created in the migration destination node, and the data is allocated to the real memory area. After all data transmission is completed, the job execution process P4 of the transfer source node deletes the entry corresponding to the data transmitted from the memory management table P45, and releases the corresponding memory area. Then, a data movement completion notification is transmitted to the job management process P2 via the intra-node job management process P3 (S407).

ジョブ制御プログラムＰ２１は、データ移動完了通知を受信したら、移動元タスクを実行中のジョブ実行プロセスＰ４に、移動対象タスクのジョブＩＤ、タスク番号を含むタスク廃棄要求を送信する。具体的には、Ｓ４０３で説明した手順と同様に、ノード内ジョブ管理プロセスを経由してタスク廃棄要求を送信する。タスク実行プロセスＰ４は、タスク廃棄要求を受信すると、廃棄が要求されたタスクを削除し、削除されたタスクに対応するエントリをタスク管理テーブルＰ４４から削除する。 Upon receiving the data movement completion notification, the job control program P21 transmits a task discard request including the job ID and task number of the movement target task to the job execution process P4 that is executing the movement source task. Specifically, a task discard request is transmitted via the intra-node job management process, as in the procedure described in S403. When the task execution process P4 receives the task discard request, the task execution process P4 deletes the task requested to be discarded, and deletes the entry corresponding to the deleted task from the task management table P44.

さらに、ジョブ制御プログラムＰ２１は、移動中タスク管理テーブルＰ２５から移動対象タスクに対応するエントリを検索し、検索されたエントリの実行状態を取得する。そして、ジョブ制御プログラムＰ２１は、実行状態、移動対象タスクのジョブＩＤ、タスク番号を含む移動タスク復帰要求を、移動先ノードのジョブ実行プロセスＰ４に送信する。具体的には、Ｓ４０３で説明した手順と同様に、ノード内ジョブ管理プロセスＰ３を経由して移動タスク復帰要求が送信される。 Further, the job control program P21 searches the entry corresponding to the movement target task from the moving task management table P25, and acquires the execution state of the searched entry. Then, the job control program P21 transmits a movement task return request including the execution state, the job ID of the movement target task, and the task number to the job execution process P4 of the movement destination node. Specifically, a moving task return request is transmitted via the intra-node job management process P3 as in the procedure described in S403.

ジョブ実行プロセスＰ４は、移動タスク復帰要求を受信すると、移動タスク復帰要求に含まれるタスク番号をキーとしてタスク管理テーブルＰ４４を検索し、検索されたエントリの実行状態を移動タスク復帰要求に含まれる実行状態に変更する（Ｓ４０８）。 When the job execution process P4 receives the move task return request, the job execution process P4 searches the task management table P44 using the task number included in the move task return request as a key, and executes the execution state of the retrieved entry included in the move task return request. The state is changed (S408).

以上のステップによってタスクの移動処理が完了する。ジョブ制御プログラムＰ２１は、Ｓ４０１に戻り、移動中タスク管理テーブルＰ２５に記載された他のタスクを移動し、全てのタスクの移動が完了するまでこの処理を繰り返信する。 The task moving process is completed by the above steps. The job control program P21 returns to S401, moves other tasks described in the moving task management table P25, and repeats this process until all the tasks have been moved.

本発明は、明細書に開示した内容から当業者が可能な範囲で変更可能な範囲をその範疇に含む。例えば、本実施の形態では、３種類の計算機（クライアントノード１０、マスタノード２０、スレーブノード３０）を利用したが、これら計算機は共通のものでもよい。例えば、クライアントプロセスＰ１がマスタノード２０上で動作し、マスタノード２０をクライアントノード１０と共通に運用してもよい。また、マスタノード２０上でノード内ジョブ管理プロセスＰ３とジョブ実行プロセスＰ４が動作し、マスタノード２０をスレーブノード３０の一つとして運用してもよい。 The present invention includes within its scope a range that can be changed by a person skilled in the art from the content disclosed in the specification. For example, in this embodiment, three types of computers (client node 10, master node 20, and slave node 30) are used, but these computers may be common. For example, the client process P1 may operate on the master node 20, and the master node 20 may be operated in common with the client node 10. Further, the in-node job management process P3 and the job execution process P4 may operate on the master node 20 and the master node 20 may be operated as one of the slave nodes 30.

また、本発明の実施の形態では、初期データの格納場所として分散ＦＳを利用したが、すべてのノードから同一のパス名でアクセス可能なファイルシステムであれば、他のファイルシステムを利用することが可能である。例えば、ＮＦＳ（ＮｅｔｗｏｒｋＦｉｌｅＳｙｓｔｅｍ）を利用してもよい。 In the embodiment of the present invention, the distributed FS is used as a storage location of initial data. However, other file systems may be used as long as the file system can be accessed from all nodes with the same path name. Is possible. For example, NFS (Network File System) may be used.

１０クライアントノード
２０マスタノード
３０スレーブノード
Ｐ１クライアントプロセス
Ｐ２ジョブ管理プロセス
Ｐ２１ジョブ制御プログラム
Ｐ２２スレーブ管理テーブル
Ｐ２３ジョブ管理テーブル
Ｐ２４タスク管理テーブル
Ｐ２５移動中タスク管理テーブル
Ｐ３ノード内ジョブ管理プロセス
Ｐ３１ノード内ジョブ制御プログラム
Ｐ３２ホスト内ジョブ管理テーブル
Ｐ３３タスク管理テーブル
Ｐ３４メモリ管理テーブル
Ｐ４ジョブ実行プロセス
Ｐ４１タスク制御プログラム
Ｐ４２データ移動プログラム
Ｐ４３タスク移動プログラム
Ｐ４４タスク管理テーブル
Ｐ４５メモリ管理テーブル
Ｐ４６タスク
Ｐ４７タスク処理コード
Ｐ５分散ＦＳ
１０１ＣＰＵ
１０２ＬＡＮＩ／Ｆ
１０３入出力Ｉ／Ｆ
１０４メモリ
１０５ストレージＩ／Ｆ
１０６キーボード
１０７マウス
１０８ディスプレイ
１０９記憶装置 10 client node 20 master node 30 slave node P1 client process P2 job management process P21 job control program P22 slave management table P23 job management table P24 task management table P25 moving task management table P3 intra-node job management process P31 intra-node job control program P32 Job management table in host P33 Task management table P34 Memory management table P4 Job execution process P41 Task control program P42 Data movement program P43 Task movement program P44 Task management table P45 Memory management table P46 Task P47 Task processing code P5 Distributed FS
101 CPU
102 LAN I / F
103 I / O I / F
104 Memory 105 Storage I / F
106 Keyboard 107 Mouse 108 Display 109 Storage Device

Claims

A computer system comprising a plurality of computers connected by a network, each computer executing a plurality of tasks constituting a job according to scheduling,
When a task is moved between the computers, a computer system characterized by selecting a task that has been processed as a task to be moved with priority by referring to an execution state of the task.

The computer system constitutes a parallel execution environment in which each computer executes a plurality of tasks in parallel,
The processing of each task and the processing of waiting for the completion of the processing of all the tasks constituting the job are defined in one processing phase,
2. The task according to claim 1, wherein when a task is moved between the computers, a task that has been processed in the processing phase is selected as a task to be preferentially moved with reference to an execution state of the task. Computer system.

Each of the computers has a swap-out function for saving data stored in a memory area used by a task to a secondary storage device and a swap-in function for storing the saved data in a memory,
2. The computer system according to claim 1, wherein a task that uses the saved data is selected as a task to be moved with priority.

The computer system according to claim 3, wherein each of the computers schedules a task with less data to be swapped in so that the task is preferentially executed.

4. The computer system according to claim 3, wherein each of the computers selects data to be used by a task for which processing has been completed as data to be swapped out preferentially.

Each of the computers is
Set the waiting time before swapping out the data stored in the memory area used by the task,
4. The computer system according to claim 3, wherein the task in the standby state is selected as a task to be moved with priority.

A method for managing tasks in a computer system that executes a plurality of tasks constituting a job according to scheduling on a plurality of computers connected to each other via a network,
The method
When moving a task between the computers, determining the execution state of the task;
Selecting a task that has been processed as a task to be preferentially moved based on the determined execution state of the task.

The computer system constitutes a parallel execution environment in which each computer executes a plurality of tasks in parallel,
The processing of each task and the processing of waiting for the completion of the processing of all the tasks constituting the job are defined in one processing phase,
8. The task management method according to claim 7, wherein, in the step of selecting the task, a task that has been processed in the processing phase is selected as a task to be moved preferentially.

Each of the computers has a swap-out function for saving data stored in a memory area used by a task to a secondary storage device and a swap-in function for storing the saved data in a memory,
The method according to claim 7, wherein the method includes a step of selecting a task that uses the saved data as a task to be preferentially moved.

The method according to claim 9, wherein the method includes a step of scheduling so that a task with a small amount of data to be swapped in is preferentially executed.

The method according to claim 9, wherein the method includes a step of selecting data to be preferentially swapped out as data to be used by a task for which processing has been completed.

The method
Setting a waiting time before swapping out data stored in a memory area used by the task;
The task management method according to claim 9, further comprising: selecting the task in the standby state as a task to be moved preferentially.