JP2011076304A

JP2011076304A - Device, method and program for allocating job

Info

Publication number: JP2011076304A
Application number: JP2009226106A
Authority: JP
Inventors: Yuki Kondo; 佑樹近藤; Kaname Takemoto; 要武本; Yuichi Mori; 森　　有一; Shuichi Tanaka; 修一田中; Takashi Oshima; 敬志大島
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2009-09-30
Filing date: 2009-09-30
Publication date: 2011-04-14

Abstract

<P>PROBLEM TO BE SOLVED: To improve use efficiency of a calculating node by avoiding a bottleneck due to an I/O processing standby state to the same data in a parallel type batch processing system including a plurality of nodes. <P>SOLUTION: When a sub-job which is executed by a sub-job execution part 302 of a job allocation device 3 is excluded from its access to shared data 422, processing is interrupted without being retried, and the processing state is suspended and stored in a shared data I/O processing standby queue 314, and the execution of a new sub-job is started. When the processing of the shared data of a certain new sub-job ends, a job including processing to the same shared data is extracted from the shared data I/O processing standby queue 314, and the shared data in the processing standby state are executed in a batch. When the suspended processing of the sub-job is resumed, the processing is started from the point where the processing of the sub-job is interrupted. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、金融、流通、産業などのバッチ処理システムに用いる、ジョブ割当装置、ジョブ割当方法およびジョブ割当プログラムに関する。 The present invention relates to a job assignment apparatus, a job assignment method, and a job assignment program used for batch processing systems such as finance, distribution, and industry.

近年、複数の計算機をネットワーク経由で結びつけ、強力な計算リソースを提供するグリッド技術が注目されており、この技術をバッチ処理に適用して高速化する方法が実現されている。バッチ処理のさらなる高速化を目的として、ジョブの実行順序を制御して計算ノードの有効活用を図る方法が提案されている（例えば、特許文献１参照）。 In recent years, grid technology that connects a plurality of computers via a network and provides powerful computing resources has attracted attention, and a method for speeding up the technology by applying this technology to batch processing has been realized. For the purpose of further speeding up the batch processing, a method has been proposed in which the execution order of jobs is controlled to effectively use the computation nodes (see, for example, Patent Document 1).

この特許文献１に記載の技術は、ジョブを複数台の計算機で並列処理するとき、優先度の低いジョブのタスクを処理待ちタスクキューに退避させ、優先度の高いジョブを先行して割り当てることで計算ノードの利用効率を向上させるものである。 In the technique described in Patent Document 1, when a job is processed in parallel by a plurality of computers, a task of a job with a low priority is saved in a task queue and a job with a high priority is assigned in advance. It improves the utilization efficiency of the computation node.

特開２００８−２２６０２３号公報JP 2008-226023 A

しかしながら、特許文献１に記載されたジョブの実行順序を制御する技術は、プロセスの並列化を目的としたものであり、データアクセスの並列性は考慮されていない。全計算ノードが、読込、書込をする共有データに対する処理が多い場合、共有データのＩ／Ｏ（Input/Output）処理がボトルネックとなり、計算ノードのＣＰＵ（Central Processing Unit）を有効に使い切れない。 However, the technique for controlling the job execution order described in Patent Document 1 is aimed at parallelization of processes, and parallelism of data access is not considered. If all the compute nodes have a lot of processing for shared data to be read and written, the I / O (Input / Output) processing of the shared data becomes a bottleneck, and the CPU (Central Processing Unit) of the compute node cannot be used up effectively. .

このような背景に鑑みて本発明がなされたのであり、本発明は、Ｉ／Ｏ処理待ちによるボトルネックを回避し、計算ノードの利用効率を向上させることができる、ジョブ割当装置、ジョブ割当方法およびジョブ割当プログラムを提供することを目的とする。 The present invention has been made in view of such a background, and the present invention avoids a bottleneck caused by waiting for I / O processing and can improve the utilization efficiency of a computation node, and a job allocation apparatus and a job allocation method And providing a job allocation program.

本発明は、ジョブを並列化して複数の計算機（ジョブ割当装置）に振り分け、データアクセスを伴う処理を実行する並列型のバッチ処理システムにおいて、共有データへのアクセスが排他された場合の処理を制御する。
あるジョブが共有データへのアクセスを排他されると、リトライせずに処理を中断して処理状態をサスペンドしてキューに保存し、新規ジョブの実行を開始する。
ある新規ジョブの共有データの処理が終了すると、キューテーブルを参照してサスペンドされた複数のジョブから同一の共有データに対する処理を含むジョブを抽出し、処理待ちの共有データを一括で実行する。 The present invention controls a process when access to shared data is exclusive in a parallel batch processing system in which jobs are parallelized, distributed to a plurality of computers (job allocation devices), and a process involving data access is executed. To do.
When a job is locked out of access to shared data, the processing is interrupted without retrying, the processing state is suspended and saved in the queue, and execution of a new job is started.
When the shared data processing for a new job is completed, a job including processing for the same shared data is extracted from a plurality of suspended jobs with reference to the queue table, and the shared data waiting for processing is collectively executed.

本発明によれば、Ｉ／Ｏ処理待ちによるボトルネックを回避し、計算ノードの利用効率を向上させる、ジョブ割当装置、ジョブ割当方法およびジョブ割当プログラムを提供することができる。 According to the present invention, it is possible to provide a job allocation device, a job allocation method, and a job allocation program that can avoid bottlenecks due to waiting for I / O processing and improve the utilization efficiency of a computation node.

本実施形態に係るジョブ割当装置を含むジョブ割当システムの構成例を示す機能ブロック図である。1 is a functional block diagram illustrating a configuration example of a job allocation system including a job allocation device according to the present embodiment. 本実施形態に係るサブジョブ管理テーブルのデータ構成の一例を示す図である。It is a figure which shows an example of a data structure of the sub job management table which concerns on this embodiment. 本実施形態に係るサブジョブ進捗管理テーブルのデータ構成の一例を示す図である。It is a figure which shows an example of a data structure of the sub job progress management table concerning this embodiment. 本実施形態に係る共有データＩ／Ｏ処理待ちキューテーブルのデータ構成の一例を示す図である。It is a figure which shows an example of a data structure of the shared data I / O process waiting queue table which concerns on this embodiment. 本実施形態に係る共有データＩ／Ｏ処理待ちキューのデータ構成の一例を示す図である。It is a figure which shows an example of a data structure of the shared data I / O process waiting queue which concerns on this embodiment. 本実施形態に係るロックテーブルのデータ構成の一例を示す図である。It is a figure which shows an example of the data structure of the lock table which concerns on this embodiment. 本実施形態に係るジョブ割当方法を説明するためのシーケンス図である。It is a sequence diagram for demonstrating the job allocation method which concerns on this embodiment. 本実施形態に係るジョブ割当方法を説明するためのシーケンス図である。It is a sequence diagram for demonstrating the job allocation method which concerns on this embodiment. 本実施形態に係るジョブ割当方法を説明するためのシーケンス図である。It is a sequence diagram for demonstrating the job allocation method which concerns on this embodiment. 本実施形態に係るジョブ割当方法を説明するためのシーケンス図である。It is a sequence diagram for demonstrating the job allocation method which concerns on this embodiment. 本実施形態に係るサブジョブの存在場所の時系列変化を説明するための図である。It is a figure for demonstrating the time-sequential change of the subjob presence location which concerns on this embodiment. 本発明を適用する一例である金融機関の集計処理を説明するための図である。It is a figure for demonstrating the total process of the financial institution which is an example to which this invention is applied.

次に、本発明を実施するための形態（「実施形態」という）について、適宜図面を参照しながら詳細に説明する。 Next, modes for carrying out the present invention (referred to as “embodiments”) will be described in detail with reference to the drawings as appropriate.

本実施形態においては、すべてのサブジョブが共有データ処理を行うバッチ処理の例として、特定項目の合計値を求める集計処理を例として説明する。この集計バッチ処理をグリッド環境で実行する。 In the present embodiment, as an example of batch processing in which all sub-jobs perform shared data processing, tabulation processing for obtaining the total value of specific items will be described as an example. This aggregation batch process is executed in a grid environment.

図１２は、本発明を適用する一例である金融機関の集計処理を説明するための図である。
ここで、図１２（ａ）に示すジョブネットとは、実行順序を関連付けたジョブの集まりであり、ジョブネットを実行すると、ジョブネット中のジョブが実行順序に従って実行される。また、図１２（ｂ）に示すジョブとは、業務を構成する要素の最小単位であり、具体的には、コマンド、シェルスクリプト、Windows（登録商標）実行ファイル等の集まりである。
この図１２（ａ）は、例えば、金融機関（銀行）の支店が、営業日当日の取引結果を集計し（集計処理）、伝票と請求書データと作成し（伝票作成および請求書作成）、売上を計算し（売上計算）、帳票を出力する（帳票出力）という各ジョブの処理順序をジョブネットとして示している。また、図１２（ｂ）は、図１２（ａ）のジョブネットの中の一つのジョブである「集計処理」の内容を示し、「集計処理」は、さらに細かい「データ抽出」「会計計算」「個人マスタ更新」「部店マスタ更新」の各ジョブステップを顧客単位で繰り返すループ処理を行う。図１２（ｃ）に示すサブジョブは、このジョブのプロセスを並列化するためにループ処理に着目し、ジョブを所定の処理単位で分割したものである。例えば、顧客単位の０００１番から３０００番までを処理するジョブがあるとすると、０００１番から１０００番、１００１番から２０００番、２００１番から３０００番のサブジョブに分割して、並列処理を行う。 FIG. 12 is a diagram for explaining the aggregation processing of a financial institution that is an example to which the present invention is applied.
Here, the job net shown in FIG. 12A is a collection of jobs associated with the execution order. When the job net is executed, the jobs in the job net are executed according to the execution order. Further, the job shown in FIG. 12B is a minimum unit of elements constituting a business, and specifically, a collection of commands, shell scripts, Windows (registered trademark) execution files, and the like.
In FIG. 12A, for example, a branch of a financial institution (bank) aggregates the transaction results on the business day (aggregation process), creates slips and invoice data (slip creation and bill creation), The processing order of each job of calculating sales (sales calculation) and outputting a form (form output) is shown as a job net. FIG. 12B shows the contents of “total processing”, which is one job in the job net of FIG. 12A. “Total processing” is further divided into “data extraction” and “accounting calculation”. Loop processing is performed in which each job step of “personal master update” and “department store master update” is repeated for each customer. The sub-job shown in FIG. 12C is obtained by dividing the job into predetermined processing units by paying attention to loop processing in order to parallelize the processes of this job. For example, if there are jobs that process customer units from 0001 to 3000, they are divided into sub-jobs from 0001 to 1000, 1001 to 2000, and 2001 to 3000, and parallel processing is performed.

本発明は、ジョブをサブジョブに分割して、そのサブジョブを複数の計算機（ジョブ割当装置）に振り分け、大量のデータアクセスを伴う処理を実行する並列型のバッチ処理システムにおいて、ジョブの入出力データを管理するＤＢサーバ内の共有データへのアクセスが排他された場合の処理を制御する。
例えば、金融機関の集計処理においては、ＤＢサーバ内に備わる、個々の顧客単位の取引等を管理する分割データと、支店ごとの１日の処理合計等のデータを管理する共有データとにアクセスすることが必要となる。従来の並列型の集計処理においては、共有データに計算機がアクセスしたときに、他の計算機がその共有データにアクセスしていることにより排他処理がなされると、リトライを繰り返した後、処理エラーとなっていた。本発明においては、ある計算機によるサブジョブが共有データへのアクセスを排他されると、リトライせずに処理を中断して処理状態をサスペンド（一時中断）しそのキューを保存し、次の新規サブジョブの実行を開始する。
そして、新規サブジョブの共有データへのアクセスが成功すると、保存したキューテーブルを参照してサスペンドされたサブジョブから同一の共有データにアクセスを要求するサブジョブを抽出し、処理待ちとなっていた共有データに対する処理を一括で実行することにより集計処理を行う。 The present invention divides a job into sub-jobs, distributes the sub-jobs to a plurality of computers (job allocation devices), and executes job input / output data in a parallel batch processing system that executes processing involving a large amount of data access. Controls processing when access to shared data in a managed DB server is exclusive.
For example, in the aggregation process of a financial institution, access is made to the divided data that manages transactions for individual customers, and the shared data that manages data such as daily processing totals for each branch, provided in the DB server. It will be necessary. In the conventional parallel type aggregation process, when a computer accesses shared data and an exclusive process is performed because another computer is accessing the shared data, a processing error and It was. In the present invention, when a sub job by a computer is exclusively accessed for shared data, the processing is suspended without retrying, the processing state is suspended (temporarily suspended), the queue is saved, and the next new sub job is saved. Start execution.
When access to the shared data of the new sub job is successful, the sub job that requests access to the same shared data is extracted from the suspended sub job with reference to the saved queue table, and the shared data that has been waiting for processing is extracted. Aggregation processing is performed by executing the processing in a batch.

図１は、本実施形態に係るジョブ割当装置３を含むジョブ割当システム５の構成例を示す機能ブロック図である。
ジョブ割当システム５は、バッチ処理を実行するシステムであり、ジョブ投入装置１と、サブジョブ振分装置２と、複数のグリッドノード（ジョブ割当装置）３（３ａ，３ｂ，３ｃ）と、ＤＢ（DataBase）サーバ４と、が通信ネットワーク６を介して接続される。なお、グリッドノード（ジョブ割当装置）３（３ａ，３ｂ，３ｃ）は、３つに限定されることはなく、複数のグリッドノード３を備えていればよい。 FIG. 1 is a functional block diagram illustrating a configuration example of a job allocation system 5 including a job allocation device 3 according to the present embodiment.
The job allocation system 5 is a system that executes batch processing, and includes a job input device 1, a sub job distribution device 2, a plurality of grid nodes (job allocation devices) 3 (3a, 3b, 3c), and a DB (DataBase ) The server 4 is connected via the communication network 6. Note that the number of grid nodes (job allocation devices) 3 (3a, 3b, 3c) is not limited to three, and a plurality of grid nodes 3 may be provided.

まず、このジョブ割当システム５のジョブ割当処理の概略を説明する。
ジョブ投入装置１からジョブを投入されたサブジョブ振分装置２は、ジョブを所定の処理単位で分割し、その分割したサブジョブを各グリッドノード３（３ａ，３ｂ，３ｃ）へ振り分ける。そして、サブジョブを振り分けられたグリッドノード３は、サブジョブを実行し、ＤＢサーバ４の分割データ４２１と共有データ４２２にアクセスする。
この共有データへのアクセスの際、グリッドノード３は、サブジョブの共有データ４２２へのアクセスを排他されると、リトライせずに処理を中断することにより処理状態をサスペンド（一時中断）して共有データＩ／Ｏ処理待ちキューテーブル（共有データＩ／Ｏ処理待ちキュー情報）３１３に登録し、次の新規サブジョブの実行を開始する。
次に、グリッドノード３は、新規サブジョブの共有データの処理に成功すると、共有データＩ／Ｏ処理待ちキューテーブル３１３を参照してサスペンドされた複数のサブジョブから同一の共有データに対する処理を含むサブジョブを抽出し、処理待ちの共有データ処理を一括で実行する。このとき、サスペンドされたサブジョブの処理の再開は、サブジョブの最初からではなく処理が中断した地点から処理を開始する。このようにすることで、Ｉ／Ｏ処理待ちによるボトルネックを回避する。 First, an outline of job assignment processing of the job assignment system 5 will be described.
The sub job distribution device 2 to which a job is input from the job input device 1 divides the job into predetermined processing units, and distributes the divided sub job to each grid node 3 (3a, 3b, 3c). Then, the grid node 3 to which the sub job is distributed executes the sub job and accesses the divided data 421 and the shared data 422 of the DB server 4.
When accessing the shared data, the grid node 3 suspends (temporarily suspends) the processing state by suspending the processing without retrying, if the access to the shared data 422 of the sub job is excluded. It is registered in the I / O processing wait queue table (shared data I / O processing wait queue information) 313, and the execution of the next new sub job is started.
Next, when the grid node 3 succeeds in processing the shared data of the new sub-job, the grid node 3 refers to the shared data I / O processing wait queue table 313 to execute a sub-job including processing for the same shared data from a plurality of suspended sub-jobs. Extract and execute shared data processing waiting for processing in a batch. At this time, resuming the processing of the suspended sub job starts from the point where the processing is interrupted, not from the beginning of the sub job. By doing so, a bottleneck caused by waiting for I / O processing is avoided.

次に、各装置について具体的に説明する。
ジョブ投入装置１は、ジョブを記憶し、その記憶したジョブをサブジョブ振分装置２へ送信する。このジョブ投入装置１は、制御部１０と、メモリ部１１と、記憶部１２と、入出力部１３と、通信部１４とを含んで構成される。 Next, each device will be specifically described.
The job input device 1 stores the job, and transmits the stored job to the sub job distribution device 2. The job input device 1 includes a control unit 10, a memory unit 11, a storage unit 12, an input / output unit 13, and a communication unit 14.

制御部１０は、ジョブ投入部１０１を含んで構成され、このジョブ投入部１０１が、記憶部１２内のジョブネット保存部１２２に記憶されたジョブネットから、ジョブを取り出してジョブ保存部１２１に記憶し、その記憶したジョブを通信部１４を介してサブジョブ振分装置２へ送信する。
メモリ部１１は、ＲＡＭ（Random Access Memory）等の記憶手段からなり、制御部１０の処理に必要な情報を一時的に記憶する。
記憶部１２は、ジョブ保存部１２１と、ジョブネット保存部１２２を含んで構成される。この記憶部１２は、ハードディスク、フラッシュメモリ等の記憶手段からなり、制御部１０の処理を行うためのプログラム等を記憶する。
入出力部１３は、外部装置から制御部１０の処理に関する指示を受け付ける。また、制御部１０が生成した情報を外部装置に出力する。なお、この入出力部１３は、入力インタフェースと出力インタフェースとから構成される。
通信部１４は、ＬＡＮ（Local Area Network）や専用回線を介して各装置との情報の送受信を行うための通信インタフェースからなる。
なお、この制御部１０の機能は、例えばジョブ投入装置１の記憶部１２に記憶されたプログラムをＣＰＵがメモリ部１１に展開し実行することで実現される。 The control unit 10 includes a job input unit 101, and the job input unit 101 extracts a job from the job net stored in the job net storage unit 122 in the storage unit 12 and stores it in the job storage unit 121. Then, the stored job is transmitted to the sub job distribution apparatus 2 via the communication unit 14.
The memory unit 11 includes storage means such as a RAM (Random Access Memory), and temporarily stores information necessary for the processing of the control unit 10.
The storage unit 12 includes a job storage unit 121 and a job net storage unit 122. The storage unit 12 includes storage means such as a hard disk and a flash memory, and stores a program and the like for performing processing of the control unit 10.
The input / output unit 13 receives an instruction regarding processing of the control unit 10 from an external device. Moreover, the information which the control part 10 produced | generated is output to an external device. The input / output unit 13 includes an input interface and an output interface.
The communication unit 14 includes a communication interface for transmitting / receiving information to / from each device via a LAN (Local Area Network) or a dedicated line.
The function of the control unit 10 is realized, for example, when the CPU develops and executes a program stored in the storage unit 12 of the job input apparatus 1 in the memory unit 11.

次に、サブジョブ振分装置２は、ジョブ投入装置１から取得したジョブを所定の処理単位に分割し、その分割したサブジョブを、各グリッドノード３へ振り分ける。また、サブジョブ振分装置２は、制御部２０と、メモリ部２１と、記憶部２２と、入出力部２３と、通信部２４とを含んで構成される。
なお、メモリ部２１、入出力部２３、および通信部２４は、ジョブ投入装置１のメモリ部１１、入出力部１３、および通信部１４と同じ機能を備えるため説明は省略する。 Next, the sub job distribution device 2 divides the job acquired from the job input device 1 into predetermined processing units, and distributes the divided sub job to each grid node 3. The sub job distribution apparatus 2 includes a control unit 20, a memory unit 21, a storage unit 22, an input / output unit 23, and a communication unit 24.
Note that the memory unit 21, the input / output unit 23, and the communication unit 24 have the same functions as the memory unit 11, the input / output unit 13, and the communication unit 14 of the job input device 1, and thus description thereof is omitted.

制御部２０は、サブジョブ振分装置２の全体の制御を司り、ジョブ分割部２０１と、サブジョブ振分部２０２と、サブジョブ管理部２０３とを含んで構成される。なお、この制御部２０の機能は、例えばサブジョブ振分装置２の記憶部２２に記憶されたプログラムをＣＰＵがメモリ部２１に展開し実行することで実現される。
ジョブ分割部２０１は、ジョブ投入装置１から取得したジョブを所定の処理単位に分割してサブジョブを生成し、記憶部２２内のサブジョブ保存部２２１に保存する。
サブジョブ振分部２０２は、サブジョブを振り分けるグリッドノード３（３ａ，３ｂ，３ｃ）を決定し、その決定したグリッドノード３（３ａ，３ｂ，３ｃ）に、サブジョブを通信部２４を介して送信する。また、サブジョブ振分部２０２には、サブジョブの振分を決定する際に、サスペンド（一時中断）できるジョブの所定数が設定され、サブジョブ振分部２０２は、その所定数を超えない範囲でジョブを振り分ける。これにより、グリッドノード３間のジョブ実行終了時間のばらつきを抑えることができる。
サブジョブ管理部２０３は、各グリッドノード３（３ａ，３ｂ，３ｃ）に振り分けた各サブジョブのデータ処理の進捗情報を、記憶部２２内のサブジョブ管理テーブル２２２（後記する図２参照）に記憶し管理する。 The control unit 20 controls the entire sub job distribution device 2 and includes a job dividing unit 201, a sub job distribution unit 202, and a sub job management unit 203. The function of the control unit 20 is realized, for example, when the CPU develops and executes a program stored in the storage unit 22 of the sub job distribution apparatus 2 in the memory unit 21.
The job dividing unit 201 divides the job acquired from the job input device 1 into predetermined processing units, generates a sub job, and stores it in the sub job storage unit 221 in the storage unit 22.
The sub job distribution unit 202 determines the grid node 3 (3a, 3b, 3c) to which the sub job is distributed, and transmits the sub job to the determined grid node 3 (3a, 3b, 3c) via the communication unit 24. In addition, a predetermined number of jobs that can be suspended (temporarily suspended) is set in the sub job distribution unit 202 when determining the distribution of sub jobs, and the sub job distribution unit 202 does not exceed the predetermined number of jobs. Sort out. Thereby, the dispersion | variation in the job execution end time between the grid nodes 3 can be suppressed.
The sub job management unit 203 stores and manages the progress information of the data processing of each sub job distributed to each grid node 3 (3a, 3b, 3c) in the sub job management table 222 (see FIG. 2 described later) in the storage unit 22. To do.

記憶部２２は、ジョブ分割部２０１が分割したサブジョブを保存するサブジョブ保存部２２１と、サブジョブ管理テーブル２２２とを含んで構成され、ハードディスク、フラッシュメモリ等の記憶手段からなる。 The storage unit 22 includes a sub-job storage unit 221 that stores the sub-jobs divided by the job dividing unit 201 and a sub-job management table 222, and includes storage means such as a hard disk and a flash memory.

図２は、本実施形態に係るサブジョブ管理テーブル２２２のデータ構成の一例を示す図である。
サブジョブ管理テーブル２２２は、サブジョブごとに１レコードからなる情報であり、図２に示すように、ジョブＩＤ１００１、サブジョブＩＤ１００２、振分先ノード１００３、および進捗状況１００４のデータ項目を含んで構成される。 FIG. 2 is a diagram illustrating an example of a data configuration of the sub job management table 222 according to the present embodiment.
The sub job management table 222 is information including one record for each sub job, and includes data items of a job ID 1001, a sub job ID 1002, a distribution destination node 1003, and a progress status 1004 as shown in FIG.

ジョブＩＤ１００１は、サブジョブに分割される前におけるジョブごとの固有な識別子である。サブジョブＩＤ１００２は、ジョブ分割部２０１により分割された各サブジョブに付された固有の識別子である。振分先ノード１００３は、サブジョブ振分部２０２がサブジョブを振り分けた各グリッドノード３（３ａ，３ｂ、３ｃ）のノード名である。また、進捗状況１００４は、サブジョブ管理部２０３により、サブジョブのデータ処理の進捗状況に応じて、「未処理」「実行中」「完了」のいずれかが登録される。 The job ID 1001 is a unique identifier for each job before being divided into sub-jobs. The sub job ID 1002 is a unique identifier assigned to each sub job divided by the job dividing unit 201. The distribution destination node 1003 is a node name of each grid node 3 (3a, 3b, 3c) to which the sub job distribution unit 202 distributes the sub job. Also, the progress status 1004 is registered by the sub job management unit 203 as “unprocessed”, “running”, or “completed” in accordance with the progress status of the data processing of the sub job.

図１に戻り、グリッドノード（ジョブ割当装置）３について説明する。グリッドノード３は、ＤＢサーバ４にアクセスすることで、サブジョブを実行する装置であり、制御部３０と、メモリ部３１と、記憶部３２と、入出力部３３と、通信部３４とを含んで構成される。
なお、入出力部３３および通信部３４は、ジョブ投入装置１の入出力部１３および通信部１４と同じ機能を備えるため説明は省略する。 Returning to FIG. 1, the grid node (job allocation apparatus) 3 will be described. The grid node 3 is a device that executes a sub job by accessing the DB server 4, and includes a control unit 30, a memory unit 31, a storage unit 32, an input / output unit 33, and a communication unit 34. Composed.
Note that the input / output unit 33 and the communication unit 34 have the same functions as the input / output unit 13 and the communication unit 14 of the job input apparatus 1, and thus description thereof is omitted.

制御部３０は、グリッドノード３全体の制御を司り、サブジョブ制御部３０１と、サブジョブ実行部３０２と、キュー操作部３０３と、キュー管理部３０４とを含んで構成される。
なお、この制御部３０の機能は、例えばグリッドノード３の記憶部３２に記憶されたプログラムをＣＰＵがメモリ部３１に展開し実行することで実現される。 The control unit 30 controls the entire grid node 3 and includes a sub job control unit 301, a sub job execution unit 302, a queue operation unit 303, and a queue management unit 304.
The function of the control unit 30 is realized, for example, when the CPU expands and executes a program stored in the storage unit 32 of the grid node 3 in the memory unit 31.

サブジョブ制御部３０１は、後記するメモリ部３１内のサブジョブ進捗管理テーブル（サブジョブ進捗管理情報）３１２（図３）を参照し、未処理のサブジョブを検索する。そして、サブジョブ制御部３０１は、キュー操作部３０３を介して、検索した未処理のサブジョブをサブジョブ実行部３０２へ割り当てる。また、サブジョブ制御部３０１は、サブジョブ進捗管理テーブル３１２の更新を行う。 The sub job control unit 301 refers to a sub job progress management table (sub job progress management information) 312 (FIG. 3) in the memory unit 31 described later, and searches for an unprocessed sub job. Then, the sub job control unit 301 allocates the searched unprocessed sub job to the sub job execution unit 302 via the queue operation unit 303. In addition, the sub job control unit 301 updates the sub job progress management table 312.

サブジョブ実行部３０２は、サブジョブの記述に従って、データ入力、データ処理、データ出力等を実行する。このとき、サブジョブ実行部３０２は、サブジョブ全体の処理における前半部分の「分割データ」に対する処理を先ず実行し、その後に、後半部分である「共有データ」の処理を実行する。また、サブジョブ実行部３０２は、後記する共有データＩ／Ｏ処理待ちキューテーブル３１３（図４参照）の更新を行う。 The sub job execution unit 302 executes data input, data processing, data output, and the like according to the description of the sub job. At this time, the sub job execution unit 302 first executes the process for the “half-part data” in the first half of the process of the entire sub job, and then executes the process for the “shared data” as the second half. Further, the sub job execution unit 302 updates a shared data I / O processing wait queue table 313 (see FIG. 4) described later.

キュー操作部３０３は、サブジョブの保存、取り出しの指示を行い、サブジョブが共有データの処理待ち状態になったときに、そのサブジョブの実行状態に関する情報の収集を行う。
具体的には、キュー操作部３０３は、サブジョブ振分装置２から通信部３４を介して受信したサブジョブを、メモリ部３１内のサブジョブキュー３１１に保存する。また、キュー操作部３０３は、サブジョブ制御部３０１からの命令を受け、サブジョブをサブジョブキュー３１１から取得し、サブジョブ実行部３０２へ送信する。さらに、キュー操作部３０３は、サブジョブ実行部３０２が共有データの処理に失敗した場合に、サブジョブの実行状態をサスペンドしてキューに保存するために必要な情報を収集し、共有データＩ／Ｏ処理待ちキュー３１４（後記する図５参照）に保存する。また、キュー操作部３０３は、サブジョブ実行部３０２が共有データの処理に成功した場合に、共有データを処理したサブジョブと、同一の共有データにアクセスを要求するサブジョブを共有データＩ／Ｏ処理待ちキュー３１４から取得し、その実行状態を復元する処理を行う。 The queue operation unit 303 instructs to store and retrieve a sub job, and collects information related to the execution state of the sub job when the sub job enters a shared data processing waiting state.
Specifically, the queue operation unit 303 stores the sub job received from the sub job distribution device 2 via the communication unit 34 in the sub job queue 311 in the memory unit 31. The queue operation unit 303 receives a command from the sub job control unit 301, acquires the sub job from the sub job queue 311, and transmits the sub job to the sub job execution unit 302. Further, the queue operation unit 303 collects information necessary for suspending the execution state of the sub job and storing it in the queue when the sub job execution unit 302 fails to process the shared data, and performs shared data I / O processing. The data is stored in the waiting queue 314 (see FIG. 5 described later). In addition, when the sub job execution unit 302 succeeds in processing the shared data, the queue operation unit 303 assigns a sub job that processes the shared data and a sub job that requests access to the same shared data to the shared data I / O processing waiting queue. It acquires from 314 and performs processing to restore its execution state.

キュー管理部３０４は、メモリ部３１内のサブジョブ進捗管理テーブル３１２を用いて、未処理または共有処理待ちのサブジョブ数を確認し、通信部３４を介して、その情報をサブジョブ振分装置２へ送信する。 The queue management unit 304 uses the sub job progress management table 312 in the memory unit 31 to check the number of sub jobs that are unprocessed or waiting for shared processing, and transmits the information to the sub job distribution device 2 via the communication unit 34. To do.

次に、メモリ部３１は、サブジョブ振分装置２から送信されたサブジョブが記憶されるサブジョブキュー３１１と、サブジョブ進捗管理テーブル３１２（図３参照）と、共有データＩ／Ｏ処理待ちキューテーブル３１３（図４参照）と、共有データＩ／Ｏ処理待ちキュー３１４（図５参照）とを含んで構成される。なお、メモリ部３１内に記憶されたこれらの情報は、記憶部３２内に記憶されていてもよい。 Next, the memory unit 31 stores a sub job queue 311 in which a sub job transmitted from the sub job distribution apparatus 2 is stored, a sub job progress management table 312 (see FIG. 3), and a shared data I / O processing wait queue table 313 ( 4) and a shared data I / O processing wait queue 314 (see FIG. 5). Note that these pieces of information stored in the memory unit 31 may be stored in the storage unit 32.

図３は、本実施形態に係るサブジョブ進捗管理テーブル（サブジョブ進捗管理情報）３１２のデータ構成の一例を示す図である。
サブジョブ進捗管理テーブル３１２は、グリッドノード３に配置されたサブジョブごとに１レコードからなる情報であり、図３に示すように、ジョブＩＤ１００１、サブジョブＩＤ１００２、振分元１００５、およびサブジョブ進捗状況１００６のデータ項目を含んで構成される。 FIG. 3 is a diagram showing an example of the data configuration of the sub job progress management table (sub job progress management information) 312 according to the present embodiment.
The sub job progress management table 312 is information consisting of one record for each sub job arranged in the grid node 3, and as shown in FIG. 3, data of job ID 1001, sub job ID 1002, distribution source 1005, and sub job progress status 1006 Consists of items.

ここで、振分元１００５は、取得したサブジョブの振分元となるサブジョブ振分装置２のノード名である。また、サブジョブ進捗状況１００６は、サブジョブ制御部３０１により、サブジョブのデータ処理の進捗状態に応じて、「未処理」「実行中」「共有処理待ち」「完了」のいずれかの情報が格納される。この「共有処理待ち」は、サブジョブ実行部３０２が、共有データの処理をできずに、そのサブジョブが、共有データＩ／Ｏ処理待ちキュー３１４（図５参照）に保存されている処理待ち状態を示すものである。 Here, the distribution source 1005 is the node name of the sub job distribution device 2 that is the distribution source of the acquired sub job. Also, the sub job progress status 1006 stores information of “unprocessed”, “running”, “waiting for shared processing”, or “completed” by the sub job control unit 301 according to the progress status of the data processing of the sub job. . This “waiting for shared processing” indicates that the sub job execution unit 302 cannot process the shared data, and the sub job is in the processing waiting state stored in the shared data I / O processing waiting queue 314 (see FIG. 5). It is shown.

図４は、本実施形態に係る共有データＩ／Ｏ処理待ちキューテーブル（共有データＩ／Ｏ処理待ちキュー情報）３１３のデータ構成の一例を示す図である。
共有データＩ／Ｏ処理待ちキューテーブル３１３は、共有データに対するアクセスが排他されたサブジョブごとに１レコードからなる情報であり、図４に示すように、ジョブＩＤ１００１、サブジョブＩＤ１００２、キューＩＤ１００７、および共有データＩＤ１００８のデータ項目を含んで構成される。 FIG. 4 is a diagram showing an example of the data configuration of the shared data I / O processing wait queue table (shared data I / O processing wait queue information) 313 according to the present embodiment.
The shared data I / O processing waiting queue table 313 is information including one record for each sub job for which access to the shared data is excluded. As shown in FIG. 4, the job ID 1001, the sub job ID 1002, the queue ID 1007, and the shared data are stored. It includes a data item of ID1008.

ここで、キューＩＤ１００７は、サブジョブが振り分けられた自己のグリッドノード３上のキューと他のグリッドノード３上のキューとを区別するためのキュー識別子である。例えば、グリッドノード３（３ａ）に振り分けられたサブジョブについては、「3a_SpdQ」のキューＩＤ１００７が付され、グリッドノード３（３ｂ）に振り分けられたサブジョブについては、「3b_SpdQ」が付される。また、共有データＩＤ１００８は、そのサブジョブが処理を要求し処理待ちとなった共有データ４２２の識別子を示す。 Here, the queue ID 1007 is a queue identifier for distinguishing between a queue on its own grid node 3 to which a sub-job is distributed and a queue on another grid node 3. For example, the queue ID 1007 of “3a_SpdQ” is assigned to the sub job assigned to the grid node 3 (3a), and “3b_SpdQ” is assigned to the sub job assigned to the grid node 3 (3b). The shared data ID 1008 indicates the identifier of the shared data 422 for which the sub job has requested processing and is waiting for processing.

図５は、本実施形態に係る共有データＩ／Ｏ処理待ちキュー３１４のデータ構成の一例を示す図である。
共有データＩ／Ｏ処理待ちキュー３１４は、サブジョブのサスペンド状態を保存するための情報であり、ヘッダ部１００９と本体部１０１０とを備える。 FIG. 5 is a diagram showing an example of the data configuration of the shared data I / O processing wait queue 314 according to the present embodiment.
The shared data I / O processing waiting queue 314 is information for storing the suspended state of the sub job, and includes a header unit 1009 and a main body unit 1010.

ヘッダ部１００９は、ジョブＩＤ１００１、サブジョブＩＤ１００２、キューＩＤ１００７、および共有データＩＤ１００８のデータ項目を備える。これらの情報は、図４に示した共有データＩ／Ｏ処理待ちキューテーブル３１３の１レコードと同一内容のデータ項目である。 The header portion 1009 includes data items of a job ID 1001, a sub job ID 1002, a queue ID 1007, and a shared data ID 1008. These pieces of information are data items having the same contents as one record of the shared data I / O processing waiting queue table 313 shown in FIG.

本体部１０１０には、サブジョブ１０１１、チェックポイント１０１２、入力元データ名１０１３、出力先データ名１０１４、および未出力データ値１０１５の各データが記憶される。
ここで、サブジョブ１０１１は、サブジョブに記載された計算の命令、処理が格納される。チェックポイント１０１２は、サブジョブ１０１１がジョブステップのどこまで終了したかを示す情報である。入力元データ名１０１３は、サブジョブで入力したデータ名を示す。出力先データ名１０１４は、サブジョブで出力するデータ名を示す。未出力データ値１０１５は、サスペンド時の未出力データ項目および未出力データ値を示す。
なお、この本体部１０１０の各情報を保存することにより、処理途中であるサスペンド時の情報を、その時点から再開し処理することができる。 The main body 1010 stores sub job 1011, checkpoint 1012, input source data name 1013, output destination data name 1014, and unoutput data value 1015.
Here, the sub job 1011 stores calculation instructions and processing described in the sub job. The check point 1012 is information indicating how far the sub job 1011 has been completed in the job step. The input source data name 1013 indicates the data name input in the sub job. The output destination data name 1014 indicates the name of data output in the sub job. An unoutput data value 1015 indicates an unoutput data item and an unoutput data value at the time of suspension.
It should be noted that by storing each piece of information in the main body 1010, it is possible to resume and process information at the time of suspension in the middle of processing from that point.

図１に戻り、ＤＢサーバ４について説明する。ＤＢサーバ４は、ジョブの入出力データを管理し、制御部４０と、メモリ部４１と、記憶部４２と、入出力部４３と、通信部４４と、を含んで構成される。
なお、メモリ部４１、入出力部４３および通信部４４は、ジョブ投入装置１のメモリ部１１、入出力部１３および通信部１４と同じ機能を備えるため説明は省略する。 Returning to FIG. 1, the DB server 4 will be described. The DB server 4 manages job input / output data, and includes a control unit 40, a memory unit 41, a storage unit 42, an input / output unit 43, and a communication unit 44.
Note that the memory unit 41, the input / output unit 43, and the communication unit 44 have the same functions as the memory unit 11, the input / output unit 13, and the communication unit 14 of the job input device 1, and thus description thereof is omitted.

制御部４０は、データ制御部４０１を含んで構成され、ＤＢサーバ４の全体の制御を司る。なお、この制御部４０の機能は、例えばＤＢサーバ４の記憶部４２に記憶されたプログラムをＣＰＵがメモリ部４１に展開し実行することで実現される。
データ制御部４０１は、グリッドノード３のサブジョブ実行部３０２からのサブジョブ実行指示に基づき、記憶部４２内の分割データ４２１および共有データ４２２の処理を行う。そして、データ制御部４０１は、記憶部４２内のデータ管理テーブル４２３を用いてデータの入出力を管理し、データ管理テーブル４２３内のロックテーブル４２４（後記する図６参照）の更新を行う。
具体的には、データ制御部４０１は、あるグリッドノード３により共有データ４２２がロック（アクセス不可）中の場合に、他のグリッドノード３からその共有データ４２２に対してアクセス要求があると、そのサブジョブの排他制御を行い、後記するロックテーブル４２４（図６参照）に、排他制御されたサブジョブの情報を記憶する。また、データ制御部４０１は、共有データ４２２がロックされていない場合、共有データ４２２にアクセスしたサブジョブの送信元となるグリッドノード３のキューＩＤ１００７を、ロックテーブル４２４に登録する。
このロックテーブル４２４にキューＩＤ１００７が登録されていることで、共有データ４２２がそのグリッドノード３によりロックされている状態であることを、データ制御部４０１が確認する。 The control unit 40 includes a data control unit 401 and controls the entire DB server 4. Note that the function of the control unit 40 is realized by, for example, the CPU storing and executing a program stored in the storage unit 42 of the DB server 4 in the memory unit 41.
The data control unit 401 processes the divided data 421 and the shared data 422 in the storage unit 42 based on the sub job execution instruction from the sub job execution unit 302 of the grid node 3. The data control unit 401 manages data input / output using the data management table 423 in the storage unit 42 and updates the lock table 424 (see FIG. 6 described later) in the data management table 423.
Specifically, when the shared data 422 is locked (inaccessible) by a certain grid node 3, the data control unit 401, when there is an access request to the shared data 422 from another grid node 3, The exclusive control of the sub job is performed, and the information of the sub job subjected to the exclusive control is stored in the lock table 424 (see FIG. 6) described later. Further, when the shared data 422 is not locked, the data control unit 401 registers the queue ID 1007 of the grid node 3 that is the transmission source of the sub job that has accessed the shared data 422 in the lock table 424.
Since the queue ID 1007 is registered in the lock table 424, the data control unit 401 confirms that the shared data 422 is locked by the grid node 3.

図６は、本実施形態に係るロックテーブル４２４のデータ構成の一例を示す図である。
ロックテーブル４２４は、共有データ４２２ごとにその共有データ４２２がロックされている場合にそのロックを獲得したキューと、その共有データ４２２にロック要求をしているサブジョブに関する情報が記憶される。そして、ロックテーブル４２４は、共有データＩＤ１００８、ロック獲得キュー１０１６、およびロック要求サブジョブ１０１７のデータ項目を含んで構成される。 FIG. 6 is a diagram illustrating an example of a data configuration of the lock table 424 according to the present embodiment.
The lock table 424 stores, for each shared data 422, information on a queue that has acquired the lock when the shared data 422 is locked, and information on a sub-job requesting the lock on the shared data 422. The lock table 424 includes data items of a shared data ID 1008, a lock acquisition queue 1016, and a lock request sub job 1017.

ここで、共有データＩＤ１００８は、処理待ちとなった共有データ４２２の識別子である。また、ロック獲得キュー１０１６には、共有データ４２２のロック権限を保持しているグリッドノード３のキューＩＤ１００７が保存される。また、ロック要求サブジョブ１０１７には、共有データ４２２を処理するためにロックの開放待ちをしているサブジョブのジョブＩＤ、サブジョブＩＤ、ノード名が登録される。そして、このロック要求サブジョブ１０１７には、共有データがロック中であることにより、処理待ちとなっているサブジョブの前記した情報が、処理待ち順に保存される。
このように、共有データ４２２へのロックおよびアンロック権限を、グリッドノード３のロック獲得キュー１０１６に記憶されたキューＩＤ１００７で管理することによって、他のグリッドノード３からの割込み処理を回避することができる。 Here, the shared data ID 1008 is an identifier of the shared data 422 that has been waiting for processing. The lock acquisition queue 1016 stores the queue ID 1007 of the grid node 3 that holds the lock authority for the shared data 422. Also, in the lock request sub job 1017, the job ID, sub job ID, and node name of the sub job waiting to release the lock for processing the shared data 422 are registered. The lock request sub-job 1017 stores the above-described information of the sub-jobs waiting for processing because the shared data is locked, in the order of waiting for processing.
In this way, by managing the authority to lock and unlock the shared data 422 with the queue ID 1007 stored in the lock acquisition queue 1016 of the grid node 3, it is possible to avoid interrupt processing from other grid nodes 3. it can.

図１に戻り、ＤＢサーバ４の記憶部４２には、いずれかのグリッドノード３（３ａ，３ｂ，３ｃ）が専有して処理を行う分割データ４２１と、すべてのグリッドノード３が共通して読込、処理、書込を行う共有データ４２２と、データ管理テーブル４２３とを含んで構成される。このデータ管理テーブル４２３には、データ制御部４０１がデータの排他制御を行う際に用いる前記したロックテーブル４２４が記憶される。 Returning to FIG. 1, the divided data 421 that is exclusively processed by any one of the grid nodes 3 (3 a, 3 b, 3 c) and all the grid nodes 3 are commonly read into the storage unit 42 of the DB server 4. , Shared data 422 to be processed and written, and a data management table 423. The data management table 423 stores the lock table 424 used when the data control unit 401 performs exclusive data control.

次に、本実施形態に係るグリッドノード（ジョブ割当装置）３を含むジョブ割当システム５によるジョブ割当方法について、図１〜図６を参照しつつ、図７〜図１０を用いて具体的に説明する。 Next, a job allocation method by the job allocation system 5 including the grid node (job allocation apparatus) 3 according to the present embodiment will be specifically described with reference to FIGS. 7 to 10 and FIGS. To do.

図７〜図１０は、本実施形態に係るジョブ割当方法を説明するためのシーケンス図である。なお、ジョブ投入装置１によるサブジョブ振分装置２へのジョブの送信、サブジョブ振分装置２のジョブ分割部２０１によるジョブのサブジョブへの分割およびサブジョブ保存部２２１への保存は終了しているものとして説明する。 7 to 10 are sequence diagrams for explaining the job assignment method according to the present embodiment. Note that the transmission of the job to the sub job distribution device 2 by the job input device 1, the division of the job into sub jobs by the job division unit 201 of the sub job distribution device 2, and the storage in the sub job storage unit 221 have been completed. explain.

まず、図７に示すように、グリッドノード３（３ａ）のキュー管理部３０４は、サブジョブ進捗管理テーブル３１２（図３）を参照し、未処理または共有処理待ちのサブジョブ数を確認する（ステップＳ７０１）。そして、キュー管理部３０４は、この確認結果をサブジョブ振分装置２のサブジョブ振分部２０２へ送信する（ステップＳ７０２）。 First, as illustrated in FIG. 7, the queue management unit 304 of the grid node 3 (3 a) refers to the sub job progress management table 312 (FIG. 3), and confirms the number of sub jobs that are unprocessed or waiting for shared processing (step S 701). ). Then, the queue management unit 304 transmits the confirmation result to the sub job distribution unit 202 of the sub job distribution device 2 (step S702).

次に、サブジョブ振分部２０２は、ステップＳ７０１の確認結果が、所定の上限値を超えない場合に、サブジョブを振り分けるグリッドノードを決定する（ステップＳ７０３）。なお、サブジョブ振分部２０２は、所定の上限値を超えた場合は、サブジョブの振分処理を行わない。
続いて、サブジョブ振分部２０２は、サブジョブ保存部２２１からサブジョブを取得し（ステップＳ７０４）、サブジョブの各グリッドノード３（３ａ，３ｂ，３ｃ）への振分を行い（ステップＳ７０５）、振り分けたサブジョブを各グリッドノード３（３ａ，３ｂ，３ｃ）に送信する（ステップＳ７０６）。ここでは、グリッドノード３（３ａ）にサブジョブが振り分けられたものとして処理を説明する。また、サブジョブ振分部２０２は、サブジョブ管理部２０３を介して、サブジョブ管理テーブル２２２（図２参照）を更新する（ステップＳ７０７）。例えば、振り分けたサブジョブに対応するレコードについて、振分先ノード１００３を振分先のグリッドノード「３ａ」に、進捗状況１００４を「未処理」から「実行中」に更新する（ステップＳ７０７）。 Next, the sub job distribution unit 202 determines a grid node to which the sub job is distributed when the confirmation result in step S701 does not exceed a predetermined upper limit value (step S703). The sub job distribution unit 202 does not perform sub job distribution processing when a predetermined upper limit is exceeded.
Subsequently, the sub job distribution unit 202 acquires the sub job from the sub job storage unit 221 (step S704), distributes the sub job to each grid node 3 (3a, 3b, 3c) (step S705), and distributes the sub job. The sub job is transmitted to each grid node 3 (3a, 3b, 3c) (step S706). Here, the processing will be described on the assumption that sub-jobs are distributed to the grid node 3 (3a). Further, the sub job distribution unit 202 updates the sub job management table 222 (see FIG. 2) via the sub job management unit 203 (step S707). For example, for the record corresponding to the distributed sub-job, the distribution destination node 1003 is updated to the distribution-destination grid node “3a”, and the progress status 1004 is updated from “unprocessed” to “running” (step S707).

次に、グリッドノード３（３ａ）のキュー操作部３０３は、ステップＳ７０６で受信したサブジョブをサブジョブキュー３１１（図１参照）に保存し（ステップＳ７０８）、サブジョブ制御部３０１に対して新規サブジョブがグリッドノード３（３ａ）に配置されたことに伴う情報の更新要求を行う（ステップＳ７０９）。そして、サブジョブ制御部３０１は、サブジョブ進捗管理テーブル３１２（図３参照）に新しいレコードを追加して、取得データを元に、ジョブＩＤ１００１、サブジョブＩＤ１００２、振分元１００５、サブジョブ進捗状況１００６を登録する（ステップＳ７１０）。このとき、新しく追加されたレコードのサブジョブ進捗状況１００６は「未処理」である。 Next, the queue operation unit 303 of the grid node 3 (3 a) stores the sub job received in step S 706 in the sub job queue 311 (see FIG. 1) (step S 708), and the new sub job is displayed in the grid for the sub job control unit 301. An update request for information associated with the placement at the node 3 (3a) is made (step S709). Then, the sub job control unit 301 adds a new record to the sub job progress management table 312 (see FIG. 3), and registers the job ID 1001, the sub job ID 1002, the distribution source 1005, and the sub job progress status 1006 based on the acquired data. (Step S710). At this time, the sub job progress status 1006 of the newly added record is “unprocessed”.

続いて、サブジョブ制御部３０１は、サブジョブ進捗管理テーブル３１２（図３）を参照し、未処理のサブジョブの有無を検索する（ステップＳ７１１）。未処理のサブジョブがある場合、サブジョブ制御部３０１は、サブジョブをサブジョブ実行部３０２に割り当てるようにキュー操作部３０３に命令し（ステップＳ７１２）、このサブジョブ割当命令を送信する（ステップＳ７１３）。また、サブジョブ制御部３０１は、サブジョブ進捗管理テーブル３１２のこのサブジョブに対応するレコードについて、サブジョブ進捗状況１００６を「未処理」から「実行中」に更新する（ステップＳ７１４）。 Subsequently, the sub job control unit 301 refers to the sub job progress management table 312 (FIG. 3) and searches for the presence or absence of an unprocessed sub job (step S711). If there is an unprocessed sub job, the sub job control unit 301 instructs the queue operation unit 303 to allocate the sub job to the sub job execution unit 302 (step S712), and transmits this sub job allocation command (step S713). Further, the sub job control unit 301 updates the sub job progress status 1006 from “unprocessed” to “running” for the record corresponding to the sub job in the sub job progress management table 312 (step S714).

そして、キュー操作部３０３は、ステップＳ７１３で命令されたサブジョブをサブジョブキュー３１１（図１参照）から取得し（ステップＳ７１５）、サブジョブ実行部３０２に送信する（ステップＳ７１６）。 The queue operation unit 303 acquires the sub job instructed in step S713 from the sub job queue 311 (see FIG. 1) (step S715), and transmits the sub job to the sub job execution unit 302 (step S716).

次に、サブジョブ実行部３０２は、サブジョブを起動し（ステップＳ７１７）、サブジョブ前半部分の分割データ処理を開始して（ステップＳ７１８）、ＤＢサーバ４のデータ制御部４０１を介して分割データ４２１（図１参照）へのアクセスを要求する（ステップＳ７１９）。分割データ４２１は、グリッドノード３（３ａ）が専有できロックの競合は発生しないため、速やかにサブジョブ実行部３０２へ送信され（ステップＳ７２０）、サブジョブ実行部３０２により、分割データ４２１の処理が実行される(ステップＳ７２１)。 Next, the sub job execution unit 302 activates the sub job (step S717), starts the divided data processing of the first half of the sub job (step S718), and transmits the divided data 421 via the data control unit 401 of the DB server 4 (FIG. 1) is requested (step S719). Since the divided data 421 can be exclusively used by the grid node 3 (3a) and lock contention does not occur, the divided data 421 is promptly transmitted to the sub job execution unit 302 (step S720), and the processing of the divided data 421 is executed by the sub job execution unit 302. (Step S721).

続いて、図８に示すように、サブジョブ実行部３０２は、サブジョブ後半部分の共有データ処理を開始して（ステップＳ７２２）、ＤＢサーバ４のデータ制御部４０１を介して共有データ４２２（図１参照）へのアクセスを要求する（ステップＳ７２３）。ここで、共有データ４２２が、他のグリッドノード３（３ｂ，３ｃ）によって、ロック中の場合、ＤＢサーバ４のデータ制御部４０１は、排他制御を行う（ステップＳ７２４）。続いて、データ制御部４０１は、ロックテーブル４２４（図６参照）のロック要求サブジョブ１０１７に、共有データ４２２にアクセスを要求したそのサブジョブに関するジョブＩＤ、サブジョブＩＤ、およびノード名を登録する（ステップＳ７２５）。次に、データ制御部４０１は、共有データ４２２がロックされていることを、グリッドノード３（３ａ）のサブジョブ実行部３０２に、ロック処理待ち通知として送信する（ステップＳ７２６）。 Subsequently, as shown in FIG. 8, the sub job execution unit 302 starts the shared data processing of the second half of the sub job (step S722), and the shared data 422 (see FIG. 1) via the data control unit 401 of the DB server 4. ) Is requested (step S723). Here, when the shared data 422 is locked by another grid node 3 (3b, 3c), the data control unit 401 of the DB server 4 performs exclusive control (step S724). Subsequently, the data control unit 401 registers, in the lock request sub job 1017 of the lock table 424 (see FIG. 6), the job ID, sub job ID, and node name related to the sub job that requested access to the shared data 422 (step S725). ). Next, the data control unit 401 transmits, as a lock processing wait notification, to the sub job execution unit 302 of the grid node 3 (3a) that the shared data 422 is locked (step S726).

サブジョブ実行部３０２は、ロック処理待ち通知を受信すると、サブジョブの実行を停止し（ステップＳ７２７）、共有データＩ／Ｏ処理待ちキューテーブル３１３（図４参照）のジョブＩＤ１００１、サブジョブＩＤ１００２、キューＩＤ１００７、および共有データＩＤ１００８を登録する（ステップＳ７２８）。そして、サブジョブ実行部３０２は、共有データ処理に失敗したことを示すメッセージを生成し（ステップＳ７２９）、サブジョブ制御部３０１に送信する（ステップＳ７３０）。 Upon receiving the lock processing wait notification, the sub job execution unit 302 stops the execution of the sub job (step S727), the job ID 1001, the sub job ID 1002, the queue ID 1007 in the shared data I / O processing wait queue table 313 (see FIG. 4), Then, the shared data ID 1008 is registered (step S728). Then, the sub job execution unit 302 generates a message indicating that the shared data processing has failed (step S729), and transmits the message to the sub job control unit 301 (step S730).

次に、サブジョブ制御部３０１は、ステップＳ７３０にて取得した共有データ処理に失敗したサブジョブのジョブＩＤ１００１、サブジョブＩＤ１００２、キューＩＤ１００７、および共有データＩＤ１００８を引数としたサブジョブ登録命令情報を生成し（ステップＳ７３１）、キュー操作部３０３に送信する（ステップＳ７３２）。 Next, the sub job control unit 301 generates sub job registration command information using as arguments the job ID 1001, sub job ID 1002, queue ID 1007, and shared data ID 1008 of the sub job that failed in the shared data processing acquired in step S730 (step S731). ) To the queue operation unit 303 (step S732).

続いて、キュー操作部３０３は、ステップＳ７３２のサブジョブ登録命令情報で指定されたサブジョブの実行状態をサスペンドしてキューに保存するために必要な情報、すなわち、チェックポイント１０１２、サブジョブの入力元データ名１０１３、出力先データ名１０１４、およびサブジョブの未出力データ値１０１５を収集する（ステップＳ７３３）。そして、キュー操作部３０３は、ヘッダ部１００９と本体部１０１０からなる情報を生成し、共有データＩ／Ｏ処理待ちキュー３１４（図５参照）として登録する（ステップＳ７３４）。続いて、キュー操作部３０３は、共有データＩ／Ｏ処理待ちキュー３１４への登録処理が完了したことをサブジョブ制御部３０１へ送信する（ステップＳ７３５）。 Subsequently, the queue operation unit 303 suspends the execution status of the sub job specified by the sub job registration command information in step S732 and stores the information in the queue, that is, check point 1012, sub job input source data name 1013, the output destination data name 1014, and the non-output data value 1015 of the sub job are collected (step S733). Then, the queue operation unit 303 generates information including the header unit 1009 and the main body unit 1010, and registers the information as the shared data I / O processing wait queue 314 (see FIG. 5) (step S734). Subsequently, the queue operation unit 303 transmits the completion of the registration process to the shared data I / O processing waiting queue 314 to the sub job control unit 301 (step S735).

次に、サブジョブ制御部３０１は、ステップＳ７３５の登録完了の通知を受信すると、サブジョブ進捗管理テーブル３１２（図３参照）のサブジョブ進捗状況１００６を「実行中」から「共有処理待ち」に更新する（ステップＳ７３６）。 Next, when receiving the registration completion notification in step S735, the sub job control unit 301 updates the sub job progress status 1006 in the sub job progress management table 312 (see FIG. 3) from “in progress” to “waiting for shared processing” ( Step S736).

以上の処理で、１つのサブジョブが共有データ処理失敗でサスペンドされたので、次の新規サブジョブに対する処理を開始する（ステップＳ７３７）。 With the above processing, since one sub job has been suspended due to the shared data processing failure, processing for the next new sub job is started (step S737).

図９は、図８のステップＳ７３７において、開始された新規サブジョブに対する処理において、図７のステップＳ７１１からステップＳ７２１までの処理、つまりサブジョブ前半部分の分割データ４２１の処理実行が終了した後の処理について示している。以下、ＤＢサーバ４の共有データ４２２がロックされていない場合の処理について、図９，図１０を用いて説明する。 FIG. 9 shows the processing from step S711 to step S721 in FIG. 7 in the processing for the new sub job started in step S737 in FIG. 8, that is, the processing after the execution of the divided data 421 in the first half of the sub job is completed. Show. Hereinafter, processing when the shared data 422 of the DB server 4 is not locked will be described with reference to FIGS. 9 and 10.

図９に示すように、サブジョブ実行部３０２は、図７と同様にサブジョブ後半部分の共有データ処理を開始して（ステップＳ７２２）、ＤＢサーバ４のデータ制御部４０１を介して共有データ４２２へのアクセスを要求する（ステップＳ７２３）。ここで、共有データ４２２がロックされていない場合、ＤＢサーバ４のデータ制御部４０１は、ロックテーブル４２４（図６参照）のロック獲得キュー１０１６にグリッドノード３（３ａ）のキューＩＤ１００７を登録する（ステップＳ７３８）。これは、ＤＢサーバ４の共有データ４２２に対してグリッドノード３（３ａ）のキューＩＤ１００７がロックをかけたのと同様であることを意味する。そして、データ制御部４０１は、サブジョブ実行部３０２に共有データ４２２を送信する（ステップＳ７３９）。 As shown in FIG. 9, the sub job execution unit 302 starts shared data processing for the latter half of the sub job in the same manner as in FIG. 7 (step S722), and transfers the shared data 422 to the shared data 422 via the data control unit 401 of the DB server 4. Access is requested (step S723). If the shared data 422 is not locked, the data control unit 401 of the DB server 4 registers the queue ID 1007 of the grid node 3 (3a) in the lock acquisition queue 1016 of the lock table 424 (see FIG. 6) ( Step S738). This means that the queue ID 1007 of the grid node 3 (3a) is locked with respect to the shared data 422 of the DB server 4 is locked. Then, the data control unit 401 transmits the shared data 422 to the sub job execution unit 302 (step S739).

共有データ４２２を受信したサブジョブ実行部３０２は、共有データ処理を実行する（ステップＳ７４０）。そして、共有データ処理の完了により、サブジョブが終了すると、サブジョブ実行部３０２は、サブジョブ処理完了メッセージを生成し（ステップＳ７４１）、サブジョブ制御部３０１に送信する（ステップＳ７４２）。 The sub job execution unit 302 that has received the shared data 422 executes shared data processing (step S740). When the sub job is completed due to the completion of the shared data processing, the sub job execution unit 302 generates a sub job processing completion message (step S741) and transmits it to the sub job control unit 301 (step S742).

次に、サブジョブ制御部３０１は、サブジョブ進捗管理テーブル３１２（図３参照）のサブジョブ進捗状況１００６を「実行中」から「完了」に更新する（ステップＳ７４３）。そして、サブジョブ制御部３０１は、サブジョブ完了通知を生成し（ステップＳ７４４）、サブジョブ振分装置２のサブジョブ振分部２０２に送信する（ステップＳ７４５）。 Next, the sub job control unit 301 updates the sub job progress status 1006 of the sub job progress management table 312 (see FIG. 3) from “in progress” to “completed” (step S743). Then, the sub job control unit 301 generates a sub job completion notification (step S744) and transmits it to the sub job distribution unit 202 of the sub job distribution device 2 (step S745).

続いて、サブジョブ完了通知を受信したサブジョブ振分装置２のサブジョブ振分部２０２は、サブジョブ管理部２０３を介して、サブジョブ管理テーブル２２２（図２参照）の進捗状況１００４を「実行中」から「完了」に更新する（ステップＳ７４６）。 Subsequently, the sub job distribution unit 202 of the sub job distribution apparatus 2 that has received the sub job completion notification changes the progress status 1004 of the sub job management table 222 (see FIG. 2) from “in progress” to “ It is updated to “completed” (step S746).

一方、グリッドノード３（３ａ）のサブジョブ制御部３０１は、共有データＩ／Ｏ処理待ちキューテーブル３１３（図４参照）を検索し、共有データＩＤ１００８にステップＳ７３８で取得したものと同一の共有データ４２２に対するアクセス要求が含まれているかを検索する（ステップＳ７４７）。つまり、共有データＩ／Ｏ処理待ちキューテーブル３１３に、処理を終えた共有データ４２２と同一の共有データＩＤ１００８が存在するかを検索する。共有データＩＤ１００８の一致するサブジョブが存在する場合、サブジョブ制御部３０１は、サブジョブ取得命令情報を生成して（ステップＳ７４８）、キュー操作部３０３に送信する（ステップＳ７４９）。なお、サブジョブ取得命令情報は、引数として、ジョブＩＤ、サブジョブＩＤ、キューＩＤ、および共有データＩＤを含む情報である。 On the other hand, the sub job control unit 301 of the grid node 3 (3a) searches the shared data I / O processing waiting queue table 313 (see FIG. 4), and the shared data 422 that is the same as that acquired in step S738 for the shared data ID 1008. It is searched whether or not an access request for is included (step S747). In other words, the shared data I / O processing waiting queue table 313 is searched for the same shared data ID 1008 as the shared data 422 that has been processed. If there is a sub job that matches the shared data ID 1008, the sub job control unit 301 generates sub job acquisition command information (step S748) and transmits it to the queue operation unit 303 (step S749). The sub job acquisition command information is information including a job ID, a sub job ID, a queue ID, and a shared data ID as arguments.

次に、キュー操作部３０３は、ステップＳ７４９のサブジョブ取得命令情報で指定されたサブジョブを共有データＩ／Ｏ処理待ちキュー３１４（図５参照）から取得し、実行状態を復元する処理を行う（ステップＳ７５０）。具体的には、共有データＩ／Ｏ処理待ちキュー３１４の本体部１０１０に保存されているチェックポイント１０１２を参照し、ジョブステップの実行位置をサスペンド前までスキップする。そして、本体部１０１０に記憶されているサブジョブの入力元データ名１０１３、出力先データ名１０１４、未出力データ値１０１５を、復元したサブジョブに設定する。そして、キュー操作部３０３は、サブジョブの実行状態の復元処理が完了したことを、サブジョブ制御部３０１に通知する（ステップＳ７５１）。 Next, the queue operation unit 303 acquires the sub job specified by the sub job acquisition command information in step S749 from the shared data I / O processing wait queue 314 (see FIG. 5), and performs processing for restoring the execution state (step S749). S750). Specifically, the check point 1012 stored in the main body 1010 of the shared data I / O processing waiting queue 314 is referred to, and the execution position of the job step is skipped before suspending. Then, the input source data name 1013, output destination data name 1014, and unoutput data value 1015 of the sub job stored in the main body 1010 are set in the restored sub job. Then, the queue operation unit 303 notifies the sub job control unit 301 that the sub job execution state restoration processing has been completed (step S751).

続いて、図１０に示すように、サブジョブ制御部３０１は、サブジョブの復元処理が完了したことを受信し、この復元したサブジョブをサブジョブ実行部３０２に割り当てるようにキュー操作部３０３に命令を出す（ステップＳ７５２）。これ以降は、前記した図７のステップＳ７１３からＳ７１７および図９のステップＳ７２２〜Ｓ７４６と同様にサブジョブの後半部分の共有データ４２２の実行処理を行う（ステップＳ７５３）。また、ステップＳ７４５において、サブジョブ完了通知を送信後、サブジョブ制御部３０１は、共有データＩ／Ｏ処理待ちキューテーブル３１３（図４参照）から処理が完了したサブジョブを示すレコードを削除する更新を行う（ステップＳ７５４）。 Next, as shown in FIG. 10, the sub job control unit 301 receives that the sub job restoration processing has been completed, and issues a command to the queue operation unit 303 to allocate the restored sub job to the sub job execution unit 302 ( Step S752). Thereafter, the execution processing of the shared data 422 in the latter half of the sub-job is performed in the same manner as steps S713 to S717 in FIG. 7 and steps S722 to S746 in FIG. 9 (step S753). In step S745, after transmitting the sub job completion notification, the sub job control unit 301 performs an update to delete the record indicating the sub job for which processing has been completed from the shared data I / O processing wait queue table 313 (see FIG. 4) ( Step S754).

それ以降、共有データＩ／Ｏ処理待ちキューテーブル３１３から該当するサブジョブがなくなるまで、ステップＳ７４７〜Ｓ７５４の共有データＩ／Ｏ処理を繰り返し実行する（ステップＳ７５５）。そして、ステップＳ７４７で共有データＩ／Ｏ処理待ちキューテーブル３１３を検索して、該当するサブジョブが存在しない場合は、共有データＩ／Ｏのループ処理を終了する（ステップＳ７５６）。
続いて、サブジョブ制御部３０１は、共有データ４２２のアンロック命令情報を生成し（ステップＳ７５７）、ＤＢサーバ４のデータ制御部４０１に送信する（ステップＳ７５８）。 Thereafter, the shared data I / O processing in steps S747 to S754 is repeatedly executed until there is no corresponding sub job in the shared data I / O processing waiting queue table 313 (step S755). Then, in step S747, the shared data I / O processing wait queue table 313 is searched, and if the corresponding sub job does not exist, the shared data I / O loop processing is terminated (step S756).
Subsequently, the sub job control unit 301 generates unlock command information of the shared data 422 (step S757) and transmits it to the data control unit 401 of the DB server 4 (step S758).

次に、データ制御部４０１は、取得したアンロック命令情報に基づき、ロックテーブル４２４（図６参照）を更新する（ステップＳ７５９）。具体的には、データ制御部４０１は、サブジョブがアクセスした共有データＩＤ１００８を示すレコードのロック獲得キュー１０１６をクリアし、ロック要求サブジョブ１０１７からノード名が３ａのサブジョブを削除する。そして、データ制御部４０１は、アンロック処理が完了したことを、グリッドノード３（３ａ）のサブジョブ制御部３０１へ送信する（ステップＳ７６０）。 Next, the data control unit 401 updates the lock table 424 (see FIG. 6) based on the acquired unlock command information (step S759). Specifically, the data control unit 401 clears the lock acquisition queue 1016 of the record indicating the shared data ID 1008 accessed by the sub job, and deletes the sub job having the node name 3a from the lock request sub job 1017. Then, the data control unit 401 transmits the completion of the unlock process to the sub job control unit 301 of the grid node 3 (3a) (step S760).

サブジョブ制御部３０１は、アンロック処理の完了を受信すると、新規サブジョブに対する処理を再開する（ステップＳ７６１）。具体的には、図７のステップＳ７１１に戻り処理を続ける。 Upon receiving the completion of the unlock process, the sub job control unit 301 resumes the process for the new sub job (step S761). Specifically, the process returns to step S711 in FIG.

なお、サブジョブ振分装置２のサブジョブ振分部２０２は、グリッドノード３のキュー管理部３０４から所定の間隔で、未処理または共有処理待ちのサブジョブ数の確認結果を受信し（ステップＳ７０２）、ステップＳ７０３〜Ｓ７０７のサブジョブの振分処理を繰り返し行う。そして、ステップＳ７０４において、振り分けるサブジョブが存在しない場合、即ち、ジョブの処理が終了した場合には、ジョブの終了通知をジョブ投入装置１に送信し、サブジョブの振分を待つ。 The sub job distribution unit 202 of the sub job distribution device 2 receives the confirmation result of the number of sub jobs waiting for unprocessed or shared processing from the queue management unit 304 of the grid node 3 at a predetermined interval (step S702). The sub job distribution process in steps S703 to S707 is repeated. In step S704, when there is no sub job to be distributed, that is, when the job processing is completed, a job end notification is transmitted to the job input device 1, and the sub job distribution is awaited.

次に、本実施形態に係るジョブ割当プログラムを実行する際における、サブジョブの処理の流れを、各サブジョブに関するデータの存在場所を示して説明する。
図１１は、本実施形態に係るサブジョブの存在場所の時系列変化を説明するための図である。 Next, the flow of sub job processing when the job allocation program according to the present embodiment is executed will be described by showing the location of data relating to each sub job.
FIG. 11 is a diagram for explaining a time-series change in the location of the sub job according to the present embodiment.

図１１中のＴ０からＴ１３は、ジョブ割当プログラム実行前であるＴ０から実行終了後であるＴ１３まで続く時間軸を示す。また、図１１中のブロックは、左から順に、サブジョブ振分装置２のサブジョブ保存部２２１、グリッドノード３（３ａ）のサブジョブキュー３１１、共有データＩ／Ｏ処理待ちキュー３１４、サブジョブ実行部３０２を示す。そして、丸囲みの数字は、計算機に保存されるサブジョブを示すものであり、数字はサブジョブＩＤ１００２を表す。
この図１１は、ある時点におけるサブジョブの存在場所をこの丸囲み文字で表現している。以下、時間軸にそって、サブジョブの処理の流れの一例を説明する。 In FIG. 11, T0 to T13 indicate time axes that continue from T0 before execution of the job assignment program to T13 after completion of execution. The blocks in FIG. 11 include, in order from the left, a sub job storage unit 221 of the sub job distribution device 2, a sub job queue 311 of the grid node 3 (3a), a shared data I / O processing wait queue 314, and a sub job execution unit 302. Show. A circled number indicates a sub job stored in the computer, and the number represents a sub job ID 1002.
In FIG. 11, the location of the sub job at a certain point is expressed by the circled characters. Hereinafter, an example of the flow of sub job processing will be described along the time axis.

時間Ｔ０では、サブジョブ「１」〜「１０」は、サブジョブ保存部２２１に保存されている、つまり、まだ、サブジョブがグリッドノード３に振り分けられていない状態である。
時間Ｔ１において、サブジョブ「１」，「３」がグリッドノード３（３ａ）に振り分けられ、サブジョブキュー３１１（図１）に保存されたことを示す。
次に、時間Ｔ２において、サブジョブ「１」がサブジョブ実行部３０２により実行され、一方、サブジョブ「５」がさらにサブジョブキュー３１１に保存される。
続いて、時間Ｔ３において、サブジョブ「１」の共有データが処理できなかったため、サスペンドされて共有データＩ／Ｏ処理待ちキュー３１４（図５）に保存される。そして、次に新たなサブジョブ「３」がサブジョブ実行部３０２により処理されることを示す。
時間Ｔ４では、サブジョブ「３」の共有データ４２２が処理できなかったため、サスペンドされて共有データＩ／Ｏ処理待ちキュー３１４に保存される。そして次に新たなサブジョブ「５」がサブジョブ実行部３０２により処理されることを示す。
時間Ｔ５において、サブジョブ「５」が共有データ処理を完了したため消滅し、次にサブジョブ「１」がサブジョブ実行部３０２に再割当されたことを示す。
時間Ｔ６では、サブジョブ「１」が共有データ処理を完了したため消滅する。そして、サブジョブ「３」がサブジョブ実行部３０２により処理され、共有データＩ／Ｏ処理待ちキュー３１４が空になるまで、繰り返し共有データ処理が実行される。
次に時間Ｔ７において、新たにサブジョブ「７」がサブジョブ実行部３０２で処理され、以下、時間Ｔ８〜時間Ｔ１２の処理を行うことで、時間Ｔ１３において、すべてのサブジョブの処理が終了する。 At time T0, the sub jobs “1” to “10” are stored in the sub job storage unit 221, that is, the sub job is not yet distributed to the grid node 3.
At time T1, the sub jobs “1” and “3” are distributed to the grid node 3 (3a) and stored in the sub job queue 311 (FIG. 1).
Next, at time T <b> 2, the sub job “1” is executed by the sub job execution unit 302, while the sub job “5” is further stored in the sub job queue 311.
Subsequently, since the shared data of the sub job “1” could not be processed at time T3, it is suspended and stored in the shared data I / O processing waiting queue 314 (FIG. 5). Next, the new sub job “3” is processed by the sub job execution unit 302.
At time T4, since the shared data 422 of the sub job “3” could not be processed, it is suspended and stored in the shared data I / O processing waiting queue 314. Next, a new sub job “5” is processed by the sub job execution unit 302.
At time T5, the sub job “5” disappears because the shared data processing is completed, and the sub job “1” is then reassigned to the sub job execution unit 302.
At time T6, the sub job “1” disappears because the shared data processing is completed. Then, the sub job “3” is processed by the sub job execution unit 302, and the shared data processing is repeatedly executed until the shared data I / O processing waiting queue 314 becomes empty.
Next, at time T7, a new sub job “7” is processed by the sub job execution unit 302. Thereafter, processing from time T8 to time T12 is performed, and processing of all the sub jobs is completed at time T13.

以上のように、本発明に係るジョブ割当装置、ジョブ割当方法およびジョブ割当プログラムによれば、複数のノードで構成される並列型のバッチ処理システムにおいて、同一データへのＩ／Ｏ処理待ちによるボトルネックを回避し、計算ノードの利用効率を向上させることができる。 As described above, according to the job allocation apparatus, the job allocation method, and the job allocation program according to the present invention, in a parallel batch processing system including a plurality of nodes, bottles due to waiting for I / O processing for the same data are performed. The bottleneck can be avoided and the utilization efficiency of the computation node can be improved.

１ジョブ投入装置
２ジョブ振分装置
３グリッドノード（ジョブ割当装置）
４ＤＢサーバ
５ジョブ割当システム
６通信ネットワーク
１０，２０，３０，４０制御部
１１，２１，３１，４１メモリ部
１２，２２，３２，４２記憶部
１３，２３，３３，４３入出力部
１４，２４，３４，４４通信部
１０１ジョブ投入部
１２１ジョブ保存部
１２２ジョブネット保存部
２０１ジョブ分割部
２０２サブジョブ振分部
２０３サブジョブ管理部
２２１サブジョブ保存部
２２２サブジョブ管理テーブル
３０１サブジョブ制御部
３０２サブジョブ実行部
３０３キュー操作部
３０４キュー管理部
３１１サブジョブキュー
３１２サブジョブ進捗管理テーブル（サブジョブ進捗管理情報）
３２３共有データＩ／Ｏ処理待ちキューテーブル（共有データＩ／Ｏ処理待ちキュー情報）
３２４共有データＩ／Ｏ処理待ちキュー
４０１データ制御部
４２１分割データ
４２２共有データ
４２３データ管理テーブル
４２４ロックテーブル 1 Job submission device 2 Job distribution device 3 Grid node (job allocation device)
4 DB server 5 Job allocation system 6 Communication network 10, 20, 30, 40 Control unit 11, 21, 31, 41 Memory unit 12, 22, 32, 42 Storage unit 13, 23, 33, 43 Input / output unit 14, 24 , 34, 44 Communication unit 101 Job input unit 121 Job storage unit 122 Job net storage unit 201 Job division unit 202 Sub job distribution unit 203 Sub job management unit 221 Sub job storage unit 222 Sub job management table 301 Sub job control unit 302 Sub job execution unit 303 Queue Operation unit 304 Queue management unit 311 Sub job queue 312 Sub job progress management table (sub job progress management information)
323 Shared data I / O processing wait queue table (shared data I / O processing wait queue information)
324 Shared data I / O processing waiting queue 401 Data control unit 421 Division data 422 Shared data 423 Data management table 424 Lock table

Claims

A job distribution device that divides a job into sub-jobs of a predetermined processing unit and distributes the jobs to a plurality of job allocation devices, and processes the distributed sub-jobs by accessing divided data and shared data stored in a DB server The job allocation apparatus provided in a job allocation system in which a plurality of job allocation apparatuses and the DB server that manages access of the divided data and the shared data for each job allocation apparatus are connected via a communication network. And
Sub job queue in which the sub job is stored, sub job progress management information in which the progress status of the processing state of the sub job is stored, and shared data I / O processing wait queue information in which a sub job of the sub job waiting for shared data processing is stored A storage unit for storing
Updating the sub job progress management information and referring to the sub job progress management information to search for an unprocessed sub job;
A sub job execution unit that acquires the searched unprocessed sub job from the sub job queue and executes the sub job by accessing the divided data and the shared data of the DB server;
When the sub job execution unit fails to process the shared data, the execution of the sub job is temporarily suspended, and the progress status of the suspended sub job waiting for processing is displayed as the shared data I / O processing queue information. A queue operation unit registered in
The sub job execution unit, after temporarily suspending the execution state of the sub job, acquires the new unprocessed sub job from the sub job queue and processes the shared data,
When the processing is successful, the sub job control unit waits for the shared data I / O processing to wait for a sub job that requests access to the same shared data as the shared data for which the processing has succeeded among the sub jobs waiting for processing. Extracted from queue information,
The job assignment apparatus, wherein the sub job execution unit executes processing of the shared data of the extracted sub job.

The storage unit further includes a shared data I / O processing waiting queue for storing information related to the suspended execution state of the sub job when the execution of the sub job is suspended.
The queue operation unit, when the sub job execution unit fails to process the shared data, collects information related to the execution state of the sub job and stores it in the shared data I / O processing waiting queue.
Further, when the sub job control unit extracts the sub job waiting for processing from the shared data I / O processing wait queue information, the queue operation unit displays information regarding the execution state of the extracted sub job waiting for processing. Retrieve from the shared data I / O processing queue, restore the execution status of the sub job when it was interrupted,
The job assignment apparatus according to claim 1, wherein the sub job execution unit executes processing of the sub job waiting for processing from the time point of the temporarily suspended execution state.

The information about the execution status of the sub job that has been interrupted includes a checkpoint indicating the degree of progress of the entire processing of the sub job, an input source data name and an output destination data name of the sub job, and information at the time of the temporary suspension is not stored. The job assignment apparatus according to claim 2, wherein the job assignment apparatus is an output data item and a non-output data value.

A job distribution device that divides a job into sub-jobs of a predetermined processing unit and distributes the jobs to a plurality of job allocation devices, and processes the distributed sub-jobs by accessing divided data and shared data stored in a DB server A plurality of job allocation devices and the DB server that manages access to the divided data and the shared data for each job allocation device are used for the job allocation device provided in a job allocation system connected via a communication network. A job assignment method, comprising:
The job allocation device includes:
Sub job queue in which the sub job is stored, sub job progress management information in which the progress status of the processing state of the sub job is stored, and shared data I / O processing wait queue information in which a sub job of the sub job that is waiting for shared data processing is stored A storage unit for storing
Searching for an unprocessed sub-job with reference to the sub-job progress management information;
Obtaining the retrieved unprocessed sub job from the sub job queue and executing the sub job by accessing the divided data and the shared data of the DB server;
When the processing of the shared data fails, temporarily suspending the execution of the sub job, and registering the progress status of the suspended sub job waiting for processing in the shared data I / O processing waiting queue information;
After temporarily suspending the execution state of the sub job, obtaining a new unprocessed sub job from the sub job queue and processing the shared data;
A step of extracting, from the shared data I / O processing waiting queue information, a sub job that requests access to the same shared data as the shared data for which the processing has succeeded among the sub jobs waiting for processing when the processing is successful When,
Executing the processing of the shared data of the extracted sub-job;
A job assignment method characterized by executing

The storage unit further includes a shared data I / O processing waiting queue for storing information related to the suspended execution state of the sub job when the execution of the sub job is suspended.
Collecting information related to the execution state of the sub-job when the shared data processing fails, and storing the information in the shared data I / O processing waiting queue;
When the sub job waiting for processing is extracted from the shared data I / O processing waiting queue information, information on the execution state of the extracted processing waiting sub job is acquired from the shared data I / O processing waiting queue, and A step of restoring the execution state of the sub job at the time of interruption;
Executing the processing of the sub-job waiting for the processing from the time of the suspended execution state;
The job assignment method according to claim 4, further comprising:

A job assignment program for causing a computer to execute the job assignment method according to claim 4 or 5.