JPH07168794A

JPH07168794A - Job managing method for computer system

Info

Publication number: JPH07168794A
Application number: JP5313019A
Authority: JP
Inventors: Taro Inoue; 太郎井上; Toshiaki Arai; 利明新井; Takao Maeda; 多可雄前田; Masaaki Oya; 雅章大矢
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1993-12-14
Filing date: 1993-12-14
Publication date: 1995-07-04

Abstract

PURPOSE:To execute a job successively on a stopped computer and to restart a check point in a distributed system. CONSTITUTION:When a job managing program agent 6000 is stopped, the job managing information 4600 and job execution information 4700 of the job managing program agent 6000 are transmitted to a job managing program master 5000. Also, when the abnormal completion of the job occurs, the check point information of the job managing program agent 6000 is transmitted to the job managing program master 5000. The job managing program master 5000 selects another computer in the same group, and notifies it to the job managing program agent of the computer.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、計算機システムにおけ
るジョブ管理方法に係り、特に、複数の計算機がネット
ワークを介して相互に接続された分散環境の計算機シス
テムにおけるジョブの管理方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a job management method in a computer system, and more particularly to a job management method in a distributed environment computer system in which a plurality of computers are connected to each other via a network.

【０００２】[0002]

【従来の技術】従来より、計算機システムとして、１台
の計算機で処理を行う集中型システムと、多数の計算機
をネットワーク接続して処理を分散して行う分散型シス
テムとがしられている。集中型システムでは、システム
を管理するための管理情報のすべてを１台の計算機が保
有する。即ち、システム内に存在するすべての資源に関
する管理情報は、この計算機上のオペレーティングシス
テム（ＯＳ）に管理される。一方、分散型システムで
は、システム内に多数存在する各計算機がそれぞれ独自
の資源を持つ。そして、各計算機が所有する資源は、そ
れぞれ各計算機上のＯＳによって管理される。2. Description of the Related Art Conventionally, as computer systems, there are a centralized system in which a single computer performs processing and a distributed system in which a large number of computers are connected to a network to perform distributed processing. In the centralized system, one computer holds all the management information for managing the system. That is, the management information on all the resources existing in the system is managed by the operating system (OS) on this computer. On the other hand, in a distributed system, each computer, which exists in large numbers in the system, has its own resource. The resources owned by each computer are managed by the OS on each computer.

【０００３】集中型システムには、従来よりウォームス
タート機能およびチェックポイントリスタート機能が存
在する。ウォームスタート機能は、システムの開始（あ
るいは再開始）の時に、前回のシステム停止時における
システムの状態、例えば、入力キューにおける入力ジョ
ブおよび出力キューにおける出力ジョブをそのまま引き
継いで処理を開始する機能である。一方、チェックポイ
ントリスタート機能は、１つあるいは複数のジョブを実
行する過程において、ジョブが異常終了した場合のジョ
ブの再実行に備え、ある時点（これをチェックポイント
という）でのジョブの実行状況に関する情報を記録して
おき、ジョブが異常終了した場合には、チェックポイン
ト時の状態から処理を再実行する機能である。これによ
り中断したジョブを最初から再実行する必要がなくな
り、計算機資源を有効に利用することができる。Conventionally, a centralized system has a warm start function and a checkpoint restart function. The warm start function is a function that, when the system is started (or restarted), it takes over the state of the system at the time of the previous system stop, for example, the input job in the input queue and the output job in the output queue, and starts the processing. . On the other hand, the checkpoint restart function prepares for the re-execution of a job if the job ends abnormally in the process of executing one or more jobs, and the execution status of the job at a certain point This is a function of recording information related to the above and re-executing the process from the state at the time of checkpoint when the job ends abnormally. As a result, it is not necessary to re-execute the interrupted job from the beginning, and computer resources can be effectively used.

【０００４】分散型システムでも集中型システムと同様
に、個々の計算機に対しウォームスタート機能、あるい
は、チェックポイントリスタート機能を持たせることに
より、各計算機において集中型システムと同様にシステ
ムの開始（再開始）または、異常終了したジョブの処理
を継続することが可能である。In the distributed system, as in the centralized system, each computer is provided with a warm start function or a checkpoint restart function so that each computer can start (restart) the system in the same manner as the centralized system. It is possible to continue the processing of the job that has started) or ended abnormally.

【０００５】[0005]

【発明が解決しようとする課題】しかし、分散型システ
ムでは、システム内に複数の計算機が存在することを利
用して、ある計算機のサービスが終了した後に、その他
の動作中の計算機が停止した計算機の仕事を引き継ぐこ
とができることが望まれる。また、ある計算機上で実行
していたジョブが異常終了した場合に、このジョブを他
の別の計算機上で再実行できることが望まれる。However, in a distributed system, there is a plurality of computers in the system, and after the services of a certain computer are terminated, the other computers that are operating are stopped. It is desirable to be able to take over the work of. In addition, when a job that was being executed on one computer ends abnormally, it is desirable that this job can be re-executed on another computer.

【０００６】このような要望に答えるためには、ある計
算機がサービス終了、あるいは障害などにより停止する
時に、当該計算機上で実行されていたジョブをシステム
内で動作中の他の別の計算機上で継続実行できるように
する必要がある（このような機能を以下、ジョブ継続実
行機能と呼ぶ）。ところが、分散型システムでは、例え
ば、先に述べたように、ファイルに関する管理情報は各
計算機がそれぞれ独自に保有している。このため、ある
計算機上で実行していたジョブを他の計算機上で実行し
ようとした場合に、ジョブを引き継いで実行しようとす
る計算機からは、前の計算機上でそのジョブの実行に使
われていたファイルを認識できないという問題がある。
従って、従来の分散計算機システムでは、計算機間での
ジョブ継続実行機能を実現することが困難であった。In order to meet such a demand, when a computer is stopped due to a service termination or a failure, a job executed on the computer is executed on another computer operating in the system. It is necessary to enable continuous execution (such a function will be referred to as a job continuous execution function hereinafter). However, in the distributed system, for example, as described above, each computer has its own management information regarding files. Therefore, if you try to execute a job that was running on one computer on another computer, the computer that is trying to take over and execute the job will not use it on the previous computer to execute the job. There is a problem that the file cannot be recognized.
Therefore, in the conventional distributed computer system, it is difficult to realize the job continuation execution function between computers.

【０００７】従って、本発明の目的は、分散型システム
において、ジョブ継続実行機能およびチェックポイント
リスタート機能を可能とする手段を提供することにあ
る。Accordingly, it is an object of the present invention to provide means for enabling a job continuation execution function and a checkpoint restart function in a distributed system.

【０００８】[0008]

【課題を解決するための手段】上述した目的を達成する
ために、本発明によるジョブ管理方法では、複数の計算
機がネットワークで相互に接続され、各計算機上ではそ
れぞれの計算機に実行させる仕事ジョブを管理するジョ
ブ管理プログラムが走行し、該ジョブ管理プログラムの
中の一部のジョブ管理プログラム（以下、ジョブ管理プ
ログラムマスタと呼ぶ）がによりその配下の計算機上の
ジョブ管理プログラム（以下、ジョブ管理プログラムエ
ージェントと呼ぶ）における管理情報を管理する計算機
システムにおいて、前記ジョブ管理プログラムマスタに
よりジョブを実行させる計算機をグループ化し、該グル
ープ化に関する情報を前記ジョブ管理プログラムエージ
ェントの各々に伝達する。ジョブ管理プログラムエージ
ェントが停止する場合、停止するジョブ管理プログラム
エージェントは、自分が停止することをジョブ管理プロ
グラムマスタに通知し、自己が走行する計算機上で実行
中のジョブの実行情報および自己が保持するジョブ管理
情報をジョブ管理プログラムマスタに送信する。ジョブ
管理プログラムマスタでは、停止するジョブ管理プログ
ラムエージェントが走行していた計算機と同じグループ
に属する他の計算機の中から、停止するジョブ管理プロ
グラムエージェントが走行していた計算機上で実行され
ていたジョブを実行させる計算機を選択し、停止するジ
ョブ管理プログラムエージェントから受信したジョブの
実行情報およびジョブ管理情報を選択した計算機上のジ
ョブ管理プログラムエージェントに送信する。そして、
選択した計算機上のジョブ管理プログラムエージェント
により、停止するジョブ管理プログラムエージェントが
走行していた計算機上のジョブを実行する。In order to achieve the above-mentioned object, in the job management method according to the present invention, a plurality of computers are connected to each other via a network, and each computer executes a job job to be executed by each computer. A job management program to be managed runs, and a part of the job management program (hereinafter, referred to as a job management program master) executes a job management program on a computer under the job management program (hereinafter, a job management program agent). In the computer system that manages the management information in (1), computers that execute jobs are grouped by the job management program master, and information regarding the grouping is transmitted to each of the job management program agents. When the job management program agent stops, the job management program agent that stops notifies the job management program master that it will stop, and retains the execution information and the execution information of the job being executed on the computer on which it runs. Send job management information to the job management program master. In the job management program master, from the other computers belonging to the same group as the computer on which the job management program agent to be stopped was running, the jobs that were running on the computer on which the job management program agent to be stopped were running The computer to be executed is selected, and the job execution information and job management information received from the job management program agent to be stopped are sent to the job management program agent on the selected computer. And
The job management program agent on the selected computer executes the job on the computer on which the stopped job management program agent was running.

【０００９】そして、このような計算機システムのジョ
ブ管理方法において、ジョブ管理プログラムエージェン
トの停止した計算機上で実行されていたジョブを、選択
した計算機上で実行する前に、選択した計算機上のジョ
ブ管理プログラムエージェントにより、ジョブ管理プロ
グラムエージェントの停止した計算機上で動作していた
ジョブが使用していたファイルのレプリカを、選択した
計算機上に作成する。In such a job management method for a computer system, the job management program agent executes job management on the selected computer before executing the job on the computer on which the agent has stopped. The program agent creates, on the selected computer, a replica of the file used by the job that was running on the computer on which the job management program agent stopped.

【００１０】あるいは、ジョブ管理プログラムエージェ
ントの停止した計算機上で実行されていたジョブを、選
択した計算機上で実行する前に、選択した計算機上のジ
ョブ管理プログラムエージェントにより、ジョブ管理プ
ログラムエージェントの停止した計算機上のファイル
を、選択した計算機から認識できるようにする。Alternatively, the job management program agent on the selected computer stops the job management program agent before the job that was being executed on the computer on which the job management program agent has stopped is executed on the selected computer. Make the files on the computer visible to the selected computer.

【００１１】また、本発明の他の観点によれば、複数の
計算機がネットワークで相互に接続され、各計算機上で
はそれぞれの計算機に実行させるジョブを管理するジョ
ブ管理プログラムが走行し、該ジョブ管理プログラムの
中の一部のジョブ管理プログラム（以下、ジョブ管理プ
ログラムマスタと呼ぶ）がによりその配下の計算機上の
ジョブ管理プログラム（以下、ジョブ管理プログラムエ
ージェントと呼ぶ）における管理情報を管理する計算機
システムのジョブ管理方法において、前記ジョブ管理プ
ログラムマスタが、ジョブを実行させる計算機をグルー
プ化し、該グループ化に関する情報を前記ジョブ管理プ
ログラムエージェントの各々へ送信する。ジョブ管理プ
ログラムエージェントは、チェックポイントにおいて、
ジョブに関連する資源の状態の情報および自己が保持す
るジョブ管理情報（以下、チェックポイント情報と呼
ぶ）をジョブ管理プログラムマスタに送信する。ジョブ
管理プログラムマスタでは、前記ジョブ管理エージェン
トから送られてきた前記チェックポイント情報を格納す
る。そして、ジョブの異常終了の発生時に、異常終了の
発生した計算機上で走行中のジョブ管理プログラムエー
ジェントが、前記異常終了の発生を前記ジョブ管理プロ
グラムマスタに通知する。ジョブ管理プログラムマスタ
は、異常終了の発生した計算機と同じグループに属する
他の計算機の中から当該ジョブを再実行する計算機を選
択し、異常終了の発生した計算機のチェックポイント情
報をその選択した計算機上のジョブ管理プログラムエー
ジェントに送信し、選択した計算機上のジョブ管理プロ
グラムエージェントで、前記ジョブの異常終了が発生し
た計算機上のジョブを再実行する。According to another aspect of the present invention, a plurality of computers are connected to each other via a network, and a job management program that manages jobs to be executed by each computer runs on each computer, and the job management program is executed. A computer system that manages management information in a job management program (hereinafter, referred to as a job management program agent) on a computer under the job management program (hereinafter, referred to as a job management program master) In the job management method, the job management program master groups computers that execute jobs and sends information about the grouping to each of the job management program agents. At the checkpoint, the job manager agent
The resource status information related to the job and the job management information held by the self (hereinafter referred to as checkpoint information) are transmitted to the job management program master. The job management program master stores the checkpoint information sent from the job management agent. Then, when an abnormal end of a job occurs, the job management program agent running on the computer in which the abnormal end occurred notifies the job management program master of the occurrence of the abnormal end. The job management program master selects the computer that will re-execute the job from the other computers that belong to the same group as the computer that has abnormally terminated, and the checkpoint information for the computer that has abnormally terminated will be displayed on the selected computer. The job management program agent on the selected computer re-executes the job on the computer in which the abnormal termination of the job has occurred.

【００１２】そして、このような計算機システムのジョ
ブ管理方法において、異常終了したジョブを、選択した
計算機上で再実行する前に、選択した計算機上のジョブ
管理プログラムエージェントが、異常終了したジョブが
使用していたファイルのレプリカを、選択した計算機上
に作成する。あるいは、異常終了したジョブを、選択し
た計算機上で再実行する前に、選択した計算機上のジョ
ブ管理プログラムエージェントが、異常終了したジョブ
が動作していた計算機上で異常終了したジョブが使用し
ていたファイルを、選択した計算機からも認識できるよ
うにする。In such a job management method for a computer system, the job management program agent on the selected computer uses the abnormally terminated job before re-executing the abnormally terminated job on the selected computer. Create a replica of the file you were using on the computer of your choice. Alternatively, the job management program agent on the selected computer is using the abnormally ended job on the computer on which the abnormally ended job was running before re-executing the abnormally ended job on the selected computer. Makes the selected file visible to the selected computer.

【００１３】[0013]

【作用】本発明では、上述した手段により、ジョブ管理
プログラムマスタが動作する計算機において計算機のグ
ループ化がなされ、ジョブ管理プログラムマスタが有す
る計算機のグループ化の情報を各計算機上のジョブ管理
プログラムエージェントが知ることができるようにな
り、各ジョブ管理プログラムエージェントが、計算機の
グループを意識してジョブのスケジューリングを実行で
きる。また、ジョブ管理プログラムエージェントが停止
する場合、このことをジョブ管理情報および実行中のジ
ョブに関する資源の状態に関する情報と共にジョブ管理
プログラムマスタに通知することで、停止する計算機の
停止前の状態をジョブ管理プログラムマスタは知ること
ができる。そして、ジョブ管理プログラムマスタが、自
身が保有する計算機のグループ化に関する情報を基にし
て、ジョブ管理プログラムエージェントの停止した計算
機と同じグループに属する他の計算機を選択し、先に入
手した停止する直前のジョブ管理プログラムエージェン
トのジョブ管理情報およびジョブ管理プログラムエージ
ェントの停止する計算機上で実行中のジョブに関する資
源の状態に関する情報を選択した計算機に送信するの
で、ジョブ管理プログラムエージェントの停止した計算
機と同じグループに属する他の計算機上で、ジョブ管理
プログラムエージェントの停止した計算機上で実行され
ていたジョブを実行させることができる。According to the present invention, by the means described above, the computers on which the job management program master operates are grouped, and the information on the grouping of computers included in the job management program master is transferred to the job management program agent on each computer. As a result, each job management program agent can execute job scheduling while being aware of a group of computers. When the job management program agent stops, the job management program master is notified of this together with the job management information and the resource status information related to the job being executed, so that the status of the computer to be stopped before the job management is stopped. The program master can know. Then, the job management program master selects another computer that belongs to the same group as the stopped computer of the job management program agent based on the information about the grouping of computers that it owns, and immediately before stopping it. The job management program agent's job management information and the job management program agent's resource status information about the jobs being executed on the computer to be stopped are sent to the selected computer. The job that has been executed on the computer in which the job management program agent has stopped can be executed on another computer belonging to.

【００１４】また、ジョブ管理プログラムエージェント
の停止した計算機上で実行されていたジョブを、選択し
た計算機上で実行する前に、選択した計算機上のジョブ
管理プログラムエージェントが、ジョブ管理プログラム
エージェントの停止した計算機上で動作していたジョブ
が使用していたファイルのレプリカを、選択した計算機
上に作成するので、選択した計算機からも引き継ぐジョ
ブの動作に必要となるファイルを使用できるようにな
り、当該ジョブの実行が可能となる。Further, before executing the job, which was being executed on the computer on which the job management program agent has stopped, on the selected computer, the job management program agent on the selected computer has stopped the job management program agent. A replica of the file used by the job that was running on the computer is created on the selected computer, so you can use the file that is required for the operation of the job that will be inherited from the selected computer. Can be executed.

【００１５】あるいは、ジョブ管理プログラムエージェ
ントの停止した計算機上で実行されていたジョブを、選
択した計算機上で実行する前に、選択した計算機上のジ
ョブ管理プログラムエージェントが、ジョブ管理プログ
ラムエージェントの停止した計算機上のファイルを、選
択した計算機から認識できるようにしているので、ジョ
ブの動作に必要となるファイルが、選択した計算機から
も認識できるようになり、この計算機からも当該ジョブ
の動作に必要となるファイルを使用が可能となって、当
該ジョブの実行を可能にすることができる。Alternatively, the job management program agent on the selected computer stops the job management program agent before executing the job on the computer on which the job management program agent has stopped on the selected computer. Since the files on the computer can be recognized by the selected computer, the files required for the job operation can also be recognized by the selected computer, and this computer also needs the files for the job operation. The file can be used and the job can be executed.

【００１６】[0016]

【実施例】以下、本発明の実施例を、図面により詳細に
説明する。Embodiments of the present invention will now be described in detail with reference to the drawings.

【００１７】図１は、本実施例における計算機システム
のハードウェア構成を示す構成図である。１０００は計
算機を示し、複数の計算機１０００がローカルエリアネ
ットワーク（ＬＡＮ）９０００によりそれぞれ接続さ
れ、ネットワークを構成する。図１では、計算機１００
０は、４台だけを示しているが、計算機１０００の台数
は任意である。ＬＡＮ９０００は、ゲートウエイ９７０
０を介してワイドエリアネットワーク（ＷＡＮ）９６０
０に接続される。FIG. 1 is a block diagram showing the hardware configuration of a computer system according to this embodiment. Reference numeral 1000 denotes a computer, and a plurality of computers 1000 are respectively connected by a local area network (LAN) 9000 to form a network. In FIG. 1, the computer 100
Although 0 indicates only four computers, the number of computers 1000 is arbitrary. LAN9000 is a gateway 970
Wide Area Network (WAN) 960 through 0
Connected to 0.

【００１８】図２は、計算機１０００の詳細な構成を示
すブロック図である。図２に示すように、計算機１００
０は、中央処理装置（ＣＰＵ）１１００、外部記憶装置
２０００、Ｉ／Ｏアダプタ２５００、メモリ３０００、
バス８０００およびＬＡＮアダプタ９５００から構成さ
れる。ＣＰＵ１１００、Ｉ／Ｏアダプタ２５００、メモ
リ３０００、およびＬＡＮアダプタ９５００は、それぞ
れバス８０００に接続され、バス８０００を通してデー
タのやりとりを行う。Ｉ／Ｏアダプタ２５００には外部
記憶装置２０００が接続される。一般的に、外部記憶装
置２０００としては、ディスク装置が使用されることが
多い。外部記憶装置２０００には、この計算機システム
を制御するための制御ソフトウェア４０００及びアプリ
ケーションソフトウェア（ＡＰ）３５００などが格納さ
れる。ＬＡＮアダプタ９５００は、ＬＡＮ９０００に接
続され、他の計算機１０００との間でのデータの伝送を
制御する。FIG. 2 is a block diagram showing the detailed arrangement of the computer 1000. As shown in FIG.
0 is a central processing unit (CPU) 1100, an external storage device 2000, an I / O adapter 2500, a memory 3000,
It is composed of a bus 8000 and a LAN adapter 9500. The CPU 1100, the I / O adapter 2500, the memory 3000, and the LAN adapter 9500 are each connected to the bus 8000, and exchange data via the bus 8000. The external storage device 2000 is connected to the I / O adapter 2500. Generally, a disk device is often used as the external storage device 2000. The external storage device 2000 stores control software 4000 and application software (AP) 3500 for controlling this computer system. The LAN adapter 9500 is connected to the LAN 9000 and controls data transmission with another computer 1000.

【００１９】図３に、制御ソフトウェア４０００の構成
を示す。この制御ソフトウェア４０００は、すべての計
算機の外部記憶装置２０００に格納され、必要に応じて
メモリ３０００にロードされる。ＣＰＵ１１００は、メ
モリ３０００上にロードされた制御ソフトウェア４００
０に従って各種の制御を行なう。制御ソフトウェア４０
００は、オペレーティングシステム４１００、分散ファ
イル管理プログラム４２００およびジョブ管理プログラ
ム４５００から構成される。オペレーティングシステム
４１００は、計算機１０００が有する資源（メモリ３０
００、外部記憶装置２０００等）の管理等を行う。分散
ファイル管理プログラム４２００は、この計算機システ
ムのように複数の計算機がネットワークで接続された環
境下でのファイルの管理を計算機間で統一的に行うプロ
グラムである。この分散ファイル管理プログラム４２０
０としては、例えば、分散コンピューティング環境（Ｄ
ＣＥ）における分散ファイルサービス（ＤＦＳ）があ
る。ジョブ管理プログラム４５００は、計算機ユーザが
計算機１０００に実行させる仕事（ジョブ）を管理する
プログラムで、ジョブのスケジューリング、キューイン
グ、ジョブの実行状態のモニタリング等を行う。ジョブ
管理プログラム４５００には、ジョブ管理プログラムマ
スタ５０００およびジョブ管理プログラムエージェント
６０００の２種類がある。ある計算機上のジョブ管理プ
ログラム（ジョブ管理プログラムマスタ５０００）がそ
の配下の計算機のジョブ管理プログラム（ジョブ管理プ
ログラムエージェント６０００）を管理する。FIG. 3 shows the configuration of the control software 4000. The control software 4000 is stored in the external storage devices 2000 of all computers and loaded into the memory 3000 as needed. The CPU 1100 is the control software 400 loaded on the memory 3000.
Various controls are performed according to 0. Control software 40
00 includes an operating system 4100, a distributed file management program 4200, and a job management program 4500. The operating system 4100 has resources (memory 30
00, external storage device 2000, etc.) and the like. The distributed file management program 4200 is a program that uniformly manages files among computers in an environment in which a plurality of computers are connected by a network like this computer system. This distributed file management program 420
0 is, for example, a distributed computing environment (D
There is a Distributed File Service (DFS) in CE). The job management program 4500 is a program that manages jobs (jobs) to be executed by the computer 1000 by the computer user, and performs job scheduling, queuing, monitoring the job execution state, and the like. There are two types of job management programs 4500: a job management program master 5000 and a job management program agent 6000. A job management program (job management program master 5000) on a computer manages the job management program (job management program agent 6000) of the computers under it.

【００２０】図４は、ジョブ管理プログラムマスタ５０
００およびジョブ管理プログラムエージェント６０００
が持つ主な情報を示す。ジョブ管理プログラムエージェ
ント６０００は、ジョブ管理情報Ａ４６００およびジョ
ブ実行情報Ａ４７００を持つ。ジョブ管理情報Ａ４６０
０は、当該計算機上で実行中（あるいは実行される予
定）のジョブの情報であり、当該計算機で実行されるジ
ョブのキューイング情報Ａ４６１０、当該計算機でのジ
ョブの実行状態（実行待ち、実行中、実行終了等）であ
るモニタリング情報Ａ４６２０等を含んでいる。ジョブ
実行情報Ａ４７００は、ＯＳ４１００等から収集した当
該計算機で実行中のジョブに関する実行情報が格納され
る。ジョブ管理プログラムマスタ５０００は、自身がジ
ョブ管理プログラムエージェントとして動作する場合に
使用するためのジョブ管理情報Ｍ４６０１、キューイン
グ情報Ｍ４６１１、モニタリング情報Ｍ４６２１および
ジョブ実行情報Ｍ４７０１を持つ。それに加えて、ジョ
ブ管理プログラムエージェント６０００から通知された
ジョブ管理情報Ａ４６００、ジョブ実行情報Ａ４７００
およびチェックポイント情報を格納するためのエージェ
ントジョブ管理情報格納部４８００、エージェントジョ
ブ実行情報格納部４８３０およびエージェントチェック
ポイント情報格納部４８４０を持つ。これらの格納部
（エージェントジョブ管理情報格納部４８００、エージ
ェントジョブ実行情報格納部４８３０、エージェントチ
ェックポイント情報格納部４８４０）は、それぞれジョ
ブ管理プログラムエージェント６０００単位に存在す
る。FIG. 4 shows a job management program master 50.
00 and job management program agent 6000
Indicates the main information held by. The job management program agent 6000 has job management information A4600 and job execution information A4700. Job management information A460
0 is the information of the job being executed (or scheduled to be executed) on the computer, which is the queuing information A4610 of the job executed on the computer, the execution status of the job on the computer (waiting for execution, executing) , Execution end, etc.) and monitoring information A 4620. The job execution information A4700 stores execution information related to the job being executed by the computer, which is collected from the OS 4100 and the like. The job management program master 5000 has job management information M4601, queuing information M4611, monitoring information M4621, and job execution information M4701, which are used when the job management program master 5000 operates as a job management program agent. In addition, job management information A4600 and job execution information A4700 notified from the job management program agent 6000.
And an agent job management information storage unit 4800 for storing checkpoint information, an agent job execution information storage unit 4830, and an agent checkpoint information storage unit 4840. These storage units (agent job management information storage unit 4800, agent job execution information storage unit 4830, agent checkpoint information storage unit 4840) exist for each job management program agent 6000.

【００２１】図５は、ジョブ管理プログラムマスタ５０
００の起動時の処理の概要を示すフローチャートであ
る。ジョブ管理プログラムマスタ５０００は、その起動
コマンドの入力を契機として起動される（ステップ５１
００）。次に、ジョブ実行計算機グループ定義ファイル
７０１０を読み込む（ステップ５１１０）。FIG. 5 shows a job management program master 50.
12 is a flowchart showing an outline of processing when 00 is started. The job management program master 5000 is activated upon the input of the activation command (step 51).
00). Next, the job execution computer group definition file 7010 is read (step 5110).

【００２２】図６に、ジョブ実行計算機グループ定義フ
ァイル７０１０の内容を示す。各計算機のジョブ実行計
算機グループの定義は、計算機のホスト名とその計算機
が所属するジョブ実行計算機グループのグループ名の対
でなされる。図６では、ＨＯＳＴ１からＨＯＳＴ４のホ
スト名を持つ計算機がＧＲＯＵＰ１というジョブ実行計
算機グループに所属し、ＨＯＳＴ５およびＨＯＳＴ６の
ホスト名を持つ計算機がＧＲＯＵＰ２というジョブ実行
計算機グループに所属することを示している。FIG. 6 shows the contents of the job execution computer group definition file 7010. The job execution computer group of each computer is defined by a pair of the host name of the computer and the group name of the job execution computer group to which the computer belongs. FIG. 6 shows that computers having host names HOST1 to HOST4 belong to the job execution computer group GROUP1 and computers having host names HOST5 and HOST6 belong to the job execution computer group GROUP2.

【００２３】図７は、ジョブ管理プログラムエージェン
ト６０００の起動時の処理の概要を示すフローチャート
である。ジョブ管理プログラムエージェント６０００
は、その起動コマンドの入力を契機として起動される
（ステップ６１００）。次に、ジョブ管理プログラムマ
スタ５０００に対してジョブ実行計算機グループ定義フ
ァイル７０１０の送信を要求する（ステップ６１１
０）。ジョブ管理プログラムマスタ５０００は、ジョブ
管理情報格納部４８００およびジョブ実行情報格納部４
８３０を作成し（ステップ６１１５）、ジョブ実行計算
機グループ定義ファイル７０１０をジョブ管理プログラ
ムエージェント６０００に送信する（ステップ６１２
０）。次に、ジョブ管理プログラムエージェント６００
０は、受信したジョブ実行計算機グループ定義ファイル
７０１０を自身のメモリ３０００（あるいは外部記憶装
置２０００）に格納する（ステップ６１３０）。FIG. 7 is a flow chart showing an outline of processing at the time of starting the job management program agent 6000. Job management program agent 6000
Is activated upon the input of the activation command (step 6100). Next, the job management program master 5000 is requested to send the job execution computer group definition file 7010 (step 611).
0). The job management program master 5000 includes a job management information storage unit 4800 and a job execution information storage unit 4
830 is created (step 6115), and the job execution computer group definition file 7010 is sent to the job management program agent 6000 (step 612).
0). Next, the job management program agent 600
0 stores the received job execution computer group definition file 7010 in its own memory 3000 (or external storage device 2000) (step 6130).

【００２４】図５および図７に示した処理により、定義
されたジョブ実行計算機グループを各ジョブ管理プログ
ラムエージェント６０００が認識できるようになる。た
だし、ジョブ管理プログラムマスタ５０００は、ジョブ
管理プログラムエージェント６０００よりも先に起動し
ておかねばならない。By the processing shown in FIGS. 5 and 7, each job management program agent 6000 can recognize the defined job execution computer group. However, the job management program master 5000 must be activated before the job management program agent 6000.

【００２５】図８は、ジョブ管理プログラムエージェン
ト６０００の停止時の処理の概要を示すフローチャート
である。ジョブ管理プログラムエージェント６０００の
停止処理は、その停止コマンドの入力を契機として開始
される（ステップ６２００）。当該計算機上で実行中の
ジョブがあればそのジョブの実行を中断し、当該ジョブ
の実行情報をジョブ実行情報Ａ４７００に格納する（ス
テップ６２０５）。そして、ジョブ管理プログラムエー
ジェント６０００は、ジョブ管理情報Ａ４６００および
ジョブ実行情報Ａ４７００をジョブ管理プログラムマス
タ５０００に送信して、ジョブ管理プログラムエージェ
ント６０００の停止をジョブ管理プログラムマスタ５０
００に通知する（ステップ６２１０）。この際、ジョブ
管理プログラムエージェント６０００は、当該計算機の
ホスト名も一緒にジョブ管理プログラムマスタ５０００
に通知する。FIG. 8 is a flow chart showing an outline of processing when the job management program agent 6000 is stopped. The stop processing of the job management program agent 6000 is started upon the input of the stop command (step 6200). If there is a job being executed on the computer, the execution of the job is interrupted, and the execution information of the job is stored in the job execution information A4700 (step 6205). Then, the job management program agent 6000 sends the job management information A4600 and the job execution information A4700 to the job management program master 5000 to stop the job management program agent 6000.
00 (step 6210). At this time, the job management program agent 6000 adds the host name of the computer together with the job management program master 5000.
To notify.

【００２６】ジョブ管理プログラムマスタ５０００は、
受信したジョブ管理情報Ａ４６００およびジョブ実行情
報Ａ４７００をエージェントジョブ管理情報格納部４８
００およびエージェントジョブ実行情報格納部４８３０
に格納する（ステップ６２３５）。続いて、先に説明し
た図７におけるステップ６１３０において格納したジョ
ブ実行計算機グループ定義ファイル７０１０の情報を基
に、ジョブ管理プログラムエージェント６０００が停止
した計算機と同じジョブ実行計算機グループに所属する
計算機を１つ選択する（ステップ６２４０）。ジョブ管
理プログラムマスタ５０００は、ステップ６２４０で選
択された計算機上のジョブ管理プログラムエージェント
６０００に対して、エージェントジョブ管理情報格納部
４８００のジョブ管理情報およびエージェントジョブ実
行情報格納部４８３０のジョブ実行情報を送信する（ス
テップ６２６０）。The job management program master 5000 is
The received job management information A4600 and job execution information A4700 are stored in the agent job management information storage unit 48.
00 and agent job execution information storage unit 4830
(Step 6235). Then, based on the information of the job execution computer group definition file 7010 stored in step 6130 in FIG. 7 described above, one computer that belongs to the same job execution computer group as the computer in which the job management program agent 6000 has stopped A selection is made (step 6240). The job management program master 5000 sends the job management information in the agent job management information storage unit 4800 and the job execution information in the agent job execution information storage unit 4830 to the job management program agent 6000 on the computer selected in step 6240. (Step 6260).

【００２７】ステップ６２４０で選択された計算機上の
ジョブ管理プログラムエージェント６０００は、受信し
たジョブ管理情報およびジョブ実行情報を自分のジョブ
管理情報Ａ４６００およびジョブ実行情報Ａ４７００に
マージする（ステップ６２７０）。そして、ステップ６
２４０で選択された計算機からジョブ管理プログラムエ
ージェント６０００が停止した計算機上のジョブのファ
イルを認識可能にする（ステップ６２８０）。ステップ
６２８０で行なわれる処理の方法としては、例えば、以
下の２つの方法が考えられる。The job management program agent 6000 on the computer selected in step 6240 merges the received job management information and job execution information with its own job management information A4600 and job execution information A4700 (step 6270). And step 6
The job management program agent 6000 makes it possible to recognize the file of the job on the computer stopped by the computer selected at 240 (step 6280). As the processing method performed in step 6280, for example, the following two methods can be considered.

【００２８】まず、第１の方法として、ジョブ管理情報
４６００およびジョブ実行情報４７００より、分散ファ
イル管理プログラム４２００が提供するファイルのレプ
リカを作成する機能を利用して、ジョブ管理プログラム
エージェント６０００が停止した計算機上のジョブのフ
ァイルのレプリカを当該計算機上に作成する方法があ
る。また、第２の方法として、ステップ６２４０で選択
された計算機上のジョブ管理プログラムエージェント６
０００で、ジョブ管理情報４６００およびジョブ実行情
報４７００を参照し、ネットワークファイルシステムの
リモートマウント機能を利用して、ジョブ管理プログラ
ムエージェント６０００が停止した計算機上のジョブの
ファイルを、当該計算機にマウントする方法がある。First, as the first method, the job management program agent 6000 is stopped using the function of creating a replica of a file provided by the distributed file management program 4200 from the job management information 4600 and the job execution information 4700. There is a method of creating a replica of a job file on a computer on the computer. As a second method, the job management program agent 6 on the computer selected in step 6240
000 to refer to the job management information 4600 and the job execution information 4700, and use the remote mount function of the network file system to mount the file of the job on the computer stopped by the job management program agent 6000 on the computer. There is.

【００２９】このようにして、ジョブ管理プログラムエ
ージェント６０００を停止しようとする計算機上で実行
されていたジョブを、その計算機と同じジョブ実行計算
機グループに所属する計算機上で継続して実行すること
ができる。In this way, the job that has been executed on the computer that is going to stop the job management program agent 6000 can be continuously executed on the computers that belong to the same job execution computer group as the computer. .

【００３０】図９は、ジョブの実行がチェックポイント
に到達した時の処理の概要を示すフローチャートであ
る。ジョブ管理プログラムエージェント６０００は、ジ
ョブ管理プログラムエージェント６０００のジョブ実行
状態モニタ機能を利用して、ジョブがチェックポイント
に到達することを認識する（ステップ７１００）。チェ
ックポイントに到達したことが認識されると、ジョブ管
理情報Ａ４６００およびジョブ実行情報Ａ４７００をチ
ェックポイント情報としてジョブ管理プログラムマスタ
５０００に送信し、チェックポイントに到達したことを
通知する（ステップ７１１０）。ジョブ管理プログラム
マスタ５０００では、ジョブ管理プログラムエージェン
ト６０００から送られてきたチェックポイント情報をエ
ージェントチェックポイント情報格納部４８４０に格納
しておく（ステップ７１２０）。FIG. 9 is a flow chart showing the outline of the processing when the job execution reaches the checkpoint. The job management program agent 6000 uses the job execution status monitor function of the job management program agent 6000 to recognize that the job reaches the checkpoint (step 7100). When it is recognized that the checkpoint has been reached, the job management information A4600 and the job execution information A4700 are sent to the job management program master 5000 as checkpoint information to notify that the checkpoint has been reached (step 7110). The job management program master 5000 stores the checkpoint information sent from the job management program agent 6000 in the agent checkpoint information storage unit 4840 (step 7120).

【００３１】図１０は、ジョブの異常終了が発生した場
合の処理の概要を示すフローチャートである。ジョブ管
理プログラムエージェント６０００は、ジョブ管理情報
Ａ４６００のモニタ情報Ａ４６２０により、ジョブの異
常終了が発生したことを認識する（ステップ７２０
０）。そして、ジョブ管理プログラムマスタ５０００
に、当該計算機のホスト名と共にジョブの異常終了が発
生したことを通知する（ステップ７２１０）。ジョブ管
理プログラムマスタ５０００は、先に説明した図７にお
けるステップ６１３０において取得したジョブ実行計算
機グループ定義ファイル７０１０の情報を基に、ジョブ
の異常終了が発生した計算機と同じジョブ実行計算機グ
ループに所属する計算機を１つ選択する（ステップ７２
３０）。ジョブ管理プログラムマスタ５０００は、ステ
ップ７２３０で選択された計算機上のジョブ管理プログ
ラムエージェント６０００に対して、エージェントチェ
ックポイント情報格納部４８４０に格納された情報を送
信する（ステップ７２４０）。FIG. 10 is a flow chart showing an outline of processing when an abnormal end of a job occurs. The job management program agent 6000 recognizes that the abnormal end of the job has occurred based on the monitor information A4620 of the job management information A4600 (step 720).
0). Then, the job management program master 5000
Is notified of the abnormal end of the job together with the host name of the computer (step 7210). The job management program master 5000 is a computer belonging to the same job execution computer group as the computer in which the abnormal end of the job has occurred, based on the information of the job execution computer group definition file 7010 acquired in step 6130 in FIG. 7 described above. One (step 72)
30). The job management program master 5000 transmits the information stored in the agent checkpoint information storage unit 4840 to the job management program agent 6000 on the computer selected in step 7230 (step 7240).

【００３２】ステップ７２３０で選択された計算機上の
ジョブ管理プログラムエージェント６０００は、送られ
てきたエージェントチェックポイント情報を自分のジョ
ブ管理情報４６００およびジョブ実行情報４７００にマ
ージする（ステップ７２５０）。そして、ジョブの異常
終了が発生した計算機上のジョブのファイルを認識可能
にする（ステップ７２６０）。ステップ７２６０では、
先にジョブ管理プログラムエージェント停止時の処理に
おいて説明したステップ６２８０と同様の方法により、
ジョブを引き継ぐ計算機からジョブの異常終了が発生し
た計算機上のジョブのファイルを認識できるようにす
る。The job management program agent 6000 on the computer selected in step 7230 merges the sent agent checkpoint information with its own job management information 4600 and job execution information 4700 (step 7250). Then, the file of the job on the computer in which the abnormal end of the job has occurred can be recognized (step 7260). In step 7260,
By the same method as step 6280 described above in the processing when the job management program agent is stopped,
Allows the computer that inherits the job to recognize the job file on the computer where the job ended abnormally.

【００３３】以上の処理により、ジョブの異常終了が発
生した計算機上のジョブを、この計算機と同じジョブ実
行計算機グループに所属する計算機上で再実行すること
ができるようになる。By the above processing, it becomes possible to re-execute a job on a computer in which an abnormal job termination has occurred on a computer belonging to the same job execution computer group as this computer.

【００３４】以上説明した実施例によれば、分散型計算
機システム内で、サービスの終了、あるいは、障害の発
生等により停止した計算機上で実行されていたジョブ
を、システム内の他の計算機で引き継いで実行すること
ができる。また、システム内の計算機をグループ分け
し、停止した計算機上のジョブを同じグループ内の計算
機で引き継ぐことにより、実行環境等に大きな変化を与
えることなく処理を引き継ぐことができる。According to the above-described embodiment, the job executed on the computer stopped due to the end of the service or the occurrence of a failure in the distributed computer system is taken over by another computer in the system. Can be run with. Further, the computers in the system are divided into groups, and the jobs in the stopped computers are taken over by the computers in the same group, so that the processing can be taken over without making a great change in the execution environment or the like.

【００３５】[0035]

【発明の効果】以上説明したように、本発明によれば、
分散型システムにおいて、停止した計算機上のジョブを
システム内の他の計算機上で引き継いで実行することが
可能となり、計算機間でのジョブの継続実行およびチェ
ックポイントリスタートが可能となる。As described above, according to the present invention,
In a distributed system, a job on a stopped computer can be taken over and executed on another computer in the system, and continuous execution of jobs among computers and checkpoint restart can be performed.

[Brief description of drawings]

【図１】本発明の計算機システムのジョブ管理方法を実
施する計算機システムのハードウェア構成図である。FIG. 1 is a hardware configuration diagram of a computer system that implements a job management method for a computer system according to the present invention.

【図２】計算機１０００の詳細構成を示すブロック図で
ある。2 is a block diagram showing a detailed configuration of a computer 1000. FIG.

【図３】制御ソフトウェアの論理的な構成を示す構成図
である。FIG. 3 is a configuration diagram showing a logical configuration of control software.

【図４】ジョブ管理プログラムマスタ５０００およびジ
ョブ管理プログラムエージェント６０００が持つ主な情
報を示す構成図である。FIG. 4 is a configuration diagram showing main information held by a job management program master 5000 and a job management program agent 6000.

【図５】ジョブ管理プログラムマスタ５０００の起動時
の処理の概要を示すフローチャートである。FIG. 5 is a flowchart showing an outline of processing when the job management program master 5000 is activated.

【図６】ジョブ実行計算機グループ定義ファイル７０１
０の内容を示すファイル構成図である。FIG. 6 is a job execution computer group definition file 701.
It is a file block diagram which shows the content of 0.

【図７】ジョブ管理プログラムエージェント６０００の
起動時の処理の概要を示すフローチャートである。FIG. 7 is a flowchart showing an outline of processing when the job management program agent 6000 is activated.

【図８】ジョブ管理プログラムエージェント６０００の
停止時の処理の概要を示すフローチャートである。FIG. 8 is a flowchart showing an outline of processing when the job management program agent 6000 is stopped.

【図９】ジョブの実行がチェックポイントに到達した時
の処理の概要を示すフローチャートである。FIG. 9 is a flowchart showing an outline of processing when the execution of a job reaches a checkpoint.

【図１０】ジョブの異常終了が発生した場合の処理の概
要を示すフローチャートである。FIG. 10 is a flowchart illustrating an outline of processing when a job ends abnormally.

[Explanation of symbols]

１０００：計算機、１１００：中央処理装置（ＣＰ
Ｕ）、２０００：外部記憶装置、２５００：Ｉ／Ｏアダ
プタ、３０００：メモリ、３５００：アプリケーション
プログラム（ＡＰ）、４０００：制御ソフトウエア、４
１００：オペレーティングシステム（ＯＳ）、４２０
０：分散ファイル管理プログラム、４５００：ジョブ管
理プログラム、４６００：ジョブ管理情報Ａ、４７０
０：ジョブ実行情報Ａ、４６０１：ジョブ管理情報Ｍ、
４７０１：ジョブ実行情報Ｍ、４８００：エージェント
ジョブ管理情報格納部、４８３０：エージェントジョブ
実行情報格納部、４８４０：エージェントチェックポイ
ント情報格納部、５０００：ジョブ管理プログラムマス
タ、６０００：ジョブ管理プログラムエージェント、７
０１０：ジョブ実行計算機グループ定義ファイル、８０
００：バス、９０００：ローカルエリアネットワーク
（ＬＡＮ）、９５００：ＬＡＮアダプタ、９６００：ワ
イドエリアネットワーク（ＷＡＮ）、９７００：ゲート
ウエイ。1000: computer, 1100: central processing unit (CP
U), 2000: external storage device, 2500: I / O adapter, 3000: memory, 3500: application program (AP), 4000: control software, 4
100: operating system (OS), 420
0: distributed file management program, 4500: job management program, 4600: job management information A, 470
0: job execution information A, 4601: job management information M,
4701: job execution information M, 4800: agent job management information storage unit, 4830: agent job execution information storage unit, 4840: agent checkpoint information storage unit, 5000: job management program master, 6000: job management program agent, 7
010: Job execution computer group definition file, 80
00: bus, 9000: local area network (LAN), 9500: LAN adapter, 9600: wide area network (WAN), 9700: gateway.

フロントページの続き (72)発明者大矢雅章神奈川県横浜市戸塚区戸塚町5030番地株式会社日立製作所ソフトウェア開発本部内Front page continuation (72) Inventor Masaaki Oya 5030 Totsuka-cho, Totsuka-ku, Yokohama, Kanagawa Prefecture Hitachi Ltd. Software Development Division

Claims

[Claims]

1. A plurality of computers are connected to each other via a network, and a job management program for managing a job job to be executed by each computer runs on each computer, and a part of the job management program in the job management program is managed. A program (hereinafter referred to as a job management program master) is managed by a job management program on a computer under the
In the job management method of a computer system for managing management information in a job management program agent), computers that execute jobs are grouped by the job management program master, and information about the grouping is grouped in each of the job management program agents. A step of transmitting, a step of stopping the job management program agent and the job management program master, and a case where the job management program agent stops, the job management program agent to be stopped,
Notifying the job management program master that it will stop, and sending to the job management program master the execution information of the job being executed on the computer on which it runs and the job management information held by itself. The job management program master is being executed on the computer on which the job management program agent to be stopped was running from another computer belonging to the same group as the computer on which the job management program agent to be stopped was running. Selecting a computer to execute the selected job, and transmitting the job execution information and job management information received from the job management program agent to be stopped to the job management program agent on the selected computer; and the selected computer The job management program above By Gram agent, job management method of a computer system, characterized in that a step of the job management program agent executes the job on a computer that was running the said stop.

2. The job management method for a computer system according to claim 1, wherein the job being executed on the computer on which the job management program agent to be stopped is running is the computer on which the job management program agent is stopped. A replica of the file used by the job running on the computer stopped by the job management program agent before the job management program agent on the selected computer executed it on another computer belonging to the same group. A method for managing a job in a computer system, the method including: creating a job on a computer on which the user runs.

3. The computer system job management method according to claim 1, wherein the job executed on the computer in which the job management program agent has stopped is executed before the execution on the selected computer. A job management method for a computer system, comprising the step of allowing a job management program agent on the computer to recognize a file on the computer stopped by the job management program agent from a computer on which the computer runs.

4. A plurality of computers are connected to each other via a network, and a job management program that manages jobs to be executed by each computer runs on each computer, and a part of the job management programs in the job management program. In a job management method of a computer system, (hereinafter, referred to as a job management program master) managing management information in a job management program (hereinafter referred to as a job management program agent) on a computer under the Grouping computers that execute jobs and transmitting information about the grouping to each of the job management program agents; and the job management program agent, at a checkpoint, information on the status of resources related to the jobs. And transmitting job management information (hereinafter, referred to as checkpoint information) held by itself to the job management program master, the job management program master stores the checkpoint information sent from the job management agent. A step of storing, when a job abnormal termination occurs, a job management program agent running on the computer in which the abnormal termination has occurred notifies the job management program master of the occurrence of the abnormal termination; The management program master selects a computer that re-executes the job from other computers belonging to the same group as the computer in which the abnormal termination has occurred, and the checkpoint information of the computer in which the abnormal termination has occurred is the selected computer. Top Job Manager Program Agent And a step of causing the selected job management program agent on the computer to re-execute the job on the computer in which the abnormal termination of the job has occurred.

5. The computer system job management method according to claim 4, wherein a job management program agent operating on the selected computer before the abnormally terminated job is re-executed on the selected computer. A method of managing a job in a computer system, further comprising: creating a replica of a file used by a job operating on the computer in which the abnormal termination has occurred on the selected computer.

6. The job management method for a computer system according to claim 4, wherein a job management program agent operating on the selected computer before re-execution of the abnormally terminated job on the selected computer. A method for managing a job in a computer system, further comprising the step of allowing a file used by the abnormally ended job on the abnormally ended computer to be recognized by the selected computer.