JP3139536B2

JP3139536B2 - Distributed batch job processing system and automatic job restart method in the event of failure

Info

Publication number: JP3139536B2
Application number: JP09135039A
Authority: JP
Inventors: 公士田淵
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1997-05-26
Filing date: 1997-05-26
Publication date: 2001-03-05
Anticipated expiration: 2017-05-26
Also published as: JPH10326201A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、各々がバッチジョ
ブ処理機能を有する複数台のコンピュータから構成され
る分散バッチジョブ処理システムに関し、特に、障害発
生時に自動的に再起動を行う方法に関する。[0001] 1. Field of the Invention [0002] The present invention relates to a distributed batch job processing system comprising a plurality of computers each having a batch job processing function, and more particularly to a method for automatically restarting when a failure occurs.

【０００２】[0002]

【従来の技術】従来、この種の分散バッチジョブ処理シ
ステムにおいては、ジョブの自動再実行を次のようにし
て行っている。すなわち、分散バッチジョブ処理システ
ムは、各々がバッチジョブを実行する複数台のコンピュ
ータと、バッチジョブのスケジュールと投入を行う再投
入用コンピュータとを備えている。そして、バッチジョ
ブを実行するある１つのコンピュータに障害が発生した
とき、再投入用コンピュータが当該障害コンピュータで
処理中であったバッチジョブを障害の発生していない別
の実行用のコンピュータに再投入することによって、自
動再起動を実現している。2. Description of the Related Art Conventionally, in this type of distributed batch job processing system, automatic re-execution of a job is performed as follows. That is, the distributed batch job processing system includes a plurality of computers each of which executes a batch job, and a re-input computer which schedules and inputs the batch job. Then, when a failure occurs in one computer that executes the batch job, the re-submission computer re-submits the batch job being processed by the failed computer to another non-failed execution computer. By doing so, automatic restart is realized.

【０００３】このような従来の分散バッチジョブ処理シ
ステムの一例としては、特開平７−１７５７６６号公報
（以下、先行技術１と呼ぶ。）に開示された「疎結合多
重システムのジョブ再実行制御方式」がある。この先行
技術１では、第ｎのホストコンピュータに障害が発生し
た場合に、指定された第ｍのホストコンピュータにより
速やかに実行中のジョブの再実行を行っている。すなわ
ち、第１のホストコンピュータで、ジョブ制御言語翻訳
手段は、ジョブ制御言語を翻訳して、第ｎのホストコン
ピュータのジョブを再実行する第ｍのホストコンピュー
タの指定をジョブ管理情報保持手段に登録する。ホスト
障害認識手段は、ホスト監視装置からの障害通知を認識
する。ジョブ再実行準備手段は、ジョブ管理情報保持手
段のジョブ制御情報を更新し、障害の第ｎのホストコン
ピュータで実行中のジョブに対する再実行を要求する。
ジョブスケジュール手段は、そのジョブの再スケジュー
リングを行い、そのジョブの再実行を第ｍのホストコン
ピュータに要求する。ジョブ起動手段は、実行を要求さ
れたジョブの実行プログラムを起動する。An example of such a conventional distributed batch job processing system is disclosed in JP-A-7-175766 (hereinafter, referred to as prior art 1). There is. In the prior art 1, when a failure occurs in the n-th host computer, the job being executed is immediately re-executed by the designated m-th host computer. That is, in the first host computer, the job control language translating means translates the job control language and registers the designation of the m-th host computer for re-executing the job of the n-th host computer in the job management information holding means. I do. The host failure recognition unit recognizes a failure notification from the host monitoring device. The job re-execution preparation unit updates the job control information of the job management information holding unit, and requests re-execution of the job being executed on the failed n-th host computer.
The job scheduler reschedules the job and requests the mth host computer to re-execute the job. The job activating means activates an execution program of the job requested to be executed.

【０００４】また、特開平８−２２７３６８号公報（以
下、先行技術２と呼ぶ。）には、障害の発生した処理単
位の検出を容易にし、効率的なジョブの再実行を行うこ
とを可能とする「ジョブ再実行方式」が開示されてい
る。この先行技術２では、コンピュータシステム上で動
作するバッチ処理的なジョブの構成と実行状態とを個々
に処理単位レべルで監視するためのログ実行制御部およ
びログファイルと、前記ジョブと各処理単位の実行状態
を表示画面に表示する状態表示制御部および表示装置
と、前記表示装置の表示画面から前記ジョブと処理とを
指定して再実行を指示するコマンドを入力することによ
り、該ジョブの指定された処理以降の処理を再実行する
再実行制御部、プロセス実行制御部及びジョブ構成管理
テーブルとを備える。Japanese Patent Application Laid-Open No. Hei 8-227368 (hereinafter referred to as Prior Art 2) discloses that a processing unit in which a failure has occurred can be easily detected, and a job can be efficiently re-executed. A “job re-execution method” is disclosed. In the prior art 2, a log execution control unit and a log file for individually monitoring, at a processing unit level, the configuration and execution state of a batch processing job operating on a computer system; A state display control unit and a display device for displaying the execution state of a unit on a display screen, and a command for designating the job and the process and instructing re-execution from the display screen of the display device, and It includes a re-execution control unit for re-executing the process after the designated process, a process execution control unit, and a job configuration management table.

【０００５】さらに、特開平２−２５３４４１号公報
（以下、先行技術３と呼ぶ。）には、装置障害が発生し
たときに自動的に装置を切換え、再実行させることがで
きる「計算機システムの装置切換方式」が開示されてい
る。この先行技術３では、ジョブ実行中に装置障害が発
生すると、装置障害受信手段が障害の通知を受け、装置
復旧可能不可能判断手段が復旧が可能であるか不可能で
あるかを判断する。装置の復旧が可能であると判断され
たときには、装置復旧指示出力手段が装置の復旧を指示
し、この指示により復旧作業が行われ、ジョブ再実行手
段がジョブを自動的に再実行する。また、装置の復旧が
不可能であると判断されたときには、装置自動切換手段
が他装置へ実行中のジョブの割り当てを自動的に切り換
え、媒体マウントメッセージ出力手段が切り換えた装置
に媒体をセットするメッセージを出力して作業を行わ
せ、ジョブ再実行手段が実行中であったジョブを自動的
に再実行する。これにより、装置障害が発生したもジョ
ブがエラーになることなく、自動的に装置を切り換えて
ジョブを再実行させることができ、復旧できる場合は復
旧後自動的にジョブを再実行させることができる。[0005] Further, Japanese Patent Application Laid-Open No. 2-253441 (hereinafter referred to as Prior Art 3) discloses a "computer system device" which can automatically switch and re-execute when a device failure occurs. Switching system "is disclosed. In the prior art 3, when a device failure occurs during job execution, the device failure receiving unit receives a notification of the failure, and the device restoration impossible determination unit determines whether restoration is possible or impossible. When it is determined that the device can be recovered, the device recovery instruction output means instructs the recovery of the device, the recovery work is performed by this instruction, and the job re-execution means automatically re-executes the job. When it is determined that the apparatus cannot be restored, the automatic apparatus switching means automatically switches the assignment of the job being executed to another apparatus, and the medium mount message output means sets the medium in the switched apparatus. A job is output and a job is performed, and the job re-executing unit automatically re-executes the job being executed. Thus, even if a device failure occurs, the job can be automatically switched and the job can be re-executed without causing an error in the job. If the job can be restored, the job can be automatically re-executed after the restoration. .

【０００６】[0006]

【発明が解決しようとする課題】上述した従来（先行技
術１）のジョブ自動再実行方法では、次に述べるような
問題点を有している。ジョブを実行するコンピュータ
に加えて、再投入処理を行う再投入用コンピュータが必
要になる。再投入処理を行う再投入用コンピュータに
障害が発生した場合は再起動処理が行われない。コン
ピュータの異常を正しく検出することは技術的に困難で
あり、誤って異常を検出した場合には、ジョブが二重に
処理される危険性を有している。The conventional (prior art 1) automatic job re-execution method described above has the following problems. In addition to the computer that executes the job, a re-entry computer that performs the re-entry processing is required. If a failure occurs in the re-entry computer that performs the re-entry processing, the restart processing is not performed. It is technically difficult to correctly detect an abnormality in a computer, and there is a risk that a job will be processed twice if an abnormality is detected by mistake.

【０００７】したがって、本発明の目的は、バッチジョ
ブ処理を行う複数台のコンピュータからなる分散バッチ
ジョブ処理システムにおいて、あるコンピュータに障害
が発生したときに、二重起動することなく自動的にバッ
チジョブの再実行を行うことができる、単純な構成の分
散バッチジョブ処理システムを提供することにある。SUMMARY OF THE INVENTION Accordingly, an object of the present invention is to provide a distributed batch job processing system comprising a plurality of computers for performing batch job processing. Of the present invention is to provide a distributed batch job processing system having a simple configuration capable of re-execution.

【０００８】なお、先行技術２および３は、いずれも複
数台のコンピュータからなる分散バッチジョブ処理シス
テムではない。Note that none of the prior arts 2 and 3 is a distributed batch job processing system including a plurality of computers.

【０００９】[0009]

【課題を解決するための手段】本発明の第１の態様によ
れば、正常時にバッチジョブ処理を行う少なくとも１台
の現用コンピュータと、前記現用コンピュータに障害が
発生したときに代替して処理を行う少なくとも１台の代
替コンピュータと、前記現用コンピュータの障害を検出
するための障害検出手段と、前記現用コンピュータの障
害発生時に前記現用コンピュータから前記代替コンピュ
ータへの接続の変更を行う接続切替手段を有する共有デ
ィスク装置とを備え、前記代替コンピュータは、前記現
用コンピュータで障害が発生した場合に前記共有ディス
ク装置から情報を取り出して前記代替コンピュータへ再
度ジョブの投入を行うためのジョブ再投入手段群を備
え、前記ジョブ再投入手段群は、前記共有ディスク装置
に記憶されたジョブ情報を取り出すためのジョブ情報読
出し手段と、該ジョブ情報読出し手段によって取り出さ
れたジョブ情報によって示されるジョブを前記代替コン
ピュータに再投入するためのジョブ再投入手段と、投入
したジョブ情報を前記共有ディスク装置上から削除する
ジョブ削除手段とを備えることを特徴とする分散バッチ
ジョブ処理システムが得られる。According to a first aspect of the present invention, at least one active computer that performs batch job processing in a normal state, and performs processing in place of a failure in the active computer. At least one alternative computer to be performed, failure detection means for detecting a failure of the active computer, and connection switching means for changing a connection from the active computer to the alternative computer when a failure of the active computer occurs A shared disk device, wherein the substitute computer has a job resubmitting means group for taking out information from the shared disk device and resubmitting a job to the substitute computer when a failure occurs in the active computer.
The job re-submitting means group includes the shared disk device.
Job information to retrieve the job information stored in the
Output means and the job information reading means.
The job indicated by the job information
Job re-submission means for re-submission to the computer and submission
Deleted job information from the shared disk device
Distributed batch job processing system according to claim Rukoto a job deletion means is obtained.

【００１０】また、本発明の第２の態様によれば、バッ
チ処理を行う複数台のコンピュータを有する分散バッチ
ジョブ処理システムであって、前記コンピュータの障害
を検出するための障害検出手段を含み、前記複数台のコ
ンピュータには、それぞれ、当該コンピュータのジョブ
情報を保存する共有ディスク装置が接続されており、前
記共有ディスク装置の各々は、障害発生時に平常のバッ
チ処理を行うコンピュータからジョブ交替するコンピュ
ータへの接続の変更を接続切替手段を有する、前記分散
バッチジョブ処理システムにおいて、前記複数台のコン
ピュータの各々は、前記平常のバッチ処理を行うコンピ
ュータで障害が発生した場合に前記共有ディスク装置か
ら情報を取り出して前記代替コンピュータへ再度ジョブ
の投入を行うためのジョブ再投入手段群を備え、前記ジ
ョブ再投入手段群は、別のコンピュータに接続された共
有ディスク装置に記憶されたジョブ情報を取り出すため
のジョブ情報読出し手段と、該ジョブ情報読出し手段に
よって取り出されたジョブ情報によって示されるジョブ
を前記代替コンピュータに再投入するためのジョブ再投
入手段と、投入したジョブ情報を前記共有ディスク装置
上から削除するジョブ削除手段とを備えることを特徴と
する分散バッチジョブ処理システムが得られる。According to a second aspect of the present invention, there is provided a distributed batch job processing system having a plurality of computers for performing a batch process, the system comprising a failure detecting means for detecting a failure of the computer, Each of the plurality of computers is connected to a shared disk device that stores job information of the computer, and each of the shared disk devices is a computer that switches jobs from a computer that performs normal batch processing when a failure occurs. In the distributed batch job processing system having a connection switching unit for changing the connection to the server, each of the plurality of computers receives information from the shared disk device when a failure occurs in the computer performing the normal batch processing. To take out the job and submit the job to the alternative computer again E Bei job reintroduction means group, the di
The job re-entry means are shared by another computer.
To retrieve job information stored in a disk drive
Job information reading means, and the job information reading means
The job indicated by the retrieved job information
Job to re-enter the substitute computer
Input means and the input job information to the shared disk device.
Distributed batch job processing system according to claim Rukoto a job deletion means for deleting from above is obtained.

【００１１】[0011]

【作用】共有ディスク装置は、常時一方のコンピュータ
としか接続されない。接続の切替えは、障害検出手段
が、現在の接続先のコンピュータの障害を検出したとき
に、あらかじめ定義されている他方のコンピュータに対
して行われる。共有ディスク装置上にはジョブ情報が記
録される。正常運用時はそのまま処理が行われ、ジョブ
実行の終了と共にジョブ情報が共有ディスク装置から削
除される。ジョブ実行のコンピュータの一方に障害が発
生した時は、正常に稼働しているコンピュータが共有デ
ィスク装置上からジョブ情報を取り出し、自コンピュー
タに再投入すると同時に共有ディスク装置上に残された
ジョブ情報を削除する。The shared disk device is always connected to only one computer. The connection is switched to another predefined computer when the failure detection unit detects a failure in the currently connected computer. Job information is recorded on the shared disk device. During normal operation, the process is performed as it is, and the job information is deleted from the shared disk device when the job execution ends. If a failure occurs on one of the computers executing the job, a normally operating computer retrieves the job information from the shared disk device, re-enters it on its own computer, and simultaneously deletes the job information remaining on the shared disk device. delete.

【００１２】[0012]

【発明の実施の形態】以下、本発明の実施の形態につい
て図面を参照して詳細に説明する。Embodiments of the present invention will be described below in detail with reference to the drawings.

【００１３】図１を参照して、本発明の第１の実施の形
態に係る分散バッチジョブ処理システムについて説明す
る。なお、分散バッチジョブ処理システムを構成するコ
ンピュータで実行されるプログラムは、記録媒体（図示
せず）に記録されていても良い。ここで、「記録媒体」
とは、プログラムを記録したコンピュータ読み取り可能
な記録媒体のことをいい。具体的には、ＣＤ−ＲＯＭ、
プレキシブル・ディスクなどの磁気ディスク、半導体メ
モリなどを含む。さらに、記録媒体はプログラムを記録
した紙でも良い。この場合には、コンピュータはＯＣＲ
（光学的文字読取装置）のような読取装置と、この読取
装置で読み取った文字（コード）をコンピュータが認識
できる機械言語に翻訳するコンパイラとを備えていれば
良い。とにかく、記録媒体に記録されたプログラムをコ
ンピュータにインストールすることによって、コンピュ
ータに所定の処理を行わせることができる。Referring to FIG. 1, a distributed batch job processing system according to a first embodiment of the present invention will be described. Note that the program executed by the computer constituting the distributed batch job processing system may be recorded on a recording medium (not shown). Here, "recording medium"
The term "computer" refers to a computer-readable recording medium on which a program is recorded. Specifically, CD-ROM,
Includes magnetic disks such as plexible disks, semiconductor memories, and the like. Further, the recording medium may be paper on which a program is recorded. In this case, the computer
It is only necessary to have a reading device such as an (optical character reading device) and a compiler that translates characters (codes) read by the reading device into a machine language that can be recognized by a computer. Anyway, by installing the program recorded on the recording medium into the computer, the computer can perform a predetermined process.

【００１４】図示の分散バッチジョブ処理システムは、
常時、バッチジョブの処理を行う第１および第２のコン
ピュータ１および２と、第１のコンピュータ１に接続さ
れた外部記憶装置３と、第１および第２のコンピュータ
１および２に接続された共有ディスク装置４とから構成
されている。この共有ディスク装置４は外部記憶装置の
一種である。The illustrated distributed batch job processing system comprises:
First and second computers 1 and 2 that always perform batch job processing, an external storage device 3 connected to the first computer 1, and a shared device connected to the first and second computers 1 and 2. And a disk device 4. This shared disk device 4 is a type of external storage device.

【００１５】第１のコンピュータ１は第１のジョブ処理
手段群１０−１とジョブ再投入手段群２０とを有する。
ジョブ処理手段群１０−１は通常にバッチジョブの処理
を行うためものである。ジョブ再投入手段群２０は第２
のコンピュータ２で障害が発生した場合に共有ディスク
装置４から情報を取り出してジョブ処理手段群１０へ再
度ジョブの投入の行うためのものである。The first computer 1 has a first job processing means group 10-1 and a job re-submission means group 20.
The job processing means group 10-1 is for normally processing a batch job. The job re-submitting means group 20 is the second
When a failure occurs in the computer 2, information is taken out from the shared disk device 4 and a job is input to the job processing means group 10 again.

【００１６】第２のコンピュータ２は第１のコンピュー
タ１に実装されている第１のジョブ処理手段群１０−１
と同様な第２のジョブ処理手段群１０−２を有する。The second computer 2 is a first job processing means group 10-1 mounted on the first computer 1.
Has the same second job processing means group 10-2.

【００１７】外部記憶装置３は第１のコンピュータ１に
投入された第１のジョブ情報３１−１と、第１のコンピ
ュータ１によってジョブが実行された場合にジョブによ
って生成される第１のジョブ実行結果３２−１とを保存
する。The external storage device 3 stores the first job information 31-1 input to the first computer 1 and the first job execution generated by the job when the first computer 1 executes the job. Save the result 32-1.

【００１８】共有ディスク装置４は、上記外部記憶装置
３とほぼ同様に、第２のコンピュータに投入された第２
のジョブ情報３１−２と、第２のコンピュータ２によっ
てジョブが実行された場合にジョブによって生成される
第２のジョブ実行結果３２−２とを保存する。さらに、
共有ディスク装置４は、障害検出手段４１と接続切替手
段４２とを有する。障害検出手段４１は、第１および第
２のコンピュータ１および２の動作を監視し、異常時に
は、接続切替手段４２に対し切替え変更を指示し、上記
ジョブ再投入手段群２０に対して再投入実施を指示す
る。接続切替手段４２は、第１のコンピュータ１と第２
のコンピュータ２の接続を切替え、同時に一方からしか
接続できないようにする。The shared disk device 4 stores the second computer input to the second computer almost in the same manner as the external storage device 3.
And the second job execution result 32-2 generated by the job when the job is executed by the second computer 2. further,
The shared disk device 4 includes a failure detection unit 41 and a connection switching unit 42. The failure detecting means 41 monitors the operations of the first and second computers 1 and 2 and, in the event of an abnormality, instructs the connection switching means 42 to change the switching, and re-executes the job re-entering means group 20. Instruct. The connection switching unit 42 is connected to the first computer 1 and the second
Of the computer 2 so that only one of them can be connected at the same time.

【００１９】この第１の実施の形態において、第２のコ
ンピュータ２は現用コンピュータと呼ばれ、第１のコン
ピュータ１は代替コンピュータと呼ばれる。In the first embodiment, the second computer 2 is called an active computer, and the first computer 1 is called an alternative computer.

【００２０】第１のジョブ処理手段群１０−１は、第１
のジョブ入力手段１０１−１と、第１のジョブ受理手段
１０２−１と、第１のジョブ情報記録手段１０３−１
と、第１のジョブ実行手段１０４−１と、第１のジョブ
終了処理手段１０５−１と、第１の操作員通知手段１０
６−１とを有する。The first group of job processing means 10-1
Job input means 101-1, first job receiving means 102-1 and first job information recording means 103-1
A first job execution unit 104-1, a first job end processing unit 105-1, and a first operator notification unit 10.
6-1.

【００２１】第１のジョブ入力手段１０１−１は操作員
などからのジョブ入力を受け付ける。第２のジョブ受理
手段１０２−１は、第１のジョブ入力手段１０１−１に
よって入力されたジョブを実際に受理するか否かを判断
する。第１のジョブ情報記録手段１０３−１は、第１の
ジョブ受理手段１０２−１によって受理されたジョブを
第１のジョブ情報３１−１として外部記憶装置３に格納
する。第１のジョブ実行手段１０４−１は外部記憶装置
３に記録された第１のジョブ情報３１−１を取り出しジ
ョブの実行を行う。第１のジョブ終了処理手段１０５−
１は、第１のジョブ実行終了をまって外部記憶装置３に
格納された第１のジョブ情報３１−１を消去する処理を
行う。第１の操作員通知手段１０６−１は、第１のジョ
ブ終了処理手段１０５−１の動作完了をまってをのジョ
ブが終了したことを操作員に通知する。また、第１の操
作員通知手段１０６−１は、第１のジョブ受理手段１０
２−１によってジョブの受理が拒絶されたことを操作員
に通知する。The first job input means 101-1 accepts a job input from an operator or the like. The second job receiving unit 102-1 determines whether to actually receive the job input by the first job input unit 101-1. The first job information recording unit 103-1 stores the job received by the first job receiving unit 102-1 in the external storage device 3 as first job information 31-1. The first job execution unit 104-1 retrieves the first job information 31-1 recorded in the external storage device 3 and executes the job. First job end processing means 105-
1 performs a process of deleting the first job information 31-1 stored in the external storage device 3 after finishing the first job execution. The first operator notifying unit 106-1 notifies the operator that the job after the completion of the operation of the first job end processing unit 105-1 has ended. Further, the first operator notifying unit 106-1 is provided with the first job receiving unit 10-1.
The operator is notified that the acceptance of the job has been rejected according to 2-1.

【００２２】同様に、第２のジョブ処理手段群１０−２
は、第２のジョブ入力手段１０１−２と、第２のジョブ
受理手段１０２−２と、第２のジョブ情報記録手段１０
３−２と、第２のジョブ実行手段１０４−２と、第２の
ジョブ終了処理手段１０５−２と、第２の操作員通知手
段１０６−２とを有する。Similarly, the second job processing means group 10-2
Is a second job input unit 101-2, a second job receiving unit 102-2, and a second job information recording unit 10-2.
3-2, a second job execution unit 104-2, a second job end processing unit 105-2, and a second operator notification unit 106-2.

【００２３】第２のジョブ入力手段１０１−２は操作員
などからのジョブ入力を受け付ける。第２のジョブ受理
手段１０２−２は、第２のジョブ入力手段１０１−２に
よって入力されたジョブを実際に受理するか否かを判断
する。第２のジョブ情報記録手段１０３−２は、第２の
ジョブ受理手段１０２−２によって受理されたジョブを
第２のジョブ情報３１−２として共有ディスク装置４に
格納する。第２のジョブ実行手段１０４−２は共有ディ
スク装置４に記録された第２のジョブ情報３１−２を取
り出しジョブの実行を行う。第２のジョブ終了処理手段
１０５−２は、第２のジョブ実行終了をまって共有ディ
スク装置４に格納された第２のジョブ情報３１−２を消
去する処理を行う。第２の操作員通知手段１０６−２
は、第２のジョブ終了処理手段１０５−２の動作完了を
まってをのジョブが終了したことを操作員に通知する。
また、第２の操作員通知手段１０６−２は、第２のジョ
ブ受理手段１０２−２によってジョブの受理が拒絶され
たことを操作員に通知する。The second job input means 101-2 receives a job input from an operator or the like. The second job receiving unit 102-2 determines whether or not to actually receive the job input by the second job input unit 101-2. The second job information recording unit 103-2 stores the job received by the second job receiving unit 102-2 as the second job information 31-2 in the shared disk device 4. The second job executing unit 104-2 extracts the second job information 31-2 recorded on the shared disk device 4 and executes the job. The second job end processing unit 105-2 performs a process of erasing the second job information 31-2 stored in the shared disk device 4 after the end of the second job execution. Second operator notification means 106-2
Notifies the operator of the completion of the job after the completion of the operation of the second job end processing means 105-2.
Further, the second operator notifying unit 106-2 notifies the operator that the second job receiving unit 102-2 has refused to accept the job.

【００２４】ジョブ再投入手段群２０は、ジョブ情報読
出し手段２０１と、ジョブ再投入手段２０２と、ジョブ
削除手段２０３とを有する。ジョブ情報読出し手段２０
１は、共有ディスク装置４から第２のジョブ情報３１−
２を読み出すためのものである。ジョブ再投入手段２０
２は、ジョブ情報読出し手段２０１によって読み出され
た第２のジョブ情報３１−２を適切な情報に修正し、第
１のジョブ受理手段１０２−１によってジョブの再投入
を試みる。ジョブ削除手段２０３は、共有ディスク装置
４から再投入されることによって不要になった第２のジ
ョブ情報３１−２を削除する。The job re-submitting unit group 20 includes a job information reading unit 201, a job re-submitting unit 202, and a job deleting unit 203. Job information reading means 20
1 indicates the second job information 31-from the shared disk device 4.
2 is to be read. Job resubmitting means 20
2 corrects the second job information 31-2 read by the job information reading means 201 into appropriate information, and attempts to resubmit the job by the first job receiving means 102-1. The job deletion unit 203 deletes the unnecessary second job information 31-2 by being re-input from the shared disk device 4.

【００２５】次に、図１および図２を参照して、第１の
実施の形態に係る分散バッチジョブ処理システムについ
て説明する。Next, a distributed batch job processing system according to the first embodiment will be described with reference to FIGS.

【００２６】正常時は、第１のコンピュータ１および第
２のコンピュータ２は、それぞれ、第１および第２のジ
ョブ処理手段群１０−１および１０−２を使用してフロ
ーＦ３の動作を行っている。また、共有ディスク装置４
は初期状態で第２のコンピュータ２と接続されている。In a normal state, the first computer 1 and the second computer 2 perform the operation of the flow F3 using the first and second job processing means groups 10-1 and 10-2, respectively. I have. Also, the shared disk device 4
Are connected to the second computer 2 in the initial state.

【００２７】まず、コンピュータの操作員が第１および
第２のジョブ入力手段１０１−１および１０１−２を使
用してジョブの入力を行う（ステップＳ３１）。入力さ
れたジョブは属性や操作員の権限等のジョブ情報の異常
の有無を検査される（ステップＳ３２）。First, a computer operator inputs a job using the first and second job input means 101-1 and 101-2 (step S31). The input job is inspected for abnormalities in job information such as attributes and authority of the operator (step S32).

【００２８】ジョブの投入を許可するならば、第１のコ
ンピュータ１では第１のジョブ情報３１−１を外部記憶
装置３へ第１のジョブ情報記憶手段１０３−１によって
記録し、第２のコンピュータ２では第２のジョブ情報３
１−２を共有ディスク装置４へ第２のジョブ情報記憶手
段１０３−２によって記録する（ステップＳ３３）。ジ
ョブの投入を否認するならば、否認された旨を、第１の
コンピュータ１では第１の操作員通知手段１０６−１を
使用して、第２のコンピュータ２では第２の操作員通知
手段１０６−２を使用して通知する（ステップＳ３
６）。If the input of the job is permitted, the first computer 1 records the first job information 31-1 in the external storage device 3 by the first job information storage means 103-1. 2 is the second job information 3
1-2 is recorded in the shared disk device 4 by the second job information storage unit 103-2 (step S33). If the job submission is denied, the first computer 1 uses the first operator notification means 106-1 and the second computer 2 uses the second operator notification means 106 (Step S3)
6).

【００２９】ジョブ情報が記録された後、第１のコンピ
ュータ１では記録された第１のジョブ情報３１−１をも
とに第１のジョブ実行手段１０４−１はジョブを実行
し、第２のコンピュータ２では記録された第２のジョブ
情報３１−２をもとに第２のジョブ実行手段１０４−２
はジョブを実行する（ステップＳ３４）。このときジョ
ブは、第１のコンピュータ１ではその実行結果を外部記
憶装置３上に出力し、第２のコンピュータ２ではその実
行結果を共有ディスク装置４上に出力する。After the job information is recorded, in the first computer 1, the first job execution means 104-1 executes the job based on the recorded first job information 31-1, and the second computer executes the second job. In the computer 2, based on the recorded second job information 31-2, the second job execution unit 104-2
Executes the job (step S34). At this time, the first computer 1 outputs the execution result to the external storage device 3, and the second computer 2 outputs the execution result to the shared disk device 4.

【００３０】ジョブ実行終了をまって、第１のコンピュ
ータ１では第１のジョブ終了処理手段１０５−１は外部
記憶装置３の第１のジョブ情報３１−１を削除し、第２
のコンピュータ２では第２のジョブ終了処理手段１０５
−２は共有ディスク装置４の第２のジョブ情報３１−２
を削除する（ステップＳ３５）。When the job execution ends, the first job end processing means 105-1 in the first computer 1 deletes the first job information 31-1 in the external storage device 3, and
The second job end processing means 105
-2 is the second job information 31-2 of the shared disk device 4
Is deleted (step S35).

【００３１】ジョブ情報削除完了後、第１のコンピュー
タ１では第１の操作員通知手段１０６−１を用いてジョ
ブの実行が正常に完了したことを通知して終了し、第２
のコンピュータ２では第２の操作員通知手段１０６−２
を用いてジョブの実行が正常に完了したことを通知して
終了する（ステップＳ３６）。After the job information deletion is completed, the first computer 1 notifies the user of the normal completion of the job execution using the first operator notifying means 106-1, and ends the processing.
Second computer notification means 106-2
To notify that the execution of the job has been completed normally, and terminate the process (step S36).

【００３２】次に、第２のコンピュータ２で障害が発生
した時の動作について説明する。まず、共有ディスク装
置４が障害検出手段４１によって障害を検出する（ステ
ップＳ１１）。つぎに、障害検出手段４１は接続切替手
段４２に対して第２のコンピュータ２から第１のコンピ
ュータ１に対して接続切替えを指示する（ステップＳ１
２）。Next, the operation when a failure occurs in the second computer 2 will be described. First, the shared disk device 4 detects a failure by the failure detecting means 41 (step S11). Next, the failure detection unit 41 instructs the connection switching unit 42 to switch the connection from the second computer 2 to the first computer 1 (step S1).
2).

【００３３】障害検出手段４１は、さらに、正常に稼働
している第１のコンピュータ１上のジョブ情報読出し手
段２０１に対して、共有ディスク装置４上にある第２の
ジョブ情報３１−２を再投入するように指示を出す（ス
テップＳ１３）。The failure detecting means 41 further transmits the second job information 31-2 on the shared disk device 4 to the job information reading means 201 on the normally operating first computer 1. An instruction is given to insert the battery (step S13).

【００３４】指示を受けたジョブ情報読出し手段２０１
は、共有ディスク装置４から第２のジョブ情報３１−２
を読み出す（ステップＳ２１）。読み出した第２のジョ
ブ情報３１−２をジョブ再投入手段２０２を用いて、ジ
ョブの再投入を行う（ステップＳ２２）。再投入後以降
のジョブの処理は第１のジョブ処理手段群１０−１によ
って正常時と同様に処理される（フローＦ３のポイント
２）。Job information reading means 201 receiving the instruction
Is the second job information 31-2 from the shared disk device 4.
Is read (step S21). The job is resubmitted using the read second job information 31-2 using the job resubmitting means 202 (step S22). The job processing after the re-submission is processed by the first job processing means group 10-1 in the same manner as in the normal state (point 2 in flow F3).

【００３５】ジョブ処理の流れとは別に、つぎのステッ
プＳ２３で、共有ディスク装置４上の第２のコンピュー
タ２が作成した第２のジョブ情報３１−２の削除をジョ
ブ情報削除手段４２によって行う。Apart from the flow of the job processing, the job information deleting means 42 deletes the second job information 31-2 created by the second computer 2 on the shared disk device 4 in the next step S23.

【００３６】次に、図１に示した分散バッチジョブ処理
システムの動作について詳細に詳細に説明する。Next, the operation of the distributed batch job processing system shown in FIG. 1 will be described in detail.

【００３７】初期状態で共有ディスク装置４の回線は第
２のコンピュータ２と接続されている。まず第２のコン
ピュータ２に対し、操作員がジョブＡを投入する。この
投入処理は第２のコンピュータ２上の第２のジョブ投入
手段１０１−２によって行われる。このジョブの属性や
権限の判断が第２のジョブ受理手段１０２−２によって
行われ、結果的に投入は正常に行われることになったと
する。次に、第２のジョブ情報記録手段１０３−２によ
って共有ディスク装置４に第２のジョブ情報３１−２の
形態でジョブＡが記録される。この時点で、ジョブＡの
投入処理が完了したことになる。つぎに第２のジョブ実
行手段１０４−２によってジョブＡの情報を取り出し、
第２のコンピュータ２上でジョブＡの実行を行う。ジョ
ブＡは実行結果の出力ファイルを第２のコンピュータ２
の外部記憶装置でもある共有ディスク装置４上に第２の
ジョブ実行結果３２−２として作成していく。In the initial state, the line of the shared disk device 4 is connected to the second computer 2. First, an operator submits a job A to the second computer 2. This input processing is performed by the second job input unit 101-2 on the second computer 2. It is assumed that the determination of the job attributes and the authority is performed by the second job receiving unit 102-2, and as a result, the job is normally input. Next, the job A is recorded on the shared disk device 4 in the form of the second job information 31-2 by the second job information recording unit 103-2. At this point, the input processing of job A has been completed. Next, the information of job A is extracted by the second job execution unit 104-2,
The job A is executed on the second computer 2. The job A outputs the execution result output file to the second computer 2.
The second job execution result 32-2 is created on the shared disk device 4 which is also an external storage device.

【００３８】この時点で第２のコンピュータ２に障害が
発生したとする。そして共有ディスク装置４上の障害検
出手段４１がその事象を検出する。障害検出手段４１は
接続切替手段４２に対し切替えの指示を出す。この指示
により接続切替手段４２は第２のコンピュータ２との接
続を停止し、第１のコンピュータ１との接続を開始す
る。この第２のコンピュータ２との接続が切断されたこ
とにより、第２のコンピュータ２で動作していたジョブ
Ａは実行結果を更新することができなくなり、ジョブの
実行を継続することが実質的にできなくなる。また、第
２のコンピュータ２から第２のジョブ情報３１−２が参
照できないため、新規のジョブも投入されない。Assume that a failure has occurred in the second computer 2 at this point. Then, the failure detection means 41 on the shared disk device 4 detects the event. The failure detecting unit 41 issues a switching instruction to the connection switching unit 42. In response to this instruction, the connection switching unit 42 stops the connection with the second computer 2 and starts the connection with the first computer 1. Since the connection with the second computer 2 is disconnected, the execution result of the job A operating on the second computer 2 cannot be updated, and the execution of the job is substantially continued. become unable. Also, since the second job information 31-2 cannot be referred to from the second computer 2, a new job is not input.

【００３９】また、障害検出手段４１は正常に動作して
いる第１のコンピュータ１上のジョブ情報読出し手段２
０１に対して動作を始めるように指示を出す。Further, the failure detecting means 41 is a means for reading job information on the first computer 1 which is operating normally.
01 is instructed to start the operation.

【００４０】第１のコンピュータ１では、ジョブ情報読
出し手段２０１が、共有ディスク装置４上からジョブＡ
の第２のジョブ情報３１−２を採取する。このジョブＡ
を第１のコンピュータ１の第１のジョブ受理手段１０２
−１に対して投入する。ここで受理処理が正常に行われ
たとする。ジョブＡの情報は今度は第１のコンピュータ
１の外部記憶装置３の第１のジョブ情報３１−１として
記録される。つぎに第１のコンピュータ１の第１のジョ
ブ実行手段１０４−１がジョブＡの情報を取り出しジョ
ブを実行する。このときジョブＡは共有ディスク装置４
上に残っている前回の途中の実行結果を参照することも
可能で、継続してジョブの実行を行うこともジョブの実
装によっては不可能ではない。参照しない場合は、全く
新規にジョブＡが実行される。In the first computer 1, the job information reading means 201 transmits the job A from the shared disk device 4.
Of the second job information 31-2. This job A
To the first job receiving means 102 of the first computer 1
Input for -1. Here, it is assumed that the reception processing has been performed normally. The information of the job A is recorded as the first job information 31-1 in the external storage device 3 of the first computer 1 this time. Next, the first job executing means 104-1 of the first computer 1 extracts the information of the job A and executes the job. At this time, job A is shared disk device 4
It is also possible to refer to the previous execution result remaining on the upper part, and it is not impossible to execute the job continuously depending on the implementation of the job. If no reference is made, a completely new job A is executed.

【００４１】ジョブＡの実行が完了すると、第１のジョ
ブ終了処理手段１０５−１によって外部記憶装置３上の
ジョブＡの第１のジョブ情報３１−１を削除する。以上
でジョブＡの処理が完了したことにより、第１の操作員
通信手段１０６−１を用いて操作員にジョブの実行完了
を通知する。When the execution of the job A is completed, the first job end processing means 105-1 deletes the first job information 31-1 of the job A on the external storage device 3. When the processing of the job A is completed as described above, the operator is notified of the completion of the job execution using the first operator communication unit 106-1.

【００４２】もし、ジョブＡを第１のコンピュータ１の
第１のジョブ受理手段１０２−１に対して投入したとき
に、第２のジョブ情報３１−２のジョブＡの情報が不完
全であった場合などには、ジョブＡの処理が不可能なこ
とがある。この場合、第１のジョブ受理手段１０２−１
がその異常を検出し、ジョブの受理を拒否し、操作員に
第１の操作員通知手段１０６−１を用いてその旨を通知
する。If the job A is submitted to the first job receiving means 102-1 of the first computer 1, the information of the job A in the second job information 31-2 is incomplete. In some cases, processing of job A may not be possible. In this case, the first job receiving unit 102-1
Detects the abnormality, refuses to accept the job, and notifies the operator using the first operator notification means 106-1.

【００４３】図３を参照して、本発明の第２の実施の形
態に係る分散バッチジョブ処理システムについて説明す
る。この第２の実施の形態に係る分散バッチジョブ処理
システムでは、上記第１の実施の形態における第１のコ
ンピュータ１用の外部記憶装置３に置き換えて、第２の
コンピュータ２用の同様の共有ディスク装置を使用して
いることである。ここでは、第１のコンピュータ１用の
共有ディスク装置に参照符号４−１を付して第１の共有
ディスク装置と呼び、第２のコンピュータ２用の共有デ
ィスク装置に参照符号４−２を付して第２の共有ディス
ク装置と呼ぶことにする。そして、第２のコンピュータ
２は、第１の実施の形態における第１のコンピュータ１
と同様にジョブ再投入手段群を有する。ここでは、第１
のコンピュータ１のジョブ再投入手段群に参照符号２０
−１を付して第１のジョブ再投入手段群と呼び、第２の
コンピュータ２のジョブ再投入手段群に参照符号２０−
２を付して第２のジョブ再投入手段群と呼ぶことにす
る。つまり、第１のコンピュータ１と第２のコンピュー
タ２の構成は全く同一となる。A distributed batch job processing system according to a second embodiment of the present invention will be described with reference to FIG. In the distributed batch job processing system according to the second embodiment, the external storage device 3 for the first computer 1 in the first embodiment is replaced with a similar shared disk for the second computer 2. The use of the device. Here, the shared disk device for the first computer 1 is denoted by reference numeral 4-1 to be referred to as a first shared disk device, and the shared disk device for the second computer 2 is denoted by reference numeral 4-2. Then, it is referred to as a second shared disk device. Then, the second computer 2 is the first computer 1 in the first embodiment.
And a job re-submitting means group. Here, the first
Reference numeral 20 denotes the job re-submitting means group of the computer 1
-1 is referred to as a first job re-submitting unit group, and the job re-submitting unit group of the second computer 2 is denoted by reference numeral 20-.
2 will be referred to as a second job re-submitting means group. That is, the first computer 1 and the second computer
The configuration of the data 2 is exactly the same.

【００４４】このとき、２つの共有ディスク装置４−１
および４−２の初期接続は、第１の共有ディスク装置４
−１は第１のコンピュータ１に、第２の共有ディスク装
置４−２は第２のコンピュータ２に接続されているとす
る。また、２つの共有ディスク装置４−１および４−２
の障害検出手段４１−１および４１−２は、障害に関す
る情報を共有し、同期して動作する。つまり、第１の共
有ディスク装置４−１の第１の障害検出手段４１−１が
障害を検出すると、その障害を検出した旨が同時に第２
の共有ディスク装置４−２の第２の障害検出手段４１−
２にも通知される。また、その逆も行われる。At this time, the two shared disk devices 4-1
And 4-2 are initially connected to the first shared disk device 4
-1 is connected to the first computer 1, and the second shared disk device 4-2 is connected to the second computer 2. Further, the two shared disk devices 4-1 and 4-2
The fault detecting means 41-1 and 41-2 share information about faults and operate synchronously. In other words , when the first failure detection unit 41-1 of the first shared disk device 4-1 detects a failure, the fact that the failure has been detected is simultaneously notified to the second failure detection unit 41-1.
Second failure detecting means 41- of the shared disk device 4-2 of FIG.
2 is also notified. The reverse is also true.

【００４５】次に、第２のコンピュータ２で障害が発生
したときの動作について説明する。第１の実施の形態と
異なることは、第２の共有ディスク装置４−２ばかりで
なく、第１の共有ディスク装置４−１も障害を検出する
ことである。ただし、第１の共有ディスク装置４−１は
初期接続で第１のコンピュータ１に接続されているの
で、接続変更は行われない。よって、このときの動作は
前述した第１の実施の形態と同様になる。Next, the operation when a failure occurs in the second computer 2 will be described. The difference from the first embodiment is that not only the second shared disk device 4-2 but also the first shared disk device 4-1 detects a failure. However, since the first shared disk device 4-1 is connected to the first computer 1 by the initial connection, the connection is not changed. Therefore, the operation at this time is similar to that of the first embodiment.

【００４６】第１のコンピュータ１で障害が発生したと
きは、逆に第１の共有ディスク装置４−１の接続切替え
が行われ、第２の共有ディスク装置４−２の接続変更は
行われない。このあと、第１の共有ディスク装置４−１
の第１の障害検出手段４１−１によって第２のコンピュ
ータ２上の第２のジョブ再投入手段群２０−２が起動さ
れ再投入処理が行われる。When a failure occurs in the first computer 1, the connection of the first shared disk device 4-1 is switched, and the connection of the second shared disk device 4-2 is not changed. . Thereafter, the first shared disk device 4-1
The second job re-entry means group 20-2 on the second computer 2 is activated by the first failure detection means 41-1 to perform the re-entry processing.

【００４７】これにより、第１のコンピュータ１、第２
のコンピュータ２のいずれで障害が発生した場合でも、
相互にジョブの自動的な再実行を実現することができ
る。Thus, the first computer 1 and the second computer 1
If any of the computers 2 fails,
Mutual automatic re-execution of jobs can be realized.

【００４８】本発明は上述した実施形態に限定せず、本
発明の趣旨を逸脱しない範囲内で種々の変更・変形が可
能である。例えば、上述した実施の形態では、コンピュ
ータが２台の場合について述べているが、３台以上ある
場合にも同様に適用できる。また、上述した実施の形態
では、共有ディスク装置が障害検出手段を備えている
が、共有ディスク装置とは別に障害検出手段を設けても
良い。The present invention is not limited to the above-described embodiment, and various changes and modifications can be made without departing from the spirit of the present invention. For example, in the above-described embodiment, the case where there are two computers is described. However, the present invention can be similarly applied to a case where there are three or more computers. Further, in the above-described embodiment, the shared disk device includes the failure detection unit. However, a failure detection unit may be provided separately from the shared disk device.

【００４９】[0049]

【発明の効果】以上説明したように、本発明では、次に
述べるような効果を奏する。As described above, the present invention has the following effects.

【００５０】第１の効果は、障害検出時に、ジョブの再
投入を行うときにジョブの二重起動の危険性を回避する
ことができることである。その理由は、障害検出時に、
障害の発生したコンピュータで使用していたジョブ情報
とジョブ実行結果を記録した共有ディスク装置の接続を
強制的に切替えることにより、障害の発生したコンピュ
ータで実行されているジョブは共有ディスク装置上のジ
ョブ実行結果を更新することができなくなるからであ
る。よって実質的に障害の発生したコンピュータによる
ジョブの実行は停止することになる。さらに、障害の発
生したコンピュータは共有ディスク装置上のジョブ情報
を参照できなくなるため、新規のジョブも実行できなく
なる。The first effect is that, when a failure is detected, the risk of double start of a job can be avoided when the job is resubmitted. The reason is that when a failure is detected,
By forcibly switching the connection between the job information used by the failed computer and the shared disk device that records the job execution results, the jobs running on the failed computer can be executed on the shared disk device. This is because the execution result cannot be updated. Therefore, the execution of the job by the computer in which the failure has occurred is substantially stopped. Further, the failed computer cannot refer to the job information on the shared disk device, and cannot execute a new job.

【００５１】第２の効果は、障害の発生したコンピュー
タが復旧したとき、または共有ディスク装置が再接続さ
れたときに、そのコンピュータは共有ディスク装置から
ジョブ情報を取り出すことができないため、ジョブの二
重起動を防止することができることである。その理由
は、ジョブの再投入の処理が完了したとき、障害の発生
したコンピュータで使用していた共有ディスク装置か
ら、取り出したジョブに関するジョブ情報を削除してい
るからである。The second effect is that when the failed computer is recovered or when the shared disk device is reconnected, the computer cannot retrieve job information from the shared disk device. This is to prevent double activation. The reason is that, when the job re-input processing is completed, the job information relating to the extracted job is deleted from the shared disk device used by the failed computer.

【００５２】第３の効果は、ジョブの再投入を行うため
の再投入用コンピュータを余分に用意する必要がないこ
とである。その理由は、ジョブの再投入処理をジョブの
実行を行うコンピュータで行っているからである。The third effect is that it is not necessary to prepare an extra computer for re-submitting a job. The reason is that the job re-input processing is performed by the computer that executes the job.

[Brief description of the drawings]

【図１】本発明の第１の実施の形態に係る分散バッチジ
ョブ処理システムを示すブロック図である。FIG. 1 is a block diagram showing a distributed batch job processing system according to a first embodiment of the present invention.

【図２】図１に示した分散バッチジョブ処理システムの
動作を説明するためのフロー図である。FIG. 2 is a flowchart for explaining the operation of the distributed batch job processing system shown in FIG. 1;

【図３】本発明の第２の実施の形態に係る分散バッチジ
ョブ処理システムを示すブロック図である。FIG. 3 is a block diagram showing a distributed batch job processing system according to a second embodiment of the present invention.

[Explanation of symbols]

１，２コンピュータ３外部記憶装置４，４−１，４−２共有ディスク装置１０−１，１０−２ジョブ処理手段群２０−１，２０−２ジョブ再投入手段群３１−１，３１−２ジョブ情報３２−１，３２−２ジョブ実行結果４１，４１−１，４１−２障害検出手段４２，４２−１，４２−２接続切替手段１０１−１，１０１−２ジョブ入力手段１０２−１，１０２−２ジョブ受理手段１０３−１，１０３−２ジョブ情報記録手段１０４−１，１０４−２ジョブ実行手段１０５−１，１０５−２ジョブ終了処理手段１０６−１，１０６−２操作員通知手段２０１，２０１−１，２０１−２ジョブ情報読出し
手段２０２，２０２−１，２０２−２ジョブ再投入手段２０３，２０３−１，２０３−２ジョブ削除手段1, 2 Computer 3 External storage device 4, 4-1, 4-2 Shared disk device 10-1, 10-2 Job processing means group 20-1, 20-2 Job re-submission means group 31-1, 31-2 Job information 32-1, 32-2 Job execution results 41, 41-1, 41-2 Failure detection means 42, 42-1, 42-2 Connection switching means 101-1 and 101-2 Job input means 102-1, 102-2 Job receiving means 103-1 and 103-2 Job information recording means 104-1, 104-2 Job execution means 105-1, 105-2 Job end processing means 106-1, 106-2 Operator notification means 201 , 201-1, 201-2 Job information reading means 202, 202-1, 202-2 Job re-submission means 203, 203-1, 203-2 Job deleting means

Claims

(57) [Claims]

1. At least one active computer (2) that performs batch job processing during normal operation, and at least one alternative computer (1) that performs processing when the active computer fails. Failure detection means (41) for detecting a failure of the active computer
And a shared disk device (4) having connection switching means (42) for changing a connection from the active computer to the alternative computer when a failure occurs in the active computer. e Bei job reintroduction means group for the retrieving information from the shared disk device when a failure occurs on the working computer performs the insertion of job again to the alternative computer (20), the job cycling means group (20 ) Is the shared disk
Job information for retrieving job information stored in the device
Information reading means (201) and the job information reading means.
The job indicated by the retrieved job information
Job to re-enter the substitute computer
Input means (202), and the input job information
Job deletion means (203) for deleting from the disk device;
Distributed batch job processing system according to claim Rukoto equipped with.

2. An at least one active computer (2) that performs batch job processing in a normal state, and at least one alternative computer (1) that performs processing when a failure occurs in the active computer. A distributed batch job processing system comprising: a shared disk device (4) for storing job information of the active computer; and a failure detection unit (41) for detecting a failure of the active computer. A method of automatically restarting a job when a failure occurs in the distributed batch job processing system, comprising: a connection switching unit (42) configured to change a connection from the active computer to the substitute computer when a failure occurs in the active computer. On the alternative computer, the job information stored in the shared disk device is A step of issuing Ri, in the alternative computer, the step of deleting and re-submit the job indicated by the retrieved job information to the alternative computer, in the alternative computer, a submitted job information from the shared disk device on And a method for automatically restarting a job when a failure occurs in the distributed batch job processing system.

3. A retrieving job information stored in the shared disk device process, a process of re-submit the job indicated by the retrieved job information to an alternate computer, the job information indicating a job that is該再turned A recording medium storing a program for causing the alternative computer to execute the process of deleting from the shared disk device.

4. A distributed batch job processing system having a plurality of computers (1, 2) for performing batch processing, wherein failure detecting means (42-1 and 42-2) for detecting a failure of the computers. And a shared disk device (4-1, 4-2) for storing job information of the computer is connected to each of the plurality of computers, and each of the shared disk devices is In the distributed batch job processing system having connection switching means (42-1 and 42-2) for changing a connection from a computer that performs normal batch processing to a computer that replaces a job, each of the plurality of computers includes: When a failure occurs in the computer that performs the normal batch processing, information is taken out from the shared disk device and the replacement is performed. Job cycling group of means for performing the insertion of job again to computer (20-1, 20-2) Bei example, said job cycling means group (20-1, 20-2) is another
On a shared disk device connected to another computer
Job information reading means for retrieving selected job information
(201-1, 201-2) and the job information reading
The job indicated by the job information retrieved by the column
Job to re-submit the job to the alternative computer
Re-charging means (202-1 and 202-2) and
Job for deleting job information from the shared disk device
Distributed batch job processing system according to claim Rukoto a deletion means (203-1, 203-2).

5. A distributed batch job processing system having a plurality of computers (1, 2) for performing batch processing, wherein failure detecting means (42-1 and 42-2) for detecting a failure of the computers. And a shared disk device (4-1, 4-2) for storing job information of the computer is connected to each of the plurality of computers, and each of the shared disk devices is The method of automatically restarting a job when a failure occurs in the distributed batch job processing system, comprising a connection switching unit (42-1 and 42-2) for changing a connection from a computer that performs normal batch processing to a computer that replaces jobs. A step in which each computer retrieves job information stored in a shared disk device connected to another computer. In each of the computers, a step of re-submitting the job indicated by the extracted job information to the substitute computer, and a step of deleting the input job information from the shared disk device in each of the computers A method for automatically restarting a job when a failure occurs in a distributed batch job processing system.