JPH02118736A

JPH02118736A - System for restoring and controlling fault of batch job

Info

Publication number: JPH02118736A
Application number: JP63272253A
Authority: JP
Inventors: Koichi Nakanishi; 弘一中西
Original assignee: KANSAI NIPPON DENKI SOFTWARE KK
Current assignee: KANSAI NIPPON DENKI SOFTWARE KK
Priority date: 1988-10-27
Filing date: 1988-10-27
Publication date: 1990-05-07

Abstract

PURPOSE:To re-perform a batch job without altering a performing environment by controlling a job re-performance information and permitting a program at the time of re-performance to retrieve control information. CONSTITUTION:A control file 5 consisting of performance control information for controlling the job and history control information is provided. Performance control information registers a record by setting the name of the job as a key prior to the performance of the job, performs the job, registers processing information of the input file 1 of a batch processing and makes it to be control information when a fault is re-performed. Consequently, file update without contradiction can be realized by applying a fault restoration control module 8 for a batch program 2 and re-performing the job at the time of re-performance from program aboart. Even in the case of aboart due to the program, a control record can be retrieved and an injustice record can be corrected. Thus, the batch job can be re-performed without altering the performance environment.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明はバッチジョブ障害復旧管理方式に関し、特にバ
ッチジョブの障害時にその原因の追求を支援し原因が取
り除かれた時、実行環境を変えることなく、以前と同様
にジョブを再実行するだけで、更新データ内容を保証し
かつそのジョブ運用の管理を有効にするバッチジョブ障
害復旧管理方式に関する。[Detailed Description of the Invention] [Industrial Application Field] The present invention relates to a batch job failure recovery management method, and in particular to supporting the pursuit of the cause of a batch job failure and changing the execution environment when the cause is removed. The present invention relates to a batch job failure recovery management method that guarantees updated data contents and enables management of job operation by simply re-executing the job in the same way as before.

[Conventional technology]

従来バッチジョブの管理方式はデータ更新において、障
害時の再実行時に更新データ内容を二度更新しても良い
ものと、加算等により二度更新してはいけないものがあ
り、その制御の為にレコード項目に更新状況を判断する
項目をもって、レコード更新を判定することをプログラ
ムのアルゴリスムに組込んでいた。Conventional batch job management methods have two types of data updates: some allow the updated data to be updated twice when re-executing in the event of a failure, while others do not allow it to be updated twice due to additions, etc. The program's algorithm includes an item to determine update status in the record item, and determines whether a record has been updated.

または、障害がデータネ正によるプログラムアボードの
時は、その不正データ情報を出力装置に表示して、その
データを修正した後、再実行のためプログラムの開始位
置をそのデータに位置付ける様にプログラムを修正する
があるいは、処理済の入力ファイルデータをファイルか
ら削除することで、更新の論理性を保証する様にしてい
た。Alternatively, if the fault is a program abort due to data error, display the incorrect data information on the output device, correct the data, and then program the program to position the program start position at that data for re-execution. The logicality of the update was ensured by modifying or deleting the processed input file data from the file.

そして障害時実行ジョブと再実行ジョブとのデータ引継
が必要とされる場合は、ジョブ設計において中間ファイ
ルを考慮するが、あるいは、障害時専用プログラムとし
て再実行時に処理済の障害時までの情報（特に累蓄値、
処理件数等）を入手できる様にプログラムを別途作成す
る必要があった。If it is necessary to transfer data between a job executed in the event of a failure and a re-execution job, an intermediate file may be considered in the job design, or a program dedicated to failures may be used to store information (especially accumulated value,
It was necessary to create a separate program to obtain information such as the number of processed items.

[Problem to be solved by the invention]

従来のファイル項目に更新判定用の項目をもっことは、
それ自体でレコード長の拡大となり、同一ファイルを更
新するジョブが、併行的に実行される場は、併行ジョブ
の数だけ更新判定用の項目が必要となり、ファイル設計
、ジョブ設計からも融通性、拡張性に乏しい欠点がある
。Adding an item for update judgment to the conventional file item,
This in itself will increase the record length, and if jobs that update the same file are executed in parallel, it will be necessary to have as many update judgment items as there are concurrent jobs. It has the disadvantage of poor scalability.

これは、基本的には、更新判定項目の内容によって、更
新の有無を判定し、レコードの論理的矛盾を回避してい
る方式だか、これは処理済のデータでもファイルアクセ
スが必要で無駄な時間を必要としている。またプログラ
ム修正や入力データ削除による再実行は、コンパイルの
実施、新たなＪＣＬの作成、データ修正等の作業を必要
とするとともに、ミスによる二重障害の恐れがあり、安
全面での問題がある。This is basically a method that determines whether or not there has been an update based on the contents of the update judgment item, and avoids logical contradictions in records.This method requires file access even for processed data, which wastes time. need. In addition, re-execution after modifying the program or deleting input data requires work such as compiling, creating a new JCL, and modifying data, and there is a risk of double failure due to mistakes, which poses safety issues. .

更に処理データ量が、増加する傾向の現在では、何十時
間必要とあるバッチ処理もあり、運用スケジュールも過
密になり、障害復旧時にプログラム修正等の実行環境を
変更して対応することは、即時性に欠は昨今の運用に対
応できないばかりでなく障害復旧方式がジョブ単位に別
々になるため、統一的な復旧手順がとれない欠点がある
。Furthermore, as the amount of data being processed is increasing, some batch processes require dozens of hours, and operational schedules are becoming more congested, making it difficult to immediately respond by changing the execution environment such as modifying programs when recovering from a failure. Not only is this method incompatible with modern operations, but it also has the disadvantage of not being able to implement a unified recovery procedure because the failure recovery method is different for each job.

[Means to solve the problem]

本発明の実施例は、バッチジョブの障害復旧からの再実
行時に更新データの論理矛盾の無い速やかな再実行動作
環境を作るバッチジョブ障害復旧管理方式において、ジ
ョブの実行を制御する管理ファイルを利用することでジ
ョブ制御文のジョブ識別名を検索キーとしてレコードを
格納、削除する手段と、その手段により、バッチジョブ
の入力ファイルの更新情報を得て、再実行時に処理済の
有無を判定し、そして直接に障害時の入・カフアイルレ
コードに位置付ける手段と、障害ジョブの処理データを
再実行ジョブが引継げる手段及びその情報を検索する手
段と、プログラムに起因しないシステム全体の障害復旧
がらの再実行において一括して障害ジョブ群を起動する
手段とを有している。An embodiment of the present invention utilizes a management file for controlling job execution in a batch job failure recovery management method that creates a prompt re-execution operating environment without logical contradictions in updated data when re-executing a batch job after failure recovery. By doing this, there is a means for storing and deleting records using the job identification name of the job control statement as a search key, and by using that means, update information of the input file of the batch job is obtained, and when re-execution, it is determined whether or not it has been processed. There is also a means for directly locating it in the input/failure record at the time of a failure, a means for a re-execution job to take over the processing data of the failed job, a means for searching that information, and a means for recovering the entire system from failures that are not caused by programs. and means for starting a group of failed jobs all at once in re-execution.

〔Example〕

次に本発明ついて図面を参照し説明する。 Next, the present invention will be explained with reference to the drawings.

第１図は本発明の一実施例を示す。第１図において、本
発明の一実施例はジョブ再実行情報を管理し、再実時の
プログラムがそれらの管理情報を索引することでジョブ
制御文、プログラム、ファイル等の実行環境を変更せず
にバッチジョブが再実行できる障害復旧管理方式で、ジ
ョブを管理するためのジョブ管理ファイル５をもつ、こ
のジョブ管理ファイル５は実行管理情報と履歴管理情報
から構成されている。実行管理情報は、ジョブ実行に先
立ってジョブ制御文中のジョブ名をキー５−１として、
レコードが登録され、正常終了時にレコードが削除され
る様にする。そしてジョブの実行とともに、バッチ処理
の入力ファイル処理情報をプログラムで指定する一定間
隔で登録し、これを障害時の再実行時の制御情報とする
。これは不正データの修正を必要とするとき、該当レコ
ードを発見するときの情報となり、検索機能等で出力装
置９に表示する。FIG. 1 shows an embodiment of the invention. In FIG. 1, one embodiment of the present invention manages job re-execution information, and the re-execution program indexes the management information, so that the execution environment of job control statements, programs, files, etc. is not changed. This is a failure recovery management method that allows batch jobs to be re-executed at any time, and has a job management file 5 for managing jobs. This job management file 5 is composed of execution management information and history management information. The execution management information uses the job name in the job control statement as key 5-1 before job execution.
A record is registered and deleted upon normal termination. As the job is executed, input file processing information for batch processing is registered at regular intervals specified by the program, and this is used as control information for re-execution in the event of a failure. This becomes information when correcting incorrect data is required or when finding a corresponding record, and is displayed on the output device 9 using a search function or the like.

本発明の一実施例に適用されるバッチプログラム２は入
力ファイル１を矢印ａの順に読みこんで、入力ファイル
１−１の項目で更新フォイル３にランダムアクセスをし
て項目１−２の値を項目３−２に加算する処理するプロ
グラムで、正常処理の場合は処理開始の最初に、ジョブ
制御文のジョブ名をキーとしてジョブ管理ファイル５に
ジョブ管理情報レコードを格納する。そして入力ファイ
ル１件に対して更新ファイルを更新するというロジック
でプログラムを構成し、この単位をトランザクションと
名付けると、１トランザクションｍ位毎に管理ファイル
上に入力処理件数５−２及び入力ファイルの読み込み時
のアドレス５−３゜処理時刻５−４を必須データとして
格納する。また１１〜ランザクジョン単位の処理結果を
、累積値として次のトランザクションに引継ぐ処理は、
！−ランザクジョン単位で管理ファイルに累積値を格納
する。そして入力ファイル１が最後まで処理されるとバ
ッチプログラムは終了処理として管理ファイル５より、
そのバッチプログラムのジョブ名をキーとした管理レコ
ード削除する。次に異常終了後の再実行の場合は、バッ
チジョブがシステムダウン等の個々のプログラムに起因
しない理由でアホ−１へした場合、このジョブを単純に
再実行するとプログラムの初期処理で障害復旧モジュー
ル８によってジョブ名をキーとして管理ファイル５を検
索し、管理レコードの有無によりこのプログラムが、ア
ボートジョブであることが判明する。A batch program 2 applied to an embodiment of the present invention reads input files 1 in the order of arrow a, randomly accesses update foil 3 using the items of input file 1-1, and updates the values of items 1-2. In the case of normal processing, this program performs processing to add to item 3-2. At the beginning of processing, a job management information record is stored in the job management file 5 using the job name of the job control statement as a key. Then, if we configure a program with the logic of updating an update file for one input file and name this unit a transaction, the number of input processes 5-2 and the input file will be read on the management file every m transactions. The time address 5-3° and the processing time 5-4 are stored as essential data. In addition, the process of carrying over the processing results of each transaction as a cumulative value to the next transaction is as follows.
! - Store the cumulative value in the management file for each run. When input file 1 is processed to the end, the batch program executes the following command from management file 5 as a termination process.
Delete the management record with the job name of the batch program as the key. Next, in the case of re-execution after abnormal termination, if a batch job goes to idiot-1 for a reason not caused by the individual program, such as a system down, if you simply re-execute this job, the failure recovery module will be installed in the initial processing of the program. 8, the management file 5 is searched using the job name as a key, and the presence or absence of a management record determines that this program is an aborted job.

障害復旧モジュール８は管理ファイル５の処理件数５−
２と入力ファイルの読み込み件数を比較してブ１コグラ
ムに処理件数５−２を越えるまで入力ファイル１の読み
飛ばしを指示する。あるいは管理ファイル５に格納され
ている入力ファイル１のアドレス５−３をキーとして、
ダイレクトにアボート時点の入力ファイル位置に位置付
けてくれる。この判断は管理レコードのアドレスまたは
キー５−３の有無により行なわれる。そして再実行ジョ
ブは引継ぎ情報５−５がある時は、障害復旧モジュール
８より提供される引継ぎ情報５−５を利用してジョブを
再実行する。これ以降は正常処理と同様に、トランザク
ション単位に障害復旧モジュール８を経由して管理ファ
イル５に管理情報を格納していき、最後のトランザクシ
ョンを処理したとき該当管理レコードを削除する。The failure recovery module 8 processes 5- cases of the management file 5.
2 and the number of input files to be read, and instructs the program to skip input file 1 until the number of processed items exceeds 5-2. Or, using address 5-3 of input file 1 stored in management file 5 as a key,
It directly positions the input file at the point of abort. This judgment is made based on the address of the management record or the presence or absence of the key 5-3. When the re-execution job has the take-over information 5-5, the job is re-executed using the take-over information 5-5 provided by the failure recovery module 8. From this point on, similarly to normal processing, management information is stored in the management file 5 via the failure recovery module 8 on a transaction-by-transaction basis, and when the last transaction is processed, the corresponding management record is deleted.

またプログラムに起因するアボートの場合、管理レコー
ド検索しジョブ名５−１、入力処理件数５−２、アドレ
スまたはキー値５−３により不正入力レコードの判定が
可能となる。そしてこれにファイルメンテナンス機能を
連動させて該当レコードを画面上に呼び出して不正デー
タの修正を行う。そして修正後ジョブの再実行を行うだ
けで、二重更新により論理矛盾や累積値の矛盾の発生を
考慮する必要はない。そして異常終了後のジョブ再実行
の場合、障害復旧モジュール８はジョブ管理ファイル５
に障害履歴レコードとしてシステム日付と自動連番とを
キー項目として、ジョブ管理情報レコードに再実行時刻
を追加してジョブ管理ファイル５に格納する。Further, in the case of an abort caused by a program, it is possible to search for the management record and determine whether it is an unauthorized input record based on the job name 5-1, the number of input processes 5-2, and the address or key value 5-3. Then, by linking this with the file maintenance function, the corresponding record is called up on the screen and incorrect data is corrected. Then, by simply re-executing the modified job, there is no need to consider the occurrence of logical contradictions or cumulative value contradictions due to double updates. In the case of re-executing a job after abnormal termination, the failure recovery module 8 uses the job management file 5
Then, a re-execution time is added to the job management information record and stored in the job management file 5 as a failure history record using the system date and automatic serial number as key items.

これは日付をキーとしてジョブ管理ファイル５をアクセ
スすることよりその日の障害の有無あるいは有った場の
障害履歴情報（ジョブ名、障害発生時刻、再実行時刻、
及び障害発生時の情報等々）を画面、帳票等の出力装置
９に出力できる。By accessing the job management file 5 using the date as a key, information on the presence or absence of failures on that day and failure history information (job name, time of failure occurrence, re-execution time,
information on the occurrence of a failure, etc.) can be output to an output device 9 such as a screen or a form.

そして特に電源ダウン等のシステムダウンの場合で、無
条件にダウン時に実行していたジョブを再実行する時は
、管理ファイルを検索することにより該当ジョブ名が入
手でき、それを利用することでジョブ制御ファイル中の
該当ジョブを起動することかできる。In particular, in the case of a system down such as a power outage, if you want to unconditionally re-execute the job that was being executed at the time of the downtime, you can obtain the relevant job name by searching the management file, and use that to rerun the job. You can start the corresponding job in the control file.

このように本発明の一実施例はジョブ再実行情報を管理
し、再実行のプログラムがそれらの管理情報を検索する
ことで、ジョブ制御文、プログラム、ファイル等の実行
環境を変更せずにバッチジョブが再実行できる障害管理
方式を提供することにあり、そのために、ジョブ障害復
旧の管理方法としてジョブを管理するための実行管理情
報と１歴管理情報から構成される管理ファイルをもつ。In this way, an embodiment of the present invention manages job re-execution information, and allows the re-execution program to search for the management information, so that batch processing can be performed without changing the execution environment of job control statements, programs, files, etc. The object of the present invention is to provide a failure management method that allows jobs to be re-executed, and for this purpose, as a management method for job failure recovery, a management file consisting of execution management information and one-history management information for managing jobs is provided.

実行管理情報は、ジョブ実行に先立ってジョブ制御文中
のジョブ名をキーとしてレコードが登録され、正常終了
時にレコードが削除されるようにし、そしてジョブの実
行とともに、バッチ処理の入力ファイル処理情報をプロ
グラムで指定する。For execution management information, a record is registered using the job name in the job control statement as a key before job execution, and the record is deleted upon normal completion.When the job is executed, input file processing information for batch processing is registered in the program. Specify with.

一定間隔で登録し、これら障害時の再実行時の制御情報
とすることにより不正データの修正を必要とするとき、
該当レコードを発見するときの情報となり、検索機能等
で出力装置に表示できる。また、再実行時は、その実行
情報を障害履歴情報として、システムで付番しキーとし
て登録５管理を行い、そして管理ファイルに登録されて
いる異常終了ジョブはシステムのジョブの終了状悪を判
断して、それらのジョブの一括起動が可能となる。When it is necessary to correct incorrect data by registering it at regular intervals and using it as control information for re-execution in the event of a failure,
This information becomes information when finding a corresponding record, and can be displayed on an output device using a search function, etc. In addition, when re-executing, the execution information is registered as failure history information, numbered by the system, and used as a key for management.For abnormally terminated jobs registered in the management file, the system determines whether the job termination status is bad. Then, those jobs can be started all at once.

〔Effect of the invention〕

以上説明したように本発明はバッチプログラムに障害復
旧管理モジュールを適用することによりシステムダウン
等によるプログラムアボートからの再実行のときはただ
単純にジョブの再実行をすることだけで、矛盾のないフ
ァイル更新を実現でき、かつアボートジョブを選択しな
いでも自動的に一括してジョブの起動ができる効果があ
る。そして、プログラムに起因するアボートの場合であ
っても、管理レコードを検索しかつメンテ機能を使用す
ることにより短期間に、誤いなく不正レコードを修正で
きる効果がある。As explained above, the present invention applies a failure recovery management module to batch programs, so that when a program is aborted due to a system down, etc., it is possible to simply re-execute the job and create a consistent file. This has the advantage of being able to update jobs and automatically start jobs all at once without selecting an abort job. Even in the case of an abort caused by a program, by searching the management record and using the maintenance function, it is possible to correct the incorrect record in a short period of time without error.

ッチプログラム、３・・・更新マスクファイル、４・・
・障害復旧管理システム、５・・・ジョブ管理ファイル
、６・・・ジョブ−括起動、７・・・検索、８・・・障
害後日モジュール、９・・・出力装置。patch program, 3... update mask file, 4...
- Failure recovery management system, 5... Job management file, 6... Job-batch startup, 7... Search, 8... Post-failure module, 9... Output device.

[Brief explanation of drawings]

第１図は、本発明のバッチプログラムと障害後日管理シ
ステムの関係を示す構成図である。FIG. 1 is a configuration diagram showing the relationship between a batch program of the present invention and a post-failure management system.

Claims

[Claims]

In a batch job failure recovery management method that creates a prompt re-execution environment without logical inconsistencies in updated data when re-executing a batch job after failure recovery, job control statements can be changed by using a management file that controls job execution. A method for storing and deleting records using a job identification name as a search key, and by using this method, obtaining update information of a batch job input file, determining whether or not it has been processed at the time of re-execution, and directly inputting it in the event of a failure. A method for locating the processing data of a failed job in a file record, a means for a re-execution job to inherit the processing data of a failed job, a means for searching that information, and a means for re-executing a group of failed jobs at once after recovery from a system-wide failure not caused by a program. A batch job failure recovery management method includes a means for starting a batch job, and promptly re-executes a batch job after a failure occurs.