JPS6125250A

JPS6125250A - Fault recovery method of information processor

Info

Publication number: JPS6125250A
Application number: JP14708384A
Authority: JP
Inventors: Noritaka Umeno; 典隆梅野
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1984-07-16
Filing date: 1984-07-16
Publication date: 1986-02-04

Abstract

PURPOSE:To secure the correct restart even with any type of a fault by performing the normalcy check before the information needed for restart is shunted. CONSTITUTION:An information processor contains a CPU101, a memory device 102, a file device 103 and a printer 104. The normalcy of the information processor is confirmed at a prescribed check point. Then the information needed for restart is stored in the device 103 when a system is normal. The state of the device 102 is reset at the check time point for restart bease on the latest contents shunted to the device 103.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、情報処理装置における障害時の回復処理に関
する。特に長時期処理の場合に適用されるチェックポイ
ントリスタート方式に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to recovery processing in the event of a failure in an information processing device. In particular, the present invention relates to a checkpoint restart method applied to long-term processing.

（従来の技術〕この種の情報処理装置の障害回復方法としては、従来、
種々の方式が考えられてきており、チェックポイントリ
スタート方式は現在よく適用されているものの１つであ
る。従来行われているチェックポイントリスタート方式
は、長時間を要する一連の処理についてこれをいくつか
の短い処理に分割し、各処理が終わる毎にそこから再ス
タートで−きる必要充分な情報を退避させた後に次の処
理に進ませるようにしておき、ある処理途中に障害が発
生した場合には、その障害を除去（修理）した後にその
１つ前の処理の終わった時点から処理再開できるように
したものである。(Prior Art) Conventionally, as a failure recovery method for this type of information processing device,
Various methods have been considered, and the checkpoint restart method is one of the most commonly used methods. The conventional checkpoint restart method divides a series of long-running processes into several short processes, and saves enough information after each process to restart from there. If a failure occurs during a certain process, the process can be restarted from the point where the previous process finished after the failure is removed (repaired). This is what I did.

しかし、障害にはその発見経過から、障害が発生し時点
ですぐ障害と分かるもの（例えば、パリティ−チェック
エラー、二重化比較エラーなど）＼−ドウェアのエラー
検出回路で検出されるものなど）と、障害が発生した時
点ではすぐ障害とわからず処理が続行され（前記ハード
ウェアのエラー検出回路で直接検出されないものなど）
ずっと後で何からの矛盾（例えば、制御のためにデータ
が壊れてしまったなど）により障害とわかるものとがあ
る。However, there are some types of failures that can be recognized as soon as they occur based on the discovery process (for example, parity check errors, duplication comparison errors, etc., and those detected by the hardware's error detection circuit). When a failure occurs, processing continues without it being immediately recognized as a failure (such as failures that are not directly detected by the hardware error detection circuit).
There are cases where a failure is discovered much later due to some kind of inconsistency (for example, data has been corrupted due to control).

[Problem that the invention seeks to solve]

前述の従来のチェックポイントリスクート方式では、あ
る処理途中の障害に対しその１つ前の処理の終わった時
点に戻すが、その１つ前の処理も正常に処理されていな
いことがあり、この場合には処理再開を行うと誤った処
理結果になる欠点がある。In the conventional checkpoint rescoot method described above, in response to a failure during a certain process, the process is returned to the point at which the previous process was completed, but the previous process may not have been processed normally, so this In some cases, restarting the process may result in incorrect processing results.

本発明はこの欠点を解決するもので、どのような障害タ
イプに対しても正しく障害回復できるようにした情報処
理装置を障害回復方法を提供することを目的とする。SUMMARY OF THE INVENTION The present invention aims to solve this drawback and to provide a fault recovery method for an information processing apparatus that can correctly recover from any type of fault.

[Means for solving problems]

本発明は、分割した各処理が終わり、再スタートに必要
な情報を退避する前に、その処理が正常に行われたか否
かを確認し、この確認された状態まで戻ってそこから再
スタートを行うように構成されたことを特徴とする。In the present invention, after each divided process is finished and before saving the information necessary for restarting, it is checked whether the process was performed normally, and the process returns to this confirmed state and restarts from there. characterized in that it is configured to perform.

すなわち本発明は、中央処理装置ｔ（ＣＰＵ）、メモリ
装置、ファイル装置などとからなる情報処理装置におい
て、所定のチェックポイントで情報処理装置の正常性を
確認し、システムが正常の場合再スタートに必要な情報
をファイル装置に格納する退避処理と、このファイル装
置に退避された最新の内容を用いて、チェックポイント
時点にメモリ装置の状態を戻す方法であり、予知不能の
原因による停止の発生を阻止して正しい再スタートを可
能とすることを特徴とする。That is, the present invention, in an information processing device consisting of a central processing unit (CPU), a memory device, a file device, etc., checks the normality of the information processing device at a predetermined checkpoint, and restarts the system if the system is normal. This is a method of saving the necessary information to a file device, and using the latest contents saved to this file device to return the state of the memory device to the checkpoint point.This method prevents outages due to unpredictable causes. The feature is that it prevents the problem from occurring and enables a correct restart.

゛　　〔作用〕本発明は分割した各処理が終わったとき、その処理が正
常に行われたかをチェックし、正常であるとき再スター
トに必要な情報をファイル装置に退避させることにより
再スタートし、予知不能の原因による停止などのトラブ
ルを除去する。゛ [Operation] When each divided process is completed, the present invention checks whether the process was performed normally, and if it is normal, restarts by saving the information necessary for restarting to the file device, Eliminate troubles such as stoppages due to unpredictable causes.

〔Example〕

次に本発明について添付図面を参照して詳細に説明する
。第１図は、本発明を適用した情報処理システムの機器
構成を示すものであり、メモリ、装置１０２は中央処理
装置１０１を介してファイル装置１０３および印刷装置
１０４に接続される。ファイル装Ｎ１０３には、処理す
べきデータ、プログラムが格納されており、−またプロ
グラムで処理中のデータ格納のためのワークエリア、さ
らにまた退避用エリアにも用いられる。このファイル装
置１０３に格納されたプログラム、データをメモリ装置
１０２にロードし、中央処理装置ＣＰ　ＵＩＯＩで処理
し、その結果を印刷装置１０４に出力される。この処理
途中に障害が発生しても、正しく再開するような処理フ
ローにしたのが、第２図および第３図のフローチャート
であり、以下説明する。Next, the present invention will be described in detail with reference to the accompanying drawings. FIG. 1 shows the equipment configuration of an information processing system to which the present invention is applied, in which a memory device 102 is connected to a file device 103 and a printing device 104 via a central processing unit 101. The file storage N103 stores data and programs to be processed, and is also used as a work area for storing data being processed by the program, and also as an evacuation area. The programs and data stored in this file device 103 are loaded into the memory device 102, processed by the central processing unit CP UIOI, and the results are output to the printing device 104. The flowcharts of FIGS. 2 and 3 provide processing flows that allow the process to be restarted correctly even if a failure occurs during the process, and will be described below.

まず、処理プログラムを処理１〜処理ｎに分割し、各分
割した処理の後に、処理が正しく行われたか否かチェッ
クし、正しく行われたと判断された場合再スタートに必
要な情報を退避する処理を入れるようにする。以下第２
図、第３図のフローに従って説明を行う。First, the processing program is divided into processes 1 to n, and after each divided process, it is checked whether the process was performed correctly or not. If it is determined that the process was performed correctly, the information necessary for restarting is saved. Make sure to include 2nd below
The explanation will be given according to the flowcharts shown in FIGS.

プログラム実行が開始（２００）されると、第２図判断
ボックス（箱）２０１ですでに途中まで実行されていた
か（Ｐ≠０）最初の実行（Ｆ’＝’Ｏ）かを判断し、最
初の実行であれば、箱（２０５）の処理１から実行し、
すでに途中まで実行されていたのであれば（Ｐ≠０）、
再スタートに必要な情報を元に戻しく箱２０２）、再ス
タート時点（Ｐｉ）に分岐する（箱２０３）、箱２０５
以降前述分割された各処理１（ｉ＝１〜ｎ）が実行され
るとチェックポイント設定マクロ（ＣＨＰＴ）が呼ばれ
る。ＣＨＰＴマクロの機能は、第３図のフローでわかる
ように、まずそれまでの処理が正しかったか否かを判断
するための正常性チェックを行い（箱３０１）、正常で
あれば、障害回復開始アドレスＡの登録（本実施例では
、プログラム開始と同一にしている）、再スタートアド
レス（チェックポイントとも呼ぶ）Ｐｉ　の登録および
その他再スタートに必要となる情報α（ｉ）、β（ｉ）
　、γ（ｉ）（処理ｉ＋ｌ以降へ引継ぐべきメモリ上の
情報など）の退避を行い（箱３０３）、判断箱３０２で
もし異常であった場合、操作者に障害を知らせるなどの
障害処理を行う（箱３１０）ものである。When program execution starts (200), it is judged in the judgment box (box) 201 in FIG. , execute from process 1 in box (205),
If it has already been executed halfway (P≠0),
Box 202) for restoring the information necessary for restarting, Branching to the restart point (Pi) (box 203), Box 205
Thereafter, when each of the aforementioned divided processes 1 (i=1 to n) is executed, a checkpoint setting macro (CHPT) is called. As can be seen from the flowchart in Figure 3, the function of the CHPT macro is to first perform a health check to determine whether the processing up to that point was correct (box 301), and if it is normal, set the failure recovery start address. Registration of A (in this embodiment, it is the same as the program start), registration of restart address (also called checkpoint) Pi, and other information α(i), β(i) required for restarting.
, γ(i) (information on the memory that should be carried over to processing i+l and later) is saved (box 303), and if an abnormality is found in the judgment box 302, trouble handling is performed such as notifying the operator of the fault. (Box 310).

したがって、処理２　（箱２０７）以降では再スタート
アドレスおよび他の再スタート必要情報がそれぞれエリ
アＰおよびＳＦに退避されていることになる。なおエリ
アＰおよびＳＦは第１図のファイル装置１０３内に設け
られ、修理などで電源を切っても情報は残っている。最
後の処理ｎが実行され（箱２２０）ると、全て処理は終
わり、また行うとすれば最初から行う、ことを意味する
ため０→Ｐのクリア動作（箱２２１）を行い全て終了す
る。Therefore, after process 2 (box 207), the restart address and other restart necessary information are saved in areas P and SF, respectively. Areas P and SF are provided in the file device 103 shown in FIG. 1, and their information remains even if the power is turned off for repairs or the like. When the last process n is executed (box 220), all processes are completed, and if they are to be performed again, they will be performed from the beginning, so a clearing operation of 0→P (box 221) is performed and all processes are completed.

次に、例えば処理２　（箱２０７）実行中に障害が発生
した場合、一旦電源一を切って修理した後、電源を入れ
てシステムをを立上げ、プログラムの開始（ターミネー
ション２００）点に戻す。プログラムの開始点では前述
のようにＰ≠０を判断した後、エリア（Ｓ　Ｆ）から処
理１実行直後ゐα、β、γの状態に戻しＰ、へ分岐する
ことにより、あたかも処理２実行途中での障害が無かっ
た如く処理続行する。また処理２実行中に障害が発生し
、それがわから蒐いまま処理が続行したとしても、箱２
０８でＣＨＰ’Ｔマクロを呼び出し、正常性チェック（
箱３０１）を実行することにより障害が見つけられるの
で、誤ってチェックポイントＰ２の再スタートのための
情報が退避されることはない。よってこの場合でも修理
後、正しくチェックポイントＰ＋から再スタートされる
ことになる。Next, for example, if a failure occurs during execution of process 2 (box 207), the power is once turned off and repaired, and then the power is turned on to start up the system and return to the program start (termination 200) point. At the starting point of the program, after determining P≠0 as described above, the area (S F) returns to the state of α, β, γ immediately after execution of process 1 and branches to P, as if it were in the middle of execution of process 2. Processing continues as if there was no failure. Also, even if a failure occurs during the execution of process 2 and the process continues without realizing it,
In 08, call the CHP'T macro and check the normality (
Since the failure is found by executing box 301), information for restarting checkpoint P2 will not be erroneously saved. Therefore, even in this case, the process will be correctly restarted from checkpoint P+ after repair.

なお正常性チェックの結果はハードウェアの試験診断プ
ログラムでメモリ装２１０２に常時格納されている。Note that the results of the normality check are constantly stored in the memory device 2102 using a hardware test diagnosis program.

〔Effect of the invention〕

本発明は以上のように、再スタートに必要な情報を退避
する前に正常性チェ・ツク香行うよう構成することによ
り、どのような障害タイプに対シても正しく再スタート
できるという効果がある。As described above, the present invention is configured to perform a normality check before saving the information necessary for restarting, thereby making it possible to restart correctly regardless of the type of failure. .

[Brief explanation of drawings]

第１図は本発明の実施例方法のための装置を示すブロッ
ク構成図。第２図は本発明の実施例の処理フローを示すフローチャ
ート。第３図は本発明の実施例の他の処理フローを示すフロー
チャート。１０１・・・中央処理装置（’ＣＰ　Ｕ）　、１０２・
・・メモリ装置、１０３・・・ファイル装置、１０４・
・・印刷装置、Ａ・・・障害回復開始アドレス、Ｐｉ・
・・再スタートアドレス（チェックポイント）、α（ｉ
）、β（ｉ）　、ｒ（ｉ）・・・再スタート必要情報、
Ｐ、ＳＦ・・・エリア、ＣＨＰＴ・・・チェックポイン
ト設定マクロ、’Ｒ３・・・リスタート。FIG. 1 is a block diagram showing an apparatus for an embodiment method of the present invention. FIG. 2 is a flowchart showing the processing flow of an embodiment of the present invention. FIG. 3 is a flowchart showing another processing flow of the embodiment of the present invention. 101... central processing unit ('CPU), 102...
...Memory device, 103...File device, 104.
...Printing device, A...Failure recovery start address, Pi.
・・Restart address (checkpoint), α(i
), β(i), r(i)... restart required information,
P, SF...Area, CHPT...Checkpoint setting macro, 'R3...Restart.

Claims

[Claims]

(1) After dividing a series of program processing into multiple short processes and saving and accumulating the necessary information so that each short process can be restarted from the state where it ended, The above series of program processing is executed so as to proceed to the next short processing mentioned above. If a failure occurs while executing this program processing, after removing the failure, A failure recovery method for an information processing device that restarts the execution of a program process from a state where the short process executed before the process has finished, using the necessary information saved and accumulated, When the execution of each short process is finished, confirm that the process was executed normally, return to the above short process where this confirmation was performed, and return from the state where the execution has finished. A failure recovery method for an information processing device, characterized by restarting the information processing device.