JPS63136142A

JPS63136142A - Error recovery system for logical unit

Info

Publication number: JPS63136142A
Application number: JP61282578A
Authority: JP
Inventors: Koemon Nigo; 仁後　公衛門
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1986-11-27
Filing date: 1986-11-27
Publication date: 1988-06-08

Abstract

PURPOSE:To reduce danger that causes a system breakdown or job-aboard state, etc., by using another normal logical device to perform retrial of an instruction that is turned into an error in case such an error that can be retried and also cannot be retried is produced or such a fixed error that can be retried is produced. CONSTITUTION:When a logical unit 11 has a trouble, a monitor means 14 interrupts the processing of the unit 11 and stores the internal state of the unit 11. A diagnosis processor 2 judges an error detected out of the read-out internal state through a retrial possibility judging means 22. If an error that can be retired and also cannot be retried is judged or a fixed error that can be retried is judged, the succession information is complied and produced from the read-out internal state by a succession information compiling/producing means 23 and then set to a normal logical unit 13 by a succession information setting means 24. Then a succession processing re-execution/succession instructing means 25 gives a instruction to the unit 13 for succession of the processing. Then the unit 13 re-executes and succeeds the processing.

Description

【発明の詳細な説明】技術分野本発明は論理装置のエラー回復方式に関し、特に複数の
論理装置が主記憶装置を共有した情報処理システムにお
ＧＪる論理！置のエラー回復方式に関する。DETAILED DESCRIPTION OF THE INVENTION Technical Field The present invention relates to an error recovery method for a logical device, and particularly to a logic system for an information processing system in which a plurality of logical devices share a main storage device. Regarding error recovery methods for equipment.

従来技術従来、この種のエラー回復り式としては、例えば特開昭
５７−０６４８４９号公報（特願昭５５−１４１３２３
号）に見られる方式が知られている。Prior Art Conventionally, this type of error recovery system has been disclosed, for example, in Japanese Patent Application Laid-Open No. 57-064849 (Japanese Patent Application No. 55-141323).
The method shown in No. 1) is known.

上記従来方式は、ある論理装置（中央ｆｆｉ理装置）で
エラーが発生し、このエラーを発生した命令がメモリ書
換えの条件等で再試行可能と判断された場合、先ずエラ
ーを発生した論理装置で所定回数の再試行を行わせてそ
れでもエラーの回復が行われなかったときに、その命令
を他の正常な論理装置で再試行させるように構成されて
いる。In the conventional method described above, when an error occurs in a certain logical device (central FFI logical device) and it is determined that the instruction that caused the error can be retried due to memory rewriting conditions, etc., the logical device that caused the error is first If the error is not recovered even after a predetermined number of retries, the instruction is configured to be retried using another normal logical device.

このエラー回復方式では、エラーの原因が間欠故障で且
つ再試行可能であってしかもエラーの発生した論理装置
での再試行に成功すれば、システムダウン等を回避でき
、またエラー原因が固定故障によるものであっても再試
行可能であれば正常な他の論理装置で再試行が成功する
ことによりシステムダウン等を回避できる。In this error recovery method, if the cause of the error is an intermittent failure and retry is possible, and if the retry is successful on the logical device where the error occurred, system failure can be avoided, and if the cause of the error is due to a fixed failure. If it is possible to retry even if the logical device is a logical device, a system failure or the like can be avoided by successfully retrying with another normal logical device.

しかし、間欠故障であっても常に再試行可能となるしの
ではなく、命令実行中のメモリ書換えタイミングによっ
ては再試行不可能となる場合もある。また、間欠故障を
一度おこした論理装置は再び間欠故障をおこす確率が高
いと考えられるから、再試行を先ずエラーの発生した論
理装置で行わせるこの方式では、その再試行の途中にお
いて再び間欠故障によるエラーが発生しそのエラーの命
令が今度は再試行不可能となる可能性がある。このよう
な場合、再試行不可能であるが故にシステムダウンやジ
ョブアボート等につながるという問題点がある。However, even if there is an intermittent failure, retrying is not always possible; retrying may not be possible depending on the memory rewriting timing during instruction execution. In addition, since it is thought that a logic device that has experienced an intermittent failure once has a high probability of causing an intermittent failure again, this method of first performing a retry on the logical device where the error occurred will result in an intermittent failure occurring again during the retry. An error may occur and the instruction in error may not be retriable. In such a case, since retrying is not possible, there is a problem that the system may go down or the job may be aborted.

このエラー回復方式を改７ｒ、　Ｌ、たちのとして、論
理装置でエラーが発生した場合でかつ該エラーの命令が
再試行可能と判断されたとき、該エラーを発生した論理
装置で再試行することなく、他の正常な論理装置に前記
エラーの命令からの再試行を行わせるようにした方式が
ｈえられている。この方式では、安全のため問欠陣害が
一度Ｃも発生すると障害装置を切り離してしまうために
、たとえば先行制御での障害のように再試行を１００％
可能にできる障害（再試行不可となり得ない障害）につ
いてもエラーを発生した論理装置での再試行ｔユ実施す
ることなく切り離すことになる。したがってシステムの
性能を無駄に低下させるという欠点がある。This error recovery method is revised to 7r, L, Tachi, and when an error occurs in a logical device and it is determined that the error instruction can be retried, the error is retried in the logical device where the error occurred. Instead, a method is available in which another normal logical device is made to retry the error instruction. In this method, for safety reasons, if a failure occurs even once, the faulty device is disconnected, so retries are made at 100%, such as in the case of a fault in advance control.
Even for possible failures (failures that cannot be retried), they are removed without retrying the logical device in which the error occurred. Therefore, there is a drawback that the performance of the system is unnecessarily degraded.

発明の目的本発明はこのような問題点を解決したものであり、その
目的とするところは、システムダウンやジョブアボート
等を招く危険性が少なく、しがも性能低下を極力抑える
方法で論理装置のエラー回復を行うことが可能なエラー
回復方式を捉供することにある。Purpose of the Invention The present invention solves these problems, and its purpose is to reduce the risk of system downtime, job aborts, etc., while minimizing performance degradation to the logical device. An object of the present invention is to provide an error recovery method capable of performing error recovery.

ｐ承本発明によれば、複数の論理装置が主記憶装置を共有す
る情報処理システムにＪ５ける論理装置のエラー回復方
式であって、前記論理装置でエラーが発生した場合、エ
ラーとなった命令が再試行可能でかつこのエラーが再試
行不可になり得るエラーと判断されたとき、あるいは再
試行可能でがっ再試行を何回か失敗したエラーであると
判断されたとき、前記エラーを発生した論理装置で再試
行をなすことなく他の正常な論理装置に前記エラーとな
った命令からの再試行を継続して行わせるようにしたこ
とを特徴とするエラー回復方式が得られる。According to the present invention, there is provided an error recovery method for a logical device in a J5 information processing system in which a plurality of logical devices share a main storage device, in which when an error occurs in the logical device, the command that caused the error is recovered. When the above error is determined to be retriable and this error is determined to be an error that cannot be retried, or if it is determined to be a retryable error but the retries have failed several times, the above error is generated. According to the present invention, there is provided an error recovery method characterized in that, without retrying the instruction in the errored logic device, other normal logic devices are made to continue retrying the instruction that caused the error.

実施例以下、図面を用いて本発明の詳細な説明する。Example Hereinafter, the present invention will be explained in detail using the drawings.

第１図は本発明の実施例のブロック図である。FIG. 1 is a block diagram of an embodiment of the invention.

図において、１は情報処理装置、２は診断装置。In the figure, 1 is an information processing device, and 2 is a diagnostic device.

３は当該情報処理装置１と診断装置２とを接続するため
の診断インタフェースである。3 is a diagnostic interface for connecting the information processing device 1 and the diagnostic device 2;

情報処理装置１は例えば中央処理装置である複数の論理
装置１１．１３と、これら複数の論理装置１１．１３に
接続され両輪理装置からアクセス可能な主記憶装置１２
と、論理装置１１．１３および主記憶装置１２に接続さ
れた監視手段１４とを含み、一つのオペレーティング・
システムで制御されている。上記監視手段１４は、論理
装置１１．１３の故障（エラー）を検出する機能と、故
障を検出するとその故障した論理装置の内部状態を保存
して診断インタフェース３を介して診断処理装置２にそ
の旨を通知する機能とを有している。The information processing device 1 includes a plurality of logical devices 11.13, which are, for example, central processing units, and a main storage device 12 that is connected to the plurality of logical devices 11.13 and accessible from the two-way processing device.
and a monitoring means 14 connected to a logical device 11.13 and a main memory 12, one operating
controlled by the system. The monitoring means 14 has a function of detecting a failure (error) in the logical device 11.13, and when a failure is detected, saves the internal state of the failed logical device and transmits it to the diagnostic processing device 2 via the diagnostic interface 3. It also has a function to notify you of this.

なお、第１図では論理装置が２台の場合を図示したが、
論理装置が３台以上協ねっている情報処理システムに対
しても本発明は適用可能である。Note that although Figure 1 shows the case where there are two logical devices,
The present invention is also applicable to information processing systems in which three or more logical devices work together.

また、診断処理装置２は内部状態の読出し手段２１と、
再試行可能性の判断手段２２と、引継ぎ情報の編集・作
成手段２３と、引継ぎ情報の設定手段２４と、引継ぎ処
理の再実行・継続の指示１段２５とを含む。このうち、
内部状態の読出し手段２１は、監視手段１４から上記通
知があったどきに監視手段１４によって保Ｈされた故障
発生論理装置の内部状態を読出す手段である。再試行可
能性判断手段２２は、その読出された内部状態に基づい
てエラーを発生した命令の再試行の可能性の判断、当該
エラーが再試行不可となり（ｑるがの判断、固定エラー
（再試行を何回か失敗したケースのエラー）の判断を行
う手段である。Further, the diagnostic processing device 2 includes an internal state reading means 21,
It includes a retry possibility determination means 22, a handover information editing/creation means 23, a handover information setting means 24, and a first stage 25 for instructing re-execution/continuation of the handover process. this house,
The internal state reading means 21 is a means for reading out the internal state of the faulty logic device held by the monitoring means 14 when the above-mentioned notification is received from the monitoring means 14. The retry possibility determining means 22 determines the possibility of retrying an instruction that has caused an error based on the read internal state, and determines whether the error is a fixed error (retry). This is a means of determining errors in cases where several attempts have failed.

引継ぎ情報の編集・作成手段２３は、再試行可能性の判
断手段２２で再試行可能かつ再試行不可となり得るエラ
ーと判断されるか、または再試行可能かつ固定エラーと
判断されたときに、故障した論理装置上で実行していた
処理の引継ぎ情報を前記内部状態から編集・作成する手
段である。引継ぎ情報の設定手段２４は、引継ぎ情報の
編集・作成手段２３で作成された引継ぎ情報を故障が検
出された論理装置以外の他の正常な論理装置に設定する
手段である。更に、引継ぎ処理の再実行・継続の指示手
段２５は、引継ぎ情報の設定手段２４により設定された
情報をもとにその正常な論理装置上で故障の検出された
論理装置で行われていた処理を引継いで再実行させるこ
とを指示する手段である。The handover information editing/creation means 23 detects a failure when the retry possibility determination means 22 determines that the error is retryable and cannot be retried, or when it is determined that the error is retryable and fixed. This is a means for editing and creating handover information for the process that was being executed on the logical device from the internal state. The handover information setting means 24 is means for setting the handover information created by the handover information editing/creation means 23 to a normal logical device other than the logical device in which the failure has been detected. Further, the re-execution/continuation instruction means 25 for instructing the re-execution/continuation of the takeover process executes the process that was being performed on the logical device in which the failure was detected on the normal logical device based on the information set by the takeover information setting means 24. This is a means of instructing that the process be taken over and re-executed.

次に、第１図において論理装置１１に故障が発生した場
合を例にして本実施例の動作を第２図のフローチャート
を用いて説明する。Next, the operation of this embodiment will be explained using the flowchart of FIG. 2, taking as an example the case where a failure occurs in the logic device 11 in FIG.

論理装置１１に故障が発生すると、これが監視手段１４
で検出される。監視手段１４はこれを検出すると、論理
装置１１の処理を中断させてその内部状態を保存し、診
断インタフェース３を介して診断処理Ｖ７Ｌ置２に論理
装置１１が故障した旨の通知を行う。When a failure occurs in the logical device 11, the monitoring means 14
Detected in When the monitoring means 14 detects this, it interrupts the processing of the logical device 11, saves its internal state, and notifies the diagnostic processing V7L unit 2 via the diagnostic interface 3 that the logical device 11 has failed.

これに応答して診断処理装置２は内部状態の読出し手段
２１を起動し、第２図の処理５１に示すようにこの手段
２１により監視手段１４で保存された論理装置１１の内
部状態を読出す。次に診断処理装置２は、Ｆ記読出され
た内部状態から検出されたエラーが再試行可能かつ再試
行不可となり得るエラーか、あるいは再試行可能かつ固
定エラーかを再試行の可能性判断手段２２で判断しく処
理５２）、条件を満さない場合には、エラー発生の論理
装置１１の障害処理を行う。この障害９１！［９！のス
テップでは、再試行可能なエラーであれば、障害が生じ
た論理装置１１の障害処理すなわち再試行処理が行われ
、再試行が成功すればそのまま継続して運転が行われる
ことになる。そうでなければ、エラー発生の論理装置１
１のシステムカラの切離しが行われる。In response to this, the diagnostic processing device 2 activates the internal state reading means 21, and uses this means 21 to read out the internal state of the logic device 11 saved by the monitoring means 14, as shown in process 51 in FIG. . Next, the diagnostic processing device 2 uses the retry possibility determining means 22 to determine whether the error detected from the read internal state is an error that can be retried and cannot be retried, or an error that can be retried and is fixed. If the condition is not satisfied, failure processing is performed for the logical device 11 in which the error has occurred. This obstacle is 91! [9! In step , if the error is one that can be retried, failure processing for the failed logical device 11, that is, retry processing is performed, and if the retry is successful, operation continues. Otherwise, the error occurs in logical unit 1.
1 system color separation is performed.

一方、処理５２で再試行可能かつ再試行不可となり１ｑ
るエラーと判断されるか、または再試行可能かつ固定エ
ラー（再試行を何回か失敗したケースのエラー）と判断
されたときには、引継ぎ情報の編集・作成手段２３によ
り読出した内部状態から引継ぎ情報を編集・作成しく処
理５３）、この引継ぎ情報を設定手段２４によって正常
な論理装置１３に対し設定させる（処理５４）。この処
理５２における判断で「再試行可能かつ再試行不可とな
り得るエラー」とは、エラーの発生時間差により命令の
内容等が異なるためにある時間の発生エラーは再試行可
能であるが、他の時間の発生エラーは再試行不可能とな
り得ることがあるということを意味する。また、［再試
行可能かつ固定エラー４とは、再試行可能なエラーであ
ってかつ再試行の結果何回かその再試行を失敗している
エラーを意味する。かかるエラーの再試行の可能性の判
断は、エラー発生状態を示すエラーフリップフロップ簀
の内容から判断されるものであり公知技術を用いれば良
い。On the other hand, in process 52, retry is possible and retry is not possible, and 1q
If the error is determined to be a retryable and fixed error (an error that occurs after several failed retries), the takeover information is extracted from the internal state read by the takeover information editing/creation means 23. is edited and created (process 53), and this inheritance information is set for the normal logical device 13 by the setting means 24 (process 54). In the judgment made in this process 52, an "error that can be retried but cannot be retried" means that an error that occurs at a certain time can be retried because the content of the command is different depending on the difference in the time of occurrence of the error, but an error that occurs at another time can be retried. This means that if an error occurs, it may not be possible to try again. [Retryable and fixed error 4] means an error that can be retried and has failed several times as a result of retrying. The possibility of retrying such an error is determined based on the contents of the error flip-flop box indicating the error occurrence state, and a known technique may be used.

次に、引継ぎ処理の再実行・継続の指示手段２５により
正常な論理装置１３に処理の引継ぎを指示しく処理５５
）、論理装置１３に引継いだ処理の再実行・継続を行わ
せる。これにより、再試行可能な故障の検出された論理
装置で実行されていたエラー発生時の処理が故障の検出
された論理装置上ではなく他の正常な論理装置上で再実
行され、論理’１ｔａｌｌで発生したエラーの回復が行
われると共に、その後の処理し故障の発生した論理装置
ではなく正常な論理装置に引継がれる。正常な論理装置
が故障した論理装置の処理を引継いだ場合、通常その故
障した論理装置は論理的にシステムから切り離される。Next, the instructing means 25 for re-executing/continuing the takeover process instructs the normal logical device 13 to take over the process 55.
), causing the logical device 13 to re-execute and continue the inherited processing. As a result, the error processing that was being executed on the logical device in which the retryable failure was detected is re-executed not on the logical device in which the failure was detected, but on another normal logical device, and the logical '1tall Recovery from the error that occurred is performed, and subsequent processing is taken over by a normal logical device instead of the failed logical device. When a normal logical device takes over the processing of a failed logical device, the failed logical device is usually logically disconnected from the system.

以上の説明は、論理装置１１が故障した場合であるが、
論理装置１３が故障した場合も、上述した如く診断処理
ｖｔ置２のυ制御のちとに正常な論理装＠１１に処理を
引継いで、論理装置１３上で行われていた処理を継続し
て行うことができる。The above explanation is for the case where the logical device 11 fails, but
Even if the logical device 13 fails, the processing is taken over to the normal logical device @11 after the υ control of the diagnostic processing device 2 as described above, and the processing that was being performed on the logical device 13 is continued. be able to.

発明の詳細な説明したように本発明によれば、再試す可能且つ再試
行不可能となり得るエラー、あるいは再試行可能且つ固
定エラーが発生した場合に、エラーとなった命令の再試
行を、エラーの発生した論理装置で実行することなく、
他の正常な論理装置で実行するものであり、再試行中に
間欠故障によるエラーが発生しそのエラーの命令が再試
行不可能となってシステムダウンやジコブアボート等を
Ｉｎ　＜危険性を少なくすることができ、しかも１００
％再試行可能なエラーの場合には、エラーの発生した論
理装置を切り離すことなくそれ自身の再試行により継続
運転を行うようにしているために性能低下を極力抑える
ことができるという効果がある。DETAILED DESCRIPTION OF THE INVENTION According to the present invention, when an error that can be retried and cannot be retried, or an error that is retriable and fixed, the retry of the erroneous instruction is performed without an error. without running on the logical device where the
It is executed by another normal logical device, and if an error occurs due to an intermittent failure during retry, the error instruction cannot be retried, resulting in a system down or abort. can be done, and 100
% In the case of a retryable error, the logical device in which the error has occurred is not disconnected and continues to operate by retrying itself, which has the effect of suppressing performance degradation to the utmost.

[Brief explanation of the drawing]

第１図は本発明の実施例のブロック図、第２図は第１図
のブロックにおける診断処理装置の処理例を示すフロー
チャーｈである。主要部分の符号の説明１１．１３・・・・・・論理装置１２・・・・・・主記憶装置FIG. 1 is a block diagram of an embodiment of the present invention, and FIG. 2 is a flowchart h showing a processing example of the diagnostic processing device in the blocks of FIG. Explanation of symbols of main parts 11.13...Logic device 12...Main storage device

Claims

[Claims]

An error recovery method for a logical device in an information processing system in which multiple logical devices share a main memory, in which when an error occurs in the logical device, the instruction in error can be retried and the error can be retried. When an error is determined to be a non-tryable error, or an error is determined to be a retryable error for which several retries have failed, an error is detected without retrying on the logical device that generated the error. An error recovery method characterized in that the normal logical device of the computer is made to continue retrying the instruction that caused the error.