JPS63136142A - Error recovery system for logical unit - Google Patents

Error recovery system for logical unit

Info

Publication number
JPS63136142A
JPS63136142A JP61282578A JP28257886A JPS63136142A JP S63136142 A JPS63136142 A JP S63136142A JP 61282578 A JP61282578 A JP 61282578A JP 28257886 A JP28257886 A JP 28257886A JP S63136142 A JPS63136142 A JP S63136142A
Authority
JP
Japan
Prior art keywords
error
retried
logical device
processing
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
JP61282578A
Other languages
Japanese (ja)
Inventor
Koemon Nigo
仁後 公衛門
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Priority to JP61282578A priority Critical patent/JPS63136142A/en
Publication of JPS63136142A publication Critical patent/JPS63136142A/en
Pending legal-status Critical Current

Links

Abstract

PURPOSE:To reduce danger that causes a system breakdown or job-aboard state, etc., by using another normal logical device to perform retrial of an instruction that is turned into an error in case such an error that can be retried and also cannot be retried is produced or such a fixed error that can be retried is produced. CONSTITUTION:When a logical unit 11 has a trouble, a monitor means 14 interrupts the processing of the unit 11 and stores the internal state of the unit 11. A diagnosis processor 2 judges an error detected out of the read-out internal state through a retrial possibility judging means 22. If an error that can be retired and also cannot be retried is judged or a fixed error that can be retried is judged, the succession information is complied and produced from the read-out internal state by a succession information compiling/producing means 23 and then set to a normal logical unit 13 by a succession information setting means 24. Then a succession processing re-execution/succession instructing means 25 gives a instruction to the unit 13 for succession of the processing. Then the unit 13 re-executes and succeeds the processing.

Description

【発明の詳細な説明】 技術分野 本発明は論理装置のエラー回復方式に関し、特に複数の
論理装置が主記憶装置を共有した情報処理システムにお
GJる論理!置のエラー回復方式に関する。
DETAILED DESCRIPTION OF THE INVENTION Technical Field The present invention relates to an error recovery method for a logical device, and particularly to a logic system for an information processing system in which a plurality of logical devices share a main storage device. Regarding error recovery methods for equipment.

従来技術 従来、この種のエラー回復り式としては、例えば特開昭
57−064849号公報(特願昭55−141323
号)に見られる方式が知られている。
Prior Art Conventionally, this type of error recovery system has been disclosed, for example, in Japanese Patent Application Laid-Open No. 57-064849 (Japanese Patent Application No. 55-141323).
The method shown in No. 1) is known.

上記従来方式は、ある論理装置(中央ffi理装置)で
エラーが発生し、このエラーを発生した命令がメモリ書
換えの条件等で再試行可能と判断された場合、先ずエラ
ーを発生した論理装置で所定回数の再試行を行わせてそ
れでもエラーの回復が行われなかったときに、その命令
を他の正常な論理装置で再試行させるように構成されて
いる。
In the conventional method described above, when an error occurs in a certain logical device (central FFI logical device) and it is determined that the instruction that caused the error can be retried due to memory rewriting conditions, etc., the logical device that caused the error is first If the error is not recovered even after a predetermined number of retries, the instruction is configured to be retried using another normal logical device.

このエラー回復方式では、エラーの原因が間欠故障で且
つ再試行可能であってしかもエラーの発生した論理装置
での再試行に成功すれば、システムダウン等を回避でき
、またエラー原因が固定故障によるものであっても再試
行可能であれば正常な他の論理装置で再試行が成功する
ことによりシステムダウン等を回避できる。
In this error recovery method, if the cause of the error is an intermittent failure and retry is possible, and if the retry is successful on the logical device where the error occurred, system failure can be avoided, and if the cause of the error is due to a fixed failure. If it is possible to retry even if the logical device is a logical device, a system failure or the like can be avoided by successfully retrying with another normal logical device.

しかし、間欠故障であっても常に再試行可能となるしの
ではなく、命令実行中のメモリ書換えタイミングによっ
ては再試行不可能となる場合もある。また、間欠故障を
一度おこした論理装置は再び間欠故障をおこす確率が高
いと考えられるから、再試行を先ずエラーの発生した論
理装置で行わせるこの方式では、その再試行の途中にお
いて再び間欠故障によるエラーが発生しそのエラーの命
令が今度は再試行不可能となる可能性がある。このよう
な場合、再試行不可能であるが故にシステムダウンやジ
ョブアボート等につながるという問題点がある。
However, even if there is an intermittent failure, retrying is not always possible; retrying may not be possible depending on the memory rewriting timing during instruction execution. In addition, since it is thought that a logic device that has experienced an intermittent failure once has a high probability of causing an intermittent failure again, this method of first performing a retry on the logical device where the error occurred will result in an intermittent failure occurring again during the retry. An error may occur and the instruction in error may not be retriable. In such a case, since retrying is not possible, there is a problem that the system may go down or the job may be aborted.

このエラー回復方式を改7r、 L、たちのとして、論
理装置でエラーが発生した場合でかつ該エラーの命令が
再試行可能と判断されたとき、該エラーを発生した論理
装置で再試行することなく、他の正常な論理装置に前記
エラーの命令からの再試行を行わせるようにした方式が
hえられている。この方式では、安全のため問欠陣害が
一度Cも発生すると障害装置を切り離してしまうために
、たとえば先行制御での障害のように再試行を100%
可能にできる障害(再試行不可となり得ない障害)につ
いてもエラーを発生した論理装置での再試行tユ実施す
ることなく切り離すことになる。したがってシステムの
性能を無駄に低下させるという欠点がある。
This error recovery method is revised to 7r, L, Tachi, and when an error occurs in a logical device and it is determined that the error instruction can be retried, the error is retried in the logical device where the error occurred. Instead, a method is available in which another normal logical device is made to retry the error instruction. In this method, for safety reasons, if a failure occurs even once, the faulty device is disconnected, so retries are made at 100%, such as in the case of a fault in advance control.
Even for possible failures (failures that cannot be retried), they are removed without retrying the logical device in which the error occurred. Therefore, there is a drawback that the performance of the system is unnecessarily degraded.

発明の目的 本発明はこのような問題点を解決したものであり、その
目的とするところは、システムダウンやジョブアボート
等を招く危険性が少なく、しがも性能低下を極力抑える
方法で論理装置のエラー回復を行うことが可能なエラー
回復方式を捉供することにある。
Purpose of the Invention The present invention solves these problems, and its purpose is to reduce the risk of system downtime, job aborts, etc., while minimizing performance degradation to the logical device. An object of the present invention is to provide an error recovery method capable of performing error recovery.

p承 本発明によれば、複数の論理装置が主記憶装置を共有す
る情報処理システムにJ5ける論理装置のエラー回復方
式であって、前記論理装置でエラーが発生した場合、エ
ラーとなった命令が再試行可能でかつこのエラーが再試
行不可になり得るエラーと判断されたとき、あるいは再
試行可能でがっ再試行を何回か失敗したエラーであると
判断されたとき、前記エラーを発生した論理装置で再試
行をなすことなく他の正常な論理装置に前記エラーとな
った命令からの再試行を継続して行わせるようにしたこ
とを特徴とするエラー回復方式が得られる。
According to the present invention, there is provided an error recovery method for a logical device in a J5 information processing system in which a plurality of logical devices share a main storage device, in which when an error occurs in the logical device, the command that caused the error is recovered. When the above error is determined to be retriable and this error is determined to be an error that cannot be retried, or if it is determined to be a retryable error but the retries have failed several times, the above error is generated. According to the present invention, there is provided an error recovery method characterized in that, without retrying the instruction in the errored logic device, other normal logic devices are made to continue retrying the instruction that caused the error.

実施例 以下、図面を用いて本発明の詳細な説明する。Example Hereinafter, the present invention will be explained in detail using the drawings.

第1図は本発明の実施例のブロック図である。FIG. 1 is a block diagram of an embodiment of the invention.

図において、1は情報処理装置、2は診断装置。In the figure, 1 is an information processing device, and 2 is a diagnostic device.

3は当該情報処理装置1と診断装置2とを接続するため
の診断インタフェースである。
3 is a diagnostic interface for connecting the information processing device 1 and the diagnostic device 2;

情報処理装置1は例えば中央処理装置である複数の論理
装置11.13と、これら複数の論理装置11.13に
接続され両輪理装置からアクセス可能な主記憶装置12
と、論理装置11.13および主記憶装置12に接続さ
れた監視手段14とを含み、一つのオペレーティング・
システムで制御されている。上記監視手段14は、論理
装置11.13の故障(エラー)を検出する機能と、故
障を検出するとその故障した論理装置の内部状態を保存
して診断インタフェース3を介して診断処理装置2にそ
の旨を通知する機能とを有している。
The information processing device 1 includes a plurality of logical devices 11.13, which are, for example, central processing units, and a main storage device 12 that is connected to the plurality of logical devices 11.13 and accessible from the two-way processing device.
and a monitoring means 14 connected to a logical device 11.13 and a main memory 12, one operating
controlled by the system. The monitoring means 14 has a function of detecting a failure (error) in the logical device 11.13, and when a failure is detected, saves the internal state of the failed logical device and transmits it to the diagnostic processing device 2 via the diagnostic interface 3. It also has a function to notify you of this.

なお、第1図では論理装置が2台の場合を図示したが、
論理装置が3台以上協ねっている情報処理システムに対
しても本発明は適用可能である。
Note that although Figure 1 shows the case where there are two logical devices,
The present invention is also applicable to information processing systems in which three or more logical devices work together.

また、診断処理装置2は内部状態の読出し手段21と、
再試行可能性の判断手段22と、引継ぎ情報の編集・作
成手段23と、引継ぎ情報の設定手段24と、引継ぎ処
理の再実行・継続の指示1段25とを含む。このうち、
内部状態の読出し手段21は、監視手段14から上記通
知があったどきに監視手段14によって保Hされた故障
発生論理装置の内部状態を読出す手段である。再試行可
能性判断手段22は、その読出された内部状態に基づい
てエラーを発生した命令の再試行の可能性の判断、当該
エラーが再試行不可となり(qるがの判断、固定エラー
(再試行を何回か失敗したケースのエラー)の判断を行
う手段である。
Further, the diagnostic processing device 2 includes an internal state reading means 21,
It includes a retry possibility determination means 22, a handover information editing/creation means 23, a handover information setting means 24, and a first stage 25 for instructing re-execution/continuation of the handover process. this house,
The internal state reading means 21 is a means for reading out the internal state of the faulty logic device held by the monitoring means 14 when the above-mentioned notification is received from the monitoring means 14. The retry possibility determining means 22 determines the possibility of retrying an instruction that has caused an error based on the read internal state, and determines whether the error is a fixed error (retry). This is a means of determining errors in cases where several attempts have failed.

引継ぎ情報の編集・作成手段23は、再試行可能性の判
断手段22で再試行可能かつ再試行不可となり得るエラ
ーと判断されるか、または再試行可能かつ固定エラーと
判断されたときに、故障した論理装置上で実行していた
処理の引継ぎ情報を前記内部状態から編集・作成する手
段である。引継ぎ情報の設定手段24は、引継ぎ情報の
編集・作成手段23で作成された引継ぎ情報を故障が検
出された論理装置以外の他の正常な論理装置に設定する
手段である。更に、引継ぎ処理の再実行・継続の指示手
段25は、引継ぎ情報の設定手段24により設定された
情報をもとにその正常な論理装置上で故障の検出された
論理装置で行われていた処理を引継いで再実行させるこ
とを指示する手段である。
The handover information editing/creation means 23 detects a failure when the retry possibility determination means 22 determines that the error is retryable and cannot be retried, or when it is determined that the error is retryable and fixed. This is a means for editing and creating handover information for the process that was being executed on the logical device from the internal state. The handover information setting means 24 is means for setting the handover information created by the handover information editing/creation means 23 to a normal logical device other than the logical device in which the failure has been detected. Further, the re-execution/continuation instruction means 25 for instructing the re-execution/continuation of the takeover process executes the process that was being performed on the logical device in which the failure was detected on the normal logical device based on the information set by the takeover information setting means 24. This is a means of instructing that the process be taken over and re-executed.

次に、第1図において論理装置11に故障が発生した場
合を例にして本実施例の動作を第2図のフローチャート
を用いて説明する。
Next, the operation of this embodiment will be explained using the flowchart of FIG. 2, taking as an example the case where a failure occurs in the logic device 11 in FIG.

論理装置11に故障が発生すると、これが監視手段14
で検出される。監視手段14はこれを検出すると、論理
装置11の処理を中断させてその内部状態を保存し、診
断インタフェース3を介して診断処理V7L置2に論理
装置11が故障した旨の通知を行う。
When a failure occurs in the logical device 11, the monitoring means 14
Detected in When the monitoring means 14 detects this, it interrupts the processing of the logical device 11, saves its internal state, and notifies the diagnostic processing V7L unit 2 via the diagnostic interface 3 that the logical device 11 has failed.

これに応答して診断処理装置2は内部状態の読出し手段
21を起動し、第2図の処理51に示すようにこの手段
21により監視手段14で保存された論理装置11の内
部状態を読出す。次に診断処理装置2は、F記読出され
た内部状態から検出されたエラーが再試行可能かつ再試
行不可となり得るエラーか、あるいは再試行可能かつ固
定エラーかを再試行の可能性判断手段22で判断しく処
理52)、条件を満さない場合には、エラー発生の論理
装置11の障害処理を行う。この障害91![9!のス
テップでは、再試行可能なエラーであれば、障害が生じ
た論理装置11の障害処理すなわち再試行処理が行われ
、再試行が成功すればそのまま継続して運転が行われる
ことになる。そうでなければ、エラー発生の論理装置1
1のシステムカラの切離しが行われる。
In response to this, the diagnostic processing device 2 activates the internal state reading means 21, and uses this means 21 to read out the internal state of the logic device 11 saved by the monitoring means 14, as shown in process 51 in FIG. . Next, the diagnostic processing device 2 uses the retry possibility determining means 22 to determine whether the error detected from the read internal state is an error that can be retried and cannot be retried, or an error that can be retried and is fixed. If the condition is not satisfied, failure processing is performed for the logical device 11 in which the error has occurred. This obstacle is 91! [9! In step , if the error is one that can be retried, failure processing for the failed logical device 11, that is, retry processing is performed, and if the retry is successful, operation continues. Otherwise, the error occurs in logical unit 1.
1 system color separation is performed.

一方、処理52で再試行可能かつ再試行不可となり1q
るエラーと判断されるか、または再試行可能かつ固定エ
ラー(再試行を何回か失敗したケースのエラー)と判断
されたときには、引継ぎ情報の編集・作成手段23によ
り読出した内部状態から引継ぎ情報を編集・作成しく処
理53)、この引継ぎ情報を設定手段24によって正常
な論理装置13に対し設定させる(処理54)。この処
理52における判断で「再試行可能かつ再試行不可とな
り得るエラー」とは、エラーの発生時間差により命令の
内容等が異なるためにある時間の発生エラーは再試行可
能であるが、他の時間の発生エラーは再試行不可能とな
り得ることがあるということを意味する。また、[再試
行可能かつ固定エラー4とは、再試行可能なエラーであ
ってかつ再試行の結果何回かその再試行を失敗している
エラーを意味する。かかるエラーの再試行の可能性の判
断は、エラー発生状態を示すエラーフリップフロップ簀
の内容から判断されるものであり公知技術を用いれば良
い。
On the other hand, in process 52, retry is possible and retry is not possible, and 1q
If the error is determined to be a retryable and fixed error (an error that occurs after several failed retries), the takeover information is extracted from the internal state read by the takeover information editing/creation means 23. is edited and created (process 53), and this inheritance information is set for the normal logical device 13 by the setting means 24 (process 54). In the judgment made in this process 52, an "error that can be retried but cannot be retried" means that an error that occurs at a certain time can be retried because the content of the command is different depending on the difference in the time of occurrence of the error, but an error that occurs at another time can be retried. This means that if an error occurs, it may not be possible to try again. [Retryable and fixed error 4] means an error that can be retried and has failed several times as a result of retrying. The possibility of retrying such an error is determined based on the contents of the error flip-flop box indicating the error occurrence state, and a known technique may be used.

次に、引継ぎ処理の再実行・継続の指示手段25により
正常な論理装置13に処理の引継ぎを指示しく処理55
)、論理装置13に引継いだ処理の再実行・継続を行わ
せる。これにより、再試行可能な故障の検出された論理
装置で実行されていたエラー発生時の処理が故障の検出
された論理装置上ではなく他の正常な論理装置上で再実
行され、論理’1tallで発生したエラーの回復が行
われると共に、その後の処理し故障の発生した論理装置
ではなく正常な論理装置に引継がれる。正常な論理装置
が故障した論理装置の処理を引継いだ場合、通常その故
障した論理装置は論理的にシステムから切り離される。
Next, the instructing means 25 for re-executing/continuing the takeover process instructs the normal logical device 13 to take over the process 55.
), causing the logical device 13 to re-execute and continue the inherited processing. As a result, the error processing that was being executed on the logical device in which the retryable failure was detected is re-executed not on the logical device in which the failure was detected, but on another normal logical device, and the logical '1tall Recovery from the error that occurred is performed, and subsequent processing is taken over by a normal logical device instead of the failed logical device. When a normal logical device takes over the processing of a failed logical device, the failed logical device is usually logically disconnected from the system.

以上の説明は、論理装置11が故障した場合であるが、
論理装置13が故障した場合も、上述した如く診断処理
vt置2のυ制御のちとに正常な論理装@11に処理を
引継いで、論理装置13上で行われていた処理を継続し
て行うことができる。
The above explanation is for the case where the logical device 11 fails, but
Even if the logical device 13 fails, the processing is taken over to the normal logical device @11 after the υ control of the diagnostic processing device 2 as described above, and the processing that was being performed on the logical device 13 is continued. be able to.

発明の詳細 な説明したように本発明によれば、再試す可能且つ再試
行不可能となり得るエラー、あるいは再試行可能且つ固
定エラーが発生した場合に、エラーとなった命令の再試
行を、エラーの発生した論理装置で実行することなく、
他の正常な論理装置で実行するものであり、再試行中に
間欠故障によるエラーが発生しそのエラーの命令が再試
行不可能となってシステムダウンやジコブアボート等を
In <危険性を少なくすることができ、しかも100
%再試行可能なエラーの場合には、エラーの発生した論
理装置を切り離すことなくそれ自身の再試行により継続
運転を行うようにしているために性能低下を極力抑える
ことができるという効果がある。
DETAILED DESCRIPTION OF THE INVENTION According to the present invention, when an error that can be retried and cannot be retried, or an error that is retriable and fixed, the retry of the erroneous instruction is performed without an error. without running on the logical device where the
It is executed by another normal logical device, and if an error occurs due to an intermittent failure during retry, the error instruction cannot be retried, resulting in a system down or abort. can be done, and 100
% In the case of a retryable error, the logical device in which the error has occurred is not disconnected and continues to operate by retrying itself, which has the effect of suppressing performance degradation to the utmost.

【図面の簡単な説明】[Brief explanation of the drawing]

第1図は本発明の実施例のブロック図、第2図は第1図
のブロックにおける診断処理装置の処理例を示すフロー
チャーhである。 主要部分の符号の説明 11.13・・・・・・論理装置 12・・・・・・主記憶装置
FIG. 1 is a block diagram of an embodiment of the present invention, and FIG. 2 is a flowchart h showing a processing example of the diagnostic processing device in the blocks of FIG. Explanation of symbols of main parts 11.13...Logic device 12...Main storage device

Claims (1)

【特許請求の範囲】[Claims] 複数の論理装置が主記憶装置を共有する情報処理システ
ムにおける論理装置のエラー回復方式であって、前記論
理装置でエラーが発生した場合、エラーとなった命令が
再試行可能でかつこのエラーが再試行不可になり得るエ
ラーと判断されたとき、あるいは再試行可能でかつ再試
行を何回か失敗したエラーであると判断されたとき、前
記エラーを発生した論理装置で再試行をなすことなく他
の正常な論理装置に前記エラーとなった命令からの再試
行を継続して行わせるようにしたことを特徴とするエラ
ー回復方式。
An error recovery method for a logical device in an information processing system in which multiple logical devices share a main memory, in which when an error occurs in the logical device, the instruction in error can be retried and the error can be retried. When an error is determined to be a non-tryable error, or an error is determined to be a retryable error for which several retries have failed, an error is detected without retrying on the logical device that generated the error. An error recovery method characterized in that the normal logical device of the computer is made to continue retrying the instruction that caused the error.
JP61282578A 1986-11-27 1986-11-27 Error recovery system for logical unit Pending JPS63136142A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
JP61282578A JPS63136142A (en) 1986-11-27 1986-11-27 Error recovery system for logical unit

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP61282578A JPS63136142A (en) 1986-11-27 1986-11-27 Error recovery system for logical unit

Publications (1)

Publication Number Publication Date
JPS63136142A true JPS63136142A (en) 1988-06-08

Family

ID=17654316

Family Applications (1)

Application Number Title Priority Date Filing Date
JP61282578A Pending JPS63136142A (en) 1986-11-27 1986-11-27 Error recovery system for logical unit

Country Status (1)

Country Link
JP (1) JPS63136142A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08286779A (en) * 1995-04-18 1996-11-01 Fuji Xerox Co Ltd Application automatic restarting device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08286779A (en) * 1995-04-18 1996-11-01 Fuji Xerox Co Ltd Application automatic restarting device

Similar Documents

Publication Publication Date Title
EP0505706B1 (en) Alternate processor continuation of the task of a failed processor
JPH0950424A (en) Dump sampling device and dump sampling method
JPH02294739A (en) Fault detecting system
US8677179B2 (en) Information processing apparatus for performing error process when controllers in synchronization operation detect error simultaneously
JPS63136142A (en) Error recovery system for logical unit
JPS6341943A (en) Error restoring system for logic unit
JPH11120154A (en) Device and method for access control in computer system
JPS62172436A (en) Error recovery system for logical unit
JPS62113241A (en) Fault recovery device
JPH0430224A (en) Execution continuing system for processing
JPS60195649A (en) Error reporting system of microprogram-controlled type data processor
JPS635779B2 (en)
JPS5938852A (en) Fault processing system
JPS6143739B2 (en)
JPS597982B2 (en) Restart method in case of system failure of computer system
JPS63140341A (en) Error recovery system for logical unit
JPS6258344A (en) Fault recovering device
JPS61101845A (en) Test system of information processor
JPS6130297B2 (en)
JPS6156537B2 (en)
JPH04365145A (en) Memory fault processing method
JPH01133171A (en) Fault recovery system for multiprocessor system
JPS6130296B2 (en)
JPH056285A (en) Fault processing method for information processor
JPH04211841A (en) Duplex processor