JP2922981B2

JP2922981B2 - Task execution continuation method

Info

Publication number: JP2922981B2
Application number: JP2135200A
Authority: JP
Inventors: 正壱郎吉岡; 尚文山田
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1990-05-28
Filing date: 1990-05-28
Publication date: 1999-07-26
Anticipated expiration: 2014-07-26
Also published as: JPH0430224A

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、障害により中断された処理（タスク）の実
行継続方法に係る。特に複数の命令処理装置が主記憶装
置を共有するマルチプロセッサシステムにおいて、一つ
の命令処理装置における障害発生により処理の実行を継
続できない場合に、その命令処理装置で実行されていた
処理を正常に継続実行可能とする方法を提供する処理の
実行継続方法に関する。The present invention relates to a method for continuing execution of a process (task) interrupted by a failure. Particularly in a multiprocessor system in which a plurality of instruction processors share a main storage device, if the execution of the processing cannot be continued due to a failure in one instruction processor, the processing executed by the instruction processor is normally continued. The present invention relates to a method for continuing execution of a process for providing a method for enabling execution.

[Conventional technology]

命令処理装置において間欠障害が発生した場合の回復
方法として、障害発生時に実行していた処理をリトライ
する方法や、命令単位などの一定のチェックポイントに
戻って再実行する方法などが知られている。As a recovery method when an intermittent failure occurs in the instruction processing device, there are known a method of retrying a process executed at the time of occurrence of the failure, a method of returning to a certain checkpoint such as an instruction unit, and re-executing the same. .

一方、命令処理装置において固定障害が発生し、その
回復に失敗した場合には、一般に自命令処理装置に対し
て障害回復失敗の割込みを発生させる。例えば、回復不
可能な障害が発生したことをマシンチェック割込みなど
により連絡する。On the other hand, when a fixed failure has occurred in the instruction processing apparatus and recovery from the failure has failed, an interruption of failure recovery failure is generally generated for the instruction processing apparatus. For example, the occurrence of an unrecoverable failure is notified by a machine check interrupt or the like.

回復不可能な障害が発生した場合のマシンチェック割
込みには次の二つの場合がある。There are the following two cases of a machine check interrupt when an unrecoverable failure occurs.

（１）回復不可能な障害が発生したが、割込み発生時点
において、障害発生前のチェックポイントの状態が保証
されている。(1) An unrecoverable failure has occurred, but the state of the checkpoint prior to the failure is guaranteed at the time of the interruption.

（２）回復不可能な障害が発生し、割込み発生時点で障
害発生前のチェックポイントの状態が保証されていな
い。(2) An unrecoverable failure has occurred, and the state of the checkpoint before the failure has not been guaranteed at the time of occurrence of the interrupt.

（１）の状態のマシンチェック割込みをPD・Ｂ（プロ
セッサダメージ・バックアップ）と呼び、（２）の状態
のマシンチェックをPD（プロセッサダメージ）と呼ぶ。
この二つの差を、第２図を用いて説明する。第２図にお
いて、21〜24は命令である。いま、命令処理装置が命令
Ｃの実行中に障害が発生したとする。命令処理装置は、
障害の内容により、PD・ＢもしくはPDのマシンチェック
割込みを発生する。命令Ｃ実行前の命令処理装置の状態
（障害発生前のチェックポイントの状態）が保証されて
いるとき、命令処理装置はPD・Ｂのマシンチェック割込
みを発生する。したがって、割込み受付け後、一般に命
令Ｃから再実行すれば、プログラムは正常に継続実行が
可能となる。一方、命令Ｃにより命令処理装置の内部状
態が変更されてしまっている場合など、命令Ｃ実行前の
チェックポイントの状態が保証できない場合、命令処理
装置はPDのマシンチェック割込みを発生する。この場合
には、割込み発生後も命令Ｃから再実行しても、一般に
処理の正常性を保証することはできない。The machine check interrupt in the state (1) is called PDB (processor damage backup), and the machine check in the state (2) is called PD (processor damage).
The difference between the two will be described with reference to FIG. In FIG. 2, 21 to 24 are instructions. Now, it is assumed that a failure occurs while the instruction processing device is executing the instruction C. The instruction processing device
Depending on the nature of the fault, a PD / B or PD machine check interrupt is generated. When the state of the instruction processing device before execution of the instruction C (the state of the checkpoint before the occurrence of the failure) is guaranteed, the instruction processing device generates a PD / B machine check interrupt. Therefore, if the instruction C is generally re-executed after receiving the interrupt, the program can be normally continuously executed. On the other hand, when the state of the checkpoint before the execution of the instruction C cannot be guaranteed, such as when the internal state of the instruction processing apparatus has been changed by the instruction C, the instruction processing apparatus generates a PD machine check interrupt. In this case, even if the instruction C is re-executed even after the occurrence of the interrupt, the normality of the processing cannot generally be guaranteed.

さらに、障害発生状況によっては以後の再実行動作，
障害回復動作が一切不可能な場合がある。また、チェッ
クポイント保証などの障害回復処理中にさらに障害が発
生した場合なども、その障害回復動作は不可能であり、
このような場合、命令処理装置は異常停止する。この動
作をチェックストップ（チェック停止）と呼び、マルチ
プロセッサシステムにおいて一つの命令処理装置がチェ
ックストップすると、他の動作中の命令処理装置に誤動
作警報と呼ぶ割込みが報告される。Furthermore, depending on the failure occurrence situation, the re-
In some cases, a failure recovery operation may not be possible at all. Also, if a further failure occurs during failure recovery processing such as checkpoint guarantee, the failure recovery operation is impossible,
In such a case, the instruction processing device stops abnormally. This operation is called a check stop (check stop), and when one instruction processing device performs a check stop in the multiprocessor system, an interrupt called a malfunction alarm is reported to the other active instruction processing devices.

以上の命令処理装置における障害の検出と報告，チェ
ックストップと誤動作警報割込みの動作を第３図を用い
て説明する。第３図において、311a,311bは命令処理装
置である。本構成例では命令処理装置が２台の場合を示
したが、命令処理装置が２台以上の場合でも同様であ
る。各命令処理装置には、命令実行部312a,b、障害検出
部313a,b、割込み制御部314a,b、チェックストップラッ
チ315a,bがある。The operation of the above-described instruction processing device for detecting and reporting a fault, checking and stopping, and malfunction alarm interruption will be described with reference to FIG. In FIG. 3, reference numerals 311a and 311b are instruction processing devices. In this configuration example, the case where the number of instruction processing devices is two is shown, but the same applies to the case where there are two or more instruction processing devices. Each instruction processing device includes an instruction execution unit 312a, b, a failure detection unit 313a, b, an interruption control unit 314a, b, and a check stop latch 315a, b.

命令処理装置311aの障害検出部313aが障害を検出し、
それがチェックストップ要因でない場合、その事実を信
号線319aを介して命令実行部312aに連絡する。命令実行
部312aでは、連絡を受けると、命令リトライとチェック
ポイント状態の回復を試みる。命令リトライが成功した
場合はそのまま処理が続行されるが、命令リトライが失
敗した場合には自命令実行部において、マシンチェック
割込みを発生させる。先に述べたように、チェックポイ
ント状態の回復が成功した場合にはPD・Ｂ、失敗した場
合にはPDのマシンチェック割込みを発生する。The failure detection unit 313a of the instruction processing device 311a detects the failure,
If it is not the check stop factor, the fact is communicated to the instruction execution unit 312a via the signal line 319a. Upon receiving the notification, the instruction execution unit 312a attempts to retry the instruction and recover the checkpoint state. If the instruction retry succeeds, the process is continued, but if the instruction retry fails, the own instruction execution unit generates a machine check interrupt. As described above, a PD / B machine check interrupt is generated when the checkpoint state recovery is successful, and a PD machine check interrupt is generated when the checkpoint state recovery fails.

また、障害検出部313aがチェックストップ要因の障害
を検出すると、障害検出部313aは信号線319aを介して命
令実行部312aに命令の実行を停止するよう指示し、命令
実行部312aは命令の実行を停止する。さらに障害検出部
313aは信号線317aを介してチェックストップラッチ315a
をセットする。チェックストップラッチ315aがセットさ
れると、それは信号線318aを介してもう一つの命令処理
装置311bの割込み制御部314bに伝わり、割込み制御部31
4bでは割込みの可否を制御する割込みマスクと演算し、
割込み信号を作成する。割込み信号は信号線320bを介し
て命令実行部312bに伝えられ、命令実行部312bで誤動作
警報の割込みが発生する。この場合、チェックストップ
した命令処理装置311aの内部状態は一般に不定であり、
仮にその内部状態が読み出せたとしても、その情報から
リトライやチェックポイントからの再実行を行うことは
できない。When the failure detection unit 313a detects a failure due to a check stop factor, the failure detection unit 313a instructs the instruction execution unit 312a to stop execution of the instruction via a signal line 319a, and the instruction execution unit 312a executes the instruction execution. To stop. In addition, the failure detector
313a is a check stop latch 315a via signal line 317a
Is set. When the check stop latch 315a is set, it is transmitted to the interrupt control unit 314b of another instruction processing device 311b via the signal line 318a, and the interrupt control unit 31
In 4b, an operation is performed with an interrupt mask that controls the availability of interrupts,
Create an interrupt signal. The interrupt signal is transmitted to the instruction execution unit 312b via the signal line 320b, and the instruction execution unit 312b generates a malfunction alarm interrupt. In this case, the internal state of the check-stopped instruction processing device 311a is generally undefined,
Even if the internal state can be read, no retry or re-execution from the checkpoint can be performed from the information.

ところで、障害が発生し、ハードウェアによる再実行
が失敗した場合、マシンチェック割込みなどによる報告
を受け、処理の続行、あるいは異常終了処理を行うのは
障害を発生した命令処理装置で動作するOSであり、該命
令処理装置が再度障害を発生する可能性が高い。このよ
うな場合、障害はOSの障害処理部分など中核部分で発生
することになり、システムダウンとなる可能性が高い。By the way, when a failure occurs and re-execution by hardware fails, a report such as a machine check interrupt is received, and processing is continued or abnormal termination processing is performed by the OS running on the failed instruction processing device. There is a high possibility that the instruction processing device will cause a failure again. In such a case, the failure occurs in a core part such as a failure processing part of the OS, and there is a high possibility that the system will be down.

従来、固定障害によるシステムダウンとなるのを防止
するため、マルチプロセッサシステムにおいては、障害
が発生した命令処理装置で行っていた処理を、他の正常
な命令処理装置で引継いで実行するという方式がとられ
てきた。この方式に関しては従来よりいくつかの技術が
提供されている（特公昭47−36181号，特公昭61−56537
号，特開昭57−85151号，特開昭57−137949号，特開平
１−133171号など）。これらの方式では、障害が発生し
た命令処理装置の障害発生直前の状態を、障害が発生し
た命令処理装置，正常な命令処理装置、または回復制御
装置などの動作により主記憶装置などに格納し、または
直接正常な命令処理装置に転送し、正常な命令処理装置
がその情報を用いて障害を発生した命令処理装置上で中
断された処理を再実行している。Conventionally, in order to prevent a system failure due to a fixed failure, in a multiprocessor system, a method in which the processing performed by the failed instruction processing device is taken over by another normal instruction processing device and executed is performed. Has been taken. Several techniques have been provided for this method (Japanese Patent Publication No. 47-36181 and Japanese Patent Publication No. 61-56537).
JP-A-57-85151, JP-A-57-137949, JP-A-1-133171, etc.). In these systems, the state immediately before the failure of the failed instruction processing device is stored in a main storage device or the like by the operation of the failed instruction processing device, the normal instruction processing device, or the recovery control device. Alternatively, the information is transferred directly to a normal instruction processing device, and the normal instruction processing device uses the information to re-execute the interrupted processing on the failed instruction processing device.

[Problems to be solved by the invention]

従来の技術では、障害が発生した命令処理装置で行っ
ていた処理を他の正常な命令処理装置で引継いで実行す
るために、回復制御のための装置や（特公昭47−36181
号，特公昭61−56537号など）正常な命令処理装置内に
障害命令処理装置の命令再試行のための回路を設ける
（特開昭57−137949号など）必要があり、多くの付加ハ
ードウェアを必要とし機構が複雑になるという問題点
や、障害命令処理装置が処理の継続実行のための情報を
作成する（特公昭57−85151号など）ため、再度障害を
発生する可能性が大きくなるという問題点があった。ま
た、マルチプロセッサであっても、何らかの要因によ
り、動作中のプロセッサが障害を発生した一台だけの場
合には、正常な命令処理装置で処理を引継ぐための動作
を行うことにより、障害発生命令処理装置における再実
行，回復処理を放棄することになり、それだけシステム
の停止する確率が大きくなり、信頼性が低下するという
問題点があった。In the prior art, a device for recovery control or a device for recovery control (Japanese Patent Publication No. 47-36181) is used in order to take over the processing that was being performed by the faulty instruction processing device and execute it by another normal instruction processing device.
No., JP-B-61-56537, etc.) It is necessary to provide a circuit for retrying the instruction of the faulty instruction processor in the normal instruction processor (JP-A-57-137949, etc.), and a lot of additional hardware And the problem is that the mechanism becomes complicated, and the fault instruction processing device creates information for continuous execution of processing (Japanese Patent Publication No. 57-85151, etc.), so that the possibility of a fault occurring again increases. There was a problem. Also, even in the case of a multiprocessor, if only one processor in operation is faulty for some reason, a faulty instruction is executed by performing an operation for taking over the process with a normal instruction processing device. Since the re-execution and recovery processing in the processing device is abandoned, there is a problem that the probability of stopping the system is increased and the reliability is reduced.

本発明の目的は、上記問題点を克服し、ハードウェア
の追加が少なく、また、障害が発生した命令処理装置に
よる処理の継続実行のための情報の作成にともなう新ら
たな障害の発生を低減し、さらに、動作中の命令処理装
置が一台だけの場合でもシステムの停止確率を増加させ
ずに、障害が発生した命令処理装置で行っていた処理を
他の正常な命令処理装置で引継いで実行する方式を提供
することにある。SUMMARY OF THE INVENTION An object of the present invention is to overcome the above-mentioned problems, to reduce the addition of hardware, and to prevent the occurrence of a new failure due to the creation of information for continuing execution of processing by the failed instruction processing device. In addition, even if only one instruction processing unit is operating, the processing performed by the failed instruction processing unit is taken over by another normal instruction processing unit without increasing the probability of system stoppage. An object of the present invention is to provide a method for executing the above.

[Means for solving the problem]

上記目的を達成するため、本発明では、命令実行中の
障害回復処理においてチェックポイント状態の回復に成
功したか否かを示す表示子を設け、命令処理装置などハ
ードウェアの内部状態を主記憶内に格納するスキャンア
ウト機構と、他の命令処理装置と動作状態を連絡し合う
機構と、命令処理装置がチェックストップした場合に他
の命令処理装置に誤動作警報割込みを連絡するための機
構と、主記憶装置の内容を読み込む命令とを用いる。In order to achieve the above object, according to the present invention, an indicator is provided to indicate whether or not the checkpoint state has been successfully recovered in the failure recovery processing during instruction execution, and the internal state of hardware such as the instruction processing device is stored in the main storage. A mechanism for communicating an operation state with another instruction processing device, a mechanism for communicating a malfunction alarm interrupt to another instruction processing device when the instruction processing device stops checking, and And an instruction to read the contents of the storage device.

[Action]

ある命令処理装置で固定障害が発生したとき、該障害
命令処理装置では、チェックポイントを保証し、命令再
試行をこころみる。命令再試行が失敗し、さらに命令実
行前のチェックポイントの状態が保証できるとき、チェ
ックポイント保証表示子をセツトし、該チェックポイン
トの状態をスキャンアウト処理によって主記憶に格納す
る。スキャンイン，スキャンアウト方式の具体的動作に
関しては、例えば特開昭59−161744「情報命令処理装置
のスキャン方式」，特開昭61−123939「情報命令処理装
置のスキャン方式」などにその一例が記載されている。
その後、命令処理装置はチェックストップし、誤動作警
報割込みが他の命令処理装置に連絡される。連絡を受け
た命令処理装置では、OSの処理によって主記憶に格納さ
れたスキャンアウト情報を読み込み、チェックポイント
表示子によりチェックポイントの情報が正しく格納され
ていることを確認すると、格納された命令処理装置の内
部情報を編集し、処理の継続実行に必要な制御テーブル
を作成する。これにより、特に大規模な付加回路・装置
や障害発生命令処理装置による情報操作を必要とするこ
となく、固定障害発生時にも、他の正常な命令処理装置
を用いて処理の継続実行が可能となる。When a fixed fault occurs in a certain instruction processing device, the faulty instruction processing device guarantees a checkpoint and attempts to retry the instruction. If the instruction retry fails and the state of the checkpoint before instruction execution can be guaranteed, the checkpoint guarantee indicator is set, and the state of the checkpoint is stored in the main memory by the scan-out process. Examples of specific operations of the scan-in and scan-out methods are described in, for example, JP-A-59-161744, "Scanning method of information command processing device", and JP-A-61-123939, "Scanning method of information command processing device". Are listed.
Thereafter, the instruction processing unit stops checking and a malfunction alarm interrupt is communicated to another instruction processing unit. The instruction processing device that received the notification reads the scan-out information stored in the main memory by the processing of the OS and confirms that the checkpoint information is correctly stored by the checkpoint indicator. Edit the internal information of the device and create a control table required for continuous execution of processing. As a result, it is possible to continuously execute processing using another normal instruction processing device even when a fixed failure occurs, without requiring information operation by a large-scale additional circuit / device or a failure occurrence instruction processing device. Become.

〔Example〕

以下、図面を用いて本発明の一実施例を示す。 Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

第１図は本発明の一実施例の適用されるマルチプロセ
ッサシステムのブロック図である。図において、111a,b
は命令処理装置（以後、IPと称する）、121はシステム
制御装置（以後、SCと称する）、131は主記憶装置（以
後、MSと称する）である。命令処理装置111aと111bとは
同じ構成であり、それぞれ命令実行部112a,b、障害検出
部113a,b、割込み制御部114a,b、チェックポイント保証
ラッチ115a,b、チェックストップラッチ116a,bを備えて
いる。命令実行部112a,b内には実行制御部117a,b、チェ
ックポイント保証部118a,bがある。また、システム制御
装置121には、スキャン制御部122がある。第１図は命令
処理装置が２台の構成であるが、両命令処理装置の構成
は同じであり、さらに本実施例は命令処理装置が２台以
上の構成の場合にも変更なく適用できる。ここで、チェ
ックポイント保証ラッチは、処理継続のための連絡に従
来からの誤動作警報割込みを利用し、正常なIPが誤動作
警報割込みを受け取ったときに、それが従来の意味での
チェックストップによる誤動作警報割込みなのか、本発
明に述べる処理継続のための誤動作警報割込みなのかを
区別するために必須のものである。FIG. 1 is a block diagram of a multiprocessor system to which one embodiment of the present invention is applied. In the figure, 111a, b
Denotes an instruction processing device (hereinafter, referred to as IP), 121 denotes a system control device (hereinafter, referred to as SC), and 131 denotes a main storage device (hereinafter, referred to as MS). The instruction processing devices 111a and 111b have the same configuration, and include an instruction execution unit 112a, b, a failure detection unit 113a, b, an interrupt control unit 114a, b, a checkpoint guarantee latch 115a, b, and a check stop latch 116a, b, respectively. Have. The instruction execution units 112a and 112b include execution control units 117a and 117b and checkpoint assurance units 118a and 118b. Further, the system control device 121 includes a scan control unit 122. FIG. 1 shows a configuration having two instruction processing devices. However, the configurations of both instruction processing devices are the same, and the present embodiment can be applied without change to a configuration having two or more instruction processing devices. Here, the checkpoint assurance latch uses a conventional malfunction alarm interrupt for communication to continue processing, and when a normal IP receives a malfunction alarm interrupt, it detects a malfunction due to a check stop in the conventional sense. This is essential for distinguishing between an alarm interrupt and a malfunction alarm interrupt for continuing processing described in the present invention.

次に、IP111aにおいて障害が発生した場合の動作を第
１図を用いて説明する。IP111aの命令実行部112aは、信
号線152aを介してMS131をアクセスし、命令処理を実行
している。また、命令実行部112aが動作中であること
は、信号線147を介して他の命令処理装置に通報されて
いる。同様に他の命令処理装置が動作中であるか否か
は、信号線148を介して実行制御部117aに通報されてい
る。この信号は、従来から本実施例には明示していない
バッファ記憶や、アドレス変換バッファの一致制御など
に使用されている。障害検出部113aは、自IPにおける障
害を検出して、その事実を信号線143aを介して自IPの実
行制御部117aに報告する。実行制御部117aは、IPの固定
障害発生時、信号線146aを介してチェックポイント保証
部118aにチェックポイントの保証を指示し、チェックポ
イントの保証が成功すると、信号線141aを介してチェッ
クポイント保証ラツチ115aをセットする。次に信号線14
9aを介して命令処理装置111aの内部状態をスキャンアウ
トするようSC121内のスキャン制御部122に指示する。SC
121内のスキャン制御部122は、信号線150aを介してIP11
1aの内部情報を読み出し、信号線151を介してMS1131に
書き込む。さらに実行制御部117aは、信号線148により
通報された他の命令処理装置の動作状態から、他の命令
処理装置が動作中であることを確認すると、信号線142
a,bを介してチェックストップラッチ116a,bをセット
し、さらに自IPにおける命令の実行を停止する。チェッ
クストップラッチ116a,bをセットすることにより、信号
線145a,bを介して他の正常なIPの割込み制御部114b,aに
割込み信号が出される。割込み制御部114b,aは、該割込
み信号により、信号線144b,aを通じて正常なIPの実行制
御部117b,aに割込み発生を指示する。Next, an operation when a failure occurs in the IP 111a will be described with reference to FIG. The instruction execution unit 112a of the IP 111a accesses the MS 131 via the signal line 152a to execute the instruction processing. The fact that the instruction execution unit 112a is operating is reported to another instruction processing device via the signal line 147. Similarly, whether or not another instruction processing device is operating is notified to the execution control unit 117a via the signal line 148. This signal has been used for buffer storage, which is not explicitly described in the present embodiment, and for matching control of the address translation buffer. The failure detection unit 113a detects a failure in the own IP and reports the fact to the execution control unit 117a of the own IP via the signal line 143a. When a fixed IP failure occurs, the execution control unit 117a instructs the checkpoint assurance unit 118a via the signal line 146a to guarantee the checkpoint, and if the checkpoint guarantee succeeds, the checkpoint assurance via the signal line 141a. Set the latch 115a. Next, signal line 14
Instruct the scan control unit 122 in the SC 121 to scan out the internal state of the instruction processing device 111a via 9a. SC
The scan control unit 122 in the 121 receives the IP11 via the signal line 150a.
The internal information of 1a is read and written to MS1131 via signal line 151. Further, when the execution control unit 117a confirms that the other instruction processing device is operating from the operation state of the other instruction processing device notified by the signal line 148, the signal line 142
The check stop latches 116a and 116b are set via a and b, and the execution of the instruction in the own IP is stopped. By setting the check stop latches 116a, b, an interrupt signal is issued to the other normal IP interrupt control units 114b, a via the signal lines 145a, b. The interrupt control unit 114b, a instructs the normal IP execution control unit 117b, a to generate an interrupt through the signal line 144b, a.

IPの命令実行部112a,bは、障害発生時に当該IPの状態
を障害発生前のある時点（チェックポイント）の状態へ
戻すチェックポイント保証部118a,b、命令の実行制御部
117a,bを有している。第４図にチェックポイント保証部
の構成例を示す。実行制御部は制御回路であり、その動
作を第５図に示す。第４図はIP111aのチェックポイント
保証手段118aを示しているが、IP111bについても同様で
ある。The instruction execution units 112a and 112b of the IP include checkpoint assurance units 118a and 118b that return the state of the IP to a state at a certain point (checkpoint) before the occurrence of a failure, and an instruction execution control unit.
117a, b. FIG. 4 shows a configuration example of the checkpoint assurance unit. The execution control unit is a control circuit, and its operation is shown in FIG. FIG. 4 shows the checkpoint assurance means 118a for IP111a, but the same applies to IP111b.

第４図において、実行制御部117aは信号線146a−1,14
6a−２などを通じてチェックポイント保証部の各要素を
制御している。ここでは、動作の記述に必要な146a−1,
146a−２の二つの制御信号線のみを示した。図におい
て、411は信号線152a−１を介して、MS131からのデータ
がセットされるレジスタ、412は演算器（ALU）413で演
算する前のデータをセットするためのレジスタ、413は
演算器、414は演算結果をセットするためのレジスタで
ある。415は命令により参照可能な汎用レジスタ群、416
は汎用レジスタ群415への入力を選択するセレクタ、417
は演算前のデータを格納しておくレジスタ群である。ま
た、418は、演算前の格納データ417がリトライ可能なデ
ータか、チェックポイント保証可能なデータかを示す制
御情報群であり、レジスタ群417と同期して制御され
る。レジスタ群415のデータは、信号線425を介してレジ
スタ412にセットされる。レジスタ411,412の内容は、演
算器（ALU）413で演算を行った後、信号線424を介して
再びレジスタ群415に書き込まれたり、信号線152a−２
を介してMS131へ書き込まれたりする。レジスタ群417
は、レジスタ412の内容を、命令実行ごとに退避するも
のであり、レジスタ群415の書き込み前（演算実行前）
の内容が順に退避されている。制御情報群418は、命令
実行ごとにレジスタ群417に退避されるデータがリトラ
イ可能なデータか、チェックポイント保証可能なデータ
かを示す情報であり、実行制御部によって制御信号線14
6a−１を介してレジスタ群417への退避と同期して順に
退避されている。In FIG. 4, execution control section 117a includes signal lines 146a-1, 146
Each element of the checkpoint assurance unit is controlled through 6a-2 and the like. Here, 146a-1,
Only two control signal lines 146a-2 are shown. In the figure, reference numeral 411 denotes a register in which data from the MS 131 is set via a signal line 152a-1, reference numeral 412 denotes a register for setting data before operation by an arithmetic unit (ALU) 413, reference numeral 413 denotes an arithmetic unit, 414 is a register for setting the operation result. 415 is a general-purpose register group that can be referenced by an instruction, 416
Is a selector for selecting an input to the general-purpose register group 415, 417
Is a group of registers for storing data before the operation. Reference numeral 418 denotes a control information group indicating whether the stored data 417 before the operation is data that can be retried or data whose checkpoint can be guaranteed, and is controlled in synchronization with the register group 417. Data of the register group 415 is set in the register 412 via the signal line 425. The contents of the registers 411 and 412 are written into the register group 415 again via the signal line 424 after the arithmetic unit (ALU) 413 performs the operation, or the signal lines 152a-2
Or written to MS131 via Register group 417
Saves the contents of the register 412 each time an instruction is executed, and writes the contents of the register group 415 before the execution of the operation (before executing the operation).
Are saved in order. The control information group 418 is information indicating whether data saved in the register group 417 every time an instruction is executed is data that can be retried or data whose checkpoint can be guaranteed.
It is sequentially saved in synchronization with the saving to the register group 417 via 6a-1.

次にIP111aにおいて障害が発生した場合の命令実行部
の動作について、第４図及び第５図を用いて説明する。
本動作は、IP111bで障害が発生した場合にも同じであ
る。Next, the operation of the instruction execution unit when a failure occurs in the IP 111a will be described with reference to FIGS.
This operation is the same when a failure occurs in the IP 111b.

IP111aで障害が発生すると、実行制御部117aは信号線
143aを介して障害検出部113aよりその報告を受ける。そ
して、まず信号線146a−２を介して障害発生直前の退避
情報の状態を読み出し、退避情報がリトライ可能か否か
を判定する（ステップ501）。リトライ不可能の場合に
は、リトライ失敗のマシンチェック割込み（PD）を発生
させる制御を行なう。リトライ可能の場合には、信号線
423を介して、レジスタ群415にリトライ用のデータを回
復するリトライリストア処理を行い（ステップ502）、
障害発生処理をリトライする（ステップ503）。本実施
例ではこのリトライの成功／不成功によって、障害が固
定障害か間欠故障かを判定する（ステップ504）。発生
した障害が固定障害か否かの判定は、この他に、命令の
リトライを複数回行っても同じ障害が発生することで判
断しても良いし、障害発生部位に対してテストを行なう
方法で判断しても良い。障害が固定障害でない場合に
は、処理のリトライは成功し、通常の命令実行処理が継
続できる。When a failure occurs in the IP 111a, the execution control unit 117a
The report is received from the failure detection unit 113a via 143a. Then, first, the state of the save information immediately before the occurrence of the failure is read out via the signal line 146a-2, and it is determined whether the save information can be retried (step 501). If retry is not possible, control is performed to generate a machine check interrupt (PD) for retry failure. If retry is possible, signal line
Via the 423, the register group 415 performs a retry restore process for restoring the data for retry (step 502).
The failure occurrence processing is retried (step 503). In this embodiment, it is determined whether the failure is a fixed failure or an intermittent failure based on the success / failure of the retry (step 504). The method of determining whether a fault has occurred is a fixed fault. Alternatively, the same fault may be determined even if the instruction is retried a plurality of times. May be determined. If the failure is not a fixed failure, the processing retry succeeds and normal instruction execution processing can be continued.

リトライが失敗するか、障害発生時に固定障害である
ことが判定できた場合には、次のように処理を行なう。
まず、信号線146a−２を介して障害発生前の退避情報の
状態を読み出し、障害発生前のあるチェックポイントま
で内部状態が戻せるかどうかを確実認する（ステップ50
5）。チェックポイントが保証できないのは、MS131の内
容がすでに書き換えられているときや、または退避レジ
スタ417の内容からレジスタ群415を回復できないときな
どである。この時にはPD表示のマシンチェック割込みを
発生させる。チェックポイントの保証が可能である場合
には、第４図の退避レジスタ417の内容を信号線424,セ
レクタ416を介し、レジスタ群415にチェックポイントが
保証できるところまで書き込む（ステップ506）。チェ
ックポイント保証が終了すると、信号線141aを介してチ
ェックポイント保証ラッチをセツトする（ステップ50
7）。次にSC121内スキャンアウト制御部122に対して信
号線149aを介してスキャンアウト指示を出す（ステップ
508）。連絡を受けたキャンアウト制御部122では、通常
のスキャン動作と同じく、第４図におけるレジスタ群41
5の内容や、チェックポイント保証ラッチの状態を含む
命令処理装置の内部状態を信号線150a−１などにより読
み出し、信号線151を介してMS113へ格納する。読み出し
と格納が終了すると、信号線149aを通じてスキャンアウ
ト終了を実行制御部117aに報告する。スキャンアウト終
了を受け取った実行制御部117aは、次に信号線148によ
り自分以外のIPの動作状態を確認する（ステップ50
9）。自分以外に動作中のIPがある場合、信号線142aに
よりチェックストツプラッチをセットし（ステップ51
0）、命令の実行を停止しチェックストップする。この
チェックストツプラッチのセットにより、信号線145aを
介して動作中の正常なIPであるIP111bの割込み制御部11
4bに誤動作警報割込み要求が連絡される。自分以外に動
作中のIPがない場合には、自分自身に対してPD・Ｂマシ
ンチェックを発生させる。If the retry fails or if it is determined that the failure is a fixed failure when a failure occurs, the following processing is performed.
First, the state of the save information before the occurrence of the failure is read out via the signal line 146a-2, and it is confirmed whether the internal state can be returned to a certain checkpoint before the occurrence of the failure (step 50).
Five). Checkpoints cannot be guaranteed when the contents of the MS 131 have already been rewritten or when the register group 415 cannot be recovered from the contents of the save register 417. At this time, a machine check interrupt of the PD display is generated. If the checkpoint can be guaranteed, the contents of the save register 417 shown in FIG. 4 are written to the register group 415 via the signal line 424 and the selector 416 until the checkpoint can be guaranteed (step 506). When the checkpoint guarantee ends, the checkpoint guarantee latch is set via the signal line 141a (step 50).
7). Next, a scan-out instruction is issued to the scan-out control unit 122 in the SC 121 via the signal line 149a (step
508). In response to the notification, the can-out control unit 122 executes the register group 41 shown in FIG.
The contents of 5 and the internal state of the instruction processing device including the state of the checkpoint guarantee latch are read out via the signal line 150a-1, etc., and stored in the MS 113 via the signal line 151. When the reading and storing are completed, the end of the scan-out is reported to the execution control unit 117a via the signal line 149a. The execution control unit 117a that has received the scan-out end confirms the operation states of the IPs other than itself through the signal line 148 (step 50).
9). If there is an operating IP other than yourself, the check stop latch is set by the signal line 142a (step 51).
0), Stop the instruction execution and check-stop. By setting this check stop latch, the interrupt control unit 11 of the IP 111b that is operating normally via the signal line 145a
4b is notified of a malfunction alarm interrupt request. If there is no operating IP other than yourself, a PD / B machine check is issued to itself.

第６図は、MS113上に格納されたIP内部情報の一例で
ある。内部情報61は、チェックポイント保証が成功した
か否かを示すチェックポイント保証ラッチの情報611
と、IPの内部状態であるプログラムから読み出し可能な
レジスタ類の情報612、タイマ類の情報613などから構成
される。本情報が格納される領域は、あらかじめ命令処
理装置ごとに設定された固定的領域でもよいし、あらか
じめ障害発生前に命令処理装置もしくはスキャンアウト
制御部122に対して指定してもよい。また、スキャンア
ウトなどにより専用的に使用される領域を設定し、その
中に格納しても良い。ただし、その場合には、格納した
情報をソフトウァアが読み出すための手段が必要であ
る。FIG. 6 is an example of IP internal information stored on MS 113. The internal information 61 includes information 611 of a checkpoint assurance latch indicating whether the checkpoint assurance was successful.
And information 612 of registers that can be read from a program, which is an internal state of the IP, information 613 of timers, and the like. The area where this information is stored may be a fixed area set in advance for each instruction processing device, or may be specified in advance to the instruction processing device or the scan-out control unit 122 before a failure occurs. Alternatively, an area exclusively used for scan-out or the like may be set and stored therein. However, in that case, means for reading out the stored information by software is required.

次に、IP111bの実行制御部117bが誤動作警報割込みを
受け付けた時の動作の一例を第７図を用いて説明する。
誤動作警報割込みを受け付けると、OSはMS131内に格納
されたIPの内部状態を読み出し（ステップ701）、チェ
ックポイント保証の有無をテストする（ステップ70
2）。チェックポイントが保証されており、障害発生前
のIP内部情報が格納されている場合には、格納されてい
るIP111aの内部情報を読み出し、実行中であった処理が
継続実行できるように、タスク制御情報の形に編集し
（ステップ703）、該タスクが継続実行されるよう、し
かるべきレディタスクキューに登録する（ステップ70
4）。キュー登録されると、通常のタイムスライス制御
などにおいてレディ状態となったタスクや、I/Oが完了
したI/O待ちタスクと同じく、順にディスパッチされ、
実行される。チェックポイントが保証されていない場合
には、IP111aの内部情報は退避されておらず、処理継続
が不可能なので、従来の誤動作警報割込み発生時に行な
われていた処理と同じく、IP111aで障害発生時に行われ
ていた処理を異常終了させる（ステップ705）。Next, an example of the operation when the execution control unit 117b of the IP 111b receives the malfunction alarm interrupt will be described with reference to FIG.
Upon receiving the malfunction alarm interrupt, the OS reads the internal state of the IP stored in the MS 131 (step 701) and tests whether the checkpoint is guaranteed (step 70).
2). If a checkpoint is guaranteed and the internal IP information before the failure is stored, the task control is performed so that the stored internal information of IP111a is read out and the process that was being executed can be continued. The information is edited in the form of information (step 703) and registered in an appropriate ready task queue so that the task is continuously executed (step 70).
Four). Once registered in the queue, tasks are dispatched in order, like tasks that are ready in normal time slice control or tasks that are waiting for I / O for which I / O has been completed.
Be executed. If the checkpoint is not guaranteed, the internal information of the IP111a is not saved and processing cannot be continued. Abnormal processing is terminated (step 705).

以上のソフトウェアによる処理は、IP111bで行われる
ので、処理継続可能な状態が保証され、処理継続を行な
う場合でも、処理継続不可能で異常終了処理を行なう場
合でも、再度障害が発生することはない。Since the processing by the above software is performed by the IP111b, a state in which the processing can be continued is guaranteed, and even if the processing is continued or the processing cannot be continued and the abnormal termination processing is performed, no failure occurs again. .

本実施例では、チェックポイント保証の表示子として
ハードウェアラッチを設け、その状態をMSに格納する方
法を採用した。ハードウェアラッチの代わりに、直接MS
に書き込む、処理継続を行うIPがラッチの状態を直接読
むなどの方法であっても、チェックポイント保証表示子
の値が処理を継続するIPからテスト可能であればよい。
また、処理継続する命令処理装置において本実施例では
タスク制御情報を作成しているが、処理の継続実行が可
能な他の制御方法でも良い。In the present embodiment, a method is employed in which a hardware latch is provided as an indicator for guaranteeing a checkpoint, and the state is stored in the MS. Direct MS instead of hardware latch
Even if the IP that performs the continuation of the processing directly reads the state of the latch, the value of the checkpoint guarantee indicator can be tested from the IP that continues the processing.
In the present embodiment, the task control information is created in the instruction processing apparatus that continues the processing, but another control method capable of continuously executing the processing may be used.

〔The invention's effect〕

以上の説明から明らかなように、本発明によれば、主
記憶装置を共有するマルチプロセッサシステムにおい
て、ある命令処理装置で固定障害が発生したとき、チェ
ックポイントを保証して割込みを他の正常な命令処理装
置に報告し、中断した処理を正常な命令処理装置で継続
実行させる処理継続のための動作が、最小のハードウェ
ア量の増加で、従来の機構を利用して容易に、安価に、
さらにマルチプロセッサにおいて一台の命令処理装置の
みが動作中の場合でも信頼性を低下させることなく可能
となるという効果がある。As is apparent from the above description, according to the present invention, in a multiprocessor system sharing a main storage device, when a fixed fault occurs in a certain instruction processing device, a checkpoint is guaranteed and an interrupt is issued to another normal processor. The operation for reporting to the instruction processing unit and continuing the interrupted processing with the normal instruction processing unit is an operation that can be performed easily and inexpensively by using the conventional mechanism by increasing the minimum amount of hardware.
Furthermore, there is an effect that even if only one instruction processing device is operating in the multiprocessor, it becomes possible without lowering the reliability.

[Brief description of the drawings]

第１図は本発明の一実施例を適用したマルチプロセッサ
システムの構成を示すブロック図、第２図はチェックポ
イントの説明図、第３図は従来のシステムにおける命令
処理装置の構成を示すブロック図、第４図は命令実行部
内チェックポイント保証部の構成例を示す構成図、第５
図は命令の実行制御部における障害処理時の処理の流れ
図、第６図はIP内部情報の一例を示す説明図、第７図は
割込みを受け取った正常な命令処理装置の行なう処理の
流れ図である。 13……主記憶装置、21〜24……命令、61……IP内部情報
例、111a,b……命令処理装置、112a,b……命令実行部、
113a,b……障害検出部、114a,b……割込み制御部、115
a,b……チェックポイント保証ラッチ、116a,b……チェ
ックストップラッチ、117a,b……実行制御部、118a,b…
…チェックポイント保証部、121……システム制御装
置、122……スキャンアウト制御部、411,412,414……レ
ジスタ、413……演算装置、415……汎用レジスタ群、41
7……退避レジスタ群、418……制御情報群。FIG. 1 is a block diagram showing a configuration of a multiprocessor system to which an embodiment of the present invention is applied, FIG. 2 is an explanatory diagram of a checkpoint, and FIG. 3 is a block diagram showing a configuration of an instruction processing device in a conventional system. FIG. 4 is a configuration diagram showing a configuration example of a checkpoint assurance unit in an instruction execution unit.
FIG. 6 is a flow chart of a process at the time of failure processing in an instruction execution control unit, FIG. 6 is an explanatory diagram showing an example of IP internal information, and FIG. 7 is a flow chart of a process performed by a normal instruction processing device which has received an interrupt. . 13: Main storage device, 21 to 24: Instruction, 61: IP internal information example, 111a, b: Instruction processing device, 112a, b: Instruction execution unit,
113a, b ... fault detection unit, 114a, b ... interrupt control unit, 115
a, b ... checkpoint assurance latch, 116a, b ... check stop latch, 117a, b ... execution control unit, 118a, b ...
... Checkpoint assurance unit, 121 ... System control unit, 122 ... Scanout control unit, 411,412,414 ... Register, 413 ... Calculation unit, 415 ... General purpose register group, 41
7 ... save register group, 418 ... control information group.

───────────────────────────────────────────────────── フロントページの続き (56)参考文献特開昭58−35645（ＪＰ，Ａ) 特開昭57−85151（ＪＰ，Ａ) 特開昭57−137949（ＪＰ，Ａ) 特開平１−133171（ＪＰ，Ａ) 特公昭61−56537（ＪＰ，Ｂ２) 特公昭47−36181（ＪＰ，Ｂ１) (58)調査した分野(Int.Cl.⁶，ＤＢ名) G06F 11/14 G06F 11/20 G06F 15/16 ──────────────────────────────────────────────────続き Continuation of the front page (56) References JP-A-58-35645 (JP, A) JP-A-57-85151 (JP, A) JP-A-57-137949 (JP, A) JP-A-1- 133171 (JP, A) JP-B 61-56537 (JP, B2) JP-B 47-36181 (JP, B1) (58) Fields investigated (Int. Cl. ⁶ , DB name) G06F 11/14 G06F 11 / 20 G06F 15/16

Claims

(57) [Claims]

1. A plurality of instruction processing devices, and a main storage device shared by the plurality of instruction processing devices, wherein each of the instruction processing devices detects its own failure and sets a state when a failure occurs. Means for returning the internal state of the instruction processing device to a state at a time before the occurrence of the failure, means for saving the internal state of the instruction processing device to the main storage device for each instruction processing device, Means for contacting at least one, and means for reading internal information of the faulty instruction processing device saved to the main storage device and creating task control information,
A task execution continuation method in a multiprocessor system that executes a task based on task control information registered in a ready task queue provided in the main storage device, wherein each of the instruction processing devices is configured to execute a fixed fault Guarantees the state before the occurrence of the failure, and if it can be guaranteed, saves the state to the area of the main storage device corresponding to the instruction processing apparatus, and then checks whether there is another operating instruction processing apparatus. If there is another operating instruction processing device, the instruction processing device is contacted, and the notified instruction processing device reads the internal information of the faulty instruction processing device saved in the storage device. Determining the validity of the internal information, editing the task control information of the interrupted task, and registering the task control information in the ready task queue. Wherein the instruction processing apparatus that does not cause a failure in a multiprocessor system, execution continues a task, characterized by continuously executing the interrupted task.