JPS6375843A

JPS6375843A - Abnormality monitor system

Info

Publication number: JPS6375843A
Application number: JP61220405A
Authority: JP
Inventors: Akio Murata; 明男村田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1986-09-18
Filing date: 1986-09-18
Publication date: 1988-04-06

Abstract

PURPOSE:To prevent degradation in throughput by monitoring two CPUs by each other and restarting and restoring a faulty CPU by the other normal CPU when one CPU is faulty. CONSTITUTION:If the response to monitor data is not sent within a prescribed time, a break signal 52 is sent from a break signal sending means 35. A first counting means 36 of the CPU which receives this signal generates an interrupt signal 53 at the time of counting a prescribed bit length, and a second counting means 37 outputs a reset signal 54 at the time of counting a bit length longer than said prescribed bit length. A restoration processing part 38 inhibits the output of the reset signal 54 to perform the restoration processing such as retry when receiving the interrupt signal 53, and the processing part 38 responds to a restoration message 55 when the other CPU is restored. Thus, the break signal 52 is stopped and the processing is restarted.

Description

【発明の詳細な説明】〔概要〕シリアルインタフェースを介して監視データをやりとり
して互に異常状態を監視する２組のプロセッサユニット
（ＣＰ　Ｕ）を備えた処理装置において、監視データ送出後、所定時間内に応答がないときブレー
ク信号を送出し、このブレーク信号を受信したＣＰＵは
、ブレーク信号のビット長を所定長計数して割込み信号
およびリセット信号を発生する。[Detailed Description of the Invention] [Summary] In a processing device equipped with two sets of processor units (CPUs) that mutually monitor abnormal conditions by exchanging monitoring data via a serial interface, after sending the monitoring data, a predetermined When there is no response within the time, a break signal is sent, and the CPU that receives this break signal counts the bit length of the break signal to a predetermined length and generates an interrupt signal and a reset signal.

割込み信号では現状態を保持した状態で復旧処理を行い
、復旧処理不能のときおよび割込み不能のときリセット
信号を発生させて各部をリセットし、正常であれば復旧
する。Restoration processing is performed with the current state held in response to the interrupt signal, and when restoration processing is not possible or when interrupts are not possible, a reset signal is generated to reset each part, and if normal, restoration is performed.

以上のごとく、復旧処理を試みた後切り離し処理を行う
異常監視方式を提供する。As described above, an abnormality monitoring method is provided in which the disconnection process is performed after attempting the recovery process.

[Industrial application field]

本発明は、２組のｃｐｕが相互に監視し、異常を検知し
たＣＰＵが自動復旧させる異常、監視方式信頼性の要求
される処理装置では、例えば２組のＣＰＵを設けて機能
分担し、一方がダウンしたとき他方のＣＰＵですべての
処理を行うよう構成するが、ダウンしたＣＰＵの切り離
しが行われると処理能力が低下するという問題点がある
。The present invention provides a system in which two sets of CPUs mutually monitor each other, and a CPU that detects an abnormality automatically recovers it.In a processing device that requires reliability of the monitoring system, for example, two sets of CPUs are provided to share functions, and one When a CPU goes down, all processing is performed by the other CPU, but there is a problem in that the processing capacity decreases when the failed CPU is disconnected.

このため、ダウンしたＣＰＵを再起動して復旧させ、処
理能力の低下を防止した処理装置が求められている。For this reason, there is a need for a processing device that can restart and restore a down CPU and prevent a decrease in processing performance.

[Conventional technology]

第３図＜ａ）は２組のＣＰＵを備えて複数のＩｌｏを制
御するＩ１０コントローラのブロック図、第３図（′ｂ
）は、従来の異常監視方式説明図である。Figure 3<a) is a block diagram of an I10 controller that is equipped with two sets of CPUs and controls a plurality of Ilo, and Figure 3 ('b
) is an explanatory diagram of a conventional abnormality monitoring system.

第３図（ａ）において、１，２はプロセッサユニット（
ＣＰ　Ｕ、以下２組のＣＰＵの一方をＣＰＵＩ。In FIG. 3(a), 1 and 2 are processor units (
CPU, one of the following two sets of CPUs is CPUI.

他方をＣＰＵ２とする））であって、図示省略したホス
トと接続され、Ｉ／○装宜１４および１５をそれぞれ制
御するもの、１３は切換部であり、上記ＣＰＵ１．２と
ともにＩ１０コントローラを構成するものである。The other is a CPU 2)) which is connected to a host (not shown) and controls the I/O devices 14 and 15, respectively, and 13 is a switching unit, which together with the CPU 1.2 constitutes an I10 controller. It is something.

上記Ｉ１０コントローラでは、ＣＰＵＩが■／○装置１
４を、ＣＰＵ２がｌ１０１５を制御するとともに、ＣＰ
Ｕ１、２間で相互に監視し、一方（例えばＣＰＵ２）が
ダウンしたとき、他方（ＣＰＵＩ）がこれを検知して切
換部１３に通知し、Ｉ１０装置１５をＣＰＵＩに接続す
るように構成されているものである。In the above I10 controller, the CPUI is ■/○ device 1
4, CPU2 controls l1015, and CPU
The configuration is such that mutual monitoring is performed between U1 and U2, and when one (for example, CPU2) goes down, the other (CPUI) detects this, notifies the switching unit 13, and connects the I10 device 15 to the CPUI. It is something that exists.

ＣＰＵ１、２間の相互監視は、第３図（ｂ）に示すよう
に、シリアルインタフェース（シリアルＩＦ制御部６．
１１）を通じて一定時間ごとに監視データ５０．５１を
交信するもので、相手側より所定時間内（１０）に応答
がないとき、異常と判別してＩ１０装置を切換える。Mutual monitoring between the CPUs 1 and 2 is performed using a serial interface (serial IF control unit 6.
11), the monitoring data 50 and 51 are exchanged at fixed time intervals, and when there is no response from the other party within the predetermined time (10), it is determined that there is an abnormality and the I10 device is switched.

[Problem that the invention seeks to solve]

以上説明した従来の異常監視方式では、一時的な障害で
復旧可能な異常状態であっても、正常なＣＰＵにより切
離しが行われていた。In the conventional abnormality monitoring method described above, even if the abnormal state is a temporary failure and can be recovered, disconnection is performed by a normal CPU.

このため、正常なＣＰＵに処理負担がかかり、Ｉ１０コ
ントローラの処理能力が低下するという問題点があった
。Therefore, there is a problem in that a processing load is placed on the normal CPU, and the processing capacity of the I10 controller is reduced.

本発明は、上記問題点に鑑み、正常なＣＰＵより復旧処
理せしめる簡易な異常監視方式を提供することを目的と
する。SUMMARY OF THE INVENTION In view of the above problems, it is an object of the present invention to provide a simple abnormality monitoring method that allows a normal CPU to perform recovery processing.

[Means for solving problems]

上記目的のため、本発明の異常監視方式は、第１図本発
明の原理説明図に示すように、監視データを所定時間内に受信しないときブレーク信号
（５２）を送出するとともに、復旧メソセージ（５５）
を受信したとき該ブレーク信号の送出を停止するブレー
ク信号送出手段（３５）と、受信した該ブレーク信号（
５２）のビット長を計数し、第１の設定値に達したした
とき、復旧処理部（３８）を起動する割込み信号（５３
）を発生する第１の計数手段（３６）と、該ビット長を計数し、少な（とも第１の設定値以上の値
に設定された第２の設定値に達したとき、各部をリセッ
トするリセット信号（５４）を出力する第２の計数手段
（３７）と、該リセット信号（５４）の出力を禁止する禁止手段（３
９）と、該禁止手段（３９）に指示して該リセット信号（５４）
の出力を禁止し復旧処理を実行するとともに、復旧完了
後該復旧メツセージ（５５）を応答し、復旧不能のとき
は該禁止手段（３９）を解除する復旧処理部（３８）と
、該リセット信号（５４）出力後各部の正常性を検証し、
正常と判別したとき該復旧メツセージ（５５）を送出す
るリセット処理部（４０）とをそれぞれのプロセッサユ
ニット（ＣＰＵ１．ＣＰＵ２　）に設けたものである。For the above purpose, the abnormality monitoring system of the present invention sends out a break signal (52) when monitoring data is not received within a predetermined time, as shown in FIG. 55)
a break signal sending means (35) that stops sending the break signal when the received break signal (35) is received;
52), and when the first set value is reached, an interrupt signal (53) is activated to start the recovery processing section (38).
); and a first counting means (36) that counts the bit length and resets each part when a second set value is reached, which is set to a value that is less than or equal to the first set value. a second counting means (37) that outputs a reset signal (54); and a prohibition means (3) that prohibits output of the reset signal (54).
9), and instructing the inhibiting means (39) to send the reset signal (54).
a recovery processing unit (38) that prohibits the output of and executes a recovery process, responds with the recovery message (55) after the recovery is completed, and cancels the prohibition means (39) when recovery is impossible; and the reset signal. (54) Verify the normality of each part after output,
Each processor unit (CPU1, CPU2) is provided with a reset processing section (40) that sends out the recovery message (55) when it is determined to be normal.

[Effect]

所定時間内に監視データの応答がないとき、スペースの
連続したブレーク信号５２を送出し、このブレーク信号
５２を受信したＣＰＵは、そのビット長を計数する。When there is no response to the monitoring data within a predetermined time, a break signal 52 with consecutive spaces is sent out, and the CPU that receives this break signal 52 counts its bit length.

第１の計数手段３６は所定ビット長を計数したとき（第
１の設定値）割込み信号５３を発生し、第２の計数手段
３７はこれより長く設定されたビット長（第２の設定値
）を計数してリセット信号５４を出力する。The first counting means 36 generates an interrupt signal 53 when counting a predetermined bit length (first setting value), and the second counting means 37 generates an interrupt signal 53 when counting a predetermined bit length (second setting value). is counted and a reset signal 54 is output.

上記割込み信号５３を受けたプロセッサは現在の処理状
態を退避して復旧処理部３８に復旧処理を依願する。Upon receiving the interrupt signal 53, the processor saves the current processing state and requests the recovery processing unit 38 to perform recovery processing.

復旧処理部３８では、リセット信号５４の出力を禁止し
て再試行等の復旧処理を行い、復旧した場合は復旧メツ
セージ５５を応答してプロセッサに制御を返す（リター
ン）。The recovery processing unit 38 prohibits the output of the reset signal 54 and performs recovery processing such as retrying, and when recovery is achieved, returns a recovery message 55 to return control to the processor (return).

これにより、ブレーク信号５２が停止し、処理が再開さ
れる。As a result, the break signal 52 is stopped and processing is restarted.

上記復旧処理で復旧不能のときはリセット信号５４の出
力禁止を解除し、リセット信号５４を出力せしめる。If recovery is not possible in the above recovery process, the inhibition of output of the reset signal 54 is canceled and the reset signal 54 is output.

割込み信号５３が受付けられないときも同様にリセット
信号５４が出力される。Similarly, a reset signal 54 is output when the interrupt signal 53 is not accepted.

リセット後は各部をテストし、異常がなければ復旧メツ
セージ５５を応答して、復旧する。After resetting, each part is tested, and if there is no abnormality, a recovery message 55 is returned to restore the system.

なお、所定時間までに復旧メツセージ５５の応答がなけ
れば、正常側のＣＰＵにより切り離しが行われる。Note that if there is no response to the recovery message 55 within a predetermined time, the CPU on the normal side performs the disconnection.

以上により、異常を検知したＣＰＵは、切り離しを行う
前にブレーク信号５２を送信し、異常ＣＰＵに復旧処理
させるものである。As described above, a CPU that detects an abnormality transmits a break signal 52 before disconnecting, and causes the abnormal CPU to perform recovery processing.

〔Example〕

本発明の実施例を第１図、第２図および第３図を参照し
つつ説明する。Embodiments of the present invention will be described with reference to FIGS. 1, 2, and 3.

、　第２図は実施例のＣＰＵブロック図である。, FIG. 2 is a CPU block diagram of the embodiment.

ＣＰＵ１およびＣＰＵ２は同一構成であり、第２図は、
一方のプロ七゛ンサユニットＣＰＵの要部を示したもの
である。図中、２０はシリアルインタフェース（ＩＦ）制御部、２１は
プロセッサ、３１は監視データブレーク信号送信部でブレーク信号送
出手段３５に対応し、定期的に監視データ５０．５１の
やりとりをシリアルインタフェース制御部２０を介して
行うとともに、図示省略したタイマーにより、所定時間
（ｔｌ）応答がないときブレーク信号５２を送信し、復
旧メツセージ５５を受信したときそのブレーク信号５２
の送出を停止するもの、２４はカウンタ（第１の計数手段）で、受信したブレー
ク信号５２のビット長を所定長（時間Ｌ２とする）計数
したときラッチ回路２６を通じて割込み信号５３をプロ
セッサ２１に出力するもの、２６はカウンタ（第２の計
数手段）で、受信したブレーク信号５２を所定長（時間
Ｌ３とする。ｔ３〉ｔ２）計数しアンドゲート２８を介
してリセット信号５４を出力するもの、２７は、リセット信号５４の出力を禁止するデータを書
込む禁止レジスタ（禁止手段３９）、２９はリセット信
号５４の出力状態をセントするラッチ回路、４０はリセット処理部であって、リセット後に起動され
、ラッチ回路２９にリセット出力がセットされていると
き各部をテストし、正常ならば復旧メツセージ５５を応
答するもの、１００はアドレスバス、データバス等で構成されるバス
線である。CPU1 and CPU2 have the same configuration, and FIG.
This figure shows the main parts of one of the processor unit CPUs. In the figure, 20 is a serial interface (IF) control unit, 21 is a processor, and 31 is a monitoring data break signal transmitting unit, which corresponds to the break signal sending means 35 and periodically exchanges monitoring data 50 and 51 with the serial interface control unit. 20, and a timer (not shown) transmits a break signal 52 when there is no response for a predetermined time (tl), and when a recovery message 55 is received, the break signal 52 is transmitted.
24 is a counter (first counting means) which transmits an interrupt signal 53 to the processor 21 through the latch circuit 26 when it counts the bit length of the received break signal 52 for a predetermined length (time L2). 26 is a counter (second counting means) that counts the received break signal 52 for a predetermined length (time L3, t3>t2) and outputs a reset signal 54 via an AND gate 28; 27 is a prohibition register (inhibition means 39) in which data for prohibiting the output of the reset signal 54 is written; 29 is a latch circuit that stores the output state of the reset signal 54; and 40 is a reset processing section which is activated after reset. , which tests each part when the reset output is set in the latch circuit 29, and responds with a recovery message 55 if it is normal. 100 is a bus line composed of an address bus, a data bus, etc.

上記構成のＣＰＵにおいて、監視データブレーク信号発
生部３１は正常時に動作し、第２図中点線で示したブレ
ーク信号処理手段２２はブレーク信号を受信したときに
動作する。In the CPU having the above configuration, the supervisory data break signal generating section 31 operates normally, and the break signal processing means 22 shown by the dotted line in FIG. 2 operates when a break signal is received.

なお以下の動作説明では、正常側をＣＰＵＩ。In the following operation explanation, the normal side will be referred to as the CPUI.

異常側をＣＰＵ２とし、第２図の符号はＣＰＵ２とする
。またＣＰＵ１側を示す場合は（ＣＰＵＩ）を付して区
別する。The abnormal side is CPU2, and the reference numeral in FIG. 2 is CPU2. When indicating the CPU1 side, (CPUI) is added to distinguish it.

また、ＣＰＵ以外の符号は第３図に記載のものを使用す
る。Further, the symbols other than the CPU are those shown in FIG. 3.

以下第１回申）を参照しつつ、復旧動作を説明する。The recovery operation will be explained below with reference to the first report.

ＣＰＵＩとＣＰＵ２とは、それぞれ監視データブレーク
信号送出部３２より定期的に監視データ５０　（ＣＰＵ
Ｉ）および５１　　（ＣＰＵ２）をシリアルインタフェ
ース制御部２０．２０　（ＣＰＵＩ）を介して互いに送
信する。The CPU I and the CPU 2 each periodically receive the monitoring data 50 (CPU
I) and 51 (CPU2) to each other via the serial interface control unit 20.20 (CPUI).

ＣＰＵ１がＣＰＵ２からの監視データ５１を所定時間内
（ｔｌ）に受信しないとき、監視データブレーク信号送
信部３２　（ＣＰＵＩ）よりＣＰＵ２にブレーク信号５
２を送出する。When the CPU 1 does not receive the monitoring data 51 from the CPU 2 within a predetermined time (tl), the monitoring data break signal transmitter 32 (CPUI) sends a break signal 5 to the CPU 2.
Send 2.

ブレーク信号５２は、前述したように、スペース（“０
”データ）が連続したもので、このブレーク信号５２を
受信したＣＰＵ２では、カウンタ２４．２５が、ブレー
ク信号５２をイネーブル信号として、シリアルインタフ
ェース制御部２０の内部クロック５６を計数し、カウン
タ２４が所定値（ｔ２）を計数したとき、キャリー出力
Ｃをラッチ回路２６にラッチして、割込み信号５３をプ
ロセッサ２１に出力する。As mentioned above, the break signal 52 is a space (“0
In the CPU 2 that receives the break signal 52, the counters 24 and 25 count the internal clock 56 of the serial interface control unit 20 using the break signal 52 as an enable signal, and the counter 24 counts the internal clock 56 of the serial interface control unit 20. When the value (t2) is counted, the carry output C is latched in the latch circuit 26 and an interrupt signal 53 is output to the processor 21.

これにより、プロセッサ２１は、現在の処理状態を退避
し復旧処理部３１を起動する。Thereby, the processor 21 saves the current processing state and starts the recovery processing section 31.

復旧処理部３１は、禁止レジスフ２８をセットしてカウ
ンタ２５の出力、即ちリセット信号５４の出力を禁止し
、復旧処理を実行する。The recovery processing unit 31 sets the prohibition register 28 to prohibit the output of the counter 25, that is, the output of the reset signal 54, and executes the recovery process.

復旧処理が完了すると復旧メツセージ５５をＣＰＵＩに
送信してプロセッサ２１に制御を返す（リターン）と、
ブレーク信号が停止するとともに退避したデータに基づ
き異常発生時より動作が再開される。When the recovery process is completed, a recovery message 55 is sent to the CPUI and control is returned to the processor 21 (return).
When the break signal is stopped, the operation is resumed from the time the abnormality occurred based on the saved data.

復旧不可能の場合は禁止レジスタ２８をリセットしてリ
セット出力を許可すると、カウンタ２５によって所定時
間後（ｔ３）にリセット信号５４が出力され、ＣＰｔＪ
２は初期状態に戻る。If recovery is not possible, the prohibition register 28 is reset to permit reset output, and the counter 25 outputs the reset signal 54 after a predetermined time (t3), and the CPtJ
2 returns to the initial state.

リセットされたＣＰＵ２ではリセット処理部４０が起動
され、ランチ回路２９を読取り、リセット出力がセット
されていれば異常検出されたことを認識して各部をテス
トし、正常ならばＣＰＵＩに復旧メツセージ５５を送出
する。In the reset CPU 2, the reset processing section 40 is activated, reads the launch circuit 29, and if the reset output is set, it recognizes that an abnormality has been detected and tests each section, and if normal, sends a recovery message 55 to the CPUI. Send.

以上のごとく、割込み信号５３およびリセット信号５４
により復旧処理を行うが、復旧処理不能のときは、ＣＰ
ＵＩにより切り離しが行われる。As described above, the interrupt signal 53 and the reset signal 54
However, if the recovery process is not possible, the CP
Detachment is performed by the UI.

即ち、ＣＰＵ１はブレーク信号５２の送信が一定時間′
ｍ続したとき切換部１３に指令し、Ｉ１０装置１５をＣ
ＰＵＩ側に切換える。That is, the CPU 1 transmits the break signal 52 for a certain period of time.
When the connection continues, the switching section 13 is commanded to switch the I10 device 15 to C.
Switch to PUI side.

なお、図示省略したが、シリアルデータが“１”の状態
でカウンタ２４，２５はリセットされる。Although not shown, the counters 24 and 25 are reset when the serial data is "1".

以上のごとく、切り離しを行う前にブレーク信号５２を
送出し、異常側はこのブレーク信号に基づいて復旧処理
を行うもので、ソフトウェア等の暴走による一時的障害
を復旧させることができる。As described above, the break signal 52 is sent before disconnection, and the abnormal side performs recovery processing based on this break signal, making it possible to recover from a temporary failure due to software runaway.

（発明の効果〕以上説明したように、本発明は異常を検出したＣＰＵが
ブレーク信号を発信して自動復旧させる方式を提供する
ものであるから、２組のブロモ・ノサを備えたデータ処
理装置における性能低下を防止する効果は極めて多大で
ある。(Effects of the Invention) As explained above, the present invention provides a system in which a CPU that detects an abnormality issues a break signal and automatically recovers. The effect of preventing performance deterioration in is extremely large.

[Brief explanation of the drawing]

第１図（ａ）は本発明の原理説明図（その１）、第１図
（ｂ）は本発明の原理説明図（その２）、第２図は実施
例のＣＰＵブロック図、第３図（ａ）は従来のＩ１０コントローラブロック図、第３図（ｂｌは従来の異常監視方式説明図、である。図
中、１．２はプロセッサユニットＣＰＵ。３は切換部、６．１１はシリアルインタフェース（ＩＦ）制御部、１４、工５はＩ１０装置、２０はシリアルインタフェース（ＩＦ）制御部２１はプ
ロセッサ、２４．２５はカウンタ、２６はラッチ回路、２７は禁止
レジスタ、　　２８はアンドゲート、２９はラッチ回路
、３１は監視データブレーク信号送信部、３５はブレーク
信号送出手段、３６は第１の計数手段、３７は第２の計数手段３８は復
旧処理部、　　３９は禁止手段、４０はリセット処理部
、５０．５１は監視データ、５２はブレーク信号、５３は割込み信号、５４はリセット信号、５５は復旧メツセージ、５６は内部クロック、である。FIG. 1(a) is a diagram explaining the principle of the present invention (part 1), FIG. 1(b) is a diagram explaining the principle of the present invention (part 2), FIG. 2 is a CPU block diagram of the embodiment, and FIG. (a) is a block diagram of a conventional I10 controller, and FIG. 3 (bl is an explanatory diagram of a conventional abnormality monitoring system). In the figure, 1.2 is a processor unit CPU. 3 is a switching unit, and 6.11 is a serial interface. (IF) control unit, 14, 5 is an I10 device, 20 is a serial interface (IF) control unit 21 is a processor, 24.25 is a counter, 26 is a latch circuit, 27 is an inhibit register, 28 is an AND gate, 29 is a A latch circuit, 31 is a monitoring data break signal transmitting section, 35 is a break signal sending means, 36 is a first counting means, 37 is a second counting means 38 is a recovery processing section, 39 is an inhibiting means, and 40 is a reset processing section. , 50.51 is monitoring data, 52 is a break signal, 53 is an interrupt signal, 54 is a reset signal, 55 is a recovery message, and 56 is an internal clock.

Claims

[Claims] Monitoring data (50, 5
1) In a processing device equipped with two sets of processor units (CPU1, CPU2) that exchange and monitor each other and disconnect when each detects an abnormality, a break signal ( 52) and also sends a recovery message (55
); and a recovery processing unit that counts the bit length of the received break signal (52) and when it reaches a first set value. An interrupt signal (53) that activates (38)
); and a reset signal that counts the bit length and resets each part when a second set value set to at least the first set value is reached. (54), a second counting means (37) that outputs the reset signal (54), and a prohibition means (3) that prohibits the output of the reset signal (54).
9), and instructing the inhibiting means (39) to send the reset signal (54).
a recovery processing unit (38) that prohibits the output of and executes a recovery process, responds with the recovery message (55) after the recovery is completed, and cancels the prohibition means (39) when recovery is impossible; and the reset signal. (54) Verify the normality of each part after output,
A reset processing unit (40) that sends out the recovery message (55) when it is determined to be normal is connected to each processor unit (CPU1, CPU2).
), wherein a processor unit that detects an abnormal state sends out the break signal (52) to perform recovery processing.