JP2578985B2

JP2578985B2 - Redundant controller

Info

Publication number: JP2578985B2
Application number: JP1177028A
Authority: JP
Inventors: 圭一大山; 聡生首藤
Original assignee: NIPPON DENKI SOFUTOEA KK; Nippon Electric Co Ltd
Current assignee: NIPPON DENKI SOFUTOEA KK; NEC Corp
Priority date: 1989-07-11
Filing date: 1989-07-11
Publication date: 1997-02-05
Anticipated expiration: 2012-02-05
Also published as: JPH0342943A

Description

【発明の詳細な説明】［産業上の利用分野］本発明は二重化構成で動作する通信制御装置の障害監
視方式に関し，特にシステムの性質上処理量に大きな変
動があっても確実な障害検出とリカバリを義務付けられ
た二重化制御装置の改良に関する。Description: TECHNICAL FIELD The present invention relates to a fault monitoring method for a communication control device operating in a duplex configuration, and more particularly to a fault monitoring method capable of reliably detecting a fault even when there is a large change in the processing amount due to the nature of the system. The present invention relates to an improvement of a redundant control device which is required to perform a recovery.

［従来の技術］従来，この種の二重化構成による通信制御装置の障害
監視は，主に一定時間相手系からのチェック信号が途絶
えると，相手系障害を自系に通知するウォッチドッグタ
イマによって行われていた。[Prior art] Conventionally, fault monitoring of a communication control device of this type of duplex configuration is performed mainly by a watchdog timer that notifies a fault of the partner system to its own system when a check signal from the partner system is interrupted for a certain period of time. I was

［発明が解決しようとする課題］上述したウォッチドッグタイマ等による障害監視で
は，障害確認の余地が無く，相手系より相手系障害を通
知された際に本当に障害が発生したものとして受付けざ
るを得ない。従って，ウォッチドッグタイマ等のハード
障害による誤動作や，相手系の処理量がピークに達した
ために起こる相手系からのチェック信号送出の遅延に伴
う誤動作等が発生すると，システムが正常に動作してい
るにもかかわらず，障害処理動作に移行してしまうとい
う欠点がある。[Problems to be Solved by the Invention] In the fault monitoring using the watchdog timer or the like described above, there is no room for confirming the fault, and when a fault is notified from the counterpart system, it must be accepted as a real fault. Absent. Therefore, if a malfunction occurs due to a hardware failure of a watchdog timer or the like, or a malfunction occurs due to a delay in sending a check signal from the partner system due to a peak in the processing amount of the partner system, the system operates normally. Nevertheless, there is a drawback that the operation shifts to a failure handling operation.

また，障害検出の方法が一つしかないため，主機がシ
ステムとして動作不能の状態に落ち入っても，副機に対
して通知されないような状態が発生した場合（ウォッチ
ドッグタイマ等のハード障害，ダイナミックループ形成
状態），システムがその機能を発揮できない状態に落ち
入ってしまうという欠点がある。Also, since there is only one method of fault detection, if the main unit falls into an inoperable state as a system, a condition occurs in which the sub unit is not notified (hardware failure such as a watchdog timer, etc. Dynamic loop formation state), and the system falls into a state where it cannot perform its function.

［課題を解決するための手段］本発明の二重化制御装置は，自系の障害を相手系に通
知するウォッチドッグタイマ手段，相手系に対し，自系
が動作していることを通知するヘルスチェック手段，自
系の障害を自系で監視するループ監視手段，自系と相手
系との間の通信機能の状態を監視する系間通信状態監視
手段，自系と相手系との間で回線の切り換えを行う回線
切換装置の監視を行う回線切換装置監視手段を有し，こ
れらの各手段により行われる異常発生の通知，及び自系
の現在の状態（現在主機であるかどうか，既に他の異常
発生の通知を受けているかどうか）を合わせて障害の判
断を行い，速やかにリカバリ処理を行うことを特徴とす
る。[Means for Solving the Problems] A redundant control device according to the present invention comprises a watchdog timer means for notifying a partner system of a fault in the self system, and a health check for notifying the partner system that the self system is operating. Means, loop monitoring means for monitoring the failure of the local system in the local system, inter-system communication status monitoring means for monitoring the status of the communication function between the local system and the remote system, and communication between the local system and the remote system. It has line switching device monitoring means for monitoring the line switching device that performs switching, and reports the occurrence of abnormalities performed by each of these means, and the current status of its own system (whether or not it is the current main unit, if any other abnormalities have already occurred). The system is characterized in that the failure is determined in accordance with whether the notification of occurrence has been received and the recovery process is promptly performed.

［実施例］次に，本発明について図面を参照して説明する。第１
図は本発明の一実施例のソフトウェア及びハードウェア
の構成図であり，二重化構成にある通信制御装置10,20
の一方が主機，他方が副機として動作する。Next, the present invention will be described with reference to the drawings. First
FIG. 1 is a configuration diagram of software and hardware according to an embodiment of the present invention.
One operates as the main engine and the other operates as the sub-engine.

１はウォッチドッグタイマ,2はウォッチドッグタイマ
に信号を送るタスク,3は相手系のウォッチドッグタイマ
からの割り込みを受けるハードウェア（PIO）である。Reference numeral 1 denotes a watchdog timer, reference numeral 2 denotes a task for sending a signal to the watchdog timer, and reference numeral 3 denotes hardware (PIO) for receiving an interrupt from the watchdog timer of the partner system.

ウォッチドッグタイマ１では，ウォッチドッグタイマ
に信号を送る自系のタスク２からの信号が１秒以上途絶
えると，相手系のPIO3に対して割り込み通知を行う。な
お,PIO3は相手系ウォッチドッグタイマ１からの割込み
だけでなく，回線切換装置30からの異常割込みも合わせ
て受け取る。これらの割り込み通知は，割り込み通知を
受けるタスク４を通じて障害監視メインタスク５へと渡
される。障害監視メインタスク５には,6秒ごとにヘルス
チェックデータを相手系に送信するヘルスチェック手
段，逆に８秒間相手系からヘルスチェックデータがこな
かった場合にヘルスチェックタイムアウトするヘルスチ
ェック監視手段，他系との系間通信状態を監視する系間
通信状態監視手段，ヘルスチェックタイムアウト，系間
通信状態異常を検出した際に，相手系からのウォッチド
ッグタイマ割込みを待つガードタイマ手段等があり，各
イベント発生時には第２図のマトリックスによって障害
の判定を行う。なお，第２図中，空白部は何の処理も行
わない。In the watchdog timer 1, if the signal from the task 2 of the own system that sends a signal to the watchdog timer is interrupted for 1 second or more, an interrupt notification is sent to the PIO3 of the other system. Note that PIO3 receives not only an interrupt from the partner watchdog timer 1 but also an abnormal interrupt from the line switching device 30. These interrupt notifications are passed to the fault monitoring main task 5 through the task 4 receiving the interrupt notification. The fault monitoring main task 5 includes a health check means for transmitting health check data to the other system every 6 seconds, a health check monitoring means for performing a health check timeout when no health check data is received from the other system for 8 seconds, There are inter-system communication status monitoring means for monitoring the inter-system communication status with other systems, guard timer means for waiting for a watchdog timer interrupt from the partner system when a health check timeout or inter-system communication status error is detected, etc. At the time of occurrence of each event, a failure is determined by the matrix shown in FIG. In FIG. 2, no processing is performed on the blank portion.

６は第１のループ監視タスクでシステム中の全アプリ
ケーションタスクの中で最高位のレベルで動作する。７
は第２のループ監視タスク７で全アプリケーションタス
クの中で最低位のレベルで動作する。８はループ監視フ
ラグである。第２のループ監視タスク７は0.5秒ごとに
ループ監視フラグ８をセットし，第１のループ監視タス
ク６は８秒ごとにループ監視フラグ８をリセットする。
この時，もし既にループ監視フラグ８がリセットされて
いた場合，システムはダイナミックループ形成状態に落
ち入っていると判断し，その機能を停止（自殺）する。Reference numeral 6 denotes a first loop monitoring task which operates at the highest level among all application tasks in the system. 7
Operates at the lowest level among all the application tasks in the second loop monitoring task 7. 8 is a loop monitoring flag. The second loop monitoring task 7 sets the loop monitoring flag 8 every 0.5 seconds, and the first loop monitoring task 6 resets the loop monitoring flag 8 every 8 seconds.
At this time, if the loop monitoring flag 8 has already been reset, the system determines that the system has entered a dynamic loop formation state, and stops its function (suicide).

次に、本発明装置における状態遷移図である第２図を
参照して説明する。Next, a description will be given with reference to FIG. 2 which is a state transition diagram in the device of the present invention.

自系が主機である場合には、状態によらずイベントの
発生に対し、副機障害か自系停止のいずれかの選択をす
る。なお、空欄は何もしないことを意味し、斜線は該当
のイベントが発生し得ないことを示す。すなわち、主機
ではガードタイマは動作しない。When the own system is the main machine, the user selects either the sub machine failure or the self system stop for the occurrence of the event regardless of the state. Note that a blank column indicates that nothing is performed, and a hatched line indicates that the corresponding event cannot occur. That is, the guard timer does not operate in the main engine.

一方、自系が副機の場合は、各舛目は以下を意味す
る。On the other hand, when the own system is the sub-machine, each box means the following.

『ウォッチドッグタイマ割込み中へ移行』：状態を
『ウォッチドッグタイマ割込み中』にする。“Move to watchdog timer interrupt”: Change the status to “watchdog timer interrupt”.

『ガードタイマスタート』：ガードタイマをセット
し、状態を『ガードタイマ動作中』にする。"Guard timer start": Sets the guard timer and sets the status to "guard timer is operating".

『ウォッチドッグタイマ割込み中解除』：状態を『通
常時』に戻す。“Release during watchdog timer interrupt”: Returns the status to “normal”.

『ガードタイマキャンセル』：ガードタイマをセット
し、状態を『通常時』に戻す。“Guard timer cancel”: Sets the guard timer and returns the state to “normal”.

第２図のマトリックスは、二重化システムが系間通信
路等の共通部の障害で両系とも相手系障害を検出してし
まうことのないよう、主機と副機の相手系障害判定条件
を変えている。具体的には、副機から見た主機障害の判
定条件を、２つの異常検出があった時ときびしくしてい
る。The matrix shown in FIG. 2 is obtained by changing the partner failure determination conditions of the main unit and the sub unit so that the redundant system does not detect the partner failure due to a failure in the common part such as the inter-system communication path. I have. Specifically, the judgment condition of the main engine failure as seen from the sub-machine is strict when two abnormalities are detected.

例えば、状態が『副機』で、イベントが『ガードタイ
マタイムアウト』の場合には、前述したように、ヘルス
チェックのタイムアウトや系間通信状態異常のみでは主
機障害とは判定できないので、ガードタイマをセット
し、ウォッチドッグタイマの割込みを待つが、割り込み
がなかった場合は、系間通信不能な副機は二重化システ
ムの副機として成立しないため自系停止とする。For example, if the status is “sub-machine” and the event is “guard timer timeout”, as described above, it is not possible to determine the failure of the main machine only by the health check timeout or inter-system communication status abnormality. It is set and waits for an interrupt from the watchdog timer, but if there is no interrupt, the sub-machine that cannot communicate between systems is not established as a sub-machine of the redundant system, so the self-system is stopped.

［発明の効果］以上説明したように本発明は，ウォッチドッグタイ
マ，ヘルスチェック手段，ループ監視手段，系間通信状
態監視手段，回線切換装置監視手段の５つの障害監視手
段の組合わせにより，より正確で迅速な障害監視を実現
できる効果がある。[Effects of the Invention] As described above, the present invention is further improved by the combination of the five fault monitoring means of the watchdog timer, the health check means, the loop monitoring means, the inter-system communication state monitoring means, and the line switching device monitoring means. There is an effect that accurate and quick failure monitoring can be realized.

以下に，本発明により障害対処がより確実になった例
の一部を挙げる。The following is a part of an example in which troubleshooting according to the present invention has become more reliable.

1.主機の処理量のピーク時，及びウォッチドッグタイ
マの誤動作等による副機に対するウォッチドッグタイマ
の誤通知に対しては，副機に対して送られる次のヘルス
チェックデータ送信によってウォッチドッグタイマ割込
み中の状態が解除されるので，主機，副機の切り換えが
むやみに起こることは無い。1. The watchdog timer interrupt is issued by sending the next health check data sent to the submachine when the watchdog timer is erroneously notified to the submachine due to the peak processing amount of the main machine or the malfunction of the watchdog timer. Since the middle state is canceled, switching between the main unit and the sub unit does not occur unnecessarily.

2.主機または副機がループ形成状態に落ち入った場合
は，ループ監視タスクにより検出されて停止するので，
相手系に確実に障害が通知され，システムとして動作続
行不能となった装置は速やかに除外される。2. When the main unit or sub unit enters the loop formation state, it is detected by the loop monitoring task and stopped.
A failure is reliably notified to the partner system, and devices that cannot continue to operate as a system are promptly excluded.

【図面の簡単な説明】第１図は本発明のソフトウェア及びハードウェアの構成
図，第２図は主機，副機におけるイベント発生時の処理
を示すマトリックスである。 1:ウォッチドッグタイマ,2:タスク,3:PIO,4:タスク,5:
障害監視メインタスク,6:第１のループ監視タスク,7:第
２のループ監視タスク,8:ループ監視フラグ。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a configuration diagram of software and hardware of the present invention, and FIG. 2 is a matrix showing processing when an event occurs in a main unit and a sub unit. 1: Watchdog timer, 2: Task, 3: PIO, 4: Task, 5:
Fault monitoring main task, 6: first loop monitoring task, 7: second loop monitoring task, 8: loop monitoring flag.

Claims

(57) [Claims]

In a communication control apparatus operating in a redundant configuration, each communication control apparatus includes a watchdog timer means for notifying a partner system of a fault in the local system, and a countermeasure that the local system is operating with respect to the partner system. A health check unit for notifying, a loop monitoring unit for monitoring a failure of the local system in the local system, an inter-system communication state monitoring unit for monitoring a state of a communication function between the local system and a partner system,
It has line switching device monitoring means for monitoring the line switching device that switches the line between its own system and the partner system. It notifies the occurrence of abnormalities and the current status of its own system. A redundant control device, which also determines a failure and performs a recovery process.