JPH02279040A

JPH02279040A - Fault detection system for multi-processor system

Info

Publication number: JPH02279040A
Application number: JP1098957A
Authority: JP
Inventors: Kazuo Nishidai; 西大　和男
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1989-04-20
Filing date: 1989-04-20
Publication date: 1990-11-15
Anticipated expiration: 2014-07-12
Also published as: JP2917291B2

Abstract

PURPOSE:To detect the fault of a system without the provision of a master processor by applying processor fault detection notice when it is detected a reception section is in the nonreception state of an operation monitoring signal even after the prescribed time elapse and inputting a new operation monitoring signal to a transmission section. CONSTITUTION:When a fault takes place in a processor 2, a processor 3 cannot receive an operation monitoring signal S1 from the processor 2. A monitoring section 7 of the processor 3 resets a timer 8 at the transmission (point A) of the preceding operation monitoring signal S1 and counts the reception time based on a clock signal (c) from the timer 8, the elapse of the next signal reception timing time T111 is recognized and the occurrence of the fault in the processor 2 is discriminated. The monitoring section 7 based on the discrimination informs the fault detection of the processor 2 to reset the timer 8 to send the operation monitoring signals S1 to the processor 4. Thus, the fact of the occurrence of the fault is discriminated by individual processors 1-4.

Description

【発明の詳細な説明】〔産業上の利用分野〕この発明はバスに結合された複数のプロセッサの個々の
障害発生を自動的に検出するマルチプロセッサシステム
の障害検出方式に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a fault detection method for a multiprocessor system that automatically detects the occurrence of a fault in each of a plurality of processors coupled to a bus.

（従来の技術〕従来のマルチプロセッサシステムの障害検出方式は、シ
ステム内に各プロセッサの動作状態を管理するマスクプ
ロセッサを持ち、このマスクプロセッサから他の総ての
プロセッサに対して順次動作監視信号を送信して応答信
号の返信を監視し、所定時間内に応答信号を受信できな
かった場合にそのプロセッサを障害と判断することでシ
ステム内の全プロセッサの障害発生を検出する方式とな
っていた。(Prior Art) A conventional failure detection method for a multiprocessor system has a mask processor in the system that manages the operating state of each processor, and this mask processor sequentially sends operation monitoring signals to all other processors. The system detects the occurrence of a failure in all processors in the system by transmitting a response signal, monitoring the response, and determining that the processor is at fault if the response signal is not received within a predetermined time.

この技術を第３図に基づいて具体的に説明する。This technique will be specifically explained based on FIG.

マルチプロセッサシステムは、１個のマスクプロセッサ
３１と３個のプロセッサ３２〜３４をリング型のバス３
５で接続した構成となっている。マスクプロセッサ３１
は他のプロ１セツサ３２〜３４の動作状態を管理してい
る。マスクプロセッサ３１は、プロセッサ３２に対して
動作監視信号Ｓｌｌを送信し、所定時間内にその応答信
号Ｓ＋Ｚが受信されるか否かを監視する。そしてその結
果をプロセッサ３２の動作状態として管理する。引き続
きマスクプロセッサ３１がプロセッサ３３．プロセッサ
３４に対して順次同様の手順を繰り返すことで、各プロ
セッサ３２〜３４の障害検出を行う。The multiprocessor system connects one mask processor 31 and three processors 32 to 34 to a ring bus 3.
It has a configuration in which 5 are connected. Mask processor 31
manages the operating status of the other processors 32-34. The mask processor 31 transmits an operation monitoring signal Sll to the processor 32, and monitors whether the response signal S+Z is received within a predetermined time. The results are then managed as the operating state of the processor 32. Subsequently, the mask processor 31 is replaced by the processor 33 . By sequentially repeating the same procedure for the processors 34, failures in each of the processors 32 to 34 are detected.

[Problem to be solved by the invention]

前述した従来のマルチプロセッサシステムの障害検出方
式にあっては、システム内にマスクプロセッサという特
別なプロセッサ３１を設け、このマスクプロセッサ３１
から他の総てのプロセッサ３２〜３４に対して動作監視
信号Ｓ１１を送信し、その応答信号Ｓｌ□を監視するこ
とでプロセッサの障害を検出するものであったため、以
下の欠点がある。In the conventional multiprocessor system failure detection method described above, a special processor 31 called a mask processor is provided in the system, and this mask processor 31
Since the system detects a failure in a processor by transmitting an operation monitoring signal S11 from one processor to all other processors 32 to 34 and monitoring the response signal Sl□, there are the following drawbacks.

（イ）マスクプロセッサ３１のみで障害監視を行うこと
から、被監視プロセッサ３２〜３４の配設数が多くなれ
ばなる程、マスタプロセッサ３１の負荷が増大する。(a) Since only the mask processor 31 performs fault monitoring, the load on the master processor 31 increases as the number of monitored processors 32 to 34 increases.

（ロ）マスクプロセッサ３１自身に障害が発生した場合
には、システム全体の障害検出機能が停止してしまう。(b) If a failure occurs in the mask processor 31 itself, the failure detection function of the entire system will stop.

この発明の目的は、前記従来の課題を解決するために、
複数のプロセッサで障害検出動作を行うことができ、経
済的かつ信頼性の高いマルチプロセッサシステムの障害
検出方式を提供することにある。The purpose of this invention is to solve the above-mentioned conventional problems.
It is an object of the present invention to provide an economical and highly reliable fault detection method for a multiprocessor system that can perform fault detection operations using a plurality of processors.

（課題を解決するための手段）この発明は、バスを介して結合している複数のプロセッ
サに動作監視信号を順回させることによってプロセッサ
の障害を検出するマルチプロセッサシステムの障害検出
方式であって、各プロセッサは、手前のプロセッサからの動作監視信号を受信する受信部
と、次のプロセッサへ動作監視信号を送信する送信部と、受信部の受信状態を監視し、受信部の動作監視信号受信
時にその動作監視信号を送信部に入力し、又は受信部が
所定時間経過後も動作監視信号の非受信状態にあるとネ
★知した時点でプロセッサ障害検出通知を行うと共に新
たな動作監視信号を送信部に人力する監視部とを備えて
いることを特徴とする。(Means for Solving the Problem) The present invention is a fault detection method for a multiprocessor system that detects a fault in a processor by sequentially passing an operation monitoring signal to a plurality of processors connected via a bus. , each processor has a receiving section that receives the operation monitoring signal from the previous processor, a transmitting section that sends the operation monitoring signal to the next processor, and a receiving section that monitors the receiving state of the receiving section and receives the operation monitoring signal of the receiving section. When the operation monitoring signal is input to the transmitting section, or when the receiving section notices that the operation monitoring signal is not being received even after a predetermined period of time has elapsed, a processor fault detection notification is sent and a new operation monitoring signal is sent. It is characterized in that the transmitting section includes a monitoring section that is manually operated.

〔Example〕

この発明の実施例について図面を参照して説明する。 Embodiments of the invention will be described with reference to the drawings.

第１図はこの発明の一実施例に係るマルチプロセッサシ
ステムの障害検出方式を示すブロック図である。FIG. 1 is a block diagram showing a fault detection method for a multiprocessor system according to an embodiment of the present invention.

このマルチプロセッサシステムの障害検出方式は、プロ
セッサ１〜４をバス５によってリング状に接続し、動作
監視信号Ｓ、をプロセッサ１〜４に順回させることによ
ってプロセッサの障害を検出する方式である。The failure detection method of this multiprocessor system is such that processors 1 to 4 are connected in a ring through a bus 5, and an operation monitoring signal S is sent to the processors 1 to 4 in order to detect a failure in the processors.

各プロセッサｌ　（〜４）は、受信部６と、監視部７と
、タイマ８と、送信部９とを備えている。Each processor l (~4) includes a receiving section 6, a monitoring section 7, a timer 8, and a transmitting section 9.

受信部６は、隣接のプロセッサからの動作監視信号Ｓ、
を受信して監視部７へ送出するためのものである。The receiving unit 6 receives an operation monitoring signal S from an adjacent processor,
This is for receiving and sending it to the monitoring section 7.

監視部７は、受信部６からの動作監視信号Ｓ１とタイマ
８からのクロック信号Ｃを監視することにより、隣接の
プロセッサの障害の有無を検出するものである。The monitoring unit 7 monitors the operation monitoring signal S1 from the receiving unit 6 and the clock signal C from the timer 8 to detect whether there is a failure in an adjacent processor.

以下、監視部７の機能を具体的に述べる。監視部７は、
受信部６から受信した動作監視信号ＳＩを送信部９に送
出すると共にタイマをリセットし、タイマ８からのクロ
ック信号Ｃを監視する。そしてこのクロック信号Ｃに基
づいて動作監視信号Ｓ１を送信部９に送出した後、受信
部６から人力する迄の時間ｔを測定する。時間ｔが予め
設定された次信号受信タイミング時間Ｔ、に略−敗して
いる場合には、監視部７はプロセッサ４が正常に動作し
ていると判断し、タイマ８をリセットすると共に送信部
９に動作監視信号Ｓ、を送出する。動作監視信号Ｓ１の
送信部９への送出タイミングは、動作監視信号Ｓ、の受
信からＴ２時間後の送信タイミングアウト時を契機とし
て行われる。Ｔ２時間の測定は、監視部７が動作監視信
号Ｓ、の受信時にタイマ８をリセットし、タイマ８から
入力されるクロック信号Ｃを測定することにより行われ
る。一方、監視部７が次信号受信タイミング時間Ｔ、を
経過しても動作監視信号Ｓ１を受信しない場合には、プ
ロセッサ４に障害が発生したものと判断し、その旨の通
知を図示しない監視制御装置等に送る。監視部７は、こ
の通知と並行してタイマ８をリセットすると共に送信部
９に新たな動作監視信号Ｓ１を送出する機能を有する。The functions of the monitoring section 7 will be specifically described below. The monitoring unit 7 is
The operation monitoring signal SI received from the receiving section 6 is sent to the transmitting section 9, the timer is reset, and the clock signal C from the timer 8 is monitored. After sending the operation monitoring signal S1 to the transmitting section 9 based on this clock signal C, the time t until it is manually transmitted from the receiving section 6 is measured. If the time t is approximately equal to the preset next signal reception timing T, the monitoring unit 7 determines that the processor 4 is operating normally, resets the timer 8, and resets the transmitting unit. The operation monitoring signal S is sent to the terminal 9. The timing at which the operation monitoring signal S1 is sent to the transmitter 9 is set at the time when the transmission timing is out, which is T2 hours after the reception of the operation monitoring signal S. The measurement of the T2 time is performed by resetting the timer 8 when the monitoring unit 7 receives the operation monitoring signal S, and measuring the clock signal C input from the timer 8. On the other hand, if the monitoring unit 7 does not receive the operation monitoring signal S1 even after the next signal reception timing time T has elapsed, it determines that a failure has occurred in the processor 4, and sends a notification to that effect via a monitoring control (not shown). Send to equipment, etc. The monitoring section 7 has a function of resetting the timer 8 and sending out a new operation monitoring signal S1 to the transmitting section 9 in parallel with this notification.

尚、次信号受信タイミング時間Ｔ１は、少なくとも動作
監視信号Ｓ、がプロセッサ１から送出され、プロセッサ
２，３．４を順回しプロセッサ１に戻る迄の時間以上に
設定されている。The next signal reception timing time T1 is set to be at least longer than the time from when the operation monitoring signal S is sent from the processor 1 to when it passes through the processors 2, 3, and 4 and returns to the processor 1.

次に、この実施例の障害検出動作について第１図と第２
図に基づいて説明する。Next, the fault detection operation of this embodiment will be explained in Figs. 1 and 2.
This will be explained based on the diagram.

第２図はこの実施例のマルチプロセッサシステムの障害
検出方式が示す障害検出動作のシーケンス図である。FIG. 2 is a sequence diagram of the failure detection operation shown by the failure detection method of the multiprocessor system of this embodiment.

プロセッサ１が、タイマ８をリセットして送信部９から
動作監視信号Ｓ１をプロセッサ２に送信する。The processor 1 resets the timer 8 and transmits the operation monitoring signal S1 from the transmitter 9 to the processor 2.

プロセッサ２〜４は、第２図に示すように、手前のプロ
セッサからの動作監視信号Ｓ、を受信し、Ｔ２時間の信
号送信タイミングアウト時に動作監視信号Ｓ１を次のプ
ロセッサに送出していく。As shown in FIG. 2, the processors 2 to 4 receive the operation monitoring signal S from the previous processor, and send the operation monitoring signal S1 to the next processor when the signal transmission timing is out at time T2.

プロセッサ１〜４のいずれもが正常に動作している場合
には、動作監視信号Ｓ、はプロセッサ１〜４を順回し、
再びプロセッサ１に戻ってくる。When all of the processors 1 to 4 are operating normally, the operation monitoring signal S passes through the processors 1 to 4 in order;
It returns to processor 1 again.

この動作監視信号Ｓ１はプロセッサ１の受信部６に受信
される。この受信部６からの動作監視信号Ｓ、を入力し
た監視部７は、タイマ８からのクロック信号Ｃに基づき
受信時の時間が次信号受信タイミング時間Ｔ１に略一致
していることを認識する。This operation monitoring signal S1 is received by the receiving section 6 of the processor 1. The monitoring section 7, which receives the operation monitoring signal S from the receiving section 6, recognizes based on the clock signal C from the timer 8 that the time of reception substantially coincides with the next signal reception timing time T1.

これにより、監視部７は、プロセッサ４が正常動作して
いると判断する。動作監視信号Ｓｌの受信と並行して監
視部７は、タイマ８をリセットし、タイマ８からのクロ
ック信号Ｃを測定して信号送信タイミングアウト時間Ｔ
２に到ったと判断したときに動作監視信号Ｓｌを送信部
９に送る。送信部９は、この動作監視信号Ｓ１を次のプ
ロセッサ２に送信する（第２図）。Thereby, the monitoring unit 7 determines that the processor 4 is operating normally. In parallel with receiving the operation monitoring signal Sl, the monitoring unit 7 resets the timer 8, measures the clock signal C from the timer 8, and determines the signal transmission timing out time T.
2, an operation monitoring signal Sl is sent to the transmitter 9. The transmitter 9 transmits this operation monitoring signal S1 to the next processor 2 (FIG. 2).

このとき、プロセッサ２に障害が生じた場合には、プロ
セッサ２は、プロセッサ１からの動作監視信号Ｓ、を受
信できず、プロセッサ３にその動作監視信号Ｓ１をプロ
セッサ３に送信することができない（第２図）。At this time, if a failure occurs in the processor 2, the processor 2 cannot receive the operation monitoring signal S from the processor 1, and cannot send the operation monitoring signal S1 to the processor 3 ( Figure 2).

従ってプロセッサ３は、プロセッサ２からの動作監視信
号Ｓｌを受信することができない。このプロセッサ３の
監視部７は、前回の動作監視信号Ｓ、送信時（第２図Ａ
点）にタイマ８をリセットし、タイマ８からのクロック
信号Ｃに基づいて受信時間を測定しているため、次信号
受信タイミング時間Ｔ、の経過を認識し、プロセッサ２
に障害が生じていると判断する。この判断に基づいて、
監視部７はプロセッサ２の障害検出通知を行うと共に、
自プロセッサ２以外の次信号受信タイミングアウトの発
生を防止するため、タイマ８をリセットし送信部９を介
して動作監視信号Ｓ、をプロセッサ４に送信する（第２
図）。Therefore, the processor 3 cannot receive the operation monitoring signal Sl from the processor 2. The monitoring unit 7 of this processor 3 monitors the previous operation monitoring signal S at the time of transmission (FIG. 2A).
Since the timer 8 is reset at point 8) and the reception time is measured based on the clock signal C from the timer 8, the passage of the next signal reception timing T is recognized and the processor 2
It is determined that a problem has occurred. Based on this judgment,
The monitoring unit 7 notifies the processor 2 of failure detection, and
In order to prevent timing out of reception of the next signal other than the own processor 2, the timer 8 is reset and the operation monitoring signal S is transmitted to the processor 4 via the transmitter 9 (second
figure).

このようにして、個々のブロモ・ンサ１〜４によって障
害発生の事実を判断することができる。また、障害発生
によって障害が生じたプロセッサ２以外のプロセッサの
動作を停止させる必要もない。In this way, the fact that a failure has occurred can be determined by each of the Bromo Sensors 1 to 4. Further, there is no need to stop the operation of processors other than the failed processor 2 due to the occurrence of a failure.

〔Effect of the invention〕

この発明のマルチプロセッサシステムの障害検出方式は
以上説明したように構成されているため、以下の効果が
ある。Since the fault detection method for the multiprocessor system of the present invention is configured as described above, it has the following effects.

（イ）システム内にマスクプロセッサという特別なプロ
セッサを設置することなく、システムの障害検出動作を
実現することができ、この結果、より経済的なシステム
の構築が可能となる。(a) It is possible to realize a fault detection operation in the system without installing a special processor called a mask processor in the system, and as a result, it is possible to construct a more economical system.

（ロ）システム内のどのプロセッサが障害になってもシ
ステム内の障害検出動作が停、止しない。この結果、シ
ステムの信頼性の向上を図ることができる。(b) No matter which processor in the system fails, the failure detection operation within the system does not stop or stop. As a result, it is possible to improve the reliability of the system.

[Brief explanation of drawings]

第１図はこの発明の一実施例に係るマルチプロセッサシ
ステムの障害検出方式を示すブロック図、第２図は第１
図のマルチプロセッサシステムの障害検出方式が行う障
害検出動作のシーケンス図、第３図は従来のマルチプロ
セッサシステムの障害検出方式を示すブロック図である
。１〜４・・・プロセッサ５・・・バス６・・・受信部７・・・監視部８・・・タイマ９・・・送信部FIG. 1 is a block diagram showing a fault detection method for a multiprocessor system according to an embodiment of the present invention, and FIG.
FIG. 3 is a sequence diagram of the failure detection operation performed by the failure detection method of the multiprocessor system shown in the figure. FIG. 3 is a block diagram showing the conventional failure detection method of the multiprocessor system. 1 to 4... Processor 5... Bus 6... Receiving section 7... Monitoring section 8... Timer 9... Transmitting section

Claims

[Claims]

(1) A failure detection method for a multiprocessor system that detects a processor failure by sequentially passing an operation monitoring signal to multiple processors connected via a bus, in which each processor receives a signal from the previous processor. A receiving section that receives the operation monitoring signal; a transmitting section that sends the operation monitoring signal to the next processor; and a transmitting section that monitors the receiving state of the receiving section and inputs the operation monitoring signal to the transmitting section when the receiving section receives the operation monitoring signal. or a monitoring unit that notifies the processor failure detection and inputs a new operation monitoring signal to the transmitting unit when the receiving unit detects that the operation monitoring signal is not being received even after a predetermined period of time has elapsed. A failure detection method for a multiprocessor system characterized by the following.