JPH02277336A

JPH02277336A - Faulty processor detection system

Info

Publication number: JPH02277336A
Application number: JP1099501A
Authority: JP
Inventors: Masato Konuki; 小貫　理人
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1989-04-18
Filing date: 1989-04-18
Publication date: 1990-11-13

Abstract

PURPOSE:To avoid a transmission destination processor from being regarded as a fault when the processor is in a single congestion state by providing a 2nd count means counting the pause state to each processor being the transmission destination to a sender processor, monitoring the pause state for a prescribed period and regarding it as a fault when the level is larger than a prescribed 2nd threshold level. CONSTITUTION:When a processor 22 cannot receive a transmission data due to any cause although a processor 1 sends a data to the processor 22, a transmission control processing section 53 of the processor 21 registers a transmission data to a transmission queue 72 corresponding to the processor 22, regards the processor 22 to be in the pause state to stop the transmission of a data for a prescribed time and restarts the transmission processing after a prescribed time. A pause state check processing regarding it to be a fault when number of times of the pause state exceeds a prescribed number of times within a prescribed time, and a processing regarding it to be a fault when a transmission queue length counter exceeds a prescribed value in a conventional system are used in common to reduce the possibility of misjudging the transmission destination processor congestion state is precluded.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明は、複数のプロセッサをバスを介し接続して構成
されるマルチプロセッサシステムのプロセッサ相互間で
用いられる障害プロセッサ検出方式に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a faulty processor detection method used between processors in a multiprocessor system configured by connecting a plurality of processors via a bus.

[Conventional technology]

従来、複数のプロセッサ相互間通信における障害プロセ
ッサ検出方式では、データを送信する送信元プロセッサ
に相手先各プロセッサ対応の送信待ちデータを登録させ
る送信待ちキュー及び送信待ちデータの長さを計数する
送信待ちキュー長カウンタを設け、送信データを受信す
べき送信先プロセッサが送信データを受取ってくれなか
ったときに、送信データを送信待ちキューに登録し送信
待ちキュー長カウンタを一歩歩進させ、周期的に送信待
ちキューの長さを監視して、所定のしきい値、例えば、
２０と設定された場合はこの値を越えている送信先プロ
セッサを障害とみなすキュー長チェック処理によってプ
ロセッサの障害を検出している。Conventionally, in a faulty processor detection method in communication between multiple processors, there is a transmission waiting queue in which the sending processor that sends data registers the waiting data for each destination processor, and a sending waiting queue in which the length of the waiting data is counted. A queue length counter is provided, and when the destination processor that should receive the transmitted data does not receive the transmitted data, the transmitted data is registered in the transmission waiting queue, the transmission waiting queue length counter is incremented one step, and the processing is performed periodically. Monitor the send queue length and set a predetermined threshold, e.g.
If the value is set to 20, a processor failure is detected through queue length check processing that treats a destination processor exceeding this value as a failure.

[Problem to be solved by the invention]

上述したように、従来の障害プロセッサ検出方式では、
送信待ちキューの長さが所定のしきい値を越えた送信先
のプロセッサを直ちに障害とみなしてしまう。従って、
送信先プロセッサが、他のプロセッサからのデータを受
信しているような単なる輻湊状態のため送信されてきた
データの受信動作ができなくて送信元プロセッサの送信
待ちキューにたまっている状態のときでも、このような
送信先プロセッサは障害とみなされてしまうという欠点
がある。As mentioned above, conventional faulty processor detection methods
A destination processor whose transmission queue length exceeds a predetermined threshold is immediately regarded as a failure. Therefore,
Even when the destination processor is simply in a congestion state, such as receiving data from another processor, and is unable to receive the transmitted data and is stuck in the transmission queue of the source processor. , such a destination processor is considered to be a failure.

本発明の目的は、送信先プロセッサが単なる輻湊状態の
ときには障害とみなされないようにした障害プロセッサ
検出方式を提供することにある。SUMMARY OF THE INVENTION An object of the present invention is to provide a faulty processor detection method that prevents a destination processor from being considered faulty when it is simply in a congested state.

[Means to solve the problem]

本発明の障害プロセッサ検出方式は、複数のプロセッサ
をバスを介し接続して構成されるマルチプロセッサシス
テムにおけるデータを送信する送信元プロセッサに、送
信相手先各プロセッサごとに送信待ち□データを登録さ
せる記憶手段及び送信待ちデータの長さを計数する第１
の計数手段を設け、送信データを受信すべき送信先プロ
セッサが送信データを受取ってくれなかったときに、送
信データを前記記憶手段に登録し、前記第１の計数手段
を歩進させ、周期的に送信待ちデータの長さを監視し、
所定の第１のしきい値を越えているプロセッサを障害と
みなすキュー長チェック処理により障害プロセッサを検
出する障害プロセッサ検出方式において、前記送信元プ
ロセッサに送信相手先各プロセッサごとに休止中状態を
計数する第２の計数手段を設け、送信先プロセッサが送
信データを受取ってくれなかったときに、前記送信先プ
ロセッサを休止中状態とみなしデータの送信を見合わせ
ると共に前記第２の計数手段、を歩進させ、一定周期で
前記休止中状態を監視し、所定の第２のしきい値より小
さければ前記第２の計数手段の値を０に戻し、大きけれ
ば障害と・みなす休止中チェック処理と前記キュー長チ
ェッ・り処理とを併用して障害プロセッサを検出するよ
う構成されている。The faulty processor detection method of the present invention is a storage system in which a sending processor that sends data in a multiprocessor system configured by connecting a plurality of processors via a bus registers data waiting to be sent for each destination processor. The first step is to count the length of the data waiting to be sent.
counting means is provided, and when the destination processor that should receive the transmitted data does not receive the transmitted data, the transmitted data is registered in the storage means, the first counting means is incremented, and the first counting means is periodically counted. monitor the length of data waiting to be sent,
In a faulty processor detection method in which a faulty processor is detected by a queue length check process in which a processor that exceeds a predetermined first threshold is considered to be faulty, the number of idle states of the transmission destination processor is counted for each destination processor. and when the destination processor does not receive the transmitted data, the destination processor is deemed to be in a dormant state, and the data transmission is suspended and the second counting means is incremented. the inactive state is monitored at regular intervals, and if the value is smaller than a predetermined second threshold value, the value of the second counting means is returned to 0; if the value is larger than the inactive state, the inactive state is regarded as a failure; The system is configured to detect a faulty processor using long check processing in combination.

〔Example〕

次に、本発明の実施例について図面を参照して説明する
。Next, embodiments of the present invention will be described with reference to the drawings.

第１図は本発明の一実施例を示すブロック図である。第
１図には、ｎ個のプロセッサ２１〜２ｎをバス１０を介
し接続して、プロセッサ２１〜２ｎの間で通信を行いな
がら処理を進めるマルチプロセッサシステムが示されて
いる。プロセッサ２１は、中央処理装置５０と、バス１
０上のデータを取込んで中央処理装置５０に伝える受信
部３０と、中央処理装置５０からのデータをバス１０上
に送出する送信部４０と、中央処理装置５０に接続され
た記憶装置６０とを備えて構成されており、他のプロセ
ッサ２２〜２ｎも同様に構成されている。中央処理袋Ｎ
５０は、キュー長チェック処理部５１と休止中チェック
処理部５２と送信制御処理部５３とを有し、記憶装置６
０は、自プロセッサ以外の送信相手先プロセッサに対応
する送信待ちキュー７２〜７ｎと送信待ちキュー長カウ
ンタ８２〜８ｎと休止中カウンタ９２〜９ｎとを有して
いる。FIG. 1 is a block diagram showing one embodiment of the present invention. FIG. 1 shows a multiprocessor system in which n processors 21 to 2n are connected via a bus 10 and processing is performed while communicating among the processors 21 to 2n. The processor 21 has a central processing unit 50 and a bus 1.
a receiving section 30 that takes in data on 0 and transmits it to the central processing unit 50, a transmitting section 40 that sends data from the central processing unit 50 onto the bus 10, and a storage device 60 connected to the central processing unit 50. The other processors 22 to 2n are similarly configured. Central processing bag N
50 includes a queue length check processing section 51, an inactive check processing section 52, and a transmission control processing section 53, and a storage device 6.
0 has transmission waiting queues 72 to 7n, transmission waiting queue length counters 82 to 8n, and idle counters 92 to 9n, which correspond to transmission destination processors other than the own processor.

以下にプロセッサ２１がデータを送信する送信元プロセ
ッサ、プロセッサ２２がデータを受信する送信先プロセ
ッサであるものとして本実施例の動作を説明する。The operation of this embodiment will be described below assuming that the processor 21 is a source processor that transmits data, and the processor 22 is a destination processor that receives data.

プロセッサ２１が、プロセッサ２２に対してデータを送
信したが何らかの要因によりプロセッサ２２が送信デー
タを受信できなかった場合、プロセッサ２１の送信制御
処理部５３は、プロセッサ２２に対応する送信待ちキュ
ー７２に送信データを登録し、送信待ちキュー長カウン
タ８２及び休止中カウンタ９２をそれぞれ一歩歩進させ
、このプロセッサ２２を休止中状態とみなし一定時間デ
ータの送信を中止し所定の時間後に送信処理を再開する
。If the processor 21 transmits data to the processor 22 but the processor 22 is unable to receive the transmitted data for some reason, the transmission control processing unit 53 of the processor 21 transmits the data to the transmission waiting queue 72 corresponding to the processor 22. The data is registered, the transmission waiting queue length counter 82 and the inactive counter 92 are each incremented by one step, the processor 22 is considered to be in an inactive state, data transmission is stopped for a certain period of time, and the transmission process is resumed after a predetermined period of time.

他方、キュー長チェック処理部５１は、一定時間ごとに
各プロセッサに対応する送信待ちキュー長カウンタ８２
〜８ｎの値と所定のしきい値とを比較して、しきい値を
越えているものはないかを検索する。このしきい値は、
送信先プロセッサの輻輳状態を加味しているため、例え
ば、従来２゜と設定していたものを２００のように従来
の値より大きめに設定されている。もし、プロセッサ２
２に対応する送信待ちキュー長カウンタ７２かそのしき
い値を越えていたならはプロセッサ２２に障害が発生し
ているものと判断する。On the other hand, the queue length check processing unit 51 checks the transmission waiting queue length counter 82 corresponding to each processor at regular intervals.
The value of ~8n is compared with a predetermined threshold value to search for any value exceeding the threshold value. This threshold is
Since the congestion state of the destination processor is taken into consideration, for example, the conventional value of 2 degrees is now set to 200, which is larger than the conventional value. If processor 2
If the transmission queue length counter 72 corresponding to 2 exceeds its threshold value, it is determined that a failure has occurred in the processor 22.

休止中チェック処理部５２は、一定周期て各プロセッサ
対応の休止中カウンタ９２〜９ｎの値と所定のしきい値
とを比較して、しきい値を越えているものはないかを検
索する。例えば、５０ｍ５の間に２以内ならば通常の輻
輳状態の範囲内とみて正常動作しているものと判断して
これに該当する休止中カウンタの値を０に戻し、もし、
プロセッサ２２に対応する休止中カウンタ９２の値が３
であるならばプロセッサ２２に障害が発生しているもの
と判断する。The inactivity check processing unit 52 compares the values of the inactivity counters 92 to 9n corresponding to each processor with a predetermined threshold value at regular intervals, and searches for any that exceeds the threshold value. For example, if it is within 2 within 50m5, it is considered to be within the range of normal congestion and is considered to be operating normally, and the value of the corresponding idle counter is reset to 0.
The value of the inactive counter 92 corresponding to the processor 22 is 3.
If so, it is determined that a failure has occurred in the processor 22.

〔Effect of the invention〕

以上説明したように、本発明は、送信元プロセッサに従
来から設けられている送信先プロセッサ対応の送信待ち
キュー長カウンタに加え、新たに休止中カウンタを設け
、送信先プロセッサがデータを受信してくれなかった場
合を休止中状態として計数させると共にデータの送信を
止め、定時間内に休止中状態の回数が所定の回数を越え
たらこの送信先プロセッサを障害とみなす休止中チェッ
ク処理と従来の送信待ちキュー長カウンタが所定のしき
い値を越えたら送信先プロセッサを障害とみなすキュー
長チェック処理とを併用することにより、送信先プロセ
ッサが単なる輻輳状態で受信処理ができないときに誤っ
て送信先プロセッサを障害と判断させる可能性を少なく
する効果を有する。As explained above, the present invention provides a new idle counter in addition to the transmission waiting queue length counter corresponding to the destination processor, which is conventionally provided in the source processor, so that the destination processor receives data. If the destination processor is not in the dormant state, it is counted as a dormant state and data transmission is stopped, and if the number of times the destination processor is in the dormant state exceeds a predetermined number within a fixed period of time, the destination processor is considered to be at fault. By using a queue length check process that treats the destination processor as a failure when the waiting queue length counter exceeds a predetermined threshold, it is possible to prevent the destination processor from erroneously failing when the destination processor is simply congested and unable to perform reception processing. This has the effect of reducing the possibility of being judged as a disability.

[Brief explanation of drawings]

第１図は本発明の一実施例を示すブロック図である。１０・・・・・・バス、２１〜２ｎ・・・・・・プロセ
ッサ、３０・・・・・・受信部、４０・・・・・・送信
部、５０・・・・・・中央処理装置、５１・・・・・・
キュー長チェック処理部、５２・・・・・・休止中チェ
ック処理部、５３・・・・・・送信制御処理部、６０・
・・・・・記憶装置、７２〜７ｎ・・・・・送信待ちキ
ュー、８２〜８ｎ・・・・・・送信待ちキュー長カウン
タ、９２〜９ｎ・・・・・・休止中カウンタ。FIG. 1 is a block diagram showing one embodiment of the present invention. 10... bus, 21-2n... processor, 30... receiving section, 40... transmitting section, 50... central processing unit , 51...
Queue length check processing unit, 52...Suspension check processing unit, 53...Transmission control processing unit, 60.
...Storage device, 72-7n...Transmission waiting queue, 82-8n...Sending waiting queue length counter, 92-9n...Sleeping counter.

Claims

[Claims]

In a multiprocessor system configured by connecting multiple processors via a bus, a storage means for registering data waiting to be sent for each destination processor in a sending processor that sends data, and counting the length of the data waiting to be sent. and registering the transmitted data in the storage means and incrementing the first counting means when the destination processor that should receive the transmitted data does not receive the transmitted data. , a faulty processor detection method detects a faulty processor through a queue length check process that periodically monitors the length of data waiting to be sent and considers a processor exceeding a predetermined first threshold as a fault; The processor is provided with a second counting means for counting the inactive state for each destination processor, and when the destination processor does not receive the transmitted data, the destination processor is deemed to be in the inactive state and the data is counted. suspending transmission and incrementing the second counting means, monitoring the inactive state at regular intervals, and returning the value of the second counting means to 0 if it is smaller than a predetermined second threshold; A method for detecting a faulty processor, characterized in that a faulty processor is detected by using a combination of the queue length check processing and an inactive check processing in which the queue length is considered to be faulty if the size is large.