JPH0293856A

JPH0293856A - Trouble processing system in multiprocessor system

Info

Publication number: JPH0293856A
Application number: JP63246525A
Authority: JP
Inventors: Akiko Hiraishi; 平石　亜紀子
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1988-09-30
Filing date: 1988-09-30
Publication date: 1990-04-04

Abstract

PURPOSE:To decrease the load of respective processors and to exactly execute processing at a high speed when the generation of a trouble is detected by providing an exclusive maintenance processor to execute trouble detection and circuit processing. CONSTITUTION:A main memory 20 has a monitor information storing area 60 to house the monitor information of respective processors 1-N and monitor information updating means 70 in the plural processors 1-N update the information. A processor timer monitor means 80 of a maintenance processor 10 reads the monitor information with constant time intervals, discriminates the generation of the trouble in the processors 1-N and executes the trouble recovery processing of the processors 1-N, in which the trouble is generated, by a recovery means 105 of the maintenance processor 10 on the basis of trouble information to be collected by an information collecting means 100. Thus, the load of a whole multiprocessor system is decreased and when the trouble is detected, the collection of the fine trouble information and the trouble recovery processing can be executed.

Description

【発明の詳細な説明】〔産業上の利用分野〕本発明はマルチプロセッサシステムにおける障害処理方
式に関し、特に障害処理用のメンテナンスプロセッサに
よるプロセッサの障害処理方式に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Industrial Application] The present invention relates to a failure handling method in a multiprocessor system, and more particularly to a processor failure handling method using a maintenance processor for handling failures.

[Conventional technology]

従来、この種のマルチプロセッサシステムにおける障害
処理方式は、外部装置としてマルチプロセノリ・システ
ムが有するプロセッサの個数分の障害監視タイマ装置を
持ち、各プロセッサと障害監視タイマ装置とをｌ対ｌに
対応させて、各プロセッサが対応する障害監視タイマ装
置に対して一定時間間隔で出力する信号をプロセッサが
正常であることを示すヘルス信号としていた。Conventionally, a fault handling method in this type of multiprocessor system has as external devices fault monitoring timer devices for the number of processors that the multiprocessor system has, and each processor and fault monitoring timer device are in one-to-one correspondence. In this way, a signal that each processor outputs at fixed time intervals to a corresponding failure monitoring timer device is used as a health signal indicating that the processor is normal.

そして、障害監視タイマ装置は、対応するプロセッサか
らのヘルス信号が途絶えた場合には、プロセッサの異常
の発生、すなわち障害の発生として、対応するプロセッ
サに割込みを起こして障害情報を収集し、障害回復処理
を行っていた。When the health signal from the corresponding processor is interrupted, the fault monitoring timer device determines that an abnormality has occurred in the processor, that is, a fault has occurred, and generates an interrupt to the corresponding processor to collect fault information and recover from the fault. It was being processed.

[Problem to be solved by the invention]

上述した従来のマルチプロセッサシステムにおける障害
処理方式では、各プロセッサ対応に障害監視タイマ装置
を設けていたので、プロセッサの個数分の障害監視タイ
マ装置が必要になるという欠点がある。In the conventional fault handling method in the multiprocessor system described above, a fault monitoring timer device is provided for each processor, so there is a drawback that fault monitoring timer devices are required for the number of processors.

また、各プロセッサが対応する障害監視タイマ装置に対
してヘルス信号を出力するので、マルチプロセッサシス
テム全体として、障害検出処理のオーバヘッドが大きい
という欠点がある。Furthermore, since each processor outputs a health signal to the corresponding fault monitoring timer device, the multiprocessor system as a whole has a drawback in that the overhead of fault detection processing is large.

さらに、障害監視タイマ装置は、対応するプロセッサの
障害を検出した場合に、障害情報を収集するために障害
が発生したプロセッサに対して割込みを起こして障害情
報を収集するので、正常な障害情報が収集されるとは限
らないという欠点がある。Furthermore, when the fault monitoring timer device detects a fault in the corresponding processor, it generates an interrupt to the faulty processor and collects the fault information, so that normal fault information is not detected. The disadvantage is that it is not always collected.

本発明の目的は、上述の点に鑑み、障害検出および回復
処理を行う専用のメンテナンスプロセッサを設けること
により、マルチプロセッサシステム全体のｆｔ荷を軽減
するとともに、障害検出時に細かい障害情報の収集およ
び障害回復処理が行えるマルチプロセッサシステムにお
ける障害処理方式を提供することにある。In view of the above-mentioned points, an object of the present invention is to reduce the ft load of the entire multiprocessor system by providing a dedicated maintenance processor that performs fault detection and recovery processing, and to collect detailed fault information when faults are detected. An object of the present invention is to provide a failure handling method in a multiprocessor system that can perform recovery processing.

[Means to solve the problem]

本発明のマルチプロセッサシステムにおける障害処理方
式は、各プロセッサの監視情報を格納する監視情報記憶
領域を持つ主記憶と、この主記憶上の監視情報記憶領域
の監視情報を更新する監視情報更新手段を含む複数のプ
ロセッサと、前記主記憶上の監視情報記ｔａ領領域監視
情報を一定時間間隔で読み出すプロセッサタイマ監視手
段と、このプロセッサタイマ監視手段により読み出され
た監視情報を基に前記プロセッサの障害発生を判別する
障害検出手段と、この障害検出手段により障害発生が検
出されたプロセッサの監視情報から障害情報を収集する
情報収集手段と、この情報収集手段により収集された障
害情報を基に障害が発生した前記プロセッサの障害回復
処理を行う回復手段とを含むメンテナンスプロセッサと
を存する。The failure handling method in the multiprocessor system of the present invention includes a main memory having a monitoring information storage area for storing monitoring information of each processor, and a monitoring information updating means for updating the monitoring information in the monitoring information storage area on the main memory. a plurality of processors including a plurality of processors; processor timer monitoring means for reading monitoring information from a monitoring information storage area on the main memory at regular intervals; A fault detection means for determining the occurrence of a fault, an information collection means for collecting fault information from monitoring information of a processor whose occurrence has been detected by the fault detection means, and a fault detection means for detecting a fault based on the fault information collected by the information collection means. and a maintenance processor that performs a recovery process for the processor that has occurred.

[Effect]

本発明のマルチプロセッサシステムにおける障害処理方
式では、主記憶が各プロセッサの監視情報を格納する監
視情報記憶領域を持ち、複数のブＬ】セッサの監視情報
更新手段が主記憶上の監視情報記憶領域の監視情報を更
新し、メンテナンスプロセッサのプロセッサタイマ監視
手段が上記ｔａ上の監視情報記憶領域の監視情報を一定
時間間隔で読み出し、メンテナンスプロセッサの障害検
出手段がプロセッサタイマ監視手段により読み出された
監視情報を基にプロセッサの障害発生を判別し、メンテ
ナンスプロセッサの情報収集手段が障害検出手段により
障害発生が検出されたプロセッサの監視情報から障害情
報を収集し、メンテナンスプロセッサの回復手段が情報
収集手段により収集された障害情報を基に障害が発生し
たプロセッサの障害回復処理を行う。In the fault handling method in the multiprocessor system of the present invention, the main memory has a monitoring information storage area for storing monitoring information of each processor, and the monitoring information updating means for the plurality of processors is configured to have a monitoring information storage area on the main memory. The processor timer monitoring means of the maintenance processor reads the monitoring information in the monitoring information storage area on ta at regular time intervals, and the failure detection means of the maintenance processor updates the monitoring information read by the processor timer monitoring means. The occurrence of a fault in a processor is determined based on the information, the information collecting means of the maintenance processor collects the fault information from the monitoring information of the processor whose fault has been detected by the fault detecting means, and the recovery means of the maintenance processor collects the fault information using the information collecting means. Performs failure recovery processing for the failed processor based on the collected failure information.

〔Example〕

次に、本発明について図面を参照して詳細に説明する。 Next, the present invention will be explained in detail with reference to the drawings.

第１図は、本発明の一実施例のマルチプロセッサシステ
ムにおける障害処理方式の構成を示すブロック図である
。本実施例のマルチプロセッサシステムにおける障害処
理方式は、監視情報更新手段７０をそれぞれ含んで構成
される複数のプロセッサｌ、２．・・・、　Ｎ　（Ｎは
正整数）と、プロセッサタイマ監視手段８０．障害検出
手段９０．情報収集手段１００および回復手段１０５を
含んで構成されるメンテナンスプロセッサＩＯと、監視
情報記憶領域６０を含んで構成される主記憶２０と、通
信制御装置３０と、入出力制御装置４０と、外部記憶装
置５０とからその主要部が構成されている。FIG. 1 is a block diagram showing the configuration of a failure handling method in a multiprocessor system according to an embodiment of the present invention. The failure handling method in the multiprocessor system of this embodiment is based on a plurality of processors l, 2, . ..., N (N is a positive integer), and processor timer monitoring means 80. Fault detection means 90. A maintenance processor IO including an information collection means 100 and a recovery means 105, a main memory 20 including a monitoring information storage area 60, a communication control device 30, an input/output control device 40, and an external storage. The main part is composed of the device 50.

障害情報記憶領域６０には、各プロセッサｌ、２−・・
・、Ｎの監視情報が記憶される。The failure information storage area 60 stores information about each processor l, 2-...
, N monitoring information is stored.

各プロセッサ１．２．・・・、Ｎの監視情報更新手段７
０は、主記憶２０上の監視情報記憶領域６０の該当する
監視情報を更新する。Each processor 1.2. ..., N monitoring information update means 7
0 updates the corresponding monitoring information in the monitoring information storage area 60 on the main memory 20.

プロセッサタイマ監視手段８０は、一定時間間隔で上記
憶２０上の監視情報記憶領域６０からプロセノサｌ、２
．・・・、Ｎの監視情報を読み出す。The processor timer monitoring means 80 reads processors l, 2 from the monitoring information storage area 60 on the upper storage 20 at fixed time intervals.
．． . . . reads the monitoring information of N.

障害検出手段９０は、プロセッサタイマ監視手段８０が
上記ｊｅ２０上の監視情報記憶領域６０から読み出した
監視情報からプロセッサ１．２．・・・、Ｎの障害発生
を判別する。The failure detection means 90 detects the processors 1.2, . . . . determines whether a failure has occurred in N.

情報収集手段１００は、障害検出手段９０によりプロセ
ッサＫ　（Ｋはｌ≦ＭＳＮの正整数）の障害発生を検出
した場合にプロセッサにの監視情報から障害情報を収集
して、人出力制御装置４０を介して外部記憶装置５０に
出力する。The information collecting means 100 collects fault information from the processor monitoring information when the fault detecting means 90 detects the occurrence of a fault in the processor K (K is a positive integer of l≦MSN), and controls the human output control device 40 by collecting fault information from the processor monitoring information. The data is output to the external storage device 50 via the external storage device 50.

回復手段１０５は、↑ｎ報収集手段１００により収集さ
れた障害情報を基に障害が検出されたプロセッサにの障
害回復処理を行う。The recovery means 105 performs failure recovery processing on the processor in which a failure has been detected based on the failure information collected by the ↑n information collection means 100.

第２図を参照すると、プロセッサ１．２．・・・Ｎにお
ける処理は、処理実行ステップ１１０と、監視情報更新
ステップ１２０とからなる。Referring to FIG. 2, processors 1.2. . . . The process in N consists of a process execution step 110 and a monitoring information update step 120.

第３図を参照すると、メンテナンスプロセッサ１０にお
ける処理は、プロセッサタイマ監視ステップ１３０と、
監視情報続出しステップ１４０と、障害検出判別ステッ
プ１５０と、障害情仰収集ステップ１６０と、障害回復
処理ステップ１７０とからなる。Referring to FIG. 3, the processing in the maintenance processor 10 includes a processor timer monitoring step 130;
It consists of a step 140 for generating monitoring information, a step 150 for detecting and determining a failure, a step 160 for collecting information about the failure, and a step 170 for recovering from the failure.

次に、このように構成された本実施例のマルチプ［１セ
ツサシステムにおける障害処理方式の動作について説明
する。Next, the operation of the failure handling method in the multiplex [1 setter system] of this embodiment configured as described above will be explained.

プロセノナ１，２．・・・、Ｎは、各ブロセノ→）１２
　・・・、Ｎが行うべき処理を実行するとくステップ１
１０）、監視情報更新手段７０により主記憶２０上の監
視情報記憶領域６０の監視悄￥艮を更新する（ステップ
１２０）。Prosenona 1, 2. ..., N is each Broseno →) 12
..., step 1 when N executes the processing to be performed.
10) The monitoring information in the monitoring information storage area 60 on the main memory 20 is updated by the monitoring information updating means 70 (step 120).

−・方、メンテナンスプロセッサ１０では、プロセッサ
タイマ監視手段８０によりプロセッサ１．２Ｎのタイマ
（メンテナンスプロセッサｌＯでソフトウェア的に作ら
れているタイマ）を監視して（ステップ＋３０　）　、
一定時間間隔で主記憶２０」二の監視情報記１０領域６
０のから各プロセッサｌ、２゜Ｎの監視情報を読み出し
くステップ１４０）、監視情報記憶領域６０から読み出
した各プロセン４ノ゛１．２．・・・、Ｎの監視情報を
基に障害検出手段９０により各プロセッサｌ、２．・・
・、Ｎの障害発生の判別を行う　（ステップ１５０）。- On the other hand, in the maintenance processor 10, the processor timer monitoring means 80 monitors the timer of the processor 1.2N (a timer created by software in the maintenance processor IO) (step +30),
Main memory 20”2 monitoring information record 10 area 6 at fixed time intervals
The monitoring information of each processor 1, 2°N is read from the monitoring information storage area 60 (step 140), and the monitoring information of each processor 4, 1.2. . . , N, the failure detection means 90 detects each processor l, 2 .・・・
. , N is determined to have occurred (step 150).

障害検出手段９０によりプロセッサにの障害発生が検出
されたならば、（’ｉ’ｔ　報収集手段１００により、
障害発生が検出されたプし１セノザにの監視情報から障
害情報を収集して入出力制御装置４０を介して外部記憶
装置５０に出力しくステップ１６０）、回復手段１０５
により１１′？報収集手段１００によって収集された障
害情報を基にプロセッサにの障害回復処理を行う　（ス
テップ１７０）。障害検出手段９０により障害発生が検
出されなかった場合には、そのまま処理を終了する。If the failure detection means 90 detects the occurrence of a failure in the processor, ('i't) the information collection means 100
Collect failure information from the monitoring information of the first sensor in which the failure has been detected and output it to the external storage device 50 via the input/output control device 40 (Step 160), recovery means 105
11′? Based on the fault information collected by the information collection means 100, fault recovery processing is performed on the processor (step 170). If the failure detection means 90 does not detect the occurrence of a failure, the process is immediately terminated.

〔Effect of the invention〕

以上説明したように本発明は、障害検出および回復処理
を行う専用のメンテナンスプロセッサを設けることによ
り、各プロセッサの負荷を軽減することができるととも
にプロセッサの障害発生の検出時に細かい１’？７報収
集および障害回復処理を行うことができるという効果が
ある。As described above, the present invention is capable of reducing the load on each processor by providing a dedicated maintenance processor that performs fault detection and recovery processing, and also makes it possible to fine-tune 1'? This has the advantage of being able to collect seven reports and perform failure recovery processing.

[Brief explanation of the drawing]

７ＰＩ１図は本発明の一実施例のマルチプロセッサシス
テムにおける障害処理方式の構成を示すブロック図、第２図は第１図中のプロセッサにおける処理を示す流れ
図、第３図は第１図中のメンテナンスプロセッサにおける処
理を示す流れ図である。図において、１．２．・・・、Ｎ・・プロセッサ、１０　　・・・・・・・メンテナンスプロセッサ、２０
・・・・・・・・主記憶、３０・・・・・・・・通信制御装置、４０・・・・・・・・人出力制御装置、５０・・・・・
・・・外部記ｊｆＪ装置、６０・・・・・・・・監視情
報記ｊｌ＞領域、７０・・・・・・・・監視情報更新手
段、８０・・・・・・・・プロセッサタイマ監視手段、
９０・・・・・・・・障害検出手段、１００　　・・・・・・・情報収集手段、１０５　　・
・・・・・・回復手段である。Figure 7PI1 is a block diagram showing the configuration of a failure handling method in a multiprocessor system according to an embodiment of the present invention, Figure 2 is a flowchart showing the processing in the processor in Figure 1, and Figure 3 is a maintenance diagram in Figure 1. It is a flow chart showing processing in a processor. In the figure, 1.2. ..., N...Processor, 10...Maintenance processor, 20
・・・・・・・・・Main memory, 30・・・・・・Communication control device, 40・・・・・・Person output control device, 50・・・・・・
...External record jfJ device, 60...Monitoring information record jl>area, 70...Monitoring information update means, 80...Processor timer monitoring means,
90...Fault detection means, 100...Information collection means, 105.
...It is a means of recovery.

Claims

[Scope of Claims] A main memory having a monitoring information storage area for storing monitoring information of each processor; and a plurality of processors including monitoring information updating means for updating the monitoring information in the monitoring information storage area on the main memory; processor timer monitoring means for reading monitoring information from a monitoring information storage area on the main memory at regular time intervals; and failure detection means for determining occurrence of a fault in the processor based on the monitoring information read by the processor timer monitoring means. an information collecting means for collecting failure information from monitoring information of a processor whose occurrence of a failure has been detected by the failure detection means; and failure recovery of the processor in which a failure has occurred based on the failure information collected by the information gathering means. 1. A failure handling method in a multiprocessor system, comprising: a maintenance processor including a recovery means for performing processing.