JPS6224330A

JPS6224330A - Fault detecting system for multi-processor

Info

Publication number: JPS6224330A
Application number: JP60161832A
Authority: JP
Inventors: Noboru Mizuhara; 水原　登; Tadashi Koshiba; 小柴　忠司; Toru Hoshi; 徹星; Kenji Kawakita; 謙二川北
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1985-07-24
Filing date: 1985-07-24
Publication date: 1987-02-02

Abstract

PURPOSE:To detect the fault of a processor at an opposite side by providing an action display area for processors to a common memory and making each processor refer to said display area to reset the display patterns with each other. CONSTITUTION:A working state display area ST for processors and a fault code display area FC are provided in a specific address of a common memory CM set between processors 1 and 2. Both processors 1 and 2 refers periodically to the area ST to write their own numbers. Thus it is possible to inform the processor of the remote side that the processor of its own side is working normally. While a fault code is written to the area FC if the processor of its won side has a fault. Thus the contents of the generation of the fault can be delivered to the processor of the remote side. Here it is decided that the processor of the remote side has no working if the area FC is not rewritten for two periods by the processor of the opposite side.

Description

【発明の詳細な説明】〔発明の利用分野〕本発明は、共通メモリを有する密結合のマルチプロセッ
サ・システムに係わり、特にプロセッサの障害検出に好
適なプロセッサ相互監視方式に関する。DETAILED DESCRIPTION OF THE INVENTION [Field of Application of the Invention] The present invention relates to a tightly coupled multiprocessor system having a common memory, and more particularly to a mutual processor monitoring method suitable for detecting processor failures.

[Background of the invention]

プロセッサの障害検出手段として、ウォッチドッグタイ
マによるプログラム暴走検出法、クロック断検出による
プロセッサ停止検出法などが知られ、マルチプロセッサ
の場合もこれらの方法がとられているが、これらの方法
の実現には専用回路を必要とする。（五十用他、″マイ
クロプロセッサシステム冗長構成の一考察″′、電子通
信学会技術研究報告＋　Ｖｏｌ−８４ｔ　Ｎα、２１２
，５Ｅ８４−７８．１９８４年１１月に詳しい）〔発明
の目的〕本発明の目的は、上記したような共通メモリを有するマ
ルチプロセッサ・システムにおいて、共通メモリ以外の
付加機構を設けることなく、プロセッサ相互で相手プロ
セッサの異常を検出できる機能を提供することにある。Known methods for detecting processor failures include a program runaway detection method using a watchdog timer and a processor stop detection method using clock interruption detection.These methods are also used in the case of multiprocessors, but it is difficult to realize these methods. requires dedicated circuitry. (Iso et al., “A Study on Microprocessor System Redundancy Configuration”, Institute of Electronics and Communication Engineers Technical Research Report + Vol-84t Nα, 212
, 5E84-78, November 1984) [Object of the Invention] An object of the present invention is to provide a multiprocessor system having a common memory as described above, in which processors can communicate with each other without providing any additional mechanism other than the common memory. The objective is to provide a function that can detect abnormalities in the other processor.

[Summary of the invention]

本発明の特徴は、前記したマルチプロセッサ・システム
において、共通メモリ上の特定番地にプロセッサの動作
表示用領域を設け、各プロセッサは周期的に本領域を参
照し、互いに相手プロセッサが本領域に表示したパター
ンをリセットし合うことにより、相手プロセッサの異常
を検出できることを可能にする点にある。A feature of the present invention is that in the multiprocessor system described above, an area for displaying processor operations is provided at a specific address on the common memory, each processor periodically refers to this area, and each other's processors display information in this area. By resetting the patterns that have been created, it is possible to detect an abnormality in the partner processor.

[Embodiments of the invention]

以下、本発明の実施例を第１図および第２図により説明
する。Embodiments of the present invention will be described below with reference to FIGS. 1 and 2.

第１図は、本プロセッサ障害検出方式を実現するマルチ
プロセッサ・システムの構成を示したものである。第１
図においてＰＬ、Ｐ２はそれぞれプロセッサを示し、Ｃ
ＭはＰＬ、Ｐ２間の共通メモリを、ＳＴはＣＭ内のプロ
セッサ動作状態表示領域を、ＦＣは障害コード表示領域
を示す、またＬＭＩ、ＬＭ２はそれぞれＰＬ、Ｐ２のロ
ーカルメモリを、ＣＮＴｌ、ＣＮＴ２はそれぞれＰＬ。FIG. 1 shows the configuration of a multiprocessor system that implements the present processor failure detection method. 1st
In the figure, PL and P2 each indicate a processor, and C
M is the common memory between PL and P2, ST is the processor operating status display area in CM, FC is the fault code display area, LMI and LM2 are the local memories of PL and P2, respectively, and CNTl and CNT2 are the local memories of PL and P2, respectively. PL each.

Ｐ２の和本プロセッサ障害検出回数カウンタを示す。It shows the number of times the Japanese processor failure detection counter of P2 is shown.

第２図は、第１図に示すプロセッサｐ１のプロセッサ動
作表示および相手プロセッサ障害検出のための動作フロ
ーを示したものである。第２図・に示した処理は各プロ
セッサにおいて周期的に起動され、まずＳＴ判定ステッ
プ２１においてプロセッサ動作状態表示を判定し、ＳＴ
が相手プロセッサ番号を示していれば相手プロセッサが
動作中とみなす。＄ｉ＜ＦＣ判定ステップ２２では、Ｆ
Ｃが障害発生を示していれば障害処理ステップ２９へ、
正常動作中を示していれば動作状態判定ステップ２３へ
移る。ステップ２３では自プロセッサの動作状態を判定
し、障害が発生していなければ直ちに自プロセッサ動作
中表示ステップ２５へ移り、障害が発生していれば、障
害コード設定ステップ２４にて障害内容に対応した障害
コードをＦＣに設定した後、ステップ２５へ移る。自プ
ロセッサ動作中表示ステップ２５では、自プロセッサが
動作中であることを相手プロセッサに通知するためにＳ
Ｔに自プロセッサ番号を書き込み、続＜　ＣＮＴｉクリ
アステップ２６にて障害検出カウンタをＯクリアしてお
き処理を終了する。FIG. 2 shows an operation flow for displaying the processor operation of the processor p1 shown in FIG. 1 and detecting a fault in the other processor. The process shown in FIG.
If indicates the partner processor number, it is assumed that the partner processor is operating. $i<FC In the determination step 22, F
If C indicates that a failure has occurred, proceed to failure processing step 29;
If it indicates normal operation, the process moves to operation state determination step 23. In step 23, the operating state of the own processor is determined, and if no fault has occurred, the process immediately moves to step 25 to display that the own processor is operating, and if a fault has occurred, the fault code setting step 24 is carried out to respond to the details of the fault. After setting the fault code to FC, the process moves to step 25. In the own processor operation display step 25, the S
The own processor number is written in T, and the fault detection counter is cleared to O in the continuation < CNTi clear step 26, and the process ends.

一方、ＳＴ判定ステップ２１でＳＴが自プロセッサ番号
を示したままであれば、続＜ＣＮＴｉ判定ステップ２７
にてＣＮＴｉの値を判定し、ＣＮＴｉが０であれば相手
プロセッサが動作を停止した可能性があるとみなし、Ｃ
ＮＴｉインクリメントステップ２８にてＣＮＴｉに１を
加算して次周期の起動に備える。また、ステップ２７に
てＣＮＴ　ｉが１であれば２周期の間、相手プロセッサ
が動作していないことを示し、この場合は相手プロセッ
サが動作を停止したとみなし、障害処理ステップ２９に
て障害処理１例えば１重系システムであれば障害処理の
収集とともに再立ち上げの初期設定を、２重系システム
であれば障害情報の収集ならびに系切替えを行なう、な
お、ステップ２７でＣＮＴｉを判定した際、ＣＮＴｉ＝
１、すなわち。On the other hand, if the ST continues to indicate its own processor number in ST determination step 21, then continue < CNTi determination step 27
If CNTi is 0, it is assumed that the other processor may have stopped operating, and the
In the NTi increment step 28, 1 is added to CNTi in preparation for the next cycle of activation. Furthermore, if CNT i is 1 in step 27, it indicates that the partner processor has not been operating for two cycles. 1. For example, in the case of a single-system system, failure processing is collected and initial settings for restart are performed, and in the case of a dual-system system, failure information is collected and system switching is performed. Furthermore, when CNTi is determined in step 27, CNTi=
1, that is.

連続して２周期相手プロセッサが動作していない場合に
、初めて相手プロセッサが停止したとみなすのは、プロ
セッサ間の本プログラム実行周期誤差による過剰の障害
検出を防止するためである。The reason why it is assumed that the partner processor has stopped for the first time when the partner processor has not been operating for two consecutive cycles is to prevent excessive failure detection due to the program execution cycle error between the processors.

また、プロセッサ間で障害コードを排他的に割当てるこ
とができれば、ＳＴに障害コードも表示してＦＣを削除
することもできる。Furthermore, if a fault code can be exclusively assigned between processors, the fault code can also be displayed on the ST and the FC can be deleted.

以上、本実施例ではプロセッサが２台の場合の例を示し
たが、プロセッサが３台以Ｊ−の場合も、共通メモリＣ
Ｍ中にｎＣ，（ｎはプロセッサ台数）個のＳＴおよびＦ
Ｃを設けるとともに、各プロセッサＰｉのローカルメモ
リＬ　Ｍ　ｉ中にｎ−１個のプロセッサ障害検出カウン
タを設け、プロセッサＰｉは互いに他のｎ−１台のプロ
セッサの動作状態を監視することで同様に実現できる。In the above, the example in which there are two processors has been shown in this embodiment, but when there are three or more processors, the common memory C
nC (n is the number of processors) STs and F in M
In addition, n-1 processor failure detection counters are provided in the local memory L M i of each processor Pi, and the processors Pi mutually monitor the operating status of the other n-1 processors. realizable.

ただし、プロセッサ数が多くなるとＳＴおよびＦＣの数
も監視できない数となる。その場合は、プロセッサ対応
のＳＴを設け、各プロセッサは周期的に自プロセッサ対
応のＳＴの値をカラン１−アップさせることにより（オ
ーバフローした場合は０に戻す）、他のプロセッサに自
プロセッサの動作を表示する方法も考えられる。However, when the number of processors increases, the number of STs and FCs also becomes too large to be monitored. In that case, an ST corresponding to the processor is provided, and each processor periodically increments the value of the ST corresponding to its own processor by 1 (in case of overflow, returns it to 0), so that the other processors can know the operation of its own processor. Another possible method is to display the .

〔Effect of the invention〕

本発明によれば、共通メモリを有するマルチプロセッサ
システムにおいて、共通メモリ中に設けた動作状態表示
領域を各プロセッサが周期的にアクセスするのみで互い
の障害が検出できるので、障害検出用の付加回路を削減
できる効果がある。According to the present invention, in a multiprocessor system having a common memory, each processor can detect each other's faults simply by periodically accessing the operating status display area provided in the common memory. It has the effect of reducing

[Brief explanation of drawings]

第１図は本マルチプロセッサ障害検出方式を実現する密
結合マルチプロセッサシステムの構成図、第２図は障害
検出処理手順のフローチャートであ代理人　弁理士　小
川勝男＼ミ・′・Ｓフ第　１図第　２固Figure 1 is a block diagram of a tightly coupled multiprocessor system that implements this multiprocessor failure detection method, and Figure 2 is a flowchart of the failure detection processing procedure. 2nd hard

Claims

[Claims]

In a multiprocessor system having a common memory, an area for displaying the operating status of the processors is provided in the common memory, and a unique identification code is assigned to each processor, so that each processor refers to the above area at the same cycle and performs self-processing. By writing the identification code of the processor, it is possible to notify the other processor that the own processor is operating normally, and by writing the code assigned to the failure type, it is possible to notify the contents of the failure. A fault detection method for a multiprocessor, characterized in that if the area has not been rewritten by the other processor for a period of time, it can be determined that the other processor has stopped operating.