JP2008146222A

JP2008146222A - Computer failure detection system and computer failure detection method

Info

Publication number: JP2008146222A
Application number: JP2006330676A
Authority: JP
Inventors: Takayuki Uchida; 貴之内田
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2006-12-07
Filing date: 2006-12-07
Publication date: 2008-06-26

Abstract

<P>PROBLEM TO BE SOLVED: To solve the problem that since the data transfer quantity of a common bus is limited in a conventional computer failure detection system, the use of the common bus for notifying the other processor of self-diagnosis information results in the deterioration of the data transfer quantity of the common bus, and since the common bus is used by the other processor, use permission is not available, and the output of self-diagnosis information is delayed, and it is not possible to correctly detect any failure. <P>SOLUTION: This computer failure detection system is provided with a plurality of self-diagnosis output computers, a failure detection computer, and a self-diagnosis information line for connecting a plurality of self-diagnosis output computers with the failure detection computer. The self-diagnosis output computer prepares self-diagnosis information by itself, and the self-diagnosis information line transmits the self-diagnosis information to be output from each self-diagnosis output computer, and the failure detection computer detects the failure of the self-diagnosis output computer which has transmitted self-diagnosis information based on the self-diagnosis information received from each self-diagnosis information line. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

この発明は、マルチプロセッサ並列処理システムにおいて、プロセッサ搭載の各計算機を接続する共通バスの性能を低下させることなく計算機の故障を検出するシステムおよび方法に関する。 The present invention relates to a system and method for detecting a failure of a computer in a multiprocessor parallel processing system without degrading the performance of a common bus connecting the computers mounted on the processor.

従来のマルチプロセッサ並列処理システムにおいて、計算機で発生した故障を検出する方法として、プロセッサを搭載した各計算機が共通バスを介して自己診断情報を周期的に他のプロセッサに通知し、各プロセッサは、他のプロセッサから周期的に通知された自己診断情報を確認することで、故障が発生した計算機を検出する方法が知られている（例えば、特許文献１参照）。 In a conventional multiprocessor parallel processing system, as a method of detecting a failure occurring in a computer, each computer equipped with a processor periodically notifies self-diagnosis information to other processors via a common bus. A method of detecting a computer in which a failure has occurred by checking self-diagnosis information periodically notified from another processor is known (see, for example, Patent Document 1).

特開平７−２１９９１０（第２〜４頁、第３図）JP-A-7-219910 (pages 2-4, FIG. 3)

しかしマルチプロセッサ並列処理システムにおいて、プロセッサ間のデータ通信に用いる共通バスはデータ転送量が限られている。よって、あるプロセッサが他のプロセッサに対して周期的に自己診断情報を通知するために共通バスを使用すると、プロセッサ間でのデータ通信で使用可能なデータ転送量は低下する。 However, in a multiprocessor parallel processing system, the common bus used for data communication between processors has a limited data transfer amount. Therefore, when a certain processor uses a common bus to periodically notify self-diagnosis information to other processors, the amount of data transfer that can be used for data communication between the processors decreases.

加えて自己診断情報を通知すべきプロセッサが、例えば他のプロセッサの共通バス使用により、共通バスでのデータ通信を管理している計算機から共通バスの使用許可を得られず自己診断情報を出力できないため、自己診断情報の出力遅延が発生する可能性もある。 In addition, the processor that should notify the self-diagnosis information cannot output self-diagnosis information because the use of the common bus cannot be obtained from the computer that manages the data communication on the common bus, for example, by using the common bus of another processor. Therefore, there is a possibility that output delay of self-diagnosis information occurs.

従って共通バスを用いて自己診断情報を通知することにより、共通バスのデータ転送量が低下するとともに、自己診断情報を受信した計算機において自己診断情報が更新されていない原因は、自己診断情報の出力遅延で未更新なのか、それとも送信側プロセッサ故障による送信不能で未更新なのか判断できず、正しい故障検出ができないという課題があった。 Therefore, by notifying the self-diagnosis information using the common bus, the data transfer amount of the common bus decreases, and the reason that the self-diagnosis information is not updated in the computer that has received the self-diagnosis information is the output of the self-diagnosis information There is a problem in that it is impossible to determine whether it is not updated due to a delay or whether it is not updated because transmission is not possible due to a transmitter processor failure, and correct failure detection cannot be performed.

本発明は係る課題を解決するためになされたものであり、共通バスのデータ転送量を低下させることなく、各計算機の自己診断情報をもとに計算機の故障を正しく検出することを目的とする。 The present invention has been made to solve such problems, and has an object to correctly detect a computer failure based on self-diagnosis information of each computer without reducing the data transfer amount of a common bus. .

本発明に係る計算機故障検出システムは、
複数の自己診断出力計算機、及び故障検出計算機、及び前記複数の自己診断出力計算機と前記故障検出計算機とを接続する自己診断情報回線を備え、
前記自己診断出力計算機は、自らの自己診断結果を示す自己診断情報を作成出力し、
前記自己診断情報回線は、前記自己診断出力計算機と各々接続し、前記自己診断出力計算機から出力される自己診断情報を伝送し、
前記故障検出計算機は、前記自己診断情報回線と全て接続し、前記自己診断情報回線から伝送された自己診断情報を受信し、受信した前記自己診断情報に基づいて前記自己診断情報を送信した自己診断出力計算機の故障を検出する
ことを特徴とする。 The computer failure detection system according to the present invention is:
A plurality of self-diagnosis output computers, a failure detection computer, and a self-diagnosis information line connecting the plurality of self-diagnosis output computers and the failure detection computer,
The self-diagnosis output computer creates and outputs self-diagnosis information indicating its own self-diagnosis result,
The self-diagnosis information line is connected to the self-diagnosis output computer, and transmits self-diagnosis information output from the self-diagnosis output computer.
The fault detection computer is connected to the self-diagnosis information line, receives the self-diagnosis information transmitted from the self-diagnosis information line, and transmits the self-diagnosis information based on the received self-diagnosis information. It is characterized by detecting a failure of the output computer.

本発明のように、共通バスを用いずに自己診断情報回線を用いて自己診断情報を送信することにより、共通バスのデータ転送量を低下させることは無い。また自己診断情報回線では共通バスのような出力遅延が発生しないため、確実に故障検出を行うことが可能である。 By transmitting the self-diagnosis information using the self-diagnosis information line without using the common bus as in the present invention, the data transfer amount of the common bus is not reduced. In addition, since the self-diagnosis information line does not cause an output delay unlike the common bus, it is possible to reliably detect a failure.

実施の形態１．
図１は、実施の形態１における計算機故障検出方法を実現するマルチプロセッサ並列処理システムの一構成例を示している。 Embodiment 1 FIG.
FIG. 1 shows a configuration example of a multiprocessor parallel processing system that implements the computer fault detection method according to the first embodiment.

図１は、自己診断出力計算機が２台、故障検出計算機が１台で構成されている。しかしマルチプロセッサ並列処理システムにおける自己診断出力計算機は、図１のように２台に限定する必要はなく、並列処理システムに必要とされる台数で並列処理システムを構成させてよいことはいうまでもない。 FIG. 1 includes two self-diagnosis output computers and one failure detection computer. However, it is not necessary to limit the number of self-diagnosis output computers in the multiprocessor parallel processing system to two as shown in FIG. 1, and it is needless to say that the parallel processing system may be configured by the number required for the parallel processing system. Absent.

自己診断出力計算機１１は、各々共通バスＩ／Ｆ（インタフェース）部１２と、自己診断出力計算機ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ：中央処理装置）１３と、自己診断情報送信部１４とを含む構成であり、内部バス１５にて各々内部接続されている。 Each of the self-diagnosis output computers 11 includes a common bus I / F (interface) unit 12, a self-diagnosis output computer CPU (Central Processing Unit) 13, and a self-diagnosis information transmission unit 14. Each is internally connected by an internal bus 15.

故障検出計算機１６は、共通バスＩ／Ｆ部１７と、故障検出計算機ＣＰＵ１８と、自己診断情報受信部１９と、受信メモリ部２０とを含む構成であり、内部バス２１にて各々接続されている。また故障検出計算機１６は受信時刻計測部２２を含む。受信時刻計測部２２より出力される受信時刻データ２３は、自己診断情報受信部１９に入力される。 The failure detection computer 16 includes a common bus I / F unit 17, a failure detection computer CPU 18, a self-diagnosis information reception unit 19, and a reception memory unit 20, and each is connected via an internal bus 21. . The failure detection computer 16 includes a reception time measuring unit 22. Reception time data 23 output from the reception time measurement unit 22 is input to the self-diagnosis information reception unit 19.

図１の自己診断出力計算機１１と故障検出計算機１６は、自己診断出力計算機１１の共通バスＩ／Ｆ部１２と、故障検出計算機１６の共通バスＩ／Ｆ部１７を通じて共通バス２４により接続されている。各計算機間は、接続された共通バス２４を用いてデータ転送を行う。 The self-diagnosis output computer 11 and the failure detection computer 16 of FIG. 1 are connected by a common bus 24 through the common bus I / F unit 12 of the self-diagnosis output computer 11 and the common bus I / F unit 17 of the failure detection computer 16. Yes. Data transfer is performed between the computers using the connected common bus 24.

また各自己診断出力計算機１１内部の自己診断情報送信部１４と、故障検出計算機１６内部の自己診断情報受信部１９は、図１のように各々自己診断情報回線２５により接続されている。各自己診断出力計算機１１は、接続された自己診断情報回線２５を用いて故障検出計算機へ自己診断情報を送信する。なお自己診断情報回線２５は、シリアル伝送、パラレル伝送いずれでもよい。 Further, the self-diagnosis information transmission unit 14 in each self-diagnosis output computer 11 and the self-diagnosis information reception unit 19 in the failure detection computer 16 are connected by a self-diagnosis information line 25 as shown in FIG. Each self-diagnosis output computer 11 transmits self-diagnosis information to the failure detection computer using the connected self-diagnosis information line 25. The self-diagnosis information line 25 may be either serial transmission or parallel transmission.

図２は、実施の形態１における自己診断出力計算機１１より出力される自己診断情報に関するテーブル及び自己診断情報のデータフレームの一例を示した図である。 FIG. 2 is a diagram illustrating an example of a table relating to self-diagnosis information output from the self-diagnosis output computer 11 and a data frame of self-diagnosis information in the first embodiment.

各自己診断出力計算機１１は、自己診断出力計算機１１内部において処理される機能及びその機能を実現するためのプログラムや装置等につき、正常動作をしているか異常動作をしているかの自己診断を行う。ここではその一例として、自己診断出力計算機１１ａにおける自己診断について説明する。 Each self-diagnosis output computer 11 performs a self-diagnosis as to whether a function is processed in the self-diagnosis output computer 11 and a program or a device for realizing the function is operating normally or abnormally. . Here, as an example, self-diagnosis in the self-diagnosis output computer 11a will be described.

自己診断出力計算機１１ａに搭載された自己診断出力計算機ＣＰＵ１３ａは、内部において処理される機能及びその機能を実現するためのプログラムや装置等、例えばＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）やＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等のメモリ、また情報処理等を行うＰＬＤ（ＰｒｏｇｒａｍｍａｂｌｅＬｏｇｉｃＤｅｖｉｃｅ）やＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）やＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）などの信号処理デバイス、及びそれらより構成される基板、装置等に対して周期的に自己診断プログラムを実行する。自己診断プログラムでは、自己診断出力計算機１１ａの各機能が正常動作か異常動作かを確認し、必要であれば、例えば機能確認用プログラムを実行して異常動作であることを検出する。 The self-diagnosis output computer CPU 13a mounted on the self-diagnosis output computer 11a has functions to be processed inside and programs and devices for realizing the functions, such as ROM (Read Only Memory), RAM (Random Access Memory), etc. Memory, signal processing devices such as PLD (Programmable Logic Device), FPGA (Field Programmable Gate Array) and DSP (Digital Signal Processor) that perform information processing, etc. Run self-diagnostic program. In the self-diagnosis program, whether each function of the self-diagnosis output computer 11a is operating normally or abnormally is checked, and if necessary, for example, a function checking program is executed to detect abnormal operation.

例えば自己診断の結果、自己診断出力計算機１１ａ内部の機能が全て正常動作している場合、自己診断出力計算機ＣＰＵ１３ａは、図２（ａ）のように自己診断情報００を得る。また例えば自己診断の結果、自己診断出力計算機１１ａ内部の機能のうち、機能１において故障２が発生している場合、自己診断出力計算機ＣＰＵ１３ａは、同様にして自己診断情報０２を得る。 For example, if all the functions inside the self-diagnosis output computer 11a are operating normally as a result of self-diagnosis, the self-diagnosis output computer CPU 13a obtains self-diagnosis information 00 as shown in FIG. Further, for example, when a failure 2 occurs in the function 1 among the functions in the self-diagnosis output computer 11a as a result of the self-diagnosis, the self-diagnosis output computer CPU 13a similarly obtains the self-diagnosis information 02.

なお図２（ａ）において、自己診断情報は１６進数表示（ｈｅｘａｄｅｃｉｍａｌ、図２（ａ）ではｈｅｘと表示）で表されており、ビット数は８ビットとしている。しかし、ビット数は８ビットに限定するものではなく、想定される故障内容全てを表示可能とするビット数としてもよいことはいうまでもない。 In FIG. 2A, the self-diagnosis information is expressed in hexadecimal notation (hexadecimal, indicated as hex in FIG. 2A), and the number of bits is 8 bits. However, it is needless to say that the number of bits is not limited to 8 bits, and may be the number of bits that can display all the assumed failure contents.

図２（ａ）の自己診断情報テーブルに基づき自己診断出力計算機ＣＰＵ１３ａは、例えば図２（ｂ）に示す自己診断情報のデータフレームを作成して自己診断情報回線２５ａに出力する。送信する自己診断情報のデータフレームの構成要素は、どの自己診断出力計算機に関する自己診断情報かを表す自己診断出力計算機ＩＤ（ＩＤｅｎｔｉｆｉｅｒ）と当該自己診断出力計算機の自己診断情報である。 Based on the self-diagnosis information table of FIG. 2A, the self-diagnosis output computer CPU 13a creates a data frame of self-diagnosis information shown in FIG. 2B, for example, and outputs it to the self-diagnosis information line 25a. The constituent elements of the data frame of the self-diagnosis information to be transmitted are a self-diagnosis output computer ID (IDentifier) indicating which self-diagnosis output computer is self-diagnosis information and self-diagnosis information of the self-diagnosis output computer.

なお自己診断情報のデータフレームは自己診断出力計算機ＣＰＵ１３ａではなく、自己診断情報送信部１４ａにおいて作成してもよいし、またデータフレーム作成用の機能ブロックや回路を設けて作成してもよい。 The data frame of the self-diagnosis information may be created not by the self-diagnosis output computer CPU 13a but by the self-diagnosis information transmission unit 14a, or may be created by providing a functional block or circuit for creating a data frame.

自己診断情報のデータフレームを作成した後、自己診断出力計算機１１ａの自己診断情報は、自己診断情報送信部１４ａより出力される。自己診断情報送信部１４ａより出力された自己診断出力計算機１１ａの自己診断情報は、自己診断情報回線２５ａを介して故障検出計算機１６の自己診断情報受信部１９へ入力される。 After creating the data frame of the self-diagnosis information, the self-diagnosis information of the self-diagnosis output computer 11a is output from the self-diagnosis information transmission unit 14a. The self-diagnosis information of the self-diagnosis output computer 11a output from the self-diagnosis information transmission unit 14a is input to the self-diagnosis information reception unit 19 of the failure detection computer 16 via the self-diagnosis information line 25a.

自己診断出力計算機１１ｂも同様の処理を行う。また自己診断出力計算機１１が３台以上ある場合も同様の処理を行う。 The self-diagnosis output computer 11b performs the same processing. The same processing is performed when there are three or more self-diagnosis output computers 11.

図３は、実施の形態１における故障検出計算機１６の自己診断情報受信部１９におけるデータ処理に受信時刻データ２３を使用することを説明する図である。 FIG. 3 is a diagram illustrating the use of the reception time data 23 for data processing in the self-diagnosis information reception unit 19 of the failure detection computer 16 in the first embodiment.

図３は、自己診断出力計算機１１ａで故障が発生し、自己診断出力計算機１１ｂは正常動作する場合を示している。図３のうち（１）と（２）は受信時刻データ２３を併用せずに故障検出を行う場合を、（３）と（４）は受信時刻データ２３を併用して故障検出を行う場合を示している。 FIG. 3 shows a case where a failure occurs in the self-diagnosis output computer 11a and the self-diagnosis output computer 11b operates normally. 3, (1) and (2) show the case where failure detection is performed without using the reception time data 23, and (3) and (4) show the case where failure detection is performed using the reception time data 23 together. Show.

まず図３の（１）と（２）を用いて、自己診断情報と受信時刻データを併用しない場合について説明する。故障検出計算機１６には、図２（ｂ）に示したように自己診断出力計算機ＩＤと自己診断出力計算機の自己診断情報が送信される。 First, the case where the self-diagnosis information and the reception time data are not used together will be described with reference to (1) and (2) of FIG. As shown in FIG. 2B, the failure detection computer 16 is transmitted with the self-diagnosis output computer ID and the self-diagnosis information of the self-diagnosis output computer.

ここで自己診断出力計算機１１ａの故障には、自己診断出力計算機１１ａの自己診断出力計算機ＣＰＵ１３ａが故障した場合と、自己診断出力計算機ＣＰＵ１３ａ以外が故障した場合が考えられる。そこでまず図３の（１）のように、自己診断出力計算機ＣＰＵ１３ａ以外が故障した場合、例えば機能１で故障２が発生した場合について説明する。 Here, the failure of the self-diagnosis output computer 11a includes a case where the self-diagnosis output computer CPU 13a of the self-diagnosis output computer 11a has failed and a case where other than the self-diagnosis output computer CPU 13a has failed. Therefore, first, as in (1) of FIG. 3, a case where a failure other than the self-diagnosis output computer CPU 13a has failed, for example, a failure 2 in function 1 will be described.

自己診断出力計算機１１ａと自己診断出力計算機１１ｂは、動作を開始してから故障検出計算機１６における受信時刻ｔ１まで故障が発生せず、正常動作をしているとする。このとき、自己診断出力計算機１１ａと自己診断出力計算機１１ｂは、図２（ａ）のテーブルに従い、正常動作を示す自己診断情報「００」を故障検出計算機１６の自己診断情報受信部１９に入力する。従って図３の（１）に示したように、受信時刻ｔ１に、自己診断情報受信部１９にて受信された自己診断出力計算機１１ａと自己診断出力計算機１１ｂの自己診断情報はともに「００」である。 It is assumed that the self-diagnosis output computer 11a and the self-diagnosis output computer 11b are operating normally without a failure until the reception time t1 in the failure detection computer 16 after starting the operation. At this time, the self-diagnosis output computer 11a and the self-diagnosis output computer 11b input the self-diagnosis information “00” indicating normal operation to the self-diagnosis information receiving unit 19 of the failure detection computer 16 according to the table of FIG. . Therefore, as shown in (1) of FIG. 3, the self-diagnosis information of the self-diagnosis output computer 11a and the self-diagnosis output computer 11b received by the self-diagnosis output reception unit 19 at the reception time t1 is both “00”. is there.

故障検出計算機１６の故障検出計算機ＣＰＵ１８は、上記自己診断情報を参照し、自己診断出力計算機１１ａと自己診断出力計算機１１ｂは正常動作していると判断する。 The failure detection computer CPU 18 of the failure detection computer 16 refers to the self-diagnosis information and determines that the self-diagnosis output computer 11a and the self-diagnosis output computer 11b are operating normally.

次に、受信時刻ｔ１から次の自己診断情報を受信するタイミングである受信時刻ｔ２までの間に、自己診断出力計算機１１ａの機能１で故障２が発生したとする。この場合、自己診断出力計算機１１ｂは正常動作のままなので、自己診断情報は「００」のまま出力する。しかし自己診断出力計算機１１ａは、機能１で故障２が発生しているため、図２（ａ）のテーブルに従い「０２」を出力する。従って図３の（１）に示したように、受信時刻ｔ２に自己診断情報受信部１９にて受信された自己診断出力計算機１１ａの自己診断情報は「０２」、自己診断出力計算機１１ｂの自己診断情報は「００」である。 Next, it is assumed that a failure 2 has occurred in the function 1 of the self-diagnosis output computer 11a between the reception time t1 and the reception time t2, which is the timing for receiving the next self-diagnosis information. In this case, since the self-diagnosis output computer 11b remains in normal operation, the self-diagnosis information is output as “00”. However, the self-diagnosis output computer 11a outputs “02” in accordance with the table of FIG. Accordingly, as shown in (1) of FIG. 3, the self-diagnosis information of the self-diagnosis output computer 11a received by the self-diagnosis information receiving unit 19 at the reception time t2 is “02”, and the self-diagnosis of the self-diagnosis output computer 11b. The information is “00”.

故障検出計算機１６の故障検出計算機ＣＰＵ１８は、上記自己診断情報を参照し、自己診断出力計算機１１ａは機能１で故障２が発生し、自己診断出力計算機１１ｂは正常動作していると判断する。 The failure detection computer CPU 18 of the failure detection computer 16 refers to the self-diagnosis information, and the self-diagnosis output computer 11a determines that the failure 1 occurs in the function 1 and the self-diagnosis output computer 11b operates normally.

しかしこの方法では、ＣＰＵが故障した場合、故障検出が正しくできない。そのことを示すため、図３の（２）のように自己診断出力計算機１１ａの自己診断出力計算機ＣＰＵ１３ａが故障した場合を説明する。 However, with this method, when the CPU fails, failure detection cannot be performed correctly. In order to show this, a case will be described where the self-diagnosis output computer CPU 13a of the self-diagnosis output computer 11a fails as shown in (2) of FIG.

自己診断出力計算機１１ａと自己診断出力計算機１１ｂは、動作を開始してから故障検出計算機１６における受信時刻ｔ１まで故障が発生せず、正常動作をしているとする。このとき、自己診断出力計算機１１ａと自己診断出力計算機１１ｂは、図２（ａ）のテーブルに従い、正常動作を示す自己診断情報「００」を故障検出計算機１６の自己診断情報受信部１９に入力する。従って図３の（２）に示したように、受信時刻ｔ１に、自己診断情報受信部１９にて受信された自己診断出力計算機１１ａと自己診断出力計算機１１ｂの自己診断情報はともに「００」である。 It is assumed that the self-diagnosis output computer 11a and the self-diagnosis output computer 11b are operating normally without a failure until the reception time t1 in the failure detection computer 16 after starting the operation. At this time, the self-diagnosis output computer 11a and the self-diagnosis output computer 11b input the self-diagnosis information “00” indicating normal operation to the self-diagnosis information receiving unit 19 of the failure detection computer 16 according to the table of FIG. . Therefore, as shown in (2) of FIG. 3, the self-diagnosis information of the self-diagnosis output computer 11a and the self-diagnosis output computer 11b received by the self-diagnosis information reception unit 19 is “00” at the reception time t1. is there.

次に、受信時刻ｔ１から次の自己診断情報を受信するタイミングである受信時刻ｔ２までの間に、自己診断出力計算機１１ａの自己診断出力計算機ＣＰＵ１３ａで故障が発生したとする。この場合、自己診断出力計算機１１ｂは正常動作のままなので、自己診断情報は「００」のまま出力する。しかし自己診断出力計算機１１ａは、自己診断出力計算機ＣＰＵ１３ａで故障が発生したため、自己診断情報を送信することができず、自己診断出力計算機１１ａは自己診断情報を出力することができない。従って図３の（２）に示したように、受信時刻ｔ２に、自己診断情報受信部１９にて受信された自己診断出力計算機１１ｂの自己診断情報は「００」である。しかし自己診断出力計算機１１ａの自己診断情報は更新されないままである。 Next, it is assumed that a failure has occurred in the self-diagnosis output computer CPU 13a of the self-diagnosis output computer 11a between the reception time t1 and the reception time t2, which is the timing for receiving the next self-diagnosis information. In this case, since the self-diagnosis output computer 11b remains in normal operation, the self-diagnosis information is output as “00”. However, the self-diagnosis output computer 11a cannot transmit the self-diagnosis information because the failure has occurred in the self-diagnosis output computer CPU 13a, and the self-diagnosis output computer 11a cannot output the self-diagnosis information. Accordingly, as shown in (2) of FIG. 3, the self-diagnosis information of the self-diagnosis output computer 11b received by the self-diagnosis information receiving unit 19 at the reception time t2 is “00”. However, the self-diagnosis information of the self-diagnosis output computer 11a is not updated.

故障検出計算機１６の故障検出計算機ＣＰＵ１８は、上記自己診断情報を参照し、自己診断出力計算機１１ｂは正常動作していると正しく判断する。しかし、自己診断出力計算機１１ａは自己診断情報を更新しないため、故障検出計算機ＣＰＵ１８が自己診断出力計算機１１ａの自己診断情報を参照した場合、図３の（２）に示したように、更新されていない自己診断情報「００」を参照し、正常動作していると誤判断する。 The failure detection computer CPU 18 of the failure detection computer 16 refers to the self-diagnosis information and correctly determines that the self-diagnosis output computer 11b is operating normally. However, since the self-diagnosis output computer 11a does not update the self-diagnosis information, when the failure detection computer CPU 18 refers to the self-diagnosis information of the self-diagnosis output computer 11a, it is updated as shown in (2) of FIG. The self-diagnosis information “00” is not referred to, and it is erroneously determined that it is operating normally.

このことから、各自己診断出力計算機１１から送信された自己診断情報のみ用いると、自己診断出力計算機ＣＰＵ１３が動作している場合は故障検出可能であるが、例えば自己診断出力計算機ＣＰＵ１３が故障等により停止した場合は故障検出できないことがわかる。 Therefore, if only the self-diagnosis information transmitted from each self-diagnosis output computer 11 is used, it is possible to detect a failure when the self-diagnosis output computer CPU 13 is operating. It can be seen that the failure cannot be detected when it stops.

そこで図３の（３）と（４）に示すように、受信時刻計測部２２から出力される受信時刻データ２３を自己診断情報受信部１９に入力し、各自己診断出力計算機１１より送信される自己診断情報と受信時刻データとを併用して故障検出を行う。なお図３の（３）と（４）記載のｔ１とｔ２は、受信時刻ｔ１、受信時刻ｔ２を表すデータである。 Therefore, as shown in (3) and (4) of FIG. 3, the reception time data 23 output from the reception time measurement unit 22 is input to the self-diagnosis information reception unit 19 and transmitted from each self-diagnosis output computer 11. Fault detection is performed using the self-diagnosis information and the reception time data together. Note that t1 and t2 described in (3) and (4) of FIG. 3 are data representing the reception time t1 and the reception time t2.

故障検出計算機１６内部にある受信時刻計測部２２は、例えばカウンタ等のタイマを用いて時刻を計測しており、自己診断情報受信部１９に時刻を常時出力している。なお、各自己診断出力計算機１１と故障検出計算機１６に搭載されているタイマどうしは同期がとれており、各自己診断出力計算機１１から故障検出計算機１６へ出力される自己診断情報の受信タイミングがあらかじめわかっているならば、受信時刻計測部２２から出力される受信時刻データ２３は常時出力する必要はなく、受信タイミングに合わせて出力するだけでもよい。 The reception time measurement unit 22 in the failure detection computer 16 measures time using a timer such as a counter, for example, and always outputs the time to the self-diagnosis information reception unit 19. The self-diagnosis output computers 11 and the timers mounted on the failure detection computer 16 are synchronized with each other, and the reception timing of the self-diagnosis information output from each self-diagnosis output computer 11 to the failure detection computer 16 is predetermined. If it is known, the reception time data 23 output from the reception time measuring unit 22 need not always be output, and may be output only in accordance with the reception timing.

自己診断情報受信部１９は、各自己診断出力計算機１１より自己診断情報を受信した場合、受信時刻計測部２２より出力された受信時刻データ２３を自己診断情報に付加する。 When receiving the self-diagnosis information from each self-diagnosis output computer 11, the self-diagnosis information receiving unit 19 adds the reception time data 23 output from the reception time measuring unit 22 to the self-diagnosis information.

図３の（３）のように、自己診断出力計算機１１ａの機能１で故障２が発生した場合、受信時刻ｔ１では自己診断情報と受信時刻データ２３を合わせたデータは「００ｔ１」、受信時刻ｔ２では自己診断情報と受信時刻データ２３を合わせたデータは「０２ｔ２」となるため、図３の（１）と同様に故障検出計算機ＣＰＵ１８は自己診断情報のみを参照比較することで、自己診断出力計算機１１ａの機能１で故障２が発生したことを正しく判断することができる。 As shown in (3) of FIG. 3, when a failure 2 occurs in the function 1 of the self-diagnosis output computer 11a, the data including the self-diagnosis information and the reception time data 23 is “00t1” at the reception time t1, and the reception time t2 Then, since the data combining the self-diagnosis information and the reception time data 23 is “02t2,” the failure detection computer CPU 18 refers to and compares only the self-diagnosis information as in (1) of FIG. It is possible to correctly determine that the failure 2 has occurred in the function 1 of 11a.

次に図３の（２）と同様に図３の（４）において、自己診断出力計算機１１ａの自己診断出力計算機ＣＰＵ１３ａが故障した場合を説明する。 Next, a case where the self-diagnosis output computer CPU 13a of the self-diagnosis output computer 11a fails in (4) of FIG. 3 as in (2) of FIG. 3 will be described.

自己診断出力計算機１１ａと自己診断出力計算機１１ｂは、動作を開始してから故障検出計算機１６における受信時刻ｔ１まで故障が発生せず、正常動作をしているとする。このとき、自己診断出力計算機１１ａと自己診断出力計算機１１ｂは、図２（ａ）のテーブルに従い、正常動作を示す自己診断情報「００」を故障検出計算機１６の自己診断情報受信部１９に入力する。従って図３の（４）に示したように、受信時刻ｔ１における、自己診断出力計算機１１ａと自己診断出力計算機１１ｂの自己診断情報と受信時刻データ２３を合わせたデータはともに「００ｔ１」となる。 It is assumed that the self-diagnosis output computer 11a and the self-diagnosis output computer 11b are operating normally without a failure until the reception time t1 in the failure detection computer 16 after starting the operation. At this time, the self-diagnosis output computer 11a and the self-diagnosis output computer 11b input the self-diagnosis information “00” indicating normal operation to the self-diagnosis information receiving unit 19 of the failure detection computer 16 according to the table of FIG. . Therefore, as shown in (4) of FIG. 3, the data obtained by combining the self-diagnosis information of the self-diagnosis output computer 11a and the self-diagnosis output computer 11b and the reception time data 23 at the reception time t1 is “00t1”.

故障検出計算機１６の故障検出計算機ＣＰＵ１８は、上記自己診断情報を参照し、図３には図示していないが、一時刻前の自己診断情報と受信時刻データ２３を合わせたデータを比較して、自己診断出力計算機１１ａと自己診断出力計算機１１ｂは正常動作していると判断する。 The failure detection computer CPU 18 of the failure detection computer 16 refers to the self-diagnosis information, and although not shown in FIG. 3, compares the data obtained by combining the self-diagnosis information one hour before and the reception time data 23, It is determined that the self-diagnosis output computer 11a and the self-diagnosis output computer 11b are operating normally.

そして、受信時刻ｔ１から次の自己診断情報を受信するタイミングである受信時刻ｔ２までの間に、自己診断出力計算機１１ａの自己診断出力計算機ＣＰＵ１３ａで故障が発生したとする。この場合、自己診断出力計算機１１ｂは正常動作のままなので、自己診断情報は「００」のまま出力する。しかし自己診断出力計算機１１ａは、自己診断出力計算機ＣＰＵ１３ａで故障が発生したため、自己診断出力計算機１１ａは自己診断情報を出力することができない。 It is assumed that a failure has occurred in the self-diagnosis output computer CPU 13a of the self-diagnosis output computer 11a between the reception time t1 and the reception time t2, which is the timing for receiving the next self-diagnosis information. In this case, since the self-diagnosis output computer 11b remains in normal operation, the self-diagnosis information is output as “00”. However, since the self-diagnosis output computer 11a has failed in the self-diagnosis output computer CPU 13a, the self-diagnosis output computer 11a cannot output self-diagnosis information.

従って受信時刻ｔ２における、自己診断出力計算機１１ｂの自己診断情報と受信時刻データ２３を合わせたデータは「００ｔ２」である。しかし自己診断出力計算機１１ａの自己診断情報は、自己診断情報受信部１９にて受信されないため、自己診断情報と受信時刻データ２３を合わせたデータは更新されることなく「００ｔ１」のままである。 Therefore, the combined data of the self-diagnosis information of the self-diagnosis output computer 11b and the reception time data 23 at the reception time t2 is “00t2.” However, since the self-diagnosis information of the self-diagnosis output computer 11a is not received by the self-diagnosis information receiving unit 19, the combined data of the self-diagnosis information and the reception time data 23 remains “00t1” without being updated.

故障検出計算機１６の故障検出計算機ＣＰＵ１８は、後述するように上記自己診断情報と受信時刻データを合わせたデータについて、受信時刻ｔ２のデータと一時刻前の受信時刻ｔ１のデータを参照比較する。つまり自己診断出力計算機１１ｂは、受信時刻ｔ１でのデータ「００ｔ１」と受信時刻ｔ２でのデータ「００ｔ２」を参照比較することで、正常動作していると正しく判断する。 As will be described later, the failure detection computer CPU 18 of the failure detection computer 16 compares the data of the reception time t2 with the data of the reception time t1 one time earlier with respect to the data obtained by combining the self-diagnosis information and the reception time data. That is, the self-diagnosis output computer 11b correctly determines that the operation is normal by comparing the data “00t1” at the reception time t1 with the data “00t2” at the reception time t2.

一方、自己診断出力計算機１１ａは、自己診断情報と受信時刻データを合わせたデータが更新されないため、受信時刻ｔ１でのデータ「００ｔ１」と受信時刻ｔ２でのデータ「００ｔ１」を参照比較し、受信時刻データが更新されていないことを検出する。 On the other hand, since the combined data of the self-diagnosis information and the reception time data is not updated, the self-diagnosis output computer 11a compares the data “00t1” at the reception time t1 with the data “00t1” at the reception time t2, and receives the data. Detect that time data is not updated.

自己診断情報回線２５は、故障検出計算機１６と個々の自己診断出力計算機１１を接続しているため、共通バス２４を用いた場合のような他の自己診断出力計算機１１と故障検出計算機１６とのデータ送受信による自己診断情報の出力遅延は発生しない。従って、出力遅延によるデータ更新の遅延も発生しない。 Since the self-diagnosis information line 25 connects the failure detection computer 16 and each self-diagnosis output computer 11, the self-diagnosis output computer 11 and the failure detection computer 16 are connected to each other as in the case where the common bus 24 is used. There is no delay in output of self-diagnosis information due to data transmission / reception. Therefore, a data update delay due to an output delay does not occur.

このことから、自己診断出力計算機１１ａの自己診断情報と受信時刻データ２３を合わせたデータが受信時刻ｔ１とｔ２において更新されていない原因は、自己診断情報の出力遅延によるものではなく、自己診断出力計算機１１ａの自己診断出力計算機ＣＰＵ１３ａの故障発生により動作を停止したためと判断することができる。従って、故障検出計算機ＣＰＵ１８は自己診断出力計算機１１ａの故障を正しく検出できる。 Therefore, the reason why the data combining the self-diagnosis information of the self-diagnosis output computer 11a and the reception time data 23 is not updated at the reception times t1 and t2 is not due to the output delay of the self-diagnosis information. It can be determined that the operation has been stopped due to the occurrence of a failure in the self-diagnosis output computer CPU 13a of the computer 11a. Therefore, the failure detection computer CPU 18 can correctly detect the failure of the self-diagnosis output computer 11a.

図４は、実施の形態１における自己診断情報受信部１９から受信メモリ部２０に出力されるデータの一例を示す図である。 FIG. 4 is a diagram illustrating an example of data output from the self-diagnosis information reception unit 19 to the reception memory unit 20 in the first embodiment.

図４（ａ）は、自己診断出力計算機の自己診断情報と受信時刻に関するデータの一例である。例えばＮｏ．１は、自己診断出力計算機ａの自己診断情報「００」（正常）が００時０１分００秒（ｔ１）に受信されたことを表す。またＮｏ．２は、自己診断出力計算機ａの自己診断情報「０２」（機能１で故障２が発生）が００時０１分３０秒（ｔ２）に受信されたことを表す。 FIG. 4A is an example of data relating to self-diagnosis information and reception time of the self-diagnosis output computer. For example, no. 1 represents that the self-diagnosis information “00” (normal) of the self-diagnosis output computer a is received at 00:01:00 (t1). No. 2 indicates that self-diagnosis information “02” (fault 2 occurred in function 1) of the self-diagnosis output computer a was received at 00:01:30 (t2).

図４（ａ）のデータをもとに、故障検出計算機１６の自己診断情報受信部１９が自己診断情報を受信した場合、受信メモリ部２０に出力するデータフォーマット例を図４（ｂ）に示す。自己診断情報受信部１９には、例えば図２（ｂ）に示したように自己診断出力計算機から送信された自己診断出力計算機ＩＤと自己診断情報と、受信時刻計測部２２より出力された受信時刻データ２３が入力する。これらの入力より、図４（ｂ）の（１）に示す自己診断情報受信部データフォーマット例のようにデータを作成する。 FIG. 4B shows an example of a data format to be output to the reception memory unit 20 when the self-diagnosis information receiving unit 19 of the failure detection computer 16 receives the self-diagnosis information based on the data in FIG. . For example, as shown in FIG. 2B, the self-diagnosis information receiving unit 19 includes a self-diagnosis output computer ID and self-diagnosis information transmitted from the self-diagnosis output computer, and a reception time output from the reception time measuring unit 22. Data 23 is input. From these inputs, data is created as in the self-diagnosis information receiving unit data format example shown in (1) of FIG.

次に、受信メモリ部２０に一時記憶させるため、受信メモリアドレスと受信メモリデータに分ける。図４（ｂ）の（２）に示すように、受信メモリアドレスは、自己診断出力計算機ＩＤ部と受信時刻の前後関係を表すビットであるＴＩＭＥ部より構成される。 Next, for temporary storage in the reception memory unit 20, the reception memory address and reception memory data are divided. As shown in (2) of FIG. 4B, the reception memory address is composed of a self-diagnosis output computer ID unit and a TIME unit that is a bit representing the order of the reception time.

ＴＩＭＥ部とは、図３でも説明したように、受信時刻ｔ２の自己診断情報と一時刻前の受信時刻ｔ１の自己診断情報とを比較して故障検出を行うため、受信時刻を用いて同一自己診断出力計算機の自己診断情報を区別するためのビットである。なお説明の便宜上、受信時刻ｔ２に基づくアドレスをＮＥＷアドレスと呼び、一時刻前の受信時刻ｔ１に基づくアドレスをＯＬＤアドレスと呼ぶこととする。 As described in FIG. 3, the TIME unit compares the self-diagnosis information at the reception time t2 with the self-diagnosis information at the reception time t1 one time before, and performs fault detection. It is a bit for distinguishing the self-diagnosis information of the diagnostic output computer. For convenience of explanation, an address based on the reception time t2 is called a NEW address, and an address based on the reception time t1 one time before is called an OLD address.

また図４（ｂ）の（３）に示すように、受信メモリデータは自己診断情報と受信時刻より構成される。従って、例えば、受信時刻ｔ１において正常動作している自己診断情報を受信した場合は「００ｔ１」、受信時刻ｔ２において機能１で故障２が発生している自己診断情報を受信した場合は「０２ｔ２」となる。なお説明の便宜上、受信メモリデータにおいてＮＥＷアドレスに対応するデータをＮＥＷデータと呼び、ＯＬＤアドレスに対応するデータをＯＬＤデータと呼ぶこととする。受信メモリ部２０は、自己診断情報受信部１９より出力されたＮＥＷデータを記憶する。 As shown in (3) of FIG. 4B, the reception memory data is composed of self-diagnosis information and reception time. Thus, for example, “00t1” is received when the self-diagnosis information operating normally at the reception time t1 is received, and “02t2” is received when the self-diagnosis information in which the failure 2 occurs in the function 1 at the reception time t2. It becomes. For convenience of explanation, in the received memory data, data corresponding to the NEW address is referred to as NEW data, and data corresponding to the OLD address is referred to as OLD data. The reception memory unit 20 stores the NEW data output from the self-diagnosis information reception unit 19.

図５は、受信メモリ部２０に記憶された自己診断出力計算機ａに対するデータの比較による故障検出を示す図である。また図６は、受信メモリ部２０に一時記憶された自己診断出力計算機ａのデータを用いて故障検出を行うことを示したフローチャートの一例である。図５、図６に従い、処理内容を説明する。なお説明は自己診断出力計算機ａのみについて行うが、自己診断出力計算機ｂも同様である。 FIG. 5 is a diagram showing failure detection by comparing data with respect to the self-diagnosis output computer a stored in the reception memory unit 20. FIG. 6 is an example of a flowchart showing that failure detection is performed using data of the self-diagnosis output computer a temporarily stored in the reception memory unit 20. The processing contents will be described with reference to FIGS. The description is given only for the self-diagnosis output computer a, but the same applies to the self-diagnosis output computer b.

最初に、マルチプロセッサ並列処理システムが起動したときに故障検出をスタートさせる（ステップＳＴ１０１）。このとき受信メモリ部２０のデータは、全ビット０または全ビット１などのように適切な初期値に設定する。 First, failure detection is started when the multiprocessor parallel processing system is activated (step ST101). At this time, the data in the reception memory unit 20 is set to an appropriate initial value such as all bits 0 or all bits 1.

次に、図５に示されているように、受信メモリにおける一時刻前の受信時刻におけるＮＥＷデータをＯＬＤアドレスのデータ（ＯＬＤデータ）として上書きして一時記憶する（図６ステップＳＴ１０２）。 Next, as shown in FIG. 5, the NEW data at the reception time one time before in the reception memory is overwritten as OLD address data (OLD data) and temporarily stored (step ST102 in FIG. 6).

次に、受信した自己診断出力計算機ａの自己診断情報と受信時刻を示すデータを合わせたデータを受信メモリ部２０のＮＥＷデータに上書き更新して一時記憶する（図６ステップＳＴ１０３）。 Next, the data obtained by combining the received self-diagnosis information of the self-diagnosis output computer a and data indicating the reception time is overwritten and updated in the NEW data of the reception memory unit 20 (step ST103 in FIG. 6).

次に、一時記憶したＮＥＷデータと一時刻前のＮＥＷデータであるＯＬＤデータを受信メモリ部２０より読み出す（図６ステップＳＴ１０４）。 Next, the temporarily stored NEW data and the OLD data which is the NEW data one hour before are read from the reception memory unit 20 (step ST104 in FIG. 6).

次に、受信メモリ部２０より読み出したＯＬＤデータとＮＥＷデータとを比較する（図６ステップＳＴ１０５）。 Next, the OLD data read from the reception memory unit 20 and the NEW data are compared (step ST105 in FIG. 6).

ここで、図５の（１）に示しているように、自己診断出力計算機ａが正常動作している場合について説明する。自己診断出力計算機ａが正常動作している場合、図５の（１）のようにＯＬＤデータとＮＥＷデータ（例えば「００ｔ１」と「００ｔ２」）は異なるため、図６のステップＳＴ１０５はＮＯとなる。 Here, the case where the self-diagnosis output computer a is operating normally as shown in (1) of FIG. 5 will be described. When the self-diagnosis output computer a is operating normally, the OLD data and the NEW data (for example, “00t1” and “00t2”) are different as shown in (1) of FIG. 5, and therefore step ST105 of FIG. 6 is NO. .

その場合、ＮＥＷデータが異常を示しているか判断する（図６ステップＳＴ１０６）。異常の有無は、図４に示すように、ＮＥＷデータを構成している要素のうち自己診断情報よりわかる。 In that case, it is determined whether the NEW data indicates an abnormality (step ST106 in FIG. 6). As shown in FIG. 4, the presence / absence of an abnormality can be found from self-diagnosis information among elements constituting NEW data.

自己診断出力計算機ａが正常動作している場合、ＮＥＷデータは正常を示す（例えば「００ｔ２」のうちの「００」）ため、図６のステップＳＴ１０６はＮＯとなる。その場合、図６のステップＳＴ１０２に戻る。以降、同様の処理ループを繰り返す。 When the self-diagnosis output computer a is operating normally, the NEW data indicates normality (for example, “00” of “00t2”), so step ST106 in FIG. 6 is NO. In that case, the process returns to step ST102 of FIG. Thereafter, the same processing loop is repeated.

次に、図５の（２）に示しているように、自己診断出力計算機ａにおいて機能１で故障２が発生した場合について説明する。この場合、例えばＯＬＤデータは「００ｔ１」、ＮＥＷデータは「０２ｔ２」であるから、図６のステップＳＴ１０５においてＮＯを選択するところまでは上述の通りである。 Next, as shown in (2) of FIG. 5, a case where a failure 2 occurs in the function 1 in the self-diagnosis output computer a will be described. In this case, for example, since the OLD data is “00t1” and the NEW data is “02t2”, the process up to the point where NO is selected in step ST105 of FIG. 6 is as described above.

図５の（２）では、受信時刻ｔ１とｔ２の間に機能１で故障２が発生した場合を示している。従って受信時刻ｔ２におけるＮＥＷデータには機能１で故障２が発生した情報（例えば「０２ｔ２」のうちの「０２」）が含まれているため、図６のステップＳＴ１０６はＹＥＳとなる。 (2) in FIG. 5 shows a case where the failure 2 occurs in the function 1 between the reception times t1 and t2. Accordingly, since the NEW data at the reception time t2 includes information (for example, “02” of “02t2”) in which the failure 1 occurred in the function 1, step ST106 in FIG. 6 is YES.

ＮＥＷデータを構成する自己診断情報より、どの機能がどのような異常を起こしているか読み取り（図６ステップＳＴ１０７）、自己診断出力計算機ａに異常が発生した旨の表示を行う（図６ステップＳＴ１０８）。更に正確を期して、自己診断出力計算機ａの故障した機能と故障内容の表示をさせてもよい。 From the self-diagnosis information composing the NEW data, which function is causing what kind of abnormality is read (step ST107 in FIG. 6), and the fact that an abnormality has occurred is displayed on the self-diagnosis output computer a (step ST108 in FIG. 6). . Further, for the sake of accuracy, the malfunction function of the self-diagnosis output computer a and the contents of the malfunction may be displayed.

自己診断出力計算機ａの異常発生を表示させたら、例えば全計算機を停止させるなどして、故障検出処理を終了させる（図６ステップＳＴ１０９）。 When the occurrence of abnormality in the self-diagnosis output computer a is displayed, the failure detection process is terminated by stopping all the computers, for example (step ST109 in FIG. 6).

最後に、図５の（３）に示しているように、受信時刻ｔ１と受信時刻ｔ２の間に自己診断出力計算機ａにおいてＣＰＵが故障した場合について説明する。その場合、図５の（３）に示しているように受信時刻ｔ２におけるデータは更新されないため、ＯＬＤデータとＮＥＷデータは一致し（例えば「００ｔ１」と「００ｔ１」）、図６のステップＳＴ１０５はＹＥＳを選択する。 Finally, as shown in (3) of FIG. 5, a case where the CPU fails in the self-diagnosis output computer a between the reception time t1 and the reception time t2 will be described. In that case, since the data at the reception time t2 is not updated as shown in (3) of FIG. 5, the OLD data and the NEW data match (for example, “00t1” and “00t1”), and step ST105 of FIG. Select YES.

この場合、自己診断出力計算機ａのＣＰＵが停止等で故障しているため、自己診断情報は更新されていないと判断する（図６ステップＳＴ１１０）。 In this case, it is determined that the self-diagnosis information has not been updated because the CPU of the self-diagnosis output computer a is out of order or the like (step ST110 in FIG. 6).

自己診断出力計算機ａのＣＰＵが故障していると判断した場合は、自己診断出力計算機ａに異常が発生した旨の表示を行う（図６ステップＳＴ１０８）。更に正確を期して、自己診断出力計算機ａのＣＰＵ故障の表示をさせてもよい。自己診断出力計算機ａの異常発生を表示させたら、例えば全計算機を停止させるなどして、故障検出処理を終了させる（図６ステップＳＴ１０９）。 If it is determined that the CPU of the self-diagnosis output computer a has failed, a display indicating that an abnormality has occurred in the self-diagnosis output computer a is displayed (step ST108 in FIG. 6). For further accuracy, the CPU failure of the self-diagnosis output computer a may be displayed. When the occurrence of abnormality in the self-diagnosis output computer a is displayed, the failure detection process is terminated by stopping all the computers, for example (step ST109 in FIG. 6).

このように、自己診断情報と受信時刻を併用して参照比較することで、自己診断出力計算機の故障検出を行うことが可能となる。 In this way, it is possible to detect a failure of the self-diagnosis output computer by performing a reference comparison using the self-diagnosis information and the reception time together.

故障検出後、全計算機の処理を止めるなど行い故障計算機を交換する、または故障計算機をマルチプロセッサ並列処理システムから除外して残りの計算機を用いて処理を継続する、などの処置を行うことが可能である。 After detecting a failure, it is possible to replace the failed computer by stopping the processing of all computers, or to remove the failed computer from the multiprocessor parallel processing system and continue the processing using the remaining computers. It is.

このように、共通バス２４を用いずに自己診断情報回線２５を用いて自己診断情報を送信することにより、共通バス２４のデータ転送量を低下させることは無い。また自己診断情報回線２５では共通バス２４のような出力遅延が発生しないため、確実に故障検出を行うことが可能である。 As described above, by transmitting the self-diagnosis information using the self-diagnosis information line 25 without using the common bus 24, the data transfer amount of the common bus 24 is not reduced. Further, since the self-diagnosis information line 25 does not generate an output delay unlike the common bus 24, it is possible to reliably detect a failure.

更に、故障検出計算機１６内部に受信時刻計測部２２を設けて、受信した自己診断情報に受信時刻データ２３を付加し、故障検出計算機ＣＰＵ１８が参照比較することで、自己診断出力計算機の機能故障のみならず、プロセッサの故障発生により自己診断情報が更新されない場合での故障検出を行うことが可能となる、という効果を奏する。 Further, a reception time measuring unit 22 is provided in the failure detection computer 16 to add the reception time data 23 to the received self-diagnosis information, and the failure detection computer CPU 18 performs reference comparison so that only the function failure of the self-diagnosis output computer is detected. In other words, it is possible to detect a failure when the self-diagnosis information is not updated due to the occurrence of a processor failure.

実施の形態１における計算機故障検出方法を実現するマルチプロセッサ並列処理システムの一構成例を示す図である。1 is a diagram illustrating a configuration example of a multiprocessor parallel processing system that implements a computer fault detection method according to Embodiment 1. FIG. 実施の形態１における自己診断出力計算機より出力される自己診断情報に関するテーブル及び自己診断情報のデータフレームの一例を示した図である。It is the figure which showed an example of the table regarding the self-diagnosis information output from the self-diagnosis output computer in Embodiment 1, and the data frame of self-diagnosis information. 実施の形態１における故障検出計算機の自己診断情報受信部におけるデータ処理に受信時刻データを使用することを説明する図である。6 is a diagram for explaining the use of reception time data for data processing in the self-diagnosis information receiving unit of the failure detection computer according to Embodiment 1. FIG. 実施の形態１における自己診断情報受信部から受信メモリ部に出力されるデータの一例を示す図である。6 is a diagram illustrating an example of data output from a self-diagnosis information reception unit to a reception memory unit in the first embodiment. FIG. 実施の形態１における受信メモリに記憶された自己診断出力計算機に対するデータの比較による故障検出を示す図である。6 is a diagram illustrating failure detection by comparing data with respect to a self-diagnosis output computer stored in a reception memory according to Embodiment 1. FIG. 実施の形態１における受信メモリ部に一時記憶された自己診断出力計算機のデータを用いて故障検出を行うことを示したフローチャートの一例である。3 is an example of a flowchart showing that failure detection is performed using data of a self-diagnosis output computer temporarily stored in a reception memory unit in the first embodiment.

Explanation of symbols

１１．自己診断出力計算機
１２．共通バスＩ／Ｆ部
１３．自己診断出力計算機ＣＰＵ
１４．自己診断情報送信部
１５．内部バス
１６．故障検出計算機
１７．共通バスＩ／Ｆ部
１８．故障検出計算機ＣＰＵ
１９．自己診断情報受信部
２０．受信メモリ部
２１．内部バス
２２．受信時刻計測部
２３．受信時刻データ
２４．共通バス
２５．自己診断情報回線 11. Self-diagnosis output calculator 12. Common bus I / F section 13. Self-diagnosis output computer CPU
14 Self-diagnosis information transmission unit 15. Internal bus 16. Failure detection computer 17. Common bus I / F section 18. Failure detection computer CPU
19. Self-diagnosis information receiving unit 20. Reception memory unit 21. Internal bus 22. Reception time measuring unit 23. Receive time data 24. Common bus 25. Self-diagnosis information line

Claims

A plurality of self-diagnosis output computers, a failure detection computer, and a self-diagnosis information line connecting the plurality of self-diagnosis output computers and the failure detection computer,
The self-diagnosis output computer creates and outputs self-diagnosis information indicating its own self-diagnosis result,
The self-diagnosis information line is connected to the self-diagnosis output computer, and transmits self-diagnosis information output from the self-diagnosis output computer.
The fault detection computer is connected to the self-diagnosis information line, receives the self-diagnosis information transmitted from the self-diagnosis information line, and transmits the self-diagnosis information based on the received self-diagnosis information. A computer fault detection system characterized by detecting a fault in an output computer.

The failure detection computer includes a memory unit and a CPU,
When the memory unit receives the self-diagnosis information output from the self-diagnosis output computer, the memory unit stores data in which the reception time of the self-diagnosis information and the self-diagnosis information are associated with each other.
2. The computer failure detection system according to claim 1, wherein the CPU detects a failure of a self-diagnosis output computer that has transmitted the self-diagnosis information with reference to the data stored in the memory unit.

Creating and outputting self-diagnosis information in each computer;
Detecting a failure of the computer that transmitted the self-diagnosis information by referring to and comparing data in which the self-diagnosis information is received and the self-diagnosis information based on the output self-diagnosis information; and
A computer fault detection method comprising:

The reference comparison data includes data that associates the reception time of the self-diagnosis information with the self-diagnosis information, the reception time when the self-diagnosis information is received one hour before the reception time, and the self before the one time. 4. The computer fault detection method according to claim 3, wherein the data is associated with diagnostic information.