JPH03273344A

JPH03273344A - Fault tolerant system

Info

Publication number: JPH03273344A
Application number: JP2075011A
Authority: JP
Inventors: Yasue Takeshima; 竹島　康江
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1990-03-22
Filing date: 1990-03-22
Publication date: 1991-12-04

Abstract

PURPOSE:To improve reliability by executing recovery suitable for the type of error by diagnosing the cause of the error based on the program of the second program memory part when detecting the error according to majority. CONSTITUTION:A processing block 3 is formed by the first program memory part 11, second program memory part 12 to store the same program as a program to be stored in this memory part 11, and processing part 2 to process input data according to the programs stored in these memory parts. When the processed result of the block 3 is decided by the majority and the error is detected in a result processed by the program of the memory part 11, the block 3 executes the processing based on the program in the memory part 12 and the cause of the error is diagnosed from the processed result under the control of a recovery control part 6. Thus, the reliability is improved by executing the recovery suitable for the type of error.

Description

【発明の詳細な説明】（産業上の利用分野）本発明はＮバージョンプログラミングにより誤り検出を
行なうフォールトトレラントシステムに関する。DETAILED DESCRIPTION OF THE INVENTION (Field of Industrial Application) The present invention relates to a fault-tolerant system that performs error detection by N-version programming.

[Conventional technology]

Ｎバージョンプログラミングは、同一の仕様に基づき異
なる設計者によフて機能的に等価なＮ個のプログラムを
作り、実行した結果を比較または多数決によってバグを
検出またはマスクする方法である（Ｊ、グレイ他著、渡
田覧−編訳（昭６ｌ−１０−１５）ｒフォールト・トレ
ラント・システム」マグロウヒルブックＰ、３７７参照
）。N-version programming is a method in which N functionally equivalent programs are created by different designers based on the same specifications, and bugs are detected or masked by comparing the execution results or by majority vote (J, Gray). (See ``Fault Tolerant Systems'' by Ran Watata and others (1986-10-15), McGraw-Hill Book P, 377).

従来、Ｎバージョンプログラミングにより誤り検出を行
なうフォールトトレラントシステムは、Ｎ個のプログラ
ムをそれぞれ処理するため、Ｎ個の処理ブロックを有し
ている。各処理ブロックは、各々のプログラムを記憶す
るプログラム記憶部と、プログラム記憶部が記憶してい
るプログラムによって人力データを処理する処理部とを
そわぞれ存している。各処理ブロックの処理結果は、誤
り検出部によって誤り検出が行なわわる。すなわち、す
べての処理結果が一致したときは、各処理結果が正しい
ものとされ、一致しないときは、誤りが検出されたとし
て３ボ一タ以上の多数決かとられ、一致した数が最多数
の処理結果が正しいものとされる。Conventionally, a fault-tolerant system that performs error detection using N-version programming has N processing blocks in order to process N programs, respectively. Each processing block includes a program storage section that stores each program, and a processing section that processes human data using the program stored in the program storage section. An error detection unit performs error detection on the processing results of each processing block. In other words, when all the processing results match, each processing result is considered correct; when they do not match, it is assumed that an error has been detected and a majority vote of 3 or more votes is taken, and the processing with the highest number of matches is determined. The results are considered correct.

[Invention or problem to be solved]

上述した従来のフォールトトレラントシステムは、誤り
を検出することと、３ポ一タ以上の多数決を行なうこと
による誤り箇所の検出を行なうことのみであり、誤りの
原因を診断する手段を有していないので、検出した誤り
の種類に適したりカバリが必ずしも行なわれないという
欠点がある。The above-mentioned conventional fault-tolerant system only detects errors and detects the error location by performing a majority vote of 3 points or more, and does not have a means to diagnose the cause of the error. Therefore, it has the disadvantage that it is not always suitable for the type of error detected and is not necessarily correctable.

本発明の目的は、誤りの原因を診断して誤りの種類に適
したリカバリが可能なフォールトトレラントシステムを
提供することである。An object of the present invention is to provide a fault-tolerant system capable of diagnosing the cause of an error and performing recovery appropriate to the type of error.

〔課題を解決するための手段）本発明のフォールトトレラントシステムは、第１のプロ
グラム記憶部と、第１のプログラム記憶部に格納される
プログラムと同一のプログラムが格納される第２のプロ
グラム記憶部とを、それぞれ有するＮ個の処理ブロック
と、各処理ブロック内で、それぞれ第１のプログラム記憶部
のプログラムによる処理結果が得られた際、いずれかの
処理ブロックの処理結果に誤りが検出されると、当該処
理ブロックの第２のプログラム記憶部のプログラムに基
づいて当該処理ブロックに処理を行なわせ、行なわせた
処理結果から誤りの原因を診断するリカバリ制御部とを
有している。[Means for Solving the Problems] A fault-tolerant system of the present invention includes a first program storage section and a second program storage section in which the same program as the program stored in the first program storage section is stored. N processing blocks, each of which has: When a processing result is obtained by the program in the first program storage unit in each processing block, an error is detected in the processing result of one of the processing blocks. and a recovery control unit that causes the processing block to perform processing based on the program in the second program storage unit of the processing block and diagnoses the cause of the error from the result of the processing.

〔作　用）第１のプログラム記憶部のプログラムの処理結果に誤り
が検出されると、リカバリ制御部は第２のブロクラム記
憶部に格納されたプログラムに基づいて誤りの原因を診
断する。[Operation] When an error is detected in the processing result of the program in the first program storage section, the recovery control section diagnoses the cause of the error based on the program stored in the second program storage section.

〔Example〕

次に、本発明の実施例について図面を参照して説明する
。Next, embodiments of the present invention will be described with reference to the drawings.

第１図は本発明のフォールトトレラントシステムの一実
施例を示すブロック図、第２図は第１図の実施例中のリ
カバリ制御部６の処理を示すフローチャートである。FIG. 1 is a block diagram showing an embodiment of the fault-tolerant system of the present invention, and FIG. 2 is a flowchart showing the processing of the recovery control section 6 in the embodiment of FIG.

本実施例は３バージヨンプログラミングによるフォール
トトレラントシステムである。３個の処理ブロック３は
、そわぞれ第１のプログラム記憶部１と、第１のプログ
ラム記憶部１に格納されるプログラムと同一のプログラ
ムが格納される第２のプログラム記憶部１２と、こわら
プログラム記憶部１．．１２のプログラムにより人力デ
ータを処理する処理部２からなる。誤り検出部５は、３
個の処理ブロック３の処理結果を多数決によって判定す
るとともにエラー発生時すなわち誤りを検出した時点で
リカバリ制御部６へ誤り検出信号を送る。リカバリ制御
部６は、この誤り検出信号を受けると、後述する診断お
よびリカバリ動作を行なう。入力データ記憶部４は、誤
りの診断用データとして入力データを保持する。出力デ
ータ言己憶部７は、誤りの診断用データとして出力デー
タを保持する。外部監視装置インタフェース部８は、い
ずれかの処理ブロック３のプログラム記憶部１、．１２
のプログラムにバグすなわち潜在エラーがあると判断さ
れたとき、図示しない外部監視装置に潜在エラー通知信
号を出力し、また、送られてきた修正プログラムを誤り
が検出された処理ブロック３のプログラム記憶部１．．
１２に書き込む。プログラムの修正は、ホストコンピュ
ータ等で構成される前記外部監視装置等で行なわれる。This embodiment is a fault-tolerant system using three-version programming. The three processing blocks 3 each include a first program storage section 1, a second program storage section 12 in which the same program as the program stored in the first program storage section 1 is stored. Straw program storage unit 1. ．． It consists of a processing section 2 that processes human data using 12 programs. The error detection unit 5 includes 3
The processing results of each processing block 3 are determined by majority vote, and an error detection signal is sent to the recovery control unit 6 when an error occurs, that is, when an error is detected. When the recovery control unit 6 receives this error detection signal, it performs diagnosis and recovery operations, which will be described later. The input data storage unit 4 holds input data as data for diagnosing errors. The output data memory unit 7 holds the output data as data for diagnosing errors. The external monitoring device interface section 8 is connected to the program storage section 1, . 12
When it is determined that there is a bug, that is, a latent error in the program, a latent error notification signal is output to an external monitoring device (not shown), and the sent correction program is sent to the program storage section of the processing block 3 in which the error was detected. 1. ．．
Write to 12. Modification of the program is performed by the external monitoring device or the like configured with a host computer or the like.

したがって、フォールトトレラントシステムの外部から
プログラムを再度書込むことか可能であるため、プログ
ラムの作成、検査時に発見されなかったバグが運用中に
発見されても修正が容易である。Therefore, since it is possible to rewrite the program from outside the fault-tolerant system, even if a bug that was not discovered during program creation or inspection is discovered during operation, it can be easily corrected.

次に、第１図の実施例の動作について説明する。ここで
は、各処理ブロック３内で、それぞれ第１のプログラム
記憶部１１が運用されているものとする。Next, the operation of the embodiment shown in FIG. 1 will be explained. Here, it is assumed that the first program storage unit 11 is operated in each processing block 3.

入力データは、入力データ記憶部４に保持されるととも
に各処理ブロック３に入力され、同−仕様、異設計の３
個のプログラムによってそれぞれ処理される。各処理ブ
ロック３の第１回の処理結果は誤り検出部５でそれぞれ
比較され、すべての処理結果が一致した場合は正しい（
誤りがない）ものとされ、一致した処理結果が出力デー
タとして出力される。各処理結果が一致しない場合は、
いずれかの処理結果に誤りがあったものとされ、誤り検
出部５で各処理結果の多数決がとられる。The input data is held in the input data storage unit 4 and input to each processing block 3, and is processed into three processing blocks with the same specifications and different designs.
Each is processed by a separate program. The first processing results of each processing block 3 are compared by the error detection unit 5, and if all the processing results match, it is correct (
(no errors), and the matching processing results are output as output data. If the processing results do not match,
It is assumed that one of the processing results has an error, and the error detection section 5 takes a majority vote of each processing result.

ここで、多数決によって得られた出力データが正しい出
力データと定義され、出力データ記憶部７で保持される
。リカバリ制御部６は、誤り検出部５から誤り検出信号
を送られると、ただちに正しい出力データと異なる処理
結果を出力した（誤動作をした）処理ブロック３を運用
ラインから切離し、切離した処理ブロック３の診断およ
びリカバリ処理を開始する。このとき、他の処理ブロッ
ク３は、通常運転を行なう。Here, the output data obtained by majority vote is defined as correct output data and is held in the output data storage section 7. When the recovery control unit 6 receives the error detection signal from the error detection unit 5, it immediately disconnects the processing block 3 that has outputted (malfunctioned) a processing result different from the correct output data from the operation line, and restores the disconnected processing block 3. Begin the diagnostic and recovery process. At this time, the other processing blocks 3 perform normal operation.

この診断およびリカバリ処理は、第２図に示す各ステッ
プに従って行なわわる。This diagnosis and recovery processing is performed according to each step shown in FIG.

まず、第１のプログラム記憶部１１の電流再立上げ等の
初期化を行なう。その後、切離した処理ブロック３（以
降誤動作ブロックと記す）の第１のプログラム記憶部１
１　　（以降現用と記す）に誤動作ブロックの第２のプ
ログラム記憶部１２　（以陵予備と記す）のプログラム
をコピーした後、現用を作動させ、エラー発生時と同じ
入力データを誤動作ブロック３に入力して第２回の処理
結果を得る（ステップ１１）。第２回の処理結果が正し
い出力データと等しいかどうか、すなわち正しいかどう
か判別する（ステップ１２）。第２回の処理結果が正し
いときは、検出された誤りがラッチアップやシングルイ
ベントエラー等による偶発エラーであると判断しくステ
ップ１３）、現用を復活させ、誤動作ブロックを運用ラ
インに戻す（ステップ１４）。第３回の処理結果か正し
くないときは、予備を作動させエラー発生時と同じ入力
データを誤動作ブロックに人力して第３回の処理結果を
得る（ステップ１５）。第３回の処理結果か正しい出力
データと等しいかどうか、すなわち正しいかどうか判別
する（ステップ１６）。First, initialization such as restarting the current of the first program storage section 11 is performed. After that, the first program storage unit 1 of the separated processing block 3 (hereinafter referred to as malfunction block)
After copying the program in the second program storage unit 12 (hereinafter referred to as spare) of the malfunction block to 1 (hereinafter referred to as the current use), activate the current use and input the same input data as at the time of the error to the malfunction block 3. to obtain the second processing result (step 11). It is determined whether the second processing result is equal to the correct output data, that is, whether it is correct (step 12). If the second processing result is correct, it is determined that the detected error is an accidental error such as a latch-up or a single event error (Step 13), the current use is restored, and the malfunctioning block is returned to the operational line (Step 14). ). If the result of the third processing is incorrect, the backup is activated and the same input data as at the time of the error is manually entered into the malfunction block to obtain the result of the third processing (step 15). It is determined whether the third processing result is equal to the correct output data, that is, whether it is correct (step 16).

第３回の処理結果か正しいときは、現用のハードウェア
の故障であると判断しくステップ１７）予備を運用して
以後の処理を行なえるようにして誤動作ブロックを運用
ラインに戻す（ステップ１８）。第３回の処理結果が正
しくないときは、プログラムにバグがある、すなわち潜
在エラーであると判断しくステップ１９）、外部監視装
置インタフェース部８を介して外部監視装置へ潜在エラ
ー通知信号を出力する（ステップ２０）。その後、外部
監視装置インタフェース部８を介して現用および予備に
修正済のプログラムが書込まれ（ステップ２１）、現用
を復活させ誤動作ブロックを運用ラインに戻す（ステッ
プ２２）。If the third processing result is correct, it is determined that there is a failure in the current hardware.Step 17) The malfunctioning block is returned to the production line by operating the spare and making it possible to perform subsequent processing (Step 18) . If the third processing result is incorrect, it is determined that there is a bug in the program, that is, a latent error (step 19), and a latent error notification signal is output to the external monitoring device via the external monitoring device interface unit 8. (Step 20). Thereafter, the modified program is written into the current and backup programs via the external monitoring device interface section 8 (step 21), and the current program is restored and the malfunctioning block is returned to the operational line (step 22).

以上の各ステップ１〜２２による誤りの分類（エラーケ
ース）についてまとめたものを表１に示す。Table 1 shows a summary of error classifications (error cases) in each of steps 1 to 22 above.

表１〔発明の効果〕以上説明したように本発明は、多数決によって誤りが検
出された場合、第２のプログラム記憶部のプログラムに
基づいて誤りの原因を診断することにより、誤りの種類
に通したリカバリを行なうことが可能なため、信頼性向
上を図ることかでき４．４１芹二鏑ふ第１図は本発明のフォールトトレラントシステムの一実
施例を示すブロック図、第２図は第１図の実施例中のリ
カバリ制御部６の処理を示すフローチャートである。Table 1 [Effects of the Invention] As explained above, when an error is detected by majority vote, the present invention diagnoses the cause of the error based on the program in the second program storage unit, thereby identifying the type of error. Figure 1 is a block diagram showing an embodiment of the fault tolerant system of the present invention, and Figure 2 is a block diagram showing an embodiment of the fault tolerant system of the present invention. It is a flowchart which shows the process of the recovery control part 6 in the example of a figure.

Claims

[Claims] 1. In a fault-tolerant system that performs error detection by N-version programming, a first program storage section and a second program storage section in which the same program as the program stored in the first program storage section are stored. N processing blocks each having a program storage section, and in each processing block, when a processing result is obtained by the program of the first program storage section, an error occurs in the processing result of one of the processing blocks. and a recovery control unit that causes the processing block to perform processing based on the program in the second program storage unit of the processing block and diagnoses the cause of the error from the result of the processing when the processing block is detected. Features a fault-tolerant system.