JPH0973401A

JPH0973401A - Fault-tolerant system

Info

Publication number: JPH0973401A
Application number: JP7230251A
Authority: JP
Inventors: Noriaki Uchino; 則彰内野; Shigetaka Okina; 茂孝翁; Tatsuya Morikawa; 達也森川; Atsushi Funayama; 敦舩山
Original assignee: Seiko Instruments Inc
Current assignee: Seiko Instruments Inc
Priority date: 1995-09-07
Filing date: 1995-09-07
Publication date: 1997-03-18

Abstract

PROBLEM TO BE SOLVED: To enable another computer unit to substitute and perform processing if one of computer units in the system gets out of order. SOLUTION: Of a data management system which has a plurality of computer units 1-4, the respective computer units 1-4 have >=2 communication ports A and B, which can be connected to the respective computer units; and adjacent computer units in the system constituted by connecting all the computer units 1-4 are connected electrically to each other through their communication ports.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、マルチプロセッサ
方式のフォールト・トレラント・コンピュータに関す
る。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a multiprocessor type fault tolerant computer.

【０００２】[0002]

【従来の技術】図２は、従来のマルチプロセッサ方式の
フォールト・トレラント・コンピュータの概念図の一例
である。ここで、マルチプロセッサは全く同じ構成のコ
ンピュータユニットを独立に複数台設け、各ユニットの
コモンエリアを介して接続する構成となっている。2. Description of the Related Art FIG. 2 is an example of a conceptual diagram of a conventional multiprocessor type fault tolerant computer. Here, the multiprocessor is configured such that a plurality of computer units having exactly the same configuration are independently provided and are connected via a common area of each unit.

【０００３】また、前記各ユニットは各々別々のクロッ
クにより非同期でそれぞれ異なったタスクを実行してい
る。このマルチプロセッサにおいて、フォールト・トレ
ランスは相互監視、故障部分の切り離し、バックアップ
の３ステップによって実現している。Further, the respective units asynchronously execute different tasks with different clocks. In this multiprocessor, fault tolerance is realized by three steps: mutual monitoring, separation of faulty parts, and backup.

【０００４】この相互監視のステップは、各ユニットが
各々個別のタスクを実行中に定期的に他のユニットの動
作状態を監視し、その監視結果をバックアップ処理回路
に出力するようになっており、このような監視は各ユニ
ットの間で相互に行われる。次のステップでは、バック
アップ処理回路は各ユニットからの監視結果に基づいて
正常なユニットと異常なユニットを決定し、異常ユニッ
トに対してはシステムから切り離すための停止信号を出
力すると共に、正常ユニットに対しては、現在どのユニ
ットが運転状態にあるかを示す動作情報を出力する。In this mutual monitoring step, each unit periodically monitors the operating state of another unit while executing an individual task, and outputs the monitoring result to the backup processing circuit. Such monitoring is mutually performed between the units. In the next step, the backup processing circuit determines a normal unit and an abnormal unit based on the monitoring result from each unit, outputs a stop signal for disconnecting from the system to the abnormal unit, and notifies the normal unit. On the other hand, it outputs operation information indicating which unit is currently in operation.

【０００５】続いて、正常ユニットは前記バックアップ
処理回路から受信した動作情報に基づいて実行すべきタ
スクを決定し実行する。この際、各ユニットのタスクが
重複したり漏れることがないようにタスクが決定され
る。なお異常が発見されてシステムから切り離されたユ
ニットのタスクは、正常ユニットがバックアップする。
このため、正常ユニットは状況に応じて複数のタスクを
実行することとなる。Subsequently, the normal unit determines and executes the task to be executed based on the operation information received from the backup processing circuit. At this time, tasks are determined so that the tasks of each unit are not duplicated or leaked. The task of the unit that was found to be abnormal and disconnected from the system is backed up by the normal unit.
Therefore, the normal unit will execute a plurality of tasks depending on the situation.

【０００６】また、図３に示すように、各々のコンピュ
ータユニットは自らが制御するデバイスと電気的に接続
されている。As shown in FIG. 3, each computer unit is electrically connected to a device controlled by itself.

【０００７】[0007]

【発明が解決しようとする課題】上述のように、従来技
術では各コンピュータユニットが相互監視を行うため
に、コンピュータユニットの数が増えるに従いその相互
監視ロジックは複雑になり、さらに相互監視の為に使わ
れるＣＰＵパワーの負荷が増大し、コンピュータシステ
ムが本来行うべき処理のための時間が少なくなるという
課題がある。As described above, in the prior art, each computer unit performs mutual monitoring. Therefore, as the number of computer units increases, the mutual monitoring logic becomes complicated. There is a problem that the load of the CPU power used increases and the time for processing that the computer system should originally perform decreases.

【０００８】また、図３に示されるように、デバイスを
接続しているコンピュータユニットが故障した場合、デ
バイスの制御と共にデバイスの動作が不能になるという
課題がある。Further, as shown in FIG. 3, when a computer unit connecting the device fails, there is a problem that the device cannot be controlled and the device cannot operate.

【０００９】[0009]

【課題を解決するための手段】上述の課題を解決するた
め、本願発明におけるシステムは図４に示すように、通
信ポートＡ44と通信ポートＢ45の少なくとも２つ以上の
通信ポートを有する複数のコンピュータユニットによっ
て構成するものであり、該コンピュータユニットの前記
通信ポートは、図５に示すように、全てのコンピュータ
ユニットがそれぞれの通信ポートを介して互いに電気的
に接続されるように各々別のコンピュータユニットと接
続される構成を持っている。In order to solve the above-mentioned problems, the system according to the present invention, as shown in FIG. 4, has a plurality of computer units having at least two communication ports, a communication port A44 and a communication port B45. The communication port of the computer unit is connected to another computer unit so that all the computer units are electrically connected to each other via the respective communication ports, as shown in FIG. Have a configuration that is connected.

【００１０】前記システム内の全てのコンピュータユニ
ットは、互いに自らが行う処理と同様の処理が代行でき
るコンピュータユニットを少なくとも一つ以上有してお
り、もし、システム内のいづれかのコンピュータユニッ
トに故障が発生した場合、故障したコンピュータユニッ
トの回復処理が行われるまでの間、前記故障したコンピ
ュータユニットが行うべき処理は、システム内の他のコ
ンピュータユニットが代行する。All the computer units in the system have at least one computer unit that can perform the same processing as each other, and if any of the computer units in the system fails. In this case, until the recovery processing of the failed computer unit is performed, the processing to be performed by the failed computer unit is performed by another computer unit in the system.

【００１１】さらに、本構成においてはシステム全体の
故障情報を検出し、これを管理する基準コンピュータユ
ニットと、この基準コンピュータユニットをサポートす
るサブコンピュータユニットを有している。また、前記
基準コンピュータユニットあるいはサブコンピュータユ
ニットは、図６に示すように故障検出コマンドの管理番
号及びコンピュータユニットアドレスを含んだヘッダー
情報６１、各コンピュータユニットの故障情報６２及び
システム全体の故障情報６３を持つ故障検出コマンドを
用いて前記システムの故障を検出する。Further, the present configuration has a reference computer unit that detects and manages failure information of the entire system, and a sub computer unit that supports this reference computer unit. As shown in FIG. 6, the reference computer unit or the sub-computer unit stores header information 61 including the management number and computer unit address of the failure detection command, failure information 62 of each computer unit and failure information 63 of the entire system. A fault in the system is detected by using the fault detection command that it has.

【００１２】一方、前記システムに接続されたデバイス
は、図１に示すように該デバイスの制御が代行できる少
なくとも２つ以上のコンピュータユニットに対して同様
に接続され、このデバイスを操作する１つのコンピュー
タユニットに故障が発生し、制御不能になった時、前記
代行可能な他のコンピュータユニットによってデバイス
の操作を行う。On the other hand, the device connected to the system is also connected to at least two or more computer units that can control the device, as shown in FIG. 1, and one computer to operate the device. When the unit is out of order and out of control, the device is operated by the other computer unit that can act on its behalf.

【００１３】[0013]

【発明の実施の形態】以下、図面を参照して本発明の好
適な実施例を詳細に説明する。図１は本発明によるフォ
ールトトレラントシステムの概略説明図である。この図
面において、基準コンピュータユニット１は、例えばＣ
ＰＵ４１、ＲＯＭ４２、ＲＡＭ４３、通信ポートＡ４
４、通信ポートＢ４５によって構成されており、前記コ
ンピュータユニット１が有する２つの通信ポートはそれ
ぞれサブコンピュータユニット２、デバイス制御用コン
ピュータユニットＡ３と接続されている。また、サブコ
ンピュータユニット２も基準コンピュータユニット１と
同様の構成から成り、前記サブコンピュータユニット２
の２つの通信ポートはそれぞれ前記基準コンピュータユ
ニット１、デバイス制御用コンピュータユニットＢ４に
接続されている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Preferred embodiments of the present invention will be described in detail below with reference to the drawings. FIG. 1 is a schematic explanatory diagram of a fault tolerant system according to the present invention. In this figure, the reference computer unit 1 is, for example, C
PU41, ROM42, RAM43, communication port A4
4 and a communication port B45. The two communication ports of the computer unit 1 are connected to the sub computer unit 2 and the device control computer unit A3, respectively. The sub computer unit 2 also has the same configuration as the reference computer unit 1, and the sub computer unit 2
The two communication ports are connected to the reference computer unit 1 and the device control computer unit B4, respectively.

【００１４】前記基準コンピュータユニット１の通信ポ
ートに接続されたデバイス制御用コンピュータユニット
Ａ３は、前記サブコンピュータユニット２に接続された
デバイス制御用コンピュータユニットＢ４に接続されて
いる。前記基準コンピュータユニット１及びサブコンピ
ュータユニット２は、主にデータ処理を行い、これら各
コンピュータユニットは互いにその機能を代行すること
が可能である。The device controlling computer unit A3 connected to the communication port of the reference computer unit 1 is connected to the device controlling computer unit B4 connected to the sub computer unit 2. The reference computer unit 1 and the sub computer unit 2 mainly perform data processing, and these computer units can perform their functions on behalf of each other.

【００１５】即ち、前記基準コンピュータユニット１に
障害が発生したときには、障害が回復されるまでの間、
前記サブコンピュータユニット２が前記基準コンピュー
タユニット１に代わってシステム全体の処理を行う。前
記デバイス制御用コンピュータユニットＡ３及び前記デ
バイス制御用コンピュータユニットＢ４は、主にデバイ
ス５の制御を行い、これらのコンピュータユニットに接
続するデバイス５は、図１に示すように前記両コンピュ
ータユニットに同時に接続され、どちらのコンピュータ
ユニットからでもデバイスを制御することが可能であ
る。That is, when a failure occurs in the reference computer unit 1, until the failure is recovered,
The sub-computer unit 2 replaces the reference computer unit 1 and processes the entire system. The device control computer unit A3 and the device control computer unit B4 mainly control the device 5, and the device 5 connected to these computer units is simultaneously connected to both computer units as shown in FIG. It is possible to control the device from either computer unit.

【００１６】また、前記デバイス制御用コンピュータユ
ニットＡ３に障害が発生した時には、障害が回復するま
で前記デバイス制御用コンピュータユニットＢ４が該デ
バイス制御用コンピュータユニットＡ３が行うべき処理
を代行する。前記基準コンピュータユニット１は、本来
のシステム処理の他にシステム内の故障検出を行う。Further, when a failure occurs in the device controlling computer unit A3, the device controlling computer unit B4 substitutes the processing to be performed by the device controlling computer unit A3 until the failure is recovered. The reference computer unit 1 detects a fault in the system in addition to the original system processing.

【００１７】さらに、システムの故障検出コマンドは図
６に示すようにヘッダー部６１、各々のコンピュータユ
ニットが管理する故障情報６２、及び基準コンピュータ
ユニットが管理する故障情報６３の情報から構成されて
おり、前記ヘッダー部６１は前記故障検出コマンドを発
行したコンピュータユニットのアドレス、該故障検出コ
マンドのフレーム管理番号などで構成される。Further, as shown in FIG. 6, the system failure detection command is composed of a header portion 61, failure information 62 managed by each computer unit, and failure information 63 managed by a reference computer unit, The header section 61 is composed of the address of the computer unit that issued the failure detection command, the frame management number of the failure detection command, and the like.

【００１８】次に、図７に示す基準コンピュータユニッ
ト以外のコンピュータユニットが故障検出コマンドを受
信した場合の処理について、図７を参考に説明する。基
準コンピュータユニット以外のコンピュータユニットが
ステップ７１にて故障検出コマンドを受信すると、ステ
ップ７２において受信した故障検出コマンドに自らが管
理する故障情報を書き込み、ステップ７３でこの書き込
み処理が終了した故障検出コマンドを受信通信ポートと
は別の通信ポートに書き込み、他のコンピュータユニッ
トに送信する。次に、ステップ７４において送信が成功
したかどうかを確認し、送信が成功した場合には再び故
障検出コマンドの受信を待つ。Next, the processing when a computer unit other than the reference computer unit shown in FIG. 7 receives a failure detection command will be described with reference to FIG. When a computer unit other than the reference computer unit receives the failure detection command in step 71, it writes the failure information managed by itself in the failure detection command received in step 72, and in step 73, writes the failure detection command for which this writing process has ended. Write to a communication port different from the reception communication port and send to another computer unit. Next, in step 74, it is confirmed whether or not the transmission is successful, and if the transmission is successful, the reception of the failure detection command is waited again.

【００１９】これにより前記コンピュータシステム内の
全てのコンピュータユニットを経由して故障していない
ことが確認された後、基準コンピュータユニットが発信
した前記故障検出コマンドは再び前記基準コンピュータ
ユニットに戻る。しかしながら、基準コンピュータユニ
ット以外のコンピュータユニットが前記故障検出コマン
ドの送信処理に失敗した場合には、ステップ７５におい
て当該コンピュータユニットは送信失敗情報を対象の故
障検出コマンドに書き込み、ステップ７６において最初
に前記故障検出コマンドを受信した通信ポートに前記送
信失敗情報を書き込んだ故障検出コマンドを送信する。After confirming that no failure has occurred via all the computer units in the computer system, the failure detection command sent from the reference computer unit returns to the reference computer unit again. However, when a computer unit other than the reference computer unit fails in the transmission processing of the failure detection command, the computer unit writes the transmission failure information to the target failure detection command in step 75, and first, in step 76, the failure. The failure detection command in which the transmission failure information is written is transmitted to the communication port that has received the detection command.

【００２０】この結果、前記故障検出コマンドは各コン
ピュータユニットの故障情報をのせ、再び基準コンピュ
ータユニットにフィードバックされる。次に、図８のフ
ローチャートに基づいて基準コンピュータユニット１の
故障検出コマンドに関する動作を説明する。As a result, the failure detection command carries failure information of each computer unit and is fed back to the reference computer unit again. Next, the operation related to the failure detection command of the reference computer unit 1 will be described based on the flowchart of FIG.

【００２１】まず、基準コンオユータユニットは、ステ
ップ８１で故障検出に必要な情報を故障検出コマンドに
書き込み、ステップ８２において前記故障検出コマンド
を自らの通信ポートＡ３に送信し、システム内の各コン
ピュータユニットを順に回送させる。ステップ８３で
は、システム内を一巡した基準コンピュータユニットか
らの前記故障検出コマンドを再び受信する。ステップ８
４において受信した故障検出コマンドに故障情報がない
ときには再び故障検出コマンドの受信を待つ。First, the reference computer unit writes the information necessary for failure detection in the failure detection command in step 81, transmits the failure detection command to its own communication port A3 in step 82, and each computer in the system. Send the units in sequence. In step 83, the failure detection command is received again from the reference computer unit that has cycled through the system. Step 8
When there is no failure information in the failure detection command received in 4, the reception of the failure detection command is waited again.

【００２２】一方、前記受信した故障検出コマンドに故
障情報が存在した場合には、ステップ８５において故障
個所の特定が可能か否かを調べる。故障個所の特定が可
能な場合はステップ９０に進み、該故障個所の特定を行
い、ステップ９１において故障処理を行う。On the other hand, if failure information is present in the received failure detection command, it is checked in step 85 whether the failure location can be identified. If the failure point can be specified, the process proceeds to step 90, the failure point is specified, and the failure process is performed in step 91.

【００２３】受信した故障情報で故障個所の特定が不可
能な場合には、ステップ８６において別の故障検出コマ
ンド管理番号を持った別の故障検出コマンドを用意し、
ステップ８７において前回送信した通信ポートとは別の
通信ポートに前記故障検出コマンドを送信し、システム
内を一巡させた後にステップ８８において前記故障検出
コマンドを再び受信する。If the failure location cannot be identified from the received failure information, another failure detection command having another failure detection command management number is prepared in step 86,
In step 87, the failure detection command is transmitted to a communication port different from the communication port transmitted last time, and after making one round in the system, the failure detection command is received again in step 88.

【００２４】ステップ８９では、前記故障検出コマンド
に故障情報が無ければ故障は既に解消されたと判断し、
再び最初の故障検出コマンドを定期的に用意し、前記基
準コンピュータユニットの通信ポートＡに送信する。一
方、ステップ８８で受信した前記故障検出コマンドに再
び故障情報が有れば、ステップ９０で故障個所を特定
し、ステップ９１において故障処理を行う。In step 89, if there is no failure information in the failure detection command, it is determined that the failure has already been resolved,
The first failure detection command is again prepared periodically and transmitted to the communication port A of the reference computer unit. On the other hand, if the failure detection command received in step 88 includes failure information again, the failure location is specified in step 90, and failure processing is performed in step 91.

【００２５】次に、図９のフローチャートに基づいてサ
ブコンピュータユニット１の故障検出コマンドに関する
動作を説明する。サブコンピュータユニット２は、ステ
ップ９２において常に定期的に前記故障検出コマンドの
受信を監視しており、規定時間以上サブコンピュータユ
ニット２に故障検出コマンドの受信が無かった場合は、
該サブコンピュータユニット２は前記基準コンピュータ
ユニット１に対し、該基準コンピュータユニット１が正
常であるか否かを確認するための確認コマンド送信をス
テップ９４にて実行する。Next, the operation relating to the failure detection command of the sub computer unit 1 will be described with reference to the flowchart of FIG. The sub computer unit 2 constantly monitors the reception of the failure detection command at step 92, and when the sub computer unit 2 has not received the failure detection command for a predetermined time or longer,
In step 94, the sub computer unit 2 transmits a confirmation command to the reference computer unit 1 to confirm whether the reference computer unit 1 is normal.

【００２６】該確認コマンドに対する基準コンピュータ
ユニット１からの応答を規定時間受信しなかった場合に
はサブコンピュータユニットは前記基準コンピュータユ
ニット１の故障と判断し、ステップ９８において前記サ
ブコンピュータユニット２は該基準コンピュータユニッ
ト１故障時の処理を行う。When the response from the reference computer unit 1 to the confirmation command is not received for the specified time, the sub computer unit judges that the reference computer unit 1 has failed, and in step 98 the sub computer unit 2 outputs the reference signal. Performs processing when the computer unit 1 fails.

【００２７】一方、サブコンピュータユニットが前記確
認コマンドを受信した場合には、ステップ９６において
受信データを解析し、次に、ステップ９７において前記
受信データの情報から前記基準コンピュータユニット１
が正常かどうかを判断し、もし基準コンピュータユニッ
ト１が正常で有れば再び前記故障検出コマンドの受信を
待つ。On the other hand, when the sub computer unit receives the confirmation command, the received data is analyzed in step 96, and then the reference computer unit 1 is analyzed from the information of the received data in step 97.
Is normal, and if the reference computer unit 1 is normal, it waits for the reception of the failure detection command again.

【００２８】ステップ９７において前記受信データの内
容が前記基準コンピュータユニット１の故障を示してい
れば、前記サブコンピュータユニット２は基準コンピュ
ータユニットの規定時間以内の応答がなかった時と同じ
ように、ステップ９８において前記基準コンピュータユ
ニット１故障時の処理を行う。If the content of the received data indicates a failure of the reference computer unit 1 at step 97, the sub computer unit 2 performs the same step as when the sub computer unit 2 does not respond within the specified time of the reference computer unit. At 98, processing is performed when the reference computer unit 1 fails.

【００２９】また、前記基準コンピュータユニット１は
前記故障検出コマンドを送信する前に故障検出コマンド
に含まれる「基準コンピュータユニットが管理する故障
情報」６３の領域に基準コンピュータユニットが管理す
るコンピュータシステムの故障情報を書き込む。Further, before the reference computer unit 1 transmits the failure detection command, the failure of the computer system managed by the reference computer unit is included in the area of "failure information managed by the reference computer unit" 63 included in the failure detection command. Write information.

【００３０】基準コンピュータユニット以外のコンピュ
ータユニットは該故障検出コマンドの「基準コンピュー
タユニットが管理する故障情報」６３と、自らが管理す
る情報とを常に参照する事ができ、もし「基準コンピュ
ータユニットが管理する故障情報」６３と自らが管理す
る情報とが異なる場合には、該コンピュータユニットは
故障処理、あるいは回復処理を行うことが可能である。Computer units other than the reference computer unit can always refer to the "fault information managed by the reference computer unit" 63 of the fault detection command and the information managed by itself, and if the "reference computer unit manages" If the "information about failure" 63 is different from the information managed by itself, the computer unit can perform failure processing or recovery processing.

【００３１】以上のように、前記コンピュータシステム
はコンピュータシステム内の故障を自動的に検出する事
が可能であり、しかもコンピュータシステム内の故障を
検出した場合には、故障が回復するまでの間、故障した
コンピュータユニットの機能を他の代行可能なコンピュ
ータユニットが故障によって実行不可となった処理を代
行することにより、フォールトトレラントシステムを実
現することが可能となる。As described above, the computer system can automatically detect a failure in the computer system, and when a failure in the computer system is detected, until the failure is recovered. A fault-tolerant system can be realized by substituting a process that cannot be executed by another computer unit that can substitute the function of the faulty computer unit due to the fault.

【００３２】次に、本発明の第２の実施例について、図
１０を参照して詳細に説明する。図１０において、コン
ピュータシステムを構成する全てのコンピュータユニッ
トは少なくとも図４に示すようにＣＰＵ４１、ＲＯＭ４
２、ＲＡＭ４３、通信ポートＡ４４、通信ポートＢ４５
とによって構成されており、基準コンピュータユニット
１０１が持つ２つの通信ポートは、それぞれサブコンピ
ュータユニット１０２、コンピュータユニットＡ１０３
と接続され、同様に前記サブコンピュータユニット１０
２の２つの通信ポートは前記基準コンピュータユニット
１０１、コンピュータユニットａ１０４と接続されてい
る。Next, a second embodiment of the present invention will be described in detail with reference to FIG. In FIG. 10, all computer units constituting the computer system include at least a CPU 41 and a ROM 4 as shown in FIG.
2, RAM43, communication port A44, communication port B45
The two communication ports of the reference computer unit 101 are the sub computer unit 102 and the computer unit A 103, respectively.
And the sub-computer unit 10 as well.
Two communication ports 2 are connected to the reference computer unit 101 and the computer unit a104.

【００３３】基準コンピュータユニット１０１に接続さ
れた前記コンピュータユニットＡ１０３は、さらにコン
ピュータユニットＢ１０５と接続し、該コンピュータユ
ニットＢ１０５はさらにコンピュータユニットＣ１０７
と接続し、該コンピュータユニットＣ１０７はコンピュ
ータユニットｃ１０８と接続し、該コンピュータユニッ
トｃ１０８はコンピュータユニットｂ１０６と接続し、
該コンピュータユニットｂ１０６は前記コンピュータユ
ニットａ１０４と接続している。The computer unit A 103 connected to the reference computer unit 101 is further connected to the computer unit B 105, and the computer unit B 105 is further connected to the computer unit C 107.
, The computer unit C107 is connected to the computer unit c108, the computer unit c108 is connected to the computer unit b106,
The computer unit b106 is connected to the computer unit a104.

【００３４】前記基準コンピュータユニット１０１と前
記サブコンピュータユニット１０２は互いの処理を代行
できる能力を持ち、前記コンピュータユニットＡ１０３
と前記コンピュータユニットａ１０４、前記コンピュー
タユニットＢ１０５と前記コンピュータユニットｂ１０
６、前記コンピュータユニットＣ１０７と前記コンピュ
ータユニットｃ１０８もそれぞれ互いの処理を代行でき
る能力を持っている。The reference computer unit 101 and the sub computer unit 102 have the capability of substituting for the processing of each other, and the computer unit A 103.
And the computer unit a104, the computer unit B105 and the computer unit b10
6. The computer unit C107 and the computer unit c108 also have the ability to substitute each other's processing.

【００３５】前記基準コンピュータユニット１０１はシ
ステムの本来の処理の他にシステム内の故障検出を行
う。システム内の故障検出コマンドは図６に示すよう
に、ヘッダー部６１、各々のコンピュータユニットが管
理する故障情報６２及び基準コンピュータユニットが管
理する故障情報６３の各情報から構成されており、前記
ヘッダー部６１は前記故障検出コマンドを発行したコン
ピュータユニットのアドレス、故障検出コマンドのフレ
ーム管理番号などで構成されている。The reference computer unit 101 performs fault detection in the system in addition to the original processing of the system. As shown in FIG. 6, the failure detection command in the system comprises a header section 61, failure information 62 managed by each computer unit, and failure information 63 managed by a reference computer unit. Reference numeral 61 includes an address of the computer unit that issued the failure detection command, a frame management number of the failure detection command, and the like.

【００３６】基準コンピュータユニット１０１以外の各
コンピュータユニットは、図７のフローチャートに示す
ごとく、ステップ７１において基準コンピュータユニッ
トからの故障検出コマンドを受信した際、ステップ７２
において自らが管理する故障情報を受信した故障検出コ
マンドに書き込み、自らの故障情報を書き込んだ故障検
出コマンドを受信した通信ポートとは別の通信ポートに
書き込み、ステップ７３において別のコンピュータユニ
ットに該故障検出コマンドを送信する。Each computer unit other than the reference computer unit 101 receives a failure detection command from the reference computer unit in step 71, as shown in the flowchart of FIG.
In step 73, the failure information managed by itself is written in the failure detection command, the failure information is written in a communication port different from the communication port receiving the failure detection command, and the failure is written in another computer unit in step 73. Send a detect command.

【００３７】この送信が成功した場合には再び基準コン
ピュータユニットからの故障検出コマンドの受信体制に
移る。同様に全てのコンピュータユニットの故障無し確
認がされた場合には、前記基準コンピュータユニットが
発信した故障検出コマンドは前記コンピュータシステム
内の全てのコンピュータユニットを経由した後、再び前
記基準コンピュータユニットにフィードバックされる。When this transmission is successful, the system goes back to receiving the failure detection command from the reference computer unit. Similarly, when it is confirmed that all the computer units have no failure, the failure detection command issued by the reference computer unit is fed back to the reference computer unit after passing through all the computer units in the computer system. It

【００３８】しかしながら、基準コンピュータユニット
１０１以外のコンピュータユニットが前記故障検出コマ
ンドの送信処理に失敗した場合、ステップ７５において
この送信失敗情報を送信失敗した前記故障検出コマンド
に書き込み、ステップ７６で前記故障検出コマンドを直
前に受信した通信ポートに送信する。However, when a computer unit other than the reference computer unit 101 fails in the transmission processing of the failure detection command, the transmission failure information is written in the failure detection command which failed in step 75, and the failure detection is performed in step 76. Send the command to the communication port that was just received.

【００３９】これにより前記故障検出コマンドは各コン
ピュータユニットの故障情報をのせ、再び基準コンピュ
ータユニットに戻る。図８は、基準コンピュータユニッ
ト１０１の故障検出コマンドに関する動作をフローチャ
ートとして示したものである。As a result, the failure detection command carries failure information of each computer unit and returns to the reference computer unit again. FIG. 8 is a flowchart showing the operation relating to the failure detection command of the reference computer unit 101.

【００４０】このフローチャートにおいて、ステップ８
１では故障検出に必要な情報を前記故障検出コマンドに
書き込む。次に、ステップ８２において前記故障検出コ
マンドを前記通信ポートＡ３に送信する。In this flowchart, step 8
In No. 1, information necessary for failure detection is written in the failure detection command. Next, in step 82, the failure detection command is transmitted to the communication port A3.

【００４１】そして、ステップ８３では送信された該故
障検出コマンドはシステム内の故障情報を乗せて再び前
記基準コンピュータユニットに受信され、該受信した故
障検出コマンドに故障情報がないときには再び故障検出
コマンドの受信を待つ。一方、ステップ８４にて前記受
信した故障検出コマンドに故障情報が有った場合には、
ステップ８５において該故障の故障個所の特定が可能か
調べる。Then, in step 83, the transmitted fault detection command carries the fault information in the system and is received again by the reference computer unit. When there is no fault information in the received fault detection command, the fault detection command is sent again. Wait for reception. On the other hand, if there is failure information in the received failure detection command in step 84,
In step 85, it is checked whether the failure location of the failure can be identified.

【００４２】該故障個所の特定が可能な場合はステップ
９０において故障個所の特定を行い、ステップ９１で故
障処理を行う。もしそれまでの情報で故障個所の特定が
不可能な場合には、ステップ８６において別の故障検出
コマンド管理番号を持った別の故障検出コマンドを用意
し、この故障検出コマンドをステップ８７において前回
送信した通信ポートとは別の通信ポートに送信し、ステ
ップ８９で故障検出コマンドを再び受信する。If the failure point can be specified, the failure point is specified in step 90, and the failure process is executed in step 91. If the location of the failure cannot be identified by the information up to that point, another failure detection command having another failure detection command management number is prepared in step 86, and this failure detection command is sent last in step 87. The communication port different from the communication port that has been used is transmitted, and in step 89, the failure detection command is received again.

【００４３】該故障検出コマンドに故障情報が無ければ
故障は既に解消されたと判断し、再び定期的に故障検出
コマンドを用意し、前記通信ポートＡに送信する。もし
受信した前記故障検出コマンドに再び故障情報が有れ
ば、ステップ９０で故障個所を特定し、ステップ９１で
故障処理を行う。If there is no failure information in the failure detection command, it is determined that the failure has already been resolved, and the failure detection command is periodically prepared again and transmitted to the communication port A. If the received fault detection command has fault information again, the fault location is specified in step 90, and fault processing is performed in step 91.

【００４４】一方、図９のフローチャートに示すよう
に、ステップ９２においてサブコンピュータユニット１
０２は常に定期的に前記故障検出コマンドを受信するこ
とを監視しており、ステップ９４では前記サブコンピュ
ータユニット２で故障検出コマンドがある規定時間以上
受信されなかった場合に、該サブコンピュータユニット
１０２から前記基準コンピュータユニット１０１に向け
て基準コンピュータユニット１０１が正常であることを
確認するためのコマンドを送信する。On the other hand, as shown in the flow chart of FIG. 9, in step 92, the sub computer unit 1
02 monitors the reception of the failure detection command at regular intervals. In step 94, if the failure detection command is not received by the sub computer unit 2 for a certain time or more, the sub computer unit 102 receives the failure detection command. A command for confirming that the reference computer unit 101 is normal is transmitted to the reference computer unit 101.

【００４５】前記確認コマンドに対する応答を規定時間
受信しなかった場合は、前記サブコンピュータユニット
２は前記基準コンピュータユニット１が故障していると
判断し、ステップ９８で基準コンピュータユニット１故
障時の処理を行う。前記確認コマンドを受信した場合に
はステップ９６で受信データを解析し、ステップ９７で
該受信データの情報から前記基準コンピュータユニット
１が正常かどうかを判断する。そして、前記基準コンピ
ュータユニット１０１が正常で有れば再び前記故障検出
コマンドの受信を待つ。If the response to the confirmation command is not received for the specified time, the sub computer unit 2 judges that the reference computer unit 1 is out of order, and in step 98, the process when the reference computer unit 1 is out of order is performed. To do. When the confirmation command is received, the received data is analyzed in step 96, and it is determined in step 97 whether the reference computer unit 1 is normal or not from the information of the received data. Then, if the reference computer unit 101 is normal, it waits for the reception of the failure detection command again.

【００４６】前記受信データの内容が前記基準コンピュ
ータユニット１０１の故障を示していれば、ステップ９
８で前記サブコンピュータユニット１０２は前記基準コ
ンピュータユニット１０１に対して故障時の処理を行
う。また前記基準コンピュータユニット１０１は前記故
障検出コマンドを送信する前に故障検出コマンドに含ま
れる「基準コンピュータユニットが管理する故障情報」
の領域に該基準コンピュータユニット１０１が管理する
前記コンピュータシステムの故障情報を書き込む。If the content of the received data indicates a failure of the reference computer unit 101, step 9
In step 8, the sub-computer unit 102 performs a failure process on the reference computer unit 101. Further, the reference computer unit 101 includes “failure information managed by the reference computer unit” included in the failure detection command before transmitting the failure detection command.
The failure information of the computer system managed by the reference computer unit 101 is written in the area of.

【００４７】基準コンピュータユニット以外のコンピュ
ータユニットは、該故障検出コマンドの「基準コンピュ
ータユニットが管理する故障情報」と自らが管理する情
報とを常に参照する事ができ、もし「基準コンピュータ
ユニットが管理する故障情報」と自らが管理する情報と
が異なる場合には、当該コンピュータユニットは故障処
理、または回復処理を行うことが可能である。The computer units other than the reference computer unit can always refer to the “fault information managed by the reference computer unit” of the fault detection command and the information managed by itself, and if “the reference computer unit manages”. When the "fault information" and the information managed by itself are different, the computer unit can perform the failure process or the recovery process.

【００４８】以上示したように、前記コンピュータシス
テムによればコンピュータシステム内の各コンピュータ
ユニットの故障を自動的に検出する事が可能であり、も
しコンピュータシステム内における故障を検出した場合
には、その故障が回復するまでの間、故障したコンピュ
ータユニットの機能を代行可能な他のコンピュータユニ
ットが故障コンピュータユニットが実行できなくなった
処理を代行することにより、フォールトトレラントシス
テムを実現した。As described above, according to the computer system, it is possible to automatically detect the failure of each computer unit in the computer system. If the failure in the computer system is detected, the failure can be detected. A fault tolerant system was realized by another computer unit, which can substitute the function of the faulty computer unit, for the rest of the fault until the faulty computer unit can no longer execute the process.

【００４９】図１１は本発明を適用するフォールトトレ
ラントシステムの第三の実施例の概略説明図である。こ
の実施例は外食産業等の店舗内システムに使用されるも
のであって、システムを構成する全てのコンピュータユ
ニットは少なくとも図４に示すような構成となってお
り、基準コンピュータユニット１１１が有する２つの通
信ポートは、それぞれサブコンピュータユニット１１２
の２つの通信ポートと接続された構成となっている。FIG. 11 is a schematic explanatory view of a third embodiment of a fault tolerant system to which the present invention is applied. This embodiment is used for an in-store system of the restaurant industry or the like, and all the computer units constituting the system are configured at least as shown in FIG. The communication ports are the sub computer units 112, respectively.
It is configured to be connected to the two communication ports.

【００５０】前記コンピュータシステムに接続されるデ
バイス１１３は、前記各コンピュータユニットの接続回
線内に接続され、基準コンピュータユニット１１１及び
サブコンピュータユニット１１２のどちらからでも制御
可能な構成となっている。さらに、前記基準コンピュー
タユニット１１１と前記サブコンピュータユニット１１
２は、互いの処理を代行できる能力を持っており、前記
基準コンピュータユニット１１３に故障が発生した場合
でも前記サブコンピュータユニットが基準コンピュータ
ユニット１１３の処理が代行できるように構成されてい
る。The device 113 connected to the computer system is connected within the connection line of each computer unit and can be controlled by either the reference computer unit 111 or the sub computer unit 112. Further, the reference computer unit 111 and the sub computer unit 11
The reference numeral 2 has a capability of substituting for each other's processing, and is configured so that the sub-computer unit can perform the processing of the reference computer unit 113 even if a failure occurs in the reference computer unit 113.

【００５１】そして、前記基準コンピュータユニット１
１１は、システムの本来の処理と共にシステム内の故障
検出を行っている。本実施例においては、システムを構
成するコンピュータユニットが基準コンピュータユニッ
トとサブコンピュータユニットの２つであるため、図１
２に示すように故障検出コマンドは、少なくとも該コマ
ンドが故障検出コマンドであることを示す情報が含まれ
るヘッダー部と、前記ヘッダー部に含まれない故障情報
などを含んだ部分によって構成されている。Then, the reference computer unit 1
Reference numeral 11 detects the failure in the system as well as the original processing of the system. In this embodiment, there are two computer units constituting the system, a reference computer unit and a sub computer unit.
As shown in 2, the failure detection command is composed of at least a header section including information indicating that the command is a failure detection command, and a section including failure information not included in the header section.

【００５２】前記サブコンピュータユニット１１２は、
前記基準コンピュータユニット１１１から故障検出コマ
ンドを受信した際、通信ポートＡ４４、または通信ポー
トＢ４５を介して前記基準コンピュータユニット１１１
に応答信号を送信する。そして、どちらかの送信に失敗
した場合には、２つの通信ポートのうちの別の通信ポー
トを介して前記送信失敗情報を基準コンピュータユニッ
ト１１２に送信する。The sub computer unit 112 is
When a failure detection command is received from the reference computer unit 111, the reference computer unit 111 is sent via the communication port A44 or the communication port B45.
To send a response signal to. If either transmission fails, the transmission failure information is transmitted to the reference computer unit 112 via another communication port of the two communication ports.

【００５３】基準コンピュータユニット１１１が前記サ
ブコンピュータユニット１１２からの送信失敗情報を受
信した時、あるいは前記サブコンピュータユニット１１
２からの応答の中に故障情報が含まれていた時、あるい
はこれらの応答が規定時間内に返って来なかった時、ま
たは別に故障情報を受信した時には故障処理を行う。When the reference computer unit 111 receives the transmission failure information from the sub computer unit 112, or the sub computer unit 11
When the response from 2 includes failure information, or when these responses do not come back within the specified time, or when failure information is received separately, the failure processing is performed.

【００５４】同時に、前記サブコンピュータユニット１
１２は基準コンピュータユニット１１１からの故障検出
コマンドの受信を定期的に監視しており、ある一定の規
定時間以上、前記サブコンピュータユニット１１２で故
障検出コマンドを受信できなかった場合には、前記サブ
コンピュータユニット１１２は基準コンピュータユニッ
ト１１１に向けて基準コンピュータユニット１１１が正
常であるか否かを確認するためのコマンドを送信する。At the same time, the sub computer unit 1
Reference numeral 12 regularly monitors the reception of the failure detection command from the reference computer unit 111, and when the failure detection command cannot be received by the sub computer unit 112 for a certain predetermined time or longer, the sub computer The unit 112 sends a command to the reference computer unit 111 to check whether the reference computer unit 111 is normal.

【００５５】そして、サブコンピュータユニットが送信
した確認コマンドに対する基準コンピュータユニットか
らの応答が規定時間以内に受信できなかった場合には、
サブコンピュータユニットは前記基準コンピュータユニ
ット１１１が故障していると判断し、サブコンピュータ
ユニット１１２は基準コンピュータユニット１１１故障
時の処理を行う。If the response from the reference computer unit to the confirmation command transmitted by the sub computer unit cannot be received within the specified time,
The sub computer unit determines that the reference computer unit 111 is out of order, and the sub computer unit 112 performs processing when the reference computer unit 111 is out of order.

【００５６】一方、確認コマンドを受信した場合には、
図４に示すフローチャート内のステップ９６において受
信したデータを解析し、さらにステップ９７で該受信デ
ータの情報から前記基準コンピュータユニット１１１が
正常かどうかを判断し、基準コンピュータユニット１１
１が正常であると判断されれば再び前記故障検出コマン
ドの受信を待つこととなる。On the other hand, when the confirmation command is received,
The data received in step 96 in the flow chart shown in FIG. 4 is analyzed, and in step 97 it is judged from the information of the received data whether the reference computer unit 111 is normal.
If 1 is determined to be normal, the reception of the failure detection command is awaited again.

【００５７】また、受信データの解析内容が前記基準コ
ンピュータユニット１１１の故障を示していれば、サブ
コンピュータユニット１１２は前記基準コンピュータユ
ニット１１１故障時の処理を行う。以上のように、本発
明によれば、コンピュータシステム内の故障を自動的に
検出する事が可能であり、もしシステム内に故障を検出
した場合には故障が回復するまでの間、故障したコンピ
ュータユニットの実行不能となった機能を他のコンピュ
ータユニットが代行処理することにより、フォールトト
レラントシステムを実現した。If the analysis content of the received data indicates the failure of the reference computer unit 111, the sub computer unit 112 performs the processing when the reference computer unit 111 fails. As described above, according to the present invention, it is possible to automatically detect a failure in a computer system, and if a failure is detected in the system, the failed computer is restored until the failure is recovered. A fault-tolerant system was realized by the other computer unit acting as a substitute for the function that became inexecutable by the unit.

【００５８】[0058]

【発明の効果】以上説明したように、本発明によればシ
ステムの故障検出に必要なＣＰＵの負荷がコンピュータ
システムを構成するコンピュータユニットの数に依存し
ないために、コンピュータユニットの数が増加してもそ
の相互監視のために本来のコンピュータシステムの処理
能力が低下することなくフォールトトレラントシステム
を実現できる。As described above, according to the present invention, since the load of the CPU necessary for detecting the system failure does not depend on the number of computer units constituting the computer system, the number of computer units increases. Also, due to the mutual monitoring, a fault-tolerant system can be realized without lowering the processing capacity of the original computer system.

【００５９】さらにシステム内のデバイスを制御するコ
ンピュータユニットが故障しても、その処理の代行が可
能な他のコンピュータユニットが存在するため、コンピ
ュータユニットの故障によるシステム機能の停止をなく
すことができる等の効果を有する。Further, even if the computer unit that controls the device in the system fails, there is another computer unit that can perform the processing on behalf of the computer unit, so that it is possible to prevent the system function from being stopped due to the failure of the computer unit. Have the effect of.

[Brief description of drawings]

【図１】フォールトトレラントシステム構成１を示す図
である。FIG. 1 is a diagram showing a fault tolerant system configuration 1.

【図２】従来のフォールトトレラントコンピュータの概
念図である。FIG. 2 is a conceptual diagram of a conventional fault tolerant computer.

【図３】従来のデバイスの接続図である。FIG. 3 is a connection diagram of a conventional device.

【図４】コンピュータユニットの機能ブロック図であ
る。FIG. 4 is a functional block diagram of a computer unit.

【図５】システムブロック図である。FIG. 5 is a system block diagram.

【図６】故障検出コマンドのフォーマット１を示す図で
図である。FIG. 6 is a diagram showing a format 1 of a failure detection command.

【図７】基準コンピュータユニット以外のコンピュータ
ユニットにおける故障検出コマンドの処理を示す図でで
ある。FIG. 7 is a diagram showing processing of a failure detection command in a computer unit other than the reference computer unit.

【図８】基準コンピュータユニットにおける故障検出方
法のフローチャートである。FIG. 8 is a flow chart of a failure detection method in a reference computer unit.

【図９】サブコンピュータユニットにおける基準コンピ
ュータユニットの故障検出のフローチャートである。FIG. 9 is a flowchart of failure detection of a reference computer unit in a sub computer unit.

【図１０】フォールトトレラントシステム構成２を示す
図である。FIG. 10 is a diagram showing a fault tolerant system configuration 2;

【図１１】フォールトトレラントシステム構成３を示す
図である。FIG. 11 is a diagram showing a fault tolerant system configuration 3;

【図１２】故障検出コマンドのフォーマット２を示す図
である。FIG. 12 is a diagram showing a format 2 of a failure detection command.

[Explanation of symbols]

１基準コンピュータユニット２サブコンピュータユニット３デバイス制御用コンピュータユニットＡ４デバイス制御用コンピュータユニットＢ５デバイス２１コンピュータＡ２２コンピュータＢ２３コンピュータＣ２４コンピュータＤ２５バックアップ処理回路３１基準コンピュータユニット３２デバイス制御用コンピュータユニット３３デバイス４１ＣＰＵ４２ＲＯＭ４３ＲＡＭ４４通信ポートＡ４５通信ポートＢ６１ヘッダー部６２各々のコンピュータユニットが管理する故障情報６３基準コンピュータユニットが管理する故障情報１０１基準コンピュータユニット１０２サブコンピュータユニット１０３コンピュータユニットＡ１０４コンピュータユニットａ１０５コンピュータユニットＢ１０６コンピュータユニットｂ１０７コンピュータユニットＣ１０８コンピュータユニットｃ１１１基準コンピュータユニット１１２サブコンピュータユニット１１３デバイス１２１ヘッダー部１２２その他の情報 1 Reference Computer Unit 2 Sub Computer Unit 3 Device Control Computer Unit A 4 Device Control Computer Unit B 5 Device 21 Computer A 22 Computer B 23 Computer C 24 Computer D 25 Backup Processing Circuit 31 Reference Computer Unit 32 Device Control Computer Unit 33 device 41 CPU 42 ROM 43 RAM 44 communication port A 45 communication port B 61 header part 62 failure information managed by each computer unit 63 failure information managed by a reference computer unit 101 reference computer unit 102 sub computer unit 103 computer unit A 104 computer unit a 105 computer unit B 106 Computer unit b 107 Computer unit C 108 Computer unit c 111 Reference computer unit 112 Sub computer unit 113 Device 121 Header section 122 Other information

───────────────────────────────────────────────────── フロントページの続き (72)発明者舩山敦千葉県千葉市美浜区中瀬１丁目８番地セイコー電子工業株式会社内 ─────────────────────────────────────────────────── ─── Continuation of front page (72) Inventor Atsushi Funayama 1-8 Nakase, Nakase, Mihama-ku, Chiba-shi, Chiba Seiko Electronic Industry Co., Ltd.

Claims

[Claims]

1. In a data management system having a plurality of computer units, each computer unit has two or more communication ports, and the communication ports can be connected to each computer unit, and all computer units are connected. A fault tolerant system configured such that adjacent computer units in the system configured by connecting the units are electrically connected to each other through the respective communication ports.

2. In the computer system, the function of at least one computer unit is configured so that another one or more computer units can substitute for it.
The fault tolerant system according to claim 1, wherein when one computer unit fails, another computer unit takes over the same processing function as that of the failed computer unit so that the processing of the entire system is not stopped.

3. The computer system according to claim 1, further comprising a reference computer unit that detects and manages failure information of the entire system, and a sub computer unit that supports the reference computer unit. Fault tolerant system.

4. In the computer system, a management number of a failure detection command managed by a reference or sub computer unit, header information including a computer unit address, failure information of each computer unit, and failure information of the entire system are displayed. The fault tolerant system according to claim 1, wherein a fault of the system is detected by using a fault detection command that the fault tolerant system has.

5. The communication port that receives the failure information managed by itself, when the computer units other than the reference computer unit receive the failure detection command transmitted by the reference computer unit, and write the failure information managed by itself into the failure detection command. By sending to a communication port different from the above, the failure detection command sent by the reference computer unit returns to the reference computer unit again via all the computer units, and the reference computer unit outputs the failure information of all the computer units. A fault tolerant system according to claim 1, characterized in that it is obtainable.

6. When a computer unit different from the reference computer unit receives a failure detection command and transmission to another computer unit fails, the transmission source computer unit writes transmission failure information in the failure detection command. , The failure detection command is sent back to the computer unit that sent the failure detection command immediately before,
The reference computer unit that receives this failure detection command is required to detect the failure of the communication line inside the system and each computer unit and specify the failure location from the transmission failure information of the failure detection command received from each communication port. The fault tolerant system of claim 1 characterized.

7. The sub-computer unit constantly monitors reception of a failure detection command from a reference computer unit within a predetermined time, and if the failure detection command cannot be received within the predetermined time, the reference computer The reference computer unit transmits a failure detection command for confirming the normality of the reference computer unit itself to the unit, and the reference computer unit which receives the failure detection command from the sub-computer unit detects its own failure, and the sub-computer unit has a predetermined operation. By again monitoring whether the failure detection command is sent from the reference computer unit within the time, and when receiving the failure detection command from the reference computer unit, by checking the failure information in the failure detection command. ,
2. The fault tolerant system according to claim 1, wherein when the sub computer unit detects a failure of the reference computer unit, the sub computer unit sends fault information of the reference computer unit into the system.

8. A device connected to the computer system is similarly connected to at least two or more computer units having the same function, and even if one computer unit fails, the device is connected to another computer unit by the other computer unit. The fault tolerant system according to claim 1, wherein the fault tolerant system is operable.

9. In the computer system, when a system failure occurs, a failure detection command is transmitted in the system, and information of each computer unit and the entire system is monitored by each computer unit, thereby recovering system recovery information. The fault tolerant system according to claim 1, which is obtainable.