JPH0895931A

JPH0895931A - Faust detecting method for distributed computer system

Info

Publication number: JPH0895931A
Application number: JP6230022A
Authority: JP
Inventors: Kenji Ueda; 健治植田
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 1994-09-26
Filing date: 1994-09-26
Publication date: 1996-04-12

Abstract

PURPOSE: To provide a fault detecting method with which fault resistance can be improved and the load on a network and a computer at normal time can be minimized by investigating whether each computer periodically receives a survival signal transmitted from the adjacent computer on a virtual ring or not, and judging the generation of any abnormality on a communication path to be used for transmitting the survival signal when no signal is received. CONSTITUTION: Plural computers 101-104 or the like are arranged on a virtual ring 10 and each of computers 101-104 or the like periodically transmit the survival signal showing its own survival to the adjacent computer in the specified direction on the virtual ring 10. Besides, each of computers 101-104 or the like investigates whether the survival signal transmitted from the adjacent computer on the virtual ring 10 is periodically received or not, when no signal is received, it is judged that any abnormality is generated on the communication path to be used for transmitting the survival signal, and the position of that fault is specified. Then, each of computers 101-104 or the like reports fault information on the discovered fault to all communicatable communication equipment.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】この発明は、ネットワークを介し
て接続された複数の計算機を含む分散計算機システムに
おいて、ネットワーク及び計算機にかかる負荷を低減し
て確実に故障を発見するための分散計算機システムの故
障検出方法に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a distributed computer system including a plurality of computers connected via a network, for reducing the load on the network and the computer to reliably detect a failure. The present invention relates to a failure detection method.

【０００２】[0002]

【従来の技術】図５９は、従来の分散計算機システムに
おける、計算機及びネットワークの故障検出方法を説明
するブロック図であり、図において、１０１〜１０５
は、計算機である。分散計算機システムでは、これらの
計算機１０１〜１０５は周知のようにローカルエリアネ
ットワーク（以下、ＬＡＮと称す）を介して互いに通信
可能なように接続されている。2. Description of the Related Art FIG. 59 is a block diagram for explaining a failure detecting method for a computer and a network in a conventional distributed computer system.
Is a calculator. In the distributed computer system, these computers 101 to 105 are connected to each other via a local area network (hereinafter referred to as LAN) so that they can communicate with each other, as is well known.

【０００３】次に動作について説明する。これらの計算
機の一つ、例えば計算機１０３に、稼働情報の管理の役
割が割り当てられており、この計算機１０３が他の全て
の計算機に対して、定期的に故障検出用の信号、即ち生
存信号送信要求のための信号を送信し、一定時間以内に
この信号に対する応答、即ち生存信号を受信するか否か
をチェックする。稼働情報管理の計算機１０３は、生存
信号を受信した場合には、生存信号を送信してきた計算
機及びその計算機との通信経路は正常であると判断し、
生存信号を受信しなかった場合には、生存信号送信要求
の信号を送信した計算機またはその計算機との通信経路
は何らかの故障状態にあると判断する。そして、稼働情
報管理の計算機１０３は、検出した故障情報を何らかの
方法で全ての計算機に通知する。Next, the operation will be described. One of these computers, for example, the computer 103, is assigned the role of managing operating information, and this computer 103 periodically sends a signal for failure detection, that is, a survival signal to all other computers. A signal for request is transmitted, and it is checked whether a response to this signal, that is, a survival signal is received within a certain time. When receiving the survival signal, the operation information management computer 103 determines that the computer that has transmitted the survival signal and the communication path with the computer are normal,
When the survivor signal is not received, it is determined that the computer that has transmitted the survivor signal transmission request signal or the communication path with the computer is in some failure state. Then, the operation information management computer 103 notifies all the computers of the detected failure information by some method.

【０００４】ところで、このような故障検出方法では、
１台の計算機に稼働情報の管理が集中しているので、当
該計算機が故障した場合に、故障検出機能が失われてし
まう。このような欠点を避けるため、全ての計算機に同
様な機能を割り当てる方法もある。By the way, in such a failure detection method,
Since the management of operation information is concentrated on one computer, the failure detection function is lost when the computer fails. To avoid such drawbacks, there is also a method of assigning similar functions to all computers.

【０００５】[0005]

【発明が解決しようとする課題】従来の分散計算機シス
テムの故障検出方法は以上のように構成されているの
で、特に前者の方法では、故障検出機能が特定の計算機
に集中しており、当該計算機自身が故障すると、故障検
出機能が失われてしまうという問題点があった。Since the conventional fault detection method for a distributed computer system is configured as described above, the fault detection function is concentrated on a specific computer especially in the former method. There is a problem that the failure detection function is lost when the device itself fails.

【０００６】また、後者の故障検出方法では、計算機の
台数が増加すると、故障検出のために送受信される信号
の量が、計算機の台数の２乗にほぼ比例して増加するた
め、ＬＡＮにかかる負担（単位時間あたりＬＡＮ上に送
信される信号の個数）が大きくなってしまうという問題
点がある。即ち、ＬＡＮ上に送り出される生存信号の数
は、計算機の台数をＮとすると、全ての計算機が自分自
身以外に全て生存信号送信要求のための信号を送信し、
これに対する生存信号を受信することになるので、２Ｎ
（Ｎ−１）個の信号がＬＡＮ上に送信されることとな
る。また、各計算機あたり送受信する信号の数も、計算
機の台数Ｎに比例するため、計算機にかかる負荷が大き
くなる。In the latter fault detection method, when the number of computers increases, the amount of signals transmitted / received for fault detection increases in proportion to the square of the number of computers, so that LAN is involved. There is a problem that the burden (the number of signals transmitted on the LAN per unit time) becomes large. That is, regarding the number of surviving signals sent out on the LAN, assuming that the number of computers is N, all computers send signals for requesting survival signal transmission other than themselves,
Since the survival signal for this is received, 2N
(N-1) signals will be transmitted on the LAN. Further, since the number of signals transmitted / received for each computer is also proportional to the number N of computers, the load on the computers becomes large.

【０００７】さらに、２本以上のＬＡＮに接続された分
散計算機システムの場合、計算機は自らと一部のＬＡＮ
との接続が切れた場合、他の計算機とそのＬＡＮへの接
続状態を知ることができないという問題点がある。これ
により、図６０に示すように、計算機１０２、１０３の
２つの計算機で異なるＬＡＮへの接続が切れる、たすき
がけ故障が発生すると、計算機１０２、１０３は、互い
に相手の計算機との通信が不可能であると判断する。こ
の際、計算機１０２は、ＬＡＮ４０１への接続が切れて
いるので、計算機１０３が故障しているのか、それと
も、計算機１０３とＬＡＮ４０２との接続が切れたにす
ぎず（即ち、計算機１０３は正常）、ＬＡＮ４０１を介
して通信が可能であるのかを判断することができない。
また、同様に、計算機１０３は、ＬＡＮ４０２への接続
が切れているので、計算機１０２が故障しているのか、
それとも、計算機１０２とＬＡＮ４０１との接続が切れ
たにすぎず（即ち、計算機１０２は正常）、ＬＡＮ４０
２を介して通信が可能であるのかを判断することができ
ない。このため、計算機１０２、１０３は、計算機１０
１を経由して通信を行う経路があるにもかかわらず、そ
の経路を発見することができないなどの問題点があっ
た。Further, in the case of a distributed computer system connected to two or more LANs, the computer itself and some LANs
When the connection with the computer is disconnected, there is a problem that it is not possible to know the state of connection to another computer and its LAN. As a result, as shown in FIG. 60, when two computers, that is, the computers 102 and 103, are disconnected from different LANs, or when a trailing failure occurs, the computers 102 and 103 cannot communicate with each other. It is determined that At this time, since the computer 102 is disconnected from the LAN 401, the computer 103 may be out of order, or the connection between the computer 103 and the LAN 402 may be simply disconnected (that is, the computer 103 is normal). It cannot be determined whether communication is possible via the LAN 401.
Similarly, since the computer 103 is disconnected from the LAN 402, is it possible that the computer 102 is out of order?
Or, the connection between the computer 102 and the LAN 401 is only disconnected (that is, the computer 102 is normal), and the LAN 40
It is not possible to determine whether or not communication is possible via 2. Therefore, the computers 102 and 103 are
Even though there is a route for communication via 1, the problem is that the route cannot be found.

【０００８】請求項１の発明は上記のような問題点を解
消するためになされたもので、特定の計算機に故障検出
機能を集中させず、耐故障性に優れており、且つ、平常
時にネットワーク及び計算機にかかる負荷を最小にでき
る分散計算機システムの故障検出方法を得ることを目的
とする。The invention of claim 1 has been made to solve the above-mentioned problems. It is excellent in fault tolerance without concentrating the fault detection function on a specific computer, and the network and It is an object of the present invention to obtain a fault detection method for a distributed computer system that can minimize the load on the computer.

【０００９】請求項２の発明は、請求項１の発明に加
え、生存信号の送信先を定期的に右隣から左隣またはそ
の逆に切り換えることにより、故障発生時にも最小限の
通信量で故障を発見でき、平常時及び異常発生時に交換
される生存信号の数を最小にできる分散計算機システム
の故障検出方法を得ることを目的とする。According to a second aspect of the present invention, in addition to the first aspect of the invention, by periodically switching the transmission destination of the survival signal from the right adjacent to the left adjacent or vice versa, the communication amount can be minimized even when a failure occurs. It is an object of the present invention to provide a failure detection method for a distributed computer system that can detect a failure and minimize the number of surviving signals exchanged in normal times and when an abnormality occurs.

【００１０】請求項３の発明は、自らが生存信号を受信
したか否かを、次に送信する生存信号に書き込むことに
より、送信先計算機が生存信号を受信したか否かと、生
存信号の送信元計算機が生存信号を受信したか否かとを
組み合わせて故障を発見することにより、平常時及び異
常発生時に交換される生存信号の数を最小にできる分散
計算機システムの故障検出方法を得ることを目的とす
る。According to the third aspect of the present invention, by writing in the survival signal to be transmitted next whether or not the receiving computer itself has received the survival signal, whether or not the destination computer has received the survival signal and the transmission of the survival signal. The purpose of the present invention is to obtain a failure detection method for a distributed computer system that can minimize the number of surviving signals exchanged during normal times and during an abnormality by discovering a failure by combining whether or not the original computer has received a surviving signal. And

【００１１】請求項４の発明は、各計算機を節点とする
仮想ツリー上に配置することにより、故障発生時にも最
小限の通信量で故障を発見でき、平常時及び異常発生時
に交換される生存信号の数を最小にできる分散計算機シ
ステムの故障検出方法を得ることを目的とする。According to the invention of claim 4, by arranging each computer on a virtual tree having nodes as nodes, the failure can be found with a minimum communication amount even when the failure occurs, and the survivor is exchanged in normal times and when an abnormality occurs. It is an object of the present invention to obtain a fault detection method for a distributed computer system that can minimize the number of signals.

【００１２】請求項５の発明は、計算機を複数のグルー
プに分割し、各グループでの代表計算機を仮想リング上
に配置することにより、故障発生時にも最小限の通信量
で故障を発見でき、平常時及び異常発生時に交換される
生存信号の数を最小にできる分散計算機システムの故障
検出方法を得ることを目的とする。According to the invention of claim 5, by dividing the computer into a plurality of groups and arranging the representative computers in each group on the virtual ring, it is possible to find the fault with the minimum communication amount even when the fault occurs. An object of the present invention is to obtain a fault detection method for a distributed computer system that can minimize the number of surviving signals exchanged in normal times and in the event of an abnormality.

【００１３】請求項６の発明は、二重化ＬＡＮを使用す
ることにより、送信先計算機で生存信号が受信したか否
かをチェックすることにより、平常時に故障発見のため
に交換される生存信号の数を最小にできる分散計算機シ
ステムの故障検出方法を得ることを目的とする。According to a sixth aspect of the present invention, by using a dual LAN, it is possible to check whether or not a surviving signal is received by a destination computer, and thereby, the number of surviving signals exchanged for detecting a failure in normal times. The objective is to obtain a fault detection method for a distributed computer system that can minimize

【００１４】請求項７から請求項１０の発明は、二重化
ＬＡＮを使用することにより、生存信号そのものの受信
状態と、送信元の計算機での生存信号の受信状態とを組
み合わせて故障発見を行うことにより、平常時及び異常
発生時に交換される生存信号の数を最小にできるととも
に、１つの計算機の故障を、故障計算機の近傍の複数の
計算機により発見が可能であり、故障発生からより短い
遅れ時間で故障を発見できる分散計算機システムの故障
検出方法を得ることを目的とする。According to the seventh to tenth aspects of the invention, by using the duplicated LAN, the failure detection is performed by combining the receiving state of the live signal itself and the receiving state of the live signal at the transmission source computer. This minimizes the number of surviving signals that are exchanged during normal times and when an abnormality occurs, and it is possible to find a failure of one computer by multiple computers in the vicinity of the failure computer, resulting in a shorter delay time from the occurrence of the failure. The purpose of the present invention is to obtain a fault detection method for a distributed computer system that can detect faults in.

【００１５】請求項１１から請求項１４の発明は、故障
発生時、故障計算機の復旧時、または新しい計算機の増
設時に、各計算機の送信先を変化させ、システムの構成
変化が生じても、それ以前と同様な故障検出能力を維持
することができる分散計算機システムの故障検出方法を
得ることを目的とする。According to the eleventh to fourteenth aspects of the present invention, when a failure occurs, a failed computer is restored, or a new computer is added, the transmission destination of each computer is changed, and even if the system configuration changes, that It is an object of the present invention to obtain a fault detection method for a distributed computer system that can maintain the same fault detection capability as before.

【００１６】請求項１５、請求項１６の発明は、故障情
報の通知に、各計算機が送信する生存信号を利用するた
め、通知のために余分な信号を送信する必要がなく、Ｌ
ＡＮにかかる負荷を小さくすることができる分散計算機
システムの故障検出方法を得ることを目的とする。In the inventions of claims 15 and 16, since the survival signal transmitted by each computer is used for the notification of the failure information, it is not necessary to transmit an extra signal for notification, and L
An object of the present invention is to obtain a failure detection method for a distributed computer system that can reduce the load on the AN.

【００１７】請求項１７、請求項１９の発明は、３本以
上のＬＡＮを２本ずつの組にし、各組に対して請求項６
から請求項１０、請求項１２、請求項１４、請求項１６
の方法を適用することにより、任意の本数のＬＡＮを持
つ分散計算機システムの故障検出方法を得ることを目的
とする。In the inventions of claims 17 and 19, three or more LANs are grouped into two groups, and each group is defined by claim 6.
To claim 10, claim 12, claim 14, claim 16
It is an object of the present invention to obtain a failure detection method for a distributed computer system having an arbitrary number of LANs by applying the method of (1).

【００１８】請求項１８の発明は、３本以上のＬＡＮを
２本ずつの組にし、各組に対して請求項１から請求項
５、請求項１１、請求項１３、請求項１５の方法、また
は、請求項６から請求項１０、請求項１２、請求項１
４、請求項１６の方法を適用することにより、任意の本
数のＬＡＮを持つ分散計算機システムの故障検出方法を
得ることを目的とする。In the invention of claim 18, three or more LANs are grouped into two groups, and the methods of claims 1 to 5, claim 11, claim 13 and claim 15, for each group, Alternatively, claim 6 to claim 10, claim 12, claim 1
The object of the present invention is to obtain a failure detection method for a distributed computer system having an arbitrary number of LANs by applying the method of claim 4 or claim 16.

【００１９】請求項２０の発明は、請求項１９の故障検
出方法において、２つのＬＡＮの組で共有されているＬ
ＡＮにおいて、それぞれの組で用いられる生存信号を１
つにまとめることにより、交換される生存信号の数を少
なくする分散計算機システムの故障検出方法を得ること
を目的とする。According to a twentieth aspect of the invention, in the failure detection method of the nineteenth aspect, L shared by two LAN groups is used.
In AN, the survival signal used in each set is 1
The purpose is to obtain a fault detection method for a distributed computer system that reduces the number of surviving signals to be exchanged.

【００２０】請求項２１の発明は、故障発生が本来の業
務に及ぼす影響を少なくすることができる分散計算機シ
ステムの故障検出方法を得ることを目的とする。It is an object of the present invention to provide a method for detecting a failure in a distributed computer system, which can reduce the effect of a failure occurrence on the original work.

【００２１】請求項２２、請求項２３の発明は、故障検
出の必要性の高い計算機の故障を確実に検出することが
できる分散計算機システムの故障検出方法を得ることを
目的とする。It is an object of the inventions of claims 22 and 23 to obtain a failure detecting method for a distributed computer system capable of surely detecting a failure of a computer for which failure detection is highly necessary.

【００２２】請求項２４の発明は、生存信号の送信と受
信時刻の関係を、要求される故障発見の特性に合わせ
て、自由に設定することができる分散計算機システムの
故障検出方法を得ることを目的とする。According to a twenty-fourth aspect of the present invention, it is possible to obtain a fault detection method for a distributed computer system in which the relationship between the transmission and reception times of the survival signal can be freely set according to the required characteristic of fault detection. To aim.

【００２３】[0023]

【課題を解決するための手段】請求項１の発明に係る分
散計算機システムの故障検出方法は、複数の計算機を仮
想的な仮想リング上に配置する仮想配置ステップと、各
計算機が、仮想リング上の隣接する計算機から送信され
た生存信号を定期的に受信したか否かを調べ、受信しな
い場合、生存信号の送信に使用される通信路に異常が発
生したと判断し、故障箇所を特定する故障検出ステップ
と、各計算機が、発見した故障に関する故障情報を、通
信し得る全ての計算機に通知する故障通知ステップとを
実行するものである。According to a first aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system comprising: a virtual placement step of placing a plurality of computers on a virtual virtual ring; Check whether or not the survival signal transmitted from the adjacent computer has been regularly received. If not, it is determined that an abnormality has occurred in the communication path used to transmit the survival signal, and the failure location is specified. The failure detection step and the failure notification step in which each computer notifies failure information regarding the found failure to all computers with which it can communicate are executed.

【００２４】請求項２の発明に係る分散計算機システム
の故障検出方法は、計算機が定期的に生存信号を送信す
る生存信号送信ステップにおいて、定期的なタイミング
毎に仮想リング上で交互に切り替えて右隣または左隣の
計算機へと生存信号を送信するものである。According to a second aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system, wherein in a survival signal transmitting step in which a computer periodically transmits a survival signal, the computer alternately switches on a virtual ring at regular timings. It sends a survival signal to the computer next to or to the left.

【００２５】請求項３の発明に係る分散計算機システム
の故障検出方法は、生存信号送信ステップにおいて、計
算機が受信予定の生存信号を所定の時間内に受信したか
否かを、送信する生存信号に書き込み送信するものであ
る。According to a third aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system, wherein in the surviving signal transmitting step, it is determined whether or not the computer has received the surviving signal to be received within a predetermined time as a surviving signal to be transmitted. It is for writing and transmitting.

【００２６】請求項４の発明に係る分散計算機システム
の故障検出方法は、各計算機を節点とし各節点が２つ以
上の子節点を有する仮想的な仮想ツリー上に配置する仮
想配置ステップと、各計算機が、仮想ツリー上で親節点
に位置する親計算機に対して、生存信号を定期的に送信
する生存信号送信ステップと、各計算機が、仮想ツリー
上で子節点に位置する子計算機からの生存信号を受信し
たか否かを調べ、その結果を組み合わせて故障箇所を特
定する故障検出ステップと、各計算機が、発見した故障
に関する情報を、通信し得る全ての計算機に通知する故
障通知ステップとを実行するものである。According to a fourth aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system, wherein each computer is a node, and each node is arranged on a virtual virtual tree having two or more child nodes. The survival signal transmission step in which the computer periodically sends a survival signal to the parent computer located at the parent node on the virtual tree, and each computer is alive from the child computer located at the child node on the virtual tree. A failure detection step of checking whether or not a signal has been received, specifying a failure location by combining the results, and a failure notification step of notifying all computers with which each computer can communicate information regarding the discovered failure of each computer. It is what you do.

【００２７】請求項５の発明に係る分散計算機システム
の故障検出方法は、計算機をＭ個のグループに分割し、
各グループごとに１台の計算機を代表計算機とし、Ｍ個
の代表計算機を、仮想的な仮想リング上に配置する仮想
配置ステップと、代表計算機以外の計算機が、計算機の
属するグループの代表計算機に生存信号を定期的に送信
する第１の生存信号送信ステップと、各代表計算機が、
仮想リング上で特定の方向に隣接する計算機に生存信号
を定期的に送信する第２の生存信号送信ステップと、各
代表計算機が、代表計算機に送信される生存信号を受信
したか否かを調べ、その結果を組み合わせて故障箇所を
特定する故障検出ステップと、各代表計算機が、発見し
た故障に関する情報を、通信し得る全ての計算機に通知
する故障通知ステップとを実行するものである。According to a fifth aspect of the present invention, there is provided a distributed computer system failure detection method, wherein a computer is divided into M groups,
One computer for each group is set as a representative computer, and a virtual arrangement step of arranging M representative computers on a virtual virtual ring, and computers other than the representative computer survive on the representative computer of the group to which the computer belongs. The first survival signal transmission step of periodically transmitting a signal and each representative computer,
The second survival signal transmitting step of periodically transmitting the survival signal to the adjacent computer in the specific direction on the virtual ring, and checking whether or not each representative computer has received the survival signal transmitted to the representative computer The failure detection step of specifying the failure location by combining the results, and the failure notification step of notifying all the computers with which each representative computer can communicate the information about the found failure.

【００２８】請求項６の発明に係る分散計算機システム
の故障検出方法は、複数の計算機を仮想的な仮想リング
上に配置する仮想配置ステップと、各計算機を仮想リン
グ上での特定の計算機から特定の方向における順番によ
って、偶数番目、奇数番目に分ける際、奇数番目の計算
機が、第１のＬＡＮを介して仮想リング上の隣接する計
算機に定期的に生存信号を送信し、偶数番目の計算機
が、第２のＬＡＮを介して仮想リング上の隣接する計算
機に定期的に生存信号を送信する生存信号送信ステップ
と、各計算機が、該計算機に送信される生存信号を受信
したか否かを調べ、その結果を組み合わせて故障箇所を
特定する故障検出ステップと、各計算機が、発見した故
障に関する情報を、通信し得る全ての計算機に通知する
故障通知ステップとを実行するものである。According to a sixth aspect of the present invention, there is provided a method of detecting a failure in a distributed computer system, which comprises a virtual placement step of placing a plurality of computers on a virtual virtual ring, and identifying each computer from a specific computer on the virtual ring. When dividing into an even number and an odd number according to the order in the direction of, the odd number computer periodically sends a survival signal to the adjacent computer on the virtual ring via the first LAN, and the even number computer , A survival signal transmitting step of periodically transmitting a survival signal to an adjacent computer on the virtual ring via the second LAN, and checking whether or not each computer has received the survival signal transmitted to the computer. , A failure detection step of specifying the failure location by combining the results, and a failure notification step of notifying all the computers with which each computer can communicate information on the discovered failure, It is intended to run.

【００２９】請求項７の発明に係る分散計算機システム
の故障検出方法は、複数の計算機を仮想的な仮想リング
上に配置する仮想配置ステップと、各計算機が、第１の
ＬＡＮを介して仮想リング上の特定の方向に隣接した計
算機に定期的に生存信号を送信するとともに、第２のＬ
ＡＮを介して、仮想リング上の特定の方向とは逆の方向
に隣接した計算機に定期的に生存信号を送信する生存信
号送信ステップと、各計算機が隣接計算機から送信され
る生存信号を受信したか否かを調べ、その結果を、隣接
計算機に送信する生存信号に隣接計算機から送信された
生存信号への応答として書き込む生存信号応答ステップ
と、各計算機が仮想リング上での両隣の計算機からの生
存信号の有無と応答の内容とを組み合わせることによ
り、故障箇所を特定する故障検出ステップと、各計算機
が、発見した故障に関する情報を、通信し得る全ての計
算機に通知する故障通知ステップとを実行するものであ
る。According to a seventh aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system, comprising a virtual placement step of placing a plurality of computers on a virtual virtual ring, and each computer placing a virtual ring via a first LAN. Alive signal is periodically transmitted to the computer adjacent to the above specific direction, and the second L
Through the AN, a survival signal transmitting step of periodically transmitting a survival signal to a computer adjacent to the adjacent computer in a direction opposite to the specific direction on the virtual ring, and each computer receiving the survival signal transmitted from the adjacent computer Check whether or not, and write the result to the survival signal to be transmitted to the adjacent computer as a response to the survival signal sent from the adjacent computer, and the survival signal response step from each computer on both sides on the virtual ring. By combining the presence / absence of the survival signal and the content of the response, a failure detection step of specifying the failure location and a failure notification step of notifying all computers with which the computer can communicate information regarding the discovered failure To do.

【００３０】請求項８の発明に係る分散計算機システム
の故障検出方法は、各計算機は、隣接する計算機に定期
的な生存信号を送信する生存信号送信ステップにおい
て、隣接する計算機から送信された生存信号に対する応
答とともに、隣接する計算機とは異なるもう一方の隣接
する計算機からの応答をコピーしたものも書き込むもの
である。According to an eighth aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system, wherein each computer transmits a survival signal periodically to an adjacent computer in a survival signal transmitting step. A copy of the response from another adjacent computer different from the adjacent computer is also written together with the response to.

【００３１】請求項９の発明に係る分散計算機システム
の故障検出方法は、複数の計算機を仮想的な仮想リング
上に配置する仮想配置ステップと、各計算機を、仮想リ
ング上での特定の計算機から特定の方向における順番に
よって、偶数番目、奇数番目に分ける際、奇数番目の計
算機が、第１のＬＡＮを介して仮想リング上の両隣の計
算機に定期的に生存信号を送信し、偶数番目の計算機
が、第２のＬＡＮを介して仮想リング上の両隣の計算機
に定期的に生存信号を送信する生存信号送信ステップ
と、各計算機が、第１または第２のＬＡＮを介して、両
隣から送信される生存信号を受信したか否かを調べ、そ
の結果を組み合わせることにより故障箇所を特定する故
障検出ステップと、各計算機が、発見した故障に関する
情報を、通信し得る全ての計算機に通知する故障通知ス
テップとを実行するものである。According to a ninth aspect of the present invention, there is provided a method of detecting a failure in a distributed computer system, wherein a virtual placement step of placing a plurality of computers on a virtual virtual ring and each computer from a specific computer on the virtual ring. When dividing into even-numbered and odd-numbered computers depending on the order in a specific direction, the odd-numbered computers periodically send survival signals to the computers on both sides of the virtual ring via the first LAN, and the even-numbered computers However, the survival signal transmission step of periodically transmitting the survival signal to the computers on both sides of the virtual ring via the second LAN, and each computer is transmitted from both sides via the first or second LAN. Check whether or not a survivor signal has been received, and combine the results to identify the failure location, and each computer can communicate the information about the discovered failure. And it executes the failure notification step of notifying the computer.

【００３２】請求項１０の発明に係る分散計算機システ
ムの故障検出方法は、複数の計算機を仮想的な仮想リン
グ上に配置する仮想配置ステップと、各計算機を、３台
ずつの複数のグループに分割し、各グループにおいて、
第１の計算機が、第２の計算機に第１のＬＡＮを介して
定期的に生存信号を送信するとともに、第３の計算機に
第２のＬＡＮを介して定期的に生存信号を送信する生存
信号送信ステップと、各グループにおいて、第２の計算
機が、第１の計算機からの生存信号を受信したか否かを
調べ、その結果を、第３の計算機に第２のＬＡＮを介し
て定期的に送信する生存信号に書き込む第１の生存信号
応答ステップと、第３の計算機が、第１の計算機からの
生存信号を受信したか否かを調べ、その結果を、第２の
計算機に第１のＬＡＮを介して定期的に送信する生存信
号に書き込む第２の生存信号応答ステップと、第２の計
算機が、第１及び第３の計算機から送信される生存信号
の有無と内容を調べ、それらの結果を組み合わせること
により、故障箇所を特定する第１の故障検出ステップ
と、第３の計算機が、第１及び第２の計算機から送信さ
れる生存信号の有無と内容を調べ、それらの結果を組み
合わせることにより、故障箇所を特定する第２の故障検
出ステップと、各計算機が、発見した故障に関する情報
を、通信し得る全ての計算機に通知する故障通知ステッ
プとを実行するものである。According to a tenth aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system, which comprises a virtual placement step of placing a plurality of computers on a virtual virtual ring, and dividing each computer into a plurality of groups of three computers. In each group,
A survival signal in which the first computer periodically transmits a survival signal to the second computer via the first LAN and at the same time transmits a survival signal to the third computer via the second LAN. In the transmitting step and in each group, it is checked whether the second computer has received the survival signal from the first computer, and the result is periodically sent to the third computer via the second LAN. The first survival signal response step of writing in the survival signal to be transmitted, and whether the third computer has received the survival signal from the first computer, and the result is stored in the first computer to the first computer. The second survival signal response step of writing to the survival signal periodically transmitted via the LAN, the second computer checks the existence and contents of the survival signal transmitted from the first and third computers, and By combining the results, the failure location The first failure detection step for specifying, and the third computer for checking the existence and contents of the survival signal transmitted from the first and second computers, and for combining the results thereof, for specifying the failure location. The failure detection step of No. 2 and the failure notification step of notifying all the computers with which each computer can communicate of the information about the discovered failure.

【００３３】請求項１１の発明に係る分散計算機システ
ムの故障検出方法は、故障発生時に、各計算機の仮想的
な配置を新たに設定し直す再配置ステップをさらに実行
するものである。According to an eleventh aspect of the present invention, there is provided a distributed computer system failure detection method, which further executes a reallocation step for newly setting a virtual layout of each computer when a failure occurs.

【００３４】請求項１２の発明に係る分散計算機システ
ムの故障検出方法は、故障発生時に、各計算機の仮想的
な配置を新たに設定し直す再配置ステップをさらに実行
するものである。According to a twelfth aspect of the present invention, there is provided a distributed computer system failure detection method, further comprising a relocation step for re-setting a virtual layout of each computer when a failure occurs.

【００３５】請求項１３の発明に係る分散計算機システ
ムの故障検出方法は、故障発生時に、各計算機の仮想的
な配置を新たに設定し直す再配置ステップをさらに実行
するものである。According to a thirteenth aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system, further comprising a relocation step for resetting a virtual layout of each computer when a failure occurs.

【００３６】請求項１４の発明に係る分散計算機システ
ムの故障検出方法は、故障発生時に、各計算機の仮想的
な配置を新たに設定し直す再配置ステップをさらに実行
するものである。According to a fourteenth aspect of the present invention, there is provided a distributed computer system failure detection method further comprising a relocation step for re-setting a virtual layout of each computer when a failure occurs.

【００３７】請求項１５の発明に係る分散計算機システ
ムの故障検出方法は、検出された故障情報を隣接計算機
に通知する故障通知ステップにおいて、生存信号に故障
情報を付加して生存信号を送信することにより故障を通
知するものである。According to a fifteenth aspect of the present invention, in a fault detecting method for a distributed computer system, in the fault notifying step of notifying the adjacent computer of the detected fault information, the fault signal is added to the surviving signal and the surviving signal is transmitted. The failure is notified by.

【００３８】請求項１６の発明に係る分散計算機システ
ムの故障検出方法は、検出された故障情報を隣接計算機
に通知する故障通知ステップにおいて、生存信号に故障
情報を付加して生存信号を送信することにより故障を通
知するものである。According to a sixteenth aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system, wherein in the failure notification step of notifying the adjacent computer of the detected failure information, the failure information is added to the survival signal and the survival signal is transmitted. The failure is notified by.

【００３９】請求項１７の発明に係る分散計算機システ
ムの故障検出方法は、２Ｎ本のＬＡＮにより接続され
た、複数の計算機からなる分散システムにおいて、ＬＡ
Ｎを２本ずつペアにし、各ペアごとに請求項６から請求
項１０、請求項１２、請求項１４、及び請求項１６の故
障検出方法のうちのいずれかを用いるものである。According to a seventeenth aspect of the present invention, there is provided a distributed computer system failure detection method, comprising: a distributed system consisting of a plurality of computers connected by 2N LANs;
Two pairs of N are used and any one of the failure detection methods of claims 6 to 10, claim 12, claim 14 and claim 16 is used for each pair.

【００４０】請求項１８の発明に係る分散計算機システ
ムの故障検出方法は、（２Ｎ＋１）本のＬＡＮにより接
続された、複数の計算機からなる分散システムにおい
て、ＬＡＮを２本ずつペアにし、各ペアごとに請求項６
から請求項１０、請求項１２、請求項１４、及び請求項
１６の故障検出方法のうちのいずれかを用い、余った１
本については、請求項１から請求項５、請求項１１、請
求項１３、及び請求項１５の故障検出方法のうちのいず
れかを用いるものである。According to an eighteenth aspect of the present invention, there is provided a distributed computer system failure detection method, wherein in a distributed system comprising a plurality of computers connected by (2N + 1) LANs, two LANs are paired, and each pair is paired. Claim 6
Any one of the failure detection methods of claim 10, claim 12, claim 14, and claim 16
The book uses any one of the failure detection methods of claims 1 to 5, claim 11, claim 13, and claim 15.

【００４１】請求項１９の発明に係る分散計算機システ
ムの故障検出方法は、（２Ｎ＋１）本のＬＡＮにより接
続された、複数の計算機からなる分散システムにおい
て、ＬＡＮを２本ずつペアにし、（２Ｎ＋１）本目のＬ
ＡＮといずれかのＬＡＮによりさらに１つのペアを作
り、各ペアごとに請求項６から請求項１０、請求項１
２、請求項１４、及び請求項１６の故障検出方法のうち
のいずれかを用いるものである。According to a nineteenth aspect of the present invention, there is provided a distributed computer system failure detection method, wherein in a distributed system comprising a plurality of computers connected by (2N + 1) LANs, two LANs are paired, and (2N + 1) LANs are paired. The actual L
Claims 6 to 10 and claim 1 for each pair by further forming one pair with AN and any LAN
Any one of the failure detection methods of claim 2, claim 14 and claim 16 is used.

【００４２】請求項２０の発明に係る分散計算機システ
ムの故障検出方法は、２つのペアで共有されているＬＡ
Ｎにおいて、それぞれのペアにおいて送信される生存信
号を１つにまとめるものである。According to a twentieth aspect of the present invention, there is provided a distributed computer system fault detection method, wherein an LA shared by two pairs is used.
In N, the survival signals transmitted in each pair are combined.

【００４３】請求項２１の発明に係る分散計算機システ
ムの故障検出方法は、仮想配置ステップにおいて、相互
に通信する頻度の高い計算機を、仮想的な配置において
近接するように配置するものである。In the failure detecting method for a distributed computer system according to the twenty-first aspect of the present invention, in the virtual arranging step, computers that frequently communicate with each other are arranged so as to be close to each other in the virtual arrangement.

【００４４】請求項２２の発明に係る分散計算機システ
ムの故障検出方法は、仮想配置ステップにおいて、信頼
性の高い計算機と信頼性の低い計算機を、仮想的な配置
において交互に並べるものである。According to a twenty-second aspect of the present invention, there is provided a distributed computer system failure detection method in which, in the virtual placement step, highly reliable computers and low reliability computers are alternately arranged in a virtual placement.

【００４５】請求項２３の発明に係る分散計算機システ
ムの故障検出方法は、仮想配置ステップにおいて、信頼
性の高い計算機と機能的に重要な計算機を、仮想的な配
置において交互に並べるものである。According to a twenty-third aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system, wherein in a virtual arrangement step, highly reliable computers and functionally important computers are arranged alternately in a virtual arrangement.

【００４６】請求項２４の発明に係る分散計算機システ
ムの故障検出方法は、一部または全ての生存信号につい
て、その送信時刻または受信期限を、各計算機が特定の
生存信号を受信した時刻を基準にして設定するものであ
る。According to a twenty-fourth aspect of the present invention, there is provided a distributed computer system failure detection method, wherein the transmission time or reception deadline of some or all of the surviving signals is based on the time at which each computer receives a specific surviving signal. Is set.

【００４７】[0047]

【作用】請求項１の発明における分散計算機システムの
故障検出方法は、複数の計算機が仮想的な仮想リング上
に配置され、各計算機は、仮想リング上の特定の方向に
隣接する計算機に対して、自分自身の生存を示す生存信
号を定期的に送信する。また、各計算機は、仮想リング
上の隣接する計算機から送信された生存信号を定期的に
受信したか否かを調べ、受信しない場合、生存信号の送
信に使用される通信路に異常が発生したと判断し、故障
箇所を特定し、各計算機は、発見した故障に関する故障
情報を、通信し得る全ての計算機に通知する。即ち、各
計算機は、仮想リング上の隣接する決められた送信相手
に定期的に信号を送信し、定期的に生存信号を送ること
により、送信先計算機で該生存信号が受信できるかどう
かをチェックする。各計算機ごとに決まった計算機に生
存信号を送信し、送信先計算機で、該生存信号が受信で
きるかをチェックすることにより、各計算機が限定され
た範囲の故障検出を行う。各計算機が、決められた相手
にだけ生存信号を送信するため、全ての計算機が送受信
する信号の量は、計算機の台数に比例し、計算機１台あ
たり送受信する生存信号の量は、計算機の台数に関係な
くほぼ一定となる。これにより、平常時に故障発見のた
めに交換される生存信号の数を最小にできる。また、各
計算機が、自分自身の担当範囲内で発見された故障の情
報を、他の計算機に通知することにより、たすきがけ故
障が発生しても、各計算機がシステム全体の稼働情報を
得ることができる。According to the first aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system, wherein a plurality of computers are arranged on a virtual virtual ring, and each computer is connected to a computer adjacent in a specific direction on the virtual ring. , Periodically send a survival signal indicating its own survival. In addition, each computer periodically checks whether or not the survival signal transmitted from the adjacent computer on the virtual ring is received. If not, an abnormality has occurred in the communication path used to transmit the survival signal. Then, each computer notifies each of the computers with which it can communicate the fault information regarding the discovered fault. In other words, each computer periodically sends a signal to the adjacent destination on the virtual ring and sends a live signal periodically to check whether the live computer can receive the live signal. To do. The survivor signal is transmitted to a computer determined for each computer, and the destination computer checks whether the survivor signal can be received, whereby each computer detects a fault in a limited range. Since each computer sends a survival signal only to a designated partner, the amount of signals sent and received by all computers is proportional to the number of computers, and the amount of alive signals sent and received per computer is the number of computers. It is almost constant regardless of. This minimizes the number of surviving signals that are exchanged during normal times for fault detection. In addition, each computer notifies other computers of the information of the fault found within its own range, so that even if a strike failure occurs, each computer can obtain operating information of the entire system. You can

【００４８】請求項２の発明における分散計算機システ
ムの故障検出方法は、計算機が定期的に生存信号を送信
する生存信号送信ステップにおいて、定期的なタイミン
グ毎に仮想リング上で交互に切り替えて右隣または左隣
の計算機へと生存信号を送信する。このように、定期的
な生存信号の送信に加えて、送信先計算機の組み合わせ
を工夫することにより、２つ以上の計算機からの生存信
号を受信する計算機をつくる。従って、故障発生時にも
最小限の通信量で故障を発見でき、平常時及び異常発生
時に交換される生存信号の数を最小にできる。In the failure detection method for the distributed computer system according to the second aspect of the present invention, in the survival signal transmitting step in which the computer periodically transmits the survival signal, the computer is alternately switched on the virtual ring at every regular timing to the right of the neighbor. Or send a survival signal to the computer on the left. Thus, in addition to the periodical transmission of the survival signal, by devising the combination of the destination computers, a computer that receives the survival signals from two or more computers is created. Therefore, even when a failure occurs, it is possible to detect the failure with a minimum amount of communication, and it is possible to minimize the number of surviving signals that are exchanged in normal times and when an abnormality occurs.

【００４９】請求項３の発明における分散計算機システ
ムの故障検出方法は、生存信号送信ステップにおいて、
計算機が受信予定の生存信号を所定の時間内に受信した
か否かを、送信する生存信号に書き込み送信する。即
ち、定期的な生存信号の送信と、送信先計算機との組み
合わせの工夫に加えて、自分自身が生存信号を受信した
か否かを次に送信する生存信号に書き込むことにより、
送信先計算機が生存信号そのものを受信したか否かと、
生存信号の送信元計算機が生存信号を受信したか否かと
を組み合わせて故障を発見する。これにより、平常時及
び異常発生時に交換される生存信号の数を最小にでき
る。According to a third aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system, wherein in the live signal transmitting step,
Whether or not the computer has received the survival signal to be received within a predetermined time is written in the survival signal to be transmitted and transmitted. That is, in addition to regular transmission of the survival signal and devising a combination with the destination computer, by writing in the survival signal to be transmitted next whether or not oneself has received the survival signal,
Whether the destination computer received the survival signal itself,
The source computer of the surviving signal detects whether or not the surviving signal is received in combination with the surviving signal. As a result, the number of surviving signals exchanged in normal times and when an abnormality occurs can be minimized.

【００５０】請求項４の発明における分散計算機システ
ムの故障検出方法は、各計算機を節点とし各節点が２つ
以上の子節点を有する仮想的な仮想ツリー上に配置し、
各計算機は、仮想ツリー上で親節点に位置する親計算機
に対して、生存信号を定期的に送信する生存信号送信ス
テップと、各計算機が、仮想ツリー上で子節点に位置す
る子計算機からの生存信号を受信したか否かを調べ、そ
の結果を組み合わせて故障箇所を特定する。さらに、各
計算機は、発見した故障に関する情報を、通信し得る全
ての計算機に通知する。即ち、請求項１の定期的な生存
信号の送信に加えて、送信先計算機の組み合わせを工夫
することにより、２つ以上の計算機からの生存信号を受
信する計算機をつくる。従って、故障発生時にも最小限
の通信量で故障を発見でき、平常時及び異常発生時に交
換される生存信号の数を最小にできる。According to a fourth aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system, wherein each computer is arranged as a node and each node is arranged on a virtual virtual tree having two or more child nodes.
Each computer has a survival signal transmission step of periodically transmitting a survival signal to the parent computer located at the parent node on the virtual tree, and each computer sends a survival signal from the child computer located at the child node on the virtual tree. It is checked whether or not the survival signal is received, and the results are combined to identify the failure location. Further, each computer notifies all computers with which it can communicate of information regarding the discovered failure. That is, in addition to the periodical transmission of the survival signal according to claim 1, by devising a combination of transmission destination computers, a computer for receiving the survival signals from two or more computers is created. Therefore, even when a failure occurs, it is possible to detect the failure with a minimum amount of communication, and it is possible to minimize the number of surviving signals that are exchanged in normal times and when an abnormality occurs.

【００５１】請求項５の発明における分散計算機システ
ムの故障検出方法は、計算機をＭ個のグループに分割
し、各グループごとに１台の計算機を代表計算機とし、
Ｍ個の代表計算機を、仮想的な仮想リング上に配置す
る。代表計算機以外の計算機は、該計算機の属するグル
ープの代表計算機に生存信号を定期的に送信し、各代表
計算機は、仮想リング上で特定の方向に隣接する計算機
に生存信号を定期的に送信する。また、各代表計算機
は、該計算機に送信される生存信号を受信したか否かを
調べ、その結果を組み合わせて故障箇所を特定し、各代
表計算機は、発見した故障に関する情報を、通信し得る
全ての計算機に通知する。即ち、請求項１の定期的な生
存信号の送信に加えて、送信先計算機の組み合わせを工
夫することにより、２つ以上の計算機からの生存信号を
受信する計算機をつくる。従って、故障発生時にも最小
限の通信量で故障を発見でき、平常時及び異常発生時に
交換される生存信号の数を最小にできる。According to a fifth aspect of the present invention, there is provided a distributed computer system failure detection method, wherein a computer is divided into M groups and one computer is set as a representative computer in each group.
M representative computers are arranged on a virtual virtual ring. Computers other than the representative computer periodically transmit the survival signal to the representative computers of the group to which the computer belongs, and each representative computer periodically transmits the survival signal to the adjacent computer in a specific direction on the virtual ring. . Further, each representative computer checks whether or not the survival signal transmitted to the computer has been received, and combines the results to identify the failure location, and each representative computer can communicate information regarding the found failure. Notify all computers. That is, in addition to the periodical transmission of the survival signal according to claim 1, by devising a combination of transmission destination computers, a computer for receiving the survival signals from two or more computers is created. Therefore, even when a failure occurs, it is possible to detect the failure with a minimum amount of communication, and it is possible to minimize the number of surviving signals that are exchanged in normal times and when an abnormality occurs.

【００５２】請求項６の発明における分散計算機システ
ムの故障検出方法は、複数の計算機を仮想的な仮想リン
グ上に配置する。各計算機を仮想リング上での特定の計
算機から特定の方向における順番によって、偶数番目、
奇数番目に分け、奇数番目の計算機は、第１のＬＡＮを
介して仮想リング上の隣接する計算機に定期的に生存信
号を送信し、偶数番目の計算機は、第２のＬＡＮを介し
て仮想リング上の隣接する計算機に定期的に生存信号を
送信する。また、各計算機は、該計算機に送信される生
存信号を受信したか否かを調べ、その結果を組み合わせ
て故障箇所を特定し、各計算機は、発見した故障に関す
る情報を、通信し得る全ての計算機に通知する。即ち、
各計算機が、決められた送信相手に定期的に信号を送信
し、定期的に生存信号を送る方法を二重化ＬＡＮに適用
することにより、送信先計算機で該生存信号が受信した
か否かをチェックする。これにより、平常時に故障発見
のために交換される生存信号の数を最小にできる。In the fault detecting method for the distributed computer system according to the sixth aspect of the present invention, a plurality of computers are arranged on a virtual virtual ring. Depending on the order in a specific direction from a specific computer on the virtual ring, each computer is an even number,
Divided into odd-numbered computers, the odd-numbered computers periodically send survival signals to adjacent computers on the virtual ring via the first LAN, and the even-numbered computers send virtual signals to the virtual ring via the second LAN. The survival signal is periodically transmitted to the adjacent computer above. Further, each computer checks whether or not the survival signal transmitted to the computer is received, and combines the results to identify the failure location, and each computer communicates the information regarding the found failure to all the information that can be communicated. Notify the calculator. That is,
By applying a method in which each computer periodically sends a signal to a designated transmission partner and sends a live signal periodically, it is checked whether or not the live signal is received by the destination computer. To do. This minimizes the number of surviving signals that are exchanged during normal times for fault detection.

【００５３】請求項７の発明における分散計算機システ
ムの故障検出方法は、複数の計算機を仮想的な仮想リン
グ上に配置する。各計算機は、第１のＬＡＮを介して仮
想リング上の特定の方向に隣接した計算機に定期的に生
存信号を送信するとともに、第２のＬＡＮを介して、仮
想リング上の特定の方向とは逆の方向に隣接した計算機
に定期的に生存信号を送信する。また、各計算機は、隣
接計算機から送信される生存信号を受信したか否かを調
べ、その結果を、隣接計算機に送信する生存信号に隣接
計算機から送信された生存信号への応答として書き込
み、各計算機は、仮想リング上での両隣の計算機からの
生存信号の有無と応答の内容とを組み合わせることによ
り、故障箇所を特定する。さらに、各計算機は、発見し
た故障に関する情報を、通信し得る全ての計算機に通知
する。即ち、定期的な生存信号の送信と、送信先計算機
の組み合わせの工夫、及び請求項３のような生存信号の
内容の工夫を、二重化ＬＡＮに適用することにより、生
存信号そのものの受信状態と、送信元の計算機での生存
信号の受信状態とを組み合わせて故障発見を行う。これ
により、平常時及び異常発生時に交換される生存信号の
数を最小にできるとともに、１つの計算機の故障を、該
故障計算機の近傍の複数の計算機により発見が可能とな
り、故障発生からより短い遅れ時間で故障を発見できる
可能性が高くなる。In the fault detecting method for the distributed computer system according to the seventh aspect of the present invention, a plurality of computers are arranged on a virtual virtual ring. Each computer periodically transmits a survival signal to a computer adjacent to a specific direction on the virtual ring via the first LAN, and at the same time, transmits a survival signal to the specific computer on the virtual ring via the second LAN. The survival signal is periodically transmitted to the computers adjacent in the opposite direction. In addition, each computer checks whether or not it has received the survival signal transmitted from the adjacent computer, and writes the result as a response to the survival signal transmitted from the adjacent computer in the survival signal transmitted to the adjacent computer. The computer identifies the failure location by combining the presence / absence of the survival signal from the computers on both sides on the virtual ring and the content of the response. Further, each computer notifies all computers with which it can communicate of information regarding the discovered failure. That is, by applying the transmission of the live signal periodically, the combination of the destination computers, and the device of the content of the live signal according to claim 3 to the duplicated LAN, the reception state of the live signal itself, Fault detection is performed by combining with the reception status of the surviving signal at the transmission source computer. This makes it possible to minimize the number of surviving signals exchanged in normal times and in the event of an abnormality, and it is possible to detect a failure of one computer by multiple computers in the vicinity of the failure computer, resulting in a shorter delay from the occurrence of the failure. The chance of finding a failure in time increases.

【００５４】請求項８の発明における分散計算機システ
ムの故障検出方法は、各計算機は、隣接する計算機に定
期的な生存信号を送信する生存信号送信ステップにおい
て、隣接する計算機から送信された生存信号に対する応
答とともに、隣接する計算機とは異なるもう一方の隣接
する計算機からの応答をコピーしたものも書き込む。こ
のように、定期的な生存信号の送信と、送信先計算機の
組み合わせの工夫、及び請求項３のような生存信号の内
容の工夫を、二重化ＬＡＮに適用することにより、生存
信号そのものの受信状態と、送信元の計算機での生存信
号の受信状態とを組み合わせて故障発見を行う。これに
より、平常時及び異常発生時に交換される生存信号の数
を最小にできるとともに、１つの計算機の故障を、該故
障計算機の近傍の複数の計算機により発見が可能とな
り、故障発生からより短い遅れ時間で故障を発見できる
可能性が高くなる。According to the eighth aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system, wherein each computer transmits a survival signal to an adjacent computer at regular intervals in the survival signal transmitting step, with respect to the survival signal transmitted from the adjacent computer. Along with the response, a copy of the response from the other adjacent computer that is different from the adjacent computer is also written. In this way, by periodically transmitting the survival signal, devising the combination of the destination computers, and devising the content of the survival signal as claimed in claim 3 to the duplicated LAN, the reception state of the survival signal itself is obtained. And the reception state of the surviving signal at the transmission source computer are combined to detect the failure. This makes it possible to minimize the number of surviving signals exchanged in normal times and in the event of an abnormality, and it is possible to detect a failure of one computer by multiple computers in the vicinity of the failure computer, resulting in a shorter delay from the occurrence of the failure. The chance of finding a failure in time increases.

【００５５】請求項９の発明における分散計算機システ
ムの故障検出方法は、複数の計算機を仮想的な仮想リン
グ上に配置する。各計算機を、仮想リング上での特定の
計算機から特定の方向における順番によって、偶数番
目、奇数番目に分け、奇数番目の計算機は、第１のＬＡ
Ｎを介して仮想リング上の両隣の計算機に定期的に生存
信号を送信し、偶数番目の計算機は、第２のＬＡＮを介
して仮想リング上の両隣の計算機に定期的に生存信号を
送信する。各計算機は、第１または第２のＬＡＮを介し
て、両隣から送信される生存信号を受信したか否かを調
べ、その結果を組み合わせることにより故障箇所を特定
し、各計算機は、発見した故障に関する情報を、通信し
得る全ての計算機に通知する。このように、定期的な生
存信号の送信と、送信先計算機の組み合わせの工夫、及
び請求項３のような生存信号の内容の工夫を、二重化Ｌ
ＡＮに適用することにより、生存信号そのものの受信状
態と、送信元の計算機での生存信号の受信状態とを組み
合わせて故障発見を行う。これにより、平常時及び異常
発生時に交換される生存信号の数を最小にできるととも
に、１つの計算機の故障を、該故障計算機の近傍の複数
の計算機により発見が可能となり、故障発生からより短
い遅れ時間で故障を発見できる可能性が高くなる。In the failure detecting method for the distributed computer system according to the ninth aspect of the present invention, a plurality of computers are arranged on a virtual virtual ring. Each computer is divided into an even-numbered computer and an odd-numbered computer according to the order in a specific direction from a specific computer on the virtual ring, and the odd-numbered computer is the first LA.
The survival signal is periodically transmitted to the computers on both sides of the virtual ring via N, and the even-numbered computer periodically transmits the survival signal to the computers on both sides of the virtual ring via the second LAN. . Each computer checks whether or not a survival signal transmitted from both sides is received via the first or second LAN, and by combining the results, the failure location is identified, and each computer finds the found failure. Informs all computers with which it can communicate of information about. In this way, the transmission of the survival signal on a regular basis, the combination of transmission destination computers, and the arrangement of the content of the survival signal as claimed in claim 3 are duplicated.
By applying to the AN, the failure detection is performed by combining the reception state of the survival signal itself and the reception state of the survival signal at the transmission source computer. This makes it possible to minimize the number of surviving signals exchanged in normal times and in the event of an abnormality, and it is possible to detect a failure of one computer by multiple computers in the vicinity of the failure computer, resulting in a shorter delay from the occurrence of the failure. The chance of finding a failure in time increases.

【００５６】請求項１０の発明における分散計算機シス
テムの故障検出方法は、各計算機を、３台ずつの複数の
グループに分割し、各グループにおいて、第１の計算機
が、第２の計算機に第１のＬＡＮを介して定期的に生存
信号を送信するとともに、第３の計算機に第２のＬＡＮ
を介して定期的に生存信号を送信する。また、各グルー
プにおいて、第２の計算機は、第１の計算機からの生存
信号を受信したか否かを調べ、その結果を、第３の計算
機に第２のＬＡＮを介して定期的に送信する生存信号に
書き込む。第３の計算機は、第１の計算機からの生存信
号を受信したか否かを調べ、その結果を、第２の計算機
に第１のＬＡＮを介して定期的に送信する生存信号に書
き込む。第２の計算機は、第１及び第３の計算機から送
信される生存信号の有無と内容を調べ、それらの結果を
組み合わせることにより、故障箇所を特定し、第３の計
算機は、第１及び第２の計算機から送信される生存信号
の有無と内容を調べ、それらの結果を組み合わせること
により、故障箇所を特定する。そして、各計算機は、発
見した故障に関する情報を、通信し得る全ての計算機に
通知する。このように、定期的な生存信号の送信と、送
信先計算機の組み合わせの工夫、及び請求項３のような
生存信号の内容の工夫を、二重化ＬＡＮに適用すること
により、生存信号そのものの受信状態と、送信元の計算
機での生存信号の受信状態とを組み合わせて故障発見を
行う。これにより、平常時及び異常発生時に交換される
生存信号の数を最小にできるとともに、１つの計算機の
故障を、該故障計算機の近傍の複数の計算機により発見
が可能となり、故障発生からより短い遅れ時間で故障を
発見できる可能性が高くなる。According to a tenth aspect of the present invention, there is provided a distributed computer system failure detection method, wherein each computer is divided into a plurality of groups of three, and in each group, the first computer is divided into the second computer and the first computer. Sends a survival signal periodically via the LAN of the third computer and the second LAN to the third computer.
Send a live signal periodically via. Further, in each group, the second computer checks whether or not the survival signal from the first computer has been received, and periodically transmits the result to the third computer via the second LAN. Write to the survival signal. The third computer checks whether or not the survival signal from the first computer has been received, and writes the result in a survival signal that is periodically transmitted to the second computer via the first LAN. The second computer examines the existence and contents of the survival signal transmitted from the first and third computers, and by combining the results, identifies the failure location, and the third computer uses the first and third computers. The presence / absence and content of the survival signal transmitted from the second computer are checked, and the failure location is specified by combining the results. Then, each computer notifies all computers with which it can communicate of the information regarding the discovered failure. In this way, by periodically transmitting the survival signal, devising the combination of the destination computers, and devising the content of the survival signal as claimed in claim 3 to the duplicated LAN, the reception state of the survival signal itself is obtained. And the reception state of the surviving signal at the transmission source computer are combined to detect the failure. This makes it possible to minimize the number of surviving signals exchanged in normal times and in the event of an abnormality, and it is possible to detect a failure of one computer by multiple computers in the vicinity of the failure computer, resulting in a shorter delay from the occurrence of the failure. The chance of finding a failure in time increases.

【００５７】請求項１１の発明における分散計算機シス
テムの故障検出方法は、故障発生時に、各計算機の仮想
的な配置を新たに設定し直す再配置ステップをさらに実
行する。各計算機の現在の稼働状況に合わせて、それぞ
れの計算機の送受信先を設定する方法を、各計算機が備
えることにより、故障発生時、故障計算機の復旧時、ま
たは新しい計算機の増設時に、各計算機の送信先を変化
させ、システムの構成変化が生じてもそれ以前と同様な
故障検出能力を維持する。In the fault detecting method for the distributed computer system according to the eleventh aspect of the present invention, when a fault occurs, a relocation step of newly setting the virtual placement of each computer is further executed. Each computer is equipped with a method of setting the transmission and reception destination of each computer according to the current operating status of each computer, so that when a failure occurs, when a failed computer is restored, or when a new computer is added, each computer Even if the destination is changed and the system configuration changes, the same failure detection capability as before is maintained.

【００５８】請求項１２の発明における分散計算機シス
テムの故障検出方法は、故障発生時に、各計算機の仮想
的な配置を新たに設定し直す再配置ステップをさらに実
行する。各計算機の現在の稼働状況に合わせて、それぞ
れの計算機の送受信先を設定する方法を、各計算機が備
えることにより、故障発生時、故障計算機の復旧時、ま
たは新しい計算機の増設時に、各計算機の送信先を変化
させ、システムの構成変化が生じてもそれ以前と同様な
故障検出能力を維持する。According to the twelfth aspect of the present invention, in the method of detecting a failure in a distributed computer system, when a failure occurs, a relocation step of newly setting the virtual layout of each computer is further executed. Each computer is equipped with a method of setting the transmission and reception destination of each computer according to the current operating status of each computer, so that when a failure occurs, when a failed computer is restored, or when a new computer is added, each computer Even if the destination is changed and the system configuration changes, the same failure detection capability as before is maintained.

【００５９】請求項１３の発明における分散計算機シス
テムの故障検出方法は、故障発生時に、各計算機の仮想
的な配置を新たに設定し直す再配置ステップをさらに実
行する。各計算機の現在の稼働状況に合わせて、それぞ
れの計算機の送受信先を設定する方法を、各計算機が備
えることにより、故障発生時、故障計算機の復旧時、ま
たは新しい計算機の増設時に、各計算機の送信先を変化
させ、システムの構成変化が生じてもそれ以前と同様な
故障検出能力を維持する。According to a thirteenth aspect of the present invention, there is provided a distributed computer system failure detection method, further comprising a relocation step of re-setting a virtual layout of each computer when a failure occurs. Each computer is equipped with a method of setting the transmission and reception destination of each computer according to the current operating status of each computer, so that when a failure occurs, when a failed computer is restored, or when a new computer is added, each computer Even if the destination is changed and the system configuration changes, the same failure detection capability as before is maintained.

【００６０】請求項１４の発明における分散計算機シス
テムの故障検出方法は、故障発生時に、各計算機の仮想
的な配置を新たに設定し直す再配置ステップをさらに実
行する。各計算機の現在の稼働状況に合わせて、それぞ
れの計算機の送受信先を設定する方法を、各計算機が備
えることにより、故障発生時、故障計算機の復旧時、ま
たは新しい計算機の増設時に、各計算機の送信先を変化
させ、システムの構成変化が生じてもそれ以前と同様な
故障検出能力を維持する。In a distributed computer system failure detection method according to a fourteenth aspect of the present invention, when a failure occurs, a relocation step of newly setting the virtual layout of each computer is further executed. Each computer is equipped with a method of setting the transmission and reception destination of each computer according to the current operating status of each computer, so that when a failure occurs, when a failed computer is restored, or when a new computer is added, each computer Even if the destination is changed and the system configuration changes, the same failure detection capability as before is maintained.

【００６１】請求項１５の発明における分散計算機シス
テムの故障検出方法は、検出された故障情報を隣接計算
機に通知する故障通知ステップにおいて、生存信号を利
用する。故障情報の通知に、各計算機が送信する生存信
号を利用するため、通知のために余分な信号を送信する
必要がなく、ＬＡＮにかかる負荷を小さくすることがで
きる。According to a fifteenth aspect of the present invention, a distributed computer system fault detecting method uses a survival signal in a fault notifying step of notifying an adjacent computer of the detected fault information. Since the survival signal transmitted by each computer is used for notification of failure information, it is not necessary to transmit an extra signal for notification, and the load on the LAN can be reduced.

【００６２】請求項１６の発明における分散計算機シス
テムの故障検出方法は、検出された故障情報を隣接計算
機に通知する故障通知ステップにおいて、生存信号を利
用する。故障情報の通知に、各計算機が送信する生存信
号を利用するため、通知のために余分な信号を送信する
必要がなく、ＬＡＮにかかる負荷を小さくすることがで
きる。According to a sixteenth aspect of the present invention, there is provided a method for detecting a failure in a distributed computer system, wherein a live signal is used in a failure notification step of notifying the adjacent computer of the detected failure information. Since the survival signal transmitted by each computer is used for notification of failure information, it is not necessary to transmit an extra signal for notification, and the load on the LAN can be reduced.

【００６３】請求項１７の発明における分散計算機シス
テムの故障検出方法は、ＬＡＮを２本ずつペアにし、各
ペアごとに請求項６から請求項１０、請求項１２、請求
項１４、請求項１６の故障検出方法のうちのいずれかを
用いる。３本以上のＬＡＮを２本ずつの組にし、各組に
対して請求項６から請求項１０、請求項１２、請求項１
４、請求項１６の方法を適用することにより、任意の本
数のＬＡＮを持つシステムに、上記発明を適用可能とす
る。According to the failure detecting method of the distributed computer system in the invention of claim 17, two LANs are paired, and each pair is defined by claim 6 to claim 10, claim 12, claim 14 and claim 16. Use one of the failure detection methods. Claims 6 to 10, claim 12, claim 1 for each group of three or more LANs in groups of two
4. By applying the method of claim 16, the invention can be applied to a system having an arbitrary number of LANs.

【００６４】請求項１８の発明における分散計算機シス
テムの故障検出方法は、ＬＡＮを２本ずつペアにし、各
ペアごとに請求項６から請求項１０、請求項１２、請求
項１４、請求項１６の故障検出方法のうちのいずれかを
用い、余った１本については、請求項１から請求項５、
請求項１１、請求項１３、請求項１５の故障検出方法の
うちのいずれかを用いる。３本以上のＬＡＮを２本ずつ
の組にし、各組に対して請求項１から請求項５、請求項
１１、請求項１３、請求項１５の方法、または、請求項
６から請求項１０、請求項１２、請求項１４、請求項１
６の方法を適用することにより、任意の本数のＬＡＮを
持つシステムに上記発明を適用可能とする。According to the failure detecting method of the distributed computer system in the invention of claim 18, two LANs are paired, and each pair is defined by claim 6 to claim 10, claim 12, claim 14 and claim 16. Any one of the failure detection methods is used, and for the remaining one, claims 1 to 5,
Any one of the failure detection methods of claim 11, claim 13 and claim 15 is used. Two or more sets of three or more LANs are set, and for each set, the method of claim 1 to claim 5, claim 11, claim 13 or claim 15, or claim 6 to claim 10, Claim 12, Claim 14, Claim 1
By applying the method of No. 6, the invention can be applied to a system having an arbitrary number of LANs.

【００６５】請求項１９の発明における分散計算機シス
テムの故障検出方法は、ＬＡＮを２本ずつペアにし、
（２Ｎ＋１）本目のＬＡＮといずれかのＬＡＮによりさ
らに１つのペアを作り、各ペアごとに請求項６から請求
項１０、請求項１２、請求項１４、請求項１６の故障検
出方法のうちのいずれかを用いる。３本以上のＬＡＮを
２本ずつの組にし、各組に対して請求項６から請求項１
０、請求項１２、請求項１４、請求項１６の方法を適用
することにより、任意の本数のＬＡＮを持つシステム
に、上記発明を適用可能とする。A fault detecting method for a distributed computer system according to a nineteenth aspect of the present invention comprises forming a pair of two LANs,
Any one of the fault detection methods of claims 6 to 10, claim 12, claim 14, and claim 16 is made for each pair by further forming a pair by the (2N + 1) th LAN and any one of the LANs. Use or. Claims 6 to 1 for each set, with two or more sets of three or more LANs
By applying the methods of 0, claim 12, claim 14, and claim 16, the invention can be applied to a system having an arbitrary number of LANs.

【００６６】請求項２０の発明における分散計算機シス
テムの故障検出方法は、２つのペアで共有されているＬ
ＡＮにおいて、それぞれのペアにおいて送信される生存
信号を１つにまとめる。請求項１９の故障検出方法にお
いて、２つのＬＡＮの組で共有されているＬＡＮにおい
て、それぞれの組で用いられる生存信号を１つにまとめ
ることにより、交換される生存信号の数を少なくする。The fault detection method for a distributed computer system according to the twentieth aspect of the present invention is L shared by two pairs.
At the AN, the survivor signals transmitted in each pair are combined. In the failure detection method according to claim 19, in a LAN shared by two LAN groups, the survival signals used in each group are combined into one, thereby reducing the number of exchanged survival signals.

【００６７】請求項２１の発明における分散計算機シス
テムの故障検出方法は、仮想配置ステップにおいて、相
互に通信する頻度の高い計算機を、仮想的な配置におい
て近接するように配置する。本来の業務において相互に
通信する可能性の高い計算機を、論理的に近い位置に配
置することにより、ある計算機の故障情報が、このよう
な計算機に早く伝えられる。これにより、故障発生が本
来の業務に及ぼす影響を少なくする。In the failure detecting method for the distributed computer system according to the twenty-first aspect of the invention, in the virtual arranging step, the computers that frequently communicate with each other are arranged so as to be close to each other in the virtual arrangement. By arranging computers that are highly likely to communicate with each other in their original business at positions that are logically close to each other, the failure information of a certain computer can be quickly transmitted to such a computer. As a result, the influence of the failure occurrence on the original work is reduced.

【００６８】請求項２２の発明における分散計算機シス
テムの故障検出方法は、仮想配置ステップにおいて、信
頼性の高い計算機と信頼性の低い計算機を、仮想的な配
置において交互に並べる。故障検出の必要性の高い計算
機を、信頼性の高い計算機に隣接させることにより、後
者が前者の生存信号をチェックするよう配置する。これ
により、故障検出の必要性の高い計算機の故障を確実に
検出することができる。In the fault detecting method for the distributed computer system according to the twenty-second aspect of the present invention, the computer having high reliability and the computer having low reliability are alternately arranged in the virtual arrangement in the virtual arrangement step. By placing a computer with high need for fault detection adjacent to a computer with high reliability, the latter is arranged so that the latter checks the survival signal of the former. As a result, it is possible to reliably detect a failure in a computer that has a high need for failure detection.

【００６９】請求項２３の発明における分散計算機シス
テムの故障検出方法は、仮想配置ステップにおいて、信
頼性の高い計算機と機能的に重要な計算機を、仮想的な
配置において交互に並べる。故障検出の必要性の高い計
算機を、信頼性の高い計算機に隣接させることにより、
後者が前者の生存信号をチェックするよう配置する。こ
れにより、故障検出の必要性の高い計算機の故障を確実
に検出することができる。In the fault detecting method of the distributed computer system according to the twenty-third aspect of the present invention, in the virtual arrangement step, the highly reliable computers and the functionally important computers are alternately arranged in the virtual arrangement. By placing a computer with high need for fault detection next to a computer with high reliability,
The latter is arranged to check the survival signal of the former. As a result, it is possible to reliably detect a failure in a computer that has a high need for failure detection.

【００７０】請求項２４の発明における分散計算機シス
テムの故障検出方法は、一部または全ての生存信号につ
いて、その送信時刻または受信期限を、各計算機が特定
の生存信号を受信した時刻を基準にして設定する。生存
信号の送受信時刻を、自分自身が生存信号を受信した時
刻を基準に設定することにより、同期的に生存信号を交
換する。これにより、生存信号の送信と受信時刻の関係
を、要求される故障発見の特性に合わせて、自由に設定
することができる。According to a twenty-fourth aspect of the present invention, there is provided a distributed computer system failure detection method, wherein the transmission time or reception deadline of some or all of the surviving signals is based on the time when each computer receives a specific surviving signal. Set. The survival signal is exchanged synchronously by setting the transmission / reception time of the survival signal with reference to the time when the self receives the survival signal. As a result, the relationship between the transmission and reception time of the survival signal can be freely set according to the required characteristics of failure detection.

【００７１】[0071]

【Example】

実施例１．以下、この発明の一実施例を図について説明
する。図１は、この発明の一実施例による分散計算機シ
ステムの物理的な構成を示すブロック図であり、図にお
いて、１０１〜１０４は、計算機である。各計算機は、
それぞれ通信インターフェース２１１〜２１４と、ケー
ブル３１１〜３１４とを介して、ＬＡＮ（ローカルエリ
アネットワーク）４０１に接続されている。Example 1. An embodiment of the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a physical configuration of a distributed computer system according to an embodiment of the present invention. In the figure, 101 to 104 are computers. Each calculator is
It is connected to a LAN (local area network) 401 via communication interfaces 211 to 214 and cables 311 to 314, respectively.

【００７２】図２は、この実施例による分散計算機シス
テムの仮想的な仮想リングを示す図であり、１０は、仮
想リングである。分散計算機システムは、故障検出を行
うため、計算機１０１〜１０４が、図２に示すような仮
想的な仮想リング１０上に配置される。各計算機は、仮
想リング上で右回りに１０１、１０２、１０３、１０４
の順に並んで配置されているが、この順序は物理的な位
置関係とは無関係に設定され得る。仮想リング上での計
算機の配置方法として、互いの通信の頻度の高い計算機
同士を近接させる方法、信頼性の高い計算機と低い計算
機を交互に配置する方法、並びに、信頼性の高い計算機
と重要な機能を担当する計算機を交互に配置する方法が
考えられる。FIG. 2 is a diagram showing a virtual virtual ring of the distributed computer system according to this embodiment, and 10 is a virtual ring. In the distributed computer system, in order to detect a failure, the computers 101 to 104 are arranged on the virtual virtual ring 10 as shown in FIG. The computers 101, 102, 103, 104 rotate clockwise on the virtual ring.
However, this order can be set regardless of the physical positional relationship. As a method of arranging computers on a virtual ring, a method of bringing computers with high frequency of mutual communication close to each other, a method of arranging computers with high reliability and computers with low reliability, and a computer with high reliability are important. A possible method is to arrange the computers in charge of the functions alternately.

【００７３】ところで、動作の説明において後で詳細に
述べるように、この実施例のみならず、この発明による
分散計算機システムの故障検出方法は以下のような基本
的な特徴を備えるように構成される。By the way, as will be described later in detail in the description of the operation, not only this embodiment but the fault detecting method for the distributed computer system according to the present invention is configured to have the following basic features. .

【００７４】（１）ある計算機の故障は、仮想的な配置
関係において隣接する計算機が発見する。（２）生存信号を用いて故障情報を伝える。従って、仮
想的な配置関係において故障計算機を発見した計算機に
近くにある計算機ほど、故障情報が早く伝わる。(1) A failure of a computer is found by the adjacent computer in the virtual arrangement relationship. (2) The failure information is transmitted using the survival signal. Therefore, the failure information is transmitted earlier to the computer that is closer to the computer that has found the failure computer in the virtual arrangement relationship.

【００７５】前記した互いの通信の頻度の高い計算機同
士を近接させる方法は、この２つの基本的な特徴を利用
するものであり、これにより、ある計算機が故障した際
に、本来の業務を行うためにその計算機と通信する頻度
の高い計算機は、早期に故障情報を受け取ることがで
き、故障を知らずに通信を継続して実施して本来の業務
を停止してしまうことを防止することができる。従っ
て、故障により計算機本来の業務に与える影響を小さく
することができる。The above-mentioned method of bringing the computers with high frequency of mutual communication close to each other utilizes these two basic features, whereby the original work is performed when a computer fails. For this reason, a computer that frequently communicates with the computer can receive the failure information at an early stage, and can prevent the original work from being stopped by continuing the communication without knowing the failure. . Therefore, it is possible to reduce the influence of the failure on the original work of the computer.

【００７６】また、信頼性の高い計算機と低い計算機を
交互に配置する方法、並びに、信頼性の高い計算機と重
要な機能を担当する計算機を交互に配置する方法は、上
記（１）の特徴を利用しており、それ故、信頼性の高い
計算機は、隣接する信頼性の低い計算機、または、重要
な機能を担当する計算機の故障を確実に発見することが
期待される。従って、このような仮想的な配置関係を構
築することにより、隣接する信頼性の低い計算機、また
は、重要な機能を担当する計算機のような故障検出の必
要性の高い計算機の故障を確実に検出できる。A method of alternately arranging a computer with high reliability and a computer with low reliability, and a method of alternately arranging a computer with high reliability and a computer in charge of important functions have the characteristics of the above (1). Therefore, a highly reliable computer is expected to reliably detect a failure of an adjacent unreliable computer or a computer in charge of an important function. Therefore, by constructing such a virtual layout relationship, it is possible to reliably detect the failure of an adjacent computer with low reliability, or a computer with a high need for failure detection, such as a computer in charge of important functions. it can.

【００７７】次に動作について説明する。図３は、分散
計算機システムの動作を示すフローチャートであり、図
４は、図２の如く仮想リング上に配置された計算機の生
存信号の送受信の様子を説明するためのブロック図であ
り、図５は、計算機１０２のケーブル３１２が故障した
場合の分散計算機システムの動作を示す図であり、図６
は、故障後の再構成された分散計算機システムを示す図
である。以下、これらの図を参照しながら、また、図３
に示すフローチャートの各ステップと対応させながら、
分散計算機システムの計算機の動作を説明する。Next, the operation will be described. FIG. 3 is a flowchart showing the operation of the distributed computer system, and FIG. 4 is a block diagram for explaining the transmission / reception of the survival signal of the computers arranged on the virtual ring as shown in FIG. 6 is a diagram showing the operation of the distributed computer system when the cable 312 of the computer 102 fails, and FIG.
FIG. 3 is a diagram showing a reconfigured distributed computer system after a failure. Below, referring to these figures,
While corresponding to each step of the flowchart shown in
The operation of the computer of the distributed computer system will be described.

【００７８】図４に示すように、計算機１０１〜１０４
は、それぞれＬＡＮ４０１を介して、仮想リング上で右
隣に位置する計算機に対して定期的に生存信号を送信す
る。具体的には、生存信号を定期的に送るための送信タ
イマが０か否かをチェックして（ステップＳＴ１）、送
信タイマが０ならば、即ち予め定められたタイムアウト
時間を経過しているならば、生存信号を送信して、タイ
ムアウト時間を設定して生存信号の送信タイマをセット
する（ステップＳＴ２）。As shown in FIG. 4, computers 101 to 104
Respectively periodically transmit a survival signal to the computer located on the right side on the virtual ring via the LAN 401. Specifically, it is checked whether or not the transmission timer for periodically transmitting the survival signal is 0 (step ST1), and if the transmission timer is 0, that is, if a predetermined time-out time has elapsed. For example, the survival signal is transmitted, the timeout time is set, and the survival signal transmission timer is set (step ST2).

【００７９】通常、計算機は平常モードで動作しており
（ステップＳＴ３）、各計算機は、左隣の計算機からの
生存信号が、一定時間ごとに受信されるかを調べ（ステ
ップＳＴ４）、受信した場合に予め定められたタイムア
ウト時間を設定して生存信号の受信タイマをセットない
しリセットする（ステップＳＴ５）。さらに、生存信号
に故障情報が付加されているかチェックする（ステップ
ＳＴ６）。Normally, the computer is operating in the normal mode (step ST3), and each computer checks whether or not the survival signal from the computer on the left is received at regular intervals (step ST4), and receives it. In this case, a predetermined timeout time is set and the survival signal reception timer is set or reset (step ST5). Further, it is checked whether failure information is added to the survival signal (step ST6).

【００８０】ステップＳＴ４において、計算機が左隣の
計算機から生存信号を受信しなかった場合、以下のよう
な故障が発生している可能性がある。In step ST4, when the computer does not receive the survival signal from the computer on the left, there is a possibility that the following failure has occurred.

【００８１】１）左隣の計算機の故障２）左隣の計算機の通信インターフェースの故障３）左隣の計算機とＬＡＮを接続するケーブルの故障４）自分自身とＬＡＮを接続するケーブルの故障５）自分自身の通信インターフェースの故障1) Failure of the computer on the left side 2) Failure of the communication interface of the computer on the left side 3) Failure of the cable connecting the computer on the left side to the LAN 4) Failure of the cable connecting itself to the LAN 5) Failure of own communication interface

【００８２】以下、図５に示すように、計算機１０２の
通信インターフェース２１２とＬＡＮ４０１とを接続す
るケーブル３１２が故障しているものとして、以下の分
散計算機システムの動作を説明する。Hereinafter, as shown in FIG. 5, the operation of the following distributed computer system will be described assuming that the cable 312 connecting the communication interface 212 of the computer 102 and the LAN 401 is out of order.

【００８３】計算機１０２、１０３は、それぞれ左隣の
計算機１０１、１０２からの生存信号を受信できなくな
る。計算機１０２、１０３は、生存信号の受信に失敗す
ると、まず、生存信号の受信タイマが０か否か、即ち所
定の時間、生存信号を受信しなかったか否かを判断し
（ステップＳＴ１０）、受信タイマが０ならば上記１）
〜５）のいずれかの故障が発生したものと判断し、故障
検出モードに移行して、計算機１０２、１０３は、左隣
の計算機以外の計算機を選び、該計算機に対して、生存
信号の送信を要求する信号を送るとともに、予め定めら
れたタイムアウト時間を設定して生存信号受信タイマを
リセットする（ステップＳＴ１１）。尚、受信タイマが
０である場合は、受信タイマがタイムアウトしている場
合に対応する。ステップＳＴ１１の後、ステップＳＴ１
に戻り、生存信号の送信タイマは０ではなく故障検出モ
ードであるので、ステップＳＴ３からステップＳＴ１４
に移行する。The computers 102 and 103 cannot receive the survival signal from the computers 101 and 102 on the left side, respectively. When the survival signal reception fails, the computers 102 and 103 first determine whether the survival signal reception timer is 0, that is, whether the survival signal has not been received for a predetermined time (step ST10), and receive the signal. If the timer is 0, above 1)
It is determined that any of the failures in 5) to 5) has occurred, the mode shifts to the failure detection mode, the computers 102 and 103 select a computer other than the computer on the left, and transmit a survival signal to the computer. Is sent, and a survival signal reception timer is reset by setting a predetermined timeout time (step ST11). The case where the reception timer is 0 corresponds to the case where the reception timer has timed out. After step ST11, step ST1
Returning to step ST3, since the survival signal transmission timer is not 0 but in the failure detection mode, the steps ST3 to ST14 are performed.
Move to.

【００８４】計算機１０２は、ケーブル３１２の故障の
ため、この要求に対する生存信号を受信できない。従っ
て、計算機１０２は、上記した４）または５）の故障が
発生したと判断し、ステップＳＴ１９において生存信号
の受信タイマが０か否かを判断した後、自分自身を再起
動するなどの処置を行う（ステップＳＴ２０）。The computer 102 cannot receive the live signal for this request due to the failure of the cable 312. Therefore, the computer 102 determines that the failure of 4) or 5) described above has occurred, determines whether the reception timer of the survival signal is 0 in step ST19, and then takes measures such as restarting itself. Perform (step ST20).

【００８５】一方、図５に示す計算機１０３は、生存信
号の送信要求を送ることにより、ステップＳＴ１１で選
択した、左隣の計算機以外の計算機から応答を受信する
ことができるため、１）〜３）の故障が発生したと判断
する（ステップＳＴ１４）。次に、計算機１０３は、計
算機１０２を取り除いた新たな仮想リングを作成する。
新しい仮想リングでは、図６に示すように、計算機は右
回りに１０１、１０３、１０４の順で並んでいる。この
ように、故障や復旧によって計算機の構成が変化した場
合に、計算機の仮想的な配置を変更する処理を、再構成
という。再構成を行うことにより、分散計算機システム
は故障発生前と同様の故障検出能力を発揮できる。ま
た、この結果、計算機１０１は生存信号の送信先計算機
が、計算機１０２から計算機１０３に変わるため、計算
機１０３は計算機１０１に生存信号の送信先の変更要求
を送信する（ステップＳＴ１５）。On the other hand, the computer 103 shown in FIG. 5 can receive a response from a computer other than the computer on the left side selected in step ST11 by sending a request to transmit a survival signal, and therefore 1) to 3). It is judged that the failure of 1) has occurred (step ST14). Next, the computer 103 creates a new virtual ring from which the computer 102 has been removed.
In the new virtual ring, as shown in FIG. 6, the computers are arranged clockwise in the order of 101, 103, 104. In this way, when the configuration of the computer changes due to a failure or restoration, the process of changing the virtual placement of the computer is called reconfiguration. By performing the reconfiguration, the distributed computer system can exhibit the same failure detection capability as before the failure occurred. Further, as a result, since the destination computer of the survival signal of the computer 101 is changed from the computer 102 to the computer 103, the computer 103 transmits a request to change the destination of the survival signal to the computer 101 (step ST15).

【００８６】さらに、計算機１０３は、他の通信し得る
計算機全てに、計算機１０２の故障情報を通知する必要
がある。通知には生存信号を利用する。各計算機が、自
分自身が送信する生存信号に故障情報を書き込むことに
より、故障情報がリング上の隣接計算機に順次転送さ
れ、全ての計算機に通知される。故障情報の通知のため
の動作を以下に説明する。計算機１０３は、次に送信す
る生存信号に、計算機１０２の故障発生を示す故障情報
を書き込むとともに（ステップＳＴ１６）、再送に備え
てメモリに故障情報を保存して故障情報タイマをセット
し（ステップＳＴ１７）、平常モードに戻る（ステップ
ＳＴ１８）。Further, the computer 103 needs to notify all the other computers with which it can communicate of the failure information of the computer 102. The survival signal is used for notification. Each computer writes the failure information in the survival signal transmitted by itself, so that the failure information is sequentially transferred to the adjacent computers on the ring and notified to all the computers. The operation for notifying the failure information will be described below. The computer 103 writes the failure information indicating the failure occurrence of the computer 102 into the survival signal to be transmitted next (step ST16), saves the failure information in the memory in preparation for retransmission, and sets the failure information timer (step ST17). ), And returns to the normal mode (step ST18).

【００８７】一方、ステップＳＴ６において、各計算機
は、受信した生存信号中に故障発生を示す故障情報を発
見した場合、その故障情報の発信源が自分でないならば
（ステップＳＴ７）、計算機１０３と同様に再構成を行
うとともに、次に送信する生存信号に同様の故障情報を
書き込む（ステップＳＴ９）。On the other hand, in step ST6, when each computer finds failure information indicating a failure occurrence in the received survival signal and the source of the failure information is not its own (step ST7), it is the same as the computer 103. And the similar failure information is written in the survival signal to be transmitted next (step ST9).

【００８８】ステップＳＴ７において、生存信号中の故
障情報が自分自身が出したもの、即ち、故障情報に関す
るメッセージが、所定の時間以内にリングを一周して自
分自身に到達したか否かを調べ、もし受信したのであれ
ば、メモリ中から該メッセージを削除して、故障情報タ
イマを削除する（ステップＳＴ８）。受信できなければ
もう一度故障情報を生存信号に書き込み、再送を試み
る。即ち、故障を発見した計算機１０３は、上記したよ
うに、ステップＳＴ１６に従って仮想リング上の右隣の
計算機１０４に対して、生存信号とともに故障情報を伝
える。計算機１０４は、さらに右隣の計算機１０１に同
様にして故障情報を伝える。これを繰り返して、仮想リ
ング上を故障情報が一巡すれば全計算機に故障情報が通
知されたことを確認して、上記ステップＳＴ８に示した
ようにメモリ中から故障情報に関するメッセージを削除
して、故障情報タイマを削除する。しかしながら、故障
情報が一巡している際に途中の計算機が故障したりする
と、故障情報が失われてしまう恐れがある。これを防ぐ
ために、ステップＳＴ４において生存信号を受信せず、
ステップＳＴ１０において受信タイマが０でないなら
ば、故障情報を発信してから、ステップＳＴ１７におい
てセットしたタイムアウト時間内に故障情報が仮想リン
グのループを一巡して自分自身に戻ってきたか否かをチ
ェックして（ステップＳＴ１２）、タイムアウト時間を
超過しているならばもう一度隣接計算機に対して生存信
号に故障情報を付加して送信し、故障情報タイマをリセ
ットする（ステップＳＴ１３）。In step ST7, it is checked whether or not the failure information contained in the survival signal is output by itself, that is, whether or not the message related to the failure information has reached itself by traveling around the ring within a predetermined time. If received, the message is deleted from the memory and the failure information timer is deleted (step ST8). If it cannot be received, the failure information is written in the survival signal again and the retransmission is attempted. That is, the computer 103 that has found the failure transmits the failure information together with the survival signal to the computer 104 on the right side on the virtual ring in accordance with step ST16, as described above. The computer 104 transmits the failure information to the computer 101 on the right side in the same manner. Repeating this, if the failure information makes one round in the virtual ring, it is confirmed that the failure information has been notified to all computers, and as shown in step ST8, the message relating to the failure information is deleted from the memory, Delete the failure information timer. However, if the computer on the way fails while the failure information has completed a cycle, the failure information may be lost. In order to prevent this, in step ST4, the survival signal is not received,
If the reception timer is not 0 in step ST10, after transmitting the failure information, it is checked whether or not the failure information has returned to itself within the time-out time set in step ST17, making one round in the loop of the virtual ring. (Step ST12), if the time-out period has been exceeded, the failure information is added to the survival signal again and transmitted to the adjacent computer, and the failure information timer is reset (step ST13).

【００８９】尚、ステップＳＴ１、ＳＴ２は、生存信号
送信ステップ、ステップＳＴ３、ＳＴ４、ＳＴ１０、Ｓ
Ｔ１１、ＳＴ１４は故障検出ステップ、ステップＳＴ５
〜ＳＴ９、ＳＴ１６〜ＳＴ１８は故障通知ステップ、ス
テップＳＴ１５、ＳＴ９は再構成ステップに対応してい
る。Incidentally, steps ST1 and ST2 are survival signal transmission steps, and steps ST3, ST4, ST10 and S
T11 and ST14 are failure detection steps, step ST5
~ ST9 and ST16 to ST18 correspond to the failure notification step, and steps ST15 and ST9 correspond to the reconfiguration step.

【００９０】この実施例による分散計算機システムは、
各計算機に故障検出機能を分散しているため、特定の計
算機の故障により、故障検出機能が失われることがな
い。また、この分散計算機システムは、各計算機が、自
分自身の生存を知らせるために、毎周期１つの生存信号
を送信するのみである。このため、平常時に各計算機が
送受信する生存信号の数を最小にでき、計算機への負荷
が小さくなる。また、ＬＡＮ上に送出される生存信号の
総数は、計算機の台数に比例した数であるので、ＬＡＮ
への負荷、即ち単位時間あたりにＬＡＮ上に送信される
信号の個数を小さくすることができる。さらに、この分
散計算機システムでは、故障情報を通知するために生存
信号を利用しているので、通知のための余分な信号を送
信する必要がなく、ＬＡＮの負荷をさらに小さくするこ
とができる。The distributed computer system according to this embodiment is
Since the failure detection function is distributed to each computer, the failure detection function is not lost due to the failure of a specific computer. Further, in this distributed computer system, each computer only transmits one survival signal every cycle in order to notify the survival of itself. Therefore, the number of surviving signals transmitted and received by each computer in normal times can be minimized, and the load on the computer is reduced. Also, since the total number of live signals transmitted on the LAN is proportional to the number of computers, the LAN
Load, that is, the number of signals transmitted on the LAN per unit time can be reduced. Further, in this distributed computer system, since the survival signal is used to notify the failure information, it is not necessary to transmit an extra signal for notification, and the load on the LAN can be further reduced.

【００９１】実施例２．図７はこの発明の他の実施例に
よる分散計算機システムの故障検出方法の動作を示すフ
ローチャートである。この実施例による分散計算機シス
テムは、実施例１と同様に、図１のような物理的構成を
もつ。また、各計算機は、故障検出のため、図２に示す
仮想的な仮想リング１０上に並べられる。このとき、実
施例１に述べたように、計算機のいくつかの属性に注目
した配列方法が考えられる。Example 2. FIG. 7 is a flow chart showing the operation of a failure detecting method for a distributed computer system according to another embodiment of the present invention. The distributed computer system according to this embodiment has a physical configuration as shown in FIG. 1 as in the first embodiment. Further, the respective computers are arranged on the virtual virtual ring 10 shown in FIG. 2 for detecting the failure. At this time, as described in the first embodiment, an arrangement method focusing on some attributes of the computer can be considered.

【００９２】次に動作について説明する。以下、図７に
示すフローチャートの各ステップと対応させながら、各
計算機の動作を説明する。Next, the operation will be described. The operation of each computer will be described below in correspondence with each step of the flowchart shown in FIG.

【００９３】各計算機は、隣接する計算機Ｘへの存在信
号の送信タイマが０か否かをチェックして（ステップＳ
Ｔ２１）、定期的に生存信号を送信するとともに、予め
定められたタイムアウト時間を設定して生存信号送信タ
イマをセットして、送信先を右隣から左隣、または左隣
から右隣へと変更する（ステップＳＴ２２）。即ち、送
信先は、１周期ごとに、右隣、左隣、右隣、左隣、…と
いうように、奇数周期目には右隣の計算機、偶数周期目
には左隣の計算機とする。次に、各計算機は、定められ
た時刻に、左右の隣接計算機から生存信号を受信したか
否かを調べ（ステップＳＴ２３）、受信していないなら
ば、さらに生存信号の受信が所定の時間内になされたか
否か（即ちタイムアウトしているか否か）を調べ（ステ
ップＳＴ３４）、タイムアウトしていないならば受信タ
イマをリセットして（ステップＳＴ３７）、ステップＳ
Ｔ２１へ戻る。この際の各計算機の生存信号の送受信の
様子を図８に示す。Each computer checks whether the existence signal transmission timer to the adjacent computer X is 0 (step S
T21), while transmitting the survival signal periodically, set a predetermined timeout time and set the survival signal transmission timer, and change the transmission destination from right adjacent to left adjacent or left adjacent to right adjacent Yes (step ST22). That is, the transmission destinations are the right adjacent computer, the left adjacent, the right adjacent, the left adjacent, and so on for each cycle, such that the odd adjacent cycle is the right adjacent computer and the even cycle is the left adjacent computer. Next, each computer checks whether or not a survival signal has been received from the left and right adjacent computers at a predetermined time (step ST23), and if not, further reception of the survival signal is within a predetermined time. (Step ST34), the reception timer is reset (step ST37), and step S37.
Return to T21. FIG. 8 shows how the surviving signals of each computer are transmitted and received at this time.

【００９４】以下、図９に示すように、計算機１０２の
通信インターフェース２１２とＬＡＮ４０１とを接続す
るケーブル３１２が故障しているものとして、故障検
出、リングの再構成、故障情報の通知の順に分散計算機
システムの動作を説明する。Hereinafter, as shown in FIG. 9, it is assumed that the cable 312 connecting the communication interface 212 of the computer 102 and the LAN 401 is out of order, and the distributed computer is in the order of failure detection, ring reconfiguration, and failure information notification. The operation of the system will be described.

【００９５】故障の発生により、計算機１０１は右隣、
計算機１０２は両隣、計算機１０３は左隣の計算機から
の信号を受信することができなくなる（ステップＳＴ３
４）。各計算機は、生存信号の受信の最初の失敗を検出
すると、さらに、もう１つの隣接計算機からの生存信号
が受信できるかを調べる（ステップＳＴ３５）。計算機
１０１、１０３は、計算機１０４からの生存信号の受信
が可能なため、計算機１０２が故障したと判断する。即
ち、図７において、ステップＳＴ３５からステップＳＴ
３７を経てステップＳＴ２１へ戻り、ステップＳＴ２２
によって送信先を計算機１０２から計算機１０４へと変
更して、計算機１０４からは生存信号を受信することが
可能であるのが（ステップＳＴ２３）、計算機１０２か
らの生存信号は受信することなくタイムアウトするので
（ステップＳＴ２４）、計算機１０２は故障と判断す
る。Due to the occurrence of a failure, the computer 101 is on the right side,
Computers 102 cannot receive signals from both computers, and computer 103 cannot receive signals from the computer on the left (step ST3).
4). When each computer detects the first failure in receiving the live signal, it further checks whether a live signal from another adjacent computer can be received (step ST35). Since the computers 101 and 103 can receive the survival signal from the computer 104, they determine that the computer 102 has failed. That is, in FIG. 7, steps ST35 to ST
After 37, the process returns to step ST21 and step ST22.
It is possible to change the transmission destination from the computer 102 to the computer 104 and receive the survival signal from the computer 104 (step ST23), but since the survival signal from the computer 102 times out without being received, (Step ST24), the computer 102 determines that there is a failure.

【００９６】一方、ステップＳＴ３４において一方の隣
接計算機からの生存信号受信に失敗した計算機１０２自
身は、もう一方の隣接計算機からの生存信号の受信にも
失敗するため（ステップＳＴ３５）、自分自身とＬＡＮ
４０１との接続が切断されたと判断し、再起動などの処
置を行う（ステップＳＴ３６）。以上のような判断は、
２つの計算機が同時に故障する確率が非常に低いという
仮定に基づいている。On the other hand, the computer 102 itself, which has failed to receive the live signal from one adjacent computer in step ST34, also fails to receive the live signal from the other adjacent computer (step ST35).
It is determined that the connection with 401 has been disconnected, and measures such as restarting are performed (step ST36). The above judgment is
It is based on the assumption that the probability of two computers failing at the same time is very low.

【００９７】仮に、計算機１０３が、隣接計算機１０２
の故障を最初に発見したとする。計算機１０３は、リン
グ上から計算機１０２を削除し、仮想リング上の新たな
配置を設定する。このような処理を、リングの再構成と
呼ぶ。この結果、計算機１０３の隣接計算機は、計算機
１０１、１０４に変わる（ステップＳＴ２５）。図１０
は、再構成後の分散計算機システムの仮想リングを示す
図であり、再構成により、故障が発生しても、故障発生
以前と同程度の故障検出能力を維持することができる。If the computer 103 is the adjacent computer 102,
Suppose you first discovered the breakdown. The computer 103 deletes the computer 102 from the ring and sets a new arrangement on the virtual ring. Such processing is called ring reconstruction. As a result, the computers adjacent to the computer 103 are changed to the computers 101 and 104 (step ST25). Figure 10
FIG. 4 is a diagram showing a virtual ring of the distributed computer system after reconfiguration. Even if a failure occurs due to the reconfiguration, it is possible to maintain the same level of failure detection capability as before the failure occurred.

【００９８】さらに、計算機１０３は、通信し得る計算
機全てに、計算機１０２の故障情報を通知する必要があ
る。故障情報の通知のための動作を以下に説明する。Further, the computer 103 needs to notify all computers with which it can communicate of failure information of the computer 102. The operation for notifying the failure information will be described below.

【００９９】最初に、計算機１０３は、ステップＳＴ２
５においてこれ以後に両隣の計算機に送信する生存信号
に故障情報を書き込むようにする。計算機１０１、１０
４は、該生存信号を受信し、そのなかに故障情報が書か
れていることを発見すると（ステップＳＴ２６）、それ
に応じてリングを再構成し（ステップＳＴ２７）、その
後、もう一方の隣接計算機から故障情報を既に受信して
いるか確認して（ステップＳＴ２８）、まだ受信してい
ないならば一方の隣接計算機への生存信号に、故障情報
を書き込むようにする（ステップＳＴ２９）。また、計
算機１０３へと送る生存信号に、故障情報を受信したこ
とを送信元に知らせる受信確認を付与する（ステップＳ
Ｔ３１）。First, the computer 103 executes step ST2.
In 5, the failure information is written in the survival signal to be transmitted to the computers on both sides thereafter. Calculator 101, 10
4 receives the surviving signal and discovers that failure information is written therein (step ST26), reconfigures the ring accordingly (step ST27), and then from the other adjacent computer. It is confirmed whether the failure information has already been received (step ST28), and if it has not been received yet, the failure information is written in the survival signal to one of the adjacent computers (step ST29). In addition, a reception confirmation for notifying the sender that the failure information has been received is added to the survival signal to be sent to the computer 103 (step S).
T31).

【０１００】一方、計算機１０３では、生存信号に受信
確認を発見すると（ステップＳＴ３２）、以後、該生存
信号の送信元計算機への生存信号に、故障情報を付与し
ないようにする（ステップＳＴ３３）。On the other hand, when the computer 103 finds the reception confirmation in the surviving signal (step ST32), thereafter, the failure information is not added to the surviving signal to the sender computer of the surviving signal (step ST33).

【０１０１】他の計算機でも、上記の処理と同様のこと
を繰り返す。これにより、故障情報は、計算機１０３を
起点として、リング上を右回り、左回りに転送される。
右回り、左回りに転送される故障情報は、ある計算機に
おいて、同時に受信される。このとき、当該計算機はス
テップＳＴ２８において、既に故障情報を受信している
と判断するので、これ以上の故障情報の転送を行わず、
受信確認のみを行う（ステップＳＴ３０）。The same processing as above is repeated on other computers. As a result, the failure information is transferred clockwise and counterclockwise on the ring, starting from the computer 103.
The failure information transferred clockwise and counterclockwise is simultaneously received by a computer. At this time, since the computer determines in step ST28 that the failure information has already been received, the failure information is not further transferred,
Only reception confirmation is performed (step ST30).

【０１０２】尚、ステップＳＴ２１、ＳＴ２２、ＳＴ３
７は、生存信号送信ステップ、ステップＳＴ２３〜ＳＴ
２５、ＳＴ３４〜ＳＴ３６は故障検出ステップ、ステッ
プＳＴ２５、ＳＴ２６、ＳＴ２８〜ＳＴ３３は故障通知
ステップ、ステップＳＴ２５、ＳＴ２７は再構成ステッ
プに対応している。Incidentally, steps ST21, ST22, ST3
7 is a survival signal transmission step, steps ST23 to ST
25 and ST34 to ST36 correspond to the failure detection step, steps ST25, ST26 and ST28 to ST33 correspond to the failure notification step, and steps ST25 and ST27 correspond to the reconstruction step.

【０１０３】この実施例による分散計算機システムは、
各計算機に故障検出機能を分散しているため、特定の計
算機の故障により、故障検出機能が失われることがな
い。また、分散計算機システムは、平常時の故障発見の
ための通信量を最小にできる。また、実施例１のよう
に、故障箇所を特定するために、余分に信号を送受信す
る必要がないため、故障発生時の通信量が少ない。さら
に、分散計算機システムでは、故障情報を通知するため
に、生存信号を利用するため、通知のための余分な信号
を送信する必要がなく、ＬＡＮの負荷を小さくすること
ができる。The distributed computer system according to this embodiment is
Since the failure detection function is distributed to each computer, the failure detection function is not lost due to the failure of a specific computer. In addition, the distributed computer system can minimize the amount of communication for fault detection in normal times. Further, unlike the first embodiment, it is not necessary to transmit / receive an additional signal in order to specify the failure location, so the communication amount at the time of failure is small. Furthermore, in the distributed computer system, since the survival signal is used to notify the failure information, it is not necessary to transmit an extra signal for notification, and the load on the LAN can be reduced.

【０１０４】実施例３．図１１は、この発明の他の実施
例による分散計算機システムの故障検出方法の動作を示
すフローチャートである。この実施例の分散計算機シス
テムは、図１に示す物理的構成を備えており、各計算機
は、仮想的に図２に示すような仮想リング１０上に配置
されている。実施例１と同様に、仮想リング１０上の計
算機の配置には、計算機の様々な属性に注目したいくつ
かの方法が考えられる。Example 3. FIG. 11 is a flow chart showing the operation of a failure detecting method for a distributed computer system according to another embodiment of the present invention. The distributed computer system of this embodiment has the physical configuration shown in FIG. 1, and each computer is virtually arranged on a virtual ring 10 as shown in FIG. Similar to the first embodiment, for the arrangement of the computers on the virtual ring 10, several methods that consider various attributes of the computers can be considered.

【０１０５】次に動作について説明する。以下、図１１
に示すフローチャートの各ステップと対応させながら、
各計算機の動作を説明する。Next, the operation will be described. Below, FIG.
While corresponding to each step of the flowchart shown in
The operation of each computer will be described.

【０１０６】各計算機は、定期的に右隣の計算機に生存
信号を送信するべく、生存信号の送信タイマが０である
か否かをチェックして（ステップＳＴ４１）、生存信号
を送信して、予め定められたタイムアウト時間を設定し
て生存信号送信タイマをセットする（ステップＳＴ４
２）。次に、左隣の計算機からの生存信号を、定期的に
受信したか否かを調べ（ステップＳＴ４３）、受信して
いないならば、さらに生存信号に付加された故障情報の
受信が所定の時間内になされたか否か（即ちタイムアウ
トしているか否か）を調べ（ステップＳＴ５７）、タイ
ムアウトしていないならばステップＳＴ４１へ戻る。Each computer periodically checks whether or not the survival signal transmission timer is 0 so as to transmit the survival signal to the computer on the right (step ST41), and transmits the survival signal. The survival signal transmission timer is set by setting a predetermined timeout time (step ST4).
2). Next, it is checked whether or not the survival signal from the computer on the left side is regularly received (step ST43), and if it is not received, the failure information added to the survival signal is received for a predetermined time. It is checked whether or not it has been performed within (that is, whether or not it has timed out) (step ST57), and if it has not timed out, the process returns to step ST41.

【０１０７】ステップＳＴ４３において、各計算機は、
定められた時間内に生存信号を受信したならば、予め定
められたタイムアウト時間を設定して生存信号の受信タ
イマをセットないしリセットし、右隣に送信する生存信
号にＡＣＫを書き込む（ステップＳＴ４４）。さもなけ
れば、ステップＳＴ５３に移行し、生存信号の受信タイ
マが０であるならば生存信号にＮＡＫを書き込む（ステ
ップＳＴ５４）。図１２は、各計算機がＡＣＫまたはＮ
ＡＫを生存信号に書き込む様子を示した分散計算機シス
テムのブロック図である。At step ST43, each computer
If a live signal is received within a predetermined time, a preset timeout time is set to set or reset a live signal reception timer, and ACK is written to the live signal to be transmitted to the right (step ST44). . Otherwise, the process proceeds to step ST53, and if the survival signal reception timer is 0, NAK is written in the survival signal (step ST54). In FIG. 12, each computer is ACK or N
It is a block diagram of the distributed computer system which showed a mode that AK was written in a survival signal.

【０１０８】図１３に示すように、実施例１と同様に、
計算機１０２の通信インターフェースとＬＡＮ４０１と
を接続するケーブル３１２に故障が発生したとして、以
下の故障検出方法の動作について説明する。As shown in FIG. 13, as in the first embodiment,
Assuming that a failure has occurred in the cable 312 that connects the communication interface of the computer 102 and the LAN 401, the operation of the following failure detection method will be described.

【０１０９】計算機１０２、１０３は、それぞれ計算機
１０１、１０２からの生存信号を受信できない。このた
め、上記したように、計算機１０２は計算機１０３に対
して、計算機１０３は計算機１０４に対して、ステップ
ＳＴ５３及びＳＴ５４に従ってそれぞれＮＡＫを含む生
存信号を送信する。従って、各計算機が送信する生存信
号の内容は、図１３に示すようになる。ＮＡＫを含む２
つの生存信号のうち、計算機１０２の送信した生存信号
は、ケーブル３１２が故障しているために、計算機１０
３には到達しない。従って、計算機１０４だけがＮＡＫ
を含む故障情報が付加された生存信号を受信する（ステ
ップＳＴ４５）。従って、計算機１０４は、計算機１０
２に故障が発生したものと判断し、実施例１と同様にし
て、計算機１０１の生存信号の送信先を変更させる（ス
テップＳＴ５０）。また、計算機１０４は、次に、故障
情報を転送すべく送信する生存信号に故障情報を書き込
むとともに（ステップＳＴ５１）、再送に備えてメモリ
に故障情報を保存して、予め定められたタイムアウト時
間を設定して故障情報タイマをセットする（ステップＳ
Ｔ５２）。The computers 102 and 103 cannot receive the survival signals from the computers 101 and 102, respectively. Therefore, as described above, the computer 102 transmits the survival signal including the NAK to the computer 103 and the computer 103 to the computer 104 according to steps ST53 and ST54. Therefore, the contents of the survival signal transmitted by each computer are as shown in FIG. 2 including NAK
Of the two survival signals, the survival signal transmitted by the computer 102 is the computer 10 because the cable 312 has a failure.
3 is not reached. Therefore, only the computer 104 is NAK
The survival signal to which the failure information including is added is received (step ST45). Therefore, the computer 104 is the computer 10
It is determined that a failure has occurred in No. 2, and the destination of the survival signal of the computer 101 is changed in the same manner as in the first embodiment (step ST50). Further, the computer 104 next writes the failure information in the survival signal to be transmitted to transfer the failure information (step ST51), stores the failure information in the memory in preparation for retransmission, and sets a predetermined timeout time. Set and set failure information timer (step S
T52).

【０１１０】一方、ステップＳＴ４５において生存信号
がＡＣＫを含むことが判明したならば、各計算機は、受
信した生存信号中に故障発生を示す故障情報が付加され
ているか否かをチェックして（ステップＳＴ４６）、故
障情報を発見した場合、その故障情報の発信源が自分で
ないならば（ステップＳＴ４７）、次に送信する生存信
号に同様の故障情報を書き込む（ステップＳＴ４９）。On the other hand, if it is found in step ST45 that the surviving signal includes ACK, each computer checks whether or not the received surviving signal includes failure information indicating a failure occurrence (step S45). In ST46), when the failure information is found and the transmission source of the failure information is not its own (step ST47), the same failure information is written in the survival signal to be transmitted next (step ST49).

【０１１１】ステップＳＴ４７において、生存信号中の
故障情報が自分自身が出したもの、即ち、故障情報に関
するメッセージが、所定の時間以内にリングを一周して
自分自身に到達したか否かを調べ、もし受信したのであ
れば、メモリ中から該メッセージを削除して、故障情報
タイマを削除する（ステップＳＴ４８）。他方、生存信
号を受信できず故障情報の受信がステップＳＴ５２で設
定したタイムアウト時間を超過したならば（ステップＳ
Ｔ５７）、実施例１と同様にもう一度故障情報を次の生
存信号に書き込み、故障情報タイマをリセットして再送
を試みる（ステップＳＴ５８）。In step ST47, it is checked whether or not the failure information in the survival signal is issued by itself, that is, whether or not the message related to the failure information has reached itself by going around the ring within a predetermined time. If received, the message is deleted from the memory and the failure information timer is deleted (step ST48). On the other hand, if the survival signal cannot be received and the reception of the failure information exceeds the timeout time set in step ST52 (step S
T57), similarly to the first embodiment, the failure information is written into the next survival signal again, the failure information timer is reset, and the retransmission is attempted (step ST58).

【０１１２】尚、ステップＳＴ４１、ＳＴ４２は、生存
信号送信ステップ、ステップＳＴ４３、ＳＴ４４、ＳＴ
５３〜ＳＴ５６は故障検出ステップ、ステップＳＴ４５
〜ＳＴ５２、ＳＴ５７、ＳＴ５８は故障通知ステップに
対応している。Note that steps ST41 and ST42 are survival signal transmission steps, and steps ST43, ST44 and ST
53 to ST56 are failure detection steps, step ST45
-ST52, ST57, and ST58 correspond to the failure notification step.

【０１１３】分散計算機システムは、各計算機に故障検
出機能を分散しているため、特定の計算機の故障によ
り、故障検出機能が失われることがない。また、分散計
算機システムは、平常時に各計算機が送受信する生存信
号の数を最小にでき、計算機への負荷が小さくなる。ま
た、ＬＡＮ上に送出される生存信号の総数は、計算機の
台数に比例した数であるので、ＬＡＮへの負荷、即ち単
位時間あたりにＬＡＮ上に送信される信号の個数を小さ
くすることができる。さらに、分散計算機システムで
は、故障情報を通知するために生存信号を利用している
ので、通知のための余分な信号を送信する必要がなく、
ＬＡＮの負荷を小さくすることができる。Since the distributed computer system distributes the failure detection function to each computer, the failure detection function will not be lost due to the failure of a specific computer. In addition, the distributed computer system can minimize the number of live signals transmitted and received by each computer during normal operation, which reduces the load on the computer. Moreover, since the total number of live signals transmitted on the LAN is proportional to the number of computers, the load on the LAN, that is, the number of signals transmitted on the LAN per unit time can be reduced. . Furthermore, in the distributed computer system, since the survival signal is used to notify the failure information, there is no need to send an extra signal for notification,
The load on the LAN can be reduced.

【０１１４】実施例４．図１４は、この発明の他の実施
例による分散計算機システムの故障検出方法の動作を示
すフローチャートであり、図１５は、この実施例による
分散計算機システムの仮想的な配置を示すブロック図で
ある。また、この実施例の分散計算機システムは、図１
に示す物理的構成を備えている。Example 4. FIG. 14 is a flow chart showing the operation of a failure detecting method for a distributed computer system according to another embodiment of the present invention, and FIG. 15 is a block diagram showing a virtual arrangement of the distributed computer system according to this embodiment. Further, the distributed computer system of this embodiment is shown in FIG.
It has the physical configuration shown in.

【０１１５】この実施例による分散計算機システムで
は、図１５に示すように、計算機１０１〜１０７が仮想
的なツリーの節点上に配置され、ツリーの最下層の計算
機１０３、１０４、１０６、１０７を除き、各計算機が
複数の子計算機を有するようにツリーを構成する。以
下、ツリー上のある計算機からみて、親節点の位置にあ
る計算機を親計算機、子節点の位置にある計算機を子計
算機、同じ階層にある計算機を兄弟計算機と呼ぶ。実施
例１と同様に、ツリー上の計算機の配置には、計算機の
様々な属性に注目していくつかの配置方法が考えられ
る。In the distributed computer system according to this embodiment, as shown in FIG. 15, computers 101 to 107 are arranged on the nodes of a virtual tree, except for the computers 103, 104, 106 and 107 at the bottom of the tree. , Configure the tree so that each computer has multiple child computers. Hereinafter, when viewed from a certain computer on the tree, the computer at the position of the parent node is called a parent computer, the computer at the position of the child node is called a child computer, and the computers at the same level are called sibling computers. Similar to the first embodiment, for arranging the computers on the tree, several arrangement methods can be considered by paying attention to various attributes of the computers.

【０１１６】次に動作について説明する。以下、図１４
に示すフローチャートの各ステップと対応させながら、
各計算機の動作を説明する。Next, the operation will be described. Below, FIG.
While corresponding to each step of the flowchart shown in
The operation of each computer will be described.

【０１１７】各計算機は親計算機に対して定期的に生存
信号を送信すべく、生存信号の送信タイマが０か否かを
チェックして（ステップＳＴ６１）、送信タイマが０で
あるならば生存信号を送信するとともに、生存信号送信
タイマをセットする（ステップＳＴ６２）。各計算機は
複数の子計算機からの生存信号が受信したか否かを調べ
（ステップＳＴ６３）、さらに生存信号を受信しなかっ
た場合は生存信号の受信タイマが０であるか否かをチェ
ックして（ステップＳＴ６６）、これらの結果を組み合
わせて故障箇所を判断する。Each computer checks whether the transmission timer of the survival signal is 0 in order to periodically transmit the survival signal to the parent computer (step ST61), and if the transmission timer is 0, the survival signal is 0. And the survival signal transmission timer is set (step ST62). Each computer checks whether or not the survival signal from a plurality of child computers has been received (step ST63), and when it does not receive the survival signal, it checks whether or not the survival signal reception timer is 0. (Step ST66), these results are combined to determine the failure location.

【０１１８】ツリーのルートに当たる計算機は、ツリー
の最下層の計算機に生存信号を送信することにより、ル
ート計算機の故障を検出する。図１６は、図１５のよう
な仮想的なツリー配置を有する分散計算機システムの生
存信号の送受信の様子を示す図である。The computer corresponding to the root of the tree detects a failure of the root computer by transmitting a survival signal to the computer at the bottom of the tree. FIG. 16 is a diagram showing how a live signal is transmitted and received in the distributed computer system having the virtual tree arrangement shown in FIG.

【０１１９】次に、図１７に示すように、計算機１０５
の通信インターフェース２１５とＬＡＮ４０１とを接続
するケーブル３１５が切断した場合における、この実施
例による故障検出方法の動作について説明する。Next, as shown in FIG. 17, the computer 105
The operation of the failure detection method according to this embodiment when the cable 315 connecting the communication interface 215 and the LAN 401 is disconnected will be described.

【０１２０】この故障により、計算機１０５は全ての子
計算機からの生存信号を受信できなくなるので、ステッ
プＳＴ６６及びＳＴ６７において全ての子計算機の生存
信号がタイムアウトしたと判断され、自分自身とＬＡＮ
４０１の間に故障が生じたものとして、自らを再スター
トするなどの処置を行う（ステップＳＴ６８）。Due to this failure, the computer 105 cannot receive the survival signals from all the child computers, so that it is determined that the survival signals of all the child computers have timed out in steps ST66 and ST67, and the computer itself and the LAN.
It is assumed that a failure has occurred during 401, and measures such as restarting itself are performed (step ST68).

【０１２１】一方、計算機１０１は、計算機１０５から
の生存信号が、定められた時間内に受信できないことを
検出する（ステップＳＴ６６）。計算機１０１はしばら
く後、計算機１０２からの生存信号を検出した時点で
（ステップＳＴ６３）、子計算機１０５の生存信号がタ
イムアウトしたことをもって、計算機１０５が故障した
かあるいは、計算機１０５とＬＡＮ４０１との間に故障
が発生したものと判断する（ステップＳＴ６４）。以上
のような判断は、２つの計算機が同時に故障する確率が
非常に低いという仮定に基づいている。計算機１０１
は、子計算機１０５の子計算機（孫計算機）を、自分自
身の子計算機とすることにより、図１８のようにツリー
を再構成する。これにより、故障が発生しても、故障発
生以前と同程度の故障検出能力を維持することができ
る。計算機１０１は故障情報と新しい構成情報を、新た
な子計算機に対して通知する。これらの故障情報及び構
成情報を受信した計算機は、さらに、子計算機に対して
これを通知する（ステップＳＴ６５）。On the other hand, the computer 101 detects that the survival signal from the computer 105 cannot be received within a predetermined time (step ST66). After a while, when the computer 101 detects the survival signal from the computer 102 (step ST63), the survival signal of the child computer 105 has timed out, causing the computer 105 to fail, or between the computer 105 and the LAN 401. It is determined that a failure has occurred (step ST64). The above judgment is based on the assumption that the probability of simultaneous failure of two computers is extremely low. Computer 101
Reconfigures the tree as shown in FIG. 18 by using the child computer (grandchild computer) of the child computer 105 as its own child computer. As a result, even if a failure occurs, it is possible to maintain the same level of failure detection capability as before the failure occurred. The computer 101 notifies the new child computer of the failure information and the new configuration information. The computer that has received the failure information and the configuration information further notifies the child computer of this (step ST65).

【０１２２】尚、ステップＳＴ６１、ＳＴ６２は、生存
信号送信ステップ、ステップＳＴ６３、ＳＴ６４、ＳＴ
６６〜ＳＴ６８は故障検出ステップ、ステップＳＴ６５
は故障通知ステップ及び再構成ステップに対応してい
る。Note that steps ST61 and ST62 are survival signal transmission steps, and steps ST63, ST64 and ST
66 to ST68 are failure detection steps, step ST65
Corresponds to the failure notification step and the reconstruction step.

【０１２３】分散計算機システムは、各計算機に故障検
出機能を分散しているため、特定の計算機の故障によ
り、故障検出機能が失われることがない。さらに、分散
計算機システムは、各計算機が自分自身の生存を知らせ
るために、毎周期に１つの生存信号を送信するのみであ
るので、平常時に各計算機が送受信する生存信号の数を
最小にでき、計算機への負荷が小さくなる。また、ＬＡ
Ｎ上に送出される生存信号の総数は、計算機の台数に比
例した数であるので、ＬＡＮへの負荷、即ち単位時間あ
たりにＬＡＮ上に送信される信号の個数を小さくするこ
とができる。Since the distributed computer system distributes the failure detection function to each computer, the failure detection function is not lost due to the failure of a specific computer. Furthermore, since the distributed computer system only sends one survival signal in each cycle in order to inform each computer of its own survival, it is possible to minimize the number of survival signals transmitted and received by each computer during normal operation. The load on the computer is reduced. Also, LA
Since the total number of surviving signals transmitted on N is proportional to the number of computers, the load on the LAN, that is, the number of signals transmitted on the LAN per unit time can be reduced.

【０１２４】実施例５．図１９は、この発明の他の実施
例による分散計算機システムの故障検出方法の動作を示
すフローチャートであり、図２０は、この実施例による
分散計算機システムの仮想的な構成を示すブロック図で
ある（以下、このような構成をチェーンと呼ぶ）。実施
例１と同様に、分散計算機システムの計算機の仮想的配
置には、計算機の様々な属性に注目したいくつかの方法
が考えられる。Example 5. FIG. 19 is a flow chart showing the operation of the fault detecting method for the distributed computer system according to another embodiment of the present invention, and FIG. 20 is a block diagram showing the virtual configuration of the distributed computer system according to this embodiment ( Hereinafter, such a configuration is called a chain). Similar to the first embodiment, for the virtual arrangement of the computers in the distributed computer system, several methods that pay attention to various attributes of the computers can be considered.

【０１２５】図２０に示すように、この実施例による分
散計算機システムでは、まず、計算機１０１と計算機１
０５とをグループ１００１、計算機１０２と計算機１０
３と計算機１０４とをグループ１００２とする。また、
計算機１０１、１０４は、それぞれグループ１００１、
１００２を代表する代表計算機であり、これらの代表計
算機１０１、１０４は仮想的な仮想リング上に配置され
ている。代表計算機以外の計算機１０２、１０３、１０
５は、それぞれ自らが属するグループの代表計算機に生
存信号を定期的に送信すべく構成されている。図２１
は、このようなチェーン構成を有する分散計算機システ
ムの生存信号の送受信を示すブロック図である。As shown in FIG. 20, in the distributed computer system according to this embodiment, first, the computer 101 and the computer 1
05 and group 1001, computer 102 and computer 10
3 and the computer 104 form a group 1002. Also,
The computers 101 and 104 are group 1001 and group 1001, respectively.
This is a representative computer representative of 1002, and these representative computers 101 and 104 are arranged on a virtual virtual ring. Computers 102, 103, 10 other than the representative computer
5 is configured to periodically transmit a survival signal to the representative computer of the group to which it belongs. Figure 21
FIG. 3 is a block diagram showing transmission / reception of a survival signal of a distributed computer system having such a chain structure.

【０１２６】次に動作について説明する。以下、図１９
のフローチャートと対応させながら、代表計算機の動作
を説明する。Next, the operation will be described. Below, FIG.
The operation of the representative computer will be described with reference to the flowchart of FIG.

【０１２７】仮想的なリング上に並べられた代表計算機
１０１、１０４は、それぞれ右隣の代表計算機に対し
て、また、代表計算機以外の計算機は代表計算機に対し
て、定期的に生存信号を送信すべく、生存信号の送信タ
イマが０であるか否かをチェックし（ステップＳＴ７
１）、生存信号の送信を開始して、予め定められたタイ
ムアウト時間を設定して生存信号送信タイマをセットす
る（ステップＳＴ７２）。従って、各代表計算機は、同
一のグループ内の他の計算機からと、左隣の代表計算機
からの生存信号を受信する。例えば、代表計算機１０４
はグループ１００２の計算機１０２、１０３からの生存
信号、及び、左隣の代表計算機１０１からの生存信号を
受信する。The representative computers 101 and 104 arranged on the virtual ring periodically transmit survival signals to the representative computers on the right side, and the computers other than the representative computer periodically transmit the survival signal to the representative computer. In order to do so, it is checked whether the transmission signal of the survival signal is 0 (step ST7).
1) The transmission of the survival signal is started, a predetermined timeout time is set, and the survival signal transmission timer is set (step ST72). Therefore, each representative computer receives survival signals from other computers in the same group and from the representative computer on the left side. For example, the representative computer 104
Receives the survival signals from the computers 102 and 103 of the group 1002 and the survival signal from the representative computer 101 on the left side.

【０１２８】各代表計算機は、左隣の計算機からの生存
信号が、一定時間ごとに受信されるかを調べ（ステップ
ＳＴ７３）、受信した場合に予め定められたタイムアウ
ト時間を設定して生存信号の受信タイマをセットないし
リセットする（ステップＳＴ７４）。また、各代表計算
機は、生存信号の受信に失敗すると、まず、生存信号の
受信タイマが０か否か、即ち所定の時間、生存信号を受
信しなかったか否かを判断し（ステップＳＴ８３）、故
障を検出する。即ち、各代表計算機は、自らに送信され
る信号の受信状況を組み合わせて故障箇所を推定する。Each representative computer checks whether or not the survival signal from the computer on the left side is received at regular intervals (step ST73), and when received, sets a predetermined time-out time to set the survival signal. The reception timer is set or reset (step ST74). When the representative computer fails to receive the survival signal, first, the representative computer determines whether the survival signal reception timer is 0, that is, whether the survival signal has not been received for a predetermined time (step ST83), Detect failure. That is, each representative computer estimates the failure location by combining the reception statuses of the signals transmitted to itself.

【０１２９】以下、例として、図２２に示すような計算
機１０２が故障した場合における、上記に継続する、故
障検出方法の動作について説明する。As an example, the operation of the failure detecting method continued when the computer 102 as shown in FIG. 22 fails will be described below.

【０１３０】代表計算機１０４は、計算機１０２の生存
信号を受信できないが（ステップＳＴ８３）、代表計算
機１０１及び計算機１０３からの生存信号は受信できる
ため、ステップＳＴ７３からステップＳＴ７４へ移り、
受信した生存信号には故障情報は付加されていないの
で、ステップＳＴ７９に移行し、送信計算機１０２の生
存信号受信がタイムアウトしたか否かを判断し、計算機
１０２に故障が発生したと判断する。このような判断
は、２つの計算機が同時に故障する確率が非常に低いと
いう仮定に基づいている。The representative computer 104 cannot receive the surviving signal of the computer 102 (step ST83), but can receive the surviving signals from the representative computer 101 and the computer 103. Therefore, the process proceeds from step ST73 to step ST74.
Since failure information is not added to the received survival signal, the process proceeds to step ST79, it is determined whether or not the survival signal reception of the transmission computer 102 has timed out, and it is determined that a failure has occurred in the computer 102. Such a judgment is based on the assumption that two computers have a very low probability of simultaneously failing.

【０１３１】計算機１０４は、計算機１０２をグループ
から取り除き、図２３に示すような新たなチェーンを構
成し、グループ１００２の他の計算機１０３に対して、
生存信号とは別の信号により故障情報を通知する（ステ
ップＳＴ８０）。これにより、故障が発生しても、故障
発生以前と同程度の故障検出能力を維持することができ
る。The computer 104 removes the computer 102 from the group, forms a new chain as shown in FIG. 23, and with respect to the other computers 103 of the group 1002,
The failure information is notified by a signal different from the survival signal (step ST80). As a result, even if a failure occurs, it is possible to maintain the same level of failure detection capability as before the failure occurred.

【０１３２】次に、計算機１０４は、検出された故障情
報を、生存信号に付加することにより、他の代表計算機
に対して、故障情報の転送を行う（ステップＳＴ８
１）。生存信号を利用した故障情報の転送は、実施例１
と同様に行い、再送に備えてメモリに故障情報を保存し
て故障情報タイマをセットする（ステップＳＴ８２）。
そして、ステップＳＴ７５において、代表計算機は、受
信した生存信号中に故障発生を示す故障情報を発見した
場合、その故障情報の発信源が自分でないならば（ステ
ップＳＴ７６）、故障情報を次の生存信号に付加すると
ともに、グループの他の計算機に対して、生存信号とは
別の信号により通知する（ステップＳＴ７８）。また、
ステップＳＴ７６において、生存信号中の故障情報が自
分自身が出したもの、即ち、故障情報に関するメッセー
ジが、所定の時間以内にリングを一周して自分自身に到
達したか否かを調べ、もし受信したのであれば、メモリ
中から該メッセージを削除して、故障情報タイマを削除
する（ステップＳＴ７７）。Next, the computer 104 transfers the fault information to another representative computer by adding the detected fault information to the survival signal (step ST8).
1). The transfer of the failure information using the survival signal is performed in the first embodiment.
In the same manner as above, the failure information is stored in the memory and the failure information timer is set in preparation for retransmission (step ST82).
Then, in step ST75, when the representative computer finds failure information indicating the occurrence of a failure in the received survival signal, and the source of the failure information is not its own (step ST76), the representative computer sends the failure information to the next survival signal. And a signal different from the survival signal is sent to other computers in the group (step ST78). Also,
In step ST76, it is checked whether or not the failure information in the survival signal is the one issued by itself, that is, whether or not the message regarding the failure information has reached itself by going around the ring within a predetermined time. If so, the message is deleted from the memory and the failure information timer is deleted (step ST77).

【０１３３】また、実施例１と同様に、故障情報が仮想
リング上を一巡している際に途中の計算機が故障したり
すると、故障情報が失われてしまう恐れがある。これを
防ぐために、ステップＳＴ７３において生存信号を受信
せず、ステップＳＴ８３において受信タイマが０でない
ならば、故障情報を発信してから、ステップＳＴ８２に
おいてセットしたタイムアウト時間内に故障情報が仮想
リングのループを一巡して自分自身に戻ってきたか否か
をチェックして（ステップＳＴ８６）、タイムアウト時
間を超過しているならばもう一度隣接計算機に対して生
存信号に故障情報を付加して送信し、故障情報タイマを
リセットする（ステップＳＴ８７）。Also, as in the first embodiment, if the computer in the middle fails while the failure information goes around the virtual ring, the failure information may be lost. In order to prevent this, if the survival signal is not received in step ST73 and the reception timer is not 0 in step ST83, the failure information is transmitted within the timeout period set in step ST82 after the failure information is transmitted. It is checked whether or not it has returned to itself (step ST86), and if the timeout time has been exceeded, the failure information is added to the survival signal and transmitted to the adjacent computer again, and the failure information is returned. The timer is reset (step ST87).

【０１３４】次に、別の例として、図２４に示すように
代表計算機１０４の通信インターフェース３１４が故障
した場合のこの実施例による故障検出方法の動作を示
す。Next, as another example, the operation of the failure detecting method according to this embodiment when the communication interface 314 of the representative computer 104 fails as shown in FIG. 24 will be described.

【０１３５】代表計算機１０４は、他の代表計算機１０
１、及び同一グループ１００２の計算機１０２、１０３
のいずれの生存信号も受信することができないので、ス
テップＳＴ８３を経てステップＳＴ８４に至り、全ての
送信計算機からの生存信号の受信がタイムアウト時間を
超過する。このため、計算機１０４は、自らとＬＡＮと
の間の接続が切断されたと考え、自分自身を再起動する
などの処置を行う（ステップＳＴ８５）。The representative computer 104 is the other representative computer 10.
1 and the computers 102 and 103 of the same group 1002
Since it is not possible to receive any of the surviving signals, the process goes to step ST84 through step ST83, and the receiving of the surviving signals from all the transmission computers exceeds the timeout time. For this reason, the computer 104 considers that the connection between itself and the LAN has been disconnected, and takes measures such as restarting itself (step ST85).

【０１３６】また、代表計算機１０１は、代表計算機１
０４の生存信号が受信できず、計算機１０５の生存信号
が受信できることから、計算機１０４の故障を発見する
ことができる（ステップＳＴ７９）。次に、計算機１０
１は、図２５に示すような新たなチェーンを構成し、前
の例と同様にして他の計算機への故障情報の通知を行う
（ステップＳＴ８０〜ＳＴ８２、ＳＴ７６、ＳＴ７７、
ＳＴ８６、ＳＴ８７）。Also, the representative computer 101 is the representative computer 1
Since the live signal 04 of the computer 104 cannot be received and the live signal of the computer 105 can be received, the failure of the computer 104 can be found (step ST79). Next, computer 10
1 constructs a new chain as shown in FIG. 25, and notifies other computers of failure information in the same manner as the previous example (steps ST80 to ST82, ST76, ST77,
ST86, ST87).

【０１３７】尚、ステップＳＴ７１、ＳＴ７２は、生存
信号送信ステップ、ステップＳＴ７３、ＳＴ７４、ＳＴ
８３〜ＳＴ８５は故障検出ステップ、ステップＳＴ７５
〜ＳＴ８２、ＳＴ８６〜ＳＴ８７は故障通知ステップ、
ステップＳＴ８０は再構成ステップに対応している。Incidentally, steps ST71 and ST72 are survival signal transmission steps, and steps ST73, ST74 and ST
83 to ST85 are failure detection steps, step ST75
~ ST82, ST86 to ST87 are failure notification steps,
Step ST80 corresponds to the reconstruction step.

【０１３８】分散計算機システムは、各計算機に故障検
出機能を分散しているため、特定の計算機の故障によ
り、故障検出機能が失われることがない。また、分散計
算機システムは、各計算機が自分自身の生存を知らせる
ために、毎周期に１つの生存信号を送信するのみである
ので、平常時の故障検出及び、故障発生時の故障箇所の
特定のための通信量を最小にできる。さらに、分散計算
機システムでは、代表計算機間で故障情報を通知するた
めに生存信号を利用するので、ＬＡＮの負荷を小さくす
ることができる。Since the distributed computer system distributes the failure detection function to each computer, the failure detection function is not lost due to the failure of a specific computer. In addition, since the distributed computer system only sends one survival signal in each cycle in order to notify each computer of its own survival, failure detection in normal times and identification of a failure location at the time of failure occurrence are possible. Can minimize the amount of communication. Furthermore, in the distributed computer system, since the survival signal is used to notify the failure information between the representative computers, the load on the LAN can be reduced.

【０１３９】実施例６．図２６は、この発明の他の実施
例による分散計算機システムの故障検出方法の動作を示
すフローチャート、図２７は、この実施例による分散計
算機システムの物理的な構成図であり、図において、１
０１〜１０４は計算機、４０１、４０２はＬＡＮであ
る。計算機１０１〜１０４は、各々通信インターフェー
ス２１１〜２１４と、ケーブル３１１〜３１４とによ
り、ＬＡＮ４０１に接続されている。また、計算機１０
１〜１０４は、各々通信インターフェース２２１〜２２
４と、ケーブル３２１〜３２４とにより、ＬＡＮ４０２
に接続されている。Example 6. FIG. 26 is a flow chart showing the operation of the fault detecting method for a distributed computer system according to another embodiment of the present invention, and FIG. 27 is a physical configuration diagram of the distributed computer system according to this embodiment.
01 to 104 are computers, and 401 and 402 are LANs. The computers 101 to 104 are connected to the LAN 401 by communication interfaces 211 to 214 and cables 311 to 314, respectively. Also, the computer 10
1 to 104 are communication interfaces 221 to 22 respectively
4 and the cables 321 to 324, the LAN 402
It is connected to the.

【０１４０】この実施例による分散計算機システムでは
故障検出のため、実施例１と同様に、計算機１０１〜１
０４を図４のような仮想的な仮想リング状に配置する。
実施例１と同様に、計算機の仮想的配置には、計算機の
様々な属性に注目したいくつかの方法が考えられる。In the distributed computer system according to this embodiment, since the faults are detected, the computers 101 to 1 are used as in the first embodiment.
04 are arranged in a virtual virtual ring shape as shown in FIG.
Similar to the first embodiment, for the virtual arrangement of computers, several methods that consider various attributes of the computers can be considered.

【０１４１】次に動作について説明する。以下、図２６
のフローチャートと対応させながら、各計算機の動作を
説明する。Next, the operation will be described. Hereinafter, FIG.
The operation of each computer will be described with reference to the flowchart of FIG.

【０１４２】各計算機は、仮想的配置において右隣に位
置する計算機に対して、生存信号を送信する。説明の便
宜上、各計算機に仮想リング上での配列順序に従って番
号を割り当てるとき、奇数番目の計算機はＬＡＮ４０１
を用いて、右隣の計算機に定期的に生存信号を送信し、
また、偶数番目の計算機はＬＡＮ４０２を用いて、右隣
の計算機に定期的に生存信号を送信すべく、生存信号を
定期的に送るための送信タイマが０か否かをチェックし
て（ステップＳＴ９１）、送信タイマが０ならば、生存
信号の送信を開始して、予め定められたタイムアウト時
間を設定して生存信号の送信タイマをセットする（ステ
ップＳＴ９２）。また、左隣の計算機からの生存信号
は、奇数番目の計算機ではＬＡＮ４０２を通じて受信さ
れ、偶数番目の計算機ではＬＡＮ４０１を通じて受信さ
れる。図２８は、各計算機の生存信号の送受信を示すブ
ロック図である。Each computer transmits a survival signal to the computer located on the right side of the virtual arrangement. For convenience of explanation, when assigning a number to each computer according to the arrangement order on the virtual ring, the odd-numbered computer is LAN 401.
Send a survival signal to the computer on the right using
Further, the even-numbered computer uses the LAN 402 to check whether the transmission timer for periodically transmitting the survival signal is 0 in order to regularly transmit the survival signal to the computer on the right side (step ST91). ), If the transmission timer is 0, the transmission of the survival signal is started, a predetermined timeout time is set, and the survival signal transmission timer is set (step ST92). The survival signal from the computer on the left side is received through the LAN 402 by the odd-numbered computer and is received through the LAN 401 by the even-numbered computer. FIG. 28 is a block diagram showing transmission / reception of a survival signal of each computer.

【０１４３】通常、計算機は平常モードで動作しており
（ステップＳＴ９３）、各計算機は、左隣の計算機から
の生存信号が、一定時間ごとに受信されるかを調べ（ス
テップＳＴ９４）、受信した場合に予め定められたタイ
ムアウト時間を設定して生存信号の受信タイマをセット
ないしリセットする（ステップＳＴ９５）。さらに、生
存信号に故障情報が付加されているかチェックする（ス
テップＳＴ９６）。Normally, the computer is operating in the normal mode (step ST93), and each computer checks whether the survival signal from the computer on the left is received at regular intervals (step ST94), and receives it. In this case, a predetermined timeout time is set and the survival signal reception timer is set or reset (step ST95). Further, it is checked whether failure information is added to the survival signal (step ST96).

【０１４４】以下、故障例１として、図２９に示すよう
な計算機１０２とＬＡＮ４０２とを接続するケーブル３
２２が切断した場合における、この実施例による分散計
算機システムの故障検出方法の動作を説明する。In the following, as a failure example 1, a cable 3 for connecting the computer 102 and the LAN 402 as shown in FIG.
The operation of the fault detection method for the distributed computer system according to this embodiment when the switch 22 is disconnected will be described.

【０１４５】計算機１０３は、計算機１０２からの生存
信号が、予め定められた時間内に受信できないことを検
出すると（ステップＳＴ１００）、故障検出モードに移
行し、予め定められた計算機に対して、自分自身に生存
信号を送信するように要求する信号を、ＬＡＮ２を用い
て送る（ステップＳＴ１０１）。When the computer 103 detects that the surviving signal from the computer 102 cannot be received within a predetermined time (step ST100), it shifts to the failure detection mode, and the computer 103 itself A signal requesting itself to transmit a survival signal is sent using LAN2 (step ST101).

【０１４６】この後、計算機１０３では、ステップＳＴ
９１に戻り、ステップＳＴ９３に至り、故障検出モード
であるのでステップＳＴ１０４に分岐する。一方、この
要求信号が計算機１０４に送られたとすると、計算機１
０４は該要求信号に応じて生存信号を計算機１０３に送
信する。計算機１０３は、ステップＳＴ１０４におい
て、該要求信号に対する応答が得られたことから、計算
機１０２とＬＡＮ４０２の間の接続に故障が発生したと
判断する。さらに、計算機１０３は、図３０に示すよう
に、再構成して各計算機の生存信号の送信先を設定し直
す（ステップＳＴ１０５）。このような再構成を行うこ
とにより、故障が発生しても、故障発生以前と同程度の
故障検出能力を維持することができる。この際に、送信
先が変更される計算機に対して、計算機１０３は直接故
障情報を通知した後（ステップＳＴ１０６）、再送に備
えてメモリに故障情報を保存して故障情報タイマをセッ
トし（ステップＳＴ１０７）、平常モードに戻る（ステ
ップＳＴ１０８）。また、それ以外の計算機に対して
は、生存信号に故障情報を付加し、順次隣接計算機に転
送することにより、故障情報を通知する（ステップＳＴ
１０６、ＳＴ１０７、ＳＴ９７、ＳＴ９９）。実施例１
と同様に、計算機１０３は、該故障情報が一定時間以内
に自分自身に転送されて戻ってくるかを調べ（ステップ
ＳＴ９６、ＳＴ９７）、もし受信したのであれば、メモ
リ中から該メッセージを削除して、故障情報タイマを削
除する（ステップＳＴ９８）。これに対して、一巡中に
故障情報が失われ、戻ってこないような場合に再転送を
行う（ステップＳＴ１０２、ＳＴ１０３）。After that, in the computer 103, step ST
Returning to step 91, the process proceeds to step ST93, and since it is in the failure detection mode, the process branches to step ST104. On the other hand, if this request signal is sent to the computer 104, the computer 1
04 transmits a survival signal to the computer 103 in response to the request signal. In step ST104, the computer 103 determines that a failure has occurred in the connection between the computer 102 and the LAN 402 since the response to the request signal has been obtained. Further, the computer 103 reconfigures and resets the transmission destination of the survival signal of each computer as shown in FIG. 30 (step ST105). By performing such reconfiguration, even if a failure occurs, it is possible to maintain the same level of failure detection capability as that before the failure. At this time, the computer 103 directly reports the failure information to the computer whose destination is changed (step ST106), then saves the failure information in the memory and sets the failure information timer in preparation for retransmission (step ST106). ST107) and returns to the normal mode (step ST108). Further, other computers are notified of the failure information by adding the failure information to the survival signal and sequentially transferring it to the adjacent computers (step ST
106, ST107, ST97, ST99). Example 1
Similarly, the computer 103 checks whether the failure information is transferred to itself and returned within a fixed time (steps ST96 and ST97), and if received, deletes the message from the memory. Then, the failure information timer is deleted (step ST98). On the other hand, when the failure information is lost during one cycle and the failure information does not return, retransfer is performed (steps ST102 and ST103).

【０１４７】次に、別な故障例として、図３１に示すよ
うな計算機１０３の通信インターフェースとＬＡＮ４０
２とを接続するケーブル３２３が故障した場合におけ
る、この実施例による故障検出方法の動作について説明
する。Next, as another failure example, the communication interface of the computer 103 and the LAN 40 as shown in FIG.
The operation of the failure detecting method according to this embodiment when the cable 323 connecting the cable 2 and the cable 2 fails will be described.

【０１４８】計算機１０３は、計算機１０２からの生存
信号が、定められた時間内に受信できないことを検出す
ると（ステップＳＴ１００）、故障検出モードに移行
し、適当な計算機に対して、自分自身に生存信号を送信
するように要求する信号を、ＬＡＮ２を用いて送る（ス
テップＳＴ１０１）。この後、計算機１０３では、ステ
ップＳＴ９１に戻り、故障検出モードであるのでステッ
プＳＴ９３を経てステップＳＴ１０４に分岐する。要求
信号は、送信先の計算機に届かないため、計算機１０３
は該要求信号に対する応答を受信することができない。
このため、計算機１０３は、ステップＳＴ１０４からス
テップＳＴ１０９に移行して、定められた時間内に応答
を受信できないために自分自身とＬＡＮ４０２との間に
故障が発生したと判断する（ステップＳＴ１０５）。計
算機１０３は、図３２に示すように各計算機の生存信号
の送信先を設定し直し、故障例１と同様にして各計算機
に故障情報を通知し、故障情報をメモリに保存して故障
情報タイマをセットするとともに、平常モードに戻る
（ステップＳＴ１０５〜１０８）。また、故障例１と同
様に、計算機１０３は、該故障情報が一定時間以内に自
分自身に転送されて戻ってくるかを調べ（ステップＳＴ
９６、ＳＴ９７）、受信したのであれば、メモリ中から
該メッセージを削除して、故障情報タイマを削除する
（ステップＳＴ９８）。これに対して、一巡中に故障情
報が失われ、戻ってこないような場合に再転送を行う
（ステップＳＴ１０２、ＳＴ１０３）。When the computer 103 detects that the survival signal from the computer 102 cannot be received within the predetermined time (step ST100), the computer 103 shifts to the failure detection mode, and the computer 103 survives itself to the appropriate computer. A signal requesting transmission of a signal is sent using LAN2 (step ST101). After that, the computer 103 returns to step ST91 and, since it is in the failure detection mode, branches to step ST104 via step ST93. Since the request signal does not reach the destination computer, the computer 103
Cannot receive a response to the request signal.
Therefore, the computer 103 shifts from step ST104 to step ST109, and determines that a failure has occurred between itself and the LAN 402 because the response cannot be received within the predetermined time (step ST105). The computer 103 resets the transmission destination of the survival signal of each computer as shown in FIG. 32, notifies each computer of the failure information in the same manner as in failure example 1, saves the failure information in the memory, and saves the failure information timer. Is set, and the process returns to the normal mode (steps ST105 to ST108). Further, as in the failure example 1, the computer 103 checks whether the failure information is transferred to itself and returned within a fixed time (step ST
96, ST97), if received, the message is deleted from the memory and the failure information timer is deleted (step ST98). On the other hand, when the failure information is lost during one cycle and the failure information does not return, retransfer is performed (steps ST102 and ST103).

【０１４９】尚、ステップＳＴ９１、ＳＴ９２は、生存
信号送信ステップ、ステップＳＴ９３、ＳＴ９４、ＳＴ
１００、ＳＴ１０１、ＳＴ１０４、ＳＴ１０９は故障検
出ステップ、ステップＳＴ９５〜ＳＴ９９、ＳＴ１０６
〜ＳＴ１０８は故障通知ステップ、ステップＳＴ１０
５、ＳＴ９９は再構成ステップに対応している。Note that steps ST91 and ST92 are survival signal transmission steps, and steps ST93, ST94 and ST
100, ST101, ST104, and ST109 are failure detection steps, steps ST95 to ST99, and ST106.
~ ST108 is a failure notification step, step ST10
5, ST99 corresponds to the reconstruction step.

【０１５０】分散計算機システムは、各計算機に故障検
出機能を分散しているため、特定の計算機の故障によ
り、故障検出機能が失われることがない。また、分散計
算機システムは、平常時の故障検出及び、故障発生時の
故障箇所の特定のための通信量を最小にできる。さら
に、分散計算機システムでは、故障情報を通知するため
に、生存信号を利用するため、通知のための余分な信号
を送信する必要がなく、ＬＡＮの負荷を小さくすること
ができる。Since the distributed computer system distributes the failure detection function to each computer, the failure detection function is not lost due to the failure of a specific computer. In addition, the distributed computer system can minimize the amount of communication for detecting a failure during normal operation and for specifying a failure location when a failure occurs. Furthermore, in the distributed computer system, since the survival signal is used to notify the failure information, it is not necessary to transmit an extra signal for notification, and the load on the LAN can be reduced.

【０１５１】実施例７．図３３は、この発明の他の実施
例による分散計算機システムの故障検出方法の動作を示
すフローチャートである。この実施例による分散計算機
システムは、実施例６と同様な物理的構成を備えてお
り、分散計算機システムでは故障検出のため、各計算機
１０１〜１０４が図２に示す仮想的な仮想リング状に配
置されている。また、実施例１と同様に、計算機の仮想
的配置には、計算機の様々な属性に注目したいくつかの
方法が考えられる。Example 7. FIG. 33 is a flow chart showing the operation of the failure detecting method for the distributed computer system according to another embodiment of the present invention. The distributed computer system according to this embodiment has the same physical configuration as that of the sixth embodiment. In the distributed computer system, the computers 101 to 104 are arranged in a virtual virtual ring shape shown in FIG. 2 for detecting a failure. Has been done. Further, as in the first embodiment, for the virtual arrangement of computers, several methods that pay attention to various attributes of the computers can be considered.

【０１５２】次に動作について説明する。以下、図３３
のフローチャートと対応させながら、各計算機の動作を
説明する。Next, the operation will be described. Below, FIG.
The operation of each computer will be described with reference to the flowchart of FIG.

【０１５３】各計算機は、仮想的配置において隣接する
計算機に対して、生存信号を送信すべく、右隣または左
隣の計算機ｘへの生存信号の送信タイマが０か否かをチ
ェックして（ステップＳＴ１１１）、送信タイマが０、
即ち既に予め定められた時間内に生存信号を送信してい
ないならば、生存信号を送信して送信先を右隣から左
隣、または左隣から右隣へと変更する（ステップＳＴ１
１２）。そして、ステップＳＴ１１へ戻る。このとき、
ＬＡＮ４０１では右隣の計算機を送信先とし、ＬＡＮ４
０２では左隣の計算機を送信先とする。これにより、隣
接する２つの計算機間では、互いに異なるＬＡＮを用い
て、相手計算機に生存信号を送信する。このような、隣
接計算機間の生存信号のやり取りを行う経路をループと
呼ぶ。図３４は、この実施例による故障検出方法におけ
るループを介した生存信号の送受信を示すブロック図で
ある。Each computer checks whether or not the transmission signal of the survival signal to the computer x on the right side or the left side is 0 in order to transmit the survival signal to the adjacent computer in the virtual arrangement ( Step ST111), the transmission timer is 0,
That is, if the survival signal has not been transmitted within the predetermined time, the survival signal is transmitted and the transmission destination is changed from right adjacent to left adjacent or left adjacent to right adjacent (step ST1).
12). Then, the process returns to step ST11. At this time,
In LAN401, the computer on the right side is the destination,
In 02, the computer on the left is set as the transmission destination. As a result, the two adjacent computers use different LANs to transmit the survival signal to the other computer. Such a path for exchanging a survival signal between adjacent computers is called a loop. FIG. 34 is a block diagram showing transmission / reception of a survival signal via a loop in the failure detection method according to this embodiment.

【０１５４】ループを利用することにより、各計算機
は、隣接計算機からの生存信号が受信できたか否かを、
隣接計算機に送信する生存信号を用いて、この隣接計算
機に対して応答することができる。もし、隣接計算機ｘ
への生存信号の送信タイマが０ではなく、隣接計算機ｘ
からの生存信号が、定められた時間内に受信できたなら
ば（ステップＳＴ１１３）、計算機は隣接計算機ｘへの
応答としてＡＣＫを生存信号に書き込む（ステップＳＴ
１１４）。定められた時間内に生存信号が受信できない
場合は（ステップＳＴ１２２）、計算機は隣接計算機へ
の応答としてＮＡＫを生存信号に書き込む（ステップＳ
Ｔ１２３）。一方、ステップＳＴ１２２において一方の
隣接計算機からの生存信号受信に失敗した計算機が、も
う一方の隣接計算機からの生存信号の受信にも失敗した
ならば（ステップＳＴ１２４）、自分自身は孤立してい
ると判断し、再起動などの処置を行う（ステップＳＴ１
２５）。By using the loop, each computer determines whether or not the survival signal from the adjacent computer can be received.
It is possible to respond to this neighboring computer by using the survival signal transmitted to the neighboring computer. If adjacent computer x
The transmission signal of the survival signal to the
If the surviving signal from the computer can be received within the predetermined time (step ST113), the computer writes ACK in the surviving signal as a response to the adjacent computer x (step ST113).
114). When the survival signal cannot be received within the defined time (step ST122), the computer writes NAK in the survival signal as a response to the adjacent computer (step S122).
T123). On the other hand, if the computer that fails to receive the live signal from one adjacent computer in step ST122 also fails to receive the live signal from the other adjacent computer (step ST124), then it is said that itself is isolated. Judge and take action such as restart (step ST1)
25).

【０１５５】この結果、各計算機は隣接計算機から生存
信号を受信し、その応答の内容がＡＣＫであるか、ＮＡ
Ｋであるか、または、生存信号の受信そのものができな
いかのいずれかである。以下、生存信号の受信ができな
い場合の応答を、”ＮｏＭｓｇ．”と表現する。As a result, each computer receives the survival signal from the adjacent computer, and the content of the response is ACK or NA.
It is either K or cannot receive the survival signal itself. Hereinafter, the response when the survival signal cannot be received is expressed as “No Msg.”.

【０１５６】図３５は、斜線で示した計算機が、左隣の
計算機からＡＣＫ、ＮＡＫ、ＮｏＭｓｇ．の各応答が得
られたときに、考えられる故障の範囲を示している図で
ある。ＡＣＫの場合、２つの計算機及び２つのＬＡＮま
での経路はともに正常である。これに対して、ＮＡＫの
場合は、左側の計算機は、斜線で示した右側の計算機よ
り定められた時間内に生存信号を受信できない場合であ
り、この際この生存信号を左隣の計算機に対して送信す
る経路が故障している可能性がある。また、ＮｏＭｓ
ｇ．の場合、左隣の計算機または左隣の計算機からの生
存信号の経路に故障がある可能性がある。In FIG. 35, the hatched computer indicates that the computer on the left has ACK, NAK, NoMsg. FIG. 6 is a diagram showing a range of possible failures when each response of FIG. In the case of ACK, both routes to the two computers and the two LANs are normal. On the other hand, in the case of NAK, the computer on the left cannot receive the survival signal within the time specified by the computer on the right indicated by the diagonal line, and at this time the computer on the left is to receive this survival signal. There is a possibility that the route for sending data is broken. Also, No Ms
g. In the case of, there may be a failure in the computer on the left or the path of the survival signal from the computer on the left.

【０１５７】各計算機は、両隣の計算機から応答を受信
できるため、２つの応答を組み合わせることにより、故
障の存在する範囲を特定することができる。図３６は、
斜線で示した計算機が受信した生存信号の組み合わせか
ら、特定される故障の範囲を示している。図３６に示し
た９つのケースのうち、故障箇所を特定できるのは２０
０２、２００４、２００５のケースである（ステップＳ
Ｔ１１６、ＳＴ１１８）。また、ケース２００３、２０
０７では、故障範囲は複数の箇所に渡っているが、この
場合は図中×印で示した箇所に故障が発生したと判断す
る（ステップＳＴ１１７）。Since each computer can receive the responses from the computers on both sides, it is possible to specify the range where the failure exists by combining the two responses. FIG. 36 shows
The range of the failure specified from the combination of the survival signals received by the computer indicated by the diagonal lines is shown. Of the nine cases shown in FIG. 36, the failure location can be identified in 20 cases.
02, 2004, 2005 (step S
T116, ST118). In addition, cases 2003 and 20
In 07, the failure range extends over a plurality of locations, but in this case, it is determined that a failure has occurred at the location indicated by the cross mark in the figure (step ST117).

【０１５８】分散計算機システムでは、ループにより、
隣接計算機間で生存信号を交換する際には、相手計算機
からの生存信号を受信してから、適当な待ち時間後に相
手計算機への生存信号を送信する。これを互いに繰り返
すことにより、同期的な方法で、一定周期の生存信号の
送受信を実現する（ステップＳＴ１２６）。In the distributed computer system, the loop causes
When exchanging the survival signal between the adjacent computers, the survival signal from the partner computer is received, and then the survival signal to the partner computer is transmitted after an appropriate waiting time. By repeating this mutually, transmission / reception of the survival signal of a constant cycle is realized by a synchronous method (step ST126).

【０１５９】待ち時間の設定の一例として、左隣の計算
機の生存信号を受信した場合は、待ち時間を０とし、右
隣の計算機の生存信号を受信した場合には、待ち時間を
Ｔとする方法がある。この方法は、右隣の計算機に送信
した生存信号への応答が、遅れ時間なしに得られるた
め、右隣の計算機からの生存信号が、分散計算機システ
ムの状態を常に正確に反映しているという利点がある。As an example of setting the waiting time, when the survival signal of the computer on the left side is received, the waiting time is set to 0, and when the survival signal of the computer on the right side is received, the waiting time is set to T. There is a way. In this method, the response to the surviving signal sent to the computer on the right is obtained without delay, so the surviving signal from the computer on the right always accurately reflects the state of the distributed computer system. There are advantages.

【０１６０】次に故障例として、図３７に示すように計
算機１０３の通信インターフェースとＬＡＮ４０２とを
接続するケーブル３２３が切断された場合を用いて、各
計算機の動作を説明する。Next, as a failure example, the operation of each computer will be described by using the case where the cable 323 connecting the communication interface of the computer 103 and the LAN 402 is disconnected as shown in FIG.

【０１６１】ケーブル３２３の切断により、計算機１０
３が計算機１０２にＬＡＮ４０２を用いて送信する生存
信号と、計算機１０４が計算機１０３にＬＡＮ４０２を
用いて送信する生存信号は、目的計算機に受信されない
（ステップＳＴ１２２）。このため、計算機１０２は計
算機１０３に、計算機１０３は計算機１０４に、それぞ
れＬＡＮ４０１を用いて、ＮＡＫを含む生存信号を送信
する（ステップＳＴ１２３）。これにより、計算機１０
２が両隣りの計算機から受信する応答はＡＣＫとＮｏ
Ｍｓｇ．となる（ステップＳＴ１１７）。また、計算機
１０３が受信する応答はＮＡＫとＮｏＭｓｇ．、計算
機１０４が受信する応答はＮＡＫとＡＣＫになる（ステ
ップＳＴ１１６）。従って、図３７に示すように、計算
機１０２と計算機１０４とはともに、ケーブル３２３ま
たは通信インターフェース２２３のいずれかに故障が発
生したと判断する（ステップＳＴ１１９）。計算機１０
３は、応答の組み合わせ（ケース２００８）から、自分
自身の周辺に故障が発生したことを検出することはでき
るが、その位置を特定することはできない。このため、
隣接計算機１０２または１０４がそれを特定し、故障内
容を通知するのを待つ（ステップＳＴ１２０）。By disconnecting the cable 323, the computer 10
The survival signal transmitted by the computer 3 to the computer 102 using the LAN 402 and the survival signal transmitted by the computer 104 to the computer 103 using the LAN 402 are not received by the target computer (step ST122). Therefore, the computer 102 transmits a survival signal including NAK to the computer 103 and the computer 103 to the computer 104 using the LAN 401, respectively (step ST123). As a result, the computer 10
2 receives ACK and No from the computers on both sides
Msg. (Step ST117). The response received by the computer 103 is NAK and No Msg. The response received by the computer 104 is NAK and ACK (step ST116). Therefore, as shown in FIG. 37, both the computer 102 and the computer 104 determine that a failure has occurred in either the cable 323 or the communication interface 223 (step ST119). Calculator 10
3 can detect that a failure has occurred around itself from the combination of responses (case 2008), but cannot specify its position. For this reason,
It waits for the adjacent computer 102 or 104 to identify it and notify the failure content (step ST120).

【０１６２】計算機１０２または１０４は、図３８のよ
うに、リングを再構成する。新しい構成では、計算機１
０２と１０４とはＬＡＮ４０１及び４０２を介して応答
を送信しあう。また、計算機１０３は、ＬＡＮ４０１を
用いて信号を送受信できるため、自分自身とＬＡＮ４０
１の接続を確認するために、計算機１０２から生存信号
を定期的に受信するとともに、自分自身の生存を知らせ
るために、計算機１０４に対して生存信号を送信する。
これにより、故障が発生しても、故障発生以前と同程度
の故障検出能力を維持することができる。The computer 102 or 104 reconfigures the ring as shown in FIG. With the new configuration, Calculator 1
02 and 104 send a response to each other via the LANs 401 and 402. Further, the computer 103 can send and receive signals using the LAN 401, so that
In order to confirm the connection of No. 1, the survival signal is periodically received from the computer 102, and the survival signal is transmitted to the computer 104 to notify the survival of itself.
As a result, even if a failure occurs, it is possible to maintain the same level of failure detection capability as before the failure occurred.

【０１６３】上記のような故障情報を全計算機に通知す
るため、実施例２と同様な方法を用い、隣接計算機間で
故障情報の送達を確認しながら、順次隣の計算機に故障
情報を転送する（ステップＳＴ１２１）。In order to notify the failure information as described above to all the computers, a method similar to that of the second embodiment is used, and the failure information is sequentially transferred to the adjacent computer while confirming the delivery of the failure information between the adjacent computers. (Step ST121).

【０１６４】尚、ステップＳＴ１１１、ＳＴ１１２、Ｓ
Ｔ１２６は生存信号送信ステップ、ステップＳＴ１１
３、ＳＴ１１４、ＳＴ１２２、ＳＴ１２３は生存信号応
答ステップ、ステップＳＴ１１５〜ＳＴ１２０、ＳＴ１
２４、ＳＴ１２５は故障検出ステップ、ステップＳＴ１
２１は故障通知ステップ、ステップＳＴ１１９、ＳＴ１
２０は再構成ステップに対応している。Incidentally, steps ST111, ST112, S
T126 is a survival signal transmission step, step ST11
3, ST114, ST122, ST123 are survival signal response steps, steps ST115 to ST120, ST1.
24, ST125 is a failure detection step, step ST1
21 is a failure notification step, steps ST119 and ST1
20 corresponds to the reconstruction step.

【０１６５】分散計算機システムは、各計算機に故障検
出機能を分散しているため、特定の計算機の故障によ
り、故障検出機能が失われることがない。また、分散計
算機システムでは、平常時の故障検出のための通信量と
しては、実施例６の２倍を要するが、通信量のオーダー
は同じであるため、計算機の増加に対して、通信量の増
加は少ない。また、異常発生時にも、故障箇所特定のた
めの余分な信号を要しないため、通信量は少なくてす
む。さらに、分散計算機システムでは、故障情報を通知
するために生存信号を利用しているので、通知のための
余分な信号を送信する必要がなく、さらに、ＬＡＮ上に
送出される生存信号の総数は計算機の台数に比例した数
であるので、ＬＡＮの負荷を小さくすることができる。Since the distributed computer system distributes the failure detection function to each computer, the failure detection function is not lost due to the failure of a specific computer. Also, in the distributed computer system, the communication amount for detecting a failure during normal operation requires twice as much as that in the sixth embodiment, but since the order of the communication amount is the same, the communication amount increases as the number of computers increases. Little increase. In addition, even when an abnormality occurs, an extra signal for identifying a failure location is not required, so that the communication amount can be small. Further, in the distributed computer system, since the survival signal is used to notify the failure information, it is not necessary to transmit an extra signal for notification, and the total number of survival signals transmitted on the LAN is Since the number is proportional to the number of computers, the load on the LAN can be reduced.

【０１６６】実施例８．図３９は、この発明の他の実施
例による分散計算機システムの故障検出方法の動作を示
すブロック図である。この実施例の分散計算機システム
は、実施例７と同様の物理的、仮想的構成をもつ。ま
た、実施例７と同様に、ループを利用して隣接計算機間
で生存信号を交換する。Example 8. FIG. 39 is a block diagram showing the operation of a failure detecting method for a distributed computer system according to another embodiment of the present invention. The distributed computer system of this embodiment has the same physical and virtual configuration as that of the seventh embodiment. Also, as in the case of the seventh embodiment, a survival signal is exchanged between adjacent computers using a loop.

【０１６７】分散計算機システムでは、生存信号の内容
は２つのフィールドを含む。第１のフィールドは、実施
例７と同様に、各計算機が隣接計算機の生存信号を受信
できたか否かを、隣接計算機に示す応答である。図３９
に示すように、第２のフィールドは、計算機が、生存信
号の送信先とは異なる、もう一方の隣接計算機から受信
した生存信号の第１フィールド（応答）をコピーしたも
のに相当する。In the distributed computer system, the content of the live signal includes two fields. The first field is a response indicating to each adjacent computer whether or not each computer was able to receive the survival signal of the adjacent computer, as in the case of the seventh embodiment. FIG. 39
As shown in, the second field corresponds to the computer copying the first field (response) of the live signal received from the other adjacent computer, which is different from the destination of the live signal.

【０１６８】各計算機は、実施例７と同様に、両隣から
の生存信号に含まれる応答を組み合わせて故障箇所を特
定する。また、１つの生存信号に含まれる上記のような
２つのフィールドの内容を組み合わせることにより、故
障箇所を特定することができる。図４０は、斜線で示し
た計算機が受信した生存信号の内容から特定できる故障
箇所を示す。As in the case of the seventh embodiment, each computer combines the responses included in the survival signals from both sides to specify the fault location. Further, by combining the contents of the above two fields included in one survival signal, the failure location can be specified. FIG. 40 shows a failure location that can be identified from the content of the survival signal received by the computer, which is indicated by hatching.

【０１６９】次に動作について説明する。故障例とし
て、図４１のように、計算機１０３の通信インターフェ
ース２２３とＬＡＮ４０２とを接続するケーブル３２３
が切断した場合をとりあげ、各計算機の動作を説明す
る。Next, the operation will be described. As an example of failure, a cable 323 connecting the communication interface 223 of the computer 103 and the LAN 402 as shown in FIG.
The operation of each computer will be explained taking the case of disconnection.

【０１７０】ケーブルの故障により、計算機１０３、１
０４が、ＬＡＮ４０２を用いて左隣に送信する生存信号
は、送信先計算機に受信されない。このため、計算機１
０２、１０３の右隣の計算機への応答はＮＡＫとなる。
即ち、計算機１０２、１０３が、ＬＡＮ４０１を用いて
右隣に送信する生存信号の第１フィールドはＮＡＫとな
る。従って、計算機１０３、１０４が、ＬＡＮ４０１を
用いて右隣に送信する生存信号の第２フィールドはＮＡ
Ｋとなる。また、計算機１０２が、ＬＡＮ４０２を用い
て左隣の計算機に送信する生存信号の第２フィールド
は、ＮｏＭｓｇ．となる。その他の生存信号の第２フ
ィールドは、ＡＣＫとなる。従って、計算機１０１は計
算機１０４から受信する生存信号から、ケース３００３
を適用するか、計算機１０２から受信する生存信号か
ら、ケース３００５を適用することにより、故障を検出
することができる。また、計算機１０２、１０４は、隣
接計算機から受信する２つの生存信号を組み合わせて、
故障を検出することができる。また、計算機１０３は、
計算機１０２から受信する生存信号から、ケース３００
２を適用することにより、故障を検出することができ
る。Due to a cable failure, computers 103, 1
The live signal 04 transmitted to the left by using the LAN 402 is not received by the destination computer. Therefore, computer 1
The response to the computer on the right of 02 and 103 is NAK.
That is, the first field of the survival signal transmitted to the right of the computers 102 and 103 using the LAN 401 is NAK. Therefore, the second field of the survival signal transmitted by the computers 103 and 104 to the right of the LAN 401 is NA.
It becomes K. The second field of the survival signal transmitted from the computer 102 to the computer on the left using the LAN 402 is No Msg. Becomes The second field of other survival signals becomes ACK. Therefore, the computer 101 determines from the survival signal received from the computer 104 that the case 3003
Or the case 3005 can be detected from the survival signal received from the computer 102 to detect the failure. In addition, the computers 102 and 104 combine two survival signals received from adjacent computers,
A failure can be detected. In addition, the computer 103
From the survival signal received from the computer 102, the case 300
By applying 2, the failure can be detected.

【０１７１】故障検出後の再構成と、故障情報の通知
は、実施例７と同様である。再構成により、故障が発生
しても、故障発生以前と同程度の故障検出能力を維持す
ることができる。Reconstruction after failure detection and notification of failure information are the same as in the seventh embodiment. By the reconfiguration, even if a failure occurs, it is possible to maintain the same level of failure detection capability as before the failure occurred.

【０１７２】この実施例による分散計算機システムは、
各計算機に故障検出機能を分散しているため、特定の計
算機の故障により、故障検出機能が失われることがな
い。また、分散計算機システムでは、各計算機が自分自
身の生存を知らせるために、毎周期に１つの生存信号を
送信するのみであるので、平常時の故障検出及び、故障
発生時の故障箇所の特定のための通信量を最小オーダー
にできる。さらに、分散計算機システムでは、故障情報
を通知するために生存信号を利用しているので、通知の
ための余分な信号を送信する必要がなく、さらに、ＬＡ
Ｎ上に送出される生存信号の総数は計算機の台数に比例
した数であるので、ＬＡＮの負荷を小さくすることがで
きる。The distributed computer system according to this embodiment is
Since the failure detection function is distributed to each computer, the failure detection function is not lost due to the failure of a specific computer. Further, in the distributed computer system, each computer only sends one survival signal in every cycle in order to notify the existence of its own, so failure detection in normal times and identification of a failure location at the time of failure occurrence are possible. Communication volume can be minimized. Further, in the distributed computer system, since the survival signal is used to notify the failure information, it is not necessary to send an extra signal for notification,
Since the total number of surviving signals sent to N is proportional to the number of computers, the load on the LAN can be reduced.

【０１７３】実施例９．図４２は、この発明の他の実施
例による分散計算機システムの故障検出方法の動作を示
すフローチャートである。この実施例の分散計算機シス
テムは、図２７に示す実施例６と同様な物理的構成をも
つ。分散計算機システムでは、故障検出のために、各計
算機が図２に示すような仮想的な仮想リング状に配置さ
れる。実施例１と同様に、計算機の仮想的配置には、計
算機の様々な属性に注目したいくつかの方法が考えられ
る。Example 9. FIG. 42 is a flow chart showing the operation of a failure detecting method for a distributed computer system according to another embodiment of the present invention. The distributed computer system of this embodiment has the same physical configuration as that of the sixth embodiment shown in FIG. In the distributed computer system, each computer is arranged in a virtual virtual ring shape as shown in FIG. 2 in order to detect a failure. Similar to the first embodiment, for the virtual arrangement of computers, several methods that consider various attributes of the computers can be considered.

【０１７４】次に動作について説明する。以下、図４２
のフローチャートと対応させながら、各計算機の動作を
説明する。Next, the operation will be described. Below, FIG.
The operation of each computer will be described with reference to the flowchart of FIG.

【０１７５】仮想リングを構成する計算機のうち、偶数
番目の計算機はＬＡＮ４０１を用いて両隣の計算機に定
期的に生存信号を送信すべく、隣接計算機ｘへの生存信
号の送信タイマが０であるか否かをチェックして（ステ
ップＳＴ１３１）、生存信号を送信し予め定められたタ
イムアウト時間を設定して生存信号送信タイマをセット
するとともに、送信先を一方の隣接計算機ｘから他方の
隣接計算機ｘ’へと変更する（ステップＳＴ１３２）。
また同様に、奇数番目の計算機はＬＡＮ４０２を用いて
両隣の計算機に定期的に生存信号を送信する。各計算機
は、生存信号を受信したか否かをチェックして（ステッ
プＳＴ１３３）、受信していないならば、さらに隣接計
算機からの生存信号の受信がタイムアウト時間を超過し
たか否かを調べ（ステップＳＴ１４４）、タイムアウト
していないならば受信タイマをリセットして（ステップ
ＳＴ１４７）、ステップＳＴ１３１へ戻る。図４３は、
この実施例による分散計算機システムにおける各計算機
の生存信号の送受信の様子を示すブロック図である。図
４３に示すように、隣接する２つの計算機は、互いに異
なるＬＡＮを用いて、相手計算機に対して生存信号を送
信する。このような生存信号の交換経路をループと呼
ぶ。Among the computers constituting the virtual ring, whether the even-numbered computer has a live signal transmission timer to the adjacent computer x is 0 in order to periodically transmit the live signal to both adjacent computers using the LAN 401. It is checked whether or not (step ST131), the survival signal is transmitted, the predetermined timeout time is set, the survival signal transmission timer is set, and the transmission destination is changed from one adjacent computer x to the other adjacent computer x ′. Is changed to (step ST132).
Similarly, the odd-numbered computers periodically transmit the survival signal to the computers on both sides using the LAN 402. Each computer checks whether or not it has received a live signal (step ST133), and if it has not received, it further checks whether or not the reception of a live signal from an adjacent computer has exceeded the timeout time (step ST133). ST144) If the timeout has not occurred, the reception timer is reset (step ST147) and the process returns to step ST131. FIG. 43 shows
FIG. 7 is a block diagram showing how a live signal is transmitted and received by each computer in the distributed computer system according to this embodiment. As shown in FIG. 43, two adjacent computers use different LANs to transmit a survival signal to the other computer. Such a survival signal exchange path is called a loop.

【０１７６】以下、故障例１として、図４４に示すよう
に計算機１０３の通信インターフェース２１３とＬＡＮ
４０１とを接続するケーブル３１３が故障した場合を用
いて、各計算機の動作を説明する。Hereinafter, as a failure example 1, as shown in FIG. 44, the communication interface 213 of the computer 103 and the LAN
The operation of each computer will be described by using the case where the cable 313 connecting to 401 is broken.

【０１７７】故障により、計算機３１３の送信する生存
信号は、送信先である計算機１０２、１０４に受信され
ない（ステップＳＴ１４４）。次に、計算機１０３とは
異なるもう一方の計算機からの生存信号の受信がタイム
アウト時間を超過したか否かをチェックするが（ステッ
プＳＴ１４５）、計算機１０２、１０４では、計算機１
０３とは異なるもう一方の隣接計算機１０１から生存信
号を受信するので、ステップＳＴ１４７、ＳＴ１３１、
ＳＴ１３２を経てステップＳＴ１３３に至る。従って、
計算機１０２、１０４では、計算機１０３、通信インタ
ーフェース２１３、及びケーブル３１３のいずれかが故
障したと判断する（ステップＳＴ１３３、ＳＴ１３４、
ＳＴ１３５）。このような判断は、２つの計算機が同時
に故障する確率が非常に低いという仮定に基づいてい
る。Due to the failure, the survival signal transmitted by the computer 313 is not received by the destination computers 102 and 104 (step ST144). Next, it is checked whether or not the reception of the survival signal from the other computer different from the computer 103 has exceeded the timeout time (step ST145).
03, since a survival signal is received from the other adjacent computer 101 different from 03, steps ST147, ST131,
After ST132, the operation proceeds to step ST133. Therefore,
In the computers 102 and 104, it is determined that any of the computer 103, the communication interface 213, and the cable 313 has failed (steps ST133, ST134,
ST135). Such a judgment is based on the assumption that two computers have a very low probability of simultaneously failing.

【０１７８】故障例２として、図４５に示すように計算
機１０３の通信インターフェース２２３とＬＡＮ４０２
とを接続するケーブル３２３が故障した場合を用いて、
各計算機の動作を説明する。As a failure example 2, as shown in FIG. 45, the communication interface 223 of the computer 103 and the LAN 402 are connected.
Using the case where the cable 323 connecting to and fails,
The operation of each computer will be described.

【０１７９】故障により、計算機１０２、１０４が送信
する生存信号は、計算機１０３で受信することができな
い（ステップＳＴ１４４）。このため、計算機１０３
は、両隣の計算機の生存信号を受信できないことがわか
った時点で（ステップＳＴ１４５）、自分自身とＬＡＮ
４０２との間の接続が切断されたと判断する（ステップ
ＳＴ１４６）。Due to a failure, the survival signal transmitted by the computers 102 and 104 cannot be received by the computer 103 (step ST144). Therefore, the computer 103
When it is determined that the surviving signals of the computers on both sides cannot be received (step ST145), the self and the LAN
It is determined that the connection with 402 is disconnected (step ST146).

【０１８０】故障例１では、図４６のようにリングを再
構成する。また、故障例２では、図４７のようにリング
を再構成する。これにより、故障が発生しても、故障発
生以前と同程度の故障検出能力を維持することができ
る。上記のような故障情報を全計算機に通知するため、
実施例２と同様な方法を用い、隣接計算機間で故障情報
の送達を確認しながら、順次隣の計算機に故障情報を転
送する（ステップＳＴ１３６〜ＳＴ１４３）。In failure example 1, the ring is reconfigured as shown in FIG. Further, in failure example 2, the ring is reconfigured as shown in FIG. As a result, even if a failure occurs, it is possible to maintain the same level of failure detection capability as before the failure occurred. In order to notify all computers of the above failure information,
Using the same method as in the second embodiment, the failure information is sequentially transferred to the adjacent computer while confirming the delivery of the failure information between the adjacent computers (steps ST136 to ST143).

【０１８１】尚、ステップＳＴ１３１、ＳＴ１３２、Ｓ
Ｔ１４７は生存信号送信ステップ、ステップＳＴ１３
３、ＳＴ１３４、ＳＴ１３５、ＳＴ１４４〜ＳＴ１４６
は故障検出ステップ、ステップＳＴ１３６、ＳＴ１３８
〜ＳＴ１４３、ＳＴ１４６は故障通知ステップ、ステッ
プＳＴ１３５、ＳＴ１３７は再構成ステップに対応して
いる。Incidentally, steps ST131, ST132, S
T147 is a survival signal transmission step, step ST13
3, ST134, ST135, ST144 to ST146
Is a failure detection step, steps ST136 and ST138
~ ST143 and ST146 correspond to the failure notification step, and steps ST135 and ST137 correspond to the reconstruction step.

【０１８２】分散計算機システムは、各計算機に故障検
出機能を分散しているため、特定の計算機の故障によ
り、故障検出機能が失われることがない。また、分散計
算機システムでは、各計算機が自分自身の生存を知らせ
るために、毎周期に１つの生存信号を送信するのみであ
るので、平常時の故障検出及び、故障発生時の故障箇所
の特定のための通信量を最小オーダーにできる。さら
に、分散計算機システムでは、故障情報を通知するため
に生存信号を利用しているので、通知のための余分な信
号を送信する必要がなく、さらに、ＬＡＮ上に送出され
る生存信号の総数は、計算機の台数に比例した数である
ので、ＬＡＮの負荷を小さくすることができる。Since the distributed computer system distributes the failure detection function to each computer, the failure detection function is not lost due to the failure of a specific computer. Further, in the distributed computer system, each computer only sends one survival signal in every cycle in order to notify the existence of its own, so failure detection in normal times and identification of a failure location at the time of failure occurrence are possible. Communication volume can be minimized. Further, in the distributed computer system, since the survival signal is used to notify the failure information, it is not necessary to transmit an extra signal for notification, and the total number of survival signals transmitted on the LAN is Since the number is proportional to the number of computers, the load on the LAN can be reduced.

【０１８３】実施例１０．図４８は、この発明の他の実
施例による分散計算機システムの故障検出方法の動作を
示すフローチャートである。この実施例の分散計算機シ
ステムは、図２７に示す実施例６と同様な物理的構成を
有している。Example 10. FIG. 48 is a flow chart showing the operation of a failure detecting method for a distributed computer system according to another embodiment of the present invention. The distributed computer system of this embodiment has the same physical configuration as that of the sixth embodiment shown in FIG.

【０１８４】この実施例による分散計算機システムでは
故障検出のため、図４９に示すように、計算機を３台ず
つグループにする。このようなグループを構成する際に
は、実施例１と同様に、計算機の様々な属性に注目した
いくつかの方法が考えられる。In the distributed computer system according to this embodiment, three computers are grouped in groups of three as shown in FIG. 49 for detecting a failure. When constructing such a group, as in the case of the first embodiment, several methods that consider various attributes of the computer can be considered.

【０１８５】次に動作について説明する。以下、図４８
のフローチャートと対応させながら、計算機１０１、１
０３、１０４、１０６の動作を説明する。Next, the operation will be described. Below, FIG.
Corresponding to the flow chart of
The operations of 03, 104 and 106 will be described.

【０１８６】グループ内の各々の計算機を、故障検出時
に果たす役割によってＡ、Ａ’、Ｂと呼ぶ。図４９で
は、計算機１０１、１０４がＡ、計算機１０２、１０５
がＢ、計算機１０３、１０６がＡ’に相当する。ＢはＬ
ＡＮ４０１を用いてＡに定期的に生存信号を送信し、Ｌ
ＡＮ４０２を用いてＡ’に定期的に生存信号を送信す
る。Ａは、ＬＡＮ４０２を用いて、Ａ’に定期的に生存
信号を送信すべく、生存信号の送信タイマが０であるか
否かをチェックして（ステップＳＴ１５１）、送信タイ
マが０であるならば生存信号を送信し生存信号送信タイ
マをリセットする（ステップＳＴ１５２）。このとき、
Ａ’から生存信号を受信せず（ステップＳＴ１５３）、
Ｂからの生存信号を受信できるならば（ステップＳＴ１
６１）、生存信号にＡＣＫを書き込み（ステップＳＴ１
６２）、Ａ’から生存信号受信が定められたタイムアウ
ト時間を超過しておらず（ステップＳＴ１６８）、Ｂか
らの生存信号を受信できなければＮＡＫを書き込む（ス
テップＳＴ１７２、ＳＴ１７３）。Ａ’は、Ａと同様
に、Ｂからの生存信号の有無を生存信号に書き込み、Ａ
に対して生存信号を定期的に送信する。Each computer in the group is called A, A ', B depending on its role in detecting a failure. In FIG. 49, the computers 101 and 104 are A, and the computers 102 and 105.
Corresponds to B, and the computers 103 and 106 correspond to A ′. B is L
Send a live signal to A periodically using AN401, and
The AN 402 is used to periodically send a live signal to A '. Using the LAN 402, A checks whether or not the transmission timer of the survival signal is 0 in order to periodically transmit the survival signal to A '(step ST151), and if the transmission timer is 0, The survival signal is transmitted and the survival signal transmission timer is reset (step ST152). At this time,
No survival signal is received from A '(step ST153),
If the survival signal from B can be received (step ST1
61), write ACK to the survival signal (step ST1)
62), reception of the survival signal from A ′ has not exceeded the predetermined timeout time (step ST168), and if the survival signal from B cannot be received, NAK is written (steps ST172 and ST173). A ′, like A, writes the presence / absence signal from B to the survival signal, and A
To periodically send a survival signal to.

【０１８７】一方、Ａは、Ａ’とＢからの生存信号の有
無と内容を用いて故障箇所を判断し、Ａ’は、ＡとＢか
らの生存信号の有無と内容を用いて故障箇所を判断す
る。図５０は、Ａが受信する生存信号の組み合わせから
推定される故障箇所の範囲を示している。いくつかのケ
ースでは、複数の箇所に１つ異常の故障が発生している
可能性がある。分散計算機システムでは、故障箇所は同
時に１つであるという仮定を用いて、故障箇所の候補を
探す。さらに、候補となった故障箇所のうち、故障を判
定する計算機に最も近い候補に故障が発生したとみな
す。図５１中の×印は、このような基準で決定される故
障箇所を示す。On the other hand, A determines the failure location by using the existence and contents of the survival signal from A'and B, and A'determines the failure location by using the existence and content of the survival signal from A and B. to decide. FIG. 50 shows the range of failure locations estimated from the combination of survival signals received by A. In some cases, there may be one anomalous failure at multiple locations. In the distributed computer system, a candidate for a failure location is searched for by using the assumption that there is one failure location at the same time. Furthermore, among the candidate failure locations, it is considered that the failure has occurred in the candidate closest to the computer that determines the failure. The X mark in FIG. 51 indicates a failure location determined by such a criterion.

【０１８８】次に、故障例１として、図５１に示すよう
な計算機１０２の通信インターフェース２１２とＬＡＮ
４０１とを接続するケーブル３１２が切断された場合に
おける、この実施例による分散計算機システムの動作を
説明する。Next, as a failure example 1, a communication interface 212 of the computer 102 and a LAN as shown in FIG.
The operation of the distributed computer system according to the present embodiment when the cable 312 connecting to 401 is disconnected will be described.

【０１８９】ケーブル３１２の切断により、計算機１０
２（Ｂ）が送信する生存信号は、計算機１０１（Ａ）に
受信されない（ステップＳＴ１７２）。このため、計算
機１０１は計算機１０３（Ａ’）にＮＡＫを送信する
（ステップＳＴ１７３）。計算機１０１（Ａ）は計算機
１０３（Ａ’）から生存信号ＡＣＫを受信した時点で、
図５０に示すケース５００４を適用し、ケーブル３１２
が故障したと判断する（ステップＳＴ１５３〜ＳＴ１５
５、ＳＴ１５７、ＳＴ１５９、ＳＴ１７７）。このよう
な判断は、２つの計算機が同時に故障する確率が非常に
低いという仮定に基づいている。計算機１０３（Ａ’）
は、計算機１０２（Ｂ）からの生存信号を受信し、計算
機１０１（Ａ）から生存信号ＮＡＫを受信する。これ
は、図５０のケース５００２に相当するが、この場合は
故障箇所を特定できない（ステップＳＴ１５４〜ＳＴ１
５６、ＳＴ１６０、ＳＴ１７７）。By cutting the cable 312, the computer 10
The survival signal transmitted by 2 (B) is not received by the computer 101 (A) (step ST172). Therefore, the computer 101 transmits NAK to the computer 103 (A ′) (step ST173). When the computer 101 (A) receives the survival signal ACK from the computer 103 (A ′),
Applying the case 5004 shown in FIG.
Is determined to have failed (steps ST153 to ST15)
5, ST157, ST159, ST177). Such a judgment is based on the assumption that two computers have a very low probability of simultaneously failing. Calculator 103 (A ')
Receives a survival signal from the computer 102 (B) and receives a survival signal NAK from the computer 101 (A). This corresponds to the case 5002 in FIG. 50, but in this case the failure location cannot be specified (steps ST154 to ST1).
56, ST160, ST177).

【０１９０】故障例２として、図５２に示すような計算
機１０１の通信インターフェース２１１とＬＡＮ４０１
とを接続するケーブル３１１が切断された場合をとりあ
げる。As Failure Example 2, the communication interface 211 of the computer 101 and the LAN 401 as shown in FIG.
Take the case where the cable 311 connecting to and is disconnected.

【０１９１】ケーブル３１１の切断により、計算機１０
１（Ａ）は、全ての生存信号を受信できない。このた
め、計算機１０１（Ａ）は両方の生存信号が受信できな
いことがわかった時点で、図５０のケース５００６を適
用し、ケーブル３１１が故障したと判断する（ステップ
ＳＴ１６８〜１７１、ＳＴ１７７またはステップＳＴ１
７２〜ＳＴ１７７）。計算機１０３（Ａ’）は、計算機
１０２（Ｂ）からの生存信号を受信し、計算機１０１
（Ａ）から生存信号ＮＡＫを受信する。これは、図５０
のケース５００２に相当するが、この場合は故障箇所を
特定できない（ステップＳＴ１５４〜ＳＴ１５６、ＳＴ
１６０、ＳＴ１７７）。By disconnecting the cable 311, the computer 10
1 (A) cannot receive all the survival signals. For this reason, the computer 101 (A) determines that the cable 311 has failed by applying the case 5006 of FIG. 50 (steps ST168 to 171, ST177, or step ST1) when it is determined that both survival signals cannot be received.
72-ST177). The computer 103 (A ′) receives the survival signal from the computer 102 (B) and
The survival signal NAK is received from (A). This is shown in FIG.
In this case, the failure location cannot be identified (steps ST154 to ST156, ST
160, ST177).

【０１９２】故障例３として、図５３に示すように計算
機１０１の通信インターフェース２２１とＬＡＮ４０２
とを接続するケーブル３２１が切断された場合をとりあ
げる。As Failure Example 3, as shown in FIG. 53, the communication interface 221 of the computer 101 and the LAN 402 are connected.
Take the case where the cable 321 connecting to and is disconnected.

【０１９３】ケーブル３２１の切断により、計算機１０
１（Ａ）が送信する生存信号は、計算機１０３（Ａ’）
に受信されない（ステップＳＴ１６８）。計算機１０３
（Ａ’）は計算機１０２（Ｂ）から生存信号ＡＣＫが受
信した時点で、図５０のケース５００３を適用し、ケー
ブル３２１が故障したと判断する（ステップＳＴ１６１
〜ＳＴ１６６、ＳＴ１７７）。計算機１０１（Ａ）は、
いずれの生存信号も正常であるため、故障を検出するこ
とはできない（ステップＳＴ１５３、ＳＴ１５４、ＳＴ
１７７またはステップＳＴ１６１〜ＳＴ１６３、ＳＴ１
７７）。When the cable 321 is cut, the computer 10
The survival signal transmitted by 1 (A) is the computer 103 (A ').
Is not received (step ST168). Computer 103
When the survival signal ACK is received from the computer 102 (B), (A ′) applies the case 5003 of FIG. 50 and determines that the cable 321 has failed (step ST161).
~ ST166, ST177). Computer 101 (A)
Since all the survival signals are normal, it is not possible to detect a failure (steps ST153, ST154, ST
177 or steps ST161 to ST163, ST1
77).

【０１９４】故障例４として、図５４に示すように計算
機１０２（Ｂ）が故障した場合をとりあげる。As Failure Example 4, the case where the computer 102 (B) fails as shown in FIG. 54 will be taken up.

【０１９５】計算機１０２（Ｂ）の故障により、計算機
１０１（Ａ）、１０３（Ａ’）とも、計算機１０２
（Ｂ）からの生存信号を受信することができない（ステ
ップＳＴ１７２）。計算機１０１（Ａ）、１０３
（Ａ’）は、それぞれ計算機１０３（Ａ’）、１０１
（Ａ）に生存信号ＮＡＫを送信する（ステップＳＴ１７
３）。計算機１０１（Ａ）、１０３（Ａ’）は、生存信
号ＮＡＫを受信した時点で、図５０のケース５００５を
適用し、計算機１０２（Ｂ）が故障したと判断する（ス
テップＳＴ１５３〜ＳＴ１５９、ＳＴ１７７）。Due to the failure of the computer 102 (B), both the computers 101 (A) and 103 (A ') are
The survival signal from (B) cannot be received (step ST172). Computer 101 (A), 103
(A ′) are computers 103 (A ′) and 101, respectively.
The survival signal NAK is transmitted to (A) (step ST17).
3). The computers 101 (A) and 103 (A ′), when receiving the survival signal NAK, apply the case 5005 of FIG. 50 and determine that the computer 102 (B) has failed (steps ST153 to ST159, ST177). .

【０１９６】尚、ステップＳＴ１５１、ＳＴ１５２は、
生存信号送信ステップ、ステップＳＴ１６１、ＳＴ１６
２、ＳＴ１７３、ＳＴ１７４は第１及び第２の生存信号
応答ステップ、ステップＳＴ１５４〜ＳＴ１５８、ＳＴ
１６０、ＳＴ１６９、ＳＴ１７０は第１の故障検出ステ
ップ、ステップＳＴ１６３〜ＳＴ１６５、ＳＴ１６７、
ＳＴ１７４、ＳＴ１７５は第２の故障検出ステップ、ス
テップＳＴ１５９、ＳＴ１６６、ＳＴ１７１、ＳＴ１７
６は故障通知ステップ、ステップＳＴ１５９、ＳＴ１６
０、ＳＴ１６６、ＳＴ１６７、ＳＴ１７１、ＳＴ１７６
は再構成ステップに対応している。また、計算機Ａは第
２の計算機、計算機Ｂは第１の計算機、計算機Ａ’は第
３の計算機に対応している。The steps ST151 and ST152 are
Survival signal transmission step, steps ST161 and ST16
2, ST173 and ST174 are first and second survival signal response steps, steps ST154 to ST158 and ST.
160, ST169, ST170 are the first failure detection step, steps ST163 to ST165, ST167,
ST174 and ST175 are second failure detection steps, steps ST159, ST166, ST171 and ST17.
6 is a failure notification step, steps ST159 and ST16
0, ST166, ST167, ST171, ST176
Corresponds to the reconstruction step. The computer A corresponds to the second computer, the computer B corresponds to the first computer, and the computer A ′ corresponds to the third computer.

【０１９７】計算機、通信インターフェースやケーブル
の故障を発見した場合、これを発見した計算機は、故障
した計算機を、他のグループの同じ役割をもつ計算機に
置き換えることにより、再構成を行う。置き換えられた
計算機は２つのグループに同時に属することになる。例
として、計算機１０２が故障した場合に、計算機１０５
を計算機１０２に置き換えた様子を、図５５に示す。こ
れにより、故障が発生しても、故障発生以前と同程度の
故障検出能力を維持することができる。再構成の結果
は、適当なブロードキャスト通信により、全ての計算機
に通知される。この実施例による分散計算機システム
は、各計算機に故障検出機能を分散しているため、特定
の計算機の故障により、故障検出機能が失われることが
ない。また、分散計算機システムでは、同様に、平常時
の故障検出及び、故障発生時の故障箇所の特定のための
通信量を最小オーダーにできる。When a computer, communication interface, or cable failure is found, the found computer replaces the failed computer with a computer having the same role in another group to perform reconfiguration. The replaced computer will belong to two groups at the same time. As an example, if the computer 102 fails, the computer 105
FIG. 55 shows how the computer is replaced with. As a result, even if a failure occurs, it is possible to maintain the same level of failure detection capability as before the failure occurred. The result of the reconfiguration is notified to all computers by appropriate broadcast communication. In the distributed computer system according to this embodiment, the failure detection function is distributed to each computer, so that the failure detection function is not lost due to the failure of a specific computer. Further, in the distributed computer system, similarly, the communication amount for detecting a failure during normal times and for specifying a failure location when a failure occurs can be minimized.

【０１９８】実施例１１．この実施例による分散計算機
システムは、実施例７と同様な物理的構成と、仮想的構
成とを備えている。また、この実施例による分散計算機
システムの故障検出のための動作も、実施例７とほとん
ど同じである。Example 11. The distributed computer system according to this embodiment has the same physical configuration as that of the seventh embodiment and a virtual configuration. The operation for detecting a failure of the distributed computer system according to this embodiment is also almost the same as that of the seventh embodiment.

【０１９９】次に動作について説明する。以下、実施例
７と異なる部分について説明する。分散計算機システム
では、各計算機は、右隣の計算機にＬＡＮ４０１を用い
て生存信号を送信する際には、各計算機自身の管理する
タイマによって、定期的に生存信号を送信する。各計算
機は、左隣から該生存信号を受信した場合、直ちにＡＣ
Ｋを書き込んだ生存信号を送信する。もし、定められた
時間内に左隣の計算機からの生存信号を受信できなけれ
ば、ＮＡＫを書き込んだ生存信号を送信する。Next, the operation will be described. Hereinafter, parts different from the seventh embodiment will be described. In the distributed computer system, when each computer transmits a survival signal to the adjacent computer on the right using the LAN 401, each computer periodically transmits the survival signal by a timer managed by each computer. When each computer receives the survival signal from the neighbor on the left, it immediately executes AC
The survival signal in which K is written is transmitted. If the survival signal from the computer on the left is not received within the predetermined time, the survival signal in which NAK is written is transmitted.

【０２００】この実施例における故障検出方法は、実施
例７に示した生存信号の送受信タイミングの設定方法と
同様な利点がある。また、この実施例は請求項７、請求
項１２、請求項１６、請求項２１から請求項２４の発明
に対応している。The fault detecting method in this embodiment has the same advantages as the method of setting the transmission / reception timing of the live signal shown in the seventh embodiment. Further, this embodiment corresponds to the inventions of claim 7, claim 12, claim 16, and claim 21 to claim 24.

【０２０１】実施例１２．図５６はこの発明の他の実施
例による分散計算機システムの構成を示すブロック図で
あり、図において、１０１、１０２は計算機、２１１、
２１２、２２１、２２２、２３１、２３２、２４１、２
４２は通信インターフェース、３１１、３１２、３２
１、３２２、３３１、３３２、３４１、３４２はケーブ
ル、４０１〜４０４はＬＡＮである。Example 12 56 is a block diagram showing the configuration of a distributed computer system according to another embodiment of the present invention. In the figure, 101 and 102 are computers, 211, and
212, 221, 222, 231, 232, 241, 2
42 is a communication interface, 311, 312, 32
1, 322, 331, 332, 341, and 342 are cables, and 401 to 404 are LANs.

【０２０２】この実施例による分散計算機システムで
は、各計算機は４本のＬＡＮ４０１〜４０４に接続され
ている。分散計算機システムでは、４本のＬＡＮのう
ち、ＬＡＮ４０１とＬＡＮ４０２、ＬＡＮ４０３とＬＡ
Ｎ４０４をそれぞれ組とし、それぞれの組を用いて、実
施例６と同様の方法で故障検出を行う。In the distributed computer system according to this embodiment, each computer is connected to four LANs 401 to 404. In the distributed computer system, among the four LANs, LAN401 and LAN402, LAN403 and LA
N404 is set as each set, and each set is used to detect a failure by the same method as in the sixth embodiment.

【０２０３】これにより、請求項１から請求項１６の発
明を、偶数本のＬＡＮをもつ分散計算機システム一般に
適用することができる。As a result, the inventions of claims 1 to 16 can be applied to a general distributed computer system having an even number of LANs.

【０２０４】実施例１３．図５７はこの発明の他の実施
例による分散計算機システムの構成を示すブロック図で
あり、図において、図５６と同一符号は同一または相当
な構成要素である。Example 13 57 is a block diagram showing the configuration of a distributed computer system according to another embodiment of the present invention. In the figure, the same reference numerals as those in FIG. 56 designate the same or corresponding components.

【０２０５】この実施例による分散計算機システムで
は、各計算機は３本のＬＡＮ４０１〜４０３に接続され
ている。分散計算機システムでは、３本のＬＡＮの内、
ＬＡＮ４０１と４０２を用いて、実施例６と同様の方法
で故障検出を行う。また、残ったＬＡＮ４０３を用い
て、実施例１と同様な方法で故障検出を行う。In the distributed computer system according to this embodiment, each computer is connected to three LANs 401-403. In the distributed computer system, of the three LANs,
Failure detection is performed using the LANs 401 and 402 in the same manner as in the sixth embodiment. Further, using the remaining LAN 403, failure detection is performed by the same method as in the first embodiment.

【０２０６】これにより、請求項１から請求項１６の発
明を、奇数本のＬＡＮを有する分散計算機システム一般
に適用することができる。As a result, the inventions of claims 1 to 16 can be applied to general distributed computer systems having an odd number of LANs.

【０２０７】実施例１４．図５８はこの発明の他の実施
例による分散計算機システムの構成を示すブロック図で
あり、図において、図５６と同一符号は同一または相当
な構成要素である。Example 14. 58 is a block diagram showing the configuration of a distributed computer system according to another embodiment of the present invention. In the figure, the same reference numerals as those in FIG. 56 designate the same or corresponding components.

【０２０８】この実施例による分散計算機システムで
は、各計算機は３本のＬＡＮ４０１〜４０３に接続され
ている。分散計算機システムでは、３本のＬＡＮの内、
ＬＡＮ４０１とＬＡＮ４０２、ＬＡＮ４０３とＬＡＮ４
０２を組とする。それぞれの組において、実施例６と同
様な方法で故障検出を行うが、ＬＡＮ４０２では、それ
ぞれの組で実行される故障検出用の２つの生存信号を１
つにまとめることにより、生存信号の送信頻度を減らし
ている。In the distributed computer system according to this embodiment, each computer is connected to three LANs 401-403. In the distributed computer system, of the three LANs,
LAN401 and LAN402, LAN403 and LAN4
02 as a set. In each group, failure detection is performed by the same method as in the sixth embodiment, but in the LAN 402, two survival signals for failure detection executed in each group are set to 1
By putting them together, the transmission frequency of the survival signal is reduced.

【０２０９】これにより、請求項１から請求項１６の発
明を、奇数本のＬＡＮを有する分散計算機システム一般
に適用することができる。Thus, the inventions of claims 1 to 16 can be applied to general distributed computer systems having an odd number of LANs.

【０２１０】[0210]

【発明の効果】以上のように、請求項１の発明によれ
ば、複数の計算機を仮想的な仮想リング上に配置する仮
想配置ステップと、各計算機が、仮想リング上の特定の
方向に隣接する計算機に対して、自分自身の生存を示す
生存信号を定期的に送信する生存信号送信ステップと、
各計算機が、仮想リング上の隣接する計算機から送信さ
れた生存信号を定期的に受信したか否かを調べ、受信し
ない場合、生存信号の送信に使用される通信路に異常が
発生したと判断し、故障箇所を特定する故障検出ステッ
プと、各計算機が、発見した故障に関する故障情報を、
通信し得る全ての計算機に通知する故障通知ステップと
を実行するように構成したので、平常時に、分散計算機
システムを構成する計算機、通信インターフェースまた
はケーブルの故障発見のために交換される生存信号の数
を最小にできる効果がある。また、各計算機が、自分自
身の担当範囲内で発見された故障の情報を、他の計算機
に通知することにより、たすきがけ故障が発生しても、
各計算機がシステム全体の稼働情報を得ることができる
効果がある。As described above, according to the invention of claim 1, the virtual placement step of placing a plurality of computers on a virtual virtual ring, and each computer is adjacent in a specific direction on the virtual ring. A survival signal transmitting step of periodically transmitting a survival signal indicating the survival of oneself to the computer
Each computer periodically checks whether or not it has received the survival signal transmitted from the adjacent computer on the virtual ring. If not, it is determined that an error has occurred in the communication path used to transmit the survival signal. Then, the failure detection step of identifying the failure location and failure information regarding the failure discovered by each computer are
Since it is configured to execute the failure notification step of notifying all computers that can communicate, the number of surviving signals exchanged during normal times to detect the failure of the computers, communication interfaces or cables that make up the distributed computer system. There is an effect that can minimize. In addition, each computer notifies other computers of the information of the fault found within its own range, so that even if a plow failure occurs,
There is an effect that each computer can obtain operation information of the entire system.

【０２１１】請求項２の発明によれば、計算機が定期的
に生存信号を送信する生存信号送信ステップにおいて、
定期的なタイミング毎に仮想リング上で交互に切り替え
て右隣または左隣の計算機へと生存信号を送信するよう
に構成したので、故障発生時にも最小限の通信量で、分
散計算機システムを構成する計算機、通信インターフェ
ースまたはケーブルの故障を発見でき、平常時及び異常
発生時に交換される生存信号の数を最小にできる効果が
ある。According to the invention of claim 2, in the live signal transmitting step in which the computer regularly transmits the live signal,
The distributed computer system is configured with a minimum amount of communication even when a failure occurs, because it is configured to switch alternately on the virtual ring at regular timings and send a survival signal to the computer to the right or left. There is an effect that a failure of a computer, a communication interface or a cable that operates can be detected, and the number of surviving signals exchanged in normal times and when an abnormality occurs can be minimized.

【０２１２】請求項３の発明によれば、生存信号送信ス
テップにおいて、計算機が受信予定の生存信号を所定の
時間内に受信したか否かを、送信する生存信号に書き込
み送信するように構成したので、平常時に、最小限の生
存信号の送受信を行うことにより、分散計算機システム
を構成する計算機、通信インターフェースまたはケーブ
ルの故障を検出することができ、平常時及び異常発生時
に交換される生存信号の数を最小にできる効果がある。According to the invention of claim 3, in the survival signal transmitting step, whether or not the computer has received the survival signal to be received within a predetermined time is written in the survival signal to be transmitted and transmitted. Therefore, by sending and receiving the minimum survival signal during normal times, it is possible to detect failures in the computers, communication interfaces, or cables that make up the distributed computer system, and to detect the survival signals that are exchanged during normal times and when an error occurs. This has the effect of minimizing the number.

【０２１３】請求項４の発明によれば、各計算機を節点
とし各節点が２つ以上の子節点を有する仮想的な仮想ツ
リー上に配置する仮想配置ステップと、各計算機が、仮
想ツリー上で親節点に位置する親計算機に対して、生存
信号を定期的に送信する生存信号送信ステップと、各計
算機が、仮想ツリー上で子節点に位置する子計算機から
の生存信号を受信したか否かを調べ、その結果を組み合
わせて故障箇所を特定する故障検出ステップと、各計算
機が、発見した故障に関する情報を、通信し得る全ての
計算機に通知する故障通知ステップとを実行するように
構成したので、故障発生時にも最小限の通信量で、分散
計算機システムを構成する計算機、通信インターフェー
スまたはケーブルの故障を検出することができ故障を発
見でき、平常時及び異常発生時に交換される生存信号の
数を最小にできる効果がある。According to the invention of claim 4, a virtual arranging step of arranging each computer on a virtual virtual tree having each node as a node and having two or more child nodes, and each computer on the virtual tree Survival signal transmission step that periodically transmits a survival signal to the parent computer located at the parent node, and whether or not each computer has received the survival signal from the child computer located at the child node on the virtual tree. And the failure detection step of identifying the failure location by combining the results, and the failure notification step of notifying each computer of the information about the found failure to all computers with which it can communicate. Even when a failure occurs, the failure of the computer, communication interface or cable that constitutes the distributed computer system can be detected and the failure can be found with a minimum amount of communication. There is an effect of minimizing the number of viable signals exchanged when an abnormality occurs.

【０２１４】請求項５の発明によれば、計算機をＭ個の
グループに分割し、各グループごとに１台の計算機を代
表計算機とし、Ｍ個の代表計算機を、仮想的な仮想リン
グ上に配置する仮想配置ステップと、代表計算機以外の
計算機が、計算機の属するグループの代表計算機に生存
信号を定期的に送信する第１の生存信号送信ステップ
と、各代表計算機が、仮想リング上で特定の方向に隣接
する計算機に生存信号を定期的に送信する第２の生存信
号送信ステップと、各代表計算機が、代表計算機に送信
される生存信号を受信したか否かを調べ、その結果を組
み合わせて故障箇所を特定する故障検出ステップと、各
代表計算機が、発見した故障に関する情報を、通信し得
る全ての計算機に通知する故障通知ステップとを実行す
るように構成したので、故障発生時にも最小限の通信量
で、分散計算機システムを構成する計算機、通信インタ
ーフェースまたはケーブルの故障を検出することができ
故障を発見でき、平常時及び異常発生時に交換される生
存信号の数を最小にできる効果がある。According to the invention of claim 5, the computer is divided into M groups, one computer is set as a representative computer in each group, and the M representative computers are arranged on a virtual virtual ring. Virtual allocation step, a first survival signal transmission step in which a computer other than the representative computer periodically transmits a survival signal to the representative computer of the group to which the computer belongs, and each representative computer has a specific direction on the virtual ring. The second survival signal transmitting step for periodically transmitting the survival signal to the adjacent computer, and whether or not each representative computer has received the survival signal transmitted to the representative computer, and combines the results to determine the failure. Each representative computer is configured to execute a failure detection step of identifying a location and a failure notification step of notifying all computers that can communicate of information regarding the found failure. The number of surviving signals exchanged during normal times and in the event of an abnormality can be detected with a minimum amount of communication even when a failure occurs, the failure of the computer, communication interface or cable that constitutes the distributed computer system can be detected. There is an effect that can minimize.

【０２１５】請求項６の発明によれば、複数の計算機を
仮想的な仮想リング上に配置する仮想配置ステップと、
各計算機を仮想リング上での特定の計算機から特定の方
向における順番によって、偶数番目、奇数番目に分ける
際、奇数番目の計算機が、第１のＬＡＮを介して仮想リ
ング上の隣接する計算機に定期的に生存信号を送信し、
偶数番目の計算機が、第２のＬＡＮを介して仮想リング
上の隣接する計算機に定期的に生存信号を送信する生存
信号送信ステップと、各計算機が、該計算機に送信され
る生存信号を受信したか否かを調べ、その結果を組み合
わせて故障箇所を特定する故障検出ステップと、各計算
機が、発見した故障に関する情報を、通信し得る全ての
計算機に通知する故障通知ステップとを実行するように
構成したので、平常時に、最小限の生存信号の送受信を
行うことにより、分散計算機システムを構成する計算
機、通信インターフェースまたはケーブルの故障を検出
することができ、故障発見のために交換される生存信号
の数を最小にできる効果がある。According to the invention of claim 6, a virtual arranging step of arranging a plurality of computers on a virtual virtual ring,
When dividing each computer into an even number and an odd number according to the order in a specific direction from a specific computer on the virtual ring, the odd numbered computer regularly sends to the adjacent computer on the virtual ring via the first LAN. To send a survival signal,
Survival signal transmission step in which the even-numbered computer periodically transmits the survival signal to the adjacent computer on the virtual ring via the second LAN, and each computer receives the survival signal transmitted to the computer. Whether or not the failure detection step of identifying whether or not the failure location is determined by combining the results, and the failure notification step of notifying all computers with which the computer can communicate the information regarding the discovered failure, are executed. Since it is configured, by sending and receiving the minimum survival signal during normal times, it is possible to detect failures in the computers, communication interfaces or cables that make up the distributed computer system. The effect is to minimize the number of.

【０２１６】請求項７の発明によれば、複数の計算機を
仮想的な仮想リング上に配置する仮想配置ステップと、
各計算機が、第１のＬＡＮを介して仮想リング上の特定
の方向に隣接した計算機に定期的に生存信号を送信する
とともに、第２のＬＡＮを介して、仮想リング上の特定
の方向とは逆の方向に隣接した計算機に定期的に生存信
号を送信する生存信号送信ステップと、各計算機が、隣
接計算機から送信される生存信号を受信したか否かを調
べ、その結果を、隣接計算機に送信する生存信号に隣接
計算機から送信された生存信号への応答として書き込む
生存信号応答ステップと、各計算機が、仮想リング上で
の両隣の計算機からの生存信号の有無と応答の内容とを
組み合わせることにより、故障箇所を特定する故障検出
ステップと、各計算機が、発見した故障に関する情報
を、通信し得る全ての計算機に通知する故障通知ステッ
プとを実行するように構成したので、平常時及び異常発
生時に交換される生存信号の数を最小にできるととも
に、１つの計算機の故障を、該故障計算機の近傍の複数
の計算機により発見が可能となり、故障発生からより短
い遅れ時間で故障を発見できる効果がある。According to the invention of claim 7, a virtual placement step of placing a plurality of computers on a virtual virtual ring,
Each computer periodically transmits a survival signal to a computer adjacent to a specific direction on the virtual ring via the first LAN, and the specific direction on the virtual ring via the second LAN. Survival signal transmission step to periodically transmit the survival signal to the adjacent computer in the opposite direction, and check whether each computer has received the survival signal transmitted from the adjacent computer, and the result is sent to the adjacent computer. Combining the survival signal response step to write in the survival signal to be transmitted as a response to the survival signal transmitted from the adjacent computer, and the presence or absence of the survival signal from the computers on both sides on the virtual ring and the content of the response To execute the failure detection step of identifying the failure location and the failure notification step of notifying all computers with which the computer can communicate the information about the discovered failure. Since the configuration is adopted, the number of surviving signals exchanged in normal times and in the occurrence of an abnormality can be minimized, and a failure of one computer can be found by a plurality of computers in the vicinity of the failure computer, resulting in a shorter failure time. There is an effect that a failure can be found in the delay time.

【０２１７】請求項８の発明によれば、各計算機は、隣
接する計算機に定期的な生存信号を送信する生存信号送
信ステップにおいて、隣接する計算機から送信された生
存信号に対する応答とともに、隣接する計算機とは異な
るもう一方の隣接する計算機からの応答をコピーしたも
のも書き込むように構成したので、平常時及び異常発生
時に交換される生存信号の数を最小にできるとともに、
１つの計算機の故障を、該故障計算機の近傍の複数の計
算機により発見が可能となり、故障発生からより短い遅
れ時間で故障を発見できる効果がある。According to the invention of claim 8, each computer has a response to the survival signal transmitted from the adjacent computer in the survival signal transmitting step of transmitting a periodic survival signal to the adjacent computer and the adjacent computer. Since it is configured to write a copy of the response from the other adjacent computer that is different from, it is possible to minimize the number of surviving signals exchanged in normal times and during an abnormality, and
A failure of one computer can be found by a plurality of computers near the failure computer, and there is an effect that the failure can be found with a shorter delay time from the occurrence of the failure.

【０２１８】請求項９の発明によれば、複数の計算機を
仮想的な仮想リング上に配置する仮想配置ステップと、
各計算機を、仮想リング上での特定の計算機から特定の
方向における順番によって、偶数番目、奇数番目に分け
る際、奇数番目の計算機が、第１のＬＡＮを介して仮想
リング上の両隣の計算機に定期的に生存信号を送信し、
偶数番目の計算機が、第２のＬＡＮを介して仮想リング
上の両隣の計算機に定期的に生存信号を送信する生存信
号送信ステップと、各計算機が、第１または第２のＬＡ
Ｎを介して、両隣から送信される生存信号を受信したか
否かを調べ、その結果を組み合わせることにより故障箇
所を特定する故障検出ステップと、各計算機が、発見し
た故障に関する情報を、通信し得る全ての計算機に通知
する故障通知ステップとを実行するように構成したの
で、平常時及び異常発生時に交換される生存信号の数を
最小にできるとともに、１つの計算機の故障を、該故障
計算機の近傍の複数の計算機により発見が可能となり、
故障発生からより短い遅れ時間で故障を発見できる効果
がある。According to the invention of claim 9, a virtual arrangement step of arranging a plurality of computers on a virtual virtual ring,
When dividing each computer into an even number and an odd number according to the order in a specific direction from a specific computer on the virtual ring, the odd numbered computers become the adjacent computers on both sides of the virtual ring via the first LAN. Send a survival signal regularly,
An even-numbered computer periodically transmits a survival signal to adjacent computers on both sides of the virtual ring via the second LAN, and each computer has a first or second LA.
A failure detection step of checking whether or not a survival signal transmitted from both sides is received via N and specifying the failure location by combining the results, and each computer communicates information on the discovered failure. Since it is configured to execute the failure notification step of notifying all the obtained computers, it is possible to minimize the number of surviving signals exchanged at normal times and at the time of occurrence of an abnormality, and at the same time the failure of one computer Discovered by multiple computers in the vicinity,
There is an effect that a failure can be found with a shorter delay time from the occurrence of the failure.

【０２１９】請求項１０の発明によれば、複数の計算機
を仮想的な仮想リング上に配置する仮想配置ステップ
と、各計算機を、３台ずつの複数のグループに分割し、
各グループにおいて、第１の計算機が、第２の計算機に
第１のＬＡＮを介して定期的に生存信号を送信するとと
もに、第３の計算機に第２のＬＡＮを介して定期的に生
存信号を送信する生存信号送信ステップと、各グループ
において、第２の計算機が、第１の計算機からの生存信
号を受信したか否かを調べ、その結果を、第３の計算機
に第２のＬＡＮを介して定期的に送信する生存信号に書
き込む第１の生存信号応答ステップと、第３の計算機
が、第１の計算機からの生存信号を受信したか否かを調
べ、その結果を、第２の計算機に第１のＬＡＮを介して
定期的に送信する生存信号に書き込む第２の生存信号応
答ステップと、第２の計算機が、第１及び第３の計算機
から送信される生存信号の有無と内容を調べ、それらの
結果を組み合わせることにより、故障箇所を特定する第
１の故障検出ステップと、第３の計算機が、第１及び第
２の計算機から送信される生存信号の有無と内容を調
べ、それらの結果を組み合わせることにより、故障箇所
を特定する第２の故障検出ステップと、各計算機が、発
見した故障に関する情報を、通信し得る全ての計算機に
通知する故障通知ステップとを実行するように構成した
ので、平常時及び異常発生時に交換される生存信号の数
を最小にできるとともに、１つの計算機の故障を、該故
障計算機の近傍の複数の計算機により発見が可能とな
り、故障発生からより短い遅れ時間で故障を発見できる
効果がある。According to the tenth aspect of the invention, a virtual placement step of placing a plurality of computers on a virtual virtual ring, and dividing each computer into a plurality of groups of three,
In each group, the first computer periodically sends the survival signal to the second computer via the first LAN, and at the same time sends the survival signal to the third computer via the second LAN. The survival signal transmitting step of transmitting, and in each group, it is checked whether the second computer has received the survival signal from the first computer, and the result is transmitted to the third computer via the second LAN. First survival signal response step of writing to the survival signal to be periodically transmitted by the third computer, and whether or not the third computer has received the survival signal from the first computer, and the result is used by the second computer. A second liveness signal response step for writing in a liveness signal to be periodically transmitted via the first LAN, and the presence / absence and contents of the liveness signal transmitted from the first and third computers by the second computer. Examine and combine those results According to the first failure detection step of identifying the failure location, the third computer checks the existence and contents of the survival signal transmitted from the first and second computers, and combines the results, Since the second failure detection step of identifying the failure point and the failure notification step of notifying each computer of the information about the found failure to all computers with which the computer can communicate are executed in normal times and abnormalities. The effect that the number of surviving signals exchanged at the time of occurrence can be minimized, and the failure of one computer can be found by multiple computers near the failed computer, and the failure can be found in a shorter delay time from the occurrence of the failure. There is.

【０２２０】請求項１１の発明によれば、故障発生時
に、各計算機の仮想的な配置を新たに設定し直す再配置
ステップをさらに実行するように構成したので、故障発
生時、故障計算機の復旧時、または新しい計算機の増設
時に、各計算機の送信先を変化させ、システムの構成変
化が生じてもそれ以前と同様な故障検出能力を維持する
ことができる効果がある。According to the eleventh aspect of the present invention, when the failure occurs, the rearrangement step for resetting the virtual arrangement of each computer is further executed. Therefore, when the failure occurs, the failure computer is restored. At the same time, or when a new computer is added, the transmission destination of each computer can be changed, and even if the system configuration changes, the same failure detection capability as before can be maintained.

【０２２１】請求項１２の発明によれば、故障発生時
に、各計算機の仮想的な配置を新たに設定し直す再配置
ステップをさらに実行するように構成したので、故障発
生時、故障計算機の復旧時、または新しい計算機の増設
時に、各計算機の送信先を変化させ、システムの構成変
化が生じてもそれ以前と同様な故障検出能力を維持する
ことができる効果がある。According to the twelfth aspect of the present invention, when the failure occurs, the relocation step for resetting the virtual placement of each computer is further executed. Therefore, when the failure occurs, the failure computer is restored. At the same time, or when a new computer is added, the transmission destination of each computer can be changed, and even if the system configuration changes, the same failure detection capability as before can be maintained.

【０２２２】請求項１３の発明によれば、故障発生時
に、各計算機の仮想的な配置を新たに設定し直す再配置
ステップをさらに実行するように構成したので、故障発
生時、故障計算機の復旧時、または新しい計算機の増設
時に、各計算機の送信先を変化させ、システムの構成変
化が生じてもそれ以前と同様な故障検出能力を維持する
ことができる効果がある。According to the thirteenth aspect of the present invention, when a failure occurs, the rearrangement step for resetting the virtual arrangement of each computer is further executed. Therefore, when a failure occurs, the failure computer is restored. At the same time, or when a new computer is added, the transmission destination of each computer can be changed, and even if the system configuration changes, the same failure detection capability as before can be maintained.

【０２２３】請求項１４の発明によれば、故障発生時
に、各計算機の仮想的な配置を新たに設定し直す再配置
ステップをさらに実行するように構成したので、故障発
生時、故障計算機の復旧時、または新しい計算機の増設
時に、各計算機の送信先を変化させ、システムの構成変
化が生じてもそれ以前と同様な故障検出能力を維持する
ことができる効果がある。According to the fourteenth aspect of the present invention, when a failure occurs, the rearrangement step for resetting the virtual arrangement of each computer is further executed. Therefore, when the failure occurs, the failure computer is restored. At the same time, or when a new computer is added, the transmission destination of each computer can be changed, and even if the system configuration changes, the same failure detection capability as before can be maintained.

【０２２４】請求項１５の発明によれば、検出された故
障情報を隣接計算機に通知する故障通知ステップにおい
て、生存信号に故障情報を付加して生存信号を送信する
ことにより故障を通知するように構成したので、通知の
ために余分な信号を送信する必要がなく、ＬＡＮにかか
る負荷を小さくすることができる効果がある。According to the fifteenth aspect of the invention, in the failure notifying step of notifying the adjacent computer of the detected failure information, the failure information is added to the survival signal and the failure signal is notified by transmitting the survival signal. Since it is configured, it is not necessary to transmit an extra signal for notification, and the load on the LAN can be reduced.

【０２２５】請求項１６の発明によれば、検出された故
障情報を隣接計算機に通知する故障通知ステップにおい
て、生存信号に故障情報を付加して生存信号を送信する
ことにより故障を通知するように構成したので、通知の
ために余分な信号を送信する必要がなく、ＬＡＮにかか
る負荷を小さくすることができる効果がある。According to the sixteenth aspect of the present invention, in the failure notifying step of notifying the adjacent computer of the detected failure information, the failure information is added by adding the failure information to the survival signal to notify the failure. Since it is configured, it is not necessary to transmit an extra signal for notification, and the load on the LAN can be reduced.

【０２２６】請求項１７の発明によれば、２Ｎ本のＬＡ
Ｎにより接続された、複数の計算機からなる分散システ
ムにおいて、ＬＡＮを２本ずつペアにし、各ペアごとに
請求項６から請求項１０、請求項１２、請求項１４、請
求項１６の故障検出方法のうちのいずれかを用いるよう
に構成したので、任意の本数のＬＡＮを持つシステム
に、上記発明を適用可能とすることができる効果があ
る。According to the invention of claim 17, 2N LAs are included.
In a distributed system consisting of a plurality of computers connected by N, two LANs are paired, and the failure detection method according to claim 6 to claim 12, claim 14, claim 14 and claim 16 for each pair. Since any one of the above is used, there is an effect that the above invention can be applied to a system having an arbitrary number of LANs.

【０２２７】請求項１８の発明によれば、（２Ｎ＋１）
本のＬＡＮにより接続された、複数の計算機からなる分
散システムにおいて、ＬＡＮを２本ずつペアにし、各ペ
アごとに請求項６から請求項１０、請求項１２、請求項
１４、請求項１６の故障検出方法のうちのいずれかを用
い、余った１本については、請求項１から請求項５、請
求項１１、請求項１３、請求項１５の故障検出方法のう
ちのいずれかを用いるように構成したので、任意の本数
のＬＡＮを持つシステムに上記発明を適用可能とするこ
とができる効果がある。According to the eighteenth invention, (2N + 1)
In a distributed system consisting of a plurality of computers connected by two LANs, two LANs are paired, and each pair has a failure according to claim 6, claim 12, claim 14, or claim 16. Any one of the detection methods is used, and the remaining one is configured to use any one of the failure detection methods of claims 1 to 5, claim 11, claim 13, and claim 15. Therefore, there is an effect that the above invention can be applied to a system having an arbitrary number of LANs.

【０２２８】請求項１９の発明によれば、（２Ｎ＋１）
本のＬＡＮにより接続された、複数の計算機からなる分
散システムにおいて、ＬＡＮを２本ずつペアにし、（２
Ｎ＋１）本目のＬＡＮといずれかのＬＡＮによりさらに
１つのペアを作り、各ペアごとに請求項６から請求項１
０、請求項１２、請求項１４、請求項１６の故障検出方
法のうちのいずれかを用いるように構成したので、任意
の本数のＬＡＮを持つシステムに、上記発明を適用可能
とすることができる効果がある。According to the invention of claim 19, (2N + 1)
In a distributed system consisting of multiple computers connected by two LANs, two LANs are paired, and (2
N + 1) One LAN and one of the LANs further form one pair, and each pair forms claim 6 to claim 1.
Since any one of the 0, claim 12, claim 14, and claim 16 failure detection methods is used, the invention can be applied to a system having an arbitrary number of LANs. effective.

【０２２９】請求項２０の発明によれば、２つのペアで
共有されているＬＡＮにおいて、それぞれのペアにおい
て送信される生存信号を１つにまとめるように構成した
ので、交換される生存信号の数を少なくすることができ
る効果がある。According to the twentieth aspect of the invention, in the LAN shared by two pairs, the survival signals transmitted in each pair are combined into one, so that the number of exchanged survival signals is increased. There is an effect that can be reduced.

【０２３０】請求項２１の発明によれば、仮想配置ステ
ップにおいて、相互に通信する頻度の高い計算機を、仮
想的な配置において近接に配置するように構成したの
で、故障発生が本来の業務に及ぼす影響を少なくするこ
とができる効果がある。According to the twenty-first aspect of the invention, in the virtual arrangement step, the computers that frequently communicate with each other are arranged close to each other in the virtual arrangement, so that the occurrence of a failure affects the original work. There is an effect that the influence can be reduced.

【０２３１】請求項２２の発明によれば、仮想配置ステ
ップにおいて、信頼性の高い計算機と信頼性の低い計算
機を、仮想的な配置において交互に並べるように構成し
たので、故障検出の必要性の高い計算機の故障を確実に
検出することができる効果がある。According to the twenty-second aspect of the invention, in the virtual arrangement step, the highly reliable computers and the less reliable computers are arranged alternately in the virtual arrangement. There is an effect that it is possible to reliably detect a high computer failure.

【０２３２】請求項２３の発明によれば、仮想配置ステ
ップにおいて、信頼性の高い計算機と機能的に重要な計
算機を、仮想的な配置において交互に並べるように構成
したので、故障検出の必要性の高い計算機の故障を確実
に検出することができる効果がある。According to the twenty-third aspect of the present invention, in the virtual placement step, the highly reliable computers and the computers that are functionally important are arranged alternately in the virtual placement. There is an effect that a failure of a high-performance computer can be reliably detected.

【０２３３】請求項２４の発明によれば、一部または全
ての生存信号について、その送信時刻または受信期限
を、各計算機が特定の生存信号を受信した時刻を基準に
して設定するように構成したので、生存信号の送信と受
信時刻の関係を、要求される故障発見の特性に合わせ
て、自由に設定することができる効果がある。According to the invention of claim 24, the transmission time or the reception deadline of some or all of the surviving signals is set on the basis of the time when each computer receives the specific surviving signal. Therefore, there is an effect that the relationship between the transmission and reception time of the survival signal can be freely set in accordance with the required characteristic of failure detection.

[Brief description of drawings]

【図１】この発明の一実施例による分散計算機システ
ムの物理的な構成を示したブロック図である。FIG. 1 is a block diagram showing a physical configuration of a distributed computer system according to an embodiment of the present invention.

【図２】図１に示した分散計算機システムにおける仮
想的な仮想リング上に配置された計算機を示す図であ
る。2 is a diagram showing computers arranged on a virtual virtual ring in the distributed computer system shown in FIG.

【図３】図１に示した分散計算機システムにおける故
障検出方法の動作を説明するためのフローチャートであ
る。FIG. 3 is a flow chart for explaining the operation of the failure detection method in the distributed computer system shown in FIG.

【図４】図１に示した分散計算機システムにおける故
障検出方法を行うための、各計算機の生存信号の送受信
の様子を示したブロック図である。4 is a block diagram showing a state of transmission / reception of a survival signal of each computer for performing the failure detection method in the distributed computer system shown in FIG.

【図５】図１に示した分散計算機システムにおける故
障検出方法の動作を説明するための、故障例を示した図
である。5 is a diagram showing a failure example for explaining the operation of the failure detection method in the distributed computer system shown in FIG.

【図６】図１に示した分散計算機システムの故障検出
方法において、故障の影響を除去するために再構成を行
った後の、各計算機の生存信号の送受信の様子を示した
図である。6 is a diagram showing a state of transmission / reception of a survival signal of each computer after reconfiguration for removing the influence of the fault in the fault detecting method for the distributed computer system shown in FIG. 1;

【図７】この発明の他の実施例による分散計算機シス
テムにおける故障検出方法の動作を説明するためのフロ
ーチャートである。FIG. 7 is a flow chart for explaining the operation of a failure detection method in a distributed computer system according to another embodiment of the present invention.

【図８】図７に示した分散計算機システムの故障検出
方法を行うための、各計算機の生存信号の送受信の様子
を示した図である。8 is a diagram showing a state of transmission and reception of a survival signal of each computer for performing the failure detection method of the distributed computer system shown in FIG.

【図９】図７に示した分散計算機システムの故障検出
方法の動作を説明するための、故障例を示した図であ
る。9 is a diagram showing a failure example for explaining the operation of the failure detection method of the distributed computer system shown in FIG.

【図１０】図７に示した分散計算機システムの故障検
出方法において、故障の影響を除去するために再構成を
行った後の、仮想リングを示す図である。FIG. 10 is a diagram showing a virtual ring after reconstruction is performed in order to eliminate the influence of a failure in the failure detection method for the distributed computer system shown in FIG.

【図１１】この発明の他の実施例による分散計算機シ
ステムにおける故障検出方法の動作を説明するためのフ
ローチャートである。FIG. 11 is a flow chart for explaining the operation of the failure detection method in the distributed computer system according to another embodiment of the present invention.

【図１２】図１１に示した分散計算機システムの故障
検出方法を行うための、各計算機の生存信号の送受信の
様子を示した図である。FIG. 12 is a diagram showing a state of transmission / reception of a survival signal of each computer for performing the failure detection method of the distributed computer system shown in FIG. 11.

【図１３】図１１に示した分散計算機システムの故障
検出方法の動作を説明するための、故障例を示した図で
ある。13 is a diagram showing a failure example for explaining the operation of the failure detection method of the distributed computer system shown in FIG.

【図１４】この発明の他の実施例による分散計算機シ
ステムの故障検出方法の動作を説明するためのフローチ
ャートである。FIG. 14 is a flow chart for explaining the operation of a failure detecting method for a distributed computer system according to another embodiment of the present invention.

【図１５】図１４に示した分散計算機システムの故障
検出方法を行うために、各計算機を仮想的にツリー上に
並べた様子を示したブロック図である。15 is a block diagram showing a state in which each computer is virtually arranged on a tree in order to perform the failure detection method of the distributed computer system shown in FIG.

【図１６】図１４に示した分散計算機システムの故障
検出方法を行うための、各計算機の生存信号の送受信の
様子を示した図である。16 is a diagram showing a state of transmission / reception of a survival signal of each computer for performing the failure detection method of the distributed computer system shown in FIG.

【図１７】図１４に示した分散計算機システムの故障
検出方法の動作を説明するための、故障例を示した図で
ある。FIG. 17 is a diagram showing a failure example for explaining the operation of the failure detection method for the distributed computer system shown in FIG.

【図１８】図１４に示した分散計算機システムの故障
検出方法において、故障の影響を除去するために再構成
を行った後の、仮想的なツリーの構成を示した図であ
る。18 is a diagram showing the configuration of a virtual tree after reconfiguration for removing the influence of a fault in the fault detection method for the distributed computer system shown in FIG.

【図１９】この発明の他の実施例による分散計算機シ
ステムの故障検出方法の動作を説明するためのフローチ
ャートである。FIG. 19 is a flow chart for explaining the operation of a failure detecting method for a distributed computer system according to another embodiment of the present invention.

【図２０】図１９に示した分散計算機システムの故障
検出方法を行うために、各計算機を仮想的にチェーン上
に並べた様子を示したブロック図である。20 is a block diagram showing a state in which each computer is virtually arranged in a chain in order to perform the failure detection method of the distributed computer system shown in FIG.

【図２１】図１９に示した分散計算機システムの故障
検出方法を行うための、各計算機の生存信号の送受信の
様子を示した図である。FIG. 21 is a diagram showing a state of transmission / reception of a survival signal of each computer for performing the failure detection method of the distributed computer system shown in FIG.

【図２２】図１９に示した分散計算機システムの故障
検出方法の動作を説明するための、故障例を示した図で
ある。22 is a diagram showing a failure example for explaining the operation of the failure detection method for the distributed computer system shown in FIG.

【図２３】図１９に示した分散計算機システムの故障
検出方法において、図２２に示した故障の影響を除去す
るために再構成を行った後の仮想的なツリーの構成を示
した図である。23 is a diagram showing a configuration of a virtual tree after reconfiguration for removing the influence of the failure shown in FIG. 22 in the failure detection method for the distributed computer system shown in FIG. .

【図２４】図１９に示した分散計算機システムの故障
検出方法の動作を説明するための、故障例を示した図で
ある。FIG. 24 is a diagram showing a failure example for explaining the operation of the failure detection method for the distributed computer system shown in FIG.

【図２５】図１９に示した分散計算機システムの故障
検出方法において、図２４に示した故障の影響を除去す
るために再構成を行った後の仮想的なツリーの構成を示
した図である。25 is a diagram showing a configuration of a virtual tree after reconfiguration in order to remove the influence of the failure shown in FIG. 24 in the failure detection method for the distributed computer system shown in FIG. .

【図２６】この発明の他の実施例による分散計算機シ
ステムの故障検出方法の動作を説明するためのフローチ
ャートである。FIG. 26 is a flow chart for explaining the operation of a failure detection method for a distributed computer system according to another embodiment of the present invention.

【図２７】図２６に示した分散計算機システムの故障
検出方法が適用される分散計算機システムの物理的な構
成を示した構成図である。27 is a configuration diagram showing a physical configuration of a distributed computer system to which the failure detection method for the distributed computer system shown in FIG. 26 is applied.

【図２８】図２６に示した分散計算機システムの故障
検出方法を行うための、各計算機の生存信号の送受信の
様子を示した図である。28 is a diagram showing a state of transmission / reception of a survival signal of each computer for performing the failure detection method of the distributed computer system shown in FIG. 26.

【図２９】図２６に示した分散計算機システムの故障
検出方法の動作を説明するための、故障例を示したブロ
ック図である。29 is a block diagram showing a failure example for explaining the operation of the failure detection method for the distributed computer system shown in FIG.

【図３０】図２６に示した分散計算機システムの故障
検出方法において、図２９に示す故障の影響を除去する
ために再構成を行った後の各計算機の生存信号の送受信
の様子を示した図である。FIG. 30 is a diagram showing a state of transmission / reception of a survival signal of each computer after reconfiguration for eliminating the influence of the failure shown in FIG. 29 in the failure detection method for the distributed computer system shown in FIG. 26. Is.

【図３１】図２６に示した分散計算機システムの故障
検出方法の動作を説明するための、故障例を示した図で
ある。31 is a diagram showing a failure example for explaining the operation of the failure detection method for the distributed computer system shown in FIG.

【図３２】図２６に示した分散計算機システムの故障
検出方法において、図３１に示す故障の影響を除去する
ために再構成を行った後の、各計算機の生存信号の送受
信の様子を示した図である。32 shows a state of transmission / reception of a survival signal of each computer after reconfiguration for eliminating the influence of the fault shown in FIG. 31 in the fault detection method for the distributed computer system shown in FIG. It is a figure.

【図３３】この発明の他の実施例による分散計算機シ
ステムの故障検出方法の動作を説明するためのフローチ
ャートである。FIG. 33 is a flow chart for explaining the operation of a failure detection method for a distributed computer system according to another embodiment of the present invention.

【図３４】図３３に示した分散計算機システムの故障
検出方法を行うための、各計算機の生存信号の送受信の
様子を示した図である。FIG. 34 is a diagram showing a state of transmission / reception of a survival signal of each computer for performing the failure detection method of the distributed computer system shown in FIG. 33.

【図３５】図３３に示した分散計算機システムの故障
検出方法において、ある計算機が受信した生存信号の内
容と、故障の存在し得る範囲の関係を示した表図であ
る。FIG. 35 is a table showing the relationship between the content of a live signal received by a computer and the possible range of a fault in the fault detection method for the distributed computer system shown in FIG. 33.

【図３６】図３３に示した分散計算機システムの故障
検出方法において、ある計算機が、両隣の計算機から受
信した生存信号の内容と、故障の存在し得る範囲の関係
を示した表図である。FIG. 36 is a table diagram showing the relationship between the content of a live signal received by a computer from adjacent computers and the range in which a fault may exist in the fault detection method for the distributed computer system shown in FIG. 33.

【図３７】図３３に示した分散計算機システムの故障
検出方法の動作を説明するための、故障例を示した図で
ある。FIG. 37 is a diagram showing a failure example for explaining the operation of the failure detection method for the distributed computer system shown in FIG. 33.

【図３８】図３３に示した分散計算機システムの故障
検出方法において、図３７の故障の影響を除去するため
に再構成を行った後の、各計算機の生存信号の送受信の
様子を示した図である。38 is a diagram showing a state of transmission / reception of a survival signal of each computer after reconfiguration for eliminating the influence of the fault of FIG. 37 in the fault detection method for the distributed computer system shown in FIG. 33. Is.

【図３９】この発明の他の実施例による分散計算機シ
ステムの故障検出方法の動作を説明するためのブロック
図である。FIG. 39 is a block diagram for explaining the operation of the failure detecting method for the distributed computer system according to another embodiment of the present invention.

【図４０】図３９に示した分散計算機システムの故障
検出方法において、ある計算機が、隣の計算機から受信
した生存信号の内容と、故障の存在し得る範囲の関係を
示した表図である。40 is a table showing the relationship between the content of a live signal received by a computer from an adjacent computer and the range in which a fault may exist in the fault detection method for the distributed computer system shown in FIG. 39.

【図４１】図３９に示した分散計算機システムの故障
検出方法の動作を説明するための、故障例を示した図で
ある。FIG. 41 is a diagram showing a failure example for explaining the operation of the failure detection method for the distributed computer system shown in FIG. 39.

【図４２】この発明の他の実施例による分散計算機シ
ステムの故障検出方法の動作を説明するためのフローチ
ャートである。FIG. 42 is a flow chart for explaining the operation of a failure detecting method for a distributed computer system according to another embodiment of the present invention.

【図４３】図４２に示した分散計算機システムの故障
検出方法を行うための、各計算機の生存信号の送受信の
様子を示したブロック図である。43 is a block diagram showing a state of transmission / reception of a survival signal of each computer for performing the failure detection method of the distributed computer system shown in FIG. 42.

【図４４】図４２に示した分散計算機システムの故障
検出方法の動作を説明するための、故障例を示した図で
ある。FIG. 44 is a diagram showing a failure example for explaining the operation of the failure detection method for the distributed computer system shown in FIG. 42.

【図４５】図４２に示した分散計算機システムの故障
検出方法の動作を説明するための、故障例を示した図で
ある。45 is a diagram showing a failure example for explaining the operation of the failure detection method for the distributed computer system shown in FIG. 42.

【図４６】図４２に示した分散計算機システムの故障
検出方法において、図４４の故障の影響を除去するため
に再構成を行った後の各計算機の生存信号の送受信の様
子を示した図である。FIG. 46 is a diagram showing a state of transmission / reception of a survival signal of each computer after reconfiguration for eliminating the influence of the failure of FIG. 44 in the failure detection method of the distributed computer system shown in FIG. 42. is there.

【図４７】図４２に示した分散計算機システムの故障
検出方法において、図４５の故障の影響を除去するため
に再構成を行った後の各計算機の生存信号の送受信の様
子を示した図である。47 is a diagram showing a state of transmission / reception of a survival signal of each computer after reconfiguration for eliminating the influence of the failure of FIG. 45 in the failure detection method of the distributed computer system shown in FIG. 42. is there.

【図４８】この発明の他の実施例による分散計算機シ
ステムの故障検出方法の動作を説明するためのフローチ
ャートである。FIG. 48 is a flow chart for explaining the operation of a failure detecting method for a distributed computer system according to another embodiment of the present invention.

【図４９】図４８に示した分散計算機システムの故障
検出方法を行うための、各計算機の生存信号の送受信の
様子を示した図である。FIG. 49 is a diagram showing a state of transmission / reception of a survival signal of each computer for performing the failure detection method of the distributed computer system shown in FIG. 48.

【図５０】図４８に示した分散計算機システムの故障
検出方法において、ある計算機が、受信した生存信号の
組み合わせと、故障の存在し得る範囲の関係を示した表
図である。50 is a table showing the relationship between the combination of survival signals received by a computer and the range in which a fault may exist in the fault detection method for the distributed computer system shown in FIG. 48.

【図５１】図４８に示した分散計算機システムの故障
検出方法の動作を説明するための、故障例を示した図で
ある。51 is a diagram showing a failure example for explaining the operation of the failure detection method for the distributed computer system shown in FIG. 48.

【図５２】図４８に示した分散計算機システムの故障
検出方法の動作を説明するための、故障例を示した図で
ある。52 is a diagram showing a failure example for explaining the operation of the failure detection method for the distributed computer system shown in FIG. 48.

【図５３】図４８に示した分散計算機システムの故障
検出方法の動作を説明するための、故障例を示した図で
ある。FIG. 53 is a diagram showing a failure example for explaining the operation of the failure detection method for the distributed computer system shown in FIG. 48.

【図５４】図４８に示した分散計算機システムの故障
検出方法の動作を説明するための、故障例を示した図で
ある。FIG. 54 is a diagram showing a failure example for explaining the operation of the failure detection method for the distributed computer system shown in FIG. 48.

【図５５】図４８に示した分散計算機システムの故障
検出方法において、図５４の故障の影響を除去するため
に再構成を行った後の、各計算機の生存信号の送受信の
様子を示した図である。55 is a diagram showing a state of transmission / reception of a survival signal of each computer after reconfiguration for eliminating the influence of the fault of FIG. 54 in the fault detection method of the distributed computer system shown in FIG. 48. Is.

【図５６】この発明の他の実施例による分散計算機シ
ステムの故障検出方法を行うための、各計算機の生存信
号の送受信の様子を示した図である。FIG. 56 is a diagram showing a state of transmission / reception of a survival signal of each computer for performing the failure detection method of the distributed computer system according to another embodiment of the present invention.

【図５７】この発明の他の実施例による分散計算機シ
ステムの故障検出方法を行うための、各計算機の生存信
号の送受信の様子を示したブロック図である。FIG. 57 is a block diagram showing a state of transmission / reception of a survival signal of each computer for performing a failure detection method for a distributed computer system according to another embodiment of the present invention.

【図５８】この発明の他の実施例による分散計算機シ
ステムの故障検出方法を行うための、各計算機の生存信
号の送受信の様子を示したブロック図である。FIG. 58 is a block diagram showing a state of transmission / reception of a survival signal of each computer for performing a failure detection method for a distributed computer system according to another embodiment of the present invention.

【図５９】従来の分散計算機システムの故障検出方法
の概略を示すブロック図である。FIG. 59 is a block diagram showing an outline of a conventional fault detection method for a distributed computer system.

【図６０】従来の多重化ＬＡＮを備えた分散計算機シ
ステムにおける、たすきがけ故障を示した図である。FIG. 60 is a diagram showing a strike failure in a conventional distributed computer system having a multiplexed LAN.

[Explanation of symbols]

１０仮想リング、１０１〜１０４計算機、４０１〜
４０４ＬＡＮ（ローカルエリアネットワーク）、１０
０１，１００２グループ。10 virtual rings, 101-104 computers, 401-
404 LAN (Local Area Network), 10
01,1002 groups.

Claims

[Claims]

1. A failure detection method for a distributed computer system including a plurality of computers connected to each other via at least one local area network, the virtual placement step of placing the plurality of computers on a virtual virtual ring. , A survival signal transmitting step of periodically transmitting a survival signal indicating the survival of itself to a computer adjacent in a specific direction on the virtual ring, and each computer on the virtual ring. Check whether the survival signal transmitted from the adjacent computer is regularly received, and if not, determine that an abnormality has occurred in the communication path used to transmit the survival signal, and identify the failure point. A detection step and a failure notification step in which each computer notifies failure information related to the found failure to all computers with which it can communicate. A method for detecting failures in distributed computer systems.

2. The survival signal transmitting step transmits the survival signal to the computer adjacent on the right side or the left side by alternately switching on the virtual ring at regular timings. Detection method for distributed computer systems in Japan.

3. The survival signal transmitting step writes and transmits to a survival signal to be transmitted whether or not the computer has received the survival signal to be received within a predetermined time. Item 2. A failure detection method for a distributed computer system according to item 2.

4. A failure detection method for a distributed computer system including a plurality of computers connected to each other through at least one local area network, wherein each computer has a node and each node has two or more child nodes. Virtual placement step to place on a virtual tree, each computer, to the parent computer located at the parent node on the virtual tree, a survival signal transmission step of periodically transmitting a survival signal, each computer, It is checked whether or not a survival signal from a child computer located at a child node on the virtual tree is received, and a failure detection step of identifying a failure location by combining the results, and each computer provides information on the found failure. , A failure notification step of notifying all computers with which communication is possible, a failure detection method for a distributed computer system.

5. A failure detection method for a distributed computer system including a plurality of computers connected to each other via at least one local area network, wherein M is a computer.
Into a group, one computer for each group is used as a representative computer, and M representative computers are arranged on a virtual virtual ring. A first survival signal transmitting step of periodically transmitting a survival signal to a representative computer of a group to which each of the groups belongs, and each representative computer periodically transmitting a survival signal to a computer adjacent in a specific direction on the virtual ring. A survival signal transmission step 2 and a failure detection step of checking whether or not each representative computer has received the survival signal transmitted to the computer and combining the results to identify a failure location;
A failure detection method for a distributed computer system, wherein each representative computer executes a failure notification step of notifying all computers with which it can communicate of information about the discovered failure.

6. A failure detection method for a distributed computer system including a plurality of computers connected to each other via first and second local area networks, wherein the plurality of computers are arranged on a virtual virtual ring. According to the arranging step and dividing each computer into an even number and an odd number according to an order in a specific direction from a specific computer on the virtual ring, an odd number computer may be the virtual computer via the first local area network. Survival signal that periodically transmits a survival signal to an adjacent computer on the ring, and an even-numbered computer that regularly transmits the survival signal to an adjacent computer on the virtual ring via the second local area network. The transmitting step and each computer checks whether or not the survival signal transmitted to the computer has been received, and based on the result, the failure location is identified. A failure detection method for a distributed computer system, comprising: performing a failure detection step to specify; and a failure notification step in which each computer notifies all computers with which it can communicate information about a discovered failure.

7. A failure detection method for a distributed computer system including a plurality of computers connected to each other via first and second local area networks, wherein the plurality of computers are arranged on a virtual virtual ring. An arranging step, in which each computer periodically transmits a survival signal to a computer adjacent in a specific direction on the virtual ring via a first local area network, and via a second local area network, A survival signal transmitting step of periodically transmitting a survival signal to the adjacent computer in the opposite direction to the specific direction on the virtual ring, and whether each computer has received the survival signal transmitted from the adjacent computer. And the result is written as a response to the survival signal sent from the adjacent computer to the survival signal sent to the adjacent computer. Answer step, each computer, the failure detection step for identifying the failure point by combining the presence or absence of the survival signal from both adjacent computers on the virtual ring and the content of the response, and the failure found by each computer And a failure notification step of notifying information to all computers that can communicate with the failure detection method of the distributed computer system.

8. Each computer is different from the adjacent computer together with a response to the survival signal transmitted from the adjacent computer in the survival signal transmitting step of transmitting a periodic survival signal to the adjacent computer. The failure detection method for a distributed computer system according to claim 7, wherein a copy of a response from one adjacent computer is also written.

9. A failure detection method for a distributed computer system including a plurality of computers connected to each other via first and second local area networks, wherein the plurality of computers are arranged on a virtual virtual ring. When the arrangement step and each computer are divided into an even number and an odd number according to the order in a specific direction from the specific computer on the virtual ring, the odd number computer is the first computer via the first local area network. Survival signal is transmitted periodically to the computers on both sides of the virtual ring, and the even-numbered computer periodically transmits the survival signal to the computers on both sides of the virtual ring via the second local area network. The signal transmission step and each computer transmits the survival signal transmitted from both sides via the first or second local area network. The failure detection step of checking whether or not received, and specifying the failure location by combining the results, and the information on the failure found by each computer,
And a failure notification step of notifying all computers with which communication is possible, a failure detection method for a distributed computer system.

10. A failure detection method for a distributed computer system including a plurality of computers connected to each other via first and second local area networks, wherein the plurality of computers are arranged on a virtual virtual ring. Arrangement step and dividing each computer into a plurality of groups of three, and in each group, the first computer periodically transmits a survival signal to the second computer via the first local area network. In addition, the survival signal transmitting step of periodically transmitting the survival signal to the third computer via the second local area network, and in each group, the second computer transmits the survival signal from the first computer. It is checked whether or not it has been received, and the result is sent to the third computer as the second
A first liveness signal response step of writing to the liveness signal that is periodically transmitted via the local area network of
The third computer checks whether or not the surviving signal from the first computer is received, and writes the result in the surviving signal that is periodically transmitted to the second computer via the first local area network. A second survival signal response step, and a second
The first computer checks the existence and contents of the survival signal transmitted from the first and third computers, and by combining the results, the first failure detection step of identifying the failure location and the third computer , A second failure detection step of identifying the failure location by checking the existence and contents of the survival signal transmitted from the first and second computers and combining the results, and the failure found by each computer And a failure notification step of notifying information to all communicable computers, a failure detection method for a distributed computer system.

11. The method according to any one of claims 1 to 3 and 5, further comprising a reallocation step for resetting a virtual layout of each computer when a failure occurs.
A failure detection method for a distributed computer system according to any one of the above.

12. The method according to claim 6, further comprising executing a rearrangement step of newly setting a virtual arrangement of each computer when a failure occurs. Detection method for distributed computer systems in Japan.

13. The method of detecting a failure in a distributed computer system according to claim 4, further comprising executing a rearrangement step of newly setting a virtual arrangement of each computer when a failure occurs.

14. The method of detecting a failure in a distributed computer system according to claim 10, further comprising executing a rearrangement step of newly setting a virtual arrangement of each computer when a failure occurs.

15. The failure notification step of notifying the adjacent computer of the detected failure information notifies the failure by adding the failure information to the survival signal and transmitting the survival signal. A failure detection method for the distributed computer system described.

16. The failure notification step of notifying the adjacent computer of the detected failure information notifies failure by adding failure information to the survival signal and transmitting the survival signal. A failure detection method for the distributed computer system described.

17. A distributed system consisting of a plurality of computers connected by 2N local area networks, wherein two local area networks are paired, and each pair is paired with one of claims 6 to 10, and
A method for detecting a failure in a distributed computer system, which uses any one of the failure detecting methods according to claim 2, claim 14 and claim 16.

18. In a distributed system comprising a plurality of computers connected by (2N + 1) local area networks, two local area networks are paired, and each pair is defined by any one of claims 6 to 10,
Any one of the failure detection methods of claim 12, claim 14, and claim 16 is used, and the remaining one is defined as claim 1 to claim 5, claim 11, claim 13, and claim. A failure detection method for a distributed computer system using any one of the failure detection methods of 15.

19. In a distributed system comprising a plurality of computers connected by (2N + 1) local area networks, two local area networks are paired, and a local area network of (2N + 1) th and any one of the local areas are paired. Further, one pair is formed by the area network, and each pair is defined by claims 6 to 10, claim 12, claim 14, and claim 16.
A fault detection method for a distributed computer system using any of the above fault detection methods.

20. A failure detecting method for a distributed computer system according to claim 19, wherein, in a local area network shared by two pairs, the survival signals transmitted in each pair are combined into one. .

21. In the virtual arranging step, computers that frequently communicate with each other are arranged so as to be close to each other in the virtual arrangement. A method for detecting a failure in a distributed computer system according to the item.

22. A computer having high reliability and a computer having low reliability are alternately arranged in a virtual arrangement in the virtual arrangement step. A method for detecting a failure in a distributed computer system according to the item.

23. A computer having high reliability and a computer which are functionally important are alternately arranged in a virtual arrangement in the virtual arrangement step. A method for detecting a failure in a distributed computer system according to one item.

24. For some or all of the survival signals,
The distributed computer system according to any one of claims 1 to 23, characterized in that the transmission time or the reception deadline is set based on the time when each computer receives a specific survival signal. Failure detection method.