JP2016115239A

JP2016115239A - Fault tolerant system, fault tolerant method, and program

Info

Publication number: JP2016115239A
Application number: JP2014254967A
Authority: JP
Inventors: 康竹森; Yasushi Takemori
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2014-12-17
Filing date: 2014-12-17
Publication date: 2016-06-23

Abstract

PROBLEM TO BE SOLVED: To provide a fault tolerant system that has reduced the probability that a system in which fault is not occurring is erroneously cut off or a server is stopped by increasing the probability of being able to detect a real fault place.SOLUTION: In a fault tolerant system 10 that operates two servers 11A and 11B in parallel, a server detects an error about which the two servers cannot identify its occurrence place through their communication, the server disconnects communication with the other server, and the system independently operates each server for a prescribed time, and identifies the occurrence place of the error.SELECTED DRAWING: Figure 1

Description

本発明は、フォールトトレラントシステム、フォールトトレラント方法、及び、そのためのプログラムに関する。 The present invention relates to a fault tolerant system, a fault tolerant method, and a program therefor.

特許文献１は、フォールトトレラントコンピュータシステムのシステムダウンの可能性を大幅に小さくすることが可能となる技術について開示している。起動時に、フォールトトレラントコンピュータシステムを構成する２台のコンピュータ内の一方のコンピュータのモジュールでエラーが検出された場合に、エラーを発生したモジュールを当該コンピュータから切り離すとともに、エラーを発生したモジュールと対をなす他方のコンピュータのモジュールも切り離す。そして、正常なモジュールのみを使用して、両コンピュータを縮退したハードウェア構成で、かつ、二重化を維持しつつシステムを起動できる。 Patent Document 1 discloses a technique that can significantly reduce the possibility of a system failure of a fault tolerant computer system. When an error is detected in the module of one of the two computers that make up the fault-tolerant computer system at startup, the module that generated the error is disconnected from the computer and the module that generated the error Disconnect the other computer module. Then, it is possible to start up the system using only normal modules, with a hardware configuration in which both computers are degenerated, and maintaining duplication.

特許文献２は、フォールトトレラント機能を実現するために、具体的なエラー箇所を特定することが不可能な場合は、動作モードが「アクティブ」モードのＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）サブシステムでサービスを継続し、動作モードが「スタンバイ」モードのＣＰＵサブシステムを切り離す技術について開示している。 In Patent Document 2, in order to realize a fault-tolerant function, when it is impossible to specify a specific error location, the service is continued with a CPU (Central Processing Unit) subsystem whose operation mode is “active” mode. However, a technique for disconnecting a CPU subsystem whose operation mode is “standby” mode is disclosed.

特開２０１４−０４１５０３号公報Japanese Patent Application Laid-Open No. 2014-041503 特開２００６−１７８６１６号公報JP 2006-178616 A

多重化された複数のＣＰＵ（プロセッサ）が常に同期を取りながら同じタイミングで同一動作を実行するロックステップ動作等により、２つの系を多重化して並列して動作させているフォールトトレラントコンピュータシステム（フォールトトレラントシステム）においては、次のような課題がある。 A fault tolerant computer system (fault) in which two systems are multiplexed and operated in parallel, such as by a lock step operation in which multiple CPUs (processors) always execute the same operation at the same timing while synchronizing. (Tolerant system) has the following problems.

フォールトトレラントシステムは、すぐには故障箇所が判明せず、系を切り離す必要がある障害が発生した場合、例えば、予め決定されているプライオリティの高い系を残して系の切り離し動作を行う、等としている。 The fault-tolerant system does not immediately identify the failure location, and when a failure that requires system disconnection occurs, for example, performing a system disconnection operation while leaving a system with a predetermined high priority, etc. Yes.

しかし、この場合、一定時間後に、残した側の系で障害が発生していたことが判明するケースが考えられる。そして、最悪の場合、サーバが停止してしまうことになる。 However, in this case, there may be a case where it becomes clear that a failure has occurred in the remaining system after a certain time. In the worst case, the server is stopped.

特許文献１は、起動時に、明らかにエラーの発生したことが判明している一方のコンピュータのモジュール、及び、そのモジュールと対をなす他方のコンピュータのモジュールを切り離すだけであり、エラーの発生箇所が容易に判明するような限定されたケースにしか適用できない。 In Patent Document 1, at the time of startup, it is only necessary to separate the module of one computer that is clearly known to have an error and the module of the other computer that is paired with that module, and the location where the error has occurred It can only be applied to limited cases that can be easily found.

また、特許文献２は、予め設定されたアクティブモードのＣＰＵサブシステム、すなわち、前述のプライオリティの高い系を動作させる技術である。 Patent Document 2 is a technique for operating a CPU subsystem in a preset active mode, that is, the above-described high priority system.

このため、本発明の目的は、前述した課題である、フォールトトレラントシステムにおいて、真の故障箇所を検出できる確率を増やすことにより、誤って、障害が発生していない側の系を切り離し、最悪の場合サーバが停止してしまう確率を減らす手段を提供することにある。 For this reason, the object of the present invention is to increase the probability that a true fault location can be detected in the fault-tolerant system, which is the problem described above. It is to provide a means for reducing the probability that the server will stop.

本発明のフォールトトレラントシステムは、２つのサーバを並列して動作させるフォールトトレラントシステムにおいて、各々の前記サーバが、２つの前記サーバ間の通信で、発生箇所を特定できないエラーを検出した場合、他方の前記サーバとの通信の接続を切り離すクロスリンク手段と、所定の時間、各々の前記サーバを独立して動作させ、前記エラーの発生箇所を特定する制御手段と、を包含する。 In the fault tolerant system of the present invention, in the fault tolerant system in which two servers are operated in parallel, when each of the servers detects an error in which the occurrence point cannot be specified by communication between the two servers, Cross-link means for disconnecting communication connection with the server, and control means for operating each of the servers independently for a predetermined time and specifying the location where the error has occurred.

本発明のフォールトトレラント方法は、２つのサーバを並列して動作させるフォールトトレラントシステムにおいて、各々の前記サーバが、２つの前記サーバ間の通信で、発生箇所を特定できないエラーを検出した場合、他方の前記サーバとの通信の接続を切り離し、所定の時間、各々の前記サーバを独立して動作させ、前記エラーの発生箇所を特定する。 In the fault-tolerant method of the present invention, in a fault-tolerant system in which two servers are operated in parallel, when each of the servers detects an error in which the occurrence point cannot be specified in communication between the two servers, The communication connection with the server is disconnected, each server is operated independently for a predetermined time, and the location where the error occurs is specified.

本発明のコンピュータプログラムは、２つのサーバを並列して動作させるフォールトトレラントシステムにおいて、各々の前記サーバが、２つの前記サーバ間の通信で、発生箇所を特定できないエラーを検出した場合、他方の前記サーバとの通信の接続を切り離す処理と、所定の時間、各々の前記サーバを独立して動作させ、前記エラーの発生箇所を特定する処理と、をコンピュータに実行させる。 In the fault tolerant system in which two servers are operated in parallel, the computer program of the present invention is configured so that when each of the servers detects an error in which the occurrence location cannot be specified by communication between the two servers, A computer is caused to execute a process of disconnecting a communication connection with a server and a process of operating each of the servers independently for a predetermined time and specifying the location where the error has occurred.

本発明によれば、フォールトトレラントシステムにおいて、真の故障箇所を検出できる確率を増やすことにより、誤って、障害が発生していない側の系を切り離し、最悪の場合サーバが停止してしまう確率を減らす効果を奏することが可能となる。 According to the present invention, in the fault-tolerant system, by increasing the probability that a true failure location can be detected, the probability that the server on the side where no failure has occurred will be accidentally disconnected and the server will stop in the worst case is increased. It is possible to achieve an effect of reducing.

図１は、フォールトトレラントシステムの構成の一例を示すブロック図である。FIG. 1 is a block diagram illustrating an example of the configuration of a fault tolerant system. 図２は、フォールトトレラントシステムをコンピュータ装置で実現したハードウェア回路の一例を示すブロック図である。FIG. 2 is a block diagram illustrating an example of a hardware circuit in which the fault tolerant system is realized by a computer apparatus. 図３は、フォールトトレラントシステムの動作を示すフローチャートである。FIG. 3 is a flowchart showing the operation of the fault tolerant system. 図４は、第二の実施形態に係る、フォールトトレラントシステムの構成の一例を示すブロック図である。FIG. 4 is a block diagram showing an example of the configuration of the fault tolerant system according to the second embodiment.

発明を実施するための第一の形態について、図面を参照して詳細に説明する。 A first embodiment for carrying out the invention will be described in detail with reference to the drawings.

図１は、フォールトトレラントシステム１０の構成の一例を示すブロック図である。 FIG. 1 is a block diagram illustrating an example of the configuration of the fault tolerant system 10.

フォールトトレラントシステム１０は、２系統のサーバ１１Ａ、及び、サーバ１１Ｂから構成される。なお、以下、サーバ１１Ａを「Ａ系」、サーバ１１Ｂを「Ｂ系」とも記述する。 The fault tolerant system 10 includes two systems 11A and 11B. Hereinafter, the server 11A is also described as “A system” and the server 11B is described as “B system”.

Ａ系のサーバ１１Ａは、制御部１２Ａ、ＦＴ（フォールトトレラント）クロスリンク部１３Ａ、及び、ＩＯ（入出力）部１４Ａから構成される。 The A-system server 11A includes a control unit 12A, an FT (fault tolerant) cross link unit 13A, and an IO (input / output) unit 14A.

同様に、Ｂ系のサーバ１１Ｂは、制御部１２Ｂ、ＦＴクロスリンク部１３Ｂ、及び、ＩＯ部１４Ｂから構成される。 Similarly, the B-system server 11B includes a control unit 12B, an FT cross link unit 13B, and an IO unit 14B.

制御部１２Ａ、１２Ｂは、所定の時間、各々のサーバ１１Ａ、１１Ｂを独立して動作させ、エラーの発生箇所を特定する。 The control units 12A and 12B operate the servers 11A and 11B independently for a predetermined time, and specify the location where the error occurs.

一方のサーバのＦＴクロスリンク部１３Ａ、または、１３Ｂは、２つのサーバ１１Ａ、１１Ｂ間の通信で、発生箇所を特定できないエラーを検出した場合、他方のサーバ１１Ａ、または、１１ＢのＦＴクロスリンク部１３Ａ、または、１３Ｂとの通信の接続を切り離す。 When the FT cross link unit 13A or 13B of one server detects an error in which the occurrence location cannot be specified in communication between the two servers 11A and 11B, the FT cross link unit of the other server 11A or 11B Disconnect the communication connection with 13A or 13B.

ＩＯ部１４Ａは、フォールトトレラントシステム１０の外部とのアクセスを行う。 The IO unit 14 </ b> A accesses the outside of the fault tolerant system 10.

また、ＦＴクロスリンク部１３ＡとＦＴクロスリンク部１３Ｂは、クロスリンク１５を介して、接続される。 In addition, the FT cross link unit 13 </ b> A and the FT cross link unit 13 </ b> B are connected via the cross link 15.

ところで、本実施形態は、例えば、２つの系をロックステップ動作させているフォールトトレラントシステム１０において、すぐには故障箇所が判明しない、故障側の系を切り離す必要がある障害が発生した場合に、正確に故障側の系を切り離すことを特徴としている。 By the way, in this embodiment, for example, in the fault tolerant system 10 in which two systems are operated in a lockstep operation, when a failure that does not immediately reveal a failure location and needs to separate a failure side system occurs, It is characterized by accurately disconnecting the system on the failure side.

フォールトトレラントシステム１０において、系間の通信を行うクロスリンク１５でエラーを検出する等、すぐには故障箇所が判明せず、かつ、系を切り離す必要がある障害が発生する場合を想定する。この場合、各サーバ１１Ａ、１１Ｂは、一時的にフォールトトレラントシステム１０の外部とのアクセスを停止し、スプリットモードと呼ばれる、それぞれの系を独立して動作させる特殊なモードに入る。 In the fault tolerant system 10, a case is assumed in which an error is not immediately identified, such as an error is detected by the cross link 15 that performs communication between systems, and a failure that requires disconnecting the system occurs. In this case, each of the servers 11A and 11B temporarily stops access to the outside of the fault tolerant system 10 and enters a special mode called a split mode in which each system is operated independently.

そして、フォールトトレラントシステム１０は、切り離す側の系を決定せずに暫く動作させることにより、その後に真の故障箇所を特定できた際に、故障している側の系を切り離し、故障していない側のサーバ１１Ａ、または、１１Ｂの系を有効な系として外部とのアクセスを再開する。 Then, the fault tolerant system 10 is operated for a while without determining the system to be disconnected, and when the true fault location can be identified thereafter, the faulty system is disconnected and has not failed. The external server 11A or 11B system is used as an effective system and access to the outside is resumed.

なお、所定の時間で、真の故障箇所を特定できなかった場合、フォールトトレラントシステム１０は、予め決定されているプライオリティの高い系を残して切り離し動作を行い、外部とのアクセスを再開する。 If the true fault location cannot be identified within a predetermined time, the fault tolerant system 10 performs the disconnection operation while leaving a system with a predetermined high priority, and resumes access to the outside.

図２は、フォールトトレラントシステム１０をコンピュータ装置２０で実現したハードウェア回路の一例を示すブロック図である。 FIG. 2 is a block diagram illustrating an example of a hardware circuit in which the fault tolerant system 10 is realized by the computer apparatus 20.

Ａ系のサーバ２１Ａは、ＣＰＵ２２Ａ、メモリ２３Ａ、ＦＴチップセット２４Ａ、及び、ＩＯデバイス２５Ａで構成されている。 The A-system server 21A includes a CPU 22A, a memory 23A, an FT chip set 24A, and an IO device 25A.

同様に、Ｂ系のサーバ２１Ｂは、ＣＰＵ２２Ｂ、メモリ２３Ｂ、ＦＴチップセット２４Ｂ、及び、ＩＯデバイス２５Ｂで構成されている。 Similarly, the B system server 21B includes a CPU 22B, a memory 23B, an FT chip set 24B, and an IO device 25B.

ＣＰＵ２２Ａ、２２Ｂは、オペレーティングシステムを動作させてサーバ２１Ａ、２１Ｂの全体を制御する。また、ＣＰＵ２２Ａ、２２Ｂは、例えばドライブ装置などに装着された記録媒体からメモリ２３Ａ、２３Ｂにプログラムやデータを読み出す。また、ＣＰＵ２２Ａ、２２Ｂは、第一の実施の形態におけるサーバ１１Ａ、１１Ｂの一部として機能し、プログラムに基づいて各種の処理を実行する。ＣＰＵ２２Ａ、２２Ｂは、複数のＣＰＵによって構成されていてもよい。 The CPUs 22A and 22B operate the operating system to control the entire servers 21A and 21B. In addition, the CPUs 22A and 22B read programs and data from, for example, a recording medium mounted on a drive device to the memories 23A and 23B. Moreover, CPU22A, 22B functions as a part of server 11A, 11B in 1st embodiment, and performs various processes based on a program. CPU22A, 22B may be comprised by several CPU.

メモリ２３Ａ、２３Ｂは、例えば、半導体メモリ等である。 The memories 23A and 23B are, for example, semiconductor memories.

メモリ２３Ａ、２３ＢとＣＰＵ２２Ａ、２２Ｂは、図１の制御部１２Ａ、１２Ｂに対応する。 The memories 23A and 23B and the CPUs 22A and 22B correspond to the control units 12A and 12B in FIG.

Ａ系のＦＴチップセット２４Ａは、クロスリンク２６を介し、Ｂ系のＦＴチップセット２４Ｂと接続されている。 The A-system FT chip set 24A is connected to the B-system FT chip set 24B via the cross link 26.

ＦＴチップセット２４Ａ、２４Ｂは、例えば、電子回路チップである。ＦＴチップセット２４Ａ、２４Ｂは、図１のＦＴクロスリンク部１３Ａ、１３Ｂに対応する。 The FT chip sets 24A and 24B are, for example, electronic circuit chips. The FT chip sets 24A and 24B correspond to the FT cross links 13A and 13B in FIG.

ＩＯデバイス２５Ａ、２５Ｂは、サーバ２１Ａ、２１Ｂの外部とのデータの送受信を行う。ＩＯデバイス２５Ａ、２５Ｂは、例えば、電子回路チップである。ＩＯデバイス２５Ａ、２５Ｂは、図１のＩＯ部１４Ａ、１４Ｂに対応する。 The IO devices 25A and 25B exchange data with the outside of the servers 21A and 21B. The IO devices 25A and 25B are, for example, electronic circuit chips. The IO devices 25A and 25B correspond to the IO units 14A and 14B in FIG.

クロスリンク２６は、サーバ２１Ａとサーバ２１Ｂ間のデータ転送を行う。クロスリンク２６は、相互接続が可能な信号伝送路であり、例えば、クロスケーブルによって構成される。 The cross link 26 performs data transfer between the server 21A and the server 21B. The cross link 26 is a signal transmission path that can be interconnected, and is configured by, for example, a cross cable.

図３は、フォールトトレラントシステム１０の動作を示すフローチャートである。 FIG. 3 is a flowchart showing the operation of the fault tolerant system 10.

図３では、すぐには障害箇所が判明しない障害として、Ｂ系の故障が原因によるクロスリンク１５の障害を検出した場合の動作を例に説明する。 In FIG. 3, an operation will be described as an example when a failure of the cross link 15 due to a failure in the B system is detected as a failure whose failure location cannot be immediately determined.

まず、フォールトトレントシステム１０が二重化動作中の状態を想定する（ステップＳ１）。この状態で、クロスリンク１５でエラーを検出した場合（ステップＳ２）、ＦＴクロスリンク部１３Ａ、１３Ｂは、ＩＯ部１４Ａ、１４Ｂを経由する通信を一時停止することで外部とのアクセスを一時停止する（ステップＳ３）。 First, it is assumed that the fault torrent system 10 is in a duplex operation (step S1). In this state, when an error is detected in the cross link 15 (step S2), the FT cross link units 13A and 13B temporarily stop access to the outside by temporarily stopping communication via the IO units 14A and 14B. (Step S3).

そして、フォールトトレラントシステム１０は、スプリットモードに移行する（ステップＳ４）。 Then, the fault tolerant system 10 shifts to the split mode (step S4).

なお、スプリットモードとは、ＦＴクロスリンク部１３ＡとＦＴクロスリンク部１３Ｂ間のクロスリンク１５の通信を停止し、Ａ系とＢ系が、それぞれ独立して動作するモードを示す。 Note that the split mode is a mode in which the communication of the cross link 15 between the FT cross link unit 13A and the FT cross link unit 13B is stopped, and the A system and the B system operate independently.

このモードで、サーバ１１Ａ、１１Ｂが、系ごとに動作を継続する（ステップＳ５）。 In this mode, the servers 11A and 11B continue to operate for each system (step S5).

そして、所定の時間が経過するまでは、制御部１２Ａ、１２Ｂは、各々のサーバ１１Ａ、１１Ｂを独立して動作させ、一定時間が経過するまで、各ＦＴクロスリンク部１３Ａ、１３Ｂのエラー検出（故障解析）の結果を待つ（ステップＳ６）。 Until the predetermined time elapses, the control units 12A and 12B operate the servers 11A and 11B independently, and until a predetermined time elapses, error detection of each FT cross link unit 13A and 13B ( Wait for the result of failure analysis (step S6).

制御部１２Ａ、１２Ｂは、一定時間内においては、故障個所が判明しないときに、ステップＳ５〜Ｓ７を繰り返す。 The control units 12A and 12B repeat steps S5 to S7 when the failure location is not found within a certain time.

ここで、一定時間が経過しないで（ステップＳ６でＮｏ）、Ｂ系の故障が制御部１２Ｂにより判明する（ステップＳ７でＹｅｓ）とする。 Here, it is assumed that a certain time has not elapsed (No in step S6), and the failure of the B system is determined by the control unit 12B (Yes in step S7).

この場合、フォールトトレラントシステム１０において、故障していない側の系（Ａ系）がシステムのアクティブな系として動作する（ステップＳ８）。 In this case, in the fault tolerant system 10, the non-failed system (system A) operates as an active system (step S8).

そして、ＦＴクロスリンク部１３Ａは、ＩＯ部１４Ａとの通信を再開する（ステップＳ９）。これにより、フォールトトレラントシステム１０は、今回の例ではＡ系のみでシステムの運用を再開する（ステップＳ１０）。 Then, the FT cross link unit 13A resumes communication with the IO unit 14A (step S9). Thereby, the fault tolerant system 10 resumes the operation of the system only in the A system in this example (step S10).

一方、故障箇所が所定の時間を経過しても判明しない場合（ステップＳ６でＹｅｓ）、フォールトトレラントシステム１０は、予め決定されているプライオリティの高い系をシステムのアクティブな系として選択する（ステップＳ１１）。 On the other hand, if the failure location is not found after a predetermined time has elapsed (Yes in step S6), the fault tolerant system 10 selects a system with a high priority determined in advance as an active system of the system (step S11). ).

そして、例えば、予め決定されているプライオリティの高い系がＡ系である場合、ＦＴクロスリンク部１３Ａは、ＩＯ部１４Ａとの通信を再開する（ステップＳ９）。これにより、フォールトトレラントシステム１０は、片系（Ａ系）で運用を再開する（ステップＳ１０）。 For example, when the system with a high priority determined in advance is the A system, the FT cross-link unit 13A resumes communication with the IO unit 14A (step S9). As a result, the fault tolerant system 10 resumes operation in one system (system A) (step S10).

本実施形態に係るフォールトトレラントシステム１０は、以下に記載するような効果を奏する。 The fault tolerant system 10 according to the present embodiment has the following effects.

その効果は、フォールトトレラントシステム１０において、真の故障箇所を検出できる確率を増やすことにより、誤って、障害が発生していない側の系を切り離し、最悪の場合サーバが停止してしまう確率を減らすことが可能となる。 The effect is to increase the probability that a fault location can be detected in the fault tolerant system 10, thereby erroneously disconnecting the system on the side where no failure has occurred, and reducing the probability that the server will stop in the worst case. It becomes possible.

その理由は、２系から構成されるフォールトトレラントシステム１０において、障害が発生して系を切り離す場合、切り離す側の系を決定せずに、所定の時間、各々の系を独立して動作させ、真の故障箇所を検出できる確率を増やすからである。
＜第二の実施形態＞
次に、本発明を実施するための第二の形態について図面を参照して詳細に説明する。 The reason is that in the fault tolerant system 10 composed of two systems, when a failure occurs and the system is disconnected, each system is operated independently for a predetermined time without determining the system to be disconnected, This is because the probability of detecting a true fault location is increased.
<Second Embodiment>
Next, a second embodiment for carrying out the present invention will be described in detail with reference to the drawings.

図４は、第二の実施形態に係る、フォールトトレラントシステム３０の構成の一例を示すブロック図である。 FIG. 4 is a block diagram showing an example of the configuration of the fault tolerant system 30 according to the second embodiment.

フォールトトレラントシステム３０は、サーバ３１Ａ、及び、サーバ３１Ｂにより構成される。 The fault tolerant system 30 includes a server 31A and a server 31B.

フォールトトレラントシステム３０は、２つのフォールトトレラントサーバ３１Ａ、３１Ｂを並列して動作させる。各々のサーバ３１Ａ、３１Ｂが、２つのサーバ３１Ａ、３１Ｂ間の通信で、発生箇所を特定できないエラーを検出した場合、他方のサーバ３１Ａ、３１Ｂとの通信の接続を切り離すクロスリンク部３３Ａ、３３Ｂと、所定の時間、各々のサーバ３１Ａ、３１Ｂを独立して動作させ、エラーの発生箇所を特定する、制御部３２Ａ、３２Ｂと、を包含する。 The fault tolerant system 30 operates two fault tolerant servers 31A and 31B in parallel. When each server 31A, 31B detects an error in which the occurrence location cannot be specified in communication between the two servers 31A, 31B, the cross link units 33A, 33B that disconnect the communication connection with the other server 31A, 31B; And control units 32A and 32B that operate each of the servers 31A and 31B independently for a predetermined time and specify an error occurrence location.

本実施形態に係るフォールトトレラントシステム３０は、以下に記載するような効果を奏する。 The fault tolerant system 30 according to the present embodiment has the following effects.

その効果は、フォールトトレラントシステム３０において、真の故障箇所を検出できる確率を増やすことにより、誤って、障害が発生していない側の系を切り離し、最悪の場合サーバが停止してしまう確率を減らすことが可能となる。 The effect is that by increasing the probability that a fault location can be detected in the fault-tolerant system 30, the system on which the fault has not occurred is erroneously disconnected, and the probability that the server will stop in the worst case is reduced. It becomes possible.

その理由は、２系から構成されるフォールトトレラントシステム３０において、障害が発生して系を切り離す場合、切り離す側の系を決定せずに、所定の時間、各々の系を独立して動作させ、真の故障箇所を検出できる確率を増やすからである。 The reason is that in the fault tolerant system 30 composed of two systems, when a failure occurs and the system is disconnected, each system is operated independently for a predetermined time without determining the system to be disconnected, This is because the probability of detecting a true fault location is increased.

以上、図面を参照して本発明の実施形態を説明したが、本発明は上記実施形態に限定されるものではない。本発明の構成や詳細には、本発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 As mentioned above, although embodiment of this invention was described with reference to drawings, this invention is not limited to the said embodiment. Various changes that can be understood by those skilled in the art can be made to the configuration and details of the present invention within the scope of the present invention.

１０フォールトトレラントシステム
１１Ａサーバ
１１Ｂサーバ
１２Ａ制御部
１２Ｂ制御部
１３ＡＦＴクロスリンク部
１３ＢＦＴクロスリンク部
１４ＡＩＯ部
１４ＢＩＯ部
１５クロスリンク
２０コンピュータ装置
２１Ａサーバ
２１Ｂサーバ
２２ＡＣＰＵ
２２ＢＣＰＵ
２３Ａメモリ
２３Ｂメモリ
２４ＡＦＴチップセット
２４ＢＦＴチップセット
２５ＡＩＯデバイス
２５ＢＩＯデバイス
２６クロスリンク
３０フォールトトレラントシステム
３１Ａサーバ
３１Ｂサーバ
３２Ａ制御部
３２Ｂ制御部
３３Ａクロスリンク部
３３Ｂクロスリンク部 DESCRIPTION OF SYMBOLS 10 Fault tolerant system 11A server 11B server 12A Control part 12B Control part 13A FT cross link part 13B FT cross link part 14A IO part 14B IO part 15 Cross link 20 Computer apparatus 21A server 21B server 22A CPU
22B CPU
23A Memory 23B Memory 24A FT Chip Set 24B FT Chip Set 25A IO Device 25B IO Device 26 Cross Link 30 Fault Tolerant System 31A Server 31B Server 32A Control Unit 32B Control Unit 33A Cross Link Unit 33B Cross Link Unit

Claims

In a fault-tolerant system that operates two servers in parallel,
Each said server
Cross-link means for disconnecting the communication connection with the other server when an error that cannot identify the occurrence location is detected in the communication between the two servers;
A fault-tolerant system including control means for operating each of the servers independently for a predetermined time and specifying the location where the error occurs.

Furthermore, each of the servers includes input / output means for accessing the outside of the fault tolerant system,
2. The fault tolerant system according to claim 1, wherein, when the cross-link unit detects the error, the communication connection with the input / output unit is disconnected, and then the communication connection with the other server is disconnected.

If the control means cannot identify the location where the error occurred within the predetermined time, the cross-link means of the server having a predetermined high priority is connected to the communication with the other server. The fault tolerant system according to claim 1, wherein the fault tolerant system is disconnected.

The fault tolerant system according to any one of claims 1 to 3, wherein the two servers are operated in parallel by a lock step operation.

In a fault-tolerant system that operates two servers in parallel,
Each said server
When an error that cannot identify the occurrence location is detected in communication between the two servers, the communication connection with the other server is disconnected,
A fault-tolerant method of operating each of the servers independently for a predetermined time to identify the location where the error has occurred.

Further, when each of the servers detects the error, it disconnects the communication connection with the outside of the fault tolerant system, and then disconnects the communication connection with the other server. Fault tolerant method.

If the location where the error occurred cannot be specified in the predetermined time,
The fault tolerant method according to claim 5 or 6, wherein a connection of communication with the other server of the server having a high priority determined in advance is disconnected.

In a fault-tolerant system that operates two servers in parallel,
Each said server
A process of disconnecting the communication connection with the other server when an error that cannot identify the occurrence location is detected in the communication between the two servers;
A program for causing a computer to execute a process of operating each of the servers independently for a predetermined time and specifying the location where the error has occurred.

Further, when each of the servers detects the error, the server is caused to disconnect the communication connection with the outside of the fault-tolerant system, and thereafter, the computer executes a process of disconnecting the communication connection with the other server. The program according to claim 8.

If the location where the error occurred cannot be specified in the predetermined time,
The program according to claim 8 or 9, which causes the computer to execute a process of disconnecting communication connection with the other server of the server having a predetermined high priority.