JP2001290670A

JP2001290670A - Cluster system

Info

Publication number: JP2001290670A
Application number: JP2000108501A
Authority: JP
Inventors: Ryoichi Tanabe; 亮一田辺
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2000-04-10
Filing date: 2000-04-10
Publication date: 2001-10-19

Abstract

PROBLEM TO BE SOLVED: To solve the problem where it is necessary to manually switch processings to a reserve computer system, each time a fault occurs in the conventional cluster system. SOLUTION: This system is provided with a fault-detecting means 101 for detecting the fault of a magnetic storage device 130 in an active computer system 100, a data-transmitting means 103 for transmitting data in the main storage device 110 in the active computer system to a reserve computer system 200 when the fault is detected and a data receiving means 201 for receiving the transmitted data and storing them in a main storage device 210 located on the side of the reserve computer system. After the data are completely transmitted/received, the active computer system is stopped, an application is started in the reserve computer system and by executing processing, while referring to the data stored in the main storage device 210, the relevant application continuously performs processing in the active computer system.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、現用計算機装置と
予備計算機装置から成るクラスタシステム、特に障害発
生時の主記憶装置のデータを予備計算機装置に引き継ぐ
場合の引き継ぎ方式に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a cluster system including an active computer and a spare computer, and more particularly to a takeover system for transferring data of a main storage device to a spare computer when a failure occurs.

【０００２】[0002]

【従来の技術】従来、このようなクラスタシステムは１
台あるいは複数台の計算機装置と１台の予備計算機装置
で構成され、いずれか１つの計算機装置で障害が発生す
ると、障害の発生した計算機装置の処理を予備計算機装
置で肩代わりすることによってシステムの運用を行って
いる。2. Description of the Related Art Conventionally, such a cluster system is one of the following.
It is composed of one or more computer devices and one spare computer device. If a failure occurs in any one of the computer devices, the spare computer device takes over the processing of the failed computer device to operate the system. It is carried out.

【０００３】[0003]

【発明が解決しようとする課題】しかしながら、上記従
来のクラスタシステムでは、計算機装置に障害が発生し
た時に手動によって予備計算機装置に処理の切り換えを
行っており、主記憶装置のデータを予備計算機装置に送
信して処理の引き継ぎを自動化することは行っていなか
った。そのため、障害発生毎に手動で予備計算機装置へ
の処理の引き継ぎを行う必要があった。However, in the above-described conventional cluster system, when a failure occurs in a computer device, the processing is manually switched to a spare computer device, and the data in the main storage device is transferred to the spare computer device. They did not send and automate the process takeover. Therefore, it is necessary to manually hand over the processing to the spare computer every time a failure occurs.

【０００４】本発明は、上記従来の問題点に鑑みなされ
たもので、その目的は、障害発生時に主記憶装置のデー
タを予備計算機装置に送信し、自動的に処理の引き継ぎ
を行うことが可能なクラスタシステムを提供することに
ある。SUMMARY OF THE INVENTION The present invention has been made in view of the above-mentioned conventional problems, and an object of the present invention is to transmit data in a main storage device to a spare computer device when a failure occurs and to automatically take over processing. It is to provide a simple cluster system.

【０００５】[0005]

【課題を解決するための手段】本発明は、上記目的を達
成するため、現用計算機装置及び予備計算機装置から成
るクラスタシステムにおいて、前記現用計算機装置の補
助記憶装置の障害を検知する手段と、前記障害検知手段
により障害が検知された時に前記現用計算機装置の主記
憶装置のデータを前記予備計算機装置に送信する手段
と、送信されたデータを受信し予備計算機装置側の主記
憶装置に格納する手段とを備え、前記データの送受信終
了後に前記現用計算機装置を停止し、且つ、前記予備計
算機装置においてアプリケーションを起動し、当該アプ
リケーションは前記主記憶装置に格納されたデータを参
照して処理を実行することにより現用計算機装置の処理
を継続して行うことを特徴としている。According to the present invention, in order to achieve the above object, in a cluster system comprising a working computer device and a spare computer device, means for detecting a failure in an auxiliary storage device of the working computer device; Means for transmitting data in the main memory of the working computer to the spare computer when a fault is detected by the fault detecting means; means for receiving the transmitted data and storing the data in the main memory of the spare computer. After the transmission and reception of the data is completed, the active computer device is stopped, and an application is started in the spare computer device, and the application executes a process with reference to the data stored in the main storage device Thus, the present invention is characterized in that the processing of the active computer device is continuously performed.

【０００６】[0006]

【発明の実施の形態】以下、本発明の実施の形態につい
て図面を参照して詳細に説明する。図１は本発明のクラ
スタシステムの一実施形態の構成を示すブロック図であ
る。図１において、クラスタシステムは、現用計算機装
置１００、予備計算機装置２００から成っていて、現用
計算機装置１００に障害が発生した時は予備計算機装置
２００に処理が引き継がれる。なお、図１では現用計算
機装置１００を１台としているが、現用計算機装置１０
０を複数台としてもよい。Embodiments of the present invention will be described below in detail with reference to the drawings. FIG. 1 is a block diagram showing the configuration of an embodiment of the cluster system of the present invention. In FIG. 1, the cluster system includes an active computer device 100 and a standby computer device 200. When a failure occurs in the active computer device 100, the processing is taken over by the standby computer device 200. Although the number of the active computer 100 is one in FIG.
0 may be plural.

【０００７】現用計算機装置１００は、補助記憶装置で
ある磁気記憶装置１３０、磁気記憶装置１３０のデータ
の入出力を制御する入出力制御部１２０、磁気記憶装置
１３０の障害を検知する障害検知手段１０１、障害発生
時に現用計算機装置１００の緊急停止を行う緊急停止手
段１０２、主記憶装置１１０、障害発生時に主記憶装置
１１０上のデータを予備計算機装置２００に送信するデ
ータ送信手段１０３を備えている。３００は現用計算機
装置１００上のアプリケーションプログラム（以下、ア
プリケーションと略す）である。The active computer device 100 includes a magnetic storage device 130 serving as an auxiliary storage device, an input / output control unit 120 for controlling data input / output of the magnetic storage device 130, and a failure detecting means 101 for detecting a failure of the magnetic storage device 130. An emergency stop unit 102 for emergency stop of the active computer 100 when a failure occurs, a main storage device 110, and a data transmission unit 103 for transmitting data on the main storage device 110 to the spare computer device 200 when a failure occurs. Reference numeral 300 denotes an application program (hereinafter, abbreviated as an application) on the active computer device 100.

【０００８】また、予備計算機装置２００は現用計算機
装置１００から送信されたデータを受信するデータ受信
手段２０１、主記憶装置２１０、アプリケーション起動
手段２０２、磁気記憶装置２３０、入出力制御部２２０
から構成されている。なお、４００はクラスタシステム
に接続された計算機端末、５００は計算機端末４００上
のアプリケーションである。ここで、本実施形態では、
障害検知手段１０１は磁気記憶装置１３０の障害を検知
しており、現用計算機装置１００においてアプリケーシ
ョン３００の処理が継続不可能な障害として磁気記憶装
置１３０の障害を想定している。[0008] The spare computer device 200 includes a data receiving means 201 for receiving data transmitted from the active computer device 100, a main storage device 210, an application starting means 202, a magnetic storage device 230, and an input / output control unit 220.
It is composed of Note that 400 is a computer terminal connected to the cluster system, and 500 is an application on the computer terminal 400. Here, in the present embodiment,
The failure detection unit 101 detects a failure in the magnetic storage device 130, and assumes a failure in the magnetic storage device 130 as a failure in which the processing of the application 300 cannot be continued in the active computer device 100.

【０００９】次に、本実施形態の具体的な動作について
図２〜図６のフローチャートを参照して詳細に説明す
る。まず、図２は障害検知手段１０１の障害検知処理を
示すフローチャートである。図２において、障害検知手
段１０１はシステムの起動時に図示しない設定値ファイ
ルから対象装置（この場合は、磁気記憶装置１３０）、
チェックする時間間隔等の設定値を取得する（ステップ
Ａ１）。障害検知手段１０１はシステムの運用時におい
て取得した設定値に基づいてテストＩ／Ｏによる障害検
知を行う。Next, a specific operation of the present embodiment will be described in detail with reference to the flowcharts of FIGS. First, FIG. 2 is a flowchart showing the failure detection processing of the failure detection means 101. In FIG. 2, when the system is started, the failure detection unit 101 reads a target device (in this case, the magnetic storage device 130) from a setting value file (not shown),
A set value such as a time interval to be checked is acquired (step A1). The failure detection means 101 performs failure detection by test I / O based on the setting values obtained during operation of the system.

【００１０】即ち、磁気記憶装置１３０にテストＩ／Ｏ
を発行し（ステップＡ２）、磁気記憶装置１３０からの
テストＩ／Ｏに対する返信情報に基づいて正常か否かの
判定を行う（ステップＡ３）。この時、正常であれば、
障害検知手段１０１は一定時間停止した後（ステップＡ
４）、再度、ステップＡ２に戻ってテストＩ／Ｏを発行
し、正常か否かの判定を行う（ステップＡ３）。以下、
ステップＡ２〜Ａ４の処理を繰り返し行い、定期的にテ
ストＩ／Ｏを発行して磁気記憶装置１３０が正常か否か
を監視している。一方、システムの稼動中にステップＡ
３で磁気記憶装置１３０のディスク故障が発生し障害を
検知すると、障害検知手段１０１は障害の発生を緊急停
止手段１０２へ通知する。That is, the test I / O is
Is issued (step A2), and it is determined whether the data is normal based on the reply information to the test I / O from the magnetic storage device 130 (step A3). At this time, if it is normal,
After the failure detection means 101 has been stopped for a certain period of time (step A
4) Return to step A2 again, issue a test I / O, and determine whether or not it is normal (step A3). Less than,
The processing of steps A2 to A4 is repeated, and a test I / O is periodically issued to monitor whether the magnetic storage device 130 is normal. On the other hand, during the operation of the system, step A
When a disk failure of the magnetic storage device 130 occurs and a failure is detected in step 3, the failure detection unit 101 notifies the emergency stop unit 102 of the failure.

【００１１】図３は障害発生時の緊急停止手段１０２の
処理の流れを示すフローチャートである。図３におい
て、緊急停止手段１０２は障害発生が通知されると、ま
ず、アプリケーション３００を閉塞状態にするために、
閉塞処理を行う（ステップＢ１）。閉塞状態とは、アプ
リケーション３００に対するトランザクション要求を受
け付けない状態のことをいう。また、閉塞処理とは、閉
塞状態テーブル（図示せず）を閉塞状態に変更し、稼動
中のトランザクションの終了を待ち合わせる処理のこと
をいう。FIG. 3 is a flow chart showing the flow of processing of the emergency stop means 102 when a failure occurs. In FIG. 3, when the emergency stop unit 102 is notified of the occurrence of a failure, first, in order to put the application 300 into a blocked state,
A closing process is performed (step B1). The closed state refers to a state in which a transaction request to the application 300 is not accepted. Further, the closing process refers to a process of changing a closed state table (not shown) to a closed state and waiting for the end of a running transaction.

【００１２】ここで、計算機端末４００のアプリケーシ
ョン５００は、アプリケーション３００に対するトラン
ザクション要求を発行する前に閉塞状態テーブルを参照
してアプリケーション３００が閉塞状態か否かをチェッ
クし、新たなトランザクション要求が可能かどうかを確
認している。従って、アプリケーション５００は、閉塞
処理を行った後はアプリケーション３００が閉塞状態で
あると判断し、新たなトランザクション要求は行わな
い。Here, the application 500 of the computer terminal 400 checks whether or not the application 300 is in a closed state by referring to the closed state table before issuing a transaction request to the application 300, and determines whether a new transaction request is possible. Are you sure? Therefore, the application 500 determines that the application 300 is in the blocked state after performing the blocking process, and does not issue a new transaction request.

【００１３】閉塞処理を終了すると、データ送信手段１
０３は主記憶装置１１０上のデータを予備計算機装置２
００に送信する処理を行う（ステップＢ２）。このデー
タ送信手段１０３の処理を図４のフローチャートに示
す。図４において、まず、データ送信手段１０３は設定
値ファイル（図示せず）からメモリ識別子や送信すべき
データの大きさ等の設定値を取得する（ステップＣ
１）。When the closing process is completed, the data transmitting means 1
Reference numeral 03 denotes the data stored in the main storage device 110 and the spare computer device 2
00 is performed (step B2). FIG. 4 is a flowchart showing the processing of the data transmission means 103. In FIG. 4, first, the data transmission unit 103 acquires setting values such as a memory identifier and the size of data to be transmitted from a setting value file (not shown) (step C).
1).

【００１４】次いで、得られた識別子を用いてメモリ
（主記憶装置１１０）のアタッチを行い（ステップＣ
２）、予備計算機装置２００上のデータ受信手段２０１
とＴＣＰ／ＩＰプロトコルを用いた通信を行うためにソ
ケットの作成やコネクションの確立を行う（ステップＣ
３）。また、データ送信手段１０３は主記憶装置１１０
のデータを読み込み（ステップＣ４）、データの送信を
行う（ステップＣ５）。この場合、データ送信手段１０
３はアタッチにより得られたアドレスから、設定値ファ
イルで得られた大きさの分だけ主記憶装置１１０からデ
ータを読み出し、予備計算機装置２００に送信する。Next, the memory (main storage device 110) is attached using the obtained identifier (step C).
2), data receiving means 201 on the standby computer 200
A socket is created and a connection is established to perform communication using the TCP / IP protocol with the server (step C).
3). Further, the data transmission unit 103 is connected to the main storage device 110.
Is read (step C4), and the data is transmitted (step C5). In this case, the data transmission means 10
Numeral 3 reads data from the main storage device 110 by the size obtained in the setting value file from the address obtained by the attachment, and transmits the data to the spare computer device 200.

【００１５】次に、データ受信手段２０１の処理を図５
のフローチャートを参照して説明する。図５において、
まず、データ受信手段２０１の起動は、予め予備として
起動している予備計算機装置２００側で基本ソフト（Ｏ
Ｓ）の起動時に行われる。初めに、設定値ファイルから
メモリ識別子、データの大きさ等の設定値を取得し（ス
テップＤ１）、それに基づいてデータを格納するメモリ
を確保する。また、メモリ（主記憶装置２１０）のアタ
ッチを行う（ステップＤ２）。この時のメモリ識別子や
データの大きさは現用計算機装置１００におけるデータ
送信手段１０３の場合の設定値と同じである。Next, the processing of the data receiving means 201 will be described with reference to FIG.
This will be described with reference to the flowchart of FIG. In FIG.
First, activation of the data receiving means 201 is performed by the basic software (O
This is performed at the start of S). First, a set value such as a memory identifier and data size is obtained from the set value file (step D1), and a memory for storing data is secured based on the set value. Attachment of the memory (main storage device 210) is performed (step D2). At this time, the memory identifier and the size of the data are the same as the setting values in the case of the data transmission unit 103 in the active computer device 100.

【００１６】続いて、データ受信手段２０１はＴＣＰ／
ＩＰによる通信手順として、ソケットの作成、ポートへ
の対応付け、キューのセット及び接続要求待ちとなるよ
うに処理を行い、通信準備を行う（ステップＤ３）。こ
れによって、データ送信手段１０３から何時でもコネク
ション要求を受け付け可能な状態となる（ステップＤ
４）。この状態で、データ送信手段１０３からコネクシ
ョン要求があると、コネクションの確立を行い、データ
受信手段２０１からのデータの送信を待ち、データの受
信を行う（ステップＤ５）。受信データは主記憶装置２
１０に対しアタッチで得られたアドレスに書き込まれる
（ステップＤ６）。Subsequently, the data receiving means 201 transmits the TCP /
As a communication procedure by IP, processing is performed to prepare a socket, associate with a port, set a queue, and wait for a connection request, and prepare for communication (step D3). Thus, the connection request can be accepted at any time from the data transmission unit 103 (step D).
4). In this state, when there is a connection request from the data transmitting unit 103, the connection is established, the transmission of data from the data receiving unit 201 is waited, and the data is received (step D5). The received data is stored in the main storage device 2
10 is written to the address obtained by the attachment (step D6).

【００１７】図３に戻る。このようにしてデータの送受
信を完了すると、緊急停止手段１０２は、図３のステッ
プＢ３においてアプリケーション３００の緊急停止処理
を行う。次いで、基本ソフト（ＯＳ）の緊急停止を行い
（ステップＢ４）、現用計算機装置１００の停止処理を
完了する。Referring back to FIG. When data transmission / reception is completed in this way, the emergency stop unit 102 performs an emergency stop process of the application 300 in step B3 of FIG. Next, an emergency stop of the basic software (OS) is performed (step B4), and the stop processing of the active computer device 100 is completed.

【００１８】一方、予備計算機装置２００上ではアプリ
ケーション３００が起動され、現用計算機装置１００の
処理を引き続き行う。図６はこの時のアプリケーション
起動手段２０２の処理を示す。図６において、アプリケ
ーション起動手段２０２は設定値ファイルからデータ受
信手段２０１の設定値と同じメモリ識別子を取得し（ス
テップＥ１）、アプリケーション３００の起動を行う
（ステップＥ２）。この場合、アプリケーション３００
は得られたメモリ識別子を用いて主記憶装置２１０のデ
ータを参照して処理を行い、これによって現用計算機装
置１００の切り換え以前からの処理を継続して処理する
ことが可能となる。On the other hand, the application 300 is started on the spare computer 200, and the processing of the active computer 100 is continued. FIG. 6 shows the processing of the application starting means 202 at this time. In FIG. 6, the application starting unit 202 acquires the same memory identifier as the setting value of the data receiving unit 201 from the setting value file (step E1), and starts the application 300 (step E2). In this case, the application 300
Using the obtained memory identifier, the process is performed by referring to the data in the main storage device 210, whereby it is possible to continue the process from before the switching of the active computer device 100.

【００１９】次に、本発明の他の実施形態について説明
する。本実施形態では、障害検知手段１０１の障害検知
方法が異なっている。その他の構成は図１の実施形態と
同様である。図７は本実施形態の障害検知手段１０１の
処理を示すフローチャートである。図７において、ま
ず、障害検知手段１０１は磁気記憶装置１３０に対しテ
ストＩ／Ｏを発行し（ステップＦ１）、テストＩ／Ｏに
よる結果が正常か否かで障害の検知を行う（ステップＦ
２）。正常であれば、一定時間停止した後（ステップＦ
３）、ステップＦ１に戻って、再度テストＩ／Ｏを発行
し、正常か否かの判定を行う（ステップＦ２）。Next, another embodiment of the present invention will be described. In the present embodiment, the failure detection method of the failure detection unit 101 is different. Other configurations are the same as those of the embodiment of FIG. FIG. 7 is a flowchart showing the processing of the failure detection means 101 of the present embodiment. 7, first, the failure detection unit 101 issues a test I / O to the magnetic storage device 130 (step F1), and detects a failure based on whether the result of the test I / O is normal (step F).
2). If normal, after stopping for a certain time (step F
3) Return to step F1, issue a test I / O again, and determine whether or not it is normal (step F2).

【００２０】このように定期的にテストＩ／Ｏを発行
し、磁気記憶装置１３０が正常か否かを監視している。
ここで、障害検知手段１０１はステップＦ２において異
常と判定された回数をカウントし（ステップＦ４）、カ
ウント値と予め設定された閾値を比較する（ステップＦ
５）。この場合、磁気記憶装置１３０のディスクの劣化
により間欠的な障害が発生すると、テストＩ／Ｏが異常
となるが、カウント値が閾値以下である時はステップＦ
１に戻って正常として扱い、カウント値が閾値を越える
と障害であると判定する。As described above, the test I / O is periodically issued to monitor whether the magnetic storage device 130 is normal.
Here, the failure detection means 101 counts the number of times that it is determined to be abnormal in step F2 (step F4), and compares the count value with a preset threshold (step F4).
5). In this case, if an intermittent failure occurs due to the deterioration of the disk of the magnetic storage device 130, the test I / O becomes abnormal.
It returns to 1 and treats it as normal.

【００２１】このように本実施形態では、異常と判定さ
れた回数をカウントし、カウント値が閾値を越えた時に
障害の発生を検知しているので、ディスクの劣化に伴う
間欠的な障害によって生じる不要な予備計算機装置２０
０への切り換えを防ぐことができる。従って、計算機装
置の切り換えに伴う時間を削減でき、システムの処理効
率を向上することができる。また、本実施形態では、閾
値を調整することにより、積極的に計算機装置を切り換
えたり、あるいは切り換えの頻度を小さくすることが可
能である。As described above, in the present embodiment, the number of times determined to be abnormal is counted, and the occurrence of a failure is detected when the count value exceeds the threshold value. Unnecessary spare computer device 20
Switching to zero can be prevented. Therefore, the time required for switching the computer device can be reduced, and the processing efficiency of the system can be improved. Further, in the present embodiment, by adjusting the threshold value, it is possible to actively switch the computer device or to reduce the frequency of the switching.

【００２２】[0022]

【発明の効果】以上説明したように本発明は、次の効果
がある。（１）アプリケーションの処理に必要な補助記憶装置に
障害が発生した場合、現用計算機装置の主記憶装置上の
データを予備計算機装置に送信し、主記憶装置に格納し
ているので、予備計算機装置において現用計算機装置の
処理を引き続いて行うことができる。（２）データの送受信前にアプリケーションに対するト
ランザクション要求を受け付けない状態とする閉塞処理
を行うことにより、矛盾のないデータの引き継ぎを行う
ことができる。（３）データの送受信は計算機装置の切り換え時に行う
ので、通常のシステムの運用時においては余分な通信を
行う必要がない。（４）現用計算機装置が複数台になったとしても予備計
算機装置では現用計算機装置一台分のメモリ容量で済む
ため、メモリ容量が増加することはない。As described above, the present invention has the following effects. (1) When a failure occurs in an auxiliary storage device required for processing of an application, data in the main storage device of the active computer device is transmitted to the spare computer device and stored in the main storage device. In, the processing of the active computer device can be performed subsequently. (2) Consistent data can be taken over by performing a closing process of not accepting a transaction request for an application before transmitting / receiving data. (3) Since data transmission and reception are performed when the computer device is switched, there is no need to perform extra communication during normal system operation. (4) Even if there are a plurality of active computer devices, the spare computer device needs only the memory capacity of one active computer device, so that the memory capacity does not increase.

[Brief description of the drawings]

【図１】本発明のクラスタシステムの一実施形態の構成
を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration of an embodiment of a cluster system according to the present invention.

【図２】図１の障害検知手段の処理を示すフローチャー
トである。FIG. 2 is a flowchart illustrating a process of a failure detection unit in FIG. 1;

【図３】図１の緊急停止手段の処理を示すフローチャー
トである。FIG. 3 is a flowchart showing a process of an emergency stop means of FIG. 1;

【図４】図１のデータ送信手段の処理を示すフローチャ
ートである。FIG. 4 is a flowchart showing a process of a data transmission unit of FIG. 1;

【図５】図１のデータ受信手段の処理を示すフローチャ
ートである。FIG. 5 is a flowchart illustrating a process of a data receiving unit in FIG. 1;

【図６】図１のアプリケーション起動手段の処理を示す
フローチャートである。FIG. 6 is a flowchart showing a process of an application starting unit in FIG. 1;

【図７】本発明の他の実施形態の障害検知手段の処理を
示すフローチャートである。FIG. 7 is a flowchart illustrating a process of a failure detection unit according to another embodiment of the present invention.

[Explanation of symbols]

１００現用計算機装置１０１障害検知手段１０２緊急停止手段１０３データ送信手段１１０主記憶装置１２０入出力制御部１３０磁気記憶装置２００予備計算機装置２０１データ受信手段２０２アプリケーション起動手段２１０主記憶装置２２０入出力制御部２３０磁気記憶装置３００アプリケーション４００計算機端末５００アプリケーション REFERENCE SIGNS LIST 100 active computer device 101 failure detection means 102 emergency stop means 103 data transmission means 110 main storage device 120 input / output control unit 130 magnetic storage device 200 spare computer device 201 data reception means 202 application activation means 210 main storage device 220 input / output control unit 230 magnetic storage device 300 application 400 computer terminal 500 application

Claims

[Claims]

In a cluster system comprising a working computer and a spare computer, means for detecting a fault in an auxiliary storage device of the working computer, and a means for detecting a fault in the working computer when the fault is detected by the fault detecting means. Means for transmitting data in the main storage device to the spare computer device, and means for receiving the transmitted data and storing it in the main storage device on the side of the spare computer device, and Stopping and starting an application in the spare computer device, the application refers to data stored in the main storage device and executes a process to continue the process of the active computer device. And the cluster system.

2. The apparatus according to claim 1, further comprising means for performing a closing process for disabling a transaction request for an application when the failure detecting means detects a failure in the auxiliary storage device. Cluster system.

3. The method according to claim 2, wherein the failure detection unit periodically issues a test I / O to the auxiliary storage device and detects a failure in the auxiliary storage device based on a reply result to the test I / O. The cluster system according to claim 1, wherein

4. The fault detecting means counts the number of abnormalities as a result of the test I / O, and detects a fault in the auxiliary storage device when the count value exceeds a predetermined value. The cluster system according to claim 3.