JPH0934852A

JPH0934852A - Cluster system

Info

Publication number: JPH0934852A
Application number: JP7201406A
Authority: JP
Inventors: Noriaki Sakai; 則彰境
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1995-07-13
Filing date: 1995-07-13
Publication date: 1997-02-07

Abstract

PROBLEM TO BE SOLVED: To take over the processing at a high speed on the occurrence of a fault of a computer being a component of the cluster system. SOLUTION: Each of computers 10-1-10-3 is provided with a service processor 44 conducting recovery processing and re-configuration for a faulty part in response to the detection of a fault, and the service processor 44 is provided with an interrupt control section 54 receiving a fault from a component in the computer to generate an interrupt for a fault processing routine, and an auxiliary CPU 52 executing the processing by a fault processing routine with the interrupt by the interrupt control section 54, and the auxiliary CPU 52 analyzes a fault in the fault processing routine to discriminate whether or not the consecutive processing due to the fault is disable and informs a notice representing disable consecutive processing by its own computer to other computer when disable consecutive processing is discriminated.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、複数の計算機を互いに
結合してなる構成のクラスタシステムに関し、特に可用
性の向上を実現するクラスタシステムに関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a cluster system in which a plurality of computers are connected to each other, and more particularly to a cluster system for improving availability.

【０００２】[0002]

【従来の技術】従来、システムの可用性の向上や性能の
スケーラビリティを実現するために、複数の計算機を互
いに結合することによりクラスタシステムを構成する場
合がある。このようなクラスタシステムでは、ある計算
機が何らかの障害によって処理の継続が不可能になった
ときに、他の計算機が代替して中断した処理を引き継ぐ
ことでシステム全体として高い可用性を実現できる。ま
た、同種の処理を複数の計算機で分散して実行すること
で、クラスタシステムを構成する計算機の台数に応じて
性能を高めることができる。2. Description of the Related Art Conventionally, in order to improve system availability and scalability of performance, a cluster system may be constructed by connecting a plurality of computers with each other. In such a cluster system, when one computer cannot continue its processing due to some failure, another computer takes over the interrupted processing to realize high availability of the entire system. In addition, by performing the same type of processing in a distributed manner on a plurality of computers, it is possible to improve the performance according to the number of computers that make up the cluster system.

【０００３】例えば、Ｄ．Ｐ．Ｃｏｘ，Ｆ．Ｌａｗｌｏ
ｒ，”ＨＡＣＭＰ／６０００バージョン２．１の概
要”（ＩＢＭＡＩＸｐｅｒｔ１９９４Ｎｏ．２）
には、クラスタ構成が実現する高可用性と性能のスケー
ラビリティについて記載されている。また、特公平１−
２１７６６６号公報には、上述したクラスタシステムの
ように複数のプロセッサを結合して構成した分散処理を
実現するマルチプロセッサシステムにおける障害検出方
式について開示されている。[0003] For example, D. P. Cox, F.F. Lawlo
r, "Outline of HACMP / 6000 version 2.1" (IBM AIXpert 1994 No. 2)
Describes the high availability and scalability of performance realized by the cluster configuration. In addition,
Japanese Laid-Open Patent Publication No. 217666 discloses a failure detection method in a multiprocessor system that realizes distributed processing configured by connecting a plurality of processors like the cluster system described above.

【０００４】さらに、障害の検出に応じて障害部の切り
放し、再構成処理や再立ち上げ処理を行なう専用プロセ
ッサを備える計算機の一例が、Ｊ．Ｏ．Ｎｉｃｈｏｌｓ
ｏｎ， ”ＴｈｅＲＩＳＣＳｙｓｔｅｍ／６０００
ＳＭＰＳｙｓｔｍ”（ＣｏｍｐｃｏｎＰｒｏｃｅ
ｅｄｉｎｇｓｌ９９５）に示されている。また、Ｔ．
Ｂ．Ａｌｅｘａｎｄｅｒｅｔ．ａｌ．，”Ｃｏｒｐｏ
ｒａｔｅＢｕｓｉｎｅｓｓＳｅｒｖｅｒｓ：Ａｎ
ＡｌｔｅｒｎａｔｉｖｅｔｏＭａｉｎｆｒａｍｅｓ
ｆｏｒＢｕｓｉｎｅｓｓＣｏｍｐｕｔｉｎｇ”
（Ｊｕｎｅｌ９９４ＨＰＪｏｕｒｎａｌ）には、
同様な障害処理を行なう別の例が示されている。Further, an example of a computer provided with a dedicated processor for disconnecting a faulty part in response to detection of a fault and performing a reconfiguration process and a restart process is described in J. O. Nichols
on, "The RISC System / 6000
SMP System "(Compcon Proce
edings 1995). Also, T.I.
B. Alexander et. al. , "Corpo
rate Business Servers: An
Alternate to Mainframes
for Business Computing ”
(June 1994 HP Journal)
Another example of performing similar fault handling is shown.

【０００５】従来においては、クラスタシステムを構成
する計算機の異常検出は、”ｈｅａｒｔｂｅａｔ”と
呼ばれる動作監視信号を一定時間間隔毎に、計算機で相
互に送信し、この信号が途絶えたことを検出することで
計算機の異常と判断しているのが一般的である。Conventionally, in detecting an abnormality of a computer constituting a cluster system, an operation monitoring signal called "heart beat" is transmitted between computers at regular time intervals, and it is detected that this signal is interrupted. Therefore, it is generally judged that the computer is abnormal.

【０００６】[0006]

【発明が解決しようとする課題】上述したように従来に
おいては、動作監視信号を一定時間感間隔毎に相互に送
信し、この信号が途絶えたことを検出することで計算機
の異常と判断しているが、計算機本体が異常でなく計算
機間を接続する通信経路に異常がある場合も、上記動作
監視信号が受信不可能となるため計算機の異常と誤って
判断してしまう可能性がある。つまり、計算機間の通信
経路に異常がある場合、動作監視信号を送信した計算機
は通信経路に異常があることを判定できないため、その
まま処理を継続する。一方、動作監視信号を受信する側
の計算機では、動作監視信号が受信されないため信号送
信側の計算機に異常が発生したと判断して、相手計算機
上で実行されていた処理の引き継ぎ処理を行なうことに
なる。このような場合、２台の計算機が同じ処理を行な
うように動作するが、例えばＬＡＮのネットワークアド
レスなどが重複することによりアドレス競合の問題が発
生するといった問題点がある。As described above, in the prior art, the operation monitor signal is mutually transmitted at every constant time interval, and it is judged that the computer is abnormal by detecting the interruption of this signal. However, even if the computer itself is not abnormal and the communication path connecting the computers is abnormal, the operation monitoring signal cannot be received, and there is a possibility that the computer is erroneously determined to be abnormal. That is, if there is an abnormality in the communication path between the computers, the computer that has transmitted the operation monitoring signal cannot determine that there is an abnormality in the communication path, and therefore the processing continues. On the other hand, the computer that receives the operation monitoring signal determines that an abnormality has occurred in the signal sending computer because the operation monitoring signal is not received, and then takes over the processing that was being executed on the partner computer. become. In such a case, the two computers operate so as to perform the same processing, but there is a problem that an address conflict problem occurs due to, for example, overlapping LAN network addresses.

【０００７】上述した第１の問題点を解決するために、
例えば、Ｌ．Ｌｅａｂｙ，”ＮｅｗＡｖａｉｌａｂｉｌ
ｉｔｙＦｅａｔｕｒｅｓｏｆＬｏｃａｌＡｒｅ
ａＶＡＸｃｌｕｓｔｅｒＳｙｓｔｅｍｓ”（Ｖｏｌ．
３Ｎｏ．３１９９１ＤｉｇｉｔａＩＴｅｃｈｎｉ
ｃａｌｊｏｕｒｎａｌ）には、クラスタシステムを構
成する計算機間で複数の通信経路を備える、すなわち通
信経路を多重化することにより可用性を向上する技術が
示されている。このように、通信経路を多重化した構成
では、動作監視信号があらかじめ定められた時間以内に
到達しないときは、受信側の計算機が他の通信経路を使
用して送信側の計算機が正常かどうかの確認を行なう。
確認の結果、送信側の計算機が正常か異常か判定でき
る。しかし、動作確認信号の到着を予め定められた時間
だけ待ち合わせた上で、他の通信経路を介して送信側の
計算機の状態を確認するために時間がかかり、かつ実際
に送信側の計算機が異常であった場合に、他方の計算機
に処理の切替を行なうのに時間がかかるという問題があ
る。In order to solve the above-mentioned first problem,
For example, L. Leaby, "NewAvailable
ity Features of Local Are
aVAXcluster Systems "(Vol.
3 No. 3 1991 DigitaI Techni
"Cal journal" discloses a technique for improving availability by providing a plurality of communication paths among computers that constitute a cluster system, that is, by multiplexing the communication paths. In this way, in the configuration in which the communication paths are multiplexed, if the operation monitoring signal does not arrive within the predetermined time, the receiving computer uses another communication path to check whether the transmitting computer is normal. Check.
As a result of the confirmation, it can be determined whether the sending computer is normal or abnormal. However, it takes time to confirm the status of the sending computer via another communication path after waiting for the arrival of the operation confirmation signal for a predetermined time, and the sending computer actually has an abnormality. If so, there is a problem that it takes time to switch the processing to the other computer.

【０００８】本発明は、上記従来の課題を解決し、ノー
ドを構成する計算機の異常に応じて速やかに他の計算機
で処理を引き継ぎを行なうことにより、可用性の向上と
高信頼性を実現することができるクラスタシステムを提
供することを目的とする。The present invention solves the above-mentioned conventional problems and realizes improved availability and high reliability by promptly taking over processing by another computer in response to an abnormality in a computer forming a node. It is an object of the present invention to provide a cluster system capable of

【０００９】[0009]

【課題を解決するための手段】上記目的を達成する本発
明は、計算機を複数台接続することで構成されるクラス
タシステムにおいて、障害の検出に応じて障害部分の復
旧処理や再構成処理を行なう専用プロセッサを前記計算
機毎に備え、前記専用プロセッサは、自計算機の障害を
検出して分析し、該障害により処理の継続が不可能であ
るか否かを判定すると共に、処理の継続が不可能である
と判定した場合に、自計算機による処理の継続が不可能
である旨の通知を前記他の計算機に通知する構成として
いる。According to the present invention to achieve the above object, in a cluster system constituted by connecting a plurality of computers, a restoration process or a reconstruction process of a failed portion is performed according to the detection of a failure. A dedicated processor is provided for each computer, and the dedicated processor detects and analyzes a failure of its own computer, determines whether or not the processing cannot be continued due to the failure, and the processing cannot be continued. If it is determined that it is, the notification that the processing by the own computer cannot be continued is notified to the other computer.

【００１０】好ましい態様では、前記専用プロセッサ
は、処理の継続が可能であると判定した場合に、前記障
害に対する復旧処理を行なう構成としている。In a preferred mode, the dedicated processor performs a recovery process for the failure when it is determined that the process can be continued.

【００１１】上記目的を達成する本発明は、計算機を複
数台接続することで構成されるクラスタシステムにおい
て、前記計算機が、障害の検出に応じて障害部分の復旧
処理や再構成処理を行なう専用プロセッサを備え、前記
専用プロセッサは、前記計算機内の構成要素からの障害
発生信号を受信して障害処理ルーチンのための割り込み
を発生するする割り込み制御手段と、前記割り込み制御
手段からの割り込みによって前記障害処理ルーチンによ
る処理を実行する補助ＣＰＵを備え、前記補助ＣＰＵ
は、前記障害処理ルーチンにおいて、障害を分析して該
障害により処理の継続が不可能であるか否かを判定する
と共に、処理の継続が不可能であると判定した場合に、
自計算機による処理の継続が不可能である旨の通知を前
記通信経路を介して前記他の計算機に通知する構成とし
ている。According to the present invention to achieve the above object, in a cluster system constituted by connecting a plurality of computers, the computer executes a dedicated processor for performing restoration processing or reconstruction processing of a failed portion in response to detection of a failure. The dedicated processor includes an interrupt control unit that receives a failure occurrence signal from a component in the computer and generates an interrupt for a failure processing routine, and the failure processing by the interrupt from the interrupt control unit. The auxiliary CPU is provided with an auxiliary CPU that executes a process according to a routine.
In the failure processing routine, when the failure is analyzed, it is determined whether or not the processing cannot be continued due to the failure, and when it is determined that the processing cannot be continued,
It is configured to notify the other computer via the communication path that the processing by the own computer cannot be continued.

【００１２】また、好ましい態様では、前記専用プロセ
ッサの補助ＣＰＵは、処理の継続が可能であると判定し
た場合に、前記障害に対する復旧処理を行なう構成とし
ている。他の好ましい態様では、前記計算機は、２系統
の通信経路によって互いに接続され、前記専用プロセッ
サの補助ＣＰＵは、自計算機による処理の継続が不可能
である旨の通知を前記通信経路を介して前記他の計算機
に通知する構成としている。Further, in a preferred mode, the auxiliary CPU of the dedicated processor performs a recovery process for the failure when it is determined that the process can be continued. In another preferred aspect, the computers are connected to each other through two communication paths, and the auxiliary CPU of the dedicated processor notifies the notification that the processing by the own computer cannot be continued via the communication path. It is configured to notify other computers.

【００１３】[0013]

【作用】本発明では、ある計算機で障害が発生した場
合、障害情報が専用プロセッサの割り込み制御手段に通
知される。この障害の通知によって、割り込み制御手段
は、補助ＣＰＵに割り込みを発生させ、障害処理ルーチ
ンに処理を遷移させる。補助ＣＰＵは、通知された障害
情報から障害要因を分析し、当該障害によって処理の継
続が不可能であるか否かの判定を行なう。処理の継続が
可能な障害である場合には、当該障害の復旧処理が行な
われる。処理の継続が不可能な障害である場合、専用プ
ロセッサは、通信経路を介してクラスタシステムを構成
する他の計算機に対して、「自計算機に障害が発生し処
理の継続ができなくなった」旨の通知情報を通知する。
上記通知情報を受信した他の計算機では、予め定められ
た方法で障害が発生した計算機からの処理の引き継ぎを
行なう。[Operation] In the present invention, when a failure occurs in a computer, the failure information is notified to the interrupt control means of the dedicated processor. In response to the notification of this failure, the interrupt control means causes the auxiliary CPU to generate an interrupt and shifts the processing to the failure processing routine. The auxiliary CPU analyzes the failure factor from the notified failure information and determines whether or not the processing cannot be continued due to the failure. In the case of a failure that can continue the processing, recovery processing for the failure is performed. If the failure is such that processing cannot continue, the dedicated processor notifies the other computers that make up the cluster system via the communication path that "the failure has occurred in the local computer and the processing cannot continue". Notify the notification information of.
The other computer that has received the notification information takes over the process from the computer in which the failure has occurred by a predetermined method.

【００１４】[0014]

【実施例】次に、本発明の実施例について図面を参照し
て詳細に説明する。図１は本発明を適用したクラスタシ
ステムの全体構成を示す構成ブロック図であり、図２は
図１のクラスタシステムのノードを構成する計算機の詳
細を示す構成ブロック図である。Next, embodiments of the present invention will be described in detail with reference to the drawings. FIG. 1 is a configuration block diagram showing an overall configuration of a cluster system to which the present invention is applied, and FIG. 2 is a configuration block diagram showing details of a computer constituting a node of the cluster system of FIG.

【００１５】図１において、クラスタシステム１０は、
ノード＃１（計算機１０−１）、ノード＃２（計算機１
０−２）、ノード＃３（計算機１０−３）で構成され
る。これらのノードは、実装されたＬＡＮアダプタ３０
−１と３０−２、３０−３と３０−４、３０−５と３０
−６を経由して２系統のＬＡＮ２０−１，２０−２によ
り相互に接続されている。In FIG. 1, the cluster system 10 is
Node # 1 (computer 10-1), node # 2 (computer 1
0-2) and node # 3 (computer 10-3). These nodes are installed LAN adapter 30
-1 and 30-2, 30-3 and 30-4, 30-5 and 30
Two LANs 20-1 and 20-2 are connected to each other via -6.

【００１６】２系統のＬＡＮ２０−１，２０−２は、ク
ラスタシステム１０を構成するノード＃１，＃２，＃３
間のデータの交換、構成情報のやりとり、障害時の異常
通知などに使用される。またＬＡＮ２０−１，２０−２
は、それぞれ冗長構成となっているが、二重化されてい
るわけではない。The two systems of LANs 20-1 and 20-2 are composed of nodes # 1, # 2, # 3 constituting the cluster system 10.
It is used for exchanging data between devices, exchanging configuration information, and notification of abnormalities when a failure occurs. LAN 20-1, 20-2
Have redundant configurations, but they are not duplicated.

【００１７】クラスタシステム１０を構成する各ノード
＃１，＃２，＃３には、ＬＡＮ２０−１，２０−２にそ
れぞれ接続されるＬＡＮアダプタ３０−１と３０−２、
３０−３と３０−４、３０−５と３０−６がそれぞれ実
装されている。ＬＡＮアダプタ（Ａ）とＬＡＮアダブタ
（Ｂ）の使い分けは、例えばラウンドロビンといった予
め定められた方法によって決められる。LAN adapters 30-1 and 30-2 connected to the LANs 20-1 and 20-2, respectively, are connected to the nodes # 1, # 2, and # 3 constituting the cluster system 10.
30-3 and 30-4, 30-5 and 30-6 are mounted, respectively. The LAN adapter (A) and the LAN adapter (B) are selectively used by a predetermined method such as round robin.

【００１８】ノード＃１，＃２，＃３を構成する計算機
の構成を図２によって説明する。ここでは、ノード＃１
を構成する計算機１０−１を示している。他のノード＃
２，＃３を構成する計算機１０−２，１０−３について
も同じ構成である。The configuration of the computers forming the nodes # 1, # 2 and # 3 will be described with reference to FIG. Here, node # 1
The computer 10-1 which comprises is shown. Other nodes #
The same applies to the computers 10-2 and 10-3 forming the second and the third # 3.

【００１９】計算機１０−１には、ＣＰＵ４１、メモリ
４２、バスブリッジ４３、サービスプロセッサ４４、シ
ステムバス４５、ＬＡＮアダプタ（Ａ）３０−１とＬＡ
Ｎアダプタ（Ｂ）３０−２、入出力バス４６が備えられ
ている。The computer 10-1 includes a CPU 41, a memory 42, a bus bridge 43, a service processor 44, a system bus 45, a LAN adapter (A) 30-1 and an LA.
An N adapter (B) 30-2 and an input / output bus 46 are provided.

【００２０】ＣＰＵ４１は、システムバス４５を通じて
メモリ４２にアクセスでき、また同様にシステムバス４
５に接続されたバスブリッジ４３を経由して、入出力バ
ス４６に接続された各種の入出力アダプタ（ＬＡＮアダ
プタ以外は不図示）にアクセスすることができる。入出
力バス４６には、ディスク装置やテープ装置等を接続す
るアダプタや通信制御アダプタ等が選択的に実装され
る。The CPU 41 can access the memory 42 through the system bus 45, and likewise the system bus 4
It is possible to access various input / output adapters (not shown other than the LAN adapter) connected to the input / output bus 46 via the bus bridge 43 connected to 5. On the input / output bus 46, an adapter for connecting a disk device, a tape device, etc., a communication control adapter, etc. are selectively mounted.

【００２１】サービスプロセッサ４４は、補助ＣＰＵ５
２、補助メモリ５３、割り込み制御部５４、バスブリッ
ジ５５とこれらを接続するバス５１により構成される。
サービスプロセッサ４４は、計算機１０−１に対して、
立ち上げ処理、構成制御処理、障害処理、遠隔診断支援
などの機能を有する専用プロセッサである。例えば、障
害発生時には、信号線（Ｌ１，Ｌ２，Ｌ３）を介して通
知される障害発生信号に基づいて割り込み制御部５４が
補助ＣＰＵ５２に割り込みを発生させることにより、障
害処理を開始するため障害処理ルーチンに処理を移す。The service processor 44 includes the auxiliary CPU 5
2, an auxiliary memory 53, an interrupt controller 54, a bus bridge 55, and a bus 51 connecting them.
The service processor 44 tells the computer 10-1 that
It is a dedicated processor having functions such as startup processing, configuration control processing, failure processing, and remote diagnosis support. For example, when a failure occurs, the interrupt control unit 54 causes the auxiliary CPU 52 to generate an interrupt based on the failure occurrence signal notified via the signal lines (L1, L2, L3), so that the failure processing is started. Transfer processing to the routine.

【００２２】次に、上記のように構成される本実施例の
動作について説明する。クラスタシステム１０が正常に
稼動している状態では、図１のノード＃１〜＃３は２系
統のＬＡＮ２０−１，２０−２の何れかを使用してノー
ド間のデータ通信を行なう。例えば、データベース処理
を行なうクラスタシステムの場合であれば、データベー
スのレコード情報やレコード更新のためのロック情報等
がクラスタを構成するノード間でやりとりされる。Next, the operation of the present embodiment configured as described above will be described. When the cluster system 10 is operating normally, the nodes # 1 to # 3 in FIG. 1 perform data communication between the nodes using either of the two systems of LANs 20-1 and 20-2. For example, in the case of a cluster system that performs database processing, database record information, lock information for updating records, and the like are exchanged between the nodes forming the cluster.

【００２３】次に、あるノードで障害が発生したときの
動作を説明する。例えば、ノード＃１を構成する計算機
１０−１のＣＰＵ４１やメモリ４２等で障害が発生した
場合を想定する。図２において、ＣＰＵ４１やメモリ４
２の障害情報は、信号線Ｌ３を介してサービスプロセッ
サ４４の割り込み制御部５４に通知される。この障害の
通知によって、割り込み制御部５４は、バス５１を通じ
て補助ＣＰＵ５３に割り込みを発生させ、障害処理ルー
チンに処理を遷移させる。Next, the operation when a failure occurs in a node will be described. For example, it is assumed that a failure occurs in the CPU 41, the memory 42, etc. of the computer 10-1 that constitutes the node # 1. In FIG. 2, the CPU 41 and the memory 4
The fault information of No. 2 is notified to the interrupt control unit 54 of the service processor 44 via the signal line L3. In response to the notification of this failure, the interrupt control unit 54 causes an interrupt to the auxiliary CPU 53 through the bus 51, and shifts the processing to the failure processing routine.

【００２４】補助ＣＰＵ５３は、図３のフローチャート
に示す障害処理ルーチンに基づいて障害処理を実行す
る。すなわち、補助ＣＰＵ５３は、通知された障害情報
から障害要因を分析し（ステップ３０１）、当該障害に
よって処理の継続が不可能であるか否かの判定を行なう
（ステップ３０２）。例えば、障害の種類が入出力処理
関連の障害で、かつ処理の再試行等により障害の復旧が
可能な場合は、処理の継続が可能であると判断し、それ
以外は本ノードにおける処理の継続は不可能と判断す
る。例えば、メモリ４２における障害としては、ＥＣＣ
エラーやパリティエラー等が考えれる。また、入出力処
理関連の障害としては、ＣＲＣエラー等が考えられる。The auxiliary CPU 53 executes failure processing based on the failure processing routine shown in the flowchart of FIG. That is, the auxiliary CPU 53 analyzes the failure factor from the notified failure information (step 301) and determines whether or not the processing cannot be continued due to the failure (step 302). For example, if the type of failure is a failure related to I / O processing and it is possible to recover from the failure by retrying the processing, it is determined that the processing can be continued. Otherwise, the processing continues at this node. Judge impossible. For example, as a failure in the memory 42, ECC
Errors and parity errors are possible. A CRC error or the like can be considered as an input / output processing related failure.

【００２５】処理の継続が可能な障害である場合には、
当該障害の復旧処理が行なわれる（ステップ３０３）。
この障害復旧処理においては、故障箇所の切り離しや再
試行がなされる。If the fault is such that processing can be continued,
Recovery processing for the failure is performed (step 303).
In this failure recovery process, the failure point is separated and retried.

【００２６】処理の継続が不可能な障害である場合、サ
ービスプロセッサ４４は、バスブリッジ５５を介してＬ
ＡＮアダプタ３０−１，３０−２、ＬＡＮ２０−１，２
０−２によりクラスタシステム１０を構成する他のノー
ド＃２，＃３に対して、「ノード＃１に障害が発生し処
理の継続ができなくなった」旨の通知情報を通知する
（ステップ３０４）。その後、サービスプロセッサ４４
は、ノード＃１のサービスプロセッサ以外の部分を初期
化する（ステップ３０５）。In the case of a fault in which the processing cannot be continued, the service processor 44 passes the L via the bus bridge 55.
AN adapter 30-1, 30-2, LAN20-1, 2
0-2 notifies the other nodes # 2 and # 3 constituting the cluster system 10 of the notification information that “the failure has occurred in the node # 1 and the processing cannot be continued” (step 304). . Then the service processor 44
Initializes the part other than the service processor of node # 1 (step 305).

【００２７】上記通知情報を受信した他のノード＃２，
＃３では、予め定められた方法で障害が発生したノード
＃１から処理を引き継ぐノードを決定して処理の引き継
ぎを行なう。Other node # 2 that received the above notification information
In # 3, a node which takes over the process from the node # 1 in which the failure has occurred is determined by a predetermined method, and the process is taken over.

【００２８】以上好ましい実施例をあげて本発明を説明
したが、本発明は必ずしも上記実施例に限定されるもの
ではない。例えば、図１では、３つのノードで構成され
るクラスタシステムを説明したが、２つのノード又は４
つ以上のノードで構成されるクラスタシステムにおいて
も本発明を同様に適用することができるのは言うまでも
ない。Although the present invention has been described with reference to the preferred embodiments, the present invention is not necessarily limited to the above embodiments. For example, in FIG. 1, a cluster system including three nodes has been described, but two nodes or four nodes are used.
It goes without saying that the present invention can be similarly applied to a cluster system including one or more nodes.

【００２９】[0029]

【発明の効果】以上説明したように本発明によれば、ノ
ードを構成する計算機の障害検出に応答して、障害のた
め処理の継続が可能か不可能かを判定する専用プロセッ
サを搭載することにより、高速に他のノードに中断した
処理の継続を行なうことができるので、障害発生から処
理の引き継ぎまでの時間を短縮でき、自ノードが正常に
動作していることを示すための動作確認信号をノード間
でやりとりするのではなく、自ノードに異常が発生し処
理の継続ができないことを通知するため、通信経路に異
常がある場合でもクラスタシステムが整合がとれた状態
を保て信頼性を高めることができる。よって、クラスタ
システムの可用性の向上と高信頼性を実現できる。As described above, according to the present invention, a dedicated processor for determining whether or not continuation of processing is possible due to a failure is mounted in response to a failure detection of a computer constituting a node. By this, it is possible to continue the processing interrupted to other nodes at high speed, so it is possible to shorten the time from the occurrence of a failure to the takeover of processing, and an operation confirmation signal to indicate that the local node is operating normally. Is sent between nodes instead of being exchanged between nodes, it notifies that there is an error in the local node and cannot continue processing.Therefore, even if there is an error in the communication path, the cluster system can maintain a consistent state and maintain reliability. Can be increased. Therefore, it is possible to improve the availability and high reliability of the cluster system.

[Brief description of drawings]

【図１】本発明を適用するクラスタシステムの全体構
成を説明するブロック図である。FIG. 1 is a block diagram illustrating an overall configuration of a cluster system to which the present invention is applied.

【図２】図１のクラスタシステムを構成するノードの
内部構成を説明するブロック図である。FIG. 2 is a block diagram illustrating an internal configuration of a node included in the cluster system of FIG.

【図３】補助ＣＰＵによる障害処理ルーチンの処理内
容を説明するフローチャートである。FIG. 3 is a flowchart illustrating the processing contents of a failure processing routine by the auxiliary CPU.

[Explanation of symbols]

１０クラスタシステム１０−１，１０−２，１０−３計算機２０−１，２０−２ＬＡＮ３０−１〜３０−６ＬＡＮアダプタ４１ＣＰＵ４２メモリ４３バスブリッジ４４サービスプロセッサ５２補助ＣＰＵ５３補助メモリ５４割り込み制御部５５バスブリッジ 10 cluster system 10-1, 10-2, 10-3 computer 20-1, 20-2 LAN 30-1 to 30-6 LAN adapter 41 CPU 42 memory 43 bus bridge 44 service processor 52 auxiliary CPU 53 auxiliary memory 54 interrupt Controller 55 Bus bridge

Claims

[Claims]

1. In a cluster system configured by connecting a plurality of computers, each computer is provided with a dedicated processor that performs restoration processing or reconfiguration processing of a failed portion in response to detection of a failure, and the dedicated processor is Detects and analyzes the failure of the own computer and determines whether the failure prevents the processing from continuing, and when it is determined that the processing cannot continue, A cluster system characterized by notifying to the other computer that continuation is impossible.

2. The cluster system according to claim 1, wherein the dedicated processor performs recovery processing for the failure when it is determined that the processing can be continued.

3. A cluster system configured by connecting a plurality of computers, wherein the computer includes a dedicated processor for performing restoration processing and reconfiguration processing of a failed portion in response to detection of a failure, and the dedicated processor is An interrupt control means for receiving a fault occurrence signal from a component in the computer and generating an interrupt for a fault handling routine; and an auxiliary for executing processing by the fault handling routine by an interrupt from the interrupt control means. A CPU is provided, and the auxiliary CPU analyzes a failure in the failure processing routine to determine whether or not the processing cannot be continued due to the failure, and also determines that the processing cannot be continued. In this case, a notification that the processing by the own computer cannot be continued is notified to the other computer via the communication path. A cluster system characterized by:

4. The cluster system according to claim 3, wherein the auxiliary CPU of the dedicated processor performs recovery processing for the failure when it is determined that the processing can be continued.

5. The computer is connected to each other via two communication paths, and the auxiliary CPU of the dedicated processor sends a notification that the processing by the own computer cannot be continued via the communication path. 4. The computer of claim 3 is notified.
The cluster system described in.