JP3248485B2

JP3248485B2 - Cluster system, monitoring method and method in cluster system

Info

Publication number: JP3248485B2
Application number: JP14906798A
Authority: JP
Inventors: 光輝津端
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 1998-05-29
Filing date: 1998-05-29
Publication date: 2002-01-21
Anticipated expiration: 2018-05-29
Also published as: JPH11338725A

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、クラスタシステム
に関し、特にクラスタシステムにおける監視方式および
その方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a cluster system, and more particularly, to a monitoring method and a monitoring method in a cluster system.

【０００２】[0002]

【従来の技術】図５を参照すると、運用系装置１、待機
系装置２おおび共有ディスク装置４を有するクラスタ構
成の従来のデュプレックスシステムにおいては、処理業
務を行っている運用系装置１において故障または障害が
発生し、クライアント３によって運用系装置１が処理不
能と判断された場合には、運用系装置１で行われていた
処理業務は待機系装置２に引き継がれ、待機系装置２に
おいて続行されるようになっている。2. Description of the Related Art Referring to FIG. 5, in a conventional duplex system having a cluster configuration having an active device 1, a standby device 2, and a shared disk device 4, a failure occurs in the active device 1 performing a processing operation. Alternatively, if a failure occurs and the client 3 determines that the active device 1 cannot be processed, the processing task performed in the active device 1 is taken over by the standby device 2 and continued in the standby device 2. It is supposed to be.

【０００３】特開平４−３４２３３３号公報には、ルー
プ型伝送路を有するローカルエリアネットワークに、ル
ープインタフェース装置およびループ監視部を設け、操
作員がテストコマンドを入力することで待機系の診断を
行う診断方式が提案されている。[0003] In Japanese Patent Application Laid-Open No. 4-342333, a loop interface device and a loop monitoring unit are provided in a local area network having a loop-type transmission line, and a standby system is diagnosed by an operator inputting a test command. Diagnostic methods have been proposed.

【０００４】しかし、このような従来システムにおいて
は、運用系装置１における故障または障害の発生に伴う
処理業務の引き継ぎの際に、運用系装置１で実行されて
いた処理業務がうまく待機系装置２に引き継ぎされない
場合があり、システムの信頼性の低下を招いている。However, in such a conventional system, when a processing task is taken over due to the occurrence of a failure or a failure in the active device 1, the processing task executed in the active device 1 is successfully performed by the standby device 2. May not be handed over, causing a decrease in the reliability of the system.

【０００５】処理業務がうまく引き継がれない場合と
は、例えば、業務を引き継ぐべき待機系装置２で何らか
の故障または障害が発生している場合、クライアント３
から待機系装置２に接続できない場合、もしくは、運用
系装置１で実行していた業務が待機系装置２のローカル
メモリ２７やローカルディスク２８の記憶容量の不足、
待機系装置２から共用ディスク４へのアクセス不能等の
問題で処理引き継ぎ後に待機系装置２では当該処理業務
の実行ができない場合等である。[0005] The case where the processing work is not successfully taken over is, for example, when some kind of failure or failure has occurred in the standby system 2 to take over the work, the client 3
Cannot be connected to the standby system 2 from the server, or the work executed in the active system 1 is insufficient for the storage capacity of the local memory 27 or the local disk 28 of the standby system 2
This is a case where the standby system 2 cannot execute the processing task after the processing is taken over due to a problem such as the inability of the standby system 2 to access the shared disk 4.

【０００６】[0006]

【発明が解決しようとする課題】上述のような従来の診
断方式では、操作員がテスト実行を命令することによっ
てバイパスモードを使用してシステムにおける待機系装
置の障害の有無が検査されるだけであり、運用系装置の
動作中における処理業務引き継ぎの可用性についての判
断は全く行われていなかった。このため、従来の方式に
おいては、待機系装置の処理移行の際に処理業務を実行
するだけの資源が確保されているかどうかの信頼性が保
証されていなかった。In the conventional diagnostic system as described above, the operator only instructs the execution of a test to use the bypass mode to check whether or not a fault has occurred in the standby system in the system. Yes, no determination has been made regarding the availability of handover of the processing task during the operation of the active device. For this reason, in the conventional method, the reliability as to whether resources sufficient to execute processing tasks are secured at the time of processing transition of the standby system device has not been guaranteed.

【０００７】例えば、運用系装置１で中央処理装置（以
下、ＣＰＵという。）を２台、メモリを３０ＭＢ、ディ
スクを４ＧＢそれぞれ使用して業務が実行されている場
合に、待機系装置２では、ＣＰＵが３台、メモリが１０
０ＭＢ、ディスクが１０ＧＢそれぞれ使用可能であった
とする。ここでもし、待機系装置２で１台のＣＰＵと１
０ＭＢのメモリが故障したとしても、待機系装置２で
は、ＣＰＵを２台、メモリを９０ＭＢ、ディスクを１０
ＧＢそれぞれ使用可能であることから、運用系装置１か
ら処理業務を移行することができる。しかし、待機系装
置２で２台のＣＰＵと８ＧＢのディスクが故障したとす
ると、待機系装置２では、ＣＰＵを１台、メモリを１０
０ＭＢ、ディスクを２ＧＢそれぞれ使用できるだけであ
り、運用系装置１から処理業務を移行することができな
い。[0007] For example, in the case where a job is executed using two central processing units (hereinafter referred to as CPUs), 30 MB of memory, and 4 GB of disk in the active system device 1, the standby system device 2 3 CPUs, 10 memory
It is assumed that 0 MB and 10 GB of disk can be used, respectively. Here, if one CPU and one
Even if the 0 MB memory fails, the standby system 2 has two CPUs, 90 MB of memory and 10 disks of disk.
Since each of the GBs can be used, the processing operation can be transferred from the active device 1. However, assuming that two CPUs and an 8 GB disk have failed in the standby device 2, the standby device 2 has one CPU and 10 memories.
Only 0 MB and 2 GB of the disk can be used, and the processing operation cannot be transferred from the active device 1.

【０００８】本発明の目的は、このような問題を解決す
るために、処理装置それぞれに監視部を設け、運用系の
資源利用情報と待機系の資源状況を常に互いに監視する
ことで運用系装置に合わせた待機系装置の可用性を保証
することにある。[0008] An object of the present invention is to solve the above problem by providing a monitoring unit in each processing device, and constantly monitoring the resource utilization information of the active system and the resource status of the standby system. The purpose of the present invention is to guarantee the availability of the standby device according to the requirements.

【０００９】[0009]

【課題を解決するための手段】上記課題を解決するため
に本発明のクラスタシステムにおける待機系監視装置
は、運用系装置および待機系装置がネットワークによっ
て接続されて構成されるクラスタシステムにおける監視
方式において、前記運用系装置および前記待機系装置の
それぞれは、障害および資源を監視する監視部を備え、
前記運用系の監視部は、運用系装置内の障害検出および
資源の使用量の検出を行い、検出した資源の使用量を前
記待機系の監視部に通知し、前記待機系の監視部は、待
機系装置内の障害検出および資源の使用可能量の検出を
行い、前記運用系の監視部から通知される前記運用系装
置の資源の使用量よりも検出した待機系装置の資源の使
用可能量が小さいときには異常が発生したと判断するこ
とを特徴とする。 SUMMARY OF THE INVENTION In order to solve the above-mentioned problems, a standby monitoring device in a cluster system according to the present invention comprises an operating device and a standby device connected to a network.
In a cluster system configured by connecting
In the system, the operation system device and the standby system device
Each has a monitoring unit that monitors faults and resources,
The active monitoring unit detects a failure in the active device and
Detects resource usage and increases the detected resource usage
The standby monitoring unit is notified, and the standby monitoring unit
Detect failures in equipment and detect available resources
The operation system device notified by the operation system monitoring unit.
Resource usage of the standby system detected from the resource usage of the
When the available amount is small, it is determined that an abnormality has occurred.
And features.

【００１０】[0010]

【００１１】さらに、前記運用系装置および前記待機系
装置はそれぞれ、監視のための監視線を含み、この監視
線を介して障害および資源の監視および異常検出時の通
知を行う。Further, each of the active system device and the standby system device includes a monitoring line for monitoring, and monitors faults and resources and notifies when an abnormality is detected via the monitoring line.

【００１２】また、前記運用系装置および前記待機系装
置は互いに連絡線によって接続される。[0012] Further, the operating device and the standby device are connected to each other by a communication line.

【００１３】また、前記運用系の監視部は、検出した自
装置内の資源の使用量を前記連絡線を介して前記待機系
の監視部に通知し、前記待機系の監視部は、前記連絡線
を介して通知される前記運用系装置の資源の使用量より
も待機系装置の資源の使用可能量が小さいと判断すると
障害発生と判断する。The monitoring section of the active system notifies the monitoring section of the standby system via the communication line of the detected use amount of the resource in its own device. If it is determined that the available resource amount of the standby device is smaller than the resource usage amount of the active device notified via the line, it is determined that a failure has occurred.

【００１４】さらに、前記運用系装置および前記待機系
装置の監視部は、前記連絡線を介して互いに監視し合
う。Further, the monitoring units of the active device and the standby device monitor each other via the communication line.

【００１５】また、前記資源とはＣＰＵ、ローカルメモ
リおよびローカルディスクである。The resources are a CPU, a local memory and a local disk.

【００１６】また、前記運用系装置および前記待機系装
置の監視部は、自系のＬＡＮ制御回路に障害を検出する
と、他系を通じてクライアントに障害を報告する。Further, when the monitoring units of the active system device and the standby system device detect a failure in the LAN control circuit of the own system, the monitoring unit reports the failure to the client through the other system.

【００１７】さらに、運用系装置および待機系装置がそ
れぞれ障害および資源を監視する監視部を備え、ネット
ワークによって接続されて構成されるクラスタシステム
において、前記運用系装置の稼働中に、運用系装置内の
障害検出と資源の使用量の検出と、待機系装置内の障害
検出と資源の使用可能量の検出を行い、運用系装置の資
源の使用量よりも前記待機系装置の使用可能量の方が小
さいと判断すると待機系装置の障害発生と判断する。Further, in a cluster system in which the active device and the standby device each include a monitoring unit for monitoring a fault and a resource, and are connected by a network, in a cluster system which is operating while the active device is operating, Of the standby device and the resource usage of the standby device, and the fault detection and the available resource of the standby device are detected. Is small, it is determined that a failure has occurred in the standby system device.

【００１８】[0018]

【発明の実施の形態】次に本発明の実施の形態について
図面を参照して詳細に説明する。Embodiments of the present invention will now be described in detail with reference to the drawings.

【００１９】図１を参照すると、本発明のクラスタシス
テムの実施の形態は、処理業務を実行する運用系装置１
と、運用系装置１が故障した場合にその処理業務を引き
継ぐ待機系装置２と、運用系装置１および待機系装置２
の両方からアクセスすることのできる共有ディスク装置
４と、クラスタシステムを管理するためのクライアント
３と、運用系装置１と待機系装置２とクライアント３と
を接続するローカルエリアネットワーク（以下、「ＬＡ
Ｎ」という。）５と、運用系装置１と待機系装置２の装
置間を接続する連絡パス６とを有する。Referring to FIG. 1, an embodiment of a cluster system according to the present invention is an operation system 1 for executing a processing task.
And a standby device 2 that takes over the processing task when the active device 1 fails, an active device 1 and a standby device 2
, A client 3 for managing the cluster system, and a local area network (hereinafter, referred to as “LA”) connecting the active device 1, the standby device 2, and the client 3.
N ". ) 5 and a communication path 6 for connecting the active system device 1 and the standby system device 2 to each other.

【００２０】運用系装置１は、ＬＡＮ５と接続された第
１のＬＡＮ制御回路１１と、連絡パス６と接続されたパ
ス制御回路１２と、ケーブルを介して共有ディスク装置
４と接続されたディスク制御回路１３とを有する。さら
に、運用系装置１は、定期的にＬＡＮ制御回路１１、パ
ス制御回路１２およびディスク制御回路１３の状態やＣ
ＰＵ１６、ローカルメモリ１７およびローカルディスク
１８の使用量等の資源の使用量および状態を監視し、通
常の運用時は資源の使用量を待機系装置２へ送出し、運
用系装置１内の障害の検出時にはクライアント３もしく
は待機系装置２にその旨の通知を行う監視部１４を備え
る。監視部１４は、ＬＡＮ制御回路１１、パス制御回路
１２、ディスク制御回路１３、ＣＰＵ１６、ローカルメ
モリ１７およびローカルディスク１８と監視のための監
視用パス１５を介して接続されている。The operation system device 1 includes a first LAN control circuit 11 connected to the LAN 5, a path control circuit 12 connected to the communication path 6, and a disk control device connected to the shared disk device 4 via a cable. And a circuit 13. Further, the operation system device 1 periodically checks the status of the LAN control circuit 11, the path control circuit 12, and the disk control circuit 13,
The usage and status of resources such as the usage of the PU 16, the local memory 17 and the local disk 18 are monitored, and during normal operation, the resource usage is sent to the standby system 2, and the failure in the operation system 1 is monitored. A monitoring unit 14 is provided for notifying the client 3 or the standby apparatus 2 of the detection at the time of detection. The monitoring unit 14 is connected to the LAN control circuit 11, the path control circuit 12, the disk control circuit 13, the CPU 16, the local memory 17, and the local disk 18 via a monitoring path 15 for monitoring.

【００２１】待機系装置２は、運用系装置１と同様の構
成を有しており、ＬＡＮ制御回路２１と、パス制御回路
２２と、ディスク制御回路２３とを有し、定期的にＬＡ
Ｎ制御回路１１、パス制御回路１２およびディスク制御
回路１３の状態やＣＰＵ２６、ローカルメモリ２７およ
びローカルディスク２８等の資源の使用可能量および状
態を監視して障害もしくは資源不足の検出時にクライア
ント３もしくは運用系装置１その旨の通知を行う監視部
２４と、監視のための監視用パス２５とを備える。The standby system 2 has the same configuration as the active system 1 and has a LAN control circuit 21, a path control circuit 22, and a disk control circuit 23.
The state of the N control circuit 11, the path control circuit 12, and the disk control circuit 13 and the available amount and state of resources such as the CPU 26, the local memory 27, and the local disk 28 are monitored to detect a failure or resource shortage. The system device 1 includes a monitoring unit 24 for notifying the fact and a monitoring path 25 for monitoring.

【００２２】クライアント３は、通常時にはＬＡＮ５を
経由して運用系装置１に処理要求を発行し、運用系装置
１に処理業務を実行させている。Normally, the client 3 issues a processing request to the active device 1 via the LAN 5 and causes the active device 1 to execute a processing task.

【００２３】運用系装置１は、クライアント３からの処
理要求に応答して処理業務を行うが、このとき、監視部
１４は、監視用パス１５を介してＬＡＮ制御回路１１、
パス制御回路１２およびディスク制御回路１３の状態
や、ＣＰＵ１６、ローカルメモリ１７およびローカルデ
ィスク１８の使用量をそれぞれ監視し、運用系のＣＰＵ
１６、ローカルメモリ１７およびローカルディスク１８
の使用量を連絡パス６を介して待機系装置２の監視部２
４へ通知する。ここで、もし、監視部１４が障害を検出
している場合には、監視部１４はクライアント３にその
旨の通知を行う。The operating system device 1 performs a processing operation in response to a processing request from the client 3, and at this time, the monitoring unit 14 transmits the LAN control circuit 11 via the monitoring path 15 to the LAN control circuit 11.
The statuses of the path control circuit 12 and the disk control circuit 13 and the usage of the CPU 16, the local memory 17, and the local disk 18 are monitored, respectively, and the active CPU is monitored.
16, local memory 17 and local disk 18
Monitoring unit 2 of the standby system 2 via the communication path 6
Notify 4. Here, if the monitoring unit 14 detects a failure, the monitoring unit 14 notifies the client 3 to that effect.

【００２４】待機系装置２の監視部２４は、運用系装置
１で業務が実行されている間、待機系装置２内の各部を
監視する。すなわち、クライアント３と待機系装置２と
の接続、運用系装置１から待機系装置２への処理業務の
引き継ぎ、および待機系装置２から共有ディスク４への
アクセスのそれぞれが可能であるかを判断するために、
監視用パス２５を介してＬＡＮ制御回路２１、パス制御
回路２２、ディスク制御回路２３、ＣＰＵ２６、ローカ
ルメモリ２７およびローカルディスク２８の監視を行
う。また、それと共に、監視部２４は、運用系装置１の
監視部１４から連絡パス６を経由して送られてくる運用
系装置１のＣＰＵ１６、メモリ１７およびディスク１８
の使用量を、待機系装置２のＣＰＵ２６、メモリ２７お
よびディスク２８の使用可能量と比較して、運用系装置
１で実行中の処理業務が待機系装置２に引き継がれても
それを実行可能な資源が待機系装置２に確保されている
かどうかの判断を行う。そして、障害または資源不足を
検出すると、監視部２４はクライアント３にその旨の通
知を行う。このように、待機系装置２内の資源の使用可
能量が運用系装置１内の資源の使用量より大きい場合に
は、待機系装置２による運用系装置１で処理中の業務を
引き継ぎが可能であると判断するといった、運用系装置
１の稼働状況に合わせた待機系装置２の監視が行われ
る。The monitoring unit 24 of the standby system device 2 monitors each unit in the standby system device 2 while a job is being executed in the active system device 1. That is, it is determined whether the connection between the client 3 and the standby device 2, the transfer of the processing task from the active device 1 to the standby device 2, and the access from the standby device 2 to the shared disk 4 are possible. To do
The monitoring of the LAN control circuit 21, the path control circuit 22, the disk control circuit 23, the CPU 26, the local memory 27, and the local disk 28 is performed via the monitoring path 25. At the same time, the monitoring unit 24 operates the CPU 16, the memory 17, and the disk 18 of the active device 1 sent from the monitoring unit 14 of the active device 1 via the communication path 6.
Is compared with the available capacity of the CPU 26, the memory 27, and the disk 28 of the standby device 2, and the processing operation being executed in the active device 1 can be executed even if it is taken over by the standby device 2. It is determined whether or not an appropriate resource is secured in the standby system device 2. Then, upon detecting a failure or resource shortage, the monitoring unit 24 notifies the client 3 to that effect. As described above, when the usable amount of the resources in the standby device 2 is larger than the used amount of the resources in the active device 1, the work being processed by the active device 1 can be taken over by the standby device 2. The standby device 2 is monitored according to the operating status of the active device 1 such that the standby device 2 is determined.

【００２５】次に本発明の実施の形態の動作について説
明する。Next, the operation of the embodiment of the present invention will be described.

【００２６】図１および図２を参照すると、運用系装置
１内の監視部１４は、まず、監視用パス１５を通じて運
用系装置１内のＣＰＵ１６、ローカルメモリ１７および
ローカルディスク１８等の資源の状態および使用状況
と、ＬＡＮ制御回路１１、パス制御回路１２およびディ
スク制御回路１３の状態とを監視するとともに、さら
に、待機系装置２の監視部２４も監視する（図２のステ
ップＳ１）。そして、この監視の結果、運用系装置１の
内部の異常発生もしくは待機系装置２の監視部２４から
の異常通知が検出されると、監視部１４はクライアント
３へ異常発生の通知を行う（ステップＳ２１）。一方、
異常が検出されなければ運用系装置１の資源利用情報を
連絡パス６を通じて待機系装置２の監視部２４へ送出す
る（ステップＳ３）。Referring to FIGS. 1 and 2, the monitoring unit 14 in the active device 1 first checks the status of resources such as the CPU 16, the local memory 17 and the local disk 18 in the active device 1 through the monitoring path 15. In addition to monitoring the usage status and the status of the LAN control circuit 11, the path control circuit 12, and the disk control circuit 13, the monitoring unit 24 of the standby system 2 is also monitored (step S1 in FIG. 2). Then, as a result of this monitoring, when the occurrence of an abnormality inside the active device 1 or the abnormality notification from the monitoring unit 24 of the standby device 2 is detected, the monitoring unit 14 notifies the client 3 of the occurrence of the abnormality (step S21). on the other hand,
If no abnormality is detected, the resource utilization information of the active device 1 is transmitted to the monitoring unit 24 of the standby device 2 through the communication path 6 (Step S3).

【００２７】一方、待機系装置２の監視部２４は、待機
系装置２内の監視用パス２５を通じて待機系装置２内の
ＣＰＵ２６、ローカルメモリ２７およびローカルディス
ク２８等の資源の状態および使用可能量と、ＬＡＮ制御
回路２１、パス制御回路２２およびディスク制御回路２
３の状態とを監視するとともに、運用系装置２の監視部
１４の状態を監視する（ステップＳ１１）。そして、こ
の監視の結果、待機系装置２の内部の異常発生もしくは
運用系装置１の監視部１４からの異常の通知が検出され
ると、クライアント３に通知を行い（ステップＳ２
１）、異常の通知が検出されなければ、連絡パス６を介
して送出されてくる運用系装置１の資源使用量と待機系
装置２の資源の使用可能容量との比較を行う（ステップ
Ｓ１３）。ここで、ＣＰＵ、ローカルメモリおよびロー
カルディスク等の資源のうちの１つでも、運用系装置１
対応する資源の資源使用量よりも不足していると判断さ
れると、その旨がクライアント３に通知される（ステッ
プＳ１４）。一方、待機系装置２内の資源の使用可能量
が全て運用系装置１の資源の使用量よりも大ければ問題
なしとして終了する。クライアント３は、運用系装置１
もしくは待機系装置２から障害等の問題発生が通知され
ると、操作者に通知を行うなどの対処を行う。On the other hand, the monitoring unit 24 of the standby device 2 transmits the status and available amount of resources such as the CPU 26, the local memory 27 and the local disk 28 in the standby device 2 through the monitoring path 25 in the standby device 2. , LAN control circuit 21, path control circuit 22, and disk control circuit 2
3 and the state of the monitoring unit 14 of the active device 2 is monitored (step S11). Then, as a result of this monitoring, when the occurrence of an abnormality inside the standby system device 2 or the notification of the abnormality from the monitoring unit 14 of the operation system device 1 is detected, the client 3 is notified (step S2).
1) If no notification of an abnormality is detected, a comparison is made between the resource usage of the active device 1 transmitted via the communication path 6 and the available capacity of the resources of the standby device 2 (step S13). . Here, even one of the resources such as the CPU, the local memory, and the local disk is stored in the active system device 1.
When it is determined that the resource usage of the corresponding resource is insufficient, the client 3 is notified of the fact (step S14). On the other hand, if the available resources of the resources in the standby device 2 are all larger than the available resources of the active device 1, the process ends without any problem. The client 3 is the operating system device 1
Alternatively, when the occurrence of a problem such as a failure is notified from the standby device 2, a measure such as notifying the operator is taken.

【００２８】次に、上記において異常が検出された場合
について詳細に述べる。Next, the case where an abnormality is detected in the above will be described in detail.

【００２９】まず、待機系装置２の監視部２４が異常を
検出した場合を考える。監視部２４は、まず、異常発生
場所を判断し（ステップＳ２１）、異常箇所が、図３に
斜線で示したパス制御回路２２、ディスク制御回路２
３、ＣＰＵ２６、ローカルメモリ２７、ローカルディス
ク２８および運用系装置１の監視部１４の何れか、もし
くはＬＡＮ５、である場合には、ＬＡＮ制御回路２１お
よびＬＡＮ５を経由してクライアント３に異常発生の通
知を行う（ステップＳ２２）。これに対し、図４に示す
ように、ＬＡＮ制御回路２１の異常を検出した場合に
は、監視部２４は、パス制御回路２２、連絡パス６およ
びパス制御回路１２を経由して、運用系装置１の監視部
１４にＬＡＮ制御回路２１の異常発生を通知する（ステ
ップＳ２３）。そして、この異常発生の通知を受けた監
視部１４は、ＬＡＮ制御回路１１およびＬＡＮ５を経由
してクライアント３に待機系装置２のＬＡＮ制御回路２
１の異常発生の通知を行う（ステップＳ２４）。First, consider the case where the monitoring unit 24 of the standby apparatus 2 detects an abnormality. The monitoring unit 24 first determines the location where the abnormality has occurred (step S21), and determines the location of the abnormality with the path control circuit 22 and the disk control circuit 2 indicated by hatching in FIG.
3. If any one of the CPU 26, the local memory 27, the local disk 28, and the monitoring unit 14 of the operation apparatus 1 or the LAN 5, or notifies the client 3 via the LAN control circuit 21 and the LAN 5 of the occurrence of an abnormality. Is performed (step S22). On the other hand, as illustrated in FIG. 4, when an abnormality of the LAN control circuit 21 is detected, the monitoring unit 24 transmits the operation system device via the path control circuit 22, the communication path 6, and the path control circuit 12. The first monitoring unit 14 is notified of the occurrence of an abnormality in the LAN control circuit 21 (step S23). Then, the monitoring unit 14 having received the notification of the occurrence of the abnormality notifies the client 3 via the LAN control circuit 11 and the LAN 5 of the LAN control circuit 2 of the standby system 2.
A notification of the occurrence of abnormality 1 is made (step S24).

【００３０】同様に、運用系装置１の監視部１４によ
り、ＬＡＮ制御回路１１以外の異常が検出されると、Ｌ
ＡＮ制御回路１１、ＬＡＮ５を経由してクライアント３
に異常発生が通知される。これに対し、ＬＡＮ制御回路
１１で異常が検出されると、パス制御回路１２、連絡パ
ス６およびＬＡＮ制御回路２２を経由して待機系装置２
の監視部２４にＬＡＮ制御回路１１の異常発生が通知さ
れる。そして、この通知を受けた待機系装置２の監視部
２４は、ＬＡＮ制御回路２１およびＬＡＮ５を経由して
クライアント３にＬＡＮ制御回路１１の異常発生の通知
を行う。Similarly, when an abnormality other than the LAN control circuit 11 is detected by the monitoring unit 14 of the active device 1,
Client 3 via AN control circuit 11 and LAN 5
Is notified of the occurrence of an error. On the other hand, when an abnormality is detected in the LAN control circuit 11, the standby system device 2 is transmitted via the path control circuit 12, the communication path 6, and the LAN control circuit 22.
Is notified of the occurrence of an abnormality in the LAN control circuit 11. Then, the monitoring unit 24 of the standby apparatus 2 receiving this notification notifies the client 3 of the occurrence of the abnormality of the LAN control circuit 11 via the LAN control circuit 21 and the LAN 5.

【００３１】運用系装置１の監視部１４と待機系装置２
の監視部２４とは、連絡パス６を経由して互いの状態を
監視し合っており、このため、相手側の監視部の故障を
検出することができ、監視部の故障で正常な監視が行わ
れない、もしくは故障の通知を出すことができないとい
ったことを防止できる。The monitoring unit 14 of the active device 1 and the standby device 2
Monitors the status of each other via the communication path 6, so that a failure of the monitoring unit on the other side can be detected, and normal monitoring can be performed due to the failure of the monitoring unit. It can be prevented that the notification is not performed or a failure notification cannot be issued.

【００３２】[0032]

【発明の効果】以上のように、本発明は、運用系装置お
よび待機系装置を有するクラスタ構成のシステムにおい
て、運用系装置で実行中の処理業務を実行できるだけの
ＣＰＵ、ローカルメモリおよびローカルディスクの容量
等が待機系装置で確保できているか、ＬＡＮおよび共有
ディスク装置にアクセスするための各制御装置に異常が
ないかを監視する監視部を設けることによって、運用系
システムの稼働状況に合わせた待機系の可用性を判断す
ることができ、運用系装置に異常が生じた時に待機系装
置に処理業務が問題なく引き継いで実行することができ
るクラスタシステムとして高信頼化を図ることができる
という効果がある。As described above, according to the present invention, in a cluster system having an active device and a standby device, a CPU, a local memory, and a local disk which can execute processing tasks being executed in the active device are used. By providing a monitoring unit that monitors whether the capacity and the like can be secured by the standby system device and whether there is any abnormality in each control device for accessing the LAN and the shared disk device, the standby unit according to the operation status of the active system is provided. It is possible to determine the availability of the system, and it is possible to achieve high reliability as a cluster system that can take over and execute processing tasks without any problem to the standby system device when an abnormality occurs in the active system device. .

[Brief description of the drawings]

【図１】本発明の実施の形態の構成を示すブロック図で
ある。FIG. 1 is a block diagram showing a configuration of an embodiment of the present invention.

【図２】本発明の実施の形態の動作を表す流れ図であ
る。FIG. 2 is a flowchart illustrating an operation of the exemplary embodiment of the present invention.

【図３】本発明の実施の形態の第１の状態を示す図であ
る。FIG. 3 is a diagram showing a first state of the embodiment of the present invention.

【図４】本発明の実施の形態の第２の状態を示す図であ
る。FIG. 4 is a diagram showing a second state of the embodiment of the present invention.

【図５】従来の技術の構成を示す図である。FIG. 5 is a diagram showing a configuration of a conventional technique.

[Explanation of symbols]

１運用系装置２待機系装置３クライアント４共有ディスク５ＬＡＮ６連絡パス７，８ケーブル１１，２１ＬＡＮ制御回路１２，２２パス制御回路１３，２３ディスク制御回路１４，２４監視部１５，２５監視用パス１６，２６ＣＰＵ１７，２７ローカルメモリ１８，２８ローカルディスク REFERENCE SIGNS LIST 1 active system device 2 standby system device 3 client 4 shared disk 5 LAN 6 communication path 7, 8 cable 11, 21 LAN control circuit 12, 22, path control circuit 13, 23 disk control circuit 14, 24 monitoring unit 15, 25 for monitoring Path 16, 26 CPU 17, 27 Local memory 18, 28 Local disk

Claims

(57) [Claims]

In a monitoring system in a cluster system in which an active device and a standby device are connected by a network , each of the active device and the standby device has a fault.
A monitoring unit that monitors harm and resources, wherein the active monitoring unit detects a failure in the active device and
Detects resource usage and increases the detected resource usage
Notifying the standby monitoring unit, the standby monitoring unit detects a failure in the standby device and
Detects the amount of resources that can be used.
Detected from the resource usage of the operating device notified from the
When the available resources of the standby system
A monitoring method in a cluster system, wherein it is determined that normal has occurred .

2. The system according to claim 1, wherein each of the active device and the standby device includes a monitoring line for monitoring, and performs monitoring of a failure and a resource and notification of abnormality detection via the monitoring line. The monitoring method in the cluster system according to claim 1 .

3. The cluster system includes a communication line that connects the active device and the standby device, and a usage amount of the resource detected by the monitoring unit of the active system is set via the communication line. The monitoring method according to claim 2, wherein the monitoring method is notified to a monitoring unit of the system.

4. The monitoring method according to claim 3 , wherein the monitoring units of the active device and the standby device monitor each other via the communication line.

Wherein said resource, CPU, claim 1, characterized in that it comprises a local memory and local disk
Or a monitoring method in the cluster system described in 2 or 4 .

6. The monitoring unit of the active device and the standby device detects a failure in its own LAN control circuit.
Monitoring method in a cluster system according to claim 1, wherein the reporting a failure to the client through the other system.

7. A monitoring method in a cluster system in which an active device and a standby device each include a monitoring unit for monitoring a failure and a resource, and wherein the active device and the standby device are connected by a network, Detecting a failure in the active device and detecting the amount of resources used; detecting a failure in the standby device and detecting the amount of available resources; and ,
Comparing the available amount of resources of the standby device; and, if the available amount of the standby device is smaller than the available amount of resources of the active device, the standby device fails. Determining that an error has occurred.

8. An active device and a standby device, wherein each of the active device and the standby device includes a monitoring unit that monitors a failure and a resource, and the monitoring unit of the active device includes: The failure detection in the active device and the detection of the resource usage are performed. The monitoring unit of the standby device detects the failure in the standby device and detects the available amount of the resource. A cluster system which determines that an abnormality has occurred when the available amount of resources of the standby system device is smaller than the used amount of the standby system device.