JP2012014674A

JP2012014674A - Failure recovery method, server, and program in virtual environment

Info

Publication number: JP2012014674A
Application number: JP2010252891A
Authority: JP
Inventors: Takayuki Tanaka; 崇幸田中; Mitsunori Imazaki; 充智今崎; Takashi Saito; 敬志斉藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2010-06-04
Filing date: 2010-11-11
Publication date: 2012-01-19
Anticipated expiration: 2030-11-11
Also published as: JP5285045B2

Abstract

PROBLEM TO BE SOLVED: To provide a virtual environment in which a failure category can be determined and operation is allowed without causing degradation of the service-providing processing speed due to system switching at the time of failure.SOLUTION: A cluster system of the present invention, which comprises an active machine and a standby machine in a virtual environment, determines a category of a failure location by using a log and, if the failure location is a resource, causes a transition of the cluster status of the active machine to "a status allowing transition to an in-service status." Furthermore, if the determined failure category represents a defective continuity from a guest machine to a host machine, or a network trouble, the system performs a continuity check.

Description

本発明は、仮想環境における故障復旧方法及びシステムに係り、特に、仮想化技術を実現する仮想マシンモニタ（例えば、XenまたはKVM等）が導入されている仮想機能を有する冗長系のシステムにおいて、現用機がシステムに組み入れられていない原因を特定し、復旧させるための仮想環境における故障復旧方法及びサーバ及び故障復旧プログラムに関する。 The present invention relates to a failure recovery method and system in a virtual environment, and in particular, in a redundant system having a virtual function in which a virtual machine monitor (for example, Xen or KVM) that implements a virtualization technology is introduced. The present invention relates to a failure recovery method, a server, and a failure recovery program in a virtual environment for identifying and recovering a cause that a machine is not incorporated in a system.

サービスの重要性が増すにつれ、ダウンタイムの少ないシステムの要求が高まっている。このため、複数のサーバで冗長構成されたクラスタシステムを構築し、何らかの故障が発生したときに自動的にサーバを切り替えることにより、サービスの継続を可能とするHeartbeat及びPacemakerなどの高可用性クラスタソフトが開発されている（非特許文献１参照）。 As the importance of services increases, so does the demand for systems with low downtime. For this reason, high availability cluster software such as Heartbeat and Pacemaker that enables continuous service by building a redundant cluster system with multiple servers and automatically switching the server when some failure occurs. It has been developed (see Non-Patent Document 1).

高可用性クラスタソフトでは、サーバ上のリソース、ネットワーク、共有ディスク等を監視しており、サービス稼働中のサーバで故障を検知すると、予め待機しているサーバに切り替え、サービスを継続させる。 High-availability cluster software monitors resources, networks, shared disks, etc. on the server. If a failure is detected on a server that is in service, the server is switched to a standby server in advance and the service is continued.

図１に、高可用性クラスタソフトを用いたクラスタシステムの概略図を示す。クラスタシステムは、ネットワークに接続されている複数のサーバ（現用機及び予備機）と、これらの複数のサーバで共有して用いられる共有ディスクとを有する。 FIG. 1 shows a schematic diagram of a cluster system using high-availability cluster software. The cluster system includes a plurality of servers (active machine and spare machine) connected to a network and a shared disk that is shared and used by the plurality of servers.

現用機及び予備機は、オペレーティングシステム（ＯＳ）と、高可用性クラスタソフトと、サービスを提供するために必要な構成要素であるリソースとをそれぞれ有する。高可用性クラスタソフトは、現用機での故障の発生を検知し、故障が発生したときに自動的に予備機に切り替える。サーバにおけるサービスの稼働状態、リソースの稼働状態及び故障状態は、内蔵ディスクの状態記憶部に格納され、故障箇所等の詳細な情報は、内蔵ディスクのログ記憶部に格納される。 The active machine and the spare machine each have an operating system (OS), high-availability cluster software, and resources that are components necessary for providing a service. The high-availability cluster software detects the occurrence of a failure in the active machine and automatically switches to a spare machine when a failure occurs. The service operating status, resource operating status and failure status in the server are stored in the status storage unit of the internal disk, and detailed information such as the failure location is stored in the log storage unit of the internal disk.

現用機及び予備機は、サービスＬＡＮと呼ばれるネットワークに接続されており、リソースによるサービスをクライアントに提供する。また、現用機及び予備機は、インターコネクトLANと呼ばれるネットワークに接続されており、サーバにおけるサービスの稼働状態、リソースの稼働状態、故障状態等の情報を交換する。更に、現用機及び予備機は、管理ＬＡＮと呼ばれるネットワークに接続されており、保守端末からのコマンドを受け付けることができる。 The active machine and the spare machine are connected to a network called a service LAN and provide a service based on resources to the client. In addition, the current machine and the spare machine are connected to a network called an interconnect LAN, and exchange information such as service operating status, resource operating status, and fault status in the server. Furthermore, the current machine and the spare machine are connected to a network called a management LAN, and can receive commands from the maintenance terminal.

また、現用機及び予備機には、故障時に他サーバの電源を強制的に切断する強制電源断機能を設定することができる。強制電源断機能は、管理ＬＡＮを経由して他サーバのハードウェア制御ボードに対して電源を切断する指示を送信することにより、他サーバの電源を切断する。 The active machine and the spare machine can be set with a forced power-off function for forcibly turning off the power of other servers when a failure occurs. The forced power cut-off function cuts off the power of the other server by transmitting an instruction to turn off the power to the hardware control board of the other server via the management LAN.

共有ディスクは、サービスの一貫性を保つために、サービス提供に用いられるデータを保存する記憶装置である。共有ディスクにより、現用機から予備機に切り替わった後も、同じデータを用いてサービスを継続できる。 The shared disk is a storage device that stores data used for service provision in order to maintain service consistency. With the shared disk, the service can be continued using the same data even after switching from the current machine to the spare machine.

このように、高可用性クラスタソフトでリソースの故障を監視しているため、リソース故障が発生した場合に、予備機に切り替えてサービスを継続させることができる。予備機に系切り替えを行った後は、予備機でサービスが継続される（例えば、特許文献１参照）。 As described above, since the failure of the resource is monitored by the high availability cluster software, when the resource failure occurs, the service can be continued by switching to the spare machine. After the system is switched to the spare machine, the service is continued on the spare machine (see, for example, Patent Document 1).

特許第４３５３００５号公報，「クラスタ構成コンピュータシステムの系切り替え方法」Japanese Patent No. 4353005, “System switching method of cluster configuration computer system”

三井一能他，「サービスの可用性を向上させるＯＳＳミドルHeartbeatの開発」，ＮＴＴ技術ジャーナル，２００９年３月，４６〜４９ページMitsunori Mitsunori et al., “Development of OSS Middle Heartbeat to Improve Service Availability”, NTT Technical Journal, March 2009, pages 46-49

上記の技術は、現用機に障害が発生した場合に予備機に系切り替えを行う処理であるが、仮想環境については考慮されておらず、また、障害発生時に、リソース故障、ネットワーク障害等の故障原因の特定を行うことができないため、現用機の故障箇所によっては現用機を単に起動させたとしても、正常に動作するとは限らない。 The above technology is a process of switching the system to the standby machine when a failure occurs in the active machine, but the virtual environment is not considered, and when a failure occurs, a failure such as a resource failure or network failure occurs. Since the cause cannot be identified, even if the active machine is simply activated depending on the failure location of the active machine, it may not operate normally.

更に、クラスタシステムにおいて予備機の性能が現用機の性能より劣る場合には、予備機でのサービス提供中は、処理速度が低下する。このような処理速度の低下は、予備機の性能が現用機の性能より劣っている場合だけでなく、２つ以上の現用機と１つの予備機とでクラスタシステムが構成されている場合にも生じる可能性がある。 Furthermore, when the performance of the spare machine is inferior to that of the active machine in the cluster system, the processing speed decreases while the service is provided by the spare machine. Such a decrease in processing speed is not only when the performance of the spare machine is inferior to that of the active machine, but also when a cluster system is composed of two or more active machines and one spare machine. It can happen.

本発明は、上記の点に鑑みなされたもので、文献「"オープンソース仮想化技術"ＮＴＴ技術ジャーナル2009 vol. 21. No.8. 2009.8, pp.82-86」等に示されるXen等のオープンソースの仮想化プロダクトによる仮想環境において、故障のカテゴリを判断でき、故障時の系切り替えによるサービス提供の処理速度の性能劣化を招くことなく運用が可能な仮想環境における故障復旧方法及びシステムを提供することを目的とする。 The present invention has been made in view of the above points, such as Xen et al. Shown in the document "" Open Source Virtualization Technology "NTT Technology Journal 2009 vol. 21. No.8. 2009.8, pp.82-86. Provide a failure recovery method and system in a virtual environment that can determine the category of a failure in a virtual environment based on an open source virtualization product and that can be operated without degrading the performance of the service provision processing speed due to system switching in the event of a failure The purpose is to do.

本発明の上記の課題を解決するため、本発明（請求項１）の仮想環境における故障復旧方法は、ハイパーバイザが導入された仮想環境における現用機及び予備機から構成されるクラスタシステムにおいて、該現用機及び該予備機が、該現用機及び該予備機サービス稼動状態を示すクラスタ状態を管理するクラスタ状態管理手段、クラスタ状態及び故障状態を格納する状態管理情報記憶手段、故障箇所を示す故障ログを格納するログ記憶手段をそれぞれ含み、該予備機がサービス稼動中であり、該現用機がクラスタ構成に組み込まれていない場合に、障害原因を特定するための仮想環境における故障復旧方法であって、
現用機の故障推定手段が、現用機のログ記憶手段から検索した結果に基づいて、該現用機の故障箇所を推定する故障箇所推定ステップと、
故障箇所推定ステップにおいて、現用機の故障箇所がネットワーク故障であると推定された場合には、該現用機の導通確認手段が、該現用機に接続されたルータまでの導通を確認する導通確認ステップと、
ルータまでの導通確認が成功した場合に、現用機のクラスタ構成起動手段が、該現用機のクラスタ状態を「サービス稼動中へ遷移できる状態」へ遷移させるクラスタ構成起動ステップと、を行う。 In order to solve the above-described problems of the present invention, a failure recovery method in a virtual environment according to the present invention (Claim 1) is provided in a cluster system including a working machine and a spare machine in a virtual environment in which a hypervisor is introduced. The active machine and the spare machine have cluster status management means for managing the cluster status indicating the operating status of the active machine and the spare machine service, status management information storage means for storing the cluster status and fault status, and a fault log indicating the fault location A fault recovery method in a virtual environment for identifying the cause of a failure when the spare machine is in service and the active machine is not incorporated in a cluster configuration. ,
A failure location estimating step of estimating a failure location of the current machine based on a result of the failure estimation unit of the current machine being retrieved from the log storage unit of the current machine;
In the failure location estimation step, when it is estimated that the failure location of the active device is a network failure, the continuity confirmation step of confirming the continuity to the router connected to the active device is performed by the continuity confirmation means of the active device. When,
When the continuity confirmation to the router is successful, the cluster configuration activation means of the active device performs a cluster configuration activation step of transitioning the cluster state of the active device to “a state in which the service device can be transited”.

また、本発明（請求項２）は、現用機に、ホストマシンとゲストマシンが導入されている環境において、
導通確認ステップにおいてホストマシンとゲストマシン間の通信故障であると特定された場合には、
導通確認ステップにおいて、
ゲストマシンからホストマシンへの導通不良を確認し、
クラスタ構成起動ステップにおいて、
ゲストマシンからホストマシンへの導通が成功した場合には、現用機のクラスタ状態を「サービス稼働中へ遷移できる状態」へ遷移させる。 Further, the present invention (Claim 2) is provided in an environment in which a host machine and a guest machine are installed in an active machine.
If the communication check step identifies a communication failure between the host machine and guest machine,
In the continuity confirmation step,
Check the continuity from the guest machine to the host machine,
In the cluster configuration startup step,
When the connection from the guest machine to the host machine is successful, the cluster state of the active machine is changed to “a state in which the service can be changed to operation”.

また、本発明（請求項３）は、故障箇所推定ステップにおいて故障箇所のカテゴリがリソースであると推定された場合は、
クラスタ構成起動ステップにおいて、現用機のクラスタ状態を「サービス稼動中へ遷移できる状態」へ遷移させる。 Further, according to the present invention (Claim 3), when the failure location category is estimated to be a resource in the failure location estimation step,
In the cluster configuration start step, the cluster state of the active machine is changed to “a state in which the service can be changed into operation”.

また、本発明（請求項４）は、故障箇所推定ステップにおいて、現用機及び予備機に、故障時に他サーバの電源を強制的に切断する強制電源断機能が導入されており、該予備機のログ記憶手段から強制電源断機能に関するエラーが検索された場合には、該現用機に重大なエラーが検出されたものと判定し、
導通確認ステップにおいて、
予備機から現用機の電源制御手段に対して導通確認を行い、
クラスタ構成起動ステップにおいて、
導通確認ステップにて電源制御手段への導通が成功した場合には、現用機のクラスタ状態を「サービス稼動中へ遷移できる状態」に遷移させる。 Further, according to the present invention (Claim 4), in the failure location estimation step, a forced power-off function for forcibly turning off the power of another server in the event of a failure is introduced into the active machine and the spare machine. When an error relating to the forced power-off function is retrieved from the log storage means, it is determined that a serious error has been detected in the working machine,
In the continuity confirmation step,
Check the continuity from the spare machine to the power control means of the current machine,
In the cluster configuration startup step,
If the continuity to the power supply control unit is successful in the continuity confirmation step, the cluster state of the active machine is changed to “a state where the service can be changed to operation”.

本発明（請求項５）のサーバ（現用機）は、ハイパーバイザが導入された仮想環境における現用機及び予備機から構成されるクラスタシステムにおいて、該現用機及び該予備機が、該現用機及び該予備機サービス稼動状態を示すクラスタ状態を管理するクラスタ状態管理手段、クラスタ状態及び故障状態を格納する状態管理情報記憶手段、故障箇所を示す故障ログを格納するログ記憶手段をそれぞれ含み、該予備機がサービス稼動中であり、該現用機がクラスタ構成に組み込まれていない場合に、障害原因を特定するための現用機として動作するサーバであって、
ログ記憶手段から検索した結果に基づいて、該現用機の故障箇所を推定する故障箇所推定手段と、
故障箇所推定手段において、現用機の故障箇所がネットワーク故障であると推定された場合には、該現用機に接続されたルータまでの導通を確認する導通確認手段と、
ルータまでの導通確認が成功した場合に、当該現用機のクラスタ状態を「サービス稼動中へ遷移できる状態」へ遷移させるクラスタ構成起動手段と、を有することを特徴とする。 The server (working machine) of the present invention (Claim 5) is a cluster system composed of a working machine and a spare machine in a virtual environment in which a hypervisor is introduced, wherein the working machine and the spare machine are the working machine and the working machine. A cluster status management means for managing the cluster status indicating the spare machine service operating status, a status management information storage means for storing the cluster status and the fault status, and a log storage means for storing a fault log indicating the fault location. A server that operates as a working machine for identifying the cause of a failure when the machine is in service and the working machine is not incorporated in the cluster configuration,
Based on the result retrieved from the log storage means, failure location estimation means for estimating the failure location of the working machine,
In the failure location estimation means, when it is estimated that the failure location of the working machine is a network failure, continuity confirmation means for confirming continuity to the router connected to the working machine,
Cluster configuration activation means for transitioning the cluster state of the active machine to “a state where the service can be transitioned to” when the continuity confirmation to the router is successful.

また、本発明（請求項６）のサーバは、当該サーバに、ホストマシンとゲストマシンが導入されている環境において、導通確認手段に、該ホストマシンと該ゲストマシン間の通信故障であると特定された場合には、該ゲストマシンから該ホストマシンへの導通を確認する手段を含み、
クラスタ構成起動手段に、導通確認手段にてゲストマシンからホストマシンへの導通が成功した場合には、現用機のクラスタ状態を「サービス稼働中へ遷移できる状態」へ遷移させる手段を含む。 In addition, the server of the present invention (Claim 6) specifies that the communication failure between the host machine and the guest machine is a continuity confirmation means in an environment where the host machine and the guest machine are installed in the server. A means for confirming continuity from the guest machine to the host machine,
The cluster configuration activation means includes means for transitioning the cluster state of the active machine to “a state where the service can be activated” when the continuity confirmation means succeeds from the guest machine to the host machine.

また、本発明（請求項７）のサーバのクラスタ構成起動手段は、故障箇所推定手段において、故障箇所のカテゴリがリソースである場合は、当該サーバのクラスタ状態を「サービス稼動中へ遷移できる状態」へ遷移させる手段を更に有する。 The server cluster configuration activation means of the present invention (Claim 7), when the failure location is a resource in the failure location estimation means, the cluster status of the server is “a state in which the server can transition to service operation”. It further has a means to make a transition to.

また、本発明（請求項８）のサーバは、現用機及び予備機に、故障時に他サーバの電源を強制的に切断する強制電源断機能が導入されている環境では、
クラスタ構成起動手段に、予備機にて強制電源断機能に関するエラーが検出され、かつ、予備機の導通確認手段が予備機から現用機の電源制御手段への導通確認を行い、導通不良であると判った場合には、当該サーバのクラスタ状態を「サービス稼動中へ遷移できる状態」へ遷移させない手段を含む。 In addition, the server of the present invention (Claim 8) is an environment where a forced power-off function for forcibly turning off the power of other servers in the event of a failure is installed in the active machine and the spare machine.
An error related to the forced power-off function is detected in the cluster configuration activation means in the spare machine, and the continuity confirmation means of the spare machine confirms the continuity from the spare machine to the power control means of the active machine, and the continuity failure If it is found, a means for preventing the cluster state of the server from changing to “a state in which the service can be changed” is included.

本発明（請求項９）は、請求項５乃至８のいずれか1項に記載のサーバを構成する各手段としてコンピュータを機能させるためのプログラムである。 The present invention (Claim 9) is a program for causing a computer to function as each means constituting the server according to any one of Claims 5 to 8.

上記のように、本発明によれば、仮想環境において、現用機がクラスタ構成に組み込まれていない故障原因を推測することにより、復旧への対処が容易になる。 As described above, according to the present invention, in the virtual environment, it is easy to cope with recovery by estimating the cause of failure in which the active machine is not incorporated in the cluster configuration.

また、推測された故障原因のカテゴリがリソース故障である場合には、予備機でサービス稼働中である場合は、現用機でサービスを再開させることが可能になるため、対故障性と対パフォーマンス性を向上させることが可能となる。 In addition, when the estimated failure cause category is a resource failure, it is possible to restart the service on the active machine if the service is active on the spare machine. Can be improved.

高可用性クラスタソフトを用いたクラスタシステムの概略図である。1 is a schematic diagram of a cluster system using high availability cluster software. FIG. 本発明の一実施の形態におけるシステム構成図である。1 is a system configuration diagram according to an embodiment of the present invention. 本発明の一実施の形態における要部の詳細構成図である。It is a detailed block diagram of the principal part in one embodiment of this invention. 本発明の一実施の形態における状態管理情報記憶部の格納項目を示す図である。It is a figure which shows the storage item of the state management information storage part in one embodiment of this invention. 本発明の一実施の形態における初期状態と終了状態を示す図である。It is a figure which shows the initial state and completion | finish state in one embodiment of this invention. 本発明の一実施の形態における動作の概要を示す図である。It is a figure which shows the outline | summary of the operation | movement in one embodiment of this invention. 本発明の一実施例における一連の動作のフローチャート（その１）である。It is a flowchart (the 1) of a series of operation | movement in one Example of this invention. 本発明の一実施例における一連の動作のフローチャート（その２）である。It is a flowchart (the 2) of a series of operation | movement in one Example of this invention. 本発明の一実施例における図７のステップ１０１の動作を示す図である。It is a figure which shows the operation | movement of step 101 of FIG. 7 in one Example of this invention. 本発明の一実施例における図７のステップ１０１，１０４，１０７の動作を示す図である。It is a figure which shows operation | movement of step 101,104,107 of FIG. 7 in one Example of this invention. 本発明の一実施例における図７のステップ１０３の動作を示す図である。It is a figure which shows the operation | movement of step 103 of FIG. 7 in one Example of this invention. 本発明の一実施例における図７のステップ１０５の動作を示す図である。It is a figure which shows operation | movement of the step 105 of FIG. 7 in one Example of this invention. 本発明の一実施例における図７のステップ１０８の動作を示す図である。It is a figure which shows operation | movement of step 108 of FIG. 7 in one Example of this invention. 本発明の一実施例における図８のステップ１０９の動作を示す図である。It is a figure which shows the operation | movement of step 109 of FIG. 8 in one Example of this invention. 本発明の一実施例における図８のステップ１１０の動作を示す図である。It is a figure which shows the operation | movement of step 110 of FIG. 8 in one Example of this invention. 本発明の一実施例における図８のステップ１１１の動作を示す図である。It is a figure which shows the operation | movement of step 111 of FIG. 8 in one Example of this invention. 本発明の一実施例における図８のステップ１１２の動作を示す図である。It is a figure which shows the operation | movement of step 112 of FIG. 8 in one Example of this invention. 本発明の一実施例における図８のステップ１１３の動作を示す図である。It is a figure which shows the operation | movement of step 113 of FIG. 8 in one Example of this invention. 本発明の一実施例における図８のステップ１１４の動作を示す図である。It is a figure which shows the operation | movement of step 114 of FIG. 8 in one Example of this invention.

以下図面と共に、本発明の実施の形態を説明する。 Embodiments of the present invention will be described below with reference to the drawings.

最初に、本明細書で用いる用語について定義する。 First, terms used in this specification will be defined.

・クラスタ構成：
複数サーバを相互に接続し、ユーザや他サーバに対して全体で１台のサーバであるかのように振舞わせる技術であり、複数サーバを１台のサーバを扱うように管理することができる。１台が停止してもシステム全体が止まることはなく、処理を続行したまま修理や交換が可能である。・ Cluster configuration:
This is a technology for connecting a plurality of servers to each other and making the user and other servers behave as if they are one server as a whole, and the plurality of servers can be managed so as to handle one server. Even if one unit stops, the entire system does not stop, and repair and replacement are possible while processing continues.

・リソース：
サービスを提供するために必要な構成要素を指す。クラスタ構成におけるリソースとは、高可用性クラスタソフトが起動、停止、監視等の制御対象とするアプリケーションを指す。アプリケーションには、データベース（ＤＢ）などが含まれる。 ·resource:
Refers to the components required to provide a service. A resource in a cluster configuration refers to an application to be controlled by the highly available cluster software such as starting, stopping, and monitoring. The application includes a database (DB) and the like.

・ACT：
サーバでサービス稼働中のことを指す。クラスタ構成において、ＤＢなどサービスを提供するリソースが稼動しているサーバの状態を"ACT"と記す。・ ACT:
The service is running on the server. In a cluster configuration, the state of a server on which a resource providing a service such as a DB is operating is denoted as “ACT”.

・SBY[online]：
"ACT"へ遷移できる状態のことをいう。クラスタ構成において、故障などによる系切り替えが発生した場合、"ACT"からリソースを切り替えることが可能なサーバを"SBY[online]"と記す。・ SBY [online]:
This refers to the state that can transition to "ACT". A server that can switch resources from "ACT" when a system switchover occurs due to a failure in the cluster configuration is described as "SBY [online]".

・SBY[standby]：
クラスタ構成において、故障などによる系切り替えが発生した場合でも、"ACT"にならないように抑制されている状態のサーバを"SBY[standby]"と記す。・ SBY [standby]:
In a cluster configuration, even if a system switchover occurs due to a failure or the like, a server that is suppressed so as not to become “ACT” is described as “SBY [standby]”.

・SBY［遷移中］：
"ACT"へ遷移しようとしている状態（系切り替え中の状態）を指す。クラスタ構成において、故障などによる系切り替えが発生し"ACT"へ遷移しようとしているが、現用機側で実施されているリソースの停止処理が正常終了するのを待っている状態のサーバを"SBY[遷移中]"と記す。・ SBY [Transition]:
Indicates the state that is going to transition to "ACT" (the state during system switchover). In a cluster configuration, a system switchover occurs due to a failure, etc., and the server that is about to transition to "ACT" is waiting for normal termination of the resource stop processing being performed on the active machine. [Transition]].

・NONE：
サーバがクラスタ構成に組み込まれていない状態を指す。サーバや高可用性クラスタソフト停止が停止していることにより、クラスタ構成に組み込まれていないサーバを"NONE"と記す。・ NONE:
A state in which the server is not incorporated in the cluster configuration. A server that is not included in the cluster configuration due to the server or high availability cluster software stoppage being stopped is described as "NONE".

図２は、本発明の一実施の形態におけるシステム構成を示す。 FIG. 2 shows a system configuration in an embodiment of the present invention.

同図に示すクラスタシステムは、仮想環境における仮想サーバのベースであるホストマシンとゲストマシン及び仮想マシンを制御するための制御プログラムであるハイパーバイザを有する複数のサーバ（本実施の形態では現用機１０、予備機２０と記す）。これらのサーバは、共有ディスク３０、クライアント装置１、サービスLAN２、ルータ３、保守端末４、管理LAN５に接続されている。 The cluster system shown in the figure includes a plurality of servers having a host machine that is a base of a virtual server in a virtual environment, a guest machine, and a hypervisor that is a control program for controlling the virtual machine (the active machine 10 in this embodiment). , Written as spare machine 20). These servers are connected to the shared disk 30, the client device 1, the service LAN 2, the router 3, the maintenance terminal 4, and the management LAN 5.

現用機１０、予備機２０はそれぞれルータ３を介してクライアント装置１にサービスを提供する。なお、現用機１０の性能は予備機２０の性能より優れていてもよい。同図には、現用機１０、予備機２０をそれぞれ１つずつ記載しているが、２つ以上の現用機と１つの予備機２０としてもよい。 The active machine 10 and the spare machine 20 provide services to the client device 1 via the router 3. The performance of the current machine 10 may be superior to the performance of the spare machine 20. Although one active machine 10 and one spare machine 20 are shown in the figure, two or more current machines and one spare machine 20 may be used.

現用機１０は、ホストマシン用に割り当てられたホストマシン割当ディスク１６１、ゲストマシン用に割り当てられたゲストマシン割当ディスク１６２を含む内蔵ディスク１６０、ハードウェア制御ボード１７、及び、これらの構成要素間の通信を行うためのインタフェースを有する。内蔵ディスク１６０のゲストマシン割当ディスク１６２はログ記憶部１６３と状態管理情報記憶部１６４を含む。 The active machine 10 includes a host machine allocation disk 161 allocated for a host machine, a built-in disk 160 including a guest machine allocation disk 162 allocated for a guest machine, a hardware control board 17, and a component between these components. It has an interface for performing communication. The guest machine allocation disk 162 of the internal disk 160 includes a log storage unit 163 and a state management information storage unit 164.

現用機１０のホストマシン１５０は、ディスク監視機能とネットワーク監視機能を有するデバイス一括監視部１５１を有し、ホストマシン１５０とハイパーバイザ１７０との間には、管理LANネットワーク仮想インタフェース（I/F）１３とサービスLANネットワーク仮想I/F１４を有する。 The host machine 150 of the active machine 10 has a device batch monitoring unit 151 having a disk monitoring function and a network monitoring function. Between the host machine 150 and the hypervisor 170, a management LAN network virtual interface (I / F) 13 and a service LAN network virtual I / F 14.

現用機１０のホストマシン１５０のデバイス一括監視部１５１のディスク監視機能は、内蔵ディスク１６０と共有ディスク３０のディスク管理を行い、その導通結果をホストマシン１５０とゲストマシン１１０の間で通信を行い、ホストマシンの管理LANネットワーク仮想I/F１３、ハイパーバイザ１７０、ゲストマシンの管理LANネットワーク仮想I/F１５を介して、ゲストマシンの高可用性クラスタソフト１４０に通知すると共に、ログ記憶部１６３に格納する。当該ディスク監視機能からエラーメッセージ出力が出力された場合は、現用機１０とルータ３間のサービスLAN通信故障または、ホストマシンとゲストマシン間の経路故障が考えられる。 The disk monitoring function of the device batch monitoring unit 151 of the host machine 150 of the active machine 10 performs disk management of the internal disk 160 and the shared disk 30, and communicates the conduction results between the host machine 150 and the guest machine 110, Notifying the high availability cluster software 140 of the guest machine via the management LAN network virtual I / F 13 of the host machine, the hypervisor 170, and the management LAN network virtual I / F 15 of the guest machine, and storing them in the log storage unit 163. If an error message output is output from the disk monitoring function, there may be a service LAN communication failure between the active machine 10 and the router 3 or a path failure between the host machine and the guest machine.

現用機１０のホストマシン１５０のデバイス一括監視部１５１のネットワーク監視機能は、サービスLANネットワーク仮想I/F１４、ハイパーバイザ１７０、サービスLANネットワークI/F１１を介してルータ３との導通確認を行う。その導通結果をホストマシンとゲストマシンの間で通信を行い、ホストマシン１５０のサービスLANネットワーク仮想I/F１６、ハイパーバイザ１７０、ゲストマシン１１０のサービスLANネットワーク仮想I/F１６を介して、ゲストマシン１１０の高可用性クラスタソフト１４０に通知すると共に、ログ記憶部１６３に格納する。当該ネットワーク監視機能からエラーメッセージ出力が出力された場合は、現用機１０とルータ３間のサービスLAN通信故障または、ホストマシン１５０とゲストマシン１１０間の経路故障が考えられる。 The network monitoring function of the device batch monitoring unit 151 of the host machine 150 of the active machine 10 confirms continuity with the router 3 via the service LAN network virtual I / F 14, the hypervisor 170, and the service LAN network I / F 11. The continuity result is communicated between the host machine and the guest machine, and the guest machine 110 is connected via the service LAN network virtual I / F 16 of the host machine 150, the hypervisor 170, and the service LAN network virtual I / F 16 of the guest machine 110. Is notified to the high availability cluster software 140 and stored in the log storage unit 163. When an error message output is output from the network monitoring function, a service LAN communication failure between the active machine 10 and the router 3 or a path failure between the host machine 150 and the guest machine 110 is considered.

現用機１０のゲストマシン１１０は、リソース１２０、制御実行部１３０、自身の稼動状態を外部に通知するための高可用性クラスタソフト１４０を有し、ゲストマシン１１０とハイパーバイザ１７０との間には、管理LANネットワーク仮想I/F１５とサービスLANネットワーク仮想I/F１６を有する。当該ゲストマシンの構成については図３にて後述する。 The guest machine 110 of the active machine 10 has a resource 120, a control execution unit 130, and high availability cluster software 140 for notifying the outside of its own operating state. Between the guest machine 110 and the hypervisor 170, It has a management LAN network virtual I / F 15 and a service LAN network virtual I / F 16. The configuration of the guest machine will be described later with reference to FIG.

予備機２０も現用機１０と同様に、ホストマシン１５０とゲストマシン１１０と、ハイパーバイザ２７０がインストールされている。また、ホストマシン割当ディスク２６１とゲストマシン割当ディスク２６２を含む内蔵ディスク２４０、ハードウェア制御ボード２７を有する。ゲストマシン割当ディスク２６２にはログ記憶部２６３と状態管理情報記憶部２６４が含まれる。 Similarly to the active machine 10, the spare machine 20 is installed with a host machine 150, a guest machine 110, and a hypervisor 270. Further, it has a built-in disk 240 including a host machine allocation disk 261 and a guest machine allocation disk 262 and a hardware control board 27. The guest machine allocation disk 262 includes a log storage unit 263 and a state management information storage unit 264.

予備機２０のホストマシン２５０は、ディスク監視機能とネットワーク監視機能を有するデバイス一括監視部２５１を有し、ホストマシン２５０とハイパーバイザ２７０との間には、管理LANネットワーク仮想インタフェース（I/F）２３とサービスLANネットワーク仮想I/F２４を有する。 The host machine 250 of the spare machine 20 has a device batch monitoring unit 251 having a disk monitoring function and a network monitoring function, and a management LAN network virtual interface (I / F) between the host machine 250 and the hypervisor 270. 23 and a service LAN network virtual I / F 24.

予備機２０のゲストマシン２１０は、リソース２２０、制御実行部２３０、高可用性クラスタソフト２４０を有し、ゲストマシン２１０とハイパーバイザ２７０との間には、管理LANネットワーク仮想I/F２５とサービスLANネットワーク仮想I/F２６を有する。 The guest machine 210 of the spare machine 20 has a resource 220, a control execution unit 230, and high-availability cluster software 240. Between the guest machine 210 and the hypervisor 270, a management LAN network virtual I / F 25 and a service LAN network It has a virtual I / F 26.

以下に、故障の可能性があり、状態が「NONE」となり、クラスタ構成に組み込まれていない現用機１０のゲストマシンを中心とした構成を示す。 In the following, there is shown a configuration centered on the guest machine of the active machine 10 that has a possibility of failure and the status is “NONE” and is not incorporated in the cluster configuration.

図３は、本発明の一実施の形態における要部の詳細構成を示す。 FIG. 3 shows a detailed configuration of a main part in one embodiment of the present invention.

現用機１０のゲストマシン１１０は、リソース１２０、制御実行部１３０、高可用性クラスタソフト１４０を有し、高可用性クラスタソフト１４０には、内蔵ディスク１６０の状態管理情報記憶部１６４が接続されている。なお、ここでは、現用機１０のみを記すが、予備機２０も同様の構成である。 The guest machine 110 of the active machine 10 has a resource 120, a control execution unit 130, and high availability cluster software 140, and a state management information storage unit 164 of an internal disk 160 is connected to the high availability cluster software 140. Although only the active machine 10 is shown here, the spare machine 20 has the same configuration.

制御実行部１３０は、ログ検索部１３１、導通確認部１３２、系切り替え部１３３、状態確認実行部１３４、起動実行部１３５、コマンド実行部１３６、故障箇所推定部１３７を有し、ログ検索部１３１には、内蔵ディスク１６０のログ記憶部１６３が接続されている。 The control execution unit 130 includes a log search unit 131, a continuity confirmation unit 132, a system switching unit 133, a state confirmation execution unit 134, a start execution unit 135, a command execution unit 136, and a failure location estimation unit 137. Is connected to the log storage unit 163 of the internal disk 160.

ログ検索部１３１は、保守端末４からログインされると、故障箇所推定部１３７からの制御によりログ記憶部１６３を検索し、エラーメッセージを取得し故障箇所推定部１３７に渡す。 When logged in from the maintenance terminal 4, the log search unit 131 searches the log storage unit 163 under the control of the failure location estimation unit 137, acquires an error message, and passes it to the failure location estimation unit 137.

導通確認部１３２は、故障箇所推定部１３７からの制御により、Ping処理を実行し、ルータ３までの導通を確認し、故障箇所推定部１３７に渡す。 The continuity confirmation unit 132 executes the Ping process under the control of the failure location estimation unit 137, confirms the continuity to the router 3, and passes it to the failure location estimation unit 137.

系切り替え部１３３は、現用機１０及び予備機２０のクラスタ状態を変更する。 The system switching unit 133 changes the cluster state of the active machine 10 and the spare machine 20.

状態確認実行部１３４は、高可用性クラスタソフトの状態コマンドを実行することにより取得したクラスタ状態を確認する。 The state confirmation execution unit 134 confirms the cluster state acquired by executing the state command of the high availability cluster software.

起動実行部１３５は、高可用性クラスタソフト１４０を起動する。 The activation execution unit 135 activates the high availability cluster software 140.

コマンド実行部１３６は、高可用性クラスタソフト１４０にコマンドを実行させる。 The command execution unit 136 causes the high availability cluster software 140 to execute a command.

故障箇所推定部１３７は、ログ検索部１３１、導通確認部１３２から取得したエラーメッセージにより故障箇所を推定する。 The failure location estimation unit 137 estimates the failure location based on the error message acquired from the log search unit 131 and the continuity confirmation unit 132.

高可用性クラスタソフト１４０は、「クラスタ状態管理手段」として動作し、サーバの故障状態を監視する故障監視機能と、クラスタ状態及び故障状態に基づいてリソースを起動及び停止させるリソース起動・停止機能と、故障状態に基づいて状態管理情報記憶部１６４のクラスタ状態を管理する状態管理機能とが含まれる。故障監視機能として、ネットワーク監視結果取得部１４１とディスク監視結果取得部１４２を有し、制御実行部１３０の起動実行部１３５により起動され、コマンド実行部１３６からの制御によりコマンドを実行することにより、状態管理情報記憶部１６４の内容を更新すると共に、他のサーバ（予備機２０）にクラスタ状態等を通知する。ネットワーク監視結果取得部１４１は、ゲストマシンのデバイス一括監視部１５１のネットワーク監視機能からサービスLANネットワーク仮想I/F１４、ハイパーバイザ１７０、サービスLANネットワーク仮想I/F１６を介して、ルータ３の導通結果を取得する。ディスク監視結果取得部１４２は、ホストマシンのデバイス一括監視部１５１からホストマシンの管理LANネットワーク仮想I/F１３、ハイパーバイザ１７０、ゲストマシン１１０の管理LANネットワーク仮想I/F１５を介して、ディスクの監視結果を取得する。 The high-availability cluster software 140 operates as a “cluster state management unit” and monitors a failure state of a server, a resource start / stop function for starting and stopping a resource based on the cluster state and the failure state, And a state management function for managing the cluster state of the state management information storage unit 164 based on the failure state. As a failure monitoring function, it has a network monitoring result acquisition unit 141 and a disk monitoring result acquisition unit 142, is activated by the activation execution unit 135 of the control execution unit 130, and executes a command under the control of the command execution unit 136. The contents of the state management information storage unit 164 are updated, and the cluster state and the like are notified to the other server (spare machine 20). The network monitoring result acquisition unit 141 obtains the continuity result of the router 3 from the network monitoring function of the device batch monitoring unit 151 of the guest machine via the service LAN network virtual I / F 14, the hypervisor 170, and the service LAN network virtual I / F 16. get. The disk monitoring result acquisition unit 142 monitors the disk from the device batch monitoring unit 151 of the host machine via the management LAN network virtual I / F 13 of the host machine, the hypervisor 170, and the management LAN network virtual I / F 15 of the guest machine 110. Get the result.

ログ記憶部１６３には、ホストマシンのデバイス一括監視部１５１から取得した、ディスク、ネットワーク等の故障ログが格納されている。予備機２０のログ記憶部２６３にも同様に、ホストマシンのデバイス一括監視部２５１から取得した故障ログが格納される。 The log storage unit 163 stores failure logs such as disks and networks acquired from the device batch monitoring unit 151 of the host machine. Similarly, a failure log acquired from the device batch monitoring unit 251 of the host machine is also stored in the log storage unit 263 of the spare machine 20.

状態管理情報記憶部１６４には、図４に示すように、クラスタ状態、故障回数、エラーステータス、リソース状態のフラグや値が格納され、これらの値は、高可用性クラスタソフト１４０の状態管理機能により管理される。 As shown in FIG. 4, the state management information storage unit 164 stores flags and values of the cluster state, the number of failures, an error status, and a resource state. These values are stored by the state management function of the high availability cluster software 140. Managed.

以下に、上記の構成における一連の故障検出から復旧までの動作の概要を示す。 The outline of the operation from the series of failure detection to recovery in the above configuration will be described below.

まず、図５に初期状態と終了状態を示す。 First, FIG. 5 shows an initial state and an end state.

初期状態は、予備機２０で状態確認処理が実行され、状態管理情報記憶部２６４の故障回数が"０"、エラーステータスが"０"となり、予備機２０のクラスタ状態が"ACT"、現用機１０のクラスタ状態が"NONE"状態のとき、現用機１０の電源スイッチＯＮの処理を行いＯＳを立ち上げた状態である。 In the initial state, the status check process is executed in the spare machine 20, the failure count of the status management information storage unit 264 is "0", the error status is "0", the cluster status of the spare machine 20 is "ACT", and the active machine When the cluster status of 10 is “NONE”, the power switch of the active machine 10 is turned on and the OS is started up.

また、終了状態は、故障の推測処理を行い、その結果、故障カテゴリとして「リソース」故障であった場合に、当該リソースの復旧処理を行い、現用機１０のクラスタ状態を"ACT"、予備機２０のクラスタ状態を"SBY[online]"とするものである。 Also, the failure state is a failure estimation process. As a result, when the failure category is “resource” failure, the resource recovery processing is performed, the cluster state of the active device 10 is set to “ACT”, and the standby device. The cluster state of 20 is “SBY [online]”.

以下に、上記図５の一連の処理を説明する。 Hereinafter, a series of processes in FIG. 5 will be described.

図６は、本発明の一実施の形態における動作の概要を示すシーケンスチャートである。 FIG. 6 is a sequence chart showing an outline of the operation according to the embodiment of the present invention.

ステップ１）現用機１０がクラスタ構成に組み込まれていない"NONE"状態で、当該現用機１０のOSを立ち上げる。このとき、予備機２０は、サービス稼動中の"ACT"の状態である。 Step 1) In the “NONE” state where the active machine 10 is not incorporated in the cluster configuration, the OS of the active machine 10 is started up. At this time, the spare machine 20 is in the “ACT” state during service operation.

現用機１０、予備機２０において保守端末４から管理LAN５を介してログインを受け付け、制御実行部１３０のログ検索部１３１がログ記憶部１６３を検索してエラーメッセージを取得する。ここでは、リソースエラーを示すエラーメッセージを取得したものとする。 The active machine 10 and the spare machine 20 accept login from the maintenance terminal 4 via the management LAN 5, and the log search unit 131 of the control execution unit 130 searches the log storage unit 163 to obtain an error message. Here, it is assumed that an error message indicating a resource error has been acquired.

ステップ２）さらに、制御実行部１３０は、保守端末４に対して、エラーメッセージを出力することで、強制電源断機能の導通確認を依頼する。これにより、保守端末４から予備機２０にログインし、予備機２０の実行制御部の導通確認部２３２から現用機１０のハードウェア制御ボード１７宛に導通確認を行う。ここでは、導通は成功したものとする。なお、仮に導通確認に失敗したとしても、サービス続行には大きく影響しないため、現用機１０をクラスタ構成に組み入れることが可能である。 Step 2) Furthermore, the control execution unit 130 requests the maintenance terminal 4 to confirm the continuity of the forced power-off function by outputting an error message. As a result, the maintenance terminal 4 logs in to the spare machine 20, and conducts continuity confirmation from the continuity confirmation unit 232 of the execution control unit of the spare machine 20 to the hardware control board 17 of the active machine 10. Here, it is assumed that conduction is successful. Note that even if the continuity check fails, the service continuation is not greatly affected, so that the active device 10 can be incorporated into the cluster configuration.

ステップ３）故障推定部１３７は、上記のステップ１，２の結果から故障発生箇所が「リソース」であると判定する。 Step 3) The failure estimation unit 137 determines that the failure occurrence location is “resource” from the results of Steps 1 and 2 above.

ステップ４）起動実行部１３５は、高可用性クラスタソフト１４０を起動させる。 Step 4) The activation execution unit 135 activates the high availability cluster software 140.

ステップ５）高可用性クラスタソフト１４０は、状態管理情報記憶部１６４の現用機１０のクラスタ状態を"SBY[online]"にする。 Step 5) The high availability cluster software 140 sets the cluster state of the active machine 10 in the state management information storage unit 164 to “SBY [online]”.

ステップ６）次に、高可用性クラスタソフト１４０は予備機２０のクラスタ状態を"ACT"から"SBY[standby]"に変更し、予備機２０に対して、更新された現用機１０、予備機２０の状態管理情報を通知する。これにより、予備機２０の高可用性クラスタソフト２４０において、状態管理情報記憶部２６４の内容を更新する。 Step 6) Next, the high availability cluster software 140 changes the cluster state of the spare machine 20 from “ACT” to “SBY [standby]”, and the updated active machine 10 and spare machine 20 are updated with respect to the spare machine 20. Notify the status management information. Thus, in a high availability cluster soft 240 of spare machine 20 updates the contents of the state management information storage unit 264.

ステップ７）状態確認実行部１３４は、予備機２０のクラスタ状態が"SBY[standby]"、現用機１０のクラスタ状態が"SBY[online]"となっていることを確認し、予備機２０の系切り替えを終了させるために、予備機２０のクラスタ状態を"SBY[online]"に戻す。 Step 7) The state confirmation execution unit 134 confirms that the cluster state of the standby machine 20 is “SBY [standby]” and the cluster state of the active machine 10 is “SBY [online]”. In order to finish the system switching, the cluster state of the spare machine 20 is returned to “SBY [online]”.

ステップ８）状態確認実行部１３４は、状態確認処理を実行し、予備機２０のクラスタ状態が"SBY[online]"になっていなければエラーを出力し、処理を終了する。 Step 8) The state confirmation execution unit 134 executes state confirmation processing. If the cluster state of the spare device 20 is not “SBY [online]”, an error is output and the processing is terminated.

以下、図面と共に、本発明の処理を詳細に説明する。 Hereinafter, the processing of the present invention will be described in detail with reference to the drawings.

図７、図８は、本発明の一実施例における一連の動作のフローチャートである。 7 and 8 are flowcharts of a series of operations in one embodiment of the present invention.

なお、以下の各ステップに記載されている「クラスタ状態」、「故障回数」、「エラーステータス」、「リソース状態」等は、現用機１０及び予備機２０の状態管理情報記憶部１６４，２６４に格納されている値である。 The “cluster status”, “number of failures”, “error status”, “resource status”, etc. described in the following steps are stored in the status management information storage units 164 and 264 of the active device 10 and the standby device 20. The stored value.

復旧方法の実行に先立ち、状態確認実行部１３４は、状態確認コマンドを実行して、現用機の状態管理情報記憶部１６４に格納されたクラスタ状態を読み取り、現用機１０のクラスタ状態が"NONE"であり、予備機２０のクラスタ状態が"ACT"であることを確認してもよい。ここで、高可用性クラスタソフト１４０，２４０により、現用機１０と予備機２０との間で状態管理情報記憶部１６４，２６４の情報交換が行われているため、状態確認コマンドにより現用機１０及び予備機２０の双方の状態を読み取ることができる。 Prior to the execution of the recovery method, the status check execution unit 134 executes a status check command to read the cluster status stored in the status management information storage unit 164 of the active machine, and the cluster status of the active machine 10 is “NONE”. It may be confirmed that the cluster status of the spare machine 20 is “ACT”. Here, since the high-availability cluster software 140 and 240 exchanges information in the state management information storage units 164 and 264 between the active machine 10 and the spare machine 20, the active machine 10 and the spare machine are used by the status check command. Both states of the machine 20 can be read.

ステップ１０１）図９に示すように、保守端末４から管理ＬＡＮ５上にある制御部から現用機１０及び予備機２０の制御実行部１３０，２３０に対してログインする。現用機１０へのログインに失敗した場合は、保守端末４に接続される表示装置等にエラー出力を行い、当該処理を終了する。 Step 101) As shown in FIG. 9, the control unit on the management LAN 5 logs in from the maintenance terminal 4 to the control execution units 130 and 230 of the active machine 10 and the spare machine 20. If login to the active machine 10 fails, an error is output to a display device or the like connected to the maintenance terminal 4 and the process ends.

この時点でのログイン時の現用機１０、予備機２０は以下の状態である。 At this time, the active machine 10 and the spare machine 20 at the time of login are in the following state.

（現用機の状態）
クラスタ状態："NONE"
故障回数：０
エラーステータス：０
リソース状態：０
（予備機の状態）
クラスタ状態："ACT"
故障回数：０
エラーステータス：０
リソース状態：１（Started）
ステップ１０２）次に、図１０に示すように、現用機１０のゲストマシン１１０の制御実行部１３０のログ検索部１３１からＯＳ機能である検索のコマンドを実行し、該当するエラーメッセージを取得する。ルータに対するエラーメッセージを取得した場合は、現用機１０とルータ３間のエラーとしてステップ１０３に移行して導通確認処理を行う。 (Current machine status)
Cluster status: "NONE"
Number of failures: 0
Error status: 0
Resource status: 0
(Status of spare machine)
Cluster status: "ACT"
Number of failures: 0
Error status: 0
Resource status: 1 (Started)
Step 102) Next, as shown in FIG. 10, a search command as an OS function is executed from the log search unit 131 of the control execution unit 130 of the guest machine 110 of the active machine 10, and a corresponding error message is acquired. When an error message for the router is acquired, the process proceeds to step 103 as an error between the active device 10 and the router 3 to perform continuity confirmation processing.

ルータに関して該当するメッセージがない場合は、ディスクに対するエラーメッセージを検索する。ディスクに対するメッセージが得られた場合は、共有ディスク３０へのアクセスに問題が発生したと推測できるため、保守端末４にエラーメッセージを出力して当該処理を終了する。 If there is no relevant message for the router, search for an error message for the disk. If a message for the disk is obtained, it can be assumed that a problem has occurred in access to the shared disk 30, so an error message is output to the maintenance terminal 4 and the process ends.

ステップ１０３）ステップ１０２において、ルータに関するエラーメッセージを取得した場合は、図１１に示すように、故障推定の精度を高めるため、現用機１０のゲストマシンの制御実行部１３０の導通確認部１３２は、ＯＳ機能であるルータ３のＩＰアドレス宛に、ネットワーク疎通を確認するためのコマンドであるPINGを送信することで導通の確定診断を行う。なお、ネットワーク上で瞬断による一時的な故障も想定されるため、所定の時間毎に所定回数PINGを送信するようにしてもよい。導通が失敗の場合は、現用機１０をクラスタ構成に組み込んでも再び故障が発生する可能性が高いため、保守端末４に対してエラーを出力して、当該処理を終了する。一方、導通が成功した場合は、ルータ３の一時的な故障であるとし、現用機１０をクラスタ構成に組み込むことができる。 Step 103) When an error message regarding the router is acquired in Step 102, as shown in FIG. 11, in order to improve the accuracy of failure estimation, the continuity confirmation unit 132 of the control execution unit 130 of the guest machine of the active machine 10 A definite diagnosis of continuity is performed by sending PING, which is a command for confirming network communication, to the IP address of the router 3 which is an OS function. Since a temporary failure due to an instantaneous interruption on the network is also assumed, the PING may be transmitted a predetermined number of times every predetermined time. If the continuity is unsuccessful, there is a high possibility that a failure will occur again even if the current machine 10 is incorporated into the cluster configuration, so an error is output to the maintenance terminal 4 and the process is terminated. On the other hand, if the continuity is successful, it is assumed that the router 3 is temporarily broken, and the active device 10 can be incorporated into the cluster configuration.

ステップ１０４）次に、現用機１０の故障箇所推定部１３７ではログ検索結果によりリソース故障かどうかの判別を行う。図１０に示すように、現用機１０のゲストマシンの制御実行部１３０のログ検索部１３１は、ログ記憶部１６３からエラーメッセージを検索する。故障箇所推定部１３７は、検索されたエラーメッセージからリソースＩＤを取得することにより、当該リソースＩＤから故障リソースを特定する。特定されたリソースがＤＢなどの場合は、対故障性と対パフォーマンス性を向上させるためにステップ１０６に移行し、系切り替え処理のための強制電源断機能の故障推定を行う。 Step 104) Next, the failure location estimation unit 137 of the active machine 10 determines whether or not there is a resource failure based on the log search result. As illustrated in FIG. 10, the log search unit 131 of the control execution unit 130 of the guest machine of the active machine 10 searches the log storage unit 163 for an error message. The failure location estimation unit 137 identifies the failure resource from the resource ID by acquiring the resource ID from the retrieved error message. If the identified resource is a DB or the like, the process proceeds to step 106 in order to improve fault tolerance and performance, and fault estimation of the forced power-off function for system switching processing is performed.

現用機停止時の故障発生箇所の特定後、強制電源断機能が導入されている場合は、ステップ１０５に移行し、導入されていない場合はステップ１０７に移行する。なお、強制電源断機能の有無を判断するために、予め状態管理情報記憶部１６４に強制電源断機能有無を示すフラグを設定しておき、当該フラグを参照するようにしてもよい。また、当該強制電源断機能の設定状態は、システム全体として設定されていてもよく、または、サーバ毎に設定されてもよい。さらに、現用機１０及び予備機２０にネットワークや共有ディスク３０に故障が発生した場合等に、他のサーバの電源制御部に対して強制的に電源を切断する指示を送信する強制電源断機能部を設けても良い。強制電源断機能部を有する場合には、保守端末４に対してエラーを出力する。 If the forced power-off function has been introduced after identifying the location of failure when the working machine is stopped, the process proceeds to step 105. If not, the process proceeds to step 107. In order to determine the presence or absence of the forced power-off function, a flag indicating the presence or absence of the forced power-off function may be set in the state management information storage unit 164 in advance, and the flag may be referred to. Moreover, the setting state of the forced power-off function may be set for the entire system or may be set for each server. Furthermore, when a failure occurs in the network 10 or the shared disk 30 in the active machine 10 and the spare machine 20, a forced power-off function unit that transmits an instruction to forcibly power off to the power control unit of another server May be provided. When the forced power-off function unit is included, an error is output to the maintenance terminal 4.

ステップ１０５）強制電源断機能が導入されている場合は、保守端末４からの指示により、図１２に示すように、予備機２０において、強制電源断機能の故障原因を特定する。予備機２０のゲストマシンの制御実行部２３０のログ検索部２３１において、ＯＳ機能である検索のコマンドを実行し、ログ記憶部２６３から該当するエラーメッセージのログを取得し、故障箇所推定部２３７において、強制電源断機能による強制電源断処理（リセット）が失敗したかどうかの判定を行い、判定結果を保守端末４に通知する。 Step 105) When the forced power-off function is introduced, the cause of the failure of the forced power-off function is specified in the spare machine 20, as shown in FIG. In the log search unit 231 of the control execution unit 230 of the guest machine of the spare machine 20, a search command that is an OS function is executed, the log of the corresponding error message is acquired from the log storage unit 263, and the fault location estimation unit 237 Then, it is determined whether the forced power-off process (reset) by the forced power-off function has failed, and the determination result is notified to the maintenance terminal 4.

具体的には、予備機２０の制御実行部２３０のログ検索部２３１において、キーワードを入力し、予備機２０のログ記憶部２６３から該当するエラーメッセージを検索する。該当するエラーメッセージが１個以上出力された場合、強制電源断（リセット）に失敗したと判定される。 Specifically, in the log search unit 231 of the control execution unit 230 of the standby machine 20, a keyword is input and the corresponding error message is searched from the log storage unit 263 of the backup machine 20. When one or more corresponding error messages are output, it is determined that the forced power-off (reset) has failed.

ステップ１０６）ハードウェア制御ボード１７の故障推定精度を向上させるために、保守端末４の指示により、予備機２０の実行制御部２３０の導通確認部２３２は、現用機１０のハードウェア制御ボード１７宛に管理LAN５を介してPING処理を行い、導通が成功することを確認し、ステップ１０７に移行する。一方、PINGが失敗した場合、エラーが出力される。この場合、予備機２０の強制電源断機能の故障、現用機１０のハードウェア制御ボード１７の故障、或いは管理LAN５のネットワーク故障であると推定されるが、サービスの提供には影響が小さいため、現用機をクラスタ構成に組み込むことも可能である（本実施例では、ステップ１０７に移行するものとする）。PING処理が成功した場合、ネットワークの瞬断による一時的な故障と考えられるため、現用機をクラスタ構成に組み込むことができるものとし、ステップ１０７に移行する。 Step 106) In order to improve the failure estimation accuracy of the hardware control board 17, the continuity confirmation unit 232 of the execution control unit 230 of the spare machine 20 is addressed to the hardware control board 17 of the active machine 10 in accordance with an instruction from the maintenance terminal 4. Then, the PING process is performed via the management LAN 5 to confirm that the continuity is successful, and the process proceeds to step 107. On the other hand, if PING fails, an error is output. In this case, it is estimated that the failure of the forced power-off function of the spare machine 20, the failure of the hardware control board 17 of the active machine 10, or the network failure of the management LAN 5, but the service provision has little influence. It is also possible to incorporate the current machine into the cluster configuration (in this embodiment, the process proceeds to step 107). If the ping process is successful, it is considered a temporary failure due to a momentary network interruption. Therefore, it is assumed that the active machine can be incorporated into the cluster configuration, and the process proceeds to step 107.

このときの現用機１０、予備機２０の状態は以下の通りである。 The states of the current machine 10 and the spare machine 20 at this time are as follows.

（現用機の状態）
クラスタ状態："NONE"
故障回数：０
エラーステータス：０
リソース状態：０
（予備機の状態）
クラスタ状態："ACT"
故障回数：０
エラーステータス：０
リソース状態：１（Started）
ステップ１０７）現用機１０の故障推定部１３７では、ログ検索結果によりホストマシン１５０、ゲストマシン１１０間の通信に問題が発生したと推測できる場合は、ステップ１０８に移行し、特定できない場合は、ステップ１０９に移行する。 (Current machine status)
Cluster status: "NONE"
Number of failures: 0
Error status: 0
Resource status: 0
(Status of spare machine)
Cluster status: "ACT"
Number of failures: 0
Error status: 0
Resource status: 1 (Started)
Step 107) If the failure estimation unit 137 of the active machine 10 can estimate that a problem has occurred in communication between the host machine 150 and the guest machine 110 based on the log search result, the process proceeds to Step 108. 109.

ステップ１０８）現用機１０では、図１３に示すように、ゲストマシン１１０の制御実行部１３０の導通確認部１３２は、PINGを実行し、ホストマシン１５０の管理LANネットワーク仮想I/F１３及びサービスLANネットワークI/F１１を介してルータ３への導通を確認し、導通が成功した場合はステップ１０９に移行し、「不可」であることを確認した場合は、保守端末４にエラーを出力して当該処理を終了する。 Step 108) In the active machine 10, as shown in FIG. 13, the continuity confirmation unit 132 of the control execution unit 130 of the guest machine 110 executes PING, and the management LAN network virtual I / F 13 and service LAN network of the host machine 150 Confirm the continuity to the router 3 via the I / F 11, and if the continuity is successful, the process proceeds to step 109. If it is confirmed as “impossible”, an error is output to the maintenance terminal 4 and the process is performed. Exit.

具体的には、該当したメッセージが、クライアント装置がアクセスに使用するサービスLAN側の導通が失敗したことを示している場合は、現用機１０の導通確認部１３２からホストマシンのサービスLANネットワークI/F１１のＩＰアドレス宛にPINGコマンドを実行し、導通が不可であることを確認する。また、該当したエラーメッセージの属性値がゲストマシンからホストマシンへの導通不良を示す場合は、現用機１０の導通確認部１２３からホストマシンの管理LAN用インタフェース１２のＩＰアドレス宛にPINGコマンドを実行し、導通が成功した場合はステップ１０９に移行し、「不可」であることを確認した場合は、保守端末４にエラーを出力し、当該処理を終了する。 Specifically, when the corresponding message indicates that the continuity on the service LAN side used by the client device for access has failed, the continuity confirmation unit 132 of the active device 10 sends the service LAN network I / O of the host machine. Execute the PING command addressed to the IP address of F11 and confirm that continuity is not possible. If the attribute value of the corresponding error message indicates a continuity failure from the guest machine to the host machine, the PING command is executed from the continuity confirmation unit 123 of the active machine 10 to the IP address of the management LAN interface 12 of the host machine. If the connection is successful, the process proceeds to step 109. If it is confirmed that the connection is not possible, an error is output to the maintenance terminal 4 and the process ends.

ステップ１０９）現用機１０の制御実行部１３０の起動実行部１３５、及び高可用性クラスタソフト１４０は、
・ステップ１０４で故障発生箇所が「リソース」であるとき；
・ステップ１０５で強制電源断機能（リセット）が失敗したが、ステップ１０６において現用機１０のハードウェア制御ボード１７への導通が成功した（失敗でも許容可）場合、または、ステップ１０７でルータ３への導通が成功し、かつ、ゲストマシンからホストマシンへの導通が成功した場合；
図１４に示すように、以下の復旧処理を行う。以下の括弧内の数字と図１４中の括弧内の数字が示す処理が対応する。 Step 109) The start execution unit 135 of the control execution unit 130 of the active machine 10 and the high availability cluster software 140 are:
When the failure location is “resource” in step 104;
If the forced power-off function (reset) fails in step 105, but the connection to the hardware control board 17 of the current machine 10 succeeds in step 106 (even if it fails), or to the router 3 in step 107 If continuity is successful and continuity from the guest machine to the host machine is successful;
As shown in FIG. 14, the following recovery processing is performed. The following numbers in parentheses correspond to the processes indicated by the numbers in parentheses in FIG.

（１）保守端末４から現用機１０のゲストマシン１１０の制御実行部１３０に起動実行を指示する。 (1) The maintenance terminal 4 instructs the control execution unit 130 of the guest machine 110 of the active machine 10 to start execution.

（２）現用機１０の制御実行部１３０のコマンド実行部１３６は、起動実行部１３５により現用機１０の高可用性クラスタソフト１４０を起動する。 (2) The command execution unit 136 of the control execution unit 130 of the active machine 10 activates the high availability cluster software 140 of the active machine 10 by the activation execution unit 135.

（３）現用機１０の高可用性クラスタソフト１４０が起動されると、現用機１０の状態管理記憶部１６４のクラスタ状態を"NONE"から"SBY[online]"に更新する。 (3) When the high availability cluster software 140 of the active machine 10 is activated, the cluster state of the state management storage unit 164 of the active machine 10 is updated from “NONE” to “SBY [online]”.

（４）高可用性クラスタソフト１４０は、予備機２０の高可用性クラスタソフト２４０へ更新されたクラスタ状態を通知する。 (4) The high availability cluster software 140 notifies the updated cluster state to the high availability cluster software 240 of the standby machine 20.

（５）予備機２０の高可用性クラスタソフト２４０は、現用機１０の高可用性クラスタソフト１４０から通知により、状態管理記憶部２６４のクラスタ状態を"ACT"から"SBY[online]"に変更する。 (5) The high availability cluster software 240 of the standby machine 20 changes the cluster status of the status management storage unit 264 from “ACT” to “SBY [online]” in response to the notification from the high availability cluster software 140 of the active machine 10.

この時点の現用機１０、予備機２０は以下の状態である。 The current machine 10 and the spare machine 20 at this time are in the following state.

（現用機の状態）
クラスタ状態："SBY[online]"
故障回数：０
エラーステータス：０
リソース状態：０
（予備機の状態）
クラスタ状態："ACT"
故障回数：０
エラーステータス：０
リソース状態：１（Started）
ステップ１１０）現用機１０は、図１５に示すように、ゲストマシン１１０の制御実行部１３０のコマンド実行部１３６において、高可用性クラスタソフト１４０の状態確認コマンドを実行し、状態管理情報記憶部１６４からクラスタ状態を取得し、"SBY[online]"であることを確認する。この状態でない場合は、エラーを出力する。 (Current machine status)
Cluster status: "SBY [online]"
Number of failures: 0
Error status: 0
Resource status: 0
(Status of spare machine)
Cluster status: "ACT"
Number of failures: 0
Error status: 0
Resource status: 1 (Started)
Step 110) The active machine 10 executes the status check command of the high availability cluster software 140 in the command execution unit 136 of the control execution unit 130 of the guest machine 110 as shown in FIG. Get the cluster status and confirm that it is "SBY [online]". If it is not in this state, an error is output.

ステップ１１１）現用機１０で対故障性と対パフォーマンス性を向上させるために、図１６に示す処理を行う。以下の括弧内の数字と図１６中の数字が示す処理が対応する。 Step 111) The process shown in FIG. 16 is performed in order to improve the fault tolerance and the performance against the active machine 10. The numbers in parentheses below correspond to the processes indicated by the numbers in FIG.

（１）保守端末４の制御部が、現用機１０のゲストマシン１１０の制御実行部１３０に系切り替えの処理を実行させる。 (1) The control unit of the maintenance terminal 4 causes the control execution unit 130 of the guest machine 110 of the active machine 10 to execute the system switching process.

（２）現用機１０の制御実行部１３０のコマンド実行部１３６は、系切り替え部１３３により、クラスタ状態を現用機１０"ACT"、予備機"SBY[standby]"に変更させるため高可用性クラスタソフト１４０に対して系切り替えコマンドを実行させる。 (2) The command execution unit 136 of the control execution unit 130 of the active machine 10 causes the system switching unit 133 to change the cluster state to the active machine 10 “ACT” and the standby machine “SBY [standby]”. 140 executes a system switch command.

（３）現用機１０の状態管理記憶部１６４に格納された予備機のクラスタ状態を "SBY[standby]"へ遷移させることで、現用機１０のクラスタ状態を"ACT"へ遷移させる。 (3) The cluster state of the active machine 10 is changed to “ACT” by changing the cluster state of the standby machine stored in the state management storage unit 164 of the active machine 10 to “SBY [standby]”.

（４）現用機１０の高可用性クラスタソフト１４０は、予備機２０の高可用性クラスタソフト２４０に対して、上記の（３）で更新されたクラスタ状態を通知する。 (4) The high availability cluster software 140 of the active machine 10 notifies the high availability cluster software 240 of the standby machine 20 of the cluster status updated in (3) above.

（５）予備機２０の高可用性クラスタソフト２４０は、その通知を受け、予備機２０の状態管理記憶部２６４の現用機と予備機のクラスタ状態を上記の（３）と同様に更新する。 (5) Upon receiving the notification, the high availability cluster software 240 of the spare machine 20 updates the cluster state of the active machine and the spare machine in the state management storage unit 264 of the spare machine 20 in the same manner as (3) above.

（６）予備機２０の高可用性クラスタソフト２４０は、予備機２０のリソース２２０を停止する。 (6) The high availability cluster software 240 of the spare machine 20 stops the resource 220 of the spare machine 20.

（７）現用機１０の高可用性クラスタソフト１４０は、現用機１０のリソース１２０を起動する。 (7) The high availability cluster software 140 of the active machine 10 activates the resource 120 of the active machine 10.

ステップ１１２）現用機１０のゲストマシン１１０の制御実行部１３０は、図１７に示すように、高可用性クラスタソフト１４０の状態確認コマンドを実行し、状態管理情報記憶部１６４のクラスタ状態が"ACT"、リソース状態が"１"で"Started"になっていることを確認する。 Step 112) As shown in FIG. 17, the control execution unit 130 of the guest machine 110 of the active machine 10 executes the status check command of the high availability cluster software 140, and the cluster status of the status management information storage unit 164 is “ACT”. Confirm that the resource status is "1" and "Started".

（現用機の状態）
クラスタ状態："ACT"
故障回数：０
エラーステータス：０
リソース状態：1（Started）
（予備機の状態）
クラスタ状態："SBY[standby]"
故障回数：０
エラーステータス：０
リソース状態：０
ステップ１１３）現用機１０において、図１８に示す終了処理を行う。以下の括弧内の数字と図１８中の括弧内の数字が示す処理が対応する。 (Current machine status)
Cluster status: "ACT"
Number of failures: 0
Error status: 0
Resource status: 1 (Started)
(Status of spare machine)
Cluster status: "SBY [standby]"
Number of failures: 0
Error status: 0
Resource status: 0
Step 113) In the active machine 10, the termination process shown in FIG. 18 is performed. The following numbers in parentheses correspond to the processes indicated by the numbers in parentheses in FIG.

（１）管理LAN５に接続される保守端末４は、現用機１０のゲストマシン１１０の制御実行部１３０に系切り替えの処理要求を送信する。 (1) The maintenance terminal 4 connected to the management LAN 5 transmits a system switching processing request to the control execution unit 130 of the guest machine 110 of the active machine 10.

（２）現用機１０の制御実行部１３０のコマンド実行部１３６は、系切り替え部１３３により予備機２０のクラスタ状態を"SBY[standby]"から"SBY[online]"に変更させるため、高可用性クラスタソフト１４０に系切り替えコマンドを実行させる。 (2) Since the command execution unit 136 of the control execution unit 130 of the active machine 10 changes the cluster state of the standby machine 20 from “SBY [standby]” to “SBY [online]” by the system switching unit 133, high availability The cluster software 140 is caused to execute a system switching command.

（３）現用機１０の高可用性クラスタソフト１４０は系切り替えコマンドにより、現用機１０の状態管理記憶部１６４の予備機２０のクラスタ状態を更新する。 (3) The high availability cluster software 140 of the active machine 10 updates the cluster status of the standby machine 20 in the status management storage unit 164 of the active machine 10 by a system switching command.

（４）現用機１０の高可用性クラスタソフト１４０から予備機２０の高可用性クラスタソフト２４０へ"SBY[standby]"から"SBY[online]"への更新処理を通知する。 (4) An update process from “SBY [standby]” to “SBY [online]” is notified from the high availability cluster software 140 of the active machine 10 to the high availability cluster software 240 of the standby machine 20.

（５）予備機２０の高可用性クラスタソフト２４０は、その通知を受け、予備機２０の状態管理記憶部２６４の予備機のクラスタ状態を"SBY[online]"に更新する。 (5) Upon receiving the notification, the high availability cluster software 240 of the spare machine 20 updates the cluster status of the spare machine in the status management storage unit 264 of the spare machine 20 to “SBY [online]”.

ステップ１１４）現用機１０は、図１９に示すように、ゲストマシン１１０の制御実行部１３０の状態確認実行部１３４で高可用性クラスタソフト１４０の状態コマンドを実行し、予備機２０のクラスタ状態が"SBY[online]"になっていなければ、エラーを管理LAN５に接続される保守端末４に出力する。 Step 114) As shown in FIG. 19, the active machine 10 executes the status command of the high availability cluster software 140 in the status check execution unit 134 of the control execution unit 130 of the guest machine 110, and the cluster status of the standby machine 20 is “ If it is not “SBY [online]”, an error is output to the maintenance terminal 4 connected to the management LAN 5.

（現用機の状態）
クラスタ状態："ACT"
故障回数：０
エラーステータス：０
リソース状態：1（Started）
（予備機の状態）
クラスタ状態："SBY[online]"
故障回数：０
エラーステータス：０
リソース状態：０
ステップ１１５）現用機１０からログアウトする。 (Current machine status)
Cluster status: "ACT"
Number of failures: 0
Error status: 0
Resource status: 1 (Started)
(Status of spare machine)
Cluster status: "SBY [online]"
Number of failures: 0
Error status: 0
Resource status: 0
Step 115) Log out of the active machine 10.

上記のように、本発明によれば、仮想環境における、現用機がクラスタ構成に組み込まれていない原因を、当該現用機のリソース、ゲストマシンからホストマシンへの導通不良、ディスク、ネットワークのいずれのカテゴリの故障であるのかを特定することができる。これにより、少なくとも、リソース故障である場合には、予備機でサービス稼働中である場合は、現用機でサービスを再開させることが可能になる。また、エラーのカテゴリが特定されることにより、エラーを保守者に提示することにより当該故障箇所を容易に同定することが可能となる。 As described above, according to the present invention, in the virtual environment, the cause that the active machine is not incorporated in the cluster configuration can be determined by any of the resources of the active machine, the poor continuity from the guest machine to the host machine, the disk, and the network. It is possible to specify whether it is a category failure. As a result, at least in the case of a resource failure, if the service is operating on the spare machine, the service can be restarted on the active machine. In addition, by specifying the error category, the failure location can be easily identified by presenting the error to the maintenance person.

また、故障箇所の推定後に導通確認を行うことにより、故障推定精度を向上させることができる。更に、ネットワークの瞬断による一時的な故障が原因で発生したリソース故障を自動的に回復することができる。 Moreover, the failure estimation accuracy can be improved by performing the conduction check after the failure location is estimated. Furthermore, it is possible to automatically recover a resource failure that has occurred due to a temporary failure due to an instantaneous network interruption.

また、予備機の性能が現用機の性能より劣る場合、又は、２つ以上の現用機と１つの予備機とでクラスタシステムが構成されている場合に、予備機でのサービス提供による処理速度の低下を回避することができる。 In addition, when the performance of the spare machine is inferior to that of the active machine, or when a cluster system is composed of two or more active machines and one spare machine, the processing speed of the spare machine providing the service speed A decrease can be avoided.

説明の便宜上、本発明の実施例に係るシステムは機能的なブロック図を用いて説明しているが、本発明のシステムは、ハードウェア、ソフトウェア又はそれらの組み合わせで実現されてもよい。例えば、サーバ（現用機及び予備機）の各機能部がソフトウェアで実現され、オペレーションシステム上にインストールされてもよい。また、各機能部が必要に応じて組み合わせて使用されてもよい。 For convenience of explanation, the system according to the embodiment of the present invention is described using a functional block diagram. However, the system of the present invention may be realized by hardware, software, or a combination thereof. For example, each functional unit of the server (active machine and spare machine) may be realized by software and installed on the operation system. In addition, the functional units may be used in combination as necessary.

以上、本発明の実施の形態及び実施例について説明したが、本発明は、上記の実施の形態及び実施例に限定されることなく、特許請求の範囲内において、種々の変更・応用が可能である。 Although the embodiments and examples of the present invention have been described above, the present invention is not limited to the above-described embodiments and examples, and various modifications and applications are possible within the scope of the claims. is there.

１クライアント装置
２サービスLAN
３ルータ
４保守端末
５管理LAN
１０サーバ（現用機）
１１，２１サービスLANネットワークインタフェース
１２，２２管理LANネットワークインタフェース
１３，２３管理LANネットワーク仮想インタフェース
１４，２４サービスLANネットワーク仮想インタフェース
１５，２５管理LANネットワーク仮想インタフェース
１６，２６サービスLANネットワーク仮想インタフェース
１７，２７ハードウェア制御ボード
２０サーバ（予備機）
３０共有ディスク
１１０、２１０ゲストマシン
１５１，２５１デバイス一括監視部
１２０，２２０リソース
１３０，２３０制御実行部
１３１，２３１ログ検索部
１３２，２３２導通確認部
１３３，２３３系切り替え部
１３４，２３４状態確認実行部
１３５，２３５起動実行部
１３６，２３６コマンド実行部
１３７，２３７故障箇所推定部
１４０，２４０高可用性クラスタソフト
１４１，２４１ネットワーク監視結果取得部
１４２，２４２ディスク監視結果取得部
１６０，２６０内蔵ディスク
１６１，２６１ホストマシン割当ディスク
１６２，２６２ゲストマシン割当ディスク
１６３，２６３ログ記憶部
１６４，２６４状態管理情報記憶部
１７０，２７０ハイパーバイザ 1 Client device 2 Service LAN
3 Router 4 Maintenance terminal 5 Management LAN
10 servers (current machine)
11, 21 Service LAN network interface 12, 22 Management LAN network interface 13, 23 Management LAN network virtual interface 14, 24 Service LAN network virtual interface 15, 25 Management LAN network virtual interface 16, 26 Service LAN network virtual interface 17, 27 Hardware Wear control board 20 Server (spare machine)
30 Shared disk 110, 210 Guest machine 151, 251 Device batch monitoring unit 120, 220 Resource 130, 230 Control execution unit 131, 231 Log search unit 132, 232 Continuity check unit 133, 233 System switching unit 134, 234 Status check execution unit 135, 235 Start execution unit 136, 236 Command execution unit 137, 237 Failure location estimation unit 140, 240 High availability cluster software 141, 241 Network monitoring result acquisition unit 142, 242 Disk monitoring result acquisition unit 160, 260 Built-in disk 161, 261 Host machine allocation disks 162 and 262 Guest machine allocation disks 163 and 263 Log storage units 164 and 264 State management information storage units 170 and 270 Hypervisor

Claims

In a cluster system composed of a working machine and a spare machine in a virtual environment where a hypervisor is introduced, a cluster state in which the working machine and the spare machine manage a cluster state indicating an operating state of the working machine and the spare machine Management means, state management information storage means for storing the cluster state and failure state, and log storage means for storing a failure log indicating the failure location, respectively, the spare machine is in service, and the active machine is in a cluster configuration A failure recovery method in a virtual environment for identifying the cause of a failure when not incorporated,
A fault location estimating step of estimating a fault location of the active machine based on a result of the fault search means of the active machine being retrieved from the log storage means of the active machine;
In the failure location estimation step, when it is estimated that the failure location of the active device is a network failure, the continuity confirmation means of the active device confirms continuity to the router connected to the active device. A confirmation step;
A cluster configuration starting step for transitioning the cluster state of the active unit to a “state in which service can be changed” when the continuity confirmation to the router is successful;
A failure recovery method in a virtual environment, comprising:

In an environment where a host machine and a guest machine are installed in the current machine,
When it is specified that the communication failure between the host machine and the guest machine in the continuity confirmation step,
In the conduction confirmation step,
Check the conduction failure from the guest machine to the host machine,
In the cluster configuration startup step,
The failure recovery method in a virtual environment according to claim 1, wherein when the connection from the guest machine to the host machine is successful, the cluster state of the active machine is changed to “a state in which the service machine can be changed to service operation”.

When it is estimated that the category of the failure location is a resource in the failure location estimation step,
The failure recovery method in a virtual environment according to claim 1 or 2, wherein, in the cluster configuration starting step, the cluster state of the active machine is changed to "a state in which the service can be changed to service operation".

In the failure location estimation step, a forced power-off function for forcibly turning off the power of other servers in the event of a failure has been introduced to the active machine and the spare machine, and the forced power-off function from the log storage means of the spare machine If an error regarding is found, it is determined that a serious error has been detected in the working machine,
In the conduction confirmation step,
Conducting confirmation from the spare machine to the power control means of the working machine,
In the cluster configuration startup step,
4. The failure recovery method in a virtual environment according to claim 3, wherein when the connection to the power control means is successful in the continuity confirmation step, the cluster state of the active device is changed to “a state where the service can be shifted to service operation”. .

In a cluster system composed of a working machine and a spare machine in a virtual environment where a hypervisor is introduced, a cluster state in which the working machine and the spare machine manage a cluster state indicating an operating state of the working machine and the spare machine Management means, state management information storage means for storing the cluster state and failure state, and log storage means for storing a failure log indicating the failure location, respectively, the spare machine is in service, and the active machine is in a cluster configuration A server that operates as a working machine to identify the cause of failure when it is not installed,
Based on the result retrieved from the log storage means, failure location estimation means for estimating the failure location of the working machine,
In the failure location estimation means, when it is estimated that the failure location of the working machine is a network failure, continuity confirmation means for confirming continuity to the router connected to the active device,
Cluster configuration activation means for transitioning the cluster state of the active machine to “a state in which the service can be transitioned” when the continuity confirmation to the router is successful,
The server characterized by having.

The conduction confirmation means
In the environment where the host machine and guest machine are installed on the server, if the communication failure between the host machine and the guest machine is identified, the continuity from the guest machine to the host machine is confirmed. Including means to
The cluster configuration starting means
6. The server according to claim 5, further comprising means for causing the cluster state of the active machine to transition to “a state in which the service can be activated” when the continuity check means succeeds in continuity from the guest machine to the host machine. .

The cluster configuration starting means
The server according to claim 5 or 6, further comprising means for causing the cluster state of the server to transition to "a state in which the service can be shifted to operation" when the failure point category is a resource.

In an environment where a forced power-off function for forcibly turning off the power of other servers in the event of a failure is introduced to the active machine and the spare machine,
The cluster configuration starting means
When an error related to the forced power-off function is detected in the spare machine, and the continuity confirmation means of the spare machine checks the continuity from the spare machine to the power control means of the working machine, and it is found that the continuity is defective. The server according to claim 7, further comprising means for preventing the cluster state of the server from transitioning to “a state in which a transition to service operation is possible”.

The program for functioning a computer as each means which comprises the server of any one of Claim 5 thru | or 8.