JP2019008340A

JP2019008340A - Redundant system and hardware failure detection method

Info

Publication number: JP2019008340A
Application number: JP2017120409A
Authority: JP
Inventors: 博史野口; Hiroshi Noguchi; 信浜田; Makoto Hamada; 俊之森谷; Toshiyuki Moriya; 直人日高; Naoto Hidaka
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2017-06-20
Filing date: 2017-06-20
Publication date: 2019-01-17
Anticipated expiration: 2037-06-20
Also published as: JP6760888B2

Abstract

To provide a hardware failure detection method capable of rapidly detecting a failure and having a small risk of false detection and a small monitoring load, for a redundant system.SOLUTION: In a redundant system 1 configured by including a host OS 21 that runs on a plurality of computers 2, each virtual machine 23, and duplex configuration applications executed by the virtual machine 23, a host OS monitoring unit 31 that monitors activity/inactivity of each host OS 21, and a representative application selecting unit 32 for selecting a representative application executed by the virtual machine 23 on the host OS 21, when it detects that the response of the host OS 21 has been interrupted, furthermore, for selecting a corresponding application to be paired, and an application state confirmation unit 33 for confirming whether a status of the selected corresponding application is in single-system operation or not, are provided.SELECTED DRAWING: Figure 1

Description

本発明は、冗長化システムおよびハードウェア障害検出方法に関する。 The present invention relates to a redundant system and a hardware failure detection method.

近年のクラウドサービスでは、集約効率を高めるために、仮想化技術を用いて物理コンピュータのハードウェア上に複数の論理コンピュータ（仮想マシン）を設け、テナントユーザへ提供することが一般的である。論理コンピュータとは、一般にはＶＭ（Virtual Machine）を指す。 In recent cloud services, in order to increase aggregation efficiency, it is common to provide a plurality of logical computers (virtual machines) on the hardware of a physical computer using virtualization technology and provide them to tenant users. A logical computer generally refers to a VM (Virtual Machine).

このようなクラウドサービスは、仮想リソース管理機能によって管理、運用することができる。仮想リソース管理機能を担うソフトウェアとして、近年はOpenStack（登録商標：非特許文献１参照）が注目されている。
仮想リソース管理機能と、クラウドサービスを構成する物理コンピュータや論理コンピュータを監視する機能とを連携させることにより、クラウドサービスの冗長性を担保することが可能である。以下、このようにクラウドサービスの冗長性を担保するシステムを冗長化システムと呼称する。 Such a cloud service can be managed and operated by a virtual resource management function. In recent years, OpenStack (registered trademark: see Non-Patent Document 1) has been attracting attention as software responsible for the virtual resource management function.
It is possible to ensure the redundancy of the cloud service by linking the virtual resource management function and the function of monitoring the physical computer and logical computer constituting the cloud service. Hereinafter, such a system that ensures the redundancy of the cloud service is referred to as a redundant system.

図５は、比較例の冗長化システムの構成例を示す機能ブロック図である。
冗長化システム１Ａは、コンピュータ２−１〜２−３、監視装置３、仮想リソース管理機能４、コンピュータＩＰアドレス一覧４１から構成される。
比較例の冗長化システム１Ａは、高い可用性を要件とするため、現用系のアプリケーションと予備系のアプリケーションとで二重化されている。仮想化環境においても、現用系のアプリケーションと予備系のアプリケーションとを異なるＶＭに実行させ、各ＶＭをそれぞれ異なるコンピュータ上で稼働させることで二重化し、ハードウェア障害に対する冗長性を持たせている。 FIG. 5 is a functional block diagram illustrating a configuration example of a redundant system according to a comparative example.
The redundancy system 1A includes computers 2-1 to 2-3, a monitoring device 3, a virtual resource management function 4, and a computer IP address list 41.
Since the redundancy system 1A of the comparative example requires high availability, it is duplexed between the active application and the standby application. Even in a virtualized environment, the active application and the standby application are executed by different VMs, and each VM is operated on a different computer to be duplicated to provide redundancy against a hardware failure.

コンピュータ（物理コンピュータ）２−１上でホストＯＳ２１−１が稼働し、ホストＯＳ２１−１上でハイパーバイザ２２−１が稼働する。ハイパーバイザ２２−１はＶＭ（仮想マシン）２３−１，２３−２を具現化する。ＶＭ２３−１は現用系アプリケーションＡを、ＶＭ２３−２は現用系アプリケーションＢを二重化するための予備系アプリケーションｂを実行する。
コンピュータ（物理コンピュータ）２−２上でホストＯＳ２１−２が稼働し、ホストＯＳ２１−２上でハイパーバイザ２２−２が稼働する。ハイパーバイザ２２−２はＶＭ２３−３，２３−４を具現化する。ＶＭ２３−３は現用系アプリケーションＢを、ＶＭ２３−４は現用系アプリケーションＣを二重化するための予備系アプリケーションｃを実行する。
コンピュータ（物理コンピュータ）２−３上でホストＯＳ２１−３が稼働し、ホストＯＳ２１−３上でハイパーバイザ２２−３が稼働する。ハイパーバイザ２２−３はＶＭ２３−５，２３−６を具現化する。ＶＭ２３−５は現用系アプリケーションＣを、ＶＭ２３−６は現用系アプリケーションＡを二重化するための予備系アプリケーションａを実行する。 The host OS 21-1 operates on the computer (physical computer) 2-1, and the hypervisor 22-1 operates on the host OS 21-1. The hypervisor 22-1 embodies VMs (virtual machines) 23-1 and 23-2. The VM 23-1 executes the active application A, and the VM 23-2 executes the standby application b for duplicating the active application B.
The host OS 21-2 operates on the computer (physical computer) 2-2, and the hypervisor 22-2 operates on the host OS 21-2. The hypervisor 22-2 embodies the VMs 23-3 and 23-4. The VM 23-3 executes the active application B, and the VM 23-4 executes the standby application c for duplicating the active application C.
The host OS 21-3 operates on the computer (physical computer) 2-3, and the hypervisor 22-3 operates on the host OS 21-3. The hypervisor 22-3 embodies VMs 23-5 and 23-6. The VM 23-5 executes the active application C, and the VM 23-6 executes the standby application a for duplicating the active application A.

コンピュータ２−１〜２−３を単にコンピュータ２と記載することがある。ホストＯＳ２１−１〜２１−３を単にホストＯＳ２１と記載することがある。ハイパーバイザ２２−１〜２２−３を単にハイパーバイザ２２と記載することがある。ＶＭ２３−１〜２３−６を単にＶＭ２３と記載することもある。 The computers 2-1 to 2-3 may be simply referred to as a computer 2. The host OS 21-1 to 21-3 may be simply referred to as the host OS 21. The hypervisors 22-1 to 22-3 may be simply referred to as the hypervisor 22. The VMs 23-1 to 23-6 may be simply referred to as VM23.

監視装置３は、コンピュータ２とＶＭ２３のどちらか一方または両方の死活を監視する。
死活の監視の方法としては、例えばpingを送信して応答有無を確認する方法が簡易であり、多く用いられる。この方法では、監視装置３がコンピュータ２の物理ＮＩＣ（Network Interface Card）へpingを送信することにより、コンピュータ２のハードウェア障害を検知することができる。また、監視装置３がＶＭ２３の仮想ＮＩＣへpingを送信することにより、コンピュータ２のハードウェア障害を検知することもできる。コンピュータ２とＶＭ２３両方にpingを送信して応答確認しても、片方だけに送信して応答確認してもよい。
Pingの送信先のＮＩＣに割り当てられたＩＰアドレスは、コンピュータＩＰアドレス一覧４１に格納されている。監視装置３はコンピュータＩＰアドレス一覧４１を参照して、確認対象のＮＩＣにpingを送信する。 The monitoring device 3 monitors the life and death of one or both of the computer 2 and the VM 23.
As a life and death monitoring method, for example, a method of sending a ping and confirming the presence or absence of a response is simple and often used. In this method, the monitoring device 3 can detect a hardware failure of the computer 2 by transmitting a ping to a physical NIC (Network Interface Card) of the computer 2. The monitoring device 3 can also detect a hardware failure of the computer 2 by sending a ping to the virtual NIC of the VM 23. A response may be confirmed by sending a ping to both the computer 2 and the VM 23, or a response may be confirmed by sending it to only one side.
The IP address assigned to the NIC of the Ping destination is stored in the computer IP address list 41. The monitoring device 3 refers to the computer IP address list 41 and sends a ping to the NIC to be confirmed.

仮想リソース管理機能４は、監視装置３と連携し、コンピュータ２上に具現化されたＶＭ２３を管理する。
このような構成により、監視装置３がコンピュータ２のハードウェア障害を検出した際、仮想リソース管理機能４は、ＶＭ２３を故障ハードウェア上のホストＯＳ２１上から削除し、異なるハードウェア上のホストＯＳ２１上に再生成できる。このようなＶＭの自動的な削除と再生成とを、以降、復旧処理またはオートヒーリングと呼ぶこともある。 The virtual resource management function 4 manages the VM 23 embodied on the computer 2 in cooperation with the monitoring device 3.
With this configuration, when the monitoring device 3 detects a hardware failure of the computer 2, the virtual resource management function 4 deletes the VM 23 from the host OS 21 on the failed hardware, and on the host OS 21 on the different hardware. Can be regenerated. Such automatic deletion and regeneration of a VM may be hereinafter referred to as recovery processing or auto healing.

OpenStack Project、“OpenStack Open Source Cloud Computing Software”、[online]、[平成29年6月16日検索]、インターネット<URL:http://www.openstack.org/>OpenStack Project, “OpenStack Open Source Cloud Computing Software”, [online], [Search June 16, 2017], Internet <URL: http://www.openstack.org/>

物理コンピュータのみで構成されたシステムであれば、定常時は監視への応答遅延は起こりづらい。しかし、複数のＶＭが１つの物理コンピュータに共存する仮想環境においては、ハードウェア障害が発生していない場合であっても応答遅延のばらつきが大きい。そのため、物理コンピュータとＶＭとが正常に動作していても、あらかじめ規定した時間（タイムアウト時間）内に応答が返らない可能性がある。その場合、監視装置３は、応答が途絶えたと誤認し、ハードウェア障害を誤検知するおそれがある。 If the system is composed only of physical computers, response delays to monitoring are unlikely to occur during normal operation. However, in a virtual environment in which a plurality of VMs coexist on one physical computer, the response delay varies greatly even when no hardware failure has occurred. Therefore, even if the physical computer and the VM are operating normally, there is a possibility that no response is returned within a predetermined time (timeout time). In that case, the monitoring device 3 may erroneously recognize that the response has been lost, and may erroneously detect a hardware failure.

このような誤検知が発生した場合、仮想リソース管理機能４は、自動的にＶＭ２３を削除して別のホストＯＳ２１上に再生成する。このような処置は、アプリケーションによってはサービス中断の原因や、コンピュータ処理と通信の負荷等となる。よって、このような場合に行うべきではない。 When such a false detection occurs, the virtual resource management function 4 automatically deletes the VM 23 and regenerates it on another host OS 21. Depending on the application, such measures may cause service interruption, load on computer processing and communication, and the like. Therefore, it should not be performed in such a case.

このような障害の誤検知によるオートヒーリングを回避するため、pingのタイムアウト時間を長くする方法も考えられる。しかし、この方法では、障害検知にかかる時間が長くなるため、迅速に復旧を開始できなくなる。 In order to avoid auto-healing due to such erroneous detection of a failure, a method of extending the ping timeout time is also conceivable. However, with this method, since it takes a long time to detect a failure, recovery cannot be started quickly.

ハードウェア障害検知手段として、個々のアプリケーションに対して定期的な状態確認やログ取得を行う方法もある。しかしながら、アプリケーションごとの確認は、pingによる単純な応答確認と比較して、応答処理負荷やデータ転送の通信負荷が大きい。そのため、複数のＶＭが同時に存在するような環境で高頻度にアプリケーションごとの確認を行うと、サービス障害の原因となるおそれがある。 As a hardware failure detection means, there is a method of periodically checking the status and acquiring a log for each application. However, the confirmation for each application has a larger response processing load and data transfer communication load than simple response confirmation by ping. Therefore, if each application is frequently checked in an environment where a plurality of VMs exist at the same time, a service failure may occur.

本発明はこのような点を鑑みてなされたものであり、障害を迅速に検知できると共に誤検知のリスクが小さく、監視負荷が小さい冗長化システムおよびハードウェア障害検出手法を提供することを課題とする。 The present invention has been made in view of such points, and it is an object of the present invention to provide a redundant system and a hardware failure detection method that can quickly detect a failure, have a low risk of erroneous detection, and have a small monitoring load. To do.

前記した課題を解決するため、請求項１に記載の発明は、
複数の物理コンピュータ上に稼働するホストＯＳ、および各仮想マシン、前記仮想マシンによって実行される二重化構成のアプリケーションを含んで構成される冗長化システムにおいて、
各前記ホストＯＳの死活を監視するホストＯＳ監視部と、
前記ホストＯＳ監視部が、いずれかのホストＯＳの応答が途絶えたことを検知したならば、検知した当該ホストＯＳ上の仮想マシンによって実行される代表アプリケーションを選出し、更に当該代表アプリケーションの対となる対応アプリケーションを選出する代表アプリケーション選出部と、
前記代表アプリケーション選出部によって選出された前記対応アプリケーションの状態が片系運転中か否かを確認するアプリケーション状態確認部と、
を備えることを特徴とする冗長化システムとした。 In order to solve the above-described problem, the invention according to claim 1
In a redundant system configured to include a host OS running on a plurality of physical computers, each virtual machine, and a duplex configuration application executed by the virtual machine,
A host OS monitoring unit that monitors the life and death of each of the host OSs;
If the host OS monitoring unit detects that the response of any of the host OSs is interrupted, it selects a representative application to be executed by the detected virtual machine on the host OS, and further selects a representative application pair. A representative application selection section for selecting a corresponding application,
An application state confirmation unit for confirming whether or not the state of the corresponding application selected by the representative application selection unit is in single system operation;
A redundant system characterized by comprising:

このようにすることで、冗長化システムは、定常時の負荷を抑えつつ、障害を誤検知するリスクを小さくすることが可能となる。更にこの構成では、正常なコンピュータ上に具現化された仮想マシンに対して状態を問い合わせ、障害が疑われるコンピュータ上に具現化された仮想マシンに対しては状態を問い合わせない。これにより、障害が疑われるコンピュータに対して、更に処理と通信の負荷をかけることを回避できる。 In this way, the redundant system can reduce the risk of erroneously detecting a failure while suppressing the load during normal operation. Further, in this configuration, the state is inquired with respect to a virtual machine embodied on a normal computer, and the state is not inquired with respect to a virtual machine embodied on a computer suspected of a failure. As a result, it is possible to avoid further applying processing and communication loads to a computer suspected of being damaged.

前記した課題を解決するため、請求項２に記載の発明は、
前記アプリケーション状態確認部が確認した前記対応アプリケーションのうちいずれかの状態が片系運転中であったならば、前記ホストＯＳ監視部が検知したホストＯＳ上の仮想マシンを他のホストＯＳ上に復旧する復旧処理部、
を備えることを特徴とする請求項１に記載の冗長化システムとした。 In order to solve the above-described problem, the invention according to claim 2
If any one of the corresponding applications confirmed by the application status confirmation unit is in single system operation, the virtual machine on the host OS detected by the host OS monitoring unit is restored on another host OS. Recovery processing department,
The redundancy system according to claim 1 is provided.

このようにすることで、冗長化システムは、精度の高い障害検知に基づく必要なオートヒーリングのみを行うことが可能となる。これにより、サービスの中断、コンピュータ処理と通信の負荷の増大等を回避することができる。 By doing in this way, the redundant system can perform only necessary auto healing based on highly accurate failure detection. Thereby, interruption of service, increase in the load of computer processing and communication, etc. can be avoided.

前記した課題を解決するため、請求項３に記載の発明は、
各前記物理コンピュータと、当該物理コンピュータ上に具現化された前記仮想マシンと、当該仮想マシンによって実行される代表アプリケーションと、当該代表アプリケーションの対となる対応アプリケーションとの対応関係を格納した代表アプリケーション候補データベース、
を備えることを特徴とする請求項１に記載の冗長化システムとした。 In order to solve the above-described problem, the invention according to claim 3
Representative application candidates storing correspondence relationships between each physical computer, the virtual machine embodied on the physical computer, a representative application executed by the virtual machine, and a corresponding application that is a pair of the representative application Database,
The redundancy system according to claim 1 is provided.

このようにすることで、冗長化システムは、障害が疑われるコンピュータ上の代表アプリケーションと対になる対応アプリケーションを特定し、対応アプリケーションの状態確認を行う。これにより、冗長化システムは、障害を迅速に検知できる。 In this way, the redundant system identifies the corresponding application that is paired with the representative application on the computer suspected of failure, and checks the status of the corresponding application. Thereby, the redundant system can detect a failure quickly.

前記した課題を解決するため、請求項４に記載の発明は、
前記代表アプリケーションの状態が片系運転中か否かを確認する状態確認コマンドと、前記状態確認コマンドに対する応答とを格納した状態確認ロジック、
を備えることを特徴とする請求項１に記載の冗長化システムとした。 In order to solve the above-described problem, the invention according to claim 4 provides:
A state confirmation logic for storing a state confirmation command for confirming whether the state of the representative application is in single system operation, and a response to the state confirmation command;
The redundancy system according to claim 1 is provided.

このようにすることで、冗長化システムは、確認対象の対応アプリケーションに対応したロジックで、対応アプリケーションの状態確認を行うことが可能となる。個々に状態確認ロジックが異なるアプリケーションであっても、状態確認が可能となる。 By doing so, the redundant system can check the status of the corresponding application with the logic corresponding to the corresponding application to be checked. Even if the application has different status confirmation logic, the status can be confirmed.

前記した課題を解決するため、請求項５に記載の発明は、
各前記仮想マシンは、前記ホストＯＳ上で動作するハイパーバイザによって具現化される、
ことを特徴とする請求項１に記載の冗長化システムとした。 In order to solve the above-described problem, the invention according to claim 5 provides:
Each of the virtual machines is embodied by a hypervisor operating on the host OS.
The redundant system according to claim 1 is provided.

このようにすることで、冗長化システムは、Type２のハイパーバイザを適用できる。 By doing in this way, the redundant system can apply a Type 2 hypervisor.

前記した課題を解決するため、請求項６に記載の発明は、
各前記物理コンピュータ上で動作するハイパーバイザによって、前記アプリケーションを実行する各前記仮想マシンと、前記ホストＯＳを実行する仮想マシンとが具現化される、
ことを特徴とする請求項１に記載の冗長化システムとした。 In order to solve the above-described problem, the invention according to claim 6 provides:
Each virtual machine that executes the application and a virtual machine that executes the host OS are embodied by a hypervisor operating on each physical computer.
The redundant system according to claim 1 is provided.

このようにすることで、冗長化システムは、Type１のハイパーバイザを適用できる。 By doing in this way, the redundant system can apply a Type 1 hypervisor.

前記した課題を解決するため、請求項７に記載の発明は、
複数の物理コンピュータ上に稼働するホストＯＳ、および各仮想マシン、前記仮想マシンによって実行される二重化構成のアプリケーションを含んで構成される冗長化システムにおいて、
ホストＯＳ監視部が各前記ホストＯＳの死活を監視し、
前記ホストＯＳ監視部によって、いずれかのホストＯＳの応答が途絶えたことを検知したならば、代表アプリケーション選出部は、検知した当該ホストＯＳ上の仮想マシンによって実行される代表アプリケーションを選出し、
当該代表アプリケーションの対となる対応アプリケーションを選出し、
アプリケーション状態確認部によって、前記対応アプリケーションの状態が片系運転中か否かを確認する、
ことを特徴とする冗長化システムのハードウェア障害検出方法とした。 In order to solve the above-described problem, the invention according to claim 7 provides:
In a redundant system configured to include a host OS running on a plurality of physical computers, each virtual machine, and a duplex configuration application executed by the virtual machine,
The host OS monitoring unit monitors the life and death of each host OS,
If the host OS monitoring unit detects that the response of any host OS has been interrupted, the representative application selection unit selects a representative application to be executed by the detected virtual machine on the host OS,
Select the corresponding application to be paired with the representative application,
The application status confirmation unit confirms whether the status of the corresponding application is in single system operation,
A hardware failure detection method for a redundant system characterized by this.

このようにすることで、冗長化システムは、定常時の負荷を抑えつつ、障害を誤検知するリスクを小さくすることが可能となる。更にこの構成では、正常なコンピュータ上に具現化された仮想マシンに対して状態を問い合わせ、障害が疑われるコンピュータ上に具現化された仮想マシンに対しては状態を問い合わせない。これにより、障害が疑われるコンピュータに対して、更に処理と通信の負荷をかけることを回避できる。 In this way, the redundant system can reduce the risk of erroneously detecting a failure while suppressing the load during normal operation. Further, in this configuration, the state is inquired with respect to a virtual machine embodied on a normal computer, and the state is not inquired with respect to a virtual machine embodied on a computer suspected of a failure. As a result, it is possible to avoid further applying processing and communication loads to a computer suspected of having a failure.

前記した課題を解決するため、請求項８に記載の発明は、
前記アプリケーション状態確認部が確認した前記代表アプリケーションのうちいずれかの状態が片系運転中であったならば、復旧処理部により、応答が途絶えたことを検知したホストＯＳ上の仮想マシンを他のホストＯＳ上に復旧する、
ことを特徴とする請求項７に記載の冗長化システムのハードウェア障害検出方法とした。 In order to solve the above-described problem, the invention according to claim 8 provides:
If any one of the representative applications confirmed by the application state confirmation unit is in single system operation, the virtual machine on the host OS that has detected that the response has been interrupted is restored by the recovery processing unit. Recover on the host OS,
The hardware failure detection method for a redundant system according to claim 7 is characterized.

本発明によれば、障害を迅速に検知できると共に誤検知のリスクが小さく、監視負荷が小さい冗長化システムおよびハードウェア障害検出手法を提供することができる。 According to the present invention, it is possible to provide a redundant system and a hardware failure detection method that can quickly detect a failure, reduce the risk of erroneous detection, and reduce the monitoring load.

本実施形態に係る冗長化システムの構成例を示す機能ブロック図である。It is a functional block diagram which shows the structural example of the redundancy system which concerns on this embodiment. 本実施形態に係るハードウェア障害検出方法のフローチャートである。It is a flowchart of the hardware failure detection method which concerns on this embodiment. 本実施形態に係る状態確認ロジックのテーブルの一例である。It is an example of the table of the status confirmation logic which concerns on this embodiment. 本実施形態に係るホストＯＳ監視の実行から復旧処理までの流れを示すシーケンス図である。It is a sequence diagram which shows the flow from execution of host OS monitoring which concerns on this embodiment to recovery processing. 比較例の冗長化システムの構成例を示す機能ブロック図である。It is a functional block diagram which shows the structural example of the redundancy system of a comparative example.

次に、本発明を実施するための形態（以下、「本実施形態」と称する。）における冗長化システム１等について説明する。 Next, the redundancy system 1 and the like in a mode for carrying out the present invention (hereinafter referred to as “the present embodiment”) will be described.

本実施形態の冗長化システム１は、高い可用性を要件とするため、現用系のアプリケーションと予備系のアプリケーションとで二重化されている。仮想化環境においても、現用系のアプリケーションと予備系のアプリケーションとを異なるＶＭに実行させ、各ＶＭをそれぞれ異なるコンピュータ上で稼働させることで、ハードウェア障害に対する冗長性を持たせている。
二重化構成では、一つの系に障害が発生したとき、正常なもう片系に問い合わせることで、稼動状態を確認することが可能である。本実施形態では、この性質を利用して、仮想化環境における二重化構成のアプリケーションを利用した二段階の監視機能を設ける。 Since the redundancy system 1 of the present embodiment requires high availability, the redundancy system 1 is duplexed between an active application and a standby application. Even in a virtual environment, the active application and the standby application are executed by different VMs, and each VM is operated on a different computer, thereby providing redundancy for a hardware failure.
In the duplex configuration, when a failure occurs in one system, it is possible to check the operating state by inquiring to the other system that is normal. In this embodiment, using this property, a two-stage monitoring function using a duplex configuration application in a virtual environment is provided.

図１は本実施形態に係る冗長化システム１の構成例を示す機能ブロック図である。
冗長化システム１は、コンピュータ２−１〜２−３、監視装置３、仮想リソース管理機能４から構成される。 FIG. 1 is a functional block diagram showing a configuration example of a redundancy system 1 according to this embodiment.
The redundancy system 1 includes computers 2-1 to 2-3, a monitoring device 3, and a virtual resource management function 4.

コンピュータ（物理コンピュータ）２−１上でホストＯＳ２１−１が稼働し、ホストＯＳ２１−１上でハイパーバイザ２２−１が稼働する。ハイパーバイザ２２−１はＶＭ（仮想マシン）２３−１，２３−２を具現化する。
コンピュータ（物理コンピュータ）２−２上でホストＯＳ２１−２が稼働し、ホストＯＳ２１−２上でハイパーバイザ２２−２が稼働する。ハイパーバイザ２２−２はＶＭ２３−３，２３−４を具現化する。
コンピュータ（物理コンピュータ）２−３上でホストＯＳ２１−３が稼働し、ホストＯＳ２１−３上でハイパーバイザ２２−３が稼働する。ハイパーバイザ２２−３はＶＭ２３−５，２３−６を具現化する。 The host OS 21-1 operates on the computer (physical computer) 2-1, and the hypervisor 22-1 operates on the host OS 21-1. The hypervisor 22-1 embodies VMs (virtual machines) 23-1 and 23-2.
The host OS 21-2 operates on the computer (physical computer) 2-2, and the hypervisor 22-2 operates on the host OS 21-2. The hypervisor 22-2 embodies the VMs 23-3 and 23-4.
The host OS 21-3 operates on the computer (physical computer) 2-3, and the hypervisor 22-3 operates on the host OS 21-3. The hypervisor 22-3 embodies VMs 23-5 and 23-6.

仮想リソース管理機能４は、コンピュータＩＰアドレス一覧４１を備える。ＶＭ２３の情報は仮想リソース管理機能４に格納され管理される。 The virtual resource management function 4 includes a computer IP address list 41. Information of the VM 23 is stored and managed in the virtual resource management function 4.

監視装置３は、ホストＯＳ監視部３１、代表アプリケーション選出部３２、代表アプリケーション候補データベース３４、アプリケーション状態確認部３３、状態確認ロジック３５を備える。監視装置３は、仮想リソース管理機能４とは別個の機能である。仮想リソース管理機能４と同一のコンピュータ上に配備することも異なるコンピュータ上に配備することも可能であり、その配備方法は問わない。なお、図面ではデータベースのことを“ＤＢ”と記載している場合がある。 The monitoring device 3 includes a host OS monitoring unit 31, a representative application selection unit 32, a representative application candidate database 34, an application state confirmation unit 33, and a state confirmation logic 35. The monitoring device 3 is a function separate from the virtual resource management function 4. The virtual resource management function 4 can be deployed on the same computer or a different computer, and the deployment method is not limited. In the drawings, the database may be described as “DB”.

現用系アプリケーションＡは、コンピュータ２−１上に具現化されたＶＭ２３−１が実行している。現用系アプリケーションＡの対となる予備系アプリケーションａは、コンピュータ２−３上に具現化されたＶＭ２３−６が実行している。
現用系アプリケーションＢは、コンピュータ２−２上に具現化されたＶＭ２３−３が実行している。現用系アプリケーションＢの対となる予備系アプリケーションｂは、コンピュータ２−１上に具現化されたＶＭ２３−２が実行している。
現用系アプリケーションＣは、コンピュータ２−３上に具現化されたＶＭ２３−５が実行している。現用系アプリケーションＣの対となる予備系アプリケーションｃは、コンピュータ２−２上に具現化されたＶＭ２３−４が実行している。
このように、これらのアプリケーションは二重化されている。 The active application A is executed by the VM 23-1 embodied on the computer 2-1. The standby application a that is a pair of the active application A is executed by the VM 23-6 embodied on the computer 2-3.
The active application B is executed by the VM 23-3 embodied on the computer 2-2. The standby application b that is a pair with the active application B is executed by the VM 23-2 embodied on the computer 2-1.
The active application C is executed by the VM 23-5 embodied on the computer 2-3. The standby application c that is a pair of the active application C is executed by the VM 23-4 embodied on the computer 2-2.
Thus, these applications are duplicated.

二重化構成のアプリケーションＡとａ、Ｂとｂ、およびＣとｃは、対となるアプリケーションに対して互いに定期的に問い合わせて（ポーリング）、対となるアプリケーションの状態を保持している。
ユーザは事前に、代表アプリケーション候補データベース３４に、二重化されたアプリケーションの情報を格納しておく。具体的には、例えばコンピュータ２−１と、その上に具現化されたＶＭ２３−１と、ＶＭ２３−１によって実行される代表アプリケーションＡと、この代表アプリケーションＡの対となる対応アプリケーションａとの対応関係を格納する。格納は既知の方法で行うことができる。
代表アプリケーション候補の格納は、例えば、ＶＭ２３の情報を仮想リソース管理機能４に格納するときに行うことができる。また、代表アプリケーション候補が削除等によりユーザが当初格納した数を下回った場合等、必要に応じて随時、代表アプリケーション候補の追加格納が可能である。 Duplicated applications A and a, B and b, and C and c regularly inquire (polling) the paired applications and hold the state of the paired applications.
The user stores information on the duplicated application in the representative application candidate database 34 in advance. Specifically, for example, the correspondence between the computer 2-1, the VM 23-1 embodied thereon, the representative application A executed by the VM 23-1, and the corresponding application a that is a pair of the representative application A Store relationships. Storage can be done in a known manner.
The representative application candidate can be stored, for example, when information of the VM 23 is stored in the virtual resource management function 4. In addition, when the number of representative application candidates is less than the number initially stored by the user due to deletion or the like, the representative application candidates can be additionally stored as needed.

ホストＯＳ監視部３１は、アプリケーションを実行している全てのＶＭ２３に共通のホストＯＳ２１に対して死活の監視を行う。ホストＯＳ２１の監視は、個々のＶＭ２３ごとの監視よりも低負荷である。監視は、例えば既存技術と同様、ping送信による応答確認等により行う。ホストＯＳ監視部３１はコンピュータＩＰアドレス一覧４１からコンピュータ２の物理ＮＩＣに割り当てられたＩＰアドレスを取得する。取得したＩＰアドレスに基づいて、コンピュータ２の物理ＮＩＣへ定期的にpingを送信し、応答を確認する。
なお、死活の監視方法として、ホストＯＳ２１からホストＯＳ監視部３１に向けてkeep aliveを定期的に送信する方法を採用してもよいし、その他の方法でホストＯＳを監視してもよい。 The host OS monitoring unit 31 monitors the life and death of the host OS 21 common to all VMs 23 that are executing applications. Monitoring of the host OS 21 has a lower load than monitoring for each individual VM 23. For example, monitoring is performed by confirming a response by ping transmission, as in the existing technology. The host OS monitoring unit 31 acquires the IP address assigned to the physical NIC of the computer 2 from the computer IP address list 41. Based on the acquired IP address, a ping is periodically sent to the physical NIC of the computer 2 to check the response.
As a life and death monitoring method, a method of periodically transmitting keep alive from the host OS 21 to the host OS monitoring unit 31 may be employed, or the host OS may be monitored by other methods.

ホストＯＳ２１−１から一定時間内の応答が得られなかった場合を考える。
このとき、コンピュータ２−１のハードウェアに障害が発生したことが疑われる。障害の確認のため、代表アプリケーション選出部３２は、代表アプリケーション候補データベース３４から、１つ以上の代表アプリケーションを選出する。選出方法はランダムでも、ＶＭ生成順でも、その他の方法でもよい。本例では、コンピュータ２−１上に具現化されるＶＭ２３−１が実行する現用系アプリケーションＡと、ＶＭ２３−２が実行する予備系アプリケーションｂとが、代表アプリケーションとして選出される。 Consider a case where a response within a certain time is not obtained from the host OS 21-1.
At this time, it is suspected that a failure has occurred in the hardware of the computer 2-1. In order to confirm the failure, the representative application selection unit 32 selects one or more representative applications from the representative application candidate database 34. The selection method may be random, VM generation order, or any other method. In this example, the active application A executed by the VM 23-1 embodied on the computer 2-1 and the standby application b executed by the VM 23-2 are selected as representative applications.

更に、代表アプリケーション選出部３２は、選出した代表アプリケーションと対になりかつ別のコンピュータ２で動作しているアプリケーションを、代表アプリケーション候補データベース３４から特定する。このようなアプリケーションを、以降、対応アプリケーションと称する。
本例では、代表アプリケーションＡの対応アプリケーションとして、コンピュータ２−３上に具現化されるＶＭ２３−６が実行する予備系アプリケーションａが特定される。また、代表アプリケーションｂの対応アプリケーションとして、コンピュータ２−２上に具現化されるＶＭ２３−３が実行する現用系アプリケーションＢが特定される。 Further, the representative application selection unit 32 specifies an application that is paired with the selected representative application and that is running on another computer 2 from the representative application candidate database 34. Such an application is hereinafter referred to as a corresponding application.
In this example, the standby application a to be executed by the VM 23-6 embodied on the computer 2-3 is specified as the application corresponding to the representative application A. Further, the active application B that is executed by the VM 23-3 embodied on the computer 2-2 is specified as the corresponding application of the representative application b.

対応アプリケーションａと対応アプリケーションＢに対して、アプリケーション状態確認部３３が状態確認を行い、片系運転中か否かを判定する。前述のように、対応アプリケーションは、対となる代表アプリケーションの状態をポーリングによって確認し保持している。そのため対応アプリケーションへの状態確認によって代表アプリケーションの状態が確認できる。対応アプリケーションが二重化運転中ならば、代表アプリケーションは動作中である。対応アプリケーションが片系運転中ならば、代表アプリケーションは停止している。
アプリケーションの状態確認コマンドや、異常状態の判定ロジックは、アプリケーションごとに異なる。よってユーザが事前に、状態確認ロジック３５にアプリケーション固有のインタフェースを事前に格納しておくものとする。状態確認ロジック３５の詳細については後述する。 For the corresponding application a and the corresponding application B, the application state confirmation unit 33 confirms the state and determines whether or not the one-system operation is in progress. As described above, the corresponding application confirms and holds the state of the representative application to be paired by polling. Therefore, the state of the representative application can be confirmed by checking the state of the corresponding application. If the corresponding application is in duplex operation, the representative application is in operation. If the corresponding application is in single system operation, the representative application is stopped.
The application status confirmation command and the abnormal state determination logic differ for each application. Therefore, it is assumed that the user stores an application-specific interface in the state confirmation logic 35 in advance. Details of the state confirmation logic 35 will be described later.

アプリケーション状態確認部３３は、状態確認ロジック３５を参照し、確認対象の対応アプリケーションに応じたロジックで対応アプリケーションの状態確認を行う。確認の結果、いずれかの二重化構成のアプリケーションにおいて、対応アプリケーションのみが稼働している（片系運転）ことが判明した場合、監視装置３は、コンピュータ２−１のハードウェアに障害が発生したと判定する。
この判定により、仮想リソース管理機能４は、コンピュータ２−１上に具現化されているＶＭ２３−１，ＶＭ２３−２を削除し、コンピュータ２−１以外のホストＯＳ２１上に再生成する。 The application state confirmation unit 33 refers to the state confirmation logic 35 and confirms the state of the corresponding application with the logic corresponding to the corresponding application to be confirmed. As a result of confirmation, if it is found that only one of the corresponding applications is operating (single system operation) in any of the duplex configuration applications, the monitoring device 3 indicates that a failure has occurred in the hardware of the computer 2-1. judge.
Based on this determination, the virtual resource management function 4 deletes the VM 23-1 and the VM 23-2 embodied on the computer 2-1 and regenerates them on the host OS 21 other than the computer 2-1.

図１のように、本実施形態のＶＭ２３は、ホストＯＳ２１上で動作するハイパーバイザ２２によって具現化されている。このようなハイパーバイザを、Type２のハイパーバイザという。
なお、コンピュータ上でホストＯＳを介さずに動作するType１のハイパーバイザによって、アプリケーションを実行するＶＭとホストＯＳを実行するＶＭとを具現化する方法も採用できる。この場合も、本実施形態と同様の方法でハードウェア障害検出が可能である。 As shown in FIG. 1, the VM 23 of the present embodiment is embodied by a hypervisor 22 that operates on the host OS 21. Such a hypervisor is called a Type 2 hypervisor.
It is also possible to adopt a method of embodying a VM that executes an application and a VM that executes the host OS by a Type 1 hypervisor that operates on the computer without going through the host OS. Also in this case, hardware failure detection can be performed by the same method as in this embodiment.

図２は、本実施形態に係るハードウェア障害検出方法のフローチャートである。
まず、ホストＯＳ監視部３１は、ホストＯＳ２１に対して死活を監視する（ステップＳ１０）。実際にはホストＯＳ監視部３１は、ホストＯＳ２１−１〜２１−３に対して死活を監視するが、ここでは単純化のために、１台のホストＯＳ２１に着目して説明する。 FIG. 2 is a flowchart of the hardware failure detection method according to this embodiment.
First, the host OS monitoring unit 31 monitors the life and death of the host OS 21 (step S10). In practice, the host OS monitoring unit 31 monitors the life and death of the host OSs 21-1 to 21-3, but here, for the sake of simplification, the description will be made focusing on one host OS 21.

監視の結果、一定時間内に応答があった場合、ホストＯＳ監視部３１はその後も定期的に監視を実行し続ける（ステップＳ１０：応答）。
監視の結果、一定時間内に応答がなかった場合（ステップＳ１０：タイムアウト）、ホストＯＳ監視部３１は、ホストＯＳ２１の応答が途絶えたことを検知する。代表アプリケーション選出部３２は、代表アプリケーション候補データベース３４に基づき、このホストＯＳ２１上のＶＭ２３によって実行される代表アプリケーションを選出する（ステップＳ１１）。代表アプリケーション選出部３２は、コンピュータ１台につき１つ以上の代表アプリケーションを選出する。 As a result of the monitoring, if there is a response within a certain time, the host OS monitoring unit 31 continues to perform monitoring periodically thereafter (step S10: response).
As a result of monitoring, when there is no response within a certain time (step S10: timeout), the host OS monitoring unit 31 detects that the response of the host OS 21 has been interrupted. The representative application selection unit 32 selects a representative application to be executed by the VM 23 on the host OS 21 based on the representative application candidate database 34 (step S11). The representative application selection unit 32 selects one or more representative applications for one computer.

監視装置３は、選出された各代表アプリケーションについてステップＳ１２〜Ｓ１５の処理を繰り返す。 The monitoring device 3 repeats the processes in steps S12 to S15 for each selected representative application.

以下に繰り返し処理の内容を述べる。
代表アプリケーション選出部３２は、代表アプリケーション候補データベース３４から、代表アプリケーションの対となる対応アプリケーションを特定する（ステップＳ１３）。なお、対応アプリケーションの特定は、アプリケーション状態確認部３３が行ってもよい。
特定された対応アプリケーションに対して、アプリケーション状態確認部３３は、状態確認を行う（ステップＳ１４）。 The contents of the repetition process are described below.
The representative application selection unit 32 specifies a corresponding application that is a pair of representative applications from the representative application candidate database 34 (step S13). Note that the application state confirmation unit 33 may specify the corresponding application.
For the identified corresponding application, the application state confirmation unit 33 performs state confirmation (step S14).

対応アプリケーションの状態確認の結果、二重化運転ならば（ステップＳ１４：正常／二重化運転）、代表アプリケーションが正常であったと判定できる。この場合、アプリケーション状態確認部３３は、監視結果（タイムアウト）が誤検知であったか、または通信経路に留まる障害であったと判定する。よって、復旧処理には移行せず、監視装置３は次の代表アプリケーションについて繰り返し処理を行う（ステップＳ１５）。
ステップＳ１５において、全ての代表アプリケーションについて異常が検出されなかった場合、ホストＯＳ監視部３１は再びホストＯＳ２１に対する死活の監視に戻る（ステップＳ１０）。
故障が疑われるコンピュータ２上で動作している代表アプリケーションに対して状態を確認する場合、代表アプリケーションの応答がないか、または応答時間が長くなることが予想される。これに対して、代表アプリケーションの対となる対応アプリケーションは、正常に動作しているコンピュータ２上で動作していると考えられる。よって、対応アプリケーションは、迅速に応答する。これにより、冗長化システム１は、障害を迅速に検知することができる。 As a result of checking the status of the corresponding application, if it is a duplex operation (step S14: normal / duplex operation), it can be determined that the representative application is normal. In this case, the application state confirmation unit 33 determines that the monitoring result (timeout) is a false detection or a failure that remains in the communication path. Therefore, without shifting to the recovery process, the monitoring device 3 repeats the process for the next representative application (step S15).
In step S15, when no abnormality is detected for all the representative applications, the host OS monitoring unit 31 returns to the life / death monitoring for the host OS 21 again (step S10).
When the state is confirmed with respect to the representative application operating on the computer 2 suspected of malfunctioning, it is expected that the representative application does not respond or the response time becomes long. On the other hand, it is considered that the corresponding application that is a pair of representative applications is operating on the computer 2 that is operating normally. Thus, the corresponding application responds quickly. Thereby, the redundancy system 1 can detect a failure quickly.

対応アプリケーションの状態確認の結果、対応アプリケーションのみが稼働する片系運転ならば（ステップＳ１４：異常／片系運転）、代表アプリケーションに障害が発生していると判定できる。この場合、アプリケーション状態確認部３３は、監視結果（タイムアウト）はコンピュータ２のハードウェアの障害によるものであると判定して、ステップＳ１６の復旧処理に進む。
復旧処理において、仮想リソース管理機能４は、故障したコンピュータ２上のホストＯＳ２１に配備されたＶＭ２３を削除し、別のホストＯＳ２１上にＶＭ２３を再生成する。その後、ホストＯＳ監視部３１は、再びホストＯＳ２１の監視に戻る（ステップＳ１０）。 As a result of checking the state of the corresponding application, if the single system operation is performed in which only the corresponding application is operating (step S14: abnormal / single system operation), it can be determined that a failure has occurred in the representative application. In this case, the application state confirmation unit 33 determines that the monitoring result (timeout) is due to a hardware failure of the computer 2, and proceeds to the recovery process in step S16.
In the recovery process, the virtual resource management function 4 deletes the VM 23 deployed in the host OS 21 on the failed computer 2 and regenerates the VM 23 on another host OS 21. Thereafter, the host OS monitoring unit 31 returns to monitoring the host OS 21 again (step S10).

図３は、本実施形態に係る状態確認ロジック３５のテーブルの一例である。
本例では、状態確認ロジック３５は代表アプリケーション（アプリケーション種別）と、状態確認コマンドと、正常確認ロジックを関連付けて記憶している。
アプリケーション状態確認部３３は、この状態確認ロジック３５を参照することで、確認対象の対応アプリケーションに対応した状態確認コマンドの実行と、正常確認ロジックによる正常確認を行う。
具体的には、アプリケーション状態確認部３３は、アプリケーションＡに対して、状態確認コマンドＸｘｘｘを実行し、コマンドの応答に文字列ＡＡＡが含まれた場合は、アプリケーションＡの状態は正常だと判定する。
アプリケーション状態確認部３３は、アプリケーションＢに対して、状態確認コマンドＹｙｙｙを実行し、コマンドの応答に文字列ＢＢＢが含まれた場合は、アプリケーションＢの状態は正常だと判定する。
アプリケーション状態確認部３３は、アプリケーションＣに対して、状態確認コマンドＺｚｚｚを実行し、コマンドの応答に文字列ＣＣＣが含まれた場合は、アプリケーションＣの状態は正常だと判定する。 FIG. 3 is an example of a table of the state confirmation logic 35 according to the present embodiment.
In this example, the state confirmation logic 35 stores a representative application (application type), a state confirmation command, and a normal confirmation logic in association with each other.
The application state confirmation unit 33 refers to the state confirmation logic 35 to execute a state confirmation command corresponding to the corresponding application to be confirmed, and perform normal confirmation using the normal confirmation logic.
Specifically, the application state confirmation unit 33 executes the state confirmation command Xxxxx for the application A, and determines that the state of the application A is normal when the character string AAA is included in the response to the command. .
The application state confirmation unit 33 executes the state confirmation command Yyyy for the application B, and determines that the state of the application B is normal when the character string BBB is included in the response to the command.
The application state confirmation unit 33 executes the state confirmation command Zzzzz for the application C, and determines that the state of the application C is normal when the character string CCC is included in the response to the command.

なお、状態確認ロジック３５には、本例のように正常確認ロジックを関連付けるだけでなく、異常確認ロジックを関連付けてもよい。この場合、アプリケーション状態確認部３３は、対応アプリケーションの異常確認を行う。その際、アプリケーション状態確認部３３が、異常発生の事実だけでなく対応アプリケーションのログ等の情報も取得するようにしてもよい。 The state confirmation logic 35 may be associated not only with a normal confirmation logic as in this example but also with an abnormality confirmation logic. In this case, the application state confirmation unit 33 confirms the abnormality of the corresponding application. At that time, the application state confirmation unit 33 may acquire not only the fact that an abnormality has occurred but also information such as a log of the corresponding application.

図４は、本実施形態に係るホストＯＳ監視の実行から復旧処理までの流れを示すシーケンス図である。なお、この例では監視装置３のホストＯＳ監視部３１からpingを送信することで、ホストＯＳ２１の監視を行っている。 FIG. 4 is a sequence diagram showing a flow from execution of host OS monitoring to recovery processing according to the present embodiment. In this example, the host OS 21 is monitored by transmitting a ping from the host OS monitoring unit 31 of the monitoring device 3.

まず、監視装置３のホストＯＳ監視部３１は、コンピュータ２−１の物理ＮＩＣへpingを送信する（ステップＳ２０）。ホストＯＳ２１−１は、正常であれば一定時間以内に応答を返す（ステップＳ２１）。 First, the host OS monitoring unit 31 of the monitoring device 3 transmits a ping to the physical NIC of the computer 2-1 (step S20). If it is normal, the host OS 21-1 returns a response within a predetermined time (step S21).

ping送信の際（ステップＳ２２）、通信障害などにより一定時間以内に応答が返らなかった場合（ステップＳ２３）、監視装置３はタイムアウトを検知する（ステップＳ２４）。
このとき、監視装置３のアプリケーション状態確認部３３は、ホストＯＳ２１−１に配備されたＶＭ２３−１が実行する代表アプリケーションＡと対になる対応アプリケーションａの状態を確認する（ステップＳ２５）。前述の通り、今回のタイムアウトの原因は通信障害であり、コンピュータ２−１にハードウェア障害は発生しておらず、代表アプリケーションＡは正常に動作している。代表アプリケーションＡの正常動作をポーリングで確認している対応アプリケーションａは、代表アプリケーションＡと対応アプリケーションａの両方が稼働した二重化運転がなされている旨を応答する（ステップＳ２６）。 In the case of ping transmission (step S22), if no response is returned within a certain time due to a communication failure or the like (step S23), the monitoring device 3 detects a timeout (step S24).
At this time, the application status confirmation unit 33 of the monitoring device 3 confirms the status of the corresponding application a that is paired with the representative application A executed by the VM 23-1 deployed in the host OS 21-1 (step S25). As described above, the cause of this timeout is a communication failure, no hardware failure has occurred in the computer 2-1, and the representative application A is operating normally. The corresponding application a confirming the normal operation of the representative application A by polling responds that the duplex operation in which both the representative application A and the corresponding application a are operating is performed (step S26).

次に、監視装置３のアプリケーション状態確認部３３は、もう一つの代表アプリケーションｂと対になる対応アプリケーションＢの状態確認を行う（ステップＳ２７）。コンピュータ２−１にハードウェア障害は発生していないので、対応アプリケーションＢも、二重化運転がなされている旨を応答する（ステップＳ２８）。 Next, the application status confirmation unit 33 of the monitoring device 3 confirms the status of the corresponding application B paired with another representative application b (step S27). Since no hardware failure has occurred in the computer 2-1, the corresponding application B also responds that the duplex operation is being performed (step S28).

このようにコンピュータ２−１上の全ての代表アプリケーションが正常動作している場合、復旧処理には移行せず、監視装置３は再び定期監視に戻る（ステップＳ２９）。監視に対して、ホストＯＳ２１−１は、正常であれば一定時間以内に応答を返す（ステップＳ３０）。 As described above, when all the representative applications on the computer 2-1 are operating normally, the monitoring apparatus 3 returns to the regular monitoring again without proceeding to the recovery process (step S29). In response to the monitoring, the host OS 21-1 returns a response within a predetermined time if it is normal (step S30).

ここで、コンピュータ２−１のハードウェアに障害が発生した場合を考える。このとき、監視装置３のホストＯＳ監視部３１から送信されたpingには（ステップＳ３１）、一定時間内に応答が返らず、監視装置３はタイムアウトを検知する（ステップＳ３２）。
このとき、監視装置３のアプリケーション状態確認部３３は、ホストＯＳ２１−１に配備されたＶＭ２３−１が実行する代表アプリケーションＡと対になる対応アプリケーションａの状態を確認する（ステップＳ３３）。コンピュータ２−１に発生したハードウェア障害のため、代表アプリケーションＡは正常に動作していない。代表アプリケーションＡの異常をポーリングによって確認している対応アプリケーションａは、対応アプリケーションａのみが稼働した片系運転となっている旨を応答する（ステップＳ３４）。 Here, consider a case where a failure has occurred in the hardware of the computer 2-1. At this time, no response is returned to the ping transmitted from the host OS monitoring unit 31 of the monitoring device 3 (step S31), and the monitoring device 3 detects a timeout (step S32).
At this time, the application status confirmation unit 33 of the monitoring device 3 confirms the status of the corresponding application a that is paired with the representative application A executed by the VM 23-1 deployed in the host OS 21-1 (step S33). The representative application A is not operating normally due to a hardware failure that has occurred in the computer 2-1. The corresponding application a that has confirmed the abnormality of the representative application A by polling responds that only the corresponding application a is in a single system operation (step S34).

対応アプリケーションａの応答により、コンピュータ２−１のハードウェア障害を検出した監視装置３は復旧処理に移行し、仮想リソース管理機能４に自動復旧を実行させる（ステップＳ３５）。 In response to the response from the corresponding application a, the monitoring device 3 that has detected a hardware failure of the computer 2-1 shifts to a recovery process and causes the virtual resource management function 4 to execute automatic recovery (step S 35).

実施形態は本発明を分かりやすく説明するために詳細に記載したものであり、必ずしも説明した全ての構成を備えるものに限定されない。また、各実施形態の構成の一部について、他の構成の追加・削除・置換をすることが可能である。
また、前記した機構や構成は説明上必要と考えられるものを示しており、製品上必ずしも全ての機構や構成を示しているとは限らない。 The embodiments are described in detail for easy understanding of the present invention, and are not necessarily limited to those having all the configurations described. In addition, it is possible to add, delete, and replace other configurations for a part of the configuration of each embodiment.
In addition, the above-described mechanisms and configurations are those that are considered necessary for the description, and do not necessarily indicate all the mechanisms and configurations on the product.

このように、本実施形態では、仮想環境で実行する複数サービスに共用されるコンピュータリソースに対して、障害誤検知のリスクが小さく、かつ、定常負荷が小さい冗長化システムおよびハードウェア障害検出手法を提供することができる。このハードウェア障害検出手法は、特殊なハードウェアが不要であり、ソフトウェアだけで実装可能である。 As described above, in this embodiment, a redundant system and a hardware failure detection method that reduce the risk of erroneous detection of errors and reduce the steady load on computer resources shared by multiple services executed in a virtual environment are provided. Can be provided. This hardware failure detection method does not require special hardware and can be implemented only by software.

（変形例）
本発明は、上記実施形態に限定されることなく、本発明の趣旨を逸脱しない範囲で、変更実施が可能であり、例えば、次の（ａ）〜（ｃ）のようなものがある。
（ａ）死活監視の方法は、pingに限定されず、任意のコマンドをホストＯＳに送信して、その応答を監視してもよい。
（ｂ）その他の死活監視の方法は、keep aliveに限定されず、任意のハートビート通信によって監視してもよい。
（ｃ）上記実施形態では、いずれかの代表アプリケーションと対となる対応アプリケーションに対して状態を確認し、片系運転ならば、この代表アプリケーションが動作しているコンピュータ上の全ての仮想マシンを自動復旧している。しかし、これに限られず、障害を検知した代表アプリケーションに係る仮想マシンのみを自動復旧してもよい。 (Modification)
The present invention is not limited to the above-described embodiment, and can be modified without departing from the spirit of the present invention. For example, there are the following (a) to (c).
(A) The alive monitoring method is not limited to ping, and an arbitrary command may be transmitted to the host OS to monitor the response.
(B) Other life and death monitoring methods are not limited to keep alive, and may be monitored by any heartbeat communication.
(C) In the above embodiment, the status is confirmed with respect to a corresponding application that is paired with any of the representative applications, and if it is a one-system operation, all virtual machines on the computer on which this representative application is operating are automatically It has recovered. However, the present invention is not limited to this, and only the virtual machine related to the representative application that detects the failure may be automatically recovered.

１冗長化システム
２−１，２−２，２−３コンピュータ（物理コンピュータ）
２１−１，２１−２，２１−３ホストＯＳ
２２−１，２２−２，２２−３ハイパーバイザ
２３−１，２３−２，２３−３，２３−４，２３−５，２３−６仮想マシン（ＶＭ）（論理コンピュータ）
３監視装置
３１ホストＯＳ監視部
３２代表アプリケーション選出部
３３アプリケーション状態確認部
３４代表アプリケーション候補データベース（代表アプリケーション候補ＤＢ）
３５状態確認ロジック
４復旧処理部（仮想リソース管理機能）
４１コンピュータＩＰアドレス一覧
Ａ，Ｂ，Ｃアプリケーション（現用系アプリケーション）（代表アプリケーション）（対応アプリケーション）
ａ，ｂ，ｃアプリケーション（予備系アプリケーション）（代表アプリケーション）（対応アプリケーション） 1 Redundant system 2-1, 2-2, 2-3 Computer (physical computer)
21-1, 21-2, 21-3 Host OS
22-1, 22-2, 22-3 Hypervisor 23-1, 23-2, 23-3, 23-4, 23-5, 23-6 Virtual machine (VM) (logical computer)
3 Monitoring device 31 Host OS monitoring unit 32 Representative application selection unit 33 Application state confirmation unit 34 Representative application candidate database (representative application candidate DB)
35 Status confirmation logic 4 Recovery processing part (virtual resource management function)
41 Computer IP address list A, B, C Application (current application) (Representative application) (Compatible application)
a, b, c Application (standby application) (Representative application) (Supported application)

Claims

In a redundant system configured to include a host OS running on a plurality of physical computers, each virtual machine, and a duplex configuration application executed by the virtual machine,
A host OS monitoring unit that monitors the life and death of each of the host OSs;
If the host OS monitoring unit detects that the response of any of the host OSs is interrupted, it selects a representative application to be executed by the detected virtual machine on the host OS, and further selects a representative application pair. A representative application selection section for selecting a corresponding application,
An application state confirmation unit for confirming whether or not the state of the corresponding application selected by the representative application selection unit is in single system operation;
A redundant system comprising:

If any one of the corresponding applications confirmed by the application status confirmation unit is in single system operation, the virtual machine on the host OS detected by the host OS monitoring unit is restored on another host OS. Recovery processing department,
The redundancy system according to claim 1, further comprising:

Representative application candidates storing correspondence relationships between each physical computer, the virtual machine embodied on the physical computer, a representative application executed by the virtual machine, and a corresponding application that is a pair of the representative application Database,
The redundancy system according to claim 1, further comprising:

A state confirmation logic for storing a state confirmation command for confirming whether the state of the representative application is in single system operation, and a response to the state confirmation command;
The redundancy system according to claim 1, further comprising:

Each of the virtual machines is embodied by a hypervisor operating on the host OS.
The redundant system according to claim 1, wherein:

Each virtual machine that executes the application and a virtual machine that executes the host OS are embodied by a hypervisor operating on each physical computer.
The redundant system according to claim 1, wherein:

In a redundant system configured to include a host OS running on a plurality of physical computers, each virtual machine, and a duplex configuration application executed by the virtual machine,
The host OS monitoring unit monitors the life and death of each host OS,
If the host OS monitoring unit detects that the response of any host OS has been interrupted, the representative application selection unit selects a representative application to be executed by the detected virtual machine on the host OS,
Select the corresponding application to be paired with the representative application,
The application status confirmation unit confirms whether the status of the corresponding application is in single system operation,
A hardware failure detection method for a redundant system.

If any one of the representative applications confirmed by the application state confirmation unit is in single system operation, the virtual machine on the host OS that has detected that the response has been interrupted is restored by the recovery processing unit. Recover on the host OS,
The method for detecting a hardware failure in a redundant system according to claim 7.