JP6464704B2

JP6464704B2 - Fault tolerant system, active device, standby device, failover method, and failover program

Info

Publication number: JP6464704B2
Application number: JP2014243699A
Authority: JP
Inventors: 大介木本
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2014-12-02
Filing date: 2014-12-02
Publication date: 2019-02-06
Anticipated expiration: 2034-12-02
Also published as: JP2016110173A

Description

本発明は、仮想マシンのフォールトトレランスを実現する技術に関する。 The present invention relates to a technique for realizing fault tolerance of a virtual machine.

サービスを提供するサーバシステムでは、サーバに障害が発生してもサービスを継続するフォールトトレランスが求められる。そのため、そのようなサーバシステムでは、稼働系サーバおよび待機系サーバからなる冗長構成が採用される。サーバが仮想マシンである場合も、稼働系の仮想マシンが動作するサーバとは異なるサーバに待機系の仮想マシンを動作させることにより、フォールトトレランスを実現する技術が知られている。 In a server system that provides a service, fault tolerance is required to continue the service even if a failure occurs in the server. Therefore, in such a server system, a redundant configuration including an active server and a standby server is employed. Even when the server is a virtual machine, a technique for realizing fault tolerance by operating a standby virtual machine on a server different from the server on which the active virtual machine operates is known.

例えば、特許文献１には、待機系サーバを必要としない仮想マシンのフォールトトレラントシステムが記載されている。この関連技術は、複数のサーバ上でそれぞれ複数の仮想マシンを動作させる。詳細には、各サーバは、１つ以上のプライマリの仮想マシンと、１つ以上のセカンダリの仮想マシンとを動作させる。また、あるサーバで稼働するプライマリの仮想マシンに対して、ペアとなるセカンダリの仮想マシンは、他のサーバで動作するよう構成される。そして、プライマリの仮想マシンのデータが動作するサーバから、ペアとなるセカンダリの仮想マシンが動作するサーバに対して、プライマリの仮想マシンのデータを定期的に送信する。これにより、この関連技術は、待機系サーバを必要とせずに、各サーバで稼働する仮想マシンのフォールトトレランスを実現している。 For example, Patent Document 1 describes a fault tolerant system for a virtual machine that does not require a standby server. In this related technique, a plurality of virtual machines are operated on a plurality of servers, respectively. Specifically, each server operates one or more primary virtual machines and one or more secondary virtual machines. Further, a secondary virtual machine that is paired with a primary virtual machine that operates on a certain server is configured to operate on another server. Then, the primary virtual machine data is periodically transmitted from the server on which the primary virtual machine data operates to the server on which the paired secondary virtual machine operates. As a result, this related technology realizes fault tolerance of the virtual machine operating on each server without requiring a standby server.

また、特許文献２には、稼働系の仮想マシンおよび待機系の仮想マシン間でのデータ転送量を削減する技術が記載されている。この関連技術は、稼働系の仮想マシンが動作するマシンおよび待機系の仮想マシンが動作するマシン間でストレージを共有することを前提とする。そして、稼働系の仮想マシンは、共有ストレージに読み出し命令が発生すると、読み出し命令の読み込み位置、読み出し位置および読み出しサイズを記憶する。そして、稼働系の仮想マシンは、読み込み位置、読み出し位置および読み出しサイズを待機系に転送する。また、待機系の仮想マシンは、受信した読み込み位置、読み出し位置および読み出しサイズにしたがって、仮想マシンメモリに、共有ストレージ上のデータを読み込む。これにより、この関連技術は、稼働系および待機系の仮想マシン間で、仮想マシンメモリの内容そのものを転送する必要がなく、データ転送量を削減している。 Patent Document 2 describes a technique for reducing the data transfer amount between an active virtual machine and a standby virtual machine. This related technique is based on the premise that storage is shared between a machine on which an active virtual machine operates and a machine on which a standby virtual machine operates. When a read command is generated in the shared storage, the active virtual machine stores the read position, read position, and read size of the read command. Then, the active virtual machine transfers the reading position, the reading position, and the reading size to the standby system. The standby virtual machine reads data on the shared storage into the virtual machine memory according to the received read position, read position, and read size. As a result, this related technique eliminates the need to transfer the contents of the virtual machine memory itself between the active and standby virtual machines and reduces the data transfer amount.

また、特許文献２には、この関連技術が、複数の仮想マシンに対するフォールトトレランスの実現にも適用可能であることが記載されている。この関連技術は、ある稼働系マシンで動作する仮想マシンに同期する仮想マシンと、他の稼働系マシンで動作する仮想マシンに同期する仮想マシンとを、１つの待機系装置上で動作させる。 Patent Document 2 describes that this related technique can be applied to the realization of fault tolerance for a plurality of virtual machines. In this related technology, a virtual machine that synchronizes with a virtual machine that operates on a certain active machine and a virtual machine that synchronizes with a virtual machine that operates on another active machine operate on a single standby apparatus.

特開２０１２−１９０１７５号公報JP 2012-190175 A 特開２０１２−２２１０６４号公報JP 2012-221064 A

しかしながら、特許文献１および特許文献２に記載されたものは、待機系の仮想マシンを動作させるためのコストが高いという課題がある。以下、待機系の仮想マシンを動作させるためのコストを、単に、待機系にかかるコストとも記載する。 However, those described in Patent Document 1 and Patent Document 2 have a problem that the cost for operating the standby virtual machine is high. Hereinafter, the cost for operating the standby virtual machine is also simply referred to as the cost for the standby system.

例えば、特許文献１に記載された関連技術は、Ｎ個の稼働系の仮想マシンに対して同数の待機系の仮想マシンを必要とする。さらに、この関連技術は、合計Ｎ×２個の仮想マシンを分散して動作させるため、複数の仮想マシンの動作を可能にする複数のマシンを必要とする。ここで、複数の仮想マシンを動作させることが可能なマシンは、一般にコストが高くなる。そのため、この関連技術は、稼働系の仮想マシン数が多いほど、待機系にかかるコストを増大させる。 For example, the related technology described in Patent Document 1 requires the same number of standby virtual machines for N active virtual machines. Furthermore, since this related technique operates a total of N × 2 virtual machines in a distributed manner, a plurality of machines capable of operating a plurality of virtual machines are required. Here, a machine capable of operating a plurality of virtual machines generally has a high cost. Therefore, this related technique increases the cost of the standby system as the number of active virtual machines increases.

また、特許文献２に記載された関連技術は、Ｎ個の稼働系の仮想マシンに対して同数の待機系のマシンを必要とする。そのため、この関連技術は、稼働系の仮想マシンの数が多いほど、より多くの待機系装置が必要となり、待機系にかかるコストを増大させる。 Further, the related art described in Patent Document 2 requires the same number of standby machines for N active virtual machines. Therefore, according to this related technology, the larger the number of active virtual machines, the more standby devices are required, which increases the cost of the standby system.

あるいは、特許文献２に記載された関連技術は、Ｎ個の稼働系の仮想マシンに対して、Ｎ個の仮想マシンの動作を可能にする待機系のマシンを必要とする。上述のように、複数の仮想マシンを動作させることが可能なマシンは、一般にコストが高い。また、稼働系の仮想マシンの数が多いほど、待機系のマシンは、より多くの仮想マシンを動作させるためにより多くのリソースを必要とする。そのため、この関連技術は、稼働系の仮想マシンの数が多いほど、待機系にかかるコストを増大させる。 Alternatively, the related technology described in Patent Document 2 requires a standby machine that enables the operation of N virtual machines for N active virtual machines. As described above, a machine capable of operating a plurality of virtual machines generally has a high cost. Further, as the number of active virtual machines increases, the standby machine requires more resources to operate more virtual machines. Therefore, this related technique increases the cost of the standby system as the number of active virtual machines increases.

このように、上述の関連技術では、システムの規模が大きくなるほど、待機系の仮想マシンを動作させるためのマシンまたはリソースが増えて行く。そのため、待機系にかかるコスト（マシン費用、メンテナンス費用、電力費用等）が増大するという課題があった。 As described above, in the related technology described above, the machine or the resource for operating the standby virtual machine increases as the scale of the system increases. For this reason, there is a problem that the cost (machine cost, maintenance cost, power cost, etc.) required for the standby system increases.

本発明は、上述の課題を解決するためになされたものである。すなわち、本発明は、仮想マシンによって構成されるサーバシステムの規模が増大しても、待機系にかかるコストを抑えてフォールトトレランスを実現する技術を提供することを目的とする。 The present invention has been made to solve the above-described problems. That is, an object of the present invention is to provide a technique for realizing fault tolerance while suppressing the cost of a standby system even when the scale of a server system configured by virtual machines increases.

本発明のフォールトトレラントシステムは、稼働系装置および待機系装置によって共有される共有ストレージと、前記稼働系装置上で動作する稼働系仮想マシンの仮想マシン用メモリ領域の内容（メモリ領域データ）を前記稼働系装置から前記共有ストレージに転送して二重化するメモリ領域データ二重化部と、前記稼働系装置の障害発生を検出する障害検出部と、前記障害発生が検出された稼働系装置において動作する稼働系仮想マシンのメモリ領域データを前記共有ストレージから前記待機系装置に転送するメモリ領域データ取得部と、前記障害検出部により障害発生が検出されると、前記メモリ領域データ取得部によって取得されたメモリ領域データを用いて、前記待機系装置において待機系仮想マシンを動作させるとともに、前記障害発生が検出された稼働系装置において動作する稼働系仮想マシンの機能を前記待機系仮想マシンで継続するよう切り替える切替部と、を備える。 The fault-tolerant system of the present invention includes a shared storage shared by an active system device and a standby system device, and contents (memory area data) of a virtual machine memory area of an active virtual machine operating on the active system device. A memory area data duplexing unit that transfers data from the active device to the shared storage and duplicates it, a failure detection unit that detects a failure in the active device, and an active system that operates in the active device in which the failure is detected A memory area data acquisition unit that transfers memory area data of a virtual machine from the shared storage to the standby system device, and a memory area acquired by the memory area data acquisition unit when a failure occurrence is detected by the failure detection unit The standby virtual machine is operated in the standby system using the data, and the failure is Generation and a switching section for switching so as to continue the function of operating system virtual machine operating in operating system device detected by the standby virtual machine.

また、本発明の稼働系装置は、上述のフォールトトレラントシステムにおいて、前記メモリ領域データ二重化部を有する。 Moreover, the active system apparatus of this invention has the said memory area data duplication part in the above-mentioned fault tolerant system.

また、本発明の待機系装置は、上述のフォールトトレラントシステムにおいて、前記メモリ領域データ取得部と、前記切替部と、を有する。 Further, the standby system apparatus of the present invention includes the memory area data acquisition unit and the switching unit in the fault tolerant system described above.

また、本発明のフェイルオーバー方法は、稼働系装置および待機系装置によって共有される共有ストレージを用いて、前記稼働系装置上で動作する稼働系仮想マシンの仮想マシン用メモリ領域の内容（メモリ領域データ）を前記稼働系装置から前記共有ストレージに転送して二重化し、前記稼働系装置の障害発生を検出すると、前記障害発生を検出した稼働系装置において動作する稼働系仮想マシンのメモリ領域データを前記共有ストレージから前記待機系装置に転送し、前記待機系装置において、前記共有ストレージから転送されたメモリ領域データを用いて待機系仮想マシンを動作させ、前記障害発生を検出した稼働系装置において動作する稼働系仮想マシンの機能を前記待機系仮想マシンで継続するよう切り替える。 In addition, the failover method of the present invention uses the shared storage shared by the active device and the standby device, and uses the contents of the virtual machine memory area (memory area) of the active virtual machine operating on the active device. Data) is transferred from the active device to the shared storage and duplicated, and when the occurrence of a failure in the active device is detected, the memory area data of the active virtual machine that operates in the active device that has detected the failure is Transfer from the shared storage to the standby system device, operate the standby virtual machine using the memory area data transferred from the shared storage in the standby system, and operate in the active system device that detected the failure occurrence The function of the active virtual machine to be switched is switched to continue with the standby virtual machine.

また、本発明の他のフェイルオーバー方法は、待機系装置が、稼働系装置上で動作する稼働系仮想マシンの仮想マシン用メモリ領域の内容（メモリ領域データ）が二重化された共有ストレージを用いて、前記稼働系装置の障害発生が検出されると、前記障害発生が検出された稼働系装置において動作する稼働系仮想マシンのメモリ領域データを前記共有ストレージから取得し、前記共有ストレージから取得したメモリ領域データを用いて待機系仮想マシンを動作させ、前記障害発生が検出された稼働系装置において動作する稼働系仮想マシンの機能を前記待機系仮想マシンで継続するよう切り替える。 In addition, another failover method of the present invention uses a shared storage in which the contents of the virtual machine memory area (memory area data) of the active virtual machine operating on the active system are duplicated by the standby system. When the occurrence of a failure in the active device is detected, the memory area data of the active virtual machine operating in the active device in which the failure has been detected is acquired from the shared storage, and the memory acquired from the shared storage The standby virtual machine is operated using the area data, and the function of the active virtual machine operating in the active device in which the occurrence of the failure is detected is switched to continue in the standby virtual machine.

また、本発明のフェイルオーバープログラムは、稼働系装置上で動作する稼働系仮想マシンの仮想マシン用メモリ領域の内容（メモリ領域データ）が二重化された共有ストレージを用いて、前記稼働系装置の障害１発生が検出されると、前記障害発生が検出された稼働系装置において動作する稼働系仮想マシンのメモリ領域データを前記共有ストレージから取得するメモリ領域データ取得ステップと、前記メモリ領域データ取得ステップで取得されたメモリ領域データを用いて待機系仮想マシンを動作させる待機系仮想マシン動作ステップと、前記障害発生が検出された稼働系装置において動作する稼働系仮想マシンの機能を前記待機系仮想マシンで継続するよう切り替える切替ステップと、を待機系装置に実行させる。 In addition, the failover program of the present invention uses the shared storage in which the contents (memory area data) of the virtual machine memory area of the active virtual machine operating on the active system are duplicated, and the failure of the active system When one occurrence is detected, a memory region data acquisition step for acquiring memory region data of an active virtual machine operating in the active device in which the failure has been detected from the shared storage; and a memory region data acquisition step A standby virtual machine operation step for operating the standby virtual machine using the acquired memory area data, and a function of the active virtual machine that operates in the active device in which the occurrence of the failure is detected are described in the standby virtual machine. The standby system device executes the switching step for switching to continue.

また、本発明の他のフェイルオーバー方法は、稼働系装置が、自装置および待機系装置によって共有される共有ストレージに、自装置上で動作する稼働系仮想マシンの仮想マシン用メモリ領域の内容（メモリ領域データ）を転送して二重化する。 In addition, according to another failover method of the present invention, the active system device stores the contents of the virtual machine memory area of the active virtual machine operating on the own device in the shared storage shared by the own device and the standby system device ( (Memory area data) is transferred and duplicated.

また、本発明の他のフェイルオーバープログラムは、自装置および待機系装置によって共有される共有ストレージに、自装置上で動作する稼働系仮想マシンの仮想マシン用メモリ領域の内容（メモリ領域データ）を転送して二重化するメモリ領域データ二重化ステップを、稼働系装置に実行させる。 In addition, another failover program of the present invention stores the contents (memory area data) of the virtual machine memory area of the active virtual machine operating on the own apparatus in the shared storage shared by the own apparatus and the standby apparatus. The operating system apparatus is caused to execute a memory area data duplexing step for transferring and duplicating.

本発明は、仮想マシンによって構成されるサーバシステムの規模が増大しても、待機系にかかるコストを抑えてフォールトトレランスを実現する技術を提供することができる。 The present invention can provide a technique for realizing fault tolerance while suppressing the cost of a standby system even when the scale of a server system configured by virtual machines increases.

本発明の第１の実施の形態としてのフォールトトレラントシステムの構成を示すブロック図である。It is a block diagram which shows the structure of the fault tolerant system as the 1st Embodiment of this invention. 本発明の第１の実施の形態としてのフォールトトレラントシステムの内部構成の一例を示す図である。It is a figure which shows an example of the internal structure of the fault tolerant system as the 1st Embodiment of this invention. 本発明の第１の実施の形態としてのフォールトトレラントシステムの機能ブロック構成を示す図である。It is a figure which shows the functional block structure of the fault tolerant system as the 1st Embodiment of this invention. 本発明の第１の実施の形態としてのフォールトトレラントシステムのメモリ領域データ二重化動作を説明するフローチャートである。It is a flowchart explaining the memory area data duplication operation | movement of the fault tolerant system as the 1st Embodiment of this invention. 本発明の第１の実施の形態としてのフォールトトレラントシステムの障害検出動作を説明するフローチャートである。It is a flowchart explaining the fault detection operation | movement of the fault tolerant system as the 1st Embodiment of this invention. 本発明の第１の実施の形態としてのフォールトトレラントシステムのフェイルオーバー動作を説明するフローチャートである。It is a flowchart explaining the failover operation | movement of the fault tolerant system as the 1st Embodiment of this invention. 本発明の第２の実施の形態としてのフォールトトレラントシステムの機能ブロック構成を示す図である。It is a figure which shows the functional block structure of the fault tolerant system as the 2nd Embodiment of this invention. 本発明の第２の実施の形態としてのフォールトトレラントシステムの障害検出動作を説明するフローチャートである。It is a flowchart explaining the failure detection operation | movement of the fault tolerant system as the 2nd Embodiment of this invention. 本発明の第３の実施の形態としてのフォールトトレラントシステムの機能ブロック構成を示す図である。It is a figure which shows the functional block structure of the fault tolerant system as the 3rd Embodiment of this invention. 本発明の第３の実施の形態としてのフォールトトレラントシステムのフェイルオーバー動作を説明するフローチャートである。It is a flowchart explaining the failover operation | movement of the fault tolerant system as the 3rd Embodiment of this invention. 本発明の第４の実施の形態としてのフォールトトレラントシステムの機能ブロック構成を示す図である。It is a figure which shows the functional block structure of the fault tolerant system as the 4th Embodiment of this invention. 本発明の第４の実施の形態としてのフォールトトレラントシステムの障害検出・フェイルオーバー動作を説明するフローチャートである。It is a flowchart explaining the failure detection and failover operation of the fault tolerant system as the fourth embodiment of the present invention. 本発明の第５の実施の形態としてのフォールトトレラントシステムの機能ブロック構成を示す図である。It is a figure which shows the functional block structure of the fault tolerant system as the 5th Embodiment of this invention. 本発明の第５の実施の形態としてのフォールトトレラントシステムの障害検出・フェイルオーバー動作を説明するフローチャートである。It is a flowchart explaining the failure detection and failover operation of the fault tolerant system as the fifth exemplary embodiment of the present invention.

以下、本発明の実施の形態について、図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

（第１の実施の形態）
本発明の第１の実施の形態としてのフォールトトレラントシステム１の構成を図１に示す。図１において、フォールトトレラントシステム１は、１つ以上の稼働系装置１０と、待機系装置２０と、共有ストレージ３０と、障害検出装置４０とを備える。なお、図１には、３つの稼働系装置、１つの待機系装置、１つの共有ストレージ、および、１つの障害検出装置を示しているが、本発明のフォールトトレラントシステムが含む各装置の数を限定するものではない。稼働系装置１０および待機系装置２０は、それぞれ、ネットワークを介して共有ストレージ３０と通信可能に接続される。また、稼働系装置１０および待機系装置２０は、それぞれ、ネットワークを介して障害検出装置４０と通信可能に接続される。 (First embodiment)
FIG. 1 shows the configuration of a fault tolerant system 1 as a first embodiment of the present invention. In FIG. 1, the fault-tolerant system 1 includes one or more active devices 10, a standby device 20, a shared storage 30, and a failure detection device 40. Although FIG. 1 shows three active devices, one standby device, one shared storage, and one failure detection device, the number of devices included in the fault-tolerant system of the present invention is shown. It is not limited. The active system device 10 and the standby system device 20 are communicably connected to the shared storage 30 via a network. In addition, the active system device 10 and the standby system device 20 are communicably connected to the failure detection device 40 via a network.

図２に、フォールトトレラントシステム１を構成する各装置のハードウェア構成の一例を示す。 FIG. 2 shows an example of the hardware configuration of each device constituting the fault tolerant system 1.

図２において、稼働系装置１０は、ＣＰＵ（Central Processing Unit）１００１、メモリ１００２、ローカルストレージ１００３、および、ネットワークインタフェース１００４等のハードウェア要素を含むコンピュータ装置によって構成可能である。メモリ１００２は、ＲＡＭ（Random Access Memory）およびＲＯＭ（Read Only Memory）等によって構成される。ローカルストレージ１００３は、仮想マシン制御プログラムおよび稼働系フォールトトレラントプログラム等を記憶している。ネットワークインタフェース１００４は、ネットワークに接続するインタフェースである。 2, the active system device 10 can be configured by a computer device including hardware elements such as a CPU (Central Processing Unit) 1001, a memory 1002, a local storage 1003, and a network interface 1004. The memory 1002 includes a RAM (Random Access Memory), a ROM (Read Only Memory), and the like. The local storage 1003 stores a virtual machine control program, an active fault tolerant program, and the like. The network interface 1004 is an interface connected to the network.

ＣＰＵ１００１は、ローカルストレージ１００３から仮想マシン制御プログラムを読み込んで実行する。これにより、メモリ１００２上で、ハイパーバイザー１０１が動作する。また、ハイパーバイザー１０１は、メモリ１００２上に仮想マシン１００を構築して動作させる。以降、稼働系装置１０上で動作する仮想マシン１００を、稼働系仮想マシン１００とも記載する。稼働系仮想マシン１００は、メモリ１００２内に確保される仮想マシン用メモリ領域１０２を用いて動作する。また、稼働系仮想マシン１００のローカルデータは、共有ストレージ３０内に保存される。 The CPU 1001 reads a virtual machine control program from the local storage 1003 and executes it. As a result, the hypervisor 101 operates on the memory 1002. Further, the hypervisor 101 constructs and operates the virtual machine 100 on the memory 1002. Hereinafter, the virtual machine 100 operating on the active system device 10 is also referred to as the active virtual machine 100. The active virtual machine 100 operates using the virtual machine memory area 102 secured in the memory 1002. Further, local data of the active virtual machine 100 is stored in the shared storage 30.

また、ＣＰＵ１００１は、ローカルストレージ１００３から稼働系フォールトトレラントプログラムを読み込んで実行する。これにより、ＣＰＵ１００１は、ネットワークインタフェース１００４を制御しながら、稼働系装置１０上に後述の機能ブロックを実現する。 Further, the CPU 1001 reads and executes an active fault tolerant program from the local storage 1003. As a result, the CPU 1001 implements functional blocks to be described later on the active device 10 while controlling the network interface 1004.

待機系装置２０は、ＣＰＵ２００１、メモリ２００２、ローカルストレージ２００３、および、ネットワークインタフェース２００４等を含むコンピュータ装置によって構成可能である。メモリ２００２は、ＲＡＭおよびＲＯＭ等によって構成される。ローカルストレージ２００３は、仮想マシン制御プログラムおよび待機系フォールトトレラントプログラム等を記憶している。ネットワークインタフェース２００４は、ネットワークに接続するインタフェースである。 The standby device 20 can be configured by a computer device including a CPU 2001, a memory 2002, a local storage 2003, a network interface 2004, and the like. The memory 2002 includes a RAM, a ROM, and the like. The local storage 2003 stores a virtual machine control program, a standby fault tolerant program, and the like. A network interface 2004 is an interface connected to a network.

ＣＰＵ２００１は、メモリ２００２から仮想マシン制御プログラムを読み込んで実行する。これにより、メモリ２００２上で、ハイパーバイザー２０１が動作する。また、ハイパーバイザー２０１は、メモリ２００２上に仮想マシン２００を構築して動作させることが可能である。ただし、いずれの稼働系装置１０にも障害が発生していない通常時において、待機系装置２０の仮想マシン２００は停止している。以降、待機系装置２０上で動作し得る仮想マシン２００を、待機系仮想マシン２００とも記載する。待機系仮想マシン２００は、メモリ２００２内に確保される仮想マシン用メモリ領域２０２を用いて動作する。また、待機系仮想マシン２００は、共有ストレージ３０内に保存されるいずれかの稼働系仮想マシン１００のローカルデータを用いて動作することが可能となるよう構成される。 The CPU 2001 reads a virtual machine control program from the memory 2002 and executes it. As a result, the hypervisor 201 operates on the memory 2002. Further, the hypervisor 201 can construct and operate the virtual machine 200 on the memory 2002. However, the virtual machine 200 of the standby system device 20 is stopped at a normal time when no failure has occurred in any active system device 10. Hereinafter, the virtual machine 200 that can operate on the standby device 20 is also referred to as a standby virtual machine 200. The standby virtual machine 200 operates using the virtual machine memory area 202 secured in the memory 2002. The standby virtual machine 200 is configured to be able to operate using local data of any active virtual machine 100 stored in the shared storage 30.

また、ＣＰＵ２００１は、メモリ２００２から待機系フォールトトレラントプログラムを読み込んで実行する。これにより、ＣＰＵ２００１は、ネットワークインタフェース２００４を制御しながら、待機系装置２０上における後述の機能ブロックを実現する。 Further, the CPU 2001 reads and executes a standby fault-tolerant program from the memory 2002. As a result, the CPU 2001 implements a later-described functional block on the standby system device 20 while controlling the network interface 2004.

共有ストレージ３０は、ネットワークに接続され、稼働系装置１０および待機系装置２０によって共有される。なお、共有ストレージ３０をネットワークに接続するためのネットワークインタフェースおよびその動作を制御するプロセッサ、メモリ等のハードウェア要素については、図示を省略している。例えば、共有ストレージ３０は、ＳＡＮ（Storage Area Network）システムまたはＮＡＳ（Network Attached Storage）によって構成されていてもよい。また、共有ストレージ３０は、ＲＡＩＤ（Redundant Arrays of Inexpensive Disks）として機能する装置であってもよい。 The shared storage 30 is connected to the network and is shared by the active system device 10 and the standby system device 20. Note that a network interface for connecting the shared storage 30 to the network and hardware elements such as a processor and a memory for controlling the operation are not shown. For example, the shared storage 30 may be configured by a SAN (Storage Area Network) system or a NAS (Network Attached Storage). The shared storage 30 may be a device that functions as RAID (Redundant Arrays of Inexpensive Disks).

障害検出装置４０は、ＣＰＵ４００１、メモリ４００２、ローカルストレージ４００３、および、ネットワークインタフェース４００４等のハードウェア要素を含むコンピュータ装置によって構成可能である。メモリ４００２は、ＲＡＭおよびＲＯＭ等によって構成される。ローカルストレージ４００３は、本実施の形態の障害検出装置４０を動作させるためのプログラムを記憶している。ネットワークインタフェース４００４は、ネットワークに接続するインタフェースである。ＣＰＵ４００１は、ローカルストレージ４００３からプログラムを読み込んで実行することにより、ネットワークインタフェース４００４を制御しながら、障害検出装置４０の後述の機能を実現する。 The failure detection device 40 can be configured by a computer device including hardware elements such as a CPU 4001, a memory 4002, a local storage 4003, and a network interface 4004. The memory 4002 is configured by a RAM, a ROM, and the like. The local storage 4003 stores a program for operating the failure detection apparatus 40 of the present embodiment. The network interface 4004 is an interface connected to the network. The CPU 4001 reads the program from the local storage 4003 and executes it, thereby realizing the later-described functions of the failure detection device 40 while controlling the network interface 4004.

なお、フォールトトレラントシステム１を構成する各装置およびその各機能ブロックのハードウェア構成は、上述の構成に限定されない。 Note that the hardware configuration of each device and each functional block constituting the fault-tolerant system 1 is not limited to the above-described configuration.

次に、フォールトトレラントシステム１の機能ブロック構成を図３に示す。図３において、稼働系装置１０は、メモリ領域データ二重化部１１を有する。また、待機系装置２０は、メモリ領域データ取得部２１と、切替部２２とを有する。また、障害検出装置４０は、障害検出部４１を有する。 Next, the functional block configuration of the fault tolerant system 1 is shown in FIG. In FIG. 3, the active system device 10 includes a memory area data duplication unit 11. The standby system device 20 includes a memory area data acquisition unit 21 and a switching unit 22. Further, the failure detection device 40 has a failure detection unit 41.

メモリ領域データ二重化部１１は、稼働系仮想マシン１００によって利用される仮想マシン用メモリ領域１０２の内容（メモリ領域データ）を共有ストレージ３０に記憶させることによってデータを二重化する。メモリ領域データには、稼働系仮想マシン１００によって利用される仮想的なメモリ（主記憶装置）のデータ、ＣＰＵコンテキスト、ネットワーク送信用バッファ内容等が含まれている。なお、メモリ領域データ二重化部１１は、メモリ領域データを、自装置上の仮想マシンの情報であることを識別可能に、共有ストレージ３０に記憶させる。例えば、メモリ領域データ二重化部１１は、共有ストレージ３０に、稼働系仮想マシン１００の識別情報に対応付けてメモリ領域データを記憶させてもよい。また、例えば、メモリ領域データ二重化部１１は、メモリ領域データを、所定タイミング毎に共有ストレージ３０に遅延コピーしてもよい。また、例えば、メモリ領域データ二重化部１１は、メモリ領域データのうちダーティページを遅延コピーするようにしてもよい。なお、メモリ領域データ二重化部１１は、メモリ領域データを共有ストレージ３０に記憶させて二重化する手法として、各種公知の技術を採用可能である。 The memory area data duplexing unit 11 duplexes data by storing the contents (memory area data) of the virtual machine memory area 102 used by the active virtual machine 100 in the shared storage 30. The memory area data includes data of a virtual memory (main storage device) used by the active virtual machine 100, CPU context, network transmission buffer contents, and the like. Note that the memory area data duplication unit 11 stores the memory area data in the shared storage 30 so that it can be identified as the information of the virtual machine on its own device. For example, the memory area data duplication unit 11 may store the memory area data in the shared storage 30 in association with the identification information of the active virtual machine 100. Further, for example, the memory area data duplication unit 11 may delay copy the memory area data to the shared storage 30 at every predetermined timing. Further, for example, the memory area data duplication unit 11 may delay copy a dirty page in the memory area data. Note that the memory area data duplexing unit 11 can employ various known techniques as a technique for storing the memory area data in the shared storage 30 and duplexing.

共有ストレージ３０は、稼働系仮想マシン１００毎に、それぞれのメモリ領域データを記憶する。 The shared storage 30 stores each memory area data for each active virtual machine 100.

障害検出部４１は、稼働系装置１０における障害の発生を検出する。稼働系装置１０における障害発生を検出する手法としては、コンピュータ装置の物理的な障害を検出する各種公知の手法を採用することが可能である。また、障害検出部４１は、稼働系装置１０における障害発生を示す検出情報を、待機系装置２０に通知する。例えば、障害検出部４１は、障害が発生した稼働系装置１０を識別する情報を、検出情報として待機系装置２０に通知してもよい。 The failure detection unit 41 detects the occurrence of a failure in the active system device 10. As a method for detecting the occurrence of a failure in the active system device 10, various known methods for detecting a physical failure of the computer device can be employed. Further, the failure detection unit 41 notifies the standby system device 20 of detection information indicating the occurrence of a failure in the active system device 10. For example, the failure detection unit 41 may notify the standby system device 20 of information identifying the active system device 10 in which the failure has occurred as detection information.

メモリ領域データ取得部２１は、障害発生が検出された稼働系装置１０において動作する稼働系仮想マシン１００のメモリ領域データを、共有ストレージ３０から取得する。例えば、メモリ領域データ取得部２１は、取得したメモリ領域データを、自装置のメモリ２００２内の仮想マシン用メモリ領域２０２に保存する。以降、障害発生が検出された稼働系装置１０において動作する稼働系仮想マシン１００を、単に、「障害が発生した稼働系仮想マシン１００」とも記載する。 The memory area data acquisition unit 21 acquires the memory area data of the active virtual machine 100 operating in the active system 10 in which the occurrence of the failure is detected from the shared storage 30. For example, the memory area data acquisition unit 21 stores the acquired memory area data in the virtual machine memory area 202 in the memory 2002 of the own apparatus. Hereinafter, the active virtual machine 100 operating in the active system 10 in which the occurrence of a failure is detected is also simply referred to as “active virtual machine 100 in which a failure has occurred”.

切替部２２は、障害検出部４１により障害発生が検出されると、メモリ領域データ取得部２１によって取得されたメモリ領域データを用いて、待機系仮想マシン２００を動作させる。例えば、切替部２２は、障害が発生した稼働系仮想マシン１００のメモリ領域データが保存された自装置の仮想マシン用メモリ領域２０２を用いて動作するよう、待機系仮想マシン２００を起動すればよい。また、切替部２２は、障害が発生した稼働系仮想マシン１００のローカルデータとして共有ストレージ３０に保存された情報をローカルデータとして用いるよう、待機系仮想マシン２００を起動すればよい。 When the failure detection unit 41 detects the occurrence of a failure, the switching unit 22 operates the standby virtual machine 200 using the memory area data acquired by the memory area data acquisition unit 21. For example, the switching unit 22 may activate the standby virtual machine 200 so as to operate using the virtual machine memory area 202 of its own apparatus in which the memory area data of the active virtual machine 100 in which the failure has occurred is stored. . Further, the switching unit 22 may activate the standby virtual machine 200 so that information stored in the shared storage 30 as local data of the active virtual machine 100 in which a failure has occurred is used as local data.

そして、切替部２２は、障害発生が発生した稼働系仮想マシン１００の機能を、自装置上で起動した待機系仮想マシン２００で継続するよう切り替える処理（フェイルオーバ）を行う。 Then, the switching unit 22 performs a process (failover) for switching the function of the active virtual machine 100 in which the failure has occurred to be continued in the standby virtual machine 200 started on the own device.

以上のように構成されたフォールトトレラントシステム１の動作について、図面を参照して説明する。 The operation of the fault tolerant system 1 configured as described above will be described with reference to the drawings.

まず、フォールトトレラントシステム１のメモリ領域データ二重化動作を図４に示す。 First, the memory area data duplication operation of the fault tolerant system 1 is shown in FIG.

図４では、まず、稼働系装置１０のメモリ領域データ二重化部１１は、自装置上で動作する稼働系仮想マシン１００によって利用中のメモリ領域データを、共有ストレージ３０に記憶させる（ステップＳ１）。 In FIG. 4, first, the memory area data duplication unit 11 of the active system apparatus 10 stores the memory area data being used by the active virtual machine 100 operating on the own apparatus in the shared storage 30 (step S <b> 1).

前述のように、メモリ領域データ二重化部１１は、メモリ領域データのうちダーティページを共有ストレージ３０に記憶させてもよい。また、メモリ領域データ二重化部１１は、メモリ領域データを、自装置の情報であることを識別可能に、共有ストレージ３０に記憶させる。 As described above, the memory area data duplication unit 11 may store the dirty page in the memory area data in the shared storage 30. Further, the memory area data duplication unit 11 stores the memory area data in the shared storage 30 so that it can be identified as the information of the own device.

以上の動作を、メモリ領域データ二重化部１１は、所定間隔毎に繰り返す。これにより、共有ストレージ３０には、各稼働系装置１０上で動作する稼働系仮想マシン１００毎にメモリ領域データが二重化される。 The memory area data duplication unit 11 repeats the above operation at predetermined intervals. As a result, memory area data is duplicated in the shared storage 30 for each active virtual machine 100 operating on each active device 10.

次に、フォールトトレラントシステム１の障害検出動作を図５に示す。 Next, the fault detection operation of the fault tolerant system 1 is shown in FIG.

図５では、まず、障害検出装置４０の障害検出部４１は、各稼働系装置１０に障害が発生しているか否かを確認する（ステップＳ１１）。 In FIG. 5, first, the failure detection unit 41 of the failure detection device 40 checks whether or not a failure has occurred in each active system device 10 (step S11).

ここで、障害が発生している稼働系装置１０がある場合、障害検出部４１は、その稼働系装置１０を表す情報を、待機系装置２０に通知する（ステップＳ１２）。 Here, when there is an active device 10 in which a failure has occurred, the failure detection unit 41 notifies the standby device 20 of information representing the active device 10 (step S12).

以上の動作を、障害検出部４１は、所定間隔毎に繰り返す。 The failure detection unit 41 repeats the above operation at predetermined intervals.

次に、フォールトトレラントシステム１のフェイルオーバー動作を図６に示す。 Next, the failover operation of the fault tolerant system 1 is shown in FIG.

図６では、まず、待機系装置２０は、障害検出装置４０から障害発生を通知されると（ステップＳ２１でＹ）、以下の動作を開始する。 In FIG. 6, first, the standby device 20 starts the following operation when a failure occurrence is notified from the failure detection device 40 (Y in step S21).

ここでは、まず、メモリ領域データ取得部２１は、共有ストレージ３０から、障害が発生した稼働系仮想マシン１００について記憶されているメモリ領域データを取得する（ステップＳ２２）。そして、メモリ領域データ取得部２１は、取得したメモリ領域データを、自装置の仮想マシン用メモリ領域２０２に展開する。 Here, first, the memory area data acquisition unit 21 acquires memory area data stored for the active virtual machine 100 in which a failure has occurred from the shared storage 30 (step S22). Then, the memory area data acquisition unit 21 expands the acquired memory area data in the virtual machine memory area 202 of the own device.

次に、切替部２２は、自装置の仮想マシン用メモリ領域２０２を用いて動作するよう待機系仮想マシン２００を起動する（ステップＳ２３）。 Next, the switching unit 22 activates the standby virtual machine 200 to operate using the virtual machine memory area 202 of the own device (step S23).

次に、切替部２２は、稼働系仮想マシン１００の機能を待機系仮想マシン２００で継続するよう切り替えを行う（ステップＳ２４）。 Next, the switching unit 22 performs switching so that the function of the active virtual machine 100 is continued in the standby virtual machine 200 (step S24).

以上で、フェイルオーバー動作の説明を終了する。 This is the end of the description of the failover operation.

次に、本発明の第１の実施の形態の効果について述べる。 Next, effects of the first exemplary embodiment of the present invention will be described.

本発明の第１の実施の形態としてのフォールトトレラントシステムは、仮想マシンによって構成されるサーバシステムの規模が増大しても、待機系にかかるコストを抑えてフォールトトレランスを実現することができる。 The fault-tolerant system as the first embodiment of the present invention can realize fault tolerance while suppressing the cost of the standby system even if the scale of the server system configured by virtual machines increases.

その理由は、それぞれの稼働系装置のメモリ領域データ二重化部が、稼働系仮想マシンのメモリ領域データを、共有ストレージに記憶させることによって二重化するからである。そして、障害検出装置によって稼働系装置の障害が検出されると、待機系装置のメモリ領域データ取得部が、共有ストレージから、障害発生が検出された稼働系装置上の仮想マシンのメモリ領域データを取得する。そして、待機系装置の切替部が、取得したメモリ領域データを用いて待機系仮想マシンを起動し、障害が発生した稼働系仮想マシンの機能を待機系仮想マシンで継続するよう切り替えるからである。 This is because the memory area data duplication unit of each active system device duplicates the memory area data of the active virtual machine by storing it in the shared storage. When a failure of the active device is detected by the failure detection device, the memory area data acquisition unit of the standby device acquires the memory area data of the virtual machine on the active device where the failure is detected from the shared storage. get. This is because the switching unit of the standby system starts the standby virtual machine using the acquired memory area data, and switches the function of the active virtual machine in which the failure has occurred to continue in the standby virtual machine.

このように、本実施の形態は、共有ストレージに各稼働系仮想マシンのメモリ領域データを記憶することによって二重化し、障害発生時に、該当する稼働系仮想マシンのメモリ領域データを待機系装置に転送して稼働系仮想マシンを起動し切替を行う。そのため、本実施の形態は、稼働系仮想マシンと同数の待機系仮想マシンを待機させておく必要がなく、少なくとも１台の待機系装置があればよい。また、待機系装置は、少なくとも１つの待機系仮想マシンを動作可能な性能があればよく、大量のリソースを必要としない。例えば、待機系装置は、稼働系装置と略同性能であればよい。その結果本実施の形態は、稼働系仮想マシン数に関わらず、待機系にかかるコストを抑えてフォールトトレランスを実現することができる。例えば、本実施の形態は、障害の発生がそれほど頻繁でなく、稼働系と待機系とが１対１の場合では待機系が待機状態となっていることが多いシステムにおいて、特に効果を奏する。 As described above, in this embodiment, the memory area data of each active virtual machine is stored in the shared storage to be duplicated, and the memory area data of the corresponding active virtual machine is transferred to the standby system when a failure occurs. Then start and switch the active virtual machine. Therefore, in the present embodiment, it is not necessary to wait for the same number of standby virtual machines as active virtual machines, and it is sufficient if there is at least one standby system device. In addition, the standby apparatus only needs to be capable of operating at least one standby virtual machine, and does not require a large amount of resources. For example, the standby system device may have substantially the same performance as the active system device. As a result, according to the present embodiment, it is possible to realize fault tolerance while suppressing the cost of the standby system regardless of the number of active virtual machines. For example, the present embodiment is particularly effective in a system in which failures frequently occur and the standby system is often in a standby state when the active system and the standby system are on a one-to-one basis.

（第２の実施の形態）
次に、本発明の第２の実施の形態について図面を参照して詳細に説明する。本発明の第１の実施の形態では、本発明の障害検出部の一実施形態である障害検出装置が、稼働系装置の障害を検出して待機系装置に通知する構成について説明した。本実施の形態では、本発明の障害検出部の一実施形態を、稼働系装置および待機系装置に分散して配置する例について説明する。 (Second Embodiment)
Next, a second embodiment of the present invention will be described in detail with reference to the drawings. In the first embodiment of the present invention, the configuration has been described in which the failure detection device, which is an embodiment of the failure detection unit of the present invention, detects a failure in the active device and notifies the standby device. In the present embodiment, an example will be described in which one embodiment of the failure detection unit of the present invention is distributed and arranged in an active system device and a standby system device.

なお、本実施の形態の説明において参照する各図面において、本発明の第１の実施の形態と同一の構成および同様に動作するステップには同一の符号を付して本実施の形態における詳細な説明を省略する。 Note that, in each drawing referred to in the description of the present embodiment, the same reference numerals are given to the same configuration and steps that operate in the same manner as in the first embodiment of the present invention, and the detailed description in the present embodiment. Description is omitted.

まず、本発明の第２の実施の形態としてのフォールトトレラントシステム２の構成を図７に示す。図７において、フォールトトレラントシステム２は、本発明の第１の実施の形態としてのフォールトトレラントシステム１に対して、稼働系装置１０に替えて稼働系装置５０と、待機系装置２０に替えて待機系装置６０とを備え、障害検出装置４０を含まない点が異なる。なお、フォールトトレラントシステム２のハードウェア構成は、図２に示した本発明の第１の実施の形態のハードウェア要素のうち、障害検出装置４０を除いたハードウェア要素によって構成可能である。ただし、フォールトトレラントシステム２を構成する各装置およびその各機能ブロックのハードウェア構成は、上述の構成に限定されない。 First, FIG. 7 shows a configuration of a fault tolerant system 2 as a second embodiment of the present invention. In FIG. 7, the fault tolerant system 2 is in standby for the fault tolerant system 1 according to the first embodiment of the present invention, instead of the active device 10 and the active device 50 and the standby device 20. The system apparatus 60 is different and the failure detection apparatus 40 is not included. Note that the hardware configuration of the fault tolerant system 2 can be configured by hardware elements excluding the failure detection device 40 among the hardware elements of the first embodiment of the present invention shown in FIG. However, the hardware configuration of each device and each functional block constituting the fault-tolerant system 2 is not limited to the above-described configuration.

稼働系装置５０は、本発明の第１の実施の形態としての稼働系装置１０と同一の構成に加えて、障害検出部５２を備える。 The active system device 50 includes a failure detection unit 52 in addition to the same configuration as that of the active system device 10 according to the first exemplary embodiment of the present invention.

障害検出部５２は、自装置が正常に動作していることを確認する処理を行う。また、障害検出部５２は、自装置が正常に動作していることを確認した場合、正常動作していることを表す情報（確認情報）を、待機系装置６０に送信する。例えば、障害検出部５２は、自装置が共有メモリを更新したか否かを所定間隔毎に確認してもよい。この場合、障害検出部５２は、共有メモリの更新を確認した場合、自装置が正常に動作しているとして、その旨を表す確認情報を、待機系装置６０に送信する。 The failure detection unit 52 performs processing for confirming that its own device is operating normally. Further, when the failure detection unit 52 confirms that the device itself is operating normally, the failure detection unit 52 transmits information (confirmation information) indicating that the device is operating normally to the standby system device 60. For example, the failure detection unit 52 may check whether or not the own device has updated the shared memory at predetermined intervals. In this case, when the failure detection unit 52 confirms the update of the shared memory, the failure detection unit 52 transmits confirmation information indicating that to the standby system device 60, assuming that the device itself is operating normally.

待機系装置６０は、本発明の第１の実施の形態としての待機系装置２０と同一の構成に加えて、障害検出部６３を備える。 The standby system device 60 includes a failure detection unit 63 in addition to the same configuration as that of the standby system device 20 according to the first embodiment of the present invention.

障害検出部６３は、稼働系装置５０からの確認情報の受信状況に基づいて、稼働系装置５０における障害発生を検出する。例えば、障害検出部６３は、稼働系装置５０から定期的に送信されるはずの確認情報を、所定期間受信していない場合に、その稼働系装置５０において障害が発生したと判断してもよい。 The failure detection unit 63 detects the occurrence of a failure in the active system device 50 based on the reception status of confirmation information from the active system device 50. For example, the failure detection unit 63 may determine that a failure has occurred in the active device 50 when the confirmation information that should be periodically transmitted from the active device 50 has not been received for a predetermined period. .

以上のように構成されたフォールトトレラントシステム２の動作について、図面を参照して説明する。なお、フォールトトレラントシステム２のメモリ領域データ二重化動作およびフェイルオーバー動作については、図４および図６を参照して説明した本発明の第１の実施の形態の動作と同様であるため、本実施の形態における説明を省略する。ここでは、フォールトトレラントシステム２の障害検出動作を、図８に示す。なお、図８では、左図は稼働系装置５０の動作を示し、右図は待機系装置６０の動作を示す。また、左右を結ぶ破線の矢印は、データの流れを示すものとする。 The operation of the fault tolerant system 2 configured as described above will be described with reference to the drawings. Note that the memory area data duplication operation and the failover operation of the fault tolerant system 2 are the same as the operations of the first embodiment of the present invention described with reference to FIG. 4 and FIG. The description in the form is omitted. Here, the fault detection operation of the fault tolerant system 2 is shown in FIG. In FIG. 8, the left diagram shows the operation of the active system device 50, and the right diagram shows the operation of the standby system device 60. In addition, broken arrows connecting the left and right indicate the flow of data.

図８において、まず、稼働系装置５０の障害検出部５２は、自装置が正常動作しているか否かを確認し、確認情報を待機系装置６０に送信する（ステップＳ３１）。 In FIG. 8, first, the failure detection unit 52 of the active device 50 confirms whether or not the device itself is operating normally, and transmits confirmation information to the standby device 60 (step S31).

なお、稼働系装置５０が正常動作していない場合、このステップは実行されない。 Note that this step is not executed when the active device 50 is not operating normally.

そして、障害検出部５２は、所定間隔毎に、ステップＳ３１の動作を繰り返す。 And the failure detection part 52 repeats the operation | movement of step S31 for every predetermined interval.

次に、待機系装置６０の障害検出部６３は、稼働系装置５０から確認情報を受信する（ステップＳ３３）。 Next, the failure detection unit 63 of the standby system device 60 receives confirmation information from the active system device 50 (step S33).

次に、障害検出部６３は、前回の確認情報の受信からの経過時間が所定時間以上となっている稼働系装置５０があるか否かを判断する（ステップＳ３４）。 Next, the failure detection unit 63 determines whether or not there is an active device 50 whose elapsed time from the reception of the previous confirmation information is equal to or longer than a predetermined time (step S34).

ここで、前回の確認情報の受信からの経過時間が所定時間以上となっている稼働系装置５０がある場合、障害検出部６３は、その稼働系装置５０に障害が発生したと判断する。そして、障害検出部６３は、障害が発生した稼働系装置５０を、切替部２２に通知する（ステップＳ３５）。 Here, when there is an active device 50 whose elapsed time from the reception of the previous confirmation information is equal to or longer than a predetermined time, the failure detection unit 63 determines that a failure has occurred in the active device 50. Then, the failure detection unit 63 notifies the switching unit 22 of the active device 50 in which the failure has occurred (step S35).

以上で、フォールトトレラントシステム２は、障害検出動作を終了する。 Thus, the fault tolerant system 2 ends the failure detection operation.

そして、障害の発生を通知された切替部２２は、メモリ領域データ取得部２１を用いて、図６におけるステップＳ２２〜Ｓ２４を実行し、フェイルオーバーを実施する。 Then, the switching unit 22 notified of the occurrence of the failure uses the memory area data acquisition unit 21 to execute steps S22 to S24 in FIG.

なお、本実施の形態において、各稼働系装置５０の障害検出部５２は、自装置が正常動作していることを表す確認情報を、待機系装置６０だけでなく、さらに他の稼働系装置５０に送信してもよい。この場合、障害検出部５２は、他の稼働系装置５０からの確認情報の受信状況に基づいて、他の稼働系装置５０における障害発生を検出する。例えば、障害検出部５２は、他の稼働系装置５０から定期的に送信されるはずの確認情報を、所定期間受信していない場合に、その稼働系装置５０において障害が発生したと判断してもよい。そして、この場合、障害検出部５２は、障害発生を検出した他の稼働系装置５０を表す情報を、待機系装置６０に送信すればよい。 In the present embodiment, the failure detection unit 52 of each active system device 50 uses not only the standby system device 60 but also other active system devices 50 as confirmation information indicating that the device itself is operating normally. May be sent to. In this case, the failure detection unit 52 detects the occurrence of a failure in the other active device 50 based on the reception status of the confirmation information from the other active device 50. For example, the failure detecting unit 52 determines that a failure has occurred in the active device 50 when the confirmation information that should be periodically transmitted from the other active device 50 has not been received for a predetermined period. Also good. In this case, the failure detection unit 52 may transmit information representing another active device 50 that has detected the failure to the standby device 60.

また、さらに、本実施の形態において、待機系装置６０の障害検出部６３は、自装置が正常動作していることを表す確認情報を、稼働系装置５０に送信してもよい。この場合、稼働系装置５０の障害検出部５２は、待機系装置６０からの確認情報の受信状況に基づいて、待機系装置６０における障害発生を検出する。そして、待機系装置６０に障害発生を検出した場合、稼働系装置５０は、その旨を出力してもよい。あるいは、フォールトトレラントシステム２が複数の待機系装置６０を備える場合、稼働系装置５０の障害検出部５２は、障害発生の待機系装置６０を他の待機系装置６０に通知してもよい。これにより、障害が発生していない待機系装置６０が、フェイルオーバーを実施可能となる。 Further, in the present embodiment, the failure detection unit 63 of the standby system device 60 may transmit confirmation information indicating that the device itself is operating normally to the active system device 50. In this case, the failure detection unit 52 of the active device 50 detects the occurrence of a failure in the standby device 60 based on the reception status of confirmation information from the standby device 60. Then, when a failure occurrence is detected in the standby system device 60, the active system device 50 may output that fact. Alternatively, when the fault tolerant system 2 includes a plurality of standby system devices 60, the failure detection unit 52 of the active system device 50 may notify the standby system device 60 in which the failure has occurred to the other standby system devices 60. As a result, the standby device 60 in which no failure has occurred can perform failover.

次に、本発明の第２の実施の形態の効果について述べる。 Next, the effect of the second exemplary embodiment of the present invention will be described.

本発明の第２の実施の形態としてのフォールトトレラントシステムは、仮想サーバによって構成されるサーバシステムにおいて、待機系にかかるコストを抑えてフォールトトレランスを実現する際に、稼働系装置の障害をより確実に検出することができる。 The fault tolerant system according to the second embodiment of the present invention is a server system constituted by virtual servers. When the fault tolerance is realized while reducing the cost of the standby system, the failure of the active system apparatus is more sure. Can be detected.

その理由は、稼働系装置の障害検出部が、自装置の正常動作を確認すると確認情報を待機系装置に送信し、待機系装置の障害検出部が、稼働系装置からの確認情報の受信状況に基づいて、稼働系装置における障害発生を検出するからである。 The reason is that the failure detection unit of the active device sends confirmation information to the standby device when the normal operation of the own device is confirmed, and the failure detection unit of the standby device receives the confirmation information from the active device. This is because the occurrence of a failure in the active system is detected based on the above.

また、さらには、稼働系装置の障害検出部が、他の稼働系装置との間で確認情報を送受信することにより他の稼働系装置における障害発生を検出する構成をとる場合の効果について説明する。この場合、本実施の形態は、稼働系装置の障害発生をさらに精度よく検出してフェイルオーバーを実施可能となる。なぜなら、待機系装置は、ある稼働系装置Ａからの確認情報の受信状況に加えて、他の稼働系装置Ｂからの稼働系装置Ａに関する障害発生の通知に基づいて、稼働系装置Ａの障害を検出可能となるからである。例えば、ある稼働系装置Ａが正常動作していても、待機系装置および稼働系装置Ａ間のネットワークに障害が発生する場合がある。このような場合、待機系装置は、稼働系装置Ａからの確認情報の受信状況に基づくだけでは、稼働系装置Ａに障害が発生していると判断することになる。このとき、稼働系装置Ｂからの稼働系装置Ａに関する障害発生の通知がなければ、待機系装置は、ネットワーク障害であると判断可能になるからである。 Furthermore, an effect in the case where the failure detection unit of the active device is configured to detect the occurrence of a failure in the other active device by transmitting / receiving confirmation information to / from another active device will be described. . In this case, according to the present embodiment, it is possible to detect the occurrence of a failure in the active system apparatus with higher accuracy and perform failover. This is because the standby system device detects the failure of the active system device A based on the notification of the occurrence of the failure related to the active system device A from the other active system device B in addition to the reception status of the confirmation information from the active system device A. This is because it can be detected. For example, even if a certain active device A is operating normally, a failure may occur in the network between the standby device and the active device A. In such a case, the standby apparatus determines that a failure has occurred in the active apparatus A based only on the reception status of the confirmation information from the active apparatus A. At this time, if there is no notification from the active system device B regarding the occurrence of the fault related to the active system device A, the standby system device can determine that there is a network fault.

また、さらには、待機系装置の障害検出部が、自装置が正常動作していることを表す確認情報を稼働系装置に送信し、稼働系装置が待機系装置の障害発生を検出する構成をとる場合の効果について説明する。この場合、本実施の形態は、待機系装置における障害発生を利用者に報知することができる。また、本実施の形態は、複数の待機系装置のうち正常動作している待機系装置を用いて、より確実にフェイルオーバーを実施することができる。 Furthermore, the failure detection unit of the standby device transmits confirmation information indicating that the device is operating normally to the active device, and the active device detects the occurrence of a failure of the standby device. The effect of taking this will be described. In this case, the present embodiment can notify the user of the occurrence of a failure in the standby system device. In addition, according to the present embodiment, failover can be more reliably performed using a standby system device that is operating normally among a plurality of standby system devices.

このように、本実施の形態は、障害検出部を各稼働系装置および待機系装置に分散配置して互いに正常動作を確認することにより、より精度よく、より確実にフォールトトレランスを実現することができる。 As described above, according to the present embodiment, fault tolerance can be more accurately and more reliably realized by distributing failure detection units to each active system device and standby system device and confirming normal operations with each other. it can.

（第３の実施の形態）
次に、本発明の第３の実施の形態について図面を参照して詳細に説明する。本発明の第１および第２の実施の形態では、共有ストレージ３０上にある稼働系仮想マシンのメモリ領域データを、待機系の仮想マシン用メモリ領域へ書き込む処理を、フェイルオーバーを発生させるタイミングで行っていた。このため、本発明の第１および第２の実施の形態は、通常時から待機系を動作させておくシステムに比べて、待機系にかかるコストを削減できる代わりに、フェイルオーバーで発生するダウンタイムが増加するという課題があった。本実施の形態では、このような第１および第２の実施の形態の課題に対する対策を施す例について説明する。 (Third embodiment)
Next, a third embodiment of the present invention will be described in detail with reference to the drawings. In the first and second embodiments of the present invention, the processing of writing the memory area data of the active virtual machine on the shared storage 30 to the memory area for the standby virtual machine is performed at the timing at which failover occurs. I was going. For this reason, the first and second embodiments of the present invention can reduce the cost of the standby system as compared to a system in which the standby system is operated from the normal time. There has been a problem of increasing. In the present embodiment, an example will be described in which measures against the problems of the first and second embodiments are taken.

なお、本実施の形態の説明において参照する各図面において、本発明の第１および第２の実施の形態と同一の構成および同様に動作するステップには同一の符号を付して本実施の形態における詳細な説明を省略する。 Note that, in each drawing referred to in the description of the present embodiment, the same configurations and steps that operate in the same manner as in the first and second embodiments of the present invention are denoted by the same reference numerals, and the present embodiment. The detailed description in is omitted.

まず、本発明の第３の実施の形態としてのフォールトトレラントシステム３の構成を図９に示す。図９において、フォールトトレラントシステム３は、本発明の第２の実施の形態としてのフォールトトレラントシステム２に対して、待機系装置６０に替えて待機系装置７０を備える点が異なる。また、待機系装置７０は、本発明の第２の実施の形態における待機系装置６０に対して、メモリ領域データ取得部２１に替えてメモリ領域データ取得部７１を備える点が異なる。なお、フォールトトレラントシステム３のハードウェア構成は、図２に示した本発明の第１の実施の形態のハードウェア要素のうち、障害検出装置４０を除いたハードウェア要素によって構成可能である。ただし、フォールトトレラントシステム３を構成する各装置およびその各機能ブロックのハードウェア構成は、上述の構成に限定されない。 First, FIG. 9 shows a configuration of a fault tolerant system 3 as a third embodiment of the present invention. In FIG. 9, the fault tolerant system 3 is different from the fault tolerant system 2 as the second embodiment of the present invention in that a standby system device 70 is provided instead of the standby system device 60. The standby system device 70 is different from the standby system device 60 according to the second embodiment of the present invention in that a memory area data acquisition unit 71 is provided instead of the memory area data acquisition unit 21. The hardware configuration of the fault tolerant system 3 can be configured by hardware elements excluding the failure detection device 40 among the hardware elements of the first embodiment of the present invention shown in FIG. However, the hardware configuration of each device and each functional block constituting the fault-tolerant system 3 is not limited to the above-described configuration.

メモリ領域データ取得部７１は、稼働系装置５０に障害が発生していない通常時に、共有ストレージ３０に記憶される各稼働系仮想マシン１００のメモリ領域データについて、それぞれの一部を取得しておく。詳細には、メモリ領域データ取得部７１は、通常時において、所定タイミング毎に、各メモリ領域データの一部を取得し更新すればよい。また、メモリ領域データ取得部７１は、取得した各稼働系仮想マシン１００のメモリ領域データの一部を、自装置のメモリ２００２内の仮想マシン用メモリ領域２０２に書き込んでおく。つまり、仮想マシン用メモリ領域２０２には、各稼働系仮想マシン１００のメモリ領域データの一部が、保存され更新されることになる。各メモリ領域データのうち取得する一部分のサイズについては、あらかじめ定められていてもよい。例えば、各稼働系仮想マシン１００に割り当てられた仮想マシン用メモリ領域１０２のサイズが３ＧＢ（ギガバイト）である場合、メモリ領域データ取得部７１は、そのうち１ＧＢずつを所定タイミング毎に取得・更新してもよい。 The memory area data acquisition unit 71 acquires a part of the memory area data of each active virtual machine 100 stored in the shared storage 30 at the normal time when the active system device 50 has not failed. . Specifically, the memory area data acquisition unit 71 may acquire and update a part of each memory area data at a predetermined timing in normal times. Further, the memory area data acquisition unit 71 writes a part of the acquired memory area data of each active virtual machine 100 in the virtual machine memory area 202 in the memory 2002 of the own device. That is, a part of the memory area data of each active virtual machine 100 is stored and updated in the virtual machine memory area 202. The size of a part of each memory area data to be acquired may be determined in advance. For example, when the size of the virtual machine memory area 102 allocated to each active virtual machine 100 is 3 GB (gigabytes), the memory area data acquisition unit 71 acquires and updates 1 GB each at a predetermined timing. Also good.

また、例えば、メモリ領域データのうち取得する一部の内容は、更新頻度に基づき決定されてもよい。例えば、メモリ領域データ取得部７１は、共有ストレージ３０上に記憶されている稼働系仮想マシン１００のメモリ領域データのうち、更新頻度の少ない部分を優先して取得してもよい。具体的には、メモリ領域データ取得部７１は、共有ストレージ３０上に記憶されている稼働系仮想マシン１００のメモリ領域データのうち、所定の低頻度更新条件を満たすデータを取得してもよい。あるいは、メモリ領域データ取得部７１は、共有ストレージ３０上に記憶されている稼働系仮想マシン１００のメモリ領域データのうち、その更新頻度の低い方から順に所定サイズまでのデータを取得してもよい。 Further, for example, some contents to be acquired from the memory area data may be determined based on the update frequency. For example, the memory area data acquisition unit 71 may preferentially acquire a part with a low update frequency from the memory area data of the active virtual machine 100 stored on the shared storage 30. Specifically, the memory area data acquisition unit 71 may acquire data satisfying a predetermined infrequent update condition among the memory area data of the active virtual machine 100 stored on the shared storage 30. Alternatively, the memory area data acquisition unit 71 may acquire data up to a predetermined size in order from the least frequently updated memory area data of the active virtual machine 100 stored on the shared storage 30. .

そして、メモリ領域データ取得部７１は、障害発生時には、障害発生が検出された稼働系装置５０において動作する稼働系仮想マシン１００のメモリ領域データの残りを、共有ストレージ３０から取得する。例えば、前述のように、各稼働系仮想マシン１００のメモリ領域データ３ＧＢのうち１ＧＢずつを仮想マシン用メモリ領域２０２にロードしていた場合について説明する。この場合、メモリ領域データ取得部７１は、仮想マシン用メモリ領域２０２にロードされているデータのうち、障害が発生していない稼働系仮想マシン１００のメモリ領域データを破棄する。そして、メモリ領域データ取得部７１は、障害が発生した稼働系仮想マシン１００のメモリ領域データの残り２ＧＢを、共有ストレージ３０から取得して仮想マシン用メモリ領域２０２にロードすればよい。 Then, when a failure occurs, the memory area data acquisition unit 71 acquires from the shared storage 30 the remaining memory area data of the active virtual machine 100 that operates in the active device 50 in which the failure has been detected. For example, as described above, a case where 1 GB of the memory area data 3 GB of each active virtual machine 100 is loaded into the virtual machine memory area 202 will be described. In this case, the memory area data acquisition unit 71 discards the memory area data of the active virtual machine 100 in which no failure has occurred among the data loaded in the virtual machine memory area 202. Then, the memory area data acquisition unit 71 may acquire the remaining 2 GB of memory area data of the active virtual machine 100 in which a failure has occurred from the shared storage 30 and load it into the virtual machine memory area 202.

以上のように構成されたフォールトトレラントシステム３の動作について、図面を参照して説明する。なお、メモリ領域データ二重化動作については、図４を参照して説明した本発明の第１の実施の形態と同様であるため、本実施の形態における説明を省略する。また、障害検出動作については、図８を参照して説明した本発明の第２の実施の形態と同様であるため、本実施の形態における説明を省略する。 The operation of the fault tolerant system 3 configured as described above will be described with reference to the drawings. Note that the memory area data duplication operation is the same as that of the first embodiment of the present invention described with reference to FIG. Further, since the failure detection operation is the same as that of the second embodiment of the present invention described with reference to FIG. 8, the description in this embodiment is omitted.

ここでは、フォールトトレラントシステム３のフェイルオーバー動作を図１０に示す。なお、待機系装置７０は、各稼働系装置５０について、以下の動作を実行するものとする。 Here, the failover operation of the fault tolerant system 3 is shown in FIG. Note that the standby system device 70 performs the following operation for each active system device 50.

図１０において、まず、メモリ領域データ取得部７１は、通常時において、共有ストレージ３０に記憶される稼働系仮想マシン１００のメモリ領域データについて、その一部を取得する（ステップＳ４１）。 In FIG. 10, first, the memory area data acquisition unit 71 acquires a part of the memory area data of the active virtual machine 100 stored in the shared storage 30 during normal operation (step S <b> 41).

次に、メモリ領域データ取得部７１は、障害検出部６３により障害発生が検出されたか否かを判断する（ステップ４２）。 Next, the memory area data acquisition unit 71 determines whether or not a failure has been detected by the failure detection unit 63 (step 42).

ここで、この稼働系装置５０に障害が発生していなければ、メモリ領域データ取得部７１は、所定間隔経過後に、ステップＳ４１からの動作を繰り返す。 If no failure has occurred in the active system device 50, the memory area data acquisition unit 71 repeats the operation from step S41 after a predetermined interval has elapsed.

一方、この稼働系装置５０に障害が発生した場合、メモリ領域データ取得部７１は、次のように動作する。すなわち、メモリ領域データ取得部７１は、障害発生が検出されたこの稼働系装置５０において動作する稼働系仮想マシン１００のメモリ領域データの残りを、共有ストレージ３０から取得する（ステップＳ４４）。 On the other hand, when a failure occurs in the active device 50, the memory area data acquisition unit 71 operates as follows. That is, the memory area data acquisition unit 71 acquires the remaining memory area data of the active virtual machine 100 operating in the active system device 50 in which the failure has been detected from the shared storage 30 (step S44).

以降、待機系装置７０は、ステップＳ２３〜Ｓ２４まで本発明の第１の実施の形態と同様に動作する。これにより、障害が発生した稼働系仮想マシン１００のメモリ領域データを用いて、待機系仮想マシン２００が起動される。そして、稼働系仮想マシン１００から待機系仮想マシン２００に運用が切り替えられる。 Thereafter, the standby system device 70 operates in the same manner as in the first embodiment of the present invention from step S23 to S24. As a result, the standby virtual machine 200 is activated using the memory area data of the active virtual machine 100 in which the failure has occurred. Then, the operation is switched from the active virtual machine 100 to the standby virtual machine 200.

次に、本発明の第３の実施の形態の効果について述べる。 Next, effects of the third exemplary embodiment of the present invention will be described.

本発明の第３の実施の形態としてのフォールトトレラントシステムは、仮想サーバによって構成されるサーバシステムにおいて、待機系にかかるコストを抑えてフォールトトレランスを実現しながら、フェイルオーバー時のダウンタイムを短縮することができる。 The fault tolerant system according to the third embodiment of the present invention reduces the downtime at the time of failover in the server system constituted by virtual servers while realizing the fault tolerance while suppressing the cost of the standby system. be able to.

その理由について説明する。本実施の形態では、それぞれの稼働系装置のメモリ領域データ二重化部が、稼働系仮想マシンのメモリ領域データを、共有ストレージに記憶させることによって二重化する。そして、待機系装置のメモリ領域データ取得部が、通常時に、共有ストレージから各稼働系仮想マシンのメモリ領域データの一部を取得する。そして、稼働系装置に障害が検出されると、メモリ領域データ取得部が、障害を発生した稼働系装置上の仮想マシンのメモリ領域データの残りを、共有ストレージから取得するからである。そして、待機系装置の切替部が、通常時に取得しておいた一部および障害発生時に取得した残りを合わせたメモリ領域データを用いて待機系仮想マシンを起動し、障害が発生した稼働系仮想マシンの機能を待機系仮想マシンで継続するよう切り替えるからである。 The reason will be described. In the present embodiment, the memory area data duplication unit of each active system device duplicates the memory area data of the active virtual machine by storing it in the shared storage. Then, the memory area data acquisition unit of the standby system apparatus acquires a part of the memory area data of each active virtual machine from the shared storage at the normal time. When a failure is detected in the active device, the memory area data acquisition unit acquires the remaining memory area data of the virtual machine on the active device in which the failure has occurred from the shared storage. Then, the switching unit of the standby device starts up the standby virtual machine using the memory area data that combines the part acquired during normal operation and the rest acquired when a failure occurs, and the active virtual machine in which the failure occurred This is because the function of the machine is switched to continue in the standby virtual machine.

このように、本実施の形態は、共有ストレージに各稼働系仮想マシンのメモリ領域データを二重化しておき、障害発生時に、該当する稼働系仮想マシンを起動し切替を行うので、本発明の第１の実施の形態と同様に、少なくとも１台の待機系装置があればよい。さらに、本実施の形態は、障害が発生した稼働系仮想マシンのメモリ領域データを、全て障害発生時に待機系装置に転送するのではなく、通常時に一部を転送しておき、障害発生時には残りを転送する。これにより、本実施の形態は、待機系にかかるコストを削減しながらも、障害発生時に転送が必要となるメモリ領域データ量を減らして、ダウンタイムを短縮することができる。 As described above, in this embodiment, the memory area data of each active virtual machine is duplicated in the shared storage, and the corresponding active virtual machine is started and switched when a failure occurs. As in the first embodiment, it is sufficient that there is at least one standby system device. Furthermore, this embodiment does not transfer all the memory area data of the active virtual machine in which a failure has occurred to the standby system when a failure occurs, but transfers a part of it in the normal state and remains in the event of a failure. Forward. As a result, the present embodiment can reduce the downtime by reducing the amount of memory area data that needs to be transferred when a failure occurs, while reducing the cost of the standby system.

さらに、本実施の形態は、待機系装置のメモリ領域データ取得部が、通常時に共有ストレージから取得するメモリ領域データの一部として、更新頻度の少ないものを優先する場合、待機系装置の通常時におけるメモリ領域データの取得負荷を減らすことができる。 Furthermore, in the present embodiment, when the memory area data acquisition unit of the standby system device gives priority to a part of the memory area data acquired from the shared storage that is less frequently updated at normal time, The memory area data acquisition load can be reduced.

（第４の実施の形態）
次に、本発明の第４の実施の形態について図面を参照して詳細に説明する。本実施の形態では、本発明の第１および第２の実施の形態におけるフェイルオーバー時のダウンタイム増加という前述の課題に対して、本発明の第３の実施の形態とは異なる対策を施す例について説明する。 (Fourth embodiment)
Next, a fourth embodiment of the present invention will be described in detail with reference to the drawings. In the present embodiment, an example in which measures different from those in the third embodiment of the present invention are applied to the above-described problem of increased downtime at the time of failover in the first and second embodiments of the present invention. Will be described.

まず、本発明の第４の実施の形態としてのフォールトトレラントシステム４の構成を図１１に示す。図１１において、フォールトトレラントシステム４は、本発明の第２の実施の形態としてのフォールトトレラントシステム２に対して、待機系装置６０に替えて待機系装置８０を備える点が異なる。また、待機系装置８０は、本発明の第２の実施の形態における待機系装置６０に対して、メモリ領域データ取得部２１に替えてメモリ領域データ取得部８１と、障害検出部６３に替えて障害検出部８３とを備える点が異なる。なお、フォールトトレラントシステム４のハードウェア構成は、図２に示した本発明の第１の実施の形態のハードウェア要素のうち、障害検出装置４０を除いたハードウェア要素によって構成可能である。ただし、フォールトトレラントシステム４を構成する各装置およびその各機能ブロックのハードウェア構成は、上述の構成に限定されない。 First, FIG. 11 shows a configuration of a fault tolerant system 4 as a fourth embodiment of the present invention. In FIG. 11, the fault tolerant system 4 is different from the fault tolerant system 2 as the second embodiment of the present invention in that a standby system device 80 is provided instead of the standby system device 60. The standby system device 80 is different from the standby system device 60 according to the second embodiment of the present invention in that it replaces the memory area data acquisition unit 21 with a memory area data acquisition unit 81 and a failure detection unit 63. The difference is that a failure detection unit 83 is provided. The hardware configuration of the fault tolerant system 4 can be configured by hardware elements excluding the failure detection device 40 among the hardware elements of the first embodiment of the present invention shown in FIG. However, the hardware configuration of each device and each functional block constituting the fault-tolerant system 4 is not limited to the above-described configuration.

障害検出部８３は、稼働系装置５０における障害発生の検出に加えて、稼働系装置５０における障害発生の兆候を検出する。具体的には、障害検出部８３は、各稼働系装置５０について、前回の確認情報受信からの経過時間に基づいて、障害発生の兆候および障害発生を判断すればよい。 The failure detection unit 83 detects a failure occurrence sign in the active system device 50 in addition to detection of a failure occurrence in the active system device 50. Specifically, the failure detection unit 83 may determine a failure occurrence sign and a failure occurrence for each active system device 50 based on the elapsed time from the reception of the previous confirmation information.

例えば、障害検出部８３は、各稼働系装置５０について、前回の確認情報受信からの経過時間が第１閾値を超えると、障害発生の兆候を検出したと判断してもよい。この場合、障害検出部８３は、経過時間がさらに第２閾値を超えると、障害発生を検出したと判断してもよい。ただし、この場合、第２閾値は、第１閾値より大きい値である。また、第１閾値は、確認情報が送信されるよう定められた所定間隔より大きい値である。 For example, the failure detection unit 83 may determine that a sign of the occurrence of a failure has been detected for each active device 50 when the elapsed time since the last reception of confirmation information exceeds the first threshold. In this case, the failure detection unit 83 may determine that the failure has been detected when the elapsed time further exceeds the second threshold. However, in this case, the second threshold value is larger than the first threshold value. Further, the first threshold value is a value larger than a predetermined interval determined so that the confirmation information is transmitted.

具体例として、稼働系装置５０が正常動作している場合は、その共有メモリが１００ミリ秒ごとに更新されるよう設計されている場合を想定する。この場合、稼働系装置５０の障害検出部５２は、自装置の共有メモリ領域が更新されているか否かを確認する処理を、１００ミリ秒ごとに実施し、確認情報を送信するよう構成されているとする。また、第１閾値として４００ミリ秒、第２閾値として８００ミリ秒が設定されているとする。この場合、障害検出部８３は、４００ミリ秒以上確認情報を受信していない稼働系装置５０に、障害発生の兆候を検出したと判断する。また、障害検出部８３は、さらに４００ミリ秒（合計８００ミリ秒）以上確認情報を受信していない稼働系装置５０に、障害発生を検出したと判断する。 As a specific example, when the active device 50 is operating normally, it is assumed that the shared memory is designed to be updated every 100 milliseconds. In this case, the failure detection unit 52 of the active device 50 is configured to perform a process of checking whether or not the shared memory area of the own device has been updated every 100 milliseconds, and to transmit the confirmation information. Suppose that In addition, it is assumed that 400 milliseconds is set as the first threshold and 800 milliseconds is set as the second threshold. In this case, the failure detection unit 83 determines that a failure occurrence sign has been detected in the active device 50 that has not received confirmation information for 400 milliseconds or longer. Further, the failure detection unit 83 determines that a failure has been detected in the active system device 50 that has not received confirmation information for 400 milliseconds (total 800 milliseconds) or more.

メモリ領域データ取得部８１は、障害検出部８３により障害発生の兆候が検出されると、次のように動作するよう構成される。すなわち、この場合、メモリ領域データ取得部８１は、障害発生の兆候が検出された稼働系装置５０において動作する稼働系仮想マシン１００について、共有ストレージ３０に記憶されるメモリ領域データの取得を開始する。以降、障害発生の兆候が検出された稼働系装置５０において動作する稼働系仮想マシン１００を、単に「障害発生の兆候が検出された稼働系仮想マシン１００」とも記載する。その後、障害発生の兆候が検出された稼働系装置５０についてさらに障害発生が検出されると、メモリ領域データ取得部８１は、次のように動作するよう構成される。すなわち、この場合、メモリ領域データ取得部８１は、障害発生が検出された稼働系仮想マシン１００のメモリ領域データの取得を、継続して完了する。 The memory area data acquisition unit 81 is configured to operate as follows when the failure detection unit 83 detects a failure occurrence sign. That is, in this case, the memory area data acquisition unit 81 starts acquiring the memory area data stored in the shared storage 30 for the active virtual machine 100 operating in the active apparatus 50 in which the failure occurrence sign is detected. . Hereinafter, the active virtual machine 100 operating in the active device 50 in which the failure occurrence sign is detected is also simply referred to as “the active virtual machine 100 in which the failure occurrence sign is detected”. Thereafter, when a failure occurrence is further detected for the active system device 50 in which the failure occurrence sign is detected, the memory area data acquisition unit 81 is configured to operate as follows. That is, in this case, the memory area data acquisition unit 81 continuously completes the acquisition of the memory area data of the active virtual machine 100 in which the failure occurrence is detected.

なお、障害発生の兆候が検出された稼働系装置５０において障害発生が検出されなかった場合、メモリ領域データ取得部８１は、障害発生の兆候が検出された稼働系仮想マシン１００のメモリ領域データの取得を中止すればよい。例えば、メモリ領域データ取得部８１は、障害発生の兆候が検出されてから、前述の第２閾値となる経過時間を過ぎても障害発生が検出されない場合、該当する稼働系仮想マシン１００のメモリ領域データの取得を中止すればよい。 When no failure occurrence is detected in the active system device 50 in which the failure occurrence sign is detected, the memory area data acquisition unit 81 stores the memory area data of the active virtual machine 100 in which the failure occurrence sign is detected. You can cancel the acquisition. For example, if a failure occurrence is not detected even after the elapsed time that is the above-described second threshold has elapsed since the sign of the occurrence of the failure has been detected, the memory region data acquisition unit 81 can detect Data acquisition should be stopped.

以上のように構成されたフォールトトレラントシステム４の動作について、図面を参照して説明する。なお、メモリ領域データ二重化動作については、図４を参照して説明した本発明の第１の実施の形態と同様であるため、本実施の形態における説明を省略する。 The operation of the fault tolerant system 4 configured as described above will be described with reference to the drawings. Note that the memory area data duplication operation is the same as that of the first embodiment of the present invention described with reference to FIG.

ここでは、フォールトトレラントシステム４の障害検出・フェイルオーバー動作を図１２に示す。なお、待機系装置８０は、各稼働系装置５０について、以下の動作を実行するものとする。 Here, the fault detection / failover operation of the fault tolerant system 4 is shown in FIG. Note that the standby system device 80 performs the following operation for each active system device 50.

図１２では、まず、待機系装置８０の障害検出部８３は、この稼働系装置５０について、障害発生の兆候があるか否かを確認する（ステップＳ５１）。例えば、前述のように、障害検出部８３は、前回の確認情報受信からの経過時間が第１閾値を超えたか否かを判断してもよい。 In FIG. 12, first, the failure detection unit 83 of the standby system device 80 checks whether or not there is a failure occurrence sign for the active system device 50 (step S51). For example, as described above, the failure detection unit 83 may determine whether or not the elapsed time since the reception of the previous confirmation information has exceeded the first threshold value.

ここで、障害発生の兆候があると判断した場合、メモリ領域データ取得部８１は、障害発生の兆候が検出された稼働系仮想マシン１００について、共有ストレージ３０からメモリ領域データの取得を開始する（ステップＳ５２）。 If it is determined that there is a failure occurrence sign, the memory area data acquisition unit 81 starts acquiring memory area data from the shared storage 30 for the active virtual machine 100 in which the failure occurrence sign is detected ( Step S52).

次に、障害検出部８３は、障害発生の兆候があるこの稼働系装置５０について、障害が発生しているか否かを確認する（ステップＳ５３）。例えば、前述のように、障害検出部８３は、該当する稼働系装置５０について、前回の確認情報受信からの経過時間が第２閾値を超えたか否かを判断してもよい。 Next, the failure detection unit 83 confirms whether or not a failure has occurred in the active system device 50 having a failure occurrence sign (step S53). For example, as described above, the failure detection unit 83 may determine whether or not the elapsed time since the reception of the previous confirmation information has exceeded the second threshold for the corresponding active device 50.

ここで、障害が発生していると判断した場合、メモリ領域データ取得部８１は、該当する稼働系仮想マシン１００のメモリ領域データの取得を継続して完了する（ステップＳ６４）。 If it is determined that a failure has occurred, the memory area data acquisition unit 81 continues to complete the acquisition of the memory area data of the corresponding active virtual machine 100 (step S64).

なお、この稼働系装置５０において障害発生が検出されなかった場合、メモリ領域データ取得部８１は、該当する稼働系仮想マシン１００のメモリ領域データの取得を中止すればよい。 If no failure is detected in the active system device 50, the memory area data acquisition unit 81 may stop acquiring the memory area data of the corresponding active virtual machine 100.

以降、待機系装置６０は、ステップＳ２３〜Ｓ２４まで本発明の第１の実施の形態と同様に動作する。これにより、障害発生が検出された稼働系仮想マシン１００のメモリ領域データを用いて、待機系仮想マシン２００が起動される。そして、稼働系仮想マシン１００から待機系仮想マシン２００に運用が切り替えられる。 Thereafter, the standby system device 60 operates in the same manner as in the first embodiment of the present invention from step S23 to S24. As a result, the standby virtual machine 200 is activated using the memory area data of the active virtual machine 100 in which the failure is detected. Then, the operation is switched from the active virtual machine 100 to the standby virtual machine 200.

次に、本発明の第４の実施の形態の効果について述べる。 Next, effects of the fourth exemplary embodiment of the present invention will be described.

本発明の第４の実施の形態としてのフォールトトレラントシステムは、仮想マシンによって構成されるサーバシステムにおいて、待機系にかかるコストを抑えてフォールトトレランスを実現しながら、フェイルオーバー時のダウンタイムを短縮することができる。 The fault tolerant system as the fourth embodiment of the present invention reduces downtime at the time of failover in a server system constituted by virtual machines while realizing fault tolerance while reducing the cost of the standby system. be able to.

その理由について説明する。本実施の形態では、それぞれの稼働系装置のメモリ領域データ二重化部が、稼働系仮想マシンのメモリ領域データを、共有ストレージに記憶させることによって二重化する。そして、障害検出部が、各稼働系装置について障害の兆候を確認する。そして、障害発生の兆候が検出されると、待機系装置のメモリ領域データ取得部が、障害発生の兆候が検出された稼働系仮想マシンのメモリ領域データについて、共有ストレージからの取得を開始するからである。そして、さらに、障害発生の兆候が検出された待機系装置に障害発生が検出されると、該当するメモリ領域データの取得を継続して完了するからである。そして、切替部が、取得が完了したメモリ領域データを用いて待機系仮想マシンを起動し、障害発生装置上の稼働系仮想マシンの機能を待機系仮想マシンで継続するよう切り替えるからである。 The reason will be described. In the present embodiment, the memory area data duplication unit of each active system device duplicates the memory area data of the active virtual machine by storing it in the shared storage. Then, the failure detection unit confirms a failure sign for each active device. When a failure occurrence sign is detected, the memory area data acquisition unit of the standby device starts acquiring from the shared storage the memory area data of the active virtual machine where the failure occurrence sign is detected. It is. Further, when the occurrence of a failure is detected in the standby system apparatus in which the failure occurrence sign is detected, the acquisition of the corresponding memory area data is continuously completed. This is because the switching unit starts the standby virtual machine using the acquired memory area data, and switches the function of the active virtual machine on the failure occurrence device to be continued in the standby virtual machine.

このように、本実施の形態は、共有ストレージに各稼働系仮想マシンのメモリ領域データを二重化しておき、障害発生時に、該当する稼働系仮想マシンを起動し切替を行うので、本発明の第１の実施の形態と同様に、少なくとも１台の待機系装置があればよい。さらに、本実施の形態は、障害発生が検出された稼働装置上の稼働系仮想マシンのメモリ領域データを、障害が発生してから全て待機系装置に転送するのではなく、障害発生の兆候を検出した時点で、その取得を開始する。そして、本実施の形態は、障害発生時には、該当するメモリ領域データの取得を継続して完了する。これにより、本実施の形態は、待機系装置にかかるコストを削減しながらも、障害発生してから転送が必要となるメモリ領域データ量を減らして、ダウンタイムを短縮することができる。 As described above, in this embodiment, the memory area data of each active virtual machine is duplicated in the shared storage, and the corresponding active virtual machine is started and switched when a failure occurs. As in the first embodiment, it is sufficient that there is at least one standby system device. Furthermore, this embodiment does not transfer all the memory area data of the active virtual machine on the active device in which the failure has been detected to the standby device after the failure has occurred, but displays an indication of the failure. When it is detected, the acquisition is started. In this embodiment, when a failure occurs, the acquisition of the corresponding memory area data is continuously completed. As a result, the present embodiment can reduce the downtime by reducing the amount of memory area data that needs to be transferred after a failure occurs, while reducing the cost of the standby system.

（第５の実施の形態）
次に、本発明の第５の実施の形態について図面を参照して詳細に説明する。本実施の形態では、本発明の第３および第４の実施の形態を組み合わせる例について説明する。なお、本実施の形態の説明において参照する各図面において、本発明の第３および第４の実施の形態と同一の構成および同様に動作するステップには同一の符号を付して本実施の形態における詳細な説明を省略する。 (Fifth embodiment)
Next, a fifth embodiment of the present invention will be described in detail with reference to the drawings. In this embodiment, an example in which the third and fourth embodiments of the present invention are combined will be described. In each drawing referred to in the description of the present embodiment, the same reference numerals are given to the same configurations and steps that operate in the same manner as in the third and fourth embodiments of the present invention. The detailed description in is omitted.

まず、本発明の第５の実施の形態としてのフォールトトレラントシステム５の構成を図１３に示す。図１３において、フォールトトレラントシステム５は、本発明の第４の実施の形態としてのフォールトトレラントシステム４に対して、待機系装置８０に替えて待機系装置９０を備える点が異なる。待機系装置９０は、本発明の第４の実施の形態における待機系装置８０に対して、メモリ領域データ取得部８１に替えてメモリ領域データ取得部９１を備える点が異なる。なお、フォールトトレラントシステム５のハードウェア構成は、図２に示した本発明の第１の実施の形態のハードウェア要素のうち、障害検出装置４０を除いたハードウェア要素によって構成可能である。ただし、フォールトトレラントシステム５を構成する各装置およびその各機能ブロックのハードウェア構成は、上述の構成に限定されない。 First, FIG. 13 shows a configuration of a fault tolerant system 5 as a fifth embodiment of the present invention. In FIG. 13, the fault tolerant system 5 differs from the fault tolerant system 4 according to the fourth embodiment of the present invention in that a standby system device 90 is provided instead of the standby system device 80. The standby system device 90 is different from the standby system device 80 in the fourth embodiment of the present invention in that it includes a memory area data acquisition unit 91 instead of the memory area data acquisition unit 81. Note that the hardware configuration of the fault tolerant system 5 can be configured by hardware elements excluding the failure detection device 40 among the hardware elements of the first embodiment of the present invention shown in FIG. However, the hardware configuration of each device and each functional block constituting the fault-tolerant system 5 is not limited to the above-described configuration.

メモリ領域データ取得部９１は、通常時には、本発明の第３の実施の形態におけるメモリ領域データ取得部７１と同様に構成される。つまり、メモリ領域データ取得部９１は、稼働系装置５０に障害が発生していない通常時において、共有ストレージ３０に記憶される各稼働系仮想マシン１００のメモリ領域データについて、それぞれの一部を取得しておく。そして、メモリ領域データ取得部９１は、取得した各稼働系仮想マシン１００のメモリ領域データを、自装置のメモリ２００２内の仮想マシン用メモリ領域２０２に書き込んでおく。 The memory area data acquisition unit 91 is normally configured in the same manner as the memory area data acquisition unit 71 in the third embodiment of the present invention. That is, the memory area data acquisition unit 91 acquires a part of the memory area data of each active virtual machine 100 stored in the shared storage 30 at the normal time when the active apparatus 50 has not failed. Keep it. Then, the memory area data acquisition unit 91 writes the acquired memory area data of each active virtual machine 100 in the virtual machine memory area 202 in the memory 2002 of the own device.

また、メモリ領域データ取得部９１は、障害検出部８３により障害発生の兆候が検出されると、次のように動作するよう構成される。この場合、メモリ領域データ取得部９１は、障害発生の兆候が検出された稼働系仮想マシン１００について、共有ストレージ３０に記憶されるメモリ領域データのうち、通常時に取得されていない残りのデータの取得を開始する。その後、障害発生の兆候が検出された稼働系装置５０についてさらに障害発生が検出されると、メモリ領域データ取得部９１は、メモリ領域データの取得を継続し完了すればよい。 The memory area data acquisition unit 91 is configured to operate as follows when the failure detection unit 83 detects a failure occurrence sign. In this case, the memory area data acquisition unit 91 acquires, for the active virtual machine 100 in which a failure occurrence sign has been detected, the remaining data that has not been acquired in normal time among the memory area data stored in the shared storage 30. To start. Thereafter, when a failure occurrence is further detected for the active system device 50 in which the failure occurrence sign is detected, the memory area data acquisition unit 91 may continue and complete the acquisition of the memory area data.

以上のように構成されたフォールトトレラントシステム５の動作について、図面を参照して説明する。なお、フォールトトレラントシステム５のメモリ領域データ二重化動作については、図４を参照して説明した本発明の第１の実施の形態と同様であるため、本実施の形態における説明を省略する。ここでは、フォールトトレラントシステム５の障害検出・フェイルオーバー動作を図１４に示す。なお、待機系装置９０は、各稼働系装置５０について、以下の動作を実行するものとする。 The operation of the fault tolerant system 5 configured as described above will be described with reference to the drawings. Note that the memory area data duplication operation of the fault tolerant system 5 is the same as that of the first embodiment of the present invention described with reference to FIG. Here, the fault detection / failover operation of the fault tolerant system 5 is shown in FIG. Note that the standby system device 90 performs the following operation for each active system device 50.

図１４では、まず、メモリ領域データ取得部９１は、この稼働系装置５０上で動作する稼働系仮想マシン１００について、共有ストレージ３０に記憶されるメモリ領域データの一部を取得する（ステップＳ４１）。 In FIG. 14, first, the memory area data acquisition unit 91 acquires a part of the memory area data stored in the shared storage 30 for the active virtual machine 100 operating on the active system device 50 (step S41). .

次に、メモリ領域データ取得部９１は、この稼働系装置５０において、障害発生の兆候が検出されたか否かを判断する（ステップＳ５１）。 Next, the memory area data acquisition unit 91 determines whether or not a failure occurrence sign is detected in the active system device 50 (step S51).

ここで、障害発生の兆候が検出されていなければ、メモリ領域データ取得部９１は、所定間隔経過後に、ステップＳ４１からの動作を繰り返す。 Here, if no sign of failure has been detected, the memory area data acquisition unit 91 repeats the operation from step S41 after a predetermined interval has elapsed.

一方、この稼働系装置５０に障害発生の兆候が検出された場合、メモリ領域データ取得部９１は、次のように動作する。すなわち、メモリ領域データ取得部９１は、障害発生の兆候が検出された稼働系仮想マシン１００のメモリ領域データのうち、通常時に取得されていない残りについて、共有ストレージ３０からの取得を開始する（ステップＳ６１）。 On the other hand, when a failure occurrence sign is detected in the active system device 50, the memory area data acquisition unit 91 operates as follows. That is, the memory area data acquisition unit 91 starts acquiring from the shared storage 30 the remaining memory area data of the active virtual machine 100 in which the failure occurrence sign has been detected, which has not been acquired in the normal time (step) S61).

以降、待機系装置９０は、本発明の第４の実施の形態としての待機系装置８０と同様に、ステップＳ５３、Ｓ５４、Ｓ２３、Ｓ２４を実行する。これにより、待機系装置９０は、障害発生の兆候が検出された稼働系装置５０についてさらに障害発生が検出された場合、該当する稼働系仮想マシン１００のメモリ領域データの取得を継続して完了する。そして、障害発生が検出された稼働系仮想マシン１００のメモリ領域データを用いて、待機系仮想マシン２００が起動される。そして、稼働系仮想マシン１００から待機系仮想マシン２００に運用が切り替えられる。 Thereafter, the standby system device 90 executes steps S53, S54, S23, and S24 in the same manner as the standby system device 80 according to the fourth embodiment of the present invention. As a result, the standby device 90 continuously completes the acquisition of the memory area data of the corresponding active virtual machine 100 when the occurrence of a failure is further detected for the active device 50 in which the failure occurrence sign is detected. . Then, the standby virtual machine 200 is activated using the memory area data of the active virtual machine 100 in which the failure is detected. Then, the operation is switched from the active virtual machine 100 to the standby virtual machine 200.

次に、本発明の第５の実施の形態の効果について述べる。 Next, effects of the fifth exemplary embodiment of the present invention will be described.

本発明の第５の実施の形態としてのフォールトトレラントシステムは、仮想マシンによって構成されるサーバシステムにおいて、待機系にかかるコストを抑えてフォールトトレランスを実現しながら、フェイルオーバー時のダウンタイムをより短縮することができる。 The fault tolerant system according to the fifth embodiment of the present invention is a server system constituted by virtual machines, further reducing the downtime during failover while realizing fault tolerance while reducing the cost of the standby system. can do.

その理由について説明する。本実施の形態では、待機系装置のメモリ領域データ取得部が、通常時に各稼働系装置のメモリ領域データの一部を取得しておく。そして、メモリ領域データ取得部が、障害発生の兆候の検出時に、兆候が検出された稼働系仮想マシンについて、通常時に取得されていなかったメモリ領域データの残りの取得を開始する。そして、障害発生の兆候が検出された稼働系装置における障害発生が検出されると、メモリ領域データ取得部が、メモリ領域データの残りの取得を継続して完了し、切替部が、待機系仮想マシンを起動して運用の切り替えを行うからである。 The reason will be described. In the present embodiment, the memory area data acquisition unit of the standby system apparatus acquires a part of the memory area data of each active system apparatus at the normal time. Then, the memory area data acquisition unit starts acquiring the remaining memory area data that has not been acquired at the normal time for the active virtual machine in which the sign has been detected, when the failure occurrence sign is detected. Then, when a failure occurrence is detected in the active system device in which the failure occurrence sign is detected, the memory area data acquisition unit continuously acquires the remaining memory area data, and the switching unit This is because the machine is started and the operation is switched.

このように、本実施の形態は、本発明の第３および第４の実施の形態におけるダウンタイム短縮のための構成を組み合わせることにより、さらに、ダウンタイムを短縮することができる。 Thus, this embodiment can further reduce the downtime by combining the configurations for reducing the downtime in the third and fourth embodiments of the present invention.

なお、上述した本発明の第４および第５の実施の形態において、障害検出部は、障害発生の兆候から障害発生までを第１閾値および第２閾値の２段階で検出する例について説明した。これに限らず、障害検出部は、障害発生の兆候から障害発生までを３段階以上に分けて検出してもよい。 In the above-described fourth and fifth embodiments of the present invention, the example in which the failure detection unit detects from the failure occurrence sign to the failure occurrence in two stages of the first threshold value and the second threshold value has been described. However, the present invention is not limited to this, and the failure detection unit may detect a failure occurrence sign to a failure occurrence in three or more stages.

また、上述した本発明の第２から第５の実施の形態において、各稼働系装置の障害検出部が、共有メモリの更新の有無を確認することによって正常動作しているか否かを確認する例について説明した。これに限らず、各実施の形態における障害検出部は、稼働系装置の障害を検出する各種公知の技術を採用しても実現可能である。 In the second to fifth embodiments of the present invention described above, an example in which the failure detection unit of each active device confirms whether or not the shared memory is operating normally by confirming whether or not the shared memory has been updated. Explained. Not only this but the failure detection part in each embodiment is realizable also if various well-known techniques which detect the failure of an active system apparatus are employ | adopted.

また、上述した本発明の各実施の形態において、フォールトトレラントシステムが、複数台の稼働系装置に対して１台の待機系装置を有する例を中心に説明した。これに限らず、フォールトトレラントシステムは、Ｎ台（Ｎ＞１）の待機系装置を備えていてもよい。この場合、各実施の形態は、Ｎ台の稼働系装置に同時に障害が発生した場合にも対応可能となる。具体的には、この場合、各待機系装置は、障害発生装置上の稼働系仮想マシンのそれぞれに対するフェイルオーバーを実施可能となる。このように構成した場合、各実施の形態は、複数の稼働系装置で同時に発生する障害にも対応することができ、信頼性を向上させる。なお、この場合であっても、各実施の形態は、稼働系仮想マシンと同数の待機系仮想マシンを必要とせず、稼働系仮想マシンより少ない数の待機系仮想マシンを動作させることが可能であればよい。したがって、この場合であっても、各実施の形態は、待機系にかかるコストを削減できる。 Further, in each of the embodiments of the present invention described above, the fault tolerant system has been described mainly with respect to an example in which one standby system device is provided for a plurality of active system devices. However, the present invention is not limited to this, and the fault tolerant system may include N (N> 1) standby devices. In this case, each embodiment can deal with a case where a failure occurs simultaneously in N active devices. Specifically, in this case, each standby device can perform failover for each of the active virtual machines on the failure generating device. When configured in this way, each embodiment can cope with failures that occur simultaneously in a plurality of active devices, and improves reliability. Even in this case, each embodiment does not require the same number of standby virtual machines as the active virtual machines, and can operate a smaller number of standby virtual machines than the active virtual machines. I just need it. Therefore, even in this case, each embodiment can reduce the cost for the standby system.

また、上述した本発明の各実施の形態において、稼働系装置および待機系装置は、それぞれ１つずつの仮想マシンを動作させる例を中心に説明した。これに限らず、各実施の形態は、複数の仮想マシンを動作させることが可能な稼働系装置または待機系装置を含んでいてもよい。このような場合であっても、各実施の形態は、稼働系仮想マシンと同数の待機系仮想マシンを必要とせず、稼働系仮想マシンより少ない数の待機系仮想マシンを動作させることが可能であればよい。したがって、この場合であっても、各実施の形態は、待機系にかかるコストを削減できる。 Further, in each of the above-described embodiments of the present invention, the active system apparatus and the standby system apparatus have been described focusing on an example in which one virtual machine is operated. However, the present invention is not limited to this, and each embodiment may include an active device or a standby device capable of operating a plurality of virtual machines. Even in such a case, each embodiment does not require the same number of standby virtual machines as the active virtual machines, and can operate a smaller number of standby virtual machines than the active virtual machines. I just need it. Therefore, even in this case, each embodiment can reduce the cost for the standby system.

また、上述した本発明の各実施の形態において、稼働系装置のローカルデータが共有ストレージに記憶され、フェイルオーバー時には、待機系仮想マシンが、共有ストレージ上で稼働系仮想マシンが用いていたローカルデータを引き継ぐものとして説明した。これに限らず、各稼働系仮想マシンのローカルデータは、待機系仮想マシンによって引き継ぎ可能な場所に記憶されていればよい。例えば、待機系装置が、各稼働系仮想マシンが利用するローカルデータを全て記憶可能な容量のストレージを有していてもよい。この場合、待機系装置のストレージを、ｄｒｄｂ（Distributed Replicated Block Device）等の分散ストレージシステムとして機能させてもよい。これにより、稼働系仮想マシンは、そのローカルデータを待機系装置のストレージに記憶させることが可能となる。そして、この場合、フェイルオーバー時には、待機系仮想マシンが、自身が動作する待機系装置のストレージにおいて該当する稼働系仮想マシンのローカルデータを引き継げばよい。 In each embodiment of the present invention described above, the local data of the active system is stored in the shared storage, and the local data used by the active virtual machine on the shared storage is used by the standby virtual machine at the time of failover. It was explained as taking over. Not limited to this, the local data of each active virtual machine only needs to be stored in a location that can be taken over by the standby virtual machine. For example, the standby apparatus may have a storage with a capacity capable of storing all local data used by each active virtual machine. In this case, the storage of the standby device may function as a distributed storage system such as drdb (Distributed Replicated Block Device). Thereby, the active virtual machine can store the local data in the storage of the standby system device. In this case, at the time of failover, the standby virtual machine may take over the local data of the corresponding active virtual machine in the storage of the standby apparatus on which it operates.

また、上述した本発明の各実施の形態において、稼働系装置における障害発生によりフェイルオーバーを実施した後は、待機系装置が、本発明の稼働系装置として動作してもよい。また、この場合、障害から復旧した稼働系装置は、本発明の待機系装置として動作してもよい。このような場合、各実施の形態の稼働系装置は、待機系装置の機能ブロックをさらに有していればよい。また、各実施の形態の待機系装置は、稼働系装置の機能ブロックをさらに有していればよい。また、この場合、フォールトトレラントシステムに含まれる装置のうちいずれの装置が待機系装置となるかを表す情報は、共有ストレージ上に保存されていてもよい。 In each of the above-described embodiments of the present invention, the standby system device may operate as the active system device of the present invention after failover is performed due to the occurrence of a failure in the active system device. In this case, the active system device recovered from the failure may operate as the standby system device of the present invention. In such a case, the active system apparatus of each embodiment should just have the functional block of a standby system apparatus. Moreover, the standby system apparatus of each embodiment should just have the functional block of an active system apparatus. In this case, information indicating which of the devices included in the fault-tolerant system is the standby device may be stored on the shared storage.

また、上述した本発明の各実施の形態において、フォールトトレラントシステムを構成する各装置の各機能ブロックが、記憶装置またはＲＯＭに記憶されたコンピュータ・プログラムを実行するＣＰＵによって実現される例を中心に説明した。これに限らず、各機能ブロックの一部、全部、または、それらの組み合わせが専用のハードウェアにより実現されていてもよい。 In each of the above-described embodiments of the present invention, each functional block of each device constituting the fault-tolerant system is mainly implemented by a CPU that executes a computer program stored in a storage device or ROM. explained. However, the present invention is not limited to this, and some, all, or a combination of each functional block may be realized by dedicated hardware.

また、上述した本発明の各実施の形態において、各フローチャートを参照して説明した各装置の動作を、本発明のコンピュータ・プログラムとしてコンピュータの記憶装置（記憶媒体）に格納しておいてもよい。そして、係るコンピュータ・プログラムを当該ＣＰＵが読み出して実行するようにしてもよい。そして、このような場合において、本発明は、係るコンピュータ・プログラムのコードあるいは記憶媒体によって構成される。 In the above-described embodiments of the present invention, the operations of the respective devices described with reference to the respective flowcharts may be stored in a computer storage device (storage medium) as the computer program of the present invention. . Then, the computer program may be read and executed by the CPU. In such a case, the present invention is constituted by the code of the computer program or a storage medium.

また、上述した各実施の形態は、適宜組み合わせて実施されることが可能である。 Moreover, each embodiment mentioned above can be implemented in combination as appropriate.

また、本発明は、上述した各実施の形態に限定されず、様々な態様で実施されることが可能である。 The present invention is not limited to the above-described embodiments, and can be implemented in various modes.

１、２、３、４、５フォールトトレラントシステム
１０、５０稼働系装置
１１メモリ領域データ二重化部
２０、６０、７０、８０、９０待機系装置
２１、７１、８１、９１メモリ領域データ取得部
２２切替部
３０共有ストレージ
４０障害検出装置
４１障害検出部
５２、６３、８３障害検出部
１００稼働系仮想マシン
２００待機系仮想マシン
１０１、２０１ハイパーバイザー
１０２、２０２仮想マシン用メモリ領域
１００１、２００１、４００１ＣＰＵ
１００２、２００２、４００２メモリ
１００３、２００３、４００３ローカルストレージ
１００４、２００４、４００４ネットワークインタフェース 1, 2, 3, 4, 5 Fault tolerant system 10, 50 Active system 11 Memory area data duplication unit 20, 60, 70, 80, 90 Standby system 21, 71, 81, 91 Memory area data acquisition unit 22 Switching Unit 30 Shared storage 40 Failure detection device 41 Failure detection unit 52, 63, 83 Failure detection unit 100 Active virtual machine 200 Standby virtual machine 101, 201 Hypervisor 102, 202 Virtual machine memory area 1001, 2001, 4001 CPU
1002, 2002, 4002 Memory 1003, 2003, 4003 Local storage 1004, 2004, 4004 Network interface

Claims

Shared storage shared by active and standby devices,
A memory area data duplication unit for transferring and duplicating the contents (memory area data) of a virtual machine memory area of an active virtual machine operating on the active system apparatus from the active system apparatus to the shared storage;
A failure detection unit for detecting occurrence of a failure in the active device;
A memory area data acquisition unit that transfers memory area data of an active virtual machine that operates in the active apparatus in which the failure occurrence is detected, from the shared storage to the standby apparatus;
When a failure occurrence is detected by the failure detection unit, a standby virtual machine is operated in the standby system device using the memory area data acquired by the memory area data acquisition unit, and the occurrence of the failure is detected. A switching unit that switches the function of the active virtual machine that operates in the active system to continue in the standby virtual machine;
Equipped with a,
The memory area data acquisition unit transfers a part of the memory area data of the active virtual machine stored in the shared storage to the standby system device at a normal time when a failure occurrence is not detected by the failure detection unit. In addition, when a failure occurrence is detected by the failure detection unit, the remaining memory area data of the active virtual machine operating in the detected active device is transferred from the shared storage to the standby device. Features a fault tolerant system.

In addition to detecting the occurrence of failure in the active device, the failure detection unit detects a failure occurrence sign in the active device,
When the failure detection unit detects the failure occurrence, the memory area data acquisition unit stores the memory area data stored in the shared storage for the active virtual machine operating in the detected active device. The transfer to the standby system apparatus is started, and thereafter, when the occurrence of a failure is detected for the active system apparatus in which the failure occurrence sign is detected, the transfer of the memory area data is continuously completed. The fault tolerant system according to claim 1 .

3. The fault tolerant system according to claim 1 or 2 , wherein the active device having the memory area data duplication unit.

The fault tolerant system according to claim 1 or 2 , wherein the standby system device includes the memory area data acquisition unit and the switching unit.

Using shared storage shared by active and standby devices,
The contents (memory area data) of the virtual machine memory area of the active virtual machine operating on the active system device are transferred from the active system device to the shared storage and duplicated,
When the occurrence of a failure of the active device is detected,
Transfer the memory area data of the active virtual machine that operates in the active device that detected the occurrence of the failure from the shared storage to the standby device,
In the standby system device, operate the standby virtual machine using the memory area data transferred from the shared storage,
Switch to continue the function of the active virtual machine that operates in the active device that detected the failure occurrence in the standby virtual machine ,
During normal times when failure of the active device is not detected, a part of the memory area data of the active virtual machine stored in the shared storage is transferred to the standby device,
When the occurrence of a failure in the active device is detected, the remaining memory area data of the active virtual machine operating in the detected active device is transferred from the shared storage to the standby device. Failover method.

Standby device
Using shared storage in which the contents of the virtual machine memory area (memory area data) of the active virtual machine running on the active system are duplicated,
When a failure occurrence of the active device is detected,
Obtaining memory area data of the active virtual machine operating in the active device in which the failure occurrence is detected from the shared storage;
Operate the standby virtual machine using the memory area data acquired from the shared storage,
Switching so that the function of the active virtual machine operating in the active system in which the failure occurrence is detected is continued in the standby virtual machine ,
During normal time when the occurrence of a failure of the active device is not detected, a part of the memory area data of the active virtual machine stored in the shared storage is transferred to the standby device,
When the occurrence of a failure in the active device is detected, the remaining memory area data of the active virtual machine operating in the active device in which the failure is detected is acquired from the shared storage.
A failover method characterized by that .

Using shared storage in which the contents of the virtual machine memory area (memory area data) of the active virtual machine running on the active system are duplicated,
When a failure occurrence of the active device is detected, a memory region data acquisition step of acquiring memory region data of an active virtual machine operating in the active device where the failure occurrence is detected from the shared storage;
A standby virtual machine operation step of operating a standby virtual machine using the memory area data acquired in the memory area data acquisition step;
A switching step for switching the function of the active virtual machine operating in the active system in which the failure occurrence is detected to continue in the standby virtual machine;
A transfer step of transferring a part of the memory region data of the active virtual machine stored in the shared storage to the standby device at a normal time when the occurrence of a failure of the active device is not detected;
An acquisition step of acquiring, from the shared storage, the remaining memory area data of the active virtual machine operating in the active device in which the occurrence of the failure is detected when the occurrence of a failure in the active device is detected; Failover program that causes the standby system device to execute