JP2017045084A

JP2017045084A - Failure detection apparatus and failure detection method

Info

Publication number: JP2017045084A
Application number: JP2015164432A
Authority: JP
Inventors: 大輔宮本; Daisuke Miyamoto; 英治石川; Eiji Ishikawa
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2015-08-24
Filing date: 2015-08-24
Publication date: 2017-03-02

Abstract

PROBLEM TO BE SOLVED: To allow a virtual machine to acquire hardware state of a host machine, to properly respond to a failure in the host machine.SOLUTION: A server failure detection function 11 acquires hardware state of host machines 1A, 1B. A failure detection cooperation function 12 transmits the machine state of the host machines 1A, 1B to virtual machines 2A, 2B. A failure detection cooperation function 21 of the virtual machines 2A, 2B receives the machine state of the host machines 1A, 1B, to associate the machine state of the host machines 1A, 1B with that of the virtual machines 2A, 2B. A system running on the virtual machines 2A, 2B acquires the machine state of the host machines 1A, 1B, to detect a failure of the host machines 1A, 1B.SELECTED DRAWING: Figure 1

Description

本発明は、物理マシン上で動作する仮想マシンを用いたシステムの障害制御技術に関する。 The present invention relates to a system failure control technique using a virtual machine operating on a physical machine.

キャリアグレードのサーバは、フォールトトレラント性を保つためＨＡクラスタを使用し、運用系サーバと予備系サーバを用意して、運用系サーバに異常が発生したときは、予備系と運用系とを切り替えることでサービス継続性を高めている（例えば特許文献１参照）。運用系サーバ上で動作するシステムは、自身が動作しているマシン等のハードウェアの状態を取得して状態判定を行うことで異常の発生を検知している。 Carrier grade servers use an HA cluster to maintain fault tolerance, prepare an active server and a standby server, and switch between the standby and active systems when an error occurs in the active server Thus, service continuity is improved (see, for example, Patent Document 1). A system operating on an active server detects the occurrence of an abnormality by acquiring the status of hardware such as a machine on which it is operating and performing status determination.

特開２０１４−１６５６３２号公報JP 2014-165632 A

自身が動作するハードウェアの状態を取得して状態判定を行って運用系と予備系とを切り替えるシステムを、仮想マシンを用いて構築した場合、通常のシステムが保持するハードウェアの状態取得機能や状態判定機能では仮想マシンの状態を取得して判定してしまい、仮想マシンの状態を検知し続けても、ハードウェア特有の温度異常や経年劣化による故障は検知できず、仮想マシンを動作させているホストマシンの異常を検知することはできないという問題があった。 When a virtual machine is used to build a system that switches between the active system and the standby system by acquiring the status of the hardware on which it is operating and performing status determination, the hardware status acquisition function The status determination function acquires and determines the status of the virtual machine, and even if it continues to detect the status of the virtual machine, it cannot detect hardware-specific temperature abnormalities or failures due to aging, and operate the virtual machine. There was a problem that it was not possible to detect abnormalities in the host machine.

この問題について、一般的には、物理マシンの状態検知のために外部に仮想化層監視装置を配置する対応が考えられる。しかしながら、外部の監視装置では、監視対象機能以外の異常（進行性の異常等）については検知することができない。また、外部の監視装置から仮想マシン上で動作するサーバプログラムに情報を伝達できない事態になった場合は、サーバプログラムの自発的なマシン制御を実施することができない。 In general, it is conceivable to deal with this problem by arranging a virtualization layer monitoring device outside to detect the state of the physical machine. However, the external monitoring device cannot detect an abnormality (progressive abnormality or the like) other than the monitoring target function. In addition, when information cannot be transmitted from an external monitoring device to a server program operating on a virtual machine, it is not possible to perform spontaneous machine control of the server program.

本発明は、上記に鑑みてなされたものであり、仮想マシンがホストマシンのハードウェアの状態を取得し、ホストマシンの障害発生に適切に対応することを目的とする。 The present invention has been made in view of the above, and an object of the present invention is to make it possible for a virtual machine to acquire a hardware state of a host machine and appropriately cope with a failure of the host machine.

第１の本発明に係る障害検知装置は、物理マシン上で仮想マシンが動作する障害検知装置であって、前記物理マシンは、当該物理マシンの状態を取得する状態取得手段と、前記状態を前記仮想マシンへ伝達する状態伝達手段と、を有し、前記仮想マシンは、前記物理マシンから前記状態を取得し、前記仮想マシン上で動作するプログラムが当該状態を取得可能にする状態連携手段を有することを特徴とする。 A failure detection apparatus according to a first aspect of the present invention is a failure detection apparatus in which a virtual machine operates on a physical machine, wherein the physical machine acquires state of the physical machine, the state acquisition means, State communication means for transmitting to the virtual machine, the virtual machine having the state cooperation means for acquiring the state from the physical machine and enabling a program operating on the virtual machine to acquire the state. It is characterized by that.

上記障害検知装置は、現用系と予備系とを備える冗長化したシステムで用いられるものであって、前記仮想マシンは、前記物理マシンの異常を検知したときに、現用系と予備系とを切り替える障害制御手段を有することを特徴とする。 The failure detection apparatus is used in a redundant system including an active system and a standby system, and the virtual machine switches between the active system and the standby system when detecting an abnormality of the physical machine. It has a failure control means.

第２の本発明に係る障害検知方法は、物理マシン上で仮想マシンが動作する障害検知装置による障害検知方法であって、前記物理マシンによる、当該物理マシンの状態を取得するステップと、前記状態を前記仮想マシンへ伝達するステップと、を有し、前記仮想マシンによる、前記物理マシンから前記状態を取得し、前記仮想マシン上で動作するプログラムが当該状態を取得可能にするステップを有することを特徴とする。 A failure detection method according to a second aspect of the present invention is a failure detection method by a failure detection device in which a virtual machine operates on a physical machine, the step of acquiring the state of the physical machine by the physical machine, and the state Transmitting the status to the virtual machine, and acquiring the status from the physical machine by the virtual machine and enabling a program operating on the virtual machine to acquire the status. Features.

本発明によれば、仮想マシンがホストマシンのハードウェアの状態を取得し、ホストマシンの障害発生に適切に対応することができる。 According to the present invention, the virtual machine can acquire the state of the hardware of the host machine and can appropriately cope with the occurrence of a failure in the host machine.

本実施の形態における障害制御システムの構成を示す機能ブロック図である。It is a functional block diagram which shows the structure of the failure control system in this Embodiment. 本実施の形態における障害制御システムの動作を示すシーケンス図である。It is a sequence diagram which shows operation | movement of the failure control system in this Embodiment.

以下、本発明の実施の形態について図面を用いて説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本実施の形態における障害制御システムの構成を示す機能ブロック図である。図１に示す障害制御システムは、ホストマシン１Ａ，１Ｂ上で動作する仮想マシン２Ａ，２Ｂがプログラムを実行して所望のシステムを動作させる障害制御システムであって、一方の仮想マシン２Ａで動作するシステムを運用系、他方の仮想マシン２Ｂで動作するシステムを予備系とし、運用系のシステムで異常を検知したときには、予備系のシステムに切り替えてサービスを継続させるシステムである。例えば、通信事業者が通信網の構築に本システムを用いる。本障害制御システムは仮想マシン２Ａ，２Ｂ上でプログラムを動作させてサービスを提供する機能を有しているが、図１では障害制御に用いられる機能のみを図示している。 FIG. 1 is a functional block diagram showing the configuration of the fault control system in the present embodiment. The failure control system shown in FIG. 1 is a failure control system in which virtual machines 2A and 2B operating on host machines 1A and 1B execute programs to operate a desired system, and operate on one virtual machine 2A. In this system, the system is the active system, and the system operating on the other virtual machine 2B is the standby system. When an abnormality is detected in the active system, the system is switched to the standby system to continue the service. For example, a telecommunications carrier uses this system for building a communication network. The fault control system has a function of providing a service by operating a program on the virtual machines 2A and 2B. FIG. 1 shows only functions used for fault control.

図１に示すホストマシン１Ａ，１Ｂは、サーバ障害検知機能１１及び障害検知連携機能１２を備え、ホストマシン１Ａ，１Ｂ上で動作する仮想マシン２Ａ，２Ｂは、障害検知連携機能２１及びサーバ障害検知機能２２を備える。 The host machines 1A and 1B shown in FIG. 1 have a server failure detection function 11 and a failure detection linkage function 12, and the virtual machines 2A and 2B operating on the host machines 1A and 1B are a failure detection linkage function 21 and a server failure detection. A function 22 is provided.

サーバ障害検知機能１１は、ホストマシン１Ａ，１Ｂが持つマシン状態を取得する機能を利用してホストマシン１Ａ，１Ｂ自身の状態を取得する。ホストマシン１Ａ，１Ｂは、物理マシンであり、ホストマシン１Ａ，１Ｂが備えるハードウェア部品に異常が発生したり故障したりする。物理マシンの状態を取得する機能としては、例えば、ハードディスクに内蔵された自己診断機能であるＳ．Ｍ．Ａ．Ｒ．Ｔ．やメモリチェック等がある。 The server failure detection function 11 acquires the status of the host machines 1A and 1B using a function of acquiring the machine status of the host machines 1A and 1B. The host machines 1A and 1B are physical machines, and an abnormality occurs or fails in hardware components included in the host machines 1A and 1B. As a function for acquiring the state of the physical machine, for example, the S.D. M.M. A. R. T. T. And memory check.

障害検知連携機能１２は、サーバ障害検知機能１１が取得したホストマシン１Ａ，１Ｂのマシン状態を仮想マシン２Ａ，２Ｂへ伝達し、仮想マシン２Ａ，２Ｂ上で動作するシステムがホストマシン１Ａ，１Ｂのマシン状態を取得できるようにする機能である。 The failure detection cooperation function 12 transmits the machine state of the host machines 1A and 1B acquired by the server failure detection function 11 to the virtual machines 2A and 2B, and the system operating on the virtual machines 2A and 2B is the host machine 1A or 1B. It is a function that makes it possible to acquire the machine status.

マシン状態を伝達する方法としては、ＩＰレイヤの独自機能をホストマシン１Ａ，１Ｂに実装する方法がある。その他に以下の例が挙げられる。 As a method of transmitting the machine state, there is a method of mounting an IP layer unique function in the host machines 1A and 1B. Other examples include the following.

（１）ＳＮＭＰトラップを用いて、異常発生時にホストマシン１Ａ，１ＢのＳＮＭＰエージェントから仮想マシン２Ａ，２ＢのＳＮＭＰマネージャへ異常通知する。
（２）仮想マシン２Ａ，２Ｂからホストマシン１Ａ，１Ｂへｈｅａｒｔｂｅａｔを送信し、受信を以って死活監視を実施する。ｈｅａｒｔｂｅａｔが返らない場合は異常であることを検知する。
（３）ホストマシン１Ａ，１Ｂが仮想マシン２Ａ，２Ｂにログインし、ホストマシン１Ａ，１Ｂが異常をきたしている箇所について仮想マシン２Ａ，２Ｂを破壊する。
（４）ホストマシン１Ａ，１Ｂでメールサーバを起動し、仮想マシン２Ａ，２Ｂでメールクライアントを起動し、ホストマシン１Ａ，１Ｂが仮想マシン２Ａ，２Ｂへメールで伝達する。
（５）ホストマシン１Ａ，１Ｂが仮想マシン２Ａ，２ＢにＴｅｌｎｅｔ接続し、仮想マシン２Ａ，２Ｂ上のファイルを編集する。
（６）ホストマシン１Ａ，１Ｂが仮想マシン２Ａ，２ＢにＦＴＰ接続し、仮想マシン２Ａ，２Ｂ上にファイルを配置する。 (1) When an abnormality occurs, an SNMP trap is used to notify the SNMP manager of the virtual machines 2A and 2B of the abnormality from the SNMP agent of the host machines 1A and 1B.
(2) The heartbeat is transmitted from the virtual machines 2A and 2B to the host machines 1A and 1B, and alive monitoring is performed by reception. If heartbeat does not return, it is detected that there is an abnormality.
(3) The host machines 1A and 1B log in to the virtual machines 2A and 2B, and destroy the virtual machines 2A and 2B at locations where the host machines 1A and 1B are malfunctioning.
(4) A mail server is activated on the host machines 1A and 1B, a mail client is activated on the virtual machines 2A and 2B, and the host machines 1A and 1B transmit to the virtual machines 2A and 2B by mail.
(5) The host machines 1A and 1B make Telnet connections to the virtual machines 2A and 2B, and edit the files on the virtual machines 2A and 2B.
(6) The host machines 1A and 1B are FTP-connected to the virtual machines 2A and 2B, and files are arranged on the virtual machines 2A and 2B.

一方、仮想マシン２Ａ，２Ｂの障害検知連携機能２１は、上記の方法で伝達されたホストマシン１Ａ，１Ｂのマシン状態を受け取り、ホストマシン１Ａ，１Ｂと仮想マシン２Ａ，２Ｂのマシン状態を連携させて、仮想マシン２Ａ，２Ｂのサーバ障害検知機能２２がホストマシン１Ａ，１Ｂのマシン状態を検知できるようにする。 On the other hand, the failure detection linkage function 21 of the virtual machines 2A and 2B receives the machine state of the host machines 1A and 1B transmitted by the above method, and links the host machines 1A and 1B with the machine states of the virtual machines 2A and 2B. Thus, the server failure detection function 22 of the virtual machines 2A and 2B can detect the machine state of the host machines 1A and 1B.

サーバ障害検知機能２２は、仮想マシン２Ａ，２Ｂ上で動作するシステム（サーバプログラム）が備えた機能であって、仮想マシン２Ａ，２Ｂの状態判定とホストマシン１Ａ，１Ｂの状態判定を行い、異常を検知する。サーバ障害検知機能２２は、仮想マシン２Ａ，２Ｂの状態判定とホストマシン１Ａ，１Ｂの状態判定を実施することになるため、それぞれの状態がわかるように別個に動作判定を実施してもよいし、仮想マシン２Ａ，２Ｂの状態判定にホストマシン１Ａ，１Ｂの状態判定を重畳して動作判定を実施してもよい。 The server failure detection function 22 is a function provided in a system (server program) that operates on the virtual machines 2A and 2B. The server failure detection function 22 determines the status of the virtual machines 2A and 2B and the status of the host machines 1A and 1B. Is detected. Since the server failure detection function 22 performs the state determination of the virtual machines 2A and 2B and the state determination of the host machines 1A and 1B, the operation determination may be performed separately so that the respective states can be understood. Alternatively, the operation determination may be performed by superimposing the state determination of the host machines 1A and 1B on the state determination of the virtual machines 2A and 2B.

サーバ障害検知機能２２は、異常を検知すると、予備系に切り替える障害制御動作を開始する。あるいは、異常を検知したときに保守者へアラームを出力し、保守者が保守作業を行ってもよい。 When detecting a failure, the server failure detection function 22 starts a failure control operation for switching to the standby system. Alternatively, an alarm may be output to the maintenance person when an abnormality is detected, and the maintenance person may perform maintenance work.

次に、本実施の形態における障害制御システムの動作について説明する。 Next, the operation of the failure control system in this embodiment will be described.

図２は、本実施の形態における障害制御システムの動作を示すシーケンス図である。 FIG. 2 is a sequence diagram showing the operation of the failure control system in the present embodiment.

ホストマシン１Ａのサーバ障害検知機能１１がホストマシン１Ａの状態を取得し、異常を検知する（ステップＳ１１）。 The server failure detection function 11 of the host machine 1A acquires the state of the host machine 1A and detects an abnormality (step S11).

ホストマシン１Ａの障害検知連携機能１２は、ホストマシン１Ａの状態を仮想マシン２Ａの障害検知連携機能２１に伝達する（ステップＳ１２）。 The failure detection cooperation function 12 of the host machine 1A transmits the state of the host machine 1A to the failure detection cooperation function 21 of the virtual machine 2A (step S12).

仮想マシン２Ａの障害検知連携機能２１は、ホストマシン１Ａの状態から現用系を予備系に切り替える必要があると判定した場合は、仮想マシン２Ｂへ切り替えの指示を出す（ステップＳ１３）。 If the failure detection cooperation function 21 of the virtual machine 2A determines that it is necessary to switch the active system to the standby system from the state of the host machine 1A, it issues a switching instruction to the virtual machine 2B (step S13).

以上説明したように、本実施の形態によれば、サーバ障害検知機能１１がホストマシン１Ａ，１Ｂのハードウェアの状態を取得し、障害検知連携機能１２がホストマシン１Ａ，１Ｂのマシン状態を仮想マシン２Ａ，２Ｂへ伝達し、仮想マシン２Ａ，２Ｂの障害検知連携機能２１がホストマシン１Ａ，１Ｂのマシン状態を受け取ってホストマシン１Ａ，１Ｂと仮想マシン２Ａ，２Ｂのマシン状態を連携させておくことで、仮想マシン２Ａ，２Ｂ上で動作するシステムがホストマシン１Ａ，１Ｂのマシン状態を取得し、ホストマシン１Ａ，１Ｂの異常を検知することが可能となる。 As described above, according to the present embodiment, the server failure detection function 11 acquires the hardware state of the host machines 1A and 1B, and the failure detection cooperation function 12 virtualizes the machine state of the host machines 1A and 1B. Are transmitted to the machines 2A and 2B, and the fault detection cooperation function 21 of the virtual machines 2A and 2B receives the machine statuses of the host machines 1A and 1B, and links the host machines 1A and 1B with the machine statuses of the virtual machines 2A and 2B. As a result, the system operating on the virtual machines 2A and 2B can acquire the machine state of the host machines 1A and 1B, and can detect an abnormality in the host machines 1A and 1B.

１Ａ，１Ｂ…ホストマシン
１１…サーバ障害検知機能
１２…障害検知連携機能
２Ａ，２Ｂ…仮想マシン
２１…障害検知連携機能
２２…サーバ障害検知機能 1A, 1B ... Host machine 11 ... Server failure detection function 12 ... Failure detection cooperation function 2A, 2B ... Virtual machine 21 ... Failure detection cooperation function 22 ... Server failure detection function

Claims

A failure detection device in which a virtual machine operates on a physical machine,
The physical machine is
Status acquisition means for acquiring the status of the physical machine;
State transmitting means for transmitting the state to the virtual machine;
The virtual machine is
A failure detection apparatus comprising: a state cooperation unit that acquires the state from the physical machine and enables a program operating on the virtual machine to acquire the state.

The failure detection device is used in a redundant system including an active system and a standby system,
The virtual machine is
The failure detection apparatus according to claim 1, further comprising a failure control unit that switches between an active system and a standby system when an abnormality of the physical machine is detected.

A failure detection method by a failure detection apparatus in which a virtual machine operates on a physical machine,
By the physical machine,
Obtaining the state of the physical machine;
Communicating the state to the virtual machine,
By the virtual machine,
A failure detection method comprising: acquiring the state from the physical machine, and enabling a program operating on the virtual machine to acquire the state.