JP5747765B2

JP5747765B2 - Failure analysis apparatus, failure analysis method, and program

Info

Publication number: JP5747765B2
Application number: JP2011211411A
Authority: JP
Inventors: 尾崎　哲也; 哲也尾崎
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2011-09-27
Filing date: 2011-09-27
Publication date: 2015-07-15
Anticipated expiration: 2031-09-27
Also published as: JP2013073389A

Description

本発明は、仮想化技術により大量の仮想サーバを一元管理するデータセンタなどの運用管理に用いられる、障害分析装置、障害分析方法、およびプログラムに関する。 The present invention relates to a failure analysis apparatus, a failure analysis method, and a program used for operation management of a data center or the like that centrally manages a large number of virtual servers using virtualization technology.

近年、仮想化技術を利用したコンピュータシステムが運用されている。このようなシステムでは、たとえば、１台のサーバコンピュータ上で複数の仮想マシンを稼働させる（たとえば、特許文献１参照）。特許文献１に記載の仮想コンピュータシステムは、複数の物理デバイス上で論理デバイスを用いて仮想デバイスを設定し、当該仮想デバイスにより仮想マシンを動作させる。 In recent years, computer systems using virtualization technology have been operated. In such a system, for example, a plurality of virtual machines are operated on one server computer (see, for example, Patent Document 1). The virtual computer system described in Patent Literature 1 sets a virtual device using a logical device on a plurality of physical devices, and operates a virtual machine using the virtual device.

また、当該仮想コンピュータシステムは、仮想化環境運用支援システムを有している。仮想化環境運用支援システムは、障害が発生した物理デバイスと該物理デバイスが影響を及ぼす論理デバイスを特定した第１関連情報と、障害が発生した論理デバイスと当該論理デバイスが影響を及ぼす仮想デバイスを特定した第２関連情報と、障害が発生した仮想デバイスと当該仮想デバイスが影響を及ぼす仮想マシンを特定した第３関連情報と、を格納する、影響範囲特定テーブル部を備えている。 The virtual computer system has a virtual environment operation support system. The virtualization environment operation support system includes first related information that identifies a physical device in which a failure has occurred and a logical device that the physical device affects, a logical device in which a failure has occurred, and a virtual device that has an influence on the logical device. An influence range specifying table unit is provided that stores the specified second related information, and the third related information specifying the virtual device in which the failure has occurred and the virtual machine affected by the virtual device.

また、前記仮想化環境運用支援システムは、制御部を備えている。制御部は、仮想コンピュータシステムから障害が発生した物理デバイス、論理デバイス又は仮想デバイスを特定する障害発生部位情報を受信し、当該障害発生部位情報を基に前記影響範囲特定テーブル部を参照して前記障害が及ぼす仮想マシンを特定する。また、制御部は、障害発生部位情報が仮想デバイスのとき、当該仮想デバイスが影響を及ぼす仮想マシンを、前記第３関連情報を参照して特定する。 The virtual environment operation support system includes a control unit. The control unit receives failure occurrence site information that identifies a physical device, logical device, or virtual device in which a failure has occurred from the virtual computer system, and refers to the influence range identification table unit based on the failure occurrence site information. Identify the virtual machine affected by the failure. In addition, when the failure occurrence site information is a virtual device, the control unit identifies a virtual machine that is affected by the virtual device with reference to the third related information.

特開２０１０−００９４１１号公報（要約、請求項１）JP 2010-009411 A (summary, claim 1)

ところで、ＶＭＷａｒｅ（登録商標）に代表される仮想化技術を多用するサーバコンピュータ（クラウドシステム）の障害監視を行う場合、障害発生時に、本来、原因が一つであるにも拘らず、複数の障害が同時期に発生したように見えてしまう場合が多々ある。そのような場合、どの障害から対処したらよいかが判らず、障害原因および影響範囲の特定に時間がかかってしまう。また、仮想環境下では、頻繁に利用リソース変更があり、影響範囲の特定が困難である。 By the way, when performing failure monitoring of a server computer (cloud system) that makes heavy use of virtualization technology represented by VMWare (registered trademark), when a failure occurs, a plurality of failures may be detected despite originally having one cause. Often appear to have occurred at the same time. In such a case, it is not known from which trouble to deal with, and it takes time to identify the cause of the trouble and the affected range. Also, in a virtual environment, there are frequent changes in usage resources, and it is difficult to identify the affected range.

特許文献１に記載の構成では、仮想マシンに生じた障害の影響範囲を特定する処理を行う前提条件として、障害イベントから、原因となるデバイスを特定できるようにしておく必要がある。すなわち、予め、障害イベントと、原因となるデバイスとの関係とを登録してあることが必要である。よって、事前に登録されていない障害の原因分析を行うことができない。 In the configuration described in Patent Document 1, it is necessary to be able to identify the cause device from the failure event as a precondition for performing the process of identifying the affected range of the failure that has occurred in the virtual machine. That is, it is necessary to register in advance the failure event and the relationship with the causative device. Therefore, it is not possible to analyze the cause of a failure that has not been registered in advance.

本発明の目的の一例は、エラー内容と、エラー原因との関係を予め特定できていなくても、障害原因を分析することができ得る、障害分析装置、障害分析方法、およびプログラムを提供することにある。 An example of an object of the present invention is to provide a failure analysis apparatus, a failure analysis method, and a program capable of analyzing a failure cause even if the relationship between the error content and the error cause cannot be specified in advance. It is in.

上記目的を達成するため、本発明の一側面における、障害分析装置は、仮想ホストを含む複数のホストのそれぞれが利用する物理デバイスおよび論理デバイスを特定する情報を入手する入手部と、
複数の前記ホストのうちエラーが発生したエラー発生ホストのホスト名と、前記入手部で入手した前記情報とから、前記エラー発生ホストが利用している前記物理デバイスおよび前記論理デバイスを特定し、かつ、特定された前記物理デバイスおよび前記論理デバイスを共用する他の前記ホストでエラーが発生しているか否かを判定するエラー有無判定部と、
を備えていることを特徴とする。 In order to achieve the above object, in one aspect of the present invention, a failure analysis apparatus includes:
Identifying the physical device and the logical device used by the error-occurring host from the host name of the error-occurring host among the plurality of hosts and the information obtained by the obtaining unit; and An error presence / absence determining unit that determines whether an error has occurred in another host that shares the identified physical device and the logical device;
It is characterized by having.

また、上記目的を達成するため、本発明の一側面における、障害分析方法は、
（ａ）仮想ホストを含む複数のホストのそれぞれが利用する物理デバイスおよび論理デバイスを特定する情報を入手するステップと、
（ｂ）複数の前記ホストのうちエラーが発生したエラー発生ホストのホスト名と、前記入手部で入手した前記情報とから、前記エラー発生ホストが利用している前記物理デバイスおよび前記論理デバイスを特定するステップと、
（ｃ）特定された前記物理デバイスおよび前記論理デバイスを共用する他の前記ホストでエラーが発生しているか否かを判定するステップと、
を含むことを特徴とする。 In order to achieve the above object, a failure analysis method according to one aspect of the present invention includes:
(A) obtaining information for specifying a physical device and a logical device used by each of a plurality of hosts including a virtual host;
(B) Identifying the physical device and the logical device used by the error-occurring host from the host name of the error-occurring host among the plurality of hosts and the information obtained by the obtaining unit And steps to
(C) determining whether an error has occurred in another host that shares the identified physical device and logical device;
It is characterized by including.

更に、上記目的を達成するため、本発明の一側面におけるプログラムは、仮想ホストを含む複数のホストで発生する障害をコンピュータによって分析するためのプログラムであって、前記コンピュータに、
（ａ）前記仮想ホストを含む複数の前記ホストのそれぞれが利用する物理デバイスおよび論理デバイスを特定する情報を入手するステップと、
（ｂ）複数の前記ホストのうちエラーが発生したエラー発生ホストのホスト名と、前記入手部で入手した前記情報とから、前記エラー発生ホストが利用している前記物理デバイスおよび前記論理デバイスを特定するステップと、
（ｃ）特定された前記物理デバイスおよび前記論理デバイスを共用する他の前記ホストでエラーが発生しているか否かを判定するステップと、
を実行させることを特徴とする。 Furthermore, in order to achieve the above object, a program according to one aspect of the present invention is a program for analyzing failures occurring in a plurality of hosts including a virtual host by a computer,
(A) obtaining information for identifying a physical device and a logical device used by each of the plurality of hosts including the virtual host;
(B) Identifying the physical device and the logical device used by the error-occurring host from the host name of the error-occurring host among the plurality of hosts and the information obtained by the obtaining unit And steps to
(C) determining whether an error has occurred in another host that shares the identified physical device and logical device;
Is executed.

以上のように、本発明によれば、エラー内容と、エラー原因との関係を予め特定できていなくても、障害原因を分析することができ得る。 As described above, according to the present invention, it is possible to analyze the cause of a failure even if the relationship between the error content and the cause of the error cannot be specified in advance.

図１は、本発明の実施の形態の障害分析装置を含む、障害分析システムの概略構成を示すブロック図である。FIG. 1 is a block diagram showing a schematic configuration of a failure analysis system including a failure analysis apparatus according to an embodiment of the present invention. 図２は、構成情報蓄積部に含まれる、本実施の形態における動作に必要なデータテーブルを示す図である。FIG. 2 is a diagram showing a data table included in the configuration information storage unit and necessary for the operation in the present embodiment. 図３は、ホスト管理テーブルを示す図である。FIG. 3 shows a host management table. 図４は、物理ディスク管理テーブル、論理ディスク管理テーブル、およびＮＩＣ管理テーブルを示す図である。FIG. 4 is a diagram showing a physical disk management table, a logical disk management table, and a NIC management table. 図５は、リソース割り当てテーブルを示す図である。FIG. 5 is a diagram showing a resource allocation table. 図６は、メッセージの構造例を示す図である。FIG. 6 is a diagram illustrating an example of a message structure. 図７は、エラー原因絞込みの処理の流れを説明するためのフローチャートである。FIG. 7 is a flowchart for explaining the flow of processing for narrowing down the cause of error. 図８（ａ）は、発生ノード判定部が行う処理について説明するためのフローチャートであり、図７のステップＳ２の処理の詳細を示している。図８（ｂ）は、メッセージ検索範囲算出部が行う処理について説明するためのフローチャートであり、図７のステップＳ３の処理の詳細を示している。FIG. 8A is a flowchart for explaining the process performed by the generation node determination unit, and shows details of the process in step S2 of FIG. FIG. 8B is a flowchart for explaining the process performed by the message search range calculation unit, and shows details of the process in step S3 in FIG. 図９は、エラー有無判定部が行う処理について説明するためのフローチャートであり、図７のステップＳ４の処理の詳細を示している。FIG. 9 is a flowchart for explaining the process performed by the error presence / absence determination unit, and shows details of the process in step S4 of FIG. 図１０は、図９のステップＳ３０１〜Ｓ３０３についての詳細な処理の流れを説明するためのフローチャートである。FIG. 10 is a flowchart for explaining the detailed processing flow of steps S301 to S303 in FIG. 図１１（ａ）は、被疑対象絞込み部が行う処理の流れを示すフローチャートであり、図１１（ｂ）は、ソート処理部が行う処理の流れを示すフローチャートである。FIG. 11A is a flowchart showing the flow of processing performed by the suspicious object narrowing unit, and FIG. 11B is a flowchart showing the flow of processing performed by the sort processing unit. 図１２は、ステップＳ９〜Ｓ１２について説明するためのフローチャートである。FIG. 12 is a flowchart for explaining steps S9 to S12.

以下、本発明の実施の形態における、障害分析装置について、図面を参照しながら説明する。 Hereinafter, a failure analysis apparatus according to an embodiment of the present invention will be described with reference to the drawings.

[装置構成]
図１は本発明の実施の形態に係る障害分析装置１００を含む、障害分析システム１の概略構成を示すブロック図である。本実施の形態では、障害分析システム１は、障害分析装置１００と、監視対象サーバ２００と、オペレータが操作する監視端末３００と、を備えている。 [Device configuration]
FIG. 1 is a block diagram showing a schematic configuration of a failure analysis system 1 including a failure analysis apparatus 100 according to an embodiment of the present invention. In the present embodiment, the failure analysis system 1 includes a failure analysis device 100, a monitoring target server 200, and a monitoring terminal 300 operated by an operator.

障害分析装置１００は、後述するように、たとえば、サーバコンピュータによって構築することができる。監視対象サーバ２００は、たとえば、サーバコンピュータによって構築することができる。監視対象サーバ２００は、ハードディスクなどの物理デバイスと、論理デバイスと、をそれぞれ複数有している。また、監視対象サーバ２００は、これら物理デバイスおよび論理デバイスを利用する、複数のホストを有している。当該複数のホストは、物理ＯＳ（Operating System）、ハイパバイザ、および仮想ＯＳを含む。 The failure analysis device 100 can be constructed by, for example, a server computer, as will be described later. The monitoring target server 200 can be constructed by a server computer, for example. The monitoring target server 200 has a plurality of physical devices such as hard disks and logical devices. Moreover, the monitoring target server 200 has a plurality of hosts that use these physical devices and logical devices. The plurality of hosts includes a physical OS (Operating System), a hypervisor, and a virtual OS.

監視端末３００には、コンピュータが含まれる。障害分析装置１００と、監視対象サーバ２００とは、ネットワーク４００を介して接続されている。また、障害分析装置１００と、監視端末３００とは、ネットワーク４００を介して接続されている。 The monitoring terminal 300 includes a computer. The failure analysis apparatus 100 and the monitoring target server 200 are connected via a network 400. In addition, the failure analysis apparatus 100 and the monitoring terminal 300 are connected via a network 400.

障害分析装置１００は、入手部１３１と、エラー有無判定部１５３と、を含んでいる。入手部１３１は、監視対象サーバ２００の各ホストのそれぞれが利用する物理デバイスの情報および論理デバイスを特定する情報（デバイス情報）を、監視対象サーバ２００から入手する。 The failure analysis apparatus 100 includes an acquisition unit 131 and an error presence / absence determination unit 153. The obtaining unit 131 obtains, from the monitoring target server 200, information on the physical devices used by each host of the monitoring target server 200 and information (device information) for specifying a logical device.

エラー有無判定部１５３は、監視対象サーバ２００の複数のホストのうち、エラーが発生したエラー発生ホストのホスト名と、入手部１３１で入手したデバイス情報とから、エラー発生ホストが利用している物理デバイスおよび論理デバイスを特定する。エラー有無判定部１５３は、さらに、特定された物理デバイスおよび論理デバイスを共用する他のホストで、エラーが発生しているか否かを判定する。 The error presence / absence determination unit 153 uses the host name of the error-occurring host in which an error has occurred among the plurality of hosts of the monitoring target server 200 and the device used by the error-occurring host from the device information obtained by the obtaining unit 131. Identify devices and logical devices. The error presence / absence determination unit 153 further determines whether an error has occurred in another host that shares the specified physical device and logical device.

以上説明したように、本実施の形態では、エラー内容と、エラー原因との関係を予め障害分析装置１００で特定できていなくても、障害原因を分析することが可能である。より具体的には、エラー発生ホストと、物理デバイスおよび論理デバイスの少なくとも一方を共用する他のホストにもエラーが生じている場合には、当該共用するデバイスがエラー原因であると分析することができる。一方、エラー発生ホストにはエラーが発生しているにも拘らず、上記他のホストにはエラーが生じていない場合には、エラー発生ホスト自身がエラー原因であると分析することができる。その結果、オペレータは、エラー発生原因を正確に知ることができ、エラー解消作業に迅速にとりかかることができる。 As described above, in the present embodiment, it is possible to analyze the cause of the failure even if the relationship between the error content and the cause of the error cannot be specified in advance by the failure analysis apparatus 100. More specifically, if an error has occurred in the host where the error occurred and another host that shares at least one of the physical device and the logical device, the shared device may be analyzed as the cause of the error. it can. On the other hand, when an error has occurred in the host in which the error occurred, but no error has occurred in the other host, it can be analyzed that the error host itself is the cause of the error. As a result, the operator can know the cause of the error accurately, and can quickly start the error elimination work.

以上が、障害分析装置１００の概略説明である。次に、障害原因箇所絞込みシステム１の、より具体的な構成を、図１に加え、図２〜図６を用いて説明する。 The above is a schematic description of the failure analysis apparatus 100. Next, a more specific configuration of the failure cause location narrowing system 1 will be described with reference to FIGS. 2 to 6 in addition to FIG.

図１に示すように、本実施の形態では、監視対象サーバ２００は、構成情報取得部２０１と、イベント蓄積部２１０とを含んでいる。構成情報取得部２０１は、監視対象サーバ２００を構成するハードウェアの構成情報、およびソフトウェアの構成情報を、常時監視しており、これらの構成情報を、構成変更情報を含め取得する。なお、一般的に物理ＯＳ（Operating System）およびハイパバイザは、自身が利用している物理デバイス情報、および論理デバイス情報を管理している。構成情報の具体的な取得方法の一例としては、監視対象サーバ２００側で提供しているAPI（Application Program Interface）を利用すること、または、エージェントプログラムを稼動させて、構成情報を取得すること、を挙げることができる。本実施の形態では、構成情報取得部２０１が取得した構成情報は、データ分析部１３２およびテーブル更新部１３３を介して、構成情報蓄積部１４０に格納される。また、本実施の形態では、監視対象サーバ２００のイベント蓄積部２１０は、監視対象サーバ２００で発生したイベント情報を格納する。 As shown in FIG. 1, in the present embodiment, the monitoring target server 200 includes a configuration information acquisition unit 201 and an event storage unit 210. The configuration information acquisition unit 201 constantly monitors hardware configuration information and software configuration information configuring the monitoring target server 200, and acquires the configuration information including configuration change information. In general, a physical OS (Operating System) and a hypervisor manage physical device information and logical device information used by the physical OS. As an example of a specific method for acquiring the configuration information, using the API (Application Program Interface) provided on the monitoring target server 200 side, or operating the agent program to acquire the configuration information, Can be mentioned. In the present embodiment, the configuration information acquired by the configuration information acquisition unit 201 is stored in the configuration information storage unit 140 via the data analysis unit 132 and the table update unit 133. In the present embodiment, the event accumulation unit 210 of the monitoring target server 200 stores event information generated in the monitoring target server 200.

本実施の形態では、監視端末３００は、オペレータによって操作される。監視端末３００は、メッセージ検索部３０１と、要求部３１０と、表示制御部３１３と、表示部３１４と、を含んでいる。 In the present embodiment, monitoring terminal 300 is operated by an operator. The monitoring terminal 300 includes a message search unit 301, a request unit 310, a display control unit 313, and a display unit 314.

本実施の形態では、メッセージ検索部３０１は、オペレータの操作に応じて、後述するメッセージ蓄積部１１０に蓄積されているメッセージから、たとえば、１つのメッセージを検索する。 In the present embodiment, the message search unit 301 searches, for example, one message from the messages stored in the message storage unit 110 described later according to the operation of the operator.

本実施の形態では、要求部３１０は、原因絞込み要求部３１１と、メッセージ一覧要求部３１２と、を含んでいる。本実施の形態では、原因絞込み要求部３１１は、メッセージ検索部３０１によって検索されたメッセージを見たオペレータによって操作される。原因絞込み要求部３１１は、エラーメッセージが生じた原因を絞り込む指令を、障害分析装置１００へ与える。また、本実施の形態では、メッセージ一覧要求部３１２は、原因絞込み要求部３１１の操作に対して障害分析装置１００から返された原因絞込みの結果を見たオペレータによって、操作される。メッセージ一覧要求部３１２は、エラーメッセージを検索した結果の一覧を作成する要求を、障害分析装置１００へ与える。 In the present embodiment, request unit 310 includes a cause narrowing request unit 311 and a message list request unit 312. In the present embodiment, the cause narrowing request unit 311 is operated by an operator who viewed the message searched by the message search unit 301. The cause narrowing request unit 311 gives an instruction to narrow down the cause of the error message to the failure analysis apparatus 100. In the present embodiment, the message list request unit 312 is operated by an operator who has seen the result of the cause narrowing returned from the failure analysis apparatus 100 in response to the operation of the cause narrowing request unit 311. The message list request unit 312 gives the failure analysis apparatus 100 a request to create a list of results of searching for error messages.

本実施の形態では、表示制御部３１３は、与えられたデータに基づく画像を、液晶ディスプレイ等の表示部３１４の表示画面に表示させる。 In the present embodiment, the display control unit 313 displays an image based on given data on the display screen of the display unit 314 such as a liquid crystal display.

本実施の形態では、障害分析装置１００は、メッセージ監視部１０１と、メッセージ蓄積部１１０と、デバイス情報整理部１２０と、処理部１３０と、結果出力部１１１と、を含んでいる。 In the present embodiment, the failure analysis apparatus 100 includes a message monitoring unit 101, a message storage unit 110, a device information organizing unit 120, a processing unit 130, and a result output unit 111.

本実施の形態では、メッセージ監視部１０１は、監視対象サーバ２００のイベント蓄積部２０１に格納されるイベントを監視し、当該イベントをメッセージとして取得し、取得したメッセージを、メッセージ蓄積部１１０に格納する。 In the present embodiment, the message monitoring unit 101 monitors an event stored in the event storage unit 201 of the monitoring target server 200, acquires the event as a message, and stores the acquired message in the message storage unit 110. .

本実施の形態では、デバイス情報整理部１２０は、入手部１３１と、データ分析部１３２と、テーブル更新部１３３と、構成情報蓄積部１４０と、を含んでいる。 In the present embodiment, the device information organizing unit 120 includes an obtaining unit 131, a data analyzing unit 132, a table updating unit 133, and a configuration information accumulating unit 140.

本実施の形態では、構成情報監視部１３１は、監視対象サーバ２００を構成するハードウェア情報、およびソフトウェア情報などの構成情報を、構成情報取得部２０１から取得する。 In the present embodiment, the configuration information monitoring unit 131 acquires configuration information such as hardware information and software information configuring the monitoring target server 200 from the configuration information acquisition unit 201.

本実施の形態では、構成情報監視部１３１で得られた情報は、データ分析部１３２で情報毎にまとめられ、分析および分類される。当該分析されたデータは、テーブル更新部１３３によって、構成情報蓄積部１４０に登録される。 In the present embodiment, the information obtained by the configuration information monitoring unit 131 is collected for each information by the data analysis unit 132 and analyzed and classified. The analyzed data is registered in the configuration information storage unit 140 by the table update unit 133.

図２は、構成情報蓄積部１４０に含まれる、本実施形態における動作に必要なデータテーブル１４１〜１４５を示す図である。図２に示すように、本実施の形態では、構成情報蓄積部１４０は、物理デバイスおよび論理デバイスとホストとの関係などを示すテーブルを格納している。より具体的には、構成情報蓄積部１４０は、ホスト管理テーブル１４１と、物理ディスク管理テーブル１４２と、論理ディスク管理テーブル１４３と、ＮＩＣ（Network Interface Card ）管理テーブル１４４と、リソース割り当てテーブル１４５と、を格納している。 FIG. 2 is a diagram illustrating data tables 141 to 145 included in the configuration information storage unit 140 and necessary for operations in the present embodiment. As shown in FIG. 2, in the present embodiment, the configuration information storage unit 140 stores a table indicating the relationship between physical devices, logical devices, and hosts. More specifically, the configuration information storage unit 140 includes a host management table 141, a physical disk management table 142, a logical disk management table 143, a NIC (Network Interface Card) management table 144, a resource allocation table 145, Is stored.

図３は、ホスト管理テーブル１４１を示す図である。図３に示すように、本実施の形態では、ホスト管理テーブル１４１は、全ホスト名を管理するテ−ブル１４１１と、ハイパバイザであるホスト名を管理するテーブル１４１２と、ハイパバイザ上に構成された仮想ホスト名を管理するテーブル１４１３と、を含んでいる。これらのテーブル１４１１、１４１２、１４１３の情報は、互いに関係付けられている。 FIG. 3 is a diagram showing the host management table 141. As shown in FIG. 3, in this embodiment, the host management table 141 includes a table 1411 for managing all host names, a table 1412 for managing host names that are hypervisors, and a virtual configured on the hypervisor. And a table 1413 for managing host names. Information in these tables 1411, 1412, and 1413 is related to each other.

本実施の形態では、テーブル１４１１は、ホストＩＤと、当該ホストＩＤに対応するホスト名とを格納している。テーブル１４１１は、たとえば、ホストＩＤ１のホスト名をＶＭ１として格納し、ホストＤ２のホスト名をＧｕｅｓｔ２として格納している。さらに、テーブル１４１１は、ホストＩＤ３，４，５，６，…のホスト名を、それぞれ、Ｇｕｅｓｔ２，ＶＭ２，Ｇｕｅｓｔ３，ｈｏｓｔ１，…として格納している。また、テーブル１４１２は、ハイパバイザＩＤと、当該ハイパバイザＩＤに対応するホスト名とを格納している。テーブル１４１２は、たとえば、ハイパバイザＩＤ１のホスト名をＶＭ１として格納している。また、テーブル１４１３は、仮想ホストＩＤと、当該ホストＩＤに対応するホスト名とを格納している。テーブル１４１３は、たとえば、仮想ホストＩＤ１のホスト名をＧｕｅｓｔ１として格納し、仮想ホストＩＤ２のホスト名をＧｕｅｓｔ２として格納している。 In this embodiment, the table 1411 stores a host ID and a host name corresponding to the host ID. The table 1411 stores, for example, the host name of the host ID 1 as VM1 and the host name of the host D2 as Guest2. Further, the table 1411 stores host names of host IDs 3, 4, 5, 6,... As Guest 2, VM 2, Guest 3, host 1,. The table 1412 stores a hypervisor ID and a host name corresponding to the hypervisor ID. The table 1412 stores, for example, the host name of the hypervisor ID 1 as VM1. The table 1413 stores a virtual host ID and a host name corresponding to the host ID. For example, the table 1413 stores the host name of the virtual host ID 1 as Guest 1 and stores the host name of the virtual host ID 2 as Guest 2.

図４は、物理ディスク管理テーブル１４２、論理ディスク管理テーブル１４３、およびＮＩＣ管理テーブル１４４を示す図である。図４に示すように、ディスク管理テーブル１４２は、全監視対象の物理ディスクを管理するテーブル１４２１と、物理ディスク毎に利用ホストを管理するテーブル１４２２と、を含んでおり、ホスト名から、利用している物理ディスクを特定する。 FIG. 4 is a diagram showing the physical disk management table 142, the logical disk management table 143, and the NIC management table 144. As shown in FIG. 4, the disk management table 142 includes a table 1421 for managing all the physical disks to be monitored and a table 1422 for managing the use host for each physical disk. Identify the physical disk

本実施の形態では、テーブル１４２１は、物理ディスクＩＤと、当該物理ディスクＩＤに対応するディスク名とを格納している。テーブル１４２１は、たとえば、物理ディスクＩＤ１のディスク名をＤｉｓｋＡとして格納し、物理ディスクＩＤ２のディスク名をＤｉｓｋＢとして格納し、物理ディスクＩＤ３のディスク名をＤｉｓｋＣとして格納している。また、本実施の形態では、テーブル１４２２は、物理ＤｉｓｋＡを利用するホストＩＤと、当該ホストＩＤに対応するホスト名とを格納している。テーブル１４２２は、たとえば、物理ＤｉｓｋＡにおけるホストＩＤ１のホスト名をＶＭ１として格納し、物理ＤｉｓｋＡにおけるホストＩＤ４のホスト名をＶＭ２として格納している。なお、図４では、物理ディスクＡに対応するホストＩＤおよびホスト名を図示しているが、テーブル１４２２は、物理ディスクＢ、Ｃ、…、の各物理ディスクに対応するホストＩＤおよびホスト名も格納している。 In this embodiment, the table 1421 stores a physical disk ID and a disk name corresponding to the physical disk ID. The table 1421 stores, for example, the disk name of the physical disk ID1 as DiskA, the disk name of the physical disk ID2 as DiskB, and the disk name of the physical disk ID3 as DiskC. In this embodiment, the table 1422 stores a host ID that uses the physical Disk A and a host name corresponding to the host ID. For example, the table 1422 stores the host name of the host ID 1 in the physical disk A as VM1 and stores the host name of the host ID 4 in the physical disk A as VM2. 4 illustrates the host ID and host name corresponding to the physical disk A, the table 1422 also stores the host ID and host name corresponding to each of the physical disks B, C,. doing.

本実施の形態では、論理ディスク管理テーブル１４３は、物理ディスク管理テーブル１４２１の各物理ディスクと対応する論理ディスク名を格納したテーブル１４３１を含んでいる。このテーブル１４３１には、論理ディスクを利用しているホスト名を格納したテーブル１４３２が紐づいている。これにより、ホスト名から、利用している論理ディスクを特定することが可能となっている。なお、本実施の形態では、テーブル１４３１は、論理ディスクＩＤと、当該論理ディスクＩＤに対応する論理ディスク名とを格納している。テーブル１４３１は、たとえば、論理ディスクＩＤ１の論理ディスク名を論理Ａとして格納し、論理ディスクＩＤ２の論理ディスク名を論理Ｂとして格納し、論理ディスクＩＤ３の論理ディスク名を論理Ｃとして格納している。なお、図４では、物理ディスクＡに対応する論理ディスクについて、テーブル１４３１で図示しているが、テーブル１４３１は、物理ディスクＢ、Ｃ、…、の各物理ディスクに対応する論理ディスクの情報も格納している。 In this embodiment, the logical disk management table 143 includes a table 1431 that stores logical disk names corresponding to the physical disks in the physical disk management table 1421. This table 1431 is linked to a table 1432 that stores the names of hosts using logical disks. This makes it possible to specify the logical disk being used from the host name. In this embodiment, the table 1431 stores a logical disk ID and a logical disk name corresponding to the logical disk ID. For example, the table 1431 stores the logical disk name of the logical disk ID 1 as logical A, stores the logical disk name of the logical disk ID 2 as logical B, and stores the logical disk name of the logical disk ID 3 as logical C. In FIG. 4, the logical disk corresponding to the physical disk A is shown as a table 1431. However, the table 1431 also stores information on the logical disks corresponding to the physical disks B, C,. doing.

本実施の形態では、ＮＩＣ管理テーブル１４４は、全監視対象の物理ＮＩＣを管理するテーブルであり、テーブル１４４１と、ＮＩＣ毎に利用ホストを管理するテーブル１４４２と、を含んでいる、これにより、ホスト名から、利用しているＮＩＣを特定することが可能となっている。本実施の形態では、テーブル１４４１は、ＮＩＣＩＤと、当該ＮＩＣＩＤに対応するＭＡＣ（Media Access Control）アドレス名とを格納している。テーブル１４４１は、たとえば、ＮＩＣＩＤ１のＭＡＣアドレスをＭＡＣ１として格納し、ＮＩＣＩＤ２のＭＡＣアドレスをＭＡＣ２として格納し、ＮＩＣＩＤ３のＭＡＣアドレスをＭＡＣ３として格納している。図４では、ＭＡＣ１に対するホストＩＤおよびホスト名について図示しているが、テーブル１４４２は、ＭＡＣ２、ＭＡＣ３の各ＭＡＣアドレスに対応するホストの情報も格納している。 In the present embodiment, the NIC management table 144 is a table for managing all the physical NICs to be monitored, and includes a table 1441 and a table 1442 for managing the use host for each NIC. It is possible to identify the NIC being used from the name. In this embodiment, the table 1441 stores NICIDs and MAC (Media Access Control) address names corresponding to the NICIDs. For example, the table 1441 stores the MAC address of NICID1 as MAC1, stores the MAC address of NICID2 as MAC2, and stores the MAC address of NICID3 as MAC3. Although FIG. 4 illustrates the host ID and host name for MAC1, the table 1442 also stores host information corresponding to the MAC addresses of MAC2 and MAC3.

図５は、リソース割り当てテーブル１４５を示す図である。本実施の形態では、図５に示すように、リソース割り当てテーブル１４５は、ハイパバイザテーブル１４５１と、ＮＩＣテーブル１４５２と、論理ディスクテーブル１４５３と、仮想ホストテーブル１４５４と、を含んでいる。 FIG. 5 is a diagram showing the resource allocation table 145. In the present embodiment, as shown in FIG. 5, the resource allocation table 145 includes a hypervisor table 1451, a NIC table 1452, a logical disk table 1453, and a virtual host table 1454.

本実施の形態では、ハイパバイザテーブル１４５１は、ハイパバイザＩＤと、当該ハイパバイザＩＤに対応するホスト名とを格納している。ハイパバイザテーブル１４５１は、たとえば、ハイパバイザＩＤ１のホスト名をＶＭ１として格納している。また、本実施の形態では、ＮＩＣテーブル１４５２は、ＮＩＣＩＤと、当該ＮＩＣＩＤに対応するＭＡＣアドレスとを格納している。ＮＩＣテーブル１４５２は、たとえば、ＮＩＣＩＤ１、ＮＩＣＩＤ２、ＮＩＣＩＤ３のＭＡＣアドレス名、それぞれ、をＭＡＣ１、ＭＡＣ２、ＭＡＣ３として格納している。また、本実施の形態では、論理ディスクテーブル１４５３は、論理ディスクＩＤと、当該論理ディスクＩＤに対応する論理ディスク名とを格納している。論理ディスクテーブル１４５３は、たとえば、論理ディスクＩＤ１、ＩＤ２、ＩＤ３の論理ディスク名を、それぞれ、論理Ａ、論理Ｂ、論理Ｃとして格納している。また、本実施の形態では、仮想ホストテーブル１４５４は、仮想ホストＩＤと、当該仮想ホストＩＤに対応するホスト名とを格納している。仮想ホストテーブル１４５４は、たとえば、仮想ホストＩＤのホスト名をＧｕｅｓｔ１として格納している。 In the present embodiment, the hypervisor table 1451 stores a hypervisor ID and a host name corresponding to the hypervisor ID. The hypervisor table 1451 stores, for example, the host name of the hypervisor ID1 as VM1. In the present embodiment, the NIC table 1452 stores NICID and a MAC address corresponding to the NICID. The NIC table 1452 stores, for example, MACID names of NICID1, NICID2, and NICID3 as MAC1, MAC2, and MAC3, respectively. In this embodiment, the logical disk table 1453 stores a logical disk ID and a logical disk name corresponding to the logical disk ID. The logical disk table 1453 stores, for example, logical disk names of logical disks ID1, ID2, and ID3 as logical A, logical B, and logical C, respectively. In this embodiment, the virtual host table 1454 stores a virtual host ID and a host name corresponding to the virtual host ID. The virtual host table 1454 stores, for example, the host name of the virtual host ID as Guest1.

本実施の形態では、リソース割り当てテーブル１４５において、ハイパバイザが各仮想ホストに割り当てるリソース情報を含む仮想ホストテーブル１４５４と、ＮＩＣテーブル１４５２と、論理ディスクテーブル１４５３と、が関係づけられている。これにより、仮想ホスト名から、割り当てられたデバイス情報を特定することが可能となる。なお、ＮＩＣテーブル１４５２および論理ディスクテーブル１４５３は、それぞれ、ハイパバイザテーブル１４５１と関係づけられている。これにより、ホスト名から、当該ホストが利用するハイパバイザを特定することができる。 In the present embodiment, in the resource allocation table 145, a virtual host table 1454 including resource information allocated to each virtual host by the hypervisor, a NIC table 1452, and a logical disk table 1453 are associated with each other. Thereby, it is possible to specify the assigned device information from the virtual host name. Each of the NIC table 1452 and the logical disk table 1453 is related to the hypervisor table 1451. Thereby, the hypervisor used by the host can be specified from the host name.

次に、図１に示すように、本実施形態における原因絞込みを実施する処理部１３０について説明する。本実施の形態では、処理部１３０は、発生ノード判定部１５１と、メッセージ抽出範囲算出部１５２と、エラー有無判定部１５３と、被疑対象絞込み部１５４と、ソート処理部１５５と、を含んでいる。本実施の形態では、処理部１３０は、構成情報蓄積部１４０およびメッセージ蓄積部１１０のそれぞれと接続されている。 Next, as illustrated in FIG. 1, a processing unit 130 that performs cause narrowing in this embodiment will be described. In the present embodiment, the processing unit 130 includes a generation node determination unit 151, a message extraction range calculation unit 152, an error presence / absence determination unit 153, a suspicious object narrowing unit 154, and a sort processing unit 155. . In the present embodiment, processing unit 130 is connected to each of configuration information storage unit 140 and message storage unit 110.

本実施の形態では、発生ノード判定部１５１は、メッセージ発生源が仮想ホストであるか否かを判定する。メッセージ検索範囲算出部１５２は、オペレータによって選択されたエラーの発生時を基準に、当該事象発生前後のどれくらいの期間のメッセージを検索対象とするが決定する。エラー有無判定部１５３は、抽出した期間内に発生ノードなどでエラーメッセージが発生していたか否かを判定する。本実施の形態では、被疑対象絞込み部１５４は、エラーメッセージが出ているホスト数の割合から被疑対象デバイスを絞り込む。ソート処理部１５５は、被疑対象絞込み部１５４からの出力データを並び替える。 In the present embodiment, the generation node determination unit 151 determines whether or not the message generation source is a virtual host. The message search range calculation unit 152 determines, based on the time of occurrence of the error selected by the operator, how long a message before and after the event occurs is to be searched. The error presence / absence determination unit 153 determines whether or not an error message has occurred in the generation node or the like within the extracted period. In the present embodiment, the suspected object narrowing unit 154 narrows down suspected devices from the ratio of the number of hosts that have issued error messages. The sort processing unit 155 rearranges the output data from the suspected object narrowing unit 154.

本実施の形態では、処理部１５０は、ソート処理部１５５でソートされたデータは、結果出力部１１１によって、ネットワーク４００を通じて、監視端末３００の表示制御部３１３へ出力される。表示制御部３１３は、受けたデータを、監視端末３００の表示部３１４に表示する。これにより、障害分析装置１００は、監視端末３００を利用するオペレータに、要求部３１０の操作に対する結果を返す。本実施の形態では、原因絞込み要求部３１１からの要求により、発生ノード判定部１５１、メッセージ抽出範囲算出部１５２、エラー有無判定部１５３、被疑対象絞込み部１５４、およびソート処理部１５５の処理が実行される。原因絞込み要求部３１１の要求は、発生ノード判定部１５１へ与えられる。 In the present embodiment, the processing unit 150 outputs the data sorted by the sort processing unit 155 to the display control unit 313 of the monitoring terminal 300 via the network 400 by the result output unit 111. The display control unit 313 displays the received data on the display unit 314 of the monitoring terminal 300. As a result, the failure analysis apparatus 100 returns the result of the operation of the request unit 310 to the operator who uses the monitoring terminal 300. In the present embodiment, in response to a request from the cause narrowing request unit 311, processing of the occurrence node determination unit 151, the message extraction range calculation unit 152, the error presence / absence determination unit 153, the suspected target narrowing unit 154, and the sort processing unit 155 is executed. Is done. The request of the cause narrowing request unit 311 is given to the generation node determination unit 151.

また、本実施の形態では、処理部１３０は、要求対象判定部１６１と、メッセージ検索部１６２と、を更に含んでいる。要求対象判定部１６１は、メッセージ要求対象ホストを特定する。メッセージ検索部１６２は、特定のホストについてのメッセージをメッセージ蓄積部１１０内から検索する。ソート処理部１５５は、検索されたメッセージを所定のルールに従って並び替える。本実施の形態では、当該メッセージに関する処理は、メッセージ一覧要求部３１２からの要求により実行される。メッセージ一覧要求部３１２の要求は、要求対象判定部１６１へ与えられる。 In the present embodiment, the processing unit 130 further includes a request target determination unit 161 and a message search unit 162. The request target determination unit 161 specifies a message request target host. The message search unit 162 searches the message storage unit 110 for a message about a specific host. The sort processing unit 155 sorts the retrieved messages according to a predetermined rule. In the present embodiment, processing related to the message is executed by a request from the message list request unit 312. The request from the message list request unit 312 is given to the request target determination unit 161.

前述したように、監視端末３００は、メッセージ検索部３０１を含んでいる。メッセージ検索部３０１は、ネットワーク４００を介してメッセージ蓄積部１１０に接続されている。メッセージ検索部３０１は、メッセージ蓄積部１１０に保存されているメッセージに対して検索を行うことで、任意のメッセージを参照することができる。なお、メッセージ蓄積部１１０に蓄積されているメッセージの構造例は、図６に示すとおりである。 As described above, the monitoring terminal 300 includes the message search unit 301. The message search unit 301 is connected to the message storage unit 110 via the network 400. The message search unit 301 can refer to an arbitrary message by searching for a message stored in the message storage unit 110. An example of the structure of messages stored in the message storage unit 110 is as shown in FIG.

図６は、メッセージの構造例を示す図である。本実施の形態では、メッセージ蓄積部１１０に蓄積されるメッセージは、発生ノード１１０１、メッセージＩＤ１１０２、メッセージ内容１１０３、アラートレベル１１０４、発生日１１０５、および発生時間１１０６を含んでいる。 FIG. 6 is a diagram illustrating an example of a message structure. In the present embodiment, the message stored in the message storage unit 110 includes the generation node 1101, the message ID 1102, the message content 1103, the alert level 1104, the generation date 1105, and the generation time 1106.

発生ノード１１０１には、イベントが発生したホスト名が示されている。メッセージ内容１１０３には、具体的なイベント内容（エラー内容）が示されている。アラートレベル１１０４には、エラーが生じたイベントについて、”Ｅｒｒｏｒ”が示されている。発生日１１０５には、イベントの発生日が示されている。発生時間１１０６には、イベントが発生した時刻が示されている。 The generation node 1101 indicates the name of the host where the event has occurred. The message content 1103 indicates specific event content (error content). The alert level 1104 indicates “Error” for an event in which an error has occurred. The occurrence date 1105 indicates the date of occurrence of the event. The occurrence time 1106 indicates the time when the event occurred.

［本実施の形態における動作の説明］
[動作の概要]
図１に示すように、本実施の形態では、監視対象サーバ２００にエラーが発生した場合、オペレータは、監視端末３００の原因絞込み要求部３１１を操作することで、エラーに関連すると判断したメッセージを１つ選択する。これに基づき、障害分析装置１００は、そのメッセージを発生したエラー発生ホストと、デバイス情報とを関連づける。そして、障害分析装置１００は、エラーメッセージを発生したホストと、デバイスを共用している１または複数のホストを抽出する。 [Description of operation in this embodiment]
[Overview of operation]
As shown in FIG. 1, in the present embodiment, when an error occurs in the monitoring target server 200, the operator operates the cause narrowing request unit 311 of the monitoring terminal 300 to display a message that is determined to be related to the error. Select one. Based on this, the failure analysis apparatus 100 associates the error occurrence host that generated the message with the device information. Then, the failure analysis apparatus 100 extracts one or a plurality of hosts sharing the device with the host that has generated the error message.

さらに、障害分析装置１００は、上記エラー発生時近辺に、上記共用のホストにエラーが発生しているか判定する。そして、障害分析装置１００は、エラーの発生の有無を判定した判定結果に基づき、エラーの発生原因として疑われる被疑対象デバイス、および被疑対象デバイスを利用するホストの一覧を作成する。そして、障害分析装置１００は、作成したホストの一覧を監視端末３００に返す。被疑対象デバイス、および被疑対象デバイスを利用するホストに関して、メッセージ一覧要求部３１２の操作に基づいて、メッセージ一覧の要求がある場合、障害分析装置１００は、メッセージ蓄積部１１０内を検索する。そして、障害分析装置１００は、被疑対象デバイスまたは当該デバイスを利用するホストのメッセージを検索し、監視端末３００へ検索結果を出力する。 Furthermore, the failure analysis apparatus 100 determines whether an error has occurred in the shared host near the time of the error occurrence. Then, the failure analysis apparatus 100 creates a list of suspected devices suspected of causing the error and hosts that use the suspected device based on the determination result of determining whether or not an error has occurred. Then, the failure analysis apparatus 100 returns the created host list to the monitoring terminal 300. When there is a message list request for the suspected device and the host using the suspected device, based on the operation of the message list requesting unit 312, the failure analysis apparatus 100 searches the message storage unit 110. Then, the failure analysis apparatus 100 searches for a suspected device or a message of a host that uses the device, and outputs the search result to the monitoring terminal 300.

[本実施の形態における動作の詳細な説明]
次に、本実施の形態における障害分析装置１００の動作の詳細について、図７〜図１２を用いて説明する。図７は、エラー原因絞込みの処理の流れを説明するためのフローチャートである。また、以下の説明においては、適宜、図１〜図６を参照する。また、本実施の形態では、障害分析装置１００を動作させることによって、障害分析方法が実施される。よって、本実施の形態における、障害分析方法の説明は、以下の障害分析装置１００の動作説明に代える。 [Detailed description of operation in this embodiment]
Next, details of the operation of the failure analysis apparatus 100 according to the present embodiment will be described with reference to FIGS. FIG. 7 is a flowchart for explaining the flow of processing for narrowing down the cause of error. Moreover, in the following description, FIGS. 1-6 is suitably referred. In the present embodiment, the failure analysis method is performed by operating the failure analysis apparatus 100. Therefore, the description of the failure analysis method in the present embodiment is replaced with the following description of the operation of the failure analysis apparatus 100.

図７に示すように、本実施の形態では、障害分析装置１００は、任意のエラーメッセージ１件に対して、原因絞込み要求があるか否かを判定する（ステップＳ１）。たとえば、オペレータが原因絞込み要求部３１１を操作することにより、任意のエラーメッセージ１件に関する原因絞込み要求が発せられると（ステップＳ１でＹＥＳ）、発生ノード判定部１５１が処理を行う（ステップＳ２）。次に、メッセージ検索範囲算出部１５２が処理を行い（ステップＳ３）、以後、順に、エラー有無判定部１５３、被疑対象絞込み部１５４、ソート処理部１５５が処理を行う（ステップＳ４、Ｓ５、Ｓ６）。ソート処理部１５５でソートされた、障害原因の被疑対象の推定結果は、結果出力部１１１が、ネットワーク４００を介して監視端末３００の表示制御部３１３へ出力する（ステップＳ７）。これにより、上記被疑対象の推定結果は、表示部３１４に表示される。 As shown in FIG. 7, in the present embodiment, the failure analysis apparatus 100 determines whether there is a cause narrowing request for one arbitrary error message (step S1). For example, when a cause narrowing request for one arbitrary error message is issued by operating the cause narrowing request unit 311 (YES in step S1), the occurrence node determination unit 151 performs processing (step S2). Next, the message search range calculation unit 152 performs processing (step S3), and thereafter, the error presence / absence determination unit 153, the suspicious object narrowing unit 154, and the sort processing unit 155 perform processing in order (steps S4, S5, and S6). . The result output unit 111 outputs the estimation result of the suspected cause of failure sorted by the sort processing unit 155 to the display control unit 313 of the monitoring terminal 300 via the network 400 (step S7). Thereby, the estimation result of the suspected object is displayed on the display unit 314.

次に、被疑対象デバイスの推定結果を見たオペレータによって、メッセージ一覧要求部３１２が操作されることにより、上記被疑対象に対する検索要求が障害分析装置１００へ発せられると（ステップＳ８でＹＥＳ）、要求対象判定部１６１、メッセージ検索部１６２、およびソート処理部１５５が、順に処理を行う（ステップＳ９、Ｓ１０、Ｓ１１）。ソート処理部１５５でソートされた、被疑対象に関するメッセージの一覧は、結果出力部１１１がネットワーク４００を介して、監視端末３００の表示制御部３１３へ出力する（ステップＳ１２）。これにより、当該メッセージ一覧は、表示部３１４に表示される。 Next, when a search request for the suspected object is issued to the failure analysis apparatus 100 by operating the message list request unit 312 by the operator who has seen the estimation result of the suspected object device (YES in step S8), the request is made. The object determination unit 161, the message search unit 162, and the sort processing unit 155 perform processing in order (steps S9, S10, and S11). The result output unit 111 outputs the list of messages related to the suspected object sorted by the sort processing unit 155 to the display control unit 313 of the monitoring terminal 300 via the network 400 (step S12). Thereby, the message list is displayed on the display unit 314.

上記したように、本実施の形態において、障害原因絞込みの処理は大きく分けて２段階ある。１段階目の処理は、１件のエラーメッセージから、原因と考えられるデバイス、および関連するホスト名を列挙する処理（ステップＳ２〜Ｓ６）である。２段階目の処理は、列挙された対象についてメッセージ検索を行う処理（ステップＳ９〜Ｓ１１）である。まず、前者の処理（ステップＳ２〜Ｓ６）について、図８〜図１１を用いて説明する。 As described above, in the present embodiment, failure cause narrowing-down processing is roughly divided into two stages. The first-stage process is a process (steps S2 to S6) for enumerating a possible device and a related host name from one error message. The second stage process is a process (steps S9 to S11) for performing a message search for the listed objects. First, the former process (steps S2 to S6) will be described with reference to FIGS.

図８（ａ）は、発生ノード判定部１５１が行う処理について説明するためのフローチャートであり、図７のステップＳ２の処理の詳細を示している。図８（ｂ）は、メッセージ検索範囲算出部１５２が行う処理について説明するためのフローチャートであり、図７のステップＳ３の処理の詳細を示している。図９は、エラー有無判定部１５３が行う処理について説明するためのフローチャートであり、図７のステップＳ４の処理の詳細を示している。 FIG. 8A is a flowchart for explaining the process performed by the generation node determination unit 151, and shows details of the process in step S2 of FIG. FIG. 8B is a flowchart for explaining the process performed by the message search range calculation unit 152, and shows details of the process in step S3 of FIG. FIG. 9 is a flowchart for explaining the process performed by the error presence / absence determination unit 153, and shows details of the process in step S4 of FIG.

図７に示すように、たとえば、オペレータが、監視端末３００の原因絞込み要求部３１１を操作することにより、原因絞込み要求部３１１から、あるエラーメッセージ１件について、障害分析装置１００に処理を行う要求が発せられると、図８（ａ）に示すように、発生ノード判定部１５１は、メッセージ蓄積部１１０に蓄積されている指定されたメッセージの発生ノード１１０１のホスト名と、構成情報蓄積部１４０中のホスト管理テーブル１４１とを照合し、発生ノード１１０１のホスト名と一致するホスト名を取得する（ステップＳ１０１）。次に、発生ノード判定部１５１は、取得したホスト名に仮想ホストテーブル１４１３のホスト名が含まれるか否か判定する（ステップＳ１０２）。取得したホスト名に仮想ホストＩＤが含まれている場合（ステップＳ１０２でＹＥＳ）、発生ノード判定部１５１は、メッセージ発生ノードが仮想ホストであると判定する（ステップＳ１０３）。一方、取得したホスト名に仮想ホストＩＤが含まれていない場合（ステップＳ１０２でＮＯ）、メッセージ発生ノードは、物理ホスト、またはハイパバイザであると判定する（ステップＳ１０４）。 As illustrated in FIG. 7, for example, when the operator operates the cause narrowing request unit 311 of the monitoring terminal 300, the request from the cause narrowing request unit 311 to the fault analysis apparatus 100 to process one error message. 8a, as shown in FIG. 8A, the generation node determination unit 151 includes the host name of the generation node 1101 of the designated message stored in the message storage unit 110, and the configuration information storage unit 140. The host management table 141 is checked to obtain a host name that matches the host name of the source node 1101 (step S101). Next, the generation node determination unit 151 determines whether or not the acquired host name includes the host name of the virtual host table 1413 (step S102). When the acquired host name includes a virtual host ID (YES in step S102), the generation node determination unit 151 determines that the message generation node is a virtual host (step S103). On the other hand, if the acquired host name does not include a virtual host ID (NO in step S102), it is determined that the message generation node is a physical host or a hypervisor (step S104).

図８（ｂ）に示すように、メッセージ検索範囲算出部１５２は、原因絞込み要求のあったエラーメッセージ中の発生日１１０５、および発生時間１１０６を中心に、すなわち、エラー発生状況に基づいて、エラー有無判定部１５３でエラーの判定対象とするメッセージ取得範囲を算出する（ステップＳ２０１）。メッセージ取得範囲の指定方法については特に限定されないが、本実施の形態では、原因絞込みを行っているエラーメッセージの発生時刻を基準として前後数秒または数十秒の期間に発生したメッセージを検索範囲対象とする。なお、メッセージ取得範囲の指定方法として、上記エラーメッセージの発生時刻の前後に発生した数十件のメッセージを検索対象範囲としてもよい。 As shown in FIG. 8B, the message search range calculation unit 152 sets the error based on the occurrence date 1105 and the occurrence time 1106 in the error message for which the cause narrowing request is made, that is, based on the error occurrence status. The presence / absence determination unit 153 calculates a message acquisition range as an error determination target (step S201). Although the method for specifying the message acquisition range is not particularly limited, in this embodiment, messages that occur within a period of several seconds or several tens of seconds before and after the occurrence time of the error message that is narrowing down the cause are targeted for the search range. To do. As a method for specifying the message acquisition range, dozens of messages generated before and after the error message generation time may be set as the search target range.

次に、図９に示すように、エラー有無判定部１５３は、構成情報蓄積部１４０を検索することにより、指定された発生ノード１１０１のホストが利用しているデバイス情報を抽出する（ステップＳ３０１）。 Next, as shown in FIG. 9, the error presence / absence determination unit 153 searches the configuration information storage unit 140 to extract device information used by the host of the specified generation node 1101 (step S301). .

次に、エラー有無判定部１５３は、指定された発生ノード１１０１のホストと論理デバイスを共用している全てのホストを抽出する（ステップＳ３０２）。次に、エラー有無判定部１５３は、指定された発生ノード１１０１のホストが利用する論理デバイスに紐づく全てのホストについて、ステップＳ２０１で算出されたメッセージ検索範囲に該当するメッセージを、メッセージ蓄積部１１０から検索する。そして、エラー有無判定部１５３は、当該期間にエラーが発生しているか否か、すなわち、アラートレベル１１０４がＥｒｒｏｒであるメッセージが発生しているか否かを判定する（ステップＳ３０３）。 Next, the error presence / absence determination unit 153 extracts all the hosts that share the logical device with the host of the designated generation node 1101 (step S302). Next, the error presence / absence determination unit 153 sends messages corresponding to the message search range calculated in step S201 to the message storage unit 110 for all hosts associated with the logical device used by the host of the specified generation node 1101. Search from. Then, the error presence / absence determination unit 153 determines whether or not an error has occurred during the period, that is, whether or not a message with an alert level 1104 has occurred (step S303).

次に、エラー有無判定部１５３は、ステップＳ３０４、およびステップＳ３０５の処理を行う。ステップＳ３０４およびステップＳ３０５の処理は、論理デバイスについての検索処理（ステップＳ３０２、Ｓ３０３）と同様の処理であり、物理デバイスについての検索処理を行う。 Next, the error presence / absence determining unit 153 performs the processes of steps S304 and S305. The processes in step S304 and step S305 are the same as the search process for logical devices (steps S302 and S303), and the search process for physical devices is performed.

具体的には、エラー有無判定部１５３は、指定された発生ノード１１０１のホストと物理バイスを共用している全てのホストを抽出する（ステップＳ３０４）。次に、エラー有無判定部１５３は、指定された発生ノード１１０１のホストが利用する物理デバイスに紐づく全てのホストについて、ステップＳ２０１で算出されたメッセージ検索範囲に該当するメッセージを、メッセージ蓄積部１１０から検索する。そして、エラー有無判定部１５３は、当該期間にエラーが発生しているか否か、すなわち、アラートレベル１１０４がＥｒｒｏｒであるメッセージが発生しているか否かを判定する（ステップＳ３０５）。 Specifically, the error presence / absence determination unit 153 extracts all the hosts that share the physical device with the host of the designated generation node 1101 (step S304). Next, the error presence / absence determination unit 153 sends messages corresponding to the message search range calculated in step S201 for all hosts associated with the physical device used by the host of the specified generation node 1101 to the message storage unit 110. Search from. Then, the error presence / absence determination unit 153 determines whether or not an error has occurred during the period, that is, whether or not a message with an alert level 1104 has occurred (step S305).

次に、エラー有無判定部１５３は、指定された発生ノード１１０１のホストが、仮想ホストであるか否かを確認する（ステップＳ３０６）。なお、発生ノード１１０１が仮想ホストであるか否かの判定は、予め発生ノード判定部１５１においてされているものである。エラー有無判定部１５３は、発生ノード１１０１が仮想ホストであると確認した場合（ステップＳ３０６でＹＥＳ）、ステップＳ３０７に進む。ステップＳ３０７では、エラー有無判定部１５３は、仮想ホストの基盤となるハイパバイザを構成情報蓄積部１４０より特定し、ハイパバイザをエラー発生ノードと見立てる。次に、エラー有無判定部１５３は、ステップＳ３０１での処理と同様に、構成情報蓄積部１４０を検索することで、ハイパバイザが利用しているデバイス情報を抽出する（ステップＳ３０８）。次に、エラー有無判定部１５３は、ステップＳ３０９〜Ｓ３１２の処理を行うことで、エラー発生状況を判定する。 Next, the error presence / absence determination unit 153 confirms whether or not the host of the specified generation node 1101 is a virtual host (step S306). Whether or not the generation node 1101 is a virtual host is determined in advance in the generation node determination unit 151. If the error presence / absence determination unit 153 confirms that the generation node 1101 is a virtual host (YES in step S306), the process proceeds to step S307. In step S307, the error presence / absence determination unit 153 identifies the hypervisor serving as the base of the virtual host from the configuration information storage unit 140, and regards the hypervisor as an error occurrence node. Next, the error presence / absence determination unit 153 extracts the device information used by the hypervisor by searching the configuration information storage unit 140 in the same manner as the processing in step S301 (step S308). Next, the error presence / absence determination unit 153 determines the error occurrence status by performing the processing of steps S309 to S312.

なお、ステップＳ３０９およびステップＳ３１０の処理は、それぞれ、ステップＳ３０２およびステップＳ３０３の処理と同様である。また、ステップＳ３１１およびステップＳ３１２の処理は、それぞれ、ステップＳ３０２およびステップＳ３０３の処理と同様である。 Note that the processes in steps S309 and S310 are the same as the processes in steps S302 and S303, respectively. Further, the processes in steps S311 and S312 are the same as the processes in steps S302 and S303, respectively.

具体的には、エラー有無判定部１５３は、仮想ホストの基盤となるハイパバイザと論理デバイスを共用している全てのホストを抽出する（ステップＳ３０９）。次に、エラー有無判定部１５３は、上記ハイパバイザが利用する論理デバイスに紐づく全てのホストについて、ステップＳ２０１で算出されたメッセージ検索範囲に該当するメッセージを、メッセージ蓄積部１１０から検索する。そして、エラー有無判定部１５３は、当該期間にエラーが発生しているか否か、すなわち、アラートレベル１１０４がＥｒｒｏｒであるメッセージが発生しているか否かを判定する（ステップＳ３１０）。 Specifically, the error presence / absence determination unit 153 extracts all the hosts that share the logical device with the hypervisor that is the basis of the virtual host (step S309). Next, the error presence / absence determination unit 153 searches the message storage unit 110 for messages corresponding to the message search range calculated in step S201 for all the hosts associated with the logical devices used by the hypervisor. Then, the error presence / absence determination unit 153 determines whether or not an error has occurred during the period, that is, whether or not a message with an alert level 1104 has occurred (step S310).

次に、エラー有無判定部１５３は、上記ハイパバイザと物理デバイスを共用している全てのホストを抽出する（ステップＳ３１１）。次に、エラー有無判定部１５３は、上記ハイパバイザが利用する物理デバイスに紐づく全てのホストについて、ステップＳ２０１で算出されたメッセージ検索範囲に該当するメッセージを、メッセージ蓄積部１１０から検索する。そして、エラー有無判定部１５３は、当該期間にエラーが発生しているか否か、すなわち、アラートレベル１１０４がＥｒｒｏｒであるメッセージが発生しているか否かを判定する（ステップＳ３１２）。 Next, the error presence / absence determination unit 153 extracts all hosts sharing the physical device with the hypervisor (step S311). Next, the error presence / absence determination unit 153 searches the message storage unit 110 for messages corresponding to the message search range calculated in step S201 for all hosts associated with the physical device used by the hypervisor. Then, the error presence / absence determination unit 153 determines whether or not an error has occurred during the period, that is, whether or not a message having an alert level 1104 is generated (step S312).

次に、図１０に示すステップＳ３０１〜Ｓ３０３についての詳細な処理の流れを説明する。図１０は、図９のステップＳ３０１〜Ｓ３０３についての詳細な処理の流れを説明するためのフローチャートである。 Next, a detailed processing flow for steps S301 to S303 illustrated in FIG. 10 will be described. FIG. 10 is a flowchart for explaining the detailed processing flow of steps S301 to S303 in FIG.

ステップＳ３００１は、ステップＳ３０１と同一の処理であり、エラー有無判定部１５３は、構成情報蓄積部１４０を検索することにより、指定された発生ノード１１０１のホストが利用している論理デバイスおよび物理デバイスの構成情報を抽出する。次に、ステップＳ３００２では、エラー有無判定部１５３は、ステップＳ３００１で取得した、指定された発生ノード１１０１のホストが利用している論理デバイス一覧と紐づくホストがあるか否かを、構成情報蓄積部１４０を参照して判定する（ステップＳ３００２）。 Step S3001 is the same process as step S301, and the error presence / absence determination unit 153 searches the configuration information storage unit 140 to search for the logical device and physical device used by the host of the specified generation node 1101. Extract configuration information. Next, in step S3002, the error presence / absence determination unit 153 stores configuration information as to whether there is a host associated with the list of logical devices used by the host of the designated generation node 1101 acquired in step S3001. The determination is made with reference to the unit 140 (step S3002).

指定された発生ノード１１０１のホスト以外に論理デバイスを利用しているホストが存在しない場合（ステップＳ３００２でＮＯ）、エラー有無判定部１５３は、ステップＳ３００８に進む。一方、発生ノード１１０１のホスト以外に論理デバイスを利用しているホストが存在する場合（ステップＳ３００２でＹＥＳ）、エラー有無判定部１５３は、指定された発生ノード１１０のホストと論理デバイスを共用する全てのホストを抽出する（ステップＳ３００３）。次に、エラー有無判定部１５３は、抽出した中の一のホストについて、ステップＳ２０１で算出したメッセージ検索対象期間に該当するメッセージを、メッセージ蓄積部１１０内から検索する（ステップＳ３００４）。次に、エラー有無判定部１５３は、検索したメッセージのうち、アラートレベル１１０４が”Ｅｒｒｏｒ”となっているものが１件以上存在するか否かを判定する（ステップＳ３００５）。アラートレベル１１０４が”Ｅｒｒｏｒ”となっているものが１件以上存在する場合（ステップＳ３００５でＹＥＳ）、エラー有無判定部１５３は、カウント値を１つ加算する（ステップＳ３００６）。一方、アラートレベル１１０４が”Ｅｒｒｏｒ”となっているものが無い場合（ステップＳ３０５でＮＯ）、エラー有無判定部１５３は、カウント値を加算しない。 If there is no host that uses a logical device other than the host of the specified generation node 1101 (NO in step S3002), the error presence / absence determination unit 153 proceeds to step S3008. On the other hand, when there is a host using a logical device other than the host of the generation node 1101 (YES in step S3002), the error presence / absence determination unit 153 shares all the logical devices with the specified host of the generation node 110. Are extracted (step S3003). Next, the error presence / absence determination unit 153 searches the message storage unit 110 for a message corresponding to the message search target period calculated in step S201 for one of the extracted hosts (step S3004). Next, the error presence / absence determination unit 153 determines whether or not one or more of the retrieved messages whose alert level 1104 is “Error” exist (step S3005). If one or more alert levels 1104 are “Error” (YES in step S3005), the error presence / absence determination unit 153 adds one count value (step S3006). On the other hand, when there is no alert level 1104 with “Error” (NO in step S305), the error presence / absence determination unit 153 does not add the count value.

次に、エラー有無判定部１５３は、指定された発生ノード１１０１のホストと論理デバイスを共用するホストのうち、ステップＳ３００４〜Ｓ３００６の処理が行われていないホストが存在しているか否かを判定する（ステップＳ３００７）。指定された発生ノード１１０１のホストと論理デバイスを共用しているホストのうち、ステップＳ３００４〜Ｓ３００６の処理が行われていないホストが存在している場合（ステップＳ３００７でＹＥＳ）、エラー有無判定部１５３は、メッセージ検索対象のホストを当該ホストへシフトする（ステップＳ３００８）。そして、エラー有無判定部１５３は、当該ホストについて、ステップＳ３００４〜Ｓ３００６の処理を繰り返す。 Next, the error presence / absence determination unit 153 determines whether there is a host that has not been subjected to the processing of steps S3004 to S3006 among the hosts that share the logical device with the host of the specified generation node 1101. (Step S3007). Of the hosts sharing the logical device with the host of the specified generation node 1101, if there is a host that has not been subjected to the processing of steps S3004 to S3006 (YES in step S3007), an error presence determination unit 153 Shifts the message search target host to the host (step S3008). Then, the error presence / absence determination unit 153 repeats the processing of steps S3004 to S3006 for the host.

一方、指定された発生ノード１１０１のホストと論理デバイスを共用しているホストの全てについて、ステップＳ３００４〜Ｓ３００６の処理が行われた場合（ステップＳ３００７でＮＯ）、エラー有無判定部１５３は、ステップＳ３００９に進む。 On the other hand, when the processing of steps S3004 to S3006 has been performed for all of the hosts that share the logical device with the host of the specified generation node 1101 (NO in step S3007), the error presence / absence determination unit 153 determines whether or not step S3009 Proceed to

ステップＳ３００９では、エラー有無判定部１５３は、指定された発生ノード１１０１のホストと、当該ホストと論理デバイスを共用する他のホストとを合わせた、当該論理デバイス上の全ホスト数に対する、ステップＳ３００６カウント値（エラー発生ホスト数）の割合をエラー発生割合として算出する。例えば、ステップＳ３００３で抽出されたホスト数が３、ステップＳ３００６で加算されたカウント値が３であった場合、ステップＳ３００９での算出値は、３／４≒０．８となる。上記の割合算出後、エラー有無判定部１５３は、ステップＳ３０１０に進む。ステップＳ３０１０では、エラー有無判定部１５３は、ステップＳ３００１で抽出された論理デバイスのうち、ステップＳ３００２〜Ｓ３００９の処理定が行われていない論理デバイスがあるか否かを判定する。 In step S3009, the error presence / absence determination unit 153 counts step S3006 for the total number of hosts on the logical device including the host of the specified generation node 1101 and the other host sharing the logical device with the host. The ratio of the value (number of error-occurring hosts) is calculated as the error occurrence ratio. For example, if the number of hosts extracted in step S3003 is 3, and the count value added in step S3006 is 3, the calculated value in step S3009 is 3 / 4≈0.8. After the above ratio calculation, the error presence / absence determination unit 153 proceeds to step S3010. In step S3010, the error presence / absence determination unit 153 determines whether there is a logical device that has not been processed in steps S3002 to S3009 among the logical devices extracted in step S3001.

ステップＳ３００２〜Ｓ３００９の処理が行われていない論理デバイスがある場合には、エラー有無判定部１５３は、上記の処理が行われていない他の論理デバイスを処理対象にシフトし（ステップＳ３０１１）、ステップＳ３００２〜Ｓ３００９の処理を繰り返し行う。一方、ステップＳ３００２〜Ｓ３００９の処理が行われていない論理デバイスが無い場合（ステップＳ３０１０でＮＯ）、エラー有無判定部１５３は、処理を終了する。 If there is a logical device for which the processes in steps S3002 to S3009 have not been performed, the error presence / absence determination unit 153 shifts another logical device for which the above process has not been performed to a processing target (step S3011), The processes of S3002 to S3009 are repeated. On the other hand, when there is no logical device for which the processes of steps S3002 to S3009 are not performed (NO in step S3010), the error presence / absence determining unit 153 ends the process.

なお、図９に示すステップＳ３０４〜３０５の処理は、図１０に示すステップＳ３００２〜Ｓ３０１１の処理における「論理デバイス」を、「物理デバイス」に置き換えた場合と同一の処理となるので、詳細な説明は省略する。 The processing in steps S304 to S305 shown in FIG. 9 is the same as that in the case where “logical device” in the processing in steps S3002 to S3011 shown in FIG. 10 is replaced with “physical device”. Is omitted.

また、図９に示すステップＳ３０８〜Ｓ３１０の処理は、図１０に示すステップＳ３０００１〜Ｓ３０１１の処理における「発生ノード１１０１のホスト」を「ハイパバイザ」に置き換えた場合と同一の処理となるので、詳細な説明は省略する。さらに、図９に示すステップＳ３１１〜Ｓ３１２の処理は、図１０に示すステップＳ３００２〜Ｓ３０１１の処理における「発生ノード１１０１のホスト」を「ハイパバイザ」に置き換え、かつ、「論理デバイス」を「物理デバイス」に置き換えた場合と同一の処理となるので、詳細な説明は省略する。 Further, the processing of steps S308 to S310 shown in FIG. 9 is the same as the processing when “host of the generation node 1101” in the processing of steps S30001 to S3011 shown in FIG. 10 is replaced with “hypervisor”. Description is omitted. Further, in the processing of steps S311 to S312 shown in FIG. 9, “host of the generation node 1101” in the processing of steps S3002 to S3011 shown in FIG. 10 is replaced with “hypervisor”, and “logical device” is replaced with “physical device”. Since the processing is the same as that in the case of replacing with, detailed description is omitted.

次に、図１１（ａ）および図１１（ｂ）に示す、被疑対象絞込み部１５４、およびソート処理部１５５での処理の流れを説明する。図１１（ａ）は、被疑対象絞込み部１５４が行う処理の流れを示すフローチャートであり、図１１（ｂ）は、ソート処理部１５５が行う処理の流れを示すフローチャートである。 Next, the flow of processing in the suspicious object narrowing unit 154 and the sort processing unit 155 shown in FIGS. 11A and 11B will be described. FIG. 11A is a flowchart illustrating a flow of processing performed by the suspicious object narrowing unit 154, and FIG. 11B is a flowchart illustrating a flow of processing performed by the sort processing unit 155.

図１１（ａ）に示すように、被疑対象絞込み部１５４は、ステップＳ３００９（図１０参照）で算出された、各論理デバイスおよび各物理デバイスのそれぞれにおける、エラー発生割合を基に、エラー原因として疑われる被疑デバイスを絞り込む基準を決定する（ステップＳ４０１）。本実施の形態では、絞込み基準は、たとえば、各論理デバイスおよび各物理デバイスのそれぞれにおいて、エラー発生割合の値が高いもの上位５件を被疑デバイスとすること、または、エラー発生割合が５０%を超えるものは全て被疑対象デバイスとすることなどが考えられる。 As shown in FIG. 11A, the suspicious object narrowing unit 154 determines the error cause based on the error occurrence ratio in each logical device and each physical device calculated in step S3009 (see FIG. 10). A criterion for narrowing down the suspected device is determined (step S401). In the present embodiment, the narrowing criteria is, for example, that each of the logical device and each physical device has the highest error occurrence ratio value as the top five cases, or the error occurrence ratio is 50%. Anything beyond that could be considered a suspected device.

被疑対象絞込み部１５４は、絞り込む基準を決定した後、各論理デバイスおよび各物理デバイスのそれぞれについて、エラー発生割合と絞込み基準とを照合する。そして、被疑対象絞込み部１５４は、基準を満たすデバイスを被疑対象デバイスとして抽出する（ステップＳ４０２）。その後、被疑対象絞込み部１５４は、被疑対象デバイスを利用する全ホスト名の一覧、構成情報蓄積部１４０を検索することで取得し（ステップＳ４０３）、処理を終える。 After determining the criteria for narrowing down, the suspected subject narrowing-down unit 154 collates the error occurrence rate and the narrowing-down criteria for each logical device and each physical device. Then, the suspected object narrowing unit 154 extracts devices that satisfy the criteria as suspected devices (step S402). Thereafter, the suspected object narrowing unit 154 obtains a list of all host names that use the suspected device by searching the configuration information accumulating unit 140 (step S403), and ends the process.

次に、図１１（ｂ）に示すように、ソート処理部１５５では、被疑対象絞込み部１５４で得られた、被疑対象デバイスの全ホスト名の情報を、監視端末３００を利用するオペレータに返すために、情報の整理を行う（ステップＳ５０１）。監視端末３００に返す情報は、被疑対象デバイス、被疑デバイスの全ホスト名、および、各被疑対象デバイスのエラー発生割合である。ソート処理部１５５は、これらの情報をソートする。その後ソートされた結果は、図７に示すように、結果出力部１１１へ出力される。結果出力部１１１は、これらの情報を、ネットワークを介して監視端末３００の表示制御部３１３へ出力する（ステップＳ７）。これにより、上記の情報は、表示部３１４に表示され、表示内容をオペレータが確認可能となる。 Next, as shown in FIG. 11B, the sort processing unit 155 returns the information on all the host names of the suspected device obtained by the suspected target narrowing unit 154 to the operator who uses the monitoring terminal 300. Then, the information is organized (step S501). The information returned to the monitoring terminal 300 is the suspected device, all host names of the suspected device, and the error occurrence rate of each suspected device. The sort processing unit 155 sorts these pieces of information. The sorted results are output to the result output unit 111 as shown in FIG. The result output unit 111 outputs these pieces of information to the display control unit 313 of the monitoring terminal 300 via the network (step S7). Thus, the above information is displayed on the display unit 314, and the operator can confirm the display contents.

次に、ステップＳ８〜Ｓ１２のフロー、すなわち、エラー原因として推定される被疑対象デバイスが抽出された後の処理について、図１２を用いて説明する。図１２は、ステップＳ９〜Ｓ１２について説明するためのフローチャートである。 Next, the flow after steps S8 to S12, that is, the process after the suspected target device estimated as the cause of error is extracted will be described with reference to FIG. FIG. 12 is a flowchart for explaining steps S9 to S12.

図７に示すように、監視端末３００の表示部３１４に表示された情報に基づいて、オペレータが、メッセージ一覧要求部３１２を操作することで、メッセージ検索要求が出された場合（ステップＳ８でＹＥＳ）、図１２に示すように、要求対象判定部１６１におけるステップＳ６０１が開始される。具体的には、要求対象判定部１６１は、メッセージ検索要求の対象がデバイス名であるか、またはホスト名であるかを判定する。メッセージ検索要求の対象の指定は、たとえば、オペレータがメッセージ一覧要求部３１２を操作することにより行われる。 As shown in FIG. 7, when a message search request is issued by the operator operating the message list request unit 312 based on the information displayed on the display unit 314 of the monitoring terminal 300 (YES in step S8). ), Step S601 in the request target determination unit 161 is started as shown in FIG. Specifically, the request target determination unit 161 determines whether the target of the message search request is a device name or a host name. The target of the message search request is specified, for example, when the operator operates the message list request unit 312.

被疑デバイスに対してメッセージ検索の要求があった場合（ステップＳ６０１で被疑デバイス）、メッセージ検索部１６２は、被疑対象デバイスを共用する全ホストについてメッセージ蓄積部１１０を検索し、該当するメッセージを抽出し、ステップＳ８０１に進む。 If there is a message search request for the suspect device (the suspect device in step S601), the message search unit 162 searches the message storage unit 110 for all hosts sharing the suspect device, and extracts the corresponding message. The process proceeds to step S801.

一方、オペレータによるメッセージ一覧要求部３１２の操作による、メッセージ検索要求の対象がホストであった場合（ステップＳ６０１でホスト）、メッセージ検索部１６２は、メッセージ蓄積部１１０を検索し、該当するメッセージを抽出し、ステップＳ８０１に進む。 On the other hand, when the target of the message search request by the operator's operation of the message list request unit 312 is a host (host in step S601), the message search unit 162 searches the message storage unit 110 and extracts the corresponding message. Then, the process proceeds to step S801.

ステップＳ８０１では、ソート処理部１５５は、ステップＳ７０１、またはＳ７０２で得られたメッセージを、それぞれ、発生ノード毎に並べる。なお、ステップＳ６０１は、図７のステップＳ９に相当し、ステップＳ７０１、Ｓ７０２は、図７のステップＳ１０に相当し、ステップＳ８０１は、図７のステップＳ１１に相当する。 In step S801, the sort processing unit 155 arranges the messages obtained in step S701 or S702 for each occurrence node. Note that step S601 corresponds to step S9 in FIG. 7, steps S701 and S702 correspond to step S10 in FIG. 7, and step S801 corresponds to step S11 in FIG.

ソート処理部１５５の処理に次いで、図７に示すステップＳ１２が実行される。すなわち、ステップＳ８０１でソートされたメッセージ一覧が、結果表示部１１１へ出力され、結果出力部１１１は、メッセージ一覧を、ネットワーク４００を介して、表示制御部３１３へ出力する（ステップＳ１２）。これにより、監視端末３００を利用するオペレータは、表示制御部３１３が表示部３１４に表示するメッセージ一覧を確認することができる。 Subsequent to the processing of the sort processing unit 155, step S12 shown in FIG. 7 is executed. That is, the message list sorted in step S801 is output to the result display unit 111, and the result output unit 111 outputs the message list to the display control unit 313 via the network 400 (step S12). Thereby, the operator using the monitoring terminal 300 can check the message list displayed on the display unit 314 by the display control unit 313.

本発明の実施の形態におけるプログラムは、コンピュータに、図７〜図１２に示すステップＳ１〜Ｓ１２、Ｓ１０１〜Ｓ１０４、Ｓ２０１、Ｓ３０１〜Ｓ３１２、Ｓ３００１〜Ｓ３０１１、Ｓ４０１〜Ｓ４０３、Ｓ５０１、Ｓ６０１、Ｓ７０１、Ｓ７０２、およびＳ８０１を実行させるプログラムであればよい。このプログラムをコンピュータにインストールし、実行することによって、本実施の形態における障害分析装置１００を実現することができる。この場合、コンピュータのＣＰＵ（Central Processing Unit）は、メッセージ監視部１０１、結果出力部１１１、入手部１３１、データ分析部１３２、テーブル更新部１３３、発生ノード判定部１５１、メッセージ抽出範囲算出部１５２、エラー有無判定部１５３、被疑対象絞込み部１５４、ソート処理部１５５、要求対象判定部１６１、およびメッセージ検索部１６２として機能し、処理を行なう。 The program according to the embodiment of the present invention is stored in the computer in steps S1 to S12, S101 to S104, S201, S301 to S312, S3001 to S3011, S401 to S403, S501, S601, S701, and S702 shown in FIGS. , And any program that executes S801. By installing and executing this program on a computer, the failure analysis apparatus 100 according to the present embodiment can be realized. In this case, the CPU (Central Processing Unit) of the computer includes a message monitoring unit 101, a result output unit 111, an acquisition unit 131, a data analysis unit 132, a table update unit 133, a generation node determination unit 151, a message extraction range calculation unit 152, It functions as an error presence / absence determination unit 153, a suspicious object narrowing unit 154, a sort processing unit 155, a request target determination unit 161, and a message search unit 162 to perform processing.

また、本実施の形態では、メッセージ蓄積部１１０および構成情報蓄積部１４０は、コンピュータに備えられたハードディスク等の記憶装置に、これらを構成するデータファイルを格納することによって実現できる。また、メッセージ蓄積部１１０および構成情報蓄積部１４０は、別のコンピュータによって構築されてもよい。 Further, in the present embodiment, the message storage unit 110 and the configuration information storage unit 140 can be realized by storing data files constituting them in a storage device such as a hard disk provided in the computer. Further, the message storage unit 110 and the configuration information storage unit 140 may be constructed by different computers.

以上のように本実施の形態によれば、エラー内容と、エラー原因との関係を予め障害分析装置１００で特定できていなくても、障害原因を分析することができ得る。具体的には、障害分析装置１００は、監視対象サーバ２００からエラーメッセージなどのメッセージを収集しており、障害分析装置１００のデバイス情報整理部１２０は、ホスト（物理ＯＳ、ハイパバイザ、仮想ＯＳ）が利用するデバイス情報を管理している。そして、障害分析装置１００は、任意に選択した１つのエラーメッセージについて、エラーメッセージを発生するホストと、デバイス情報とを関連づけることで、エラー発生ホストとデバイスを共用している複数のホストを抽出する。さらに、障害分析装置１００は、着目したエラー発生時近辺に上記デバイスを共用する複数のホストでエラーが発生しているか判定する。そして、当該複数のホストでのエラーの発生割合などに基づいて、エラー原因を分析し、被疑対象デバイスを特定する。これにより、デバイス障害観点からのエラー原因の絞込みを可能にしている。 As described above, according to the present embodiment, the cause of the failure can be analyzed even if the relationship between the error content and the cause of the error cannot be specified in advance by the failure analysis apparatus 100. Specifically, the failure analysis apparatus 100 collects messages such as error messages from the monitoring target server 200, and the device information organizing unit 120 of the failure analysis apparatus 100 has a host (physical OS, hypervisor, virtual OS). Manages device information to be used. Then, the failure analysis apparatus 100 extracts a plurality of hosts sharing the device with the error-occurring host by associating the error-generating host with the device information for one arbitrarily selected error message. . Further, the failure analysis apparatus 100 determines whether an error has occurred in a plurality of hosts sharing the device near the time of occurrence of the focused error. Then, the cause of the error is analyzed based on the error occurrence rate in the plurality of hosts, and the suspected device is specified. This makes it possible to narrow down the cause of errors from the viewpoint of device failure.

このような本実施の形態の構成により、仮想化環境のソフトウェアと物理デバイスとが複雑に構成されているシステムにおいて、エラーイベント発生原因の追及を容易にすることができる。より具体的には、ホストが利用する物理デバイス、論理デバイスの構成情報を用いて、特定のエラーイベントに含まれるホスト名から、利用しているデバイスを割り出す。そして、当該デバイスを共用する他のホストで同時期に障害が発生している割合を判定する。これにより、特定のデバイスに基づく連鎖障害であるか、または、ホスト自身で発生している障害であるか絞り込むことができる。 With such a configuration of the present embodiment, it is possible to easily pursue the cause of the occurrence of an error event in a system in which software and physical devices in a virtual environment are configured in a complex manner. More specifically, the device being used is determined from the host name included in the specific error event using the configuration information of the physical device and logical device used by the host. Then, the rate at which a failure occurs at the same time in another host sharing the device is determined. Thereby, it is possible to narrow down whether it is a chain fault based on a specific device or a fault occurring in the host itself.

また、本実施の形態では、エラー有無判定部１５３は、監視端末３００で選択されたエラーを基準として、当該基準から所定の範囲内で、エラー発生ノードのホストとデバイスを共用する他のホストでエラーが発生しているか否かを判定する。これにより、監視端末３００で選択されたエラーと同時期に発生した他のエラーを特定することができるので、エラー発生原因に適した分析材料を得ることができ、より正確にエラー原因を分析することができる。 Further, in this embodiment, the error presence / absence determination unit 153 is based on the error selected by the monitoring terminal 300, and the other host sharing the device with the host of the error occurrence node within a predetermined range from the reference. It is determined whether an error has occurred. As a result, it is possible to identify other errors that have occurred at the same time as the error selected by the monitoring terminal 300, so that an analysis material suitable for the cause of the error can be obtained, and the cause of the error can be analyzed more accurately. be able to.

また、本実施の形態では、エラー発生ホストが仮想ホストである場合には、エラー発生ホストのハイパバイザが利用している物理デバイスおよび論理デバイスを共用する他のホストを特定し、当該特定されたホストのエラーを抽出する。これにより、エラー発生ホストが仮想ホストである場合でも、エラーの原因をより正確に特定することができる。 In this embodiment, when the error occurrence host is a virtual host, the other host sharing the physical device and the logical device used by the hypervisor of the error occurrence host is specified, and the specified host is specified. Extract errors. Thereby, even when the error occurrence host is a virtual host, the cause of the error can be specified more accurately.

また、本実施の形態では、被疑対象絞込み部１５４は、エラー発生ホストとデバイスを共用する他のホストのうち、エラーが発生しているホストの数が所定の基準を超えている場合に、エラー発生ホストが利用しているデバイスにエラーが生じていると判定する。これにより、エラーを生じているデバイスを、より正確に分析することができる。 In the present embodiment, the suspicious object narrowing unit 154 determines that an error occurs when the number of hosts in which an error has occurred among other hosts sharing the device with the error-occurring host exceeds a predetermined standard. It is determined that an error has occurred in the device used by the generating host. As a result, the device causing the error can be analyzed more accurately.

（変形例）
上記実施の形態では、構成情報蓄積部１４０が管理するデバイスとして、監視対象サーバ２００の物理ディスク、論理ディスクおよび物理ＮＩＣを例示している。そして、これらのデバイスとホスト名とを関連付けることにより、エラー原因を絞り込む構成としているが、これに限定されない。たとえば、ディスクとＮＩＣ以外にも、監視対象サーバ２００の構成情報取得部２１０がＡＰＩで提供可能な物理、論理デバイスであり、且つ複数のホストで共用する（部分的なリソース割り当てが可能な）デバイスがあれば、構成情報蓄積部１４０で管理することができる。 (Modification)
In the above embodiment, the physical disk, logical disk, and physical NIC of the monitoring target server 200 are illustrated as devices managed by the configuration information storage unit 140. And although it is set as the structure which narrows down the cause of an error by correlating these devices and a host name, it is not limited to this. For example, in addition to the disk and NIC, the configuration information acquisition unit 210 of the monitoring target server 200 is a physical or logical device that can be provided by an API, and is a device shared by a plurality of hosts (partial resource allocation is possible) Can be managed by the configuration information storage unit 140.

また、図８（ｂ）のステップＳ２０１において、エラーメッセージの検索範囲は、監視端末３００を操作することで設定できてもよいし、障害分析装置１００に算出範囲決定パターンを設定する装置を設けることで、適宜設定されてもよい。 8B, the error message search range may be set by operating the monitoring terminal 300, or the failure analysis apparatus 100 is provided with a device for setting the calculation range determination pattern. Therefore, it may be set as appropriate.

また、図１１のステップＳ４０１において、被疑対象デバイス決定の絞り込みの基準は、監視端末３００を操作することで設定されてもよいし、障害分析装置１００に絞り込み基準パターンを設定すする装置を設けることで、適宜設定されてもよい。 Further, in step S401 of FIG. 11, the narrowing criteria for determining the suspected device may be set by operating the monitoring terminal 300, or a device for setting the narrowing criteria pattern is provided in the failure analysis apparatus 100. Therefore, it may be set as appropriate.

上述した実施の形態の一部又は全部は、以下に記載する（付記１）〜（付記１２）によって表現することができるが、以下の記載に限定されるものではない。 Part or all of the above-described embodiments can be expressed by (Appendix 1) to (Appendix 12) described below, but is not limited to the following description.

（付記１）
仮想ホストを含む複数のホストのそれぞれが利用する物理デバイスおよび論理デバイスを特定する情報を入手する入手部と、
複数の前記ホストのうちエラーが発生したエラー発生ホストのホスト名と、前記入手部で入手した前記情報とから、前記エラー発生ホストが利用している前記物理デバイスおよび前記論理デバイスを特定し、かつ、特定された前記物理デバイスおよび前記論理デバイスを共用する他の前記ホストでエラーが発生しているか否かを判定するエラー有無判定部と、
を備えていることを特徴とする、障害分析装置。 (Appendix 1)
An obtaining unit for obtaining information for identifying a physical device and a logical device used by each of a plurality of hosts including a virtual host;
Identifying the physical device and the logical device used by the error-occurring host from the host name of the error-occurring host among the plurality of hosts and the information obtained by the obtaining unit; and An error presence / absence determining unit that determines whether an error has occurred in another host that shares the identified physical device and the logical device;
A failure analysis apparatus comprising:

（付記２）
前記エラー有無判定部は、前記エラー発生ホストで発生した前記エラーの発生状況に基づいて、他の前記ホストでエラーが発生しているか否かを判定する、付記１に記載の障害分析装置。 (Appendix 2)
The failure analysis device according to appendix 1, wherein the error presence / absence determination unit determines whether an error has occurred in another host based on the occurrence state of the error that has occurred in the error-occurring host.

（付記３）
前記エラー有無判定部は、前記エラー発生ホストが前記仮想ホストである場合には、前記エラー発生ホストのハイパバイザが利用している前記物理デバイスおよび前記論理デバイスを共用する他の前記ホストを特定し、当該特定されたホストのエラーを抽出する、付記１または付記２に記載の障害分析装置。 (Appendix 3)
When the error occurrence host is the virtual host, the error presence / absence determination unit identifies the other physical host that shares the physical device and the logical device used by the hypervisor of the error occurrence host, and The failure analysis apparatus according to appendix 1 or appendix 2, which extracts an error of the identified host.

（付記４）
エラーが生じている前記物理デバイスおよび前記論理デバイスを絞込む被疑対象絞込み部をさらに備え、
前記被疑対象絞込み部は、前記エラー発生ホストおよび他の前記ホストのなかで、エラーが発生しているホストの数が所定の基準を超えている場合に、前記エラー発生ホストが利用している前記物理デバイスおよび前記論理デバイスの少なくとも一方に障害が生じていると判定する、付記１〜付記３のいずれかに記載の障害分析装置。 (Appendix 4)
A suspicious object narrowing section for narrowing down the physical device and the logical device in which an error has occurred;
The suspected object narrowing unit is used by the error-occurring host when the number of hosts in which the error has occurred exceeds a predetermined standard among the error-occurring host and the other hosts. The failure analysis apparatus according to any one of appendix 1 to appendix 3, wherein it is determined that a failure has occurred in at least one of the physical device and the logical device.

（付記５）
（ａ）仮想ホストを含む複数のホストのそれぞれが利用する物理デバイスおよび論理デバイスを特定する情報を入手するステップと、
（ｂ）複数の前記ホストのうちエラーが発生したエラー発生ホストのホスト名と、前記入手部で入手した前記情報とから、前記エラー発生ホストが利用している前記物理デバイスおよび前記論理デバイスを特定するステップと、
（ｃ）特定された前記物理デバイスおよび前記論理デバイスを共用する他の前記ホストでエラーが発生しているか否かを判定するステップと、
を含むことを特徴とする、障害分析方法。 (Appendix 5)
(A) obtaining information for specifying a physical device and a logical device used by each of a plurality of hosts including a virtual host;
(B) Identifying the physical device and the logical device used by the error-occurring host from the host name of the error-occurring host among the plurality of hosts and the information obtained by the obtaining unit And steps to
(C) determining whether an error has occurred in another host that shares the identified physical device and logical device;
A failure analysis method comprising:

（付記６）
前記エラーが発生しているか否かを判定するステップでは、前記エラー発生ホストで発生した前記エラーの発生状況に基づいて、他の前記ホストでエラーが発生しているか否かを判定する、付記５に記載の障害分析方法。 (Appendix 6)
In the step of determining whether or not the error has occurred, it is determined whether or not an error has occurred in another host based on the occurrence state of the error that has occurred in the error-occurring host. Failure analysis method described in 1.

（付記７）
前記エラーが発生しているか否かを判定するステップでは、前記エラー発生ホストが前記仮想ホストである場合には、前記エラー発生ホストのハイパバイザが利用している前記物理デバイスおよび前記論理デバイスを共用する他の前記ホストを特定し、当該特定されたホストのエラーを抽出する、付記５または付記６に記載の障害分析方法。 (Appendix 7)
In the step of determining whether or not the error has occurred, if the error-occurring host is the virtual host, the physical device and the logical device used by the hypervisor of the error-occurring host are shared. The failure analysis method according to appendix 5 or appendix 6, wherein the other host is identified and an error of the identified host is extracted.

（付記８）
エラーが生じている前記物理デバイスおよび前記論理デバイスを絞込むステップをさらに備え、
前記絞込むステップでは、前記エラー発生ホストおよび他の前記ホストのなかで、エラーが発生しているホストの数が所定の基準を超えている場合に、前記エラー発生ホストが利用している前記物理デバイスおよび前記論理デバイスの少なくとも一方に障害が生じていると判定する、付記５〜付記７のいずれかに記載の障害分析方法。 (Appendix 8)
Further comprising the step of narrowing down the physical device and the logical device in which an error has occurred,
In the narrowing-down step, the physical number used by the error-occurring host when the number of hosts in which an error has occurred exceeds a predetermined standard among the error-occurring host and other hosts. The failure analysis method according to any one of appendix 5 to appendix 7, wherein it is determined that a fault has occurred in at least one of the device and the logical device.

（付記９）
仮想ホストを含む複数のホストで発生する障害をコンピュータによって分析するためのプログラムであって、前記コンピュータに、
（ａ）仮想ホストを含む複数のホストのそれぞれが利用する物理デバイスおよび論理デバイスを特定する情報を入手するステップと、
（ｂ）複数の前記ホストのうちエラーが発生したエラー発生ホストのホスト名と、前記入手部で入手した前記情報とから、前記エラー発生ホストが利用している前記物理デバイスおよび前記論理デバイスを特定するステップと、
（ｃ）特定された前記物理デバイスおよび前記論理デバイスを共用する他の前記ホストでエラーが発生しているか否かを判定するステップと、
を実行させる、プログラム。 (Appendix 9)
A program for analyzing failures occurring in a plurality of hosts including a virtual host by a computer,
(A) obtaining information for specifying a physical device and a logical device used by each of a plurality of hosts including a virtual host;
(B) Identifying the physical device and the logical device used by the error-occurring host from the host name of the error-occurring host among the plurality of hosts and the information obtained by the obtaining unit And steps to
(C) determining whether an error has occurred in another host that shares the identified physical device and logical device;
A program that executes

（付記１０）
前記エラーが発生しているか否かを判定するステップでは、前記エラー発生ホストで発生した前記エラーの発生状況に基づいて、他の前記ホストでエラーが発生しているか否かを判定する、付記９に記載のプログラム。 (Appendix 10)
In the step of determining whether or not the error has occurred, it is determined whether or not an error has occurred in another host based on the occurrence status of the error that has occurred in the error-occurring host. The program described in.

（付記１１）
前記エラーが発生しているか否かを判定するステップでは、前記エラー発生ホストが前記仮想ホストである場合には、前記エラー発生ホストのハイパバイザが利用している前記物理デバイスおよび前記論理デバイスを共用する他の前記ホストを特定し、当該特定されたホストのエラーを抽出する、付記９または付記１０に記載のプログラム。 (Appendix 11)
In the step of determining whether or not the error has occurred, if the error-occurring host is the virtual host, the physical device and the logical device used by the hypervisor of the error-occurring host are shared. The program according to appendix 9 or appendix 10, which identifies another host and extracts an error of the identified host.

（付記１２）
エラーが生じている前記物理デバイスおよび前記論理デバイスを絞込むステップをさらに備え、
前記絞込むステップでは、前記エラー発生ホストおよび他の前記ホストのなかで、エラーが発生しているホストの数が所定の基準を超えている場合に、前記エラー発生ホストが利用している前記物理デバイスおよび前記論理デバイスの少なくとも一方に障害が生じていると判定する、付記９〜付記１１のいずれかに記載のプログラム。 (Appendix 12)
Further comprising the step of narrowing down the physical device and the logical device in which an error has occurred,
In the narrowing-down step, the physical number used by the error-occurring host when the number of hosts in which an error has occurred exceeds a predetermined standard among the error-occurring host and other hosts. The program according to any one of appendix 9 to appendix 11, which determines that a failure has occurred in at least one of a device and the logical device.

本発明は、仮想化技術により大量の仮想サーバを一元管理するデータセンタなどの運用管理に用いられる、障害分析装置、障害分析方法、およびプログラムに適用することができる。 The present invention can be applied to a failure analysis apparatus, a failure analysis method, and a program used for operation management of a data center or the like that centrally manages a large number of virtual servers using a virtualization technique.

１００障害分析装置
１３１入手部
１５３エラー有無判定部
１５４被疑対象絞込み部 100 Failure Analyzer 131 Obtaining Unit 153 Error Presence Determination Unit 154 Suspicious Target Narrowing Unit

Claims

An obtaining unit for obtaining information for identifying a physical device and a logical device used by each of a plurality of hosts including a virtual host;
Identifying the physical device and the logical device used by the error-occurring host from the host name of the error-occurring host among the plurality of hosts and the information obtained by the obtaining unit; and An error presence / absence determining unit that determines whether an error has occurred in another host that shares the identified physical device and the logical device;
A failure analysis apparatus comprising:

The failure analysis apparatus according to claim 1, wherein the error presence / absence determination unit determines whether an error has occurred in another host based on the occurrence state of the error that has occurred in the error-occurring host.

When the error occurrence host is the virtual host, the error presence / absence determination unit identifies the other physical host that shares the physical device and the logical device used by the hypervisor of the error occurrence host, and The failure analysis apparatus according to claim 1, wherein an error of the identified host is extracted.

A suspicious object narrowing section for narrowing down the physical device and the logical device in which an error has occurred;
The suspected object narrowing unit is used by the error-occurring host when the number of hosts in which the error has occurred exceeds a predetermined standard among the error-occurring host and the other hosts. The failure analysis apparatus according to claim 1, wherein it is determined that a failure has occurred in at least one of the physical device and the logical device.

(A) obtaining information for specifying a physical device and a logical device used by each of a plurality of hosts including a virtual host;
(B) Identifying the physical device and the logical device used by the error-occurring host from the host name of the error-occurring host among the plurality of hosts and the information obtained by the obtaining unit And steps to
(C) determining whether an error has occurred in another host that shares the identified physical device and logical device;
A failure analysis method comprising:

A program for analyzing failures occurring in a plurality of hosts including a virtual host by a computer,
In the computer,
(A) obtaining information for identifying a physical device and a logical device used by each of the plurality of hosts including the virtual host;
(B) Identifying the physical device and the logical device used by the error-occurring host from the host name of the error-occurring host among the plurality of hosts and the information obtained by the obtaining unit And steps to
(C) determining whether an error has occurred in another host that shares the identified physical device and logical device;
A program that executes