JP5893513B2

JP5893513B2 - Monitoring device, monitoring method and monitoring program

Info

Publication number: JP5893513B2
Application number: JP2012131285A
Authority: JP
Inventors: 励野元; 宗之川谷; 竜二山田; 浩二徳丸; 西村　徹; 徹西村; 麦佐藤
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2012-06-08
Filing date: 2012-06-08
Publication date: 2016-03-23
Anticipated expiration: 2032-06-08
Also published as: JP2013254451A

Description

本発明の実施形態は、監視装置、監視方法及び監視プログラムに関する。 Embodiments described herein relate generally to a monitoring device, a monitoring method, and a monitoring program.

クラウドに代表されるサーバの仮想化技術によって装置の集約化や規模の増大が進んでいる。これに伴って、ＩＴリソースが有効活用される一方でシステムの運用管理が煩雑化している。このため、システムの保守者は、幅広いハードウェアやソフトウェアに関する障害に対し、多岐にわたる対処の中から適切な対処を選択せねばならない。 Server virtualization technology, such as cloud computing, is advancing the consolidation and scale of devices. As a result, IT resources are effectively utilized, but system operation management is complicated. For this reason, the system maintainer must select an appropriate response from a wide range of responses to failures related to a wide range of hardware and software.

かかるシステムの運用管理を支援する技術の一例として、障害と対処を予め対応付けておくことによって障害の内容から対処を特定する技術が挙げられる。これによって、保守者によるマニュアル検索そのものを不要化することを目指す。他の一例として、障害の内容を用いてマニュアルを自動的に検索する技術が挙げられる。これによって、保守者によるマニュアル検索の手間を軽減することを目指す。 As an example of a technique for supporting the operation management of such a system, there is a technique for identifying the countermeasure from the contents of the fault by associating the fault with the countermeasure in advance. This aims to eliminate the need for manual search by maintenance personnel. As another example, there is a technique of automatically searching a manual using the content of a failure. This aims to reduce the labor of manual search by maintenance personnel.

Jan Goyvaerts, "Regular-Expressions.info", ［online］，［平成２４年５月３０日検索］，インターネット＜http://www.regular-expressions.info/index.html＞Jan Goyvaerts, "Regular-Expressions.info", [online], [May 30, 2012 search], Internet <http://www.regular-expressions.info/index.html>

しかしながら、上記の従来技術では、以下に説明するように、事前設定なしにマニュアル検索の手間を削減することができないという問題がある。 However, the above-described prior art has a problem that it is not possible to reduce manual search effort without prior setting as described below.

例えば、前者の技術の場合には、事前に障害と対処を対応付ける設定を行うために多大な労力が必要となる。すなわち、システムの監視対象に障害が発生した場合には、障害の内容が記述されたメッセージが生成されるが、かかるメッセージには、障害を識別可能なキーワード等のエッセンス以外にも障害の識別に無関係な日時などの情報も含まれる。このようなメッセージの中から対処と対応付けるエッセンスを抽出するには、設定者に知識や経験が要求される上、想定されるメッセージごとにそのメッセージに相応しい対処を対応付ける労力が必要となる。 For example, in the case of the former technique, a great deal of labor is required to make a setting for associating a failure with a countermeasure in advance. In other words, when a failure occurs in the monitoring target of the system, a message describing the content of the failure is generated. In this message, in addition to the essence such as a keyword that can identify the failure, the failure can be identified. Information such as irrelevant date and time is also included. In order to extract the essence to be associated with the countermeasure from such messages, it is necessary for the setter to have knowledge and experience, and for each assumed message, an effort to associate a countermeasure appropriate for the message is required.

また、後者の技術の場合には、検索によって対処が１つに絞り込まれるとは限らないので、検索結果が多数ある場合にマニュアルを改めて検索し直す手間が生じる場合がある。さらに、後者の技術の場合には、障害のメッセージに含まれているキーワード等のキー情報がマニュアルにも含まれていなければ検索をヒットさせることができず、自動検索を適応できる場面にも制約がある。 In the case of the latter technique, the search is not necessarily narrowed down to one, so that there are cases where it takes time to re-search the manual when there are many search results. Furthermore, in the case of the latter technique, if key information such as keywords included in the failure message is not included in the manual, the search cannot be hit, and the situation where the automatic search can be applied is also limited. There is.

そこで、本発明の実施形態は、上記に鑑みてなされたものであって、事前設定なしにマニュアル検索の手間を削減できる監視装置、監視方法及び監視プログラムを提供することを目的とする。 Therefore, an embodiment of the present invention has been made in view of the above, and an object of the present invention is to provide a monitoring device, a monitoring method, and a monitoring program that can reduce manual search effort without prior setting.

実施形態に係る監視装置は、ネットワークを介して接続される監視対象装置の障害の分類と、前記分類に該当する障害への対処方法とが対応付けられたマニュアルを記憶するマニュアル記憶部と、前記監視対象装置の障害に関する障害情報と、当該障害情報が該当する障害の分類とが対応付けられた履歴情報を蓄積する履歴蓄積部と、前記履歴蓄積部に蓄積された各履歴情報に含まれる障害情報を構成する要素を用いて、各障害情報をベクトルで表現されるデータ形式へ変換する手順を決定する決定部と、前記決定部によって決定された手順にしたがって前記履歴蓄積部に蓄積された履歴情報に含まれる障害情報をベクトル表現のデータ形式の障害情報へ変換する第１の変換部と、前記第１の変換部によって変換された前記ベクトル表現のデータ形式の障害情報と当該障害情報に対応付けられた分類とを学習データとし、ベクトル表現のデータ形式で入力される障害情報から前記障害の分類を判定する判定処理に適用する判定モデルを生成する生成部と、前記監視対象装置の状態を監視する監視部と、前記決定部によって決定された手順にしたがって前記監視部によって障害発生時に生成された障害情報をベクトル表現のデータ形式へ変換する第２の変換部と、前記生成部によって生成された判定モデルを用いて、前記第２の変換部によってベクトル表現のデータ形式へ変換された障害情報から障害の分類を判定する判定部と、前記マニュアル記憶部に記憶されたマニュアルのうち前記判定部によって判定された障害の分類に対応付けられたマニュアルから対処方法を抽出する抽出部とを有する。 The monitoring device according to the embodiment includes a manual storage unit that stores a manual in which a failure classification of a monitoring target device connected via a network is associated with a countermeasure for the failure corresponding to the classification, and A history accumulation unit that accumulates history information in which failure information related to a failure of the monitoring target device is associated with a classification of a failure to which the failure information corresponds, and a failure included in each history information accumulated in the history accumulation unit A determination unit that determines a procedure for converting each piece of failure information into a data format represented by a vector using elements constituting information, and a history stored in the history storage unit according to the procedure determined by the determination unit A first conversion unit that converts the failure information included in the information into failure information in a data format of a vector representation, and the vector representation converted by the first conversion unit. A determination model to be applied to a determination process for determining the classification of the failure from the failure information input in the data format of the vector representation is generated using the failure information in the data format and the classification associated with the failure information as learning data. A generating unit; a monitoring unit that monitors a state of the monitoring target device; and a second unit that converts failure information generated by the monitoring unit when a failure occurs according to a procedure determined by the determining unit into a data format of vector representation A determination unit for determining a classification of a failure from the failure information converted into the data format of the vector representation by the second conversion unit using the determination model generated by the generation unit, and the manual storage Extracting a countermeasure from a manual associated with the failure classification determined by the determination unit among manuals stored in the unit With the door.

実施形態に係る監視装置の一つの態様によれば、事前設定なしにマニュアル検索の手間を削減できるという効果を奏する。 According to one aspect of the monitoring apparatus according to the embodiment, there is an effect that it is possible to reduce manual search labor without prior setting.

図１は、第１の実施形態に係る監視サーバを含む監視システムの構成を示す図である。FIG. 1 is a diagram illustrating a configuration of a monitoring system including a monitoring server according to the first embodiment. 図２は、障害分類情報の構成例を示す図である。FIG. 2 is a diagram illustrating a configuration example of failure classification information. 図３は、対処情報の構成例を示す図である。FIG. 3 is a diagram illustrating a configuration example of the handling information. 図４は、監視端末に表示される履歴入力画面の一例を示す図である。FIG. 4 is a diagram illustrating an example of a history input screen displayed on the monitoring terminal. 図５は、障害情報の変換方法の一例を示す図である。FIG. 5 is a diagram illustrating an example of a failure information conversion method. 図６は、第１の実施形態に係る判定モデルの生成処理の手順を示すフローチャートである。FIG. 6 is a flowchart illustrating a procedure of determination model generation processing according to the first embodiment. 図７は、第１の実施形態に係る障害監視処理の手順を示すフローチャートである。FIG. 7 is a flowchart illustrating a procedure of failure monitoring processing according to the first embodiment. 図８は、第２の実施形態に係る監視プログラムによる情報処理がコンピュータを用いて具体的に実現されることを示す図である。FIG. 8 is a diagram illustrating that the information processing by the monitoring program according to the second embodiment is specifically realized using a computer.

以下に、本発明の実施形態に係る監視装置、監視方法及び監視プログラムを図面に基づいて詳細に説明する。なお、この実施形態は本発明を限定するものではない。そして、各実施形態は、処理内容を矛盾させない範囲で適宜組み合わせることが可能である。 Hereinafter, a monitoring device, a monitoring method, and a monitoring program according to an embodiment of the present invention will be described in detail based on the drawings. Note that this embodiment does not limit the present invention. And each embodiment can be suitably combined in the range which does not contradict a processing content.

［第１の実施形態］
［監視システム１の構成］
図１は、第１の実施形態に係る監視サーバを含む監視システムの構成を示す図である。図１に示す監視システム１には、監視サーバ１０と、監視対象装置３０Ａ及び３０Ｂと、監視端末５０とが収容される。なお、図１の例では、２つの監視対象装置、１つの監視端末をそれぞれ図示したが、本システムは図示の構成に限定されず、監視システム１は任意の数の監視対象装置および監視端末を収容できる。以下では、監視対象装置３０Ａ及び３０Ｂを区別なく総称する場合には「監視対象装置３０」と呼ぶこととする。 [First Embodiment]
[Configuration of monitoring system 1]
FIG. 1 is a diagram illustrating a configuration of a monitoring system including a monitoring server according to the first embodiment. The monitoring system 1 shown in FIG. 1 accommodates a monitoring server 10, monitoring target devices 30A and 30B, and a monitoring terminal 50. In the example of FIG. 1, two monitoring target devices and one monitoring terminal are illustrated, but this system is not limited to the illustrated configuration, and the monitoring system 1 includes an arbitrary number of monitoring target devices and monitoring terminals. Can be accommodated. Hereinafter, when the monitoring target devices 30A and 30B are collectively referred to without distinction, they are referred to as “monitoring target device 30”.

これら監視対象装置３０、監視端末５０及び監視サーバ１０の間は、図示しないネットワークを介して相互に通信可能に接続される。また、監視対象装置３０及び監視サーバ１０間と、監視端末５０及び監視サーバ１０間とは、各々が異なる種類の通信網によって接続されることとしてもかまわない。なお、上記のネットワークには、有線または無線を問わず、インターネット（Internet）、ＬＡＮ（Local Area Network）やＶＰＮ（Virtual Private Network）などの任意の種類の通信網を採用できる。 The monitoring target device 30, the monitoring terminal 50, and the monitoring server 10 are connected to be communicable with each other via a network (not shown). Further, the monitoring target device 30 and the monitoring server 10 and the monitoring terminal 50 and the monitoring server 10 may be connected to each other via different types of communication networks. It should be noted that any type of communication network, such as the Internet (Internet), a LAN (Local Area Network), or a VPN (Virtual Private Network), can be adopted as the above-described network regardless of wired or wireless.

このうち、監視対象装置３０は、監視サーバ１０によってリソースの状態を監視する対象とされる装置である。監視対象装置３０の一例としては、Ｗｅｂサービスを提供するＷｅｂサーバやＤＢＭＳ（DataBase Management System）等を搭載するデータベースサーバなどのサーバ装置が挙げられる。これらＷｅｂサーバやデータベースサーバとしての機能はオンプレミスで実装することもできるし、また、クラウドとして実装することもできる。 Among these, the monitoring target device 30 is a device that is a target whose resource status is monitored by the monitoring server 10. Examples of the monitoring target device 30 include a server device such as a Web server that provides a Web service or a database server that includes a DBMS (DataBase Management System). These functions as a Web server and a database server can be implemented on-premises, or can be implemented as a cloud.

監視端末５０は、監視対象装置３０の保守者によって使用される端末装置である。かかる監視端末５０の一例としては、パーソナルコンピュータ（ＰＣ：Personal Computer）を始めとする固定端末の他、携帯電話機、ＰＨＳ（Personal Handyphone System）やＰＤＡ（Personal Digital Assistants）などの移動体端末も採用できる。 The monitoring terminal 50 is a terminal device used by a maintenance person of the monitoring target device 30. As an example of the monitoring terminal 50, a mobile terminal such as a mobile phone, a PHS (Personal Handyphone System) or a PDA (Personal Digital Assistants) can be adopted as well as a fixed terminal such as a personal computer (PC). .

［監視サーバ１０の構成］
監視サーバ１０は、監視対象装置３０の状態を監視し、障害発生時に障害に関する障害情報を監視端末５０へ通知するサーバ装置である。図１に示すように、監視サーバ１０は、監視部１１と、出力部１２と、マニュアル記憶部１３と、検索部１４と、取得部１５ａと、履歴蓄積部１５ｂと、変換手順決定部１６ａと、変換手順記憶部１６ｂと、第１の変換部１７ａと、モデル生成部１８ａと、第２の変換部１７ｂと、判定部１８ｂと、抽出部１９とを有する。なお、監視サーバ１０は、図１に示した機能部以外にも既知のサーバ装置が有する各種の機能部、例えば各種の入力デバイスや音声出力デバイスなどの機能部を有することとしてもかまわない。 [Configuration of the monitoring server 10]
The monitoring server 10 is a server device that monitors the state of the monitoring target device 30 and notifies the monitoring terminal 50 of failure information related to the failure when a failure occurs. As shown in FIG. 1, the monitoring server 10 includes a monitoring unit 11, an output unit 12, a manual storage unit 13, a search unit 14, an acquisition unit 15a, a history storage unit 15b, and a conversion procedure determination unit 16a. A conversion procedure storage unit 16b, a first conversion unit 17a, a model generation unit 18a, a second conversion unit 17b, a determination unit 18b, and an extraction unit 19. Note that the monitoring server 10 may include various functional units included in a known server device, for example, functional units such as various input devices and audio output devices, in addition to the functional units illustrated in FIG.

このうち、監視部１１は、監視対象装置３０の状態を監視する処理部である。具体的には、監視部１１は、ＳＮＭＰ（Simple Network Management Protocol）にしたがって監視対象装置３０上で動作するアプリケーションのログや通信のログなどを監視情報として採取する。そして、監視部１１は、監視対象装置３０から採取した監視情報を用いて、監視対象装置３０に障害が発生しているか否かを判定する。このとき、監視部１１は、監視対象装置３０に障害が発生している場合には、アプリケーションのログや通信のログから障害情報を生成する。その上で、監視部１１は、上述のように生成した障害情報を出力部１２及び第２の変換部１７ｂへ出力する。 Among these, the monitoring unit 11 is a processing unit that monitors the state of the monitoring target device 30. Specifically, the monitoring unit 11 collects, as monitoring information, a log of an application operating on the monitoring target device 30 or a communication log in accordance with SNMP (Simple Network Management Protocol). Then, the monitoring unit 11 determines whether a failure has occurred in the monitoring target device 30 using the monitoring information collected from the monitoring target device 30. At this time, if a failure has occurred in the monitoring target device 30, the monitoring unit 11 generates failure information from the application log or communication log. Then, the monitoring unit 11 outputs the failure information generated as described above to the output unit 12 and the second conversion unit 17b.

出力部１２は、監視端末５０に対する情報の出力を制御する処理部である。具体的には、出力部１２は、監視部１１によって障害情報が生成された場合に、障害情報及び障害の分類の対応付けに関する学習が一定の学習度に達したか否かを判定する。すなわち、一定の学習度に達していない場合には、モデル生成部１８ａによって生成される判定モデルを用いて、障害情報からその障害の分類を判定部１８ｂに判定させたとしても、正答を得ることができるとは限らない。このため、出力部１２は、例えば、モデル生成部１８ａによって判定モデルの生成に用いられた学習データのサンプル数、すなわち監視端末５０から取得された障害対処の履歴情報の数が所定の閾値以上であるか否かによって一定の学習度に達したか否かを判定する。そして、出力部１２は、一定の学習度に達している場合には、障害情報とともに抽出部１９によって出力される障害の分類や対処方法を併せて監視端末５０へ出力し、一定の学習度に達していない場合には、障害情報を監視端末５０へ出力する。また、出力部１２は、検索部１４によってマニュアルの検索が実行された場合に、その検索結果を監視端末５０へ出力する。 The output unit 12 is a processing unit that controls output of information to the monitoring terminal 50. Specifically, when the failure information is generated by the monitoring unit 11, the output unit 12 determines whether or not the learning regarding the association between the failure information and the failure classification has reached a certain learning level. That is, if the learning level has not reached a certain level, the correct answer can be obtained even if the determination unit 18b determines the classification of the failure from the failure information using the determination model generated by the model generation unit 18a. It is not always possible. For this reason, the output unit 12 has, for example, the number of learning data samples used for generating the determination model by the model generation unit 18a, that is, the number of failure handling history information acquired from the monitoring terminal 50 equal to or greater than a predetermined threshold. It is determined whether or not a certain learning level has been reached depending on whether or not there is. When the learning unit has reached a certain level of learning, the output unit 12 also outputs to the monitoring terminal 50 the failure classification and coping method output by the extraction unit 19 together with the failure information. If not reached, fault information is output to the monitoring terminal 50. In addition, when the search unit 14 performs a manual search, the output unit 12 outputs the search result to the monitoring terminal 50.

マニュアル記憶部１３は、監視対象装置３０の障害と当該障害への対処とを含むマニュアルを記憶する記憶部である。かかるマニュアル記憶部１３は、障害分類番号、障害の分類及びマニュアル番号が対応付けられた障害分類情報と、マニュアル番号及び対処方法が対応付けられた対処情報とを記憶する。なお、上記の「障害分類番号」は、監視対象装置３０で発生する障害の分類を識別する番号を指し、また、「マニュアル番号」は、マニュアルを識別する番号を指す。 The manual storage unit 13 is a storage unit that stores a manual including the failure of the monitoring target device 30 and the handling of the failure. The manual storage unit 13 stores failure classification information associated with a failure classification number, a failure classification, and a manual number, and handling information associated with a manual number and a handling method. Note that the above “failure classification number” refers to a number that identifies the classification of a failure that occurs in the monitoring target apparatus 30, and “manual number” refers to a number that identifies a manual.

図２は、障害分類情報の構成例を示す図である。図３は、対処情報の構成例を示す図である。図２に示すように、ＣＰＵ関連のエラー「001」、ＭＥＭＯＲＹ関連のエラー「002」及びＷＥＢアプリ関連のエラー「005」に分類される障害がいずれもマニュアル番号「ａ」に対応し、図３に示すように、マニュアル番号「ａ」にはＰＣリブートが対応付けられている。これは、ＣＰＵ関連のエラー「001」、ＭＥＭＯＲＹ関連のエラー「002」及びＷＥＢアプリ関連のエラー「005」に分類される障害への対処方法がいずれも共通し、コンピュータを再起動することによって対処すべき旨が定められていることを意味する。また、図２に示すように、ＤＢ関連のエラー「003」に分類される障害がマニュアル番号「ｂ」に対応し、図３に示すように、マニュアル番号「ｂ」にＤＢリブートが対応付けられている。これは、ＤＢ関連のエラーにはデータベースを再起動することによって対処すべき旨が定められていることを意味する。また、図２に示すように、ＨＴＴＰ関連のエラー「004」に分類される障害がマニュアル番号「ｃ」に対応し、図３に示すように、マニュアル番号「ｃ」にＮＷリブートが対応付けられている。これは、ＨＴＴＰ関連のエラーにはネットワークを再起動することによって対処すべき旨が定められていることを意味する。 FIG. 2 is a diagram illustrating a configuration example of failure classification information. FIG. 3 is a diagram illustrating a configuration example of the handling information. As shown in FIG. 2, failures classified into CPU-related error “001”, MEMORY-related error “002”, and WEB application-related error “005” all correspond to the manual number “a”. As shown in FIG. 4, the manual number “a” is associated with a PC reboot. This is a common solution to failures classified into CPU-related error “001”, MEMORY-related error “002”, and WEB application-related error “005”, and can be handled by restarting the computer. It means that it should be established. Further, as shown in FIG. 2, a failure classified as DB related error “003” corresponds to manual number “b”, and as shown in FIG. 3, DB reboot is associated with manual number “b”. ing. This means that it is determined that DB related errors should be dealt with by restarting the database. Further, as shown in FIG. 2, a failure classified as HTTP related error “004” corresponds to manual number “c”, and as shown in FIG. 3, NW reboot is associated with manual number “c”. ing. This means that it has been determined that HTTP related errors should be dealt with by restarting the network.

ここで、上記の図２の例では、障害分類情報として障害分類番号および障害の分類を記憶させる場合を例示したが、監視端末５０によるキーワード検索にも対応する観点から、実際に生成された障害情報を構成する要素、例えば障害メッセージもしくは障害メッセージから抽出されたキーワード等が併せて記憶されることとする。なお、図２及び図３の例では、マニュアルに含まれる障害分類情報および対処情報の各々を別のテーブルとして構成する場合を例示したが、これら障害分類情報および対処情報を１つのテーブルとして構成することもできる。 Here, in the example of FIG. 2 described above, the failure classification number and the failure classification are stored as the failure classification information. However, from the viewpoint of corresponding to the keyword search by the monitoring terminal 50, the actually generated failure It is assumed that elements constituting information, for example, a failure message or a keyword extracted from the failure message is stored together. In the example of FIGS. 2 and 3, the case where each of the failure classification information and the handling information included in the manual is configured as a separate table is illustrated, but the failure classification information and the handling information are configured as one table. You can also

検索部１４は、マニュアル記憶部１３を用いて、監視端末５０からのキー情報の指定をもとに当該キー情報に対応するマニュアルを検索する処理部である。具体的には、検索部１４は、監視端末５０からキーワード等のキー情報の指定を含む検索要求を受け付ける。すると、検索部１４は、マニュアル記憶部１３に記憶されたマニュアルのうち、当該検索要求で指定されたキーワードと部分一致または完全一致する障害メッセージもしくは障害メッセージから抽出されたキーワードを含むマニュアルを検索する。この結果、検索部１４は、検索がヒットした場合には、検索がヒットしたマニュアルを検索結果として出力部１２へ出力する。 The search unit 14 is a processing unit that uses the manual storage unit 13 to search for a manual corresponding to the key information based on the designation of the key information from the monitoring terminal 50. Specifically, the search unit 14 receives a search request including designation of key information such as keywords from the monitoring terminal 50. Then, the search unit 14 searches the manual stored in the manual storage unit 13 for a manual containing a failure message that partially matches or completely matches the keyword specified in the search request or a keyword extracted from the failure message. . As a result, when the search is hit, the search unit 14 outputs the manual with the search hit to the output unit 12 as a search result.

取得部１５ａは、監視端末５０から障害対処の履歴情報を取得する処理部である。具体的には、取得部１５ａは、監視端末５０へ障害情報が通知された場合に、マニュアルの検索が実行されているか否かを判定する。このとき、取得部１５ａは、マニュアルの検索が実行されている場合には、マニュアルの検索結果のうち対処が選択されたマニュアルを選択可能な履歴入力画面を監視端末５０へ出力する。そして、取得部１５ａは、監視端末５０上に表示された履歴入力画面を介して入力された履歴情報、例えば障害情報が該当する障害の分類とともに出力部１２によって出力された障害情報を取得する。一方、マニュアルの検索が実行されずに対処が実行された場合には、出力部１２によって出力された障害情報、障害の分類および障害への対処方法がそのまま履歴情報として監視サーバ１０へ返信される。このとき、監視端末５０から障害情報、障害の分類および障害への対処方法の返信を受け付ける代わりに障害対処の完了通知を受け付け、かかる完了通知を受け付けた場合に、出力部１２によって監視端末５０へ出力された障害情報、障害の分類および障害への対処方法を履歴情報として取得することもできる。このように履歴情報を取得した後に、取得部１５ａは、監視端末５０から取得した障害情報および障害の分類を含む履歴情報を履歴蓄積部１５ｂへ格納する。 The acquisition unit 15 a is a processing unit that acquires failure handling history information from the monitoring terminal 50. Specifically, when the failure information is notified to the monitoring terminal 50, the acquisition unit 15a determines whether a manual search is being performed. At this time, when a manual search is being executed, the acquisition unit 15a outputs a history input screen that allows the user to select a manual for which a countermeasure has been selected from the manual search results, to the monitoring terminal 50. Then, the acquisition unit 15 a acquires history information input via the history input screen displayed on the monitoring terminal 50, for example, the failure information output by the output unit 12 together with the failure classification to which the failure information corresponds. On the other hand, when the countermeasure is performed without executing the manual search, the failure information output by the output unit 12, the classification of the failure, and the troubleshooting method are directly returned to the monitoring server 10 as history information. . At this time, instead of receiving the failure information, the failure classification, and the response method for the failure from the monitoring terminal 50, a failure handling completion notification is received. When the completion notification is received, the output unit 12 sends the notification to the monitoring terminal 50. The output failure information, failure classification, and troubleshooting method can be acquired as history information. After acquiring the history information in this way, the acquisition unit 15a stores the history information including the failure information and the failure classification acquired from the monitoring terminal 50 in the history storage unit 15b.

図４は、監視端末５０に表示される履歴入力画面の一例を示す図である。図４に示すように、履歴入力画面２００には、障害分類番号「001」〜「005」の５つの分類の障害とその対処方法とが対応付けられたマニュアルが表示されている。例えば、５つのマニュアルの左側にレイアウトされたラジオボタンのうち障害分類番号「004」のＷＥＢアプリ関連のエラーに関するマニュアルが選択された状態で実行ボタン２００Ａが押下されると、出力部１２によって履歴入力画面２００とは別途通知された障害情報と、履歴入力画面２００を介して入力された障害分類番号「004」とが対応付けられた履歴情報が監視端末５０から監視サーバ１０へ送信される。これによって、取得部１５ａは、障害情報と障害の分類との対応付けを取得することができる。 FIG. 4 is a diagram illustrating an example of a history input screen displayed on the monitoring terminal 50. As shown in FIG. 4, the history input screen 200 displays a manual in which five classifications of fault classification numbers “001” to “005” are associated with the countermeasures. For example, when the execution button 200A is pressed in a state where a manual relating to an error related to the web application with the failure classification number “004” among radio buttons laid out on the left side of five manuals is selected, the history is input by the output unit 12 History information in which failure information notified separately from the screen 200 is associated with the failure classification number “004” input via the history input screen 200 is transmitted from the monitoring terminal 50 to the monitoring server 10. Thereby, the acquisition unit 15a can acquire the association between the failure information and the failure classification.

履歴蓄積部１５ｂは、障害対処の履歴情報を蓄積する記憶部である。一例として、履歴蓄積部１５ｂには、取得部１５ａによって履歴情報が取得される度に、当該履歴情報が追加登録される。他の一例として、履歴蓄積部１５ｂは、障害情報を構成する要素、例えば障害メッセージからその障害情報の分類を判定するための判定モデルを生成するために、変換手順決定部１６ａによって参照される。 The history accumulation unit 15b is a storage unit that accumulates failure handling history information. As an example, each time history information is acquired by the acquisition unit 15a, the history information is additionally registered in the history storage unit 15b. As another example, the history storage unit 15b is referred to by the conversion procedure determination unit 16a in order to generate a determination model for determining a classification of the failure information from elements constituting the failure information, for example, a failure message.

変換手順決定部１６ａは、履歴蓄積部１５ｂを参照し、各履歴情報の障害情報を構成する要素を用いて、各障害情報をベクトルで表現されるデータ形式へ変換する手順を決定する処理部である。 The conversion procedure determination unit 16a is a processing unit that refers to the history storage unit 15b and determines a procedure for converting each failure information into a data format represented by a vector using elements constituting the failure information of each history information. is there.

具体的に説明すると、変換手順決定部１６ａは、前回に障害情報及び障害の分類の対応付けに関する学習が実行されてから新規に登録された履歴情報が所定の閾値以上になった場合に処理を起動する。すなわち、変換手順決定部１６ａは、新規の履歴情報が閾値以上になった場合に、履歴蓄積部１５ｂに蓄積された全ての履歴情報を読み出す。このように、履歴情報から生成される学習データが前回の学習時と大差がない場合に処理の起動を抑制するのは、高頻度に学習が実行されることによって監視サーバ１０の処理負荷が増大するのを抑制するためである。なお、ここでは、新規の履歴情報が閾値以上になった場合に処理を起動する場合を例示したが、新規の履歴情報が追加される度に処理を起動することとしてもよいし、また、バッチ処理で処理を起動することとしてもかまわない。 More specifically, the conversion procedure determination unit 16a performs the process when the newly registered history information is equal to or greater than a predetermined threshold after the previous learning about the association between the failure information and the failure classification. to start. That is, the conversion procedure determination unit 16a reads all the history information stored in the history storage unit 15b when the new history information is equal to or greater than the threshold value. As described above, when the learning data generated from the history information is not significantly different from the previous learning, the activation of the process is suppressed because the processing load of the monitoring server 10 is increased by performing the learning frequently. This is to suppress this. In addition, here, the case where the process is started when the new history information is equal to or greater than the threshold is illustrated, but the process may be started every time new history information is added, or batch processing may be performed. The process may be started by the process.

続いて、変換手順決定部１６ａは、各履歴情報の障害情報を構成する要素、例えば障害メッセージに含まれる単語の種類数からベクトルの次元数を決定した上で当該ベクトルの各成分に単語を割り当てる。例えば、変換手順決定部１６ａは、先に読み出した各障害情報を構成する要素のうち障害メッセージに含まれる全単語を形態素解析等を実行することによって探索し、各障害メッセージ間で重複しない単語を抽出する。その後、変換手順決定部１６ａは、各障害メッセージ間で重複しない単語の総数をベクトルの次元数と決定する。続いて、変換手順決定部１６ａは、各障害メッセージ間で重複しない単語をベクトルの各成分へ順番に割り当てる。その上で、変換手順決定部１６ａは、各成分に割り当てられた単語が障害メッセージに含まれるか否かによってベクトルの各成分の値を導出する手順を定義する。以下では、ベクトルで表現されるデータ形式へ変換する手順のことを「変換手順」と呼ぶとともに、ベクトル表現のデータ形式へ変換する処理のことを「ベクトル化」と呼ぶ場合がある。その後、変換手順決定部１６ａは、上述のようにして定義された変換手順を変換手順記憶部１６ｂへ保存する。 Subsequently, the conversion procedure determining unit 16a determines the number of dimensions of the vector from the number of types of words included in the failure message of each history information, for example, the failure message, and then assigns a word to each component of the vector. . For example, the conversion procedure determination unit 16a searches for all words included in the failure message among the elements constituting each failure information read out by performing morphological analysis or the like, and searches for words that do not overlap between the failure messages. Extract. Thereafter, the conversion procedure determination unit 16a determines the total number of words that do not overlap between the failure messages as the number of dimensions of the vector. Subsequently, the conversion procedure determination unit 16a sequentially assigns words that do not overlap between the failure messages to the components of the vector. In addition, the conversion procedure determination unit 16a defines a procedure for deriving the value of each component of the vector depending on whether or not the word assigned to each component is included in the failure message. Hereinafter, the procedure for converting to a data format represented by a vector is referred to as a “conversion procedure”, and the process for converting to a data format represented by a vector may be referred to as “vectorization”. Thereafter, the conversion procedure determination unit 16a stores the conversion procedure defined as described above in the conversion procedure storage unit 16b.

なお、本実施形態では、障害情報をベクトルで表現されるデータ形式へ変換するにあたって障害情報を構成する要素のうち障害メッセージを用いて変換手順を定義する場合を例示するが、他の要素、例えば監視対象装置３０のホスト名、監視方法、監視種別、ＯＳ、サーバ種別、日時、搭載システム名、監視ポート番号などを用いて変換手順を定義してもよいし、また、障害メッセージ及び他の要素を組み合わせて変換手順を定義することもできる。また、ここでは、各障害メッセージに含まれる全単語のうち重複しない単語をベクトルの成分に割り当てる場合を例示したが、必ずしも各障害メッセージ間で重複しない単語を全て割り当てる必要はない。例えば、障害メッセージに含まれる単語のうち出現頻度が上位から所定の順位までの単語に限ってベクトルの成分への割り当て対象とすることもできる。これによって、日付やＷｅｂサービスの内容などの障害の分類を識別するにあたってノイズとなる単語を割り当てから除外できる結果、障害メッセージおよび障害の分類の対応付けに関する学習精度を高めることができる。 In this embodiment, the case where the conversion procedure is defined using a failure message among the elements constituting the failure information when converting the failure information into a data format represented by a vector is exemplified. The conversion procedure may be defined by using the host name, monitoring method, monitoring type, OS, server type, date / time, installed system name, monitoring port number, etc. of the monitoring target device 30, and a failure message and other elements The conversion procedure can also be defined by combining. Also, here, the case where non-overlapping words among all the words included in each failure message are assigned to the vector component is illustrated, but it is not always necessary to assign all the non-overlapping words between the failure messages. For example, among words included in the failure message, only words having an appearance frequency from the top to a predetermined rank can be assigned to the vector components. As a result, a word that causes noise in identifying a failure classification such as a date or Web service content can be excluded from the assignment. As a result, the learning accuracy regarding the association between the failure message and the failure classification can be improved.

変換手順記憶部１６ｂは、障害情報をベクトルで表現されるデータ形式へ変換する手順を記憶する記憶部である。一例として、変換手順記憶部１６ｂには、変換手順決定部１６ａによって変換手順が決定された場合に、当該変換手順が更新登録される。他の一例として、変換手順記憶部１６ｂは、履歴情報に含まれる障害情報をベクトル表現のデータ形式へ変換する場合に、第１の変換部１７ａによって参照される。更なる一例として、変換手順記憶部１６ｂは、監視部１１によって生成された障害情報をベクトル表現のデータ形式へ変換する場合に、第２の変換部１７ｂによって参照される。 The conversion procedure storage unit 16b is a storage unit that stores a procedure for converting failure information into a data format represented by a vector. As an example, when the conversion procedure is determined by the conversion procedure determination unit 16a, the conversion procedure is updated and registered in the conversion procedure storage unit 16b. As another example, the conversion procedure storage unit 16b is referred to by the first conversion unit 17a when converting the failure information included in the history information into a data format of vector representation. As a further example, the conversion procedure storage unit 16b is referred to by the second conversion unit 17b when the failure information generated by the monitoring unit 11 is converted into a data format of vector representation.

第１の変換部１７ａは、変換手順記憶部１６ｂに記憶された変換手順にしたがって履歴蓄積部１５ｂに蓄積された履歴情報に含まれる障害情報をベクトル表現のデータ形式の障害情報へ変換する処理部である。 The first conversion unit 17a is a processing unit that converts failure information included in the history information stored in the history storage unit 15b into failure information in a data format of vector representation according to the conversion procedure stored in the conversion procedure storage unit 16b. It is.

図５は、障害情報の変換方法の一例を示す図である。図５の上段には、３つの履歴情報を図示し、図５の中段には、変換手順の一例を図示し、図５の下段には、ベクトル表現のデータ形式へ変換後の障害情報を図示している。なお、図５の例では、説明の便宜上、３つの履歴情報の障害メッセージに含まれる単語を用いて変換手順を決定する場合を例示するが、変換手順の決定に使用される履歴情報の数は任意の数であってかまわない。 FIG. 5 is a diagram illustrating an example of a failure information conversion method. The upper part of FIG. 5 illustrates three pieces of history information, the middle part of FIG. 5 illustrates an example of a conversion procedure, and the lower part of FIG. 5 illustrates failure information after conversion into a vector representation data format. Show. In the example of FIG. 5, for the sake of convenience of explanation, the case where the conversion procedure is determined using words included in the failure message of three history information is illustrated, but the number of history information used for determining the conversion procedure is It can be any number.

図５の上段に示すように、履歴蓄積部１５ｂから履歴１〜履歴３の３つの履歴情報が読み出されたとしたとき、各履歴情報に含まれる障害情報を構成する各障害メッセージに含まれる全単語を探索し、各障害メッセージ間で重複しない単語を抽出する。図５の例で言えば、全９語のうち「error」及び「apache」の２つが障害メッセージ間で重複するので、「postgres」、「error」、「xxxx」、「apache」、「yyyy」、「warning」及び「zzzz」の７つが抽出される。このように、各障害メッセージ間で重複しない単語の総数が７つであるので、ベクトルの次元数が「７」と決定される。続いて、ベクトルの成分１〜成分７に「postgres」、「error」、「xxxx」、「apache」、「yyyy」、「warning」、「zzzz」が順次割り当てられる。その上で、ベクトルの成分１〜成分７に割り当てられた単語が障害メッセージに含まれるか否かによってベクトルの各成分の値を導出する手順が定義される。例えば、ベクトルの成分１の例で言えば、障害メッセージに「postgres」が存在する場合に値「１」を付与し、障害メッセージに「postgres」が存在しない場合に値「０」を付与するという手順が定義される。このようにして障害メッセージをベクトル表現のデータ形式へ変換する変換手順が定義される。 As shown in the upper part of FIG. 5, when three pieces of history information 1 to 3 are read from the history storage unit 15b, all of the failure messages included in the failure information included in each history information are included. Search for words and extract words that do not overlap between failure messages. In the example of FIG. 5, two of “error” and “apache” out of all nine words overlap between failure messages, so “postgres”, “error”, “xxxx”, “apache”, “yyyy” , “Warning” and “zzzz” are extracted. Thus, since the total number of non-overlapping words among the failure messages is 7, the number of vector dimensions is determined to be “7”. Subsequently, “postgres”, “error”, “xxxx”, “apache”, “yyyy”, “warning”, and “zzzz” are sequentially assigned to the components 1 to 7 of the vector. Then, a procedure for deriving a value of each component of the vector is defined depending on whether or not words assigned to the component 1 to the component 7 of the vector are included in the failure message. For example, in the case of the component 1 of the vector, the value “1” is assigned when “postgres” exists in the failure message, and the value “0” is assigned when “postgres” does not exist in the failure message. A procedure is defined. In this way, a conversion procedure for converting a failure message into a vector representation data format is defined.

このような変換手順が定義された状況の下、履歴１〜履歴３の履歴情報の障害情報がベクトル化される。例えば、履歴１の場合には、障害情報を構成する障害メッセージが「postgres error xxxx」であるので、ベクトルの成分１〜成分３には値「１」が付与されるとともに、ベクトルの成分４〜成分７には値「０」が付与される。この結果、履歴１の変換後の障害情報（１，１，１，０，０，０，０）が得られる。また、履歴２の場合には、障害情報を構成する障害メッセージが「apache error yyyy」であるので、ベクトルの成分２、成分４及び成分５には値「１」が付与されるとともに、それ以外のベクトルの成分には値「０」が付与される。この結果、履歴２の変換後の障害情報（０，１，０，１，１，０，０）が得られる。また、履歴３の場合には、障害情報を構成する障害メッセージが「apache warning zzzz」であるので、ベクトルの成分４、成分６及び成分７には値「１」が付与されるとともに、それ以外のベクトルの成分には値「０」が付与される。この結果、履歴３の変換後の障害情報（０，０，０，１，０，１，１）が得られる。これら変換後の障害情報と障害分類番号との対応付けが判定モデルの生成時に学習データとして用いられる。 Under the situation where such a conversion procedure is defined, failure information of history information of history 1 to history 3 is vectorized. For example, in the case of history 1, since the failure message constituting the failure information is “postgres error xxxx”, a value “1” is assigned to the vector components 1 to 3 and the vector components 4 to 4 are added. The value “0” is assigned to the component 7. As a result, the failure information (1, 1, 1, 0, 0, 0, 0) after conversion of the history 1 is obtained. In the case of the history 2, since the failure message constituting the failure information is “apache error yyyy”, the vector component 2, the component 4 and the component 5 are assigned the value “1”, and other than that The value “0” is assigned to the vector component of. As a result, failure information (0, 1, 0, 1, 1, 0, 0) after conversion of the history 2 is obtained. In the case of the history 3, since the failure message constituting the failure information is “apache warning zzzz”, the vector component 4, the component 6 and the component 7 are assigned the value “1”, and other than that The value “0” is assigned to the vector component of. As a result, fault information (0, 0, 0, 1, 0, 1, 1) after conversion of the history 3 is obtained. The correspondence between the converted fault information and the fault classification number is used as learning data when the determination model is generated.

モデル生成部１８ａは、ベクトル表現のデータ形式で入力される障害情報から障害の分類を判定する判定処理に適用する判定モデルを生成する処理部である。一態様としては、モデル生成部１８ａは、第１の変換部１７ａによって変換されたベクトル表現のデータ形式の障害情報と当該障害情報に対応付けられた障害の分類とを学習データとし、サポートベクトルマシンやｋ近傍法などの各種の機械学習アルゴリズムによって実現される判定器に適用する判定モデルを生成する。 The model generation unit 18a is a processing unit that generates a determination model to be applied to a determination process for determining a failure classification from failure information input in a vector representation data format. As one aspect, the model generation unit 18a uses the failure information in the data format of the vector representation converted by the first conversion unit 17a and the classification of the failure associated with the failure information as learning data, and the support vector machine And a determination model to be applied to a determiner realized by various machine learning algorithms such as the k-nearest neighbor method.

第２の変換部１７ｂは、変換手順記憶部１６ｂに記憶された変換手順にしたがって監視部１１によって障害発生時に生成された障害情報をベクトル表現のデータ形式の障害情報へ変換する処理部である。かかる第２の変換部１７ｂは、変換対象とする障害情報が履歴情報に含まれるものではなく、監視部１１によって生成されたものであることを除けば第１の変換部１７ａと同様の処理を実行する。 The second conversion unit 17b is a processing unit that converts the failure information generated by the monitoring unit 11 when a failure occurs according to the conversion procedure stored in the conversion procedure storage unit 16b into the failure information in the data format of the vector representation. The second conversion unit 17b performs the same processing as the first conversion unit 17a except that the failure information to be converted is not included in the history information but is generated by the monitoring unit 11. Run.

判定部１８ｂは、モデル生成部１８ａによって生成された判定モデルを用いて、第２の変換部１７ｂによって変換されたベクトル表現のデータ形式の障害情報から障害の分類を判定する処理部である。一態様としては、判定部１８ｂは、モデル生成部１８ａによって新たな判定モデルが生成される度に、障害情報の入力に応答して当該障害情報が該当する障害の分類の出力を返す判定器を再作成する。その上で、判定部１８ｂは、第２の変換部１７ｂによってベクトル表現のデータ形式の障害情報が入力されると、判定器によって判定された障害の分類を出力部１２へ出力する。このとき、判定部１８ｂは、障害の分類とともに障害情報が当該障害の分類に該当する尤度を出力する判定器を作成している場合には、尤度が所定の閾値以上である場合に限って障害の分類を出力することとしてもかまわない。 The determination unit 18b is a processing unit that determines the failure classification from the failure information in the data format of the vector representation converted by the second conversion unit 17b, using the determination model generated by the model generation unit 18a. As one aspect, the determination unit 18b includes a determination unit that returns an output of a classification of a failure corresponding to the failure information in response to the input of the failure information every time a new determination model is generated by the model generation unit 18a. Recreate it. Then, when the failure information in the data format of the vector representation is input by the second conversion unit 17b, the determination unit 18b outputs the failure classification determined by the determiner to the output unit 12. At this time, in the case where the determination unit 18b creates a determination unit that outputs the likelihood that the failure information corresponds to the failure classification together with the failure classification, the determination unit 18b is limited to the case where the likelihood is equal to or greater than a predetermined threshold. It is also possible to output the failure classification.

抽出部１９は、マニュアル記憶部１３に記憶されたマニュアルのうち判定部１８ｂによって判定された障害の分類に対応付けられたマニュアルから対処方法を抽出する処理部である。具体的には、抽出部１９は、判定部１８ｂによって障害の分類が判定されると、マニュアル記憶部１３に記憶された障害分類情報のうち当該障害の分類が対応付けられたマニュアル番号を抽出する。そして、抽出部１９は、マニュアル記憶部１３に記憶された対処情報のうち当該マニュアル番号に対応付けられた対処方法を抽出する。その上で、抽出部１９は、障害の分類および対処方法を出力部１２へ出力する。 The extraction unit 19 is a processing unit that extracts a coping method from a manual associated with the failure classification determined by the determination unit 18 b among the manuals stored in the manual storage unit 13. Specifically, when the determination unit 18b determines the failure classification, the extraction unit 19 extracts a manual number associated with the failure classification from the failure classification information stored in the manual storage unit 13. . Then, the extraction unit 19 extracts a coping method associated with the manual number from the coping information stored in the manual storage unit 13. After that, the extraction unit 19 outputs the failure classification and the coping method to the output unit 12.

なお、図１に示した監視部１１、出力部１２、検索部１４、取得部１５ａ、変換手順決定部１６ａ、第１の変換部１７ａ、第２の変換部１７ｂ、モデル生成部１８ａ、判定部１８ｂ及び抽出部１９などの各種の機能部には、各種の集積回路や電子回路を採用できる。例えば、集積回路としては、ＡＳＩＣ（Application Specific Integrated Circuit）やＦＰＧＡ（Field Programmable Gate Array）が挙げられる。また、電子回路としては、ＣＰＵ（Central Processing Unit）やＭＰＵ（Micro Processing Unit）などが挙げられる。 In addition, the monitoring unit 11, the output unit 12, the search unit 14, the acquisition unit 15a, the conversion procedure determination unit 16a, the first conversion unit 17a, the second conversion unit 17b, the model generation unit 18a, and the determination unit illustrated in FIG. Various integrated circuits and electronic circuits can be employed for various functional units such as 18b and the extraction unit 19. For example, examples of the integrated circuit include ASIC (Application Specific Integrated Circuit) and FPGA (Field Programmable Gate Array). Examples of the electronic circuit include a central processing unit (CPU) and a micro processing unit (MPU).

また、図１に示したマニュアル記憶部１３、履歴蓄積部１５ｂ及び変換手順記憶部１６ｂなどの各種の記憶部には、次のようなデバイスを採用できる。例えば、ＲＡＭ（Random Access Memory)、ＲＯＭ（Read Only Memory）やフラッシュメモリ（flash memory）などの半導体メモリ素子を採用できる。また、ハードディスク、光ディスクなどの記憶装置も採用できる。 In addition, the following devices may be employed for various storage units such as the manual storage unit 13, the history storage unit 15b, and the conversion procedure storage unit 16b illustrated in FIG. For example, a semiconductor memory element such as a random access memory (RAM), a read only memory (ROM), or a flash memory can be employed. A storage device such as a hard disk or an optical disk can also be employed.

［処理の流れ］
続いて、本実施形態に係る監視サーバ１０の処理の流れについて説明する。なお、ここでは、監視サーバ１０によって実行される（１）判定モデルの生成処理を説明した後に（２）障害監視処理を説明することとする。 [Process flow]
Next, the process flow of the monitoring server 10 according to the present embodiment will be described. Here, after (1) determination model generation processing executed by the monitoring server 10 is described, (2) failure monitoring processing is described.

（１）判定モデルの生成処理
図６は、第１の実施形態に係る判定モデルの生成処理の手順を示すフローチャートである。この生成処理は、監視サーバ１０の電源がＯＮ状態である限り、繰り返し実行される処理である。 (1) Determination Model Generation Processing FIG. 6 is a flowchart illustrating a determination model generation processing procedure according to the first embodiment. This generation process is a process that is repeatedly executed as long as the power of the monitoring server 10 is ON.

図６に示すように、監視端末５０から障害対処の履歴情報が取得されると（ステップＳ１０１，Ｙｅｓ）、監視サーバ１０は、監視端末５０から取得した障害情報および障害の分類を含む履歴情報を履歴蓄積部１５ｂへ格納する（ステップＳ１０２）。 As illustrated in FIG. 6, when the failure handling history information is acquired from the monitoring terminal 50 (Yes in step S <b> 101), the monitoring server 10 displays the failure information acquired from the monitoring terminal 50 and the history information including the failure classification. Store in the history storage unit 15b (step S102).

続いて、前回に障害情報及び障害の分類の対応付けに関する学習を実行されてから新規に登録された履歴情報が所定の閾値以上になるまで（ステップＳ１０３，Ｎｏ）、監視サーバ１０は、上記のステップＳ１０１〜ステップＳ１０２の処理を繰り返し実行する。 Subsequently, until the newly registered history information becomes equal to or greater than a predetermined threshold after the previous learning about the association between the failure information and the failure classification (step S103, No), the monitoring server 10 Steps S101 to S102 are repeated.

このとき、前回に障害情報及び障害の分類の対応付けに関する学習を実行されてから新規に登録された履歴情報が所定の閾値以上になった場合（ステップＳ１０３，Ｙｅｓ）には、監視サーバ１０は、履歴蓄積部１５ｂに蓄積された全ての履歴情報を読み出す（ステップＳ１０４）。 At this time, if the newly registered history information is equal to or greater than a predetermined threshold after the previous learning about the association between the failure information and the failure classification (step S103, Yes), the monitoring server 10 Then, all the history information stored in the history storage unit 15b is read (step S104).

続いて、監視サーバ１０は、各履歴情報に含まれる障害情報を構成する障害メッセージに含まれる単語を用いて、各障害情報をベクトルで表現されるデータ形式へ変換する変換手順を決定する（ステップＳ１０５）。 Subsequently, the monitoring server 10 determines a conversion procedure for converting each piece of failure information into a data format represented by a vector, using words included in the failure message constituting the failure information included in each history information (step) S105).

そして、監視サーバ１０は、ステップＳ１０５で決定された変換手順にしたがって履歴蓄積部１５ｂに蓄積された履歴情報に含まれる障害情報をベクトル表現のデータ形式の障害情報へ変換する（ステップＳ１０６）。 Then, the monitoring server 10 converts the failure information included in the history information stored in the history storage unit 15b into the failure information in the data format of the vector representation according to the conversion procedure determined in step S105 (step S106).

その上で、監視サーバ１０は、ステップＳ１０６で変換されたベクトル表現のデータ形式の障害情報と当該障害情報に対応付けられた障害の分類とを学習データとし、サポートベクトルマシンやｋ近傍法などの各種の機械学習アルゴリズムによって実現される判定器に適用する判定モデルを生成する（ステップＳ１０７）。 After that, the monitoring server 10 uses the failure information in the data format of the vector representation converted in step S106 and the failure classification associated with the failure information as learning data, such as a support vector machine and a k-nearest neighbor method. A determination model to be applied to a determiner realized by various machine learning algorithms is generated (step S107).

その後、監視サーバ１０は、ステップＳ１０５で決定された変換手順を変換手順記憶部１６ｂへ保存し（ステップＳ１０８）、上記のステップＳ１０１の処理へ移行する。 Thereafter, the monitoring server 10 saves the conversion procedure determined in step S105 in the conversion procedure storage unit 16b (step S108), and proceeds to the process of step S101 described above.

なお、図６に示すフローチャートでは、変換手順記憶部１６ｂへの変換手順の登録をステップＳ１０８で実行する場合を例示したが、本発明はこれに限定されず、変換手順が決定されたＳ１０５よりも後であれば任意のタイミングで変換手順記憶部１６ｂへの変換手順の登録を実行することができる。 In the flowchart shown in FIG. 6, the case where the conversion procedure is registered in the conversion procedure storage unit 16b is exemplified in step S108. However, the present invention is not limited to this, and it is more than S105 in which the conversion procedure is determined. After that, the conversion procedure can be registered in the conversion procedure storage unit 16b at an arbitrary timing.

（２）障害監視処理
図７は、第１の実施形態に係る障害監視処理の手順を示すフローチャートである。この処理は、監視対象装置３０における障害の発生が検知された場合に処理が起動する。 (2) Fault Monitoring Process FIG. 7 is a flowchart showing the procedure of the fault monitoring process according to the first embodiment. This process is started when the occurrence of a failure in the monitoring target device 30 is detected.

図７に示すように、監視サーバ１０は、障害情報が生成されると（ステップＳ３０１）、変換手順記憶部１６ｂに記憶された変換手順にしたがって障害情報をベクトル表現のデータ形式の障害情報へ変換する（ステップＳ３０２）。 As shown in FIG. 7, when the failure information is generated (step S301), the monitoring server 10 converts the failure information into the failure information in the vector representation data format according to the conversion procedure stored in the conversion procedure storage unit 16b. (Step S302).

続いて、監視サーバ１０は、判定モデルを用いて、ステップＳ３０２で変換されたベクトル表現のデータ形式の障害情報から障害の分類を判定する（ステップＳ３０３）。そして、監視サーバ１０は、マニュアル記憶部１３に記憶された障害分類情報のうち当該障害の分類が対応付けられたマニュアル番号を特定する（ステップＳ３０４）。 Subsequently, the monitoring server 10 uses the determination model to determine the failure classification from the failure information in the data format of the vector representation converted in step S302 (step S303). And the monitoring server 10 specifies the manual number with which the classification | category of the said failure was matched among the failure classification information memorize | stored in the manual memory | storage part 13 (step S304).

そして、監視サーバ１０は、マニュアル記憶部１３に記憶された対処情報のうち当該マニュアル番号に対応付けられた対処方法を抽出する（ステップＳ３０５）。その上で、監視サーバ１０は、障害情報、障害の分類および対処方法を監視端末５０へ出力し（ステップＳ３０６）、処理を終了する。 And the monitoring server 10 extracts the coping method matched with the said manual number among the coping information memorize | stored in the manual memory | storage part 13 (step S305). Then, the monitoring server 10 outputs the failure information, the failure classification, and the coping method to the monitoring terminal 50 (Step S306), and ends the process.

［実施例１の効果］
上述してきたように、本実施形態に係る監視サーバ１０は、監視端末５０から取得された履歴情報を用いて、ベクトル表現のデータ形式へ変換後の障害情報と障害の分類の対応付けを機械学習することによって判定モデルを生成し、監視対象装置３０で障害が発生した場合には、当該判定モデルを用いて、ベクトル表現のデータ形式へ変換した障害情報から障害の分類を判定した上で障害の分類に対応する対処方法を監視端末５０へ出力する。 [Effect of Example 1]
As described above, the monitoring server 10 according to the present embodiment uses the history information acquired from the monitoring terminal 50 to perform machine learning to associate the failure information and the failure classification after being converted into the data format of the vector representation. When a failure occurs in the monitoring target device 30 by using the determination model, the failure classification is determined from the failure information converted into the vector representation data format using the determination model. The coping method corresponding to the classification is output to the monitoring terminal 50.

このため、本実施形態に係る監視サーバ１０では、障害情報から障害を識別可能なキーワード等のエッセンスを抽出した上で対処方法に対応付ける煩雑な事前設定を行わずとも、機械学習によって障害情報を構成する要素、例えば障害メッセージからその障害への対処方法を抽出できる。さらに、本実施形態に係る監視サーバ１０では、機械学習によって得られた障害の分類と対応付けられた障害への対処方法を抽出するので、保守者に提示されるマニュアルが１つに絞り込まれる。この結果、上記の従来技術のように、検索結果が多数ある場合にマニュアルを改めて検索し直す手間が生じることもない。 For this reason, in the monitoring server 10 according to the present embodiment, the failure information is configured by machine learning without performing complicated presetting to extract the essence such as a keyword that can identify the failure from the failure information and to correspond to the countermeasure. It is possible to extract a method for dealing with the failure from the failure element, for example, the failure message. Furthermore, since the monitoring server 10 according to the present embodiment extracts a method for dealing with a failure associated with the failure classification obtained by machine learning, the manual presented to the maintenance person is narrowed down to one. As a result, there is no need to re-search the manual when there are a large number of search results as in the above-described prior art.

したがって、本実施形態に係る監視サーバ１０によれば、事前設定なしにマニュアル検索の手間を削減できる。さらに、本実施形態に係る監視サーバ１０では、上記の従来技術のようにキーワード等のキー情報を用いた検索を行う必要がなく、障害情報を構成する要素に含まれるキーワードが対処方法に含まれずとも、障害情報のベクトル化および機械学習による判定を通じて障害への対処方法を抽出できるので、自動検索によって対処方法を提示する場合よりも適用範囲を拡張することもできる。 Therefore, according to the monitoring server 10 which concerns on this embodiment, the effort of a manual search can be reduced without a prior setting. Furthermore, in the monitoring server 10 according to the present embodiment, it is not necessary to perform a search using key information such as a keyword as in the above-described conventional technology, and the keyword included in the elements constituting the failure information is not included in the handling method. In both cases, since a method for dealing with a failure can be extracted through vectorization of failure information and determination based on machine learning, the scope of application can be expanded as compared with a case where a countermeasure is presented by automatic search.

また、本実施形態に係る監視サーバ１０は、各履歴情報に含まれる障害情報を構成する要素のうち障害メッセージに含まれる単語の種類数からベクトルの次元数を決定した後に当該ベクトルの各成分に単語を割り当てた上で、各成分に割り当てられた単語が障害メッセージに含まれるか否かによってベクトルの各成分の値を導出する手順を定義する。このため、本実施形態に係る監視サーバ１０では、障害を識別可能なキーワード等のエッセンスを含んだ状態で障害情報をベクトル化できるので、障害情報を適切な障害の分類にカテゴライズすることができる結果、機械学習による判定精度を向上させることができる。 In addition, the monitoring server 10 according to the present embodiment determines the number of vector dimensions from the number of types of words included in the failure message among the elements constituting the failure information included in each history information, and then determines each vector component. After assigning words, a procedure for deriving the value of each component of the vector is defined depending on whether or not the word assigned to each component is included in the failure message. For this reason, in the monitoring server 10 according to the present embodiment, the failure information can be vectorized in a state including an essence such as a keyword that can identify the failure, so that the failure information can be categorized into an appropriate failure classification. The accuracy of determination by machine learning can be improved.

［第２の実施形態］
さて、これまで本発明の実施形態について説明したが、本発明は上述した実施形態以外にも、種々の異なる形態にて実施されてよいものである。そこで、以下では、本発明に含まれる他の実施形態を説明する。 [Second Embodiment]
Although the embodiments of the present invention have been described so far, the present invention may be implemented in various different forms other than the above-described embodiments. Therefore, in the following, other embodiments included in the present invention will be described.

［マニュアルの追加］
上記の第１の実施形態では、特段の説明を行っていないが、マニュアル記憶部１３に記憶されたマニュアルは任意に追加、更新または削除を行うことができる。例えば、履歴情報として、障害情報および障害の分類に加えて障害への対処方法をさらに取得し、取得した履歴情報に含まれる障害への対処方法がマニュアル記憶部１３に登録されていない場合に新規のマニュアル番号を生成し、マニュアル記憶部１３に記憶された障害分類情報に新規のマニュアル番号および当該履歴情報に含まれる障害分類番号を追加するとともに、マニュアル記憶部１３に記憶された対処情報に新規のマニュアル番号および当該履歴情報に含まれる対処方法を追加することもできる。これによって、マニュアル記憶部１３を手動設定によってメンテナンスせずとも、マニュアルの追加を自動化することができる。 [Add manual]
In the first embodiment, no special description is given, but the manual stored in the manual storage unit 13 can be arbitrarily added, updated, or deleted. For example, as the history information, in addition to the failure information and failure classification, a failure handling method is further acquired, and the failure handling method included in the acquired history information is not registered in the manual storage unit 13. Is added to the failure classification information stored in the manual storage unit 13 and the failure classification number included in the history information is added to the countermeasure information stored in the manual storage unit 13. A manual number and a coping method included in the history information can be added. Thereby, it is possible to automate the addition of the manual without maintaining the manual storage unit 13 by manual setting.

［分散および統合］
また、図示した各装置の各構成要素は、必ずしも物理的に図示の如く構成されていることを要しない。すなわち、各装置の分散・統合の具体的形態は図示のものに限られず、その全部または一部を、各種の負荷や使用状況などに応じて、任意の単位で機能的または物理的に分散・統合して構成することができる。例えば、監視部１１、出力部１２、検索部１４、取得部１５ａ、変換手順決定部１６ａ、第１の変換部１７ａ、第２の変換部１７ｂ、モデル生成部１８ａ、判定部１８ｂまたは抽出部１９のうち一部の機能部を監視サーバ１０の外部装置としてネットワーク経由で接続するようにしてもよい。また、監視部１１、出力部１２、検索部１４、取得部１５ａ、変換手順決定部１６ａ、第１の変換部１７ａ、第２の変換部１７ｂ、モデル生成部１８ａ、判定部１８ｂまたは抽出部１９を別の装置がそれぞれ有し、ネットワーク接続されて協働することで、上記の監視サーバ１０の機能を実現するようにしてもよい。また、図１に示した第１の変換部１７ａ及び第２の変換部１７ｂは、変換部として１つの機能部に統合することもできる。 [Distribution and integration]
In addition, each component of each illustrated apparatus does not necessarily need to be physically configured as illustrated. In other words, the specific form of distribution / integration of each device is not limited to that shown in the figure, and all or a part thereof may be functionally or physically distributed or arbitrarily distributed in arbitrary units according to various loads or usage conditions. Can be integrated and configured. For example, the monitoring unit 11, the output unit 12, the search unit 14, the acquisition unit 15a, the conversion procedure determination unit 16a, the first conversion unit 17a, the second conversion unit 17b, the model generation unit 18a, the determination unit 18b, or the extraction unit 19 A part of the functional units may be connected as an external device of the monitoring server 10 via a network. In addition, the monitoring unit 11, the output unit 12, the search unit 14, the acquisition unit 15a, the conversion procedure determination unit 16a, the first conversion unit 17a, the second conversion unit 17b, the model generation unit 18a, the determination unit 18b, or the extraction unit 19 Each of the other devices may be connected to a network and cooperate to realize the function of the monitoring server 10 described above. Moreover, the 1st conversion part 17a and the 2nd conversion part 17b which were shown in FIG. 1 can also be integrated into one function part as a conversion part.

［監視プログラム］
図８は、第２の実施形態に係る監視プログラムによる情報処理がコンピュータを用いて具体的に実現されることを示す図である。図８に例示するように、コンピュータは、例えば、メモリと、ＣＰＵと、ハードディスクドライブインタフェースと、ディスクドライブインタフェースと、シリアルポートインタフェースと、ビデオアダプタと、ネットワークインタフェースとを有し、これらの各部はバスによって接続される。 [Monitoring program]
FIG. 8 is a diagram illustrating that the information processing by the monitoring program according to the second embodiment is specifically realized using a computer. As illustrated in FIG. 8, the computer includes, for example, a memory, a CPU, a hard disk drive interface, a disk drive interface, a serial port interface, a video adapter, and a network interface. Connected by.

メモリは、図８に例示するように、ＲＯＭ及びＲＡＭを含む。ＲＯＭは、例えば、ＢＩＯＳ（Basic Input Output System）等のブートプログラムを記憶する。ハードディスクドライブインタフェースは、図８に例示するように、ハードディスクドライブに接続される。ディスクドライブインタフェースは、図８に例示するように、ディスクドライブに接続される。例えば磁気ディスクや光ディスク等の着脱可能な記憶媒体が、ディスクドライブに挿入される。シリアルポートインタフェースは、図８に例示するように、例えばマウス、キーボードに接続される。ビデオアダプタは、図８に例示するように、例えばディスプレイに接続される。 The memory includes a ROM and a RAM as illustrated in FIG. The ROM stores a boot program such as BIOS (Basic Input Output System). The hard disk drive interface is connected to the hard disk drive as illustrated in FIG. The disk drive interface is connected to the disk drive as illustrated in FIG. For example, a removable storage medium such as a magnetic disk or an optical disk is inserted into the disk drive. The serial port interface is connected to, for example, a mouse and a keyboard as illustrated in FIG. The video adapter is connected to a display, for example, as illustrated in FIG.

ここで、図８に例示するように、ハードディスクドライブは、例えば、ＯＳ（Operating System）、アプリケーションプログラム、プログラムモジュール、プログラムデータを記憶する。すなわち、第２の実施形態に係る監視プログラムは、コンピュータによって実行される指令が記述されたプログラムモジュールとして、例えばハードディスクドライブに記憶される。具体的には、上記実施形態で説明した監視サーバの各種の機能部と同様の情報処理を実行する監視手順が記述されたプログラムモジュールが、ハードディスクドライブに記憶される。また、上記実施形態で説明した各種の記憶部に記憶されるデータのように、監視プログラムによる情報処理に用いられるデータは、プログラムデータとして、例えばハードディスクドライブに記憶される。そして、ＣＰＵが、ハードディスクドライブに記憶されたプログラムモジュールやプログラムデータを必要に応じてＲＡＭに読み出し、監視手順を実行する。 Here, as illustrated in FIG. 8, the hard disk drive stores, for example, an OS (Operating System), an application program, a program module, and program data. That is, the monitoring program according to the second embodiment is stored in, for example, a hard disk drive as a program module in which a command to be executed by a computer is described. Specifically, a program module describing a monitoring procedure for executing information processing similar to the various functional units of the monitoring server described in the above embodiment is stored in the hard disk drive. In addition, data used for information processing by the monitoring program, such as data stored in the various storage units described in the above embodiment, is stored as program data in, for example, a hard disk drive. Then, the CPU reads program modules and program data stored in the hard disk drive into the RAM as necessary, and executes a monitoring procedure.

なお、監視プログラムに係るプログラムモジュールやプログラムデータは、ハードディスクドライブに記憶される場合に限られず、例えば着脱可能な記憶媒体に記憶され、ディスクドライブ等を介してＣＰＵによって読み出されてもよい。あるいは、監視プログラムに係るプログラムモジュールやプログラムデータは、ＬＡＮ、ＷＡＮ（Wide Area Network）等を介して接続された他のコンピュータに記憶され、ネットワークインタフェースを介してＣＰＵによって読み出されてもよい。 Note that the program module and program data relating to the monitoring program are not limited to being stored in the hard disk drive, but may be stored in a removable storage medium, for example, and read out by the CPU via the disk drive or the like. Alternatively, the program module and program data relating to the monitoring program may be stored in another computer connected via a LAN, WAN (Wide Area Network), etc., and read by the CPU via the network interface.

１監視システム
１０監視サーバ
１１監視部
１２出力部
１３マニュアル記憶部
１４検索部
１５ａ取得部
１５ｂ履歴蓄積部
１６ａ変換手順決定部
１６ｂ変換手順記憶部
１７ａ第１の変換部
１７ｂ第２の変換部
１８ａモデル生成部
１８ｂ判定部
１９抽出部
３０Ａ，３０Ｂ監視対象装置
５０監視端末 DESCRIPTION OF SYMBOLS 1 Monitoring system 10 Monitoring server 11 Monitoring part 12 Output part 13 Manual memory part 14 Search part 15a Acquisition part 15b History storage part 16a Conversion procedure determination part 16b Conversion procedure memory | storage part 17a 1st conversion part 17b 2nd conversion part 18a Model Generation unit 18b Determination unit 19 Extraction unit 30A, 30B Monitoring target device 50 Monitoring terminal

Claims

A manual in which a failure classification of a monitoring target device connected via a network is associated with a countermeasure for a failure corresponding to the classification, and the same even if the failure classification is different A manual storage unit for storing a manual including a combination with which a coping method is associated ;
A history storage unit that stores history information in which failure information related to a failure of the monitoring target device is associated with a classification of a failure to which the failure information corresponds;
Extract all types of words included in each failure message constituting failure information included in each history information stored in the history storage unit, determine the total number of types as the number of vector dimensions, and each type of word Is a data that expresses each failure information as a vector, in accordance with the procedure of deriving the value of each component of the vector depending on whether or not the failure information includes the word assigned to each component. A deciding unit that decides as a procedure to convert to a format;
A first conversion unit that converts the failure information included in the history information stored in the history storage unit according to the procedure determined by the determination unit into failure information in a data format of vector representation;
Instead of using the failure information and the coping method as learning data, the failure information in the data format of the vector representation converted by the first conversion unit and the classification associated with the failure information are learned data. And a generation unit that generates a determination model to be applied to a determination process for determining the classification of the failure from the failure information input in a data format of a vector representation,
A monitoring unit for monitoring the state of the monitoring target device;
A second conversion unit that converts failure information generated by the monitoring unit upon occurrence of a failure into a vector representation data format according to the procedure determined by the determination unit;
A determination unit for determining a classification of a failure from the failure information converted into a data format of a vector representation by the second conversion unit using the determination model generated by the generation unit;
A monitoring apparatus, comprising: an extraction unit that extracts a coping method from a manual associated with a failure classification determined by the determination unit among manuals stored in the manual storage unit.

The manual storage unit includes failure identification information associated with manual identification information identifying the manual and failure classification identification information identifying the failure classification, and handling associated with the manual identification information and the coping method. Memorize manual including information,
The extraction unit refers to the failure classification information stored in the manual storage unit, specifies manual identification information corresponding to failure classification identification information indicated by the failure classification determined by the determination unit, and then stores the manual storage The monitoring apparatus according to claim 1, wherein a handling method corresponding to the manual identification information is extracted with reference to handling information stored in the unit.

An acquisition unit that acquires history information including the failure information, the failure classification identification information, and a method for dealing with the failure;
When a method for dealing with a failure included in the history information acquired by the acquisition unit is not registered in the manual storage unit, new manual identification information is generated, and the failure classification information stored in the manual storage unit The new manual identification information and the failure classification identification information included in the history information are added, and the new manual identification information and the handling method included in the history information are added to the handling information stored in the manual storage unit. The monitoring apparatus according to claim 2, further comprising an additional unit configured to perform the above operation.

A monitoring method executed by a monitoring device,
Referring to a history storage unit that stores history information in which failure information related to a failure of a monitoring target device connected via a network is associated with a classification of a failure corresponding to the failure information, the history information is included in each history information Extract all types of words included in each failure message making up the failure information, determine the total number of types as the number of vector dimensions, assign each type of word to each component of the vector, and assign to each component A determination step of determining the procedure of deriving the value of each component of the vector as a procedure for converting each failure information into a data format represented by a vector, depending on whether or not the word is included in the failure information;
A first conversion step of converting the failure information included in the history information stored in the history storage unit into the failure information in the data format of vector representation according to the procedure determined in the determination step;
Instead of using the failure information and the coping method as learning data, the failure information in the data format of the vector representation converted by the first conversion step and the classification associated with the failure information are used as learning data. Generating a determination model to be applied to determination processing for determining the classification of the failure from failure information input in a data format of vector representation;
A monitoring step of monitoring a state of the monitoring target device;
A second conversion step of converting failure information generated at the time of failure occurrence by the monitoring step into a vector representation data format according to the procedure determined by the determination step;
A determination step of determining a classification of a failure from the failure information converted into the data format of the vector representation by the second conversion step using the determination model generated by the generation step;
A manual in which a failure classification of the monitoring target device is associated with a countermeasure for a failure corresponding to the classification, and the same countermeasure is associated even if the failure classification is different And an extraction step of extracting a coping method from a manual associated with the classification of the failure determined by the determination step with reference to a manual storage unit that stores a manual including a combination of Method.

Referring to a history storage unit that stores history information in which failure information related to a failure of a monitoring target device connected via a network is associated with a classification of a failure corresponding to the failure information, the history information is included in each history information Extract all types of words included in each failure message making up the failure information, determine the total number of types as the number of vector dimensions, assign each type of word to each component of the vector, and assign to each component A determination step for determining a procedure for deriving the value of each component of the vector as a procedure for converting each failure information into a data format represented by a vector, depending on whether or not the word is included in the failure information;
A first conversion step of converting failure information included in the history information stored in the history storage unit according to the procedure determined in the determination step into failure information in a data format of vector representation;
Rather than using the failure information and the coping method as learning data, the failure information in the data format of the vector representation converted by the first conversion step and the classification associated with the failure information are used as learning data. Generating a determination model to be applied to determination processing for determining the classification of the failure from the failure information input in a data format of vector representation;
A monitoring step of monitoring a state of the monitoring target device;
A second conversion step of converting the failure information generated at the time of failure occurrence by the monitoring step into a data format of vector representation according to the procedure determined by the determination step;
A determination step of determining a failure classification from the failure information converted into the data format of the vector representation by the second conversion step using the determination model generated by the generation step;
A manual in which a failure classification of the monitoring target device is associated with a countermeasure for a failure corresponding to the classification, and the same countermeasure is associated even if the failure classification is different A monitoring program for causing a computer to execute an extraction step that refers to a manual storage unit that stores a manual including a combination that has been selected and extracts a countermeasure from a manual associated with the failure classification determined in the determination step .