JP4896573B2

JP4896573B2 - Fault monitoring system and method, and program

Info

Publication number: JP4896573B2
Application number: JP2006117234A
Authority: JP
Inventors: 弘二福井; 真司林; 敏鵜飼
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2006-04-20
Filing date: 2006-04-20
Publication date: 2012-03-14
Anticipated expiration: 2026-04-20
Also published as: JP2007293393A

Description

本発明は、ネットワークを介して接続された複数の監視対象計算機からなる監視対象計算機システム中における障害の発生を監視する障害監視技術に関するものである。 The present invention relates to a failure monitoring technique for monitoring the occurrence of a failure in a monitored computer system including a plurality of monitored computers connected via a network.

近年のネットワーク技術の進歩に伴い、プラント監視などの各種の分野において、ネットワークを介して接続された複数の計算機からなる計算機システムが使用されるようになっている。このような計算機システムにおいては、システムの機能を複数の計算機で担っているため、いずれか１台の計算機に障害を発生しただけでも、計算機システム全体に影響を及ぼす可能性がある。そのため、従来、計算機システムを監視対象として障害の発生を監視するために、各種の障害監視技術が開発されている。 With the recent advancement of network technology, computer systems composed of a plurality of computers connected via a network are used in various fields such as plant monitoring. In such a computer system, the functions of the system are handled by a plurality of computers. Therefore, even if a failure occurs in any one of the computers, the entire computer system may be affected. For this reason, various fault monitoring techniques have been developed in order to monitor the occurrence of faults using a computer system as a monitoring target.

従来、計算機システムの一つの監視方式として、複数の計算機から収集されたログを同時に収集して比較することによって異常を検出し、異常の原因を検証するためにさらに詳細なログを収集する方式が存在している（例えば、特許文献１参照）。また、監視対象計算機の監視対象のログを受信し、ログと監視対象の障害発生または性能低下を示すためのイベントに関する情報とに基づいて、障害解析に必要なログを特定し、障害解析を実施する方式が存在している（例えば、特許文献２参照）。 Conventionally, as a monitoring method of a computer system, there is a method of collecting more detailed logs to detect anomalies by simultaneously collecting and comparing logs collected from multiple computers and verifying the cause of the anomaly Exists (see, for example, Patent Document 1). In addition, the monitoring target log of the monitoring target computer is received, the log necessary for failure analysis is identified based on the log and information related to the event that indicates the occurrence of failure or performance degradation of the monitoring target, and the failure analysis is performed There is a method to do this (for example, see Patent Document 2).

特開平１１−１４３７３８JP 11-143738 A 特開２００４−１７８３３６JP 2004-178336 A

上記のような従来の計算機システムの障害監視技術は、計算機システムの障害発生を検知した際に、監視対象計算機のログを用いてその原因を検証するものであるが、このような従来技術には、次のような問題点が存在している。 The conventional computer system fault monitoring technology as described above is to verify the cause by using the log of the monitored computer when the occurrence of a fault in the computer system is detected. The following problems exist.

まず、計算機システムの障害発生を検知した際に、その原因を検証するだけでは、表面化した障害から原因を特定することが困難な場合がある。すなわち、計算機で表面化される障害においては、ある問題が二次的、三次的に影響した結果、その問題とは関係ない機能が影響を受けて表面化するというケースが増えている。そして、このような二次的、三次的に影響する問題のうち、特に、時間をかけて徐々に影響するような問題については、発生した障害内容から原因を特定することが困難な場合がある。 First, when the occurrence of a failure in a computer system is detected, it may be difficult to identify the cause from the surfaced failure only by verifying the cause. That is, in a failure that is surfaced by a computer, there is an increasing number of cases where a function that is not related to the problem is affected and surfaced as a result of a certain problem having a secondary or tertiary influence. Of these secondary and tertiary problems, especially those that gradually affect over time, it may be difficult to identify the cause from the details of the failure that has occurred. .

また、近年の計算機は、多くのハードウェアやソフトウェアから構成されているが、これらのハードウェアやソフトウェアは汎用化が進んでおり、多数のメーカで製造されたものを組み合わせて使用可能となっている。そのような多くのハードウェアやソフトウェアについて、未知の問題が内在するのか否か、また、内在する場合には、該計算機システムに影響するのか否か、というような事項を全て事前に検査・検証することは困難である。 In addition, recent computers are composed of a lot of hardware and software, but these hardware and software are becoming more and more generalized and can be used in combination with those manufactured by many manufacturers. Yes. For such a lot of hardware and software, all the matters such as whether an unknown problem is inherent, and if it is, whether it affects the computer system, are inspected and verified in advance. It is difficult to do.

さらに、計算機システムを長期間運用する過程では、故障したハードウェアの交換時は同一のものが使用出来ず、その後継機や他製品を使用する必要もあり、このような場合についても、上記のような事項を全て事前に検査・検証することは困難である。 Furthermore, in the process of operating a computer system for a long period of time, it is not possible to use the same when replacing faulty hardware, and it is necessary to use a successor or other products. It is difficult to inspect and verify all such matters in advance.

そのため、障害発生を検知した際にその原因を検証するだけでなく、将来的な障害発生を予知することが求められている。しかしながら、上記のような、監視対象計算機のログを用いてその原因を検証する従来技術を応用して将来的な障害発生を予知するためには、監視対象計算機の全てのログを収集する必要があり、その場合には、ネットワークに過度の負荷がかかってしまうという問題を生じる。 Therefore, it is required not only to verify the cause when a failure occurs, but also to predict the future failure. However, in order to predict a future failure by applying the conventional technology for verifying the cause using the log of the monitored computer as described above, it is necessary to collect all the logs of the monitored computer. In this case, there is a problem that an excessive load is applied to the network.

本発明は、上記のような従来技術の課題を解決するために提案されたものであり、その目的は、ネットワークに過度の負荷を与えることなしに、監視対象計算機システム中における障害発生時の原因推定に加えて障害発生の予知を行うことが可能な障害監視システムと方法、およびプログラムを提供することである。 The present invention has been proposed to solve the above-described problems of the prior art, and its purpose is to cause the occurrence of a failure in the monitored computer system without imposing an excessive load on the network. To provide a fault monitoring system, method, and program capable of predicting the occurrence of a fault in addition to estimation.

本発明は、上記のような目的を達成するために、情報収集装置において、収集したログ情報から得た計算機の構成または状態の変更を時系列データとして編集すると共に、ログ情報中の同一の事象内容データを略式データに置き換えることにより、通信するデータ量を抑制しながら、しかも、状態監視・解析サーバにおいて、計算機の構成や状態の変更情報を含む新旧の時系列データの解析情報を比較して、障害発生の予知を行うことができるようにしたものである。 In order to achieve the above object, the present invention edits a change in the configuration or state of a computer obtained from collected log information as time series data in an information collecting apparatus, and the same event in log information. By replacing the content data with the abbreviated data, while comparing the amount of data to be communicated, the state monitoring / analysis server compares the analysis information of the old and new time series data including the computer configuration and state change information. It is possible to predict the occurrence of a failure.

本発明の障害監視システムは、ネットワークを介して接続された複数の監視対象計算機からなる監視対象計算機システム中における障害の発生を監視する障害監視システムにおいて、監視対象計算機で保存している当該計算機のログ情報を収集し、編集して保存する情報収集装置と、情報収集装置で保存されたログ情報を解析して解析情報を取得すると共に、ログ情報および解析情報に基づいて障害発生時の原因推定および障害発生の予知を行う状態監視・解析サーバを備えたことを特徴としている。情報収集装置と状態監視・解析サーバはさらに次のような特徴を有する。 The fault monitoring system of the present invention is a fault monitoring system that monitors the occurrence of a fault in a monitored computer system composed of a plurality of monitored computers connected via a network. Information collection device that collects log information, edits and saves it, analyzes the log information saved by the information collection device, acquires analysis information, and estimates the cause when a failure occurs based on the log information and analysis information And a state monitoring / analysis server for predicting failure occurrence. The information collection device and the state monitoring / analysis server further have the following characteristics.

情報収集装置は、データ通信手段、データ編集手段、ログ保存手段を有する。ここで、データ通信手段は、監視対象計算機から当該計算機のログ情報を受信すると共に、状態監視・解析サーバとの間でログ情報を含むデータの通信を行う手段である。データ編集手段は、受信したログ情報に含まれる監視対象計算機の構成情報または状態情報を過去の対応する各情報とそれぞれ比較して、構成または状態の変更がある場合に、当該変更を事象の発生として取り扱い、時刻を付加して時系列データとして編集すると共に、受信したログ情報に同一の事象内容データが含まれる場合に、当該事象内容データを表現する略式データに置き換えて編集する手段である。ログ保存手段は、編集したログ情報を保存する手段である。 The information collection device has data communication means, data editing means, and log storage means. Here, the data communication means is means for receiving log information of the computer from the monitored computer and communicating data including log information with the state monitoring / analysis server. The data editing means compares the configuration information or status information of the monitored computer included in the received log information with each corresponding information in the past, and if there is a change in configuration or status, the change occurs when an event occurs. And when the received event information includes the same event content data, it is replaced with the simplified data representing the event content data. The log storage means is means for storing the edited log information.

状態監視・解析サーバは、データ通信手段、データ保存手段、データ分類手段、ログ解析手段、障害原因推定手段、障害発生予知手段を有する。ここで、データ通信手段は、情報収集装置との間でログ情報を含むデータの通信を行う手段である。データ保存手段は、受信したログ情報および当該サーバ内で取得した各種のデータを保存する手段である。データ分類手段は、受信したログ情報に略式データが含まれる場合には、当該略式データを元の前記事象内容データに復元する編集を行い、当該ログ情報のうち、障害の発生を示す情報を障害情報としてデータ分類し、データ保存手段に保存する手段である。ログ解析手段は、受信したログ情報を用いてログの発生状況を時間軸で解析して解析情報を取得し、取得した解析情報をデータ保存手段に保存する手段である。 The state monitoring / analysis server includes data communication means, data storage means, data classification means, log analysis means, failure cause estimation means, and failure occurrence prediction means. Here, the data communication means is means for communicating data including log information with the information collecting apparatus. The data storage means is means for storing the received log information and various data acquired in the server. When the received log information includes summary data, the data classification means performs editing to restore the summary data to the original event content data, and includes information indicating the occurrence of the failure in the log information. It is means for classifying data as failure information and storing it in a data storage means. The log analysis unit is a unit that analyzes the occurrence status of the log on the time axis using the received log information to acquire the analysis information, and stores the acquired analysis information in the data storage unit.

障害原因推定手段は、障害情報により示される障害の発生に対し、過去の類似障害発生時の障害情報から当該障害の原因の候補を検出して当該候補が原因である可能性の確信度を算出し、算出結果に応じて得られる障害原因の推定結果を当該障害の発生を示す障害情報の一部としてデータ保存手段に保存する手段である。障害発生予知手段は、与えられた期間の解析情報に対して過去の解析情報が類似しているか否かを時間軸上のデータ比較により判定し、類似していると判定した場合に、当該過去の解析情報に対応する障害情報を用いて、当該障害情報により示される障害の発生可能性の確信度を算出し、算出結果に応じて得られる予知結果を予知情報としてデータ保存手段に保存する手段である。 The failure cause estimation means detects a failure cause candidate from the failure information when a similar failure occurred in the past, and calculates the certainty of the possibility that the failure is caused by the failure information indicated by the failure information. The failure cause estimation result obtained according to the calculation result is stored in the data storage unit as a part of failure information indicating the occurrence of the failure. The failure occurrence predicting means determines whether or not past analysis information is similar to the analysis information for a given period by comparing the data on the time axis. Means for calculating the certainty of the possibility of occurrence of the failure indicated by the failure information using the failure information corresponding to the analysis information, and storing the prediction result obtained according to the calculated result in the data storage unit as the prediction information It is.

また、本発明の障害監視方法および障害監視プログラムは、上記システムの特徴を方法およびコンピュータプログラムの観点からそれぞれ把握したものである。 The fault monitoring method and fault monitoring program of the present invention grasp the characteristics of the system from the viewpoints of the method and the computer program.

以上のような特徴を有する本発明によれば、情報収集装置において、各種の事象の発生を示す本来の時系列データだけでなく、受信したログ情報中の監視対象計算機の構成情報または状態情報を過去の情報と比較して変更を検出した場合に、この変更を事象の発生として取り扱い、時刻を付加して変更事象の発生を示す時系列データとして編集することができる。 According to the present invention having the above-described features, in the information collecting apparatus, not only the original time-series data indicating the occurrence of various events but also the configuration information or status information of the monitored computer in the received log information. When a change is detected in comparison with past information, this change can be handled as an occurrence of an event and edited as time-series data indicating the occurrence of a change event by adding a time.

したがって、状態監視・解析サーバにおいては、各種の事象の発生を示す本来の時系列データに加えて、監視対象計算機の構成や状態に関する変更についても、変更を事象の発生とみなして編集した時系列データを用いることができるため、そのような計算機の構成や状態の情報を示す時系列データを含む全ての時系列データを解析して、解析情報の新旧比較を行うことができる。そして、計算機の構成や状態の情報を含む解析情報の新旧比較により、構成や状態が類似している過去の解析情報を精度よく検出し、その解析情報に対応する障害情報を用いて、障害発生の予知を精度よく行うことが可能となる。 Therefore, in the status monitoring / analysis server, in addition to the original time-series data indicating the occurrence of various events, changes related to the configuration and status of monitored computers are also time-series edited by regarding the changes as occurrences of events. Since data can be used, all time-series data including time-series data indicating information on the configuration and state of such a computer can be analyzed, and the analysis information can be compared between new and old. Then, by comparing old and new analysis information including computer configuration and status information, past analysis information with similar configuration and status is accurately detected, and failure information is generated using failure information corresponding to the analysis information. Can be accurately predicted.

また、情報収集装置において、収集したログ情報に同一の事象内容データが含まれる場合には、当該事象内容データを略式データに置き換えて編集することができる。したがって、情報収集装置から状態監視・解析サーバに送信するデータ量を低減することができるため、ネットワークに過度の負荷を与えることがない。 Further, in the information collection device, when the collected event information includes the same event content data, the event content data can be replaced with summary data and edited. Therefore, the amount of data transmitted from the information collection device to the state monitoring / analysis server can be reduced, and an excessive load is not applied to the network.

本発明によれば、ネットワークに過度の負荷を与えることなしに、監視対象計算機システム中における障害発生時の原因推定に加えて障害発生の予知を行うことが可能な障害監視システムと方法、およびプログラムを提供することができる。 According to the present invention, a fault monitoring system, method, and program capable of predicting the occurrence of a fault in addition to estimating the cause when a fault occurs in a monitored computer system without imposing an excessive load on the network Can be provided.

［第１の実施形態］
［障害監視システムの構成］
図１は、本発明を適用した第１の実施形態に係る障害監視システムの構成を示すブロック図である。 [First Embodiment]
[Fault monitoring system configuration]
FIG. 1 is a block diagram showing a configuration of a failure monitoring system according to a first embodiment to which the present invention is applied.

この図１に示すように、本実施形態に係る障害監視システムは、インターネット１００を介して、情報収集装置１０１と状態監視・解析サーバ１０２が接続されて構成されている。情報収集装置１０１は、監視対象計算機システム２０１を構成する複数の監視対象計算機２０２の各々とネットワーク２０３を介して接続されている。また、状態監視・解析サーバ１０２には、監視用イントラネット１０３を介して同一施設内の監視端末１０４が接続されると共に、インターネット１００を介して遠隔地またはモバイル型の監視端末１０５が接続されている。 As shown in FIG. 1, the failure monitoring system according to the present embodiment is configured by connecting an information collection device 101 and a state monitoring / analysis server 102 via the Internet 100. The information collection apparatus 101 is connected to each of a plurality of monitoring target computers 202 constituting the monitoring target computer system 201 via a network 203. In addition, a monitoring terminal 104 in the same facility is connected to the state monitoring / analysis server 102 via a monitoring intranet 103, and a remote or mobile monitoring terminal 105 is connected via the Internet 100. .

なお、図１中では、監視対象計算機システム２０１を構成する計算機として、３台の監視対象計算機２０２が示されているが、本発明の対象となる監視対象計算機システム２０１を構成する計算機の数は、４台以上でも、あるいは２台以下でもよく、また、複数のネットワークを介して接続されたシステムであってもよい。 In FIG. 1, three monitoring target computers 202 are shown as the computers constituting the monitoring target computer system 201. However, the number of computers constituting the monitoring target computer system 201 that is the subject of the present invention is as follows. There may be four or more, or two or less, or a system connected via a plurality of networks.

また、図１中でインターネット１００に接続されたＷｅｂサイト１０６は、そのＷｅｂページ１６１上で監視対象計算機２０２に関する情報を提供する一つのＷｅｂサイトを例示的に示したものである。 Further, the Web site 106 connected to the Internet 100 in FIG. 1 exemplarily shows one Web site that provides information related to the monitoring target computer 202 on the Web page 161.

情報収集装置１０１は、監視対象計算機システム２０１を構成する各監視対象計算機２０２で保存している当該計算機のログ情報３０１を収集し、編集して保存する装置であり、データ通信手段１１１、データ編集手段１１２、ログ保存手段１１３を有する。 The information collection device 101 is a device that collects, edits and saves log information 301 of the computer stored in each monitoring target computer 202 constituting the monitoring target computer system 201. Means 112 and log storage means 113.

データ通信手段１１１は、監視対象計算機２０２から当該計算機のログ情報３０１を受信すると共に、状態監視・解析サーバ１０２との間でログ情報を含むデータの通信を行う手段である。データ編集手段１１２は、受信したログ情報に含まれる監視対象計算機２０２の構成情報または状態情報を過去の対応する各情報とそれぞれ比較して、構成または状態の変更がある場合に、当該変更を事象の発生として取り扱い、時刻を付加して時系列データとして編集すると共に、受信したログ情報３０１に同一の事象内容データが含まれる場合に、当該事象内容データを表現する略式データに置き換えて編集する手段である。ログ保存手段１１３は、編集したログ情報３１１を保存する手段である。 The data communication unit 111 is a unit that receives log information 301 of the computer from the monitoring target computer 202 and communicates data including log information with the state monitoring / analysis server 102. The data editing unit 112 compares the configuration information or status information of the monitoring target computer 202 included in the received log information with each corresponding information in the past. And editing as time-series data with the addition of time, and when the received log information 301 includes the same event content data, means for editing by replacing the data with summary data representing the event content data It is. The log storage unit 113 is a unit that stores the edited log information 311.

状態監視・解析サーバ１０２は、情報収集装置１０１で保存されたログ情報３１１を解析して解析情報を取得すると共に、ログ情報および解析情報に基づいて障害発生時の原因推定および障害発生の予知を行う装置であり、データ通信手段１２１、データ保存手段１２２、データ分類手段１２３、ログ解析手段１２４、障害原因推定手段１２５、障害発生予知手段１２６、を有する。 The state monitoring / analysis server 102 analyzes the log information 311 stored in the information collecting apparatus 101 to obtain analysis information, and also estimates the cause of failure occurrence and prediction of failure occurrence based on the log information and analysis information. And a data communication unit 121, a data storage unit 122, a data classification unit 123, a log analysis unit 124, a failure cause estimation unit 125, and a failure occurrence prediction unit 126.

データ通信手段１２１は、情報収集装置１０１との間でログ情報３１１を含むデータの通信を行う手段である。データ保存手段１２２は、状態監視・解析サーバ１０２内で取得した通常のログ情報３２１、障害情報３３１、解析情報３４１、予知情報３５１等の各種のデータを保存する手段である。データ分類手段１２３は、受信したログ情報３１１に略式データが含まれる場合には、当該略式データを元の事象内容データに復元する編集を行い、当該ログ情報３１１を、通常のログ情報３２１と、障害の発生を示す障害情報３３１とにデータ分類し、データ保存手段１２２に保存する手段である。 The data communication unit 121 is a unit that communicates data including the log information 311 with the information collection apparatus 101. The data storage unit 122 is a unit that stores various data such as normal log information 321, failure information 331, analysis information 341, and prediction information 351 acquired in the state monitoring / analysis server 102. When the received log information 311 includes summary data, the data classification unit 123 performs editing to restore the summary data to the original event content data, and the log information 311 is converted into normal log information 321. Data is classified into failure information 331 indicating the occurrence of a failure and stored in the data storage unit 122.

なお、以下の説明において、「ログ情報３０１」は、情報収集装置１０１内で編集される前のログ情報を意味しており、「ログ情報３１１」は、情報収集装置１０１内で編集された後の、構成または状態の変更を示す時系列データを含むログ情報を意味している。「通常のログ情報３２１」と「障害情報３３１」は、「ログ情報３１１」を分類したものであるため、単に「ログ情報」と称した場合は、「通常のログ情報３２１」と「障害情報３３１」を含めた「ログ情報３１１」を意味している。 In the following description, “log information 301” means log information before being edited in the information collection apparatus 101, and “log information 311” is after being edited in the information collection apparatus 101. Log information including time-series data indicating a change in configuration or status. Since “normal log information 321” and “failure information 331” are classifications of “log information 311”, “normal log information 321” and “failure information” are simply referred to as “log information”. "Log information 311" including "331".

ログ解析手段１２４は、受信したログ情報３１１を用いてログの発生状況を時間軸で解析して解析情報３４１を取得し、取得した解析情報３４１をデータ保存手段１２２に保存する手段である。障害原因推定手段１２５は、障害情報３３１により示される障害の発生に対し、過去の類似障害発生時の障害情報３３１から当該障害の原因の候補を検出して当該候補が原因である可能性の確信度を算出し、算出結果に応じて得られる障害原因の推定結果を当該障害の発生を示す障害情報３３１の一部としてデータ保存手段１２２に保存する手段である。 The log analysis unit 124 is a unit that analyzes the occurrence status of the log on the time axis using the received log information 311 to acquire the analysis information 341, and stores the acquired analysis information 341 in the data storage unit 122. The failure cause estimating means 125 detects a failure cause candidate from the failure information 331 when a similar failure has occurred in the past, with the occurrence of the failure indicated by the failure information 331, and is convinced that the candidate may be the cause. This is a means for calculating the degree, and storing in the data storage unit 122 the failure cause estimation result obtained according to the calculation result as part of the failure information 331 indicating the occurrence of the failure.

障害発生予知手段１２６は、与えられた期間の解析情報３４１に対して過去の解析情報３４１が類似しているか否かを時間軸上のデータ比較により判定し、類似していると判定した場合に、当該過去の解析情報３４１に対応する障害情報３３１を用いて、当該障害情報３３１により示される障害の発生可能性の確信度を算出し、算出結果に応じて得られる予知結果を予知情報３５１としてデータ保存手段１２２に保存する手段である。 The failure occurrence predicting means 126 determines whether or not the past analysis information 341 is similar to the analysis information 341 for a given period by comparing the data on the time axis, and when determining that they are similar Then, the failure information 331 corresponding to the past analysis information 341 is used to calculate the certainty of the possibility of occurrence of the failure indicated by the failure information 331, and the prediction result obtained according to the calculation result is used as the prediction information 351. It is means for storing in the data storage means 122.

なお、図１中では、一例として、監視用イントラネット１０３を介して１台の監視端末１０４を接続した場合を示しているが、１台または複数台の監視端末１０４を状態監視・解析サーバ１０２に直接接続してもよく、両方の接続方式の監視端末１０４を組合せてもよい。同様に、図１中では、一例として、インターネット１００を介して遠隔地またはモバイル型の１台の監視端末１０５を接続した場合を示しているが、インターネット１００を介して接続される監視端末１０５は複数台であってもよい。さらに、本発明において、このような遠隔地またはモバイル型の監視端末１０５は必須の構成ではない。 In FIG. 1, as an example, a case where one monitoring terminal 104 is connected via the monitoring intranet 103 is shown, but one or more monitoring terminals 104 are connected to the state monitoring / analysis server 102. You may connect directly and you may combine the monitoring terminal 104 of both connection systems. Similarly, in FIG. 1, as an example, a case where a remote or mobile monitoring terminal 105 is connected via the Internet 100 is shown, but the monitoring terminal 105 connected via the Internet 100 is A plurality of units may be provided. Further, in the present invention, such a remote or mobile monitoring terminal 105 is not an essential configuration.

また、図１中では、情報収集装置１０１と状態監視・解析サーバ１０２を、それぞれ１台のみ設けた場合を示しているが、情報収集装置１０１と状態監視・解析サーバ１０２は、それぞれ、複数台で構成してもよい。すなわち、インターネット１００を介して接続しているため、任意の数の状態監視・解析サーバ１０２に対して任意の数の情報収集装置１０１を自由に接続可能であり、障害監視システム構成の自由度は高い。 Further, FIG. 1 shows a case where only one information collection device 101 and one state monitoring / analysis server 102 are provided, but a plurality of information collection devices 101 and multiple state monitoring / analysis servers 102 are provided. You may comprise. That is, since they are connected via the Internet 100, an arbitrary number of information collection devices 101 can be freely connected to an arbitrary number of state monitoring / analysis servers 102, and the degree of freedom of the fault monitoring system configuration is high.

なお、以上のような情報収集装置１０１は、情報収集用として特化されたソフトウェアにより、パーソナルコンピュータあるいはより高性能の各種コンピュータを制御することで実現されるようになっている。また、状態監視・解析サーバ１０２は、状態監視・解析サーバ用として特化されたソフトウェアにより、サーバ用の大容量・高性能のコンピュータを制御することで実現されるようになっており、監視端末１０４，１０５を通じて担当者から各種の指示や入力データを受け取るとともに、各種のデータ処理結果を担当者に表示または出力するようになっている。 The information collecting apparatus 101 as described above is realized by controlling a personal computer or various high-performance computers with software specialized for information collection. The state monitoring / analysis server 102 is realized by controlling a large-capacity, high-performance computer for a server with software specialized for the state monitoring / analysis server. Various instructions and input data are received from the person in charge through 104 and 105, and various data processing results are displayed or output to the person in charge.

［監視対象計算機システムの構成］
以下には、監視対象計算機システム２０１の一般的な構成について、図２〜図４を参照して説明する。 [Configuration of monitored computer system]
Below, the general structure of the monitoring object computer system 201 is demonstrated with reference to FIGS.

図２は、監視対象計算機２０２の構成例を示すブロック図であり、（ａ）はハードウェア構成２０２を、（ｂ）はソフトウェア構成２０２をそれぞれ示している。 FIG. 2 is a block diagram illustrating a configuration example of the monitoring target computer 202, where (a) illustrates the hardware configuration 202 and (b) illustrates the software configuration 202.

図２（ａ）で示すように、計算機は、バス２１１を通じて、ＣＰＵ２１２、メモリ２１３、ディスク２１４、通信インタフェース２１５等の複数の機器（ハードウェア）を接続することで構成される。なお、図２（ａ）で示している機器は、計算機を構成する基本的な機器のみを示しており、一般的には、他にも多種多様な多数の機器が用いられる。 As illustrated in FIG. 2A, the computer is configured by connecting a plurality of devices (hardware) such as a CPU 212, a memory 213, a disk 214, and a communication interface 215 through a bus 211. Note that the devices shown in FIG. 2A are only basic devices that constitute the computer, and generally, a wide variety of other devices are used.

ところで、これらの機器は、汎用化されているものが多く、各社が独自の技術を用いて製作することが可能であるため、価格や性能、機能といった各種の観点から使用する機器を選定して使用することが可能である。また、性能の改善や機能の追加、故障時の交換といった場合には、新たな機器を追加したり、製造元の異なる製品に交換して使用したりすることも可能である。 By the way, many of these devices are generalized and can be manufactured by each company using its own technology. Select the devices to be used from various viewpoints such as price, performance, and functions. It is possible to use. In addition, in the case of performance improvement, addition of functions, replacement at the time of failure, it is also possible to add new equipment or replace with a product of a different manufacturer.

図２（ｂ）は、計算機のソフトウェア構成の一例として、図２（ａ）に示す通信インタフェース２１５を介して通信を行う場合のソフトウェア構成の一例を示したものである。この図においては、ＯＳ（基本ソフトウェア）２２１の下で、アプリケーション２２２により、インタフェース２２３、ドライバ２２４を介して、通信インタフェース２１５と通信する場合を示している。 FIG. 2B shows an example of the software configuration when communication is performed via the communication interface 215 shown in FIG. 2A as an example of the software configuration of the computer. This figure shows a case where the application 222 communicates with the communication interface 215 via the interface 223 and the driver 224 under the OS (basic software) 221.

この図２（ｂ）のソフトウェア構成は、図２（ａ）のハードウェア構成において、新たな機器を追加したり、製造元の異なる製品に交換して使用した場合にも、ハードウェアに依存するドライバ２２４またはインタフェース２２３を切り替えることで、アプリケーション２２２への影響を最小限に抑えることが可能なソフトウェア構成である。 The software configuration shown in FIG. 2B is a driver that depends on the hardware even when a new device is added to the hardware configuration shown in FIG. The software configuration can minimize the influence on the application 222 by switching the H.224 or the interface 223.

なお、一般的に、図２（ｂ）に示すような、各ソフトウェア、すなわち、ＯＳ２２１、アプリケーション２２２、インタフェース２２３、ドライバ２２４等は、動作および処理内容に応じたログ情報３０１を出力し、自計算機内に保存するようになっている。 In general, each software, that is, OS 221, application 222, interface 223, driver 224, etc. as shown in FIG. 2B outputs log information 301 corresponding to the operation and processing contents, and the own computer It is supposed to save in.

図３は、図２（ｂ）に示すように計算機２０２内で保存される自計算機のログ情報３０１の一例を示すデータ構成図である。この図３に示すログ情報３０１は、計算機が標準的に有する認識機能を用いて自動認識される情報であり、構成情報３０２、状態情報３０３、イベントログ情報３０４から構成されている。 FIG. 3 is a data configuration diagram showing an example of the log information 301 of the own computer stored in the computer 202 as shown in FIG. The log information 301 shown in FIG. 3 is information that is automatically recognized by using a recognition function that a computer has as a standard, and includes configuration information 302, status information 303, and event log information 304.

構成情報３０２は、自計算機２０２を構成する個々のハードウェアおよびソフトウェアの構成を示す情報であり、ハードウェア構成情報３０２ａとソフトウェア構成情報３０２ｂから構成されている。 The configuration information 302 is information indicating the configuration of each piece of hardware and software that constitutes the computer 202, and includes hardware configuration information 302a and software configuration information 302b.

この構成情報３０２としては、例えば、そのハードウェアやソフトウェアを特定するための情報として、製品名や製造番号、製造年月、型式、バージョンなどといった項目が保存される。なお、構成情報は、必ずしもハードウェアやソフトウェアを詳細に特定する情報である必要はなく、例えば、単にハードウェアやソフトウェアの名称のみでも、また、ハードウェアやソフトウェアが計算機に組み込まれた日時のみでもかまわない。 As the configuration information 302, for example, items such as a product name, a serial number, a manufacturing date, a model, and a version are stored as information for specifying the hardware and software. Note that the configuration information does not necessarily need to be information that specifies hardware or software in detail. For example, the configuration information may be simply the name of the hardware or software, or only the date and time when the hardware or software was incorporated into the computer. It doesn't matter.

状態情報３０３は、自計算機２０２やその周辺機器を構成する個々のハードウェアやソフトウェアの状態を示す情報であり、ハードウェア状態情報３０３ａとソフトウェア状態情報３０３ｂから構成されている。 The status information 303 is information indicating the status of each piece of hardware and software that constitutes the computer 202 and its peripheral devices, and includes hardware status information 303a and software status information 303b.

この状態情報３０３としては、例えば、ＣＰＵ２１２の場合は１秒間のＣＰＵ２１２の負荷、メモリ２１３の場合は当該時点でのメモリ使用量、ディスク２１４の場合は１秒間のデータ転送量、等の情報が記憶される。 As the status information 303, for example, information such as the load of the CPU 212 for 1 second in the case of the CPU 212, the memory usage at that time in the case of the memory 213, and the data transfer amount of 1 second in the case of the disk 214 are stored. Is done.

なお、以上のような構成情報３０２、状態情報３０３としては、一般的には、現時点の最新情報が保存される。すなわち、例えば、通信インタフェース２１５が、他の製品に換わった場合は、交換前の情報はログ情報３０１から破棄され、交換後の情報がログ情報３０１に付け加えられる。しかし、ログ情報３０１の保存容量に余裕がある場合などには、必ずしも交換前の構成情報３０２や状態情報３０３を破棄する必要はなく、一定の期間の情報を保存してもよい。 Note that the latest information at the present time is generally stored as the configuration information 302 and the state information 303 as described above. That is, for example, when the communication interface 215 is replaced with another product, the information before replacement is discarded from the log information 301 and the information after replacement is added to the log information 301. However, when there is a sufficient storage capacity for the log information 301, it is not always necessary to discard the configuration information 302 and the state information 303 before replacement, and information for a certain period may be stored.

一方、イベントログ情報３０４は、自計算機２０２で検出したイベント（事象の発生）を時系列に保存した情報であり、図３の例におけるイベントログ情報３０４は、ＯＳ２２１により出力されるＯＳログ情報３０４ａと、アプリケーション２２２により出力されるアプリケーションログ情報３０４ｂから構成されている。 On the other hand, the event log information 304 is information in which events (occurrence of events) detected by the own computer 202 are stored in time series, and the event log information 304 in the example of FIG. 3 is the OS log information 304a output by the OS 221. And application log information 304b output by the application 222.

このイベントログ情報３０４としては、例えば、ハードウェアにアクセスされたことを示す情報、ハードウェアやソフトウェアに何かしらの異常が検出されたことを示す情報などの、各種のイベントを示す情報が、そのイベントが検出される度に追加される。 Examples of the event log information 304 include information indicating various events such as information indicating that hardware has been accessed and information indicating that some kind of abnormality has been detected in hardware or software. Is added each time is detected.

なお、図２、図３では、監視対象計算機２０２のログ情報３０１を自計算機２０２で保存する場合について示しているが、ログ情報３０１を、通信インタフェース２１５を介して他の計算機または装置に出力して、当該他の計算機または装置によりログ情報を保存してもよい。 2 and 3 show a case in which the log information 301 of the monitoring target computer 202 is stored in the own computer 202, the log information 301 is output to another computer or apparatus via the communication interface 215. The log information may be stored by the other computer or device.

図２および図３に示したように、ログ情報３０１を保存した監視対象計算機２０２は、ログ情報３０１の一部として、自計算機の構成や現在の状態および過去のイベント情報を保存しているので、これらの情報を用いて、監視対象計算機システム２０１の異常を検出できるようになっている。また、監視対象計算機システム２０１の異常を検出するためには、ログ情報３０１として、構成情報３０２、状態情報３０３、イベントログ情報３０４の全てを有している必要もなく、これらの情報の一部のみを使用しても異常の検出は可能である。 As shown in FIG. 2 and FIG. 3, the monitoring target computer 202 that stores the log information 301 stores the configuration, current state, and past event information of the local computer as a part of the log information 301. Using these pieces of information, it is possible to detect an abnormality in the monitored computer system 201. Further, in order to detect an abnormality in the monitoring target computer system 201, it is not necessary to have all of the configuration information 302, the status information 303, and the event log information 304 as the log information 301. Anomalies can be detected using only

図４は、図１に示す監視対象計算機システム２０１の一例として、機能を多重化した監視対象計算機システム２０１の機能構成例を示すブロック図である。 FIG. 4 is a block diagram illustrating a functional configuration example of the monitoring target computer system 201 in which functions are multiplexed as an example of the monitoring target computer system 201 illustrated in FIG. 1.

この図４に示す例は、複数の機能Ａ〜Ｆを持つ監視対象計算機システム２０１において、機能Ａ〜Ｆを３台の計算機２０２１〜２０２３に配置する場合に、機能の幾つかを少なくとも２台以上の計算機に配置し、１台の計算機に異常が生じた場合に、他の正常な計算機でその異常が発生した計算機での機能を代替できるように構成したものである。 In the example shown in FIG. 4, in the monitoring target computer system 201 having a plurality of functions A to F, when the functions A to F are arranged in the three computers 2021 to 2023, some of the functions are at least two or more. When an abnormality occurs in one computer, the function of the computer in which the abnormality has occurred can be replaced by another normal computer.

例えば、第１の計算機２０２１で何らかの異常が発生し、当該計算機２０２１の機能Ａ（４０１ａ）、機能Ｂ（４０２ａ）、機能Ｃ（４０３ａ）が実行できなくなった場合に、計算機２０２１の機能Ａ（４０１ａ）と機能Ｂ（４０２ａ）は、第２の計算機２０２２の機能Ａ（４０１ｂ）と機能Ｂ（４０２ｂ）で代替処理され、また、計算機２０２１の機能Ｃ（４０３ａ）は、第３の計算機２０２３の機能Ｃ（４０３ｂ）で代替処理されるようになっている。 For example, when an abnormality occurs in the first computer 2021 and the function A (401a), function B (402a), and function C (403a) of the computer 2021 cannot be executed, the function A (401a) of the computer 2021 is used. ) And function B (402a) are replaced by the function A (401b) and function B (402b) of the second computer 2022, and the function C (403a) of the computer 2021 is the function of the third computer 2023. Substitution processing is performed in C (403b).

このように、図４に示すような監視対象計算機システム２０１の多重構成は、例えば、プラント監視や制御を行うプラント監視システムなどに適用した場合に、当該監視対象計算機システムの一部に異常が発生した場合でも、プラント監視システムとしての機能喪失を防止できる構成である。 Thus, when the multiplex configuration of the monitored computer system 201 as shown in FIG. 4 is applied to, for example, a plant monitoring system that performs plant monitoring or control, an abnormality occurs in a part of the monitored computer system. Even in this case, the loss of function as a plant monitoring system can be prevented.

なお、この図４の例では、３つの機能Ａ〜Ｃのみについて多重化した場合について示しているが、他の３つの機能Ｄ〜Ｆについても、同様に別の計算機で代替可能に設定することも可能である。 Note that the example of FIG. 4 shows a case where only three functions A to C are multiplexed, but the other three functions D to F are similarly set to be replaceable by another computer. Is also possible.

［障害監視システムの動作］
［動作の概略］
図５は、以上のような構成を有する本実施形態に係る障害監視システムの通常運転時の動作と通信されるデータの概略を示すフローチャートである。 [Operation of fault monitoring system]
[Outline of operation]
FIG. 5 is a flowchart showing an outline of data communicated with operations during normal operation of the fault monitoring system according to the present embodiment having the above-described configuration.

この図５に示すように、通常運転時において、監視対象計算機２０２は、ログ情報送信処理（Ｓ５１０）として、ログ情報が変化した場合には、情報収集装置１０１に対して情報変化通知を送信し、情報収集装置１０１からデータ送信要求がなされた場合には、要求に応じたログ情報を情報収集装置１０１に送信する。なお、監視対象計算機２０２によるこのログ情報送信処理（Ｓ５１０）は、通常運転時において連続的に繰り返し実行される。 As illustrated in FIG. 5, during normal operation, the monitoring target computer 202 transmits an information change notification to the information collection device 101 when the log information changes as log information transmission processing (S510). When a data transmission request is made from the information collecting apparatus 101, log information corresponding to the request is transmitted to the information collecting apparatus 101. The log information transmission process (S510) by the monitoring target computer 202 is continuously and repeatedly executed during normal operation.

また、通常運転時において、情報収集装置１０１は、データ編集手段１１２によりデータ通信手段１１１を通じてデータ収集・編集処理（Ｓ５２０）と要求データ送信処理（Ｓ５３０）を行う。 During normal operation, the information collection device 101 performs data collection / edit processing (S520) and request data transmission processing (S530) through the data communication unit 111 by the data editing unit 112.

データ収集・編集処理（Ｓ５２０）は、監視対象計算機２０２に対してデータ送信要求を行い、ログ情報を収集して編集し、ログ保存手段１１３に保存すると共に、受信したログ情報中に障害発生情報がある場合には、当該障害発生の影響度を算出して、状態監視・解析サーバ１０２に対して障害発生通知と障害発生詳細データを送信する処理である。情報収集装置１０１はまた、要求データ送信処理（Ｓ５３０）として、状態監視・解析サーバ１０２からデータ送信要求がなされた場合には、要求に応じた送信可能リストやログ情報を状態監視・解析サーバ１０２に送信する。 The data collection / editing process (S520) makes a data transmission request to the monitoring target computer 202, collects and edits the log information, saves it in the log storage means 113, and includes failure occurrence information in the received log information. If there is, the degree of influence of the failure occurrence is calculated, and failure occurrence notification and failure occurrence detailed data are transmitted to the state monitoring / analysis server 102. The information collection apparatus 101 also sends a list of possible transmissions and log information according to the request in the state monitoring / analysis server 102 when a data transmission request is made from the state monitoring / analysis server 102 as the request data transmission process (S530). Send to.

なお、図５においては、便宜上の理由から、データ収集・編集処理（Ｓ５２０）の後段に要求データ送信処理（Ｓ５３０）を示しているが、これらの処理は、通常運転時において、実際には同時並行的に行われ、各処理はそれぞれ連続的に繰り返し実行される。 In FIG. 5, for the sake of convenience, the request data transmission process (S530) is shown after the data collection / editing process (S520), but these processes are actually performed at the same time during normal operation. It is performed in parallel, and each process is repeatedly executed continuously.

また、通常運転時において、状態監視・解析サーバ１０２は、障害発生緊急通知処理（Ｓ５４０）、ログ情報取得・データ分類処理（Ｓ５５０）、ログ解析処理（Ｓ５６０）、障害原因推定処理（Ｓ５７０）、障害発生予知処理（Ｓ５８０）を行う。 Further, during normal operation, the state monitoring / analysis server 102 performs failure occurrence emergency notification processing (S540), log information acquisition / data classification processing (S550), log analysis processing (S560), failure cause estimation processing (S570), A failure occurrence prediction process (S580) is performed.

障害発生緊急通知処理（Ｓ５４０）は、情報収集装置１０１から障害発生通知がなされた場合に、データ分類手段１２３により、データ通信手段１２１を通じてデータ送信要求を行い、障害発生詳細データを障害情報３３１として取得し、データ保存手段１２２に保存すると共に、緊急度の高い障害情報３３１が存在する場合には、データ通信手段１２１を通じて監視端末１０４，１０５に緊急通知を行う処理である。 In the failure occurrence emergency notification process (S540), when a failure occurrence notification is made from the information collection device 101, the data classification unit 123 makes a data transmission request through the data communication unit 121, and the failure occurrence detailed data is used as the failure information 331. This is a process of obtaining an emergency notification to the monitoring terminals 104 and 105 through the data communication unit 121 when the fault information 331 having a high degree of urgency exists while being acquired and stored in the data storage unit 122.

ログ情報取得・データ分類処理（Ｓ５５０）は、データ分類手段１２３により、データ通信手段１２１を通じて情報収集装置１０１に対してデータ送信要求を行い、送信可能リストやログ情報３１１を取得して、受信したデータを通常のログ情報３２１と障害情報３３１に分類し、データ保存手段１２２に保存する処理である。 In the log information acquisition / data classification process (S550), the data classification unit 123 makes a data transmission request to the information collecting apparatus 101 through the data communication unit 121, acquires the transmittable list and the log information 311 and receives them. In this process, the data is classified into normal log information 321 and failure information 331 and stored in the data storage unit 122.

ログ解析処理（Ｓ５６０）は、ログ解析手段１２４により、新規のログ情報３１１についてログの発生状況を時間軸で解析し、解析結果を解析情報３４１としてデータ保存手段１２２に保存し、データ通信手段１２１を通じて監視端末１０４，１０５に送信する処理である。 In the log analysis process (S560), the log analysis unit 124 analyzes the log generation status of the new log information 311 on the time axis, and stores the analysis result as the analysis information 341 in the data storage unit 122. Is transmitted to the monitoring terminals 104 and 105.

障害原因推定処理（Ｓ５７０）は、障害原因推定手段１２５により、発生した障害に係る障害情報３３１を過去の障害情報３３１と比較して障害の原因の候補を検出し、原因である可能性の確信度が設定値以上の候補を障害原因の推定結果とし、障害情報３３１の一部としてデータ保存手段１２２に保存し、データ通信手段１２１を通じて監視端末１０４，１０５に送信する処理である。 In the failure cause estimation process (S570), the failure cause estimation means 125 compares the failure information 331 relating to the failure that has occurred with the previous failure information 331, detects a candidate for the cause of the failure, and is convinced that it may be the cause. This is a process in which a candidate whose degree is equal to or higher than a set value is estimated as a cause of failure, stored in the data storage unit 122 as part of the failure information 331, and transmitted to the monitoring terminals 104 and 105 through the data communication unit 121.

障害発生予知処理（Ｓ５８０）は、障害発生予知手段１２６により、与えられた期間の解析情報３４１を過去の解析情報３４１と比較して、類似の解析情報３４１に対応する障害の発生可能性の確信度が設定値以上の障害を予知結果を示す予知情報３５１としてデータ保存手段１２２に保存し、データ通信手段１２１を通じて監視端末１０４，１０５に送信する処理である。 In the failure occurrence prediction process (S580), the failure occurrence prediction means 126 compares the analysis information 341 for a given period with the past analysis information 341 and is convinced of the possibility of occurrence of a failure corresponding to the similar analysis information 341. In this processing, a failure whose degree is equal to or greater than a set value is stored in the data storage unit 122 as prediction information 351 indicating a prediction result, and is transmitted to the monitoring terminals 104 and 105 through the data communication unit 121.

なお、図５においては、便宜上の理由から、障害発生緊急通知処理（Ｓ５４０）、ログ情報取得・データ分類処理（Ｓ５５０）、ログ解析処理（Ｓ５６０）、障害原因推定処理（Ｓ５７０）、障害発生予知処理（Ｓ５８０）をこの順で示しているが、これらの処理は、通常運転時において、実際には同時並行的に行われ、各処理はそれぞれ連続的に繰り返し実行される。 In FIG. 5, for the sake of convenience, failure occurrence emergency notification processing (S540), log information acquisition / data classification processing (S550), log analysis processing (S560), failure cause estimation processing (S570), failure occurrence prediction Although the process (S580) is shown in this order, these processes are actually performed concurrently in normal operation, and each process is repeatedly executed continuously.

［各処理の詳細］
以下には、上記のような図５に示す各処理（Ｓ５１０〜Ｓ５８０）の手順について、図面を参照して説明する。 [Details of each process]
Below, the procedure of each process (S510-S580) shown in FIG. 5 as mentioned above is demonstrated with reference to drawings.

［監視対象計算機のログ情報送信処理］
図６は、監視対象計算機２０２によるログ情報送信処理（Ｓ５１０）の手順の一例を示すフローチャートである。 [Log information transmission processing for monitored computers]
FIG. 6 is a flowchart illustrating an example of a procedure of log information transmission processing (S510) by the monitoring target computer 202.

この図６に示すように、ログ情報送信処理（Ｓ５１０）において、監視対象計算機２０２は、自計算機のログ情報３０１に何らかの変化、すなわち、新しい情報が加わったか、削除された項目があるか、あるいは内容が変化したか、などの項目が変化したか否かを判定する（Ｓ５１１）。そして、ログ情報３０１の変化を検出した場合（Ｓ５１１のＹＥＳ）には、その変化項目が時刻情報を付加すべき項目であれば、時刻情報を付加した後、ログ情報３０１に変化があったことを示す情報変化通知を情報収集装置１０１へ通知する（Ｓ５１２）。 As shown in FIG. 6, in the log information transmission process (S510), the monitoring target computer 202 determines whether there is any change in the log information 301 of its own computer, that is, whether new information has been added or deleted, or It is determined whether or not items such as contents have changed (S511). If a change in the log information 301 is detected (YES in S511), if the change item is an item to which time information should be added, the log information 301 has changed after the time information has been added. Is sent to the information collecting apparatus 101 (S512).

監視対象計算機２０２はまた、情報収集装置１０１からデータ送信要求がある場合（Ｓ５１３のＹＥＳ）には、要求された項目をログ情報３０１より取り出し、情報収集装置１０１に送信する（Ｓ５１４）。 In addition, when there is a data transmission request from the information collection apparatus 101 (YES in S513), the monitoring target computer 202 extracts the requested item from the log information 301 and transmits it to the information collection apparatus 101 (S514).

なお、ログ情報３０１の変化時に、変化項目に時刻を付加すべきか否かは、例えば、その項目の変化が本来は時刻を付加しないで記録されるような項目に対して、記録後における変化の検出時に後付で時刻を付加するように監視対象計算機２０２を設定することで実現可能である。このように、ログ情報３０１のうち、本来は時刻しないで記録されるような項目に対しても、監視対象計算機２０２自身でその変化を検出した段階で時刻を付加することにより、その変化項目を情報収集装置１０１が取得した段階で、情報収集装置１０１は、その変化項目の変化した時刻により近い時刻を取得することができる。 Whether or not the time should be added to the change item when the log information 301 changes is, for example, whether or not the change of the item is recorded without adding the time. This can be realized by setting the monitoring target computer 202 so that the time is added later at the time of detection. As described above, even in the log information 301, an item that is originally recorded without time is added with the time when the change is detected by the monitoring target computer 202 itself, so that the change item is displayed. At the stage of acquisition by the information collection apparatus 101, the information collection apparatus 101 can acquire a time closer to the time when the change item has changed.

あるいはまた、ログ情報３０１の項目に対して時刻を付加するか否かの判断を行う条件を、予め情報収集装置１０１側で用意しておき、その条件を、情報収集装置１０１から監視対象計算機２０２に対して送信して保存することも可能である。 Alternatively, a condition for determining whether or not to add time to the item of the log information 301 is prepared in advance on the information collecting apparatus 101 side, and the condition is transferred from the information collecting apparatus 101 to the monitored computer 202. It is also possible to send and save for.

また、ログ情報送信処理（Ｓ５１０）の手順の変形例として、ログ情報の変化を情報収集装置１０１に通知せず、情報収集装置１０１からのデータ送信要求に応じてログ情報３０１の中から要求された項目を送信する処理（Ｓ５１３、Ｓ５１４）を行うだけでもよい。 As a modification of the procedure of the log information transmission process (S510), a change in log information is not notified to the information collection apparatus 101, but is requested from the log information 301 in response to a data transmission request from the information collection apparatus 101. It is also possible to simply perform the process of transmitting the item (S513, S514).

なお、監視対象計算機２０２において、以上のようなログ情報送信処理（Ｓ５１０）を実現するためには、監視対象計算機２０２に、ログ情報送信専用のログ情報通信手段を設けてもよいが、あるいはまた、計算機に標準的に備わっているネットワーク通信の汎用手段を用いて、情報収集装置１０１により、監視対象計算機２０２からログ情報３０１を直接取り出すことも可能である。 In order to realize the log information transmission process (S510) as described above in the monitoring target computer 202, the monitoring target computer 202 may be provided with log information communication means dedicated to log information transmission. The log information 301 can be directly extracted from the monitoring target computer 202 by the information collecting apparatus 101 using a general-purpose means for network communication provided in the computer.

［情報収集装置の処理］
［データ収集・編集処理］
図７は、情報収集装置１０１のデータ編集手段１１２によるデータ収集・編集処理（Ｓ５２０）の手順の一例を示すフローチャートである。 [Information collection device processing]
[Data collection / edit processing]
FIG. 7 is a flowchart illustrating an example of a procedure of data collection / editing processing (S520) by the data editing unit 112 of the information collecting apparatus 101.

この図７に示すように、データ収集・編集処理（Ｓ５２０）において、情報収集装置１０１のデータ編集手段１１２は、実行開始条件が満たされた場合（Ｓ５２１のＹＥＳ）に、データ通信手段１１１を通じて、監視対象計算機２０２に対しデータ送信要求を行う（Ｓ５２２）。図７の例では、実行開始条件は、一例として、監視対象計算機２０２からの情報変化通知を受信した場合、あるいは、予め周期的実行を行うように設定されていて、その実行時刻になった場合、のいずれかである。 As shown in FIG. 7, in the data collection / editing process (S520), the data editing unit 112 of the information collecting apparatus 101 passes through the data communication unit 111 when the execution start condition is satisfied (YES in S521). A data transmission request is sent to the monitoring target computer 202 (S522). In the example of FIG. 7, as an example, the execution start condition is when an information change notification is received from the monitoring target computer 202, or when the execution start condition is set in advance to perform periodic execution and the execution time is reached , Either.

データ編集手段１１２は、データ送信要求に対して、監視対象計算機２０２からデータ通信手段１１１によりログ情報３０１のデータを受信すると（Ｓ５２３のＹＥＳ）、受信したデータを時系列順に編集する（Ｓ５２４）。以下には、データを時系列順に編集する具体的な方法として、ハードウェア構成情報３０２ａを時系列順に編集する場合の一例について説明する。 In response to the data transmission request, the data editing unit 112 receives the data of the log information 301 from the monitoring target computer 202 by the data communication unit 111 (YES in S523), and edits the received data in chronological order (S524). Hereinafter, an example in which the hardware configuration information 302a is edited in time series will be described as a specific method for editing data in time series.

まず、監視対象計算機２０２から受信したログ情報３０１中のハードウェア構成情報３０２ａは、現在の監視対象計算機２０２を構成する全ハードウェア機器の情報を含む。データ編集手段１１２は、今回受信したデータ中のハードウェア構成情報を編集する場合、今回のハードウェア構成情報を、以前に受信した過去のハードウェア構成情報と比較して、構成の変更を検出した場合には、このハードウェア構成の変更を事象の発生として取り扱い、ハードウェア構成の変更を示すデータに時刻を付加して時系列データとして編集する。これにより、本来は時刻が付加されていないハードウェア構成の変更に関する構成変更情報を、時系列データとして編集することができる。 First, the hardware configuration information 302 a in the log information 301 received from the monitoring target computer 202 includes information on all hardware devices that make up the current monitoring target computer 202. When editing the hardware configuration information in the data received this time, the data editing unit 112 compares the current hardware configuration information with the previously received hardware configuration information and detects a configuration change. In this case, this hardware configuration change is handled as an occurrence of an event, and time is added to data indicating the hardware configuration change and edited as time series data. Thereby, the configuration change information regarding the change of the hardware configuration to which the time is not originally added can be edited as time series data.

なお、ハードウェア構成の変更を示すデータに付加する時刻は、例えば、今回の受信時刻または処理時刻、あるいは、ログ情報送信処理（Ｓ５１０）において監視対象計算機２０２で付加された時刻、のいずれかである。 The time added to the data indicating the hardware configuration change is, for example, either the current reception time or processing time, or the time added by the monitoring target computer 202 in the log information transmission process (S510). is there.

さらに、監視対象計算機２０２のログ情報３０１中のソフトウェア構成情報３０２ｂ、ハードウェア状態情報３０３ａ、ソフトウェア状態情報３０３ｂについても、ハードウェア構成情報３０２ａと同様の上記方法を用いて、構成や状態の変更に関する構成変更情報や状態変更情報を、時系列データとして編集することが可能である。 Further, the software configuration information 302b, hardware status information 303a, and software status information 303b in the log information 301 of the monitoring target computer 202 are also related to the configuration and status change using the same method as the hardware configuration information 302a. The configuration change information and the state change information can be edited as time series data.

データ編集手段１１２は、次に、データの簡略化が可能な場合（Ｓ５２５のＹＥＳ）には、可能な限りデータの簡略化（Ｓ５２６）を行い、データ量を低減する。以下には、データを簡略化する具体的な方法として、イベントログ情報３０４を簡略化する場合の一例について説明する。 Next, when the data can be simplified (YES in S525), the data editing unit 112 performs data simplification (S526) as much as possible to reduce the data amount. Hereinafter, an example of simplifying the event log information 304 will be described as a specific method for simplifying data.

まず、イベントログ情報３０４中のイベントログは、一定期間に監視対象計算機２０２で発生した事象を、時刻と事象内容を１組として時系列に並べたテキスト形式で構成されている。すなわち、同一事象が発生した場合には、発生時刻の異なるデータが複数存在している。また、同一事象でなくとも、事象内容の一部の表現が同一である場合も存在する。データの簡略化においては、このような同一事象あるいは同一の表現を、ある特定の記号に置換えて簡略化し、簡略前のデータは、その少なくとも１つのみを元データとして残し、その特定の記号と対応させる。 First, the event log in the event log information 304 is configured in a text format in which events occurring in the monitoring target computer 202 during a certain period are arranged in time series with a time and event content as one set. That is, when the same event occurs, there are a plurality of data having different occurrence times. In addition, even if they are not the same event, there are cases where some expressions of the event content are the same. In simplification of data, the same event or the same expression is simplified by replacing it with a specific symbol, and at least one of the data before the simplification is left as original data, and the specific symbol and Make it correspond.

あるいはまた、簡略前のデータを残さなくても、置換えた記号と対応可能な場合は、簡略化前の元データを全く残さずに、同一事象あるいは同一の表現を全ての簡略化することも可能である。 Alternatively, if it is possible to correspond to the replaced symbol without leaving the data before simplification, it is possible to simplify the same event or the same expression without leaving the original data before simplification. It is.

データ編集手段１１２は、以上のような一連の処理により編集した時系列データを、図８に示すように、一定周期毎に区切って、周期単位の時系列データグループＧ１〜Ｇｎからなるログ情報３１１としてログ保存手段１１３に保存する（Ｓ５２７）。 As shown in FIG. 8, the data editing unit 112 divides the time series data edited by the above-described series of processes at regular intervals, and logs information 311 including time series data groups G1 to Gn in units of periods. Is stored in the log storage means 113 (S527).

データ編集手段１１２はさらに、影響度判定・通知処理（Ｓ５２８）として、図９に示すような一連の処理を行う。 Further, the data editing unit 112 performs a series of processes as shown in FIG. 9 as the influence determination / notification process (S528).

図９に示すように、受信したデータ中に障害発生情報がある場合（Ｓ５２８１のＹＥＳ）には、データ編集手段１１２は、その障害発生情報に係る障害を発生した監視対象計算機の機能について、当該障害の発生が監視対象計算機システム２０１に与える影響度を算出する（Ｓ５２８２）。算出された影響度が予め設定された設定値以上である場合（Ｓ５２８３のＹＥＳ）には、データ編集手段１１２は、代替可能な別の計算機の有無を判定し（Ｓ５２８４）、代替可能な計算機が存在する場合（Ｓ５２８４のＹＥＳ）には、影響度を下方修正する（Ｓ５２８５）。 As shown in FIG. 9, when there is failure occurrence information in the received data (YES in S5281), the data editing means 112 determines the function of the monitored computer that has caused the failure related to the failure occurrence information. The degree of influence that the occurrence of the failure has on the monitored computer system 201 is calculated (S5282). If the calculated degree of influence is equal to or greater than a preset setting value (YES in S5283), the data editing unit 112 determines the presence or absence of another computer that can be replaced (S5284). If it exists (YES in S5284), the influence is corrected downward (S5285).

この場合、例えば、影響度の高低を、「緊急」、「警告」、「通常」、等の複数のレベルに区分して判定するための複数の判断基準値を予め設定しておく。この場合には、一例として、算出された影響度が「緊急」である場合には、代替可能な別の計算機の有無を判定し、代替可能な計算機が存在する場合には、影響度を「緊急」から「警告」に下方修正するなどの運用が可能である。 In this case, for example, a plurality of determination reference values are set in advance for determining the level of influence in a plurality of levels such as “emergency”, “warning”, and “normal”. In this case, as an example, when the calculated influence degree is “emergency”, the presence / absence of another computer that can be replaced is determined. Operation such as downward revision from “emergency” to “warning” is possible.

データ編集手段１１２は、影響度の算出・修正後、状態監視・解析サーバ１０２に対し、データ通信手段１１１を通じて、障害発生に関する詳細情報を示すデータではなく、障害発生の要旨のみを示す簡略なデータを障害発生通知として送信する（Ｓ５２８６）。この障害発生通知は、例えば、障害の発生時刻、影響度の程度、障害の概要、などの項目を、短い文章や記号などにより表現する簡略なデータである。 The data editing unit 112, after the calculation / correction of the degree of influence, provides the state monitoring / analysis server 102 with simple data indicating only the gist of the failure occurrence, not the data indicating the detailed information about the failure occurrence through the data communication unit 111. Is transmitted as a failure occurrence notification (S5286). This failure occurrence notification is, for example, simple data that expresses items such as the failure occurrence time, the degree of influence, and the outline of the failure with short sentences or symbols.

障害発生通知の送信後に、データ通信手段１１１により状態監視・解析サーバ１０２からのデータ送信要求を受信した場合（Ｓ５２８７のＹＥＳ）には、データ編集手段１１２は、状態監視・解析サーバ１０２に対して、データ通信手段１１１を通じて、障害発生に関する詳細情報を示す障害発生詳細データを送信する（Ｓ５２８８）。 If the data communication unit 111 receives a data transmission request from the state monitoring / analysis server 102 after transmission of the failure occurrence notification (YES in S5287), the data editing unit 112 sends a request to the state monitoring / analysis server 102. Then, the failure occurrence detailed data indicating the detailed information on the failure occurrence is transmitted through the data communication unit 111 (S5288).

以上のように、情報収集装置１０１において、図７に示すようなデータ収集・編集処理（Ｓ５２０）を行うことにより、各種の事象の発生を時刻付で示すイベントログ情報などの本来の時系列データだけでなく、受信したログ情報中の監視対象計算機の構成情報または状態情報を過去の情報と比較して変更を検出した場合に、この変更を事象の発生として取り扱い、時刻を付加して変更事象の発生を示す時系列データとして編集することができる。 As described above, the information collection apparatus 101 performs the data collection / editing process (S520) as shown in FIG. 7 so that original time-series data such as event log information indicating the occurrence of various events with time is provided. In addition, when a change is detected by comparing the configuration information or status information of the monitored computer in the received log information with past information, this change is treated as an event occurrence, and the change event is added with the time. Can be edited as time-series data indicating the occurrence of the occurrence.

したがって、情報収集装置１０１でこのように編集したデータを状態監視・解析サーバ１０２に送信することにより、状態監視・解析サーバ１０２においては、各種の事象の発生を示す時系列データに加えて、監視対象計算機の構成や状態に関する変更についても、変更事象の発生を示す時系列データを用いることができ、それによって、障害発生の予知を行うことが可能となる。 Therefore, by transmitting the data edited in this way by the information collection device 101 to the state monitoring / analysis server 102, the state monitoring / analysis server 102 can monitor in addition to time-series data indicating the occurrence of various events. Time series data indicating the occurrence of a change event can also be used for changes related to the configuration and state of the target computer, thereby making it possible to predict the occurrence of a failure.

また、情報収集装置１０１において、収集したログ情報に同一の事象内容データが含まれる場合には、当該事象内容データを略式データに置き換えて編集することができる。したがって、情報収集装置１０１から状態監視・解析サーバ１０２に送信するデータ量を低減することができるため、両者を接続するインターネット１０４等のネットワークに過度の負荷を与えることがない。 Further, in the information collecting apparatus 101, when the collected event information includes the same event content data, the event content data can be replaced with summary data and edited. Therefore, since the amount of data transmitted from the information collection apparatus 101 to the state monitoring / analysis server 102 can be reduced, an excessive load is not applied to the network such as the Internet 104 connecting the two.

また、簡略化したデータを保存することにより、必要なログ情報をできる限りコンパクトに格納できることから、ログ保存手段１１３に要求されるデータ容量を抑制することも可能となる。 Further, by saving the simplified data, the necessary log information can be stored as compactly as possible, so that the data capacity required for the log storage means 113 can be suppressed.

さらに、情報収集装置１０１において、図９に示すような影響度判定・通知処理（Ｓ５２８）を行うことにより、受信したデータ中に、システムに与える影響度の高い障害発生情報がある場合には、障害発生の要旨のみを示す簡略なデータをまず通知することにより、状態監視・解析サーバ１０２に対し、少量のデータで迅速かつ確実に障害発生通知を送信することができる。 Furthermore, in the information collection apparatus 101, by performing the influence degree determination / notification process (S528) as shown in FIG. 9, when there is failure occurrence information having a high influence degree on the system in the received data, By first notifying simple data indicating only the gist of the failure occurrence, it is possible to quickly and reliably send the failure occurrence notification to the state monitoring / analysis server 102 with a small amount of data.

そして、通知を受けた状態監視・解析サーバ１０２からのデータ送信要求に応じて障害発生詳細データを送信することにより、状態監視・解析サーバ１０２側の担当者に対して、障害発生の要旨を通知した後に、障害発生に関する詳細情報を通知することができるため、システムに与える影響度の高い障害発生に対する迅速かつ適切な対応の実現に貢献可能である。 Then, the failure occurrence detailed data is transmitted in response to the data transmission request from the status monitoring / analysis server 102 that has received the notification, thereby notifying the person in charge of the status monitoring / analysis server 102 of the gist of the failure occurrence. After that, detailed information regarding the occurrence of the failure can be notified, which can contribute to the realization of a prompt and appropriate response to the occurrence of the failure having a high impact on the system.

また、図９に示す影響度判定・通知処理（Ｓ５２８）とは異なり、簡略なデータによる障害発生通知を行わず、初めから障害発生詳細データを障害発生通知として送信した場合には、データ量の大きな障害発生詳細データの送信が不成功となった場合に、障害発生の通知自体が不成功となってしまう。 Unlike the influence determination / notification process (S528) shown in FIG. 9, when the failure occurrence detailed data is transmitted as the failure occurrence notification from the beginning without performing the failure occurrence notification using simple data, the data amount When transmission of large failure occurrence detail data is unsuccessful, the failure occurrence notification itself is unsuccessful.

これに対して、図９に示す影響度判定・通知処理（Ｓ５２８）におけるような、簡略なデータによる障害発生通知は、不成功になる可能性を極力低減できるため、状態監視・解析サーバ１０２に対して障害発生の通知を確実に行うことができる。そして、通知を確認した状態監視・解析サーバ１０２からのデータ送信要求に応じて障害発生詳細データを送信することにより、仮に、障害発生詳細データの送信が不成功になった場合でも、状態監視・解析サーバ１０２からの再度のデータ送信要求に応じて障害発生詳細データを再度送信するなどの運用が可能となる。 On the other hand, since the failure occurrence notification based on simple data as in the influence determination / notification processing (S528) shown in FIG. 9 can reduce the possibility of failure as much as possible, the state monitoring / analysis server 102 It is possible to reliably notify the occurrence of a failure. Then, by transmitting the failure occurrence detailed data in response to the data transmission request from the state monitoring / analysis server 102 that has confirmed the notification, even if the failure occurrence detailed data transmission is unsuccessful, the state monitoring / In response to a data transmission request from the analysis server 102 again, it is possible to perform operations such as transmitting the failure occurrence detailed data again.

また、簡略なデータによる障害発生通知は、通知先のデータ処理容量が小さい場合であっても確実に送信することができるため、緊急時においては、情報収集装置１０１から直接、あるいは、後述するような状態監視・解析サーバ１０２の障害発生緊急通知処理（Ｓ５４０）を介して、携帯電話等への緊急通知が可能となる。 In addition, since the failure occurrence notification based on simple data can be reliably transmitted even when the data processing capacity of the notification destination is small, in the event of an emergency, either directly from the information collection device 101 or as described later. The emergency notification to the mobile phone or the like becomes possible through the failure occurrence emergency notification processing (S540) of the state monitoring / analysis server 102.

［要求データ送信処理］
図１０は、情報収集装置１０１のデータ編集手段１１２による要求データ送信処理（Ｓ５３０）の手順の一例を示すフローチャートである。 [Request data transmission processing]
FIG. 10 is a flowchart illustrating an example of a procedure of request data transmission processing (S530) by the data editing unit 112 of the information collection apparatus 101.

この図１０に示すように、要求データ送信処理（Ｓ５３０）において、情報収集装置１０１のデータ編集手段１１２は、データ通信手段１１１により通常のデータ送信要求を受信した場合（Ｓ５３１のＹＥＳ）に、それがリスト要求であるか否かを判定する（Ｓ５３２）。リスト要求である場合（Ｓ５３２のＹＥＳ）は、ログ保存手段１１３に保存されているログ情報のうち、データ送信可能となっている項目を示す送信可能リストを送信データとして生成する（Ｓ５３３）。 As shown in FIG. 10, in the request data transmission process (S530), the data editing unit 112 of the information collection device 101 receives a normal data transmission request from the data communication unit 111 (YES in S531). Is a list request (S532). If the request is a list request (YES in S532), a transmittable list indicating items that can be transmitted in the log information stored in the log storage unit 113 is generated as transmission data (S533).

この場合、図８に示したように、ログ保存手段１１３に保存されているログ情報３１１は、一定周期毎に区切られた周期単位の時系列データグループＧ１〜Ｇｎから構成されているため、送信可能リストは、例えば、その周期単位の時系列データグループ名を列挙したリストとなる。 In this case, as shown in FIG. 8, the log information 311 stored in the log storage unit 113 is composed of time-series data groups G1 to Gn that are divided at regular intervals, so The possible list is, for example, a list in which time-series data group names in the cycle unit are listed.

また、データ送信要求がリスト要求ではなく、ログ情報要求である場合（Ｓ５３２のＮＯ）は、データ編集手段１１２は、ログ保存手段１１３に保存されているログ情報の中から要求されたログ情報を取り出して送信データを生成する（Ｓ５３４）。データ送信要求がログ情報要求である場合は、一般的には、時刻または時間の指定を含むため、その指定された時刻または時間に該当するデータを、図８に示すような時系列データグループ単位で取り出して、必要に応じたデータ編集処理を行い、送信データを生成する。 When the data transmission request is not a list request but a log information request (NO in S532), the data editing unit 112 displays the requested log information from the log information stored in the log storage unit 113. The data is taken out and transmission data is generated (S534). When the data transmission request is a log information request, generally, since it includes designation of time or time, data corresponding to the designated time or time is represented in units of time series data groups as shown in FIG. The data editing process is performed as necessary to generate transmission data.

データ編集手段１１２は、生成した送信可能リストまたはログ情報３１１の送信データを、データ通信手段１１１により状態監視・解析サーバ１０２へ送信する（Ｓ５３５）。ログ情報３１１を送信した場合には、当該ログ情報３１１の項目が送信済であることを示す情報を追加する（Ｓ５３６）。 The data editing unit 112 transmits the transmission data of the generated transmittable list or log information 311 to the state monitoring / analysis server 102 by the data communication unit 111 (S535). When the log information 311 is transmitted, information indicating that the item of the log information 311 has been transmitted is added (S536).

以上のように、情報収集装置１０１において、図１０に示すような要求データ送信処理（Ｓ５３０）を行うことにより、状態監視・解析サーバ１０２から、情報収集装置１０１にリスト要求を行い、まだ受信していないログ情報３１１の項目を確認した後、当該項目の送信を要求して当該項目を受信する等の運用が可能となる。この場合、状態監視・解析サーバ１０２がログ情報のある項目を受信しているか否かは、情報収集装置１０１側で管理している送信済の情報を送信可能リスト中に含めればよい。また、変形例として、状態監視・解析サーバ１０２側で受信済であることを示す情報を管理してもよい。 As described above, the information collection apparatus 101 makes a request for list to the information collection apparatus 101 from the state monitoring / analysis server 102 by performing the request data transmission process (S530) as shown in FIG. After confirming an item in the log information 311 that has not been received, an operation such as requesting transmission of the item and receiving the item can be performed. In this case, whether or not the state monitoring / analysis server 102 has received an item with log information may be included in the transmittable list by the transmitted information managed on the information collecting apparatus 101 side. As a modification, information indicating that the status monitoring / analysis server 102 has received the information may be managed.

いずれにしても、送信可能な項目に関する情報と、送信済あるいは受信済に関する情報を管理することにより、情報収集装置１０１から状態監視・解析サーバ１０２に対して、送信可能リストの中から状態監視・解析サーバ１０２が必要な未送信の項目のみを送信することができるため、送信済の項目を無駄に送信することを回避でき、状態監視・解析サーバ１０２に送信するデータ量をできる限り低減することができる。 In any case, by managing information regarding items that can be transmitted and information regarding transmission or reception, the information collection device 101 sends a state monitoring / analysis server 102 to the state monitoring / analysis server 102 from the transmission possible list. Since the analysis server 102 can transmit only the necessary untransmitted items, it is possible to avoid transmitting the transmitted items wastefully, and to reduce the amount of data transmitted to the state monitoring / analysis server 102 as much as possible. Can do.

［状態監視・解析サーバの処理］
図１１は、状態監視・解析サーバ１０２による各処理と、それらの処理で取得・保存され、あるいは使用されるデータを示すブロック図である。以下には、この図１１および図１２以降に示すフローチャートを参照しながら、状態監視・解析サーバ１０２の各処理について順次説明する。 [Status monitoring / analysis server processing]
FIG. 11 is a block diagram showing each process performed by the state monitoring / analysis server 102 and data acquired / stored or used in those processes. Hereinafter, each process of the state monitoring / analysis server 102 will be sequentially described with reference to flowcharts shown in FIG. 11 and FIG.

［障害発生緊急通知処理］
図１２は、状態監視・解析サーバ１０２のデータ分類手段１２３による障害発生緊急通知処理（Ｓ５４０）の手順の一例を示すフローチャートである。 [Emergency notification processing for failure occurrence]
FIG. 12 is a flowchart showing an example of the procedure of the failure occurrence emergency notification process (S540) by the data classification unit 123 of the state monitoring / analysis server 102.

この図１２に示すように、障害発生緊急通知処理（Ｓ５４０）において、状態監視・解析サーバ１０２のデータ分類手段１２３は、データ通信手段１２１により情報収集装置１０１からの障害発生通知を受信した場合（Ｓ５４１のＹＥＳ）に、障害発生通知に含まれる影響度が予め設定された設定値以上であるか否かを判定する（Ｓ５４２）。 As shown in FIG. 12, in the failure occurrence emergency notification process (S540), the data classification unit 123 of the state monitoring / analysis server 102 receives a failure occurrence notification from the information collecting apparatus 101 by the data communication unit 121 ( In S541 (YES), it is determined whether or not the influence included in the failure occurrence notification is greater than or equal to a preset setting value (S542).

影響度が設定値以上である場合（Ｓ５４２のＹＥＳ）には、データ分類手段１２３は、監視端末１０４，１０５を含めて予め設定された複数の通知先の中から、影響度に応じた通知先を選択し、データ通信手段１２１を通じて、選択した通知先に障害発生緊急通知を行う（Ｓ５４３）。この場合、例えば、影響度が「緊急」である場合には、監視端末１０４，１０５を含む全ての通知先を選択し、影響度が「警告」である場合には、監視端末１０４，１０５のみを選択するなどの運用が可能である。また、障害発生緊急通知として送信するデータは、情報収集装置１０１から受信した障害発生通知のデータを基本的にそのまま使用した簡略なデータである。 When the influence degree is equal to or greater than the set value (YES in S542), the data classification unit 123 notifies the notification destination according to the influence degree from a plurality of preset notification destinations including the monitoring terminals 104 and 105. And the failure occurrence emergency notification is made to the selected notification destination through the data communication means 121 (S543). In this case, for example, when the degree of influence is “emergency”, all notification destinations including the monitoring terminals 104 and 105 are selected, and when the degree of influence is “warning”, only the monitoring terminals 104 and 105 are selected. Operation such as selecting is possible. The data to be transmitted as the failure occurrence emergency notification is simple data basically using the failure occurrence notification data received from the information collecting apparatus 101 as it is.

監視端末１０４，１０５で受信された障害発生緊急通知が当該監視端末上で表示され、この表示に対して、監視端末１０４，１０５を操作する担当者が表示された通知を確認して通知確認の操作を行った場合、あるいは監視端末１０４，１０５が障害発生緊急通知を認識した場合には、監視端末１０４，１０５から状態監視・解析サーバ１０２に対してデータ送信要求が送信される。 The emergency notification of failure occurrence received by the monitoring terminals 104 and 105 is displayed on the monitoring terminal. In response to this display, the person in charge of operating the monitoring terminals 104 and 105 is confirmed to confirm the notification. When an operation is performed, or when the monitoring terminals 104 and 105 recognize a failure occurrence emergency notification, a data transmission request is transmitted from the monitoring terminals 104 and 105 to the state monitoring / analysis server 102.

このような場合に、状態監視・解析サーバ１０２において、データ通信手段１２１により監視端末１０４，１０５からのデータ送信要求を受信すると（Ｓ５４４のＹＥＳ）、データ分類手段１２３は、通知元の情報収集装置１０１に対し、データ通信手段１２１を通じてデータ送信要求を行う（Ｓ５４５）。なお、変形例として、このようなデータ送信要求の実行開始条件は、監視端末１０４，１０５からのデータ送信要求の受信時に限らず、例えば、状態監視・解析サーバ１０２に送信された障害発生通知の内容をデータ分類手段１２３が認識した場合など、実際のシステム運用状況に応じて適宜変更可能である。 In such a case, when the status monitoring / analysis server 102 receives a data transmission request from the monitoring terminals 104 and 105 by the data communication unit 121 (YES in S544), the data classification unit 123 displays the information collection device that is the notification source. A data transmission request is sent to the terminal 101 through the data communication means 121 (S545). As a modification, the execution start condition of such a data transmission request is not limited to when the data transmission request is received from the monitoring terminals 104 and 105, but for example, a failure occurrence notification transmitted to the state monitoring / analysis server 102 For example, when the data classification unit 123 recognizes the contents, it can be changed as appropriate according to the actual system operation status.

データ分類手段１２３は、データ送信要求の送信後に、データ通信手段１２１により要求先の情報収集装置１０１からの障害発生詳細データを受信した場合（Ｓ５４６のＹＥＳ）には、受信した障害発生詳細データを通常のログ情報３２１とは異なる障害情報３３１として分類し、データ保存手段１２２に時系列順で保存する（Ｓ５４７）と共に、障害発生緊急通知を行った監視端末１０４，１０５などの通知先に対してその障害発生詳細データを送信する（Ｓ５４８）。 If the data communication unit 121 receives the failure occurrence detailed data from the requested information collection apparatus 101 after transmission of the data transmission request (YES in S546), the data classification unit 123 transmits the received failure occurrence detail data. It is classified as failure information 331 different from the normal log information 321 and stored in the data storage means 122 in chronological order (S547), and to the notification destination such as the monitoring terminals 104 and 105 that have issued the failure occurrence emergency notification The failure occurrence detailed data is transmitted (S548).

以上のように、状態監視・解析サーバ１０２において、図１２に示すような障害発生緊急通知処理（Ｓ５４０）を行うことにより、監視端末１０４，１０５を操作する担当者は、障害発生緊急通知により、システムに与える影響度の高い緊急な対応を要する障害発生の要旨を把握した後に、障害発生に関する詳細情報を把握することができるため、システムに与える影響度の高い障害発生に対して、迅速かつ適切な対応が可能となる。 As described above, the state monitoring / analysis server 102 performs the failure occurrence emergency notification process (S540) as shown in FIG. Since it is possible to grasp detailed information about the occurrence of a failure after grasping the gist of a failure requiring urgent action that has a high impact on the system, it is quick and appropriate for a failure that has a high impact on the system. Is possible.

また、影響度判定・通知処理（Ｓ５２８）と同様に、簡略なデータによる障害発生緊急通知は、不成功になる可能性を極力低減できるため、監視端末１０４，１０５に対して障害発生の通知を確実に行うことができる。そして、通知を確認した監視端末１０４，１０５からのデータ送信要求に応じて障害発生詳細データを送信することにより、仮に、障害発生詳細データの送信が不成功になった場合でも、監視端末１０４，１０５からの再度のデータ送信要求に応じて障害発生詳細データを再度送信するなどの運用が可能となる。 Further, similar to the impact determination / notification process (S528), the failure occurrence emergency notification using simple data can reduce the possibility of failure as much as possible, so the failure occurrence notification is sent to the monitoring terminals 104 and 105. It can be done reliably. Then, by transmitting the failure occurrence detailed data in response to the data transmission request from the monitoring terminals 104 and 105 that have confirmed the notification, even if the failure occurrence detailed data transmission is unsuccessful, the monitoring terminal 104, In response to a data transmission request from 105 again, an operation such as retransmitting the detailed failure occurrence data becomes possible.

また、簡略なデータによる障害発生緊急通知は、通知先のデータ処理容量が小さい場合であっても確実に送信することができるため、影響度判定・通知処理（Ｓ５２８）について説明したように、携帯電話等への緊急通知が可能となる。 In addition, since the failure occurrence emergency notification using simple data can be reliably transmitted even when the data processing capacity of the notification destination is small, as described in the impact determination / notification processing (S528), the mobile phone Emergency notification to the telephone etc. becomes possible.

また、受信した障害発生詳細データを障害情報として時系列順で保存することにより、ログ解析手段１２４によるログ解析処理（Ｓ５６０）、障害原因推定手段１２５による障害原因推定処理（Ｓ５７０）、障害発生予知手段１２６による障害発生予知処理（Ｓ５８０）等において、障害情報３３１の時系列データとして有効に活用可能である。 Further, by storing the received failure occurrence detailed data as failure information in time series order, log analysis processing by the log analysis means 124 (S560), failure cause estimation processing by the failure cause estimation means 125 (S570), failure occurrence prediction In the failure occurrence prediction processing (S580) by the means 126, etc., it can be effectively used as time series data of the failure information 331.

［ログ情報取得・データ分類処理］
図１３は、状態監視・解析サーバ１０２のデータ分類手段１２３によるログ情報取得・データ分類処理（Ｓ５５０）の手順の一例を示すフローチャートである。 [Log information acquisition / data classification processing]
FIG. 13 is a flowchart illustrating an example of a procedure of log information acquisition / data classification processing (S550) by the data classification unit 123 of the state monitoring / analysis server 102.

この図１３に示すように、ログ情報取得・データ分類処理（Ｓ５５０）において、状態監視・解析サーバ１０２のデータ分類手段１２３は、実行開始条件が満たされた場合（Ｓ５５１のＹＥＳ）に、データ通信手段１２１を通じて、情報収集装置１０１に対し送信可能リストのデータ送信要求を行う（Ｓ５５２）。図１３の例では、実行開始条件は、一例として、監視端末１０４，１０５からログ情報取得要求がなされた場合、あるいは、予め周期的実行を行うように設定されていて、その実行時刻になった場合、のいずれかである。 As shown in FIG. 13, in the log information acquisition / data classification process (S550), the data classification unit 123 of the state monitoring / analysis server 102 performs data communication when the execution start condition is satisfied (YES in S551). Through the means 121, a data transmission request for the transmittable list is made to the information collecting apparatus 101 (S552). In the example of FIG. 13, the execution start condition is, for example, when a log information acquisition request is made from the monitoring terminals 104 and 105, or is set to perform periodic execution in advance and has reached its execution time. If any.

データ分類手段１２３は、送信可能リストのデータ送信要求に対して、情報収集装置１０１からデータ通信手段１２１により送信可能リストを受信すると（Ｓ５５３のＹＥＳ）、受信した送信可能リスト中にまだ受信していないログ情報の項目が存在するか否かを判定する（Ｓ５５４）。そして、まだ受信していないログ情報の項目が存在する場合（Ｓ５５４のＹＥＳ）には、情報収集装置１０１に対し、データ通信手段１２１を通じてそのログ情報項目の未受信データのデータ送信要求を送信する（Ｓ５５５）。 In response to the data transmission request for the transmittable list, the data classifying unit 123 receives the transmittable list from the information collecting apparatus 101 by the data communication unit 121 (YES in S553), and has not yet received it in the received transmittable list. It is determined whether or not there is any log information item (S554). If there is an item of log information that has not yet been received (YES in S554), a data transmission request for unreceived data of the log information item is transmitted to the information collection device 101 through the data communication unit 121. (S555).

データ分類手段１２３は、ログ情報項目の未受信データのデータ送信要求を送信した後に、データ通信手段１２１により要求先の情報収集装置１０１から当該データを受信した場合（Ｓ５５６のＹＥＳ）に、受信したデータ中に略式データがあれば、その略式データを元のデータに復元する（Ｓ５５７）。データ分類手段１２３は、受信したログ情報項目のデータを、通常のログ情報３２１と障害情報３３１に分類し、さらに、監視対象計算機や監視対象計算機システムなどの項目で分類した後、データ保存手段１２２に時系列順で保存し（Ｓ５５８）、要求元の監視端末１０４，１０５にログ情報取得通知を送信する（Ｓ５５９）。 The data classification unit 123 receives the data when the data communication unit 121 receives the data from the requested information collection apparatus 101 after transmitting the data transmission request for the unreceived data of the log information item (YES in S556). If there is summary data in the data, the summary data is restored to the original data (S557). The data classification unit 123 classifies the received log information item data into normal log information 321 and failure information 331, and further classifies the items by items such as the monitoring target computer and the monitoring target computer system, and then the data storage unit 122. Are stored in chronological order (S558), and a log information acquisition notification is transmitted to the monitoring terminals 104 and 105 as request sources (S559).

また、受信していないログ情報の項目が存在しない場合（Ｓ５５４のＮＯ）には、情報収集装置１０１に対し、データ通信手段１２１を通じて、今回取得すべきログ情報が存在せず、したがって今回は非取得であったことを示すログ情報非取得通知を送信する（Ｓ５５９）。 If there is no item of log information that has not been received (NO in S554), there is no log information to be acquired this time through the data communication unit 121 to the information collection apparatus 101, and therefore this time there is no log information. A log information non-acquisition notification indicating that it has been acquired is transmitted (S559).

以上のように、状態監視・解析サーバ１０２において、図１３に示すようなログ情報取得・データ分類処理（Ｓ５５０）を行うことにより、送信可能リストから、ログ情報のうち受信していない項目を検出した時点でその項目のデータを取得することができるため、状態監視・解析サーバ１０２側において、必要な全てのログ情報を確実に取得することができる。 As described above, in the status monitoring / analysis server 102, the log information acquisition / data classification process (S550) as shown in FIG. Since the data of the item can be acquired at that time, all necessary log information can be reliably acquired on the state monitoring / analysis server 102 side.

また、取得したログ情報のデータを、障害情報と通常のログ情報に分類して時系列順で保存することにより、ログ解析手段１２４によるログ解析処理（Ｓ５６０）、障害原因推定手段１２５による障害原因推定処理（Ｓ５７０）、障害発生予知手段１２６による障害発生予知処理（Ｓ５８０）等において、障害情報や通常のログ情報の時系列データとして有効に活用可能である。 Further, the acquired log information data is classified into failure information and normal log information and stored in chronological order, whereby log analysis processing by the log analysis unit 124 (S560) and failure cause estimation by the failure cause estimation unit 125 are performed. In the estimation process (S570), the failure occurrence prediction process (S580) by the failure occurrence prediction means 126, etc., it can be used effectively as time series data of failure information and normal log information.

［ログ解析処理］
図１４は、状態監視・解析サーバ１０２のログ解析手段１２４によるログ解析処理（Ｓ５６０）の手順の一例を示すフローチャートである。 [Log analysis processing]
FIG. 14 is a flowchart illustrating an example of a procedure of log analysis processing (S560) by the log analysis unit 124 of the state monitoring / analysis server 102.

この図１４に示すように、ログ解析処理（Ｓ５６０）において、状態監視・解析サーバ１０２のログ解析手段１２４は、ログ解析の実行開始条件の一例として、新規の障害情報３３１またはログ情報３１１を取得した場合、あるいは、監視端末１０４，１０５からログ解析要求を受信した場合（Ｓ５６１のＹＥＳ）に、対象となる障害情報３３１またはログ情報３１１中における未解析のログの発生状況を時間軸で解析する（Ｓ５６２）。 As shown in FIG. 14, in the log analysis process (S560), the log analysis unit 124 of the state monitoring / analysis server 102 acquires new failure information 331 or log information 311 as an example of a log analysis execution start condition. When the log analysis request is received from the monitoring terminals 104 and 105 (YES in S561), the occurrence status of the unanalyzed log in the target failure information 331 or log information 311 is analyzed on the time axis. (S562).

図１５は、解析情報３４１として保存される具体的な解析結果の一例３４１ａを示す図であり、横軸に時間の経過を示し、縦軸に、監視対象計算機２０２の一つの状態量であるＣＰＵ負荷について、最大値と最小値という２種類の値をイベント情報として解析した結果を示したものである。この場合、ドットで示される個々のイベントは、予め設定された算出単位時間毎に、当該状態量を示す単一の状態値を算出したものである。ＣＰＵ負荷最大値については、２箇所の変化Ａ１，Ａ２が示されている。 FIG. 15 is a diagram showing an example of a specific analysis result 341a stored as the analysis information 341. The horizontal axis indicates the passage of time, and the vertical axis indicates one state quantity of the monitoring target computer 202. For the load, the result of analyzing two types of values, the maximum value and the minimum value, as event information is shown. In this case, each event indicated by a dot is obtained by calculating a single state value indicating the state amount for each preset calculation unit time. For the CPU load maximum value, two changes A1 and A2 are shown.

図１５においては、例えば、算出単位時間を５分間とし、単一の状態値として５分間の平均ＣＰＵ負荷を算出している。すなわち、上段の各ドットで示されるＣＰＵ負荷最大値の各イベントは、５分間の最大ＣＰＵ負荷の平均値であり、下段の各ドットで示されるＣＰＵ負荷最小値の各イベントは、５分間の最小ＣＰＵ負荷の平均値である。また、時間軸の長さは、例えば、３００時間〜半年程度である。 In FIG. 15, for example, the calculation unit time is 5 minutes, and the average CPU load for 5 minutes is calculated as a single state value. That is, each event of the CPU load maximum value indicated by each dot in the upper row is an average value of the maximum CPU load for 5 minutes, and each event of the CPU load minimum value indicated by each dot in the lower row is the minimum value for 5 minutes. It is an average value of CPU load. Further, the length of the time axis is, for example, about 300 hours to half a year.

図１６は、解析情報３４１として保存される具体的な解析結果の別の一例３４１ｂを示す図であり、横軸に時間の経過を示し、縦軸に、監視対象計算機２０２の障害グラフ（障害情報の解析結果）３４１１、イベントグラフ（ＣＰＵ負荷最大値のイベント解析結果）３４１２、状態グラフ（ＣＰＵ負荷平均値の解析結果）３４１３、構成グラフ（構成変更情報の解析結果）３４１４、を並べて示したものである。イベントグラフ３４１２のＣＰＵ負荷最大値については、１箇所の変化Ｂ１が示されており、障害グラフ３４１１との対比から、この変化Ｂ１の後に、障害Ｃ１が発生していることがわかる。 FIG. 16 is a diagram showing another example 341b of specific analysis results stored as analysis information 341. The horizontal axis indicates the passage of time, and the vertical axis indicates a failure graph (failure information) of the monitored computer 202. Analysis result) 3411, an event graph (event analysis result of CPU load maximum value) 3412, a state graph (analysis result of CPU load average value) 3413, and a configuration graph (analysis result of configuration change information) 3414 are shown side by side. It is. As for the CPU load maximum value of the event graph 3412, one change B1 is shown, and it can be seen from the comparison with the failure graph 3411 that the failure C1 occurs after this change B1.

ログ解析手段１２４は、得られた解析結果を解析情報３４１としてデータ保存手段１２２に保存し（Ｓ５６３）、データ通信手段１２１を通じて監視端末１０４，１０５に送信する（Ｓ５６４）。 The log analysis unit 124 stores the obtained analysis result as analysis information 341 in the data storage unit 122 (S563), and transmits it to the monitoring terminals 104 and 105 through the data communication unit 121 (S564).

以上のように、状態監視・解析サーバ１０２において、図１４に示すようなログ解析処理（Ｓ５６０）を行うことにより、新規の障害情報またはログ情報の時系列データを取得した時点で、そのログ情報中における未解析のログの発生状況を解析することができるため、同一のログを無駄に解析することなく、取得した全てのログ情報について解析情報を確実に取得することができる。 As described above, when the state monitoring / analysis server 102 performs log analysis processing (S560) as shown in FIG. 14 to obtain new fault information or time series data of log information, the log information Since it is possible to analyze the occurrence of unanalyzed logs, it is possible to reliably acquire analysis information for all the acquired log information without wastefully analyzing the same log.

また、時間軸で解析して得られた解析情報を保存することにより、障害原因推定手段１２５による障害原因推定処理（Ｓ５７０）や障害発生予知手段１２６による障害発生予知処理（Ｓ５８０）等において、今回の解析情報と比較するための過去の解析情報として有効に活用可能である。 Also, by storing the analysis information obtained by analyzing on the time axis, in the failure cause estimation process (S570) by the failure cause estimation means 125, the failure occurrence prediction process (S580) by the failure occurrence prediction means 126, etc. It can be effectively used as past analysis information for comparison with other analysis information.

さらに、情報収集装置１０１から取得したログ情報３１１に含まれる監視対象計算機２０２の構成や状態の変更に関する時系列データを使用して、監視対象計算機２０２の構成や状態の変更についても、時間軸で解析して解析情報を得ることができる。すなわち、各種の事象の発生を時刻付で示すイベントログ情報などの本来の時系列データの解析情報だけでなく、本来は時系列データでない監視対象計算機２０２の構成や状態の変更についても時系列データの解析情報を取得することができ、監視端末１０４，１０５上で、図１５、図１６で示すようなグラフ表示を行うことができる。 Furthermore, using the time series data related to the change in the configuration and state of the monitored computer 202 included in the log information 311 acquired from the information collection apparatus 101, the change in the configuration and state of the monitored computer 202 can also be performed on the time axis. Analysis information can be obtained by analysis. That is, not only the original time-series data analysis information such as event log information indicating the occurrence of various events with time, but also the time-series data regarding the configuration and state change of the monitored computer 202 that is not originally time-series data. Analysis information can be acquired, and graphs as shown in FIGS. 15 and 16 can be displayed on the monitoring terminals 104 and 105.

したがって、後述する障害原因推定手段１２５による障害原因推定処理（Ｓ５７０）、障害発生予知手段１２６による障害発生予知処理（Ｓ５８０）等において、イベントログ情報などの本来の時系列データの解析情報だけでなく、本来は時系列データでない監視対象計算機２０２の構成や状態の変更についても時系列データの解析情報を過去の解析情報とグラフ表示などで容易に比較することが可能となる。 Therefore, in failure cause estimation processing (S570) by failure cause estimation means 125, which will be described later, and failure occurrence prediction processing (S580) by failure occurrence prediction means 126, not only analysis information of original time-series data such as event log information, but also The analysis information of the time series data can be easily compared with the past analysis information and the graph display or the like for the change in the configuration or state of the monitored computer 202 that is not originally time series data.

［障害原因推定処理］
図１７は、状態監視・解析サーバ１０２の障害原因推定手段１２５による障害原因推定処理（Ｓ５７０）の手順の一例を示すフローチャートである。 [Failure cause estimation processing]
FIG. 17 is a flowchart illustrating an example of a procedure of failure cause estimation processing (S570) by the failure cause estimation unit 125 of the state monitoring / analysis server 102.

この図１７に示すように、障害原因推定処理（Ｓ５７０）において、状態監視・解析サーバ１０２の障害原因推定手段１２５は、障害原因推定の実行開始条件の一例として、障害発生時にデータ分類手段１２３により障害発生緊急通知処理（Ｓ５４０）を行った場合、あるいは、監視端末１０４，１０５から障害原因推定要求を受信した場合（Ｓ５７１のＹＥＳ）に、対象となる障害情報３３１と過去の障害情報３３１との比較を行う（Ｓ５７２）。すなわち、対象となる障害情報３３１を、当該障害情報３３１に係る障害を発生した監視対象計算機および他の監視対象計算機における過去の障害情報３３１と比較する。 As shown in FIG. 17, in the failure cause estimation process (S570), the failure cause estimation unit 125 of the state monitoring / analysis server 102 uses the data classification unit 123 as an example of a failure cause estimation execution start condition. When the failure occurrence emergency notification process (S540) is performed, or when a failure cause estimation request is received from the monitoring terminals 104 and 105 (YES in S571), the target failure information 331 and past failure information 331 are Comparison is performed (S572). In other words, the target failure information 331 is compared with the previous failure information 331 in the monitoring target computer and other monitoring target computers in which the failure related to the failure information 331 has occurred.

なお、このような過去の障害情報との比較に当たっては、Ｗｅｂサイト１０６から監視対象計算機２０２の製品情報を入手して、対象となる障害情報３３１と比較してもよい。図１８は、Ｗｅｂサイト１０６のＷｅｂページ１６１に掲載されている製品情報の一例を示す図であり、製品名などの製品特定情報６０１と、何らかの異常現象についての原因や対策等を示す複数の製品現象情報６０２ａ〜６０２ｃが示されている。この製品現象情報６０２ａ〜６０２ｃは、過去の障害情報３３１に相当する。 In comparison with the past failure information, product information of the monitoring target computer 202 may be obtained from the Web site 106 and compared with the target failure information 331. FIG. 18 is a diagram showing an example of product information posted on the Web page 161 of the Web site 106, and a plurality of products indicating product identification information 601 such as a product name and causes and countermeasures for some abnormal phenomenon. Phenomenon information 602a to 602c is shown. The product phenomenon information 602a to 602c corresponds to the past failure information 331.

障害原因推定手段１２５は、比較の結果、対象となる障害情報３３１と類似する過去の障害情報３３１が存在する場合（Ｓ５７３のＹＥＳ）には、その過去の障害情報に基づき、対象となる障害情報３３１に対する障害原因の候補を検出する（Ｓ５７４）。続いて、その過去の類似障害発生時における計算機の構成情報および状態情報と、今回対象としている障害発生時における計算機の構成情報および状態情報に基づき、対象となる障害情報３３１に対する障害原因の各候補が原因である可能性の確信度を算出する（Ｓ５７５）。この場合、計算機の構成情報および状態情報としては、障害発生時のログ情報中に含まれる時系列データ化された構成変更情報および状態変更情報、あるいはその解析情報などを適宜使用可能である。 If there is past fault information 331 similar to the target fault information 331 as a result of the comparison (YES in S573), the fault cause estimation unit 125 determines the target fault information based on the past fault information. A fault cause candidate for 331 is detected (S574). Subsequently, each failure cause candidate for the target failure information 331 is based on the configuration information and status information of the computer at the time of the previous similar failure and the configuration information and status information of the computer at the time of the target failure. The certainty of the possibility of the cause is calculated (S575). In this case, as the computer configuration information and status information, configuration change information and status change information converted into time series data included in the log information at the time of the failure, or analysis information thereof can be used as appropriate.

そして、原因である可能性の確信度が設定値以上の候補を障害原因の推定結果とし、障害情報３３１の一部として保存する（Ｓ５７６）と共に、要求元の監視端末１０４，１０５に送信する（Ｓ５７７）。また、類似する障害情報３３１が存在しない場合（Ｓ５７３のＮＯ）には、要求元の監視端末１０４，１０５に推定エラー通知を送信する（Ｓ５７７）。 Then, a candidate having a certainty of the possibility of being a cause having a certainty or more is set as a failure cause estimation result, stored as a part of the failure information 331 (S576), and transmitted to the requesting monitoring terminals 104 and 105 ( S577). If similar failure information 331 does not exist (NO in S573), an estimation error notification is transmitted to the requesting monitoring terminals 104 and 105 (S577).

以上のように、状態監視・解析サーバ１０２において、図１７に示すような障害原因推定処理（Ｓ５７０）を行うことにより、障害発生時において、その障害発生に係る今回の障害情報を、過去に蓄積した情報と比較して当該障害の障害原因を容易に推定することができる。また、原因推定にあたって、計算機の構成情報および状態情報に基づき、障害原因の各候補が原因である可能性の確信度を算出することにより、計算機の構成や状態の類似性に応じて障害原因をより精度よく推定することができる。さらに、原因推定にあたってＷｅｂサイト１０６のＷｅｂページ１０６に掲載されている製品情報を利用することにより、障害原因をより精度よく推定することができる。 As described above, the failure monitoring process (S570) as shown in FIG. 17 is performed in the state monitoring / analysis server 102, so that when the failure occurs, the current failure information related to the failure is accumulated in the past. The cause of the failure can be easily estimated by comparing with the information obtained. Also, in the cause estimation, the cause of the failure is determined according to the similarity of the computer configuration and status by calculating the certainty of the possibility that each failure cause candidate is based on the computer configuration information and status information. It can be estimated more accurately. Furthermore, the cause of the failure can be estimated with higher accuracy by using the product information posted on the Web page 106 of the Web site 106 in estimating the cause.

［障害発生予知処理］
図１９は、状態監視・解析サーバ１０２の障害発生予知手段１２６による障害発生予知処理（Ｓ５８０）の手順の一例を示すフローチャートである。 [Failure prediction processing]
FIG. 19 is a flowchart showing an example of the procedure of failure occurrence prediction processing (S580) by the failure occurrence prediction means 126 of the state monitoring / analysis server 102.

この図１９に示すように、障害発生予知処理（Ｓ５８０）において、状態監視・解析サーバ１０２の障害発生予知手段１２６は、障害発生予知の実行開始条件の一例として、監視端末１０４，１０５から障害発生予知要求がなされた場合、あるいは、予め周期的実行を行うように設定されていて、その実行時刻になった場合（Ｓ５８１のＹＥＳ）に、予知対象として与えられた一定期間の解析情報３４１を、過去の障害発生時の解析情報３４１と比較する（Ｓ５８２）。この場合、「与えられた一定期間の解析情報」は、一般的には、今回取得した最新の解析情報を含む一定期間の解析情報である。 As shown in FIG. 19, in the failure occurrence prediction process (S580), the failure occurrence prediction unit 126 of the state monitoring / analysis server 102 generates a failure from the monitoring terminals 104 and 105 as an example of the failure occurrence prediction start condition. When the prediction request is made, or when the execution time is set in advance and the execution time is reached (YES in S581), the analysis information 341 for a certain period given as the prediction target is obtained. The analysis information 341 at the time of past failure occurrence is compared (S582). In this case, “analysis information for a given period” is generally analysis information for a certain period including the latest analysis information acquired this time.

障害発生予知手段１２６は、比較の結果、当該一定期間の解析情報３４１と類似する過去の解析情報３４１が存在する場合（Ｓ５８３のＹＥＳ）にはさらに、その類似する過去の解析情報３４１に対応する障害情報３３１が存在するか否かを判定する（Ｓ５８４）。 If there is past analysis information 341 similar to the analysis information 341 for the certain period as a result of the comparison (YES in S583), the failure occurrence predicting means 126 further corresponds to the similar past analysis information 341. It is determined whether or not the failure information 331 exists (S584).

例えば、ある監視対象計算機２０２について、図１５に示すような解析結果３４１ａが、一定期間の解析情報として与えられた場合を仮定する。この場合に、図１５の解析結果３４１ａにおいて、そのＣＰＵ負荷最大値における２箇所の変化Ａ１，Ａ２のうち、最初の変化Ａ１は、非連続の変化であり、同じ監視対象計算機２０２のログ情報３２１に含まれる構成変更情報の時系列データから、構成変更によるものであり、障害とは無関係と判定される。これに対して、次の変化Ａ２の連続的上昇は、予め設定された比較条件等により、比較すべき一つの現象と判定され、過去の解析結果３４１と比較される。 For example, it is assumed that an analysis result 341a as shown in FIG. 15 is given as analysis information for a certain period for a certain monitoring target computer 202. In this case, in the analysis result 341a of FIG. 15, the first change A1 among the two changes A1 and A2 in the CPU load maximum value is a discontinuous change, and the log information 321 of the same monitored computer 202 is the same. It is determined from the time series data of the configuration change information included in that that is due to the configuration change and is irrelevant to the failure. On the other hand, the continuous increase of the next change A2 is determined as one phenomenon to be compared based on a preset comparison condition or the like, and compared with the past analysis result 341.

これに対して、同一の監視対象計算機２０２または構成や運用の類似した他の監視対象計算機２０２について、図１６に示すような障害情報の解析結果を含む解析結果３４１ｂが過去の解析情報３４１として保存されているとすれば、図１５の解析結果３４１ａにおけるＣＰＵ負荷最大値の連続的な上昇を示す変化Ａ２と、図１６の解析結果３４１ｂのイベントグラフ３４１２中におけるＣＰＵ負荷最大値の連続的上昇を示す変化Ｂ１は、類似であると判定される。さらに、類似と判定されるこの解析結果３４１ｂに対応する障害情報としては、変化Ｂ１後に発生する障害Ｃ１，Ｃ２を示す障害情報が存在している。 On the other hand, an analysis result 341b including an analysis result of failure information as shown in FIG. 16 is stored as past analysis information 341 for the same monitoring target computer 202 or another monitoring target computer 202 having a similar configuration or operation. 15, a change A2 indicating a continuous increase in the CPU load maximum value in the analysis result 341a in FIG. 15 and a continuous increase in the CPU load maximum value in the event graph 3412 in the analysis result 341b in FIG. The change B1 shown is determined to be similar. Further, as fault information corresponding to the analysis result 341b determined to be similar, fault information indicating faults C1 and C2 that occur after the change B1 exists.

障害発生予知手段１２６は、類似する過去の解析情報３４１に対応する障害情報３３１が存在する場合（Ｓ５８４のＹＥＳ）には、その障害情報３３１に基づき、障害の発生可能性の確信度を算出する（Ｓ５８５）。障害発生予知手段１２６は、算出した確信度が予め設定された設定値以上の障害が存在する場合（Ｓ５８６のＹＥＳ）には、その障害の発生可能性が高いという予知結果を予知情報３５１として保存する（Ｓ５８７）と共に、要求元の監視端末１０４，１０５に送信する（Ｓ５８８）。 If the failure information 331 corresponding to the similar past analysis information 341 exists (YES in S584), the failure occurrence predicting means 126 calculates the certainty of the possibility of occurrence of the failure based on the failure information 331. (S585). The failure occurrence predicting means 126 stores, as the prediction information 351, a prediction result that the failure is likely to occur when there is a failure with the calculated certainty factor equal to or higher than a preset set value (YES in S586). At the same time (S587), the request is transmitted to the monitoring terminals 104 and 105 as request sources (S588).

例えば、図１５、図１６の解析結果３４１ａ，３４１ｂを用いた上記の例では、与えられた解析結果３４１ａにおける変化Ａ２に対して、過去の解析結果３４１ｂにおいて類似する変化Ｂ１があること、およびその変化Ｂ１の後に発生している障害Ｃ１，Ｃ２が発生する可能性が高いことを示す予知結果が得られる。 For example, in the above example using the analysis results 341a and 341b in FIG. 15 and FIG. 16, there is a similar change B1 in the past analysis result 341b with respect to the change A2 in the given analysis result 341a, and A prediction result indicating that there is a high possibility that the faults C1 and C2 occurring after the change B1 occur.

また、算出した確信度が予め設定された設定値以上の障害が存在しない場合（Ｓ５８６のＮＯ）や類似する過去の解析情報３４１に対応する障害情報３３１が存在しない場合（Ｓ５８４のＮＯ）には、障害の発生可能性が低いという予知結果を予知情報３５１として保存する（Ｓ５８７）と共に、要求元の監視端末１０４，１０５に送信する（Ｓ５８８）。さらに、類似する解析情報３４１が存在しない場合（Ｓ５８３のＮＯ）には、要求元の監視端末１０４，１０５に予知エラー通知を送信する（Ｓ５８８）。 Further, when there is no failure with the calculated certainty factor equal to or higher than a preset setting value (NO in S586), or when there is no failure information 331 corresponding to similar past analysis information 341 (NO in S584). The prediction result that the possibility of failure is low is stored as the prediction information 351 (S587), and is transmitted to the requesting monitoring terminals 104 and 105 (S588). Further, if similar analysis information 341 does not exist (NO in S583), a prediction error notification is transmitted to the requesting monitoring terminals 104 and 105 (S588).

以上のように、状態監視・解析サーバ１０２において、図１９に示すような障害発生予知処理（Ｓ５８０）を行うことにより、与えられた解析情報を、過去に蓄積した情報と時間軸上でデータ比較して障害の発生可能性を精度よく予知することができる。したがって、障害が発生する兆候を捉え、障害が発生する前にその障害内容を精度よく予知することが可能となる。 As described above, the state monitoring / analysis server 102 performs failure occurrence prediction processing (S580) as shown in FIG. 19 to compare the given analysis information with the information accumulated in the past on the time axis. Thus, it is possible to predict the possibility of failure with high accuracy. Therefore, it is possible to catch a sign of the occurrence of a failure and accurately predict the content of the failure before the failure occurs.

すなわち、各種の事象の発生を示す本来の時系列データに加えて、監視対象計算機の構成や状態に関する変更についても、変更を事象の発生とみなして編集した時系列データを用いることができるため、そのような計算機の構成や状態の情報を示す時系列データを含む全ての時系列データを解析して、解析情報の新旧比較を行うことができる。そして、計算機の構成や状態の情報を含む解析情報の新旧比較により、構成や状態が類似している過去の解析情報を精度よく検出し、その解析情報に対応する障害情報を用いて、障害発生の予知を精度よく行うことが可能となる。 In other words, in addition to the original time-series data indicating the occurrence of various events, it is also possible to use time-series data edited by regarding the change as the occurrence of an event for changes related to the configuration and status of the monitored computer. All the time series data including the time series data indicating the configuration and state information of such a computer can be analyzed, and the analysis information can be compared with the new one. Then, by comparing old and new analysis information including computer configuration and status information, past analysis information with similar configuration and status is accurately detected, and failure information is generated using failure information corresponding to the analysis information. Can be accurately predicted.

また、障害情報中には、前述した障害原因推定処理により得られた障害原因の推定結果が含まれているため、このような障害原因の推定結果を利用することにより、障害発生の予知をより精度よく行うことができる。したがって、障害が発生する前にその障害内容だけでなく、原因をも予知することが可能となり、監視端末上で予知結果を確認した担当者が、原因に応じた適切な対策を事前に講じることが可能となる。 In addition, the failure information includes the failure cause estimation result obtained by the failure cause estimation process described above, so that the failure occurrence can be predicted more accurately by using the failure cause estimation result. It can be performed with high accuracy. Therefore, it is possible to predict not only the details of the failure but also the cause before the failure occurs, and the person in charge who confirms the prediction result on the monitoring terminal must take appropriate measures according to the cause in advance. Is possible.

なお、図１５、図１６の解析結果３４１ａ，３４１ｂを用いた上記の例では、ＣＰＵ負荷最大値という１つの要素について、解析情報の時間軸上のデータ比較を行う場合について説明したが、さらに、複数の要素について解析情報の時間軸上のデータ比較を同様に行い、類似していると判定される要素の数に応じて、障害発生可能性の確信度を算出することにより、より精度の高い予知結果を得ることができる。 In the above example using the analysis results 341a and 341b in FIG. 15 and FIG. 16, the case of performing data comparison on the time axis of analysis information for one element called the CPU load maximum value has been described. By comparing the data on the time axis of analysis information for multiple elements in the same way, calculating the certainty of the possibility of failure according to the number of elements determined to be similar, it is more accurate Predictive results can be obtained.

また、過去に発生した障害に関しては、障害情報に、障害原因の推定結果に加えて、障害に対する対策についても保存しておくことにより、障害発生の予知結果に対策に関する情報を付加することにより、障害内容の予知だけでなく、それに応じた適切な対策を提示することが可能となるため、監視端末上で予知結果を確認した担当者は、提示された適切な対策を事前に講じることが可能となる。さらに、障害発生時の対策に関する情報は、前述した障害原因推定処理と同様に、図１８に示すようなＷｅｂサイト１０６のＷｅｂページ１６１に掲載されている製品情報から入手してもよい。 In addition, regarding failures that occurred in the past, in addition to the failure cause estimation results, in addition to the failure cause estimation results, by storing the measures for the failures, by adding information about the measures to the failure prediction results, Since it is possible to present not only the details of the failure, but also appropriate measures according to it, the person in charge who confirmed the prediction results on the monitoring terminal can take the appropriate measures presented in advance It becomes. Further, information regarding countermeasures when a failure occurs may be obtained from the product information posted on the web page 161 of the website 106 as shown in FIG.

［第２の実施形態］
図２０は、本発明を適用した第２の実施形態に係る障害監視システムの構成を示すブロック図である。 [Second Embodiment]
FIG. 20 is a block diagram showing a configuration of a failure monitoring system according to the second embodiment to which the present invention is applied.

この図２０に示すように、本実施形態に係る障害監視システムは、第１の実施形態に係る障害監視システムの構成において、情報収集装置１０１のデータ通信手段１１１と状態監視・解析サーバ１０２のデータ通信手段１２１を、インターネット１００ではなく、監視用イントラネット１０３を介して接続したものである。なお、他の構成は、第１の実施形態と同様である。 As shown in FIG. 20, the fault monitoring system according to the present embodiment includes the data communication unit 111 of the information collecting apparatus 101 and the data of the state monitoring / analysis server 102 in the configuration of the fault monitoring system according to the first embodiment. The communication means 121 is connected not via the Internet 100 but via the monitoring intranet 103. Other configurations are the same as those of the first embodiment.

以上のような構成を有する本実施形態によれば、情報収集装置１０１のデータ通信手段１１１と、状態監視・解析サーバ１０２のデータ通信手段１２１との間の通信を、監視用イントラネット１０３により行うことで、インターネット１００を使用できない場合にも、本発明の障害監視システムとその方法を適用することが可能となる。また、監視用イントラネット１０３を用いているため、一つの状態監視・解析サーバ１０２に対して複数の情報収集装置１０１を接続することが可能となる。 According to the present embodiment having the above-described configuration, the monitoring intranet 103 performs communication between the data communication unit 111 of the information collection apparatus 101 and the data communication unit 121 of the state monitoring / analysis server 102. Thus, even when the Internet 100 cannot be used, the fault monitoring system and method of the present invention can be applied. In addition, since the monitoring intranet 103 is used, a plurality of information collection apparatuses 101 can be connected to one state monitoring / analysis server 102.

［第３の実施形態］
図２１は、本発明を適用した第３の実施形態に係る障害監視システムの構成を示すブロック図である。 [Third Embodiment]
FIG. 21 is a block diagram showing a configuration of a failure monitoring system according to the third embodiment to which the present invention is applied.

この図２１に示すように、本実施形態に係る障害監視システムは、第１の実施形態に係る障害監視システムの構成において、情報収集装置１０１を、状態監視・解析サーバ１０２と同じ計算機内において、異なるプログラムにより実現したものであり、情報収集装置１０１のデータ通信手段１１１と状態監視・解析サーバ１０２のデータ通信手段１２１を、プログラム間通信を介して接続したものである。なお、他の構成は、第１の実施形態と同様である。 As shown in FIG. 21, the fault monitoring system according to the present embodiment includes the information collection apparatus 101 in the same computer as the state monitoring / analysis server 102 in the configuration of the fault monitoring system according to the first embodiment. This is realized by different programs, in which the data communication unit 111 of the information collecting apparatus 101 and the data communication unit 121 of the state monitoring / analysis server 102 are connected via inter-program communication. Other configurations are the same as those of the first embodiment.

以上のような構成を有する本実施形態によれば、情報収集装置１０１と状態監視・解析サーバ１０２を、１台の計算機で実現できるため、システム構成を簡略化でき、経済性を向上できる。 According to the present embodiment having the above-described configuration, the information collection apparatus 101 and the state monitoring / analysis server 102 can be realized by a single computer, so that the system configuration can be simplified and the economic efficiency can be improved.

さらに、変形例として、別の計算機上に実現した情報収集装置１０１と接続することにより、状態監視・解析サーバ１０２を、同一計算機内の情報収集装置１０１だけでなく、他の情報収集装置１０１を含む複数の情報収集装置１０１と接続することが可能となる。 Furthermore, as a modification, by connecting to the information collection device 101 realized on another computer, the state monitoring / analysis server 102 can be connected not only to the information collection device 101 in the same computer but also to another information collection device 101. It is possible to connect to a plurality of information collecting apparatuses 101 including the information collecting apparatus 101.

［他の実施形態］
なお、本発明は、前述した実施形態に限定されるものではなく、本発明の範囲内で他にも多種多様な変形例が実施可能である。 [Other Embodiments]
It should be noted that the present invention is not limited to the above-described embodiments, and various other variations can be implemented within the scope of the present invention.

例えば、前記実施形態で示した情報収集装置と状態監視・解析サーバ、およびそれらの装置を構成する各手段の具体的な構成や処理手順、処理内容等は、一例にすぎない。すなわち、本発明は、情報収集装置において、収集したログ情報から得た計算機の構成または状態の変更を時系列データとして編集すると共に、ログ情報中の同一の事象内容データを略式データに置き換え、状態監視・解析サーバにおいて、新旧の時系列データの解析情報を比較して、障害発生の予知を行うものである限り、具体的な構成や処理手順、処理内容等は自由に変更可能である。 For example, the specific configuration, processing procedure, processing content, and the like of the information collection device, the state monitoring / analysis server, and each of the units that constitute these devices are only examples. That is, according to the present invention, in the information collection device, the computer configuration or state change obtained from the collected log information is edited as time series data, and the same event content data in the log information is replaced with summary data. As long as the monitoring / analysis server compares the analysis information of the old and new time-series data and predicts the occurrence of a failure, the specific configuration, processing procedure, processing content, etc. can be freely changed.

なお、本発明は、プラント監視などに使用される計算機システムを監視する障害監視システムとして最適であるが、同様にネットワーク接続された各種の計算機システムを監視対象として同様に適用可能であり、同様に優れた効果が得られるものである。 Although the present invention is optimal as a failure monitoring system for monitoring a computer system used for plant monitoring or the like, various computer systems connected in the same manner can be similarly applied as monitoring targets, and similarly An excellent effect can be obtained.

本発明を適用した第１の実施形態に係る障害監視システムの構成を示すブロック図。1 is a block diagram showing a configuration of a failure monitoring system according to a first embodiment to which the present invention is applied. 図１に示す監視対象計算機の構成例を示すブロック図であり、（ａ）はハードウェア構成、（ｂ）はソフトウェア構成を示す図。FIG. 2 is a block diagram illustrating a configuration example of a monitoring target computer illustrated in FIG. 1, in which (a) is a hardware configuration, and (b) is a software configuration. 図１に示す監視対象計算機内で保存される自計算機のログ情報の一例を示すデータ構成図。The data block diagram which shows an example of the log information of the own computer preserve | saved in the monitoring object computer shown in FIG. 図１に示す監視対象計算機システムの一例として、機能を多重化した監視対象計算機システムの機能構成例を示すブロック図。The block diagram which shows the function structural example of the monitoring object computer system which multiplexed the function as an example of the monitoring object computer system shown in FIG. 第１の実施形態に係る障害監視システムの通常運転時の動作と通信されるデータの概略を示すフローチャート。The flowchart which shows the outline of the data communicated with the operation | movement at the time of normal driving | operation of the failure monitoring system which concerns on 1st Embodiment. 図５に示すログ情報送信処理の手順の一例を示すフローチャート。The flowchart which shows an example of the procedure of the log information transmission process shown in FIG. 図５に示すデータ収集・編集処理の手順の一例を示すフローチャート。6 is a flowchart illustrating an example of a procedure of data collection / editing processing illustrated in FIG. 5. 図７に示すデータ収集・編集処理で編集した時系列データの保存形式を示すデータ構成図。The data block diagram which shows the preservation | save format of the time series data edited by the data collection and edit process shown in FIG. 図７に示す影響度判定・通知処理の手順の一例を示すフローチャート。The flowchart which shows an example of the procedure of the influence determination / notification process shown in FIG. 図５に示す要求データ送信処理の手順の一例を示すフローチャート。6 is a flowchart showing an example of a procedure of request data transmission processing shown in FIG. 図５に示す状態監視・解析サーバによる各処理と、それらの処理で取得・保存され、あるいは使用されるデータを示すブロック図。FIG. 6 is a block diagram showing each process by the state monitoring / analysis server shown in FIG. 5 and data acquired / stored or used in those processes. 図５に示す障害発生緊急通知処理の手順の一例を示すフローチャート。The flowchart which shows an example of the procedure of the failure generation | occurrence | production emergency notification process shown in FIG. 図５に示すログ情報取得・データ分類処理の手順の一例を示すフローチャート。The flowchart which shows an example of the procedure of the log information acquisition / data classification process shown in FIG. 図５に示すログ解析処理の手順の一例を示すフローチャート。The flowchart which shows an example of the procedure of the log analysis process shown in FIG. 図１４に示すログ解析処理により解析情報として保存される具体的な解析結果の一例を示す図。The figure which shows an example of the specific analysis result preserve | saved as analysis information by the log analysis process shown in FIG. 図１４に示すログ解析処理により解析情報として保存される具体的な解析結果の別の一例を示す図。The figure which shows another example of the specific analysis result preserve | saved as analysis information by the log analysis process shown in FIG. 図５に示す障害原因推定処理の手順の一例を示すフローチャート。The flowchart which shows an example of the procedure of the failure cause estimation process shown in FIG. 図１に示すＷｅｂサイトのＷｅｂページに掲載されている製品情報の一例を示す図。The figure which shows an example of the product information published on the web page of the web site shown in FIG. 図５に示す障害発生予知処理の手順の一例を示すフローチャート。The flowchart which shows an example of the procedure of the failure occurrence prediction process shown in FIG. 本発明を適用した第２の実施形態に係る障害監視システムの構成を示すブロック図。The block diagram which shows the structure of the failure monitoring system which concerns on 2nd Embodiment to which this invention is applied. 本発明を適用した第３の実施形態に係る障害監視システムの構成を示すブロック図。The block diagram which shows the structure of the failure monitoring system which concerns on 3rd Embodiment to which this invention is applied.

Explanation of symbols

１００…インターネット
１０１…情報収集装置
１０２…状態監視・解析サーバ
１０３…監視用イントラネット
１０４，１０５…監視端末
１０６…Ｗｅｂサイト
１１１…データ通信手段
１１２…データ編集手段
１１３…ログ保存手段
１２１…データ通信手段
１２２…データ保存手段
１２３…データ分類手段
１２４…ログ解析手段
１２５…障害原因推定手段
１２６…障害発生予知手段
１６１…Ｗｅｂページ
２０１…監視対象計算機システム
２０２…監視対象計算機
２０３…ネットワーク
３０１…ログ情報
３０２…構成情報
３０３…状態情報
３０４…イベントログ情報
３１１…ログ情報
３２１…（通常の）ログ情報
３３１…障害情報
３４１…解析情報
３５１…予知情報 DESCRIPTION OF SYMBOLS 100 ... Internet 101 ... Information collection apparatus 102 ... Status monitoring / analysis server 103 ... Monitoring intranet 104, 105 ... Monitoring terminal 106 ... Web site 111 ... Data communication means 112 ... Data editing means 113 ... Log storage means 121 ... Data communication means 122 ... Data storage means 123 ... Data classification means 124 ... Log analysis means 125 ... Failure cause estimation means 126 ... Failure occurrence prediction means 161 ... Web page 201 ... Monitored computer system 202 ... Monitored computer 203 ... Network 301 ... Log information 302 ... configuration information 303 ... status information 304 ... event log information 311 ... log information 321 ... (normal) log information 331 ... failure information 341 ... analysis information 351 ... prediction information

Claims

In a failure monitoring system that monitors the occurrence of a failure in a monitored computer system composed of a plurality of monitored computers connected via a network,
An information collection device for collecting, editing and storing log information of the computer stored in the monitored computer;
Analyzing the log information stored in the information collection device to obtain analysis information, and equipped with a state monitoring / analysis server for estimating the cause at the time of failure occurrence and prediction of failure occurrence based on the log information and analysis information,
The information collecting device includes:
Data communication means for receiving log information of the computer from the monitoring target computer and communicating data including log information with the state monitoring / analysis server;
The configuration information or status information of the monitored computer included in the received log information is compared with each corresponding information in the past, and when there is a configuration or status change, the change is treated as an event occurrence, and the time And editing as time-series data, and when the received event information includes the same event content data, data editing means for editing by replacing with summary data representing the event content data,
Having log storage means for storing the edited log information;
The state monitoring / analysis server
Data communication means for communicating data including log information with the information collection device;
Data storage means for storing the received log information and various data acquired in the server;
If the received log information includes the abbreviated data, editing is performed to restore the abbreviated data to the original event content data, and information indicating the occurrence of a failure is included in the log information as failure information. Data classification means for classifying and storing in the data storage means;
Log analysis means for analyzing the occurrence of the log on the time axis using the received log information to acquire analysis information, and storing the acquired analysis information in the data storage means;
For the occurrence of the failure indicated by the failure information, detect the candidate for the cause of the failure from the failure information when the similar failure occurred in the past, calculate the certainty of the possibility that the candidate is the cause, and according to the calculation result Failure cause estimation means for storing the failure cause estimation result obtained as a part of failure information indicating the occurrence of the failure in the data storage means;
Whether or not past analysis information is similar to the analysis information for a given period is determined by comparing the data on the time axis, and if it is determined to be similar, it corresponds to the past analysis information Yes with fault information, and calculates the likelihood of certainty of the failure indicated by the fault information, the fault occurrence prediction means for storing in said data storage means as the prediction information prediction results obtained in accordance with the calculated results And
The failure occurrence predicting means performs a data comparison on the time axis between the analysis information of the given period and the past analysis information for a plurality of elements including various state quantities of the monitored computer. A failure monitoring system configured to calculate a certainty factor of the possibility of occurrence of the failure according to the number of elements determined to be .

In a failure monitoring system that monitors the occurrence of a failure in a monitored computer system composed of a plurality of monitored computers connected via a network,
An information collection device for collecting, editing and storing log information of the computer stored in the monitored computer;
Analyzing the log information stored in the information collection device to obtain analysis information, and equipped with a state monitoring / analysis server for estimating the cause at the time of failure occurrence and prediction of failure occurrence based on the log information and analysis information,
The information collecting device includes:
Data communication means for receiving log information of the computer from the monitoring target computer and communicating data including log information with the state monitoring / analysis server;
The configuration information or status information of the monitored computer included in the received log information is compared with each corresponding information in the past, and when there is a configuration or status change, the change is treated as an event occurrence, and the time And editing as time-series data, and when the received event information includes the same event content data, data editing means for editing by replacing with summary data representing the event content data,
Log storage means for storing the edited log information ;
When the received or edited log information includes information indicating the occurrence of a failure, the degree of influence that the occurrence of the failure has on the monitored computer system is calculated, and the calculated degree of influence is greater than or equal to a preset value If it is, it has an impact determination / notification means for notifying the occurrence of the failure to the state monitoring / analysis server by the data communication means,
The state monitoring / analysis server
Data communication means for communicating data including log information with the information collection device;
Data storage means for storing the received log information and various data acquired in the server;
If the received log information includes the abbreviated data, editing is performed to restore the abbreviated data to the original event content data, and information indicating the occurrence of a failure is included in the log information as failure information. Data classification means for classifying and storing in the data storage means;
Log analysis means for analyzing the occurrence of the log on the time axis using the received log information to acquire analysis information, and storing the acquired analysis information in the data storage means;
For the occurrence of the failure indicated by the failure information, detect the candidate for the cause of the failure from the failure information when the similar failure occurred in the past, calculate the certainty of the possibility that the candidate is the cause, and according to the calculation result Failure cause estimation means for storing the failure cause estimation result obtained as a part of failure information indicating the occurrence of the failure in the data storage means;
Whether or not past analysis information is similar to the analysis information for a given period is determined by comparing the data on the time axis, and if it is determined to be similar, it corresponds to the past analysis information A failure occurrence predicting unit that calculates the certainty of occurrence possibility of the failure indicated by the failure information using the failure information, and stores the prediction result obtained according to the calculation result as the prediction information in the data storage unit. Fault monitoring system characterized by that.

In the information collecting apparatus,
The degree of influence determination / notification means, when notifying the occurrence of the failure to the state monitoring / analysis server, only data indicating only the gist of the occurrence of the failure, not data indicating details of the occurrence of the failure. Sending data as transmission data by the data communication means, and transmitting data indicating details of the occurrence of the failure when a request for detailed data is received from the state monitoring / analysis server in response to the notification by the transmission data The fault monitoring system according to claim 2 , wherein the fault monitoring system is configured as follows.

In the information collecting apparatus,
When the influence degree determination / notification means calculates the influence degree of the occurrence of the failure, another monitoring target computer capable of substituting the function of the monitoring target computer in which the failure has occurred is included in the monitoring target computer system. fault monitoring system according to claim 2 or claim 3 is whether determined, characterized in that it is configured to determine a degree of influence in accordance with the determination result.

The state monitoring / analysis server
When receiving the notification of the occurrence of the failure from the information collection device, select a notification destination to notify the occurrence of the failure from a plurality of preset notification destinations according to the degree of influence of the occurrence of the failure and, fault monitoring system according to any one of claims 2 to 4 characterized in that it has a notifying means for notifying by said communication means the occurrence of the failure notification to the selected destination.

In the state monitoring / analysis server,
The log analysis means calculates a single state value indicating the state quantity for each preset calculation unit time for data indicating an arbitrary state quantity among the log information of the monitored computer, and continuously Data indicating each state value of a plurality of calculation unit times to be stored in the data storage means as analysis information of the state quantity,
Single state value indicating the state quantity of the calculation unit time, average value, maximum value, any one of claims 1 to 5 characterized in that it is a value selected from among the minimum Fault monitoring system described in 1.

In the state monitoring / analysis server,
The log analysis means calculates the frequency of occurrence within the event unit time for each event unit time set in advance for the data indicating the event that repeatedly occurs in the log information of the monitored computer, and continuously The data indicating each frequency of a plurality of event unit times is configured to be stored in the data storage unit as analysis information of the event, according to any one of claims 1 to 6. The fault monitoring system described.

The state monitoring / analysis server
When the data received from the information collection device includes information indicating the occurrence of a failure requiring urgent response, the urgent response to the monitoring terminal connected to the state monitoring / analysis server fault monitoring system according to any one of claims 1 to 7, characterized in that it has a fault notifying means for notifying the occurrence of a failure that requires.

The information collection device and the condition monitoring and analysis server, fault monitoring system according to any one of claims 1 to 8, characterized in that it is connected via a wide area network or LAN.

The data communication means of the data communication means and the condition monitoring and analysis server of the information collection device, any one of claims 1 to 8, characterized in that it is connected via a communications program Fault monitoring system described in 1.

In a failure monitoring method for monitoring the occurrence of a failure in a monitored computer system including a plurality of monitored computers connected via a network,
An information collection device for collecting, editing and storing log information of the computer stored in the monitored computer;
Using the state monitoring / analysis server that analyzes the log information stored in the information collecting device to acquire analysis information, and estimates the cause when a failure occurs and predicts the occurrence of the failure based on the log information and the analysis information ,
By the information collecting device,
A data communication step of receiving log information of the computer from the monitoring target computer and performing communication of data including log information with the state monitoring / analysis server;
The configuration information or status information of the monitored computer included in the received log information is compared with each corresponding information in the past, and when there is a configuration or status change, the change is treated as an event occurrence, and the time And editing as time series data, and when the same event content data is included in the received log information, a data editing step for editing by replacing with summary data representing the event content data,
Perform a log saving step to save the edited log information,
The state monitoring / analysis server
A data communication step of communicating data including log information with the information collection device;
A data storage step for storing the received log information and various data acquired in the server;
If the received log information includes the abbreviated data, editing is performed to restore the abbreviated data to the original event content data, and information indicating the occurrence of a failure is included in the log information as failure information. A data classification step of classifying and storing in the data storage step ;
A log analysis step for analyzing the occurrence of the log on the time axis using the received log information to acquire analysis information, and storing the acquired analysis information in the data storage step ;
For the occurrence of the failure indicated by the failure information, detect the candidate for the cause of the failure from the failure information when the similar failure occurred in the past, calculate the certainty of the possibility that the candidate is the cause, and according to the calculation result A failure cause estimation step of storing the failure cause estimation result obtained in the data storage step as part of failure information indicating the occurrence of the failure;
Whether or not past analysis information is similar to the analysis information for a given period is determined by comparing the data on the time axis, and if it is determined to be similar, it corresponds to the past analysis information Using the failure information, calculate the certainty of the possibility of occurrence of the failure indicated by the failure information, and perform a failure occurrence prediction step of storing the prediction result obtained according to the calculation result as the prediction information in the data storage step ,
In the failure occurrence predicting step, for a plurality of elements including various state quantities of the monitored computer, the analysis information of the given period and the past analysis information are compared on the time axis, A failure monitoring method , wherein a certainty factor of the possibility of occurrence of the failure is calculated according to the number of elements determined to be present .

In a failure monitoring method for monitoring the occurrence of a failure in a monitored computer system including a plurality of monitored computers connected via a network,
An information collection device for collecting, editing and storing log information of the computer stored in the monitored computer;
Using the state monitoring / analysis server that analyzes the log information stored in the information collecting device to acquire analysis information, and estimates the cause when a failure occurs and predicts the occurrence of the failure based on the log information and the analysis information ,
By the information collecting device,
A data communication step of receiving log information of the computer from the monitoring target computer and performing communication of data including log information with the state monitoring / analysis server;
The configuration information or status information of the monitored computer included in the received log information is compared with each corresponding information in the past, and when there is a configuration or status change, the change is treated as an event occurrence, and the time And editing as time series data, and when the same event content data is included in the received log information, a data editing step for editing by replacing with summary data representing the event content data,
A log saving step for saving the edited log information ;
When the received or edited log information includes information indicating the occurrence of a failure, the degree of influence that the occurrence of the failure has on the monitored computer system is calculated, and the calculated degree of influence is greater than or equal to a preset value If it is, the influence determination / notification step of notifying the occurrence of the failure to the state monitoring / analysis server by the data communication step is performed,
By the state monitoring / analysis server,
A data communication step of communicating data including log information with the information collection device;
A data storage step for storing the received log information and various data acquired in the server;
If the received log information includes the abbreviated data, editing is performed to restore the abbreviated data to the original event content data, and information indicating the occurrence of a failure is included in the log information as failure information. A data classification step of classifying and storing in the data storage step ;
A log analysis step for analyzing the occurrence of the log on the time axis using the received log information to acquire analysis information, and storing the acquired analysis information in the data storage step ;
For the occurrence of the failure indicated by the failure information, detect the candidate for the cause of the failure from the failure information when the similar failure occurred in the past, calculate the certainty of the possibility that the candidate is the cause, and according to the calculation result A failure cause estimation step of storing the failure cause estimation result obtained in the data storage step as part of failure information indicating the occurrence of the failure;
Whether or not past analysis information is similar to the analysis information for a given period is determined by comparing the data on the time axis, and if it is determined to be similar, it corresponds to the past analysis information Using the failure information, calculate the certainty of the possibility of occurrence of the failure indicated by the failure information, and perform the failure occurrence prediction step of storing the prediction result obtained according to the calculation result as the prediction information in the data storage step A fault monitoring method characterized by the above.

In a fault monitoring program for realizing by a computer a fault monitoring system that monitors the occurrence of a fault in a monitored computer system composed of a plurality of monitored computers connected via a network,
The previous year's fault monitoring system
An information collection device for collecting, editing and storing log information of the computer stored in the monitored computer;
A state monitoring / analysis server for analyzing the log information stored in the information collecting device to obtain analysis information, and for estimating the cause when a failure occurs and predicting the occurrence of the failure based on the log information and the analysis information is provided If
In the computer constituting the information collecting apparatus,
A data communication function for receiving log information of the computer from the monitoring target computer and performing communication of data including log information with the state monitoring / analysis server;
The configuration information or status information of the monitored computer included in the received log information is compared with each corresponding information in the past, and when there is a configuration or status change, the change is treated as an event occurrence, and the time And editing as time series data, and when the received event information includes the same event content data, a data editing function for editing by replacing it with summary data representing the event content data,
Realize log saving function to save the edited log information ,
In the computer constituting the state monitoring / analysis server,
A data communication function for communicating data including log information with the information collection device;
A data storage function for storing received log information and various data acquired in the server;
If the received log information includes the abbreviated data, editing is performed to restore the abbreviated data to the original event content data, and information indicating the occurrence of a failure is included in the log information as failure information. A data classification function for classifying and storing with the data storage function;
A log analysis function for analyzing the occurrence of a log on the time axis using the received log information to acquire analysis information, and storing the acquired analysis information by the data storage function;
For the occurrence of the failure indicated by the failure information, detect the candidate for the cause of the failure from the failure information when the similar failure occurred in the past, calculate the certainty of the possibility that the candidate is the cause, and according to the calculation result A failure cause estimation function for storing the failure cause estimation result obtained by the data storage function as a part of failure information indicating the occurrence of the failure;
Whether or not past analysis information is similar to the analysis information for a given period is determined by comparing the data on the time axis, and if it is determined to be similar, it corresponds to the past analysis information Using the failure information, calculate the certainty of the possibility of occurrence of the failure indicated by the failure information, and realize the failure occurrence prediction function that stores the prediction result obtained according to the calculation result as the prediction information by the data storage function Let
By the failure occurrence prediction function, for a plurality of elements including various state quantities of the monitored computer, the analysis information of the given period and the past analysis information are compared on the time axis, and similar. A failure monitoring program characterized by calculating a certainty of the possibility of occurrence of the failure according to the number of elements determined to be present .

In a fault monitoring program for realizing by a computer a fault monitoring system that monitors the occurrence of a fault in a monitored computer system composed of a plurality of monitored computers connected via a network,
The previous year's fault monitoring system
An information collection device for collecting, editing and storing log information of the computer stored in the monitored computer;
A state monitoring / analysis server for analyzing the log information stored in the information collecting device to obtain analysis information, and for estimating the cause when a failure occurs and predicting the occurrence of the failure based on the log information and the analysis information is provided If
In the computer constituting the information collecting apparatus,
A data communication function for receiving log information of the computer from the monitoring target computer and performing communication of data including log information with the state monitoring / analysis server;
The configuration information or status information of the monitored computer included in the received log information is compared with each corresponding information in the past, and when there is a configuration or status change, the change is treated as an event occurrence, and the time And editing as time series data, and when the received event information includes the same event content data, a data editing function for editing by replacing it with summary data representing the event content data,
Log saving function to save the edited log information,
When the received or edited log information includes information indicating the occurrence of a failure, the degree of influence that the occurrence of the failure has on the monitored computer system is calculated, and the calculated degree of influence is greater than or equal to a preset value If it is, the impact determination / notification function for notifying the occurrence of the failure to the state monitoring / analysis server by the data communication function is realized,
In the computer constituting the state monitoring / analysis server,
A data communication function for communicating data including log information with the information collection device;
A data storage function for storing received log information and various data acquired in the server;
If the received log information includes the abbreviated data, editing is performed to restore the abbreviated data to the original event content data, and information indicating the occurrence of a failure is included in the log information as failure information. A data classification function for classifying and storing with the data storage function ;
A log analysis function for analyzing the occurrence of a log on the time axis using the received log information to acquire analysis information, and storing the acquired analysis information by the data storage function ;
For the occurrence of the failure indicated by the failure information, detect the candidate for the cause of the failure from the failure information when the similar failure occurred in the past, calculate the certainty of the possibility that the candidate is the cause, and according to the calculation result A failure cause estimation function for storing the failure cause estimation result obtained by the data storage function as a part of failure information indicating the occurrence of the failure;
Whether or not past analysis information is similar to the analysis information for a given period is determined by comparing the data on the time axis, and if it is determined to be similar, it corresponds to the past analysis information Using the failure information, calculate the certainty of the possibility of occurrence of the failure indicated by the failure information, and realize the failure occurrence prediction function that stores the prediction result obtained according to the calculation result as the prediction information by the data storage function Fault monitoring program characterized by causing