JP2005327261A

JP2005327261A - Performance monitoring device, performance monitoring method and program

Info

Publication number: JP2005327261A
Application number: JP2005114821A
Authority: JP
Inventors: Yoshifumi Sakai; 良文坂井; Yoshitaka Ikeda; 佳隆池田; Tomokazu Shindo; 朋和進藤; Yuichi Yokoyama; 雄一横山
Original assignee: NS Solutions Corp
Current assignee: NS Solutions Corp
Priority date: 2004-04-16
Filing date: 2005-04-12
Publication date: 2005-11-24
Anticipated expiration: 2025-04-12
Also published as: JP4980581B2

Abstract

<P>PROBLEM TO BE SOLVED: To accurately detect or predict a complicated failure which may be generated in an information processing system in which a plurality of information processors operate in cooperation, for example, shortage of resources such as a memory to the number of generation of load distribution of processing and transactions between the information processors, etc. <P>SOLUTION: A monitoring data acquisition part 1001 acquires monitoring data indicating operation states of the plurality of information processors and data communication states of each communication line which connects between the plurality of information processors, a failure detection/prediction part 1006 detects the failure which is generated in the information processing system at present or predicts probability that the failure is generated in the information processing system in the future based on the monitoring data acquired by the monitoring data acquisition part 1001. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、複数の情報処理装置が協動して動作する情報処理システムの稼働を監視し、情報処理システムの障害発生を検知又は予測する性能監視装置、性能監視方法及びプログラムに関するものである。 The present invention relates to a performance monitoring apparatus, a performance monitoring method, and a program for monitoring the operation of an information processing system in which a plurality of information processing apparatuses operate in cooperation and detecting or predicting the occurrence of a failure in the information processing system.

従来、装置の障害を監視する手法、或いは運用管理を行う手法が提案されている。例えば、特許文献１には、障害発生予測アルゴリズムと障害検出用のパラメータを格納したテーブルをメモリに格納しておき、また、顧客名・製品名・モデル番号・保守履歴・障害履歴などをデータベースに格納しておき、障害発生予測アルゴリズムを用いてデータベースに格納しておき、障害発生予測アルゴリズムを用いてデータベースに格納された各情報が、障害発生条件を満たせば通知メールを発信するシステムが開示されている。また、特許文献２には、ハードウェア状態・プログラムの稼働状況を能動的に採取して解析を行い、運用支障をきたす危険がある場合には障害を回避する指示を与えるための装置が開示されている。 Conventionally, a method for monitoring a failure of an apparatus or a method for performing operation management has been proposed. For example, in Patent Document 1, a table storing a failure occurrence prediction algorithm and failure detection parameters is stored in a memory, and a customer name, product name, model number, maintenance history, failure history, and the like are stored in a database. Disclosed is a system that sends a notification e-mail when each information stored in the database using the failure prediction algorithm and stored in the database satisfies the failure condition. ing. Further, Patent Document 2 discloses an apparatus for actively collecting and analyzing the hardware state / program operating status and giving an instruction to avoid a failure when there is a risk of causing an operational trouble. ing.

特開２００１−８４２７６号公報JP 2001-84276 A 特開平９−３１１７３３号公報JP 9-31733 A

特許文献１に開示される発明は、特定の装置の監視をして障害発生を予測するものであるが、監視対象が装置自体のみであることを想定している。例えば、ウェブサーバ、アプリケーションサーバ及びデータベースサーバから成る３層構造のウェブシステムなど、複数の機能が協調して動作しているシステムの場合、装置間における処理の負荷分散やトランザクション発生数に対してメモリ等のリソースが不足している等、様々な原因による障害が予測されるが、特許文献１に開示される発明は、その点については全く考慮されていない。 The invention disclosed in Patent Document 1 predicts the occurrence of a failure by monitoring a specific device, but assumes that the monitoring target is only the device itself. For example, in the case of a system in which a plurality of functions operate in a coordinated manner, such as a web system having a three-layer structure including a web server, an application server, and a database server, a memory is distributed for processing load distribution between devices and the number of transactions Although a failure due to various causes such as a shortage of resources is predicted, the invention disclosed in Patent Document 1 does not take this point into consideration at all.

また、特許文献２に開示される発明は、知識ベース格納装置に格納された採取すべきハードウェア／ソフトウェアの稼働情報に基づいて、情報採取手段が情報を採取し、採取された情報を用いて経験則から対処すべき指示を出力するものである。特許文献２に開示された発明の場合も、監視対象はコンピュータ自体のみであり、複数のコンピュータが協調して動作しているようなシステムで発生し得る上記の障害については何ら説明がなされていない。 In the invention disclosed in Patent Document 2, the information collection unit collects information based on the hardware / software operation information to be collected stored in the knowledge base storage device, and uses the collected information. It outputs instructions to be dealt with based on empirical rules. Also in the case of the invention disclosed in Patent Document 2, the monitoring target is only the computer itself, and there is no description of the above-described failure that may occur in a system in which a plurality of computers operate in cooperation. .

以上のように、従来の監視・運用管理システムは個々のコンピュータを監視すること自体はできたものの、今日のような複数のコンピュータが協調して動作し、協調して動作することによる複雑化した障害発生の予測は想定されておらず、複雑なコンピュータシステムを対象とする監視においては、障害の検出・予測や原因の切りわけが難しい、あるいは手間がかかる場合が多かった。 As described above, the conventional monitoring / operation management system was able to monitor each computer itself, but it became complicated by the fact that multiple computers like today operate in a coordinated fashion. The prediction of failure occurrence is not assumed, and it has often been difficult or time-consuming to detect and predict failures and to determine the cause in monitoring for complex computer systems.

従って、本発明の目的は、例えば、情報処理装置間における処理の負荷分散やトランザクション発生数に対してメモリ等のリソースが不足している等、複数の情報処理装置が協調して動作する情報処理システムに発生し得る複雑化した障害を精度よく検知又は予測可能とすることにある。 Accordingly, an object of the present invention is to provide information processing in which a plurality of information processing devices operate in a coordinated manner, for example, there is a shortage of resources such as memory for processing load distribution among the information processing devices and the number of transactions generated. It is to be able to accurately detect or predict a complicated failure that may occur in the system.

本発明の性能監視装置は、複数の情報処理装置が協調して動作する情報処理システムの性能を監視する性能監視装置であって、前記複数の情報処理装置の稼働状況、及び、前記複数の情報処理装置間を接続する各通信回線のデータ通信状況を監視する監視手段と、前記監視手段による監視データに基づいて、前記情報処理システムに現在発生している障害を検知、又は、前記情報処理システムに将来障害が発生する可能性を予測する障害検知／予測手段とを有することを特徴とする。 The performance monitoring device of the present invention is a performance monitoring device that monitors the performance of an information processing system in which a plurality of information processing devices operate in cooperation, and the operating status of the plurality of information processing devices and the plurality of information Monitoring means for monitoring the data communication status of each communication line connecting between processing devices, and detecting a fault currently occurring in the information processing system based on monitoring data by the monitoring means, or the information processing system And a failure detection / prediction means for predicting the possibility of a failure occurring in the future.

本発明の性能監視方法は、複数の情報処理装置が協調して動作する情報処理システムの性能を監視する性能監視装置による性能監視方法であって、前記複数の情報処理装置の稼働状況、及び、前記複数の情報処理装置間を接続する各通信回線のデータ通信状況を監視する監視ステップと、前記監視ステップによる監視データに基づいて、前記情報処理システムに現在発生している障害を検知、又は、前記情報処理システムに将来障害が発生する可能性を予測する障害検知／予測ステップとを含むことを特徴とする。 The performance monitoring method of the present invention is a performance monitoring method by a performance monitoring device that monitors the performance of an information processing system in which a plurality of information processing devices operate in cooperation, and the operating status of the plurality of information processing devices, and A monitoring step of monitoring the data communication status of each communication line connecting the plurality of information processing devices, and based on monitoring data by the monitoring step, detecting a fault currently occurring in the information processing system, or And a failure detection / prediction step for predicting a possibility that a failure will occur in the information processing system in the future.

本発明のプログラムは、前記性能監視方法をコンピュータに実行させることを特徴とする。 The program according to the present invention causes a computer to execute the performance monitoring method.

本発明によれば、情報処理システムを構成する複数の情報処理装置の稼働状況、及び、当該複数の情報処理装置を接続する各通信監視のデータ通信状況を監視することにより、例えば、情報処理装置間における処理が正常に動作している場合、発生するトランザクション量に対して本来使うべきリソースよりも多いあるいは少ないリソースしか使用できていないことから、障害の発生を検出、予測したり、その現象がどのサーバでおきているかを検出することによって、複数の情報処理装置からなるシステムのどの部分で障害がおきているかを知ることができ、複数の情報処理装置が協調して動作する情報処理システムに発生し得る複雑化した障害を精度よく検知又は予測することが可能となる。 According to the present invention, by monitoring the operating status of a plurality of information processing devices constituting an information processing system and the data communication status of each communication monitoring connecting the plurality of information processing devices, for example, the information processing device If the processing between them is operating normally, the resources that can be used are more or less than the resources that should be used for the transaction volume that occurs. By detecting which server is running, it is possible to know which part of the system consisting of multiple information processing devices has a fault, and for an information processing system in which multiple information processing devices operate in cooperation It is possible to accurately detect or predict a complicated failure that may occur.

以下、本発明を適用した好適な第一の実施形態を、添付図面を参照しながら詳細に説明する。 Hereinafter, a preferred first embodiment to which the present invention is applied will be described in detail with reference to the accompanying drawings.

図１は、本発明の第一の実施形態に係る性能監視システムの構成を概略的に示した図である。図１において、本実施形態の性能監視システムは、性能監視装置１０、Ｗｅｂサーバ１１、ＡＰ（アプリケーション）サーバ１２、及び、ＤＢ（データベース）サーバ１３により構成されている。性能監視装置１０は、Ｗｅｂサーバ１１、ＡＰサーバ１２及びＤＢサーバ１３から構成される情報処理システムとＬＡＮ（Local Area Network）等の通信回線で接続され、この通信回線を介して各サーバの状態を監視することが可能である。 FIG. 1 is a diagram schematically showing a configuration of a performance monitoring system according to the first embodiment of the present invention. In FIG. 1, the performance monitoring system of this embodiment includes a performance monitoring device 10, a Web server 11, an AP (application) server 12, and a DB (database) server 13. The performance monitoring apparatus 10 is connected to an information processing system including a Web server 11, an AP server 12, and a DB server 13 via a communication line such as a LAN (Local Area Network), and the state of each server is checked via this communication line. It is possible to monitor.

本実施形態の性能監視装置１０は、蓄積サーバ１０１と分析サーバ１０２によって構成され、蓄積サーバ１０１は、各サーバに対する監視により夫々のＣＰＵやメモリ等のリソースの使用量、使用率を示すリソース使用状況データ及び処理履歴を示すログデータ等を取得するとともに、Ｗｅｂサーバ１１、ＡＰサーバ１２及びＤＢサーバ１３間を接続する各通信回線で通信されるトランザクションのスループット、処理名等を示すトランザクションデータを取得し、夫々を監視データとして内部に蓄積する。また、サーバに対する監視或いは通信回線に対する監視いずれからも取得できる情報として、ある処理命令に対する応答時間なども蓄積する。分析サーバ１０２は、蓄積サーバ１０１に蓄積された監視データに基づいて、情報処理システムに現在発生している障害を検知、又は、情報処理システムに将来発生する可能性のある障害を予測する。 The performance monitoring apparatus 10 according to the present embodiment includes a storage server 101 and an analysis server 102, and the storage server 101 monitors the usage of each resource such as a CPU and a memory by monitoring each server. Acquire log data indicating the data and processing history, etc., and acquire transaction data indicating the throughput, processing name, etc. of transactions communicated with each communication line connecting the Web server 11, AP server 12, and DB server 13. , Each is stored inside as monitoring data. In addition, as information that can be acquired from either monitoring of the server or monitoring of the communication line, a response time for a certain processing command is also stored. Based on the monitoring data stored in the storage server 101, the analysis server 102 detects a failure that currently occurs in the information processing system or predicts a failure that may occur in the information processing system in the future.

このように、本実施形態では、性能監視装置１０の監視対象を複数の装置夫々の稼働状況、装置間を接続する各通信回線のデータ通信状況としていることにより、複数の情報処理装置が協動して動作する情報処理システムに発生する障害の検知又は予測を精度よく行うことが可能となる。 As described above, in the present embodiment, the monitoring target of the performance monitoring apparatus 10 is the operation status of each of the plurality of devices and the data communication status of each communication line connecting the devices, whereby the plurality of information processing devices cooperate. Thus, it is possible to accurately detect or predict a failure that occurs in an information processing system that operates in the same manner.

図２は、性能監視装置１０（蓄積サーバ１０１、分析サーバ１０２）内のコンピュータシステムのハードウェア構成を概略的に示した図である。
図２に示すように、上記コンピュータシステム１２００は、ＣＰＵ１２０１、ＲＯＭ１２０２、ＲＡＭ１２０３、キーボード（ＫＢ）１２０９のキーボードコントローラ（ＫＢＣ）１２０５、表示部としてのＣＲＴディスプレイ（ＣＲＴ）１２１０のＣＲＴコントローラ（ＣＲＴＣ）１２０６、ハードディスク（ＨＤ）１２１１及びフレキシブルディスク（ＦＤ）１２１２のディスクコントローラ（ＤＫＣ）１２０７、並びに、ネットワーク１２２０との接続のためのネットワークインタフェースカード（ＮＩＣ）１２０８が、システムバス１２０４を介して互いに通信可能に接続された構成としている。 FIG. 2 is a diagram schematically showing a hardware configuration of a computer system in the performance monitoring apparatus 10 (accumulation server 101, analysis server 102).
As shown in FIG. 2, the computer system 1200 includes a CPU 1201, a ROM 1202, a RAM 1203, a keyboard controller (KBC) 1205 of a keyboard (KB) 1209, a CRT controller (CRTC) 1206 of a CRT display (CRT) 1210 as a display unit, A disk controller (DKC) 1207 of a hard disk (HD) 1211 and a flexible disk (FD) 1212 and a network interface card (NIC) 1208 for connection to the network 1220 are communicably connected to each other via a system bus 1204. The configuration is made.

ＣＰＵ１２０１は、ＲＯＭ１２０２或いはＨＤ１２１１等から情報を読み出すソフトウェアを実行することで、システムバス１２０４に接続された各構成部を統括的に制御し、後述する図４及び図５に示す処理等を実行する。 The CPU 1201 executes software that reads information from the ROM 1202 or the HD 1211 and the like, thereby comprehensively controlling each component connected to the system bus 1204, and executes processing shown in FIGS.

ＲＡＭ１２０３は、ＣＰＵ１２０１の主メモリ或いはワークエリア等として機能する。ＫＢＣ１２０５は、ＫＢ１２０９や図示していないポインティングデバイス等からの指示入力を制御する。ＣＲＴＣ１２０６は、ＣＲＴ１２１０の表示を制御する。ＤＫＣ１２０７は、ブートプログラム、種々のアプリケーション、編集ファイル、ユーザファイル及びネットワーク管理プログラムへのアクセスを制御する。ＮＩＣ１２０８は、Ｗｅｂサーバ１１、ＡＰサーバ１２、ＤＢサーバ１３及び各サーバ間を接続する通信回線と本性能監視装置１０間のデータの送受信を制御する。 A RAM 1203 functions as a main memory or work area of the CPU 1201. The KBC 1205 controls instruction input from the KB 1209, a pointing device (not shown), or the like. A CRTC 1206 controls display on the CRT 1210. The DKC 1207 controls access to a boot program, various applications, edit files, user files, and a network management program. The NIC 1208 controls transmission / reception of data between the Web server 11, the AP server 12, the DB server 13, and a communication line connecting the servers and the performance monitoring apparatus 10.

図３は、性能監視装置１０（蓄積サーバ１０１及び分析サーバ１０２）の機能構成を示すブロック図である。
性能監視装置１０は、監視データ取得部１００１、監視データ記憶部１００２、異常検出部１００３、相関関係抽出部１００４、相関関係記憶部１００５、障害検知／予測部１００６及び報知部１００７により構成される。監視データ取得部１００１は、例えばＣＰＵ１２０１、ＲＯＭ１２０２内のプログラム及びＮＩＣ１２０８により構成され、異常検出部１００３、相関関係抽出部１００４及び障害検知／予測部１００６は、例えばＣＰＵ１２０１及びＲＯＭ１２０２内のプログラムにより構成され、監視データ記憶部１００１及び相関関係記憶部１００４は、例えばＲＡＭ１２０３やＨＤ１２１１の記録媒体により構成され、報知部１００７は、例えばＣＰＵ１２０１、ＣＲＴＣ１２０６及びＣＲＴ１２１０によって構成される。 FIG. 3 is a block diagram illustrating a functional configuration of the performance monitoring apparatus 10 (the accumulation server 101 and the analysis server 102).
The performance monitoring apparatus 10 includes a monitoring data acquisition unit 1001, a monitoring data storage unit 1002, an abnormality detection unit 1003, a correlation extraction unit 1004, a correlation storage unit 1005, a failure detection / prediction unit 1006, and a notification unit 1007. The monitoring data acquisition unit 1001 is configured by, for example, the CPU 1201, a program in the ROM 1202, and the NIC 1208. The abnormality detection unit 1003, the correlation extraction unit 1004, and the failure detection / prediction unit 1006 are configured by, for example, the programs in the CPU 1201 and the ROM 1202. The monitoring data storage unit 1001 and the correlation storage unit 1004 are configured by, for example, a recording medium such as a RAM 1203 and an HD 1211, and the notification unit 1007 is configured by, for example, a CPU 1201, a CRTC 1206, and a CRT 1210.

監視データ取得部１００１は、Ｗｅｂサーバ１１、ＡＰサーバ１２及びＤＢサーバ１３からリソース使用状況データ及びログデータ、上記サーバ間を接続する通信回線からトランザクションデータ等を取得する。図示していないが、ＡＰサーバ１２やＤＢサーバ１３のログデータは、ＡＰサーバ１２やＤＢサーバ１３内に保存されていたり、或いは別途設けられるログ保存用サーバに保存されていたりするが、監視データ取得部１００１は、通信回線を介してｆｔｐなどによりこのログデータを取得する。なお、ＡＰサーバ１２やＤＢサーバ１３がログデータを送信する機能を設けていれば、監視データ取得部１００１はログデータを受動的に取得するという方法をとっても良い。監視データ記憶部１００２は、監視データ取得部１００１によってこれまで取得された監視データを蓄積する。 The monitoring data acquisition unit 1001 acquires resource usage status data and log data from the Web server 11, AP server 12, and DB server 13, and transaction data from a communication line connecting the servers. Although not shown, the log data of the AP server 12 and the DB server 13 may be stored in the AP server 12 or the DB server 13 or may be stored in a log storage server provided separately. The acquisition unit 1001 acquires this log data by ftp or the like via a communication line. If the AP server 12 or the DB server 13 has a function of transmitting log data, the monitoring data acquisition unit 1001 may take a method of passively acquiring log data. The monitoring data storage unit 1002 accumulates the monitoring data acquired so far by the monitoring data acquisition unit 1001.

異常検出部１００３は、監視データ記憶部１００２から監視データを読み込み、読み込んだ監視データに基づいて情報処理システムの異常を検出する。相関関係抽出部１００４は、監視データ記憶部１００２から２種類の監視データを読み込み、その相関関係を求める。この相関関係の詳細については後述するが、相関関係抽出部１００４では、情報処理システムが正常に稼働しているときの相関関係や、情報処理システムに異常が発生したときの相関関係が求められる。なお、１組の監視データに基づいて作成される相関関係は、正常時も異常時も複数あって良い。相関関係記憶部１００５は、相関関係抽出部１００４によって求められた相関関係をそれぞれにＩＤを付与して記憶する。 The abnormality detection unit 1003 reads monitoring data from the monitoring data storage unit 1002 and detects an abnormality in the information processing system based on the read monitoring data. The correlation extraction unit 1004 reads two types of monitoring data from the monitoring data storage unit 1002 and determines the correlation. Although details of this correlation will be described later, the correlation extraction unit 1004 obtains a correlation when the information processing system is operating normally and a correlation when an abnormality occurs in the information processing system. It should be noted that there may be a plurality of correlations created based on a set of monitoring data, both at normal times and at abnormal times. The correlation storage unit 1005 stores the correlation obtained by the correlation extraction unit 1004 with an ID assigned thereto.

障害検知／予測部１００６は、情報処理システムに現在発生している障害の検知、又は、情報処理システムに将来発生する可能性のある障害の予測を行う。即ち、障害検知／予測部１００６は、情報処理システムが正常に稼働しているときの上記２種類の監視データの相関関係と、監視データ記憶部１００２に蓄積される最新の上記２種類の監視データとを比較することにより、情報処理システムに現在発生している障害を検知したり、情報処理システムに異常が発生したときの上記２種類の監視データの相関関係と、最近得られた上記２種類の監視データの相関関係との類似性から情報処理システムに将来発生する可能性のある障害を予測する。 The failure detection / prediction unit 1006 detects a failure currently occurring in the information processing system or predicts a failure that may occur in the information processing system in the future. That is, the failure detection / prediction unit 1006 correlates the above two types of monitoring data when the information processing system is operating normally, and the latest two types of monitoring data accumulated in the monitoring data storage unit 1002. And the correlation between the two types of monitoring data when a failure currently occurring in the information processing system is detected, or when an abnormality occurs in the information processing system, and the two types recently obtained A failure that may occur in the information processing system in the future is predicted based on the similarity with the correlation of the monitoring data.

報知部１００７は、障害検知／予測部１００６により障害発生が検知された場合、又は、障害発生が予測された場合にそれらの内容を報知する。本実施形態の報知方法としては、報知部１００７が画面表示により検知内容又は予測内容をオペレータに報知するが、他の実施形態として、電子メール等による報知方法でもよい。 The notification unit 1007 notifies the contents when a failure occurrence is detected by the failure detection / prediction unit 1006 or when a failure occurrence is predicted. As a notification method of the present embodiment, the notification unit 1007 notifies the operator of the detected content or the predicted content by screen display, but as another embodiment, a notification method using e-mail or the like may be used.

尚、本実施形態においては、監視データ取得部１００１及び監視データ記憶部１００２が蓄積サーバ１０１内の構成、異常検出部１００３、相関関係抽出部１００４、相関関係記憶部１００５、障害検知／予測部１００６及び報知部１００７が分析サーバ１０２内の構成であることを想定しているが、他の実施形態として、性能監視装置１０の構成を蓄積サーバ１０１及び分析サーバ１０２の二つのサーバに分けることなく、一つのサーバ内に集約した構成としてもよい。 In this embodiment, the monitoring data acquisition unit 1001 and the monitoring data storage unit 1002 are configured in the storage server 101, the abnormality detection unit 1003, the correlation extraction unit 1004, the correlation storage unit 1005, and the failure detection / prediction unit 1006. In addition, it is assumed that the notification unit 1007 has a configuration in the analysis server 102. However, as another embodiment, the configuration of the performance monitoring device 10 is not divided into two servers, the storage server 101 and the analysis server 102. It is good also as a structure aggregated in one server.

次に、性能監視装置１０の動作について図４及び図５のフローチャートを用いて詳細に説明する。本発明を適用した第一の実施形態における性能監視システムでは、大きく分けて次の５つの処理がある。（１）監視データ取得部１００１が取得した監視データを監視データ記憶部１００２に記憶させる処理。（２）監視データ記憶部１００２から読み込んだデータに基づいて相関関係を求める（生成する）処理。（３）相関関係抽出部１００４が求めた相関関係を相関関係記憶部１００５に記憶させる処理。この（１）〜（３）の処理は監視目的に応じてバッチ処理或いはリアルタイム処理で行われる。更に、（４）監視データと相関関係或いは相関関係どうしを比較する処理。そして（５）監視データと相関関係から異常検知する処理などがある。図４は、監視データ取得部１００１、異常検出部１００３及び相関関係抽出部１００４の動作を示すフローチャートであり、図５は、障害検知／予測部１００６の動作を示すフローチャートである。 Next, the operation of the performance monitoring apparatus 10 will be described in detail with reference to the flowcharts of FIGS. In the performance monitoring system in the first embodiment to which the present invention is applied, there are roughly the following five processes. (1) Processing for storing the monitoring data acquired by the monitoring data acquisition unit 1001 in the monitoring data storage unit 1002. (2) A process for obtaining (generating) a correlation based on data read from the monitoring data storage unit 1002. (3) Processing for storing the correlation obtained by the correlation extraction unit 1004 in the correlation storage unit 1005. The processes (1) to (3) are performed by batch processing or real-time processing according to the monitoring purpose. Further, (4) a process of comparing the monitoring data with the correlation or the correlations. (5) There is a process of detecting an abnormality from the monitoring data and the correlation. 4 is a flowchart showing operations of the monitoring data acquisition unit 1001, the abnormality detection unit 1003, and the correlation extraction unit 1004, and FIG. 5 is a flowchart showing operations of the failure detection / prediction unit 1006.

なお、監視データ記憶部１００２に（１）で蓄積された各種データは、その後の各処理で用いられた後も原則として消去せずに残しておくことが好ましい。例えば後述する第二の実施形態で説明する通り、システムの構成が変更されたときなどに、過去データとの比較を行う上で、多くのデータが使用できるという利点がある。 Note that the various data accumulated in (1) in the monitoring data storage unit 1002 are preferably left without being erased in principle after being used in each subsequent process. For example, as will be described in a second embodiment described later, there is an advantage that a large amount of data can be used for comparison with past data when the system configuration is changed.

先ず、図４を参照しながら、監視データ取得部１００１、異常検出部１００３及び相関関係抽出部１００４の動作について説明する。図４では、上述した（１）や（３）の記憶させる処理と他の処理とを並行して説明するが、必ずしも並行して行う必要はない。まず最初に監視データ取得部１００１は、Ｗｅｂサーバ１１、ＡＰサーバ１２、ＤＢサーバ１３及び各サーバを接続する通信回線の監視データを取得し、取得した監視データを監視データ記憶部１００２に蓄積させていく（ステップＳ４０１、Ｓ４０２）。 First, operations of the monitoring data acquisition unit 1001, the abnormality detection unit 1003, and the correlation extraction unit 1004 will be described with reference to FIG. In FIG. 4, the process of storing (1) and (3) described above and other processes are described in parallel, but it is not always necessary to perform them in parallel. First, the monitoring data acquisition unit 1001 acquires monitoring data of the Web server 11, AP server 12, DB server 13, and communication line connecting each server, and accumulates the acquired monitoring data in the monitoring data storage unit 1002. (Steps S401 and S402).

続いて、異常検出部１００３は、監視データ記憶部１００２から２種類の監視データを読み込んだ後、それらの２種類の監視データに対応する正常時の相関関係を相関関係記憶部１００５から読み込み、監視データ記憶部１００２から読み込んだ当該２種類の監視データと、相関関係記憶部１００５から読み込んだ正常時の相関関係とを比較することにより情報処理システムの異常を検出する（ステップＳ４０３）。これは監視目的に応じて任意の周期で監視データと相関関係を読み込んで比較処理する。なお、ここで異常検出部１００３によって監視データ記憶部１００２から読み込まれる２種類の監視データは、監視データ取得部１００１によって同時に取得されたデータであることが前提である。また、ここで異常検出部１００３によって用いられる正常時の相関関係とは、当該２種類の監視データに関して一つ前のステップＳ４０６の処理で求められた正常時の相関関係である。 Subsequently, after detecting two types of monitoring data from the monitoring data storage unit 1002, the abnormality detection unit 1003 reads the normal correlation corresponding to the two types of monitoring data from the correlation storage unit 1005, and monitors them. An abnormality of the information processing system is detected by comparing the two types of monitoring data read from the data storage unit 1002 with the normal correlation read from the correlation storage unit 1005 (step S403). According to the monitoring purpose, the monitoring data and the correlation are read and compared in an arbitrary cycle. Here, it is assumed that the two types of monitoring data read from the monitoring data storage unit 1002 by the abnormality detection unit 1003 are data acquired simultaneously by the monitoring data acquisition unit 1001. Here, the normal correlation used by the abnormality detection unit 1003 is a normal correlation obtained in the process of the previous step S406 for the two types of monitoring data.

情報処理システムの異常が検出された場合、相関関係抽出部１００４は、監視データ記憶部１００２から読み込んだ過去の当該２種類の監視データから当該２種類の監視データの相関関係を算出する（ステップＳ４０３／ＹＥＳ、Ｓ４０４）。続いて、相関関係抽出部１００４は、算出した相関関係を異常時の相関関係として相関関係記憶部１００５に相関関係ＩＤと共に記憶させる（ステップＳ４０７）。このとき、相関関係記憶部１００５内においては、当該２種類の監視データについて、一つ前のステップＳ４０４の処理において求められた異常時の相関関係が今回のステップＳ４０４の処理において求められた異常時の相関関係に更新される。従って、本実施形態では、情報処理システムの稼働に追従して常に新しい異常時の相関関係を、後述のステップＳ５０５におけるエラー予測処理に用いることが可能となる。 When an abnormality in the information processing system is detected, the correlation extraction unit 1004 calculates the correlation between the two types of monitoring data from the past two types of monitoring data read from the monitoring data storage unit 1002 (step S403). / YES, S404). Subsequently, the correlation extraction unit 1004 stores the calculated correlation in the correlation storage unit 1005 as a correlation at the time of abnormality together with the correlation ID (step S407). At this time, in the correlation storage unit 1005, for the two types of monitoring data, the abnormal correlation obtained in the previous step S404 is the abnormal time obtained in the current step S404. The correlation is updated. Therefore, in the present embodiment, it is possible to always use a new abnormal correlation following the operation of the information processing system for error prediction processing in step S505 described later.

一方、ステップＳ４０３において異常が検出されなかった場合、相関関係抽出部１００４は、当該２種類の監視データの取得開始から所定時間が経過したか否かを判断する（ステップＳ４０３／ＮＯ、Ｓ４０５）。 On the other hand, if no abnormality is detected in step S403, the correlation extraction unit 1004 determines whether or not a predetermined time has elapsed since the start of acquisition of the two types of monitoring data (steps S403 / NO, S405).

当該２種類の監視データの取得開始から所定時間が経過している場合、相関関係抽出部１００４は、取得開始から所定時間が経過するまでに監視データ記憶部１００２から読み込んだ当該２種類の監視データから当該２種類の監視データの相関関係を算出し、正常時の相関関係として相関関係記憶部１００５に相関関係ＩＤと共に記憶させる（ステップＳ４０５／ＹＥＳ、ステップＳ４０６、Ｓ４０７）。このとき、相関関係記憶部１００５内においては、当該２種類の監視データに関し、一つ前のステップＳ４０６の処理において求められた正常時の相関関係が今回のステップＳ４０６の処理において求められた正常時の相関関係に更新される。従って、本実施形態では、情報処理システムの稼働に追従して常に新しい正常時の相関関係を、後述のステップＳ５０３におけるエラー検知処理に用いることが可能となる。 When a predetermined time has elapsed from the start of acquisition of the two types of monitoring data, the correlation extraction unit 1004 reads the two types of monitoring data read from the monitoring data storage unit 1002 until the predetermined time has elapsed from the start of acquisition. Then, the correlation between the two types of monitoring data is calculated and stored in the correlation storage unit 1005 together with the correlation ID as a normal correlation (step S405 / YES, steps S406 and S407). At this time, in the correlation storage unit 1005, regarding the two types of monitoring data, the normal correlation obtained in the previous step S406 is the normal time obtained in the current step S406. The correlation is updated. Therefore, in the present embodiment, it is possible to always use a new normal correlation following the operation of the information processing system for error detection processing in step S503 described later.

ステップＳ４０５において、当該監視データの取得開始から所定時間が経過していない場合には、ステップＳ４０１の監視データの取得処理に戻る。以上のように、本実施形態では監視対象のシステムに特に異常がない限り常に正常時としての相関関係が蓄積されていき、異常が発生したときには、異常時の相関関係が新たに生成され蓄積されていく。 In step S405, if the predetermined time has not elapsed since the monitoring data acquisition start, the process returns to the monitoring data acquisition process in step S401. As described above, in this embodiment, the normal correlation is always accumulated unless there is any abnormality in the monitored system. When an abnormality occurs, a new correlation is generated and accumulated. To go.

次に、図５を参照しながら、障害検知／予測部１００６の動作について説明する。障害検知／予測部１００６は、監視データ記憶部１００２から２種類の監視データを読み込む（ステップＳ５０１）。なお、ここで読み込まれる２種類の監視データは、監視データ取得部１００１によって同時に取得されたデータであり、監視データ記憶部１００２において記憶される当該２種類の監視データのうち最新のデータであることが前提である。そして、監視データ記憶部１００２から監視データを読み込む周期は監視目的に応じて任意に設定できるが、障害検知という目的からすればできるだけリアルタイム性が求められる。従って監視データ取得部１００１がデータを取得して監視データ記憶部１００２に記憶されたらすぐに読み込むよう設定することが好ましい。 Next, the operation of the failure detection / prediction unit 1006 will be described with reference to FIG. The failure detection / prediction unit 1006 reads two types of monitoring data from the monitoring data storage unit 1002 (step S501). Note that the two types of monitoring data read here are data acquired simultaneously by the monitoring data acquisition unit 1001, and are the latest data of the two types of monitoring data stored in the monitoring data storage unit 1002. Is the premise. The period for reading the monitoring data from the monitoring data storage unit 1002 can be arbitrarily set according to the monitoring purpose, but real-time characteristics are required as much as possible for the purpose of failure detection. Therefore, it is preferable to set the monitoring data acquisition unit 1001 to read data as soon as it is acquired and stored in the monitoring data storage unit 1002.

続いて、障害検知／予測部１００６は、当該２種類の監視データと、相関関係記憶部１００５に記憶される当該２種類の監視データに対応する正常時の相関関係とを比較し、その比較結果に基づいて情報処理システムにエラー（異常）が発生したか否かを判断する（ステップＳ５０２、Ｓ５０３）。 Subsequently, the failure detection / prediction unit 1006 compares the two types of monitoring data with the correlation at normal time corresponding to the two types of monitoring data stored in the correlation storage unit 1005, and the comparison result Based on the above, it is determined whether or not an error (abnormality) has occurred in the information processing system (steps S502 and S503).

ステップＳ５０３において、障害検知／予測部１００６が情報処理システムにエラーが発生したと判断した場合、報知部１００７はその内容をオペレータに対して報知する（ステップＳ５０３／ＹＥＳ、Ｓ５０６）。 In step S503, when the failure detection / prediction unit 1006 determines that an error has occurred in the information processing system, the notification unit 1007 notifies the operator of the content (steps S503 / YES, S506).

一方、障害検知／予測部１００６は、ステップＳ５０３において情報処理システムにエラーが発生したと判断しなかった場合には、所定回数前のステップＳ５０１の処理から今回のステップＳ５０１の処理までに得られた複数の当該２種類の監視データに基づいて、当該２種類の監視データの相関関係を求め、この相関関係と相関関係記憶部１００５に蓄積されている当該２種類の監視データの過去の相関関係とを用いてエラーが発生する可能性があるか否かを予測をする（ステップＳ５０３／ＮＯ、Ｓ５０４、Ｓ５０５）。 On the other hand, if the failure detection / prediction unit 1006 does not determine in step S503 that an error has occurred in the information processing system, the failure detection / prediction unit 1006 is obtained from the process of step S501 a predetermined number of times before the process of step S501 this time. Based on a plurality of the two types of monitoring data, the correlation between the two types of monitoring data is obtained, and this correlation and the past correlation between the two types of monitoring data stored in the correlation storage unit 1005 Is used to predict whether or not an error may occur (steps S503 / NO, S504, and S505).

ステップＳ５０５において、障害検知／予測部１００６が情報処理システムに将来エラーが発生する可能性があると判断した場合、報知部１００７はその内容をオペレータに対して報知する（ステップＳ５０５／ＹＥＳ、Ｓ５０７）。 In step S505, when the failure detection / prediction unit 1006 determines that an error may occur in the information processing system in the future, the notification unit 1007 notifies the operator of the content (steps S505 / YES, S507). .

一方、障害検知／予測部１００６が上記２つの相関関係が類似していないと判断した場合、処理はステップＳ５０１の監視データの読み込みに戻る（ステップＳ５０５／ＮＯ、Ｓ５０１）。 On the other hand, when the failure detection / prediction unit 1006 determines that the two correlations are not similar, the process returns to reading of the monitoring data in step S501 (steps S505 / NO, S501).

ここで、ステップＳ５０３におけるエラー検知処理について図６を用いて具体的に説明する。図６では、上記２種類の監視データとしてトランザクションデータとリソース使用状況データとが用いられ、トランザクションデータにより示されるスループット、リソース使用状況データにより示されるディスクＩ／Ｏ量から算出された相関関係６０１を示している。なお、図６中の「×」印は、上記２種類の監視データで示されるスループット、ディスクＩ／Ｏ量の関係からプロットされる点であり、上記２種類の監視データ毎に対応する点として、１２個の点がプロットされている。また、ハッチングされた範囲領域６０４は、正常時の相関関係６０１を基準としたときに正常とみなす範囲であり、相関関係に応じて予め定められている。なお、図６においては、相関関係６０１と平行して範囲領域６０４が設定されているが、必ずしも相関関係を中心とした一定幅で領域を設定する必要はない。 Here, the error detection processing in step S503 will be specifically described with reference to FIG. In FIG. 6, transaction data and resource usage status data are used as the two types of monitoring data, and the correlation 601 calculated from the throughput indicated by the transaction data and the disk I / O amount indicated by the resource usage status data is shown. Show. In FIG. 6, “x” marks are plotted from the relationship between the throughput and the disk I / O amount indicated by the two types of monitoring data, and correspond to the two types of monitoring data. , 12 points are plotted. A hatched range area 604 is a range that is regarded as normal when the normal correlation 601 is used as a reference, and is determined in advance according to the correlation. In FIG. 6, the range region 604 is set in parallel with the correlation 601, but it is not always necessary to set the region with a constant width centered on the correlation.

相関関係抽出部１００４は、上記１２個の点の近似式（図６中の直線に相当）を求める。ここで求められる近似式がスループットとディスクＩ／Ｏ量との相関関係６０１である。この相関関係６０１がステップＳ４０６において求められる正常時の相関関係であるとすると、ステップＳ５０１において読み込まれる２種類の監視データに対応して（当該２種類の監視データにより示されるスループット、ディスクＩ／Ｏ量に対応して）プロットされる点が図６中の６０２である場合、即ち、相関関係６０１を基準とする所定幅の範囲領域６０４外であって、当該範囲領域６０４の上方にステップＳ５０１にて読み込まれる２種類の監視データがプロットされるような場合、障害検知／予測部１００６は、正常時の相関関係６０１を基準にして、現在、スループットに対してディスクＩ／Ｏ量が多過ぎると判断し、ディスクＩ／Ｏ量の多さを原因とした情報処理システムのエラーを検知することができる。報知部１００７は、画面表示によりオペレータに対してシステムのエラーとその原因（スループットに対してディスクＩ／Ｏ量が多過ぎる）とを報知する。 The correlation extraction unit 1004 obtains an approximate expression (corresponding to a straight line in FIG. 6) of the 12 points. The approximate expression obtained here is a correlation 601 between the throughput and the disk I / O amount. If this correlation 601 is the normal correlation obtained in step S406, corresponding to the two types of monitoring data read in step S501 (throughput and disk I / O indicated by the two types of monitoring data). When the plotted point is 602 in FIG. 6, that is, outside the range area 604 having a predetermined width with reference to the correlation 601, the process proceeds to step S 501 above the range area 604. In the case where two types of monitoring data to be read are plotted, the failure detection / prediction unit 1006 determines that the disk I / O amount is currently excessive with respect to the throughput based on the correlation 601 at the normal time. It is possible to detect and detect an error in the information processing system due to the large amount of disk I / O. The notification unit 1007 notifies the operator of a system error and its cause (the amount of disk I / O is too large with respect to the throughput) through screen display.

また、ステップＳ５０１において読み込まれた２種類の監視データに対応して（当該２種類の監視データにより示されるスループット、ディスクＩ／Ｏ量に対応して）プロットされる点が図６中の６０３である場合、即ち、相関関係６０１を基準とした所定幅の範囲領域６０４外であって、当該範囲領域６０４の下方にステップＳ５０１にて読み込まれる２種類の監視データがプロットされるような場合、障害検知／予測部１００６は、正常時の相関関係６０１を基準にして、現在、ディスクＩ／Ｏ量に対してスループットが高過ぎると判断し、スループットの高さを原因とした情報処理システムのエラーを検知することができる。報知部１００７は、画面表示によりオペレータに対してシステムのエラーとその原因（ディスクＩ／Ｏ量に対してスループットが高過ぎる）とを報知する。 Further, a point plotted in correspondence with the two types of monitoring data read in step S501 (corresponding to the throughput and disk I / O amount indicated by the two types of monitoring data) is indicated by reference numeral 603 in FIG. In some cases, that is, when two types of monitoring data read in step S501 are plotted outside the range area 604 having a predetermined width based on the correlation 601 and below the range area 604, The detection / prediction unit 1006 determines that the throughput is currently too high with respect to the disk I / O amount on the basis of the normal correlation 601 and detects an error in the information processing system due to the high throughput. Can be detected. The notification unit 1007 notifies the operator of a system error and its cause (throughput is too high with respect to the disk I / O amount) by displaying on the screen.

なお、上述した実施形態では、どのような処理に対するスループットであるかの内容は限定していない。したがって、特定の処理に関するスループットであってもよいし、或いは、いくつかの処理を足し合わせたスループットでも良い。例えば処理ａ、処理ｂ、処理ｃ毎にスループットとディスクのＩ／Ｏ量との相関関係を求めておき、これら３つの相関関係の足し合わせた量を、当該スループットにおける基準のディスクＩ／Ｏ量として扱うようにしても良い。 In the above-described embodiment, the content of what kind of processing is the throughput is not limited. Therefore, it may be a throughput related to a specific process, or may be a throughput obtained by adding several processes. For example, the correlation between the throughput and the disk I / O amount is obtained for each of the processing a, processing b, and processing c, and the sum of these three correlations is used as the reference disk I / O amount for the throughput. You may make it treat as.

また、本実施形態の性能監視システムは、複数のサーバを監視していることを特徴としているので、オペレータに対しては、どのサーバの挙動に基づいてエラーを検知したかを含めてシステムのエラーとその原因を報知するようにする。 Further, since the performance monitoring system of this embodiment is characterized by monitoring a plurality of servers, the system error including which server's behavior is detected for the operator is included. And the cause of it.

本実施形態では、監視データ取得部１００１によって取得される監視データに基づいて他にも様々なエラー検知を行うことが可能である。例えば、或るサーバへのトランザクションを監視して得られるトランザクションデータと、当該サーバのリソース使用状況データとを用い、トランザクションデータにより示されるスループット、リソース使用状況データにより示されるＣＰＵ使用率に基づいて、当該サーバのスループットが高くなっているにも拘わらずＣＰＵ使用率が低い、又は、当該サーバのスループットが低いにも拘わらずＣＰＵ使用率が高いことを情報処理システムのエラー原因として判断することができる。 In the present embodiment, various other error detections can be performed based on the monitoring data acquired by the monitoring data acquisition unit 1001. For example, using transaction data obtained by monitoring a transaction to a server and resource usage status data of the server, based on the throughput indicated by the transaction data and the CPU usage rate indicated by the resource usage status data, It can be determined as an error cause of the information processing system that the CPU usage rate is low even though the throughput of the server is high, or that the CPU usage rate is high despite the low throughput of the server. .

また、異なる２つのサーバのリソース使用状況データに基づいて次のようなエラー原因を把握することが可能となる。例えば、正常な稼働状態では、Ｗｅｂサーバ１１とＡＰサーバ１２とのＣＰＵ使用率はＮ：Ｍであるはずなのに、Ｗｅｂサーバ１１から得られるリソース使用状況データにより示されるＣＰＵ使用率、ＡＰサーバ１２から得られるリソース使用状況データにより示されるＣＰＵ使用率に基づいて、Ｗｅｂサーバ１１の使用率のみが高い場合には、情報処理システムのエラー原因がＡＰサーバ１２における障害発生であることが判断できる。 Further, it is possible to grasp the following error causes based on the resource usage data of two different servers. For example, in a normal operating state, the CPU usage rate between the Web server 11 and the AP server 12 should be N: M, but the CPU usage rate indicated by the resource usage status data obtained from the Web server 11 is from the AP server 12. If only the usage rate of the Web server 11 is high based on the CPU usage rate indicated by the obtained resource usage status data, it can be determined that the cause of the error in the information processing system is a failure in the AP server 12.

また、或るサーバのリソース使用状況データとログデータとに基づいて次のようなエラー原因を把握することが可能となる。例えば、リソース使用状況データにより示されるＣＰＵ使用率、ログデータから判断される処理１の発生頻度に基づいて、当該サーバのＣＰＵ利用率が異常に高い値をとる時間帯で通常より処理１の発生頻度が高くなっている場合には、情報処理システムのエラー原因が、その時間帯において当該サーバ内の処理１の発生頻度が高くなっていることであることが判断できる。 Further, the following error causes can be grasped based on the resource usage status data and log data of a certain server. For example, based on the CPU usage rate indicated by the resource usage status data and the frequency of occurrence of process 1 determined from the log data, occurrence of process 1 occurs normally during the time period when the CPU usage rate of the server takes an abnormally high value. When the frequency is high, it can be determined that the cause of the error in the information processing system is that the frequency of occurrence of the process 1 in the server is high in the time period.

さらに、異なる２つのサーバのログデータに基づいて次のようなエラー原因を把握することが可能となる。例えば、Ｗｅｂサーバ１１のログデータから判断されるＷｅｂサーバ１１のスループット、ＡＰサーバ１２のログデータから判断されるＡＰサーバ１２のスループットに基づいて、Ｗｅｂサーバ１１のスループットが増加傾向であるのに拘わらずＡＰサーバ１２のスループットが増加しない場合には、ＡＰサーバ１２に問題があるため、ＡＰサーバ１２を利用する処理が滞っており、Ｗｅｂサーバ１１のみを利用する処理の比率が増えているということを検出できる。 Furthermore, it is possible to grasp the following error causes based on the log data of two different servers. For example, the throughput of the Web server 11 tends to increase based on the throughput of the Web server 11 determined from the log data of the Web server 11 and the throughput of the AP server 12 determined from the log data of the AP server 12. If the throughput of the AP server 12 does not increase, there is a problem with the AP server 12, so that the processing using the AP server 12 is delayed, and the ratio of processing using only the Web server 11 is increasing. Can be detected.

次に、図５のステップＳ５０５のエラー予測処理を図７を用いて具体的に説明する。
図７は、異なるサーバ（ここでは、Ｗｅｂサーバ１１とＡＰサーバ１２）のログデータを用い、それらのログデータにより判断されるＷｅｂサーバ１１の処理１のスループット、ＡＰサーバ１２の処理２のスループットに基づいて算出された相関関係を示している。範囲領域７０１は、Ｗｅｂサーバ１１の処理１の発生数に対してＡＰサーバ１２の処理２の発生数が正常時に求められたときの正常とみなされる範囲を示している。 Next, the error prediction process in step S505 in FIG. 5 will be specifically described with reference to FIG.
FIG. 7 shows the throughput of processing 1 of the Web server 11 and the throughput of processing 2 of the AP server 12 determined by using log data of different servers (here, the Web server 11 and the AP server 12). The correlation calculated based on this is shown. A range area 701 indicates a range that is considered normal when the number of occurrences of processing 2 of the AP server 12 is determined in a normal state relative to the number of occurrences of processing 1 of the Web server 11.

図７においては、相関関係７０２として、相関関係１００５に蓄積されている過去の相関関係として、７０２（ａ）と７０２（ｂ）がある。そして、所定回数前のステップＳ５０１の処理から今回のステップＳ５０１の処理までに得られたＷｅｂサーバ１１とＡＰサーバ１２のログデータに基づいて、相関関係抽出部１００４が求めた相関関係７０２（ｃ）も示されている。時系列的に見たときに、最初に求めた相関関係が７０２（ａ）、次が７０２（ｂ）、最新のデータが７０２（ｃ）であるとする。更に、相関関係７０３（ｄ）は監視対象システムの今後予想される相関関係を示している。なお、図をわかりやすくするために、図７においては範囲領域７０１に対応する相関関係の線は表示していない。 In FIG. 7, as the correlation 702, there are 702 (a) and 702 (b) as past correlations accumulated in the correlation 1005. The correlation 702 (c) obtained by the correlation extraction unit 1004 based on the log data of the Web server 11 and the AP server 12 obtained from the process of step S501 a predetermined number of times to the process of step S501 this time. Is also shown. When viewed in time series, it is assumed that the correlation obtained first is 702 (a), the next is 702 (b), and the latest data is 702 (c). Further, the correlation 703 (d) indicates a correlation that is expected in the future of the monitoring target system. For easy understanding of the figure, the correlation line corresponding to the range region 701 is not shown in FIG.

ステップ５０４では、監視対象システムの過去の動向と現在の状況を相関関係７０２（ａ）〜７０２（ｃ）に基づいて、つまり、ある監視対象のシステムを定期的に監視したときのデータを用いてエラーを予測する。 In step 504, the past trend and the current situation of the monitoring target system are based on the correlations 702 (a) to 702 (c), that is, using data when a certain monitoring target system is periodically monitored. Predict errors.

障害検知／予測部１００６は、ステップＳ５０５において、相関関係７０２の時系列に伴う推移を判定し、相関関係が正常時の範囲領域７０１からはずれそうな場合、情報処理システムに将来異常が発生する可能性があると予測する。この時、必要に応じて、将来の相関関係７０２（ｄ）を生成する。尚、本実施例では、最新の監視データから作成された相関関係が、正常時の相関関係の範囲領域７０１から外れそうであることを判断の基準としているが、例えば、最新の監視データから作成された相関関係が異常時の相関関係に類似した相関関係になりつつあることを判断基準としても良いし、或いは、領域範囲に入るか否かで判断するのではなく、正常時・異常時の相関関係の傾きなどで判断しても良い。 In step S505, the failure detection / prediction unit 1006 determines the transition of the correlation 702 with the time series, and if the correlation is likely to deviate from the range region 701 when the correlation is normal, an abnormality may occur in the information processing system in the future. Predict that there is sex. At this time, a future correlation 702 (d) is generated as necessary. In this embodiment, the criterion for determining that the correlation created from the latest monitoring data is likely to be out of the normal correlation range area 701 is used. For example, the correlation is created from the latest monitoring data. It may be determined that the calculated correlation is becoming a correlation similar to the correlation at the time of abnormality, or it is not determined by whether or not it enters the area range, but at the normal time / abnormal time You may judge by the inclination of a correlation.

障害検知／予測部１００６による上記の予測内容は、報知部１００７によってオペレータに対して報知される。 The above prediction content by the failure detection / prediction unit 1006 is notified to the operator by the notification unit 1007.

また、本実施形態においては、本情報処理システムに類似した構成の情報処理システムを新規に設置する場合、本情報処理システムの相関関係記憶部１００５で記憶された正常時及び異常時の相関関係を、新規の情報処理システム内の相関関係記憶部に記憶させることにより、新規の情報処理システムにおいて適切なエラー検知処理、エラー予測処理を同様に行うこともできる。ここで性能監視装置１０は、図１に示す情報処理システムに限られず様々な構成の情報処理システムを監視対象とすることができるため、流用できる相関関係は上述した例に限られないことは勿論である。 In this embodiment, when a new information processing system having a configuration similar to the information processing system is newly installed, the correlation between normal time and abnormal time stored in the correlation storage unit 1005 of the information processing system is calculated. By storing the information in the correlation storage unit in the new information processing system, appropriate error detection processing and error prediction processing can be performed in the same manner in the new information processing system. Here, since the performance monitoring apparatus 10 is not limited to the information processing system shown in FIG. 1 and can monitor information processing systems having various configurations, the correlation that can be diverted is not limited to the example described above. It is.

以上のように、本実施形態によれば、障害検知又は予測時に用いた２種類の監視データの種類によって、当該障害の原因まで追求することが可能となる。尚、本実施形態では、２種類の監視データの相関関係を用いているが、本発明に適用可能な相関関係は２種類の監視データから算出されるものに限られず、更に多種類の監視データの相関関係であってもよい。 As described above, according to the present embodiment, the cause of the failure can be pursued according to the two types of monitoring data used at the time of failure detection or prediction. In this embodiment, the correlation between two types of monitoring data is used, but the correlation applicable to the present invention is not limited to that calculated from two types of monitoring data, and more types of monitoring data. The correlation may be as follows.

また、説明の便宜上、異常検出部１００３と障害検知／予測部１００６とは別の構成で行うよう説明したが、いずれも、監視データ記憶部１００２から読み込んだ監視データと、相関関係記憶部１００５から読み込んだ相関関係とを比較するという処理については、共通のソフトウェア／ハードウェアを用いてもよい。 Further, for convenience of explanation, the abnormality detection unit 1003 and the failure detection / prediction unit 1006 have been described as having different configurations. However, both of the monitoring data read from the monitoring data storage unit 1002 and the correlation storage unit 1005 For the process of comparing the read correlation, common software / hardware may be used.

次に、本実施形態の他の処理の例について説明する。Ｗｅｂサーバ１１における処理１の発生回数とＡＰサーバ１２における処理２の発生回数間の基準比率を予め設定しておき、現在の当該２種類の監視データ間の比率が基準比率から離れていく傾向にある場合にエラーを予測することも可能である。例えば基準比率が１：１で設定されているにもかかわらず、時間経過と共にその比率が１：１．１、１：１．２、１：１．３、・・・などと基準から離れていく傾向が見られた場合に検知して、オペレータに報知する。 Next, another example of processing according to this embodiment will be described. A reference ratio between the number of occurrences of processing 1 in the Web server 11 and the number of occurrences of processing 2 in the AP server 12 is set in advance, and the current ratio between the two types of monitoring data tends to move away from the reference ratio. In some cases it is also possible to predict errors. For example, even though the reference ratio is set at 1: 1, the ratio is separated from the reference as time passes, such as 1: 1.1, 1: 1.2, 1: 1.3,. This is detected when a tendency is observed, and the operator is notified.

さらに、２種類の監視データから得られる１つの相関関係情報に基づいても異常検知をすることができる。図８は、スループットデータに対する応答時間との相関関係を示す例である。この図においては、スループットが高くなるにつれて応答時間が長くなっており、スループットがある量を超えると急激に応答時間が悪化することがわかる。応答時間が悪化する点をエラーとして検知することにより、レスポンス悪化に対してオペレータは早期に対策をとることが可能となる。具体的には、このような相関関係を相関関係記憶部１００５に記憶しておき、性能監視装置１０は監視データがこのような相関関係の極点に差し掛かったことを検知した場合にエラーと判断してオペレータに報知する。 Furthermore, it is possible to detect an abnormality based on one piece of correlation information obtained from two types of monitoring data. FIG. 8 is an example showing a correlation with response time for throughput data. In this figure, it can be seen that the response time becomes longer as the throughput increases, and that the response time rapidly deteriorates when the throughput exceeds a certain amount. By detecting a point where the response time is deteriorated as an error, the operator can take an early countermeasure against the response deterioration. Specifically, such correlation is stored in the correlation storage unit 1005, and the performance monitoring apparatus 10 determines that an error has occurred when it is detected that the monitoring data has reached the extreme point of such correlation. To inform the operator.

このように、本実施形態では相関関係の時間経過による変化を捉え、相関関係の傾きの変化、相関関係のX軸やY軸方向へのシフトなどが許容されていない場合には、これらの状況を元にエラーを報知するものである。但しこれに限るものではなく、ある時刻断面で正常時の相関関係と比較してエラー予測しても良い。 As described above, in the present embodiment, the change of the correlation over time is captured, and when the change in the correlation inclination or the shift of the correlation in the X-axis or Y-axis direction is not permitted, these situations are detected. An error is reported based on the above. However, the present invention is not limited to this, and an error may be predicted by comparing with a normal correlation in a certain time section.

なお、上述した実施形態では、性能監視装置１０によって取得される監視データとしてリソース使用状況データ、トランザクションデータ及びログデータを例として挙げたが、本発明に適用可能な監視データはこれらに限られず、Ｗｅｂサーバ１１、ＡＰサーバ１２及びＤＢサーバ１３の稼働状況を特定可能なデータは全て性能監視装置１０の採取対象とすることができ、同様の動作によるエラー検知処理、エラー予測処理が可能である。さらに、上記実施形態では、性能監視装置１０の監視対象となる情報処理システムの構成を、図１に示すＷｅｂサーバ１１、ＡＰサーバ１２及びＤＢサーバ１３から成る情報処理システムとしているが、他の構成の情報処理システムも本発明の性能監視装置の監視対象となり得ることは勿論である。 In the above-described embodiment, resource usage status data, transaction data, and log data are given as examples of monitoring data acquired by the performance monitoring device 10, but monitoring data applicable to the present invention is not limited to these, All data that can specify the operating status of the Web server 11, the AP server 12, and the DB server 13 can be collected by the performance monitoring apparatus 10, and error detection processing and error prediction processing by similar operations are possible. Further, in the above embodiment, the configuration of the information processing system to be monitored by the performance monitoring apparatus 10 is the information processing system including the Web server 11, the AP server 12, and the DB server 13 shown in FIG. Of course, this information processing system can also be a monitoring target of the performance monitoring apparatus of the present invention.

上述した実施形態では、１つのＷｅｂサーバ１１と１つのＡＰサーバ１２と１つのＤＢサーバ１３とで構成されたシステムを１つの性能監視装置１０で監視するという例で説明したが、これらは必ずしも１つずつである必要はない。性能監視装置１０は、ネットワーク上に接続されたサーバや通信回線を監視できるものであるため、１つの性能監視装置１０で２組以上のＷｅｂサーバ１１とＡＰサーバ１２とＤＢサーバ１３とで構成されたシステムを監視することも可能である。 In the above-described embodiment, an example in which a system configured by one Web server 11, one AP server 12, and one DB server 13 is monitored by one performance monitoring apparatus 10 is described. It doesn't have to be one by one. Since the performance monitoring apparatus 10 can monitor servers and communication lines connected to the network, the performance monitoring apparatus 10 is composed of two or more sets of Web server 11, AP server 12, and DB server 13 with one performance monitoring apparatus 10. It is also possible to monitor the system.

また、Ｗｅｂサーバ１１とＡＰサーバ１２とＤＢサーバ１３の数も１：１：１である必要はなく、Ｍ：Ｎ：Ｌというようにそれぞれが複数備えられたシステムであっても良い。１例を挙げると、図９のように、６台のＷｅｂサーバ１１がそれぞれ３台ずつ２台のＡＰサーバ１２と接続され、この２台のＡＰサーバ１２が１台のＤＢサーバ１３と接続されている。このとき性能監視装置１０は個々のサーバや通信回線を監視し、その挙動からきめ細かにエラー検知をすることができるようになる。また、必要に応じて１台のＡＰサーバ１２に接続されている３台のＷｅｂサーバ１１との通信については、取りまとめて１つのＷｅｂサーバ１１とみなして監視することもできる。この場合、システム構成情報を性能監視装置１０に格納しておき、任意に監視対象を設定できるようにすることが好ましい。 Further, the number of Web servers 11, AP servers 12, and DB servers 13 need not be 1: 1: 1, and a system including a plurality of M: N: L may be used. As an example, as shown in FIG. 9, six web servers 11 are connected to two AP servers 12 each three, and the two AP servers 12 are connected to one DB server 13. ing. At this time, the performance monitoring apparatus 10 can monitor individual servers and communication lines and detect errors precisely based on their behavior. Further, communication with three Web servers 11 connected to one AP server 12 can be collectively regarded as one Web server 11 and monitored as necessary. In this case, it is preferable that system configuration information is stored in the performance monitoring apparatus 10 so that a monitoring target can be arbitrarily set.

次に、本発明を適用した好適な第二の実施形態を説明する。上述したように、監視対象となるシステムについて、システム構成情報を性能監視装置１０に格納しておき、任意に監視対象を設定できるようにすることが好ましい。そこで第二の実施形態では、第一の実施形態の機能構成に加え、監視対象となるシステムのシステム構成情報を更に管理することで、より多様な監視と障害予測を行えるように工夫している。 Next, a second preferred embodiment to which the present invention is applied will be described. As described above, it is preferable to store system configuration information in the performance monitoring apparatus 10 for a system to be monitored so that the monitoring target can be arbitrarily set. Therefore, in the second embodiment, in addition to the functional configuration of the first embodiment, the system configuration information of the system to be monitored is further managed so that more various monitoring and failure prediction can be performed. .

図１０は、第二の実施形態に係る性能監視システムの構成を概略的に示した図である。以下、図面を参照しながら詳細に説明するが、第一の実施形態と同一の機能については説明を省略する。図１０は、図９で示した６台のＷｅｂサーバ１１と２台のＡＰサーバ１２と１台のＤＢサーバ１３とから構成されたシステムの性能監視を行うための構成であり、第一の実施形態と同様に、蓄積サーバ１０１と分析サーバ１０２から構成される性能監視装置１０が通信回線から取得できる情報を収集蓄積し、分析する。第二の実施形態では更に、構成情報管理装置２０が備わっており、性能監視装置１０に接続されている。なお、以下の説明では構成情報管理装置２０は性能監視装置１０と別の装置として構成した例を説明するが、これは１台のコンピュータで構成しても良い。 FIG. 10 is a diagram schematically showing the configuration of the performance monitoring system according to the second embodiment. Hereinafter, although it demonstrates in detail, referring drawings, description is abbreviate | omitted about the same function as 1st embodiment. FIG. 10 is a configuration for monitoring the performance of a system composed of the six Web servers 11, two AP servers 12, and one DB server 13 shown in FIG. Similar to the embodiment, the performance monitoring apparatus 10 constituted by the storage server 101 and the analysis server 102 collects, accumulates and analyzes information that can be acquired from the communication line. In the second embodiment, a configuration information management device 20 is further provided and connected to the performance monitoring device 10. In the following description, an example in which the configuration information management device 20 is configured as a separate device from the performance monitoring device 10 will be described, but this may be configured by a single computer.

構成情報管理装置２０は、監視対象となるシステム全体の構成にかかわる情報を格納しておくものである。具体的には、各機能のサーバの数やハードウェア属性、ネットワーク構成、ネットワーク属性、ソフトウェアやファームウェアなど、情報処理装置自体の情報と各情報処理装置間の関連性を示す情報をデータベースに格納している。なお、以下では説明を簡単にするために、ハードウェアに関する構成情報を扱う例とする。例えば、図９で示した全体構成について、ＩＤを付与して格納しておく。新たにサーバが追加されたなど監視対象のシステムの構成が変更された場合には、新たな構成情報として別途ＩＤが付与されて構成情報管理装置２０に格納される。なお、構成情報管理装置２０は、単体コンピュータで構成するには、図２に示したようなコンピュータの基本的な機能を有することになる。 The configuration information management apparatus 20 stores information related to the configuration of the entire system to be monitored. Specifically, information on the information processing device itself and the information indicating the relationship between the information processing devices, such as the number of servers for each function, hardware attributes, network configuration, network attributes, software and firmware, are stored in the database. ing. In the following, in order to simplify the description, an example of handling configuration information related to hardware will be described. For example, the entire configuration shown in FIG. 9 is assigned with an ID and stored. When the configuration of the system to be monitored is changed, such as when a new server is added, an ID is separately added as new configuration information and stored in the configuration information management apparatus 20. Note that the configuration information management apparatus 20 has the basic functions of a computer as shown in FIG.

図１１は第二の実施形態に用いる性能監視装置１０と構成情報管理装置２０の構成を詳細に説明する図である。性能監視装置１０は、第一の実施形態で説明した機能に加え、システム構成全体の中で、監視対象とする範囲を指定するための監視対象指定部１００８と、指定された監視対象範囲を記憶しておくための監視対象範囲データを監視データ記憶部１００２に備えている。 FIG. 11 is a diagram for explaining in detail the configurations of the performance monitoring device 10 and the configuration information management device 20 used in the second embodiment. In addition to the functions described in the first embodiment, the performance monitoring apparatus 10 stores a monitoring target specifying unit 1008 for specifying a range to be monitored in the entire system configuration, and a specified monitoring target range. The monitoring data storage unit 1002 has monitoring target range data to be stored.

後述するように、第二の実施形態においては、複数のハードウェアで構成されたシステムの全体構成が構成情報管理装置２０に構成情報ＩＤが付与されて記憶される。これに対して、監視対象は記憶されているシステムの全体構成の内任意の範囲を指定することができるようになっている。例えば図９において６台のＷｅｂサーバ１１と２台のＡＰサーバ１２と１台のＤＢサーバ１３の合計９台のコンピュータで構成されているシステムについて、システム全体を監視対象とすることもでき、或はその内の何台かだけを監視対象とすることもできる。そのために監視対象指定部１００８は監視対象を特定するための情報をオペレータから受付ける機能を持っている。具体的には、オペレータのキーボードやマウス操作等で範囲指定の情報を受け取る。 As will be described later, in the second embodiment, the entire configuration of a system constituted by a plurality of hardware is stored in the configuration information management device 20 with a configuration information ID. On the other hand, the monitoring target can specify an arbitrary range of the entire configuration of the stored system. For example, in FIG. 9, the entire system can be monitored for a system composed of a total of nine computers including six Web servers 11, two AP servers 12, and one DB server 13, or Can also monitor only some of them. Therefore, the monitoring target specifying unit 1008 has a function of receiving information for specifying the monitoring target from the operator. Specifically, the range designation information is received by an operator's keyboard or mouse operation.

監視対象指定部１００８で受け取った範囲指定の情報は、監視データ記憶部１００２に監視対象範囲データとして監視対象ＩＤが付与されて記憶される。監視データ所得部１００１は、Ｗｅｂサーバ１１、ＡＰサーバ１２及びＤＢサーバ１３からリソース使用状況データ及びログデータ、上記サーバ間を接続する通信回線からトランザクションデータ等を取得する際に、監視対象範囲データを参照し、指定されている範囲の情報だけを取得する。なお、監視データ所得部１００１が能動的に監視データを取得する場合には、指定されているサーバ等にアクセスしてログデータ等を取得し、受動的に監視データを取得する場合には、受信したログデータ等の内、監視対象範囲として指定されているサーバ等のデータだけを選別（フィルタリング）して取得する。 The range designation information received by the monitoring target designation unit 1008 is stored in the monitoring data storage unit 1002 with a monitoring target ID assigned as monitoring target range data. When the monitoring data income unit 1001 obtains resource usage status data and log data from the Web server 11, AP server 12, and DB server 13, transaction data from the communication line connecting the servers, the monitoring target range data is acquired. Browse and get only the information in the specified range. When the monitoring data income unit 1001 actively acquires monitoring data, it accesses the designated server or the like to acquire log data or the like, and when it acquires passive monitoring data, it receives it. Among the log data, etc., only the data of the server designated as the monitoring target range is selected (filtered) and acquired.

構成情報管理装置２０は、構成情報を入力して登録するための構成情報登録部２００１と、入力された構成情報を記憶するための構成情報記憶部２００２、そして性能監視装置１０からの要求に応じて構成情報記憶部２００２に記憶された構成情報を抽出し、性能監視装置１０に送信するための構成情報抽出部２００３から構成される。 The configuration information management device 20 responds to requests from the configuration information registration unit 2001 for inputting and registering configuration information, the configuration information storage unit 2002 for storing the input configuration information, and the performance monitoring device 10. The configuration information extraction unit 2003 is configured to extract configuration information stored in the configuration information storage unit 2002 and transmit the configuration information to the performance monitoring apparatus 10.

構成情報登録部２００１は、キーボードやマウスなどでありオペレータが入力する情報を受け付ける機能である。例えば図９であればオペレータは、監視対象としたいシステムの全体構成として、６台のＷｅｂサーバと２台のＡＰサーバと１台のＤＢサーバなど、ハードウェアの数量に関する情報と、各ハードウェアがそれぞれどのような形態で接続されているか、接続するためのネットワークはどれほどの転送レートを持ったものであるか、各ハードウェア・ソフトウェアのスペックはどのようなものであるか等を入力する。各ハードウェア・ソフトウェアのスペックとしては、単に購入時のスペックだけではなく、ファームウェアやソフトウェアのバージョンなども登録しておくと良い。なお、オペレータからの入力だけでなく、ネットワークを介してコンピュータが取得できるシステムの構成情報は、自動的に取得しても良い。 The configuration information registration unit 2001 is a function of accepting information input by an operator such as a keyboard and a mouse. For example, in FIG. 9, the operator has information about the quantity of hardware such as six Web servers, two AP servers, and one DB server as the entire configuration of the system to be monitored, It is input what kind of form each is connected to, what kind of transfer rate the network to connect to, what kind of specifications of each hardware / software is, etc. As the specifications of each hardware / software, it is preferable to register not only the specifications at the time of purchase but also the firmware and software versions. In addition to the input from the operator, the system configuration information that can be acquired by the computer via the network may be acquired automatically.

構成情報記憶部２００２は、構成情報登録部２００１で受け付けた情報を監視対象システム毎に格納するものである。構成情報には、構成情報ＩＤ以外にも構成情報を受け付けた記憶日時情報等の属性情報も付加されて記憶される。 The configuration information storage unit 2002 stores the information received by the configuration information registration unit 2001 for each monitoring target system. In addition to the configuration information ID, attribute information such as storage date / time information that received the configuration information is also added to the configuration information and stored.

構成情報抽出部２００３は、構成情報記憶部２００２に格納されている構成情報を、性能監視装置１０やオペレータからの指示に基づいて抽出する機能である。後述するように、第二の実施形態では、システムの構成に応じて性能を監視したり異常を検出するため、監視対象のシステムと正常時のシステムの挙動とから相関関係を求める必要がある。そこで、性能監視装置１０は必要に応じて構成情報を構成情報記憶部２００２から読み出して相関関係のデータ等を作成する。 The configuration information extraction unit 2003 is a function that extracts the configuration information stored in the configuration information storage unit 2002 based on instructions from the performance monitoring apparatus 10 or an operator. As will be described later, in the second embodiment, in order to monitor the performance or detect an abnormality according to the system configuration, it is necessary to obtain a correlation from the monitored system and the behavior of the system at the normal time. Therefore, the performance monitoring apparatus 10 reads configuration information from the configuration information storage unit 2002 as necessary to create correlation data and the like.

ここで、相関関係記憶部１００５内の相関関係は、相関関係を求めた環境毎に記憶される。例えばサーバが１０台の時と、１１台の時とではシステムの挙動は異なってくる。従ってサーバが１０台の時の相関関係と１１台になったときの相関関係は別に求めてそれぞれに相関関係ＩＤを付与して記憶する。そして、当該相関関係を求めた際の監視対象ＩＤ及び／又は構成情報ＩＤとをリンクさせておく。リンクはリレーショナルデータベース等で管理することで容易に設定できる。このような、ＩＤで関連付けられた各情報は別途履歴情報として格納しておいても良い。当然ながら、１つの監視対象に対して複数の相関関係が生成されるので、相関関係ＩＤと監視対象ＩＤとは複数対複数の関係でリンクが形成される。構成情報ＩＤも同様である。 Here, the correlation in the correlation storage unit 1005 is stored for each environment for which the correlation is obtained. For example, the system behavior differs when there are 10 servers and 11 servers. Accordingly, the correlation when the number of servers is 10 and the correlation when the number of servers is 11, are obtained separately and stored with a correlation ID assigned thereto. Then, the monitoring target ID and / or the configuration information ID when the correlation is obtained are linked. Links can be set easily by managing them in a relational database. Such information associated with the ID may be separately stored as history information. Naturally, since a plurality of correlations are generated for one monitoring target, the correlation ID and the monitoring target ID form a link with a plurality of relationships. The same applies to the configuration information ID.

次に、図１２を参照しながら性能監視装置１０と構成情報管理装置２０の動作を説明する。第二の実施形態では、図４を用いて説明した第一の実施形態による監視と相関関係抽出の処理自体は同じであるが、この監視処理に先立って監視対象の範囲を特定する処理が行われる。まず最初に、構成情報登録部２００１は、オペレータ又はコンピュータにより入力されるシステムの全体構成に拘る情報を受信して構成情報記憶部２００２に転送する（ステップＳ１２０１）。システムの全体構成に拘る情報を受信した構成情報記憶部２００２は、構成情報にＩＤを付与して順次情報を記憶していく。この時、上述のように受信した日時情報も一緒に記憶される（ステップＳ１２０２）。 Next, operations of the performance monitoring apparatus 10 and the configuration information management apparatus 20 will be described with reference to FIG. In the second embodiment, the monitoring and the correlation extraction process in the first embodiment described with reference to FIG. 4 are the same. However, prior to this monitoring process, a process for specifying a range to be monitored is performed. Is called. First, the configuration information registration unit 2001 receives information about the overall configuration of the system input by an operator or a computer and transfers it to the configuration information storage unit 2002 (step S1201). The configuration information storage unit 2002 that has received information related to the overall configuration of the system assigns IDs to the configuration information and sequentially stores the information. At this time, the date and time information received as described above is also stored together (step S1202).

続いて、構成情報記憶部２００２に記憶されたシステムの全体構造の内、監視対象としたい範囲に関する情報をオペレータが入力し、入力された情報を監視対象指定部１００８が受付ける（ステップＳ１２０３）。範囲指定方法の一例としては、対象となる複数のサーバのＩＰアドレスなど一意にハードウェアを特定することが挙げられる。そして受付けられた情報に基づいて、監視データ取得部１００１は構成情報抽出部２００３に抽出指示し、構成情報抽出部２００３が構成情報記憶部２００２からシステムに関する情報を抽出して監視データ取得部１００１に返送する（ステップＳ１２０４）。 Subsequently, the operator inputs information about a range to be monitored among the entire system structure stored in the configuration information storage unit 2002, and the monitoring target designating unit 1008 receives the input information (step S1203). An example of the range designation method is to uniquely identify hardware such as IP addresses of a plurality of target servers. Based on the received information, the monitoring data acquisition unit 1001 instructs the configuration information extraction unit 2003 to extract, and the configuration information extraction unit 2003 extracts information about the system from the configuration information storage unit 2002 to the monitoring data acquisition unit 1001. Return it (step S1204).

例えば、図９において、ＤＢサーバ以外の８台のサーバを監視対象とするようオペレータからの指示を監視対象指定部１００８が受けると、監視データ取得部１００１はその情報を構成情報抽出部２００３に抽出条件として送信し、構成情報抽出部２００３は８台のＩＰアドレス等を用いてサーバを特定する。特定された対象となる複数のサーバのＩＰアドレスは監視データ取得部１００１に送信され、監視データ取得部１００１は監視データ記憶部１００２に監視対象範囲データとして監視対象ＩＤを付与して記憶する（Ｓ１２０５）。 For example, in FIG. 9, when the monitoring target specifying unit 1008 receives an instruction from the operator to monitor eight servers other than the DB server, the monitoring data acquiring unit 1001 extracts the information to the configuration information extracting unit 2003. The configuration information extraction unit 2003 identifies the server using eight IP addresses and the like. The IP addresses of the plurality of servers to be identified are transmitted to the monitoring data acquisition unit 1001, and the monitoring data acquisition unit 1001 assigns and stores the monitoring target ID as monitoring target range data in the monitoring data storage unit 1002 (S1205). ).

監視データ取得部１００１は監視処理を行う際に、監視データ記憶部１００２に記憶された監視対象範囲データで特定されるハードウェア群に関する監視データを取得する。以下は図４や図５を用いて説明した第一の実施形態と同様に処理が行われる。この時、監視対象ＩＤと対応する相関関係ＩＤとに基づいて比較に用いられる相関関係が抽出され各処理が行われる。なお、図１２のステップＳ１２０１からステップＳ１２０５に於ける処理はシステムの構成が変更された度、または監視対象範囲が変更される度に行われる。 When the monitoring data acquisition unit 1001 performs the monitoring process, the monitoring data acquisition unit 1001 acquires monitoring data related to the hardware group specified by the monitoring target range data stored in the monitoring data storage unit 1002. In the following, processing is performed in the same manner as in the first embodiment described with reference to FIGS. At this time, the correlation used for comparison is extracted based on the monitoring target ID and the corresponding correlation ID, and each process is performed. Note that the processing from step S1201 to step S1205 in FIG. 12 is performed every time the system configuration is changed or the monitoring target range is changed.

以上説明したように、本発明を適用した第二の実施形態では、監視対象とするハードウェア構成とソフトウェア構成を特定する情報を更に備えることにより、システム全体の中の特定部位だけの監視を行たいなど、目的に応じた監視対象の範囲を監視することが可能となる。なお、上述した実施形態では１つのシステムについて性能監視装置１０と構成情報管理装置２０がひとつずつ備わっている例を示したが、本発明はこれにとどまらず例えば、ＡＳＰ（アプリケーションサービスプロバイダ）サービス等の形態にも応用できる。つまり、監視対象となるシステムが複数存在し、それら個々のシステム内の特定範囲だけを監視対象とすることができる。その場合、システム毎に構成情報を記憶し、システム毎に監視対象範囲データを持てば良い。 As described above, in the second embodiment to which the present invention is applied, information for specifying the hardware configuration and software configuration to be monitored is further provided to monitor only a specific part in the entire system. It is possible to monitor the range of the monitoring target according to the purpose. In the above-described embodiment, an example is shown in which one performance monitoring device 10 and one configuration information management device 20 are provided for one system. However, the present invention is not limited to this, for example, an ASP (Application Service Provider) service or the like. It can also be applied to other forms. That is, there are a plurality of systems to be monitored, and only a specific range in each of these systems can be set as a monitoring target. In that case, it is only necessary to store configuration information for each system and have monitoring target range data for each system.

また、別の形態として、１つのシステムの中で、目的に応じて複数の監視対象範囲を設定しても良い。例えばサーバＡ〜サーバＪまでの１０台のサーバで構成されたシステム全体の内、１つ目の監視対象範囲がサーバＡ〜サーバＥの５台、２つ目の監視対象範囲がサーバＦ〜サーバＨの３台という範囲を指定しても良い。更には、１つ目の監視対象範囲がサーバＡ〜サーバＧの７台、２つ目の監視対象範囲がサーバＣ〜サーバＪの８台など、１つのサーバが複数の監視対象として指定されても良い。いずれの場合も、監視データ取得部１００１は監視処理を行う際に、監視データ記憶部１００２に記憶された監視対象範囲データを参照して監視対象のサーバを特定し、必要な監視データを取得するという処理が行われる。 As another form, a plurality of monitoring target ranges may be set according to the purpose in one system. For example, in the entire system composed of 10 servers from server A to server J, the first monitoring target range is five servers A to E, and the second monitoring target range is server F to server. A range of 3 units of H may be specified. Furthermore, one server is designated as a plurality of monitoring targets, such as the first monitoring target range is seven servers A to G, and the second monitoring target range is eight servers C to J. Also good. In any case, when performing the monitoring process, the monitoring data acquisition unit 1001 refers to the monitoring target range data stored in the monitoring data storage unit 1002, identifies the monitoring target server, and acquires necessary monitoring data. The process is performed.

次に、本発明を適用した好適な第三の実施形態を説明する。上述した第一の実施形態と第二の実施形態では、何れもリソース使用状況データ、ログデータ、トランザクションデータなど、コンピュータの稼働状況を収集していた。これに対して第三の実施形態では、更に、コンピュータ稼働状況以外の情報をも収集して相関関係を求めるようにしている。 Next, a preferred third embodiment to which the present invention is applied will be described. In the first embodiment and the second embodiment described above, the operating status of the computer such as resource usage status data, log data, and transaction data is collected. In contrast, in the third embodiment, information other than the computer operating status is also collected to obtain the correlation.

コンピュータシステムは、様々な理由により、ハードウェア構成やソフトウェア構成が変更される。これらの変更によりコンピュータシステムの性能が変化する。また、コンピュータシステムを取りまく環境の変化によってもコンピュータシステムの性能は変化する。本実施形態においては、これらの変化を捉えて監視データのひとつとして扱うことを特徴としている。これを特に「イベントデータ」と称することとする。「イベントデータ」は、稼働状況を含めて監視したい対象システムの内外で発生する事象に関するデータである。例えば、内部で発生する事象としては、エラーの発生、コンピュータに組み込まれるＣＰＵの数量が増加したなどのシステムの変更がある。また外部的な事象としては、温度の変化や地震や衝撃による揺れの発生などがある。そしてイベントの内容によってはコンピュータの演算性能が低下してスループットが低下するなどの変化が発生する。そこで、例えば、監視データ取得部１００１がイベントデータをキャッチしたときに、イベントに応じて分析や異常検知などの処理を行うようにする。 The computer system is changed in hardware configuration and software configuration for various reasons. These changes change the performance of the computer system. In addition, the performance of a computer system changes with changes in the environment surrounding the computer system. This embodiment is characterized in that these changes are captured and handled as one of the monitoring data. This is particularly referred to as “event data”. “Event data” is data relating to events that occur inside and outside the target system to be monitored, including the operating status. For example, events that occur internally include system changes such as the occurrence of errors and the increase in the number of CPUs incorporated in a computer. External events include temperature changes and the occurrence of shaking due to earthquakes and shocks. Depending on the contents of the event, changes such as a decrease in throughput due to a decrease in computer computing performance may occur. Therefore, for example, when the monitoring data acquisition unit 1001 catches event data, processing such as analysis and abnormality detection is performed according to the event.

図１３は、第三の実施形態に係る性能監視システムの構成を概略的に示した図である。第三の実施形態でも基本的な情報処理は第一の実施形態及び第二の実施形態と同様であるが、本実施形態の特徴をわかりやすく説明するための構成のみを表示している。従って、同じ処理については説明を省略する。第三の実施形態の特徴のひとつとして入力データソースが「Ｗｅｂサーバ」「ＡＰサーバ」「ＤＢサーバ」等の監視対象装置に加え、「運用管理ツール」「ユーザ入力」が含まれている点がある。そして監視データ記憶部１００２に記憶されているデータについて、監視データに関するもの１００２と、イベントデータに関するもの１００２'とを分けて示している。 FIG. 13 is a diagram schematically showing the configuration of the performance monitoring system according to the third embodiment. In the third embodiment, the basic information processing is the same as in the first embodiment and the second embodiment, but only the configuration for explaining the features of the present embodiment in an easy-to-understand manner is displayed. Therefore, the description of the same processing is omitted. One of the features of the third embodiment is that the input data source includes “operation management tool” and “user input” in addition to monitoring target devices such as “Web server”, “AP server”, and “DB server”. is there. The data stored in the monitoring data storage unit 1002 is separately shown for monitoring data 1002 and event data 1002 ′.

イベントデータは、監視対象システムから発せられる信号をそのまま利用したり、図示しない運用管理ツールから受信したり、或いは人間により入力されるデータがある。なお、運用管理ツールはシステムのハードウェアやソフトウェアを管理するものであって、それぞれのハードウェアがどのような構成を持っており、どのようなバージョンのソフトウェアがインストールされているかどうか等の情報を管理している。 The event data includes data that is used as it is from a monitored system, is received from an operation management tool (not shown), or is input by a human. The operation management tool manages system hardware and software. Information such as what configuration each hardware has and what version of software is installed. I manage.

さらに、イベントデータは後述するように、監視対象システムから受信したログデータなどを元に生成されるものもある。いずれにしても、イベントデータもそれぞれイベントデータＩＤが付与されて監視データ記憶部１００２の所定の場所に格納される。 Further, as will be described later, some event data is generated based on log data received from the monitoring target system. In any case, the event data is assigned an event data ID and stored in a predetermined location in the monitoring data storage unit 1002.

次に、第三の実施形態におけるデータの流れを説明する。監視データ取得部１００１を介して受信した各データは、それぞれデータの種類に応じて、記憶部に格納される。まず監視対象システムの構成に関するデータは、第二の実施形態で説明したように構成情報管理装置２０の構成情報記憶部２００２に記憶される。監視対象システムから受信したログデータやスループットなどの監視データは、監視データ記憶部１００２に格納され、同様に監視データ取得部１００１を介して受信したイベントデータも監視データ記憶部１００２'に格納される。 Next, the data flow in the third embodiment will be described. Each data received via the monitoring data acquisition unit 1001 is stored in the storage unit according to the type of data. First, data related to the configuration of the monitoring target system is stored in the configuration information storage unit 2002 of the configuration information management apparatus 20 as described in the second embodiment. Monitoring data such as log data and throughput received from the monitoring target system is stored in the monitoring data storage unit 1002, and similarly, event data received via the monitoring data acquisition unit 1001 is also stored in the monitoring data storage unit 1002 ′. .

監視データ記憶部１００２に格納された監視データからは、何らかのイベントに関する情報を引き出すこともできる。例えば監視対象のサーバがダウンすると、監視データが受信されなくなる。つまり、定期的に受信できていた監視データが監視データ記憶部１００２に記憶されなくなった時点を感知できれば監視対象のサーバがダウンしたというエラー（障害）に関するイベントを抽出することができる。また、CPU使用率が１０分程度にわたって９０％を越えているような場合は過負荷とみなすことができるので、システムの稼働状況に関するイベントを抽出することができる。 Information related to some event can be extracted from the monitoring data stored in the monitoring data storage unit 1002. For example, when the monitoring target server goes down, monitoring data is not received. That is, an event related to an error (failure) that the monitored server is down can be extracted if it can be sensed that the monitoring data that can be received periodically is not stored in the monitoring data storage unit 1002. Further, when the CPU usage rate exceeds 90% for about 10 minutes, it can be regarded as an overload, so that an event relating to the operating status of the system can be extracted.

そこで、第三の実施形態では、イベントデータ生成部１００９を設けて、監視データをもとにイベントデータを生成している。イベントデータ生成部１００９は、監視データ記憶部１００２に格納された監視データについて、図示しないルール記憶部に記憶されているイベントデータ生成ルールに基づいてイベントデータを生成する。イベントデータ生成ルールには、どのようなタイミングで、どのデータを用いて、どのようなイベントデータを生成すかどうかが定義されている。上述したエラーに関するイベントの例では、「常に」「監視データ」を抽出して「監視データが一定時間受信できなければ"サーバダウン"」というイベントデータ生成ルールに従ってルール生成処理が行われる。また、稼働状況に関するイベントの例では、「常に」「CPU使用率」を抽出して「９０％異常が１０分続いたら"過負荷"」というイベントデータ生成ルールに従ってルール生成処理が行われる。そして、イベントデータＩＤを付与した上で監視データ記憶部１００２'に格納する。 Therefore, in the third embodiment, an event data generation unit 1009 is provided to generate event data based on the monitoring data. The event data generation unit 1009 generates event data for the monitoring data stored in the monitoring data storage unit 1002 based on an event data generation rule stored in a rule storage unit (not shown). The event data generation rule defines what kind of event data is generated using which data at what timing. In the above-described example of an error-related event, “always” “monitoring data” is extracted, and rule generation processing is performed in accordance with an event data generation rule of “server down” if monitoring data cannot be received for a certain period of time. In the example of the event related to the operation status, “always” “CPU usage rate” is extracted, and rule generation processing is performed according to the event data generation rule “overload” when 90% abnormality continues for 10 minutes. Then, the event data ID is assigned and stored in the monitoring data storage unit 1002 ′.

このように、第三の実施形態では、監視対象のシステムに発生するあらゆる事象について、監視対象システムから発せられる信号、図示しない運用管理ツールから受信した信号、人間により入力される情報、或いはイベントデータ生成部１００９で生成されたデータを、イベントデータとして監視データ記憶部１００２に格納する。 As described above, in the third embodiment, for every event occurring in the monitored system, a signal generated from the monitored system, a signal received from an operation management tool (not shown), information input by a person, or event data Data generated by the generation unit 1009 is stored in the monitoring data storage unit 1002 as event data.

相関関係抽出部１００４は、監視データ記憶部１００２及び構成情報記憶部２００２に記憶された各情報を用いて相関関係を求め、相関関係１００５に記憶しておく。 The correlation extraction unit 1004 obtains a correlation using each information stored in the monitoring data storage unit 1002 and the configuration information storage unit 2002 and stores it in the correlation 1005.

次に、イベントデータを用いた処理について説明する。第一の実施形態や第二の実施形態では、（２）監視データ記憶部１００２から読み込んだデータに基づいて相関関係を求める（生成する）処理、（４）監視データと相関関係或いは相関関係どうしを比較する処理、（５）監視データと１つの相関関係から異常検知する処理を行ったが、本実施形態では更に（６）監視データと、イベントデータをきっかけとして生成した相関関係とを比較する。 Next, processing using event data will be described. In the first embodiment and the second embodiment, (2) a process for obtaining (generating) a correlation based on data read from the monitoring data storage unit 1002, and (4) a correlation with or correlation between the monitoring data. (5) The process of detecting an abnormality from the monitoring data and one correlation is performed. In this embodiment, (6) the monitoring data is compared with the correlation generated by using the event data as a trigger. .

監視データと、イベントデータをきっかけとして生成した相関関係とを比較する処理（６）の例として、ここでは上述した監視データとサーバダウンというイベントデータとの相関関係を用いた一連の分析処理を説明する。監視データとしては、「ディスクＩ／Ｏ」と「サーバのスループット」を監視しているものとする。 As an example of the process (6) for comparing the monitoring data and the correlation generated by using the event data as a trigger, here, a series of analysis processes using the correlation between the monitoring data and the event data called server down will be described. To do. It is assumed that “disk I / O” and “server throughput” are monitored as monitoring data.

まず、監視対象のシステムについて「ディスクＩ／Ｏ」と「サーバのスループット」を継続的に測定し、測定されたデータは監視データ取得部１００１で取得され、監視データ記憶部１００２に「ディスクＩ／Ｏ」と「サーバのスループット」として逐次記憶される。イベントデータ生成部１００９は常に監視データを抽出し続け、もし監視データが一定時間受信できなければ"サーバダウン"とみなして"サーバダウン"というイベントデータを生成した上で監視データ記憶部１００２'に記憶する。 First, “disk I / O” and “server throughput” are continuously measured for the system to be monitored, and the measured data is acquired by the monitoring data acquisition unit 1001, and “disk I / O” is stored in the monitoring data storage unit 1002. O ”and“ server throughput ”are sequentially stored. The event data generation unit 1009 always extracts the monitoring data. If the monitoring data cannot be received for a certain period of time, it is regarded as “server down” and event data “server down” is generated and stored in the monitoring data storage unit 1002 ′. Remember.

次に相関関係抽出部１００４は、監視データ記憶部１００２に記憶されたディスク「ディスクＩ／Ｏ」と「サーバのスループット」と、監視データ記憶部１００２'に記憶された"サーバダウン"のイベントデータに基づいて相関関係を抽出し、相関関係記憶部１００５に記憶する。具体的には、監視データ記憶部１００２に記憶された「ディスクＩ／Ｏ」と「サーバのスループット」の監視データが急増した直後に監視データが一定時間受信できなくなっていれば、「ディスクＩ／Ｏ」と「サーバの処理数」に基づいて図１４に示したような相関関係を求めた上で、更に、「ディスクＩ／Ｏ」または「サーバのスループット」がある一定値を超えたときに"サーバダウン"が発生したという情報を生成する。図１４では、ハッチングした領域が過去に"サーバダウン"発生した時の「ディスクＩ／Ｏ」と「サーバのスループット」との関係を示す部分である。 Next, the correlation extraction unit 1004 stores the disk “disk I / O” and “server throughput” stored in the monitoring data storage unit 1002 and “server down” event data stored in the monitoring data storage unit 1002 ′. And the correlation is extracted and stored in the correlation storage unit 1005. Specifically, if the monitoring data cannot be received for a certain period of time immediately after the monitoring data “disk I / O” and “server throughput” stored in the monitoring data storage unit 1002 increase rapidly, the “disk I / O” When the correlation as shown in FIG. 14 is obtained based on “O” and “number of server processes”, and “disk I / O” or “server throughput” exceeds a certain value. Generates information that a "server down" has occurred. In FIG. 14, the hatched area is a portion showing the relationship between “disk I / O” and “server throughput” when “server down” has occurred in the past.

次に、障害検知／予測部１００６は、監視データ記憶部１００２に逐次記憶される「ディスクＩ／Ｏ」と「サーバのスループット」の監視データについて読み出し、そのデータが図１４に示した相関関係の正常値にあるのか、それとも"サーバダウン"が発生する可能性にあるのか（障害予測）、或は"サーバダウン"が発生したのか（障害検知）を判別する。そして、障害予測または障害検知と判断した場合には、「"サーバダウン"が発生する可能性がある」「"サーバダウン"が発生した」等のメッセージを報知部１００７に表示する。 Next, the failure detection / prediction unit 1006 reads out the “disk I / O” and “server throughput” monitoring data sequentially stored in the monitoring data storage unit 1002, and the data corresponds to the correlation shown in FIG. It is determined whether the value is normal or whether there is a possibility of “server down” (failure prediction) or “server down” (failure detection). If it is determined that the failure is predicted or detected, a message such as “There is a possibility of“ Server down ”” or “A server down” is displayed on the notification unit 1007.

なお、先に示した稼働状況に関するイベントの例では、生成された"過負荷"というイベントデータに基づいて、次のような相関関係の比較をすることができる。一般的にはスループットが上昇したときにCPUの処理が増加して負荷が高くなる。それに対して、スループットが高くなっているにもかかわらず、CPU負荷が高くなっていない状態は異常と考えられる。そこで、CPU使用率とスループットとの相関関係について、正常時の相関関係と"過負荷"というイベントが発生した時の相関関係を比較し、障害を判断する。 In the example of the event relating to the operation status described above, the following correlation can be compared based on the generated event data “overload”. Generally, when throughput increases, CPU processing increases and the load increases. On the other hand, a state where the CPU load is not high although the throughput is high is considered abnormal. Therefore, regarding the correlation between the CPU usage rate and the throughput, the correlation between the normal state and the correlation when the event of “overload” occurs is compared to determine the failure.

以上のように、第三の実施形態では、監視対象のシステムの内外に発生するあらゆる事象をイベントデータとして抽出し、抽出したイベントデータと監視データとを用いて相関関係を抽出している。なお、上記実施形態では単にイベントデータと監視データとを用いた分析処理について説明したが、第二の実施形態で説明したような構成情報まで含めたデータを用いて相関関係を求めることでより詳細な異常検知をすることも可能となる。 As described above, in the third embodiment, all events that occur inside and outside the monitored system are extracted as event data, and the correlation is extracted using the extracted event data and monitoring data. In the above-described embodiment, the analysis process using only the event data and the monitoring data has been described. However, the details are obtained by obtaining the correlation using the data including the configuration information as described in the second embodiment. It is also possible to detect anomalies.

なお、上述した各実施形態では、予め相関関係を求めるには図示しないルール記憶部に記憶された相関関係抽出ルールに基づいて相関関係が抽出される。この相関関係抽出ルールは予めユーザによって登録されているものであるが、記憶された監視データやイベントデータを元に、どのような相関関係を抽出すればよいかを自動的に推測し、相関関係抽出ルール自体を自動生成するようにしても良い。つまり、監視データやイベントデータを蓄積しつづけておき、エラー等が発生しない状況を正常値とし、この正常値を外れた何らかの監視データがあった場合に相関関係抽出ルール生成機能が働き、それらデータから新たな相関関係ルールを生成するなどしても良い。 In each of the above-described embodiments, in order to obtain the correlation in advance, the correlation is extracted based on a correlation extraction rule stored in a rule storage unit (not shown). This correlation extraction rule is registered in advance by the user, but based on the stored monitoring data and event data, it automatically guesses what kind of correlation should be extracted, and the correlation The extraction rule itself may be automatically generated. In other words, monitoring data and event data are continuously accumulated, the situation where no error etc. occurs is regarded as a normal value, and if there is any monitoring data that deviates from this normal value, the correlation extraction rule generation function works, and these data A new correlation rule may be generated from

以上詳細に説明したとおり、本発明では、第一の実施形態および第二の実施形態のように、システムの稼動状況に関する量的な複数種類の情報から相関関係を求める方法、そして、第三の実施形態のように、システムの稼動状況に関する量的な情報とシステムに対して発生したイベント情報とから相関関係を求める。このようにして求めた相関関係は相関関係記憶部１００５に記憶され、監視データはこの相関関係と比較されて障害の検知や予測が行われる。 As described above in detail, in the present invention, as in the first embodiment and the second embodiment, a method for obtaining a correlation from a plurality of types of quantitative information related to the operating status of the system, and the third embodiment As in the embodiment, a correlation is obtained from quantitative information related to the operating status of the system and event information generated for the system. The correlation obtained in this way is stored in the correlation storage unit 1005, and the monitoring data is compared with this correlation to detect and predict a failure.

ところで、本発明の目的は、前述した実施形態の機能を実現するソフトウェアのプログラムコードを記録した記憶媒体を、システム或いは装置に供給し、そのシステム或いは装置のコンピュータ（またはＣＰＵやＭＰＵ）が記憶媒体に格納されたプログラムコードを読み出し実行することによっても、達成されることは言うまでもない。 By the way, an object of the present invention is to supply a storage medium storing software program codes for realizing the functions of the above-described embodiments to a system or apparatus, and the computer (or CPU or MPU) of the system or apparatus stores the storage medium. Needless to say, this can also be achieved by reading and executing the program code stored in.

この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、プログラムコード自体及びそのプログラムコードを記憶した記憶媒体は本発明を構成することになる。 In this case, the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the program code itself and the storage medium storing the program code constitute the present invention.

プログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭ等を用いることができる。 As a storage medium for supplying the program code, for example, a flexible disk, a hard disk, an optical disk, a magneto-optical disk, a CD-ROM, a CD-R, a magnetic tape, a nonvolatile memory card, a ROM, or the like can be used.

また、コンピュータが読み出したプログラムコードを実行することにより、前述した実施形態の機能が実現されるだけでなく、そのプログラムコードの指示に基づき、コンピュータ上で稼動しているＯＳ(基本システム或いはオペレーティングシステム)などが実際の処理の一部又は全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, by executing the program code read by the computer, not only the functions of the above-described embodiments are realized, but also an OS (basic system or operating system) running on the computer based on the instruction of the program code. Needless to say, a case where the functions of the above-described embodiment are realized by performing part or all of the actual processing and the processing is included.

さらに、記憶媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書込まれた後、そのプログラムコードの指示に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵ等が実際の処理の一部又は全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれることは言うまでもない。 Further, after the program code read from the storage medium is written in a memory provided in a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function is determined based on the instruction of the program code. It goes without saying that the CPU or the like provided in the expansion board or function expansion unit performs part or all of the actual processing and the functions of the above-described embodiments are realized by the processing.

本発明の第一の実施形態に係る性能監視システムの構成を概略的に示す図である。It is a figure showing roughly the composition of the performance monitoring system concerning a first embodiment of the present invention. 性能監視装置内のコンピュータシステムのハードウェア構成を概略的に示す図である。It is a figure which shows roughly the hardware constitutions of the computer system in a performance monitoring apparatus. 性能監視装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of a performance monitoring apparatus. 監視データ取得部、異常検出部及び相関関係抽出部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the monitoring data acquisition part, the abnormality detection part, and a correlation extraction part. 障害検知／予測部の動作を示すフローチャートである。It is a flowchart which shows operation | movement of a failure detection / prediction part. 図５のステップＳ５０３におけるエラー検知処理を具体的に説明するための図である。It is a figure for demonstrating concretely the error detection process in step S503 of FIG. 図５のステップＳ５０５におけるエラー予測処理を具体的に説明するための図である。It is a figure for demonstrating concretely the error prediction process in step S505 of FIG. スループットデータに対する応答時間との相関関係を示す図である。It is a figure which shows correlation with the response time with respect to throughput data. 本発明を適用可能な性能監視システムの他の構成例を示す図である。It is a figure which shows the other structural example of the performance monitoring system which can apply this invention. 本発明の第二の実施形態に係る性能監視システムの構成を概略的に示す図である。It is a figure which shows schematically the structure of the performance monitoring system which concerns on 2nd embodiment of this invention. 性能監視装置内のコンピュータシステムのハードウェア構成を概略的に示す図である。It is a figure which shows roughly the hardware constitutions of the computer system in a performance monitoring apparatus. 構成情報の登録と抽出処理を示すフローチャートである。It is a flowchart which shows the registration and extraction process of structure information. 性能監視装置内のコンピュータシステムのハードウェア構成を概略的に示す図である。It is a figure which shows roughly the hardware constitutions of the computer system in a performance monitoring apparatus. 本発明の第三の実施形態における相関関係を示す図である。It is a figure which shows the correlation in 3rd embodiment of this invention.

Explanation of symbols

１０：性能監視装置
１１：Ｗｅｂサーバ
１２：ＡＰサーバ
１３：ＤＢサーバ
２０：構成情報管理装置
１０１：蓄積サーバ
１０２：分析サーバ
１００１：監視データ取得部
１００２：監視データ記憶部
１００３：異常検出部
１００４：相関関係抽出部
１００５：相関関係記憶部
１００６：障害検知／予測部
１００７：報知部
１００８：監視対象指定部
１００９：イベントデータ生成部
１２００：コンピュータシステム
１２０１：ＣＰＵ
１２０２：ＲＯＭ
１２０３：ＲＡＭ
１２０４：システムバス
１２０５：キーボードコントローラ（ＫＢＣ）
１２０６：ＣＲＴコントローラ（ＣＲＴＣ）
１２０７：ディスクコントローラ（ＤＫＣ）
１２０８：ネットワークインタフェースカード（ＮＩＣ）
１２０９：キーボード（ＫＢ）
１２１０：ＣＲＴディスプレイ（ＣＲＴ）
１２１１：ハードディスク（ＨＤ）
１２１２：フレキシブルディスク（ＦＤ）
１２２０：ＬＡＮ
２００１：構成情報登録部
２００２：構成情報記憶部
２００３：構成情報抽出部 10: Performance monitoring device 11: Web server 12: AP server 13: DB server 20: Configuration information management device 101: Storage server 102: Analysis server 1001: Monitoring data acquisition unit 1002: Monitoring data storage unit 1003: Abnormality detection unit 1004: Correlation extraction unit 1005: Correlation storage unit 1006: Failure detection / prediction unit 1007: Notification unit 1008: Monitoring target designation unit 1009: Event data generation unit 1200: Computer system 1201: CPU
1202: ROM
1203: RAM
1204: System bus 1205: Keyboard controller (KBC)
1206: CRT controller (CRTC)
1207: Disk controller (DKC)
1208: Network interface card (NIC)
1209: Keyboard (KB)
1210: CRT display (CRT)
1211: Hard disk (HD)
1212: Flexible disk (FD)
1220: LAN
2001: Configuration information registration unit 2002: Configuration information storage unit 2003: Configuration information extraction unit

Claims

A performance monitoring device that monitors the performance of an information processing system in which a plurality of information processing devices operate in cooperation,
Monitoring means for monitoring the operating status of the plurality of information processing devices and the data communication status of each communication line connecting the plurality of information processing devices;
A failure detection / prediction unit that detects a failure that currently occurs in the information processing system or predicts a possibility that a failure will occur in the information processing system based on monitoring data from the monitoring unit; A performance monitoring device.

The failure detection / prediction unit may detect a failure currently occurring in the information processing system based on a plurality of types of monitoring data obtained by the monitoring unit, or a future failure may occur in the information processing system. The performance monitoring apparatus according to claim 1, wherein the performance is predicted.

Monitoring data storage means for storing monitoring data by the monitoring means;
A correlation calculation unit that reads the plurality of types of monitoring data from the monitoring data storage unit and calculates a correlation between the plurality of types of monitoring data;
The failure detection / prediction unit is configured to perform the information processing based on the correlation between the plurality of types of monitoring data calculated by the correlation calculation unit and the current plurality of types of monitoring data obtained by the monitoring unit. The performance monitoring apparatus according to claim 2, wherein a fault currently occurring in the system is detected.

Monitoring data storage means for storing monitoring data by the monitoring means;
A correlation calculation unit that reads the plurality of types of monitoring data from the monitoring data storage unit and calculates a correlation between the plurality of types of monitoring data;
The failure detection / prediction unit is based on the correlation between the plurality of types of monitoring data calculated by the correlation calculation unit and the transition of the plurality of types of monitoring data obtained up to now by the monitoring unit. The performance monitoring apparatus according to claim 2, wherein the information processing system predicts that a failure may occur in the future.

The correlation calculation unit calculates a correlation between at least one of normal operation and abnormal operation of the information processing system based on the plurality of types of monitoring data read from the monitoring data storage unit, and The failure detection / prediction means detects a failure currently occurring in the information processing system using the correlation during normal operation or the correlation during abnormal operation, or a future failure occurs in the information processing system. The performance monitoring apparatus according to claim 3, wherein the possibility of occurrence is predicted.

The failure detection / prediction means detects a failure currently occurring in the information processing system, or predicts the possibility of a future failure occurring in the information processing system, from the type of correlation used 6. The performance according to claim 3, wherein a cause of a fault that currently occurs in the information processing system or a cause of a fault that may occur in the information processing system in the future is determined. Monitoring device.

The information processing system further includes a notifying unit for notifying a failure that is currently occurring in the information processing system detected or predicted by the failure detection / prediction unit or a possibility that a failure will occur in the information processing system in the future. The performance monitoring apparatus according to any one of claims 1 to 5.

A failure currently detected in the information processing system detected or predicted by the failure detection / prediction unit or a possibility that a failure will occur in the information processing system, and a failure determined by the failure detection / prediction unit The performance monitoring apparatus according to claim 6, further comprising notification means for notifying the cause of the problem.

An information processing system of an information processing system in which the plurality of information processing devices operate in cooperation, configuration information storage means for storing configuration information relating to relevance between the plurality of information processing devices, and the stored configuration information A monitoring target specifying means for specifying a range to be monitored by the monitoring means;
9. The performance monitoring apparatus according to claim 1, wherein the monitoring unit monitors a range specified by the monitoring target specifying unit.

The information processing device to be monitored, each communication line connecting the information processing devices, and an event data storage means for storing event data relating to an event that has occurred in at least one of the environments surrounding the information processing device,
The monitoring means acquires the event data in addition to the operating status of the information processing apparatus and the data communication status of each communication line connecting the plurality of information processing apparatuses, and stores the event data in the event data storage means The performance monitoring apparatus according to claim 1, wherein the performance monitoring apparatus is characterized.

Based on the monitoring data acquired by the monitoring means, further comprising event data generating means for generating event data,
The performance monitoring apparatus according to claim 1, wherein the event data generation unit stores the generated event data in the event data storage unit.

11. The failure detection / prediction unit predicts a possibility that a future failure will occur in the information processing system based on a correlation related to event data stored in the event data storage unit. Or the performance monitoring apparatus of 11.

A performance monitoring method by a performance monitoring device that monitors the performance of an information processing system in which a plurality of information processing devices operate in cooperation,
A monitoring step of monitoring the operating status of the plurality of information processing devices and the data communication status of each communication line connecting the plurality of information processing devices;
A failure detection / prediction step of detecting a failure currently occurring in the information processing system or predicting a possibility that a failure will occur in the information processing system based on monitoring data obtained by the monitoring step. A performance monitoring method characterized by the above.

A program for causing a computer to execute the performance monitoring method according to claim 13.