JP2004348640A

JP2004348640A - Method and system for managing network

Info

Publication number: JP2004348640A
Application number: JP2003147663A
Authority: JP
Inventors: Hajime Hirose; 肇広瀬
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2003-05-26
Filing date: 2003-05-26
Publication date: 2004-12-09

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system for managing a network which enables a user to accurately identify a cause of trouble without having enough operating information. <P>SOLUTION: The system for managing the network comprises acquisition units 121, 131, and 141 for acquiring the operating information of each monitored machine 150, 160, 170, and 180; a database 108 for storing the operating information which stores the operating information acquired by the acquisition units for each monitored machine; a status database 109 that stores the information of a status for each monitored machine classified by each monitored computer to represent whether the operating information is lost for each monitored machine; and a unit 101 for analyzing the operating information which identifies a particular machine as a candidate for being monitored due to a cause to change the utilization of the machine, and displays an identified result by using a correlation analysis on the basis of the status information stored in the status database and the information of the utilization of the monitored machine stored in a machine utilization database. <P>COPYRIGHT: (C)2005,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明はネットワーク管理システム及びネットワーク管理方法にかかり、特にコンピュータネットワークに発生したトラブルの原因を特定することのできるネットワーク管理システム及びネットワーク管理方法に関する。
【０００２】
【従来の技術】
近年、インターネット上で構築されているＷｅｂシステム等の安定運用が切望されている。Ｗｅｂシステム等を構成するコンピュータネットワークシステムは、通常複数のネットワーク機器で構成されており、例えば特許文献１に示されるように、種々の専用ソフトウェアを用いて稼動情報等を取得することによりその状態を監視している。例えば、Ｗｅｂページのレスポンスが低下した場合においては、管理者は、その原因を前記稼働情報をもとに手作業で調べている。
【０００３】
また、特許文献２には、マルチメディアネットワークにおいて、標準的に取得可能なトラヒック情報を用い、運用サーバへの負荷等の影響を最小限に抑えて、特定サーバが提示する特定アプリケーションの性能劣化要因を分析可能にすることが示されている。
【０００４】
また、特許文献３には、管理対象となるシステムの構成要素間の関係について、稼働情報をもとに定量化することにより、性能のボトルネックや障害の原因となる構成要素を絞り込み、原因の特定を早期に実現できるようにしたものが示されている。
【０００５】
【特許文献１】
特開２００１−１４４７６１号公報
【０００６】
【特許文献２】
特開２００１−１９５２８５号公報
【０００７】
【特許文献３】
特開２００２−３４２１８２号公報
【０００８】
【発明が解決しようとする課題】
前記特許文献１の方法によれば、Ｗｅｂページのレスポンスが低下するというようなトラブルが発生した場合には、あらかじめ蓄積していた多数のネットワーク機器の稼動情報から、問題の根本原因となっている可能性のあるものを、手作業で調査しなければならない。このような作業は熟練したネットワーク管理者でないと困難な作業である。
【０００９】
また、特許文献２，３に示す方法では、取得する稼働情報等の情報に欠損が存在する場合前記トラブルの原因を正確に特定することは困難である。
【００１０】
本発明はこれらの問題点に鑑みてなされたもので、稼働情報に欠損がある場合においてもトラブルの原因を正確に特定することのできるネットワーク管理システム及びネットワーク管理方法を提供する。
【００１１】
【課題を解決するための手段】
本発明は、上記の課題を解決するために次のような手段を採用した。
【００１２】
各被監視マシンの稼働情報を取得する稼働情報取得部と、前記可動情報取得部が取得した被監視マシンの稼働情報を各被監視マシン毎に格納する稼働情報データベースと、各被監視マシン毎の稼働情報の欠損の有無を表すステータス情報を各監視対象計算機毎に格納するステータスデータベースと、前記稼働情報データベースに格納した被監視マシンの稼働情報及びステータスデータベースに格納したステータス情報をもとに特定の被監視マシンの稼働率変化の原因となる被監視マシンの候補を相関分析により特定して表示する稼働情報分析部を備えた。
【００１３】
【発明の実施の形態】
以下、本発明の実施形態を添付図面を参照しながら説明する。図１は、本発明の実施形態にかかるネットワーク管理システムを説明する図である。稼動情報分析計算機１００は稼動情報採取計算機１２０が取得した稼動情報（ＣＰＵ利用率、メモリ利用率、Ｗｅｂページ応答時間等）を取得し、取得した稼動情報をもとに相関分析等を実施し、コンピュータシステムのトラブル（問題）の原因を探索する。
【００１４】
稼動情報分析計算機１００は、分析部１０１、稼動情報収集部１０５、画面表示部１０７、稼動情報データベース１０８、ステータスデータベース１０９を備える。分析部１０１は、実際に分析を実施する部署であり、相関分析部１０２、危険度計算部１０３、原因度計算部１０４を備える。稼動情報収集部１０５は、稼動情報採取計算機１２０から定期的に稼動情報を取得する。また、前記稼動情報が取得できない場合には、その旨を表すステータスを生成するステータス生成部１０６を備える。
【００１５】
画面表示部１０７は、分析対象の選択、分析結果の表示及び分析範囲の絞込み等の各種処理に対応した表示を行う。稼動情報データベース１０８は、稼動情報採取計算機１２０から定期的に取得した稼動情報を記憶しておく記憶手段である。ステータスデータベース１０９は、稼動情報採取計算機１２０から抽出した稼動情報に一部欠損があった場合に生成するステータス情報を記憶しておく記憶手段である。なお、稼動情報分析計算機１００は任意の数の稼動情報採取計算機１２０から稼動情報を取得することが可能である。
【００１６】
また、稼動情報採取計算機１２０は、コンピュータネットワーク上で実際に監視対象となる被監視マシン１５０の稼動情報を採取し、採取した稼動情報分析計算機１００からの稼動情報取得要求に答えて、稼動情報を送信する機能を持つ。また、稼動情報採取計算機１２０は、稼動情報取得部１２１、稼動情報採取ツール１２２を備える。稼動情報取得部１２１は、稼動情報採取ツール１２２が採取した稼動情報を取得し、稼動情報分析計算機に送信する。稼動情報採取ツール１２２は一般的な市販のネットワーク管理ツールであり、複数の被監視マシン１５０から稼動情報を採取する。
【００１７】
被監視マシン１５０はコンピュータネットワークを構成するネットワーク機器であり、一般的にはルータ、ハブ、スイッチ、ワークステーション、ＰＣ等が該当する。
【００１８】
図２は、本発明のネットワーク管理システムを適用するコンピュータネットワークの例を示す図である。この例では、Ｗｅｂショッピングモール等を実施する場合に構築される典型的なＷｅｂシステムの例である。
【００１９】
図に示すように、ネットワーク（ＷＡＮ）２１０を挟んで、クライアントＰＣ２２０とＷｅｂシステム２２０を構成する。Ｗｅｂシステム２００は、ファイヤウオール２０１、ルータ２０２、Ｗｅｂサーバ２０３、ＡＰ（アプリケーション）サーバ２０４、２０５、及びＤＢ（データベース）サーバ２０６、２０７、２０８等のネットワーク機器で構成する。
【００２０】
各ネットワーク機器の稼動情報は一般的には複数のネットワーク管理アプリケーションによって採取する。図の例の場合では、ネットワーク管理アプリケーションが設置されているマシン（ルータ２０２，サーバ２０４等）が稼動情報採取計算機１２０となる。
【００２１】
図３は、ステータス生成処理を説明する図である。ステータスは相関分析の欠点を補うために導入した手段であり、相関分析は、二つの異なる稼動情報を比較しその時系列データに相関性があるか（因果関係があるか）を調べる統計学的手法である。
【００２２】
相関分析は、その対象とするデータの一部に欠損がある場合には正確な相関分析を行うことができない。ネットワークを構成するネットワーク機器が、トラブルの発生により一時的に停止した場合、停止期間中には稼動情報が採取されなくなる。この場合、一時停止した前記ネットワーク機器あるいはサーバはネットワークのトラブルの原因である可能性が高いにもかかわらず相関分析の対象とすることができない。
【００２３】
本発明ではこの問題を解決するためにステータスという概念を導入している。ステータスは、稼動情報が採取できている期間には例えば「１」、稼動情報が採取できていない期間には例えば「０」を割り当て、全てのネットワーク機器に対してステータスを稼動情報とは別個に割り当てて蓄積しておく。そして、相関分析を行う際には、前記稼働情報の外にステータスを参照して行う。これにより、稼動情報が採取できなかったネットワーク機器（トラブルにより停止したネットワーク機器）に対しては、稼働情報に代えてステータスを参照することにより相関分析を実施することが可能となる。
【００２４】
ステータス生成処理は、稼動情報分析計算機１００の稼動情報収集部１０５のステータス生成部１０６で行われる。まず、ステップ３００において、稼動情報収集部１０５は稼動情報採取計算機１２０の稼動情報取得部１２１から稼動情報を取得する。ステップ３０１において、採取した稼動情報を時系列に調査し、その時間帯で稼動情報が取得できているかどうかを判定する。稼動情報が取得できていれば、ステップ３０２においてステータスを１としてステータスデータベース１０９に格納する。同時にステップ３０４において稼動情報自体を稼動情報データベース１０８に格納する。稼動情報が取得できていなければ、ステップ３０３においてステータスを０としてステータスデータベース１０９に格納する。この処理を収集した稼動情報がなくなるまで実施する。
【００２５】
図４は、ステータス情報のイメージを説明する図である。ネットワーク機器あるいはサーバの停止などにより、情報に欠損がある稼働情報４００を取得した場合、図３に示すステータス生成処理によりステータスを生成すると、ステータスデータ４０１が得られる。
【００２６】
ステータスは、前述のように監視対象期間のうち、稼動情報が採取できている期間には「１」、稼動情報が採取できていない期間には「０」を割り当てる。
【００２７】
図に示すように、分析対象とするネットワーク機器等の稼動情報４０２の変化が、ネットワーク機器等の停止に影響されている場合、ステータスデータ４０１を用いて、ステータスデータ４０１と稼動情報４０２との相関等を分析することにより、稼動情報４０２の変化の原因を特定することができる。
【００２８】
図５は、稼働率変化の原因となる被監視マシンの候補（原因候補）を相関分析により特定する処理を説明する図である。
【００２９】
まず、ステップ５００において、分析対象とする被監視マシンの稼動情報を選択し、分析の期間（時刻範囲）を決定する。この作業はネットワーク管理者が行う。前述したように、Ｗｅｂシステムの場合、一般的にはＷｅｂページのレスポンス時間等が分析対象となる。次に、ステップ５０１において、原因候補として調査するネットワーク機器の範囲を決定する。この作業もネットワーク管理者が行う。なお、確実に原因候補とならない要素はここで調査範囲から外しておく。
【００３０】
ステップ５０２以降は稼動情報分析計算機１００により自動的に行う。まず、ステップ５０２において原因候補として調査するネットワーク機器の稼動情報に対して、分析対象との相関分析を実施し、０から１の範囲の相関係数を計算する。この相関係数は大きいほど、分析対象との相関が高いことを表す。稼動情報に欠損がある場合には、ステップ５０３において、ステータスデータベースからステータス情報を取得し、分析対象の稼働率及びステータス情報をもとに相関分析を実施し、相関係数を計算する。ステップ５０４において、相関係数の上位幾つか（あらかじめ指定した値で、例えば１０個）の稼動情報を、分析対象とした稼動情報に影響を与えた原因候補と決定する。
【００３１】
図６は、原因候補の原因度の計算処理を説明する図である。まず、ステップ６００において、図４で示す処理により原因候補となったネットワーク機器の稼動情報のしきい値を読み込む。なお、しきい値はあらかじめネットワーク管理者が適切に設定しておく。ステップ６０１で原因候補となったネットワーク機器の稼動情報が、分析対象となった期間でどれくらいの期間しきい値を超えていたかを計算し、０〜１の範囲の危険割合を算出する。例えば、１時間中の１５分間、稼動情報がしきい値を超えていた場合、その稼動情報の危険割合は０．２５とする。ステップ６０２において、相関係数と危険割合から原因度を計算する。原因度は、（（相関係数×α）＋（危険割合×（１００−α）））／１００で計算し、０〜１の範囲とする。なお、αは重み付けの為の係数で０〜１００の任意の値を指定可能である。αを大きくすると、相関の高さを重要視し、αを小さくすると危険割合を重要視していることになる。
【００３２】
図７は、稼動情報データベース１０８の構成を説明する図である。稼動情報データベースは、各ネットワーク機器の稼動情報を所定時間毎に格納する。図に示すように、稼動情報として、Ｗｅｂページ応答時間、回線利用率、ＣＰＵ利用率、キャッシュヒット率等を格納する。なお、稼動情報が取得できていない場合は、値が無い状態となっている。
【００３３】
図８は、ステータスデータベース１０９の構成を説明する図である。ステータスデータベースは、各ネットワーク機器のステータスを格納する。ステータスは各ネットワーク機器毎に１つだけ存在し、一定の時間おきに０か１の値を格納する。
【００３４】
図９は、分析結果一覧画面９００のイメージ例を説明する図である。分析結果一覧画面９００は、稼動情報分析計算機１００の画面表示部１０７に表示する。分析結果一覧画面９００は、分析時刻ビュー９０１、分析対象ビュー９０２、分析結果ビュー９０３、グラフ表示ボタン９０４を備える。分析時刻ビュー９０１には、分析の対象範囲（期間）を時刻で表示する。分析対象ビュー９０３には分析対象として指定した稼動情報とその稼動情報を取得したネットワーク機器を表すマシン名を表示する。分析結果ビュー９０３には、最終的な分析結果が表示されるが、原因候補として可能性が高いもの（原因度が大きいもの）から順にリスト表示される。このリストの上位にある稼動情報が、分析対象としたＷｅｂページの応答時間が劣化したことの原因となるマシンの稼働情報である可能性が高いと考えらられる。分析結果ビュー９０３には順位、原因候補名、原因度、相関係数、危険割合が表示される。グラフ表示ボタン９０４を押すと、分析結果グラフ画面が表示される。
【００３５】
図１０は、分析結果グラフ画面１０００のイメージ例を説明する図である。分析結果グラフ画面１０００は、分析結果一覧画面９００のグラフ表示ボタン９０４を押すと、稼動情報分析計算機１００の画面表示部１０７に表示される。分析結果グラフ画面１０００は、グラフビュー１００１、グラフ要素ビュー１００２、分析時刻ビュー１００３を備える。グラフビュー１００１は、分析対象および原因候補となった稼動情報のグラフを同時に表示する。グラフ要素ビュー１００２は表示しているグラフのそれぞれがどの稼動情報のものであるかを表示する。分析時刻ビュー１００３には、分析の対象範囲を時刻で表示する。
【００３６】
以上説明したように、本実施形態によれば、コンピュータシステムにトラブルが発生した場合に、稼働情報に欠損がある場合においてもトラブルの原因を正確に特定し、その原因となっている可能性のあるネットワーク機器をネットワーク管理者に示すことができる。これにより、ネットワーク管理者は問題点の一次切り分け、及び復旧処理を速やかに行うことができる。なお、稼動情報分析計算機１００及び稼動情報採取計算機１２０が備える明細書記載の各機能はソフトウエアにより実現することができる。
【００３７】
また、前述のように監視対象期間を所定時間毎に分割し、各分割期間のうち、稼動情報が採取できている期間には「１」、稼動情報が採取できていない期間には「０」のステータスデータを割り当てたステータスデータを用いて相関分析を行う。このため、トラブルの発生により一時的に停止したネットワーク機器等（稼働率低下の原因である可能性が高い）を相関分析の対象とすることができる。
【００３８】
【発明の効果】
以上説明したように本発明によれば、稼働情報に欠損がある場合においてもトラブルの原因を正確に特定することのできるネットワーク管理システムを提供することができる。
【図面の簡単な説明】
【図１】本発明の実施形態にかかるネットワーク管理システムを説明する図である。
【図２】本発明のネットワーク管理システムを適用するコンピュータネットワークの例を示す図である。
【図３】ステータス生成処理を説明する図である。
【図４】ステータス情報のイメージを説明する図である。
【図５】稼働率変化の原因となる被監視マシンの候補を相関分析により特定する処理を説明する図である。
【図６】原因候補の原因度の計算処理を説明する図である。
【図７】稼動情報データベース１０８の構成を説明する図である。
【図８】ステータスデータベース１０９の構成を説明する図である。
【図９】分析結果一覧画面９００のイメージ例を説明する図である。
【図１０】分析結果グラフ画面１０００のイメージ例を説明する図である。
【符号の説明】
１０１稼働情報分析計算機
１０２相関分析部
１０３危険度計算部
１０４原因度計算部
１０５稼働情報収集部
１０６ステータス生成部
１０７画面表示部
１０８稼働情報データベース
１０９ステータスデータベース
１１０ネットワーク
１２０，１３０，１４０稼働情報採取計算機
１２１，１３１，１４１稼働情報取得部
１２２，１３２，１４２稼働情報採取ツール
１５０，１６０，１７０，１８０被監視マシン
２００Ｗｅｂシステム
２０１ファイヤウォール
２０２ルーター
２０３Ｗｅｂサーバ
２０４，２０５ＡＰサーバ
２０６，２０７，２０８ＤＢサーバ
２１０ネットワーク
２２０クライアントＰＣ[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a network management system and a network management method, and more particularly to a network management system and a network management method capable of specifying a cause of a trouble that has occurred in a computer network.
[0002]
[Prior art]
2. Description of the Related Art In recent years, stable operation of a Web system or the like built on the Internet has been desired. A computer network system that constitutes a Web system or the like generally includes a plurality of network devices. For example, as shown in Patent Literature 1, the state is obtained by acquiring operation information and the like using various kinds of dedicated software. Monitoring. For example, when the response of the Web page is reduced, the administrator manually investigates the cause based on the operation information.
[0003]
Japanese Patent Application Laid-Open No. H11-163873 discloses a method for deteriorating the performance of a specific application presented by a specific server by using traffic information that can be acquired as standard in a multimedia network, minimizing the influence of a load on an operation server, and the like. To be able to be analyzed.
[0004]
Patent Document 3 discloses that the relationship between the components of the system to be managed is quantified based on the operation information, thereby narrowing down the components causing a performance bottleneck or a failure. It is shown that the identification can be realized early.
[0005]
[Patent Document 1]
JP 2001-144761 A [0006]
[Patent Document 2]
JP 2001-195285 A
[Patent Document 3]
JP-A-2002-342182
[Problems to be solved by the invention]
According to the method of Patent Document 1, when a trouble such as a decrease in the response of a Web page occurs, it is a root cause of the problem based on operation information of a large number of network devices stored in advance. The potential must be investigated manually. Such a task is difficult without a skilled network administrator.
[0009]
Further, according to the methods described in Patent Documents 2 and 3, it is difficult to accurately identify the cause of the trouble when information such as operation information to be acquired has a defect.
[0010]
The present invention has been made in view of these problems, and provides a network management system and a network management method capable of accurately specifying the cause of a trouble even when operation information has a defect.
[0011]
[Means for Solving the Problems]
The present invention employs the following means in order to solve the above problems.
[0012]
An operation information acquisition unit that acquires operation information of each monitored machine; an operation information database that stores operation information of the monitored machines acquired by the movable information acquisition unit for each monitored machine; A status database that stores status information indicating the presence or absence of lack of operation information for each computer to be monitored, and a specific database based on the operation information of the monitored machine stored in the operation information database and the status information stored in the status database. An operation information analysis unit is provided for identifying and displaying a candidate of the monitored machine that causes a change in the operation rate of the monitored machine by correlation analysis.
[0013]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the accompanying drawings. FIG. 1 is a diagram illustrating a network management system according to an embodiment of the present invention. The operation information analysis computer 100 acquires the operation information (CPU utilization, memory utilization, Web page response time, etc.) acquired by the operation information collection computer 120, and performs a correlation analysis or the like based on the acquired operation information. Search for the cause of computer system troubles (problems).
[0014]
The operation information analysis computer 100 includes an analysis unit 101, an operation information collection unit 105, a screen display unit 107, an operation information database 108, and a status database 109. The analysis unit 101 is a unit that actually performs analysis, and includes a correlation analysis unit 102, a risk degree calculation unit 103, and a cause degree calculation unit 104. The operation information collection unit 105 periodically obtains operation information from the operation information collection computer 120. In addition, when the operation information cannot be obtained, a status generation unit 106 that generates a status indicating that fact is provided.
[0015]
The screen display unit 107 performs display corresponding to various processes such as selection of an analysis target, display of an analysis result, and narrowing of an analysis range. The operation information database 108 is storage means for storing operation information periodically acquired from the operation information collection computer 120. The status database 109 is a storage unit that stores status information generated when the operation information extracted from the operation information collection computer 120 has a partial loss. The operation information analysis computer 100 can acquire operation information from an arbitrary number of operation information collection computers 120.
[0016]
The operation information collection computer 120 collects operation information of the monitored machine 150 to be actually monitored on the computer network, and responds to the operation information acquisition request from the collected operation information analysis computer 100 to convert the operation information. Has the ability to send. The operation information collection computer 120 includes an operation information acquisition unit 121 and an operation information collection tool 122. The operation information acquisition unit 121 acquires the operation information collected by the operation information collection tool 122, and transmits the obtained information to the operation information analysis computer. The operation information collection tool 122 is a general commercially available network management tool, and collects operation information from a plurality of monitored machines 150.
[0017]
The monitored machine 150 is a network device constituting a computer network, and generally corresponds to a router, a hub, a switch, a workstation, a PC, or the like.
[0018]
FIG. 2 is a diagram illustrating an example of a computer network to which the network management system of the present invention is applied. This example is an example of a typical Web system constructed when implementing a Web shopping mall or the like.
[0019]
As shown in the figure, a client PC 220 and a Web system 220 are configured with a network (WAN) 210 interposed therebetween. The Web system 200 includes network devices such as a firewall 201, a router 202, a Web server 203, AP (application) servers 204 and 205, and DB (database) servers 206, 207, and 208.
[0020]
The operation information of each network device is generally collected by a plurality of network management applications. In the case of the example shown in the figure, the machine (router 202, server 204, etc.) on which the network management application is installed is the operation information collection computer 120.
[0021]
FIG. 3 is a diagram illustrating the status generation processing. Status is a method introduced to make up for the shortcomings of correlation analysis. Correlation analysis is a statistical method that compares two different pieces of operation information and checks whether the time-series data is correlated (causal relationship). It is.
[0022]
Correlation analysis cannot perform accurate correlation analysis if part of the target data has a defect. If a network device constituting the network is temporarily stopped due to the occurrence of a trouble, the operation information is not collected during the stop period. In this case, the suspended network device or server cannot be a target of the correlation analysis even though there is a high possibility of causing a network trouble.
[0023]
The present invention introduces the concept of status in order to solve this problem. The status is assigned, for example, “1” during a period in which the operation information is collected, and “0”, for example, in a period during which the operation information is not collected. The status is assigned to all network devices separately from the operation information. Assign and accumulate. When performing a correlation analysis, the correlation analysis is performed by referring to a status in addition to the operation information. This makes it possible to perform correlation analysis on network devices for which operation information could not be collected (network devices stopped due to a trouble) by referring to the status instead of the operation information.
[0024]
The status generation processing is performed by the status generation unit 106 of the operation information collection unit 105 of the operation information analysis computer 100. First, in step 300, the operation information collection unit 105 acquires operation information from the operation information acquisition unit 121 of the operation information collection computer 120. In step 301, the collected operation information is examined in chronological order, and it is determined whether the operation information has been acquired in that time zone. If the operation information has been acquired, the status is set to 1 in step 302 and stored in the status database 109. At the same time, the operation information itself is stored in the operation information database 108 in step 304. If the operation information has not been acquired, the status is set to 0 in step 303 and stored in the status database 109. This process is performed until the collected operation information is exhausted.
[0025]
FIG. 4 is a diagram illustrating an image of the status information. When the operation information 400 whose information is missing due to the stoppage of the network device or the server is acquired, when the status is generated by the status generation process shown in FIG. 3, the status data 401 is obtained.
[0026]
As described above, among the monitoring target periods, “1” is assigned to a period during which operation information can be collected, and “0” is assigned to a period during which operation information cannot be collected.
[0027]
As shown in the figure, when the change of the operation information 402 of the network device or the like to be analyzed is affected by the stop of the network device or the like, the correlation between the status data 401 and the operation information 402 is determined using the status data 401. The cause of the change in the operation information 402 can be specified by analyzing the information.
[0028]
FIG. 5 is a diagram illustrating a process of identifying a monitored machine candidate (cause candidate) that causes a change in the operation rate by correlation analysis.
[0029]
First, in step 500, the operation information of the monitored machine to be analyzed is selected, and the analysis period (time range) is determined. This is done by the network administrator. As described above, in the case of a Web system, generally, a response time of a Web page is an analysis target. Next, in step 501, the range of the network device to be investigated as a cause candidate is determined. This operation is also performed by the network administrator. Note that elements that are not sure cause candidates are excluded from the scope of investigation here.
[0030]
Step 502 and subsequent steps are automatically performed by the operation information analysis computer 100. First, in step 502, a correlation analysis with the analysis target is performed on the operation information of the network device to be investigated as a cause candidate, and a correlation coefficient in the range of 0 to 1 is calculated. The larger the correlation coefficient is, the higher the correlation with the analysis target is. If there is a defect in the operation information, in step 503, status information is acquired from the status database, a correlation analysis is performed based on the operation rate and the status information to be analyzed, and a correlation coefficient is calculated. In step 504, the operation information of some of the higher correlation coefficients (for example, 10 values designated in advance, for example, 10) is determined as a cause candidate that has affected the operation information to be analyzed.
[0031]
FIG. 6 is a diagram illustrating a process of calculating the degree of cause of a cause candidate. First, in step 600, the threshold value of the operation information of the network device that has become the cause candidate by the processing shown in FIG. 4 is read. Note that the threshold value is appropriately set in advance by a network administrator. In step 601, it is calculated how long the operation information of the network device that has become the cause candidate has exceeded the threshold value in the analysis target period, and a risk ratio in the range of 0 to 1 is calculated. For example, if the operation information exceeds the threshold for 15 minutes during one hour, the risk ratio of the operation information is set to 0.25. In step 602, the cause is calculated from the correlation coefficient and the risk ratio. The degree of cause is calculated by ((correlation coefficient × α) + (risk ratio × (100−α))) / 100, and is in the range of 0 to 1. Here, α is a coefficient for weighting and an arbitrary value from 0 to 100 can be designated. When α is increased, the degree of correlation is regarded as important, and when α is decreased, the risk ratio is regarded as important.
[0032]
FIG. 7 is a diagram illustrating the configuration of the operation information database 108. The operation information database stores operation information of each network device at predetermined time intervals. As shown in the figure, Web page response time, line usage rate, CPU usage rate, cache hit rate, and the like are stored as operation information. When the operation information has not been acquired, there is no value.
[0033]
FIG. 8 is a diagram illustrating the configuration of the status database 109. The status database stores the status of each network device. Only one status exists for each network device, and a value of 0 or 1 is stored at regular intervals.
[0034]
FIG. 9 is a diagram illustrating an example of an image of the analysis result list screen 900. The analysis result list screen 900 is displayed on the screen display unit 107 of the operation information analysis computer 100. The analysis result list screen 900 includes an analysis time view 901, an analysis target view 902, an analysis result view 903, and a graph display button 904. The analysis time view 901 displays the analysis target range (period) by time. The analysis target view 903 displays operation information designated as an analysis target and a machine name representing a network device that has acquired the operation information. The analysis result view 903 displays the final analysis result. The analysis result view 903 displays a list of candidates having the highest possibility (causes having a high degree of cause) as cause candidates in order. It is considered that there is a high possibility that the operation information at the top of the list is the operation information of the machine that causes the response time of the Web page to be analyzed to deteriorate. The analysis result view 903 displays a rank, a cause candidate name, a cause degree, a correlation coefficient, and a risk ratio. When the graph display button 904 is pressed, an analysis result graph screen is displayed.
[0035]
FIG. 10 is a diagram illustrating an example of an image of the analysis result graph screen 1000. When the graph display button 904 of the analysis result list screen 900 is pressed, the analysis result graph screen 1000 is displayed on the screen display unit 107 of the operation information analysis computer 100. The analysis result graph screen 1000 includes a graph view 1001, a graph element view 1002, and an analysis time view 1003. The graph view 1001 simultaneously displays a graph of operation information as an analysis target and a candidate for a cause. The graph element view 1002 displays which operation information each of the displayed graphs belongs to. The analysis time view 1003 displays the analysis target range by time.
[0036]
As described above, according to the present embodiment, when a trouble occurs in a computer system, the cause of the trouble can be accurately specified even if there is a defect in the operation information, and the possibility of the cause may be determined. A network device can be shown to a network administrator. As a result, the network administrator can quickly perform the primary isolation of the problem and the recovery processing. The functions described in the specification of the operation information analysis computer 100 and the operation information collection computer 120 can be realized by software.
[0037]
Further, as described above, the monitoring target period is divided into predetermined periods, and among the divided periods, “1” is set during a period in which operation information is collected, and “0” is set during a period in which operation information is not collected. The correlation analysis is performed by using the status data to which the status data is assigned. For this reason, a network device or the like that has been temporarily stopped due to the occurrence of a trouble (it is highly likely to be a cause of a decrease in the operation rate) can be subjected to the correlation analysis.
[0038]
【The invention's effect】
As described above, according to the present invention, it is possible to provide a network management system capable of accurately specifying the cause of a trouble even when operation information has a defect.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating a network management system according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating an example of a computer network to which the network management system of the present invention is applied.
FIG. 3 is a diagram illustrating a status generation process.
FIG. 4 is a diagram illustrating an image of status information.
FIG. 5 is a diagram illustrating a process of identifying a monitored machine candidate that causes a change in the operation rate by correlation analysis;
FIG. 6 is a diagram illustrating a process of calculating the degree of cause of a cause candidate.
FIG. 7 is a diagram illustrating a configuration of an operation information database.
FIG. 8 is a diagram illustrating a configuration of a status database 109.
FIG. 9 is a diagram illustrating an example of an image of an analysis result list screen 900.
FIG. 10 is a diagram illustrating an example of an image of an analysis result graph screen 1000.
[Explanation of symbols]
101 operation information analysis computer 102 correlation analysis unit 103 risk calculation unit 104 cause calculation unit 105 operation information collection unit 106 status generation unit 107 screen display unit 108 operation information database 109 status database 110 networks 120, 130, 140 operation information collection computer 121, 131, 141 Operation information acquisition unit 122, 132, 142 Operation information collection tool 150, 160, 170, 180 Monitored machine 200 Web system 201 Firewall 202 Router 203 Web server 204, 205 AP server 206, 207, 208 DB Server 210 Network 220 Client PC

Claims

An operation information acquisition unit that acquires operation information of each monitored machine;
An operation information database that stores operation information of the monitored machines acquired by the movable information acquisition unit for each monitored machine,
A status database that stores status information indicating whether or not the operation information is missing for each monitored machine for each monitored computer;
Based on the operation information of the monitored machine stored in the operation information database and the status information stored in the status database, the candidate of the monitored machine causing the change in the operation rate of the specific monitored machine is identified by correlation analysis. A network management system comprising an operation information analysis unit for displaying.

The network management system according to claim 1,
The analysis unit acquires operation information of a computer that is a candidate of the monitored machine that causes a change in the operation rate of the specific monitored machine, and the operation rate indicated by the operation information exceeds a predetermined threshold. A network management system for calculating a risk based on a period.

The network management system according to claim 2,
The network management system, wherein the analysis unit calculates a cause based on the correlation coefficient obtained by the correlation analysis and the risk.

A step of acquiring operation information of each monitored machine;
Storing the operation information of the monitored machine acquired by the movable information acquisition unit in an operation information database for each monitored machine;
A step of storing status information indicating presence or absence of loss of operation information of each monitored machine in a status database for each monitored computer;
Based on the operation information of the monitored machine stored in the operation information database and the status information stored in the status database, the candidate of the monitored machine causing the change in the operation rate of the specific monitored machine is identified by correlation analysis. A network management method comprising a step of displaying.

The network management method according to claim 4,
The analysis unit acquires operation information of a computer that is a candidate of the monitored machine that causes a change in the operation rate of the specific monitored machine, and the operation rate indicated by the operation information exceeds a predetermined threshold. A network management method comprising calculating a risk level based on a period.