JP2014157412A

JP2014157412A - Event aggregation device, event aggregation method, and event aggregation program

Info

Publication number: JP2014157412A
Application number: JP2013026859A
Authority: JP
Inventors: Masahiro Ono; 允裕大野
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2013-02-14
Filing date: 2013-02-14
Publication date: 2014-08-28

Abstract

PROBLEM TO BE SOLVED: To provide an event aggregation device, an event aggregation method, and an event aggregation program capable of easily determining a failure by which an event occurring in a computer system is caused and reducing work load required for examination diagnosis of the event.SOLUTION: The event aggregation device includes: an information acquisition unit 1 for acquiring event history information which shows occurrence history of an event showing a failure or a sign of failure in a computer system and metric history information which shows a use state of a resource in the computer system for metric as a measurement value; a storage unit 2 for storing a system sketch in which a state of the metric at the time of failure occurrence in the computer system is defined; and an aggregation unit 3 for aggregating, from the event history information, each event occurring in a time zone when a measurement result of each metric shown by the metric history information and a state of each metric shown by the system sketch are matched, as an event group corresponding to the system sketch.

Description

本発明は、複数のコンポーネントから構成されるコンピュータシステムのイベント集約装置、イベント集約方法およびイベント集約プログラムに関し、特に、コンピュータシステムの障害事象ごとに各コンピュータの各部位におけるイベントを集約するイベント集約装置、イベント集約方法およびイベント集約プログラムに関する。 The present invention relates to an event aggregating apparatus, an event aggregating method and an event aggregating program for a computer system composed of a plurality of components, and in particular, an event aggregating apparatus for aggregating events in each part of each computer for each failure event of the computer system, The present invention relates to an event aggregation method and an event aggregation program.

複数のコンポーネントから構成されるコンピュータシステムの運用管理では、利用者がＩＴ（ＩｎｆｏｒｍａｔｉｏｎＴｅｃｈｎｏｌｏｇｙ）サービスを安定して利用できるようにする必要がある。そのため、管理サーバは、障害と定義された状態、例えば、利用者がＩＴサービスを正常に受けることができない状態や、正常に受けることができなくなる可能性がある状態を示すイベントを検証する機能を有する。 In operation management of a computer system composed of a plurality of components, it is necessary for a user to be able to use an IT (Information Technology) service stably. Therefore, the management server has a function of verifying an event indicating a state defined as a failure, for example, a state in which the user cannot normally receive the IT service or a state in which the user may not be able to normally receive the service. Have.

例えば、特許文献１に記載されているように、管理サーバは、コンピュータシステムの障害または障害の兆候を示す、複数のイベントを検出し、それらのイベントをデータベースに蓄積する。また、管理サーバは、複数のイベントの因果関係を解析するための解析機能を備える。 For example, as described in Patent Document 1, the management server detects a plurality of events indicating a failure of the computer system or a sign of the failure, and accumulates these events in a database. The management server also has an analysis function for analyzing the causal relationship of a plurality of events.

管理サーバにおける解析方法は、障害に関連するイベント群（関連イベント）を条件とし、障害の原因を表現するイベント（原因イベント）を解析結果とする、ルールを用いる。そして、関連イベントと原因イベントとが含まれる複数のイベントから、原因イベントを抽出する。 The analysis method in the management server uses a rule that uses an event group (related event) related to a failure as a condition and an event (cause event) expressing the cause of the failure as an analysis result. Then, the cause event is extracted from a plurality of events including the related event and the cause event.

特許文献１に記載された方法は、ルールの条件に定義した関連イベントの発生割合を計算し、計算結果をルールの解析結果である原因イベントの確信度とする。特許文献１に記載された方法は、ルールの条件に定義したすべての関連イベントが発生しなくても、ルールの解析結果である原因イベントを推定できる。特許文献２に記載された方法は、複数の障害によって多数のイベントが発生した場合に、ルールの条件に定義した一部の関連イベントが含まれるルールの原因イベントを、関連イベントの発生割合順に、すべて列挙する。特許文献３に記載された方法は、同じ曜日の同じ時間帯におけるイベントのパターンをルールとし、そのパターンと異なる頻度で出現するイベントのパターンを検出する。 The method described in Patent Document 1 calculates the occurrence rate of a related event defined in the rule condition, and sets the calculation result as the certainty factor of the cause event that is the rule analysis result. The method described in Patent Document 1 can estimate a cause event that is an analysis result of a rule even if all the related events defined in the rule condition do not occur. In the method described in Patent Literature 2, when a large number of events occur due to a plurality of failures, the cause events of a rule including a part of related events defined in the rule condition are sorted in the order of occurrence rate of related events. List all. The method described in Patent Document 3 uses an event pattern in the same time zone on the same day of the week as a rule, and detects an event pattern that appears at a frequency different from that pattern.

米国特許７１０７１８５号明細書US Pat. No. 7,107,185 特開２０１２−５９０６３号公報JP 2012-59063 A 特許第４９４４３９１号公報Japanese Patent No. 4944391

特許文献１、特許文献２および特許文献３に記載された方法では、障害により発生するすべての関連イベントの組み合わせを表現したルールを、予め定義する必要がある。しかし、コンピュータシステムのコンポーネントの種類や規模が増えると、障害で発生する一部の関連イベントの組み合わせを定義することは可能であっても、障害により発生するすべての関連イベントの組み合わせを定義することは困難である。 In the methods described in Patent Document 1, Patent Document 2, and Patent Document 3, it is necessary to predefine rules that express combinations of all related events that occur due to a failure. However, as the types and sizes of computer system components increase, it is possible to define combinations of some related events that occur due to failures, but to define combinations of all related events that occur due to failures. It is difficult.

その結果、障害により発生する関連イベントのうち、ルールの条件に定義されていない関連イベントは、当該障害とは異なるイベントとして扱われる可能性がある。従って、管理者は、ルールに未定義の関連イベントに対して、新たな障害のイベントであるか既存障害の関連イベントであるかを、調査診断するために多くの作業負荷を費やすこととなる。 As a result, a related event that is not defined in the rule condition among related events that occur due to a failure may be handled as an event different from the failure. Therefore, the administrator spends a lot of work load to investigate and diagnose whether the event is a new failure event or an existing failure related event with respect to a related event not defined in the rule.

そこで、本発明は、コンピュータシステムにおいて発生したイベントがどの障害によるものであるかを容易に判断でき、イベントの調査診断に要する作業負荷を軽減することができるイベント集約装置、イベント集約方法およびイベント集約プログラムを提供することを目的とする。 Therefore, the present invention provides an event aggregating apparatus, an event aggregating method, and an event aggregating method that can easily determine which fault an event that has occurred in a computer system is caused, and that can reduce the work load required for event investigation and diagnosis. The purpose is to provide a program.

本発明によるイベント集約装置は、コンピュータシステムの障害または障害の兆候を示すイベントの発生履歴を示すイベント履歴情報と、コンピュータシステムにおけるリソースの使用状態をメトリックごとの計測値として表すメトリック履歴情報とを、コンピュータシステムから取得する情報取得部と、コンピュータシステムの障害発生時における各メトリックの状態を定義したシステムスケッチを記憶する記憶部と、イベント履歴情報から、メトリック履歴情報が示す各メトリックの計測結果と、システムスケッチが示す各メトリックの状態とが一致する時間帯に発生したイベントを抽出し、抽出した各イベントを当該システムスケッチに対応するイベントグループとして集約する集約部とを含むことを特徴とする。 An event aggregating apparatus according to the present invention includes event history information indicating an occurrence history of an event indicating a failure of a computer system or an indication of failure, and metric history information indicating a resource usage state in a computer system as a measured value for each metric. An information acquisition unit acquired from a computer system, a storage unit that stores a system sketch that defines the state of each metric when a failure occurs in the computer system, and a measurement result of each metric indicated by the metric history information from the event history information, And an aggregating unit that extracts events that occur in a time zone in which the state of each metric indicated by the system sketch matches, and aggregates the extracted events as an event group corresponding to the system sketch.

本発明によるイベント集約方法は、コンピュータシステムの障害または障害の兆候を示すイベントの発生履歴を示すイベント履歴情報と、コンピュータシステムにおけるリソースの使用状態をメトリックごとの計測値として表すメトリック履歴情報とを、コンピュータシステムから取得し、イベント履歴情報から、メトリック履歴情報が示す各メトリックの計測結果と、記憶部に格納されたシステムスケッチに定義されたコンピュータシステムの障害発生時における各メトリックの状態とが一致する時間帯に発生したイベントを抽出し、抽出した各イベントを、当該システムスケッチに対応するイベントグループとして集約することを特徴とする。 The event aggregation method according to the present invention includes event history information indicating an occurrence history of an event indicating a failure of a computer system or an indication of failure, and metric history information indicating a resource usage state in the computer system as a measured value for each metric. The measurement result of each metric indicated by the metric history information obtained from the computer system and the event history information matches the state of each metric at the time of failure of the computer system defined in the system sketch stored in the storage unit. It is characterized by extracting events that occurred in a time zone and collecting each extracted event as an event group corresponding to the system sketch.

本発明によるイベント集約プログラムは、コンピュータに、コンピュータシステムの障害または障害の兆候を示すイベントの発生履歴を示すイベント履歴情報と、コンピュータシステムにおけるリソースの使用状態をメトリックごとの計測値として表すメトリック履歴情報とを、コンピュータシステムから取得する処理と、イベント履歴情報から、メトリック履歴情報が示す各メトリックの計測結果と、記憶部に格納されたシステムスケッチに定義されたコンピュータシステムの障害発生時における各メトリックの状態とが一致する時間帯に発生したイベントを抽出し、抽出した各イベントを、当該システムスケッチに対応するイベントグループとして集約する処理とを実行させることを特徴とする。 An event aggregation program according to the present invention provides a computer with event history information indicating an occurrence history of an event indicating a failure of a computer system or a failure sign, and metric history information indicating a resource usage state in the computer system as a measured value for each metric. Are obtained from the computer system, the event history information, the measurement result of each metric indicated by the metric history information, and each metric at the time of failure of the computer system defined in the system sketch stored in the storage unit. It is characterized in that an event that occurs in a time zone that matches the state is extracted, and a process for aggregating the extracted events as an event group corresponding to the system sketch is executed.

本発明によれば、コンピュータシステムにおいて発生したイベントがどの障害によるものであるかを容易に判断でき、イベントの調査診断に要する作業負荷を軽減することができる。 According to the present invention, it is possible to easily determine which failure is caused by an event that has occurred in a computer system, and it is possible to reduce the work load required for event investigation and diagnosis.

イベント集約システムの第１の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 1st Embodiment of an event aggregation system. コンピュータシステムの第１の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 1st Embodiment of a computer system. コンピュータシステムが含むコンピュータの第１の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 1st Embodiment of the computer which a computer system contains. 管理サーバの第１の実施形態の構成を示すブロック図である。It is a block diagram which shows the structure of 1st Embodiment of a management server. 管理サーバの二次記憶装置に格納されたシステムトポロジ管理表の一例を示す説明図である。It is explanatory drawing which shows an example of the system topology management table stored in the secondary storage device of the management server. 管理サーバの二次記憶装置に格納されたメトリック定義表の一例を示す説明図である。It is explanatory drawing which shows an example of the metric definition table | surface stored in the secondary storage device of the management server. 管理サーバの二次記憶装置に格納された運用トポロジ管理表の一例を示す説明図である。It is explanatory drawing which shows an example of the operation | movement topology management table stored in the secondary storage device of the management server. システムスケッチの一例を示す説明図である。It is explanatory drawing which shows an example of a system sketch. システムスケッチの一例を示す説明図である。It is explanatory drawing which shows an example of a system sketch. コンピュータグループのメトリックの状態を表現するシステムスケッチの一例を示す説明図である。It is explanatory drawing which shows an example of the system sketch expressing the metric state of a computer group. コンピュータグループのメトリックグループの状態を表現するシステムスケッチの一例を示す説明図である。It is explanatory drawing which shows an example of the system sketch expressing the state of the metric group of a computer group. 管理サーバの二次記憶装置に格納されたメトリック履歴表の一例を示す説明図である。It is explanatory drawing which shows an example of the metric history table stored in the secondary storage device of the management server. 管理サーバの二次記憶装置に格納されたイベント履歴表の一例を示す説明図である。It is explanatory drawing which shows an example of the event history table stored in the secondary storage device of the management server. 管理サーバの二次記憶装置に格納されたシステムスケッチ管理表の一例を示す説明図である。It is explanatory drawing which shows an example of the system sketch management table stored in the secondary storage device of the management server. 管理サーバの二次記憶装置に格納されたイベント集約結果表の一例を示す説明図である。It is explanatory drawing which shows an example of the event aggregation result table stored in the secondary storage device of the management server. 管理サーバの二次記憶装置に格納されたイベント集約結果表の一例を示す説明図である。It is explanatory drawing which shows an example of the event aggregation result table stored in the secondary storage device of the management server. 管理サーバのシステムスケッチ生成モジュールが実行するシステムスケッチ生成処理を示すフローチャートである。It is a flowchart which shows the system sketch production | generation process which the system sketch production | generation module of a management server performs. 管理サーバのイベント集約処理モジュールが実行するイベント集約処理を示すフローチャートである。It is a flowchart which shows the event aggregation process which the event aggregation process module of a management server performs. イベント集約結果を示す表示画面の一例を示す説明図である。It is explanatory drawing which shows an example of the display screen which shows an event aggregation result. 本発明によるイベント集約装置の最小構成を示すブロック図である。It is a block diagram which shows the minimum structure of the event aggregation apparatus by this invention. 本発明によるイベント集約装置の他の最小構成を示すブロック図である。It is a block diagram which shows the other minimum structure of the event aggregation apparatus by this invention.

実施形態１．
以下、本発明の第１の実施形態を図面を参照して説明する。 Embodiment 1. FIG.
A first embodiment of the present invention will be described below with reference to the drawings.

本実施形態は、本発明を実現するための一例に過ぎず、本発明の技術的範囲を限定するものではない。 This embodiment is merely an example for realizing the present invention, and does not limit the technical scope of the present invention.

また、各図において共通の構成については同一の番号を付与して説明する。 Moreover, the same number is given and demonstrated about a common structure in each figure.

また、本実施形態の説明において、「ソフトウェア」、「プログラム」または「モジュール」を動作主体として説明する箇所がある。これらの箇所は、プロセッサを動作主体とした処理に読み替えてもよい。その理由は、ソフトウェア、プログラムまたはモジュールは、プロセッサによって実行されることで、定められた処理をメモリおよび通信インタフェース（通信制御装置）を用いながら行うためである。 Further, in the description of the present embodiment, there is a place where “software”, “program”, or “module” is described as an operation subject. These portions may be read as processing whose main operation is a processor. The reason is that software, a program, or a module is executed by a processor to perform a predetermined process using a memory and a communication interface (communication control device).

また、プログラムやモジュールを動作主体とする処理は、管理サーバ等のコンピュータ、情報処理装置が行う処理としてもよい。また、プログラムの一部または全てが、専用ハードウェアによって実現されてもよい。また、各種プログラムは、プログラム配布サーバや記憶メディアによって各コンピュータにインストールされてもよい。 In addition, the process whose main operation is a program or module may be a process performed by a computer such as a management server or an information processing apparatus. Further, part or all of the program may be realized by dedicated hardware. Various programs may be installed in each computer by a program distribution server or a storage medium.

図１は、イベント集約システムの第１の実施形態の構成を示すブロック図である。 FIG. 1 is a block diagram showing the configuration of the first embodiment of the event aggregation system.

ここでは、コンピュータシステムの障害ごとにイベントを集約する処理を説明する。 Here, processing for aggregating events for each failure of the computer system will be described.

図１に示す分散システムは、コンピュータシステム１０と、管理サーバ２０と、管理端末３０と、利用者端末４０−１〜４０−ｎと、ＩＰ（ＩｎｔｅｒｎｅｔＰｒｏｔｏｃｏｌ）スイッチ５０−１、５０−２とを有する。コンピュータシステム１０と利用者端末４０−１〜４０−ｎとは、ネットワーク６０−１を介して接続される。コンピュータシステム１０と管理サーバ２０と管理端末３０とは、ネットワーク６０−２を介して接続される。 The distributed system shown in FIG. 1 includes a computer system 10, a management server 20, a management terminal 30, user terminals 40-1 to 40-n, and IP (Internet Protocol) switches 50-1 and 50-2. Have. The computer system 10 and the user terminals 40-1 to 40-n are connected via a network 60-1. The computer system 10, the management server 20, and the management terminal 30 are connected via a network 60-2.

コンピュータシステム１０は、利用者端末４０−１〜４０−ｎからのファイルＩ／Ｏ要求を受信し、ファイルＩ／Ｏ要求に応じて、磁気ディスク等の記憶装置へのアクセスを実行する。また、コンピュータシステム１０は、リソースの使用状態や、アプリケーションやオペレーティングシステムの稼動状態を示す監視情報ログを生成する。 The computer system 10 receives file I / O requests from the user terminals 40-1 to 40-n, and executes access to a storage device such as a magnetic disk in response to the file I / O requests. In addition, the computer system 10 generates a monitoring information log indicating the resource usage status and the operating status of the application and operating system.

利用者端末４０−１〜４０−ｎは、利用者等によるサービスの実行要求を受け付けて、コンピュータシステム１０にファイルＩ／Ｏ要求を送信する。また、利用者端末４０−１〜４０−ｎは、コンピュータシステム１０によるファイルＩ／Ｏ要求の実行結果を受信する。 The user terminals 40-1 to 40-n receive a service execution request from a user or the like, and transmit a file I / O request to the computer system 10. In addition, the user terminals 40-1 to 40-n receive the execution result of the file I / O request by the computer system 10.

管理サーバ２０は、コンピュータシステム１０が保持する監視情報ログを参照して、コンピュータシステム１０の障害または障害の兆候を示す、複数のイベントを取得する。管理サーバ２０は、それらのイベントを障害ごとに集約し、その集約結果を記憶装置に格納する。 The management server 20 refers to the monitoring information log held by the computer system 10 and acquires a plurality of events indicating a failure of the computer system 10 or a sign of the failure. The management server 20 aggregates these events for each failure, and stores the aggregation result in the storage device.

管理端末３０は、ネットワーク６０−２を介して、管理サーバのＵＩ（ＵｓｅｒＩｎｔｅｒｆａｃｅ）表示処理モジュールと通信する。管理端末３０は、当該通信により取得した各種情報を、自端末の出力デバイスに表示する。管理者等は、管理端末３０に表示された各種情報を参照し、管理端末３０の入力デバイスを用いて、管理サーバ２０、ＩＰスイッチ６０−１、６０−２、および、コンピュータシステム１０に対する各種設定作業を行う。 The management terminal 30 communicates with a UI (User Interface) display processing module of the management server via the network 60-2. The management terminal 30 displays various information acquired by the communication on the output device of the own terminal. The administrator refers to various information displayed on the management terminal 30 and uses the input device of the management terminal 30 to make various settings for the management server 20, the IP switches 60-1, 60-2, and the computer system 10. Do work.

図２は、コンピュータシステム１０の第１の実施形態の構成を示すブロック図である。 FIG. 2 is a block diagram illustrating a configuration of the computer system 10 according to the first embodiment.

コンピュータシステム１０は、フロントエンドノード１１と、処理ノード１２−１〜１２−ｎとを含む。これらのノードは、本実施形態では、ＩＰスイッチ１３を含むネットワーク１４を介して接続される。なお、図２には１つのフロントエンドノードが例示されているが、フロントエンドノードはいくつあってもよい。 The computer system 10 includes a front end node 11 and processing nodes 12-1 to 12-n. In this embodiment, these nodes are connected via a network 14 including an IP switch 13. In addition, although one front end node is illustrated in FIG. 2, there may be any number of front end nodes.

フロントエンドノード１１は、例えば、通信端末等のコンピュータである。フロントエンドノード１１は、利用者端末４０−１〜４０−ｎと接続される。フロントエンドノード１１は、利用者端末４０−１〜４０−ｎからサービスの実行要求を受け付け、処理ノードへ転送するコンピュータである。また、フロントエンドノード１１は、処理ノード１２−１〜１２−ｎによるサービスの実行結果を利用者端末４０−１〜４０−ｎへ転送する。 The front end node 11 is a computer such as a communication terminal, for example. The front end node 11 is connected to the user terminals 40-1 to 40-n. The front-end node 11 is a computer that receives a service execution request from the user terminals 40-1 to 40-n and transfers it to a processing node. Further, the front end node 11 transfers the execution results of the services by the processing nodes 12-1 to 12-n to the user terminals 40-1 to 40-n.

処理ノード１２−１〜１２−ｎは、例えば、通信端末等のコンピュータである。処理ノード１２−１〜１２−ｎは、管理サーバ２０および管理端末３０と接続される。処理ノード１２−１〜１２−ｎは、サービスに対応する処理を実行する。なお、処理ノード１２−１〜１２−ｎは、コンピュータ上に構築される仮想的なコンピュータであってもよい。 The processing nodes 12-1 to 12-n are computers such as communication terminals, for example. The processing nodes 12-1 to 12-n are connected to the management server 20 and the management terminal 30. The processing nodes 12-1 to 12-n execute processing corresponding to the service. The processing nodes 12-1 to 12-n may be virtual computers constructed on a computer.

コンピュータシステム１０は、例えば、クラウドコンピューティングシステム、グリッドコンピューティングシステム、並列分散コンピュータ、スーパーコンピュータ、サーバコンピュータ、パーソナルコンピュータ、および、これらを任意に組み合わせたシステムによって実現される。 The computer system 10 is realized by, for example, a cloud computing system, a grid computing system, a parallel distributed computer, a super computer, a server computer, a personal computer, and a system that arbitrarily combines these.

図３は、コンピュータシステム１０が含むコンピュータ（フロントエンドノード、処理ノード）の第１の実施形態の構成を示すブロック図である。 FIG. 3 is a block diagram showing the configuration of the first embodiment of a computer (front end node, processing node) included in the computer system 10.

コンピュータ１００は、通信Ｉ／Ｆ（インタフェース）１１０と、プロセッサ１２０と、メモリ１３０と、二次記憶装置１４０とを含む。なお、コンピュータ１００は、処理結果を出力するためのディスプレイ等の出力デバイス１５０や、管理者等が指示を入力するためのキーボード等の入力デバイス１６０を含んでいてもよい。コンピュータ１００の各構成要素は、内部バス等の回路を介して相互に接続される。 The computer 100 includes a communication I / F (interface) 110, a processor 120, a memory 130, and a secondary storage device 140. The computer 100 may include an output device 150 such as a display for outputting a processing result, and an input device 160 such as a keyboard for an administrator to input an instruction. Each component of the computer 100 is connected to each other via a circuit such as an internal bus.

通信Ｉ／Ｆ１１０は、コンピュータ１００をネットワークに接続する。 The communication I / F 110 connects the computer 100 to a network.

プロセッサ１２０は、例えば、ＣＰＵである。 The processor 120 is, for example, a CPU.

メモリ１３０は、例えば、キャッシュメモリである。本実施形態では、メモリ１３０には、データ処理ソフトウェア１３１と、監視ソフトウェア１３２とが格納される。 The memory 130 is, for example, a cache memory. In the present embodiment, data processing software 131 and monitoring software 132 are stored in the memory 130.

データ処理ソフトウェア１３１は、サービスに対応する処理の一部または全部を実行する。データ処理ソフトウェア１３１は、本実施形態では、アプリケーションおよびオペレーティングシステムである。 The data processing software 131 executes part or all of the processing corresponding to the service. In this embodiment, the data processing software 131 is an application and an operating system.

アプリケーションは、オペレーティングシステムから提供された記憶領域を使用して、当該記憶領域に対しデータ入出力を行う。 The application uses the storage area provided by the operating system to input / output data to / from the storage area.

オペレーティングシステムは、プロセッサを複数の論理的なプロセッサとしてアプリケーションに認識させるための処理を実行する。また、オペレーティングシステムは、メモリを複数の論理的なメモリとしてアプリケーションに認識させるための処理を実行する。また、オペレーティングシステムは、二次記憶装置を複数の論理的な二次記憶領域としてアプリケーションに認識させるための処理を実行する。 The operating system executes processing for causing an application to recognize the processor as a plurality of logical processors. In addition, the operating system executes processing for causing the application to recognize the memory as a plurality of logical memories. The operating system also executes processing for causing the application to recognize the secondary storage device as a plurality of logical secondary storage areas.

監視ソフトウェア１３２は、所定周期ごとに、コンピュータ１００のリソースの使用状態を監視し、監視結果を監視情報ログに格納する。ここで、リソースには、プロセッサ１２０または論理的なプロセッサ、メモリ１３０または論理的なメモリ、二次記憶装置１４０または論理的な二次記憶領域、通信Ｉ／Ｆ１１０が含まれる。 The monitoring software 132 monitors the usage state of the resources of the computer 100 at predetermined intervals, and stores the monitoring result in the monitoring information log. Here, the resources include the processor 120 or logical processor, the memory 130 or logical memory, the secondary storage device 140 or logical secondary storage area, and the communication I / F 110.

また、監視ソフトウェア１３２は、所定周期ごとに、データ処理ソフトウェア１３１、つまりアプリケーションやオペレーティングシステムの稼動状態を監視し、監視結果を監視情報ログに格納する。 In addition, the monitoring software 132 monitors the operating status of the data processing software 131, that is, the application and the operating system, and stores the monitoring result in the monitoring information log at predetermined intervals.

二次記憶装置１４０は、ハードディスクドライブなどの記憶装置である。二次記憶装置１４０は、半導体メモリ、磁気ディスク、または、半導体メモリおよび磁気ディスクの両方から構成される。二次記憶装置１４０は、監視情報ログを記憶する。 The secondary storage device 140 is a storage device such as a hard disk drive. The secondary storage device 140 includes a semiconductor memory, a magnetic disk, or both a semiconductor memory and a magnetic disk. The secondary storage device 140 stores a monitoring information log.

監視情報ログは、メトリック履歴情報およびイベント履歴情報を含む情報である。メトリック履歴情報およびイベント履歴情報は、後述する、管理サーバ２０が保持するメトリック履歴表およびイベント履歴表を更新するための情報である。以降、「ｘｘｘ表」という表現を用いるが、これらは、特定のデータ構造に限定するものではない。そのため、「ｘｘｘテーブル」、「ｘｘｘリスト」、「ｘｘｘデータベース」、「ｘｘｘキュー」等の表現や、それ以外の表現を用いてもよい。 The monitoring information log is information including metric history information and event history information. The metric history information and the event history information are information for updating a metric history table and an event history table held by the management server 20, which will be described later. Hereinafter, the expression “xxx table” is used, but these are not limited to a specific data structure. Therefore, expressions such as “xxx table”, “xxx list”, “xxx database”, “xxx queue”, and other expressions may be used.

図４は、管理サーバ２０の第１の実施形態の構成を示すブロック図である。 FIG. 4 is a block diagram illustrating a configuration of the management server 20 according to the first embodiment.

管理サーバ２０は、通信Ｉ／Ｆ２１と、プロセッサ２２と、メモリ２３と、二次記憶装置２４とを含む。なお、管理サーバ２０は、処理結果を出力するためのディスプレイ等の出力デバイス２５や、管理者等が指示を入力するためのキーボード等の入力デバイス２６を含んでいてもよい。なお、出力デバイス２５と入力デバイス２６とは、別々のデバイスであってもよいし、１つのデバイスに含まれていてもよい。管理サーバ２０の各構成要素は、内部バス等の回路を介して相互に接続される。 The management server 20 includes a communication I / F 21, a processor 22, a memory 23, and a secondary storage device 24. The management server 20 may include an output device 25 such as a display for outputting processing results, and an input device 26 such as a keyboard for an administrator to input instructions. The output device 25 and the input device 26 may be separate devices or may be included in one device. Each component of the management server 20 is connected to each other via a circuit such as an internal bus.

通信Ｉ／Ｆ２１は、管理サーバ２０をネットワークに接続する。 The communication I / F 21 connects the management server 20 to the network.

プロセッサ２２は、例えば、ＣＰＵである。 The processor 22 is, for example, a CPU.

メモリ２３は、例えば、キャッシュメモリである。 The memory 23 is, for example, a cache memory.

メモリ２３は、プログラム制御モジュール２３１と、構成情報取得モジュール２３２と、性能情報取得モジュール２３３と、イベント取得モジュール２３４と、ＵＩ表示処理モジュール２３５と、システムスケッチ生成モジュール２３６と、イベント集約処理モジュール２３７とを格納する。 The memory 23 includes a program control module 231, a configuration information acquisition module 232, a performance information acquisition module 233, an event acquisition module 234, a UI display processing module 235, a system sketch generation module 236, and an event aggregation processing module 237. Is stored.

二次記憶装置２４は、ハードディスクドライブなどの記憶装置である。二次記憶装置２４は、半導体メモリ、磁気ディスク、または、半導体メモリおよび磁気ディスクの両方から構成される。 The secondary storage device 24 is a storage device such as a hard disk drive. The secondary storage device 24 includes a semiconductor memory, a magnetic disk, or both a semiconductor memory and a magnetic disk.

二次記憶装置２４は、システムトポロジ管理表と、メトリック定義表と、運用トポロジ管理表と、メトリック履歴表と、イベント履歴表と、システムスケッチ管理表と、イベント集約結果表とを記憶する。 The secondary storage device 24 stores a system topology management table, a metric definition table, an operation topology management table, a metric history table, an event history table, a system sketch management table, and an event aggregation result table.

プログラム制御モジュール２３１は、所定周期ごとに、構成情報取得モジュール２３２に対し、管理対象となるコンピュータシステムのコンピュータやＩＰスイッチから、システムトポロジ管理情報を取得するように指示する。 The program control module 231 instructs the configuration information acquisition module 232 to acquire system topology management information from a computer or IP switch of a computer system to be managed at predetermined intervals.

また、プログラム制御モジュール２３１は、所定周期ごとに、性能情報取得モジュール２３３に対し、管理対象となるコンピュータシステムのコンピュータやＩＰスイッチから、メトリック履歴情報を取得するように指示する。 Further, the program control module 231 instructs the performance information acquisition module 233 to acquire metric history information from a computer or an IP switch of a computer system to be managed at every predetermined period.

また、プログラム制御モジュール２３１は、所定周期ごとに、イベント取得モジュール２３４に対し、管理対象となるコンピュータシステムのコンピュータやＩＰスイッチから、イベント履歴情報を取得するように指示する。 Further, the program control module 231 instructs the event acquisition module 234 to acquire event history information from a computer or an IP switch of a computer system to be managed at every predetermined period.

また、プログラム制御モジュール２３１は、所定周期ごとに、システムスケッチ生成モジュール２３６に対し、システムスケッチ管理情報を生成するように指示する。 In addition, the program control module 231 instructs the system sketch generation module 236 to generate system sketch management information at predetermined intervals.

また、プログラム制御モジュール２３１は、所定周期ごとに、イベント集約処理モジュール２３７に対し、イベント集約結果情報を更新するように指示する。 Further, the program control module 231 instructs the event aggregation processing module 237 to update the event aggregation result information every predetermined period.

構成情報取得モジュール２３２は、管理対象となるコンピュータシステムのコンピュータおよびＩＰスイッチから、システムトポロジ管理情報を取得するとともに、システムトポロジ管理表を更新する。 The configuration information acquisition module 232 acquires system topology management information from the computer and IP switch of the computer system to be managed, and updates the system topology management table.

性能情報取得モジュール２３３は、管理対象となるコンピュータシステムのコンピュータおよびＩＰスイッチから、メトリック履歴情報を取得するとともに、メトリック履歴表を更新する。 The performance information acquisition module 233 acquires metric history information from the computer and IP switch of the computer system to be managed, and updates the metric history table.

イベント取得モジュール２３４は、管理対象となるコンピュータシステムのコンピュータおよびＩＰシステムから、イベント履歴情報を取得するとともに、イベント履歴表を更新する。 The event acquisition module 234 acquires event history information from the computer and IP system of the computer system to be managed, and updates the event history table.

ＵＩ表示処理モジュール２３５は、入力デバイス２６を介した管理者からの要求に応じ、二次記憶装置２４に格納された各種情報を、出力デバイス２５を介して表示する。 The UI display processing module 235 displays various information stored in the secondary storage device 24 via the output device 25 in response to a request from the administrator via the input device 26.

システムスケッチ生成モジュール２３６は、システムトポロジ管理表とメトリック定義表とメトリック履歴表とを参照し、後述するシステムスケッチ生成処理を実行し、システムスケッチ管理表を更新する。 The system sketch generation module 236 refers to the system topology management table, the metric definition table, and the metric history table, executes a system sketch generation process described later, and updates the system sketch management table.

イベント集約処理モジュール２３７は、システムトポロジ管理表とメトリック定義表と運用トポロジ管理表とメトリック履歴表とイベント履歴表とシステムスケッチ管理表とイベント集約結果表とを参照し、後述するイベント集約処理を実行し、イベント集約結果表を更新する。 The event aggregation processing module 237 refers to the system topology management table, the metric definition table, the operation topology management table, the metric history table, the event history table, the system sketch management table, and the event aggregation result table, and executes event aggregation processing described later. And update the event aggregation result table.

なお、各モジュールは、メモリに格納するソフトウェアモジュールではなく、ハードウェアモジュールとして提供されてもよい。また、各モジュールが行う処理が一つ以上のプログラムコードとして提供されてもよいし、複数のモジュールが１つのプログラムードとして提供されてもよい。また、説明において「モジュール」を、「プログラム」と読み替えてもよい。 Each module may be provided as a hardware module instead of a software module stored in the memory. Further, the processing performed by each module may be provided as one or more program codes, or a plurality of modules may be provided as one programmed code. In the description, “module” may be read as “program”.

なお、管理サーバ２０は、シリアルインタフェースやイーサネット（登録商標）インタフェースを通信Ｉ／Ｆ２１として備え、通信Ｉ／Ｆ２１に、ディスプレイ、キーボードまたはポインタデバイスを有する管理端末３０を表示用計算機として接続してもよい。それにより、管理サーバ２０は、表示用情報を表示用計算機に送信して、表示用計算機で表示を行ったり、入力用情報を表示用計算機から受信することで、入力を受け付けたりすることができる。 The management server 20 includes a serial interface or an Ethernet (registered trademark) interface as a communication I / F 21, and a management terminal 30 having a display, a keyboard, or a pointer device is connected to the communication I / F 21 as a display computer. Good. Thereby, the management server 20 can receive the input by transmitting the display information to the display computer and displaying the information on the display computer or receiving the input information from the display computer. .

図５は、管理サーバ２０の二次記憶装置２４に格納されたシステムトポロジ管理表の一例を示す説明図である。 FIG. 5 is an explanatory diagram showing an example of a system topology management table stored in the secondary storage device 24 of the management server 20.

システムトポロジ管理表は、図５に示すように、「システムトポロジキーＩＤ」、「コンピュータグループＩＤ」、「コンピュータＩＤ」および「コンピュータ部位ＩＤ」を示す情報を含む。 As shown in FIG. 5, the system topology management table includes information indicating “system topology key ID”, “computer group ID”, “computer ID”, and “computer part ID”.

「システムトポロジキーＩＤ」は、システムトポロジ管理表のレコードの識別子である。「コンピュータグループＩＤ」は、管理サーバ２０の管理対象となるコンピュータシステムのコンピュータグループの識別子である。「コンピュータＩＤ」は、管理サーバ２０の管理対象となるコンピュータの識別子である。「コンピュータ部位ＩＤ」は、管理サーバ２０の管理対象となるコンピュータの内部を構成する部位の識別子である。 “System topology key ID” is an identifier of a record in the system topology management table. “Computer group ID” is an identifier of a computer group of a computer system to be managed by the management server 20. “Computer ID” is an identifier of a computer to be managed by the management server 20. The “computer part ID” is an identifier of a part that constitutes the inside of the computer to be managed by the management server 20.

例えば、図５に示す、システムトポロジキーＩＤが“ＳＴＫｅｙＩＤ１”であるレコードは、コンピュータグループ“ＮＯＤＥＧｒｐ１”がコンピュータ“ＮＯＤＥ１”を含み、コンピュータ“ＮＯＤＥ１”がプロセッサを有することを示す。 For example, the record having the system topology key ID “STKeyID1” shown in FIG. 5 indicates that the computer group “NODEGrp1” includes the computer “NODE1” and the computer “NODE1” has a processor.

図６は、管理サーバ２０の二次記憶装置２４に格納されたメトリック定義表の一例を示す説明図である。 FIG. 6 is an explanatory diagram showing an example of a metric definition table stored in the secondary storage device 24 of the management server 20.

メトリック定義表は、計測項目（メトリック）を定義するための情報である。具体的には、図６に示すように、「メトリックキーＩＤ」、「メトリックグループＩＤ」、「コンピュータ部位ＩＤ」、「メトリックＩＤ」、「メトリック上限異常閾値」および「メトリック下限異常閾値」を示す情報を含む。 The metric definition table is information for defining measurement items (metrics). Specifically, as shown in FIG. 6, “metric key ID”, “metric group ID”, “computer part ID”, “metric ID”, “metric upper limit abnormal threshold” and “metric lower limit abnormal threshold” are indicated. Contains information.

「メトリックキーＩＤ」は、メトリック定義表のレコードの識別子である。「メトリックグループＩＤ」は、管理サーバ２０の管理対象となるコンピュータの監視対象となるメトリックのグループの識別子である。「コンピュータ部位ＩＤ」は、管理サーバ２０の管理対象となるコンピュータの内部を構成する部位の識別子である。「メトリック上限異常閾値」は、管理サーバ２０の管理対象となるコンピュータの内部を構成する部位の監視対象となるメトリックの値の正常範囲の上限を示す閾値である。「メトリック下限異常閾値」は、管理サーバ２０の管理対象となるコンピュータの内部を構成する部位の監視対象となるメトリックの値の正常範囲の下限を示す閾値である。 The “metric key ID” is an identifier of a record in the metric definition table. The “metric group ID” is an identifier of a metric group to be monitored by the computer to be managed by the management server 20. The “computer part ID” is an identifier of a part that constitutes the inside of the computer to be managed by the management server 20. The “metric upper limit abnormal threshold value” is a threshold value that indicates the upper limit of the normal range of the metric value that is the monitoring target of the part that constitutes the inside of the computer that is the management target of the management server 20. The “metric lower limit abnormality threshold value” is a threshold value that indicates the lower limit of the normal range of the metric value that is the monitoring target of the part that constitutes the inside of the computer that is the management target of the management server 20.

例えば、図６に示す、メトリックキーＩＤが“ＭＫｅｙＩＤ１”であるレコードは、監視対象のメトリックが、リソースのメトリックグループに属し、且つプロセッサの単位時間使用率のメトリックであることを示す。また、プロセッサの単位時間使用率の値が８０％を超えた場合に、上限異常が検出されることを示す。 For example, the record whose metric key ID is “MKeyID1” shown in FIG. 6 indicates that the metric to be monitored belongs to the metric group of the resource and is a metric of the unit time usage rate of the processor. Further, it indicates that an upper limit abnormality is detected when the value of the unit usage rate of the processor exceeds 80%.

図７は、管理サーバ２０の二次記憶装置２４に格納された運用トポロジ管理表の一例を示す説明図である。 FIG. 7 is an explanatory diagram illustrating an example of an operation topology management table stored in the secondary storage device 24 of the management server 20.

運用トポロジ管理表は、図７に示すように、「運用トポロジキーＩＤ」、「運用ドメイン」、「運用障害区分」および「システムスケッチマップＩＤ（ＳＳＭａｐＩＤ）」を示す情報を含む。 As shown in FIG. 7, the operation topology management table includes information indicating “operation topology key ID”, “operation domain”, “operation failure classification”, and “system sketch map ID (SSMap ID)”.

「運用トポロジキーＩＤ」は、運用トポロジ管理表のレコードの識別子である。「運用ドメイン」は、管理サーバ２０の管理対象となるコンピュータシステムの運用領域の識別子である。「運用障害区分」は、管理サーバ２０の管理対象となるコンピュータシステムで発生する障害の識別子である。「ＳＳＭａｐＩＤ」は、当該運用障害区分における各コンピュータの各メトリックの状態を表すシステムスケッチの識別子である。以下、識別子が“ＳＳＭａｐＩＤｘ−ｘ”であるシステムスケッチを、システムスケッチ「ＳＳＭａｐＩＤｘ−ｘ」と表現する。 “Operation topology key ID” is an identifier of a record in the operation topology management table. “Operation domain” is an identifier of the operation area of the computer system to be managed by the management server 20. The “operation failure classification” is an identifier of a failure that occurs in the computer system that is the management target of the management server 20. “SSMapID” is an identifier of a system sketch representing the state of each metric of each computer in the operation failure category. Hereinafter, a system sketch whose identifier is “SSMapIDx-x” is expressed as a system sketch “SSMapIDx-x”.

システムスケッチは、障害発生時におけるメトリックの状態を表現するルールである。具体的には、システムスケッチは、管理サーバ２０の管理対象となるコンピュータシステムの各コンピュータの障害発生時におけるメトリックの状態を定義する。本実施形態では、メトリックの状態として、“正常”、“上限異常”、“下限以上”、“全状態一致”、“対象外”がある。 A system sketch is a rule that expresses the state of a metric when a failure occurs. Specifically, the system sketch defines a metric state when a failure occurs in each computer of the computer system to be managed by the management server 20. In the present embodiment, the metric states include “normal”, “upper limit error”, “above lower limit”, “all state match”, and “not applicable”.

“正常”は、メトリックの値が正常範囲である状態を示す。 “Normal” indicates a state where the metric value is in a normal range.

“上限異常”は、メトリックの値が正常範囲の上限を示す閾値を超えた状態を示す。 “Upper limit abnormality” indicates a state in which the metric value exceeds a threshold value indicating the upper limit of the normal range.

“下限異常”は、メトリックの値が正常範囲の下限を示す閾値を下回った状態を示す。 “Lower limit abnormality” indicates a state in which the metric value falls below a threshold value indicating the lower limit of the normal range.

“全状態一致”は、メトリックの値が“正常”、“上限異常”または“下限異常”のいずれかの状態であることを示す。なお、システムスケッチでは、“全状態一致”を“＊”で表す。 “All state coincidence” indicates that the metric value is “normal”, “upper limit error”, or “lower limit error”. In the system sketch, “all state coincidence” is represented by “*”.

“対象外”は、メトリックがルールの対象外であることを示す。なお、システムスケッチでは、“対象外”を“−”で表す。 “Not applicable” indicates that the metric is not subject to the rule. In the system sketch, “not applicable” is represented by “−”.

例えば、図７に示す、運用トポロジキーＩＤが“ＯＴＫｅｙＩＤ１”であるレコードは、“Ａｐｐｌｉｃａｔｉｏｎ”の運用領域における“応答劣化”の障害は、各コンピュータの各メトリックの状態が、システムスケッチ「ＳＳＭａｐＩＤ１−１」または「ＳＳＭａｐＩＤ１−２」で表現された状態であることを示す。 For example, in the record shown in FIG. 7 where the operation topology key ID is “OTKeyID1”, the failure of “response degradation” in the “Application” operation area indicates that the state of each metric of each computer is the system sketch “SSMapID1-1”. ”Or“ SSMapID1-2 ”.

図８および図９は、システムスケッチの一例を示す説明図である。 8 and 9 are explanatory diagrams illustrating an example of a system sketch.

図８は、“Ａｐｐｌｉｃａｔｉｏｎ”の運用領域における応答劣化時における、各コンピュータのそれぞれのメトリックの状態を表現するルールである。 FIG. 8 is a rule that expresses the state of each metric of each computer at the time of response deterioration in the operation area of “Application”.

図８に示すシステムスケッチ「ＳＳＭａｐＩＤ１−１」は、管理サーバ２０の管理対象となるコンピュータシステム１０のデータ処理ソフトウェア「ＳＷ１」の応答劣化時における、コンピュータ「ＮＯＤＥ１」の各メトリックの状態とコンピュータ「ＮＯＤＥ２」の各メトリックの状態を表現するルールである。図８に示すシステムスケッチ「ＳＳＭａｐＩＤ１−２」は、管理サーバ２０の管理対象となるコンピュータシステム１０のデータ処理ソフトウェア「ＳＷ２」の応答劣化時における各コンピュータの各メトリックの状態を表現するルールである。 The system sketch “SSMapID1-1” shown in FIG. 8 indicates the state of each metric of the computer “NODE1” and the computer “NODE2” when the response processing of the data processing software “SW1” of the computer system 10 to be managed by the management server 20 is deteriorated. Is a rule expressing the state of each metric. The system sketch “SSMapID1-2” illustrated in FIG. 8 is a rule that represents the state of each metric of each computer when the response processing of the data processing software “SW2” of the computer system 10 to be managed by the management server 20 is deteriorated.

例えば、図８に示すシステムスケッチ「ＳＳＭａｐＩＤ１−１」は、コンピュータ「ＮＯＤＥ１」における、メトリック「ＭＫｅｙＩＤ１」とメトリック「ＭＫｅｙＩＤ２」とメトリック「ＭＫｅｙＩＤ３」の値が“正常”であって、メトリック「ＭＫｅｙＩＤ４」の値が“上限異常”であって、メトリック「ＭＫｅｙＩＤ５」が“対象外”であることを示す。また、コンピュータ「ＮＯＤＥ２」は、メトリック「ＭＫｅｙＩＤ１」とメトリック「ＭＫｅｙＩＤ２」とメトリック「ＭＫｅｙＩＤ３」の値が“正常”あって、メトリック「ＭＫｅｙＩＤ４」が“対象外”であって、メトリック“ＭＫｅｙＩＤ５”の値が“正常”、“上限異常”または“下限異常”のいずれかの状態であることを示す。 For example, the system sketch “SMCapID1-1” shown in FIG. 8 has the values of the metric “MKeyID1”, the metric “MKeyID2”, and the metric “MKeyID3” in the computer “NODE1”, and the metric “MKeyID4”. This indicates that the value is “upper limit error” and the metric “MKeyID5” is “not applicable”. Further, the computer “NODE2” has the values of the metric “MKeyID1”, the metric “MKeyID2”, and the metric “MKeyID3” “normal”, the metric “MKeyID4” “not applicable”, and the metric “MKeyID5”. Indicates one of the following states: “Normal”, “Upper limit error”, or “Lower limit error”.

図９は、“Ｉｎｆｒａｓｔｒｕｃｔｕｒｅ”の運用領域におけるリソース障害時における各コンピュータのそれぞれのメトリックの状態を表現するルールである。 FIG. 9 is a rule representing the state of each metric of each computer at the time of a resource failure in the “Infrastructure” operation area.

図９に示すシステムスケッチ「ＳＳＭａｐＩＤ２−１」、「ＳＳＭａｐＩＤ２−４」は、管理サーバ２０の管理対象となるコンピュータシステムのプロセッサ（ＣＰＵ）のリソース障害時における、コンピュータ「ＮＯＤＥ１」の各メトリックの状態とコンピュータ「ＮＯＤE２」の各メトリックの状態を表現するルールである。図９に示すシステムスケッチ「ＳＳＭａｐＩＤ２−２」、「ＳＳＭａｐＩＤ２−５」は、管理サーバ２０の管理対象となるコンピュータシステムのメモリのリソース障害時における各コンピュータの各メトリックの状態を表現するルールである。図９に示すシステムスケッチ「ＳＳＭａｐＩＤ２−３」、「ＳＳＭａｐＩＤ２−６」は、管理サーバ２０の管理対象となるコンピュータシステムのディスクのリソース障害時における各コンピュータの各メトリックの状態を表現するルールである。 The system sketches “SSMapID2-1” and “SSMapID2-4” shown in FIG. 9 indicate the state of each metric of the computer “NODE1” at the time of a resource failure of the processor (CPU) of the computer system to be managed by the management server 20. This is a rule expressing the state of each metric of the computer “NODE2”. The system sketches “SSMapID2-2” and “SSMapID2-5” illustrated in FIG. 9 are rules that express the state of each metric of each computer when a memory resource failure occurs in the computer system to be managed by the management server 20. The system sketches “SSMapID2-3” and “SSMapID2-6” shown in FIG. 9 are rules representing the state of each metric of each computer when a disk resource failure occurs in the computer system to be managed by the management server 20.

例えば、図９に示すシステムスケッチ「ＳＳＭａｐＩＤ２−１」は、コンピュータ「ＮＯＤＥ１」における、メトリック「ＭＫｅｙＩＤ１」の値が上限異常であって、メトリック「ＭＫｅｙＩＤ２」とメトリック「ＭＫｅｙＩＤ３」とメトリック「ＭＫｅｙＩＤ４」の値が“正常”、“上限異常”または“下限異常”のいずれかの状態であって、メトリック「ＭＫｅｙＩＤ５」が“対象外”であることを示す。また、コンピュータ「ＮＯＤＥ２」における、メトリック「ＭＫｅｙＩＤ１」とメトリック「ＭＫｅｙＩＤ２」とメトリック「ＭＫｅｙＩＤ３」とメトリック「ＭＫｅｙＩＤ５」の値が「正常」、「上限異常」または「下限異常」のいずれかの状態であって、メトリック「ＭＫｅｙＩＤ４」が対象外であることを示す。 For example, in the system sketch “SSMapID2-1” shown in FIG. 9, the value of the metric “MKeyID1” in the computer “NODE1” has an upper limit abnormality, and the values of the metric “MKeyID2”, the metric “MKeyID3”, and the metric “MKeyID4” Indicates “normal”, “upper limit error”, or “lower limit error”, and the metric “MKeyID5” is “not applicable”. In addition, in the computer “NODE2”, the values of the metric “MKeyID1”, the metric “MKeyID2”, the metric “MKeyID3”, and the metric “MKeyID5” are either “normal”, “upper limit abnormal”, or “lower limit abnormal”. Thus, the metric “MKeyID4” is excluded.

なお、システムスケッチは、管理サーバ２０の管理対象となるコンピュータシステムで発生する障害時において、各コンピュータグループのそれぞれのメトリックの状態を表現するルールであってもよい。図１０は、コンピュータグループのメトリックの状態を表現するシステムスケッチの一例を示す説明図である。例えば、図１０に示すように、システムスケッチは、コンピュータグループ「ＮＯＤＥＧｒｐ１」におけるそれぞれのメトリックの状態を表現するルールであってもよい。この場合、図１０に示すシステムスケッチ「ＳＳＭａｐＩＤ１−３」は、図８に示すシステムスケッチ「ＳＳＭａｐＩＤ１−１」とシステムスケッチ「ＳＳＭａｐＩＤ１−２」とを表現するルールとなる。 The system sketch may be a rule that expresses the state of each metric of each computer group when a failure occurs in a computer system that is a management target of the management server 20. FIG. 10 is an explanatory diagram illustrating an example of a system sketch that represents a metric state of a computer group. For example, as illustrated in FIG. 10, the system sketch may be a rule expressing the state of each metric in the computer group “NODEGrp1”. In this case, the system sketch “SSMapID1-3” illustrated in FIG. 10 is a rule that expresses the system sketch “SSMapID1-1” and the system sketch “SSMapID1-2” illustrated in FIG.

さらに、システムスケッチは、管理サーバの管理対象となるコンピュータシステムで発生する障害時において、各コンピュータグループのそれぞれのメトリックグループの状態を表現するルールであってもよい。図１１は、コンピュータグループのメトリックグループの状態を表現するシステムスケッチの一例を示す説明図である。例えば、図１１に示すように、コンピュータグループ「ＮＯＤＥＧｒｐ１」における、メトリックグループ「リソース」と、メトリック「ＭＫｅｙＩＤ４」と、メトリック「ＭＫｅｙＩＤ５」との状態を表現するルールであってもよい。この場合、図１１に示すシステムスケッチ「ＳＳＭａｐＩＤ２−７」は、図９に示すシステムスケッチ「ＳＳＭａｐＩＤ２−１」からシステムスケッチ「ＳＳＭａｐＩＤ２−６」までのシステムスケッチを表現するルールとなる。 Furthermore, the system sketch may be a rule that expresses the state of each metric group of each computer group in the event of a failure that occurs in a computer system managed by the management server. FIG. 11 is an explanatory diagram showing an example of a system sketch expressing the state of a metric group of a computer group. For example, as shown in FIG. 11, the rule may represent a state of a metric group “resource”, a metric “MKeyID4”, and a metric “MKeyID5” in the computer group “NODEGrp1”. In this case, the system sketch “SSMapID2-7” shown in FIG. 11 is a rule expressing the system sketches from the system sketch “SSMapID2-1” to the system sketch “SSMapID2-6” shown in FIG.

図１２は、管理サーバ２０の二次記憶装置２４に格納されたメトリック履歴表の一例を示す説明図である。 FIG. 12 is an explanatory diagram illustrating an example of a metric history table stored in the secondary storage device 24 of the management server 20.

メトリック履歴表は、図１２に示すように、「メトリック履歴キーＩＤ」、「システムトポロジキーＩＤ」、「メトリックキーＩＤ」、「計測日時」、「計測値」および「システムスケッチ生成処理済みフラグ」を示す情報を含む。 As shown in FIG. 12, the metric history table includes “metric history key ID”, “system topology key ID”, “metric key ID”, “measurement date / time”, “measured value”, and “system sketch generation processed flag”. Contains information indicating.

「メトリック履歴キーＩＤ」は、メトリック履歴表のレコードの識別子である。「システムトポロジキーＩＤ」は、システムトポロジ管理表のレコードの識別子である。「メトリックキーＩＤ」は、メトリック定義表のレコードの識別子である。「計測日時」は、メトリックの計測日時である。「計測値」は、メトリックの計測値である。「システムスケッチ生成処理済みフラグ」は、メトリック履歴表レコードに対するシステムスケッチ生成処理が済んでいるか否かを示すフラグである。 The “metric history key ID” is an identifier of a record in the metric history table. “System topology key ID” is an identifier of a record in the system topology management table. The “metric key ID” is an identifier of a record in the metric definition table. “Measurement date and time” is the measurement date and time of the metric. “Measured value” is a measured value of a metric. The “system sketch generation process completed flag” is a flag indicating whether or not the system sketch generation process for the metric history table record has been completed.

例えば、図１２に示す、メトリック履歴キーＩＤが“ＭＨｉｓｔｏｒｙ１”であるレコードは、コンピュータグループ「ＮＯＤＥＧｒｐ１」のコンピュータ「ＮＯＤＥ１」におけるプロセッサの単位時間使用率の計測値が“４０％”であって、システムスケッチ生成処理済みであることを示す。また、計測日時が“２０１２／１／１／１３：００：００”であることを示す。 For example, the record having the metric history key ID “MH History 1” shown in FIG. 12 has the measured value of the processor unit time utilization rate “40%” in the computer “NODE 1” of the computer group “NO DEGrp1”. Indicates that the sketch generation process has been completed. In addition, the measurement date / time is “2012/1/1/1 13:00:00”.

図１３は、管理サーバ２０の二次記憶装置２４に格納されたイベント履歴表の一例を示す説明図である。 FIG. 13 is an explanatory diagram illustrating an example of an event history table stored in the secondary storage device 24 of the management server 20.

イベント履歴表は、図１３に示すように、「イベントキーＩＤ」、「発生日時」、「システムトポロジキーＩＤ」、「イベントＩＤ」、「イベントメッセージ」および「イベント集約処理済みフラグ」を示す情報を含む。 As shown in FIG. 13, the event history table is information indicating “event key ID”, “occurrence date / time”, “system topology key ID”, “event ID”, “event message”, and “event aggregation processing flag”. including.

「イベントキーＩＤ」は、イベント履歴表のレコードの識別子である。「発生日時」は、イベントの発生日時である。「システムトポロジキーＩＤ」は、当該イベントの発生個所に対応するシステムトポロジキーＩＤである。「イベントＩＤ」、「イベントメッセージ」は、当該イベントにより通知されるメッセージの識別子とその内容である。「イベント集約処理済みフラグ」は、当該イベントに対するイベント集約処理が済んでいるか否かを示すフラグである。 “Event key ID” is an identifier of a record in the event history table. “Occurrence date and time” is the occurrence date and time of the event. The “system topology key ID” is a system topology key ID corresponding to the occurrence location of the event. “Event ID” and “event message” are an identifier of the message notified by the event and its contents. The “event aggregation processing completed flag” is a flag indicating whether or not the event aggregation processing for the event has been completed.

例えば、図１３に示す、イベントキーＩＤが“ＥｖｅｎｔＫｅｙＩＤ１”であるレコードは、“２０１２／１／１１３：０５：００”にコンピュータグループ「ＮＯＤＥＧｒｐ１」のコンピュータ「ＮＯＤＥ１」のデータ処理ソフトウェア「ＳＷ１」で発生したイベントの、イベントＩＤが“ＳＷ１Ｅｖｅｎｔ４”であって、イベントメッセージが“ＳＷ１エラー”であることを示す。また、当該イベントがイベント集約処理済みであることを示す。以下、識別子が“ＥｖｅｎｔＫｅｙＩＤｘ”であるイベント履歴を、イベント履歴「ＥｖｅｎｔＫｅｙＩＤｘ」と表現する。 For example, the record with the event key ID “EventKeyID1” shown in FIG. 13 is the data processing software “SW1” of the computer “NODE1” in the computer group “NODEGRP1” at “2012/1/1 13:05:00”. It indicates that the event ID of the event that has occurred is “SW1Event4” and the event message is “SW1 error”. It also indicates that the event has been subjected to event aggregation processing. Hereinafter, an event history whose identifier is “EventKeyIDx” is expressed as an event history “EventKeyIDx”.

図１４は、管理サーバ２０の二次記憶装置２４に格納されたシステムスケッチ管理表の一例を示す説明図である。 FIG. 14 is an explanatory diagram showing an example of a system sketch management table stored in the secondary storage device 24 of the management server 20.

システムスケッチ管理表は、図１４に示すように、「ＲｕｎＴｉｍｅＳｙｓｔｅｍＳｋｅｔｃｈ（ＲＳＳ）キーＩＤ」、「メトリックキーＩＤリスト」、「開始日時」および「終了日時」を示す情報を含む。 As shown in FIG. 14, the system sketch management table includes information indicating “Run Time System Sketch (RSS) key ID”, “metric key ID list”, “start date and time”, and “end date and time”.

「ＲＳＳキーＩＤ」は、システムスケッチ管理表のレコードの識別子である。「メトリック履歴キーＩＤリスト」は、システムスケッチ生成処理の対象となるメトリック履歴レコードの集合である。「開始日時」は、当該システムスケッチ生成処理の開始日時である。「終了日時」は、当該システムスケッチ生成処理の終了日時である。 “RSS key ID” is an identifier of a record in the system sketch management table. The “metric history key ID list” is a set of metric history records that are targets of system sketch generation processing. The “start date / time” is the start date / time of the system sketch generation process. The “end date / time” is the end date / time of the system sketch generation process.

例えば、図１４に示す、ＲＳＳキーＩＤが“ＲＳＳＭａｐＫｅｙＩＤ１”であるレコードは、“２０１２／１／１１３：００：００”から“２０１２／１／１１３：０９：５９”までの時間帯のメトリックが、“ＭＨｉｓｔｏｒｙ１”から“ＭＨｉｓｔｏｒｙ８”までのメトリックであることを示す。 For example, the record whose RSS key ID is “RSSMapKeyID1” shown in FIG. 14 is a time zone metric from “2012/1/1 13:00:00” to “2012/1/1 13:09:59”. Are metrics from “MH History 1” to “MH History 8”.

図１５および図１６は、管理サーバ２０の二次記憶装置２４に格納されたイベント集約結果表の一例を示す説明図である。 15 and 16 are explanatory diagrams illustrating an example of an event aggregation result table stored in the secondary storage device 24 of the management server 20.

イベント集約結果表は、図１５および図１６に示すように、「イベント集約グループキーＩＤ」、「運用トポロジキーＩＤ」、「イベントキーＩＤ」および「ステータス」を示す情報を含む。 As shown in FIGS. 15 and 16, the event aggregation result table includes information indicating “event aggregation group key ID”, “operation topology key ID”, “event key ID”, and “status”.

「イベント集約グループキーＩＤ」は、イベント集約結果表のレコードの識別子である。「運用トポロジキーＩＤ」は、運用トポロジ管理表のレコードの識別子である。「イベントキーＩＤ」は、イベント履歴表のレコードの識別子である。「ステータス」は、当該イベント集約グループに対する管理者の着手状態を示す。 “Event aggregation group key ID” is an identifier of a record in the event aggregation result table. “Operation topology key ID” is an identifier of a record in the operation topology management table. “Event key ID” is an identifier of a record in the event history table. “Status” indicates the start state of the administrator for the event aggregation group.

例えば、図１５に示す、イベント集約グループキーＩＤが“ＥｖｅｎｔＧｒｏｕｐＫｅｙＩＤ１”であるレコードは、“Ａｐｐｌｉｃａｔｉｏｎ”の運用領域の応答劣化におけるイベント群（イベントグループ）が、イベントキー履歴「ＥｖｅｎｔＫｅｙＩＤ１」に対応するイベントと、イベント履歴「ＥｖｅｎｔＫｅｙＩＤ２」に対応するイベントと、イベント履歴「ＥｖｅｎｔＫｅｙＩＤ３」に対応するイベントであることを示す。また、当該レコードに対して、管理者が未着手であることを示す。イベントキー履歴「ＥｖｅｎｔＫｅｙＩＤ１」に対応するイベントは、図１３に示すように、コンピュータ「ＮＯＤＥ１」のデータ処理ソフトウェア「ＳＷ１」で発生した、メッセージが「ＳＷ１エラー」であるイベントである。また、イベントキー履歴「ＥｖｅｎｔＫｅｙＩＤ２」に対応するイベントは、コンピュータ「ＮＯＤＥ１」のデータ処理ソフトウェア「ＳＷ１」で発生した、メッセージが“レスポンス異常”であるイベントである。また、イベントキー履歴「ＥｖｅｎｔＫｅｙＩＤ３」に対応するイベントは、コンピュータ「ＮＯＤＥ２」のデータ処理ソフトウェア「ＳＷ２」で発生した、メッセージが“ＳＷ２エラー”であるイベントである。 For example, in the record shown in FIG. 15 in which the event aggregation group key ID is “EventGroupKeyID1”, the event group (event group) in the response degradation in the operation area “Application” corresponds to the event corresponding to the event key history “EventKeyID1”. , The event corresponding to the event history “EventKeyID2” and the event corresponding to the event history “EventKeyID3”. Further, it indicates that the manager has not started the record. As shown in FIG. 13, the event corresponding to the event key history “EventKeyID1” is an event that occurs in the data processing software “SW1” of the computer “NODE1” and whose message is “SW1 error”. The event corresponding to the event key history “EventKeyID2” is an event that occurs in the data processing software “SW1” of the computer “NODE1” and whose message is “response abnormality”. The event corresponding to the event key history “EventKeyID3” is an event that occurs in the data processing software “SW2” of the computer “NODE2” and whose message is “SW2 error”.

次に、本実施形態の動作を説明する。 Next, the operation of this embodiment will be described.

図１７は、管理サーバ２０のシステムスケッチ生成モジュール２３６が実行するシステムスケッチ生成処理を示すフローチャートである。 FIG. 17 is a flowchart showing a system sketch generation process executed by the system sketch generation module 236 of the management server 20.

なお、本実施形態では、システムスケッチ生成モジュール２３６がシステムスケッチ生成処理を実行する前に、予め、管理者が、システムトポロジ管理表とメトリック定義表とを、それぞれ二次記憶装置２４に格納する。 In the present embodiment, before the system sketch generation module 236 executes the system sketch generation process, the administrator stores the system topology management table and the metric definition table in the secondary storage device 24 in advance.

システムスケッチ生成モジュール２３６は、システムトポロジ管理表から、管理対象となるコンピュータのすべてのコンピュータ部位ＩＤの一覧を取得する。また、システムスケッチ生成モジュール２３６は、メトリック定義表から、取得したコンピュータ部位ＩＤに一致するメトリックキーＩＤを取得する（ステップＳ１０１）。 The system sketch generation module 236 acquires a list of all computer part IDs of computers to be managed from the system topology management table. Further, the system sketch generation module 236 acquires a metric key ID that matches the acquired computer part ID from the metric definition table (step S101).

システムスケッチ生成モジュール２３６は、メトリック履歴表から、システムスケッチ生成処理済みフラグが“Ｎｏ”であって、且つステップＳ１０１で取得したメトリックキーＩＤを含むメトリック履歴レコードを取得する。このとき、システムスケッチ生成モジュール２３６は、時系列が最も古い順から、メトリックキーＩＤごとに少なくとも１つのレコードを取得し、取得したレコードのシステムスケッチ生成処理済みフラグを“Ｙｅｓ”に更新する（ステップＳ１０２）。 The system sketch generation module 236 acquires, from the metric history table, a metric history record that has the system sketch generation processing flag “No” and includes the metric key ID acquired in step S101. At this time, the system sketch generation module 236 acquires at least one record for each metric key ID from the oldest in time series, and updates the system sketch generation processing completed flag of the acquired record to “Yes” (step) S102).

システムスケッチ生成モジュール２３６は、ステップＳ１０２におけるシステムスケッチ生成処理済みフラグ更新後のメトリック履歴表から、システムスケッチ生成処理済みフラグが“Ｎｏ”であって、ステップＳ１０１で取得したメトリックキーＩＤを含むメトリック履歴レコードを取得する。システムスケッチ生成モジュール２３６は、取得したレコードの中から、時系列が最も古いレコードを取得する（ステップＳ１０３）。 The system sketch generation module 236, from the metric history table after the system sketch generation processing flag update in step S102, has the system sketch generation processing flag “No”, and includes the metric history including the metric key ID acquired in step S101. Get a record. The system sketch generation module 236 acquires the oldest record from the acquired records (step S103).

システムスケッチ生成モジュール２３６は、ステップＳ１０２で取得したメトリック履歴レコードの集合を、システムスケッチ生成処理の対象とする。つまり、当該メトリック履歴レコードの集合をシステムスケッチ管理レコードのメトリックキーＩＤリストに登録する。また、システムスケッチ生成モジュール２３６は、ステップＳ１０２で取得したメトリック履歴レコードの集合のうち最も時系列の古いレコードのメトリック計測日時を、システムスケッチ管理レコードのシステムスケッチ開始日時に登録する。また、ステップＳ１０３で取得したメトリック履歴レコードのメトリック計測日時の１秒前の日時を、システムスケッチ管理レコードのシステムスケッチ終了日時に登録する。システムスケッチ生成モジュール２３６は、システムスケッチ管理レコードを、レコードの識別子とともにシステムスケッチ管理表に格納する（ステップＳ１０４）。 The system sketch generation module 236 sets the set of metric history records acquired in step S102 as a target of the system sketch generation process. That is, the set of metric history records is registered in the metric key ID list of the system sketch management record. In addition, the system sketch generation module 236 registers the metric measurement date and time of the oldest chronological record in the set of metric history records acquired in step S102 as the system sketch start date and time of the system sketch management record. Also, the date and time one second before the metric measurement date and time of the metric history record acquired in step S103 is registered as the system sketch end date and time of the system sketch management record. The system sketch generation module 236 stores the system sketch management record in the system sketch management table together with the record identifier (step S104).

図１８は、管理サーバ２０のイベント集約処理モジュール２３７が実行するイベント集約処理を示すフローチャートである。 FIG. 18 is a flowchart showing the event aggregation processing executed by the event aggregation processing module 237 of the management server 20.

なお、イベント集約処理モジュール２３７がイベント集約処理を実行する前に、予め、管理者が、システムトポロジ管理表とメトリック定義表と運用トポロジ管理表とを、それぞれ二次記憶装置２４に格納する。 Before the event aggregation processing module 237 executes the event aggregation processing, the administrator stores the system topology management table, the metric definition table, and the operation topology management table in the secondary storage device 24 in advance.

イベント集約処理モジュール２３７は、イベント履歴表から、イベント集約処理済みフラグが“Ｎｏ”であるイベント履歴レコードの一覧を取得する（ステップＳ２０１）。 The event aggregation processing module 237 acquires a list of event history records whose event aggregation processing completed flag is “No” from the event history table (step S201).

イベント集約処理モジュール２３７は、ステップＳ２０１で取得したそれぞれのレコードに対して、ステップＳ２０３〜Ｓ２１０に示す処理を繰り返し実行する（ステップＳ２０２）。処理対象となるレコードが存在しない場合は、イベント集約処理を終了する。 The event aggregation processing module 237 repeatedly executes the processes shown in steps S203 to S210 for each record acquired in step S201 (step S202). If there is no record to be processed, the event aggregation process ends.

イベント集約処理モジュール２３７は、イベント集約結果表を参照し、各イベント集約グループのイベントキーＩＤリストに登録されているイベント履歴レコードの中に、処理対象のレコードと一致するものがあるか否かを確認する（ステップＳ２０３）。このとき、イベント集約処理モジュール２３７は、処理対象のレコードが、システムトポロジ管理表に登録されているシステムトポロジキーＩＤを含んでいて、且つ、イベント集約グループのイベントキーＩＤリストに登録されているイベント履歴レコードと同じイベントＩＤを含む場合に、処理対象のレコードと一致するイベント履歴レコードがあると判断する。 The event aggregation processing module 237 refers to the event aggregation result table, and determines whether there is an event history record registered in the event key ID list of each event aggregation group that matches the processing target record. Confirmation (step S203). At this time, the event aggregation processing module 237 includes the event whose record to be processed includes the system topology key ID registered in the system topology management table and is registered in the event key ID list of the event aggregation group. When the same event ID as the history record is included, it is determined that there is an event history record that matches the record to be processed.

一致するイベント履歴レコードがある場合は（ステップＳ２０４のＹＥＳ）、イベント集約処理モジュール２３７は、イベント集約結果表を更新する。具体的には、一致するイベント履歴レコードがあるイベント集約グループのイベントキーＩＤリストに、当該処理対象のレコードの識別子（イベントキーＩＤ）を追加する（ステップＳ２０５）。 If there is a matching event history record (YES in step S204), the event aggregation processing module 237 updates the event aggregation result table. Specifically, the identifier (event key ID) of the record to be processed is added to the event key ID list of the event aggregation group that has a matching event history record (step S205).

一致するイベント履歴レコードがない場合は（ステップＳ２０４のＮＯ）、イベント集約処理モジュール２３７は、システムスケッチ管理表から、処理対象のイベント履歴レコードの発生日時が含まれる、メトリック履歴キーＩＤリストを取得する（ステップＳ２０６）。 If there is no matching event history record (NO in step S204), the event aggregation processing module 237 acquires a metric history key ID list including the occurrence date and time of the event history record to be processed from the system sketch management table. (Step S206).

イベント集約処理モジュール２３７は、運用トポロジ管理表から、各運用トポロジキーＩＤに対応するＳＳＭａｐＩＤを取得する。イベント集約処理モジュール２３７は、メトリック履歴キーＩＤリストの各メトリック履歴レコードのメトリック状態が、取得したＳＳＭａｐＩＤに対応するシステムスケッチが示す状態と一致しているか否かを確認する（ステップＳ２０７）。例えば、システムスケッチ管理レコード「ＲＳＳＭａｐＫｅｙＩＤ１」のメトリック履歴レコード「ＭＨｉｓｔｏｒｙ１−ＭＨｉｓｔｏｒｙ８」が示すメトリック状態は、コンピュータ「ＮＯＤＥ１」におけるメトリック「ＭＫｅｙＩＤ４」のみが“上限異常”であるので、システムスケッチ「ＳＳＭＡＰＩＤ１−１」が示す状態と一致していると判断される。 The event aggregation processing module 237 obtains an SSMap ID corresponding to each operational topology key ID from the operational topology management table. The event aggregation processing module 237 checks whether or not the metric state of each metric history record in the metric history key ID list matches the state indicated by the system sketch corresponding to the acquired SSMap ID (step S207). For example, since the metric state indicated by the metric history record “MHistory1-MHistory8” of the system sketch management record “RSMapKeyID1” is only the metric “MKeyID4” in the computer “NODE1”, the system sketch “SSMAPID1-1”. It is determined that it matches the state indicated by.

イベント集約処理モジュール２３７は、各メトリック履歴レコードのメトリック状態とシステムスケッチが示す状態とが一致する場合は（ステップＳ２０８のＹＥＳ）、当該システムスケッチの識別子（ＳＳＭａｐＩＤ）が登録された運用トポロジの識別子（運用トポロジキーＩＤ）と、当該処理対象のイベント履歴レコードの識別子（イベントキーＩＤ）とを要素とする新たなイベント集約結果レコードを生成し、イベント集約結果表に登録する（ステップＳ２０９）。 If the metric state of each metric history record matches the state indicated by the system sketch (YES in step S208), the event aggregation processing module 237 identifies the identifier (SSMapID) of the operation topology in which the system sketch identifier (SSMapID) is registered ( A new event aggregation result record having the operation topology key ID) and the identifier of the event history record to be processed (event key ID) as elements is generated and registered in the event aggregation result table (step S209).

イベント集約処理モジュール２３７は、各メトリック履歴レコードのメトリック状態とシステムスケッチが示す状態とが一致しない場合は（ステップＳ２０８のＮＯ）、障害区分が“その他障害”である運用トポロジに割り当てられる運用トポロジキーＩＤ（例えば、“ＯＴＫｅｙＩＤ９９”）と、当該イベント履歴レコード（イベントキーＩＤ）とを要素とする新たなイベント集約結果レコードを、イベント集約結果表に登録する（ステップＳ２１０）。 When the metric state of each metric history record and the state indicated by the system sketch do not match (NO in step S208), the event aggregation processing module 237 operates the operation topology key assigned to the operation topology whose failure classification is “other failures”. A new event aggregation result record having an ID (for example, “OTKeyID99”) and the event history record (event key ID) as elements is registered in the event aggregation result table (step S210).

イベント集約処理モジュール２３７は、ステップＳ２０５の処理を実行後、またはステップＳ２０９の処理を実行後、またはステップＳ２１０の処理を実行後、ステップＳ２０３の処理に戻る（ステップＳ２１１）。 The event aggregation processing module 237 returns to the process of step S203 after executing the process of step S205, executing the process of step S209, or executing the process of step S210 (step S211).

図１９は、イベント集約結果を示す表示画面の一例を示す説明図である。 FIG. 19 is an explanatory diagram illustrating an example of a display screen showing an event aggregation result.

管理サーバ２０、具体的にはＵＩ表示処理モジュール２３５は、表示用計算機に表示用情報を送信して、表示用計算機が備えるディスプレイ等にイベント集約結果を示す表示画面を表示する。 The management server 20, specifically, the UI display processing module 235 transmits display information to the display computer, and displays a display screen showing the event aggregation result on a display or the like provided in the display computer.

図１９に示すように、イベント集約結果を示す表示画面には、イベント集約結果表に格納されたイベント集約結果が表示される。なお、管理サーバ２０は、イベント集約結果の表示画面のイベントリスト欄に、イベント履歴表のレコードの識別子に該当するイベント履歴情報のいずれかを表示してもよい。また、管理サーバ２０は、イベント集約結果の表示画面に、イベント履歴表のレコードの件数のみを表示する欄を設けてもよい。 As shown in FIG. 19, the event aggregation result stored in the event aggregation result table is displayed on the display screen showing the event aggregation result. The management server 20 may display any of the event history information corresponding to the record identifier of the event history table in the event list field of the event aggregation result display screen. In addition, the management server 20 may provide a column for displaying only the number of records in the event history table on the event aggregation result display screen.

以上に説明したように、本実施形態では、イベント集約処理によって、管理サーバの管理対象となるコンピュータシステムにおいて障害が発生するときの各メトリックの状態を表現したルール（システムスケッチ）に応じて、当該障害で発生するイベントを集約し、グループ化する。そして、イベント集約結果を管理者に提示する。それにより、管理者がどの障害のイベントであるかを容易に判断でき、イベントの調査診断に要する作業負荷を軽減することができる。 As described above, in the present embodiment, according to the rule (system sketch) expressing the state of each metric when a failure occurs in the computer system to be managed by the management server by the event aggregation processing, Aggregate and group events that occur due to failures. Then, the event aggregation result is presented to the administrator. As a result, it is possible for the administrator to easily determine which failure event, and to reduce the work load required for event investigation and diagnosis.

また、管理対象とするコンピュータシステムの規模が大きくなればなるほど、同時多発的に障害が複数個所で発生する可能性が高くなる。従って、大規模なコンピュータシステムに本発明を適用した場合には、本発明の効果をより享受できる。 In addition, as the scale of a computer system to be managed increases, the possibility of simultaneous failures occurring at a plurality of locations increases. Therefore, when the present invention is applied to a large-scale computer system, the effects of the present invention can be further enjoyed.

図２０は、本発明によるイベント集約装置の最小構成を示すブロック図である。図２１は、本発明によるイベント集約装置の他の最小構成を示すブロック図である。 FIG. 20 is a block diagram showing the minimum configuration of the event aggregation device according to the present invention. FIG. 21 is a block diagram showing another minimum configuration of the event aggregation device according to the present invention.

図２０に示すように、イベント集約装置（図１および図４に示す管理サーバ２０に相当。）は、コンピュータシステム（図１および図２に示すコンピュータシステム１０に相当。）の障害または障害の兆候を示すイベントの発生履歴を示すイベント履歴情報と、コンピュータシステムにおけるリソースの使用状態をメトリックごとの計測値として表すメトリック履歴情報とを、コンピュータシステムから取得する情報取得部１（図４に示す管理サーバ２０におけるメモリ２３に格納されたプログラム制御モジュール２３１、性能情報取得モジュール２３２およびイベント取得モジュール２３４に相当。）と、コンピュータシステムの障害発生時における各メトリックの状態を定義したシステムスケッチを記憶する記憶部２（図４に示す管理サーバ２０における二次記憶装置２４に相当。）と、イベント履歴情報から、メトリック履歴情報が示す各メトリックの計測結果と、システムスケッチが示す各メトリックの状態とが一致する時間帯に発生したイベントを抽出し、抽出した各イベントを当該システムスケッチに対応するイベントグループとして集約する集約部３（図４に示す管理サーバ２０におけるメモリ２３に格納されたプログラム制御モジュール２３１およびイベント集約処理モジュール２３７に相当。）とを含む。 As shown in FIG. 20, the event aggregating apparatus (corresponding to the management server 20 shown in FIGS. 1 and 4) is a failure of the computer system (corresponding to the computer system 10 shown in FIGS. 1 and 2) or a sign of the failure. Information acquisition unit 1 (management server shown in FIG. 4) that acquires event history information indicating an occurrence history of events indicating metric history information that represents a resource usage state in a computer system as a measured value for each metric from the computer system 20 corresponding to the program control module 231, the performance information acquisition module 232, and the event acquisition module 234 stored in the memory 23 in FIG. 20, and a storage unit that stores a system sketch that defines the state of each metric when a failure occurs in the computer system 2 (the management service shown in FIG. Corresponding to the secondary storage device 24 in the bar 20), and the event history information, the event that occurred in the time zone in which the measurement result of each metric indicated by the metric history information matches the state of each metric indicated by the system sketch. The aggregating unit 3 (corresponding to the program control module 231 and the event aggregation processing module 237 stored in the memory 23 in the management server 20 shown in FIG. 4) that extracts and aggregates each extracted event as an event group corresponding to the system sketch. ).

そのような構成によれば、障害が発生するときの各メトリックの状態を表現したルールに応じて、当該障害で発生するイベントを集約することができる。従って、発生したイベントがどの障害によるものであるかを容易に判断でき、イベントの調査診断に要する作業負荷を軽減することができる。 According to such a configuration, events that occur due to the failure can be aggregated according to a rule that expresses the state of each metric when the failure occurs. Therefore, it is possible to easily determine which fault the generated event is caused by, and it is possible to reduce the work load required for the investigation and diagnosis of the event.

上記の実施形態には、以下のようなイベント集約装置も開示されている。 In the above embodiment, the following event aggregation device is also disclosed.

（１）図２１に示すように、集約されたイベントグループを表示するための表示用情報を生成する表示処理部４（図４に示す管理サーバ２０におけるメモリ２３に格納されたＵＩ表示処理モジュール２３５に相当。）を含むイベント集約装置。 (1) As shown in FIG. 21, a display processing unit 4 that generates display information for displaying an aggregated event group (a UI display processing module 235 stored in the memory 23 in the management server 20 shown in FIG. 4). Event aggregation device.

そのような構成によれば、表示用情報を表示用計算機等に出力することにより、コンピュータシステムにおける障害で発生するイベントを管理者に提示することができる。このように、当該障害で発生するイベントを、当該障害の各メトリックの状態が一致するイベントグループに集約し、集約した結果を表示用計算機等に表示することにより、管理者がどの障害のイベントであるかを容易に判断でき、イベントの調査診断に要する作業負荷を軽減することができる。 According to such a configuration, by outputting the display information to a display computer or the like, an event that occurs due to a failure in the computer system can be presented to the administrator. In this way, events that occur in the failure are aggregated into event groups that match the status of each metric of the failure, and the result of the aggregation is displayed on a display computer, etc. It is possible to easily determine whether it is present, and it is possible to reduce the workload required for event investigation and diagnosis.

（２）図２１に示すように、所定期間ごとに、当該所定期間に取得されたメトリック履歴情報の集合を含むシステムスケッチ管理レコードを生成するシステムスケッチ生成部５（図４に示す管理サーバ２０におけるメモリ２３に格納されたプログラム制御モジュール２３１およびシステムスケッチ生成モジュール２３６に相当。）を含み、集約部３は、システムスケッチ管理レコードに含まれるメトリック履歴情報の集合が示す各メトリックの計測結果と、記憶部２に格納されたいずれかのシステムスケッチが示す各メトリックの状態とが一致する場合に、当該システムスケッチ管理レコードが生成された時間帯に発生したイベントを、当該システムスケッチに対応するイベントグループとして集約するイベント集約装置。 (2) As shown in FIG. 21, for each predetermined period, the system sketch generation unit 5 (in the management server 20 shown in FIG. 4) generates a system sketch management record including a set of metric history information acquired during the predetermined period. The aggregation unit 3 includes a measurement result of each metric indicated by the set of metric history information included in the system sketch management record, and a storage unit, which corresponds to the program control module 231 and the system sketch generation module 236 stored in the memory 23. When the state of each metric indicated by any one of the system sketches stored in the section 2 matches, an event that occurred in the time zone when the system sketch management record is generated is defined as an event group corresponding to the system sketch. Event aggregator that aggregates.

そのような構成によれば、未定義の関連イベントに対して、新たな障害のイベントであるか既存障害の関連イベントであるかを正確に判断することができ、より効率的にイベントを集約することができる。 According to such a configuration, it is possible to accurately determine whether the event is a new failure event or an existing failure event with respect to an undefined related event, and the events are aggregated more efficiently. be able to.

１情報取得部
２記憶部
３集約部
４表示処理部
５システムスケッチ生成部
１０コンピュータシステム
１１フロントエンドノード
１２−１〜１２−ｎ処理ノード
１３ＩＰスイッチ
１４ネットワーク
２０管理サーバ
２１通信Ｉ／Ｆ
２２プロセッサ
２３メモリ
２３１プログラム制御モジュール
２３２構成情報取得モジュール
２３３性能情報取得モジュール
２３４イベント取得モジュール
２３５ＵＩ表示処理モジュール
２３６システムスケッチ生成モジュール
２３７イベント集約処理モジュール
２４二次記憶装置
２５出力デバイス
２６入力デバイス
３０管理端末
４０−１〜４０−ｎ利用者端末
５０−１、５０−２ＩＰスイッチ
６０−１、６０−２ネットワーク
１００コンピュータ
１１０通信Ｉ／Ｆ
１２０プロセッサ
１３０メモリ
１３１データ処理ソフトウェア
１３２監視ソフトウェア
１４０二次記憶装置
１５０出力デバイス
１６０入力デバイス DESCRIPTION OF SYMBOLS 1 Information acquisition part 2 Memory | storage part 3 Aggregation part 4 Display processing part 5 System sketch production | generation part 10 Computer system 11 Front-end node 12-1 to 12-n Processing node 13 IP switch 14 Network 20 Management server 21 Communication I / F
DESCRIPTION OF SYMBOLS 22 Processor 23 Memory 231 Program control module 232 Configuration information acquisition module 233 Performance information acquisition module 234 Event acquisition module 235 UI display processing module 236 System sketch generation module 237 Event aggregation processing module 24 Secondary storage device 25 Output device 26 Input device 30 Management Terminal 40-1 to 40-n User terminal 50-1, 50-2 IP switch 60-1, 60-2 Network 100 Computer 110 Communication I / F
120 processor 130 memory 131 data processing software 132 monitoring software 140 secondary storage device 150 output device 160 input device

Claims

Information for acquiring from the computer system event history information indicating the occurrence history of an event indicating a failure or a sign of failure of the computer system, and metric history information representing a resource usage state in the computer system as a measured value for each metric. An acquisition unit;
A storage unit for storing a system sketch that defines the state of each metric when a failure occurs in the computer system;
From the event history information, an event that occurred in a time zone in which the measurement result of each metric indicated by the metric history information matches the state of each metric indicated by the system sketch is extracted, and each extracted event is extracted from the system sketch. An event aggregating apparatus comprising: an aggregating unit that aggregates as an event group corresponding to.

The event aggregation device according to claim 1, further comprising: a display processing unit that generates display information for displaying the aggregated event group.

A system sketch generation unit that generates a system sketch management record including a set of metric history information acquired during the predetermined period for each predetermined period,
The aggregation unit, when the measurement result of each metric indicated by the set of metric history information included in the system sketch management record matches the state of each metric indicated by any one of the system sketches stored in the storage unit, The event aggregating apparatus according to claim 1 or 2, wherein events occurring in a time zone when the system sketch management record is generated are aggregated as an event group corresponding to the system sketch.

Event history information indicating an occurrence history of an event indicating a failure of the computer system or an indication of a failure, and metric history information indicating a resource usage state in the computer system as a measured value for each metric are acquired from the computer system,
From the event history information, in a time zone in which the measurement result of each metric indicated by the metric history information matches the state of each metric at the time of failure of the computer system defined in the system sketch stored in the storage unit An event aggregation method characterized by extracting generated events and aggregating each extracted event as an event group corresponding to the system sketch.

The event aggregation method according to claim 4, wherein screen information for displaying the aggregated event group is generated.

For each predetermined period, generate a system sketch management record including a set of metric history information acquired during the predetermined period,
When the measurement result of each metric indicated by the set of metric history information included in the system sketch management record matches the state of each metric indicated by any one of the system sketches stored in the storage unit, the system sketch management concerned The event aggregation method according to claim 4 or 5, wherein events occurring in a time zone when the record is generated are aggregated as an event group corresponding to the system sketch.

On the computer,
Processing for acquiring event history information indicating an occurrence history of an event indicating a failure or an indication of a failure of the computer system, and metric history information indicating a resource usage state in the computer system as a measured value for each metric from the computer system When,
From the event history information, in a time zone in which the measurement result of each metric indicated by the metric history information matches the state of each metric at the time of failure of the computer system defined in the system sketch stored in the storage unit An event aggregation program for extracting generated events and executing the process of aggregating each extracted event as an event group corresponding to the system sketch.

On the computer,
The event aggregation program according to claim 7, wherein a process of generating screen information for displaying the aggregated event group is executed.

On the computer,
For each predetermined period, a process of generating a system sketch management record including a set of metric history information acquired during the predetermined period;
When the measurement result of each metric indicated by the set of metric history information included in the system sketch management record matches the state of each metric indicated by any one of the system sketches stored in the storage unit, the system sketch management concerned The event aggregation program according to claim 7 or 8, wherein the event aggregation program executes a process of aggregating events that occurred in a time zone when a record is generated as an event group corresponding to the system sketch.