JP2007323193A

JP2007323193A - System, method and program for detecting abnormality of performance load

Info

Publication number: JP2007323193A
Application number: JP2006150447A
Authority: JP
Inventors: Yuichi Nakanishi; 佑一中西
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2006-05-30
Filing date: 2006-05-30
Publication date: 2007-12-13
Anticipated expiration: 2026-05-30
Also published as: JP4573179B2

Abstract

<P>PROBLEM TO BE SOLVED: To efficiently detect the performance abnormality of a single server belonging to grouped servers. <P>SOLUTION: A system for detecting the abnormality of performance load uses: a server group 1 serving as a group of servers whose performance is to be monitored; a performance data collecting part 2 for obtaining performance data from the servers to be monitored and storing the obtained performance data in a storage device 6; a statistical calculation part 3 for taking the performance data that meets designated requirements out of the storage device 6 and performing a statistical calculation process; an abnormality detecting part 4 for determining if the data subjected to statistical processing by the statistical calculation part 3 indicates the abnormal status of any of the servers to be monitored; an external output part 5 for notifying users if an abnormality is detected; the storage device 6 for storing server resource information 7 with the resource information of the servers to be monitored, server configuration information 8 with information on the server group and the servers whose performance is to be monitored, and the performance data 9 collected from the servers to be monitored; an input control part 10 for recording setting information input from the users in the storage device 6; and an external input part 11 serving as an interface to the users. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明では、性能負荷異常検出システムに関し、特にグループ化されたサーバ群に属するサーバ単体の性能異常を検出する性能負荷異常検出システムを提供する。 The present invention relates to a performance load abnormality detection system, and particularly provides a performance load abnormality detection system that detects performance abnormality of a single server belonging to a group of servers.

現在の計算機システムにおいては、同スペックのサーバを複数台配置して、負荷を均等に分散し、高負荷に対応するという運用形態がとられることがある。そういった運用形態においては、急激に負荷が上昇した場合に配置されたサーバ群（以降、サーバグループと呼ぶ）が十分なリソースを保持しなくなるという問題に対して、負荷分散対象のサーバを増加し、サーバグループリソースを補充する処置がとられることが多い。 In the current computer system, there may be an operation mode in which a plurality of servers having the same specifications are arranged, the load is evenly distributed, and the high load is supported. In such an operation mode, the server group (hereinafter referred to as server group) arranged when the load suddenly increases increases the number of servers to be load-balanced for the problem that it does not hold sufficient resources, Actions are often taken to replenish server group resources.

このサーバグループリソースの不足を検知する方法として、サーバグループの性能データの平均値を閾値を設けて監視することで、サーバグループのリソース不足を検知する方法が考えられる。しかし、この方法では、グループ内の単体サーバに発生した負荷異常を検知することは困難であり、単体サーバの負荷異常は、将来的なサーバグループリソースの不足につながりかねない。 As a method of detecting a shortage of server group resources, a method of detecting a shortage of server group resources by monitoring the average value of the performance data of the server group by providing a threshold value can be considered. However, with this method, it is difficult to detect a load abnormality occurring in a single server in the group, and a load abnormality in a single server may lead to a shortage of server group resources in the future.

また、単体サーバの負荷異常を検知する方法として、サーバの性能データを定期的に閾値を設けて監視する方法が考えられるが、リソースの種類（ＣＰＵ、メモリ、ディスク、ネットワークなど）によって、性能データの値のスケールなどが異なるため、適切な閾値の算定にはノウハウが必要となる。 In addition, as a method of detecting a load abnormality of a single server, a method of periodically monitoring server performance data by setting a threshold value can be considered. Since the scale of the value of the value is different, know-how is required to calculate an appropriate threshold value.

関連する技術として、特開２００５−１６５６７３号公報（特許文献１）に性能監視システムが開示されている。
この性能監視システムは、複数の性能監視装置と管理サーバ装置とを通信ネットワークにより接続する。前記複数の性能監視装置は夫々、接続される一又は複数の情報処理装置の性能を監視する性能監視手段と、前記性能監視手段による監視結果を示す性能データを送信する性能データ送信手段とを有する。前記管理サーバ装置は、前記複数の性能監視装置から夫々性能データを受信する性能データ受信手段を有する。 As a related technique, Japanese Patent Laid-Open No. 2005-165673 (Patent Document 1) discloses a performance monitoring system.
This performance monitoring system connects a plurality of performance monitoring devices and a management server device via a communication network. Each of the plurality of performance monitoring devices includes a performance monitoring unit that monitors the performance of one or more information processing devices connected thereto, and a performance data transmission unit that transmits performance data indicating a monitoring result of the performance monitoring unit. . The management server device includes performance data receiving means for receiving performance data from the plurality of performance monitoring devices.

また、特開２００５−３１６８０８号公報（特許文献２）に性能監視装置が開示されている。
この性能監視装置は、予め定められている採取間隔で、監視対象装置の性能値を採取する性能値採取手段と、該性能値採取手段が予め定められているサンプリング期間に採取した性能値から、それらの平均値との差分が除外レベル値以下の性能値を抽出する抽出手段と、該抽出手段で抽出された性能値に基づいて閾値を算出する閾値算出手段と、該閾値算出手段で算出した閾値と前記性能値採取手段で採取した性能値とに基づいて、前記監視対象装置に性能的問題が発生しているか否かを判定する判定手段とを備えたことを特徴とする。 Japanese Patent Laying-Open No. 2005-316808 (Patent Document 2) discloses a performance monitoring apparatus.
This performance monitoring device is a performance value collecting means for collecting performance values of monitored devices at a predetermined collection interval, and from the performance values collected by the performance value collecting means during a predetermined sampling period, An extraction unit that extracts a performance value whose difference from the average value is equal to or less than an exclusion level value, a threshold calculation unit that calculates a threshold based on the performance value extracted by the extraction unit, and a threshold calculation unit And determining means for determining whether a performance problem has occurred in the monitoring target device based on a threshold value and the performance value collected by the performance value collecting means.

特開２００５−３２７２６１号公報（特許文献３）に性能監視装置が開示されている。
この性能監視装置は、複数の情報処理装置が協調して動作する情報処理システムの性能を監視する。なお、この性能監視装置は、前記複数の情報処理装置の稼働状況、及び、前記複数の情報処理装置間を接続する各通信回線のデータ通信状況を監視する監視手段と、前記監視手段による監視データに基づいて、前記情報処理システムに現在発生している障害を検知、又は、前記情報処理システムに将来障害が発生する可能性を予測する障害検知／予測手段とを有することを特徴とする。 Japanese Patent Laying-Open No. 2005-327261 (Patent Document 3) discloses a performance monitoring apparatus.
This performance monitoring apparatus monitors the performance of an information processing system in which a plurality of information processing apparatuses operate in cooperation. The performance monitoring device includes a monitoring unit that monitors an operation status of the plurality of information processing devices and a data communication status of each communication line connecting the plurality of information processing devices, and monitoring data by the monitoring unit. And a failure detection / prediction unit for detecting a failure currently occurring in the information processing system or predicting a possibility of a future failure occurring in the information processing system.

特開平５−９４３４２号公報（特許文献４）にコンピユータ性能監視装置が開示されている。
このコンピユータ性能監視装置は、中央処理部，主記憶部，入出力制御部，通信制御部の稼働データを検知する性能データ採取機構と、前記稼働データを受信し格納する性能情報蓄積部と、前記稼働データを解析し表示する手段とを有することを特徴とする。 JP-A-5-94342 (Patent Document 4) discloses a computer performance monitoring apparatus.
The computer performance monitoring device includes a performance data collection mechanism that detects operation data of a central processing unit, a main storage unit, an input / output control unit, and a communication control unit, a performance information storage unit that receives and stores the operation data, And means for analyzing and displaying the operation data.

特開２００５−１６５６７３号公報JP 2005-165673 A 特開２００５−３１６８０８号公報JP 2005-316808 A 特開２００５−３２７２６１号公報JP 2005-327261 A 特開平５−９４３４２号公報Japanese Patent Laid-Open No. 5-94342

本発明の目的は、グループ化されたサーバ群に属するサーバ単体の性能異常を効率的に検出する性能負荷異常検出システムを提供することである。 An object of the present invention is to provide a performance load abnormality detection system that efficiently detects performance abnormality of a single server belonging to a group of servers.

以下に、［発明を実施するための最良の形態］で使用される番号を括弧付きで用いて、課題を解決するための手段を説明する。これらの番号は、［特許請求の範囲］の記載と［発明を実施するための最良の形態］との対応関係を明らかにするために付加されたものである。但し、それらの番号を、［特許請求の範囲］に記載されている発明の技術的範囲の解釈に用いてはならない。 In the following, means for solving the problem will be described using the numbers used in [Best Mode for Carrying Out the Invention] in parentheses. These numbers are added to clarify the correspondence between the description of [Claims] and [Best Mode for Carrying Out the Invention]. However, these numbers should not be used to interpret the technical scope of the invention described in [Claims].

本発明の性能負荷異常検出システムは、監視対象サーバが属するサーバグループ（１）と、前記監視対象サーバのリソース情報を有するサーバリソース情報（７）、前記サーバグループ（１）及び前記監視対象サーバの情報を有するサーバ構成情報（８）、前記監視対象サーバから収集した性能データ（９）を記憶する記憶装置（６）と、前記監視対象サーバから前記性能データ（９）を取得し、前記記憶装置（６）に前記性能データ（９）を保存する性能データ収集部（２）と、前記記憶装置（６）から指定の条件を満たした前記性能データ（９）を取り出し、統計計算処理を実施する統計計算部（３）と、前記統計計算部（３）で統計処理されたデータが前記監視対象サーバの異常状態をあらわしているかどうかを判断する異常検出部（４）と、異常が検出された場合に利用者に通知するための外部出力部（５）と、前記利用者により設定された前記サーバリソース情報（７）、及び前記サーバ構成情報（８）を、前記記憶装置（６）に記録する入力制御部（１０）とを具備する。 The performance load abnormality detection system of the present invention includes a server group (1) to which a monitored server belongs, server resource information (7) having resource information of the monitored server, the server group (1), and the monitored server. Server configuration information (8) having information, a storage device (6) for storing performance data (9) collected from the monitored server, and the performance data (9) from the monitored server, and the storage device The performance data collection unit (2) that stores the performance data (9) in (6) and the performance data (9) that satisfies a specified condition are extracted from the storage device (6), and statistical calculation processing is performed. A statistical calculation unit (3) and an abnormality detection unit for determining whether the data statistically processed by the statistical calculation unit (3) represents an abnormal state of the monitored server ( ), An external output unit (5) for notifying the user when an abnormality is detected, the server resource information (7) set by the user, and the server configuration information (8), And an input control unit (10) for recording in the storage device (6).

前記性能データ収集部（２）は、前記記憶装置（６）内の前記サーバリソース情報（７）、前記サーバ構成情報（８）を参照し、前記性能データ（９）を収集する対象を把握し、且つ、前記サーバ構成情報（８）の情報を基に前記監視対象サーバヘアクセスし、前記サーバリソース情報（７）の情報を基に対象のサーバリソースに関する前記性能データ（９）を取得するためのコマンドの発行、又は関数の実行を行い、且つ、前記性能データ（９）を、前記サーバリソース情報（７）への参照、前記サーバ構成情報（８）への参照、収集時刻と共に、前記記憶装置（６）に記録する。 The performance data collection unit (2) refers to the server resource information (7) and the server configuration information (8) in the storage device (6) and grasps a target for collecting the performance data (9). In order to access the monitoring target server based on the information of the server configuration information (8) and acquire the performance data (9) related to the target server resource based on the information of the server resource information (7). The command is issued or the function is executed, and the performance data (9) is stored together with the reference to the server resource information (7), the reference to the server configuration information (8), and the collection time. Record in device (6).

前記統計計算部（３）は、現時刻から所定の期間内に収集された前記性能データ（９）を、前記サーバリソース情報（７）、前記サーバグループ（１）を指定して前記記録装置から取り出し、統計処理を実施する。前記異常検出部（４）は、統計処理が施されたデータを検証し、異常があると判断された場合、前記外部出力部（５）を介して利用者に通知する。 The statistical calculation unit (3) designates the performance data (9) collected within a predetermined period from the current time from the recording device by designating the server resource information (7) and the server group (1). Take out and perform statistical processing. The abnormality detection unit (4) verifies the data subjected to the statistical processing, and notifies the user via the external output unit (5) when it is determined that there is an abnormality.

前記統計計算部（３）は、次式：

を用いて、サーバ構成リスト内のサーバの各々に対して偏差値を計算する。 The statistical calculation unit (3) has the following formula:

Is used to calculate a deviation value for each of the servers in the server configuration list.

本発明の性能負荷異常検出方法は、（ａ１）対象のサーバグループ（１）に属しているサーバに関する監視対象サーバ情報及び性能データ（９）のペアをリスト化したサーバ構成リストを作成するステップと、（ａ２）監視期間、前記監視対象サーバ情報、サーバリソース情報（７）を指定して、性能データ（９）一覧を取得し、性能データ（９）リストを作成するステップと、（ａ３）前記性能データ（９）リスト内に性能データ（９）が１件以上存在しているか確認するステップと、（ａ４）前記性能データ（９）リスト内に前記性能データ（９）が存在している場合、収集時刻の古い順に前記性能データ（９）リストからデータを取り出し、取り出したデータの参照しているサーバ情報と対応する前記サーバ構成リスト内の要素に前記性能データ（９）を反映するステップと、（ａ５）前記性能データ（９）リスト内の全データについて上記の操作が完了した後、前記サーバ構成リストを確認し、前記性能データ（９）が反映されてないサーバがないことを確認するステップと、（ａ６）前記サーバ構成リスト内の全サーバに対して前記性能データ（９）が反映されていることが確認できた場合、それぞれのサーバについての偏差値を計算するステップと、（ａ７）全てのサーバについての偏差値の計算が完了した後、異常検出部（４）の処理へ遷移するステップと、（ａ８）前記性能データ（９）が反映されていないサーバが１つでも存在した場合、統計計算結果が不正となる可能性があるため、処理を中断するステップとを具備する。 The performance load abnormality detection method of the present invention includes: (a1) creating a server configuration list in which pairs of monitored server information and performance data (9) relating to servers belonging to the target server group (1) are listed; (A2) Specifying a monitoring period, the monitored server information, and server resource information (7), obtaining a performance data (9) list, and creating a performance data (9) list; (a3) A step of checking whether or not one or more performance data (9) exists in the performance data (9) list; and (a4) the performance data (9) exists in the performance data (9) list. The data is extracted from the performance data (9) list in order of the collection time, and the element in the server configuration list corresponding to the server information referred to by the extracted data is Reflecting the performance data (9), and (a5) after the above operation is completed for all the data in the performance data (9) list, the server configuration list is confirmed, and the performance data (9) is reflected. A step of confirming that there is no server that has not been performed, and (a6) when it is confirmed that the performance data (9) is reflected on all servers in the server configuration list, A step of calculating a deviation value; (a7) a step of transitioning to processing of the abnormality detection unit (4) after calculation of deviation values for all servers is completed; and (a8) reflecting the performance data (9). Since there is a possibility that the statistical calculation result may be invalid when there is even one server that has not been processed, there is a step of interrupting the processing.

前記（ａ４）ステップは、（ａ４１）前記サーバ構成リスト内の全サーバに対して前記性能データ（９）を反映するステップと、（ａ４２）既に反映済みのサーバについては、新しい時刻情報を持つ性能データ（９）で上書きするステップとを具備する。 The (a4) step includes (a41) a step of reflecting the performance data (9) to all servers in the server configuration list, and (a42) a performance having new time information for the already reflected server. Overwriting with data (9).

前記（ａ６）ステップは、（ａ６１）前記偏差値を、次式：

を用いて計算するステップを具備する。 In the step (a6), (a61) the deviation value is expressed by the following formula:

The step of calculating using is provided.

本発明の性能負荷異常検出方法は、（ｂ１）計算されたサーバ毎の偏差値について、偏差値３０未満又は７０を超える値となったサーバが存在しないかどうかを検証するステップと、（ｂ２）偏差値３０未満又は７０を超える値となったサーバが存在した場合、対象のサーバグループ（１）が異常な状態となっている可能性があるとして、利用者に対して通知するステップと、（ｂ３）偏差値３０未満又は７０を超える値となったサーバが存在しない場合、対象のサーバグループ（１）の状態は正常な状態であるとして処理を終了するステップとを更に具備する。 The performance load abnormality detection method of the present invention includes (b1) a step of verifying whether or not there is a server having a deviation value less than 30 or more than 70 with respect to the calculated deviation value for each server; A step of notifying the user that there is a possibility that the target server group (1) is in an abnormal state when there is a server with a deviation value less than 30 or more than 70; b3) When there is no server having a deviation value less than 30 or more than 70, the process further includes a step of ending the process on the assumption that the state of the target server group (1) is a normal state.

第一の効果は、サーバグループ内のサーバになんらかの障害が発生した際の過剰な負荷上昇、障害によるボトルネックにより処理効率が悪化した際の異常な負荷減少を検知することができる点である。
第二の効果は、サーバ性能を監視する際に、リソース毎に閾値を設定する必要がない点である。その理由は、偏差値を利用することで、性能データの平均値からのズレが標準化されるためである。 The first effect is that an excessive load increase when a failure occurs in a server in the server group and an abnormal load decrease when the processing efficiency deteriorates due to a bottleneck due to the failure can be detected.
The second effect is that it is not necessary to set a threshold for each resource when monitoring server performance. The reason is that the deviation from the average value of the performance data is standardized by using the deviation value.

以下に本発明の第１実施形態について添付図面を参照して説明する。
図１に示すように、本発明の計算機システムは、性能監視対象サーバ群１と、性能データ収集部２と、統計計算部３と、異常検出部４と、外部出力部５と、記憶装置６と、サーバリソース情報７と、サーバ構成情報８と、性能データ９と、入力制御部１０と、外部入力部１１とを有する。 A first embodiment of the present invention will be described below with reference to the accompanying drawings.
As shown in FIG. 1, the computer system of the present invention includes a performance monitoring target server group 1, a performance data collection unit 2, a statistical calculation unit 3, an abnormality detection unit 4, an external output unit 5, and a storage device 6. And server resource information 7, server configuration information 8, performance data 9, an input control unit 10, and an external input unit 11.

性能監視対象サーバ群（以降、サーバグループと呼ぶ）１は、監視対象となるサーバが属するサーバグループである。性能データ収集部２は、監視対象サーバから性能データを取得し、記憶装置６へ取得した性能データを保存する。統計計算部３は、記憶装置６から指定の条件を満たした性能データを取り出し、統計計算処理を実施する。異常検出部４は、統計計算部３で統計処理されたデータが監視対象サーバの異常状態をあらわしているかどうかを判断する。外部出力部５は、異常が検出された場合に、その旨を利用者に通知する。記憶装置６は、サーバリソース情報７、サーバ構成情報８、性能データ９を記憶する。サーバリソース情報７は、監視対象サーバのリソース情報を有する。サーバ構成情報８は、サーバグループ１と性能監視対象サーバのサーバ情報を有する。性能データ９は、監視対象サーバから収集した性能データである。入力制御部１０は、利用者から入力された設定情報を記憶装置６へ記録する。外部入力部１１は、利用者からの操作を受け付けるためのインターフェイスである。 A performance monitoring target server group (hereinafter referred to as a server group) 1 is a server group to which a server to be monitored belongs. The performance data collection unit 2 acquires performance data from the monitoring target server and stores the acquired performance data in the storage device 6. The statistical calculation unit 3 takes out performance data satisfying the specified condition from the storage device 6 and performs statistical calculation processing. The abnormality detection unit 4 determines whether the data statistically processed by the statistical calculation unit 3 indicates an abnormal state of the monitoring target server. When an abnormality is detected, the external output unit 5 notifies the user to that effect. The storage device 6 stores server resource information 7, server configuration information 8, and performance data 9. The server resource information 7 includes resource information of the monitoring target server. The server configuration information 8 includes server information of the server group 1 and the performance monitoring target server. The performance data 9 is performance data collected from the monitoring target server. The input control unit 10 records setting information input from the user in the storage device 6. The external input unit 11 is an interface for receiving an operation from a user.

図２を参照して、サーバリソース情報７の例について説明する。
図２では、サーバリソース情報の項目として、「ＣＰＵ使用率」「空き物理メモリ」「ディスク転送速度」「パケット転送速度」等が示されている。「ＣＰＵ使用率」は、ＣＰＵの処理能力（処理可能限界）に対して実際に実行されている処理の割合を示す。すなわち、ＣＰＵの混雑率を示す。「空き物理メモリ」は、物理メモリの最大容量から使用中の容量を差し引いた空き容量を示す。「ディスク転送速度」は、データの読み出しや書き込みの速度を示す。「パケット転送速度」は、サーバ間、或いは監視対象サーバと他の装置との間で通信した時のパケットの転送速度を示す。但し、実際には、これらの例に限定されるものではなく、他にも一般的に利用されているサーバリソース情報を使用することが可能である。 An example of the server resource information 7 will be described with reference to FIG.
In FIG. 2, “CPU usage rate”, “free physical memory”, “disk transfer rate”, “packet transfer rate”, and the like are shown as items of server resource information. The “CPU usage rate” indicates the ratio of processing actually executed with respect to the processing capacity (processing limit) of the CPU. That is, it indicates the CPU congestion rate. “Free physical memory” indicates a free capacity obtained by subtracting the capacity in use from the maximum capacity of the physical memory. “Disk transfer speed” indicates a data reading / writing speed. The “packet transfer rate” indicates a packet transfer rate when communication is performed between servers or between a monitoring target server and another device. However, actually, the present invention is not limited to these examples, and other commonly used server resource information can be used.

図３を参照して、サーバ構成情報８の例について説明する。
図３では、サーバ構成情報として、「サーバグループ毎の識別情報」と「監視対象サーバのＩＰアドレス」との組み合わせが示されている。これにより、どのサーバグループにどのサーバが属しているかを把握することが可能である。図３では、「グループ毎の識別情報」の例として、「サーバグループ１」「サーバグループ２」が示されている。但し、実際には、上記の例に限定されない。また、「監視対象サーバのＩＰアドレス」は、ＩＰアドレスに限定されるものではなく、監視対象サーバ毎に固有の識別情報（例えば、端末名）を代わりに用いても良い。すなわち、監視対象サーバを特定できる識別情報であれば良い。 An example of the server configuration information 8 will be described with reference to FIG.
In FIG. 3, a combination of “identification information for each server group” and “monitored server IP address” is shown as server configuration information. Thereby, it is possible to grasp which server belongs to which server group. In FIG. 3, “server group 1” and “server group 2” are shown as examples of “identification information for each group”. However, actually, it is not limited to the above example. Further, “the IP address of the monitoring target server” is not limited to the IP address, and identification information (for example, a terminal name) unique to each monitoring target server may be used instead. That is, any identification information that can identify the monitoring target server may be used.

図４を参照して、性能データ９の例について説明する。
図４では、性能データとして、「性能計測を実施した日時」「監視対象サーバのＩＰアドレス」「計測項目」「計測結果」の組み合わせが示されている。「性能計測を実施した日時」は、例えば「２００６／０１／０１００：００：００」のように、監視対象サーバの性能計測を実施した年月日及び時刻が示されている。「監視対象サーバのＩＰアドレス」は、監視対象サーバが有するＩＰアドレスを示す。なお、ＩＰアドレスの代わりに、監視対象サーバを特定できる識別情報を用いても良い。「計測項目」は、サーバリソース情報の項目として示した「ＣＰＵ使用率」「空き物理メモリ」「ディスク転送速度」「パケット転送速度」等である。「計測結果」は、前述した「ＣＰＵ使用率」「空き物理メモリ」「ディスク転送速度」「パケット転送速度」等の値を示している。 An example of the performance data 9 will be described with reference to FIG.
In FIG. 4, a combination of “date and time of performance measurement”, “IP address of monitoring target server”, “measurement item”, and “measurement result” is shown as performance data. The “date and time when performance measurement was performed” indicates the date and time when the performance measurement of the monitoring target server was performed, for example, “2006/01/01 00:00:00”. “Monitored server IP address” indicates an IP address of the monitored server. Instead of the IP address, identification information that can identify the monitoring target server may be used. The “measurement item” includes “CPU usage rate”, “free physical memory”, “disk transfer rate”, “packet transfer rate”, and the like shown as items of the server resource information. The “measurement result” indicates values such as “CPU usage rate”, “free physical memory”, “disk transfer rate”, and “packet transfer rate”.

利用者は、外部入力部１１を利用し、監視したいサーバリソース情報（ＣＰＵ使用率、メモリ使用量など）と、監視対象としたいサーバ情報（ＩＰアドレス、ホスト名）を設定する。入力したデータは、入力制御部１０を通じて、記憶装置６内に記録される。 The user uses the external input unit 11 to set server resource information (CPU usage rate, memory usage, etc.) to be monitored and server information (IP address, host name) to be monitored. The input data is recorded in the storage device 6 through the input control unit 10.

性能データ収集部２は、記憶装置６内のサーバリソース情報７、サーバ構成情報８を参照し、性能データを収集する対象を把握する。性能データ収集部２は、サーバ構成情報８の情報を基に監視対象サーバヘアクセスし、サーバリソース情報７の情報を基に対象のサーバリソースに関する性能データを取得するためのコマンドの発行、又は関数の実行を行う。取得された性能データは、サーバリソース情報への参照、サーバ情報への参照、収集時刻と共に、記憶装置６に記録される。 The performance data collection unit 2 refers to the server resource information 7 and the server configuration information 8 in the storage device 6 and grasps a target for collecting performance data. The performance data collection unit 2 accesses the monitoring target server based on the information of the server configuration information 8 and issues a command or function for acquiring performance data related to the target server resource based on the information of the server resource information 7 Perform the execution. The acquired performance data is recorded in the storage device 6 together with the reference to the server resource information, the reference to the server information, and the collection time.

統計計算部３は、現時刻からある期間内に収集された性能データを、サーバリソース情報、サーバグループ１を指定して記録装置６から取り出し、統計処理を実施する。統計処理が施されたデータを異常検出部４で検証し、異常があると判断された場合は、外部出力部５を通じて利用者に通知される。 The statistical calculation unit 3 retrieves performance data collected within a certain period from the current time from the recording device 6 specifying the server resource information and the server group 1, and performs statistical processing. The data subjected to the statistical processing is verified by the abnormality detection unit 4, and when it is determined that there is an abnormality, the user is notified through the external output unit 5.

次に、図５のフローチャートを参照して、図１において統計計算部３として表されている部分の動作を説明する。
（１）ステップＳ１１
統計計算部３では、まずサーバ構成リストを作成する。サーバ構成リストとは、図６に示すように、対象のサーバグループ１に属している監視対象サーバ情報と性能データのペアをリスト化したものである。リスト作成時は、リストの要素の性能データ部分には情報が入っていない状態である。図６では、所定の時間内（例：５分間）におけるサーバ構成リストの例と、統計計算できない例とを示している。図６では、サーバ構成リストに、「監視対象サーバのＩＰアドレス」と「ＣＰＵ使用率」との組み合わせが示されている。統計計算できない例では、「ＣＰＵ使用率」の値が空（指定時間内にデータ無し）の場合についても例示している。なお、「ＣＰＵ使用率」と共に、或いは代わりに、「空き物理メモリ」「ディスク転送速度」「パケット転送速度」のいずれか又は全てをサーバ構成リストに含むようにしても良い。
（２）ステップＳ１２
次に、期間、対象のサーバグループ１に属している監視対象サーバ情報、サーバリソース情報を指定して、記憶装置６から性能データ一覧を取得し、性能データリストを作成する。
（３）ステップＳ１３
次に、性能データリスト内に性能データが１件以上存在しているか確認する。
（４）ステップＳ１４
性能データリスト内に性能データが存在している場合、収集時刻の古い順に性能データリストからデータを取り出し、取り出したデータの参照しているサーバ情報と対応するサーバ構成リスト内の要素に性能データを反映する。この操作を性能データリストの全データについて実施する。既に反映済みのサーバについては、より新しい時刻情報を持つ性能データで上書きする。
（５）ステップＳ１５
性能データリスト内の全データについて上記の操作が完了した後、サーバ構成リストを確認し、性能データが反映されてないサーバがないことを確認する。
（６）ステップＳ１６
サーバ構成リスト内の全サーバに対して性能データが反映されていることが確認できた場合、それぞれのサーバについての偏差値を計算する。偏差値の算出式を以下に示す。

（７）ステップＳ１７
全てのサーバについての偏差値の計算が完了した後、異常検出部の処理へ遷移する。
（８）ステップＳ１８
性能データが反映されていないサーバが１つでも存在した場合、統計計算結果が不正となる可能性があるため、処理を中断する。 Next, the operation of the portion represented as the statistical calculation unit 3 in FIG. 1 will be described with reference to the flowchart of FIG.
(1) Step S11
The statistical calculation unit 3 first creates a server configuration list. The server configuration list is a list of monitoring target server information and performance data pairs belonging to the target server group 1, as shown in FIG. At the time of creating the list, there is no information in the performance data part of the elements of the list. FIG. 6 shows an example of a server configuration list within a predetermined time (eg, 5 minutes) and an example in which statistical calculation cannot be performed. In FIG. 6, a combination of “IP address of monitoring target server” and “CPU usage rate” is shown in the server configuration list. In the example in which the statistical calculation cannot be performed, the case where the value of “CPU usage rate” is empty (no data within a specified time) is also illustrated. In addition to or instead of “CPU usage rate”, any or all of “free physical memory”, “disk transfer rate”, and “packet transfer rate” may be included in the server configuration list.
(2) Step S12
Next, the monitoring target server information and server resource information belonging to the target server group 1 are specified for a period, a performance data list is acquired from the storage device 6, and a performance data list is created.
(3) Step S13
Next, it is confirmed whether one or more performance data exists in the performance data list.
(4) Step S14
If performance data exists in the performance data list, the data is extracted from the performance data list in order of the collection time, and the performance data is stored in the elements in the server configuration list corresponding to the server information referenced by the extracted data. reflect. This operation is performed for all data in the performance data list. For servers that have already been reflected, they are overwritten with performance data with newer time information.
(5) Step S15
After the above operation is completed for all data in the performance data list, the server configuration list is checked to confirm that there is no server that does not reflect the performance data.
(6) Step S16
When it is confirmed that the performance data is reflected on all servers in the server configuration list, a deviation value for each server is calculated. The formula for calculating the deviation value is shown below.

(7) Step S17
After calculation of deviation values for all servers is completed, the process proceeds to the process of the abnormality detection unit.
(8) Step S18
If there is even one server that does not reflect the performance data, the statistical calculation result may be invalid, so the process is interrupted.

次に、図７のフローチャートを参照して、図１において異常検出部４として表されている部分の動作を説明する。 Next, with reference to the flowchart of FIG. 7, the operation of the portion represented as the abnormality detection unit 4 in FIG. 1 will be described.

（１）ステップＳ２１
異常検出部４では、まず統計計算部３で計算されたサーバ毎の偏差値について、３０未満又は７０を超える値となったサーバが存在しないかどうかを検証する。偏差値３０以上、７０以下の範囲には、全性能データの約９５％が含まれるため、この範囲に含まれない性能データは特異な値であるといえる。
（２）ステップＳ２２
３０未満又は７０を超える値となったサーバが存在した場合、対象のサーバグループ１が異常な状態となっている可能性があるとして、利用者に対して外部出力部（図１の５）を通して通知する。
（３）ステップＳ２３
３０未満又は７０を超える値となったサーバが存在しない場合は、対象のサーバグループ１の状態は正常な状態であるとして、処理を終了する。 (1) Step S21
First, the abnormality detection unit 4 verifies whether there is a server having a value less than 30 or more than 70 with respect to the deviation value for each server calculated by the statistical calculation unit 3. The range of the deviation value 30 or more and 70 or less includes about 95% of the total performance data. Therefore, it can be said that the performance data not included in this range is a unique value.
(2) Step S22
If there is a server with a value less than 30 or more than 70, the target server group 1 may be in an abnormal state, and the user is notified through the external output unit (5 in FIG. 1) Notice.
(3) Step S23
If there is no server with a value less than 30 or greater than 70, it is determined that the target server group 1 is in a normal state, and the process ends.

なお、本発明において、偏差値の閾値を利用者が設定可能とすることで、異常検出の感度のカスタマイズを可能とすることが考えられる。また、サーバグループの平均値の監視と組み合わせて利用することにより、より厳密な監視システムが実現できる。 In the present invention, it is conceivable that the sensitivity of abnormality detection can be customized by allowing the user to set the threshold value of the deviation value. Further, a stricter monitoring system can be realized by using this in combination with monitoring of the average value of the server group.

図１は、本発明の構成例のブロック図である。FIG. 1 is a block diagram of a configuration example of the present invention. 図２は、サーバリソース情報の例を示す図である。FIG. 2 is a diagram illustrating an example of server resource information. 図３は、サーバ構成情報の例を示す図である。FIG. 3 is a diagram illustrating an example of server configuration information. 図４は、性能データの例を示す図である。FIG. 4 is a diagram illustrating an example of performance data. 図５は、サーバグループの性能データの統計計算を示すフローチャートである。FIG. 5 is a flowchart illustrating statistical calculation of server group performance data. 図６は、サーバ構成リストの例を示す図である。FIG. 6 is a diagram illustrating an example of a server configuration list. 図７は、サーバグループの性能データの異常検出を示すフローチャートである。FIG. 7 is a flowchart showing abnormality detection of server group performance data.

Explanation of symbols

１… 性能監視対象サーバ群（サーバグループ）
２… 性能データ収集部
３… 統計計算部
４… 異常検出部
５… 外部出力部
６… 記憶装置
７… サーバリソース情報
８… サーバ構成情報
９… 性能データ
１０… 入力制御部
１１… 外部入力部 1 ... Performance monitoring target server group (server group)
2 ... Performance data collection unit 3 ... Statistical calculation unit 4 ... Anomaly detection unit 5 ... External output unit 6 ... Storage device 7 ... Server resource information 8 ... Server configuration information 9 ... Performance data 10 ... Input control unit 11 ... External input unit

Claims

The server group to which the monitored server belongs,
Server resource information having resource information of the monitored server, server configuration information having information of the server group and the monitored server, a storage device for storing performance data collected from the monitored server,
A performance data collection unit that acquires the performance data from the monitored server and stores the performance data in the storage device;
A statistical calculation unit that retrieves the performance data that satisfies a specified condition from the storage device, and performs a statistical calculation process;
An abnormality detection unit for determining whether the data statistically processed by the statistical calculation unit represents an abnormal state of the monitored server;
An external output unit for notifying the user when an abnormality is detected;
A performance load abnormality detection system comprising: an input control unit that records the server resource information and the server configuration information set by the user in the storage device.

In the performance load abnormality detection system according to claim 1,
The performance data collection unit refers to the server resource information and the server configuration information in the storage device, grasps a target for collecting the performance data, and monitors the monitoring target based on the information of the server configuration information A server is accessed, a command for obtaining the performance data related to the target server resource is issued or a function is executed based on the information of the server resource information, and the performance data is transferred to the server resource information. , A reference to the server configuration information, and a collection time are recorded in the storage device.

In the performance load abnormality detection system according to claim 1,
The statistical calculation unit retrieves the performance data collected within a predetermined period from the current time from the recording device by designating the server resource information and the server group, and performs statistical processing,
The abnormality detection unit verifies the data subjected to statistical processing, and notifies the user via the external output unit when it is determined that there is an abnormality.

In the performance load abnormality detection system according to claim 1,
The statistical calculation unit has the following formula:

A performance load abnormality detection system that calculates a deviation value for each of the servers in the server configuration list.

(A1) creating a server configuration list in which pairs of monitoring target server information and performance data related to servers belonging to the target server group are listed;
(A2) specifying a monitoring period, the monitoring target server information, and server resource information, obtaining a performance data list, and creating a performance data list;
(A3) confirming whether or not one or more performance data exists in the performance data list;
(A4) If the performance data exists in the performance data list, the server configuration list corresponding to the server information referred to by the extracted data is extracted from the performance data list in the order of the collection time. Reflecting the performance data to the elements in
(A5) after the above operation is completed for all data in the performance data list, confirming the server configuration list, confirming that there is no server that does not reflect the performance data;
(A6) When it is confirmed that the performance data is reflected on all the servers in the server configuration list, calculating a deviation value for each server;
(A7) after completing calculation of deviation values for all servers, the step of transitioning to the processing of the abnormality detection unit;
(A8) A performance load abnormality detection method comprising a step of interrupting a process because there is a possibility that a statistical calculation result becomes invalid if there is even one server that does not reflect the performance data.

In the performance load abnormality detection method according to claim 5,
The step (a4)
(A41) reflecting the performance data to all servers in the server configuration list;
(A42) A performance load abnormality detection method comprising: overwriting a server already reflected with performance data having new time information.

In the performance load abnormality detection method according to claim 5 or 6,
The step (a6)
(A61) The deviation value is expressed by the following formula:

A performance load abnormality detection method comprising the step of calculating using

In the performance load abnormality detection method according to any one of claims 5 to 7,
(B1) About the calculated deviation value for each server, verifying whether there is a server having a value less than 30 or more than 70, and
(B2) a step of notifying the user that there is a possibility that the target server group is in an abnormal state when there is a server whose deviation value is less than 30 or more than 70, and
(B3) A performance load abnormality detection method further comprising a step of ending the process on the assumption that the target server group is in a normal state when there is no server having a deviation value less than 30 or greater than 70.

A program for causing a computer to execute the performance load abnormality detection method according to any one of claims 5 to 8.