JP2013200899A

JP2013200899A - Operation management apparatus, and operation management method

Info

Publication number: JP2013200899A
Application number: JP2013143069A
Authority: JP
Inventors: Tsuyoshi Ishio; 堅石王
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2013-07-08
Filing date: 2013-07-08
Publication date: 2013-10-03
Anticipated expiration: 2029-02-12
Also published as: JP5459431B2

Abstract

PROBLEM TO BE SOLVED: To provide an operation management apparatus, a method and a program for accurately detecting performance deterioration of a system.SOLUTION: An operation management apparatus comprises: normal model distribution accumulation means for holding a range of distribution of collapse of correlation of a normal model when a system is normally operated; fault model distribution accumulation means for holding a fault model when any abnormality occurs in the system; correlation variant distribution determination means for determining whether or not the distribution of the collapse of each correlation model of performance information falls within the range of the distribution of the collapse of the normal model; correlation variant history accumulation means for holding the distribution of the collapse of each collation model determined to not fall within the range; and correlation collapse increase determination means for determining whether or not the distribution of the collapse of each correlation model held by the correlation variant history accumulation means tends to approximate the distribution of the collapse of the fault model accumulated by the fault model distribution accumulation means.

Description

本発明は、ＷＥＢサービスまたは業務サービスといった情報通信サービスを提供する情報処理装置に関し、特に、システムの性能劣化を正確に検知するとともに局所化する機能を有する運用管理装置および運用管理方法に関するものである。 The present invention relates to an information processing apparatus that provides an information communication service such as a WEB service or a business service, and more particularly to an operation management apparatus and an operation management method having a function of accurately detecting and localizing system performance degradation. .

第１の従来技術としては、性能情報毎に閾値を設定し、各々の性能情報について閾値を越えたことを検出して障害を検知する運用管理装置があった。この従来の運用管理装置では、明確に異常であることを示す値を予め閾値に設定して、個々の要素の性能の異常を検出する。 As a first conventional technique, there is an operation management apparatus that sets a threshold value for each piece of performance information and detects a fault by detecting that the threshold value is exceeded for each piece of performance information. In this conventional operation management apparatus, a value that clearly indicates an abnormality is set in advance as a threshold value, and an abnormality in the performance of each element is detected.

第２の従来技術としては、任意の２つの性能情報の値の時系列に対して、一方を入力とし他方を出力とした場合の変換関数を導出することで相関モデルを生成する運用管理装置が考え出されている。この従来の運用管理装置では、新たに性能情報を検出した場合に前記の相関モデルの変換関数に従った性能値であるか否かを判定し、相関関係の崩れた数および量によって障害を検出する。 As a second conventional technique, there is an operation management device that generates a correlation model by deriving a conversion function when one is an input and the other is an output with respect to a time series of two arbitrary performance information values. Have been conceived. In this conventional operation management device, when new performance information is detected, it is determined whether or not the performance value conforms to the conversion function of the correlation model, and a failure is detected based on the number and amount of correlations To do.

特開２００７−２９３３９３号公報JP 2007-293393 A 特開２００８−２９３４４１号公報JP 2008-293441 A

しかしながら、上記第１の従来技術による運用管理装置では、閾値を低く設定してしまうと、性能情報の変動が大きい場合などに誤報が多発して管理者が混乱するという問題があった。また、閾値を高く設定してしまうと、重大な障害以外検出できなくなり、システムは動作しているものの応答速度が劣化しているなどの性能異常の検出が困難になるという問題があった。さらに、個々の要素毎の異常値は検出できるものの、ボトルネックなど入出力の関係にある他の要素の性能値との関係に起因する異常を検出することができないという問題があった。 However, in the operation management apparatus according to the first prior art, if the threshold value is set low, there is a problem that the administrator is confused due to frequent misreports when the performance information varies greatly. Further, if the threshold is set high, there is a problem that it becomes difficult to detect other than a serious failure, and it becomes difficult to detect a performance abnormality such as a system operating but a response speed deteriorated. Furthermore, although an abnormal value for each element can be detected, there is a problem that an abnormality caused by a relationship with a performance value of another element having an input / output relationship such as a bottleneck cannot be detected.

また、上記第２の従来技術による運用管理装置では、相関関係の崩れの数および量を基に障害を検出していた。このため、上記第２の従来技術では、構成要素の数に偏りがあるようなシステムの場合には、数の少ない要素に相関関係の崩れが多く発生しても、システム全体の崩れの数が多くなければ障害として検出されない、という問題があった。 Further, in the operation management apparatus according to the second prior art, a failure is detected based on the number and amount of correlation collapse. For this reason, in the second prior art, in the case of a system in which the number of components is biased, the number of collapses of the entire system can be reduced even if many correlation failures occur in a small number of elements. There was a problem that if there were not many, it was not detected as a failure.

すなわち、上記第２の従来技術による運用管理装置では、システムの性能劣化障害を検出するために、平常時の相関関係のモデルを生成し、運用時にその相関関係の崩れた状況から障害を検出して異常を特定する。しかし、この従来の手法では、第一の課題として、相関関係の崩れの数や量を元に障害を検出していたため、構成要素の数に偏りがあるようなシステムの場合には、数の少ない要素に相関関係の崩れが多く発生しても、システム全体の崩れの数が多くなければ障害として検出されないという問題があった。例えばＷｅｂ、ＡＰ、ＤＢなどで構成される一般的な３階層システムにおいては、負荷分散などを考慮してＷｅｂやＡＰは数多く設置するが、ＤＢについては少ないのが一般的である。このようなシステムにおいてＤＢで相関関係の崩れが多く発生しても、ＷｅｂやＡＰの相関関係と比べて数が少ないため、システム全体としては崩れが少なく異常はないと見なされることがあった。また、第二の課題として、例えば２点間のネットワークトラフィック量などのように、通常状態では相関関係の崩れが発生することがなく、もし崩れが発生した場合には障害であることがほぼ間違いの無いようなモデルであったとしても、やはり障害として検出されないという問題があった。 In other words, the operation management apparatus according to the second prior art generates a normal correlation model in order to detect a system performance deterioration failure, and detects the failure from the situation in which the correlation has broken during operation. Identify abnormalities. However, in this conventional method, as a first problem, since a failure is detected based on the number and amount of correlation collapse, in the case of a system in which the number of components is biased, the number of There has been a problem that even if many correlation failures occur in a small number of elements, it is not detected as a failure unless the number of collapses in the entire system is large. For example, in a general three-tier system composed of Web, AP, DB, etc., a large number of Webs and APs are installed in consideration of load distribution, but there are generally few DBs. In such a system, even if many correlation failures occur in the DB, since the number is smaller than the correlation between the Web and the AP, the entire system may be considered to have few failures and no abnormality. In addition, as a second problem, for example, the amount of network traffic between two points does not cause a correlation failure in a normal state, and if a failure occurs, it is almost a fault. There was a problem that even if it was a model with no error, it was not detected as a failure.

そこで、本発明は、システムの性能劣化を正確に検知する機能またはシステムの性能劣化を局所化する機能を有する運用管理装置および運用管理方法を提供することを目的としている。 Accordingly, an object of the present invention is to provide an operation management apparatus and an operation management method having a function of accurately detecting system performance deterioration or a function of localizing system performance deterioration.

本発明は、上述の課題を解決すべくなされたもので、サービスを実行するシステムの性能情報を監視し、正常時の前記システムにおいて成立する複数の性能値の相互間の相関関係を抽出し、前記システムの運用時の複数の性能値を検出して該検出結果から前記相関関係の変化を抽出して管理者に提示することで、前記システムの性能劣化を検知及び局所化する機能を有する運用管理装置であって、各々の性能値間の相関関係の崩れの分布において、システムが正常に動作している場合の前記相関関係の崩れの分布の範囲である正常相関モデルの崩れの分布の範囲を保持する正常モデル分布蓄積手段と、システムに異常が発生している場合の前記相関関係の崩れの分布を示す障害モデルを保持する障害モデル分布蓄積手段と、前記性能情報の各相関モデルの崩れの分布と前記正常モデル分布蓄積手段の中の正常相関モデルの崩れの分布の範囲とを比較し、前記性能情報の各相関モデルの崩れの分布が前記正常モデル分布蓄積手段の中の正常相関モデルの崩れの分布の範囲内に収まっているか否かを判断する相関変化分布判別手段と、前記相関変化分布判別手段によって範囲内に収まっていないと判断された前記性能情報の各相関モデルの崩れの分布を保持する相関変化履歴蓄積手段と、前記相関変化履歴蓄積手段に所定数の相関モデルの崩れの分布である履歴が蓄積されると、該履歴が前記障害モデル分布蓄積手段に蓄積された障害モデルの崩れの分布に近似していく傾向があるか否かを判断する相関崩れ増加判別手段と、を有することを特徴とする運用管理装置である。 The present invention has been made to solve the above-described problems, monitors performance information of a system that executes a service, extracts a correlation among a plurality of performance values established in the system at normal time, Operation having a function of detecting and localizing performance degradation of the system by detecting a plurality of performance values during operation of the system, extracting the change of the correlation from the detection result, and presenting it to an administrator In the distribution of the correlation collapse between the performance values of the management device, the range of the distribution of the correlation of the normal correlation model, which is the range of the distribution of the correlation breakdown when the system is operating normally Each of the performance information, the normal model distribution storage means for holding the fault model, the fault model distribution storage means for holding the fault model indicating the distribution of the collapse of the correlation when an abnormality occurs in the system, and the performance information The correlation distribution of the correlation model and the range of the distribution of the normal correlation model collapse in the normal model distribution storage means are compared, and the distribution of the collapse of each correlation model in the performance information is stored in the normal model distribution storage means. Correlation change distribution determining means for determining whether or not the normal correlation model falls within the range of the collapse distribution, and each correlation of the performance information determined not to be within the range by the correlation change distribution determining means Correlation change history accumulating means for holding a model disruption distribution and when a history of a predetermined number of correlation model disruption distributions is accumulated in the correlation change history accumulation means, the history is stored in the failure model distribution accumulation means. An operation management apparatus comprising: correlation loss increase determination means for determining whether or not there is a tendency to approximate the distribution of failure of accumulated failure models.

また本発明は、サービスを実行するシステムの性能情報を監視し、正常時の前記システムにおいて成立する複数の性能値の相互間の相関関係を抽出し、前記システムの運用時の複数の性能値を検出して該検出結果から前記相関関係の変化を抽出して管理者に提示する運用管理方法であって、各々の性能値間の相関関係の崩れの分布において、システムが正常に動作している場合の前記相関関係の崩れの分布の範囲である正常相関モデルの崩れの分布の範囲を保持する正常モデル分布蓄積ステップと、システムに異常が発生している場合の前記相関関係の崩れの分布を示す障害モデルを保持する障害モデル分布蓄積ステップと、前記性能情報の各相関モデルの崩れの分布と前記正常モデル分布蓄積ステップで保持された正常相関モデルの崩れの分布の範囲とを比較し、前記性能情報の各相関モデルの崩れの分布が前記正常モデル分布蓄積ステップで保持された正常相関モデルの崩れの分布の範囲内に収まっているか否かを判断する相関変化分布判別ステップと、前記相関変化分布判別ステップにおいて範囲内に収まっていないと判断された前記性能情報の各相関モデルの崩れの分布を保持する相関変化履歴蓄積ステップと、前記相関変化履歴蓄積ステップによって所定数の相関モデルの崩れの分布である履歴が蓄積されると、該履歴が前記障害モデル分布蓄積ステップによって蓄積された障害モデルの崩れの分布に近似していく傾向があるか否かを判断する相関崩れ増加判別ステップと、を有することを特徴とする運用管理方法である。 Further, the present invention monitors performance information of a system that executes a service, extracts a correlation between a plurality of performance values established in the system at normal time, and obtains a plurality of performance values during operation of the system. An operation management method for detecting and extracting a change in the correlation from the detection result and presenting the change to the administrator, wherein the system is operating normally in the distribution of the collapse of the correlation between the performance values. A normal model distribution accumulation step that retains a distribution range of the normal correlation model that is a distribution range of the correlation breakdown in a case, and a distribution of the correlation breakdown when an abnormality occurs in the system. A failure model distribution accumulation step that holds the indicated failure model, a breakdown distribution of each correlation model of the performance information, and a failure of the normal correlation model held in the normal model distribution accumulation step Correlation comparing the range of the cloth and determining whether or not the distribution of the collapse of each correlation model of the performance information is within the range of the breakdown distribution of the normal correlation model held in the normal model distribution accumulation step A change distribution determining step, a correlation change history accumulating step for holding a distribution of collapse of each correlation model of the performance information determined not to fall within a range in the correlation change distribution determining step, and the correlation change history accumulating step When the history as the distribution of the collapse of the predetermined number of correlation models is accumulated, whether or not the history tends to approximate the failure model collapse distribution accumulated by the failure model distribution accumulation step. An operation management method comprising: determining a correlation failure increase determination step.

また本発明は、サービスを実行するシステムの性能情報を監視し、正常時の前記システムにおいて成立する複数の性能値の相互間の相関関係を抽出し、前記システムの運用時の複数の性能値を検出して該検出結果から前記相関関係の変化を抽出して管理者に提示する処理をコンピュータに実行させるプログラムであって、各々の性能値間の相関関係の崩れの分布において、システムが正常に動作している場合の前記相関関係の崩れの分布の範囲である正常相関モデルの崩れの分布の範囲を保持する正常モデル分布蓄積処理と、システムに異常が発生している場合の前記相関関係の崩れの分布を示す障害モデルを保持する障害モデル分布蓄積処理と、前記性能情報の各相関モデルの崩れの分布と前記正常モデル分布蓄積処理で保持された正常相関モデルの崩れの分布の範囲とを比較し、前記性能情報の各相関モデルの崩れの分布が前記正常モデル分布蓄積処理で保持された正常相関モデルの崩れの分布の範囲内に収まっているか否かを判断する相関変化分布判別処理と、前記相関変化分布判別処理において範囲内に収まっていないと判断された前記性能情報の各相関モデルの崩れの分布を保持する相関変化履歴蓄積処理と、前記相関変化履歴蓄積処理によって所定数の相関モデルの崩れの分布である履歴が蓄積されると、該履歴が前記障害モデル分布蓄積処理によって蓄積された障害モデルの崩れの分布に近似していく傾向があるか否かを判断する相関崩れ増加判別処理と、をコンピュータに実行させるプログラムである。 Further, the present invention monitors performance information of a system that executes a service, extracts a correlation between a plurality of performance values established in the system at normal time, and obtains a plurality of performance values during operation of the system. A program for causing a computer to execute a process of detecting and extracting a change in the correlation from the detection result and presenting the change to the administrator. The normal model distribution accumulation process that holds the range of the normal correlation model collapse distribution, which is the range of the correlation collapse distribution when operating, and the correlation when the system is malfunctioning Failure model distribution storage process that holds a failure model indicating the distribution of collapse, the distribution of the collapse of each correlation model of the performance information and the normal phase retained in the normal model distribution storage process Compare with the range of distribution of model collapse, and whether the distribution of collapse of each correlation model of the performance information is within the range of collapse distribution of the normal correlation model held in the normal model distribution accumulation process Correlation change distribution discriminating process for determining the correlation change history accumulating process for holding the distribution of collapse of each correlation model of the performance information determined to be not within the range in the correlation change distribution discriminating process, and the correlation When a history that is the distribution of the collapse of a predetermined number of correlation models is accumulated by the change history accumulation process, the history tends to approximate the failure model collapse distribution accumulated by the failure model distribution accumulation process. This is a program that causes a computer to execute a correlation failure increase determination process for determining whether or not.

本発明によれば、平常時の各々の性能情報の相関関係をモデル化し、運用時にその相関関係の崩れの要素毎の分布傾向を監視することで、障害の予兆を検出し、発生場所の特定を可能にすることができ、システムの性能劣化を正確に検知する機能またはシステムの性能劣化を局所化する機能を有する運用管理装置および運用管理方法ならびにそのプログラムを提供することができる。 According to the present invention, the correlation of each performance information in normal times is modeled, and the distribution tendency for each element of the collapse of the correlation is monitored at the time of operation, so that a sign of failure is detected and the occurrence location is specified. It is possible to provide an operation management apparatus, an operation management method, and a program thereof having a function of accurately detecting system performance degradation or a function of localizing system performance degradation.

また、本発明によれば、相関関係のモデルの崩れについて、システムの構成要素毎の分布に着目し、異常時の相関関係のモデルの崩れの分布を予め登録しておいた上で、運用時の相関関係のモデルの崩れの分布が異常時の崩れの分布にある程度の期間を通じて近似している傾向があれば、障害の予兆と見なして管理者に異常を通知することができる。これにより、本発明によれば、崩れた障害モデルの数が少ない場合であっても異常を検知することができ、上記の第一および第二の課題を解決することができる。また、本発明によれば、運用時の相関関係のモデルの崩れの分布が正常範囲内であるかどうかも判定する。これにより、異常の通知が無ければ管理者はシステムが正常に動作していると見なすことができる。 Further, according to the present invention, with regard to the collapse of the correlation model, paying attention to the distribution of each system component, the distribution of the collapse of the correlation model at the time of abnormality is registered in advance, and If there is a tendency that the distribution of the collapse of the correlation model of the correlation of the collapse of the model of the correlation approximates the distribution of the collapse at the time of abnormality over a certain period of time, the abnormality can be notified to the administrator as a sign of a failure. Thereby, according to this invention, even if it is a case where the number of the failure models which collapsed is small, abnormality can be detected and said 1st and 2nd subject can be solved. Further, according to the present invention, it is also determined whether the distribution of the collapse of the correlation model during operation is within the normal range. Thus, if there is no notification of abnormality, the administrator can consider that the system is operating normally.

本発明の前提となる運用管理装置を示すブロック図である。It is a block diagram which shows the operation management apparatus used as the premise of this invention. 性能情報の一例を示す図である。It is a figure which shows an example of performance information. 性能情報の相関変化を分析するステップを示すフローチャート図である。It is a flowchart figure which shows the step which analyzes the correlation change of performance information. 相関モデルの一例を示す図である。It is a figure which shows an example of a correlation model. 図１の運用管理装置によって提示される画面の一例を示す図である。It is a figure which shows an example of the screen shown by the operation management apparatus of FIG. 本発明の第１の実施形態に係る運用管理装置を示すブロック図である。It is a block diagram which shows the operation management apparatus which concerns on the 1st Embodiment of this invention. 図６に示す運用管理装置の動作を示すフローチャート図である。It is a flowchart figure which shows operation | movement of the operation management apparatus shown in FIG. 異常時の相関モデルの崩れの分布として登録する情報を示す図である。It is a figure which shows the information registered as distribution of collapse of the correlation model at the time of abnormality. 本発明の実施形態に係る性能情報の各相関モデルの崩れの分布と正常モデルの崩れの分布の範囲の比較の概要を示す図である。It is a figure which shows the outline | summary of the comparison of the range of the distribution of collapse of each correlation model of the performance information which concerns on embodiment of this invention, and the distribution of collapse of a normal model. 図６の運用管理装置によって提示される画面の一例を示す図である。It is a figure which shows an example of the screen shown by the operation management apparatus of FIG. 本発明の第２の実施形態に係る運用管理装置によって提示される画面の一例を示す図である。It is a figure which shows an example of the screen shown by the operation management apparatus which concerns on the 2nd Embodiment of this invention. 本発明の第３の実施形態に係る運用管理装置の動作を示すフローチャート図である。It is a flowchart figure which shows operation | movement of the operation management apparatus which concerns on the 3rd Embodiment of this invention.

（前提となる構成）
まず、図１から図５を参照して、本発明の前提となる運用管理装置の構成および動作を説明する。 (Prerequisite configuration)
First, with reference to FIG. 1 to FIG. 5, the configuration and operation of an operation management apparatus as a premise of the present invention will be described.

図１を参照すると、本発明の前提となる運用管理装置は、サービス実行手段１と、性能情報蓄積手段２と、情報収集手段３と、障害分析手段４と、管理者対話手段５と、対処実行手段６と、相関モデル生成手段７と、相関モデル蓄積手段８と、相関変化分析手段９を有して構成される。 Referring to FIG. 1, an operation management apparatus as a premise of the present invention includes a service execution unit 1, a performance information storage unit 2, an information collection unit 3, a failure analysis unit 4, an administrator dialogue unit 5, and a countermeasure. The execution unit 6, the correlation model generation unit 7, the correlation model storage unit 8, and the correlation change analysis unit 9 are configured.

サービス実行手段１は、ＷＥＢサービスまたは業務サービスといった情報通信サービスを提供する情報処理装置などである。 The service execution means 1 is an information processing apparatus that provides an information communication service such as a WEB service or a business service.

性能情報蓄積手段２は、サービス実行手段１の各々の要素の性能情報を蓄積するものである。 The performance information storage unit 2 stores the performance information of each element of the service execution unit 1.

情報収集手段３は、サービス実行手段１の性能情報または異常メッセージなどの動作状態を検出して出力するとともに、動作状態に含まれる性能情報を性能情報蓄積手段２に蓄積させるものである。 The information collecting means 3 detects and outputs the performance information of the service execution means 1 or an operational state such as an error message, and causes the performance information storage means 2 to accumulate the performance information included in the operational state.

障害分析手段４は、情報収集手段３および相関変化分析手段９の出力を受け取って障害分析を行うものである。 The failure analysis unit 4 receives outputs from the information collection unit 3 and the correlation change analysis unit 9 and performs failure analysis.

管理者対話手段５は、障害分析手段４から障害分析の結果を受け取って管理者に提示するとともに、管理者からの入力を受け取って対処実行手段６に出力するものである。 The administrator interaction means 5 receives the result of the failure analysis from the failure analysis means 4 and presents it to the administrator, and receives the input from the administrator and outputs it to the countermeasure execution means 6.

対処実行手段６は、管理者対話手段５の出力に応じて、サービス実行手段１において障害に対処する処理を実行させるものである。 The coping execution means 6 causes the service execution means 1 to execute processing for coping with a failure in accordance with the output of the administrator dialogue means 5.

相関モデル生成手段７は、性能情報蓄積手段２から一定期間についての性能情報を取り出し、任意の２つの性能情報の値の時系列の変換関数を導出することで、サービス実行手段１の全体的な稼動状態の相関モデルを生成するものである。 The correlation model generation unit 7 extracts the performance information for a certain period from the performance information storage unit 2 and derives a time-series conversion function of the values of any two pieces of performance information, so that the overall service execution unit 1 A correlation model of the operating state is generated.

相関モデル蓄積手段８は、相関モデル生成手段７が生成した相関モデルを蓄積するものである。 The correlation model accumulation unit 8 accumulates the correlation model generated by the correlation model generation unit 7.

相関変化分析手段９は、情報収集手段３から新たに検出された性能情報を受け取り、この性能情報に含まれる性能値が相関モデル蓄積手段８に蓄積される相関モデルの各々の性能情報間の変換関数で示された関係を一定の誤差範囲内で満たしているか否かを分析して、その結果を出力するものである。 The correlation change analyzing means 9 receives newly detected performance information from the information collecting means 3, and the performance value included in this performance information is converted between each piece of performance information of the correlation model accumulated in the correlation model accumulating means 8. It analyzes whether or not the relationship indicated by the function is satisfied within a certain error range, and outputs the result.

図１から図５を参照して、本発明の前提となる運用管理装置の動作について、以下に説明する。 With reference to FIG. 1 to FIG. 5, the operation of the operation management apparatus which is the premise of the present invention will be described below.

まず、図１に示す情報収集手段３がサービス実行手段１の動作状態を検出し、性能情報蓄積手段２に性能情報を蓄積する。例えば、情報収集手段３は、サービス実行手段１でＷＥＢサービスが実行されている場合、ＷＥＢサービスを提供する各サーバのＣＰＵ使用率またはメモリ残量を一定時間間隔で検出する。図２に示す性能情報１０１は、このようにして検出された性能情報の一例である。図２において、例えば、「Ａ．ＣＰＵ」は、１つのサーバのＣＰＵ利用率の値を示し、２００７年１０月５日の１７時２５分の値が１２である。さらに１分間隔で１７時２６分から１５、３４、６３といった値が検出されている。同様に、「Ａ．ＭＥＭ」は同じサーバのメモリ残量の値を、「Ｂ．ＣＰＵ」は別のサーバのＣＰＵ利用率の値を、それぞれ同時刻に検出したものである。 First, the information collecting means 3 shown in FIG. 1 detects the operating state of the service executing means 1 and accumulates performance information in the performance information accumulating means 2. For example, when the service execution unit 1 is executing the WEB service, the information collection unit 3 detects the CPU usage rate or the remaining memory capacity of each server that provides the WEB service at regular time intervals. The performance information 101 shown in FIG. 2 is an example of performance information detected in this way. In FIG. 2, for example, “A.CPU” indicates the value of the CPU usage rate of one server, and the value of 17:25 on October 5, 2007 is 12. Further, values such as 15, 34, 63 are detected from 17:26 at intervals of 1 minute. Similarly, “A.MEM” detects the value of the remaining memory capacity of the same server, and “B.CPU” detects the value of the CPU utilization rate of another server at the same time.

次に、障害分析手段４は、予め決められた方法で障害分析を行う。例えば、障害分析手段４は、ＣＰＵ利用率が一定値以上であれば管理者に警告メッセージを提示するといった指定に従って、情報収集手段３で検出された性能情報の値から、特定のサーバの負荷が高くなっているか否かを閾値判定する。 Next, the failure analysis means 4 performs failure analysis by a predetermined method. For example, the failure analysis unit 4 determines whether the load on a specific server is based on the value of the performance information detected by the information collection unit 3 according to a specification that a warning message is presented to the administrator if the CPU usage rate is a certain value or more. A threshold is used to determine whether or not the value is high.

管理者対話手段５は、上記のような障害分析手段４による障害分析の結果を管理者に提示する。そして、管理者対話手段５は、管理者が何らかの対処を指示する入力を行った場合、対処実行手段６を介してサービス実行手段１に対処コマンドを実行させる。例えば、管理者は、管理者対話手段５の提示によって、ＣＰＵ負荷が高くなっていることを知ることにより、業務量を減らしたり、負荷分散を行うための構成変更を行ったりすることができる。このような情報収集、分析、対処の処理の繰り返しにより、サービス実行手段１の障害対処が継続して行われる。 The manager dialogue means 5 presents the result of the fault analysis by the fault analysis means 4 as described above to the manager. Then, the administrator interaction means 5 causes the service execution means 1 to execute the countermeasure command via the countermeasure execution means 6 when the administrator inputs to instruct some kind of countermeasure. For example, the administrator can know that the CPU load is high by presenting the administrator interaction means 5, and can reduce the workload or change the configuration for load distribution. By repeating such information collection, analysis, and handling processes, the service execution means 1 continues to handle failures.

さらに、図１に示す運用管理装置は、相関モデル生成手段７、相関モデル蓄積手段８および相関変化分析手段９によって、上記のような障害分析における性能異常をより正確に検出することができる。 Furthermore, the operation management apparatus shown in FIG. 1 can detect the performance abnormality in the failure analysis as described above more accurately by the correlation model generation means 7, the correlation model storage means 8, and the correlation change analysis means 9.

図３は、このような性能異常をより正確に検出するための処理を示すものであって、性能情報の相関変化を分析するステップを示す。 FIG. 3 shows a process for more accurately detecting such a performance abnormality, and shows a step of analyzing a correlation change in performance information.

まず、性能情報蓄積手段２において図２の性能情報１０１に示す情報が蓄積されている状態で、相関モデル生成手段７は各々の性能情報の間の変換関数を導出することによって相関モデルを作成し、該相関モデルを相関モデル蓄積手段８に蓄積させる（図３のステップＳ５０１）。 First, in the state where the information shown in the performance information 101 of FIG. 2 is stored in the performance information storage unit 2, the correlation model generation unit 7 creates a correlation model by deriving a conversion function between the pieces of performance information. The correlation model is stored in the correlation model storage unit 8 (step S501 in FIG. 3).

図４に示す相関モデル２０１は、このようにして生成された相関モデルの一例を示している。相関モデル２０１を参照すると、例えば、「Ａ．ＣＰＵ」を入力Ｘとし、「Ａ．ＭＥＭ」を出力Ｙとした場合の変換関数「Ｙ＝αＸ＋β」は、図２の性能情報１０１で示される値の時系列を参照して、αとβの値として、それぞれ「−０．６」、「１００」を決定し、その変換関数で生成した値の時系列と、出力となる性能情報の実際の値の時系列を比較し、その差分である変換誤差からこの変換関数の重み「０．８８」が算出されている。同様に、任意の２つの性能情報間の変換関数を導出し、一定の重みを持つものを有効な相関として抽出し、相関モデル２０１が生成される。尚、ここでは、変換関数「Ｙ＝αＸ＋β」の場合を説明したが、この例に限定されるものではなく、任意の２つの性能情報の値の時系列を変換するものであればよい。 A correlation model 201 shown in FIG. 4 shows an example of the correlation model generated in this way. Referring to the correlation model 201, for example, the conversion function “Y = αX + β” when “A.CPU” is an input X and “A.MEM” is an output Y is a value indicated by the performance information 101 in FIG. With reference to the time series, “−0.6” and “100” are determined as the values of α and β, respectively, and the time series of values generated by the conversion function and the actual performance information that is output The time series of values are compared, and the weight “0.88” of this conversion function is calculated from the conversion error that is the difference. Similarly, a conversion function between any two pieces of performance information is derived, and one having a certain weight is extracted as an effective correlation, and a correlation model 201 is generated. Although the case of the conversion function “Y = αX + β” has been described here, the present invention is not limited to this example, and any function that converts time series of two arbitrary values of performance information may be used.

次に、相関変化分析手段９は、情報収集手段３から新たに取得した性能情報が、相関モデルに示される相関関係と一致しているか否かを分析する（ステップＳ５０２）。 Next, the correlation change analysis unit 9 analyzes whether or not the performance information newly acquired from the information collection unit 3 matches the correlation shown in the correlation model (step S502).

例えば、相関変化分析手段９は、図２に示す性能情報１０１において、最下行にある「２００７／１１／０７８：３０」時点の性能情報を得た場合、図４に示す相関モデル２０１に記載された変換関数を順次探索し、入力である性能情報から変換関数を用いて算出した変換値と、出力となる性能情報の新たに取得された値が、一定の変換誤差の範囲内にある場合には相関が維持されていると判断し、変換誤差範囲を超えている場合には相関関係が崩れたものと判断する。相関変化分析手段９は、このような処理を全ての変換関数に対して繰り返し、新たに取得された全性能情報の相関変化の有無を判断した後、この相関変化の程度を示す異常度情報と相関変化に関係する要素を示す異常要素情報とを含む相関変化情報を作成して障害分析手段４に出力する。 For example, when the correlation change analysis unit 9 obtains performance information at the time “2007/11/07 8:30” in the bottom row in the performance information 101 shown in FIG. 2, the correlation change analysis unit 9 writes the performance information in the correlation model 201 shown in FIG. 4. When the conversion value calculated using the conversion function from the input performance information and the newly acquired value of the output performance information are within a certain range of conversion error It is determined that the correlation is maintained, and if the conversion error range is exceeded, it is determined that the correlation is broken. The correlation change analysis means 9 repeats such processing for all the conversion functions, determines whether or not there is a correlation change in all newly acquired performance information, and then includes abnormality level information indicating the degree of this correlation change and Correlation change information including abnormal element information indicating elements related to the correlation change is created and output to the failure analysis means 4.

障害分析手段４は、この相関変化情報を受け取り、変化した異常度が予め規定された値を超えている場合には（ステップＳ５０３）、該変化した異常度について、障害の可能性があることを管理者対話手段５に提示させる（ステップＳ５０４）。 The failure analysis means 4 receives this correlation change information, and when the changed abnormality degree exceeds a predetermined value (step S503), the failure analysis means 4 indicates that there is a possibility of a failure with respect to the changed abnormality degree. The manager dialogue means 5 is made to present it (step S504).

図５は、このようにして管理者対話手段５が管理者に提示する画面の例を示す。表示画面４０１には、異常度合いを示す相関崩れの数（ａ）、異常場所を示す相関関係図（ｂ）、異常度合いの大きい要素のリスト（ｃ）などが含まれる。このようにして、例えば、異常度合いの大きい要素「Ｃ．ＣＰＵ」に障害の可能性があることを管理者に提示することができる。 FIG. 5 shows an example of a screen presented by the manager interaction means 5 to the manager in this way. The display screen 401 includes the number of broken correlations (a) indicating the degree of abnormality, a correlation diagram (b) indicating the abnormal location, a list (c) of elements having a high degree of abnormality, and the like. In this way, for example, it is possible to present to the administrator that there is a possibility of failure in the element “C.CPU” having a high degree of abnormality.

以上説明したように、本発明の前提となる運用管理装置では、障害の発生していない平常時の性能情報から相関モデルを生成し、検出された性能情報がこの平常時の相関モデルから変化した割合を算出することで、応答劣化などの性能異常の発生を検出し、該性能異常の発生場所を特定することができる。 As described above, in the operation management apparatus that is the premise of the present invention, a correlation model is generated from normal performance information that does not cause a failure, and the detected performance information has changed from this normal correlation model. By calculating the ratio, it is possible to detect the occurrence of a performance abnormality such as response deterioration and specify the location where the performance abnormality has occurred.

しかしながら、前記本発明の前提となる運用管理装置では、管理者に障害の可能性として提示されるのは異常度合いを示す相関崩れの数がある程度大きい場合に限られる。このため、前記本発明の前提となる運用管理装置では、システムを構成する各アプリケーションの数に偏りがあると、少ないアプリケーションのサーバで異常が多く生じても、該性能異常が提示されないという問題がある。 However, in the operation management apparatus that is the premise of the present invention, the possibility of failure is presented to the administrator only when the number of correlation disruptions indicating the degree of abnormality is large to some extent. For this reason, in the operation management apparatus which is the premise of the present invention, if there is a bias in the number of each application constituting the system, there is a problem that even if a large number of abnormalities occur in a server with a small number of applications, the performance abnormality is not presented. is there.

例えば、前記本発明の前提となる運用管理装置が、ＷＥＢサービスを構成するアプリケーションとしてのＷＥＢサーバとＤＢサーバとを管理する場合を考える。この場合、サーバの台数はＷＥＢサーバの方がＤＢサーバよりも多いのが一般的であるため、ＤＢサーバで異常が多く生じても全体としての相関崩れの数はそれほど多くならず、障害が提示されない可能性があった。 For example, let us consider a case where the operation management apparatus which is the premise of the present invention manages a WEB server and a DB server as applications constituting a WEB service. In this case, since the number of servers is generally larger for the WEB server than for the DB server, even if many abnormalities occur in the DB server, the total number of correlation failures does not increase so much and a failure is presented. There was a possibility not to be.

（第１の実施形態）
次に、図６から図１０を参照して、本発明の第１の実施形態に係る運用管理装置を説明する。 (First embodiment)
Next, the operation management apparatus according to the first embodiment of the present invention will be described with reference to FIGS.

［第１の実施形態の構成］
図６は、本発明の第１の実施形態に係る運用管理装置を示すブロック図である。本実施形態の運用管理装置は、図１に示す運用管理装置の構成に加えて、正常モデル分布蓄積手段１０と、障害モデル分布蓄積手段１１と、相関変化分布判別手段１２と、相関変化履歴蓄積手段１３と、相関崩れ増加判別手段１４とを有して構成されている。 [Configuration of First Embodiment]
FIG. 6 is a block diagram showing an operation management apparatus according to the first embodiment of the present invention. In addition to the configuration of the operation management apparatus shown in FIG. 1, the operation management apparatus of the present embodiment includes a normal model distribution storage means 10, a failure model distribution storage means 11, a correlation change distribution determination means 12, and a correlation change history storage. Means 13 and correlation loss increase determination means 14 are provided.

正常モデル分布蓄積手段１０は、管理者が入力する性能情報の正常時の相関モデルの崩れの分布の範囲（図９の分布の範囲８０２参照）を蓄積するものである。 The normal model distribution accumulating unit 10 accumulates a distribution range (see a distribution range 802 in FIG. 9) of the correlation model collapse when the performance information input by the administrator is normal.

障害モデル分布蓄積手段１１は、管理者が入力する性能情報の異常時の相関モデルの崩れの分布を蓄積するものである。 The failure model distribution accumulating unit 11 accumulates the distribution of the collapse of the correlation model when the performance information input by the administrator is abnormal.

相関変化分布判別手段１２は、相関変化分析手段９から性能情報を受け取り、性能情報の相関モデルの崩れの分布と正常モデル分布蓄積手段１０の中の相関モデルの崩れの分布の範囲とを比較し、該比較結果に基づき、性能情報が正常の範囲内に収まっているか否かを分析するものである。 The correlation change distribution discriminating means 12 receives the performance information from the correlation change analyzing means 9 and compares the correlation model collapse distribution of the performance information with the range of the correlation model collapse distribution in the normal model distribution storage means 10. Based on the comparison result, it is analyzed whether or not the performance information is within a normal range.

相関変化履歴蓄積手段１３は、性能情報の相関モデル毎の崩れの数を蓄積するものである。 The correlation change history accumulating unit 13 accumulates the number of collapses for each correlation model of performance information.

相関崩れ増加判別手段１４は、相関変化履歴蓄積手段１３に蓄積された相関モデル毎の崩れの数の履歴に基づいて、相関モデルの崩れの分布が障害モデル分布蓄積手段１１に蓄積されている異常時の相関モデルの崩れの分布（すなわち障害モデルの分布）に近似しているか否か、を障害モデル毎に分析する。その結果、近似している障害モデルの分布があると判断した場合、相関崩れ増加判別手段１４は、その性能情報と近似している障害モデルと、障害モデルと比較して算出した近似の割合とを障害分析手段４に通知する。 Correlation failure increase determination means 14 is based on the history of the number of failures for each correlation model stored in correlation change history storage means 13, and the abnormality of distribution of correlation model failure stored in failure model distribution storage means 11. It is analyzed for each failure model whether or not the distribution of the correlation model of the time is approximate (that is, the distribution of the failure model). As a result, when it is determined that there is a distribution of the approximate fault model, the correlation loss increase determination unit 14 determines the fault model approximated to the performance information and the approximate ratio calculated in comparison with the fault model. Is notified to the failure analysis means 4.

さらに、障害分析手段４は、性能情報と近似している障害モデルと、障害モデルと比較して算出した近似の割合とを相関崩れ増加判別手段１４から受け取り、管理者対話手段５を介してこれらの情報を管理者に提示する機能を新たに有する。 Further, the failure analysis unit 4 receives the failure model approximated to the performance information and the approximate ratio calculated by comparing with the failure model from the correlation failure increase determination unit 14, and these are received via the administrator dialogue unit 5. A new function for presenting the information to the administrator is provided.

［第１の実施形態の動作］
次に、図６から図１０を参照して、本実施形態の運用管理装置の動作を説明する。 [Operation of First Embodiment]
Next, the operation of the operation management apparatus of this embodiment will be described with reference to FIGS.

まず前提として、管理者は、正常モデル分布蓄積手段１０に、正常時の相関モデルの崩れの範囲を登録する（図７のステップＳ７１１）。これは運用管理装置を動作させる前に行ってもよいし、動作中に適宜追加してもよい。例えば、サーバＡのＣＰＵ使用率とサーバＢのＣＰＵ使用率との相関を示す相関モデルＡがあり、その崩れが全体の５〜１０％の範囲内であれば正常と見なすのであれば、相関モデルＡの範囲を５〜１０として登録する。他の相関モデルも同様に登録する。 First, as a premise, the administrator registers the range of collapse of the normal correlation model in the normal model distribution accumulating means 10 (step S711 in FIG. 7). This may be performed before the operation management apparatus is operated, or may be appropriately added during the operation. For example, there is a correlation model A that indicates the correlation between the CPU usage rate of server A and the CPU usage rate of server B. If the collapse is within 5 to 10% of the total, the correlation model A The range of A is registered as 5-10. Other correlation models are registered in the same manner.

同じく前提として、管理者は、異常モデル分布蓄積手段１１に、異常時の相関モデルの崩れの分布を登録する。これも運用管理装置を動作させる前に行ってもよいし、動作中に適宜追加してもよい（ステップＳ７１２）。 Similarly, as a premise, the administrator registers the distribution of the collapse of the correlation model at the time of abnormality in the abnormality model distribution accumulating unit 11. This may also be performed before operating the operation management apparatus, or may be added as appropriate during operation (step S712).

図８は、異常時の相関モデルの崩れの分布として登録する情報を示している。例えば、管理者は、異常時の相関モデルの崩れの分布として、相関モデル名、重要度、相関モデルの崩れの分布の組を登録する。 FIG. 8 shows information to be registered as the distribution of the collapse of the correlation model at the time of abnormality. For example, the administrator registers a set of correlation model name, importance, and correlation model collapse distribution as the distribution of the collapse of the correlation model at the time of abnormality.

本実施形態の前提となる運用管理装置について図１から図５を参照して説明した場合と同様に、情報収集手段３がサービス実行手段１から収集した性能情報に基づいて、相関モデル生成手段７が相関モデルを生成する（ステップＳ７１３）。さらに、情報収集手段３が運用時の性能情報を収集すると、相関変化分析手段９が、この性能情報が相関モデルに示される相関関係と一致しているか否かを分析し、相関関係の変化から異常度を算出する（ステップＳ７１４）。 As in the case described with reference to FIG. 1 to FIG. 5 for the operation management apparatus that is the premise of the present embodiment, the correlation model generation unit 7 is based on the performance information collected from the service execution unit 1 by the information collection unit 3. Generates a correlation model (step S713). Further, when the information collecting means 3 collects performance information during operation, the correlation change analyzing means 9 analyzes whether or not this performance information matches the correlation shown in the correlation model, and from the change in correlation. The degree of abnormality is calculated (step S714).

次に、相関変化分布判別手段１２は、受け取った性能情報の各相関モデルの崩れの分布と正常モデル分布蓄積手段１０の中の相関モデルの崩れの分布の範囲とを比較し、性能情報が正常モデル分布蓄積手段１０の中の相関モデルの崩れの分布の範囲内に収まっているか否かを分析する（ステップＳ７１５）。収まっている場合、相関変化分布判別手段１２は相関変化履歴蓄積手段１３に蓄積されている全ての相関モデル毎の崩れの数をクリアする（ステップＳ７１６）。 Next, the correlation change distribution discriminating means 12 compares the distribution of the collapse of each correlation model of the received performance information with the range of the distribution of the collapse of the correlation model in the normal model distribution accumulating means 10, and the performance information is normal. It is analyzed whether the correlation distribution of the correlation model in the model distribution storage means 10 is within the range of distribution (step S715). If it is within the range, the correlation change distribution determination means 12 clears the number of collapses for all correlation models stored in the correlation change history storage means 13 (step S716).

図９は、本実施形態における性能情報の各相関モデルの崩れの分布と正常モデルの崩れの分布の範囲とを比較したものの概要を示す図である。図９を参照すると、グラフ８０１に示す相関モデルＡ、Ｂ、Ｃ、Ｄとあるうちの、相関モデルＤが、分布の範囲８０２として示す正常モデル分布蓄積手段１０内にある正常モデルの崩れの分布の範囲を超えていることがわかる。すなわち、グラフ８０１における相関モデルＤの値「２０．４」が、分布の範囲８０２における相関モデルＤの分布の範囲「１０〜１５％」を超えている。このような場合、相関変化履歴蓄積手段１３は性能情報の相関モデル毎の崩れの数を蓄積する（ステップＳ７１７）。 FIG. 9 is a diagram showing an outline of a comparison between the collapse distribution of each correlation model of performance information and the range of the collapse distribution of the normal model in the present embodiment. Referring to FIG. 9, among the correlation models A, B, C, and D shown in the graph 801, the correlation model D is the distribution of the collapse of the normal model in the normal model distribution storage means 10 shown as the distribution range 802. It can be seen that the range is exceeded. That is, the value “20.4” of the correlation model D in the graph 801 exceeds the distribution range “10-15%” of the correlation model D in the distribution range 802. In such a case, the correlation change history accumulating unit 13 accumulates the number of collapses for each correlation model of performance information (step S717).

ステップＳ７１３からステップＳ７１７を繰り返し、相関崩れ増加判別手段１４は、相関変化履歴蓄積手段１３に所定数の性能情報が蓄積されたか否か判断する（ステップＳ７１８）。ここで、相関変化履歴蓄積手段１３に所定数の性能情報が蓄積されたと判断されると、相関崩れ増加判別手段１４は、障害モデル分布蓄積手段１１に問い合わせて異常時の相関モデルの崩れの分布を取得する（ステップＳ７１９）。すると、相関崩れ増加判別手段１４は、相関変化履歴蓄積手段１３に蓄積された相関モデル毎の崩れの数の履歴を基に、相関モデルの崩れの分布がステップＳ７１９で取得した相関モデルの崩れの分布に近似していく傾向があるか否かを分析する（ステップＳ７２０）。 Steps S713 to S717 are repeated, and the correlation loss increase determination unit 14 determines whether or not a predetermined number of performance information has been stored in the correlation change history storage unit 13 (step S718). Here, if it is determined that a predetermined number of performance information has been stored in the correlation change history storage means 13, the correlation failure increase determination means 14 inquires of the failure model distribution storage means 11 and the distribution of the correlation model failure at the time of abnormality. Is acquired (step S719). Then, based on the history of the number of collapses for each correlation model accumulated in the correlation change history accumulating unit 13, the correlation collapse increase determination unit 14 calculates the correlation model collapse distribution obtained in step S 719. It is analyzed whether there is a tendency to approximate the distribution (step S720).

ステップＳ７２０の結果、近似していく障害モデルがあると判断された場合、障害分析手段４は、その障害モデルと、相関モデル毎の崩れの数の履歴のうち最新の履歴の分布と、近似の度合いを相関崩れ増加判別手段１４から受け取り、管理者対話手段５を介して、結果を管理者に提示する（ステップＳ７２１）。 As a result of step S720, when it is determined that there is a failure model to be approximated, the failure analysis means 4 determines the failure model, the latest distribution of the history of the number of collapses for each correlation model, and the approximate The degree is received from the correlation loss increase determination means 14, and the result is presented to the administrator via the administrator dialogue means 5 (step S721).

図１０は、このようにして管理者に提示される表示画面の例を示す。図１０を参照すると、グラフ９０１によって、現在のサービスの相関モデルの崩れの分布状況、及び近似している障害モデルの崩れの分布がわかる。また、情報９０２によって、近似している障害モデルと近似の度合いがわかる。また、グラフ９０３によって、異常度の時系列変化がわかる。 FIG. 10 shows an example of the display screen presented to the administrator in this way. Referring to FIG. 10, a graph 901 shows the distribution state of the collapse of the correlation model of the current service and the distribution of the collapse of the approximate failure model. Further, the information 902 shows the fault model being approximated and the degree of approximation. Further, the graph 903 shows the time series change of the degree of abnormality.

この場合、異常度グラフの時系列変化では、現在時刻において異常と判断される閾値には達していないため、異常は通知されない。そのため管理者は異常が発生していると気がつかない可能性が大きい。しかし、崩れの分布を参照すると、ある相関モデルの崩れに偏りが見られ、それが管理者によって事前に登録されている「ＤＢコネクション遅延障害モデル」に近似していることが理解できる（情報９０２参照）。このため管理者は、ＤＢコネクションに関する遅延障害の予兆があると判断でき、的確な対処をすることができる。
例えば、原因追求のためにＤＢのログを参照して問題が無いか確認する、あるいはＤＢに接続するＡＰサーバ側への影響を調査する、といった対処をすることが可能となる。 In this case, in the time series change of the abnormality level graph, the threshold value that is determined to be abnormal at the current time has not been reached, and thus no abnormality is notified. Therefore, there is a high possibility that the administrator will not notice that an abnormality has occurred. However, referring to the distribution of collapse, it can be understood that there is a bias in the collapse of a certain correlation model, which approximates the “DB connection delay failure model” registered in advance by the administrator (information 902). reference). For this reason, the administrator can determine that there is a sign of a delay failure related to the DB connection, and can take appropriate measures.
For example, for pursuing the cause, it is possible to refer to the DB log to check whether there is a problem or to investigate the influence on the AP server side connected to the DB.

［第１の実施形態の効果］
本実施形態の運用管理装置では、相関崩れ増加判別手段１４が、性能情報の相関モデルの崩れの分布が予め登録された異常時の相関モデルに近似しているか否かを判別することで、管理者へ通知するか否かを判断している。これにより、本実施形態の運用管理装置は、サービスを構成する全要素の中で少数の要素に対して相関モデルの崩れが集中している場合でも異常を検知することができる。これは、従来の運用管理装置ではできなかったことである。したがって、本実施形態の運用管理装置は、本発明の第一の課題である、サービスを構成する全要素の中で少数の要素に対して相関モデルの崩れが集中しても異常が検知できないという課題を、克服することができる。 [Effect of the first embodiment]
In the operation management apparatus according to the present embodiment, the correlation failure increase determination unit 14 determines whether or not the distribution of the failure of the correlation model of the performance information approximates a correlation model at the time of abnormality registered in advance. Whether or not to notify the person. Thereby, the operation management apparatus of this embodiment can detect an abnormality even when the collapse of the correlation model is concentrated on a small number of elements among all elements constituting the service. This is not possible with the conventional operation management apparatus. Therefore, the operation management apparatus according to the present embodiment cannot detect an abnormality even if the collapse of the correlation model concentrates on a small number of elements among all elements constituting the service, which is the first problem of the present invention. Challenges can be overcome.

また、本実施形態の運用管理装置によれば、通常状態では相関関係の崩れが発生することがなく、崩れが発生した場合は障害であることがほぼ間違いの無いようなモデルについて、異常を検知することができる。したがって、本実施形態の運用管理装置は、本発明の第二の課題である、通常状態では相関関係の崩れが発生することがなく、崩れが発生した場合は障害であることがほぼ間違いの無いようなモデルであっても異常が検知できないという課題を、克服することができる。 Further, according to the operation management apparatus of the present embodiment, an abnormality is detected in a model in which a correlation does not break in a normal state, and when a break occurs, there is almost no fault. can do. Therefore, the operation management apparatus according to the present embodiment is a second problem of the present invention, in which the correlation is not broken in a normal state, and when the break occurs, it is almost certainly a failure. Even such a model can overcome the problem that an abnormality cannot be detected.

また、本実施形態の運用管理装置によれば、過去の実績に基づく異常時の相関モデルに近似しているか否かを判別するため、発生した異常に対して行うべき対処が過去の経験により明確である可能性が高く、対処にかかる管理者の負担が軽減されるという効果がある。さらに、正常モデル分布蓄積手段１０に蓄積された正常時の相関モデルの崩れの分布の範囲に収まらない性能情報のみを分析の対象としているため、異常が検知されない期間はシステムが正常に動作していると特定できる。したがって、本実施形態の運用管理装置によれば、異常発生時に管理者がログを参照しながら原因調査を行う場合に、正常動作期間のログを調査対象から外すことができ、負担が軽減されるという効果もある。 Further, according to the operation management apparatus of the present embodiment, in order to determine whether or not it is approximate to the correlation model at the time of abnormality based on the past results, the action to be taken for the generated abnormality is clear from past experience. This has the effect of reducing the burden on the manager for handling. Furthermore, since only the performance information that does not fall within the distribution range of the normal correlation model stored in the normal model distribution storage means 10 is subject to analysis, the system operates normally during a period in which no abnormality is detected. Can be identified. Therefore, according to the operation management apparatus of the present embodiment, when the administrator conducts a cause investigation while referring to the log when an abnormality occurs, the log of the normal operation period can be excluded from the investigation target, and the burden is reduced. There is also an effect.

（第２の実施形態）
第１の実施形態と同様に、図６を参照して本実施形態の構成および動作を説明する。 (Second Embodiment)
As in the first embodiment, the configuration and operation of this embodiment will be described with reference to FIG.

本実施形態の運用管理装置における、正常時の相関モデルの崩れの分布の範囲に収まらない性能情報を所定数だけ蓄積する動作は、第１の実施形態で説明したものと同じである。さらに、本実施形態の相関崩れ増加判別手段１４は、相関変化履歴蓄積手段１３に蓄積された相関モデル毎の崩れの数の履歴を、障害モデル分布蓄積手段１１から取得した複数の異常時の相関モデルの崩れの分布と比較し、それぞれの障害モデル毎に、近似していく傾向かあるか否か分析する。 In the operation management apparatus of this embodiment, the operation of accumulating a predetermined number of performance information that does not fall within the normal distribution model collapse distribution is the same as that described in the first embodiment. Further, the correlation failure increase determination unit 14 of the present embodiment includes a plurality of correlations at the time of abnormality acquired from the failure model distribution storage unit 11 as the history of the number of failures for each correlation model stored in the correlation change history storage unit 13. Compare with the model collapse distribution and analyze whether there is a tendency to approximate each failure model.

次に、本実施形態の効果について説明する。図１１は、前記の本実施形態の相関崩れ増加判別手段１４によって障害モデル毎に近似値が求められた場合に、管理者に提示される表示画面の例を示している。図１１を参照すると、情報９０６により、現在の性能情報が複数の障害モデルに対して各々どれくらい近似しているのかがわかる。これにより、管理者は提示された複数の障害モデルの組み合わせから様々な障害の可能性を推測することができるようになる。例えば、図１１では、ＤＢ関連の障害が最も近似度合いが高いと表示されているが、それに続く３つの障害全てがＷｅｂ関連のものであるとも表示されている。管理者は、図１１の表示画面を見て、ＤＢ関連で異常が発生している可能性に加え、Ｗｅｂ関連でも異常が発生している可能性についても考慮することができる。 Next, the effect of this embodiment will be described. FIG. 11 shows an example of a display screen presented to the administrator when an approximate value is obtained for each failure model by the correlation loss increase determination means 14 of the present embodiment. Referring to FIG. 11, information 906 shows how close the current performance information is to each of a plurality of failure models. As a result, the administrator can infer the possibility of various failures from the combination of the plurality of presented failure models. For example, in FIG. 11, the DB-related failure is displayed as having the highest degree of approximation, but it is also displayed that all three subsequent failures are Web-related. The administrator sees the display screen of FIG. 11 and can consider the possibility that an abnormality has occurred in the Web as well as the possibility that an abnormality has occurred in the DB.

（第３の実施形態）
図１２を参照して、本実施形態の運用管理装置の構成および動作を説明する。 (Third embodiment)
With reference to FIG. 12, the configuration and operation of the operation management apparatus of this embodiment will be described.

図１２に示された本実施形態の運用管理装置は、第１の実施形態に係る運用管理装置の構成に加えて、正常モデル分布自動生成手段１５を有して構成されることを特徴とする。 The operation management apparatus of this embodiment shown in FIG. 12 is characterized by having a normal model distribution automatic generation means 15 in addition to the configuration of the operation management apparatus according to the first embodiment. .

正常モデル分布自動生成手段１５は、相関変化分析手段９が取得した全性能情報が相関モデルに示される相関関係と一定の誤差範囲で一致していると分析すると、その相関変化情報を受け取り、各相関モデルの崩れが全体の中で占める割合を算出する。正常モデル分布自動生成手段１５は、このような処理をある一定の回数繰り返し、相関モデル毎の崩れの割合の最小値と最大値を求め、その結果を正常状態の相関モデルの崩れ分布と見なして正常モデル分布蓄積手段１０に登録する。 When the normal model distribution automatic generation means 15 analyzes that all the performance information acquired by the correlation change analysis means 9 matches the correlation shown in the correlation model within a certain error range, the normal model distribution automatic generation means 15 receives the correlation change information, Calculate the ratio of collapse of the correlation model in the whole. The normal model distribution automatic generation means 15 repeats such a process a certain number of times, finds the minimum and maximum values of the collapse rate for each correlation model, and regards the result as the collapse distribution of the correlation model in the normal state. Register in the normal model distribution storage means 10.

次に、本実施形態の効果について説明する。上記のようにして正常モデル分布蓄積手段１０に正常状態の相関モデルの崩れ分布が自動的に蓄積されていくことにより、管理者は図７のステップＳ７１１に示したような、正常な状態の相関モデルの崩れ分布を自ら算出して登録するという行為を行わずとも、相関モデルの崩れの分布を元にした障害の検知が可能となる。 Next, the effect of this embodiment will be described. As described above, the collapse distribution of the correlation model in the normal state is automatically accumulated in the normal model distribution accumulating unit 10, so that the administrator can correlate the normal state as shown in step S <b> 711 in FIG. 7. Even without performing the act of calculating and registering the model collapse distribution by itself, it is possible to detect a failure based on the correlation model collapse distribution.

なお、上述した本発明の各実施形態に係る運用管理装置の処理の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されていることとしてもよい。このプログラムをコンピュータが読み出して実行することによって、上記処理が行われることとしてもよい。ここでコンピュータ読み取り可能な記録媒体とは、磁気ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、半導体メモリ等をいう。また、このコンピュータプログラムを通信回線によってコンピュータに配信し、この配信を受けたコンピュータが当該プログラムを実行するようにしても良い。 Note that the process of the operation management apparatus according to each embodiment of the present invention described above may be stored in a computer-readable recording medium in the form of a program. The above processing may be performed by the computer reading and executing this program. Here, the computer-readable recording medium means a magnetic disk, a magneto-optical disk, a CD-ROM, a DVD-ROM, a semiconductor memory, or the like. Alternatively, the computer program may be distributed to the computer via a communication line, and the computer that has received the distribution may execute the program.

１サービス実行手段
２性能情報蓄積手段
３情報収集手段
４障害分析手段
５管理者対話手段
６対処実行手段
７相関モデル生成手段
８相関モデル蓄積手段
９相関変化分析手段
１０正常モデル分布蓄積手段
１１障害モデル分布蓄積手段
１２相関変化分布判別手段
１３相関変化履歴蓄積手段
１４相関崩れ増加判別手段
１５正常モデル分布自動生成手段 DESCRIPTION OF SYMBOLS 1 Service execution means 2 Performance information storage means 3 Information collection means 4 Failure analysis means 5 Manager interaction means 6 Countermeasure execution means 7 Correlation model generation means 8 Correlation model storage means 9 Correlation change analysis means 10 Normal model distribution storage means 11 Failure model Distribution accumulation means 12 Correlation change distribution discrimination means 13 Correlation change history accumulation means 14 Correlation failure increase discrimination means 15 Normal model distribution automatic generation means

Claims

System performance information is monitored, extracted as a model indicating a relationship between a plurality of performance values established in the system, a plurality of performance values during operation of the system are detected, and a change in the relationship is detected from the detection result. An operation management device to detect,
Storage means for holding a distribution of the disruption of the relationship for each model as a first distribution;
Whether or not the second distribution is approximate to the first distribution by comparing the first distribution and the second distribution with the current collapse distribution for each model as the second distribution. An operation management apparatus comprising: a first determination unit that determines whether or not.

History storage means for holding the determination result by the first determination means as history;
The operation management apparatus according to claim 1, further comprising: a second determination unit configured to determine whether the second distribution tends to approximate the first distribution from the history.

The operation management apparatus according to claim 1, wherein the first distribution and the second distribution are ratios of numbers indicating a collapse of a relationship for each model.

System performance information is monitored, extracted as a model indicating a relationship between a plurality of performance values established in the system, a plurality of performance values during operation of the system are detected, and a change in the relationship is detected from the detection result. An operation management method to detect,
An accumulation stage for retaining a distribution of the disruption of the relationship for each model as a first distribution;
Whether or not the second distribution is approximate to the first distribution by comparing the first distribution and the second distribution with the current collapse distribution for each model as the second distribution. An operation management method comprising: a first determination stage for determining whether or not.

A history accumulating stage for holding the discrimination result in the first discrimination stage as a history;
The operation management method according to claim 4, further comprising: a second determination step of determining whether or not the second distribution tends to approximate the first distribution from the history.

The operation management method according to claim 4, wherein the first distribution and the second distribution are ratios of numbers indicating the collapse of the relationship for each model.