JP4761229B2

JP4761229B2 - Operation management apparatus, operation management method and program

Info

Publication number: JP4761229B2
Application number: JP2008043858A
Authority: JP
Inventors: 啓介山口
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-02-26
Filing date: 2008-02-26
Publication date: 2011-08-31
Anticipated expiration: 2028-02-26
Also published as: JP2009205208A

Description

本発明は、仮想計算機（ＶＭ；ＶｉｒｔｕａｌＭａｃｈｉｎｅ）を利用する運用管理装置に関する。 The present invention relates to an operation management apparatus that uses a virtual machine (VM).

仮想計算機（ＶＭ）を利用することにより、システムを構成するサーバをソフトウェア的に制御することが可能な運用管理装置が開発されている。従来の運用管理装置では、ハードウェア上の制約を超えて、サーバの作製や複製、稼動場所の移動や動的な処理性能の変更といったことが容易に行えるため、環境変化に追随した柔軟なサービス提供を実現できるものとして期待されている。 An operation management apparatus capable of controlling a server constituting a system by software by using a virtual machine (VM) has been developed. With conventional operation management devices, it is easy to create and replicate servers, move operating locations, and dynamically change processing performance beyond hardware limitations, so flexible services that follow changes in the environment Expected to be able to deliver.

しかし、従来の運用管理装置では、ＶＭで構成されたサーバ（ゲスト計算機）の監視エージェントから収集した性能情報のみで障害判定を行う。このため、ホスト計算機上で動的に行われた資源割り当ての変更を把握できず、ゲストの性能異常を正確に検出できなくなる。このことから、障害の誤検知や検知漏れが発生し、運用管理作業の効率が大幅に低下してしまうという問題があった。 However, in the conventional operation management apparatus, the failure is determined only by the performance information collected from the monitoring agent of the server (guest computer) configured with the VM. For this reason, the resource allocation change dynamically performed on the host computer cannot be grasped, and the guest performance abnormality cannot be detected accurately. For this reason, there is a problem that erroneous detection of a failure or omission of detection occurs, and the efficiency of operation management work is greatly reduced.

ここで、運用管理に関連する技術について紹介する。 Here, we introduce technologies related to operation management.

特開２００６−６５４３０号公報には、仮想計算機性能変更方法が記載されている（特許文献１）。仮想計算機性能変更方法は、仮想計算機の命令実行部分に含まれるダミーサイクルの値を仮想計算機の負荷の変動に応じて変更させることで仮想計算機のＣＰＵ性能を動的に変更することを特徴としている。 Japanese Unexamined Patent Application Publication No. 2006-65430 describes a virtual machine performance changing method (Patent Document 1). The virtual machine performance changing method is characterized in that the CPU performance of the virtual machine is dynamically changed by changing the value of the dummy cycle included in the instruction execution part of the virtual machine according to the change in the load of the virtual machine. .

特開２００２−３２２４４号公報には、仮想計算機が記載されている（特許文献２）。仮想計算機は、障害情報の採取手段を有する複数のゲストオペレーティングシステムを制御する仮想計算機制御手段を備えている。仮想計算機は、仮想計算機制御手段が自身の障害によって、ゲストオペレーティングシステムを制御できなくなったとき、仮想計算機制御手段自身が動作中のゲストオペレーティングシステムの動作メモリ領域をロックし、動作メモリ領域のメモリイメージを外部記憶に退避した後、自身をアボートする手段を更に備えることを特徴としている。 Japanese Patent Laid-Open No. 2002-32244 describes a virtual computer (Patent Document 2). The virtual computer includes virtual computer control means for controlling a plurality of guest operating systems having failure information collection means. When the virtual machine control unit becomes unable to control the guest operating system due to its own failure, the virtual machine locks the operating memory area of the guest operating system in which the virtual machine control unit itself is operating, and the memory image of the operating memory area Is further provided with means for aborting itself after saving to the external storage.

特開２００５−１１５７５１号公報には、車載用電子機器が記載されている（特許文献３）。計算機システムは、第１のＯＳと、該第１のＯＳ上で動作し通常の業務処理を行うサービスアプリケーションと、第１のＯＳとは異なる第２のＯＳと、該第２のＯＳ上で動作する解析予測アプリケーションとを備えている。第１のＯＳは、自ＯＳの状態情報と動作記録情報とを保持し、解析予測アプリケーションは、第１のＯＳが保持する情報の内容を解析して障害の兆候を検知することを特徴としている。 Japanese Patent Laid-Open No. 2005-115751 describes an in-vehicle electronic device (Patent Document 3). The computer system operates on the first OS, a service application that operates on the first OS and performs normal business processing, a second OS that is different from the first OS, and the second OS And an analysis prediction application. The first OS holds the status information and operation record information of the own OS, and the analysis prediction application analyzes the contents of the information held by the first OS and detects a failure sign. .

特開２００６−３９７６３号公報には、ゲストＯＳデバッグ支援方法が記載されている（特許文献４）。ゲストＯＳデバッグ支援方法は、仮想計算機マネージャによって提供される仮想計算機実行環境で動作するゲストＯＳのデバッグを支援するものである。ゲストＯＳデバッグ支援方法は、第１のゲストＯＳが動作する第１の仮想計算機実行環境とは異なる第２の仮想計算機実行環境を仮想計算機マネージャが構築するステップと、仮想計算機マネージャが、第１のゲストＯＳを、当該第１のゲストＯＳの実行状態及び当該第１のゲストＯＳが使用するメモリの状態を含めて、第２の仮想計算機実行環境にコピーすることにより、第１のゲストＯＳのコピーである第２のゲストＯＳを第２の仮想計算機実行環境に生成するステップと、第２の仮想計算機実行環境に生成された第２のゲストＯＳを停止状態にして当該第２のゲストＯＳの状態を保存するステップと、を具備することを特徴としている。 Japanese Patent Application Laid-Open No. 2006-39763 describes a guest OS debugging support method (Patent Document 4). The guest OS debugging support method supports debugging of a guest OS that operates in a virtual machine execution environment provided by a virtual machine manager. The guest OS debugging support method includes a step in which a virtual machine manager constructs a second virtual machine execution environment different from the first virtual machine execution environment in which the first guest OS operates, By copying the guest OS to the second virtual machine execution environment, including the execution state of the first guest OS and the state of the memory used by the first guest OS, the copy of the first guest OS A second guest OS that is generated in the second virtual machine execution environment, and the second guest OS generated in the second virtual machine execution environment is stopped and the state of the second guest OS And a step of storing.

特開２００６−３４４０２５号公報には、稼動性能データ取得方法が記載されている（特許文献５）。稼動性能データ取得方法は、コンピューティングシステムを用いて、稼動性能データ取得するものである。コンピューティングシステムは、各種のプログラムを実行する複数の計算機と、プログラムを計算機に割り当て、割り当てた計算機にプログラムの実行を要求する業務サーバと、プログラムの実行環境に関連する稼動性能データを計算機から収集し、稼動性能データに基づいて計算機の性能を監視する性能監視サーバとを含んでいる。稼動性能データ取得方法は、業務サーバが、割り当てられた当該プログラムを示す業務情報と、そのプログラムを実行する計算機を識別する計算機識別情報とを含み、当該プログラムの収集命令を生成するステップと、性能監視サーバが、収集命令に含まれる業務情報に対応する収集項目をメモリから読み出して、計算機識別情報を有する計算機に配布するステップと、計算機が、配布された収集項目に関する稼動性能データを収集するステップと、性能監視サーバが、収集項目を配布したことによりその収集項目に関する稼動性能データを収集した計算機から、当該収集項目に関する稼動性能データを取得するステップと、を有することを特徴としている。 Japanese Patent Application Laid-Open No. 2006-344025 describes an operation performance data acquisition method (Patent Document 5). The operation performance data acquisition method acquires operation performance data using a computing system. A computing system collects from a computer a plurality of computers that execute various programs, a business server that assigns the programs to the computers, and requests the assigned computers to execute the programs, and operational performance data related to the execution environment of the programs. And a performance monitoring server that monitors the performance of the computer based on the operation performance data. The operational performance data acquisition method includes a step in which a business server includes business information indicating the assigned program and computer identification information for identifying a computer that executes the program, and generates a collection instruction for the program, A step in which the monitoring server reads a collection item corresponding to the business information included in the collection command from the memory and distributes it to a computer having computer identification information; and a step in which the computer collects operational performance data related to the distributed collection item And the performance monitoring server has a step of acquiring operation performance data relating to the collection item from a computer that has collected the operation performance data relating to the collection item by distributing the collection item.

特開２００６−６５４３０号公報（請求項１）JP 2006-65430 A (Claim 1) 特開２００２−３２２４４号公報（請求項１）JP 2002-32244 A (Claim 1) 特開２００５−１１５７５１号公報（請求項１）JP-A-2005-115751 (Claim 1) 特開２００６−３９７６３号公報（請求項１）JP 2006-37763 A (Claim 1) 特開２００６−３４４０２５号公報（請求項１）JP 2006-344025 A (Claim 1)

従来の運用管理装置では、次のような問題点が挙げられる。 The conventional operation management apparatus has the following problems.

第１の問題点として、ホスト計算機上で資源割り当てを変更するとゲスト計算機上の負荷の変動幅が変化してしまう性能情報の場合、従来の運用管理装置では、このような資源割り当ての変更によって予期せぬ閾値越え（障害の誤検知）が発生してしまうという問題があった。また、動的に変動幅が変化させられる性能情報に対して適切な監視閾値を設定できないという問題があった。 As a first problem, in the case of performance information in which the fluctuation range of the load on the guest computer changes when the resource allocation is changed on the host computer, the conventional operation management apparatus expects the change of the resource allocation. There was a problem that an unexpected threshold exceeded (error detection of failure) occurred. In addition, there is a problem that an appropriate monitoring threshold cannot be set for performance information whose fluctuation range is dynamically changed.

第２の問題点として、従来の運用管理装置では、ゲスト計算機上の負荷をホスト計算機上で検出するため、ゲスト計算機に割り当てられた資源の利用率は正確に検出できるものの、実際にゲスト計算機上でどう検出されているかがわからないため、ゲスト計算機上のＡＰなど他の要素との相関分析が正確に行えないという問題があった。 As a second problem, in the conventional operation management apparatus, since the load on the guest computer is detected on the host computer, the utilization rate of the resources allocated to the guest computer can be detected accurately, but actually on the guest computer. In other words, it is difficult to accurately perform correlation analysis with other elements such as AP on the guest computer.

例えば、ＣＰＵ負荷は、一般的に検出時間間隔の平均値として算出される。ホスト計算機上はすべての処理時間を元に正確に値が検出できるが、ゲスト計算機にＣＰＵ資源が割り当てられない状態では、ゲスト計算機上ではカウントする時間情報自体がホスト計算機と異なる（間引かれた時間情報になる）。このため、ゲスト計算機上でＡＰがインターバル時間に基づく割り込み処理等を行っている場合、ホスト計算機上で検出できる計測できる負荷の異常状態と、ゲスト計算機上で検出可能な負荷に従って処理したＡＰの処理メッセージを比較しても、正確な障害判定が行えない場合がある。 For example, the CPU load is generally calculated as an average value of detection time intervals. On the host computer, the value can be detected accurately based on all processing times. However, when CPU resources are not allocated to the guest computer, the time information counted on the guest computer is different from the host computer (thinned out). Time information). For this reason, when the AP performs interrupt processing based on the interval time on the guest computer, the abnormal state of the load that can be detected that can be detected on the host computer and the processing of the AP that is processed according to the load that can be detected on the guest computer Even if the messages are compared, it may not be possible to accurately determine the failure.

本発明の課題は、上記の問題点を解決することができる運用管理装置を提供することにある。 The subject of this invention is providing the operation management apparatus which can solve said problem.

本発明の運用管理装置は、性能情報収集部と、性能異常分析部と、資源情報収集部と、資源割当分析部と、を具備している。性能情報収集部は、ホスト計算機上で仮想的に実現されるゲスト計算機の負荷性能を値で表す情報を収集し、性能情報として出力する。性能異常分析部は、性能情報を受け取って、性能情報が表す値と閾値とを比較し、ゲスト計算機上の負荷に異常があると判定した場合、その旨を表す異常メッセージを出力する。資源情報収集部は、ホスト計算機上でゲスト計算機に割り当てられた資源の数を表す情報を収集し、資源情報として出力する。資源割当分析部は、資源情報と性能情報とを受け取り、自然数と調整倍率とを複数対応付ける調整データを参照して、複数の調整倍率の中から、資源情報が表す数に対応する選択調整倍率を選択し、性能情報が表す値を選択調整倍率に応じて調整し、性能情報として性能異常分析部に出力する。 The operation management apparatus of the present invention includes a performance information collection unit, a performance abnormality analysis unit, a resource information collection unit, and a resource allocation analysis unit. The performance information collection unit collects information representing the load performance of the guest computer virtually realized on the host computer as a value, and outputs it as performance information. The performance abnormality analysis unit receives the performance information, compares the value represented by the performance information with a threshold value, and determines that the load on the guest computer is abnormal, outputs an abnormality message indicating that fact. The resource information collection unit collects information indicating the number of resources allocated to the guest computer on the host computer and outputs it as resource information. The resource allocation analysis unit receives resource information and performance information, refers to adjustment data that associates a plurality of natural numbers and adjustment factors, and selects a selection adjustment factor corresponding to the number represented by the resource information from the plurality of adjustment factors. The value represented by the performance information is adjusted according to the selection adjustment magnification, and is output to the performance abnormality analysis unit as performance information.

従来の運用管理装置では、このような資源割り当ての変更が把握できないため、障害を誤検知することになる。本発明の運用管理装置では、資源割り当ての変更が把握できるため、このような誤検知が抑制され、正確な障害判定を行うことができる。また、割り当ての変化によらず、一定の閾値で判定することができるため、閾値設定が容易になる。このことから、第１の問題点を解決できる。 Since the conventional operation management apparatus cannot grasp such a change in resource allocation, a fault is erroneously detected. In the operation management apparatus of the present invention, since a change in resource allocation can be grasped, such erroneous detection is suppressed, and accurate failure determination can be performed. In addition, the threshold can be easily set because the determination can be made with a certain threshold regardless of the change in allocation. From this, the first problem can be solved.

また、本発明の運用管理装置では、ゲスト計算機上で検出された性能情報に基づいて調整された情報から異常判定を行うため、ゲスト計算機上で検出可能な状態に従って他の要素の異常状態と適切に相関分析を行うことができる。このことから、第２の問題点を解決できる。 Further, in the operation management apparatus of the present invention, since abnormality determination is performed from information adjusted based on the performance information detected on the guest computer, abnormal states of other elements are appropriately determined according to a state that can be detected on the guest computer. Correlation analysis can be performed. From this, the second problem can be solved.

以下に添付図面を参照して、本発明の実施形態による運用管理装置について詳細に説明する。 Hereinafter, an operation management apparatus according to an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

（第１実施形態）
［構成］
図１は、本発明の実施形態による運用管理装置の構成を説明するための図である。図２は、本発明の実施形態による運用管理装置の構成を示すブロック図である。 (First embodiment)
[Constitution]
FIG. 1 is a diagram for explaining the configuration of an operation management apparatus according to an embodiment of the present invention. FIG. 2 is a block diagram showing the configuration of the operation management apparatus according to the embodiment of the present invention.

図１に示されるように、本発明の実施形態による運用管理装置は、性能情報収集部１、性能異常分析部２、障害分析部３、管理者対話部４を具備している。 As shown in FIG. 1, the operation management apparatus according to the embodiment of the present invention includes a performance information collection unit 1, a performance abnormality analysis unit 2, a failure analysis unit 3, and an administrator dialogue unit 4.

性能異常分析部２、障害分析部３、管理者対話部４は、コンピュータである管理マネージャ２００に設けられ、そのコンピュータに実行されるコンピュータプログラムである。性能情報収集部１は、コンピュータであるホスト計算機１００上で仮想的に実現されるゲスト計算機１１０に設けられ、そのコンピュータに実行されるコンピュータプログラムである。ゲスト計算機１１０は、ソフトウェア的に生成されている。 The performance abnormality analysis unit 2, the failure analysis unit 3, and the administrator dialogue unit 4 are computer programs provided in the management manager 200, which is a computer, and executed on the computer. The performance information collection unit 1 is a computer program that is provided in the guest computer 110 that is virtually realized on the host computer 100, which is a computer, and is executed by the computer. The guest computer 110 is generated by software.

性能情報収集部１は、ゲスト計算機１１０の負荷性能を値で表す情報を収集し、性能情報として出力する。性能異常分析部２は、性能情報収集部１から性能情報を受け取って、以下に示す異常判定を行う第１機能を有している。性能異常分析部２（第１機能）は、性能情報を受け取って、性能情報が表す値と閾値とを比較し、比較の結果に基づいて、ゲスト計算機１１０上の負荷に異常があるか否かを分析する。ここで、性能情報が表す値が閾値を超えている場合（閾値超過である場合）、性能異常分析部２は、ゲスト計算機１１０上の負荷に異常があると判定し、その旨を表す異常メッセージを出力する（異常判定）。障害分析部３は、異常メッセージの発生順序や組み合わせによってシステム全体に障害があるか否かを分析し、システム全体に障害があると判定した場合、その旨を表す障害状態情報を生成する（障害判定）。この場合、障害状態情報を、管理者対話部４を介して管理者に通知する。管理者対話部４は、障害状態情報を管理者に提示するとともに管理者の指示を受け取る。 The performance information collection unit 1 collects information representing the load performance of the guest computer 110 as a value and outputs it as performance information. The performance abnormality analysis unit 2 has a first function that receives performance information from the performance information collection unit 1 and performs abnormality determination described below. The performance abnormality analysis unit 2 (first function) receives the performance information, compares the value represented by the performance information with a threshold value, and determines whether there is an abnormality in the load on the guest computer 110 based on the comparison result. Analyze. Here, when the value represented by the performance information exceeds the threshold (when the threshold is exceeded), the performance abnormality analysis unit 2 determines that there is an abnormality in the load on the guest computer 110, and an abnormality message indicating that fact Is output (abnormal judgment). The failure analysis unit 3 analyzes whether or not there is a failure in the entire system according to the occurrence order and combination of the abnormal messages, and when determining that there is a failure in the entire system, generates failure state information indicating that (failure Judgment). In this case, the failure state information is notified to the administrator via the administrator dialogue unit 4. The manager dialogue unit 4 presents the fault state information to the manager and receives instructions from the manager.

図２に示されるように、本発明の実施形態による運用管理装置は、は、更に、資源情報収集部５、資源割当分析部６を具備している。資源情報収集部５は、ホスト計算機１００に設けられ、コンピュータ（ホスト計算機１００）に実行されるコンピュータプログラムである。資源割当分析部６は、管理マネージャ２００に設けられ、コンピュータ（管理マネージャ２００）に実行されるコンピュータプログラムである。 As shown in FIG. 2, the operation management apparatus according to the embodiment of the present invention further includes a resource information collection unit 5 and a resource allocation analysis unit 6. The resource information collection unit 5 is a computer program provided in the host computer 100 and executed by a computer (host computer 100). The resource allocation analysis unit 6 is a computer program provided in the management manager 200 and executed by a computer (management manager 200).

資源情報収集部５は、ホスト計算機１００上でゲスト計算機１１０に割り当てられた資源の数を表す情報を収集し、資源情報として出力する。資源割当分析部６は、自然数と調整倍率とを複数対応付ける調整データ（図５参照）を保持している。資源割当分析部６は、資源情報と性能情報とを受け取る。このとき、資源割当分析部６は、調整データを参照して、複数の調整倍率の中から、資源情報が表す数に対応する調整倍率（選択調整倍率とする）を選択し、性能情報が表す値を選択調整倍率に応じて調整し、上記の性能情報として性能異常分析部２に出力する。 The resource information collection unit 5 collects information indicating the number of resources allocated to the guest computer 110 on the host computer 100 and outputs it as resource information. The resource allocation analysis unit 6 holds adjustment data (see FIG. 5) that associates a plurality of natural numbers and adjustment magnifications. The resource allocation analysis unit 6 receives resource information and performance information. At this time, the resource allocation analysis unit 6 refers to the adjustment data, selects an adjustment magnification (selected adjustment magnification) corresponding to the number represented by the resource information from a plurality of adjustment magnifications, and is represented by the performance information. The value is adjusted according to the selection adjustment magnification, and is output to the performance abnormality analysis unit 2 as the above performance information.

性能異常分析部２は、第１機能に加えて、資源割当分析部６からの性能情報を受け取って、上述の異常判定を行う第２機能を更に有している。即ち、性能異常分析部２は、資源割当分析部６からの性能情報を受け取ったとき、性能情報が表す値と閾値とを比較する。このとき、性能情報が表す値が閾値超過である場合、ゲスト計算機１１０に異常があると判定し、その旨を表す異常メッセージを出力する（異常判定）。 In addition to the first function, the performance abnormality analysis unit 2 further has a second function that receives performance information from the resource allocation analysis unit 6 and performs the above-described abnormality determination. That is, when the performance abnormality analysis unit 2 receives the performance information from the resource allocation analysis unit 6, the performance abnormality analysis unit 2 compares the value represented by the performance information with a threshold value. At this time, if the value represented by the performance information exceeds the threshold value, it is determined that the guest computer 110 is abnormal, and an abnormal message indicating that fact is output (abnormal determination).

性能情報収集部１、性能異常分析部２、障害分析部３、管理者対話部４、資源情報収集部５、資源割当分析部６は、コンピュータ（運用管理装置）に実行されるコンピュータプログラムであってもよい。 The performance information collection unit 1, the performance abnormality analysis unit 2, the failure analysis unit 3, the administrator dialogue unit 4, the resource information collection unit 5, and the resource allocation analysis unit 6 are computer programs executed on a computer (operation management device). May be.

［動作］
図２〜図６を参照して、ホスト計算機１００上の資源割り当ての変更で、ゲスト計算機１１０上の負荷が変動する場合の動作について説明する。 [Operation]
With reference to FIG. 2 to FIG. 6, an operation when the load on the guest computer 110 fluctuates due to the resource allocation change on the host computer 100 will be described.

図３は、本発明の実施形態による運用管理装置の動作を示すフローチャートである。図４は、資源割当分析部６の動作（調整後の性能情報の生成）を説明するための図である。図５は、上述の調整データを示し、自然数と調整倍率との関係を示している。図６は、図３の障害判定処理を示している。 FIG. 3 is a flowchart showing the operation of the operation management apparatus according to the embodiment of the present invention. FIG. 4 is a diagram for explaining the operation of the resource allocation analysis unit 6 (generation of adjusted performance information). FIG. 5 shows the adjustment data described above, and shows the relationship between the natural number and the adjustment magnification. FIG. 6 shows the failure determination process of FIG.

以下では、性能情報としてゲスト計算機１１０上のＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）負荷を用い、資源情報としてホスト計算機１００上で実行されるゲスト計算機数｛仮想計算機（ＶＭ；ＶｉｒｔｕａｌＭａｃｈｉｎｅ）の数｝を用いて説明するが、その例に限定されるものではない。 Hereinafter, CPU (Central Processing Unit) load on the guest computer 110 is used as performance information, and the number of guest computers {number of virtual machines (VM)) executed on the host computer 100 is used as resource information. Although described, it is not limited to the example.

資源割当分析部６は、資源情報収集部５から資源情報として、ホスト計算機１００上のＶＭの数（図４の下段）を受け取る（図３のステップ９０１）。ホスト計算機１００上でＶＭ数が増加すると、各ゲスト計算機１１０に割り当てられるＣＰＵ資源は減少する。例えば、ＶＭの数が２の場合は、ホスト計算機１００上での資源を２分の１ずつ割り当て、ＶＭの数が３の場合は、３分の１ずつ割り当てる。例えば、図４では時刻ｔ１から時刻ｔ２の間、ＶＭの数が２から３に増えているため、ゲスト計算機１１０では、同時刻の間は一定量の資源減少が発生することになる。 The resource allocation analysis unit 6 receives the number of VMs on the host computer 100 (lower part of FIG. 4) as resource information from the resource information collection unit 5 (step 901 in FIG. 3). As the number of VMs increases on the host computer 100, the CPU resource allocated to each guest computer 110 decreases. For example, when the number of VMs is 2, resources on the host computer 100 are allocated by half, and when the number of VMs is 3, it is allocated by one third. For example, in FIG. 4, the number of VMs increases from 2 to 3 from time t1 to time t2, and therefore, a certain amount of resource reduction occurs during the same time in the guest computer 110.

また、資源割当分析部６は、性能情報収集部１から性能情報として、ゲスト計算機１１０上のＣＰＵ負荷の値（図４の中段）を受け取る（ステップ９０２）。ゲスト計算機１１０では、時刻ｔ１から時刻ｔ２の間、トータルのＣＰＵ資源が減少しているため、他の時刻と同様な処理を行うための負荷の％値が増加している。 Further, the resource allocation analysis unit 6 receives the CPU load value (middle part of FIG. 4) on the guest computer 110 as performance information from the performance information collection unit 1 (step 902). In the guest computer 110, since the total CPU resource is decreased from the time t1 to the time t2, the% value of the load for performing processing similar to that at other times is increased.

資源割当分析部６は、このＶＭ数によるトータル資源減少に従って、ＣＰＵ負荷の値を時刻ｔ１から時刻ｔ２の間だけ一定割合減少させた調整後のＣＰＵ負荷（図４の上段）を生成する（ステップ９０３）。この場合、ＶＭ数に応じて調整される性能情報の倍率はホスト計算機１００の能力によって異なる。図５に示される調整データは、このような調整倍率の一例であり、例えば、ホスト計算機１００のＣＰＵが複数ある場合、一定のＶＭ数までは高い処理能力があるが、それ以上数が増えるに従って急激に低下することになる。ここでは、ＶＭ数が２の場合の調整倍率（この場合の選択調整倍率）とＶＭ数が３の場合の調整倍率（この場合の選択調整倍率）の違いを考慮して、調整後のＣＰＵ負荷が生成される。 The resource allocation analysis unit 6 generates an adjusted CPU load (upper stage in FIG. 4) in which the CPU load value is reduced by a certain percentage only from time t1 to time t2 in accordance with the total resource reduction by the number of VMs (step 4). 903). In this case, the magnification of the performance information adjusted according to the number of VMs varies depending on the capability of the host computer 100. The adjustment data shown in FIG. 5 is an example of such an adjustment magnification. For example, when there are a plurality of CPUs of the host computer 100, there is a high processing capacity up to a certain number of VMs, but as the number increases further. It will drop rapidly. Here, in consideration of the difference between the adjustment magnification when the number of VMs is 2 (selection adjustment magnification in this case) and the adjustment magnification when the number of VMs is 3 (selection adjustment magnification in this case), the CPU load after adjustment Is generated.

性能異常分析部２は、予め与えられた閾値に従って、ＣＰＵ負荷が異常かどうかを判定する（ステップ９０４）。図６を参照すると、障害判定処理９０４は、例えば、設定された閾値によって異常を判定（ステップ９５１）し、閾値超過があった場合（ステップ９５２−ＹＥＳ）には、異常メッセージを生成する（ステップ９５３）。閾値超過がない場合（ステップ９５２−ＮＯ）には、ステップ９５３の処理をスキップする。障害分析部３は、例えばＡＰから直接出力される異常メッセージなど、性能値以外の異常メッセージとの組み合わせを分析して、システム全体の異常を判定する（ステップ９５４）。 The performance abnormality analysis unit 2 determines whether or not the CPU load is abnormal according to a predetermined threshold value (step 904). Referring to FIG. 6, for example, the failure determination processing 904 determines an abnormality based on a set threshold (step 951), and generates an abnormality message when the threshold is exceeded (step 952-YES) (step 952). 953). When the threshold value is not exceeded (step 952-NO), the process of step 953 is skipped. The failure analysis unit 3 analyzes a combination with an abnormal message other than the performance value such as an abnormal message directly output from the AP, for example, and determines an abnormality of the entire system (step 954).

障害分析部３は、システム全体に障害があると判定した場合（ステップ９０５−ＹＥＳ）、その旨を表す障害状態情報を生成し、管理者対話部４を介して管理者に通知する（ステップ９０６）。その後、ステップ９０１の処理に戻る。システム全体に障害がないと判定した場合（ステップ９０５−ＮＯ）、ステップ９０６の処理をスキップし、ステップ９０１の処理に戻る。 If the failure analysis unit 3 determines that there is a failure in the entire system (step 905-YES), it generates failure state information indicating that fact and notifies the administrator via the administrator dialogue unit 4 (step 906). ). Thereafter, the process returns to step 901. If it is determined that there is no failure in the entire system (step 905-NO), the process of step 906 is skipped and the process returns to step 901.

図７は、本発明の実施形態による運用管理装置の性能異常分析部２の動作（閾値判定）を説明するための図である。 FIG. 7 is a diagram for explaining the operation (threshold determination) of the performance abnormality analysis unit 2 of the operation management apparatus according to the embodiment of the present invention.

図７に示されるように、例えば、上限となる閾値を設定し、それを超えた場合に過負荷と判定して異常メッセージを生成する。従来の運用管理装置では資源割当によって変動したＣＰＵ負荷（図７の下段）を用いて異常判定を行うため、ＣＰＵ資源が減少した時刻ｔ１から時刻ｔ２の間で閾値越えが発生している。これに対して、本発明の実施形態による運用管理装置では、調整後のＣＰＵ負荷（図７の上段）は、割り当て資源量に応じてＣＰＵ負荷が調整されており、閾値越えが発生していない。 As shown in FIG. 7, for example, an upper threshold value is set, and when the threshold value is exceeded, an overload is determined and an abnormal message is generated. In the conventional operation management apparatus, since the abnormality determination is performed using the CPU load (lower part in FIG. 7) that has fluctuated due to the resource allocation, the threshold is exceeded between the time t1 and the time t2 when the CPU resource is reduced. On the other hand, in the operation management apparatus according to the embodiment of the present invention, the CPU load after adjustment (upper part of FIG. 7) is adjusted according to the allocated resource amount, and the threshold is not exceeded. .

このようなＣＰＵ資源の割り当て変更では、ゲスト計算機１１０上で実際に行われている処理にはなんら変化が無く、検出された値の尺度が変化したものである。また、この負荷の増加は、ホスト計算機１００上での割り当て制御により正常に戻ることになるため、この期間の負荷増加を障害として管理者に提示する必要はないものである。 In such a CPU resource allocation change, the processing actually performed on the guest computer 110 is not changed at all, and the scale of the detected value is changed. Further, since this increase in load returns to normal by the allocation control on the host computer 100, it is not necessary to present the increase in load during this period to the administrator as a failure.

従来の運用管理装置では、このような資源割り当ての変更が把握できないため、障害を誤検知することになる。本発明の実施形態による運用管理装置では、資源割り当ての変更が把握できるため、このような誤検知が抑制され、正確な障害判定を行うことができる。また、割り当ての変化によらず、一定の閾値で判定することができるため、閾値設定が容易になる。このことから、第１の問題点を解決できる。 Since the conventional operation management apparatus cannot grasp such a change in resource allocation, a fault is erroneously detected. In the operation management apparatus according to the embodiment of the present invention, since a change in resource allocation can be grasped, such erroneous detection can be suppressed and accurate failure determination can be performed. In addition, the threshold can be easily set because the determination can be made with a certain threshold regardless of the change in allocation. From this, the first problem can be solved.

また、本発明の実施形態による運用管理装置では、ゲスト計算機１１０上で検出された性能情報に基づいて調整された情報から異常判定を行うため、ゲスト計算機１１０上で検出可能な状態に従って他の要素の異常状態と適切に相関分析を行うことができる。このことから、第２の問題点を解決できる。 Further, in the operation management apparatus according to the embodiment of the present invention, since abnormality determination is performed from information adjusted based on the performance information detected on the guest computer 110, other elements are detected according to the state that can be detected on the guest computer 110. Correlation analysis can be performed appropriately with abnormal conditions. From this, the second problem can be solved.

（第２実施形態）
第２実施形態では、第１実施形態と重複する説明を省略する。 (Second Embodiment)
In the second embodiment, descriptions overlapping with those in the first embodiment are omitted.

前述の第１及び第２の問題点に続く第３の問題点として、ホスト計算機上で資源割り当てを変更してもゲスト計算機上ではその変更が検知できない性能情報の場合、従来の運用管理装置では、ホスト計算機上の負荷増大等によって、ゲスト計算機に必要となる資源が割り当てられない状態になっても、ゲスト計算機上では性能異常ではないため、障害の検知漏れとなってしまうという問題がある。 As a third problem following the first and second problems described above, in the case of performance information that cannot be detected on the guest computer even if the resource allocation is changed on the host computer, Even if the resources required for the guest computer are not allocated due to an increase in the load on the host computer or the like, there is a problem that a failure is not detected because the performance is not abnormal on the guest computer.

上記の問題点と本実施形態とを併せて説明するために、図８、図９を参照して、ホスト計算機１００上の資源割り当ての変更で、ゲスト計算機１１０上の負荷が影響を受けない場合の動作について説明する。 In order to explain the above problem and this embodiment together, referring to FIGS. 8 and 9, when the resource allocation on the host computer 100 is changed, the load on the guest computer 110 is not affected. Will be described.

図８は、資源割当分析部６の動作（調整後の性能情報の生成）を説明するための図である。図９は、性能異常分析部２の動作（閾値判定）を説明するための図である。 FIG. 8 is a diagram for explaining the operation of the resource allocation analysis unit 6 (generation of adjusted performance information). FIG. 9 is a diagram for explaining the operation (threshold value determination) of the performance abnormality analysis unit 2.

図８、図９を用いて説明した動作と同様に、資源割当分析部６は、資源情報としてＶＭ数（図８の下段）と、性能情報としてＣＰＵ負荷（図８の中段）を受け取り、調整後のＣＰＵ負荷（図８の上段）を生成する。この場合、時刻ｔ３から時刻ｔ４の間、ＶＭ数が２から４に増加しているが、ゲスト計算機１１０上のＣＰＵ負荷のピーク値には特に変化が見られない。これは、ＣＰＵ資源の制御にＣＰＵクロック数を用いる場合などであり、ゲスト計算機１１０上では、与えられたＣＰＵ資源のトータルは変化していないように見えるが、ＣＰＵが割り当てられる時間が減少していることになる。実際にゲスト計算機１１０上で行われる処理は、ＣＰＵ時間が割り当てられていないため遅れているが、ゲスト計算機１１０上の性能情報収集部１では、そのことを検知することができない。 Similar to the operation described with reference to FIGS. 8 and 9, the resource allocation analysis unit 6 receives and adjusts the number of VMs (lower part of FIG. 8) as resource information and the CPU load (middle part of FIG. 8) as performance information. A later CPU load (the upper part of FIG. 8) is generated. In this case, the VM number increases from 2 to 4 from time t3 to time t4, but there is no particular change in the peak value of the CPU load on the guest computer 110. This is the case where the CPU clock number is used to control the CPU resource. On the guest computer 110, the total of the given CPU resource does not seem to change, but the time to which the CPU is allocated decreases. Will be. The processing actually performed on the guest computer 110 is delayed because no CPU time is allocated, but the performance information collecting unit 1 on the guest computer 110 cannot detect this.

図９に示されるように、従来の運用管理装置の異常判定（図９の下段）では、負荷のピーク量に変化がみられないため、異常を検出することができない。本発明における調整後のＣＰＵ負荷による異常判定（図９の上段）では、時刻ｔ３から時刻ｔ４の間の処理量が低下していることが検知できる。例えば、ピーク負荷の下限閾値を設けることで、資源割り当て不足による処理不良といった障害を管理者に提示することができる。このことから、第３の問題点を解決できる。 As shown in FIG. 9, in the abnormality determination of the conventional operation management apparatus (lower part of FIG. 9), no change is observed in the peak amount of the load, so that the abnormality cannot be detected. In the abnormality determination by the CPU load after adjustment in the present invention (upper part of FIG. 9), it can be detected that the processing amount from time t3 to time t4 is reduced. For example, by providing a lower limit threshold value for peak load, a failure such as processing failure due to insufficient resource allocation can be presented to the administrator. From this, the third problem can be solved.

即ち、第１実施形態において、性能異常分析部２（第１、２機能）は、性能情報が表す値が、閾値（上限閾値）を超えている場合、ゲスト計算機１１０上の負荷に異常があると判定し、その旨を表す異常メッセージを出力する第１、２機能を有している。これに対して、第２実施形態において、性能異常分析部２は、性能情報が表す値が、上限閾値よりも低い下限閾値に満たない場合、ゲスト計算機１１０上の負荷に異常があると判定し、その旨を表す異常メッセージを出力する第３機能を更に有している。これにより、第３の問題点を解決できる。 That is, in the first embodiment, the performance abnormality analysis unit 2 (first and second functions) has an abnormality in the load on the guest computer 110 when the value represented by the performance information exceeds the threshold (upper limit threshold). And has first and second functions for outputting an abnormal message indicating that. In contrast, in the second embodiment, the performance abnormality analysis unit 2 determines that there is an abnormality in the load on the guest computer 110 when the value represented by the performance information is less than the lower threshold that is lower than the upper threshold. And a third function for outputting an abnormal message indicating that. Thereby, the third problem can be solved.

また、ホスト計算機１００上のＶＭ数の代わりに、各ゲストＯＳに割り当てられたＣＰＵリソースの割合を元に、ゲストＯＳのＣＰＵ負荷を補正することも出来る。この場合、通常は動的に変化することが無いため、監視対象としていない実ＣＰＵクロックを監視することで、実現が可能である。 Further, the CPU load of the guest OS can be corrected based on the ratio of the CPU resources allocated to each guest OS instead of the number of VMs on the host computer 100. In this case, since it usually does not change dynamically, it can be realized by monitoring an actual CPU clock that is not monitored.

また、ＣＰＵ負荷以外にも、ゲスト計算機１１０上のあらゆるリソースに対する性能情報監視に対してしても本発明を適用することが可能である。 In addition to the CPU load, the present invention can also be applied to performance information monitoring for all resources on the guest computer 110.

以上、本発明の第１、第２実施形態による運用管理装置について、ソフトウェア的にゲスト計算機１１０が生成されている場合に従って説明したが、他の実施形態として、図１０に示されるように、ハードウェア的に機能部品を組み替えられるＩＯ仮想化計算機４００であっても良い。ＩＯ仮想化計算機４００は、上述のホスト計算機１００に対応する。このＩＯ仮想化計算機４００（ホスト計算機）は、上述のゲスト計算機１１０に対応する処理部３１０と、メモリ３２０と、ＨＤＤ３３０と、処理部３１０、メモリ３２０及びＨＤＤ３３０に接続されたＩＯ割当制御部３００と、を備えている。この場合、ホスト計算機には容易に資源情報収集部５を組み込むことができないため、監視マネージャ２００上でＩＯ割当制御部３００を介して処理部３１０、メモリ３２０、ＨＤＤ３３０を遠隔から監視する資源情報収集部５を有することで、本発明が実現できる。 The operation management apparatus according to the first and second embodiments of the present invention has been described according to the case where the guest computer 110 is generated by software. However, as another embodiment, as shown in FIG. The IO virtualization computer 400 in which functional components can be rearranged in terms of hardware may be used. The IO virtualization computer 400 corresponds to the host computer 100 described above. The IO virtualization computer 400 (host computer) includes a processing unit 310 corresponding to the guest computer 110, a memory 320, an HDD 330, an IO allocation control unit 300 connected to the processing unit 310, the memory 320, and the HDD 330. It is equipped with. In this case, since the resource information collecting unit 5 cannot be easily incorporated in the host computer, the resource information collecting for remotely monitoring the processing unit 310, the memory 320, and the HDD 330 via the IO allocation control unit 300 on the monitoring manager 200. By having the part 5, the present invention can be realized.

本発明の実施形態による運用管理装置の構成を説明するための図である（第１、第２実施形態）。It is a figure for demonstrating the structure of the operation management apparatus by embodiment of this invention (1st, 2nd embodiment). 本発明の実施形態による運用管理装置の構成を示すブロック図である（第１、第２実施形態）。It is a block diagram which shows the structure of the operation management apparatus by embodiment of this invention (1st, 2nd embodiment). 本発明の実施形態による運用管理装置の動作を示すフローチャートである（第１、第２実施形態）。It is a flowchart which shows operation | movement of the operation management apparatus by embodiment of this invention (1st, 2nd embodiment). 本発明の実施形態による運用管理装置の資源割当分析部６の動作（調整後の性能情報の生成）を説明するための図である（第１実施形態）。It is a figure for demonstrating operation | movement (generation of the performance information after adjustment) of the resource allocation analysis part 6 of the operation management apparatus by embodiment of this invention (1st Embodiment). 本発明の実施形態における調整データを示し、自然数と調整倍率との関係を示している（第１、第２実施形態）。The adjustment data in embodiment of this invention is shown, and the relationship between a natural number and adjustment magnification is shown (1st, 2nd embodiment). 図３の障害判定処理を示している（第１実施形態）。The failure determination process of FIG. 3 is shown (first embodiment). 本発明の実施形態による運用管理装置の性能異常分析部２の動作（閾値判定）を説明するための図である（第１、第２実施形態）。It is a figure for demonstrating operation | movement (threshold determination) of the performance abnormality analysis part 2 of the operation management apparatus by embodiment of this invention (1st, 2nd embodiment). 本発明の実施形態による運用管理装置の資源割当分析部６の動作（調整後の性能情報の生成）を説明するための図である（第２実施形態）。It is a figure for demonstrating operation | movement (generation of the performance information after adjustment) of the resource allocation analysis part 6 of the operation management apparatus by embodiment of this invention (2nd Embodiment). 本発明の実施形態による運用管理装置の性能異常分析部２の動作（閾値判定）を説明するための図である（第２実施形態）。It is a figure for demonstrating operation | movement (threshold determination) of the performance abnormality analysis part 2 of the operation management apparatus by embodiment of this invention (2nd Embodiment). 本発明の実施形態による運用管理装置の別の構成を示すブロック図である（第１、第２実施形態）。It is a block diagram which shows another structure of the operation management apparatus by embodiment of this invention (1st, 2nd embodiment).

Explanation of symbols

１性能情報収集部、
２性能異常分析部、
３障害分析部、
４管理者対話部、
５資源情報収集部、
６資源割当分析部、
１００ホスト計算機、
１１０ゲスト計算機、
２００監視マネージャ、 1 Performance information collection unit,
2 Performance abnormality analysis part,
3 Failure analysis department,
4 Administrator Dialogue Department,
5 Resource Information Collection Department,
6 Resource Allocation Analysis Department,
100 host computer,
110 guest computers,
200 monitoring manager,

Claims

A performance information collection unit that collects information representing the load performance of the guest computer virtually realized on the host computer, and outputs it as performance information;
When the performance information output from the performance information collection unit is received and the value represented by the performance information from the performance information collection unit exceeds a threshold, it is determined that there is an abnormality in the load on the guest computer. , A performance abnormality analysis unit that outputs an abnormality message indicating that,
A resource information collection unit that collects information representing the number of guest computers virtually realized on the host computer and outputs the information as resource information;
Receiving the resource information output from the resource information collection unit and the performance information output from the performance information collection unit, referring to adjustment data associating a plurality of natural numbers and adjustment magnifications, the plurality of adjustment magnifications The selection adjustment magnification corresponding to the number represented by the resource information from the resource information collection unit is selected, and the value represented by the performance information from the performance information collection unit is adjusted according to the selection adjustment magnification, A resource allocation analysis unit that outputs to the performance abnormality analysis unit as the performance information,
The performance abnormality analysis unit is
When the performance information output from the resource allocation analysis unit is received and the value represented by the performance information from the resource allocation analysis unit exceeds the threshold, it is determined that there is an abnormality in the load on the guest computer And an operation management apparatus that outputs the abnormal message indicating the fact.

The performance abnormality analysis unit is
When the value represented by the performance information is less than a lower threshold that is lower than the upper threshold that is the threshold, it is determined that there is an abnormality in the load on the guest computer, and the abnormality message indicating that is output. The operation management apparatus according to 1.

Analyzing whether or not there is a failure in the entire system by the combination of the abnormal messages, and when determining that there is a failure in the entire system, a failure analysis unit that generates failure state information indicating that,
The operation management apparatus according to claim 1, further comprising an administrator dialogue unit that presents the failure state information to an administrator.

The guest computer is generated by software,
The operation management apparatus according to claim 1.

The guest computer is configured in hardware.
The operation management apparatus according to claim 1.

A method performed by a computer,
(A) When the performance information that represents the load performance of the guest computer that is virtually realized on the host computer is received and the value represented by the performance information exceeds a threshold value, the load on the guest computer is abnormal. A step of determining that there is an error message and outputting an abnormal message to that effect;
(B) receiving the resource information indicating the number of the guest computers virtually realized on the host computer and the performance information, and referring to the adjustment data associating a plurality of natural numbers and adjustment factors, Selecting a selection adjustment magnification corresponding to the number represented by the resource information from among the adjustment magnifications, adjusting a value represented by the performance information according to the selection adjustment magnification, and outputting as the performance information;
(C) If the value represented by the performance information from step (b) exceeds the threshold, it is determined that there is an abnormality in the load on the guest computer, and the abnormality message indicating that is output An operation management method comprising the steps of:

The steps (a) and (c) include:
When the value represented by the performance information is less than a lower threshold value that is lower than the upper threshold value that is the threshold value, it is determined that there is an abnormality in the load on the guest computer, and the abnormal message indicating that is output. The operation management method according to claim 6.

(D) analyzing whether or not there is a failure in the entire system by the combination of the abnormal messages, and when determining that there is a failure in the entire system, generating failure state information indicating that;
The operation management method according to claim 6 or 7, further comprising: (e) presenting the failure state information to an administrator.

(A) When the performance information that represents the load performance of the guest computer that is virtually realized on the host computer is received and the value represented by the performance information exceeds a threshold value, the load on the guest computer is abnormal. A step of determining that there is an error message and outputting an abnormal message to that effect;
(B) receiving the resource information indicating the number of the guest computers virtually realized on the host computer and the performance information, and referring to the adjustment data associating a plurality of natural numbers and adjustment factors, Selecting a selection adjustment magnification corresponding to the number represented by the resource information from among the adjustment magnifications, adjusting a value represented by the performance information according to the selection adjustment magnification, and outputting as the performance information;
(C) If the value represented by the performance information from step (b) exceeds the threshold, it is determined that there is an abnormality in the load on the guest computer, and the abnormality message indicating that is output And a computer program for causing a computer to execute each step.

The steps (a) and (c) include:
When the value represented by the performance information is less than a lower threshold that is lower than the upper threshold that is the threshold, determining that there is an abnormality in the load on the guest computer, and outputting the abnormality message indicating that,
The computer program according to claim 9, further comprising:

(D) analyzing whether or not there is a failure in the entire system by the combination of the abnormal messages, and when determining that there is a failure in the entire system, generating failure state information indicating that;
The computer program according to claim 9 or 10, further causing the computer to execute each step of (e) presenting the failure state information to an administrator.