JP2003345629A

JP2003345629A - System monitor device, system monitoring method used for the same, and program therefor

Info

Publication number: JP2003345629A
Application number: JP2002154884A
Authority: JP
Inventors: Ikuo Kaite; 郁夫飼手
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2002-05-29
Filing date: 2002-05-29
Publication date: 2003-12-05

Abstract

<P>PROBLEM TO BE SOLVED: To provide a system monitor device which can provide detailed materials needed to prevent a system fault to possibly be caused by process illegality and investigate the basic cause of the process illegality. <P>SOLUTION: An operation information acquiring means 11 acquires system operation state data at fixed intervals and outputs them to an operation data file 12 in time series. An operation style recognizing means 13 initially generates and updates a comparison value table 14 according to the operation data file 12 and informs a system fault preventing means 15 that the current system resource rate exceeds reference data 141 in such a case. The system fault preventing means 15 specifies a process showing symptoms of illegal processing according to the operation data 142, collects detailed data regarding the object process, and takes emergency measures for preventing the system fault while outputting the detailed data to a detailed data file 16. <P>COPYRIGHT: (C)2004,JPO

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明はシステム監視装置及
びそれに用いるシステム監視方法並びにそのプログラム
に関し、特に通信業系のコンピュータシステムにおける
システム障害防止及び根本原因究明の方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a system monitoring apparatus, a system monitoring method used therefor, and a program therefor. More particularly, the present invention relates to a method for preventing a system failure and investigating a root cause in a computer system of a communication industry.

【０００２】[0002]

【従来の技術】近年、情報処理システムにおいては、通
信業系のコンピュータシステムのように、２４時間/ ３
６５日の安定稼動が要求されるシステムが急増してい
る。通常、そのような重要なシステムには、安定稼動を
ささえるために、システム稼動状況をリアルタイムに監
視する技術が適用されている。このリアルタイムでの監
視技術としては、例えば、特開平９−２９３００４号公
報や特開平４−３４４５４４号公報に開示された技術が
ある。2. Description of the Related Art In recent years, in information processing systems, 24 hours / 3
The number of systems that require stable operation on the 65th is rapidly increasing. Usually, a technology for monitoring the operating status of the system in real time is applied to such important systems in order to maintain stable operation. As this real-time monitoring technique, there is a technique disclosed in, for example, JP-A-9-293004 and JP-A-4-344544.

【０００３】上記の技術は、一般的に、次のような機
能、すなわち一定間隔でシステム稼動状況データ［シス
テムリソース使用率、各プロセス実行情報、Ｈ／Ｗ（ハ
ードウェア）のエラー修正回数等］を取得し、リアルタ
イムで編集・視覚化表示( グラフ等) する機能、各項目
について異常（設定閾値）を検出した時に警告表示／警
告通報（システムの管理者へ）する機能、及び採取して
いる稼動状況データをログとしてファイル保存する（将
来予測／改善提案のデータ用）機能で構成されている。The above-mentioned technology generally has the following functions, ie, system operation status data at regular intervals [system resource usage rate, process execution information, error correction count of H / W (hardware), etc.]. And edit / visualize in real time (graphs, etc.), alert / report alerts (to system administrator) when abnormalities (set thresholds) are detected for each item, and collect It is configured with a function of saving the operation status data as a log (for future prediction / improvement proposal data).

【０００４】[0004]

【発明が解決しようとする課題】上述した従来のリアル
タイムでの監視技術では、現在、一元管理化（オペレー
タ削減）、システム管理者へリアルタイムに視覚化デー
タを提供することに主眼をおいており、下記に示すよう
な問題がある。The conventional real-time monitoring technology described above is currently focused on centralized management (reduction of operators) and provision of real-time visualization data to system administrators. There are the following problems.

【０００５】第一の問題点は、システム管理者への負担
がかなり大きいことである。これは警告通報を受けた場
合（対処方法の概要が示されたとしても）、システム管
理者が迅速かつ正確な対応（詳細資料の採取/ 緊急処置
等）を実施しなければならず（放置した場合、状態が極
端に悪化したり、システムがハングアップして操作不能
に至る可能性もある）、高度な精神的緊張負担を強いら
れたり（操作ミスの誘発要因）、またサービスイン後の
一定期間、または重大障害発生後の一定期間、異常兆候
をとらえるために、常時、システム稼動状況グラフをチ
ェックすることを要求されるかもしれないからである。[0005] The first problem is that the burden on the system administrator is considerably large. This means that if a warning message is received (even if an outline of the countermeasures is given), the system administrator must take prompt and accurate actions (collecting detailed information / emergency measures, etc.) In some cases, the condition may deteriorate significantly, or the system may hang and become inoperable.) This is because, for a period of time, or for a certain period after the occurrence of a serious failure, it may be required to check the system operation status graph at all times in order to catch signs of abnormality.

【０００６】第二の問題点は、異常検出のための閾値を
予め設定しておくことである。閾値が、下方（安全側）
に設定されていると、警告通報過多となるので、通報契
機を極力絞り込むために、自ずと上方( 限界値により近
い値) に設定してしまう（これによって、システム管理
者への緊急性／圧迫度が増す）。また、運用形態が変化
した際には、十分な見直し作業が必要となる。[0006] The second problem is that a threshold value for abnormality detection is set in advance. Threshold is lower (safe side)
If it is set to, too many warnings will be reported, so it will naturally be set to an upper value (a value closer to the limit value) in order to narrow down the trigger for reporting as much as possible. Increases). In addition, when the operation mode changes, sufficient review work is required.

【０００７】第三の問題点は、警告通報を契機に、シス
テムの安定化処置が施された場合、警告状態の引き金と
なった根本原因を究明することが困難なケースが多い
（システム稼動状況を過去に遡って調査することによっ
て、兆候がみられた時刻／引き金となったプロセス等を
特定することができても、その時点の詳細な資料がない
ために、根本原因の究明に至らない）ことである。この
ようなケースでは、別のコンピュータシステム（開発部
門の評価システム等）において、顧客のシミュレート環
境を構築して再現テストをすることになる（この作業
は、通常、膨大な工数を必要とする）。[0007] The third problem is that it is often difficult to determine the root cause that triggered a warning state when a system stabilization action is taken in response to a warning message (system operation status). Investigation into the past can identify the time at which the sign was observed / the process that triggered the event, but the root cause could not be determined due to lack of detailed data at that time ) That is. In such a case, a simulation environment of the customer is constructed and a reproducibility test is performed on another computer system (evaluation system of the development department, etc.). ).

【０００８】そこで、本発明の目的は上記の問題点を解
消し、プロセス不正が引き起こすであろうシステム障害
を未然に防ぐことができ、該当プロセス不正の根本原因
を究明するために必要な詳細資料を提供することができ
るシステム監視装置及びそれに用いるシステム監視方法
並びにそのプログラムを提供することにある。Accordingly, an object of the present invention is to solve the above-mentioned problems, prevent a system failure caused by a process illegality from occurring, and obtain detailed materials necessary for investigating the root cause of the process illegality. To provide a system monitoring apparatus, a system monitoring method used therefor, and a program therefor.

【０００９】[0009]

【課題を解決するための手段】本発明によるシステム監
視装置は、システム稼動状況データを取得する取得機能
を備えたコンピュータシステムにおけるシステム障害を
監視するシステム監視装置であって、前記取得機能で取
得されたデータに基づいてプロセスが不正動作に至る兆
候を検出する手段と、前記プロセスが不正動作に至る兆
候を検出した時点における対象プロセスに関わる詳細な
データを採取する手段と、前記プロセスの不正動作がも
たらす状態悪化をくいとめるための緊急予防処置を実行
する手段とを備えている。A system monitoring apparatus according to the present invention is a system monitoring apparatus for monitoring a system failure in a computer system having an acquisition function for acquiring system operation status data. Means for detecting a sign of the process leading to an illegal operation based on the data obtained, means for collecting detailed data related to the target process at the time when the process detects the sign of the illegal operation, and an illegal operation of the process. Means for performing emergency precautionary measures to curb the resulting deterioration.

【００１０】本発明による他のシステム監視装置は、複
数のプロセスが順次実行される稼動状況を示すシステム
稼動状況データを取得する取得機能を備えたコンピュー
タシステムにおけるシステム障害を監視するシステム監
視装置であって、前記取得機能によって予め設定された
一定間隔で取得される前記システム稼動状況データが時
系列に順次エントリされる稼動データファイルと、前記
稼動データファイルから読取った最新エントリ内のシス
テムリソース使用率を予め設定した基準データと比較し
てシステムが安定稼動中か否かを判定する手段と、前記
システムの安定稼動中の否定通知を受けた時に前記シス
テムリソース使用率が急増傾向にあるかどうかを判定す
る手段と、前記システムリソース使用率の急増傾向を検
知した時に前記プロセスの不正動作の兆候を検出して対
象プロセスに関する詳細なデータを採取する手段と、前
記プロセスの不正動作の兆候を検出した時に前記システ
ムの状態悪化を予防するための緊急予防処置を施す手段
とを備えている。Another system monitoring apparatus according to the present invention is a system monitoring apparatus for monitoring a system failure in a computer system having an acquisition function for acquiring system operation state data indicating an operation state in which a plurality of processes are sequentially executed. The operation data file in which the system operation status data acquired at predetermined intervals set in advance by the acquisition function is sequentially entered in time series, and the system resource usage rate in the latest entry read from the operation data file. Means for determining whether or not the system is operating stably by comparing with reference data set in advance, and determining whether or not the system resource usage rate has a tendency to rapidly increase when receiving a negative notification that the system is operating stably Means for performing the above-described processing when a sudden increase in the system resource usage rate is detected. Means for detecting a sign of unauthorized operation of the process and collecting detailed data on the target process, and means for performing an urgent preventive action for preventing deterioration of the state of the system when detecting the sign of illegal operation of the process. It has.

【００１１】本発明によるシステム監視方法は、システ
ム稼動状況データを取得する取得機能を備えたコンピュ
ータシステムにおけるシステム障害を監視するシステム
監視方法であって、前記取得機能で取得されたデータに
基づいてプロセスが不正動作に至る兆候を検出するステ
ップと、前記プロセスが不正動作に至る兆候を検出した
時点における対象プロセスに関わる詳細なデータを採取
するステップと、前記プロセスの不正動作がもたらす状
態悪化をくいとめるための緊急予防処置を実行するステ
ップとを備えている。[0011] A system monitoring method according to the present invention is a system monitoring method for monitoring a system fault in a computer system having an acquisition function for acquiring system operation status data, wherein a process is performed based on the data acquired by the acquisition function. Detecting a sign of an illegal operation, collecting detailed data related to the target process at the time when the process detects the sign of the illegal operation, and determining a state deterioration caused by the illegal operation of the process. Performing an emergency preventive action.

【００１２】本発明による他のシステム監視方法は、複
数のプロセスが順次実行される稼動状況を示すシステム
稼動状況データを取得する取得機能を備えたコンピュー
タシステムにおけるシステム障害を監視するシステム監
視方法であって、前記取得機能によって予め設定された
一定間隔で取得される前記システム稼動状況データを稼
動データファイルに時系列に順次エントリするステップ
と、前記稼動データファイルから読取った最新エントリ
内のシステムリソース使用率を予め設定した基準データ
と比較してシステムが安定稼動中か否かを判定するステ
ップと、前記システムの安定稼動中の否定通知を受けた
時に前記システムリソース使用率が急増傾向にあるかど
うかを判定するステップと、前記システムリソース使用
率の急増傾向を検知した時に前記プロセスの不正動作の
兆候を検出して対象プロセスに関する詳細なデータを採
取するステップと、前記プロセスの不正動作の兆候を検
出した時に前記システムの状態悪化を予防するための緊
急予防処置を施すステップとを備えている。Another system monitoring method according to the present invention is a system monitoring method for monitoring a system failure in a computer system having an acquisition function for acquiring system operation status data indicating an operation status in which a plurality of processes are sequentially executed. Sequentially entering the system operation status data acquired at predetermined intervals by the acquisition function into an operation data file in time series; and a system resource usage rate in the latest entry read from the operation data file. Comparing with a reference data set in advance to determine whether or not the system is operating stably, and whether or not the system resource usage rate has a tendency to rapidly increase when receiving a negative notification that the system is operating stably The step of judging and detecting the tendency of the system resource usage rate to increase rapidly. Detecting a sign of an illegal operation of the process at the time of collecting the detailed data on the target process; and Applying step.

【００１３】本発明によるシステム監視方法のプログラ
ムは、システム稼動状況データを取得する取得機能を備
えたコンピュータシステムにおけるシステム障害を監視
するシステム監視方法のプログラムであって、コンピュ
ータに、前記取得機能で取得されたデータに基づいてプ
ロセスが不正動作に至る兆候を検出する処理と、前記プ
ロセスが不正動作に至る兆候を検出した時点における対
象プロセスに関わる詳細なデータを採取する処理と、前
記プロセスの不正動作がもたらす状態悪化をくいとめる
ための緊急予防処置を実行する処理とを実行させてい
る。A program for a system monitoring method according to the present invention is a program for a system monitoring method for monitoring a system failure in a computer system having an acquisition function for acquiring system operation status data. A process of detecting a sign of a process leading to an illegal operation based on the received data, a process of collecting detailed data relating to a target process at a time when the process detects a sign of a illegal operation, and an illegal operation of the process And a process of executing an emergency preventive measure for minimizing the state deterioration caused by the above.

【００１４】本発明による他のシステム監視方法のプロ
グラムは、複数のプロセスが順次実行される稼動状況を
示すシステム稼動状況データを取得する取得機能を備え
たコンピュータシステムにおけるシステム障害を監視す
るシステム監視方法のプログラムであって、コンピュー
タに、前記取得機能によって予め設定された一定間隔で
取得される前記システム稼動状況データを稼動データフ
ァイルに時系列に順次エントリする処理と、前記稼動デ
ータファイルから読取った最新エントリ内のシステムリ
ソース使用率を予め設定した基準データと比較してシス
テムが安定稼動中か否かを判定する処理と、前記システ
ムの安定稼動中の否定通知を受けた時に前記システムリ
ソース使用率が急増傾向にあるかどうかを判定する処理
と、前記システムリソース使用率の急増傾向を検知した
時に前記プロセスの不正動作の兆候を検出して対象プロ
セスに関する詳細なデータを採取する処理と、前記プロ
セスの不正動作の兆候を検出した時に前記システムの状
態悪化を予防するための緊急予防処置を施す処理とを実
行させている。A program for another system monitoring method according to the present invention is a system monitoring method for monitoring a system failure in a computer system having an acquisition function for acquiring system operation status data indicating an operation status in which a plurality of processes are sequentially executed. A program for causing a computer to sequentially enter the system operation status data acquired at predetermined intervals set in advance by the acquisition function into an operation data file in a time-series manner; A process of comparing the system resource usage rate in the entry with reference data set in advance to determine whether or not the system is operating stably; and when receiving a negative notification that the system is operating stably, the system resource usage rate is reduced. Processing for determining whether there is a rapid increase, and the system A process of detecting a sign of the illegal operation of the process when detecting a sudden increase in the source usage rate and collecting detailed data on the target process, and deteriorating the state of the system when detecting a sign of the illegal operation of the process. And performing an emergency preventive action for prevention.

【００１５】すなわち、本発明のシステム監視装置は、
コンピュータシステムにおいて、プロセスの不正処理
（コンピュータ資源を圧迫）によって引き起こるシステ
ム障害（極端なパフォーマンス劣化、ハングアップ、ダ
ウン）を未然に防ぎ、不正プロセスの根本原因を究明す
るために必要となる詳細情報を提供するものである。That is, the system monitoring device of the present invention comprises:
Detailed information required to prevent system failures (extreme performance degradation, hang-ups, downs) caused by unauthorized processing of processes (compression of computer resources) in computer systems, and to investigate the root cause of unauthorized processes Is provided.

【００１６】一般的に、コンピュータシステムには、シ
ステム稼動状況データ（システムリソース使用率及び各
プロセスの実行情報）を取得する機能が備わっている。
本発明の特徴は、この標準機能で取得するデータに基づ
いて、プロセスが不正動作に至る兆候を自動検出し、そ
の時点における対象プロセスに関わる詳細なデータ（根
本原因を究明するための情報）を採取しておくととも
に、プロセスの不正動作がもたらす状態悪化をくいとめ
ることである。Generally, a computer system has a function of acquiring system operation status data (system resource usage rate and execution information of each process).
The feature of the present invention is that, based on the data acquired by this standard function, the process automatically detects a sign of an illegal operation, and obtains detailed data (information for finding the root cause) related to the target process at that time. In addition to collecting the information, it is necessary to prevent the deterioration of the state caused by the incorrect operation of the process.

【００１７】本発明は、上記の問題点に着目し、予め定
義された閾値によらず、システムの安定稼動を阻害しそ
うな兆候をとらえ、その時点の詳細資料を採取し、シス
テム状態の悪化を防ぐための緊急処置を施し、その後
（緊急性を緩和した後）にシステム管理者へ通報する。The present invention focuses on the above-mentioned problems, captures a sign that the stable operation of the system is likely to be impeded, irrespective of the predefined threshold value, collects detailed data at that time, and checks the deterioration of the system state. Take emergency measures to prevent this, and then notify the system administrator (after mitigating the urgency).

【００１８】これによって、本発明では、システム管理
者が余裕を持って冷静な操作が可能となる（操作ミスの
撲滅）。すなわち、本発明は、プロセス不正が引き起こ
すであろうシステム障害（極端なパフォーマンス劣化、
ハングアップ、ダウン）を未然に防ぐ（システム管理者
負担を大幅に低減する）ことができるとともに、該当プ
ロセス不正の根本原因を究明するために必要な詳細資料
（予兆があらわれた時点のデータ）を提供することがで
きる。As a result, in the present invention, the system administrator can perform a calm operation with a margin (elimination of operation errors). That is, the present invention provides a system failure (extreme performance degradation,
Hang-ups and downs) can be prevented beforehand (reducing the burden on the system administrator), and the detailed data (data at the time of the forewarning) required to investigate the root cause of the process improper process Can be provided.

【００１９】[0019]

【発明の実施の形態】次に、本発明の実施例について図
面を参照して説明する。図１は本発明の一実施例による
システム監視装置の構成を示すブロック図である。図１
において、システム監視装置１は稼動情報取得手段１１
と、稼動データファイル１２と、運用形態認識手段１３
と、比較値テーブル１４と、システム障害予防手段１５
と、詳細データファイル１６と、対処通報手段１７と、
表示端末１８と、記録媒体１９とから構成されている。
尚、システム監視装置１は主にコンピュータ（図示せ
ず）から構成され、コンピュータが記録媒体１９のプロ
グラムを実行することで上記の各手段を実現している。Next, an embodiment of the present invention will be described with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a system monitoring device according to one embodiment of the present invention. FIG.
In the system monitoring device 1, the operation information acquisition unit 11
, Operation data file 12, operation mode recognition means 13
, Comparison value table 14, and system failure prevention means 15
, A detailed data file 16, a countermeasure reporting means 17,
It comprises a display terminal 18 and a recording medium 19.
The system monitoring device 1 is mainly composed of a computer (not shown), and the computer executes the program on the recording medium 19 to realize the above-described units.

【００２０】稼動情報取得手段１１は複数のプロセスが
順次実行されるシステムの稼動状況を示すシステム稼動
状況データ（システムリソース使用率及び各プロセスの
実行情報）を一定間隔で取得し、時系列で稼動データフ
ァイル１２へ出力する。The operation information obtaining means 11 obtains system operation status data (system resource usage rate and execution information of each process) indicating the operation status of the system in which a plurality of processes are sequentially executed, and operates in time series. Output to the data file 12.

【００２１】比較値テーブル１４は基準データ１４１と
稼動データ１４２とからなる。基準データ１４１はシス
テムリソース使用率の基準値（安定稼動の指標値）であ
り、稼動データ１４２は最新のシステム稼動状況データ
とそれ以前の４回分のシステム稼動状況データ（５世代
分のデータ）を保持する。The comparison value table 14 includes reference data 141 and operation data 142. The reference data 141 is a reference value of the system resource usage rate (index value of stable operation), and the operation data 142 is the latest system operation state data and the previous four times of system operation state data (data of five generations). Hold.

【００２２】運用形態認識手段１３は稼動データファイ
ル１２に基づいて、比較値テーブル１４を初期作成／更
新する。また、運用形態認識手段１３は現在のシステム
リソース使用率が基準データ１４１を上回った時、その
旨をシステム障害予防手段１５に通知する。The operation mode recognition means 13 initially creates / updates the comparison value table 14 based on the operation data file 12. When the current system resource usage rate exceeds the reference data 141, the operation mode recognition means 13 notifies the system failure prevention means 15 of the fact.

【００２３】システム障害予防手段１５は稼動データ１
４２に基づいて不正処理の兆候があるプロセスを特定
し、対象プロセスに関わる詳細なデータ（処理トレース
等）を採取して、詳細データファイル１６へ出力すると
同時に、システム障害（極端なパフォーマンス劣化、ハ
ングアップ、ダウン）を未然に防ぐための緊急予防処置
を施す。ここで、緊急予防処置とはシステム状態の悪化
を予防するための対象プロセスのプライオリティ低下、
対象プロセスの一時停止等である。The system failure prevention means 15 stores the operation data 1
42, a process having a sign of illegal processing is identified, detailed data (processing trace, etc.) relating to the target process is collected and output to the detailed data file 16, and at the same time, a system failure (extreme performance degradation, hang Take emergency preventive measures to prevent ups and downs. Here, the urgent preventive action is to lower the priority of the target process to prevent the deterioration of the system state,
For example, the target process is temporarily stopped.

【００２４】システム障害予防手段１５は緊急予防処置
が完了した後、再度、該当プロセスの詳細状態を入念に
確認し、不正への疑いがある場合に、最終確認時のデー
タ、緊急処置内容等を詳細データファイル１６に付加
し、その旨を対処通報手段１７へ通知する。対処通報手
段１７は緊急予防処置内容、詳細なデータの格納場所等
を表示端末１８へ出力してシステム管理者に通報する。After the emergency preventive measures have been completed, the system failure preventive means 15 again carefully checks the detailed state of the relevant process. If there is a suspicion of impropriety, the data at the final check, the contents of the emergency measures, etc., are checked. It is added to the detailed data file 16 and the fact is notified to the countermeasure notifying means 17. The response reporting unit 17 outputs the contents of the emergency preventive action, the storage location of the detailed data, and the like to the display terminal 18 to notify the system administrator.

【００２５】一方、明らかに、正常動作であると判定し
た場合、システム障害予防手段１５は施した緊急予防処
置を解除し、詳細データファイル１６から該当プロセス
のデータを消去し、運用形態認識手段１３に対して基準
データ１４１の更新要求を行う。On the other hand, when it is apparently determined that the operation is normal, the system failure prevention means 15 cancels the emergency precautionary action taken, erases the data of the corresponding process from the detailed data file 16, and returns to the operation mode recognition means 13 , An update request for the reference data 141 is issued.

【００２６】図２及び図３は本発明の一実施例によるシ
ステム監視装置の動作を示すフローチャートである。こ
れら図１〜図３を参照して、本発明の一実施例によるシ
ステム監視装置の動作について説明する。尚、図２及び
図３に示す処理はシステム監視装置１のコンピュータが
記録媒体１９のプログラムを実行することで実現され
る。FIGS. 2 and 3 are flowcharts showing the operation of the system monitoring apparatus according to one embodiment of the present invention. The operation of the system monitoring apparatus according to one embodiment of the present invention will be described with reference to FIGS. 2 and 3 are realized by the computer of the system monitoring device 1 executing the program on the recording medium 19.

【００２７】稼動情報取得手段１１は一定間隔（例え
ば、２分間隔）で、システム稼動状況データを取得し、
稼動データファイル１２へ時系列に順次エントリしてい
く。稼動データファイル１２の各エントリは取得時刻
と、その時のシステムリソース使用率［ＣＰＵ（Ｃｅｎ
ｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ：中央処理
装置）使用率/ メモリ使用率（プロセス分）/ 各周辺装
置の使用率等］、及び各プロセスの実行情報（ＣＰＵ使
用率/ メモリ使用量/ 各周辺装置への入出力累積回数
等) とが含まれる。The operation information acquisition means 11 acquires system operation state data at regular intervals (for example, every two minutes).
Entries are sequentially made in the operation data file 12 in chronological order. Each entry in the operation data file 12 has an acquisition time and a system resource usage rate [CPU (Cen
[tral Processing Unit: central processing unit) usage rate / memory usage rate (for process) / usage rate of each peripheral device], and execution information of each process (CPU usage rate / memory usage / input / output to each peripheral device) , Etc.).

【００２８】運用形態認識手段１３は基準データ１４１
を適宜更新（運用形態を反映）しながら、システムの安
定稼動性を判定する（稼動データファイル１２から読取
った最新エントリ内のシステムリソース使用率が、基準
データ１４１以下の場合には安定稼動中であると判断す
る）。The operation mode recognition means 13 stores the reference data 141
Is appropriately updated (reflecting the operation mode), and the stable operation of the system is determined (when the system resource usage rate in the latest entry read from the operation data file 12 is equal to or less than the reference data 141, the system is in stable operation. Judge that there is).

【００２９】運用形態認識手段１３は稼動データファイ
ル１２に蓄積された過去１時間分のシステムリソース使
用率に基づいて、基準データ１４１を初期設定/ 更新
（最終更新時から３０分経過した時、及びシステム障害
予防手段１５から更新要求を受けた時に更新）する（図
２ステップＳ１，Ｓ３，Ｓ５）。ここで、基準データ１
４１は各項目について、「（過去１時間分の平均値＋最
大値）／２」とする。The operation mode recognition means 13 initializes / updates the reference data 141 based on the system resource usage rate for the past hour accumulated in the operation data file 12 (when 30 minutes have passed since the last update, and The update is performed when an update request is received from the system failure prevention means 15 (steps S1, S3, S5 in FIG. 2). Here, reference data 1
Reference numeral 41 denotes “(average value of past one hour + maximum value) / 2” for each item.

【００３０】すなわち、運用形態認識手段１３は過去の
データがない場合（図２ステップＳ１）、１時間待ち
（図２ステップＳ２）、運用形態認識手段１３は過去の
データがある場合（図２ステップＳ１）、過去１時間の
データにて基準データ１４１を設定する（図２ステップ
Ｓ３）。この後、運用形態認識手段１３は最新エントリ
の追加を行って稼動データ１４２を更新する（図２ステ
ップＳ４）。That is, when there is no past data (step S1 in FIG. 2), the operation mode recognition unit 13 waits for one hour (step S2 in FIG. 2), and when there is past data (step S2 in FIG. 2). S1), the reference data 141 is set based on the data of the past hour (step S3 in FIG. 2). Thereafter, the operation mode recognition unit 13 updates the operation data 142 by adding the latest entry (step S4 in FIG. 2).

【００３１】運用形態認識手段１３はシステム障害予防
手段１５から更新要求を受けると（図２ステップＳ
５）、ステップＳ３に戻って基準データ１４１を設定
し、システム障害予防手段１５から更新要求を受けなけ
れば（図２ステップＳ５）、安定稼動中かを判定する
（図２ステップＳ６）。When receiving the update request from the system failure prevention means 15, the operation form recognition means 13 (step S2 in FIG. 2)
5) Returning to step S3, the reference data 141 is set, and if an update request is not received from the system failure prevention means 15 (step S5 in FIG. 2), it is determined whether the operation is stable (step S6 in FIG. 2).

【００３２】運用形態認識手段１３は安定稼動中と判定
すると、次の基準データ１４１の取得時刻まで待ち（図
２ステップＳ７）、基準データ１４１の更新時期になる
と（図２ステップＳ８）、ステップＳ３に戻って基準デ
ータ１４１を設定する。また、運用形態認識手段１３は
基準データ１４１の更新時期でなければ（図２ステップ
Ｓ８）、ステップＳ４に戻って稼動データ１４２を更新
する。If it is determined that the operation mode is stable, the operation mode recognizing means 13 waits until the next reference data 141 is acquired (step S7 in FIG. 2). When it is time to update the reference data 141 (step S8 in FIG. 2), step S3 is performed. To set the reference data 141. If it is not time to update the reference data 141 (step S8 in FIG. 2), the operation mode recognition unit 13 returns to step S4 and updates the operation data 142.

【００３３】一方、運用形態認識手段１３は安定稼動中
と判定しなければ、その旨をシステム障害予防手段１５
に通知し（図２ステップＳ６）、次の基準データ１４１
の取得時刻まで待ち（図２ステップＳ７）、基準データ
１４１の更新時期になると（図２ステップＳ８）、ステ
ップＳ３に戻って基準データ１４１を設定する。また、
運用形態認識手段１３は基準データ１４１の更新時期で
なければ（図２ステップＳ８）、ステップＳ４に戻って
稼動データ１４２を更新する。On the other hand, if the operation mode recognizing means 13 does not judge that the operation is stable, the fact is notified to the system failure preventing means 15.
(Step S6 in FIG. 2) and the next reference data 141
(Step S7 in FIG. 2), and when it is time to update the reference data 141 (Step S8 in FIG. 2), the flow returns to Step S3 to set the reference data 141. Also,
If it is not time to update the reference data 141 (step S8 in FIG. 2), the operation mode recognition unit 13 returns to step S4 and updates the operation data 142.

【００３４】システム障害予防手段１５は運用形態認識
手段１３からの通知（安定稼動性の否定）を受けると
（図３ステップＳ１１）、システムリソース使用率が急
増傾向にあるかどうかを判定する（短い間隔、例えば３
秒間隔で、システムリソース使用率を数回取得し、数回
の取得値の差分にて判定する）（図３ステップＳ１
２）。When the system failure prevention means 15 receives the notification (denial of stable operation) from the operation form recognition means 13 (step S11 in FIG. 3), it determines whether or not the system resource usage rate is increasing rapidly (short). Spacing, eg 3
At several second intervals, the system resource usage rate is acquired several times, and the difference between the acquired values is determined several times (step S1 in FIG. 3).
2).

【００３５】ここで、基準データ１４１を上回った近傍
においては、通常、平坦増加傾向にあるので、システム
リソース使用率の急増傾向を検知した場合には、プロセ
ス不正の兆候が顕在していると断定し、稼動データ１４
２を参照して特定プロセスの不正兆候（ＣＰＵ独占、無
限ループ、メモリリーク、多量プロセス起動等）をとら
える（図３ステップＳ１３）。Here, in the vicinity where the reference data 141 is exceeded, there is usually a flat increase tendency. Therefore, when a sudden increase in the system resource usage rate is detected, it is concluded that a sign of process irregularity is apparent. And operating data 14
2, illegal signs of a specific process (CPU monopoly, infinite loop, memory leak, start of a large number of processes, etc.) are captured (step S13 in FIG. 3).

【００３６】システム障害予防手段１５は対象プロセス
に関する資料（処理トレース、スタックトレース）を採
取し（図３ステップＳ１４）、詳細データファイル６に
格納すると同時に、システム状態の悪化を予防するため
の緊急予防処置（対象プロセスのプライオリティ低下、
対象プロセスの一時停止等）を施す（図３ステップＳ１
５）。その後、システム障害予防手段１５は採取した資
料等に基づいて、対象プロセスが暴走状態に至っている
かどうかを判定する（図３ステップＳ１６）。The system failure prevention means 15 collects data (processing trace, stack trace) related to the target process (step S14 in FIG. 3), stores it in the detailed data file 6, and at the same time, urgently prevents the system state from deteriorating. Action (low priority of target process,
(Eg, suspension of the target process) (step S1 in FIG. 3).
5). Thereafter, the system failure prevention means 15 determines whether or not the target process has gone out of control based on the collected data and the like (step S16 in FIG. 3).

【００３７】その結果、システム障害予防手段１５は暴
走の疑いがある場合、さらに対象プロセスに関する資料
（主要部のメモリイメージ等）を採取して、プロセス識
別名、緊急予防処置内容、稼動データ１４２とともに、
詳細データファイル１６に格納した後（図３ステップＳ
１７）、対処通報手段１７に通知する（図３ステップＳ
１８）。As a result, when there is a suspicion of a runaway, the system failure prevention means 15 further collects data (memory image of the main part, etc.) relating to the target process and, together with the process identification name, the emergency preventive action contents, and the operation data 142. ,
After storing in the detailed data file 16 (step S in FIG. 3)
17), and notifies the handling notification means 17 (step S in FIG. 3).
18).

【００３８】一方、システム障害予防手段１５は明らか
に正常動作であると判定した場合、施した緊急予防処置
を解除し（図３ステップＳ１９）、詳細データファイル
６から該当プロセスのデータを消去して（図３ステップ
Ｓ２０）、運用形態認識手段３に対して、基準データ１
４１の更新要求する（図３ステップＳ２１）。On the other hand, if the system failure preventive means 15 clearly determines that the operation is normal, the emergency preventive action taken is canceled (step S19 in FIG. 3), and the data of the corresponding process is deleted from the detailed data file 6. (Step S20 in FIG. 3), reference data 1
A request for updating 41 is made (step S21 in FIG. 3).

【００３９】また、システム障害予防手段１５はこのケ
ースにおいて、現業務負荷がＨ／Ｗの処理能力の限界に
至っていること（ＣＰＵ使用率が９０％、メモリ不足状
況等）を感知した場合（図３ステップＳ２２）、新プロ
セスの生成を抑止し（図３ステップＳ２３）、その旨を
対処通報手段１７に通知する（図３ステップＳ１８）。In this case, the system failure prevention means 15 detects in this case that the current business load has reached the limit of the H / W processing capacity (CPU utilization is 90%, memory shortage, etc.) 3 step S22), the generation of a new process is suppressed (step S23 in FIG. 3), and the fact is notified to the countermeasure notification means 17 (step S18 in FIG. 3).

【００４０】対処通報手段１７はプロセス識別名、緊急
予防処置内容、詳細なデータの格納場所等を表示端末１
８へ出力してシステム管理者に通報する。システム管理
者は必要ならさらにデータ採取し、対象プロセスに対し
て操作（終了させる、再起動する、プライオリティを元
に戻す等）を施す。The response reporting means 17 displays the process identification name, the contents of the emergency preventive action, the storage location of the detailed data, and the like on the display terminal 1.
8 to notify the system administrator. If necessary, the system administrator further collects data and performs an operation (eg, terminates, restarts, or restores the priority) on the target process.

【００４１】このように、本実施例では、プロセス不正
が引き起こすであろうシステム障害（極端なパフォーマ
ンス劣化、ハングアップ、ダウン）を未然に防ぐ（シス
テム管理者負担を大幅に低減する）ことができるととも
に、該当プロセス不正の根本原因を究明するために必要
な詳細資料（予兆があらわれた時点のデータ）を提供す
ることができる。これによって、一般に、膨大な作業工
数を要する再現試験が不要となる。As described above, in this embodiment, it is possible to prevent a system failure (extreme performance degradation, hang-up, or down) that would be caused by a process illegality (to greatly reduce the burden on the system administrator). At the same time, it is possible to provide detailed data (data at a time when a sign appears) necessary to investigate the root cause of the process fraud. This generally eliminates the need for a reproducibility test that requires enormous man-hours.

【００４２】[0042]

【発明の効果】以上説明したように本発明は、システム
稼動状況データを取得する取得機能を備えたコンピュー
タシステムにおけるシステム障害を監視するシステム監
視装置において、取得機能で取得されたデータに基づい
てプロセスが不正動作に至る兆候を検出し、そのプロセ
スが不正動作に至る兆候を検出した時点における対象プ
ロセスに関わる詳細なデータを採取し、プロセスの不正
動作がもたらす状態悪化をくいとめるための緊急予防処
置を実行することによって、プロセス不正が引き起こす
であろうシステム障害を未然に防ぐことができ、該当プ
ロセス不正の根本原因を究明するために必要な詳細資料
を提供することができるという効果が得られる。As described above, the present invention relates to a system monitoring apparatus for monitoring a system failure in a computer system having an acquisition function for acquiring system operation status data, the process based on the data acquired by the acquisition function. Detects signs of unauthorized operation, collects detailed data related to the target process at the time the process detects signs of unauthorized operation, and takes urgent precautionary measures to prevent the process from deteriorating. By executing the process, it is possible to prevent a system failure that may be caused by a process fraud, and to provide an effect that detailed data necessary for investigating a root cause of the process fraud can be provided.

[Brief description of the drawings]

【図１】本発明の一実施例によるシステム監視装置の構
成を示すブロック図である。FIG. 1 is a block diagram showing a configuration of a system monitoring device according to one embodiment of the present invention.

【図２】本発明の一実施例によるシステム監視装置の動
作を示すフローチャートである。FIG. 2 is a flowchart showing an operation of the system monitoring device according to one embodiment of the present invention.

【図３】本発明の一実施例によるシステム監視装置の動
作を示すフローチャートである。FIG. 3 is a flowchart showing an operation of the system monitoring apparatus according to one embodiment of the present invention.

[Explanation of symbols]

１システム監視装置１１稼動情報取得手段１２稼動データファイル１３運用形態認識手段１４比較値テーブル１５システム障害予防手段１６詳細データファイル１７対処通報手段１８表示端末１９記録媒体１４１基準データ１４２稼動データ 1 System monitoring device 11 Operation information acquisition means 12 Operation data file 13 Operation form recognition means 14 Comparison value table 15 System failure prevention measures 16 Detailed data file 17 Response reporting means 18 Display terminal 19 Recording medium 141 Reference data 142 Operating data

Claims

[Claims]

1. A system monitoring apparatus for monitoring a system failure in a computer system having an acquisition function for acquiring system operation status data indicating an operation status in which a plurality of processes are sequentially executed, the system monitoring device acquiring the system operation status data. Means for detecting a sign of the process leading to an illegal operation based on the data obtained, means for collecting detailed data relating to the target process at the time when the process detects a sign of the illegal operation, and illegal operation of the process. Means for executing an emergency preventive action for preventing a state deterioration caused by the system monitoring apparatus.

2. A system monitoring device for monitoring a system failure in a computer system having an acquisition function for acquiring system operation status data indicating an operation status in which a plurality of processes are sequentially executed, wherein the system monitoring device sets in advance by the acquisition function An operation data file in which the system operation status data acquired at a given interval is sequentially entered in time series, and a system resource usage rate in the latest entry read from the operation data file is compared with preset reference data. Means for determining whether or not the system is in stable operation, and means for determining whether or not the system resource usage rate is in a tendency to rapidly increase upon receiving a negative notification that the system is in stable operation; and Detecting signs of the above-mentioned process malfunction when detecting a rapid increase in the rate A system for collecting detailed data on an elephant process, and a unit for performing an emergency precautionary measure for preventing a deterioration of the state of the system when a sign of an illegal operation of the process is detected. .

3. A means for canceling the emergency precautionary measure when it is determined that the process is operating normally and issuing a request for updating the reference data, and for inputting the request for updating the reference data and for a predetermined period of time 3. The system monitoring apparatus according to claim 2, further comprising: means for updating the reference data based on past data when the time has elapsed.

4. A means for judging whether or not the target process has reached a runaway state based on the detailed data, and further comprising: collecting detailed data on the target process when it is determined that the target process has reached a runaway state. 4. The system monitoring apparatus according to claim 2, further comprising: means for notifying said detailed data by means of the system monitoring apparatus.

5. A means for inhibiting the creation of a new process and notifying of the occurrence when it is determined that the runaway state has been reached and the current business load has reached the limit of the processing capacity of hardware. 5. The method according to claim 4, wherein
A system monitoring device as described.

6. A system monitoring method for monitoring a system failure in a computer system having an acquisition function for acquiring system operation status data indicating an operation status in which a plurality of processes are sequentially executed, the system monitoring method comprising: Detecting a sign of the process leading to an illegal operation based on the received data; collecting detailed data relating to the target process at the time when the process detects the sign of the illegal operation; and Performing an emergency precautionary measure to prevent the resulting condition deterioration.

7. A system monitoring method for monitoring a system failure in a computer system provided with an acquisition function for acquiring system operation status data indicating an operation status in which a plurality of processes are sequentially executed, wherein the system monitoring method is set in advance by the acquisition function. Sequentially entering the system operation status data acquired at a predetermined interval into an operation data file in a time series, and comparing the system resource usage rate in the latest entry read from the operation data file with reference data set in advance. Determining whether or not the system is operating stably, and determining whether or not the system resource usage rate has a tendency to increase rapidly upon receiving a negative notification that the system is operating stably. Sign of illegal operation of the process when detecting a sudden increase in usage rate Detecting detailed information about the target process by detecting the weather, and performing an emergency precautionary measure for preventing a deterioration of the state of the system when detecting a sign of the illegal operation of the process. System monitoring method.

8. A step of canceling the emergency preventive action and making a request for updating the reference data when it is determined that the process is operating normally, and when inputting the request for updating the reference data and for a preset predetermined time. Updating the reference data based on past data when the time has elapsed. 9. The system monitoring method according to claim 7, further comprising:

9. A step of determining whether or not the target process has reached a runaway state based on the detailed data, and further collecting detailed data regarding the target process when it is determined that the target process has reached a runaway state. 9. The system monitoring method according to claim 7, further comprising a step of notifying the detailed data by using the method.

10. A step of, when it is determined that the runaway state has been reached and the current work load has reached the limit of the processing capacity of hardware, suppressing the generation of a new process and notifying the user of the fact. The system monitoring method according to claim 9, wherein:

11. A program for a system monitoring method for monitoring a system failure in a computer system having an acquisition function for acquiring system operation status data indicating an operation status in which a plurality of processes are sequentially executed, wherein the computer has A process of detecting a sign of the process leading to an illegal operation based on the data acquired by the acquisition function, and a process of collecting detailed data relating to the target process at the time when the process detects the sign of the illegal operation; And a process for executing an emergency preventive measure for preventing a state deterioration caused by an illegal operation of the process.

12. A program for a system monitoring method for monitoring a system failure in a computer system having an acquisition function for acquiring system operation status data indicating an operation status in which a plurality of processes are sequentially executed. A process of sequentially entering the system operation status data acquired at predetermined intervals by an acquisition function in an operation data file in time series, and setting a system resource usage rate in the latest entry read from the operation data file in advance A process of determining whether the system is operating stably by comparing with the reference data obtained, and a process of determining whether the system resource usage rate is increasing rapidly when a negative notification that the system is operating stably is received When the sudden increase in the system resource usage rate is detected, A process of detecting a sign of an illegal operation of the process and collecting detailed data on the target process, and a process of performing an urgent preventive measure for preventing a deterioration of the state of the system when the sign of the illegal operation of the process is detected. A program for executing

13. A process for canceling the emergency preventive action when the computer determines that the process is operating normally and issuing a request for updating the reference data, and for inputting the request for updating the reference data and setting the computer in advance And updating the reference data based on past data when a predetermined time period has elapsed.
The program described.

14. A process for judging whether or not the target process has reached a runaway state based on the detailed data, and detailed data relating to the target process when it is determined that the target process has reached a runaway state. 14. The program according to claim 12, wherein the program further executes a process of notifying the detailed data and notifying the detailed data.

15. When the computer is determined to be in the runaway state and the current workload is determined to have reached the limit of the processing capacity of hardware, the generation of a new process is suppressed and the computer is notified of the determination. The program according to claim 14, wherein the program performs a process of performing the following.