JP2022081134A

JP2022081134A - System monitoring device and system monitoring method

Info

Publication number: JP2022081134A
Application number: JP2020192484A
Authority: JP
Inventors: 克文綿引; Katsufumi Watabiki; 秀樹藤井; Hideki Fujii; 輝幸安永; Teruyuki Yasunaga
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2022-05-31

Abstract

To solve the problem that, in order to determine an abnormal state from operation information, it is necessary to determine an abnormality value from operation information in abnormal states that occurred in the past or states indicating signs of abnormality, and predicting the abnormality sign is unfeasible when there is no data that can be clearly determined as an abnormality.SOLUTION: In a system monitoring device 10, a monitoring manager 40 acquires operation information for each process in each of computers 20 and 21 collected by monitoring agents 50 and 51, automatically selects a significant process to be monitored, and calculates collection conditions for its operation information. By collecting and accumulating detailed operation information for a long period of time according to the collection conditions, it becomes possible to diagnose changes in the value of the operation information from the past in detail and detect signs leading to failure.SELECTED DRAWING: Figure 1

Description

本発明は、計算機システムの稼働状況を監視する技術に関し、その中でも特に制御向け計算機システムにおけるシステム障害を防止する技術に関わる。 The present invention relates to a technique for monitoring the operating status of a computer system, and more particularly to a technique for preventing a system failure in a control computer system.

近年、何らかの機能を実現する計算機システムが運用されている。例えば、計算機システムは、プラント、上下水道、電力網を含むインフラなどの制御や金融機関、流通等のサービスを提供している。これら計算機システムにおいては３６５日の安定稼働を要求するシステムが増えている。このため、一定の時間間隔においてシステムの稼働情報をリアルタイムに収集する技術を適用して、収集した稼働情報から異常兆候の有無を判定したり、異常の原因を究明する際に稼働情報を利用したりする技術を用いて運用を行っている。このような監視技術としては、例えば、特許文献１や特許文献２に開示された技術がある。 In recent years, computer systems that realize some functions have been operated. For example, computer systems provide services such as control of plants, water and sewage systems, infrastructure including power grids, financial institutions, and distribution. In these computer systems, the number of systems that require stable operation for 365 days is increasing. For this reason, we apply a technology that collects system operation information in real time at regular time intervals, and use the operation information to determine the presence or absence of abnormal signs from the collected operation information and to investigate the cause of the abnormality. It is operated using the technology that is used. As such a monitoring technique, for example, there is a technique disclosed in Patent Document 1 and Patent Document 2.

特許文献１では、プラントに生じる異常事象を発生前に予測するため、異常兆候となる運転データの寄与度と、過去の異常兆候時の運転データを現在の運転データと比較した合致指数を算出し、異常兆候を判定する構成としている。 In Patent Document 1, in order to predict an abnormal event occurring in a plant before it occurs, a matching index is calculated by comparing the contribution of operation data that is an abnormality sign and the operation data at the time of the past abnormality sign with the current operation data. , It is configured to judge abnormal signs.

特許文献２では、特定の稼働情報に基準を設定し、該当の稼働情報が基準から外れる傾向にある場合に異常兆候と判断し、障害の原因を究明するために関連する稼働情報の採取を行う構成としている。 In Patent Document 2, a standard is set for specific operation information, and when the corresponding operation information tends to deviate from the standard, it is judged as an abnormal sign, and related operation information is collected in order to investigate the cause of the failure. It is composed.

特開2019-57164号公報Japanese Unexamined Patent Publication No. 2019-57164 特開2003-345629号公報Japanese Patent Application Laid-Open No. 2003-345629

上述した従来の監視技術は、収集した稼働情報を用いてシステムの異常値を判別するため、過去に発生した異常状態もしくは異常兆候状態における稼働情報との関連性および類似性を判断根拠に使用している。もしくは、事前に稼働情報の正常な動作と認められる基準範囲を設定し、その範囲を超えるとシステムは異常な兆候を示していると判断する。 In the conventional monitoring technique described above, in order to determine the abnormal value of the system using the collected operation information, the relevance and similarity with the operation information in the abnormal state or abnormal sign state that occurred in the past are used as the judgment basis. ing. Alternatively, a reference range recognized as normal operation of the operation information is set in advance, and if the reference range is exceeded, it is judged that the system shows an abnormal sign.

ここで、監視対象であるシステムが稼働している際には、複数のプロセスが「動作」している。このため、監視を行う場合、複数のプロセスのそれぞれを同じように監視する必要がある。また、システムを稼働している際は、その業務内容などに応じて、「動作」するプロセスが動的に変化する。 Here, when the system to be monitored is operating, a plurality of processes are "operating". Therefore, when monitoring, it is necessary to monitor each of a plurality of processes in the same way. In addition, when the system is operating, the process that "operates" changes dynamically according to the business content and the like.

これらのため、システムの稼働状況の監視において、単に、すべてのプロセスが基準範囲を満たすかを監視しようとすると、負荷が掛かってしまうとの課題が生じる。 For these reasons, in monitoring the operating status of the system, there is a problem that a load is applied when simply trying to monitor whether all the processes meet the reference range.

かかる課題を解決するため本発明においては、対象であるシステムで実行されるプロセスについて、そのプロセスの「重要度」に応じて、稼働情報を収集する。 In order to solve such a problem, in the present invention, operation information is collected for a process executed in the target system according to the "importance" of the process.

より具体的には、複数のプロセスを実行することで所定の機能を実現する計算機システムを監視するシステム監視装置において、前記複数のプロセスそれぞれの稼働情報を収集する収集条件を記憶するプロセス管理テーブルと、記憶された前記収集条件に従って、前記計算機システムの各プロセスの稼働情報を収集する稼働情報収集部と、前記複数のプロセスの全体に対する各プロセスの重要性を示す重要度を決定し、当該重要度に応じて、前記プロセスの収集条件を特定し、記憶された前記収集条件を、特定された前記収集条件に更新するプロセス選別部とを有するシステム監視装置である。 More specifically, in a system monitoring device that monitors a computer system that realizes a predetermined function by executing a plurality of processes, a process management table that stores collection conditions for collecting operation information of each of the plurality of processes. According to the stored collection conditions, the operation information collection unit that collects the operation information of each process of the computer system and the importance indicating the importance of each process to the whole of the plurality of processes are determined, and the importance is determined. It is a system monitoring device having a process selection unit that specifies the collection conditions of the process and updates the stored collection conditions to the specified collection conditions.

また、本発明には、システム監視装置を用いたシステム監視方法やシステム監視装置をコンピュータとして機能させるためのコンピュータプログラムも含まれる。また、本発明は、このコンピュータプログラムを格納した記憶媒体も含まれる。 The present invention also includes a system monitoring method using a system monitoring device and a computer program for making the system monitoring device function as a computer. The present invention also includes a storage medium in which the computer program is stored.

本発明によれば、計算機システムに対して、計算機システムにおけるプロセスの重要度に応じた監視を、より容易に実現できる。 According to the present invention, it is possible to more easily realize monitoring of a computer system according to the importance of a process in the computer system.

本発明の一実施形態のプラント機器を制御する計算機システムの構成図である。It is a block diagram of the computer system which controls the plant equipment of one Embodiment of this invention. 本発明の一実施形態における監視マネージャのソフトウェア構成を示す機能ブロック図である。It is a functional block diagram which shows the software structure of the monitoring manager in one Embodiment of this invention. 本発明の一実施形態における監視エージェントのソフトウェア構成を示す機能ブロック図である。It is a functional block diagram which shows the software structure of the monitoring agent in one Embodiment of this invention. 本発明の一実施形態におけるプロセス管理テーブルを示す図である。It is a figure which shows the process control table in one Embodiment of this invention. 本発明の一実施形態における稼働情報項目テーブルを示す図である。It is a figure which shows the operation information item table in one Embodiment of this invention. 本発明の実施形態におけるシステム監視装置の処理の流れを示すフローチャートである。It is a flowchart which shows the process flow of the system monitoring apparatus in embodiment of this invention. 本発明の一実施形態における図4のステップS408の詳細を示すフローチャートである。It is a flowchart which shows the detail of the step S408 of FIG. 4 in one Embodiment of this invention. 本発明の一実施形態における図4のステップS410の詳細を示すフローチャートである。It is a flowchart which shows the detail of the step S410 of FIG. 4 in one Embodiment of this invention. 本発明の一実施形態におけるシステム監視装置10と有する計算機20のハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware composition of the system monitoring apparatus 10 and the computer 20 which has one Embodiment of this invention.

以下、図面について、本発明の一実施形態を詳述する。図１は、本実施形態の監視対象である計算機20、21を使用してプラント機器を制御する計算機システム1を示す。計算機システム１は、複数台の計算機がネットワーク通信路を介して接続されており、計算機間においては相互に情報データの送受信を実行することができる。このことで、計算機システム1は、プラント機器の制御などの所定の機能を実現できる。また、計算機システム1は、計算機20、計算機21と通信路30からなるが、本実施形態の実現手段となるシステム監視装置10は、同様に通信路30と接続される。 Hereinafter, one embodiment of the present invention will be described in detail with respect to the drawings. FIG. 1 shows a computer system 1 that controls plant equipment by using computers 20 and 21 that are the monitoring targets of the present embodiment. In the computer system 1, a plurality of computers are connected to each other via a network communication path, and information data can be transmitted and received between the computers. As a result, the computer system 1 can realize predetermined functions such as control of plant equipment. Further, the computer system 1 includes a computer 20, a computer 21, and a communication path 30, and the system monitoring device 10 which is a means for realizing the present embodiment is similarly connected to the communication path 30.

また、システム監視装置10の上で動作するプログラムである監視マネージャ40が、計算機20および計算機21の上で動作するプログラムである監視エージェント50、監視エージェント51と相互に通信を行う。このことで、システム監視装置10は、計算機システム1の稼働状況を診断する際に使用する稼働情報を取得する。 Further, the monitoring manager 40, which is a program running on the system monitoring device 10, communicates with the monitoring agent 50 and the monitoring agent 51, which are programs running on the computer 20 and the computer 21. As a result, the system monitoring device 10 acquires the operation information used when diagnosing the operation status of the computer system 1.

なお、本実施形態では、システム監視装置10が、計算機システム1に含めているが、計算機システム1とは別に設けてもよい。また、監視マネージャ40および監視エージェント50、51は回路などのハードウェアで実現してもよい。この場合、後述する監視マネージャ40および監視エージェント50、51の各部もハードウェアで実現することになる。さらに、監視エージェント50、51を省略してもよい。この場合、計算機20、21は、監視マネージャ40の指示に従って、稼働情報をシステム監視装置10に送信する。 In the present embodiment, the system monitoring device 10 is included in the computer system 1, but may be provided separately from the computer system 1. Further, the monitoring manager 40 and the monitoring agents 50 and 51 may be realized by hardware such as a circuit. In this case, each part of the monitoring manager 40 and the monitoring agents 50 and 51, which will be described later, will also be realized by hardware. Further, the monitoring agents 50 and 51 may be omitted. In this case, the computers 20 and 21 transmit the operation information to the system monitoring device 10 according to the instruction of the monitoring manager 40.

次に、図2Aおよび図2Bを用いて、監視マネージャ40および監視エージェント50について、説明する。図2Aは、本発明の一実施形態におけるシステム監視装置10上で動作するプログラムの監視マネージャ40のソフトウェア構成を機能ブロック図である。 Next, the monitoring manager 40 and the monitoring agent 50 will be described with reference to FIGS. 2A and 2B. FIG. 2A is a functional block diagram of the software configuration of the monitoring manager 40 of the program running on the system monitoring device 10 according to the embodiment of the present invention.

図2Bは、本発明の一実施形態における各計算機上で動作する監視エージェント50のソフトウェア構成を機能ブロック図である。図2Aにおいて、監視マネージャ40は、稼働情報収集部200により、各計算機の動作状況を確認できる稼働情報をプロセス単位に収集する。この収集の対象となる稼働情報は、計算機毎に作られる稼働情報項目テーブル230上で定義した項目の情報である。 FIG. 2B is a functional block diagram of the software configuration of the monitoring agent 50 operating on each computer according to the embodiment of the present invention. In FIG. 2A, the monitoring manager 40 collects operation information for each process so that the operation status of each computer can be confirmed by the operation information collection unit 200. The operation information to be collected is the information of the items defined on the operation information item table 230 created for each computer.

また、稼働情報は、予め定められた所定期間ごとに収集され、稼働情報収集DB250に蓄えられる。そして、システム稼働状況診断部210が、蓄積された稼働情報を用いて、計算機システム1における障害発生の兆候を診断する。この診断においては、システム稼働状況診断部210は、稼働情報の値の変化および相互の動作関係の変化を算出して、この結果を用いてもよい。 In addition, the operation information is collected at predetermined predetermined periods and stored in the operation information collection DB 250. Then, the system operation status diagnosis unit 210 diagnoses a sign of failure in the computer system 1 by using the accumulated operation information. In this diagnosis, the system operation status diagnosis unit 210 may calculate the change in the value of the operation information and the change in the mutual operation relationship, and use this result.

さらに、プロセス選別部220では、収集対象となる計算機毎に作られるプロセス管理テーブル240に保持したプロセス情報を使用し、プロセスの重要度を決定する。また、プロセス選別部220は、決定された重要度が所定以上のプロセスを、計算機システム1の動作、業務影響を与える可能性がある重要プロセスとして選別してもよい。なお、重要度に応じて収集する稼働情報の稼働情報項目を変えてもよい。またさらに、選別された重要プロセスについて、より詳細な稼働情報もしくはより多くの稼働情報項目の稼働情報を収集してもよい。 Further, the process selection unit 220 uses the process information stored in the process management table 240 created for each computer to be collected to determine the importance of the process. Further, the process selection unit 220 may select a process having a determined importance of a predetermined degree or higher as an important process that may affect the operation and business of the computer system 1. The operation information items of the operation information to be collected may be changed according to the importance. Furthermore, more detailed operation information or operation information of more operation information items may be collected for the selected important processes.

ここで、図2Bに示す監視エージェント50は、計算機20上で動作するプログラムであり、当該計算機の動作状況を示す稼働情報を収集する。なお、監視エージェント51は、監視エージェント51と同様の機能構成であるため、その説明を省略する。 Here, the monitoring agent 50 shown in FIG. 2B is a program that operates on the computer 20, and collects operation information indicating the operation status of the computer. Since the monitoring agent 51 has the same functional configuration as the monitoring agent 51, the description thereof will be omitted.

また、プロセス管理テーブル（エージェント）270は、監視マネージャ40のプロセス管理テーブル240のプロセス情報うち、監視エージェント50、51が展開されている計算機20、21のプロセス情報が格納されている。このために、プロセス管理テーブル240のうち、該当するプロセス情報が、プロセス管理テーブル（エージェント）270に、周期的にコピーされることが望ましい。 Further, the process management table (agent) 270 stores the process information of the computers 20 and 21 in which the monitoring agents 50 and 51 are deployed among the process information of the process management table 240 of the monitoring manager 40. For this purpose, it is desirable that the corresponding process information in the process management table 240 is periodically copied to the process management table (agent) 270.

また、同様に、稼働情報項目テーブル（エージェント）260は、監視マネージャ40上の稼働情報項目テーブル230の稼働情報のうち、監視エージェント50、51が展開されている計算機20、21で収集する稼働情報項目が格納されている。稼働情報項目テーブル230のうち、該当する稼働情報項目が、稼働情報項目テーブル（エージェント）260に、周期的にコピーされることが望ましい。 Similarly, the operation information item table (agent) 260 collects operation information collected by the computers 20 and 21 in which the monitoring agents 50 and 51 are deployed among the operation information of the operation information item table 230 on the monitoring manager 40. The item is stored. It is desirable that the corresponding operation information item in the operation information item table 230 is periodically copied to the operation information item table (agent) 260.

また、本実施形態では、これらのテーブルの設置値を参照しながら、稼働情報収集部（エージェント）280は、必要とする稼働情報を収集し、それらを収集データDB290に記録する。 Further, in the present embodiment, the operation information collection unit (agent) 280 collects necessary operation information and records them in the collection data DB 290 while referring to the installation values of these tables.

なお、監視マネージャ40をソフトウェア、つまり、コンピュータプログラムで実現する場合、これを監視マネージャプログラムとして実現できる。この場合、稼働情報収集部200、システム稼働状況診断部210およびプロセス選別部220のそれぞれをモジュールで実現できる。また、稼働情報収集部200、システム稼働状況診断部210およびプロセス選別部220を、それぞれ稼働情報収集プログラム2001、システム稼働状況診断プログラム2101およびプロセス選別プログラム2201として実現してもよい。この構成については、図7を用いて追って、説明する。そして、これらコンピュータプログラムは、記憶媒体に格納されている。さらに、これらコンピュータプログラムは、通信路30を介して、システム監視装置10に配信されてもよい。 When the monitoring manager 40 is realized by software, that is, a computer program, this can be realized as a monitoring manager program. In this case, each of the operation information collection unit 200, the system operation status diagnosis unit 210, and the process selection unit 220 can be realized by a module. Further, the operation information collection unit 200, the system operation status diagnosis unit 210, and the process selection unit 220 may be realized as the operation information collection program 2001, the system operation status diagnosis program 2101, and the process selection program 2201, respectively. This configuration will be described later with reference to FIG. 7. Then, these computer programs are stored in the storage medium. Further, these computer programs may be delivered to the system monitoring device 10 via the communication path 30.

ここで、監視マネージャ40を有するシステム監視装置10と、監視エージェント50を有する計算機20のハードウェア構成の一例を、図7を用いて説明する。なお、計算機21の構成は、計算機21と同様であるため、省略する。 Here, an example of the hardware configuration of the system monitoring device 10 having the monitoring manager 40 and the computer 20 having the monitoring agent 50 will be described with reference to FIG. 7. Since the configuration of the computer 21 is the same as that of the computer 21, it is omitted.

図7において、システム監視装置10は、いわゆるコンピュータで実現される。このため、システム監視装置10は、処理部101、入出力部102、主記憶部103および補助記憶部104が、互いにバスなどを介して構成される。 In FIG. 7, the system monitoring device 10 is realized by a so-called computer. Therefore, in the system monitoring device 10, the processing unit 101, the input / output unit 102, the main storage unit 103, and the auxiliary storage unit 104 are configured with each other via a bus or the like.

処理部101は、CPUの如き演算装置で実現され、主記憶部103に展開された各プログラムに従って、後述する処理を実行する。また、入出力部102は、通信路30と接続し、計算機20や端末装置60と接続される。ここで、端末装置60とは、システム監視装置10に対して利用者が指示を入力したり、システム監視装置10の演算結果を出力したりする機能を有するコンピュータで実現できる。また、端末装置60の機能は、システム監視装置10に設けてもよい。さらに、入出力部102は、通信路30と接続する講師と、端末装置60に接続する構成に分けてもよい。 The processing unit 101 is realized by an arithmetic unit such as a CPU, and executes the processing described later according to each program expanded in the main storage unit 103. Further, the input / output unit 102 is connected to the communication path 30, and is connected to the computer 20 and the terminal device 60. Here, the terminal device 60 can be realized by a computer having a function of inputting an instruction to the system monitoring device 10 and outputting the calculation result of the system monitoring device 10. Further, the function of the terminal device 60 may be provided in the system monitoring device 10. Further, the input / output unit 102 may be divided into a lecturer connected to the communication path 30 and a configuration connected to the terminal device 60.

また、主記憶部103は、いわゆるメモリで実現され、本実施形態では、上述した稼働情報収集プログラム2001、システム稼働状況診断プログラム2101およびプロセス選別プログラム2201が展開される。そして、処理部101が、これらプログラムに従った演算を実行する。 Further, the main storage unit 103 is realized by a so-called memory, and in the present embodiment, the above-mentioned operation information collection program 2001, system operation status diagnosis program 2101, and process selection program 2201 are deployed. Then, the processing unit 101 executes an operation according to these programs.

また、補助記憶部104は、HDD（Hard Disk Drive）やSSD（Solid State Drive）などの記憶装置で実現される。そして、補助記憶部104は、詳細を後述する稼働情報項目テーブル230、プロセス管理テーブル240や稼働情報DB250を記憶している。さらに、補助記憶部104は、主記憶部103に展開される各種プログラムを格納している。なお、システム監視装置10は、いわゆるクラウドコンピューティングを提供するための一構成として実現してもよい。この場合、通信路30は、インターネットで実現することが望ましい。 Further, the auxiliary storage unit 104 is realized by a storage device such as an HDD (Hard Disk Drive) or an SSD (Solid State Drive). The auxiliary storage unit 104 stores the operation information item table 230, the process management table 240, and the operation information DB 250, the details of which will be described later. Further, the auxiliary storage unit 104 stores various programs expanded in the main storage unit 103. The system monitoring device 10 may be realized as one configuration for providing so-called cloud computing. In this case, it is desirable that the communication path 30 is realized on the Internet.

次に、本図を用いて、計算機20の構成を説明する。計算機20もいわゆるコンピュータで実現される。このため、計算機20も、処理部201、入出力部202、主記憶部203および補助記憶部204が、互いにバスなどを介して構成される。これらは、処理部101、入出力部102、主記憶部103および補助記憶部104と同様に実現可能である。そして、主記憶部203には、稼働情報収集部（エージェント）280に対応する稼働情報収集プログラム2801が展開される。 Next, the configuration of the computer 20 will be described with reference to this figure. The computer 20 is also realized by a so-called computer. Therefore, in the computer 20, the processing unit 201, the input / output unit 202, the main storage unit 203, and the auxiliary storage unit 204 are configured with each other via a bus or the like. These can be realized similarly to the processing unit 101, the input / output unit 102, the main storage unit 103, and the auxiliary storage unit 104. Then, the operation information collection program 2801 corresponding to the operation information collection unit (agent) 280 is deployed in the main storage unit 203.

また、補助記憶部204には、稼働情報項目テーブル（エージェント）260、プロセス管理テーブル（エージェント）270および収集データDB290が記憶されている。また、監視エージェント50として実現される稼働情報収集プログラム2801は、システム監視装置10から送信され、補助記憶部204に記憶されるものとする。 Further, the auxiliary storage unit 204 stores the operation information item table (agent) 260, the process management table (agent) 270, and the collected data DB 290. Further, the operation information collection program 2801 realized as the monitoring agent 50 shall be transmitted from the system monitoring device 10 and stored in the auxiliary storage unit 204.

次に、図3Aおよび図3Bに、監視マネージャ40が保持する二つのソフトウェア管理テーブルを示す。図3Aは、稼働情報項目テーブル230を示す図である。図3Bは、プロセス管理テーブル240を示す図である。 Next, FIGS. 3A and 3B show two software management tables held by the monitoring manager 40. FIG. 3A is a diagram showing an operation information item table 230. FIG. 3B is a diagram showing the process management table 240.

図3Aに示す稼働情報項目テーブル230は、収集対象とする稼働情報の情報種類を項目301示す。項目301は、収集を行う共通情報302の列情報とプロセスの重要度を算出するいための動作優先度と動作比率の情報によって収集対象の項目が選択される詳細情報303の列情報からなる。 The operation information item table 230 shown in FIG. 3A indicates the information type of the operation information to be collected, item 301. Item 301 includes column information of common information 302 to be collected and column information of detailed information 303 in which the item to be collected is selected by the information of the operation priority and the operation ratio for calculating the importance of the process.

また、図3Bに示すプロセス管理テーブル240が、プロセスの動作状態を示す稼働情報を記憶する。ここで、プロセスは、各計算機において単体で動作するプログラムの実行体を意味し、複数個の異なるプロセスが並行して動作することが可能である。特に、制御向けの計算機システムでは、一定の経過時間内にプログラムの動作を完了させることが重要であり、これを実現するために他のプロセスよりも先んじて動作できるようにプロセスの動作優先度が定義されている。高い優先度を持つプロセスは低い優先度のプロセスよりも先行して動作し、低い優先度のプロセスは高い優先度のプロセスの処理が完了した後、動作を開始する。一般的に高い優先度を持つプロセスは重要プロセスである。 Further, the process management table 240 shown in FIG. 3B stores the operation information indicating the operation state of the process. Here, a process means an execution body of a program that operates independently in each computer, and a plurality of different processes can operate in parallel. Especially in computer systems for control, it is important to complete the operation of the program within a certain elapsed time, and in order to achieve this, the operation priority of the process is set so that it can operate before other processes. It is defined. The high-priority process runs ahead of the low-priority process, and the low-priority process starts running after the high-priority process has finished processing. In general, a process with a high priority is an important process.

なお、本実施形態におけるプロセスは、狭義のプロセスに限定されず、いわゆるタスクなども含まれる。つまり、本実施形態のプロセスとは、プログラムの実行体の単位であればよい。 The process in this embodiment is not limited to the process in a narrow sense, and includes so-called tasks and the like. That is, the process of the present embodiment may be a unit of the execution body of the program.

また、プロセス管理テーブル240の優先度351は、プロセスの動作優先度が高いプロセスと通常のレベルのプロセスを区別している。本実施形態では、「高」「通常」の2種類に区別しているが、これら以外に「低」を加えるなどして3種類以上に区別してもよい。また、PID352は、各プロセスを識別する識別子であり、プロセスごとに異なる番号を示す。 In addition, the priority 351 of the process management table 240 distinguishes between a process having a high process operation priority and a process having a normal level. In this embodiment, two types of "high" and "normal" are distinguished, but in addition to these, "low" may be added to distinguish them into three or more types. Further, PID352 is an identifier that identifies each process, and indicates a different number for each process.

動作レベル353は、は各プロセスにおける動作における優先性を段的に示す。そして、優先度351は、互いに動作レベル353に対応している。つまり、動作レベルの値が小さいほど優先度351が高いことを示す。ここでは、動作レベルが1から9までの値を「高」優先度とし、10以上のものは「通常」優先度と定義する。このように、本実施形態では、優先度351と動作レベル353が互いに対応しているが、これらは独立した値としてもよい。 Operation level 353 indicates the priority in operation in each process. The priority 351 corresponds to the operation level 353 with each other. That is, the smaller the operation level value, the higher the priority 351. Here, values with operating levels from 1 to 9 are defined as "high" priority, and those with an operating level of 10 or higher are defined as "normal" priority. As described above, in the present embodiment, the priority 351 and the operation level 353 correspond to each other, but these may be independent values.

また、プロセス管理テーブル240には、プロセス毎の動作比率354が含まれる。動作比率354は、単位時間に動作するプロセス全体の動作時間に対する個別のプロセスの動作時間の割合を示す。この動作比率354は、稼働情報収集部200により、周期的に算出され更新される。例えば、動作比率が8％とは、1秒の間でプロセス全体が動作したトータル時間の8%が該当のプロセスが動作していたことを意味する。動作比率が大きいプロセスは、一般的に重要なプロセスである。 In addition, the process management table 240 includes an operation ratio 354 for each process. The operation ratio 354 indicates the ratio of the operation time of an individual process to the operation time of the entire process operating in a unit time. This operation ratio 354 is periodically calculated and updated by the operation information collecting unit 200. For example, an operation ratio of 8% means that the process was operating for 8% of the total time that the entire process was operated in 1 second. A process with a high operating ratio is generally an important process.

収集周期355は、プロセス毎に稼働情報を収集する周期を特定する値である。収集周期355が100の場合は100msecの周期で稼働情報が稼働情報収集機能により収集される。なお、収集周期355は、重要度に応じて定められる。つまり、重要度の一例である動作レベル353と動作比率354を用いて、プロセス選別部220により算出される。この一例として、プロセス選別部220が、以下の（数１）を用いて算出し、周期的に、この収集周期を更新する。 The collection cycle 355 is a value that specifies the cycle for collecting operation information for each process. When the collection cycle 355 is 100, the operation information is collected by the operation information collection function in a cycle of 100 msec. The collection cycle 355 is determined according to the importance. That is, it is calculated by the process selection unit 220 using the operation level 353 and the operation ratio 354, which are examples of the importance. As an example of this, the process sorting unit 220 calculates using the following (Equation 1) and periodically updates this collection cycle.

収集周期＝（動作レベル×100）－（動作比率×1000）・・・（数１）
(ただし、最小値100,10位切上げ)
さらに、稼働情報項目356は、稼働情報項目テーブル230における詳細情報303が示す稼働情報に対する取捨選択結果を示す。つまり、稼働情報項目356は、優先度351が「高」優先度のプロセスに収集すべき稼働情報を示す稼働情報項目356が設定される。このため、稼働情報収集部200は、優先度351が「高」優先度のプロセスについて、稼働情報項目356で示される稼働情報を収集する。本実施形態では、稼働情報収集部200は、「通常」優先度のプロセスについては、収集する稼働情報は共通情報302を収集の対象とする。但し、「通常」優先度のプロセスに対して、稼働情報項目356が設定されてもよい。この場合、「高」優先度の稼働情報項目356が、「通常」優先度の稼働情報項目356よりもその項目数が多いことが望ましい。これらの処理については、図4以降のフローチャートの説明でも行う。 Collection cycle = (operation level x 100)-(operation ratio x 1000) ... (number 1)
(However, the minimum value is rounded up to 100, 10th place)
Further, the operation information item 356 indicates the selection result for the operation information indicated by the detailed information 303 in the operation information item table 230. That is, the operation information item 356 is set to the operation information item 356 indicating the operation information to be collected in the process having the priority 351 of "high". Therefore, the operation information collection unit 200 collects the operation information indicated by the operation information item 356 for the process having the priority 351 of "high". In the present embodiment, the operation information collection unit 200 collects the common information 302 as the operation information to be collected for the process of "normal" priority. However, the operation information item 356 may be set for a process having a "normal" priority. In this case, it is desirable that the "high" priority operation information item 356 has a larger number of items than the "normal" priority operation information item 356. These processes will also be described in the flowcharts shown in FIGS. 4 and 4.

次に、図4～図6に示す各フローチャートを用いて、本実施形態の処理の詳細を説明する。なお、これらのフローチャートで示される処理の主体は、図2Aおよび図2Bに示す構成を用いて説明する。 Next, the details of the processing of the present embodiment will be described with reference to each flowchart shown in FIGS. 4 to 6. The main body of the processing shown in these flowcharts will be described using the configurations shown in FIGS. 2A and 2B.

図4は、本実施形態におけるシステム監視装置10の処理の流れを示すフローチャートである。まず、ステップS401では、稼働情報を収集する対象の計算機上で監視エージェント50が起動する。この起動は、計算機20の起動、所定時間（例：毎日7時）、システム監視装置10や端末装置60からの起動指示など所定条件を満たした際に実行される。なお、各フローチャートでは、監視エージェント51も監視エージェント51と同じように処理されるが、その説明は省略する。 FIG. 4 is a flowchart showing the processing flow of the system monitoring device 10 in the present embodiment. First, in step S401, the monitoring agent 50 is started on the computer for which the operation information is collected. This activation is executed when predetermined conditions such as activation of the computer 20, predetermined time (example: 7 o'clock every day), and activation instructions from the system monitoring device 10 and the terminal device 60 are satisfied. In each flowchart, the monitoring agent 51 is processed in the same manner as the monitoring agent 51, but the description thereof will be omitted.

また、ステップS402で、システム監視装置10上の監視マネージャ40が起動する。この起動は、監視エージェント50の起動に応じて実行されることが望ましい。この場合、監視エージェント50が、起動指示を送信する。また、この起動は、計算機20の起動、所定時間（例：毎日7時）、システム監視装置10や端末装置60からの起動指示など所定条件を満たした際に実行されてもよい。さらに、ステップS401とS402の順序が逆であってもよい。この場合、監視マネージャ40が、監視エージェント50に起動指示を送信することが望ましい。 Further, in step S402, the monitoring manager 40 on the system monitoring device 10 is started. It is desirable that this startup be executed in response to the startup of the monitoring agent 50. In this case, the monitoring agent 50 sends a start instruction. Further, this activation may be executed when predetermined conditions such as activation of the computer 20, a predetermined time (eg, 7 o'clock every day), and a activation instruction from the system monitoring device 10 or the terminal device 60 are satisfied. Further, the order of steps S401 and S402 may be reversed. In this case, it is desirable that the monitoring manager 40 sends a start instruction to the monitoring agent 50.

次に、ステップS403では、稼働情報項目テーブル（エージェント）260の更新を行う。このために、まず、監視マネージャ40の稼働情報収集部200が、稼働情報の収集指示を計算機20に送信する。このために、稼働情報収集部200は、稼働情報項目テーブル230の稼働情報収集項目情報を、計算機20に配信する。 Next, in step S403, the operation information item table (agent) 260 is updated. For this purpose, first, the operation information collection unit 200 of the monitoring manager 40 sends an operation information collection instruction to the computer 20. For this purpose, the operation information collection unit 200 distributes the operation information collection item information of the operation information item table 230 to the computer 20.

この結果、稼働情報収集部（エージェント）280は、計算機20上の稼働情報項目テーブル（エージェント）260に、配信された稼働情報収集項目情報を記憶する。例えば、稼働情報収集部（エージェント）280は、稼働情報収集項目情報を、稼働情報項目テーブル（エージェント）260にコピーして稼働情報収集項目情報の更新を行う。 As a result, the operation information collection unit (agent) 280 stores the distributed operation information collection item information in the operation information item table (agent) 260 on the computer 20. For example, the operation information collection unit (agent) 280 copies the operation information collection item information to the operation information item table (agent) 260 and updates the operation information collection item information.

さらに、ステップS404では、稼働情報収集部200は、プロセス管理テーブル240の初期化を行う。このために、稼働情報収集部200は、所定の値をプロセス管理テーブル240に設定してもよいし、計算機20で動作するプロセスの情報を収集して、優先度351、PID352、動作レベル353のプロセス毎の値を取り込み設定してもよい。例えば、初期化の際には、稼働情報収集部200は、動作比率は0％、収集周期は5000（5秒）、稼働情報は設定なし（すべてオフ）を初期値として設定する。 Further, in step S404, the operation information collecting unit 200 initializes the process management table 240. For this purpose, the operation information collection unit 200 may set a predetermined value in the process management table 240, or collects information on the process operating on the computer 20, and has a priority of 351 and a PID 352 and an operation level of 353. You may take in the value for each process and set it. For example, at the time of initialization, the operation information collection unit 200 sets the operation ratio to 0%, the collection cycle to 5000 (5 seconds), and the operation information to be unset (all off) as initial values.

また、稼働情報収集部200は、この初期化されたプロセス管理テーブル240のプロセス管理情報を、監視エージェント50に送信する。そして、監視エージェント50は、プロセス管理テーブル（エージェント）270に、送信されたプロセス管理情報を設定することで、初期化を行う。 In addition, the operation information collection unit 200 sends the process management information of the initialized process management table 240 to the monitoring agent 50. Then, the monitoring agent 50 initializes the process management table (agent) 270 by setting the transmitted process management information.

次に、ステップS405では、計算機20の稼働情報を収集する。このために、監視エージェント50の稼働情報収集部（エージェント）280が、プロセス管理テーブル（エージェント）270に記憶されたプロセス管理情報に従って、計算機20の稼働情報を収集している。そして、監視エージェント50は、収集した稼働情報を、収集データDB290に記憶する。 Next, in step S405, the operation information of the computer 20 is collected. For this purpose, the operation information collection unit (agent) 280 of the monitoring agent 50 collects the operation information of the computer 20 according to the process management information stored in the process management table (agent) 270. Then, the monitoring agent 50 stores the collected operation information in the collected data DB 290.

そして、稼働情報収集部200は、プロセス管理テーブル240に設定されている収集周期355と稼働情報項目356に従って、収集データDB290に記憶された稼働情報を、プロセス毎に収集し、稼働情報DB250に記録する。 Then, the operation information collection unit 200 collects the operation information stored in the collection data DB 290 for each process according to the collection cycle 355 and the operation information item 356 set in the process management table 240, and records the operation information in the operation information DB 250. do.

なお、収集データDB290から稼働情報DB290への稼働情報の収集は、以下のとおり行ってもよい。稼働情報収集部（エージェント）280が、収集データDB290から読み出し、稼働情報DB290へ送信する。 The operation information may be collected from the collected data DB 290 to the operation information DB 290 as follows. The operation information collection unit (agent) 280 reads from the collected data DB 290 and sends it to the operation information DB 290.

次に、ステップS406では、稼働情報収集部200が、監視マネージャ40に対して、稼働情報の収集処理を停止する指示があったかを判断する。この結果、停止指示があったと判断した場合は、ステップS411に進む。停止指示がなかったと判断した場合、ステップS406へ進む。 Next, in step S406, the operation information collection unit 200 determines whether the monitoring manager 40 has been instructed to stop the operation information collection process. As a result, if it is determined that the stop instruction has been given, the process proceeds to step S411. If it is determined that there is no stop instruction, the process proceeds to step S406.

ステップS411では、稼働情報収集部200は、計算機20上の監視エージェント50に対して、停止指示を送信する。この結果、監視エージェント50はその機能を停止する。また、ステップS412で、稼働情報収集部200は、監視マネージャ40を停止して監視処理を終了する。 In step S411, the operation information collection unit 200 sends a stop instruction to the monitoring agent 50 on the computer 20. As a result, the monitoring agent 50 stops functioning. Further, in step S412, the operation information collecting unit 200 stops the monitoring manager 40 and ends the monitoring process.

また、ステップS407では、プロセス選別部220が、プロセス管理テーブル240を予め定められた更新周期を超過しているか否かを確認する。通常の制御システムでは、大量のデータ処理を行うようなバッチ処理は少ない。このため、更新周期として、平均的には一定の時間は1時間程度の値を用いれば良い。本ステップでの結果、超過していない場合は、ステップS405に戻り、処理を継続する。また、超過している場合は、ステップS408に進む。 Further, in step S407, the process selection unit 220 confirms whether or not the process management table 240 exceeds a predetermined update cycle. In a normal control system, there are few batch processes that process a large amount of data. Therefore, as the update cycle, a value of about 1 hour may be used for a certain period of time on average. If the result in this step is not exceeded, the process returns to step S405 and the process is continued. If it is exceeded, the process proceeds to step S408.

ステップS408では、プロセス選別部220が、計算機20上で動作するプロセスが対象の計算機システム1に対して、どの程度の重要性を持つかを示す重要度を決定する。ここで、重要度の一例として、動作レベル353そのもの、これに基づく優先度351やその他指標が含まれる。そして、本ステップでは、プロセス選別部220が、重要度を用いて、稼働情報の収集する場合の条件となる収集条件を決定する。 In step S408, the process selection unit 220 determines the importance of the process running on the computer 20 to the target computer system 1. Here, as an example of the importance, the operation level 353 itself, the priority 351 based on the operation level 353, and other indicators are included. Then, in this step, the process selection unit 220 uses the importance to determine the collection conditions that are the conditions for collecting the operation information.

以下、重要度と収集条件の決定について、二つの例を用いて説明する。第一の例としては、プロセス選別部220が、重要度として、動作レベル353を決定する。この決定方法については、後述する図5のS501で説明する。次に、プロセス選別部220は、決定された動作レベル353と該当プロセスの動作比率354を用いて、収集条件の一例である収集周期355を算出する。この算出は、上述した（数１）を用いて行うことになる。 The determination of importance and collection conditions will be described below using two examples. As a first example, the process selection unit 220 determines the operation level 353 as the importance. This determination method will be described with reference to S501 in FIG. 5 described later. Next, the process selection unit 220 calculates the collection cycle 355, which is an example of the collection conditions, using the determined operation level 353 and the operation ratio 354 of the corresponding process. This calculation will be performed using the above-mentioned (Equation 1).

また、第二の例としては、プロセス選別部220が、重要度として、優先度351を決定する。上述のように、プロセス選別部220は、第一の例で特定された動作レベル353に応じた優先度（「高」「通常」）を特定することが望ましい。次に、プロセス選別部220は、優先度351ごとに定められている稼働情報項目356を、収集条件として特定する。ここで、優先度351を、動作比率354より動作レベル353を優先（重みづけ）して算出してもよい。具体的には、プロセス選別部220が、動作レベル353が一桁のものを、優先度351が「高」と判断し、二桁のものを「普通」と判断する。なお、プロセス選別部220は、各優先度内において、動作比率354の大きい順にソートし、ソートされた順序を、重要度として特定してもよい。また、プロセス選別部220をソートした順に、プロセス管理テーブル240に記憶する。なお、本ステップのうち、第一の例である収集周期の算出については、図5を用いて説明する。 As a second example, the process selection unit 220 determines the priority 351 as the importance. As described above, it is desirable that the process sorting unit 220 specify the priority (“high” or “normal”) according to the operation level 353 specified in the first example. Next, the process selection unit 220 specifies the operation information item 356 defined for each priority 351 as a collection condition. Here, the priority 351 may be calculated by giving priority (weighting) to the operation level 353 over the operation ratio 354. Specifically, the process selection unit 220 determines that the operation level 353 is one digit, the priority 351 is "high", and the two digit is "normal". The process sorting unit 220 may sort in each priority in descending order of the operation ratio 354, and the sorted order may be specified as the importance. Further, the process sorting unit 220 is stored in the process management table 240 in the sorted order. The calculation of the collection cycle, which is the first example of this step, will be described with reference to FIG.

次に、ステップS409では、プロセス選別部220は、ステップS408で決定したプロセスの重要度および収集条件を、プロセス管理テーブル240記憶する。例えば、プロセス選別部220は、優先度351および稼働情報項目356を記憶する。このとき、優先度351が「高」優先度に属するプロセス、つまり、重要プロセスに対しては、収集周期355に従って、稼働情報項目356を決定する。このことで、収集する稼働情報項目356のオンオフ値が更新し、重要度、つまり、優先度351の高いプロセスほど稼働情報項目356の数を増加させることができる。 Next, in step S409, the process selection unit 220 stores the process importance and collection conditions determined in step S408 in the process management table 240. For example, the process sorting unit 220 stores the priority 351 and the operation information item 356. At this time, for the process to which the priority 351 belongs to the "high" priority, that is, the important process, the operation information item 356 is determined according to the collection cycle 355. As a result, the on / off value of the operation information item 356 to be collected is updated, and the number of the operation information item 356 can be increased as the process has a higher importance, that is, a priority 351.

また、他の例として、プロセス選別部220は、動作レベル353や収集周期355を記憶する。なお、プロセス管理テーブル240に、重要度および情報収集情報の項目を設け、プロセス選別部220はこれらを更新する構成としてもよい。 As another example, the process selection unit 220 stores the operation level 353 and the collection cycle 355. The process management table 240 may be provided with items of importance and information collection information, and the process selection unit 220 may be configured to update these items.

次に、ステップS410では、システム稼働状況診断部210が、計算機システム1の稼働状況を診断する。具体的には、システム稼働状況診断部210は、プロセス管理テーブル240の設定値に従って収集した、各プロセスの稼働情報を稼働情報DB250から読み出す。そして、システム稼働状況診断部210が、計算機システム1の安定稼働を阻害する予兆が生じているか否かを診断する。本ステップの詳細は図6を用いて、追って説明する。 Next, in step S410, the system operation status diagnosis unit 210 diagnoses the operation status of the computer system 1. Specifically, the system operation status diagnosis unit 210 reads the operation information of each process collected according to the setting value of the process management table 240 from the operation information DB 250. Then, the system operation status diagnosis unit 210 diagnoses whether or not there is a sign that the stable operation of the computer system 1 is hindered. The details of this step will be described later with reference to FIG.

次に、図5を用いて、図4のステップS408の詳細を、収集周期の算出の例を説明する。本処理では、プロセス選別部220が、プロセス管理テーブル240に登録されている各プロセス管理情報が示す各プロセスに対して、以下の各ステップを実行する。 Next, using FIG. 5, the details of step S408 in FIG. 4 will be described as an example of calculating the collection cycle. In this process, the process selection unit 220 executes the following steps for each process indicated by each process management information registered in the process management table 240.

まず、ステップS501にて、プロセス選別部220は、本ステップを行う際のプロセスの動作レベルを、収集し稼働情報DB250に記憶された稼働情報データから抽出することで決定する。 First, in step S501, the process selection unit 220 determines the operation level of the process when performing this step by extracting it from the operation information data collected and stored in the operation information DB 250.

次に、ステップS502において、プロセス選別部220は、更新周期期間中に動作した各プロセスの動作時間の和に対する個別のプロセスの動作時間の比率を、動作比率354として算出する。 Next, in step S502, the process selection unit 220 calculates the ratio of the operating time of each individual process to the sum of the operating times of the operating processes during the update cycle period as the operating ratio 354.

次に、ステップS503において、プロセス選別部220は、算出した動作比率354と動作レベル353を、上述の（数１）に適用して、稼働情報の収集周期355を算出する。以上で、図5のフローチャートの説明を終了する。 Next, in step S503, the process selection unit 220 applies the calculated operation ratio 354 and operation level 353 to the above-mentioned (Equation 1) to calculate the operation information collection cycle 355. This is the end of the explanation of the flowchart of FIG.

次に、図6を用いて、図4のステップS410の詳細を説明する。本フローチャートでは、システム稼働状況診断部210が、プロセス管理テーブル240に登録されている各プロセス管理情報が示す各プロセスに対して、以下の処理を実行する。 Next, the details of step S410 in FIG. 4 will be described with reference to FIG. In this flowchart, the system operation status diagnosis unit 210 executes the following processing for each process indicated by each process management information registered in the process management table 240.

まず、ステップS601では、システム稼働状況診断部210が、稼働情報DB250に記憶された稼働情報のうち、稼働情報項目テーブル230の共通情報302（メモリの使用サイズなど）データを読み出す。共通情報302は、各プロセスの診断で共通的に使用可能な項目であり、例えば、稼働情報の中でも最も基本的で、プロセスの動作のふるまいの良し悪しの変化を顕著に見いだせる情報である。より具体的の一例としては、システム稼働状況診断部210は、過去24時間前から蓄積しているデータである稼働情報を稼働情報DB250から読み出す。 First, in step S601, the system operation status diagnosis unit 210 reads out the common information 302 (memory usage size, etc.) data of the operation information item table 230 among the operation information stored in the operation information DB 250. The common information 302 is an item that can be commonly used in the diagnosis of each process, and is, for example, the most basic information among the operation information, which is information that can remarkably find a change in the behavior of the process. As a more specific example, the system operation status diagnosis unit 210 reads the operation information, which is the data accumulated from the past 24 hours ago, from the operation information DB 250.

次に、ステップS602にて、システム稼働状況診断部210が、読み出された共通情報302が示すデータに対する診断を行う。より具体的には、システム稼働状況診断部210が、読み出したデータのデータ値の変化を検知することで診断を行う。この診断では、システム稼働状況診断部210は、データ毎に規定している基準範囲を外れたデータ値があった場合は、システムの障害に至る兆候ありと診断する。そして、システム稼働状況診断部210は、データ値とプロセスの識別子であるPID352を、診断結果として、端末装置60にメッセージデータとして出力する。また、システム稼働状況診断部210は、この診断結果を、システム監視装置10の補助記憶部104に記憶してもよい。 Next, in step S602, the system operation status diagnosis unit 210 diagnoses the data indicated by the read common information 302. More specifically, the system operation status diagnosis unit 210 makes a diagnosis by detecting a change in the data value of the read data. In this diagnosis, the system operation status diagnosis unit 210 diagnoses that there is a sign of a system failure when there is a data value outside the reference range specified for each data. Then, the system operation status diagnosis unit 210 outputs the data value and the PID 352, which is the identifier of the process, as the diagnosis result to the terminal device 60 as message data. Further, the system operation status diagnosis unit 210 may store the diagnosis result in the auxiliary storage unit 104 of the system monitoring device 10.

次に、ステップS603では、システム稼働状況診断部210が、診断する対象のプロセスが重要プロセスであるか否かを判定する。この判定は、プロセス管理テーブル240の優先度351の値に従って判定する。本実施形態においては、システム稼働状況診断部210は、優先度351が「高」所定の情報であれば重要プロセスと判定する。重要プロセスである判定された場合は、ステップS604に進む。また、重要プロセスでないと判定された場合、本フローチャートで示される処理を終了する。 Next, in step S603, the system operation status diagnosis unit 210 determines whether or not the process to be diagnosed is an important process. This determination is made according to the value of priority 351 in the process management table 240. In the present embodiment, the system operation status diagnosis unit 210 determines that the process is important if the priority 351 is "high" predetermined information. If it is determined that the process is important, the process proceeds to step S604. If it is determined that the process is not important, the process shown in this flowchart is terminated.

次に。ステップS604では、システム稼働状況診断部210が、プロセス管理テーブル240の稼働情報項目356に設定された項目に対応するデータを、稼働情報DB250から読み出す。これらのデータは稼働情報としては詳細情報に分類され、きめ細かなプロセスの動作状況を確認することができる。詳細情報についても、例えば、過去24時間前から蓄積しているデータを稼働情報DB250から読み出される。 next. In step S604, the system operation status diagnosis unit 210 reads the data corresponding to the item set in the operation information item 356 of the process management table 240 from the operation information DB 250. These data are classified as detailed information as operation information, and it is possible to confirm the operation status of a detailed process. As for the detailed information, for example, the data accumulated from the past 24 hours ago is read from the operation information DB 250.

次に、ステップS605では、システム稼働状況診断部210が、読み出されたデータのデータ値の変化を検知することで、診断を行う。システム稼働状況診断部210は、この診断においても、データ毎に規定している基準範囲を外れた場合や、１時間毎にデータ変化を比較して、変化の形が大きく異なる場合に、障害に至る兆候ありと判断する。そして、システム稼働状況診断部210は、そのデータ値とプロセスの識別子であるPID352の値ともに診断結果を、端末装置60にメッセージデータとして出力する。また、システム稼働状況診断部210は、この診断結果を、システム監視装置10の補助記憶部104に記憶してもよい。以上で、図6の説明を終了する。 Next, in step S605, the system operation status diagnosis unit 210 makes a diagnosis by detecting a change in the data value of the read data. Even in this diagnosis, the system operation status diagnosis unit 210 will be in trouble if it is out of the standard range specified for each data, or if the data changes are compared every hour and the form of the changes is significantly different. Judge that there are signs of reaching. Then, the system operation status diagnosis unit 210 outputs the diagnosis result as message data to the terminal device 60 together with the data value and the value of the PID 352 which is the identifier of the process. Further, the system operation status diagnosis unit 210 may store the diagnosis result in the auxiliary storage unit 104 of the system monitoring device 10. This is the end of the explanation of FIG.

以上の実施形態によれば、以下の各課題も解決できる。第一に、稼働情報の異常状態を判別するためには、過去に発生した異常状態や異常の兆候を示す状態における稼働情報が必要であり、明確に異常と判断できるデータがないと異常兆候の予測ができない。 According to the above embodiments, the following problems can also be solved. First, in order to determine the abnormal state of the operation information, it is necessary to have the operation information in the state showing the abnormal state or the sign of the abnormality that occurred in the past, and if there is no data that can be clearly judged to be abnormal, the abnormality sign I can't predict.

第二に、周辺機器の増設や廃止により環境に変化が生じたり、ソフトウェアの更新などで計算機システムの内外に変化が生じたりした場合には、以下の課題がある。比較対象である過去の異常状態または異常の兆候を示す状態を特定する稼働情報が、変化後のシステムにおいては、異常状態の判断根拠として使用できない。また、基準範囲にズレが生じる場合があり、その都度判別に使用する稼働情報の収集や基準範囲を見直さなければならず、システムの変化に対する追従が困難となる。 Secondly, if there is a change in the environment due to the addition or abolition of peripheral devices, or if there is a change inside or outside the computer system due to software updates, etc., there are the following problems. The operating information that identifies the past abnormal state or the state showing the sign of the abnormal state to be compared cannot be used as the basis for determining the abnormal state in the changed system. In addition, there may be a deviation in the reference range, and it is necessary to collect operation information used for discrimination and review the reference range each time, which makes it difficult to follow changes in the system.

第三に、計算機システムの異常状態または異常の兆候を示す状態を正確に判別するためには、稼働情報を収集する時間周期を短期間にして大量の情報を収集し、記録を実行しなければならない。このため、稼働情報を収集する対象のシステムと稼働情報を記録し保持する監視システムにおける計算機において、処理負荷が高くなるといった問題が生じる。したがって、計算機システムを高負荷とせず、システムの運用業務に影響を与えないためには、高性能なハードウェアを準備しなければならない。 Third, in order to accurately determine the abnormal state of the computer system or the state showing signs of abnormality, it is necessary to collect a large amount of information in a short period of time to collect operation information and execute recording. It doesn't become. For this reason, there arises a problem that the processing load becomes high in the computer in the system for collecting the operation information and the monitoring system for recording and holding the operation information. Therefore, in order not to put a heavy load on the computer system and not to affect the system operation work, it is necessary to prepare high-performance hardware.

つまり、本実施形態では、高性能な計算機リソース以外でも、そのシステム業務の動作に重要なプロセスを自動で判別し、その重要プロセスの稼働状況を詳細に解析することができる。このため、システムが障害に至る兆候をプロセス単位で細かく検知できるようになり、これに対して計画的に障害の発生を抑止する対処を行うことで、計算機システム全体の安定稼働を維持することが可能となる。 That is, in the present embodiment, it is possible to automatically determine a process important for the operation of the system business other than the high-performance computer resource, and analyze the operating status of the important process in detail. For this reason, it becomes possible for the system to detect signs of failure in detail on a process-by-process basis, and by taking measures to prevent the occurrence of failures in a planned manner, it is possible to maintain stable operation of the entire computer system. It will be possible.

1…計算機システム、10…システム監視装置、20…計算機、21…計算機、30…通信路、40…監視マネージャ、50…監視エージェント、51…監視エージェント 1 ... computer system, 10 ... system monitoring device, 20 ... computer, 21 ... computer, 30 ... communication path, 40 ... monitoring manager, 50 ... monitoring agent, 51 ... monitoring agent

Claims

In a system monitoring device that monitors a computer system that realizes a predetermined function by executing multiple processes.
A process management table that stores collection conditions that collect operation information for each of the multiple processes,
An operation information collection unit that collects operation information of each process of the computer system according to the stored collection conditions.
The importance of each process to the whole of the plurality of processes is determined, the collection conditions of the processes are specified according to the importance, and the stored collection conditions are specified as the collection conditions. A system monitoring device with a process sorting unit to update to.

In the system monitoring device according to claim 1,
In the process management table, as the collection conditions, the collection cycle of operation information in the process, the operation level according to the priority of the operation in the process, and the operation time of the process in the plurality of processes are displayed for each process. Memorize the operation ratio that indicates the ratio,
The process sorting unit is a system monitoring device that calculates the collection cycle of the operation information in each process as the collection condition by using the operation level and the operation ratio, which are the importance.

In the system monitoring device according to claim 2,
The process selection unit further specifies the priority of the operation in each process according to the operation level as the importance, and determines the operation information item indicating the operation information to be collected according to the priority.
The process management table is a system monitoring device that stores the operation information item as the collection condition for each process.

In the system monitoring device according to claim 3,
The process selection unit selects an important process from the plurality of processes according to the priority.
Further, a system monitoring device having a system operation status diagnosis unit that diagnoses the computer system using the operation information indicated by the operation information item for the selected important process.

In a system monitoring method using a system monitoring device that monitors a computer system that realizes a predetermined function by executing multiple processes,
In the process management table of the system monitoring device, the collection conditions for collecting the operation information of each of the plurality of processes are stored.
The operation information collecting unit of the system monitoring device collects the operation information of each process of the computer system according to the stored collection conditions.
The process selection unit of the system monitoring device determines the importance of each process to the whole of the plurality of processes, specifies the collection conditions of the process according to the importance, and stores the above. A system monitoring method that updates the collection conditions to the specified collection conditions.

In the system monitoring method according to claim 5,
In the process management table, as the collection conditions, the collection cycle of operation information in the process, the operation level according to the priority of the operation in the process, and the operation time of the process in the plurality of processes are displayed for each process. Memorize the operation ratio that indicates the ratio,
A system monitoring method in which the process sorting unit calculates the collection cycle of the operation information in each process as the collection condition by using the operation level and the operation ratio, which are the importance.

In the system monitoring method according to claim 6,
The process selection unit further specifies the priority of the operation in each process according to the operation level as the importance, and determines the operation information item indicating the operation information to be collected according to the priority.
The process management table is a system monitoring method for storing the operation information item as the collection condition for each process.

In the system monitoring method according to claim 7,
The process selection unit selects important processes from the plurality of processes according to the priority.
Further, a system monitoring method for diagnosing the computer system by using the operation information indicated by the operation information item for the important process selected by the system operation status diagnosis unit of the system monitoring device.