JP2001331330A

JP2001331330A - Process abnormality detection and restoration system

Info

Publication number: JP2001331330A
Application number: JP2000147904A
Authority: JP
Inventors: Tatsuya Hiraishi; 達哉平石; Manabu Takeda; 学竹田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2000-05-19
Filing date: 2000-05-19
Publication date: 2001-11-30

Abstract

PROBLEM TO BE SOLVED: To automatically and quickly restore an abnormal process and a process group related to the abnormal process in respect to a process abnormality detection and restoration system, in a process system constituted of plural processes. SOLUTION: The system is provided with a monitoring process l-l for monitoring the operation states of respective processes 1-3 distributed in a server, a client or another system and a data base 1-2 for storing the dependent relation of various patterns concerned with restart processing among respective processes. At detecting of a fault in one of processes 1-3 to be monitored, the monitoring process 1-3 causes plural process groups, having depending relation with the process concerned, to restart the processes.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、複数のプロセスか
ら構成されるプロセスシステムにおけるプロセス異常検
知及び復旧システムに関し、特に、分散処理ミドルウェ
ア（ＣＯＲＢＡ：Common Object Request Broker Archi
tecture 等）を介した通信手段を有し、複数のプロセッ
サ又はオペレーティングシステム上に分散配置されたプ
ロセス群から成るシステムにおいて、各プロセスの異常
を検知し復旧する技術に関する。なお、プロセス群は単
一のプロセッサ又はオペレーティングシステム上で動作
するものであってもよい。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a process abnormality detection and recovery system in a process system including a plurality of processes, and more particularly to a distributed processing middleware (CORBA: Common Object Request Broker Archi).
The present invention relates to a technology for detecting and recovering from an abnormality of each process in a system having a communication means via a network (such as a computer system) and a process group distributed on a plurality of processors or an operating system. Note that the process group may operate on a single processor or operating system.

【０００２】上記、分散処理ミドルウェアを介した通信
手段を用いてオブジェクトを送受する分散プロセス群か
らなるプロセスシステムの例としては、ＳＴＭネットワ
ークにおける交換機群を監視するネットワーク監視シス
テム等が好適な例として挙げられる。As a preferred example of the above-mentioned process system including a distributed process group for transmitting and receiving objects using communication means via distributed processing middleware, a network monitoring system for monitoring a group of exchanges in an STM network is a preferred example. Can be

【０００３】複数のプロセスから成るシステムにおいて
は、プロセス数の増加に伴い、その相互の依存関係が複
雑に入り組む。また、当該システムのプロセス群が複数
のプロセッサ又はオペレーティングシステム上に分散配
置されている場合、各プロセスの異常監視に加え、監視
プロセスと被監視プロセスとの間の通信手段の異常検知
及び復旧についても対処しなければならない。[0003] In a system composed of a plurality of processes, as the number of processes increases, their interdependencies are complicated. In addition, when the process group of the system is distributed on a plurality of processors or operating systems, in addition to the abnormality monitoring of each process, the abnormality detection and recovery of the communication means between the monitoring process and the monitored process are also performed. I have to deal with it.

【０００４】更に、一つのプロセスシステムが他のプロ
セスシステムと連携している場合、又は他のプロセスシ
ステムを基にして成り立っている形態の場合は、自プロ
セスシステムのプロセス群のみならず、自プロセスシス
テムに関係する他プロセスシステムのプロセス群につい
ても異常検知及び復旧の範疇に入れることが要求され
る。Further, when one process system cooperates with another process system, or when it is configured based on another process system, not only the process group of the own process system but also the own process system is used. Process groups of other process systems related to the system are also required to be included in the category of abnormality detection and recovery.

【０００５】そして、このようなプロセス群から成るプ
ロセスシステムにおいて、異常発生時、如何に人手を介
さずに自動的に異常の検知、分類及び復旧を行う手段を
構築するかが求められる。本発明は、このような要請を
踏まえ、種々の形態のプロセスシステムにおいて、動作
異常に対して可及的に人手を介さずに該異常の検知、分
類及び復旧を行うものである。[0005] In a process system including such a process group, it is required to establish a means for automatically detecting, classifying, and recovering from an abnormality without manual intervention when an abnormality occurs. The present invention is based on such a demand, and detects, classifies, and recovers from an operation abnormality in a variety of process systems with minimum human intervention as much as possible.

【０００６】[0006]

【従来の技術】従来、プロセス群を備えたシステムにお
いて、ある１つのプロセスに動作異常が発生したとき、
場合によっては当該プロセスの単一再開のみで復旧する
とは限らない。しかし、システムの復旧のために該シス
テムのプロセス群全ての再開を行うという処置は、全再
開の動作によりシステムの全サービスが一時的に停止す
ることになるので、継続的なサービスの供給という観点
からすると、必ずしも有効な手段であるとは言えない。2. Description of the Related Art Conventionally, in a system having a process group, when an operation abnormality occurs in a certain process,
In some cases, it is not always possible to recover by only a single restart of the process. However, in order to recover the system, the process of restarting all the process groups of the system involves temporarily stopping all services of the system due to the operation of full restart. Therefore, this is not necessarily an effective means.

【０００７】このようなことから、プロセスシステムの
復旧方式に関して今までに幾つかの方式が提案されてい
る。例えば、特許第２５００７４５号公報には、加入者
に対するアプリケーションサービスの実行を制御するサ
ービス制御ノードにおいて、各プロセスの障害発生回数
をしきい値と比較して再開処理の実施範囲を定め、この
定めた範囲内でアプリケーションサービスの停止及び再
開を指示してアプリケーションサービスを復旧する方式
が開示されている。[0007] In view of the above, several methods have been proposed so far for the recovery system of the process system. For example, Japanese Patent Publication No. 2500745 discloses that a service control node that controls the execution of an application service for a subscriber determines the execution range of a restart process by comparing the number of failure occurrences of each process with a threshold value. A method of instructing stop and restart of an application service within a range and restoring the application service is disclosed.

【０００８】また、特開平１１−８８４７１号公報に
は、故障切分けエージェントでどの部分に異常が発生し
たかを切分け、その結果に応じてファイルの修復やプロ
セスの再起動等を行う試験方法及び試験装置が開示され
ている。Japanese Unexamined Patent Application Publication No. 11-88471 discloses a test method for isolating which part of a failure isolating agent has an abnormality and restoring a file or restarting a process according to the result. And a test device.

【０００９】[0009]

【発明が解決しようとする課題】分散処理ミドルウェア
（ＣＯＲＢＡ等）を介した通信手段を有し、物理的に離
れた複数のプロセッサ又は複数のオペレーティングシス
テム上に分散配置された複数のプロセスから成るプロセ
ス群を備えたシステムにおいて、プロセスの異常や障害
に対する復旧処理量及び復旧処理の組合わせ数が膨大に
なるため、自動的な異常検出及び復旧が要求されてい
る。A process having communication means via distributed processing middleware (such as CORBA) and comprising a plurality of processes distributed on a plurality of physically separated processors or a plurality of operating systems. In a system including a group, the amount of recovery processing and the number of combinations of recovery processing for process abnormalities and failures become enormous, so that automatic abnormality detection and recovery are required.

【００１０】特に、上記のプロセスシステムが、例えば
通信ネットワーク等を監視するシステムのように２４時
間体制で監視対象を監視しなければならないシステムの
場合は、自動的にプロセスの異常を検出し速やかに復旧
する必要がある。更に、或るプロセスシステムが他のプ
ロセスシステムと連携し、又は主従関係で結合して動作
する形態の場合、監視対象の範囲を他プロセッサシステ
ムの関連プロセス群にまで広げることが要求される。In particular, in the case where the above-mentioned process system is a system which needs to monitor a monitoring target 24 hours a day, for example, a system for monitoring a communication network or the like, a process abnormality is automatically detected and promptly detected. Need to recover. Further, in the case where a certain process system operates in cooperation with another process system or in a master-slave relationship, it is necessary to extend the range of the monitoring target to a related process group of another processor system.

【００１１】本発明は、プロセス群を備えたシステムに
おいて、各プロセスの異常を検出し、該異常プロセス及
びそれに連係するプロセスを自動的に迅速に復旧させ、
また、監視対象を自システム内のプロセス群のみではな
く、他システムの関連するプロセス群にまで広げ、プロ
セスの異常検知時に関連するプロセス群を自動的に復旧
させることを目的とする。According to the present invention, in a system having a group of processes, an abnormality of each process is detected, and the abnormal process and a process associated therewith are automatically and promptly restored.
It is another object of the present invention to extend a monitoring target not only to a group of processes in the own system but also to a group of related processes in another system, and to automatically recover the group of related processes when a process abnormality is detected.

【００１２】[0012]

【課題を解決するための手段】複数のプロセスから構成
され、プロセス相互が依存関係を有し、連係して動作す
るシステムにおいて、或るプロセスに異常が発生した場
合、その復旧のために複数のプロセスを再開させなけれ
ばならない場合があるが、複数のプロセスの相互依存関
係には様々なパターンがあり、プロセス数が増えるほど
複雑になる。SUMMARY OF THE INVENTION In a system composed of a plurality of processes, and the processes have a dependency relationship with each other and operate in cooperation with each other, when an abnormality occurs in a certain process, a plurality of processes are required to recover the abnormality. In some cases, a process must be restarted, but there are various patterns in the interdependency of a plurality of processes, and the process becomes more complicated as the number of processes increases.

【００１３】本発明は、（１）その様なプロセスシステ
ムにおいて、自装置又は他装置に分散された各プロセス
の稼動状態を監視する監視プロセスと、各プロセス間の
再開処理に関わる様々なパターンの依存関係を保持した
データベースとを備え、監視プロセスは、監視対象の何
れかのプロセスにおける障害を検出したとき、前記デー
タベースを参照し、当該プロセスと依存関係を有する複
数のプロセス群に対してプロセス再開を実行させること
を特徴とする。According to the present invention, (1) in such a process system, a monitoring process for monitoring the operating state of each process distributed to the own device or another device, and various patterns of a restart process between the processes. A monitoring process that, when detecting a failure in any of the processes to be monitored, refers to the database and restarts a process group for a plurality of processes having a dependency with the process. Is executed.

【００１４】また、プロセス異常の一つのとして、プロ
セス再開の回数が異常に多い場合が挙げられる。即ち、
或るプロセスの再開動作が頻繁に行われると、そのプロ
セスと依存関係を有するプロセスとの情報の遣り取りに
おいて、情報に矛盾が生じてしまう場合がある。As one of the process abnormalities, there is a case where the number of process restarts is abnormally large. That is,
If a certain process is restarted frequently, inconsistency may occur in information exchange between the process and a process having a dependency.

【００１５】そのため本発明は、（２）各プロセス毎に
ある一定の時間を定義し、その一定時間当たりのプロセ
ス再開回数の上限値を定義してデータベースに保持し、
プロセス再開がその上限値を越えた場合、監視プロセス
がそれを検知し、その被監視プロセスの依存関係情報を
保持したデータベースを参照し、関連する監視対象プロ
セスに対して再開を実行させることを特徴とする。これ
により、プロセス再開の異常発生回数を検知し、プロセ
ス再開により発生するプロセス相互間の情報の矛盾を迅
速に自動的に復旧することができる。Therefore, the present invention provides (2) defining a certain time for each process, defining an upper limit of the number of process restarts per the certain time, and storing the upper limit in a database;
When the process restart exceeds the upper limit, the monitoring process detects it, refers to the database that holds the dependency information of the monitored process, and causes the related monitored process to restart. And This makes it possible to detect the number of occurrences of the process restart abnormality and quickly and automatically recover the inconsistency in the information between the processes caused by the process restart.

【００１６】また、プロセス異常の一つとして、プロセ
スサイズが異常に増加する場合が挙げられる。ここでプ
ロセスサイズは、プロセスがＣＰＵを占有する時間の割
合（ＣＰＵの占有率）を意味する。そして、プロセスサ
イズの増加（開始時の何倍になったか）がシステム全体
の処理能力を低下させる場合もある。One of the process abnormalities is a case where the process size abnormally increases. Here, the process size means the ratio of the time during which the process occupies the CPU (the occupancy of the CPU). Then, the increase in the process size (how many times the size at the start) may reduce the processing capacity of the entire system.

【００１７】そこで本発明は、（３）各プロセス毎にプ
ロセスサイズの上限値を定義してデータベースに保持
し、プロセスサイズがその上限値を越えた場合、監視プ
ロセスがそれを検知し、その被監視プロセスの依存関係
情報を保持したデータベースを参照し、関連する監視対
象プロセスに対して再開を実行させることを特徴とす
る。プロセスサイズの異常増加によるシステムヘの悪影
響を迅速に自動的に検知し、復旧することができる。Accordingly, the present invention provides (3) a process size upper limit value defined for each process and stored in a database, and when the process size exceeds the upper limit value, the monitoring process detects it and receives the process size. The present invention is characterized in that a database holding dependency information of a monitoring process is referred to, and a related monitoring target process is restarted. An adverse effect on the system due to an abnormal increase in process size can be quickly and automatically detected and restored.

【００１８】また、プロセス異常の一つとして、プロセ
ス使用メモリサイズの異常増加が挙げられる。即ち、或
るプロセスの使用メモリ領域が異常に増加すると、該メ
モリを共用する他のプロセスが正常に動作できなくな
り、システム全体の処理能力を低下させる場合がある。One of the process abnormalities is an abnormal increase in the memory size used by the process. That is, if the memory area used by a certain process abnormally increases, another process sharing the memory cannot operate normally, and the processing capacity of the entire system may be reduced.

【００１９】そのため本発明は、（４）各プロセス毎に
プロセス使用メモリ領域の上限値（開始時の何倍か）を
定義してデータベースに保持し、プロセス使用メモリ領
域がその上限値を越えた場合、監視プロセスがそれを検
知し、その被監視プロセスの依存関係情報を保持したデ
ータベースを参照し、関連する監視対象プロセスに対し
て再開をさせることを特徴とする。プロセス使用メモリ
領域の異常増加によるシステムヘの悪影響を迅速に自動
的に検知して復旧することができる。Therefore, according to the present invention, (4) the upper limit value (how many times as large as the start time) of the process use memory area is defined for each process and stored in the database, and the process use memory area exceeds the upper limit value. In this case, the monitoring process detects this, refers to the database holding the dependency information of the monitored process, and restarts the related monitoring target process. An adverse effect on the system due to an abnormal increase in the memory area used by the process can be quickly and automatically detected and restored.

【００２０】また、プロセス異常の一つとして、プロセ
ス処理におけるメモリ作業領域（スワップ領域）の異常
減少が挙げられる。即ち、メモリ作業領域（スワップ領
域）が異常に減少すると、システム全体の処理能力を低
下させる場合がある。Further, as one of the process abnormalities, an abnormal decrease in the memory work area (swap area) in the process processing can be cited. That is, when the memory work area (swap area) is abnormally reduced, the processing capacity of the entire system may be reduced.

【００２１】そのため本発明は、（５）メモリ作業領域
（スワップ領域）の下限値を定義してデータベースに保
持し、或るプロセスの処理においてメモリ作業領域（ス
ワップ領域）がその下限値を下回った場合、監視プロセ
スがそれを検知し、その被監視プロセスの依存関係情報
を保持したデータベースを参照し、関連する監視対象プ
ロセスに対して再開をさせることを特徴とする。メモリ
作業領域（スワップ領域）の異常減少によるシステムヘ
の悪影響を迅速に自動的に検知して復旧することができ
る。Therefore, according to the present invention, (5) the lower limit of the memory work area (swap area) is defined and stored in the database, and the memory work area (swap area) falls below the lower limit in the processing of a certain process. In this case, the monitoring process detects this, refers to the database holding the dependency information of the monitored process, and restarts the related monitoring target process. An adverse effect on the system due to an abnormal decrease in the memory work area (swap area) can be quickly and automatically detected and restored.

【００２２】[0022]

【発明の実施の形態】図１はプロセス間の再開に関する
依存関係を保持したデータベースを備えた本発明の実施
形態を示す。同図に示すように本実施形態のプロセスシ
ステムは、サーバ内に監視プロセス１−１と、監視／障
害情報データベース１−２とを備え、監視プロセス１−
１によって監視される複数のプロセス１−３が、サーバ
内又はサーバに接続されるクライアント若しくは他のシ
ステム内に、分散配置されている。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1 shows an embodiment of the present invention provided with a database holding a dependency relation between processes for restart. As shown in the figure, the process system of the present embodiment includes a monitoring process 1-1 and a monitoring / failure information database 1-2 in a server.
A plurality of processes 1-3 monitored by the server 1 are distributed in a server or a client or another system connected to the server.

【００２３】なお、クライアントには監視プロセス１−
１との通信機能を有する常駐の通信プロセスである“ク
ライアントモニタリングプロセス”１−４が備えられ、
また、クライアント群のうちの１つは、システム管理者
権限でログインをした“システム管理者クライアント
（Administrator Client）”となり得るシステムであ
る。Note that the monitoring process 1-
A “client monitoring process” 1-4, which is a resident communication process having a communication function with
One of the clients is a system that can be a “system administrator client” logged in with system administrator authority.

【００２４】監視プロセス１−１は、監視対象の各プロ
セス１−３を定期的に又は不定期に随時チェックする機
能１−１１と、各監視対象プロセス１−３の再開を検知
する機能１−１２と、各監視対象プロセス１−３に再開
命令を発して再開処理を実行させる機能１−１３とを備
える。The monitoring process 1-1 has a function 1-11 for periodically or irregularly checking each process 1-3 to be monitored and a function 1-11 for detecting the resumption of each process 1-3. 12 and a function 1-13 for issuing a restart command to each monitored process 1-3 to execute a restart process.

【００２５】監視／障害情報データベース１−２は、各
プロセスが再開処理を行った際に、システム復旧のため
に再開させなければならないプロセス群の依存関係情報
を保持する。そして、監視プロセス１−１は、監視／障
害情報データベース１−２に保持されたプロセス間の依
存関係情報を取得して、該当するプロセス群に対して再
開命令を発する。The monitoring / fault information database 1-2 holds dependency information of a group of processes that must be restarted for system recovery when each process performs restart processing. Then, the monitoring process 1-1 acquires dependency information between processes held in the monitoring / failure information database 1-2, and issues a restart instruction to the corresponding process group.

【００２６】なお、監視プロセス１−１及び監視／障害
情報データベース１−２は、プロセスシステムにおいて
サーバ内に備えても良いし、別の情報処理装置（例え
ば、パーソナルコンピュータ等）又はクライアント内に
備えた構成としても良い。以下の実施形態においても同
様である。The monitoring process 1-1 and the monitoring / failure information database 1-2 may be provided in a server in the process system, or provided in another information processing apparatus (for example, a personal computer) or a client. It is good also as a configuration. The same applies to the following embodiments.

【００２７】図１に示すプロセスシステムにおいて、依
存関係を有するプロセス群の再開処理は、図２に示すフ
ローによって行う。まず、監視プロセスは、定期的に
又は不定期（監視が必要なとき）に、各プロセスのプロ
セスＩＤを監視し（ステップ２−１）、該プロセスＩＤ
が変化しているかどうかを判定する（ステップ２−
２）。プロセス再開が行われると以前とは異なるプロセ
スＩＤがオペレーティングシステムによって割振られる
ため、プロセス名とプロセスＩＤとの対応が変化してい
れば、監視プロセスはその被監視プロセスに関して再開
が行われたこと検知する。In the process system shown in FIG. 1, the process of restarting a process group having a dependency is performed according to the flow shown in FIG. First, the monitoring process monitors the process ID of each process regularly or irregularly (when monitoring is necessary) (step 2-1), and
Is changed (Step 2-
2). When the process is restarted, a different process ID is assigned by the operating system, and if the correspondence between the process name and the process ID changes, the monitoring process detects that the monitored process has been restarted. I do.

【００２８】次に、プロセスＩＤの変化を基に再開有
りと判定した場合、監視／障害情報データベースを参照
し、プロセス間の依存関係情報を取得する（ステップ２
−３）。そして、プロセス間の依存関係情報に基づい
て、該当するプロセス群に対して再開命令を発し、再開
を実行させる（ステップ２−４）。Next, when it is determined that there is a restart based on the change of the process ID, the monitoring / failure information database is referred to and the dependency information between the processes is obtained (step 2).
-3). Then, based on the interdependency information between the processes, a restart instruction is issued to the corresponding process group to execute the restart (step 2-4).

【００２９】図３は単位時間当たりのプロセス再開回数
の上限を保持したデータベースを備えた本発明の実施形
態を示す。この実施形態のプロセスシステムは、図１に
示した実施形態と同様に、サーバ内に監視プロセス３−
１、監視／障害情報データベース３−２を備え、該サー
バ内又は該サーバに接続されるクライアント若しくは他
のシステム内に、監視対象の複数のプロセス３−３が存
在し、また、クライアントにはクライアントモニタリン
グプロセスが備えられ、また、クライアント群のうちの
１つは、システム管理者クライアントとなり得るシステ
ムである。FIG. 3 shows an embodiment of the present invention provided with a database holding an upper limit of the number of process restarts per unit time. The process system according to this embodiment includes a monitoring process 3- in the server similarly to the embodiment shown in FIG.
1. A monitoring / fault information database 3-2 is provided, and a plurality of processes 3-3 to be monitored exist in the server or in a client or other system connected to the server. A monitoring process is provided, and one of the clients is a system that can be a system administrator client.

【００３０】監視／障害情報データベース３−２には、
プロセス再開の時刻と再開したプロセスを記録する再開
記録データベース３−２１（図２１の（ａ）参照）と、
各プロセス種別に応じた単位時間及びプロセス再開回数
の上限値、単位時間当たりのプロセス再開回数を記録す
る再開管理データベース３−２２（図２１の（ｂ）参
照）と、各プロセスが再開した場合に、プロセスシステ
ム復旧のために再開させなければならないプロセス群の
依存関係情報を保持する依存関係データベース３−２３
（図２１の（ｃ）参照）とを備える。The monitoring / fault information database 3-2 includes:
A restart record database 3-21 (see FIG. 21A) that records the time of the process restart and the restarted process;
A restart management database 3-22 (see FIG. 21 (b)) for recording the unit time and the upper limit of the number of process restarts according to each process type and the number of process restarts per unit time. , A dependency database 3-23 that holds dependency information of a group of processes that must be restarted for process system recovery
(See FIG. 21C).

【００３１】監視プロセス３−１は図１の実施形態と同
様に、監視対象の各プロセスをチェックする機能３−１
１と、各監視対象プロセスの再開を検知する機能３−１
２と、各監視対象プロセスに再開命令を発する機能３−
１３に加え、再開記録データベース３−２１及び再開管
理データベース３−２２へプロセス再開の記録命令を発
する機能３−１４を備え、また、再開記録データベース
３−２１、再開管理データベース３−２２及びプロセス
再開に関する依存関係データベース３−２３を参照する
機能３−１５を備える。The monitoring process 3-1 has a function 3-1 for checking each process to be monitored, as in the embodiment of FIG.
1 and a function 3-1 for detecting the restart of each monitored process
2 and a function of issuing a restart instruction to each monitored process 3-
13, a function 3-14 for issuing a process restart recording command to the restart record database 3-21 and the restart management database 3-22, and a restart record database 3-21, a restart management database 3-22, and a process restart. 3-15 for referring to the dependency database 3-23 for

【００３２】図３に示すプロセスシステムにおいて、プ
ロセス再開回数に基づくプロセス再開処理は図４に示す
処理フローにより行う。まず、監視プロセスは、定期
的に又は不定期に各プロセスのプロセスＩＤを監視し、
該プロセスＩＤの変化によりプロセス再開を検知すると
（ステップ４−１）、監視／障害情報データベース内の
再開記録データベースにプロセス再開の時刻と再開した
プロセス名を記録させる（ステップ４−２）。In the process system shown in FIG. 3, process restart processing based on the number of process restarts is performed according to the processing flow shown in FIG. First, the monitoring process monitors the process ID of each process regularly or irregularly,
When the process restart is detected by the change of the process ID (step 4-1), the time of the process restart and the name of the restarted process are recorded in the restart record database in the monitoring / failure information database (step 4-2).

【００３３】次に、監視プロセスは、上記のプロセス
再開時刻から単位時間前の時刻を計算し、再開記録デー
タベースを参照し、直近のプロセス再開から単位時間前
までの期間のプロセス再開回数を計測し、再開管理デー
タベースに記録する（ステップ４−３）。Next, the monitoring process calculates the time unit time before the process restart time and refers to the restart record database to measure the number of process restarts during the period from the latest process restart to the unit time before. Is recorded in the restart management database (step 4-3).

【００３４】次に、監視プロセスは、再開管理データ
ベースを参照し（ステップ４−４）、単位時間当たりの
プロセス再開回数が上限値を越えているかどうかを判定
し（ステップ４−５）、上限値を越えていれば、プロセ
ス再開に関する依存関係データベースを参照し（ステッ
プ４−６）、依存関係のあるプロセス群に再開命令を送
出する（ステップ４−７）。Next, the monitoring process refers to the restart management database (step 4-4), determines whether the number of process restarts per unit time exceeds the upper limit (step 4-5), and checks the upper limit. If the number exceeds the limit, the process refers to the dependency database for process restart (step 4-6), and sends a restart instruction to a group of processes having a dependency (step 4-7).

【００３５】図５はプロセスサイズの上限値を保持する
データベースを備えた本発明の実施形態を示す。この実
施形態のプロセスシステムは、前述の図１の実施形態と
同様に、サーバ内に監視プロセス５−１、監視／障害情
報データベース５−２を備え、サーバ内又はサーバに接
続されるクライアント若しくは他のシステム内に、監視
対象の複数のプロセス５−３が存在し、また、クライア
ントにはクライアントモニタリングプロセスが備えら
れ、また、クライアント群のうちの１つは、システム管
理者クライアントとなり得るシステムである。FIG. 5 shows an embodiment of the present invention provided with a database holding the upper limit value of the process size. The process system of this embodiment includes a monitoring process 5-1 and a monitoring / failure information database 5-2 in a server, as in the embodiment of FIG. A plurality of processes 5-3 to be monitored exist in the system, and a client is provided with a client monitoring process, and one of the clients is a system that can be a system administrator client. .

【００３６】監視／障害情報データベース５−２には、
各プロセス種別に応じたプロセスサイズの上限値を保持
したプロセスサイズ上限値データベース５−２１（図２
２の（ａ）参照）と、各プロセスがサイズ上限値を越え
た場合に、その復旧のために再開させなければならない
プロセス群の依存関係情報を保持する依存関係データベ
ース５−２２（図２２の（ｂ）参照）とを備える。な
お、プロセスサイズは、プロセスがＣＰＵを占有する時
間の割合（ＣＰＵの占有率）である。The monitoring / fault information database 5-2 includes:
The process size upper limit database 5-21 (FIG. 2) holding the process size upper limit corresponding to each process type.
2 (a)) and a dependency database 5-22 (FIG. 22) that holds dependency information of a group of processes that must be restarted for recovery when each process exceeds the size upper limit. (See (b)). The process size is the ratio of the time during which the process occupies the CPU (the occupancy of the CPU).

【００３７】監視プロセス５−１は、図１又は図２の実
施形態における機能に加え、監視／障害情報データベー
ス５−２内のプロセスサイズ上限値データベース５−２
１及び依存関係データベース５−２２を参照する機能５
−１１を備える。The monitoring process 5-1 has a process size upper limit database 5-2 in the monitoring / fault information database 5-2 in addition to the functions in the embodiment of FIG. 1 or FIG.
1 and a function 5 for referring to the dependency database 5-22
-11.

【００３８】図５に示すプロセスシステムにおいて、プ
ロセスサイズに基づくプロセス再開処理は図６に示す処
理フローにより行う。まず、監視プロセスは定期的に
又は不定期に各プロセスを監視してプロセスサイズを取
得し、監視／障害情報データベース内のプロセスサイズ
上限値データベースを参照する（ステップ６−１）。In the process system shown in FIG. 5, the process restart process based on the process size is performed according to the process flow shown in FIG. First, the monitoring process monitors each process periodically or irregularly to acquire a process size, and refers to the process size upper limit database in the monitoring / failure information database (step 6-1).

【００３９】次に、監視プロセスは、プロセッサから
取得したプロセスサイズが上限値を越えているかどうか
を判定し（ステップ６−２）、上限値を越えていれば、
プロセス再開に関する依存関係データベースを参照し
（ステップ６−３）、依存関係のあるプロセス群に再開
命令を送出する（ステップ６−４）。Next, the monitoring process determines whether the process size obtained from the processor exceeds the upper limit (step 6-2).
The process refers to the dependency database relating to the process restart (step 6-3), and sends a restart command to a group of processes having a dependency (step 6-4).

【００４０】図７はプロセス使用メモリ領域の上限値を
保持するデータベースを備えた本発明の実施形態を示
す。この実施形態のプロセスシステムは、前述の図１の
実施形態と同様に、サーバ内に監視プロセス７−１、監
視／障害情報データベース７−２を備え、サーバ内又は
サーバに接続されるクライアント若しくは他のシステム
内に、監視対象の複数のプロセス７−３が存在し、ま
た、クライアントにはクライアントモニタリングプロセ
スが備えられ、また、クライアント群のうちの１つは、
システム管理者クライアントとなり得るシステムであ
る。FIG. 7 shows an embodiment of the present invention provided with a database for holding the upper limit value of the process use memory area. The process system of this embodiment includes a monitoring process 7-1 and a monitoring / failure information database 7-2 in a server, as in the embodiment of FIG. There are a plurality of processes 7-3 to be monitored in the system, the client is provided with a client monitoring process, and one of the clients is
It is a system that can be a system administrator client.

【００４１】監視／障害情報データベース７−２には、
各プロセス種別に応じたプロセス使用メモリ領域の上限
値を保持したプロセス使用メモリ領域上限値データベー
ス７−２１（図２２の（ｃ）参照）と、各プロセス使用
メモリ領域がその上限値を越えた場合に、復旧のために
再開させなければならないプロセス群の依存関係情報を
保持する依存関係データベース７−２２（図２２の
（ｄ）参照）とを備える。The monitoring / fault information database 7-2 includes:
Process use memory area upper limit database 7-21 (see FIG. 22 (c)) holding the process use memory area upper limit value corresponding to each process type, and when each process use memory area exceeds the upper limit value And a dependency database 7-22 (see (d) of FIG. 22) that holds dependency information of a process group that must be restarted for recovery.

【００４２】監視プロセス７−１は、前述の図１、図２
又は図３の実施形態における機能に加え、プロセス使用
メモリ領域の上限値データベース７−２１と再開に関す
る依存関係データベース７−２２を参照する機能７−１
１を備える。The monitoring process 7-1 is the same as that shown in FIGS.
Alternatively, in addition to the function in the embodiment of FIG. 3, a function 7-1 for referring to the upper limit database 7-21 of the process use memory area and the dependency database 7-22 related to restart.
1 is provided.

【００４３】図７に示すプロセスシステムにおいて、プ
ロセス使用メモリ領域の上限値を越えた再開処理は、図
８に示すフローにより行われる。まず、監視プロセス
は、定期的に又は不定期に各プロセスを監視してプロセ
ス使用メモリ領域の値を取得し、プロセス使用メモリ領
域の上限値データベースを参照する（ステップ８−
１）。In the process system shown in FIG. 7, the restart processing exceeding the upper limit value of the process use memory area is performed according to the flow shown in FIG. First, the monitoring process monitors each process periodically or irregularly, acquires the value of the process used memory area, and refers to the upper limit database of the process used memory area (step 8-).
1).

【００４４】次に、監視プロセスは、プロセス使用メ
モリ領域の値が上限値を越えているかどうかを判定し
（ステップ８−２）、上限値を越えていれば、プロセス
再開に関する依存関係データベースを参照し（ステップ
８−３）、依存関係のあるプロセス群に再開命令を送出
する（ステップ８−４）。Next, the monitoring process determines whether or not the value of the process memory area exceeds the upper limit (step 8-2). If the value exceeds the upper limit, the monitoring process refers to the dependency database regarding process restart. Then, a restart instruction is sent to a group of processes having a dependency (step 8-4).

【００４５】図９はメモリ作業領域（スワップ領域）の
下限値を保持するデータベースを備えた本発明の実施形
態を示す。この実施形態のプロセスシステムは、前述の
図１の実施形態と同様に、サーバ内に監視プロセス９−
１、監視／障害情報データベース９−２を備え、サーバ
内又はサーバに接続されるクライアント若しくは他のシ
ステム内に、監視対象の複数のプロセス９−３が存在
し、また、クライアントにはクライアントモニタリング
プロセスが備えられ、また、クライアント群のうちの１
つは、システム管理者クライアントとなり得るシステム
である。FIG. 9 shows an embodiment of the present invention provided with a database holding the lower limit value of the memory work area (swap area). The process system of this embodiment includes a monitoring process 9- in the server, as in the embodiment of FIG.
1. A monitoring / failure information database 9-2 is provided, and a plurality of processes 9-3 to be monitored exist in a server or a client or other system connected to the server. And one of the clients
One is a system that can be a system administrator client.

【００４６】監視／障害情報データベース９−２には、
メモリ作業領域（スワップ領域）の下限値を保持するメ
モリ作業領域（スワップ領域）下限値データベース９−
２１（図２２の（ｅ）参照）と、メモリ作業領域（スワ
ップ領域）が下限値を下回った場合に、復旧のために再
開させなければならないプロセス群の依存関係情報を保
持する依存関係データベース９−２２（図２２の（ｆ）
参照）とを備える。The monitoring / fault information database 9-2 includes:
Memory work area (swap area) lower limit value database 9-that holds the lower limit value of the memory work area (swap area)
21 (see (e) of FIG. 22) and a dependency database 9 for storing dependency information of a group of processes that must be restarted for recovery when the memory work area (swap area) falls below the lower limit. -22 ((f) of FIG. 22)
Reference).

【００４７】監視プロセス９−１は、前述の図１乃至図
８に示す実施形態における機能に加え、メモリ作業領域
（スワップ領域）下限値データベース９−２１と再開に
関する依存関係データベース９−２２を参照する機能９
−１１を備える。The monitoring process 9-1 refers to the memory work area (swap area) lower limit value database 9-21 and the dependency relation database 9-22 for restarting, in addition to the functions in the embodiment shown in FIGS. Function 9
-11.

【００４８】図９に示すプロセスシステムにおいて、メ
モリ作業領域（スワップ領域）の下限値を下回る再開処
理は、図１０に示すフローにより行われる。まず、監
視プロセスは、定期的に又は不定期に各プロセスを監視
してスワップ領域の値を取得し、メモリ作業領域（スワ
ップ領域）下限値データベース９−２１を参照する（ス
テップ１０−１）。In the process system shown in FIG. 9, the resuming process below the lower limit of the memory work area (swap area) is performed according to the flow shown in FIG. First, the monitoring process monitors each process periodically or irregularly, acquires the value of the swap area, and refers to the memory work area (swap area) lower limit database 9-21 (step 10-1).

【００４９】次に、監視プロセスは、メモリ作業領域
（スワップ領域）が下限値を下回っているかどうかを判
定し（ステップ１０−２）、下限値を下回っていれば、
プロセス再開に関する依存関係データベースを参照し
（ステップ１０−３）、依存関係のあるプロセス群に再
開命令を送出する（ステップ１０−４）。Next, the monitoring process determines whether the memory work area (swap area) is below the lower limit (step 10-2).
The process refers to the dependency database relating to the process restart (step 10-3), and sends a restart command to a group of processes having a dependency (step 10-4).

【００５０】図１１は定期監視とシステム管理者による
不定期監視を行う本発明の実施形態を示す。この実施形
態のプロセスシステムは、前述の図１の実施形態と同様
に、サーバ内に監視プロセス１１−１、監視／障害情報
データベース１１−２を備え、サーバ内又は該サーバに
接続されるクライアント若しくは他のシステム内に、監
視対象の複数のプロセス１１−３が存在し、また、クラ
イアントにはクライアントモニタリングプロセス１１−
４が備えられ、また、クライアント群のうちの１つは、
システム管理者クライアントとなり得るシステムであ
る。FIG. 11 shows an embodiment of the present invention for performing regular monitoring and irregular monitoring by a system administrator. The process system of this embodiment includes a monitoring process 11-1 and a monitoring / failure information database 11-2 in a server, as in the embodiment of FIG. 1 described above, and a client or a client connected to or in the server. A plurality of processes 11-3 to be monitored exist in another system, and a client monitoring process 11-
4 and one of the clients is
It is a system that can be a system administrator client.

【００５１】本実施形態において監視の実行には以下の
２種類がある。ａ．システム管理者権限ユーザの起動によって行われる
不定期監視ｂ．監視プロセス主導で行われる定期監視In the present embodiment, there are the following two types of execution of monitoring. a. Irregular monitoring performed by activation of a system administrator authority user b. Periodic monitoring led by the monitoring process

【００５２】上記ａの不定期監視は、以下の順序で実現
される。システム管理者権限でログインをした“シス
テム管理者クライアント（Administrator Client）”を
使用し、システム管理者が不定期（監視を行う必要があ
るとき）に監視要求（Ａ）を送出する。或いは、クライ
アント常駐の通信プロセスである“クライアントモニタ
リングプロセス”にシステム管理者権限アカウントでロ
グインし、監視要求（Ａ）を送出する。The above-mentioned irregular monitoring is realized in the following order. Using the “System Administrator Client” logged in with system administrator authority, the system administrator sends a monitoring request (A) at irregular intervals (when monitoring is necessary). Alternatively, the user logs in to the “client monitoring process”, which is a communication process resident on the client, with a system administrator account, and sends out a monitoring request (A).

【００５３】その監視要求Ａはクライアントモニタリ
ングプロセス１１−４を介してサーバ内の監視プロセス
１１−１に渡される。監視プロセス１１−１は該監視
要求（Ａ）に従って、前述した各プロセスの異常検知及
び復旧処理を行う。The monitoring request A is passed to the monitoring process 11-1 in the server via the client monitoring process 11-4. The monitoring process 11-1 performs abnormality detection and recovery processing of each process described above according to the monitoring request (A).

【００５４】また、上記ｂの監視プロセス主導で行われ
る定期監視は、監視プロセス１１−１の自律動作によ
り、タイマ等の設定に従って所定の時間間隔で監視対象
プロセス１１−３に対して定期チェック（Ｂ）を行い、
前述した各プロセスの異常検知及び復旧処理を行う。In the periodic monitoring performed by the monitoring process initiative b, the monitoring process 11-3 periodically checks the monitoring target process 11-3 at predetermined time intervals according to the setting of a timer or the like by the autonomous operation of the monitoring process 11-1. B)
Abnormality detection and recovery processing of each process described above are performed.

【００５５】このような構成により、監視プロセスのタ
イマ起動により定期的に各プロセスの稼動状態の監視を
行うほかに、プロセスシステムの使用者が随時、障害チ
ェックを行いたい場合などに、該監視プロセスを起動し
て監視を行うことができる。With such a configuration, in addition to periodically monitoring the operation status of each process by starting the timer of the monitoring process, the monitoring process can be performed when the user of the process system wants to check the failure at any time. Can be started to perform monitoring.

【００５６】次に、障害の根底原因を検出する本発明の
本発明の実施形態について説明する。分散処理ミドルウ
ェア（ＣＯＲＢＡ等）を介した通信手段を用いて構築さ
れたプロセスシステムにおいて、被監視プロセス群から
送出される障害に関する情報は、複数のレイヤ（物理レ
イヤ、ＴＣＰ／ＩＰレイヤ、ミドルウェアレイヤ／アプ
リケーションレイヤ）に跨る通信プロトコルスタックを
介して送信される。Next, an embodiment of the present invention for detecting the root cause of a fault will be described. In a process system constructed using communication means via distributed processing middleware (such as CORBA), information about a failure transmitted from a group of monitored processes includes a plurality of layers (physical layer, TCP / IP layer, middleware layer / It is transmitted via a communication protocol stack that straddles the application layer.

【００５７】そのため、上記通信手段に何らかの障害が
発生した場合、上位プロトコルレイヤにおける通信障害
のうちの幾つかは、下位プロトコルレイヤにおける通信
障害がその障害要因となって生じている場合がある。こ
のような場合監視プロセスにおいて、被監視プロセス群
から受取った障害情報をプロトコルレイヤで分類するこ
とにより、プロセスシステムにおける障害の根底原因を
明確に特定することが可能となる。Therefore, when any failure occurs in the communication means, some of the communication failures in the upper protocol layer may be caused by a communication failure in the lower protocol layer. In such a case, in the monitoring process, the root cause of the failure in the process system can be clearly specified by classifying the failure information received from the monitored process group by the protocol layer.

【００５８】即ち、監視プロセス上に発生する多数のア
ラーム対して、プロトコルスタック別に障害を検知し、
障害の根底原因となった下位層の障害原因情報のみを、
各プロセスを実行する装置（例えばクライアント）にア
ラーム又はイベントとして通知し、監視者が理解しやす
いアラーム情報を表示する。That is, for a number of alarms generated on the monitoring process, a fault is detected for each protocol stack.
Only the failure cause information of the lower layer that caused the failure
A device (for example, a client) that executes each process is notified as an alarm or an event, and alarm information that is easily understood by a monitor is displayed.

【００５９】図１２は障害の根底原因を検出する本発明
の本発明の実施形態を示す。この実施形態のプロセスシ
ステムは、前述の図１の実施形態と同様に、サーバ内に
監視プロセス１２−１、監視／障害情報データベース１
２−２を備え、該サーバ内又は該サーバに接続されるク
ライアント若しくは他のシステム内に、監視対象の複数
のプロセス１２−３が存在し、また、クライアントには
クライアントモニタリングプロセス１２−４が備えら
れ、また、クライアント群のうちの１つは、システム管
理者クライアントとなり得るシステムである。FIG. 12 shows an embodiment of the present invention for detecting the underlying cause of a fault. The process system of this embodiment includes a monitoring process 12-1 and a monitoring / fault information database 1 in a server, as in the embodiment of FIG.
2-2, a plurality of processes 12-3 to be monitored exist in the server or a client or other system connected to the server, and the client has a client monitoring process 12-4. One of the clients is a system that can be a system administrator client.

【００６０】図１２の実施形態における監視実行のフロ
ーを図１３に示す。本実施形態においても監視の実行は
以下の２種類がある。ａ．システム管理者権限ユーザトリガで行われる不定期
監視ｂ．監視プロセス主導で行われる定期監視FIG. 13 shows a flow of monitoring execution in the embodiment of FIG. Also in the present embodiment, there are the following two types of execution of monitoring. a. Irregular monitoring performed by system administrator authority user trigger b. Periodic monitoring led by the monitoring process

【００６１】上記ａの不定期監視の場合、システム管理
者権限でログインをした“システム管理者クライアント
（Administrator Client）”を使用し、システム管理者
が不定期に監視要求（Ａ）を送出する（ステップ１３−
１）。その監視要求（Ａ）はクライアント常駐プロセス
であるクライアントモニタリングプロセス１２−４を介
してサーバ内の監視プロセス１２−１に渡される（ステ
ップ１３−２）。In the case of the irregular monitoring described in a above, the system administrator sends a monitoring request (A) at irregular intervals by using the “System Administrator Client” logged in with system administrator authority (A). Step 13-
1). The monitoring request (A) is passed to the monitoring process 12-1 in the server via the client resident process 12-4 (step 13-2).

【００６２】以下は、上記ｂの監視プロセス主導で行わ
れる定期監視（Ｂ）と同様に、監視プロセス１２−１に
より各プロセスの異常検知及び復旧処理が行われる（ス
テップ１３−３）。監視プロセス１２−１がクライアン
トモニタリングプロセス１２−４から監視依頼（Ａ）を
受ける、又は定期監視（Ｂ）の時間になったとき、以下
に詳述する手順により障害の根底原因を判別する（ステ
ップ１３−４）。In the following, similarly to the periodic monitoring (B) led by the monitoring process described in b above, the monitoring process 12-1 performs abnormality detection and recovery processing of each process (step 13-3). When the monitoring process 12-1 receives the monitoring request (A) from the client monitoring process 12-4, or when the time for the periodic monitoring (B) is reached, the root cause of the failure is determined by the procedure described in detail below (step). 13-4).

【００６３】障害の根底原因を判別する手順は、まず、
監視プロセス１２−１が監視／障害情報データベース
１２−２にアクセスし、監視処理を行なうための情報と
して、監視方法（コマンド）及び被監視プロセスＩＰア
ドレスを取得する。監視方法（コマンド）は使用コマン
ド登録データベース１２−２１（図２３の（ａ）参照）
を参照して取得し、被監視プロセスＩＰアドレスは被監
視プロセス名／ＩＰアドレス対応データベース１２−２
２（図２３の（ｂ）参照）を参照して取得される。取得
されるデータのデータ型は「被監視プロセス名、被監視
プロセスＩＰアドレス」である。The procedure for determining the root cause of a failure is as follows:
The monitoring process 12-1 accesses the monitoring / failure information database 12-2, and acquires a monitoring method (command) and a monitored process IP address as information for performing the monitoring process. The monitoring method (command) is used command registration database 12-21 (see FIG. 23 (a)).
, And the monitored process IP address is the monitored process name / IP address correspondence database 12-2.
2 (see FIG. 23B). The data type of the acquired data is “monitored process name, monitored process IP address”.

【００６４】次に、監視プロセス１２−１は、取得し
たデータ（被監視プロセス名、被監視プロセスＩＰアド
レス）を基に監視を開始し、被監視プロセスＩＰアドレ
スに対して使用コマンド登録データベース１２−２１に
登録された全てのコマンドを投入する。Next, the monitoring process 12-1 starts monitoring based on the acquired data (monitored process name, monitored process IP address), and uses the used command registration database 12- against the monitored process IP address. All the commands registered in 21 are input.

【００６５】次に、監視プロセス１２−１は、全ての
コマンドに対する応答結果を受け取った後、該応答結果
を基に、被監視プロセスに関する障害発生の有無、障害
発生レイヤを判別し、障害情報のソーティングを行って
根底原因の判別を行なう。Next, after receiving the response results for all commands, the monitoring process 12-1 determines whether or not a failure has occurred in the monitored process and the failure occurrence layer based on the response results, and Sorting is performed to determine the underlying cause.

【００６６】図１４は根底原因の判別ロジックを示す。
まず、物理レイヤやＴＣＰ／ＩＰレイヤ等の低位プロ
トコルのコマンドに対する正常な応答結果が得られたか
（送信が成功したか）を判定し（ステップ１４−１）、
正常な応答結果が得られた場合、ＣＯＲＢＡレイヤや
アプリケーションレイヤ等の高位プロトコルのコマンド
に対する正常な応答結果が得られたか（送信が成功した
か）を判定する（ステップ１４−２）。FIG. 14 shows the logic for determining the underlying cause.
First, it is determined whether a normal response result to a command of a lower protocol such as a physical layer or a TCP / IP layer has been obtained (whether transmission has been successful) (step 14-1).
When a normal response result is obtained, it is determined whether a normal response result to a command of a higher-order protocol such as the CORBA layer or the application layer is obtained (transmission is successful) (step 14-2).

【００６７】次に、上記で送信が成功しなかった場
合、被監視プロセスＩＰアドレスがクライアントかどう
かを判定する（ステップ１４−３）。上記の判定によ
り、図１４に示す判別結果１〜４が得られる。Next, if the transmission is not successful, it is determined whether the monitored process IP address is a client (step 14-3). With the above determination, the determination results 1 to 4 shown in FIG. 14 are obtained.

【００６８】ここで、判別結果１〜４は以下の通りであ
る。・結果１：正常（低位プロトコル及び高位プロトコルで
正常）・結果２：正常（被監視プロセスはクライアントマシン
のため）・結果３：物理レイヤ、ＴＣＰ／ＩＰレイヤの異常・結果４：ＣＯＲＢＡレイヤ、アプリケーションレイヤ
の異常Here, the determination results 1 to 4 are as follows. -Result 1: Normal (normal in the low-level protocol and high-level protocol)-Result 2: Normal (because the monitored process is a client machine)-Result 3: Abnormality in the physical layer and TCP / IP layer-Result 4: CORBA layer, application Layer error

【００６９】上記の結果１が得られたときは、通信手段
に異常は無いと判断されるためアプリケーションのチェ
ックを行う。上記の結果２、結果３及び結果４が得られ
たときは、ＩＰアドレスによって障害マシンを特定す
る。上記の結果３及び結果４の場合、図２３の（ｃ）に
示すような障害情報及び復旧方法の通知をクライアント
等のプロセス実行装置に行う。障害レイヤのソーティン
グにより、監視者は障害の根底原因となったプロトコル
層を容易に判別することができる。When the above result 1 is obtained, the application is checked because it is determined that there is no abnormality in the communication means. When the result 2, the result 3, and the result 4 are obtained, the failed machine is specified by the IP address. In the case of the above results 3 and 4, notification of failure information and a recovery method as shown in FIG. 23C is sent to a process execution device such as a client. By monitoring the failure layer, the observer can easily determine the protocol layer that caused the failure.

【００７０】図１５は本発明による障害情報通知の実施
形態を示す。同図は障害の根底原因が上位層に有る場合
に障害情報をクライアントヘ通知する実施形態を示して
いる。本実施形態において、サーバ内に監視プロセス１
５−１、監視／障害情報データベース１５−２、障害通
知プロセス１５−５を備え、監視プロセス１５−１によ
って監視される複数のプロセスが、サーバ内又はサーバ
に接続されるクライアント若しくは他のシステム内に分
散配置されている。FIG. 15 shows an embodiment of the failure information notification according to the present invention. This figure shows an embodiment in which failure information is notified to the client when the root cause of the failure is in an upper layer. In the present embodiment, the monitoring process 1 is installed in the server.
5-1, a monitoring / fault information database 15-2, and a fault notification process 15-5, and a plurality of processes monitored by the monitoring process 15-1 are stored in a server or a client or another system connected to the server. Are distributed.

【００７１】なお、クライアントには監視プロセス１５
−１との通信機能を有する常駐の通信プロセスである
“クライアントモニタリングプロセス”１５−４が備え
られ、また、クライアント群のうちの１つは、システム
管理者権限でログインをした“システム管理者クライア
ント（Administrator Client）”となり得るシステムで
ある。The monitoring process 15 is provided to the client.
A client monitoring process 15-4, which is a resident communication process having a communication function with the client-1, is provided. One of the clients is a "system administrator client" logged in with system administrator authority. (Administrator Client) ".

【００７２】前述の根底原因判別手段によって障害の根
底原因を分類した後、根底原因が最上層（ミドルウェア
レイヤ／アプリケーションレイヤ）であった場合、以下
の手順により障害情報通知を行う。該手順のフローは図
１６に示している。After the root cause of the fault is classified by the root cause determination means described above, if the root cause is the uppermost layer (middleware layer / application layer), fault information is notified according to the following procedure. The flow of the procedure is shown in FIG.

【００７３】まず、監視プロセスがデータベースにア
クセスし、検知した障害に関して、通知メッセージ及び
その復旧のための対処情報が保持されている場合には、
それらの情報を取得する（ステップ１６−１）。取得し
た情報のデータ型は、「障害結果、被監視プロセス名、
被監視プロセスＩＰアドレス、対処情報（復旧方法）」
である。First, when the monitoring process accesses the database and holds a notification message and information on the action taken to recover the detected failure,
The information is obtained (step 16-1). The data type of the acquired information is "failure result, monitored process name,
Monitored process IP address, action information (recovery method) "
It is.

【００７４】次に、監視プロセスが上記により取得
した情報を、障害通知プロセスに通知する（ステップ１
６−２）。通知するデータ型は「障害結果、被監視プロ
セス名、被監視プロセスＩＰアドレス、対処情報（復旧
方法）」である。Next, the monitoring process notifies the failure notification process of the information obtained as described above (step 1).
6-2). The data type to be notified is “failure result, monitored process name, monitored process IP address, handling information (recovery method)”.

【００７５】次に、障害通知プロセスは監視プロセス
から伝えられた通知データに従って、各クライアントに
障害通知を行なう（ステップ１６−３）。障害通知プ
ロセスは、通知先のクライアントが自動復旧モードであ
るかどうかを判定し（ステップ１６−４）、自動復旧モ
ードであるとき、監視プロセスに対して自動復旧の依頼
を送出する（ステップ１６−５）。Next, the failure notification process issues a failure notification to each client according to the notification data transmitted from the monitoring process (step 16-3). The failure notification process determines whether the client of the notification destination is in the automatic recovery mode (step 16-4), and when the client is in the automatic recovery mode, sends a request for automatic recovery to the monitoring process (step 16-). 5).

【００７６】障害通知プロセスから自動復旧の依頼を
受けた監視プロセスは、前述したように被監視プロセス
を含む依存関係を有するプロセス群に対して再開命令を
送出し、自動的に復旧処理を行う（ステップ１６−
６）。The monitoring process that has received the request for automatic recovery from the failure notification process sends a restart instruction to a group of processes including the monitored process and that has a dependency, as described above, and performs recovery processing automatically ( Step 16-
6).

【００７７】図１７は根底原因が下位層の場合の本発明
による障害通知の実施形態を示す。本実施形態におい
て、サーバ内に監視プロセス１７−１、監視／障害情報
データベース１７−２、障害通知プロセス１７−５を備
え、監視プロセス１７−１によって監視される複数のプ
ロセスが、サーバ内又はサーバに接続されるクライアン
ト若しくは他のシステム内に分散配置されている。FIG. 17 shows an embodiment of the failure notification according to the present invention when the underlying cause is a lower layer. In the present embodiment, the server includes a monitoring process 17-1, a monitoring / fault information database 17-2, and a fault notification process 17-5, and a plurality of processes monitored by the monitoring process 17-1 are stored in the server or the server. Distributed in clients or other systems connected to

【００７８】なお、クライアントには監視プロセス１７
−１との通信機能を有する常駐の通信プロセスである
“クライアントモニタリングプロセス”１７−４が備え
られ、また、クライアント群のうちの１つは、システム
管理者権限でログインをした“システム管理者クライア
ント（Administrator Client）”となり得るシステムで
ある。The monitoring process 17 is provided to the client.
A client monitoring process 17-4, which is a resident communication process having a communication function with the client-1, is provided. One of the clients is a "system administrator client" logged in with system administrator authority. (Administrator Client) ".

【００７９】前述の根底原因判別手段によって障害の根
底原因を分類した後、根底原因が下位層（物理レイヤ、
ＴＣＰ／ＩＰレイヤ）又はオペレーティングシステム等
にあった場合、以下の手順により障害情報通知を行う。
該手順のフローは図１８に示している。After the root cause of the fault is classified by the root cause determining means described above, the root cause is determined by the lower layer (physical layer,
If it is in the TCP / IP layer) or the operating system, the failure information is notified according to the following procedure.
The procedure flow is shown in FIG.

【００８０】まず、監視プロセスがデータベースにア
クセスし、検知した障害に関して、通知メッセージ及び
その復旧のための対処情報が保持されている場合には、
それらの情報を取得する（ステップ１８−１）。取得し
た情報のデータ型は「障害結果、被監視プロセス名、被
監視プロセス１Ｐアドレス、対処情報（復旧方法）」で
ある。First, when the monitoring process accesses the database and stores a notification message and information on the action to recover the detected failure,
The information is obtained (step 18-1). The data type of the acquired information is “failure result, monitored process name, monitored process 1P address, handling information (recovery method)”.

【００８１】次に、監視プロセスが上記により取得
した情報を、障害通知プロセスに通知する（ステップ１
８−２）。通知するデータ型は「障害結果、被監視プロ
セス名、被監視プロセスＩＰアドレス、対処情報（復旧
方法）」である。次に、障害通知プロセスは監視プロ
セスから伝えられた通知データを基に各クライアントに
障害通知を行なう（ステップ１８−３）。Next, the monitoring process notifies the failure notification process of the information obtained as described above (step 1).
8-2). The data type to be notified is “failure result, monitored process name, monitored process IP address, handling information (recovery method)”. Next, the failure notification process notifies each client of a failure based on the notification data transmitted from the monitoring process (step 18-3).

【００８２】図１９はクライアントから障害に対する対
処要求を行う本発明の実施形態を示す。本実施形態にお
いて、サーバ内に監視プロセス１９−１、監視／障害情
報データベース１９−２を備え、監視プロセス１９−１
によって監視される複数のプロセス１９−３が、サーバ
内又はサーバに接続されるクライアント若しくは他のシ
ステム内に分散配置されている。FIG. 19 shows an embodiment of the present invention in which a client makes a request for handling a failure. In the present embodiment, a monitoring process 19-1 and a monitoring / failure information database 19-2 are provided in the server.
Are distributed in a server or a client or another system connected to the server.

【００８３】なお、クライアントには監視プロセス１９
−１との通信機能を有する常駐の通信プロセスである
“クライアントモニタリングプロセス”１９−４が備え
られ、また、クライアント群のうちの１つは、システム
管理者権限でログインをした“システム管理者クライア
ント（Administrator Client）”となり得るシステムで
ある。The monitoring process 19 is provided to the client.
A "client monitoring process" 19-4, which is a resident communication process having a communication function with the "-1", is provided. One of the clients is a "system administrator client" who has logged in with system administrator authority. (Administrator Client) ".

【００８４】クライアントから障害に対する対処要求に
は以下の２種類がある。ａ．クライアントモニタリングプロセス１９−４にシス
テム管理者権限アカウントでログインし、障害の根底原
因に対する対処要求を送出する（Ａ１）。ｂ．システム管理者権限ユーザトリガにより障害の根底
原因に対する対処要求を送出する（Ａ２）。There are the following two types of requests for handling a failure from a client. a. The user logs in to the client monitoring process 19-4 with a system administrator authority account and sends a response request for the root cause of the failure (A1). b. A response request for the root cause of the failure is transmitted by the system administrator authority user trigger (A2).

【００８５】前述の実施形態において、クライアント等
のプロセス実行装置ヘ障害の根底原因を通知した後、上
記ａ又はｂによる対処要求を受けると、監視プロセス１
９−１は、該根底原因が最上位層のミドルウェアレイヤ
／アプリケーションレイヤ等である場合、前述したよう
にデータベース１９−２に保持されたプロセスの依存関
係データ（Ｂ）を参照し、被監視プロセスを含む依存関
係を有するプロセス群に対して再開命令を送出し、障害
に対する対処（Ｃ）を行う。In the above-described embodiment, when the root cause of the failure is notified to the process execution device such as the client, and a response request by the above a or b is received, the monitoring process 1
9-1 refers to the process dependency data (B) held in the database 19-2 as described above, if the root cause is the middleware layer / application layer of the highest layer, and A restart instruction is sent to a group of processes having a dependency relation including (1), and a countermeasure against the failure (C) is performed.

【００８６】図２０はクライアントから対処要求が送出
された場合の本発明の処理フローを示す。他クライアン
トトリガにより開始する場合（ステップ２０−１）、ク
ライアントモニタリングプロセスにシステム管理者権限
アカウントでログインし（ステップ２０−２）、クライ
アントモニタリングプロセスから監視プロセスに対して
対処要求を送出する（ステップ２０−３）。FIG. 20 shows a processing flow of the present invention when a response request is sent from a client. When started by another client trigger (step 20-1), the client monitoring process logs in with a system administrator account (step 20-2), and the client monitoring process sends a response request to the monitoring process (step 20). -3).

【００８７】また、システム管理者権限ユーザトリガに
より開始する場合（ステップ２０−４）、システム管理
者権限ユーザが対処要求を送出する（ステップ２０−
５）。監視プロセスは上記ステップ２０−３又はステッ
プ２０−５により送出された対処要求に応じて、データ
ベースに保持されたプロセスの依存関係データを参照
し、被監視プロセスを含む依存関係を有するプロセス群
に対して再開命令を送出し、障害に対する対処を行う
（ステップ２０−６）。When the processing is started by the system administrator authority user trigger (step 20-4), the system administrator authority user sends a response request (step 20-).
5). The monitoring process refers to the process dependency data stored in the database in response to the response request sent in the above step 20-3 or step 20-5, and executes a process group having a dependency including the monitored process. Then, a restart instruction is sent to deal with the failure (step 20-6).

【００８８】（付記１）複数のプロセスから構成され
るプロセスシステムにおいて、自装置又は他装置に分散
された各プロセスの稼動状態を監視する監視プロセス
と、各プロセス間の再開処理に関わる依存関係を保持し
たデータベースとを備え、前記監視プロセスは、監視対
象の何れかのプロセスにおける障害を検出したとき、前
記データベースを参照し、当該プロセスと依存関係を有
する複数のプロセス群に対してプロセス再開を実行させ
る構成を備えたことを特徴とするプロセス異常検知及び
復旧システム。（付記２）前記データベースは、各プロセス種別に応
じて単位時間当たりのプロセス再開回数の上限値を保持
し、前記監視プロセスは、監視対象のプロセスの再開回
数が該上限値を越えたとき、当該監視対象プロセスと依
存関係を有する複数のプロセス群に対してプロセス再開
を実行させる構成を備えたことを特徴とする付記１記載
のプロセス異常検知及び復旧システム。（付記３）前記データベースは、各プロセス種別に応
じてプロセスサイズの上限値を保持し、前記監視プロセ
スは、監視対象のプロセスのプロセスサイズが該上限値
を越えたとき、当該プロセスと依存関係を有する複数の
プロセス群に対してプロセス再開を実行させる構成を備
えたことを特徴とする付記１記載のプロセス異常検知及
び復旧システム。（付記４）前記データベースは、各プロセス種別に応
じてプロセス使用メモリ領域の上限値を保持し、前記監
視プロセスは、監視対象のプロセスのプロセス使用メモ
リ領域が該上限値を越えたとき、当該プロセスと依存関
係を有する複数のプロセス群に対してプロセス再開を実
行させる構成を備えたことを特徴とする付記１記載のプ
ロセス異常検知及び復旧システム。（付記５）前記データベースは、各プロセスを実行す
る装置毎の作業メモリ領域の下限値を保持し、前記監視
プロセスは、監視対象のプロセスを実行する装置の作業
メモリ領域が該下限値を下回ったとき、当該プロセスと
依存関係を有する複数のプロセス群に対してプロセス再
開を実行させる構成を備えたことを特徴とする付記１記
載のプロセス異常検知及び復旧システム。（付記６）前記監視プロセスは、定期的に前記各プロ
セスの稼動状態を監視すると共に、システム管理者のロ
グインにより通信プロセスを通じて該監視プロセスに監
視要求が送出されたときに、前記各プロセスの稼動状態
を監視する構成を備えたことを特徴とする付記１乃至５
の何れかに記載のプロセス異常検知及び復旧システム。（付記７）前記監視プロセスによって監視されるプロ
セス群は、分散処理ミドルウェアを介した通信手段を備
え、複数のプロセッサ又はオペレーティングシステム上
に分散配置され、前記監視プロセスは、該通信手段にお
ける複数のプロトコルスタック上に跨る障害の検知手段
を有し、前記データベースは障害の分類情報及び障害原
因の切分け情報を保持し、前記監視プロセスは、各プロ
トコル層に跨って障害を識別し、前記データベースに保
持された情報を基に、障害発生時に発生する多数のアラ
ーム情報に対して、該障害の根底原因となったプロトコ
ル層のアラーム情報を抽出する手段を備え、該障害の根
底原因となったアラーム情報を基に復旧処理及び障害通
知を行う構成を備えたことを特徴とする付記１乃至６の
何れかに記載の異常検知及び復旧システム。（付記８）前記データベースは、障害の原因に対応し
た対処情報を保持し、前記監視プロセスは障害の原因元
の対処情報を取得した後、障害通知プロセスにより、障
害の分類情報及び障害原因及びその対処情報を、各プロ
セスを実行する装置に通知すると共に、障害の根底原因
が上位プロトコル層のアプリケーション又はミドルウェ
ア等のプロセスに存在する場合、当該プロセスと依存関
係を有する複数のプロセス群に対してプロセス再開を実
行させ、障害の根底原因が下位プロトコル層又はオペレ
ーティングシステムに存在する場合、プロセスを実行す
る装置に対処情報を含む障害情報を表示する構成を備え
たことを特徴とする付記７記載のプロセス異常検知及び
復旧システム。（付記９）システム管理者アカウントでのログインに
より通信プロセスを通じて前記監視プロセスに再開要求
が送出されたときに、該監視プロセスが当該プロセスと
依存関係を有する複数のプロセス群に対してプロセス再
開を実行させる構成を備えたことを特徴とする付記８記
載のプロセス異常検知及び復旧システム。(Supplementary Note 1) In a process system composed of a plurality of processes, a monitoring process for monitoring the operating state of each process distributed to the own device or another device, and a dependency relationship between processes for restart processing are described. The monitoring process, when detecting a failure in any of the processes to be monitored, refers to the database and executes a process restart for a plurality of process groups having a dependency with the process A system for detecting and recovering from a process abnormality, comprising: (Supplementary Note 2) The database holds an upper limit value of the number of process restarts per unit time according to each process type. When the number of restarts of the process to be monitored exceeds the upper limit value, the monitoring process performs the process. The process abnormality detection and recovery system according to claim 1, further comprising a configuration for executing a process restart for a plurality of process groups having a dependency relationship with the monitoring target process. (Supplementary Note 3) The database holds an upper limit value of the process size according to each process type, and the monitoring process determines a dependency with the process when the process size of the process to be monitored exceeds the upper limit value. The process abnormality detection and recovery system according to claim 1, further comprising a configuration for causing the plurality of process groups to execute the process restart. (Supplementary Note 4) The database holds an upper limit value of a process use memory area according to each process type, and the monitoring process performs the process when the process use memory area of the process to be monitored exceeds the upper limit value. 3. The process abnormality detection and recovery system according to claim 1, further comprising a configuration for executing a process restart for a plurality of process groups having a dependency relationship with the process. (Supplementary Note 5) The database holds a lower limit value of a work memory area for each device that executes each process, and the monitoring process determines that a work memory area of a device that executes a process to be monitored falls below the lower limit value. The process abnormality detection and recovery system according to claim 1, wherein the system is configured to execute a process restart for a plurality of process groups having a dependency relationship with the process. (Supplementary Note 6) The monitoring process periodically monitors the operation state of each process, and when a monitoring request is sent to the monitoring process through a communication process by a login of a system administrator, the monitoring process is activated. Supplementary notes 1 to 5 characterized by comprising a configuration for monitoring a state.
A process abnormality detection and recovery system according to any one of the above. (Supplementary Note 7) The process group monitored by the monitoring process includes communication means via distributed processing middleware, is distributed and arranged on a plurality of processors or operating systems, and the monitoring process includes a plurality of protocols in the communication means. The database has fault detecting means that spans the stack, the database holds fault classification information and fault isolation information, and the monitoring process identifies a fault across protocol layers and stores the fault in the database. Means for extracting alarm information of the protocol layer that caused the failure for a large number of pieces of alarm information generated at the time of occurrence of the failure, based on the obtained information, the alarm information that caused the failure 7. The abnormality according to any one of supplementary notes 1 to 6, characterized in that a recovery process and a failure notification are provided based on the Detection and recovery system. (Supplementary Note 8) The database holds coping information corresponding to the cause of the fault, and the monitoring process acquires the coping information of the cause of the fault, and then, by the fault notifying process, classifies the fault, causes the fault, and the like. In addition to reporting the handling information to the device that executes each process, if the root cause of the failure is in a process such as an application or middleware of an upper protocol layer, the process is performed for a plurality of process groups having a dependency with the process. The process according to claim 7, further comprising the step of, when the restart is executed and the underlying cause of the failure exists in the lower protocol layer or the operating system, displaying the failure information including the handling information on the device executing the process. Anomaly detection and recovery system. (Supplementary Note 9) When a restart request is sent to the monitoring process through a communication process by logging in with a system administrator account, the monitoring process executes a process restart for a plurality of processes having a dependency with the process. The process abnormality detection and recovery system according to claim 8, further comprising a configuration for causing the process abnormality to be detected.

【００８９】[0089]

【発明の効果】以上説明したように、本発明によれば、
プロセスの依存関係を保持したデータベースを参照し、
被監視プロセスと依存関係を有する複数のプロセス群に
対してプロセス再開を実行させる監視プロセスを備える
ことにより、本来であれば人手を介して行われる複雑な
復旧処理を、自動的に迅速に行うことができる。As described above, according to the present invention,
Refers to the database that holds the process dependencies,
By providing a monitoring process that executes process restart for a plurality of processes that have a dependency relationship with the monitored process, it is possible to automatically and quickly perform complicated recovery processing that would otherwise be performed manually. Can be.

【００９０】また、プロセス間通信における複数の各プ
ロトコル層に跨って障害を識別することにより、障害の
根底原因となった情報のみを監視者に通知するととも
に、根底原因となった情報を基に復旧処理を行うことに
より、システム復旧時間の短縮化を図ることができる。Further, by identifying a failure across a plurality of protocol layers in inter-process communication, only the information that caused the failure is notified to the monitor, and based on the information that caused the failure. By performing the recovery process, the system recovery time can be reduced.

[Brief description of the drawings]

【図１】プロセス間の再開に関する依存関係を保持する
データベースを備えた本発明の実施形態を示す図であ
る。FIG. 1 is a diagram illustrating an embodiment of the present invention including a database that holds a dependency relationship regarding restart between processes.

【図２】本発明による依存関係を有するプロセス群の再
開処理のフロー図である。FIG. 2 is a flowchart of a restart process of a process group having a dependency according to the present invention.

【図３】単位時間当たりのプロセス再開回数の上限値を
保持するデータベースを備えた本発明の実施形態を示す
図である。FIG. 3 is a diagram showing an embodiment of the present invention including a database holding an upper limit value of the number of process restarts per unit time.

【図４】本発明によるプロセス再開回数の上限値を越え
た再開処理のフロー図である。FIG. 4 is a flowchart of a restart process in which the number of process restarts exceeds an upper limit value according to the present invention.

【図５】プロセスサイズの上限値を保持するデータベー
スを備えた本発明の実施形態を示す図である。FIG. 5 is a diagram showing an embodiment of the present invention including a database holding an upper limit value of a process size.

【図６】本発明によるプロセスサイズの上限値を越えた
再開処理のフロー図である。FIG. 6 is a flowchart of a restart process that exceeds the upper limit of the process size according to the present invention.

【図７】プロセス使用メモリ領域の上限値を保持するデ
ータベースを備えた本発明の実施形態を示す図である。FIG. 7 is a diagram showing an embodiment of the present invention including a database holding an upper limit value of a process use memory area.

【図８】本発明によるプロセス使用メモリ領域の上限値
を越えた再開処理のフロー図である。FIG. 8 is a flowchart of a restart process in which a process used memory area exceeds an upper limit value according to the present invention.

【図９】メモリ作業領域（スワップ領域）の下限値を保
持するデータベースを備えた本発明の実施形態を示す図
である。FIG. 9 is a diagram showing an embodiment of the present invention including a database holding a lower limit value of a memory work area (swap area).

【図１０】本発明によるメモリ作業領域（スワップ領
域）の下限値を下回る再開処理のフロー図である。FIG. 10 is a flow chart of a resuming process in which the memory work area (swap area) falls below a lower limit according to the present invention.

【図１１】定期監視とシステム管理者による不定期監視
を行う本発明の実施形態を示す図である。FIG. 11 is a diagram showing an embodiment of the present invention that performs regular monitoring and irregular monitoring by a system administrator.

【図１２】障害の根底原因を検出する本発明の実施形態
を示す図である。FIG. 12 illustrates an embodiment of the present invention for detecting the underlying cause of a failure.

【図１３】本発明の根底原因を検出する監視実行のフロ
ーを示す図である。FIG. 13 is a diagram showing a flow of monitoring execution for detecting an underlying cause of the present invention.

【図１４】本発明の根底原因の判別ロジックを示す図で
ある。FIG. 14 is a diagram showing an underlying cause determination logic of the present invention.

【図１５】本発明による障害情報通知及び復旧処理の実
施形態を示す図である。FIG. 15 is a diagram showing an embodiment of a failure information notification and recovery process according to the present invention.

【図１６】本発明による障害情報通知及び復旧処理のフ
ローを示す図である。FIG. 16 is a diagram showing a flow of failure information notification and recovery processing according to the present invention.

【図１７】根底原因が下位層の場合の本発明による障害
通知の実施形態を示す図である。FIG. 17 is a diagram showing an embodiment of a failure notification according to the present invention when the underlying cause is a lower layer.

【図１８】根底原因が下位層の場合の本発明による障害
情報通知のフローを示す図である。FIG. 18 is a diagram showing a flow of failure information notification according to the present invention when the underlying cause is a lower layer.

【図１９】クライアントから障害に対する対処要求を行
う本発明の実施形態を示す図である。FIG. 19 is a diagram illustrating an embodiment of the present invention in which a client makes a request for handling a failure.

【図２０】クライアントから対処要求が送出された場合
の本発明の処理フロー図である。FIG. 20 is a processing flowchart of the present invention when a response request is sent from a client.

【図２１】単位時間当たりのプロセス再開回数の上限値
を保持するデータベースの例を示す図である。FIG. 21 is a diagram illustrating an example of a database that holds an upper limit value of the number of process restarts per unit time.

【図２２】プロセスサイズ、使用メモリ領域の上限値等
を保持するデータベースの例を示す図である。FIG. 22 is a diagram illustrating an example of a database that holds a process size, an upper limit value of a used memory area, and the like.

【図２３】障害の根底原因の検出及びその通知メッセー
ジのデータベースの例を示す図である。FIG. 23 is a diagram illustrating an example of a database of detection of a root cause of a failure and a notification message thereof.

[Explanation of symbols]

１−１監視プロセス１−２監視／障害情報データベース１−３監視対象プロセス１−４クライアントモニタリングプロセス 1-1 monitoring process 1-2 monitoring / failure information database 1-3 monitoring target process 1-4 client monitoring process

───────────────────────────────────────────────────── フロントページの続きＦターム(参考） 5B027 AA04 BB01 CC04 5B045 DD18 GG06 JJ02 JJ14 JJ16 JJ42 5B098 AA10 GA04 JJ02 JJ08 ──────────────────────────────────────────────────続き Continued on the front page F term (reference) 5B027 AA04 BB01 CC04 5B045 DD18 GG06 JJ02 JJ14 JJ16 JJ42 5B098 AA10 GA04 JJ02 JJ08

Claims

[Claims]

In a process system composed of a plurality of processes, a monitoring process for monitoring an operation state of each process distributed to its own device or another device, and a dependency related to a restart process between each process are held. And a database, wherein the monitoring process refers to the database when detecting a failure in any of the processes to be monitored,
A process abnormality detection and recovery system, comprising: a process for executing a process restart for a plurality of process groups having a dependency relationship with the process.

2. The method according to claim 1, wherein the database holds an upper limit value of the number of process restarts per unit time according to each process type, and the monitoring process performs a process when the number of restarts of the process to be monitored exceeds the upper limit value. 2. The system according to claim 1, wherein a process restart is executed for a plurality of process groups having a dependency relationship with the monitoring target process.
Process abnormality detection and recovery system described.

3. The database holds an upper limit value of a process size according to each process type, and the monitoring process determines whether the monitoring process has a dependency when the process size of the monitoring target process exceeds the upper limit value. 2. The process abnormality detection and recovery system according to claim 1, further comprising a configuration for causing a plurality of process groups having a process to execute a process restart.

4. The database holds an upper limit value of a process use memory area according to each process type, and the monitoring process, when the process use memory area of the process to be monitored exceeds the upper limit value, 2. The process abnormality detection and recovery system according to claim 1, further comprising a configuration for executing a process restart for a plurality of process groups having a dependency relationship with the process.

5. The database holds a lower limit value of a work memory area for each device that executes each process, and the monitoring process determines that a work memory area of a device that executes a process to be monitored falls below the lower limit value. 2. The process abnormality detection and recovery system according to claim 1, further comprising a step of executing a process restart for a plurality of processes having a dependency relationship with the process.