JP2001056772A

JP2001056772A - Fault monitoring system

Info

Publication number: JP2001056772A
Application number: JP23197399A
Authority: JP
Inventors: Masato Yasuda; 真人保田
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 1999-08-18
Filing date: 1999-08-18
Publication date: 2001-02-27

Abstract

PROBLEM TO BE SOLVED: To suppress the occurrence of a fault in client operation type unmanned terminal system operating for 24 hours by judging the presence/absence of abnormality by comparing acquired supervisory information with set supervisory information, judging the presence of abnormality when the two supervisory information are different and restarting a process. SOLUTION: The used memory capacity, used handle number and CPU using rate of a supervisory object process are acquired (S1). The acquired used memory capacity, handle number and CPU using rate are compared with values acquired up to the last time and it is judged whether resources are abnormally consumed or not (S2). When abnormality is related to hang or loop, the state of disabling the provision of services to a client is judged and restart processing is immediately started (S5). When abnormality is resource leak, it is judged the service provision to the client can be continued (S6), and all the processes under managing object are finished (S7). A Window NT is rebooted and recovery from abnormality is attained (S8).

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、障害監視システム
に関し、特に、顧客がサービスを利用するために、直接
コンピュータシステムを操作する顧客操作型の無人端末
システムの２４時間連続運転が必要なシステムの運用に
有用である。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a fault monitoring system and, more particularly, to a system that requires a 24-hour continuous operation of a customer-operated unattended terminal system that directly operates a computer system in order for a customer to use a service. Useful for operation.

【０００２】[0002]

【従来の技術】従来のコンピュータシステムにおいて
は、即時にサービスの提供が中止されるような障害を検
知してシステムをリスタートさせることはできた。しか
し、アプリケーションのバグなどによるメモリーリーク
等が原因によって、即時に障害とはならないが、運用を
継続していくといつか障害となるような異常の発生を事
前に検知することはできなかった。2. Description of the Related Art In a conventional computer system, it has been possible to immediately detect a failure in which service provision is stopped and restart the system. However, due to a memory leak due to a bug in the application or the like, it does not cause an immediate failure, but it is not possible to detect in advance the occurrence of an abnormality that will eventually become a failure when the operation is continued.

【０００３】そのため、実際に障害が発生してから対応
するか、あるいはメモリの使用率などを運用管理者が監
視して対処を行っていた。For this reason, an operator has to respond after an actual failure has occurred, or the operation manager has monitored the memory usage rate or the like.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、運用時
間が延長され、２４時間の連続サービスを提供するシス
テムが増加してきて、有人での対処には限界があるこ
と、また、障害の発生を予測して予防的に保守して可用
性を向上させる必要が生じてきた。また、現在の端末シ
ステムはＷｅｂ（ウェブ）ベースで作成されることが多
くなってきたが、その場合にはＷｅｂ（ウェブ）ブラウ
ザ特有の異常を検知して対処する必要がある。以上述べ
た従来の方法では、以下の問題がある。However, as the operation time is extended and the number of systems that provide continuous service for 24 hours is increasing, there is a limit to manned coping, and it is anticipated that failures will occur. Need to be proactively maintained to improve availability. In addition, current terminal systems are often created on a Web (Web) basis. In such a case, it is necessary to detect and deal with an abnormality specific to a Web (Web) browser. The conventional method described above has the following problems.

【０００５】（１）即時に障害とならないが、そのまま
運用を継続すると障害が発生する原因となる異常に対し
ては、障害が発生するまで対処できない。（２）Ｗｅｂブラウザの異常は、一般的なリソースやハ
ング、ループだけではなく、Ｗｅｂブラウザが表示する
エラーダイアログによるサービス停止が想定されるが、
その状態を異常として検出できない。[0005] (1) Although a failure does not occur immediately, an abnormality that causes a failure if the operation is continued cannot be dealt with until a failure occurs. (2) Abnormalities in the Web browser are not limited to general resources, hangs, and loops, but can also be caused by service suspension due to an error dialog displayed by the Web browser.
The state cannot be detected as abnormal.

【０００６】（３）上記エラーダイアログのうち、エラ
ーとしてリスタートさせるものと、エラーとして処理す
べきでないものがあるが、その切り分けができない。（４）リスタートさせる場合に、従来通り、ＯＳ（オペ
レーティングシステム）からすべてリスタートさせる
と、サービス停止時間が長くなる。（５）業務時間帯によって、顧客が利用している可能性
の高い時間帯であっても無条件にリスタートが発生し、
顧客が端末の利用を待たされる。(3) Among the above error dialogs, there are those that are restarted as errors and those that should not be processed as errors, but cannot be separated. (4) In the case of restarting, if all are restarted from the OS (operating system) as in the past, the service suspension time becomes longer. (5) Depending on the business hours, even if the customer is likely to be using it, a restart will occur unconditionally,
The customer is waiting for the use of the terminal.

【０００７】したがって、上記の各問題があるため、技
術的に満足できるものではなかった。[0007] Therefore, due to the above problems, it was not technically satisfactory.

【０００８】[0008]

【課題を解決するための手段】本発明は、監視対象のプ
ロセスの正常時の監視情報を予め設定しておき、そのプ
ロセスの監視情報を取得し、取得した監視情報と設定し
てある監視情報とを比較して異常の有無を判断し、両監
視情報が相違している場合に異常があると判断し、その
プロセスを再起動するようにしたことを特徴とする障害
監視システムを提供する。これにより、監視対象のプロ
セスのリソースのリークや、ハング、ループなどの即時
に障害となるわけではないがそのまま運用を継続すると
障害の原因となる異常の発生を監視情報に基づいて検知
し、異常を検知した場合に自動的に再起動して障害発生
をあらかじめ抑止することができるようになる。このた
め、２４時間稼働するような顧客操作型の無人端末シス
テムの障害を抑止することができるようになる。According to the present invention, monitoring information of a process to be monitored in a normal state is set in advance, monitoring information of the process is obtained, and the monitoring information set with the obtained monitoring information is set. The present invention provides a failure monitoring system characterized in that the presence or absence of an abnormality is compared by comparing the two pieces of monitoring information, and when the two pieces of monitoring information are different, it is determined that there is an abnormality, and the process is restarted. As a result, it is possible to detect the occurrence of an error that causes a failure based on the monitoring information based on the monitoring information. Automatically restarts when a failure is detected, and the occurrence of a failure can be suppressed in advance. For this reason, it is possible to suppress a failure of the customer-operated unattended terminal system that operates 24 hours a day.

【０００９】なお、エラーが存在する場合にエラーダイ
アログを表示するウェブブラウザを監視対象とし、その
監視情報を取得した時にエラーダイアログが表示してあ
る場合に異常があると判断し、ウェブブラウザを再起動
するようにしてもよい。これにより、ウェブブラウザ自
身が障害を検知してエラーダイアログを表示することで
発生するアプリケーションの停止を検知し、自動的に再
起動で回復することができるようになる。A web browser that displays an error dialog when an error exists is set as a monitoring target, and when the error dialog is displayed when the monitoring information is obtained, it is determined that there is an abnormality, and the web browser is restarted. You may make it start. As a result, the web browser itself detects a failure and displays an error dialog, thereby detecting a stop of the application that occurs, and automatically recovering by restarting the application.

【００１０】また、再起動が必要な異常のリストを予め
設定しておき、認識した異常がそのリストにある場合に
は再起動し、リストにない場合には再起動しないように
してもよい。これにより、再起動が必要な異常とそうで
ない異常を自動的に区別し、再起動が必要な異常を検知
した場合には、自動的に再起動で回復することができる
ようになる。A list of abnormalities requiring restart may be set in advance, and if the recognized abnormality is in the list, the system is restarted, and if not, the system is not restarted. This makes it possible to automatically distinguish abnormalities that require a restart from abnormalities that do not, and automatically recover by restarting when an abnormality that requires a restart is detected.

【００１１】更に、オペレーティングシステムの再起動
が必要な異常のリストを予め設定しておき、認識した異
常がそのリストにある場合にはオペレーティングシステ
ムの再起動を行い、リストにない場合にはプロセスの再
起動を行うようにしてもよい。これにより、プロセスの
再起動だけで復旧可能な異常の場合、関連するプロセス
のみを再起動させ、その必要のないプロセスは再起動さ
せないことができるため、再起動にかかる時間を短縮す
ることができる。Further, a list of abnormalities requiring restart of the operating system is set in advance, and if the recognized abnormality is in the list, the operating system is restarted. A restart may be performed. Thus, in the case of an error that can be recovered only by restarting the process, only the related process can be restarted, and the unnecessary processes can not be restarted, so that the time required for restart can be reduced. .

【００１２】更にまた、再起動可能な時間帯と不可能な
時間帯とを予め設定しておき、異常を認識した時の時間
帯を把握し、再起動可能な時間帯のみに再起動するよう
にしてもよい。これにより、即時に再起動する必要のな
い障害の場合に、端末の利用状況に応じて再起動を行う
ことができるようになる。このため、再起動不可能な時
間帯であれば再起動の実行を待たせることができ、再起
動によって利用者が端末を操作できない時間を減らすこ
とができる。Furthermore, a time zone in which restart is possible and a time zone in which restart is not possible are set in advance, a time zone when an abnormality is recognized is grasped, and restart is performed only in a time zone in which restart is possible. It may be. Thus, in the case of a failure that does not need to be restarted immediately, the restart can be performed according to the usage status of the terminal. For this reason, if the restart is not possible, the execution of the restart can be made to wait, and the time during which the user cannot operate the terminal due to the restart can be reduced.

【００１３】[0013]

【発明の実施の形態】以下、図面を参照して、本発明の
実施の形態を説明する。なお、これによりこの発明が限
定されるものではない。第１の実施の形態図１は、本発明の機能ブロック構成図である。図におい
て、１は監視機能であり、Ｗｅｂ（ウェブ）ブラウザや
プラットフォームを構成する各種プロセスやアプリケー
ションプロセスを監視し、異常を検知する機能を持つ。
２は運転管理機能であり、Ｗｅｂブラウザやプラットフ
ォームを構成する各種プロセスやアプリケーションプロ
セス間の依存関係を管理する機能を持つ。３はＷｅｂブ
ラウザであり、通常ＨＴＭＬやスクリプトで業務画面を
表示する機能を持つ。４はプラットフォームを構成する
各種プロセスやアプリケーションプロセスで、プラット
フォーム機能や業務機能を提供するプロセスである。こ
こでは、ＰＦプロセス４（１）、ＰＦプロセス４
（２）、ＡＰプロセス４（３）、ＡＰプロセス４（４）
を示してある。なお、通常、各プロセス間では依存関係
をもつことが多いため、ここでも依存関係をもつ場合を
想定する。５は運転管理情報テーブルであり、Ｗｅｂブ
ラウザやプラットフォームを構成する各種プロセスやア
プリケーションプロセス間の依存関係が記述されてい
る。また、６はオペレーティングシステムであり、ここ
では、ＷｉｎｄｏｗｓＮＴ（（米）マイクロソフト社
製）を想定する。なお、各機能は、ＲＯＭやＲＡＭ等の
記憶媒体に記憶されているプログラムをＣＰＵが実行す
ることにより実現することができるが、これらの詳細な
説明は省略する。Embodiments of the present invention will be described below with reference to the drawings. It should be noted that the present invention is not limited by this. First Embodiment FIG. 1 is a functional block configuration diagram of the present invention. In the figure, reference numeral 1 denotes a monitoring function, which has a function of monitoring various processes and application processes constituting a Web (Web) browser and a platform, and detecting an abnormality.
Reference numeral 2 denotes an operation management function, which has a function of managing a dependency between various processes and application processes constituting a Web browser and a platform. Reference numeral 3 denotes a Web browser, which has a function of displaying a business screen using normal HTML or script. Reference numeral 4 denotes various processes and application processes constituting the platform, which are processes for providing platform functions and business functions. Here, PF process 4 (1), PF process 4
(2), AP process 4 (3), AP process 4 (4)
Is shown. In general, there is often a dependency between the processes, and therefore, it is assumed here that there is a dependency. Reference numeral 5 denotes an operation management information table, which describes dependencies between various processes and application processes constituting a Web browser and a platform. Reference numeral 6 denotes an operating system. Here, Windows NT (manufactured by Microsoft Corporation (US)) is assumed. Each function can be realized by the CPU executing a program stored in a storage medium such as a ROM or a RAM, but a detailed description thereof will be omitted.

【００１４】次に、メモリーリークやシステムリソース
のリークなどによって起因するシステムの停止などを事
前に防ぐためのスマートリスタートの処理の流れを説明
する。図２にそのフローチャートを示す。Ｓ１：監視機能１は、監視対象プロセスの使用メモリ
量、使用ハンドル数、ＣＰＵ使用率を取得する。更に、
対象プロセスにＷｉｎｄｏｗｓメッセージを送信して応
答があるか確認する。Next, a flow of a smart restart process for preventing the system from being stopped due to a memory leak or a leak of system resources in advance will be described. FIG. 2 shows a flowchart thereof. S1: The monitoring function 1 acquires the amount of used memory, the number of used handles, and the CPU usage rate of the monitored process. Furthermore,
Sends a Windows message to the target process and checks if there is a response.

【００１５】Ｓ２：監視機能１は、取得した使用メモリ
量、ハンドル数、ＣＰＵ使用率を前回までに取得した値
と比較して、異常なリソース消費をしていないか判断す
る。Ｓ３：監視機能１は、全ての監視対象プロセスのチェッ
クが終了するまでＳ１，Ｓ２を繰り返す。Ｓ４：監視機能１は、全ての監視対象プロセスのチェッ
クが終了した場合は次の監視までスリープする。S2: The monitoring function 1 compares the obtained used memory amount, the number of handles, and the CPU usage with the values obtained up to the previous time to determine whether or not abnormal resource consumption has occurred. S3: The monitoring function 1 repeats S1 and S2 until all the monitoring target processes have been checked. S4: When all the monitoring target processes have been checked, the monitoring function 1 sleeps until the next monitoring.

【００１６】Ｓ５：運転管理機能２は、異常がハングや
ループの場合、顧客へのサービス提供が既に不可能な状
態になっていると判断し、即時にリスタート処理を起動
する。Ｓ６：運転管理機能２は、異常がハングやループでな
く、リソースリークの場合は、顧客へのサービス提供の
継続が可能であると判断し、顧客にサービスを提供して
いる場合はそのサービスに影響がないようにサービスの
終了まで待つ。S5: If the abnormality is a hang or a loop, the operation management function 2 determines that the service provision to the customer is already impossible, and immediately starts the restart processing. S6: If the abnormality is not a hang or a loop but a resource leak, the operation management function 2 determines that the service provision to the customer can be continued, and if the service is provided to the customer, the service management function 2 proceeds to the service. Wait for the service to end without any impact.

【００１７】Ｓ７：運転管理機能２は、管理対象下の全
てのプロセスを終了させる。Ｓ８：運転管理機能２は、ＷｉｎｄｏｗｓＮＴ６をリブ
ートして、異常からの回復を図る。管理対象化の全ての
プロセスはＷｉｎｄｏｗｓＮＴ起動時に自動的に立ち上
がり、顧客へのサービス再開が可能になる。上記第１の
実施の形態によると、障害の重大度（ハング、ループ、
リソースリーク）に応じて、適切な対応処理を自動的に
取ることができる。さらに、リソースリークなど現時点
では顧客へのサービスの停止に繋がらないが放置してお
くと将来重大障害が発生する可能性のある異常を事前に
検出して対処することが可能になる。また、リソースリ
ークを検出したタイミングで即時にリスタートさせるの
ではなく、サービスの状況（顧客へのサービス提供中か
否か）を判断し、利用顧客に影響を与えずにリスタート
処理を実現することができる。S7: The operation management function 2 terminates all processes under management. S8: The operation management function 2 reboots Windows NT6 to recover from the abnormality. All processes to be managed are automatically started when Windows NT is started, and services can be resumed for customers. According to the first embodiment, the severity of a failure (hang, loop,
In response to a resource leak, appropriate appropriate processing can be automatically performed. Furthermore, at the present time, such as a resource leak, which does not lead to the suspension of service to the customer, but if left unchecked, it is possible to detect in advance and deal with an abnormality that may cause a serious failure in the future. Also, instead of immediately restarting when a resource leak is detected, the service status (whether the service is being provided to the customer) is determined and the restart process is implemented without affecting the customer in use. be able to.

【００１８】第２の実施の形態本第２の実施の形態の機能構成は、上記第１の実施の形
態と同様であるため、説明を省略し、上記第１の実施の
形態と相違する処理のみを説明する。本第２の実施の形
態では、Ｗｅｂブラウザを監視対象とした場合に、Ｗｅ
ｂブラウザ特有の動作状況によって顧客へのサービスの
提供が不可能になる状態を検知してスマートリスタート
を実現するための動作を説明する。図３に、そのフロー
チャートを示す。なお、図２と同一ステップには同一符
号を示し、その説明を省略する。Second Embodiment Since the functional configuration of the second embodiment is the same as that of the first embodiment, a description thereof will be omitted, and processing different from that of the first embodiment will be omitted. Only the explanation will be given. In the second embodiment, when a Web browser is to be monitored,
An operation for realizing smart restart by detecting a state in which the service cannot be provided to the customer due to the operation state specific to the browser b will be described. FIG. 3 shows the flowchart. Note that the same steps as those in FIG. 2 are denoted by the same reference numerals, and description thereof will be omitted.

【００１９】Ｓ９：監視機能１は、監視対象が、Ｗｅｂ
ブラウザ３か否かを判断し、対象がＷｅｂブラウザの場
合は、Ｗｅｂブラウザ３独自の情報を監視する。対象が
Ｗｅｂブラウザ３か否かは運転管理機能２が運転管理情
報テーブル５を参照することで判定ができる。運転管理
情報テーブル５には、本システムで管理対象としている
全プロセス（プラットフォームプロセスやアプリケーシ
ョンプロセス、Ｗｅｂなどのサードパーティ製のプロセ
スなど）の種別やプロセス間の依存関係が記述されてい
る。プロセスが監視対象か否かという情報もこの管理情
報テーブル５中で管理しておけばよい。S9: The monitoring function 1 checks that the monitoring target is Web
It is determined whether or not the browser 3 is used. If the target is a Web browser, information unique to the Web browser 3 is monitored. The operation management function 2 can determine whether the target is the Web browser 3 by referring to the operation management information table 5. The operation management information table 5 describes the types of all processes (platform processes, application processes, third-party processes such as Web, etc.) managed by the present system, and the dependencies between the processes. Information on whether or not a process is a monitoring target may be managed in the management information table 5.

【００２０】Ｓ１０：監視機能１は、Ｗｅｂブラウザ３
がエラーダイアログを表示しているか、Ｗｅｂブラウザ
の情報を取得して判断する。エラーダイアログを表示し
ていない場合は、正常と判断し、次の監視対象プロセス
のチェックを行う。エラーダイアログを表示している場
合は、そのダイアログの表示によって、業務の画面遷移
フローが中断されたと判断し、リスタート処理を実行す
る。S10: The monitoring function 1 is the Web browser 3
Is displayed by acquiring information of the Web browser to determine whether or not an error dialog is displayed. If the error dialog is not displayed, it is judged to be normal and the next monitored process is checked. If an error dialog is displayed, it is determined by the display of the dialog that the business screen transition flow has been interrupted, and restart processing is executed.

【００２１】上記第２の実施の形態によると、Ｗｅｂブ
ラウザで画面を制御する顧客操作型のシステムを提供す
る場合、Ｗｅｂブラウザが自動的に表示するエラーダイ
アログの発生を検知する機能を設けたので、Ｗｅｂブラ
ウザ上で動作するアプリケーションが検知できずに画面
遷移が停止してしまうような障害を検知できるようにな
った。そのため、今まで有人で対応していた障害を自動
的に対処できるようになり、コストの低減が期待でき
る。According to the second embodiment, when a customer-operated system for controlling a screen with a Web browser is provided, a function for detecting the occurrence of an error dialog automatically displayed by the Web browser is provided. In addition, it is possible to detect a failure in which a screen transition is stopped because an application operating on a Web browser cannot be detected. Therefore, it is possible to automatically deal with a failure that has been handled by a man, and cost reduction can be expected.

【００２２】第３の実施の形態本第３の実施の形態の機能構成は、上記第１の実施の形
態と同様であるため、説明を省略し、上記第２の実施の
形態と相違する処理のみを説明する。本第３の実施の形
態では、上記第２の実施の形態によるＷｅｂブラウザ特
有の問題によるサービスの停止を検出してリスタートす
る処理を、さらに、Ｗｅｂブラウザが検出した障害の原
因別に対応できるようにした場合の動作を説明する。図
４に、そのフローチャートを示す。なお、図２又は図３
と同一ステップには同一符号を示し、その説明を省略す
る。Third Embodiment The functional configuration of the third embodiment is the same as that of the first embodiment, and therefore the description is omitted, and the processing is different from that of the second embodiment. Only the explanation will be given. In the third embodiment, the process of detecting the stop of the service due to the problem peculiar to the Web browser and restarting it according to the second embodiment can be further dealt with according to the cause of the failure detected by the Web browser. The operation in the case of the above will be described. FIG. 4 shows a flowchart of the operation. 2 or 3
The same steps as those described above are denoted by the same reference numerals, and description thereof will be omitted.

【００２３】Ｓ１１：監視機能１は、Ｗｅｂブラウザ３
がどんな障害を検知して表示したダイアログかを調べる
ためにＷｅｂブラウザ３が現在表示しているエラーダイ
アログのタイトル名をＷｅｂブラウザ３から取得する。Ｓ１２：監視機能１は、出力されたエラーダイアログの
タイトル名から、運転管理情報テーブル５に登録してあ
るエラーダイアログのタイトル名に応じたスマートリス
タートの処理を実行する。リスタート対象外のエラーダ
イアログが表示されている場合は、そのまま顧客に処理
を継続させる。リスタート対象のエラーダイアログが表
示されている場合はリスタート処理を実行する。S11: The monitoring function 1 is the Web browser 3
In order to find out what kind of trouble is detected and displayed by the user, the title of the error dialog currently displayed by the Web browser 3 is acquired from the Web browser 3. S12: The monitoring function 1 executes a smart restart process according to the title of the error dialog registered in the operation management information table 5 from the title of the output error dialog. If an error dialog not to be restarted is displayed, the customer continues processing as it is. If an error dialog for restart is displayed, restart processing is executed.

【００２４】上記第３の実施の形態によると、Ｗｅｂブ
ラウザで画面を制御する顧客操作型のシステムを提供す
る場合、Ｗｅｂブラウザが自動的に表示するエラーダイ
アログからＷｅｂブラウザが検知した障害を認識し、リ
スタートの必要性の有無を判定できるようになったの
で、第２の実施の形態によるＷｅｂブラウザの異常検知
によるシステム防止の発生を少なくすることが可能にな
る。そのため、異常検知によるシステムのリスタートに
よるサービス停止回数を減少させるという効果を期待で
きる。According to the third embodiment, when a customer-operated system for controlling a screen with a Web browser is provided, a failure detected by the Web browser is recognized from an error dialog automatically displayed by the Web browser. Since the necessity of restart can be determined, it is possible to reduce occurrence of system prevention due to Web browser abnormality detection according to the second embodiment. Therefore, the effect of reducing the number of service stoppages due to system restart due to abnormality detection can be expected.

【００２５】第４の実施の形態本第４の実施の形態の機能構成は、上記第１の実施の形
態と同様であるため、説明を省略し、上記第３の実施の
形態と相違する処理のみを説明する。本第４の実施の形
態では、ＷｉｎｄｏｗｓＮＴのリブートが必要な異常か
どうかという切り分け機能とリブートが必要でない異常
の場合、リスタートが必要なプロセスを自動的に決定
し、再起動する機能を設けることによるリスタート処理
の動作を説明する。図５に、そのフローチャートを示
す。なお、図２、図３又は図４と同一ステップには同一
符号を示し、その説明を省略する。Fourth Embodiment Since the functional configuration of the fourth embodiment is the same as that of the first embodiment, a description thereof will be omitted, and processing different from that of the third embodiment will be omitted. Only the explanation will be given. In the fourth embodiment, there is provided a function of isolating whether or not WindowsNT needs to be rebooted and a function of automatically determining a process that requires restart and restarting in the case of an abnormality that does not require rebooting. The operation of the restart process according to will be described. FIG. 5 shows a flowchart of the operation. The same steps as those in FIG. 2, FIG. 3, or FIG. 4 are denoted by the same reference numerals, and description thereof will be omitted.

【００２６】Ｓ１３：監視機能１は、検出した異常がＷ
ｉｎｄｏｗｓＮＴ６をリブートさせる必要があるかどう
かを判定する。ＷｉｎｄｏｗｓＮＴ６をリブートさせる
必要がある場合はＳ７の処理を実行する。Ｗｉｎｄｏｗ
ｓＮＴ６をリブートさせる必要のない障害は、異常を検
出した監視対象プロセスと関連のあるプロセスだけをリ
スタートさせる。障害原因とＮＴのリブートさせる／さ
せないという情報の関連は監視機能内部にあらかじめ持
つ。S13: The monitoring function 1 determines that the detected abnormality is W
It is determined whether it is necessary to reboot Windows NT6. If it is necessary to reboot Windows NT6, the process of S7 is executed. Windows
A failure that does not require the sNT6 to be rebooted restarts only the process associated with the monitored process that has detected the abnormality. The relation between the cause of the failure and the information of whether or not to reboot the NT is stored in the monitoring function in advance.

【００２７】Ｓ１４：運転管理機能２は、監視対象プロ
セスと関連するプロセスを決定するために、運転管理情
報テーブル５から情報を読み込む。Ｓ１５：運転管理機能２は、運転管理情報中には、プロ
セス間の依存関係に関する情報も記述されているので、
その依存関係にしたがって、リスタート対象プロセスを
順番に終了させる。プロセス間の依存関係とは、起動終
了の順序に依存するようなプロセスが存在するとき、起
動順序や終了順序を記述したものである。S14: The operation management function 2 reads information from the operation management information table 5 in order to determine a process related to the process to be monitored. S15: In the operation management function 2, since the operation management information also describes information on the dependency between processes,
The restart target processes are sequentially terminated according to the dependency. The dependency between processes describes the start order and the end order when a process that depends on the start and end order exists.

【００２８】Ｓ１６：運転管理機能２は、リスタート対
象プロセスの終了処理が完了した後、運転管理情報中の
依存関係にしたがって、対象プロセスを順番に再起動す
る。上記第４の実施の形態によると、異常を検知してリ
スタートする場合に、異常原因に応じてＮＴをリブート
するか否かを自動的に検出することができる。さらに、
異常を検出したプロセスに関連するプロセスのみリスタ
ートさせ、関連のないプロセスはそのままにしておける
ので、リスタートにかかる時間を短縮できるようになっ
た。そのため、異常検知によるシステムのリスタート時
間をさらに減少させるという効果を期待できる。S16: After the termination process of the restart target process is completed, the operation management function 2 sequentially restarts the target processes according to the dependency in the operation management information. According to the fourth embodiment, when restarting after detecting an abnormality, it is possible to automatically detect whether or not to reboot the NT according to the cause of the abnormality. further,
Only processes related to the process that detected the abnormality can be restarted, and unrelated processes can be left as they are, so that the time required for restart can be reduced. Therefore, an effect of further reducing the restart time of the system due to the abnormality detection can be expected.

【００２９】第５の実施の形態本第５の実施の形態の機能構成は、上記第１の実施の形
態と同様であるため、説明を省略し、上記第４の実施の
形態と相違する処理のみを説明する。本第５の実施の形
態では、リスタートの可否を業務時間に応じて実行させ
る処理を実現する動作を説明する。図６に、そのフロー
チャートを示す。なお、図２、図３、図４又は又は図５
と同一ステップには同一符号を示し、その説明を省略す
る。Fifth Embodiment Since the functional configuration of the fifth embodiment is the same as that of the first embodiment, a description thereof will be omitted, and processing different from that of the fourth embodiment will be omitted. Only the explanation will be given. In the fifth embodiment, an operation for realizing a process for determining whether or not restart is possible according to business hours will be described. FIG. 6 shows a flowchart of the operation. In addition, FIG. 2, FIG. 3, FIG.
The same steps as those described above are denoted by the same reference numerals, and description thereof will be omitted.

【００３０】Ｓ１７：監視機能１は、即時にリスタート
させる必要のない障害の場合は、業務時間に応じてリス
タートを実行するか否かを自動的に判断させるために業
務時間情報を読み込み、現時間がどの時間帯に分類され
ているか調べる。業務時間帯は即時リスタート可能な時
間帯とリスタート不可能な時間帯（端末がよく利用され
ると想定される時間帯）にあらかじめ、分けて登録して
おく必要がある。S17: In the case of a failure that does not need to be restarted immediately, the monitoring function 1 reads the business time information in order to automatically determine whether or not to execute the restart according to the business time, Check which time zone the current time is classified into. The business hours need to be registered separately in advance in a time zone in which immediate restart is possible and in a time zone in which restart is not possible (time zone in which terminals are frequently used).

【００３１】Ｓ１８：監視機能１は、異常を検出した時
間がリスタート可能な時間帯でなければ、リスタートの
実行をリスタート可能な時間帯まで待たせる。リスター
ト可能な時間帯になった時点で、リスタート処理を実行
する。上記第５の実施の形態によると、あらかじめリソ
ースリークなどの即時にリスタートする必要のない異常
の検知によるリスタート処理を、あらかじめ設定されて
いる端末の利用時間帯に応じて、実行することができる
ようになり、リスタートによるサービス停止を利用者に
影響させないような利便性の向上が期待できる。S18: If the time when the abnormality is detected is not in the restartable time period, the monitoring function 1 causes the restart to wait until the restartable time period. When the restartable time period is reached, restart processing is executed. According to the fifth embodiment, it is possible to execute a restart process by detecting an abnormality that does not need to be immediately restarted such as a resource leak in advance according to a preset use time period of the terminal. It is possible to improve the convenience so that the service stop due to the restart is not affected by the user.

【００３２】[0032]

【発明の効果】以上説明したように本発明の障害監視シ
ステムによると、監視対象のプロセスのリソースのリー
クや、ハング、ループなどの即時に障害となるわけでは
ないがそのまま運用を継続すると障害の原因となる異常
の発生を監視情報に基づいて検知し、異常を検知した場
合に自動的に再起動して障害発生をあらかじめ抑止する
ことができるようになり、２４時間稼働するような顧客
操作型の無人端末システムの障害の発生を抑止すること
ができる効果が得られる。As described above, according to the fault monitoring system of the present invention, a fault does not occur immediately, such as a resource leak, a hang, or a loop of a monitored process, but if the operation is continued as it is, the fault will not occur. A customer operation type that detects the occurrence of the cause of the abnormality based on the monitoring information, automatically restarts the system when an abnormality is detected, and suppresses the occurrence of the failure in advance. The effect of being able to suppress the occurrence of a failure in the unattended terminal system is obtained.

【００３３】なお、エラーが存在する場合にエラーダイ
アログを表示するウェブブラウザを監視対象とし、その
監視情報を取得した時にエラーダイアログが表示してあ
る場合に異常があると判断し、ウェブブラウザを再起動
するようにすると、ウェブブラウザ自身が障害を検知し
てエラーダイアログを表示することで発生するアプリケ
ーションの停止を検知し、自動的に再起動で回復するこ
とができる効果が得られる。It should be noted that a web browser displaying an error dialog when an error exists is set as a monitoring target, and when the error dialog is displayed when the monitoring information is obtained, it is determined that there is an abnormality, and the web browser is restarted. When the application is started, an effect is obtained in which the web browser itself detects a failure and displays an error dialog to detect a stop of the application that occurs, and can automatically recover by restarting.

【００３４】また、再起動が必要な異常のリストを予め
設定しておき、認識した異常がそのリストにある場合に
は再起動し、リストにない場合には再起動しないように
すると、再起動が必要な異常とそうでない異常を自動的
に区別し、再起動が必要な異常を検知した場合には、自
動的に再起動で回復することができる効果が得られる。Also, a list of abnormalities that need to be restarted is set in advance, and if the recognized abnormality is in the list, the system is restarted. If not, the system is not restarted. Is automatically distinguished from abnormalities that need to be restarted, and when an error that requires restarting is detected, the effect of automatically recovering by restarting is obtained.

【００３５】更に、オペレーティングシステムの再起動
が必要な異常のリストを予め設定しておき、認識した異
常がそのリストにある場合にはオペレーティングシステ
ムの再起動を行い、リストにない場合にはプロセスの再
起動を行うようにすると、プロセスの再起動だけで復旧
可能な異常の場合、関連するプロセスのみを再起動さ
せ、その必要のないプロセスは再起動させないことがで
きるため、再起動にかかる時間を短縮することができる
効果が得られる。Further, a list of abnormalities requiring restart of the operating system is set in advance, and if the recognized abnormalities are in the list, the operating system is restarted. By restarting, if an error can be recovered only by restarting the process, only the related process can be restarted, and unnecessary processes can not be restarted. The effect that can be shortened is obtained.

【００３６】更にまた、再起動可能な時間帯と不可能な
時間帯とを予め設定しておき、異常を認識した時の時間
帯を把握し、再起動可能な時間帯のみに再起動するよう
にすると、即時に再起動する必要のない障害の場合に、
端末の利用状況に応じて再起動を行うことができ、再起
動不可能な時間帯であれば再起動の実行を待たせること
ができ、再起動によって利用者が端末を操作できない時
間を減らすことができる効果が得られる。Further, a time zone in which restart is possible and a time zone in which restart is not possible are set in advance, a time zone when an abnormality is recognized is grasped, and restart is performed only in a time zone in which restart is possible. If the failure does not need to be restarted immediately,
Restart can be performed according to the usage status of the terminal, and if the restart is not possible, the restart can be made to wait, reducing the time during which the user can not operate the terminal by restarting The effect that can be obtained is obtained.

[Brief description of the drawings]

【図１】本発明の機能ブロック構成図FIG. 1 is a functional block configuration diagram of the present invention.

【図２】第１の実施の形態のフローチャートFIG. 2 is a flowchart according to the first embodiment;

【図３】第２の実施の形態のフローチャートFIG. 3 is a flowchart according to a second embodiment;

【図４】第３の実施の形態のフローチャートFIG. 4 is a flowchart of a third embodiment.

【図５】第４の実施の形態のフローチャートFIG. 5 is a flowchart according to a fourth embodiment;

【図６】第５の実施の形態のフローチャートFIG. 6 is a flowchart according to a fifth embodiment;

[Explanation of symbols]

１監視機能２運転管理機能３Ｗｅｂブラウザ４プロセス５運転管理情報テーブル 1 monitoring function 2 operation management function 3 web browser 4 process 5 operation management information table

Claims

[Claims]

1. Monitoring information of a process to be monitored at normal time is set in advance, monitoring information of the process is obtained, and the obtained monitoring information is compared with the set monitoring information to determine whether there is an abnormality. A failure monitoring system that determines that there is an abnormality when the two pieces of monitoring information are different from each other, and restarts the process.

2. The method according to claim 1, wherein a web browser that displays an error dialog when an error exists is set as a monitoring target, and when an error dialog is displayed when the monitoring information is obtained, it is determined that there is an abnormality. A fault monitoring system, wherein the web browser is restarted.

3. The method according to claim 1, wherein a list of abnormalities requiring restart is set in advance, and if the recognized abnormality is in the list, the system is restarted. A failure monitoring system characterized by not being restarted.

4. A method according to claim 1, wherein a list of abnormalities requiring restart of the operating system is set in advance. A failure monitoring system that performs a restart and restarts a process if the process is not on the list.

5. A time zone according to claim 1, 2, 3, or 4, wherein a time zone in which restart is possible and a time zone in which restart is not possible are set in advance, and a time zone when an abnormality is recognized. A failure monitoring system characterized in that the system is restarted only in a time zone in which it can be restarted.