JP6368842B2

JP6368842B2 - Process monitoring program and process monitoring system

Info

Publication number: JP6368842B2
Application number: JP2017219896A
Authority: JP
Inventors: 真之山岡
Original assignee: MUFG Bank Ltd
Current assignee: MUFG Bank Ltd
Priority date: 2017-11-15
Filing date: 2017-11-15
Publication date: 2018-08-01
Anticipated expiration: 2033-03-15
Also published as: JP2018028944A

Description

本発明は、情報処理システムにおけるプロセス監視に関する。 The present invention relates to process monitoring in an information processing system.

近年、クライアント・サーバシステムにおいて、サーバからクライアントに各種サービスが提供されている。これらのサービスを適切に提供するため、サーバ上で実行される各種プロセスは、プロセス監視システムなどにより監視されている。例えば、サーバ上の各プロセスが適切に起動されているか判定するため、プロセス監視システムは、各プロセスに対して定期的に生死判定を実行している。例えば、この生死判定は、プロセス監視システムが監視対象のサーバに定期的にメッセージを送信し、当該メッセージに対する正常応答（ＯＫ応答）がタイムアウト時間内に受信できたかに基づき行われる。生死判定の結果として、プロセスが正常に動作していないことが検出されると（例えば、ＮＧ応答を受信した場合、あるいは、タイムアウト時間内に応答を受信できなかった場合）、プロセス監視システムは、当該プロセスを他のサーバに切り替える（テイクオーバ）。 In recent years, various services are provided from a server to a client in a client-server system. In order to appropriately provide these services, various processes executed on the server are monitored by a process monitoring system or the like. For example, in order to determine whether each process on the server is properly activated, the process monitoring system periodically performs life / death determination on each process. For example, this life / death determination is performed based on whether the process monitoring system periodically transmits a message to the monitoring target server and a normal response (OK response) to the message is received within the timeout period. As a result of the life / death determination, when it is detected that the process is not operating normally (for example, when an NG response is received or when a response is not received within the timeout period), the process monitoring system Switch the process to another server (takeover).

特開２００６−１４６３１９JP 2006-146319 A 特開平１０−１５４０８５JP 10-154085 A

しかしながら、クライアント・サーバシステムの複雑化に伴って、誤った生死判定が行われることがある。例えば、サーバ上で正常に動作するプロセスに対する応答がＮＧ応答として返信されたり、あるいは、タイムアウト時間内にプロセス監視システムにＯＫ応答が到達しないなどによって、正常に動作しているプロセスが、正常に動作していないと誤判定されることがある。 However, as the client / server system becomes complicated, an erroneous life / death determination may be performed. For example, when a response to a process that operates normally on the server is returned as an NG response, or an OK response does not reach the process monitoring system within the time-out period, a normally operating process operates normally. If not, it may be misjudged.

このような誤判定は、典型的には、各プログラムにおけるバグに起因するものであるため、プログラム提供元によって当該バグは修正される。しかしながら、サーバ上で実行されるミドルウェアが増加するに従って、多数のミドルウェアが混在するシステム環境に依拠したバグなどに起因して、低い生起確率ではあるものの、このような誤判定が偶発的に発生する事象が出現するようになってきている。このタイプのバグは、人為的に再発生させることが困難であり、プログラム提供元による修正は期待することができないかもしれない。このような誤判定に基づきテイクオーバが実行されると、テイクオーバに伴う時間や人手の浪費、テイクオーバ中のサービスの中断など様々な問題が生じる。 Such an erroneous determination is typically caused by a bug in each program, and the bug is corrected by the program provider. However, as the middleware executed on the server increases, such a misjudgment occurs accidentally even though the probability of occurrence is low due to a bug that relies on a system environment in which many middlewares are mixed. Events are beginning to appear. This type of bug is difficult to reproduce artificially and may not be expected to be fixed by the program provider. When takeover is executed based on such an erroneous determination, various problems such as time taken by the takeover, waste of manpower, and interruption of service during takeover occur.

上記問題点を鑑み、本発明の課題は、偶発的に出現するバグに起因したプロセス生死判定における誤判定を低減するためのプロセス監視技術を提供することである。 In view of the above problems, an object of the present invention is to provide a process monitoring technique for reducing erroneous determination in process life / death determination caused by a bug that appears accidentally.

上記課題を解決するため、本発明の一態様は、監視対象のプロセスに対して、１組のプロセス監視電文の各プロセス監視電文を第１時間内に異なる発信タイミングで発信するステップと、前記１組のプロセス監視電文に対する応答の受信状態に基づき、前記プロセスが故障状態であるか判定するステップとをコンピュータに実行させるプロセス監視プログラムに関する。 In order to solve the above-mentioned problem, according to one aspect of the present invention, a step of transmitting each process monitoring message of a set of process monitoring messages to a monitoring target process at different transmission timings within a first time; The present invention relates to a process monitoring program that causes a computer to execute a step of determining whether or not the process is in a failure state based on a reception state of a response to a set of process monitoring messages.

本発明の他の態様は、監視対象のプロセスに対して、１組のプロセス監視電文の各プロセス監視電文を第１時間内に異なる発信タイミングで発信する発信部と、前記１組のプロセス監視電文に対する応答を受信する受信部と、前記応答の受信状態に基づき、前記プロセスが故障状態であるか判定する判定部とを有するプロセス監視システムに関する。 According to another aspect of the present invention, there is provided a transmitting unit that transmits each process monitoring message of a set of process monitoring messages at different transmission timings within a first time with respect to a process to be monitored, and the one set of process monitoring messages. The present invention relates to a process monitoring system including a receiving unit that receives a response to, and a determination unit that determines whether the process is in a failure state based on a reception state of the response.

本発明によると、偶発的に出現するバグに起因したプロセス生死判定における誤判定を低減すると共に、タイムアウトを待つことなく故障を検知して迅速にテイクオーバを開始することができる。 According to the present invention, it is possible to reduce erroneous determination in process life / death determination due to a bug that appears accidentally, and to detect a failure without waiting for timeout and to quickly start takeover.

図１は、本発明の一実施例による情報処理システムを示す概略図である。FIG. 1 is a schematic diagram showing an information processing system according to an embodiment of the present invention. 図２は、本発明の一実施例によるプロセス監視システムのハードウェア構成を示すブロック図である。FIG. 2 is a block diagram showing a hardware configuration of a process monitoring system according to an embodiment of the present invention. 図３は、本発明の一実施例によるプロセス監視システムの機能構成を示すブロック図である。FIG. 3 is a block diagram showing a functional configuration of a process monitoring system according to an embodiment of the present invention. 図４は、本発明の一実施例によるプロセス監視システムにおける処理を示すフロー図である。FIG. 4 is a flowchart showing processing in the process monitoring system according to the embodiment of the present invention.

以下、図面に基づいて本発明の実施の形態を説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

後述される実施例では、偶発的に出現するバグに起因したプロセス生死判定における誤判定を低減するためのプロセス監視システムが開示される。プロセス監視システムは、サーバ上で実行される監視対象のプロセスに対して、１組のプロセス監視電文の各プロセス監視電文を第１時間内に異なる発信タイミングで発信する。一実施例では、サーバ上の各プロセスに対して２つのプロセス監視電文が微小時間差（例えば、数百ミリ秒の時間差など）で発信される。当該プロセス監視電文を受信すると、各プロセスは、正常動作している場合には正常応答（ＯＫ応答）をプロセス監視システムに返し、他方、正常動作していない場合にはＮＧ応答をプロセス監視システムに返すか、あるいは、タイムアウト時間内に応答自体を返さない。このように、送信した１組のプロセス監視電文に対する応答の受信状態に基づき、プロセス監視システムは、当該プロセスが故障状態であるか判定し、判定結果に応じて他のサーバにテイクオーバする。 In an embodiment described later, a process monitoring system for reducing erroneous determination in process life / death determination due to a bug that appears accidentally is disclosed. The process monitoring system transmits each process monitoring message of a set of process monitoring messages to the monitoring target process executed on the server at different transmission timings within the first time. In one embodiment, two process monitoring messages are sent for each process on the server with a minute time difference (eg, a time difference of several hundred milliseconds). When the process monitoring message is received, each process returns a normal response (OK response) to the process monitoring system if it is operating normally, and an NG response to the process monitoring system if it is not operating normally. Or return no response within the timeout period. As described above, the process monitoring system determines whether the process is in a failure state based on the reception status of the response to the transmitted set of process monitoring messages, and takes over to another server according to the determination result.

このように複数個のプロセス監視電文を異なる送信タイミングで送信し、これらの応答結果に基づき生死判定を行うことによって、低い確率で偶発的に発生するようなバグに起因した誤判定を回避することが可能になる。このようなバグは、様々な要因があるタイミングで偶然に一致したことにより発生することが多く、それ以外のタイミングでは誤判定が発生する可能性は極めて低いためである。これにより、誤った生死判定に基づくテイクオーバ、これに伴うネットワークの瞬断やサービスの停止などを回避することが可能になると共に、プロセスが故障状態であるときには、タイムアウトを待つことなく故障を検知して迅速にテイクオーバを開始することができる。 In this way, by sending multiple process monitoring messages at different transmission timings and making life / death determinations based on these response results, misjudgment due to bugs that occur accidentally with low probability is avoided. Is possible. This is because such bugs often occur due to accidental coincidence at various timings, and it is extremely unlikely that an erroneous determination will occur at other timings. This makes it possible to avoid takeover based on incorrect life / death judgments, network interruptions and service interruptions associated with this, and to detect a failure without waiting for a timeout when the process is in a failure state. And takeover can be started quickly.

まず、図１を参照して、本発明の一実施例による情報処理システムを説明する。情報処理システムは、例えば、クライアント・サーバシステムであり、クライアントからの要求に応答して、サーバが各種サービスを提供する。本実施例では、情報処理システムは、高い可用性（ａｖａｉｌａｂｉｌｉｔｙ）が要求されるシステムに好適であり、稼働中のサーバにおける障害の発生に応答した予備のサーバへの切り替え（テイクオーバ又はフェイルオーバ）やサービスの中断を最小限に抑えると共に、障害が発生している場合には迅速にテイクオーバを開始するよう設計される。 First, an information processing system according to an embodiment of the present invention will be described with reference to FIG. The information processing system is, for example, a client / server system, and the server provides various services in response to requests from the client. In this embodiment, the information processing system is suitable for a system that requires high availability, and switching to a spare server (takeover or failover) in response to the occurrence of a failure in a running server or service It is designed to minimize interruptions and to quickly initiate takeover in the event of a failure.

図１は、本発明の一実施例による情報処理システムを示す概略図である。図１に示されるように、情報処理システム１０は、プロセス監視システム１００、サーバ２０１，２０２、データベース（ＤＢ）２５０及び端末装置３００を有する。図示される実施例では、サーバ２０１が稼働中のサーバであり、サーバ２０２が、サーバ２０１の予備のサーバである。 FIG. 1 is a schematic diagram showing an information processing system according to an embodiment of the present invention. As illustrated in FIG. 1, the information processing system 10 includes a process monitoring system 100, servers 201 and 202, a database (DB) 250, and a terminal device 300. In the illustrated embodiment, the server 201 is an active server, and the server 202 is a spare server for the server 201.

プロセス監視システム１００は、稼働中のサーバ２０１において実行されている各プロセスを監視する。以下で詳細に説明されるように、プロセス監視システム１００は、サーバ２０１において実行されている各プロセスに複数の生死監視メッセージを異なる送信タイミングで発信し、当該生死監視メッセージに対する各プロセスからの応答の受信結果に基づき、当該プロセスが正常に作動しているか、又は障害が発生しているか判定する。障害が発生していると判定すると、プロセス監視システム１００は、当該プロセスに対して稼働中のサーバ２０１を予備のサーバ２０２に切り替える（テイクオーバ又はフェイルオーバ）。 The process monitoring system 100 monitors each process executed in the operating server 201. As will be described in detail below, the process monitoring system 100 transmits a plurality of life / death monitoring messages to each process executed in the server 201 at different transmission timings, and a response from each process to the life / death monitoring message. Based on the reception result, it is determined whether the process is operating normally or a failure has occurred. If it is determined that a failure has occurred, the process monitoring system 100 switches the server 201 operating for the process to the spare server 202 (takeover or failover).

サーバ２０１は、稼働中のサーバであり、クライアントである端末装置３００に各種サービスを提供する。一実施例では、端末装置３００から処理要求を受信すると、サーバ２０１は、当該処理に関連するデータをデータベース２５０から取得し、取得したデータに対して要求された処理を実行し、処理結果を端末装置３００に返す。これらの処理を実行するため、サーバ２０１内ではオペレーティングシステム（ＯＳ）、ミドルウェア、アプリケーションなどの各種プログラムが起動され、これらのプログラムに関して各種プロセスが実行されている。各プロセスは、プロセス監視システム１００から送信される生死監視メッセージに対して、当該プロセスが正常に動作している場合には正常応答（ＯＫ応答）を返し、当該プロセスが正常に動作していない場合にはＮＧ応答を返すか、あるいは、故障のため応答自体を返さない。 The server 201 is an active server and provides various services to the terminal device 300 that is a client. In one embodiment, when receiving a processing request from the terminal device 300, the server 201 acquires data related to the processing from the database 250, executes the requested processing on the acquired data, and sends the processing result to the terminal. Return to device 300. In order to execute these processes, various programs such as an operating system (OS), middleware, and applications are activated in the server 201, and various processes are executed with respect to these programs. Each process returns a normal response (OK response) to the life / death monitoring message transmitted from the process monitoring system 100 when the process is operating normally, and the process is not operating normally. An NG response is returned or no response is returned due to failure.

サーバ２０２は、稼働中のサーバ２０１の予備のサーバであり、サーバ２０１に障害が発生すると、プロセス監視システム１００からのテイクオーバ指示に応答して、サーバ２０１の代わりに対応するプロセスを起動する。典型的には、このテイクオーバには、１０〜１５分などの時間を要することもあり、テイクオーバ実行中は端末装置３００へのサービスの提供が一時的に中断されることもある。このため、特に高い可用性が要求される情報処理システム１０では、テイクオーバの実行は最小限に抑えられるべきである。 The server 202 is a spare server for the active server 201. When a failure occurs in the server 201, the server 202 starts a corresponding process instead of the server 201 in response to a takeover instruction from the process monitoring system 100. Typically, this takeover may take 10 to 15 minutes or the like, and service provision to the terminal device 300 may be temporarily interrupted during the takeover execution. For this reason, in the information processing system 10 that requires particularly high availability, the execution of takeover should be minimized.

データベース２５０は、端末装置３００からの各種処理要求を実行するのに必要な各種データを格納する。例えば、これらのデータは、サーバ２０１，２０２において起動されるデータベースミドルウェアを介し取得及び操作される。 The database 250 stores various data necessary for executing various processing requests from the terminal device 300. For example, these data are acquired and manipulated via database middleware activated in the servers 201 and 202.

端末装置３００は、情報処理システム１０におけるクライアント装置であり、典型的には、デスクトップコンピュータ、ノートブックコンピュータなどの情報端末により実現される。端末装置３００は、稼働中のサーバ２０１に各種処理要求を送信し、サーバ２０１による処理結果を受信する。サーバ２０１のプロセスに障害が検出され、サーバ２０２へのテイクオーバが実行されると、端末装置３００は、サーバ２０２とのやりとりを開始する。 The terminal device 300 is a client device in the information processing system 10 and is typically realized by an information terminal such as a desktop computer or a notebook computer. The terminal device 300 transmits various processing requests to the server 201 in operation, and receives processing results from the server 201. When a failure is detected in the process of the server 201 and a takeover to the server 202 is executed, the terminal device 300 starts an exchange with the server 202.

次に、図２〜３を参照して、本発明の一実施例によるプロセス監視システムを説明する。プロセス監視システム１００は、上述したように、サーバ２０１上で実行される監視対象のプロセスに対して、１組のプロセス監視電文の各プロセス監視電文を所定の送信時間内に異なる発信タイミングで発信し、送信した１組のプロセス監視電文に対する応答の受信状態（例えば、所定の受信時間内に応答を受信したか否か、受信した応答がＯＫ応答であるか否かなど）に基づき、当該プロセスが故障状態であるか判定する。プロセス監視システム１００は、当該判定結果に応じて、当該プロセスに対するサーバ２０２へのテイクオーバを実行する。 Next, a process monitoring system according to an embodiment of the present invention will be described with reference to FIGS. As described above, the process monitoring system 100 transmits each process monitoring message of one set of process monitoring messages to the monitoring target process executed on the server 201 at different transmission timings within a predetermined transmission time. Based on the reception status of the response to the transmitted set of process monitoring messages (for example, whether the response is received within a predetermined reception time, whether the received response is an OK response, etc.) Determine if it is a fault condition. The process monitoring system 100 executes a takeover to the server 202 for the process according to the determination result.

図２は、本発明の一実施例によるプロセス監視システムのハードウェア構成を示すブロック図である。図２に示されるように、プロセス監視システム１００は、バスＢを介し相互接続されるドライブ装置１０１、補助記憶装置１０２、メモリ装置１０３、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）１０４、インタフェース装置１０５及びタイマ１０６を有する。 FIG. 2 is a block diagram showing a hardware configuration of a process monitoring system according to an embodiment of the present invention. As shown in FIG. 2, the process monitoring system 100 includes a drive device 101, an auxiliary storage device 102, a memory device 103, a CPU (Central Processing Unit) 104, an interface device 105, and a timer 106 that are interconnected via a bus B. Have.

プロセス監視システム１００における後述される各種機能及び処理を実現するプロセス監視プログラムを含む各種プログラムは、ＣＤ−ＲＯＭ（ＣｏｍｐａｃｔＤｉｓｋ−ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）などの記録媒体１０７によって提供されてもよい。プログラムを記憶した記録媒体１０７がドライブ装置１０１にセットされると、プログラムが記録媒体１０７からドライブ装置１０１を介して補助記憶装置１０２にインストールされる。但し、プログラムのインストールは必ずしも記録媒体１０７により行う必要はなく、ネットワーク（図示せず）を介し何れかの外部装置からダウンロードするようにしてもよい。補助記憶装置１０２は、インストールされたプログラムを格納すると共に、必要なファイルやデータなどを格納する。 Various programs including a process monitoring program that realizes various functions and processes described later in the process monitoring system 100 may be provided by a recording medium 107 such as a CD-ROM (Compact Disk-Read Only Memory). When the recording medium 107 storing the program is set in the drive device 101, the program is installed from the recording medium 107 to the auxiliary storage device 102 via the drive device 101. However, it is not always necessary to install the program using the recording medium 107, and the program may be downloaded from any external device via a network (not shown). The auxiliary storage device 102 stores the installed program and also stores necessary files and data.

メモリ装置１０３は、プログラムの起動指示があった場合に、補助記憶装置１０２からプログラムやデータを読み出して格納する。ＣＰＵ１０４は、メモリ装置１０３に格納されたプログラムやプログラムを実行するのに必要なパラメータなどの各種データに従って、後述されるようなプロセス監視システム１００の各種機能及び処理を実行する。インタフェース装置１０５は、ネットワーク又は外部装置に接続するための通信インタフェースとして用いられる。タイマ１０６は、計時手段として備えられる。 The memory device 103 reads the program and data from the auxiliary storage device 102 and stores them when there is an instruction to start the program. The CPU 104 executes various functions and processing of the process monitoring system 100 as described later according to various data such as a program stored in the memory device 103 and parameters necessary for executing the program. The interface device 105 is used as a communication interface for connecting to a network or an external device. The timer 106 is provided as a time measuring means.

しかしながら、プロセス監視システム１００は、上述したハードウェア構成に限定されるものでなく、例えば、サーバ、パーソナルコンピュータ、モバイル装置などの何れか適切な情報処理装置により実現されてもよい。 However, the process monitoring system 100 is not limited to the hardware configuration described above, and may be realized by any appropriate information processing device such as a server, a personal computer, or a mobile device.

図３は、本発明の一実施例によるプロセス監視システムの機能構成を示すブロック図である。図３に示されるように、プロセス監視システム１００は、発信部１１０、受信部１２０、判定部１３０及び移行指示部１４０を有する。 FIG. 3 is a block diagram showing a functional configuration of a process monitoring system according to an embodiment of the present invention. As illustrated in FIG. 3, the process monitoring system 100 includes a transmission unit 110, a reception unit 120, a determination unit 130, and a migration instruction unit 140.

発信部１１０は、サーバ２０１の監視対象のプロセスに対して、１組のプロセス監視電文の各プロセス監視電文を所定の送信時間内に異なる発信タイミングで発信する。一実施例では、発信部１１０は、各プロセスに対して２個のプロセス監視電文を微小時間差（例えば、数百ミリ秒など）で二重化又は重複化して送信する。当該所定の送信時間は、これに限定されるものでないが、少なくともタイムアウト時間より短い時間に設定される。なお、プロセス監視電文の各組毎の送信は、典型的には、当該送信時間より長い間隔で定期的に行われる。 The transmission unit 110 transmits each process monitoring message of a set of process monitoring messages to the monitoring target process of the server 201 at different transmission timings within a predetermined transmission time. In one embodiment, the transmitting unit 110 transmits two process monitoring messages for each process by duplicating or duplicating them with a minute time difference (for example, several hundred milliseconds). The predetermined transmission time is not limited to this, but is set to a time shorter than at least the timeout time. Note that transmission of each group of process monitoring messages is typically performed periodically at intervals longer than the transmission time.

受信部１２０は、発信部１１０により発信された１組のプロセス監視電文に対する応答を受信する。プロセス監視電文の対象となるプロセスが正常に動作している場合、正常応答（ＯＫ応答）が受信部１２０に返される。他方、プロセス監視電文の対象となるプロセスが正常に動作していない場合、ＮＧ応答が受信部１２０に返される。あるいは、プロセス監視電文の対象となるプロセスが正常に動作していない場合、応答自体が受信部１２０に返されない可能性もある。 The receiving unit 120 receives a response to the set of process monitoring messages transmitted by the transmitting unit 110. When the process that is the target of the process monitoring message is operating normally, a normal response (OK response) is returned to the receiving unit 120. On the other hand, if the process targeted by the process monitoring message is not operating normally, an NG response is returned to the receiving unit 120. Alternatively, when the process targeted by the process monitoring message is not operating normally, the response itself may not be returned to the receiving unit 120.

判定部１３０は、１組のプロセス監視電文の各プロセス監視電文に対する応答の受信状態に基づき、監視対象のプロセスが故障状態であるか判定する。監視対象のプロセスが故障状態であると判定すると、判定部１３０は、当該プロセスが故障状態であることを移行指示部１４０に通知する。 The determination unit 130 determines whether the process to be monitored is in a failure state based on a reception state of a response to each process monitoring message of a set of process monitoring messages. When determining that the process to be monitored is in a failure state, the determination unit 130 notifies the migration instruction unit 140 that the process is in a failure state.

一実施例では、判定部１３０は、送信した１組のプロセス監視電文に対する応答が所定の受信時間（例えば、タイムアウト時間など）内に受信されたか判定し、これらの応答のうち所定数以上の応答が所定の受信時間内に受信されなかった場合、当該プロセスが故障状態であると判定する。一例として、当該受信時間は、各プロセス監視電文の発信時刻から受信時刻までの時間と対比されてもよい。すなわち、送信した複数個のプロセス監視電文のうち所定数以上のプロセス監視電文の発信時刻から受信時刻までの時間が、タイムアウト値などの所定の受信時間を超過した場合、判定部１３０は、当該プロセスが故障状態であると判定する。典型的には、当該所定の受信時間は、タイムアウト値に設定され、上述した所定の送信時間（微小時間）より長く設定される。 In one embodiment, the determination unit 130 determines whether a response to the transmitted set of process monitoring messages has been received within a predetermined reception time (for example, a timeout time), and more than a predetermined number of these responses. Is not received within a predetermined reception time, it is determined that the process is in a failure state. As an example, the reception time may be compared with the time from the transmission time of each process monitoring message to the reception time. That is, when the time from the transmission time to the reception time of a predetermined number of process monitoring messages out of a plurality of transmitted process monitoring messages exceeds a predetermined reception time such as a timeout value, the determination unit 130 Is determined to be in a failure state. Typically, the predetermined reception time is set to a timeout value, and is set longer than the predetermined transmission time (minute time) described above.

一例として、発信部１１０が１組のプロセス監視電文として２個のプロセス監視電文を異なる送信タイミングで発信した場合、判定部１３０は、これら２個のプロセス監視電文に対する２個の応答の双方が所定の受信時間（タイムアウト時間）内に受信されなかった場合、当該プロセスが故障状態であると判定してもよい。すなわち、第１プロセス監視電文に対する応答が第１プロセス監視電文の発信時刻から所定の受信時間内に受信されず、かつ、第２プロセス監視電文に対する応答が第２プロセス監視電文の発信時刻から所定の受信時間内に受信されなかった場合、当該プロセスが故障状態であると判定してもよい。 As an example, when the transmission unit 110 transmits two process monitoring messages as a set of process monitoring messages at different transmission timings, the determination unit 130 determines that both of the two responses to the two process monitoring messages are predetermined. If it is not received within the reception time (timeout time), it may be determined that the process is in a failure state. In other words, a response to the first process monitoring message is not received within a predetermined reception time from the transmission time of the first process monitoring message, and a response to the second process monitoring message is predetermined from the transmission time of the second process monitoring message. If not received within the reception time, it may be determined that the process is in a failure state.

他の例として、発信部１１０が１組のプロセス監視電文として２個のプロセス監視電文を異なる送信タイミングで発信した場合、判定部１３０は、これら２個のプロセス監視電文に対する２個の応答の何れか一方が所定の受信時間内に受信されなかった場合、当該プロセスが故障状態であると判定してもよい。すなわち、第１プロセス監視電文に対する応答が第１プロセス監視電文の発信時刻から所定の受信時間内に受信されなかったか、又は、第２プロセス監視電文に対する応答が第２プロセス監視電文の発信時刻から所定の受信時間内に受信されなかった場合、当該プロセスが故障状態であると判定してもよい。 As another example, when the transmission unit 110 transmits two process monitoring messages as a set of process monitoring messages at different transmission timings, the determination unit 130 determines which of the two responses to these two process monitoring messages. If either one is not received within a predetermined reception time, it may be determined that the process is in a failure state. That is, a response to the first process monitoring message has not been received within a predetermined reception time from the transmission time of the first process monitoring message, or a response to the second process monitoring message is predetermined from the transmission time of the second process monitoring message. If it is not received within the reception time, it may be determined that the process is in a failure state.

何れの判定基準を使用するかは、情報処理システム１０の実施形態に応じて決定されてもよい。一実施例では、判定基準は、当該プロセスの特性、テイクオーバに係るコスト、誤判定を生じさせるバグの発生頻度の１以上などに基づき決定されてもよい。 Which criterion is used may be determined according to the embodiment of the information processing system 10. In one embodiment, the determination criterion may be determined based on the characteristics of the process, the cost of takeover, the occurrence frequency of bugs that cause erroneous determination, and the like.

例えば、当該プロセスに代替的な他のプロセスが正常に動作している場合、判定部１３０は、２個の応答の双方が所定の受信時間内に受信されなかったことに応答して、当該プロセスが故障状態であると判定してもよい。他方、当該プロセスに代替的な他のプロセスがない場合、判定部１３０は、２個の応答の何れか一方が所定の受信時間内に受信されなかったことに応答して、当該プロセスが故障状態であると判定してもよい。これは、当該プロセスがかなりの精度で故障状態である場合でも、代替プロセスが当該プロセスを代替して処理を実行するため、即座にテイクオーバする必要がないためである。 For example, when another process alternative to the process is operating normally, the determination unit 130 responds to the fact that both of the two responses are not received within a predetermined reception time. May be determined to be in a failure state. On the other hand, when there is no other process alternative to the process, the determination unit 130 determines that one of the two responses is not received within the predetermined reception time, and the process is in a failure state. It may be determined that This is because even if the process is in a failure state with a considerable degree of accuracy, it is not necessary to take over immediately because the alternative process executes the process in place of the process.

また、テイクオーバに係るコスト（例えば、テイクオーバに要する時間、サーバ２０１の可用性など）が大きい場合、判定部１３０は、２個の応答の双方が所定の受信時間内に受信されなかったことに応答して、当該プロセスが故障状態であると判定してもよい。他方、テイクオーバに係るコストがそれほど大きくない場合、判定部１３０は、２個の応答の何れか一方が所定の受信時間内に受信されなかったことに応答して、当該プロセスが故障状態であると判定してもよい。テイクオーバに係るコストを勘案して、テイクオーバの要否を決定することが可能になる。 When the cost for takeover (for example, the time required for takeover, the availability of the server 201, etc.) is high, the determination unit 130 responds that both of the two responses have not been received within the predetermined reception time. Thus, it may be determined that the process is in a failure state. On the other hand, when the cost for takeover is not so high, the determination unit 130 determines that the process is in a failure state in response to the fact that one of the two responses is not received within the predetermined reception time. You may judge. It is possible to determine whether takeover is necessary in consideration of the cost associated with takeover.

また、誤判定を生じさせるバグの発生頻度が高いと推定される場合、判定部１３０は、２個の応答の双方が所定の受信時間内に受信されなかったことに応答して、当該プロセスが故障状態であると判定してもよい。他方、誤判定を生じさせるバグの発生頻度が低いと推定される場合、判定部１３０は、２個の応答の何れか一方が所定の受信時間内に受信されなかったことに応答して、当該プロセスが故障状態であると判定してもよい。これにより、誤判定によるテイクオーバの実行を低減することが可能になる。 When it is estimated that the occurrence frequency of bugs that cause erroneous determination is high, the determination unit 130 responds that both of the two responses are not received within a predetermined reception time, You may determine with it being in a failure state. On the other hand, when it is estimated that the occurrence frequency of bugs causing erroneous determination is low, the determination unit 130 responds to the fact that either one of the two responses has not been received within a predetermined reception time. It may be determined that the process is in a failure state. Thereby, it is possible to reduce the execution of takeover due to erroneous determination.

なお、発信部１１０が１組のプロセス監視電文として２個のプロセス監視電文を異なる送信タイミングで発信した場合、判定部１３０は、これら２個のプロセス監視電文に対する２個の応答の何れか一方が所定の受信時間内に受信されなかったことに応答して、当該プロセスが故障状態であると判定する前に、発信部１１０にリトライさせるようにしてもよい。例えば、判定部１３０は、当該プロセスに対して２個のプロセス監視電文を異なる発信タイミングで再送するよう発信部１１０に指示してもよい。 When the transmission unit 110 transmits two process monitoring messages as a set of process monitoring messages at different transmission timings, the determination unit 130 determines that either one of the two responses to the two process monitoring messages is In response to not having been received within a predetermined reception time, the transmitting unit 110 may be retried before determining that the process is in a failure state. For example, the determination unit 130 may instruct the transmission unit 110 to retransmit two process monitoring messages for the process at different transmission timings.

他の実施例では、判定部１３０は、送信した１組のプロセス監視電文に対する応答がＮＧ応答であるか判定し、当該応答のうち所定数以上の応答がＮＧ応答であった場合、当該プロセスが故障状態であると判定してもよい。 In another embodiment, the determination unit 130 determines whether a response to the transmitted set of process monitoring messages is an NG response. If a predetermined number or more of the responses are NG responses, the process is You may determine with it being in a failure state.

一例として、発信部１１０が１組のプロセス監視電文として２個のプロセス監視電文を異なる送信タイミングで発信した場合、判定部１３０は、これら２個のプロセス監視電文に対する２個の応答の双方がＮＧ応答である場合、当該プロセスが故障状態であると判定してもよい。他の例として、発信部１１０が１組のプロセス監視電文として２個のプロセス監視電文を異なる送信タイミングで発信した場合、判定部１３０は、これら２個のプロセス監視電文に対する２個の応答の何れか一方がＮＧ応答である場合、当該プロセスが故障状態であると判定してもよい。 As an example, when the transmission unit 110 transmits two process monitoring messages as a set of process monitoring messages at different transmission timings, the determination unit 130 determines that both of the two responses to the two process monitoring messages are NG. If it is a response, it may be determined that the process is in a failure state. As another example, when the transmission unit 110 transmits two process monitoring messages as a set of process monitoring messages at different transmission timings, the determination unit 130 determines which of the two responses to these two process monitoring messages. If either one is an NG response, it may be determined that the process is in a failure state.

何れの判定基準を使用するかは、情報処理システム１０の実施形態や故障の推定精度に応じて決定されてもよい。一実施例では、判定基準は、当該プロセスの特性、テイクオーバに係るコスト、誤判定を生じさせるバグの発生頻度の１以上などに基づき決定されてもよい。 Which criterion is used may be determined according to the embodiment of the information processing system 10 or the accuracy of fault estimation. In one embodiment, the determination criterion may be determined based on the characteristics of the process, the cost of takeover, the occurrence frequency of bugs that cause erroneous determination, and the like.

例えば、当該プロセスに代替的な他のプロセスが正常に動作している場合、判定部１３０は、２個の応答の双方がＮＧ応答であったことに応答して、当該プロセスが故障状態であると判定してもよい。他方、当該プロセスに代替的な他のプロセスがない場合、判定部１３０は、２個の応答の何れか一方がＮＧ応答であったことに応答して、当該プロセスが故障状態であると判定してもよい。また、テイクオーバに係るコスト（例えば、テイクオーバに要する時間、サーバ２０１の可用性など）が大きい場合、判定部１３０は、２個の応答の双方がＮＧ応答であったことに応答して、当該プロセスが故障状態であると判定してもよい。他方、テイクオーバに係るコストがそれほど大きくない場合、判定部１３０は、２個の応答の何れか一方がＮＧ応答であったことに応答して、当該プロセスが故障状態であると判定してもよい。また、誤判定を生じさせるバグの発生頻度が高いと推定される場合、判定部１３０は、２個の応答の双方がＮＧ応答であったことに応答して、当該プロセスが故障状態であると判定してもよい。他方、誤判定を生じさせるバグの発生頻度が低いと推定される場合、判定部１３０は、２個の応答の何れか一方がＮＧ応答であったことに応答して、当該プロセスが故障状態であると判定してもよい。 For example, when another process alternative to the process is operating normally, the determination unit 130 responds that both of the two responses are NG responses, and the process is in a failure state. May be determined. On the other hand, if there is no alternative process in the process, the determination unit 130 determines that the process is in a failure state in response to the fact that one of the two responses is an NG response. May be. Further, when the cost for takeover (for example, the time required for takeover, the availability of the server 201, etc.) is large, the determination unit 130 responds that both of the two responses are NG responses, You may determine with it being in a failure state. On the other hand, when the cost for takeover is not so large, the determination unit 130 may determine that the process is in a failure state in response to the fact that one of the two responses is an NG response. . Further, when it is estimated that the occurrence frequency of bugs causing erroneous determination is high, the determination unit 130 determines that the process is in a failure state in response to both of the two responses being NG responses. You may judge. On the other hand, when it is estimated that the occurrence frequency of bugs that cause erroneous determination is low, the determination unit 130 responds that one of the two responses is an NG response, and the process is in a failure state. You may determine that there is.

なお、発信部１１０が１組のプロセス監視電文として２個のプロセス監視電文を異なる送信タイミングで発信した場合、判定部１３０は、これら２個のプロセス監視電文に対する２個の応答の何れか一方がＮＧ応答であったことに応答して、当該プロセスが故障状態であると判定する前に、発信部１１０にリトライさせるようにしてもよい。例えば、判定部１３０は、当該プロセスに対して２個のプロセス監視電文を異なる送信タイミングで再送するよう発信部１１０に指示してもよい。 When the transmission unit 110 transmits two process monitoring messages as a set of process monitoring messages at different transmission timings, the determination unit 130 determines that either one of the two responses to the two process monitoring messages is In response to being an NG response, the transmitting unit 110 may be made to retry before determining that the process is in a failure state. For example, the determination unit 130 may instruct the transmission unit 110 to retransmit two process monitoring messages for the process at different transmission timings.

さらなる他の実施例では、上述した実施例を組み合わせ、判定部１３０は、送信した１組のプロセス監視電文に対する応答が所定の受信時間（例えば、タイムアウト時間など）内に受信されたか判定すると共に、送信した１組のプロセス監視電文に対する応答がＮＧ応答であるか判定し、これらの応答のうち所定数以上の応答が所定の受信時間内に受信されなかった場合及び／又は当該応答のうち所定数以上の応答がＮＧ応答であった場合、当該プロセスが故障状態であると判定してもよい。 In still another embodiment, the above-described embodiments are combined, and the determination unit 130 determines whether a response to the transmitted set of process monitoring messages has been received within a predetermined reception time (for example, a timeout time), and It is determined whether a response to the transmitted set of process monitoring messages is an NG response, and when a predetermined number or more of these responses are not received within a predetermined reception time and / or a predetermined number of the responses If the above response is an NG response, it may be determined that the process is in a failure state.

また、さらなる他の実施例では、判定部１３０は、送信した１組のプロセス監視電文を送信するのに要した時間と、当該プロセス監視電文に対する応答を受信するのに要した時間とを対比し、送信に要した時間より受信に要した時間が所定の閾値以上である場合、当該プロセスが故障状態であると判定してもよい。一般に、１組のプロセス監視電文を送信するのに要した時間、すなわち、最初のプロセス監視電文を送信してから最後のプロセス監視電文を送信するまでの時間は、送信したプロセス監視電文に対する最初の応答を受信してから最後の応答を受信するまでの時間に概ね等しくなると考えられる。従って、プロセス監視電文に対する最初の応答を受信してから最後の応答を受信するまでの時間が、最初のプロセス監視電文を送信してから最後のプロセス監視電文を送信するまでの時間を大きく超過した場合、すなわち、最初のプロセス監視電文を送信してから最後のプロセス監視電文を送信するまでの時間を所定の閾値以上超過した場合、判定部１３０は、当該プロセスが故障状態であると判定してもよい。ここで、タイムアウト時間内に応答を受信できなかったプロセス監視電文がある場合、受信できた応答のみを考慮してプロセス監視電文に対する応答を受信するのに要した時間を算出し、当該プロセスが故障状態であるか判定してもよい。また、タイムアウト時間内に受信できなかったプロセス監視電文の個数を計数し、その個数が所定数以上である場合、プロセス監視電文に対する応答を受信するのに要した時間に関係なく、当該プロセスが故障状態であると判定してもよい。 In still another embodiment, the determination unit 130 compares the time required to transmit the transmitted set of process monitoring messages with the time required to receive a response to the process monitoring messages. If the time required for reception is greater than or equal to a predetermined threshold than the time required for transmission, the process may be determined to be in a failure state. In general, the time required to transmit a set of process monitoring messages, that is, the time from the transmission of the first process monitoring message to the transmission of the last process monitoring message is the first time for the transmitted process monitoring message. It is considered that it is approximately equal to the time from when the response is received until the last response is received. Therefore, the time from receiving the first response to the process monitoring message until receiving the last response greatly exceeded the time from sending the first process monitoring message to sending the last process monitoring message. In other words, that is, when the time from the transmission of the first process monitoring message to the transmission of the last process monitoring message exceeds a predetermined threshold, the determination unit 130 determines that the process is in a failure state. Also good. Here, if there is a process monitoring message that has not received a response within the timeout period, the time required to receive a response to the process monitoring message is calculated considering only the received response, and the process fails. You may determine whether it is in a state. In addition, if the number of process monitoring messages that could not be received within the timeout period is counted and the number is equal to or greater than the predetermined number, the process fails regardless of the time required to receive a response to the process monitoring message. You may determine with a state.

移行指示部１４０は、監視対象のプロセスが故障状態であると判定されると、当該プロセスによる処理を停止し、当該プロセスに代替するプロセスによって処理を続行させる。一実施例では、移行指示部１４０は、故障状態であると判定されたサーバ２０１のプロセスを、予備のサーバ２０２の代替プロセスに切り替え（テイクオーバ）、当該代替プロセスにより当該処理を続行させる。 When it is determined that the process to be monitored is in a failure state, the migration instruction unit 140 stops the process by the process and continues the process by a process that replaces the process. In one embodiment, the migration instructing unit 140 switches (takes over) the process of the server 201 determined to be in a failure state to the alternative process of the spare server 202, and causes the alternative process to continue the process.

次に、図４を参照して、本発明の一実施例によるプロセス監視システムにおける処理を説明する。図４は、本発明の一実施例によるプロセス監視システムにおける処理を示すフロー図である。 Next, processing in the process monitoring system according to an embodiment of the present invention will be described with reference to FIG. FIG. 4 is a flowchart showing processing in the process monitoring system according to the embodiment of the present invention.

図４に示されるように、ステップＳ１０１において、プロセス監視システム１００は、１組のプロセス監視電文の各プロセス監視電文を所定の送信時間内に異なる送信タイミングで発信する。例えば、各プロセス監視電文は、所定の時間差（例えば、数百ミリ秒など）で発信されてもよい。なお、プロセス監視電文の各組は、所定の間隔で定期的に送信される。 As shown in FIG. 4, in step S101, the process monitoring system 100 transmits each process monitoring message of a set of process monitoring messages at different transmission timings within a predetermined transmission time. For example, each process monitoring message may be transmitted with a predetermined time difference (for example, several hundred milliseconds). Each set of process monitoring messages is periodically transmitted at a predetermined interval.

ステップＳ１０２において、プロセス監視システム１００は、テイクオーバの要否判定を実行する。すなわち、プロセス監視システム１００は、ステップＳ１０１において送信した複数個のプロセス監視電文の各応答の受信状態に基づき、所定の判定基準に従って当該プロセスが故障状態であるか否かを判定するテイクオーバ要否判定を実行する。一実施例では、テイクオーバ要否判定は、ステップＳ１０１において送信した複数個のプロセス監視電文に対する各応答が所定の受信時間（例えば、タイムアウト時間など）内に受信されたか判定することであってもよい。他の実施例では、テイクオーバ要否判定は、ステップＳ１０１において送信した複数個のプロセス監視電文に対する各応答がＮＧ応答であるか判定することであってもよい。さらなる他の実施例では、テイクオーバ要否判定は、ステップＳ１０１において送信した複数個のプロセス監視電文に対する各応答が所定の受信時間（例えば、タイムアウト時間など）内に受信されたか判定すると共に、当該受信したプロセス監視電文に対する各応答がＮＧ応答であるか判定することであってもよい。 In step S102, the process monitoring system 100 executes a takeover necessity determination. That is, the process monitoring system 100 determines whether or not a takeover is necessary based on a predetermined determination criterion based on the reception status of each response of the plurality of process monitoring messages transmitted in step S101. Execute. In one embodiment, the takeover necessity determination may be to determine whether each response to the plurality of process monitoring messages transmitted in step S101 is received within a predetermined reception time (for example, a timeout time). . In another embodiment, the takeover necessity determination may be to determine whether each response to the plurality of process monitoring messages transmitted in step S101 is an NG response. In still another embodiment, the determination as to whether takeover is necessary is performed by determining whether or not each response to the plurality of process monitoring messages transmitted in step S101 has been received within a predetermined reception time (for example, a timeout time). It may be determined whether each response to the process monitoring message is an NG response.

ステップＳ１０３において、プロセス監視システム１００は、テイクオーバが必要であるか判定する。ステップＳ１０２において実行されたテイクオーバ要否判定に対応して、テイクオーバが必要であるか判定する。例えば、テイクオーバ要否判定が複数個のプロセス監視電文に対する各応答が所定の受信時間（例えば、タイムアウト時間など）内に受信されたか判定することである場合、プロセス監視システム１００は、当該プロセス監視電文に対する応答のうち所定数以上の応答が所定の時間内に受信されなかったことに応答して、テイクオーバを実行する必要があると判定してもよい。また、テイクオーバ要否判定が複数個のプロセス監視電文に対する各応答がＮＧ応答であるか判定することである場合、プロセス監視システム１００は、当該プロセス監視電文に対する応答のうち所定数以上の応答がＮＧ応答であったことに応答して、テイクオーバを実行する必要があると判定してもよい。さらに、テイクオーバ要否判定が複数個のプロセス監視電文に対する各応答が所定の受信時間内に受信されたか判定すると共に、当該受信したプロセス監視電文に対する各応答がＮＧ応答であるか判定することである場合、プロセス監視システム１００は、当該プロセス監視電文に対する応答のうち所定数以上の応答が所定の受信時間内に受信されず、また、当該受信したプロセス監視電文に対する応答のうち所定数以上の応答がＮＧ応答であったことに応答して、テイクオーバを実行する必要があると判定してもよい。 In step S103, the process monitoring system 100 determines whether takeover is necessary. Corresponding to the takeover necessity determination performed in step S102, it is determined whether takeover is necessary. For example, when the determination as to whether takeover is necessary is to determine whether each response to a plurality of process monitoring messages has been received within a predetermined reception time (for example, a timeout time), the process monitoring system 100 determines that the process monitoring message It may be determined that it is necessary to execute takeover in response to a response that a predetermined number or more of responses to is not received within a predetermined time. Further, when the determination as to whether takeover is necessary is to determine whether each response to a plurality of process monitoring messages is an NG response, the process monitoring system 100 determines that a predetermined number or more of the responses to the process monitoring message are NG. In response to the response, it may be determined that a takeover needs to be performed. Furthermore, the determination as to whether takeover is necessary is to determine whether or not each response to a plurality of process monitoring messages has been received within a predetermined reception time, and to determine whether or not each response to the received process monitoring message is an NG response. In this case, the process monitoring system 100 does not receive a predetermined number or more of the responses to the process monitoring message within a predetermined reception time, and receives a predetermined number or more of the responses to the received process monitoring message. In response to being an NG response, it may be determined that a takeover needs to be executed.

テイクオーバが必要であると判定されると（ステップＳ１０３：Ｙ）、ステップＳ１０４において、プロセス監視システム１００は、サーバ２０１における当該プロセスに対してテイクオーバを実行し、サーバ２０２の代替プロセスを起動することによって、当該プロセスに係る処理を続行する。他方、テイクオーバが必要でないと判定されると（ステップＳ１０３：Ｎ）、プロセス監視システム１００は、所定の間隔の経過後にステップＳ１０１を再開する。 If it is determined that a takeover is necessary (step S103: Y), in step S104, the process monitoring system 100 performs a takeover on the process in the server 201 and starts an alternative process in the server 202. Then, the processing related to the process is continued. On the other hand, if it is determined that no takeover is necessary (step S103: N), the process monitoring system 100 resumes step S101 after a predetermined interval has elapsed.

以上、本発明の実施例について詳述したが、本発明は上述した特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 As mentioned above, although the Example of this invention was explained in full detail, this invention is not limited to the specific embodiment mentioned above, In the range of the summary of this invention described in the claim, various deformation | transformation・ Change is possible.

１０情報処理システム
１００プロセス監視システム
１１０発信部
１２０受信部
１３０判定部
１４０移行指示部
２０１，２０２サーバ
２５０データベース
３００端末装置 DESCRIPTION OF SYMBOLS 10 Information processing system 100 Process monitoring system 110 Transmission part 120 Reception part 130 Determination part 140 Migration instruction | indication part 201,202 Server 250 Database 300 Terminal device

Claims

Sending each process monitoring message of a set of process monitoring messages with a minute time difference to the process to be monitored;
Determining whether the process is in a failure state based on a reception state of a response to the set of process monitoring messages;
A process monitoring program for causing a computer to execute
The determining step determines whether or not a response to the set of process monitoring messages has been received within a timeout time longer than the minute time difference, and a predetermined number or more of the responses to the set of process monitoring messages is the timeout. A process monitoring program for determining that the process is in a failure state when it is not received in time.