JP2010287142A

JP2010287142A - Fault tolerant computer system and method in fault tolerant computer system

Info

Publication number: JP2010287142A
Application number: JP2009141803A
Authority: JP
Inventors: Shusuke Yamamoto; 秀典山本; Hiromitsu Kato; 博光加藤; Masanori Yoshida; 雅徳吉田; Yoshiaki Adachi; 芳昭足達
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2009-06-15
Filing date: 2009-06-15
Publication date: 2010-12-24
Anticipated expiration: 2029-06-15
Also published as: JP5331585B2

Abstract

<P>PROBLEM TO BE SOLVED: To allow a fault tolerant computer system to present log data in the form in which a user easily grasps an operation, an abnormal state, or the like in the system. <P>SOLUTION: This method includes: a step in which a plurality of processing nodes execute the same processing in parallel; a step in which the plurality of processing nodes transmit a log of the same processing to a maintenance node through a network; a step in which the maintenance node receives logs of the same processing executed in the plurality of processing nodes through the network; and a step in which the maintenance node respectively connects logs of the same processing in the plurality of processing nodes, wherein the maintenance node transmits the logs connected as the same processing to a user terminal when receiving a request for logs from the user terminal. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、フォールトトレラントコンピュータシステムおよびフォールトトレラントコンピュータシステムにおける方法に関する。 The present invention relates to fault tolerant computer systems and methods in fault tolerant computer systems.

特許文献１は、種々のデータ形式により記録されているログ情報を汎用ログフォーマットという中間形式に変換することにより、ログファイル中に偏在していたログ情報を時刻情報を基準として統合する技術を開示している。 Patent Document 1 discloses a technique for integrating log information that is unevenly distributed in a log file based on time information by converting log information recorded in various data formats into an intermediate format called a general-purpose log format. is doing.

特開２００１−３５６９３９号公報JP 2001-356939 A

ネットワークを介して接続された複数の処理ノードにより構成されるフォールトトレラントコンピュータシステムにおいて、ユーザがシステムの保守、障害原因解析等の作業を行うためにログを参照する場合、システム内の複数のノードを跨いでの一連の実行処理を追跡するために、個々のノードに個別に蓄積されているログの中から該当するレコードを抽出して並べて参照する必要がある。 In a fault-tolerant computer system composed of multiple processing nodes connected via a network, when a user refers to a log to perform tasks such as system maintenance and failure cause analysis, multiple nodes in the system In order to track a series of execution processes across straddles, it is necessary to extract corresponding records from logs individually stored in individual nodes and refer to them side by side.

また、同じ処理を並列実行する複数のノードに対して、同一事象に関するログレコードを各ノードから抽出し、比較参照する必要がある。特に前記複数のノードでの同じ処理の並列実行はフォールトトレラントコンピュータシステムの信頼性を維持するために重要であり、並列実行不可となることがシステムに発生し得る障害として深刻であり、該障害に対する容易な原因解析及び迅速な復旧が求められる。 In addition, for a plurality of nodes that execute the same process in parallel, it is necessary to extract log records related to the same event from each node and compare and reference them. In particular, parallel execution of the same processing in the plurality of nodes is important for maintaining the reliability of the fault-tolerant computer system, and the fact that the parallel execution is impossible is a serious failure that can occur in the system. Easy cause analysis and quick recovery are required.

しかしながら、特許文献１記載の技術では、上述した課題を解決することが出来ない。ネットワークを介して接続された複数の処理ノードにより構成されるフォールトトレラントコンピュータシステムでは、システムを構成する個々のノード間で時刻が正確に一致していないため、時刻情報を基準にレコードを並べたり、個々のノード間で並列処理された同一事象に関するレコードを抽出することが出来ないからである。 However, the technique described in Patent Document 1 cannot solve the above-described problem. In a fault-tolerant computer system composed of multiple processing nodes connected via a network, the time is not exactly the same among the individual nodes that make up the system. This is because it is not possible to extract records relating to the same event processed in parallel between individual nodes.

また、特許文献１記載の技術は、そもそも単一のコンピュータもしくは構成するノード間で時刻が一致するコンピュータシステムを想定したものであり、フォールトトレラントコンピュータシステムを想定したものでもない。 In addition, the technique described in Patent Document 1 assumes a single computer or a computer system in which the times of constituent nodes coincide with each other, and does not assume a fault-tolerant computer system.

上記課題を解決するため、本発明は以下の構成を備える。即ち、ネットワークを介して接続された複数の処理ノードと複数の処理ノードのログを取得する保守ノードとを備えるフォールトトレラントコンピュータシステムにおける方法であって、複数の処理ノードが、同一の処理を並行して実行するステップと、複数の処理ノードが、同一の処理のログをネットワークを介して保守ノードへ送信するステップと、保守ノードが、複数の処理ノードにおいて実行された同一の処理のログをネットワークを介して受信するステップと、保守ノードが、複数の処理ノードにおける同一の処理のログをそれぞれ紐付けるステップと、保守ノードが、ユーザ端末からログの要求を受けたとき、同一の処理として紐付けられたログをユーザ端末に送信する。 In order to solve the above problems, the present invention comprises the following arrangement. That is, a method in a fault tolerant computer system comprising a plurality of processing nodes connected via a network and a maintenance node for acquiring logs of the plurality of processing nodes, wherein the plurality of processing nodes perform the same processing in parallel. A plurality of processing nodes transmitting the same processing log to the maintenance node via the network, and the maintenance node transmitting the same processing log executed by the plurality of processing nodes to the network. And the maintenance node associates logs of the same process in a plurality of processing nodes with each other, and the maintenance node is associated with the same process when receiving a log request from the user terminal. Log is sent to the user terminal.

また、ネットワークを介して接続された複数の処理ノードと複数の処理ノードのログを取得する保守ノードとを備えるフォールトトレラントコンピュータシステムであって、複数の処理ノードが、同一の処理を並行して実行し、複数の処理ノードが、同一の処理のログをネットワークを介して保守ノードへ送信し、保守ノードが、複数の処理ノードにおいて実行された同一の処理のログをネットワークを介して受信し、保守ノードが、複数の処理ノードにおける同一の処理のログをそれぞれ紐付け、保守ノードが、ユーザ端末からログの要求を受けたとき、同一の処理として紐付けられたログをユーザ端末に送信する。 A fault-tolerant computer system comprising a plurality of processing nodes connected via a network and a maintenance node for acquiring logs of the plurality of processing nodes, wherein the plurality of processing nodes execute the same processing in parallel. A plurality of processing nodes transmit the same processing log to the maintenance node via the network, and the maintenance node receives the same processing log executed at the plurality of processing nodes via the network for maintenance. The node associates logs of the same process in a plurality of processing nodes, respectively, and when the maintenance node receives a log request from the user terminal, the log associated with the same process is transmitted to the user terminal.

本発明によれば、ユーザによる保守、障害解析等の作業の効率化を図ることが出来る。 According to the present invention, the efficiency of operations such as maintenance and failure analysis by the user can be improved.

フォールトトレラントコンピュータシステムにおける分散ログ統合方法の概要を示す図である。It is a figure which shows the outline | summary of the distributed log integration method in a fault tolerant computer system. フォールトトレラントコンピュータシステムの概要を示す図である。It is a figure which shows the outline | summary of a fault tolerant computer system. フォールトトレラントコンピュータシステムにおける分散ログ統合方法の実施形態を示す図である。It is a figure which shows embodiment of the distributed log integration method in a fault tolerant computer system. フォールトトレラントコンピュータシステムにおける分散ログ統合方法を実施する上での、フォールトトレラントコンピュータシステムを構成する複数のノード間での全体処理シーケンスを示す図である。It is a figure which shows the whole process sequence between several nodes which comprise a fault tolerant computer system in enforcing the distributed log integration method in a fault tolerant computer system. ログ収集、蓄積及びユーザへの提示等の処理を行う保守ノードのモジュール構成を示す図である。It is a figure which shows the module structure of the maintenance node which performs processes, such as log collection, accumulation | storage, and presentation to a user. 外部からの要求に対する処理等を実行する処理ノード及びゲートウェイのモジュール構成を示す図である。It is a figure which shows the module structure of the processing node and gateway which perform the process with respect to the request | requirement from the outside. 保守ノードにおいて管理される蓄積ログデータ管理テーブルの構成を示す図である。It is a figure which shows the structure of the accumulation log data management table managed in a maintenance node. 保守ノードにて、各ノードより収集したログに対して、外部システムからの入力受付時のフォールトトレラントコンピュータシステム内のノードを跨いでの処理発生順序に並べるための紐付けの方法の概要を示す図である。The figure which shows the outline of the linking method for arranging in the processing generation order across the nodes in the fault-tolerant computer system at the time of accepting input from the external system for the logs collected from each node in the maintenance node It is. 保守ノードにて、各ノードより収集したログに対して、同じ処理を並列実行する複数の処理ノード間で同一事象に関するログレコードの紐付けの方法の概要を示す図である。It is a figure which shows the outline | summary of the method of link | linking the log record regarding the same event between several processing nodes which perform the same process in parallel with respect to the log collected from each node in a maintenance node. 処理ノードにて、蓄積していたログを取得し、保守ノードに送信する処理の流れを示すフローチャートである。It is a flowchart which shows the flow of the process which acquires the log which accumulated in the processing node, and transmits to a maintenance node. 保守ノードにて、各処理ノードからログを収集し、統合及び加工、ユーザへの提示の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a process of collecting a log from each processing node in a maintenance node, and integrating, processing, and presenting to a user. ノードを跨いで処理発生順に陳列したログを、ユーザに対して提示するための、画面表示例を示す図である。It is a figure which shows the example of a screen display for showing the log displayed in order of processing generation across a node with respect to a user. 同じ処理を並列実行する複数の処理ノード間で紐付けた、同一事象に関するログレコードを、複数の処理ノード分だけ並列に並べて、ユーザに対して提示するための、画面表示例を示す図である。It is a figure which shows the example of a screen display for arranging the log record regarding the same phenomenon linked | related between the several processing nodes which perform the same process in parallel, arranging in parallel by several processing nodes, and showing to a user. . 同じ処理を並列実行する複数の処理ノード間で紐付けた、同一事象に関するログレコードに関してデータ内容に比較結果を、ユーザに対して提示するための、画面表示例を示す図である。It is a figure which shows the example of a screen display for showing a comparison result to a data content regarding the log record regarding the same event linked | related between the several processing nodes which perform the same process in parallel.

本発明の実施形態においては、ネットワークを介して相互接続した複数の独立したノードにより構成され、構成する各ノードにおいて同じ処理を並列実行させるフォールトトレラントコンピュータ（ＦａｕｌｔＴｏｌｅｒａｎｔＣｏｍｐｕｔｅｒ）システムを例として説明する。ここでは、ユーザがシステムの保守、障害原因解析等の作業を行う場合に、システムを構成する各ノードにて処理の実行に伴い発生する各種のログを収集し、ユーザに提示するログを統合することを基本的な考え方として説明する。 In the embodiment of the present invention, a fault tolerant computer (Fault Tolerant Computer) system configured by a plurality of independent nodes interconnected via a network and executing the same processing in parallel in each of the nodes will be described as an example. Here, when the user performs tasks such as system maintenance and failure cause analysis, various logs generated by the execution of processing are collected at each node constituting the system, and the logs presented to the user are integrated. Explain this as a basic concept.

ログは、例えばＯＳ、ミドルウェア、ユーザプログラム等が各々の処理ステップにおいて、処理結果や状態等に応じて逐次出力するデータであり、１つの事象に関するレコードが１つ以上集まったものである。個々のノードにおいて個別の形式にてメモリ、ハードディスク等に蓄積される。 The log is, for example, data that is sequentially output in accordance with the processing result, state, and the like in each processing step by the OS, middleware, user program, and the like, and is a collection of one or more records related to one event. Each node is stored in a memory, a hard disk, etc. in an individual format.

図１は、本発明の実施形態によるフォールトトレラントコンピュータシステムにおける分散ログの統合方法の概要を示す図である。 FIG. 1 is a diagram showing an outline of a method for integrating distributed logs in a fault-tolerant computer system according to an embodiment of the present invention.

主な構成要素は、外部システムからの入力メッセージをフォールトトレラントコンピュータシステムの内部へ転送し、フォールトトレラントコンピュータシステムの内部からのメッセージを集約し外部システムへ転送するゲートウェイ０２１３、同じ処理を並列実行する複数の処理ノード０２１２ａ、０２１２ｂ、０２１２ｃである。ゲートウェイ０２１３では処理実行に伴い、ログ０１１０、０１２０及び０１８０、０１９０、０１００が発生する。処理ノード０２１２ａでは処理実行に伴い、ログ０１３０、０１４０、０１５０、０１６０、０１７０が発生する。処理ノード０２１２ｂでは処理実行に伴い、ログ０１３１、０１４１、０１５１、０１６１、０１７１が発生する。処理ノード０２１２ｃでは処理実行に伴い、ログ０１３２、０１４２、０１５２、０１６２、０１７２が発生する。フォールトトレラントコンピュータシステムでは、ゲートウェイ０２１３及び処理ノード０２１２ａ、０２１２ｂ、０２１２ｃでは時刻は完全には一致しておらず、同じ処理を並列実行する複数の処理ノード０２１２ａ、０２１２ｂ、０２１２ｃでも全ての処理を完全に実行タイミングを合わせられるわけではない。 The main components are a gateway 0213 that transfers input messages from an external system to the inside of the fault-tolerant computer system, aggregates messages from the inside of the fault-tolerant computer system, and transfers them to the external system. Processing nodes 0212a, 0212b, and 0212c. In the gateway 0213, logs 0110, 0120, 0180, 0190, and 0100 are generated as processing is executed. In the processing node 0212a, logs 0130, 0140, 0150, 0160, and 0170 are generated as processing is executed. In the processing node 0212b, logs 0131, 0141, 0151, 0161, and 0171 are generated as processing is executed. In the processing node 0212c, logs 0132, 0142, 0152, 0162, and 0172 are generated as processing is executed. In the fault-tolerant computer system, the time is not completely the same in the gateway 0213 and the processing nodes 0212a, 0212b, and 0212c, and all the processing is completely performed in a plurality of processing nodes 0212a, 0212b, and 0212c that execute the same processing in parallel. The execution timing cannot be adjusted.

本実施形態によると、フォールトトレラントコンピュータシステム内のノードを跨いで処理発生順に紐付けてログを並べる。例えば処理の進捗に従ってゲートウェイ０２１３及び処理ノード０２１２ａからのログ０１１０、０１２０、０１３０、０１４０、０１５０、０１６０、０１７０、０１８０、０１９０、０１００が紐付けて並べられる。つまり、ログ０１１０、０１２０、０１３０、０１４０、０１５０、０１６０、０１７０、０１８０、０１９０、０１００が処理発生順に並べて紐付けられる。ここでの紐付け方法の詳細は図８で後述する。 According to this embodiment, logs are arranged in association with each other in the order of processing generation across the nodes in the fault tolerant computer system. For example, logs 0110, 0120, 0130, 0140, 0150, 0160, 0170, 0180, 0190, and 0100 from the gateway 0213 and the processing node 0212a are linked and arranged in accordance with the progress of the processing. That is, the logs 0110, 0120, 0130, 0140, 0150, 0160, 0170, 0180, 0190, and 0100 are arranged and linked in the order in which the processes occur. Details of the linking method here will be described later with reference to FIG.

また本実施形態によると、同じ処理を並列実行する複数の処理ノード０２１２ａ、０２１２ｂ、０２１２ｃの間で、時刻情報が一致していなくとも同一事象に関するログを互いに紐付ける。例えば処理ノード０２１２ａ、０２１２ｂ、０２１２ｃからのログ０１３０と０１３１と０１３２、ログ０１４０と０１４１と０１４２、ログ０１５０と０１５１と０１５２、ログ０１６０と０１６１と０１６２、ログ０１７０と０１７１と０１７２はそれぞれ同一事象に関するログとして紐付けられる。つまり、各処理ノードにおける同一事象に関するログがそれぞれ紐付けられる。ここでの紐付け方法の詳細は図９で後述する。 Further, according to the present embodiment, logs related to the same event are associated with each other even if the time information does not match between the plurality of processing nodes 0212a, 0212b, and 0212c that execute the same processing in parallel. For example, logs 0130 and 0131 and 0132, logs 0140 and 0141 and 0142, logs 0150 and 0151 and 0152, logs 0160 and 0161 and 0162, and logs 0170 and 0171 and 0172 from the processing nodes 0212a, 0212b, and 0212c are logs related to the same event. As tied. That is, logs related to the same event in each processing node are associated with each other. Details of the linking method here will be described later with reference to FIG.

図２は、ネットワーク（通信媒体）を介して相互接続した、複数の独立したノードにより構成され、構成する各ノードにおいて同じ処理を並列実行させることによるフォールトトレラントコンピュータシステムの概要を示す図である。 FIG. 2 is a diagram showing an overview of a fault-tolerant computer system configured by a plurality of independent nodes interconnected via a network (communication medium), and executing the same processing in parallel in each of the constituent nodes.

本フォールトトレラントコンピュータシステム０２０１の主な構成要素は、ＬＡＮ０２１４を介して相互接続する、ログ収集、蓄積及びユーザへの提示等の処理を行う保守ノード０２１１、２つ以上の処理ノード０２１２、広域ネットワーク０２０３に接続し外部システムとの通信の中継を行うゲートウェイサーバ０２１３、等である。 The main components of the fault tolerant computer system 0201 are a maintenance node 0211 that performs processing such as log collection, storage, and presentation to a user, interconnected via the LAN 0214, two or more processing nodes 0212, and a wide area network 0203. And gateway server 0213 that relays communication with an external system.

フォールトトレラントコンピュータシステム０２０１は、広域ネットワーク０２０３を介して通信可能な外部システム０２０４からの要求を受け付けて、該要求に対する処理を実施し、処理結果を応答として該外部システム０２０４に返信することによるサービスを提供する。ここでは外部システム０２０４からの要求として入力メッセージ０２３１を受信し、該要求に対する処理結果を格納した出力メッセージ０２４１を外部システム０２０４に対して送信する。 The fault tolerant computer system 0201 receives a request from the external system 0204 that can communicate via the wide area network 0203, performs processing on the request, and returns a service result as a response to the external system 0204. provide. Here, the input message 0231 is received as a request from the external system 0204, and an output message 0241 storing the processing result for the request is transmitted to the external system 0204.

フォールトトレラントコンピュータシステム０２０１の内部では、広域ネットワーク０２０３を介して外部システム０２０４からの要求として入力メッセージ０２３１を受信したゲートウェイサーバ０２１３がＬＡＮ０２１４を介して、システム内の全ての処理ノード０２１２に対して入力メッセージ０２３２として転送し直す。ここでほぼ同時に全ての処理ノード０２１２が該メッセージを受信し、概メッセージに対する処理を開始できるように、入力メッセージ０２３２はブロードキャスト送信する。該入力メッセージ０２３２を受信した各処理ノード０２１２は、各々該入力メッセージ０２３２に対する処理を実行して、処理結果を格納した出力メッセージ０２４２を、ＬＡＮ０２１４を介してゲートウェイサーバ０２１３に対して送信する。各処理ノード０２１２からの出力メッセージ０２４２を受信したゲートウェイサーバ０２１３は要求元の外部システム０２０４への応答として出力メッセージ０２４１を作成し、外部システム０２０４に対して送信する。ここでゲートウェイサーバ０２１３は、各処理ノード０２１２から受信した１つ以上の出力メッセージ０２４２のデータ内容の比較照合、正誤判定等を行い、正しいメッセージデータを出力メッセージ０２４１として、要求元の外部システム０２０４に送信する。前記出力メッセージ０２４２のデータ内容の比較照合、正誤判定では、データ内容が合致するものが最も多い出力メッセージ０２４２が正しいメッセージデータと見なし、前記正しい出力メッセージ０２４２のうちの１つを、外部システム０２０４への出力メッセージ０２４１にする。 In the fault tolerant computer system 0201, the gateway server 0213 that has received the input message 0231 as a request from the external system 0204 via the wide area network 0203 sends an input message to all the processing nodes 0212 in the system via the LAN 0214. Transfer as 0232 again. Here, almost all of the processing nodes 0212 receive the message, and the input message 0232 is broadcasted so that the processing for the general message can be started. Each processing node 0212 that has received the input message 0232 executes processing for the input message 0232 and transmits an output message 0242 storing the processing result to the gateway server 0213 via the LAN 0214. The gateway server 0213 that has received the output message 0242 from each processing node 0212 creates an output message 0241 as a response to the requesting external system 0204 and transmits it to the external system 0204. Here, the gateway server 0213 performs comparison and collation of data contents of one or more output messages 0242 received from each processing node 0212, correct / incorrect determination, etc., and sends correct message data to the requesting external system 0204 as an output message 0241. Send. In the comparison / collation and correctness / incorrectness determination of the data content of the output message 0242, the output message 0242 having the largest number of matching data contents is regarded as the correct message data, and one of the correct output messages 0242 is sent to the external system 0204. Output message 0241.

保守ノード０２１１では、前記外部システム０２０４からの要求に対して実行するオンライン処理は一切実行しない。前記外部システム０２０４からの要求に対して実行する処理の過程で、各々の処理０２１２及びゲートウェイ０２１３にて発生する各種のログを、保守ノード０２１１へと収集し、蓄積する。なおログ収集に伴う処理は、各々の処理ノード０２１２及びゲートウェイ０２１３におけるオンライン処理に影響を与えないように低負荷で実行する。また前記保守ノード０２１１に収集、蓄積したログは、ユーザ端末０２０２を用いて前記保守ノード０２１１へとログインしてきたユーザにより参照可能である。 The maintenance node 0211 does not execute any online processing that is executed in response to a request from the external system 0204. In the course of processing executed in response to a request from the external system 0204, various logs generated in each processing 0212 and gateway 0213 are collected and stored in the maintenance node 0211. The processing accompanying log collection is executed with a low load so as not to affect the online processing in each processing node 0212 and gateway 0213. The log collected and accumulated in the maintenance node 0211 can be referred to by a user who has logged in to the maintenance node 0211 using the user terminal 0202.

保守ノード０２１１の主なハードウェア構成は、処理装置（ＣＰＵ）０２２１、記憶装置（メモリ、ハードディスク）０２２２、通信装置０２２３からなる。記憶装置０２２２には、処理ノード０２１２及びゲートウェイ０２１３からログを収集し、記憶装置０２２２内の指定領域に格納するためのソフトウェアプログラム、処理ノード０２１２及びゲートウェイ０２１３から収集したログの統合及び加工を行い、ユーザに提示するための画面表示等を行うためのソフトウェアプログラム、ＬＡＮ０２１４を介して保守ノード０２１１と処理ノード０２１２との間の通信または保守ノード０２１１とゲートウェイ０２１３との間の通信を行うためのソフトウェアプログラム等が格納され、処理装置０２２１により処理される。また通信装置０２２３は、処理ノード０２１２またはゲートウェイ０２１３からの送信されるログを受信するための通信処理、またはユーザ端末０２０２からのログインに対する画面の入出力等を行うための通信処理を行う。 The main hardware configuration of the maintenance node 0211 includes a processing device (CPU) 0221, a storage device (memory, hard disk) 0222, and a communication device 0223. The storage device 0222 collects logs from the processing node 0212 and the gateway 0213, integrates and processes the software program for storing in the designated area in the storage device 0222, and the logs collected from the processing node 0212 and the gateway 0213, A software program for displaying a screen for presentation to the user, a software program for performing communication between the maintenance node 0211 and the processing node 0212 or communication between the maintenance node 0211 and the gateway 0213 via the LAN 0214 Are stored and processed by the processing device 0221. The communication device 0223 performs communication processing for receiving a log transmitted from the processing node 0212 or the gateway 0213, or communication processing for inputting / outputting a screen for login from the user terminal 0202.

なお、処理装置（ＣＰＵ）０２２１はハードディスク等の記憶装置０２２２から処理に必要なプログラムを読み出して各処理を実行する。また、記憶装置０２２２や通信装置０２２３の動作を制御する。また、後述する各図面で説明するフローチャートやシーケンス図における保守ノード０２１１の処理は、特段の説明がない限り処理装置（ＣＰＵ）０２２１が実行するものとする。 The processing device (CPU) 0221 reads out a program necessary for processing from the storage device 0222 such as a hard disk and executes each processing. It also controls operations of the storage device 0222 and the communication device 0223. In addition, the processing of the maintenance node 0211 in the flowcharts and sequence diagrams described in the drawings to be described later is assumed to be executed by the processing device (CPU) 0221 unless otherwise specified.

処理ノード０２１２の主なハードウェア構成は、処理装置（ＣＰＵ）０２２４、記憶装置（メモリ、ハードディスク）０２２５、通信装置０２２６からなる。記憶装置０２２５には、フォールトトレラントコンピュータシステム０２０１により外部システム０２０４に対して提供するサービスを実行する上で必要なデータ、上記サービスに関する外部システム０２０４からの要求に対して処理を実行するためのユーザプログラム、複数の処理ノード０２１２の間での稼動中に同期を実施するためのソフトウェアプログラム、各ソフトウェアプログラムの実行に伴い発生するログを取得し保守ノード０２１１へと送信するためのソフトウェアプログラム、ＬＡＮ０２１４を介して処理ノード０２１２とゲートウェイサーバ０２１３との間の通信または処理ノード０２１２と保守ノード０２１１との間の通信を行うためのソフトウェアプログラム等が格納され、処理装置０２２４により処理される。また通信装置０２２６は、ゲートウェイサーバ０２１３からの入力メッセージ０２３２を受信し、ゲートウェイサーバ０２１３に対して出力メッセージ０２４２を送信するための通信処理を行う。 The main hardware configuration of the processing node 0212 includes a processing device (CPU) 0224, a storage device (memory, hard disk) 0225, and a communication device 0226. The storage device 0225 stores data necessary for executing a service provided to the external system 0204 by the fault tolerant computer system 0201, and a user program for executing processing for a request from the external system 0204 regarding the service. Through a LAN 0214, a software program for performing synchronization among a plurality of processing nodes 0212, a software program for acquiring a log generated along with the execution of each software program, and transmitting it to the maintenance node 0211 A software program for performing communication between the processing node 0212 and the gateway server 0213 or communication between the processing node 0212 and the maintenance node 0211 is stored and processed by the processing device 0224. Also, the communication device 0226 receives the input message 0232 from the gateway server 0213 and performs communication processing for transmitting the output message 0242 to the gateway server 0213.

なお、処理装置（ＣＰＵ）０２２４はハードディスク等の記憶装置０２２５から処理に必要なプログラムを読み出して各処理を実行する。また、記憶装置０２２５や通信装置０２２６の動作を制御する。また、後述する各図面で説明するフローチャートやシーケンス図における処理ノード０２１２の処理は、特段の説明がない限り処理装置（ＣＰＵ）０２２４が実行するものとする。 The processing device (CPU) 0224 reads out a program necessary for processing from the storage device 0225 such as a hard disk and executes each processing. It also controls the operation of the storage device 0225 and the communication device 0226. In addition, the processing of the processing node 0212 in the flowcharts and sequence diagrams described in the drawings to be described later is assumed to be executed by the processing device (CPU) 0224 unless otherwise specified.

ユーザ端末０２０２の主なハードウェア構成は、処理装置（ＣＰＵ）、記憶装置（ハードディスク）、通信装置からなる。前記記憶装置には、保守ノード０２１１にログインし、コマンド操作及び操作結果として統合・加工済みのログデータ等の画面表示等を行うためのソフトウェアプログラム、ユーザ端末０２０２と保守ノード０２１１との間の通信を行うためのソフトウェアプログラム等が格納され、前記処理装置により処理される。また前記通信装置は、保守ノード０２１１へのログインに対する画面の入出力等を行うための通信処理を行う。 The main hardware configuration of the user terminal 0202 includes a processing device (CPU), a storage device (hard disk), and a communication device. Communication between the user terminal 0202 and the maintenance node 0211 in the storage device is a software program for logging in to the maintenance node 0211 and performing a command operation and screen display of integrated and processed log data as an operation result. A software program or the like is stored and processed by the processing device. The communication device performs communication processing for inputting / outputting a screen for login to the maintenance node 0211.

なお、前記処理装置はハードディスク等の前記記憶装置から処理に必要なプログラムを読み出して各処理を実行する。また、前記記憶装置や前期通信装置の動作を制御する。また、後述する各図面で説明するフローチャートやシーケンス図におけるユーザ端末０２０２の処理は、特段の説明がない限り前記処理装置（ＣＰＵ）が実行するものとする。 The processing device reads each program necessary for processing from the storage device such as a hard disk and executes each processing. It also controls the operation of the storage device and the previous communication device. In addition, the processing of the user terminal 0202 in the flowcharts and sequence diagrams described in the drawings to be described later is executed by the processing device (CPU) unless otherwise specified.

図３は、フォールトトレラントコンピュータシステムにおける分散ログ統合方法の実施形態を示す図である。 FIG. 3 is a diagram illustrating an embodiment of a distributed log integration method in a fault tolerant computer system.

フォールトトレラントコンピュータシステム０２０１が外部システム０２０４からの要求を受け付けると、該要求に対してゲートウェイ０２１３での処理、ゲートウェイ０２１３から複数の処理ノード０２１２への通信、処理ノード０２１２での処理、各処理ノード０２１２からゲートウェイ０２１３への通信、ゲートウェイ０２１３での処理、外部システム０２０４への通信、という一連の処理が発生する（図中の(1)〜(9)、処理ノード０２１２では(5)(5’)(5’’)(5’’’)が並列実行）。 When the fault-tolerant computer system 0201 receives a request from the external system 0204, the gateway 0213 processes the request, the communication from the gateway 0213 to the plurality of processing nodes 0212, the processing at the processing node 0212, and each processing node 0212. To the gateway 0213, processing at the gateway 0213, and communication to the external system 0204 occurs ((1) to (9) in the figure, (5) (5 ') at the processing node 0212) (5 '') (5 '' ') is executed in parallel).

前記フォールトトレラントコンピュータシステム０２０１を構成する各ノード（０２１２、０２１３）にて実行する処理に伴い発生する各種のログは、保守ノード０２１１へとＬＡＮ０２１４を介して収集し、蓄積する（０３０１）。保守ノード０２１１では、前記収集、蓄積したログを統合、加工して（０３０２）、統合、加工の結果を、ユーザ端末０２０２を介してユーザに提示する（０３０３）。ここではフォールトトレラントコンピュータシステム０２０１におけるノードを跨いでの処理実行順序に各種のログを陳列して表示したり（０３１１）、同じ処理を並列実行する処理ノード０２１２におけるログを並列表示する（０３１２）。 Various logs generated by processing executed in each node (0212, 0213) constituting the fault-tolerant computer system 0201 are collected and accumulated via the LAN 0214 to the maintenance node 0211 (0301). The maintenance node 0211 integrates and processes the collected and accumulated logs (0302), and presents the results of integration and processing to the user via the user terminal 0202 (0303). Here, various logs are displayed and displayed in the processing execution order across the nodes in the fault tolerant computer system 0201 (0311), or logs in the processing node 0212 that executes the same processing in parallel are displayed in parallel (0312).

図４は、フォールトトレラントコンピュータシステムにおける分散ログ統合方法を実施する上での、フォールトトレラントコンピュータシステムを構成する複数のノード間での全体処理シーケンスを示す図である。 FIG. 4 is a diagram showing an overall processing sequence between a plurality of nodes constituting the fault-tolerant computer system when executing the distributed log integration method in the fault-tolerant computer system.

主な構成要素は、フォールトトレラントコンピュータシステム０２０１を構成する、処理ノード１（０２１２ａ）、処理ノード２（０２１２ｂ）、ゲートウェイ０２１３、保守ノード０２１１及びユーザ端末０２０２である。 The main components are a processing node 1 (0212a), a processing node 2 (0212b), a gateway 0213, a maintenance node 0211, and a user terminal 0202 that constitute the fault-tolerant computer system 0201.

図４は処理ノード１（０２１２ａ）にて障害が発生した場合について示している。処理ノード１（０２１２ａ）にて、０４０１において障害発生を検出すると、０４０２において、他のノード（処理ノード２（０２１２ｂ）、ゲートウェイ０２１３、保守ノード０２１１）に対して一斉に、処理ノード１（０２１２ａ）における障害発生を通知する。処理ノード１（０２１２ａ）は、０４０３において、自ノードにて蓄積している自ノードのログデータを取得する。０４０２の障害通知を受信した、処理ノード２（０２１２ｂ）、ゲートウェイ０２１３でも、蓄積している各ノードのログデータを取得する（０４１１、０４２１）。処理ノード１（０２１２ａ）、処理ノード２（０２１２ｂ）、ゲートウェイ０２１３の各ノードから保守ノード０２１１に対して、取得したログデータを送信する（０４０４、０４１２、０４２２）。保守ノード０２１１では、０４３１において、前記各ノードから送信されたログデータを受信し、０４３２において、該ログデータをディスク等に蓄積する。 FIG. 4 shows a case where a failure occurs in the processing node 1 (0212a). When the processing node 1 (0212a) detects the occurrence of a failure in 0401, in 0402, the processing node 1 (0212a) is simultaneously transmitted to the other nodes (processing node 2 (0212b), gateway 0213, maintenance node 0211). Notification of failure occurrence in In 0403, the processing node 1 (0212a) acquires the log data of the local node accumulated in the local node. The processing node 2 (0212b) and the gateway 0213 that have received the fault notification 0402 also acquire the accumulated log data of each node (0411, 0421). The acquired log data is transmitted from the nodes of the processing node 1 (0212a), the processing node 2 (0212b), and the gateway 0213 to the maintenance node 0211 (0404, 0412, 0422). The maintenance node 0211 receives the log data transmitted from each of the nodes at 0431, and accumulates the log data on a disk or the like at 0432.

ユーザ端末０２０２にて、ユーザの操作により、０４４１において、保守ノード０２１１にログインし、０４４２において、ログ表示に関するコマンドを実行すると、保守ノード０２１１にて前記コマンド入力を受け付け、０４３３において、蓄積したログデータより該当するデータを取得し、０４３４において、コマンドの入力内容に従って、ログデータの統合、加工の処理を実行し、０４３５において、前記処理の実行結果を画面出力する。 In the user terminal 0202, when the user logs in to the maintenance node 0211 in 0441 and executes a command related to log display in 0442, the command input is accepted in the maintenance node 0211. In 0433, the accumulated log data The corresponding data is acquired. In 0434, log data integration and processing are executed according to the input contents of the command. In 0435, the execution result of the process is output to the screen.

図５は、フォールトトレラントコンピュータシステムに含まれ、ログ収集、蓄積及びユーザへの提示等の処理を行う保守ノードのモジュール構成を示す図である。 FIG. 5 is a diagram showing a module configuration of a maintenance node included in the fault-tolerant computer system and performing processing such as log collection, accumulation, and presentation to the user.

保守ノード０２１１には、ログの収集・蓄積、統合・加工、ユーザへの提示等の処理を行うログ統合管理部０５０１、収集したログを蓄積するためのハードディスク０５０２が導入される。ログ統合管理部０５０１は、処理装置（ＣＰＵ）０２２１により実行されるソフトウェアプログラムである。 The maintenance node 0211 is installed with a log integration management unit 0501 that performs processing such as log collection / accumulation, integration / processing, and presentation to the user, and a hard disk 0502 for accumulating the collected logs. The log integrated management unit 0501 is a software program executed by the processing device (CPU) 0221.

ログ統合管理部０５０１の主な構成要素は、データ通信部０５１５を介して、処理ノード０２１２、ゲートウェイ０２１３から送信されてくるログを受け取り、ハードディスク０５０２へと格納するログ収集・蓄積部０５１１、ユーザからの要求に対してハードディスク０５０２より該当するログデータを取り出し、統合、加工等の処理を行う統合・加工部０５１２、ユーザ端末０２０２からのリモート接続に対して、コマンド入力画面や統合・加工部０５１２の処理結果の出力画面等を提供する画面表示部０５１３、データ通信部０５１５を介してユーザ端末０２０２からのリモート接続を受け付け、画面表示部０５１３に接続する、リモートアクセス受付部０５１４、通信媒体０２１４を介してシステム内の処理ノード０２１１やゲートウェイ０２１３等との間の通信を行うデータ通信部０５１５がある。なおログ収集・蓄積部０５１１はログデータの収集及び蓄積の度に蓄積ログデータ管理テーブル０５２１の更新を行い、統合・加工部０５１２はユーザ要求に対して、蓄積ログデータ管理テーブル０５２１を参照して、該当ログデータを抽出する。ここでハードディスク０５０２に該当データが無い場合、ログ収集・蓄積部０５１１を介して、処理ノード０２１１またはゲートウェイ０２１３から問合せ応答により該当ログデータを取得する。なお、蓄積ログデータ管理テーブル０５２１は図７で後述する。 The main components of the integrated log management unit 0501 are a log collection / accumulation unit 0511 that receives logs transmitted from the processing node 0212 and the gateway 0213 via the data communication unit 0515, and stores them in the hard disk 0502. In response to the request, the corresponding log data is extracted from the hard disk 0502, and the integration / processing unit 0512 performs processing such as integration and processing. For remote connection from the user terminal 0202, the command input screen and the integration / processing unit 0512 A remote display from the user terminal 0202 is received via the screen display unit 0513 and the data communication unit 0515 that provide an output screen of the processing result, and the remote access reception unit 0514 and the communication medium 0214 connected to the screen display unit 0513. And processing nodes 0211 and gates in the system There is a data communication unit 0515 for performing communication with the E Lee 0213 like. The log collection / accumulation unit 0511 updates the accumulation log data management table 0521 every time log data is collected and accumulated, and the integration / processing unit 0512 refers to the accumulation log data management table 0521 in response to a user request. The corresponding log data is extracted. If there is no corresponding data in the hard disk 0502, the corresponding log data is acquired from the processing node 0211 or the gateway 0213 via the log collection / accumulation unit 0511 and an inquiry response. The accumulated log data management table 0521 will be described later with reference to FIG.

図６は、フォールトトレラントコンピュータシステムを構成し、外部からの要求に対する処理等を実行する処理ノード及びゲートウェイのモジュール構成を示す図である。 FIG. 6 is a diagram showing a module configuration of a processing node and a gateway that configure a fault tolerant computer system and execute processing for a request from the outside.

処理ノード０２１２には、ログデータの取得、保守ノード０２１１への送信等の処理を行うログデータ収集部０６０１、外部システム０２０４からの要求に対して各種の処理を行うＯＳ０６０２、ミドルウェア０６０３、ユーザプログラム０６０４、処理ノード０２１２の間でＯＳ０６０２、ミドルウェア０６０３、ユーザプログラム０６０４の処理同期を図るための同期処理制御部０６０５、またＯＳ０６０２、ミドルウェア０６０３、ユーザプログラム０６０４によるログデータの書き込み先であるメモリ０６０６、ハードディスク０６０７が導入される。ログデータ収集部０６０１及び同期処理制御部０６０５は、処理装置（ＣＰＵ）０２２４により実行されるソフトウェアプログラムである。 The processing node 0212 includes a log data collection unit 0601 that performs processing such as acquisition of log data and transmission to the maintenance node 0211, an OS 0602 that performs various processing in response to requests from the external system 0204, middleware 0603, and a user program 0604. The synchronization processing control unit 0605 for synchronizing the processing of the OS 0602, the middleware 0603, and the user program 0604 between the processing nodes 0212, the memory 0606 that is the log data write destination by the OS 0602, the middleware 0603, and the user program 0604, and the hard disk 0607 Is introduced. The log data collection unit 0601 and the synchronization processing control unit 0605 are software programs executed by the processing device (CPU) 0224.

ログデータ収集部０６０１の主な構成要素は、自ノードのＯＳ０６０２、ミドルウェア０６０３、ユーザプログラム０６０４を監視し、自ノードの障害発生を検出する、もしくはデータ通信部０６１４を介して他の処理ノード０２１２からの障害通知を受信することにより他ノードの障害発生を検出する、障害発生検出部０６１１、障害発生検出部０６１１からの指示等により、メモリ０６０６、ハードディスク０６０７から、ＯＳ０６０２、ミドルウェア０６０３、ユーザプログラム０６０４が出力し蓄積されているログを取得する、ログデータ取得部０６１２、ログデータ取得部０６１２が取得したログデータを、データ通信部０６１４を介して保守ノード０２１１へと転送する、ログデータ転送部０６１３、通信媒体０２１４を介してシステム内の他の処理ノード０２１２やゲートウェイ０２１３、保守ノード０２１１等との間の通信を行う、データ通信部０６１４がある。またログデータ転送部０６１３は、データ通信部０６１４を介して保守ノード０２１１からの問合せ応答によるログデータ要求に対して、ログデータ取得部０６１２を介して取得したログデータを、データ通信部０６１４を介して送信する場合もある。 The main components of the log data collection unit 0601 are to monitor the own node's OS 0602, middleware 0603, and user program 0604 to detect the failure of the own node or from another processing node 0212 via the data communication unit 0614. In response to an instruction from the failure occurrence detection unit 0611 and the failure occurrence detection unit 0611, the OS 0602, the middleware 0603, and the user program 0604 are detected from the memory 0606 and the hard disk 0607 according to an instruction from the failure occurrence detection unit 0611 and the failure occurrence detection unit 0611. A log data transfer unit 0613 that outputs and accumulates logs, and transfers log data acquired by the log data acquisition unit 0612 to the maintenance node 0211 via the data communication unit 0614; Via communication medium 0214 Other processing nodes 0212 and gateway 0213 in the stem, communicates with the maintenance node 0211, etc., there is a data communication unit 0614. Further, the log data transfer unit 0613 receives the log data acquired via the log data acquisition unit 0612 via the data communication unit 0614 in response to the log data request from the maintenance node 0211 via the data communication unit 0614. May be transmitted.

なおゲートウェイ０２１３の場合のモジュール構成は、前記の図６の構成からユーザプログラム０６０４及び同期処理制御部０６０５を除いたものとなる。 The module configuration in the case of the gateway 0213 is obtained by removing the user program 0604 and the synchronization processing control unit 0605 from the configuration in FIG.

図７は、フォールトトレラントコンピュータシステムに含まれ、ログ収集、蓄積及びユーザへの提示等の処理を行う保守ノードにおいて管理される蓄積ログデータ管理テーブルの構成を示す図である。 FIG. 7 is a diagram illustrating a configuration of an accumulated log data management table that is included in the fault tolerant computer system and is managed in a maintenance node that performs processing such as log collection, accumulation, and presentation to the user.

蓄積ログデータ管理テーブルの主な構成要素は、ログ種別０７０１、発生元ノード０７０２、収集通番０７０３、収集時刻０７０４、最古レコード時刻０７０５、最新レコード時刻０７０６、最古同期通番０７０７、最新同期通番０７０８、レコード数０７０９、ファイル格納先０７１０である。 The main components of the accumulated log data management table are log type 0701, source node 0702, collection sequence number 0703, collection time 0704, oldest record time 0705, latest record time 0706, oldest synchronization sequence number 0707, and latest synchronization sequence number 0708. The number of records is 0709, and the file storage location is 0710.

ログ種別０７０１には、処理ノード０２１２またはゲートウェイ０２１３から収集したログの種別を示す情報が格納される。発生元ノード０７０２には、収集したログの発生元である、処理ノード０２１２またはゲートウェイ０２１３のいずれかを識別する情報が格納される。収集通番０７０３には、処理ノード０２１２、ゲートウェイ０２１３から保守ノード０２１１へとログを収集する処理が開始される度に加算される通番の値が格納される。収集時刻０７０４には、処理ノード０２１２またはゲートウェイ０２１３から送信された当該ログを保守ノード０２１１が受信した時刻（保守ノード０２１１が有する時計を用いて算出）が格納される。最古レコード時刻０７０５には、収集した当該ログに含まれる１つ以上のログレコードのうち最も古いログレコードに、ログ発生元の処理ノード０２１２またはゲートウェイ０２１３にて付けられた時刻が格納される、最新レコード時刻０７０６には、収集した当該ログに含まれる１つ以上のログレコードのうち最も新しいログレコードに、ログ発生元の処理ノード０２１２またはゲートウェイ０２１３にて付けられた時刻が格納される。最古同期通番０７０７には、収集した当該ログに含まれる１つ以上のログレコードのうち最も古いログレコードに、ログ発生元の処理ノード０２１２にて割り振られた同期通番が格納される。最新同期通番０７０８には、収集した当該ログに含まれる１つ以上のログレコードのうち最も新しいログレコードに、ログ発生元の処理ノード０２１２にて割り振られた同期通番が格納される。レコード数０７０９には、収集した当該ログに含まれるログレコードの件数が格納される。ファイル格納先０７１０には、収集した当該ログの格納先であるファイルパスが格納される。 The log type 0701 stores information indicating the type of log collected from the processing node 0212 or the gateway 0213. The generation source node 0702 stores information for identifying either the processing node 0212 or the gateway 0213 that is the generation source of the collected logs. The collection sequence number 0703 stores a sequence number value that is added every time processing for collecting logs from the processing node 0212 and the gateway 0213 to the maintenance node 0211 is started. The collection time 0704 stores the time at which the maintenance node 0211 received the log transmitted from the processing node 0212 or the gateway 0213 (calculated using the clock of the maintenance node 0211). The earliest record time 0705 stores the time given to the oldest log record among the one or more log records included in the collected log by the processing node 0212 or the gateway 0213 that is the log generation source. The latest record time 0706 stores the time given by the processing node 0212 or the gateway 0213 of the log generation source to the newest log record among one or more log records included in the collected log. The oldest synchronization sequence number 0707 stores the synchronization sequence number allocated by the processing node 0212 that is the log generation source in the oldest log record among one or more log records included in the collected log. The latest synchronization sequence number 0708 stores the synchronization sequence number allocated by the processing node 0212 of the log generation source in the newest log record among one or more log records included in the collected log. The number of records 0709 stores the number of log records included in the collected log. The file storage location 0710 stores the file path that is the storage location of the collected log.

前記蓄積ログデータ管理テーブルは、処理ノード０２１２、ゲートウェイ０２１３から保守ノード０２１１へとログが収集され、保守ノード０２１１上のハードディクスに格納される度に更新され、ユーザからの要求に対してログを提示する際に参照される。 The accumulated log data management table is updated each time logs are collected from the processing node 0212 and the gateway 0213 to the maintenance node 0211 and stored in the hard disk on the maintenance node 0211, and a log is recorded in response to a request from the user. Referenced when presenting.

図８は、フォールトトレラントコンピュータシステムに含まれ、ログ収集、蓄積及びユーザへの提示等の処理を行う保守ノードにて、各ノードより収集したログに対して、外部システムからの入力受付時のフォールトトレラントコンピュータシステム内のノードを跨いでの処理発生順序に並べるための紐付けの方法の概要を示す図である。 FIG. 8 shows a fault that is included in the fault-tolerant computer system, and that receives logs from an external system for logs collected from each node in a maintenance node that performs processing such as log collection, storage, and presentation to the user. It is a figure which shows the outline | summary of the linking method for arranging in the process generation order across the nodes in a tolerant computer system.

主な構成要素は、フォールトトレラントコンピュータシステム０２０１を構成する、ゲートウェイ０２１３及び処理ノード０２１２である。ゲートウェイ０２１３、処理ノード０２１２は同期していないそれぞれ異なる時刻情報を有している（０８０１、０８０２）。複数のノードを跨いで統一的に紐付けるための一元的な情報は存在しない。このため１ノード内、ノード間等の部分的な紐付けの組合せによりフォールトトレラントコンピュータシステム０２０１内のノードを跨いでの処理発生順序に従って、保守ノード０２１１は収集・蓄積する異種かつ複数のログを並べていく。 The main components are a gateway 0213 and a processing node 0212 that constitute the fault-tolerant computer system 0201. The gateway 0213 and the processing node 0212 have different time information that is not synchronized (0801 and 0802). There is no unitary information for uniformly linking across multiple nodes. For this reason, the maintenance node 0211 arranges a plurality of different logs to be collected and stored in accordance with the processing generation order across the nodes in the fault-tolerant computer system 0201 by a combination of partial ties within one node and between nodes. Go.

外部システム０２０４からの入力メッセージ受信後のゲートウェイ０２１３における処理のログに対して保守ノード０２１１は、各種ログ（通信ログ、ログＡ）に一般的に共通して含まれるＰＩＤ（プロセスＩＤ）に着目し、外部システム０２０４からのメッセージ受信のログからＰＩＤが同一のログを、各ログに刻印された時刻０８０１に従って発生順に紐付ける（０８１１、０８１２、０８１３）。 The maintenance node 0211 pays attention to the PID (process ID) generally included in various logs (communication log, log A) for the processing log in the gateway 0213 after receiving the input message from the external system 0204. Then, logs having the same PID from the log of message reception from the external system 0204 are linked in the order of occurrence according to the time 0801 stamped on each log (0811, 0812, 0813).

ゲートウェイ０２１３から処理ノード０２１２への通信のログに関して、通信ログは各々個別に記録されるが、保守ノード０２１１は、同一メッセージの送信及び受信に関するログを紐付けるために、メッセージの識別情報及び通番が同一のものを抽出する（０８３１）。 Regarding the log of communication from the gateway 0213 to the processing node 0212, the communication log is individually recorded, but the maintenance node 0211 has the message identification information and serial number in order to link the logs related to transmission and reception of the same message. The same thing is extracted (0831).

ゲートウェイ０２１３からのメッセージ受信後の処理ノード０２１２における処理のログに対して保守ノード０２１１は、前記ゲートウェイ０２１３の内部処理のログと同様に、各種ログ（通信ログ、ログＢ、ログＣ）からＰＩＤに着目して、ゲートウェイ０２１３からのメッセージ受信のログからＰＩＤが同一のログを、各ログに刻印された時刻０８０２に従って発生順に紐付ける（０８２１、０８２２、０８２３、０８２４、０８２５、０８２６）。ここで処理の過程で実行プロセスの切り替え、新規起動等が発生する場合はそれらのログの参照により対象とするＰＩＤの変化を追跡する。 For the processing log in the processing node 0212 after receiving the message from the gateway 0213, the maintenance node 0211 changes the various logs (communication log, log B, log C) from PID to the PID in the same way as the internal processing log of the gateway 0213. Paying attention, the logs having the same PID from the log of message reception from the gateway 0213 are linked in the order of occurrence according to the time 0802 stamped on each log (0821, 0822, 0823, 0824, 0825, 0826). Here, when execution process switching, new activation, or the like occurs in the course of processing, the change of the target PID is traced by referring to those logs.

処理ノード０２１２からゲートウェイ０２１３への通信のログに関して、前記ゲートウェイ０２１３から処理ノード０２１２への通信のログの場合と同様にして、保守ノード０２１１は、同一メッセージの送信及び受信に関するログを紐付ける（０８３２）。 Regarding the communication log from the processing node 0212 to the gateway 0213, the maintenance node 0211 links the logs related to the transmission and reception of the same message in the same manner as the communication log from the gateway 0213 to the processing node 0212 (0832). ).

処理ノード０２１２からのメッセージ受信後のゲートウェイ０２１３における処理のログに対して保守ノード０２１１は、前記ゲートウェイ０２１３の内部処理のログまたは処理ノード０２１２の内部処理のログと同様に、各種ログ（通信ログ、ログＡ）からＰＩＤに着目して、処理ノード０２１２からのメッセージ受信のログからＰＩＤが同一のログを、各ログに刻印された時刻０８０１に従って発生順に紐付ける（０８１４、０８１５、０８１６）。 For the processing log in the gateway 0213 after receiving the message from the processing node 0212, the maintenance node 0211 can execute various logs (communication log, communication log, Focusing on the PID from the log A), logs having the same PID from the message reception log from the processing node 0212 are linked in the order of occurrence according to the time 0801 stamped on each log (0814, 0815, 0816).

上記のノード内またはノード間での部分的な紐付けの組合せにより保守ノード０２１１は、外部システム０２０４から入力メッセージを受信してから該応答としてのメッセージを外部システム０２０４に送信するまでの、フォールトトレラントコンピュータシステム０２０１におけるノードを跨いでの一連の処理を紐付けることができる。 The maintenance node 0211 receives the input message from the external system 0204 and transmits the message as the response to the external system 0204 by the combination of partial ties within the node or between the nodes. A series of processes across nodes in the computer system 0201 can be linked.

図９は、フォールトトレラントコンピュータシステムに含まれ、ログ収集、蓄積及びユーザへの提示等の処理を行う保守ノードにて、各ノードより収集したログに対して、同じ処理を並列実行する複数の処理ノード間で同一事象に関するログレコードの紐付けの方法の概要を示す図である。 FIG. 9 shows a plurality of processes that are included in the fault-tolerant computer system and that execute the same process in parallel on the logs collected from each node in a maintenance node that performs processes such as log collection, storage, and presentation to the user. It is a figure which shows the outline | summary of the correlation method of the log record regarding the same event between nodes.

主な構成要素は、同じ処理（処理１〜４）を並列実行する複数の処理ノード０２１２ａ、０２１２ｂ、０２１２ｃである。処理ノード０２１２ａ、０２１２ｂ、０２１２ｃは同期していないそれぞれ異なる時刻情報を有しており（０９０１、０９０２、０９０３）、処理ノード間で時刻が同一のログレコードが存在したとしても、実際には同時に発生した事象ではない可能性がある。 The main components are a plurality of processing nodes 0212a, 0212b, and 0212c that execute the same processing (processing 1 to 4) in parallel. The processing nodes 0212a, 0212b, and 0212c have different time information that is not synchronized (0901, 0902, 0903), and even if log records with the same time exist between the processing nodes, they actually occur at the same time. The event may not have been

処理ノード０２１２ａ、０２１２ｂ、０２１２ｃはゲートウェイ０２１３からブロードキャスト送信されるメッセージ０９１１を同時に受信し処理を一斉に開始する。またある特定の処理を開始する際に処理ノード間で開始タイミングの待合わせを行うために同期信号０９１２の送受信を処理ノード間で実施している。ただしこれらは処理ノード間で処理実行タイミングが大きくずれることを回避するためのものであり、処理ノード間で完全にタイミングを合わせて同期実行するためのものではない。 The processing nodes 0212a, 0212b, and 0212c simultaneously receive the message 0911 broadcast from the gateway 0213 and start the processing all at once. In addition, when a specific process is started, the synchronization signal 0912 is transmitted / received between the processing nodes in order to wait for the start timing between the processing nodes. However, these are for preventing the processing execution timing from greatly deviating between the processing nodes, and not for synchronous execution with the timing perfectly matched between the processing nodes.

前記の処理ノード０２１２ａ、０２１２ｂ、０２１２ｃの動作内容から、処理実行の過程では、メッセージ受信時や同期信号受信時のように処理ノード間で処理開始の待合わせを実施する同期ポイント０９２１が存在する。同期ポイント０９２１は、処理ノード間である特定の処理の開始タイミングを合わせるためのものであり、処理ノード０２１２にて稼動するユーザプログラムの処理コードに対して、ネットワーク、共有メモリ、ハードディスク等へのＩ／Ｏアクセスの処理、新規プロセス起動等の開始直前に設ける。各処理ノード０２１２では、前記同期ポイントに到達すると処理実行を一時停止し、同期信号０９１２をブロードキャスト送信する。各処理ノード０２１２では他の全ての処理ノード０２１２からの前記同期信号０９１２の受信を確認してから処理を再開する。
ここで同期ポイント０９２１を通過する度に１加算する同期通番０９２２を導入し、各処理ノードのログに対応付ける。ここでは各種ログにおける事象毎の１ログレコードの中に同期通番０９２２を格納する領域を追加し、ログレコードの発生時点での同期通番０９２２の値も当該ログレコード内に記録する、もしくは別途ログレコードと同期通番０９２２との対応テーブルを設け、各ログレコードの識別コードと当該ログレコード発生時点での同期通番０９２２の値とを対応付けて記録するものとする。処理ノード０２１２は保守ノード０２１１に対して、前記同期通番０９２２も格納されたログレコードを送信する、もしくは前記対応テーブルをログレコードとともに送信する。 From the operation contents of the processing nodes 0212a, 0212b, and 0212c, there is a synchronization point 0921 for performing processing start waiting between the processing nodes as in the case of message reception or synchronization signal reception in the process execution. The synchronization point 0921 is used to synchronize the start timing of a specific process between the processing nodes. The processing code of the user program running on the processing node 0212 is used for the network, shared memory, hard disk, etc. It is provided immediately before the start of / O access processing, new process activation, and the like. Each processing node 0212 suspends processing execution when the synchronization point is reached and broadcasts a synchronization signal 0912. Each processing node 0212 restarts processing after confirming reception of the synchronization signal 0912 from all other processing nodes 0212.
Here, a synchronization serial number 0922 that is incremented by 1 every time the synchronization point 0921 is passed is introduced and associated with the log of each processing node. Here, an area for storing the synchronous serial number 0922 is added to one log record for each event in various logs, and the value of the synchronous serial number 0922 at the time of occurrence of the log record is also recorded in the log record, or a separate log record And a correspondence table of the synchronization sequence number 0922, and the identification code of each log record and the value of the synchronization sequence number 0922 at the time of occurrence of the log record are recorded in association with each other. The processing node 0212 transmits to the maintenance node 0211 a log record in which the synchronization serial number 0922 is also stored, or transmits the correspondence table together with the log record.

前記同期通番０９２２で区切られた範囲内では、処理ノード間で同一処理または事象に関するログが含まれているので、当該範囲内で各種ログの個々の識別コード及び当該範囲内での同種のログレコードの発生順序に基づき、同一事象に関するログレコードをノード間で紐付ける。例えば、同期通番０９２２が“n+1”である範囲内では、処理ノード１（０２１２ａ）には０９３２〜０９３５のログが含まれており、処理ノード２（０２１２ｂ）には０９４２〜０９４８のログが含まれている。これらのログから識別コードと各識別コードを有する同種のログレコードの中での発生順序に基づいて、ログ０９３２と０９４２、ログ０９３３と０９４３、ログ０９３４と０９４６、ログ０９３５と０９４８がそれぞれ紐付けられる。なお少なくとも１つの処理ノードのみで発生し、他の処理ノードには同一ログが存在しないログ（例えばエラーログ等）は紐付け無しとして扱う。ログ０９４４、０９４５、０９４７は処理ノード２（０２１２ｂ）のみの紐付け無しログとなる。図１０は、フォールトトレラントコンピュータシステムを構成し、外部からの要求に対する処理等を実行する処理ノードにて、蓄積していたログを取得し、保守ノードに送信する処理の流れを示すフローチャートである。 Within the range delimited by the synchronization serial number 0922, logs related to the same process or event are included between the processing nodes. Therefore, individual identification codes of various logs within the range and log records of the same type within the range. Based on the occurrence order, log records related to the same event are linked between nodes. For example, within the range where the synchronization serial number 0922 is “n + 1”, the processing node 1 (0212a) includes the logs 0932 to 0935, and the processing node 2 (0212b) includes the logs 0942 to 0948. include. Logs 0932 and 0942, logs 0933 and 0943, logs 0934 and 0946, and logs 0935 and 0948 are associated with each other based on the identification code and the order of occurrence in the same type of log record having each identification code. . A log that occurs only in at least one processing node and does not have the same log in other processing nodes (for example, an error log) is treated as having no association. Logs 0944, 0945, and 0947 are unlinked logs only for the processing node 2 (0212b). FIG. 10 is a flowchart showing the flow of processing that configures the fault-tolerant computer system, acquires the accumulated log in the processing node that executes processing for the request from the outside, and transmits it to the maintenance node.

１００１において、障害発生を検出する。ここで検出した障害は自ノードで発生したものである場合、１００２において、他ノードへ障害通知を送信する。１００１において、他ノードからの障害通知を受信した場合、１００２の処理は行わない。１００３において、自ノードにて蓄積しているログデータを取得する。ここで取得するログデータは、当該ノードにて蓄積しているログデータのうち、前記障害検出の時点で保守ノード０２１１には未送信である全てのログデータである。１００４において、１００３にて取得したログデータを保守ノードへ送信する。１００５において、障害発生したのは自ノードである場合、正常にノードの処理を終了させるべく、１００６において、終了処理を行う。１００５において、障害発生したのは自ノードでは無い場合、１００１から１００５の処理を繰り返す。 In 1001, occurrence of a failure is detected. If the detected failure has occurred in its own node, in 1002, a failure notification is transmitted to another node. When a failure notification from another node is received in 1001, the processing in 1002 is not performed. In 1003, log data accumulated in the own node is acquired. The log data acquired here is all the log data that has not been transmitted to the maintenance node 0211 at the time of the failure detection among the log data stored in the node. In 1004, the log data acquired in 1003 is transmitted to the maintenance node. In step 1005, if it is the local node that has failed, the end processing is performed in step 1006 in order to end the processing of the node normally. In 1005, if it is not the own node that has failed, the processing from 1001 to 1005 is repeated.

１００４において、処理ノード０２１２における各ログのデータ保有期間の長さ、ログのデータサイズ、実行中のオンライン処理の負荷等に応じて、ログ毎に優先度を決定して当該処理ノード０２１２から保守ノード０２１１への送信処理を行う。例えば、処理ノード０２１２におけるデータ保有期間の終了までの時間が短いログから優先して送信し、データ保有期間の終了までの期間が長いログは、１００３にてログデータ取得後直ちに一斉送信せず、遅延して送信する。もしくは一定間隔毎に分割して送信する（ただしいずれのログもデータ保有期間終了までに送信完了することは保証する）。データサイズの小さいログの送信はオンライン処理と並行実行しても良いが、データサイズの大きいログの送信は、処理負荷増、通信帯域の占有等でオンライン処理に影響を与える可能性があるため、オンライン処理が実行していない期間に実施する。また処理ノード０２１１におけるCPU負荷率を計測し、特にオンライン処理実行中は、CPU負荷率が設定した閾値を超えると、送信処理を中断する、というような制御を実施する。これらにより外部システム０２０４からの入力に対する処理実行に与える影響を最小限にする。 In step 1004, a priority is determined for each log according to the length of the data holding period of each log in the processing node 0212, the log data size, the load of the online processing being executed, and the like from the processing node 0212 to the maintenance node. Transmission processing to 0211 is performed. For example, the processing node 0212 prioritizes transmission from a log with a short time until the end of the data retention period, and a log with a long period until the end of the data retention period is not broadcast immediately after acquiring log data at 1003. Send with delay. Or, it is divided and sent at regular intervals (however, it is guaranteed that all logs will be sent by the end of the data retention period). Sending a log with a small data size may be executed in parallel with online processing, but sending a log with a large data size may affect online processing due to increased processing load, occupied communication bandwidth, etc. Implemented during a period when online processing is not being executed. In addition, the CPU load factor in the processing node 0211 is measured, and control is performed such that the transmission process is interrupted when the CPU load factor exceeds a set threshold, particularly during execution of online processing. As a result, the influence on the processing execution with respect to the input from the external system 0204 is minimized.

図１１は、フォールトトレラントコンピュータシステムに含まれ、ログ収集、蓄積及びユーザへの提示等の処理を行う保守ノードにて、各処理ノードからログを収集し、統合及び加工、ユーザへの提示の処理の流れを示すフローチャートである。 FIG. 11 is a maintenance node that is included in the fault tolerant computer system and performs processing such as log collection, storage, and presentation to the user. Logs are collected from each processing node, integrated, processed, and presented to the user. It is a flowchart which shows the flow.

１１０１において、ユーザ端末０２０２を介してのユーザからのコマンド実行要求を受け付ける。前記コマンドの入力には要求する処理内容、対象とするデータの範囲等が含まれる。１１０２において、１１０１のコマンド入力により指定された範囲内で、保守ノード０２１１のハードディスク０５０２に格納されたログデータより該当データを検索する。１１０３において、１１０２の検索の結果、該当データが無い場合、１１０４において、保守ノード０２１１から処理ノード０２１２、ゲートウェイ０２１３に問い合わせて、該当データを取得する。１１０３において、１１０２の検索の結果、該当データがある場合、１１０４の処理は省略する。１１０５において、１１０１のコマンド入力で指定された処理の要求内容が“処理発生順表示”である場合、１１０６において、１１０２または１１０４により取得したログレコードの中から、送信先ノード（最初はゲートウェイ０２１３）における受信ログ（最初は外部システム０２０４からのメッセージ受信のログ）を抽出する。１１０７において、１１０６で抽出したログレコードとＰＩＤが一致するログレコードを抽出する。１１０８において、１１０７で抽出したログレコードが該当ノードの送信ログ（最初はゲートウェイ０２１３から処理ノード０２１２へのメッセージ送信のログ）でない場合、１１０７の処理を繰り返す。１１０８において、１１０７で抽出したログレコードが該当ノードの送信ログである場合、１１０９において、抽出したログレコードを表示用に配列する。１１１０において、全ノード分（処理実行順にゲートウェイ０２１３、処理ノード０２１２、ゲートウェイ０２１３）のログレコード抽出及び配列が終了していなければ、１１０６〜１１０９の処理を繰り返す。１１１０において、全ノード分のログレコード抽出及び配列が終了した場合、１１１１において、１１１０までの処理結果を画面表示し、終了する。 In 1101, a command execution request from the user via the user terminal 0202 is accepted. The input of the command includes requested processing contents, a range of target data, and the like. In 1102, the corresponding data is searched from the log data stored in the hard disk 0502 of the maintenance node 0211 within the range specified by the command input of 1101. In 1103, if there is no corresponding data as a result of the search in 1102, in 1104, the maintenance node 0211 inquires the processing node 0212 and the gateway 0213 to acquire the corresponding data. In step 1103, if there is corresponding data as a result of the search in step 1102, the processing in step 1104 is omitted. In 1105, when the request content of the process designated by the command input of 1101 is “display processing order”, in 1106, from the log records acquired by 1102 or 1104, the destination node (initially gateway 0213). The reception log (initially, a message reception log from the external system 0204) is extracted. In 1107, the log record whose PID matches the log record extracted in 1106 is extracted. In 1108, if the log record extracted in 1107 is not a transmission log of the corresponding node (initially, a message transmission log from the gateway 0213 to the processing node 0212), the processing of 1107 is repeated. In 1108, when the log record extracted in 1107 is the transmission log of the corresponding node, in 1109, the extracted log record is arranged for display. If the log record extraction and arrangement of all nodes (gateway 0213, processing node 0212, gateway 0213) have not been completed in 1110, the processing of 1106 to 1109 is repeated. If the log record extraction and arrangement for all nodes are completed in 1110, the processing results up to 1110 are displayed on the screen in 1111 and the process ends.

１１０５において、１１０１のコマンド入力で指定された処理の要求内容が“並列表示”もしくは“比較表示”である場合、１１１２において、１１０２または１１０４により取得したログレコードを同期通番の値により範囲分割する。１１１３において、１１１２にて分割した１つの範囲のログレコードの中から、紐付け情報（図９参照）を参照し、処理ノード間で同一事象のレコードを抽出する。１１１４において、１つの事象に関する全処理ノード分のログレコードの抽出が終了していない場合、１１１３の処理を繰り返す。１１１４において、１つの事象に関する全処理ノード分のログレコードの抽出が終了した場合、１１１５において、１１０１のコマンド入力で指定された処理の要求内容が“比較表示”である場合、１１１６において、１１１３にて抽出した全処理ノード分の同一事象に関するログレコードのデータ内容の比較処理を実施する。１１１５において、１１０１のコマンド入力で指定された処理の要求内容が“並列表示”である場合、１１１６の処理は省略する。１１１７において、１つの同期通番の値による範囲内で全ログレコードに対する処理が終了していない場合、１１１２〜１１１６の処理を繰り返す。１１１７において、１つの同期通番の値による範囲内で全ログレコードに対する処理が終了した場合、１１１８において、抽出したログレコードを表示用に配列する。１１１９において、１１１２で分割した全範囲に関して処理が終了していない場合、１１１２〜１１１８の処理を繰り返す。１１１９において、１１１２で分割した全範囲に関して処理が終了した場合、１１１１において、１１１９までの処理結果を画面表示し、終了する。 In 1105, when the requested processing content specified by the command input in 1101 is “parallel display” or “comparison display”, in 1112 the log record acquired in 1102 or 1104 is divided into ranges by the value of the synchronous serial number. In 1113, the record of the same event is extracted between the processing nodes by referring to the association information (see FIG. 9) from the log records in one range divided in 1112. If the extraction of log records for all processing nodes related to one event has not been completed in 1114, the processing of 1113 is repeated. If the extraction of log records for all processing nodes related to one event is completed in 1114, the request content of the process designated by the command input in 1101 is “comparison display” in 1115, the process returns to 1113 in 1116 Compare the data contents of log records related to the same event for all processing nodes extracted in this way. In 1115, when the request content of the process designated by the command input of 1101 is “parallel display”, the process of 1116 is omitted. In 1117, if the processing for all the log records is not completed within the range based on the value of one synchronous serial number, the processing of 1112 to 1116 is repeated. If the processing for all the log records is completed within the range based on the value of one synchronization serial number in 1117, the extracted log records are arranged for display in 1118. If the process has not been completed for the entire range divided by 1112 in 1119, the processes of 1112 to 1118 are repeated. In 1119, when the processing is completed for the entire range divided in 1112, in 1111 the processing results up to 1119 are displayed on the screen, and the processing ends.

１１０７または１１１３において、少なくとも１つの処理ノード０２１２またはゲートウェイ０２１３から収集したログレコードの件数が他の処理ノード０２１２から収集したログレコードの件数を大きく上回る場合、当該処理ノード０２１２またはゲートウェイ０２１３から収集したログレコードの間引きを実施する。特に連続して発生する同一内容のログレコード、同一ログレコードのデータ内容が更新される場合の最新値以外のレコード、当該処理ノード０２１２にのみ含まれるログ、ユーザによる参照頻度の低いログ、等が間引きの対象となる。ここで間引きの対象となったログレコードは加工、画面表示等の対象外となるだけで、格納していたハードディスクからの削除は実施しない。 In 1107 or 1113, when the number of log records collected from at least one processing node 0212 or gateway 0213 greatly exceeds the number of log records collected from other processing nodes 0212, the log collected from the processing node 0212 or gateway 0213 Perform record thinning. In particular, log records with the same content that occur continuously, records other than the latest value when the data content of the same log record is updated, a log included only in the processing node 0212, a log with a low frequency of reference by the user, etc. It is the target of thinning. Here, the log records subject to thinning are only excluded from processing, screen display, etc., and are not deleted from the stored hard disk.

図１２は、ノードを跨いで処理発生順に陳列したログを、ユーザに対して提示するための、画面表示例を示す図である。 FIG. 12 is a diagram illustrating a screen display example for presenting a log displayed in the order of processing generation across nodes to the user.

画面表示１２０１は、図８にて示した、外部システムからの入力受付時のフォールトトレラントコンピュータシステム内のノードを跨いでの処理発生順序に並べるための紐付けの方法に基づく処理結果であり、ユーザのコマンド操作に対して図１１に示した処理実行による出力結果として表示する。１２１１に示す開始時刻と終了時刻の間に発生した、外部システムから入力に対して、フォールトトレラントコンピュータシステム内の各ノードにて実行された処理に伴い発生したログを発生順に並べて表示したものであり、表示の１行が１つのログレコードに該当する。各ログレコードに対して、発生元の構成要素（ノード）１２１２、各構成要素でのログの発生時刻１２１３、ログ種別１２１４、ログレコードに含まれるメッセージ１２１５が表示される。 The screen display 1201 is a processing result based on the linking method shown in FIG. 8 for arranging in the processing generation order across the nodes in the fault-tolerant computer system when receiving an input from the external system. Is displayed as an output result of the process execution shown in FIG. Logs generated by processing executed at each node in the fault-tolerant computer system in response to input from the external system that occurred between the start time and end time shown in 1211 are displayed in order of occurrence. , One line of display corresponds to one log record. For each log record, a source component (node) 1212, a log generation time 1213 in each component, a log type 1214, and a message 1215 included in the log record are displayed.

図１３は、同じ処理を並列実行する複数の処理ノード間で紐付けた、同一事象に関するログレコードを、複数の処理ノード分だけ並列に並べて、ユーザに対して提示するための、画面表示例を示す図である。 FIG. 13 is a screen display example for displaying log records related to the same event, which are linked between a plurality of processing nodes that execute the same processing in parallel, in parallel for a plurality of processing nodes and presented to the user. FIG.

画面表示１２０２は、図９にて示した、同じ処理を並列実行する複数の処理ノード間で同一事象に関するログレコードの紐付けの方法に基づき、同一事象に関するログレコードを並列表示する結果であり、ユーザのコマンド操作に対して図１１に示した処理実行による出力結果として表示する。１２２１に示す開始時刻と終了時刻の間に、同じ処理を並列実行する複数の処理ノードで実行された処理に伴い発生したログを発生順に並べて表示したものであり、表示の１行が同じ処理を並列実行する複数の処理ノードにおける同一事象に関する１つのログレコードに該当する。各ログレコードに対して、ログ種別１２２２、ログレコードに含まれるメッセージ１２２３、各処理ノードにおける該当ログレコードの、各処理ノードでの発生時刻（１２２４、１２２５、１２２６）、各処理ノードでのログレコードの発生時刻を比較した結果である時間差１２２７が表示される。ここでは各処理ノードでのログレコードの発生時刻の時間差が設定された閾値内に収まっていれば“match”と表示し、時間差が設定された閾値内に収まっていなければ“unmatch”と表示する。 The screen display 1202 is a result of displaying the log records related to the same event in parallel based on the method of associating the log records related to the same event among a plurality of processing nodes executing the same process in parallel as shown in FIG. In response to the user's command operation, it is displayed as an output result by the processing execution shown in FIG. Between the start time and the end time shown in 1221, logs generated in association with processes executed by a plurality of processing nodes that execute the same process in parallel are displayed in order of occurrence, and one line of the display displays the same process. This corresponds to one log record related to the same event in a plurality of processing nodes executed in parallel. For each log record, the log type 1222, the message 1223 included in the log record, the occurrence time (1224, 1225, 1226) of the corresponding log record in each processing node, and the log record in each processing node A time difference 1227, which is a result of comparing the occurrence times of, is displayed. Here, “match” is displayed if the time difference of the log record occurrence time in each processing node is within the set threshold, and “unmatch” is displayed if the time difference is not within the set threshold. .

図１４は、同じ処理を並列実行する複数の処理ノード間で紐付けた、同一事象に関するログレコードに関してデータ内容に比較結果を、ユーザに対して提示するための、画面表示例を示す図である。 FIG. 14 is a diagram showing a screen display example for presenting a comparison result to the data contents regarding the log records related to the same event, which are linked between a plurality of processing nodes executing the same process in parallel. .

画面表示１２０３は、図９にて示した、同じ処理を並列実行する複数の処理ノード間で同一事象に関するログレコードの紐付けの方法に基づき、同一事象に関するログレコードのデータ内容の比較処理結果であり、ユーザのコマンド操作に対して図１１に示した処理実行による出力結果として表示する。画面表示１２０３では、同一事象に関するログレコードのうち、処理ノード間でデータ内容に差異があったもの（ただし発生時刻の差異は対象外）のみ表示している。１２３１には、前記処理ノード間でデータ内容に差異のあったログレコードを特定するための識別情報を表示し、１２３２には差異のある２つの処理ノードの間でのデータ内容の比較結果を表示する。なお図中の“TARGET”、“ORIGINAL”はデータ内容比較における比較対象及び比較の参照元を示す。ユーザは、“TARGET”の最後の箇所「００００００００００００」と“ORIGINAL”の最後の箇所「０ｆｆｆ０００ｃ０００１」が一致していないことを容易に確認することできる。 The screen display 1203 is a comparison processing result of the data contents of the log records related to the same event based on the method of associating the log records related to the same event between a plurality of processing nodes executing the same processing in parallel as shown in FIG. Yes, it is displayed as an output result by the process execution shown in FIG. 11 in response to the user's command operation. On the screen display 1203, only log records related to the same event that differ in data contents between processing nodes (however, differences in occurrence time are excluded) are displayed. 1231 displays identification information for identifying a log record having a difference in data contents between the processing nodes, and 1232 displays a comparison result of data contents between two processing nodes having differences. To do. Note that “TARGET” and “ORIGINAL” in the figure indicate a comparison target and a reference source of the comparison in the data content comparison. The user can easily confirm that the last place “0000 0000 0000” of “TARGET” and the last place “0fff 000c 0001” of “ORIGINAL” do not match.

以上のように、本発明の実施形態によれば、フォールトトレラントコンピュータシステムの構成の複雑さや各ノードに搭載されるソフトウェア等の構成の複雑さに関わらず、ユーザによるシステム稼働中の動作状態把握、または障害発生時の早急な原因解析等が容易となり、システム構成等に関する詳細知識を有さなくとも、ユーザによる保守、障害解析等の作業の効率化を図ることができる。 As described above, according to the embodiment of the present invention, regardless of the complexity of the configuration of the fault-tolerant computer system and the complexity of the software installed on each node, the user can grasp the operating state during system operation, Or, it is easy to quickly analyze the cause when a failure occurs, and it is possible to improve the efficiency of operations such as maintenance and failure analysis by the user without having detailed knowledge about the system configuration or the like.

なお、本発明の実施形態について、その実施の形態に基づき具体的に説明したが、これに限定されるものではなく、その要旨を逸脱しない範囲で種々変更可能である。 In addition, although embodiment of this invention was concretely demonstrated based on the embodiment, it is not limited to this and can be variously changed in the range which does not deviate from the summary.

２０１・・・フォールトトレラントコンピュータシステム、０２０２・・・ユーザ端末、０２０３・・・広域ネットワーク、０２０４・・・ＬＡＮ、０２１１・・・保守ノード、０２１２・・・処理ノード、０２１３・・・ゲートウェイサーバ、０２２１・・・処理装置、０２２２・・・記憶装置、０２２３・・・通信装置、０２２４・・・処理装置、０２２５・・・記憶装置、０２２６・・・通信装置。 201 ... fault tolerant computer system, 0202 ... user terminal, 0203 ... wide area network, 0204 ... LAN, 0211 ... maintenance node, 0212 ... processing node, 0213 ... gateway server, 0221 ... Processing device, 0222 ... Storage device, 0223 ... Communication device, 0224 ... Processing device, 0225 ... Storage device, 0226 ... Communication device.

Claims

A method in a fault tolerant computer system comprising a plurality of processing nodes connected via a network and a maintenance node for acquiring logs of the plurality of processing nodes,
The plurality of processing nodes executing the same processing in parallel;
The plurality of processing nodes transmitting a log of the same processing to a maintenance node via a network;
The maintenance node receiving a log of the same process executed in the plurality of processing nodes via a network;
The maintenance node associating each log of the same processing in the plurality of processing nodes;
When the maintenance node receives the log request from a user terminal, the maintenance node transmits a log associated with the same process to the user terminal.

The method of claim 1, comprising:
Each of the plurality of processing nodes records a synchronization number corresponding to each processing in a log,
The maintenance node associates logs of the same process in each processing node using the synchronization number.

The method according to claim 1 or 2, comprising:
Each of the plurality of processing nodes transmits a synchronization signal to another processing node when reaching the synchronization point of each processing,
Each of the plurality of processing nodes stops processing until receiving a synchronization signal from the other processing node.

A method according to any one of claims 1 to 3,
The method is characterized in that the maintenance node does not link a log that is recorded only in one processing node and not recorded in another processing node.

A method according to any of claims 1 to 4, comprising
Each of the plurality of processing nodes transmits the log to the maintenance node based on a log retention period in each processing node and a data size of the log.

A method according to any one of claims 1 to 5, comprising
Each of the plurality of processing nodes transmits the log to the maintenance node during a period when online processing is not being executed.

The method according to any one of claims 1 to 6, comprising:
When a failure occurs in at least one of the plurality of processing nodes,
The processing node in which the failure has occurred notifies the other processing nodes of the occurrence of the failure, and sends a log of the processing node in which the failure has occurred to the maintenance node,
The other processing nodes send their respective logs to the maintenance node.

A method according to any one of claims 1 to 7,
The fault tolerant computer system includes a gateway device that receives a processing request from an external device via a network or transmits a processing result for the processing request to the external device via a network,
The gateway device receives a processing request received from the external device via a network, and transmits the processing request to the plurality of processing nodes;
The plurality of processing nodes respectively execute processing for the processing request in parallel, and transmit the executed processing results to the gateway device.
The gateway device collates processing results executed by each processing node received from the plurality of processing nodes, and transmits a processing result regarded as normal to the external device via a network.

9. The method of claim 8, wherein
The gateway device transmits a processing request received from the external device to the processing node together with identification information corresponding to the processing request, and transmits a log of processing executed by the gateway device to the processing node, and A log of reception from the external device is transmitted to the maintenance node together with identification information corresponding to processing in the gateway device and execution time information of the processing,
The processing node executes processing for a processing request received from the gateway device, and a processing log executed by the processing node and a transmission and reception log between the gateway device are executed by the processing node. Together with the identification information corresponding to the process and the execution time information of the process to the maintenance node,
The maintenance node generates a process log of the process received from the gateway apparatus and a process log received from the process node according to the identification information and the execution time information in the gateway apparatus or the process node. A method characterized by arranging in order.

10. A method according to claim 8 or 9, comprising
The maintenance node has a process log received from the gateway device and a process log received from the processing node having the same identification information, the identification information and the execution time information in the gateway device or the processing node. According to the method, the processes are arranged in the order of occurrence of the processes.

A method according to any of claims 1 to 10, comprising
The user terminal displays the logs associated with the same process side by side on a screen.

A method according to any of claims 1 to 11, comprising
The user terminal displays a comparison result of logs associated as the same processing on a screen.

A fault tolerant computer system comprising a plurality of processing nodes connected via a network and a maintenance node for acquiring logs of the plurality of processing nodes,
The plurality of processing nodes execute the same processing in parallel;
The plurality of processing nodes send logs of the same processing to a maintenance node via a network;
The maintenance node receives a log of the same process executed in the plurality of processing nodes via a network;
The maintenance node associates logs of the same processing in the plurality of processing nodes,
A fault tolerant computer system, wherein when the maintenance node receives a request for the log from a user terminal, the log associated with the same process is transmitted to the user terminal.