JP2013003950A

JP2013003950A - Decentralized processing system, log collection server, log collection method, and program

Info

Publication number: JP2013003950A
Application number: JP2011136126A
Authority: JP
Inventors: Takeshi Kaji; 武志鍜治; Akihiro Yamanaka; 章裕山中; Hideji Nakamura; 英児中村
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2011-06-20
Filing date: 2011-06-20
Publication date: 2013-01-07
Anticipated expiration: 2031-06-20
Also published as: JP5566956B2

Abstract

PROBLEM TO BE SOLVED: To shorten the time for which an analyzer waits for a log to arrive if a failure occurs to a program.SOLUTION: A decentralized processing system of the present invention includes a plurality of service providing servers 10 and a log collection server 30 which collects logs from the service providing servers 10. The log collection server 30 includes: a log collection order list 33 showing the order of collection of logs determined according to dependence between programs and performed in the case of occurrence of a failure to each program; and a log collection part 31 which collects logs from the service providing servers 10 in the order shown in the log collection order list 33 with respect to a program on the service providing servers 10 to which a failure occurs in such a case.

Description

本発明は、サーバからログを収集する技術に関し、特に、複数のサーバから大容量のログを収集する技術に関する。 The present invention relates to a technique for collecting logs from a server, and more particularly to a technique for collecting large-capacity logs from a plurality of servers.

分散処理システムは、複数のサーバが協調して動作することでサービスを提供するシステムであり、各サーバ上では複数のプログラムが動作する。 The distributed processing system is a system that provides a service by a plurality of servers operating in cooperation, and a plurality of programs operate on each server.

各プログラムは、障害発生時に備え、障害原因を特定するためのログ（解析用ログ）を各サーバに出力している。ログは、大容量のため、通常は各サーバに蓄積しており、障害発生時に各サーバから収集して、障害原因の特定のために解析者が使用する。 Each program outputs a log (analysis log) for identifying the cause of the failure to each server in preparation for the occurrence of the failure. Logs are usually stored in each server because of their large capacity, and are collected from each server when a failure occurs and used by an analyst to identify the cause of the failure.

サーバからログを収集する技術として、例えば、非特許文献１には、簡易操作で運用可能な統合ログ管理環境を提供し、システム監査や障害調査に必要な各種ログを収集し、管理コストを削減する技術が記載されている。 As a technology for collecting logs from the server, Non-Patent Document 1, for example, provides an integrated log management environment that can be operated with simple operations, collects various logs necessary for system audits and fault investigations, and reduces management costs The technology to do is described.

また、サーバからログを収集する他の技術として、例えば、非特許文献２には、サーバの各部品や様々なサービス間の依存関係に注目し、システム全体の全ログの中から管理者が現在注目している事象に関連のあるログを自動抽出する技術が開示されている。 As another technique for collecting logs from the server, for example, Non-Patent Document 2 focuses on the dependency between each component of the server and various services. A technique for automatically extracting a log related to an event of interest is disclosed.

ＮＥＣ、“WebSAM LogCollector”、［平成23年6月7日検索］、インターネット＜URL：http://www.nec.co.jp/middle/WebSAM/products/LogCollector/＞NEC, “WebSAM LogCollector”, [Search June 7, 2011], Internet <URL: http://www.nec.co.jp/middle/WebSAM/products/LogCollector/> 敷田幹文、後藤宏志、“大規模サーバ間の部品依存関係に基づくログ管理支援法”、2008年3月15日、情報処理学会、情報処理学会論文誌、第49巻、第3号、p.1081-1089、［平成23年6月7日検索］、インターネット＜URL：http://hdl.handle.net/10119/7763＞Mikifumi Shikita, Hiroshi Goto, “Log Management Support Method Based on Component Dependency Between Large Servers”, March 15, 2008, Information Processing Society of Japan, IPSJ Transactions, Vol. 49, No. 3, p. .1081-1089, [Search June 7, 2011], Internet <URL: http://hdl.handle.net/10119/7763>

しかし、非特許文献１，２に記載の技術においては、ログを一括して収集した後に解析を開始するため、全てのログを回収するまで解析に着手できなかった。 However, in the techniques described in Non-Patent Documents 1 and 2, since the analysis is started after collecting the logs collectively, the analysis cannot be started until all the logs are collected.

分散処理システムを構成する複数のサーバから大容量のログを収集するには長い時間を要する（場合によっては数日を要する）ため、障害発生時に全ログの到着を待って解析に着手していては、障害原因の特定が遅れてしまう。 Collecting large volumes of logs from multiple servers that make up a distributed processing system takes a long time (in some cases it may take several days), so we have begun analysis after waiting for the arrival of all logs when a failure occurs. Will delay the identification of the cause of the failure.

そこで、本発明の目的は、プログラムに障害が発生した場合に、解析者がログの到着を待つ時間を短縮することができる分散処理システム、ログ収集サーバ、ログ収集方法、プログラムを提供することにある。 Accordingly, an object of the present invention is to provide a distributed processing system, a log collection server, a log collection method, and a program capable of reducing the time for an analyst to wait for the arrival of a log when a failure occurs in the program. is there.

本発明の分散処理システムは、
複数のサービス提供サーバと、前記サービス提供サーバからログを収集するログ収集サーバと、を有してなる分散処理システムであって、
前記ログ収集サーバは、
プログラム毎に、プログラム間の依存関係に応じて予め決められた、そのプログラムに障害が発生した時にログを収集する順序を表す第１のリストと、
前記サービス提供サーバ上のプログラムに障害が発生した場合、障害が発生したプログラムについて前記第１のリストに表される順序に従って前記サービス提供サーバからログを収集するログ収集部と、を有する。 The distributed processing system of the present invention
A distributed processing system comprising a plurality of service providing servers and a log collection server that collects logs from the service providing servers,
The log collection server
For each program, a first list that represents a sequence in which logs are collected when a failure occurs in the program, which is predetermined according to the dependency relationship between the programs;
And a log collection unit that collects logs from the service providing server according to the order represented in the first list when a failure occurs in the program on the service providing server.

本発明のログ収集サーバは、
サービス提供サーバからログを収集するログ収集サーバであって、
プログラム毎に、プログラム間の依存関係に応じて予め決められた、そのプログラムに障害が発生した時にログを収集する順序を表す第１のリストと、
前記サービス提供サーバ上のプログラムに障害が発生した場合、障害が発生したプログラムについて前記第１のリストに表される順序に従って前記サービス提供サーバからログを収集するログ収集部と、を有する。 The log collection server of the present invention
A log collection server that collects logs from a service providing server,
For each program, a first list that represents a sequence in which logs are collected when a failure occurs in the program, which is predetermined according to the dependency relationship between the programs;
And a log collection unit that collects logs from the service providing server according to the order represented in the first list when a failure occurs in the program on the service providing server.

本発明のログ収集方法は、
サービス提供サーバからログを収集するログ収集サーバが行うログ収集方法であって、
プログラム毎に、プログラム間の依存関係に応じて予め決められた、そのプログラムに障害が発生した時にログを収集する順序を表す第１のリストを登録し、
前記サービス提供サーバ上のプログラムに障害が発生した場合、障害が発生したプログラムについて前記第１のリストに表される順序に従って前記サービス提供サーバからログを収集する。 The log collection method of the present invention includes:
A log collection method performed by a log collection server that collects logs from a service providing server,
For each program, register a first list representing a sequence in which logs are collected when a failure occurs in the program, which is determined in advance according to the dependency relationship between the programs,
When a failure occurs in the program on the service providing server, logs are collected from the service providing server according to the order shown in the first list for the program in which the failure has occurred.

本発明のプログラムは、
前記ログ収集方法を前記ログ収集サーバに実行させるためのものである。 The program of the present invention
This is for causing the log collection server to execute the log collection method.

本発明によれば、プログラム毎に、プログラム間の依存関係に応じて、そのプログラムに障害が発生した時のログ収集する順序を予め決めておき、障害発生時には、その順序でログを収集する。 According to the present invention, for each program, the order of log collection when a failure occurs in the program is determined in advance according to the dependency relationship between the programs, and when a failure occurs, the logs are collected in that order.

これにより、障害発生時には、解析者が必要とする順序でログを自動的に収集することができ、解析者がログの収集を待つ時間を短縮できるという効果が得られる。 As a result, when a failure occurs, logs can be automatically collected in the order required by the analyst, and the time that the analyst waits for log collection can be shortened.

プログラムの依存関係の例を示す図である。It is a figure which shows the example of the dependence relationship of a program. 障害の発生箇所と原因個所とが異なる例を示す図である。It is a figure which shows the example from which the occurrence location of a failure differs from a cause location. ログ収集順序の例を示す図である。It is a figure which shows the example of a log collection order. 分散処理システムにおけるサーバの動作例を示す図である。It is a figure which shows the operation example of the server in a distributed processing system. 本発明の第１および第２の実施形態の分散処理システムの構成例を示す図である。It is a figure which shows the structural example of the distributed processing system of the 1st and 2nd embodiment of this invention. ログ収集順序リストの例を示す図である。It is a figure which shows the example of a log collection order list. サーバリストの例を示す図である。It is a figure which shows the example of a server list. 本発明の第１の実施形態の分散処理システムの障害発生時の動作例を説明するシーケンスチャートである。It is a sequence chart explaining the operation example at the time of the failure generation of the distributed processing system of the 1st Embodiment of this invention. 本発明の第１の実施形態のログ収集サーバのログ収集動作の動作例を説明するフローチャートである。It is a flowchart explaining the operation example of the log collection operation | movement of the log collection server of the 1st Embodiment of this invention. 本発明の第２実施形態のログ収集サーバのログ収集動作の動作例を説明するフローチャートである。It is a flowchart explaining the operation example of the log collection operation | movement of the log collection server of 2nd Embodiment of this invention. 本発明の第３の実施形態の分散処理システムの構成例を示す図である。It is a figure which shows the structural example of the distributed processing system of the 3rd Embodiment of this invention. 本発明の第３の実施形態の分散処理システムの障害発生時の動作例を説明するシーケンスチャートである。It is a sequence chart explaining the operation example at the time of the failure generation of the distributed processing system of the 3rd Embodiment of this invention.

（１）本発明の概要
最初に、本発明の概要について説明する。 (1) Outline of the Present Invention First, an outline of the present invention will be described.

ここでは、１台のサーバにおいて、Ｐ１、Ｐ２、Ｐ３の３個のプログラムが動作するケースを考える。本ケースでは、Ｐ１、Ｐ２、Ｐ３の依存関係は、図１に示すように、Ｐ１がＰ２を使用し、Ｐ２がＰ３を使用するという関係になっている。また、図２に示すように、障害（アラーム）が発生したプログラムがＰ１、障害の真の原因となる不具合があるプログラムがＰ３だとする。 Here, a case where three programs P1, P2, and P3 operate on one server is considered. In this case, as shown in FIG. 1, the dependency relationship between P1, P2, and P3 is such that P1 uses P2 and P2 uses P3. Further, as shown in FIG. 2, it is assumed that a program in which a failure (alarm) has occurred is P1, and a program having a problem that causes a failure is P3.

この場合、解析者は、Ｐ１の障害の原因を特定するため、まず、Ｐ１のログを解析する。Ｐ１のログを解析した結果、障害の原因がＰ１ではない場合、次に、解析者は、Ｐ１が使用するＰ２のログを解析する。Ｐ２のログを解析した結果、障害の原因がＰ２ではない場合、次に、解析者は、Ｐ２が使用するＰ３のログを解析する。そして、Ｐ３のログを解析した結果、障害の真の原因がＰ３の不具合であることを特定する。 In this case, the analyst first analyzes the log of P1 in order to identify the cause of the failure of P1. As a result of analyzing the log of P1, if the cause of the failure is not P1, then the analyst analyzes the log of P2 used by P1. If the cause of the failure is not P2, as a result of analyzing the P2 log, the analyst then analyzes the P3 log used by P2. As a result of analyzing the log of P3, it is specified that the true cause of the failure is a malfunction of P3.

上記のケースでは、解析者がログを必要とする順序は、Ｐ１→Ｐ２→Ｐ３の順序である。そのため、Ｐ１→Ｐ２→Ｐ３の順序でログを収集すれば、解析者がログの到着を待つ時間を短縮できる。 In the above case, the order in which the analyst needs logs is the order of P1 → P2 → P3. Therefore, if logs are collected in the order of P1, P2, and P3, the time for the analyst to wait for the arrival of the log can be shortened.

そこで、本発明では、図３に示すように、プログラム毎に、プログラム間の依存関係に応じて、そのプログラムに障害が発生した時のログ収集する順序を予め決めておき、障害発生時には、その順序でログを収集する。 Therefore, in the present invention, as shown in FIG. 3, for each program, the order of log collection when a failure occurs in the program is determined in advance according to the dependency relationship between the programs. Collect logs in order.

これにより、障害発生時には、解析者が必要とするＰ１→Ｐ２→Ｐ３の順序でログを自動的に収集することができ、解析者がログの収集を待つ時間を短縮できる。 Thus, when a failure occurs, logs can be automatically collected in the order of P1 → P2 → P3 required by the analyst, and the time for the analyst to wait for log collection can be shortened.

一方、プログラム間の依存関係を考慮せずにログを収集した場合、例えば、最初にＰ１のログ、次にＰ３のログ、最後にＰ２のログを収集した場合を考える。 On the other hand, when logs are collected without considering the dependency relationship between programs, for example, a case where a log of P1 is collected first, then a log of P3, and finally a log of P2 is considered.

この場合には、解析者は、Ｐ１のログを解析後、Ｐ２のログを解析しようとするが、この時点ではＰ３のログしか収集できていない可能性がある。その場合には、解析者は、Ｐ２のログが到着するのを待つ必要があり、解析作業の進捗を阻害してしまう。 In this case, the analyst tries to analyze the log of P2 after analyzing the log of P1, but there is a possibility that only the log of P3 can be collected at this point. In that case, the analyst needs to wait for the log of P2 to arrive, which hinders the progress of the analysis work.

なお、上記のケースは、サーバが１台の例であるが、分散処理システムでは、図４に示すように、各プログラムは複数のサーバ上で稼動し、かつ、それらの複数のサーバが協調して動作する。各プログラムのログは、大容量であり、かつ、複数のサーバに分散して蓄積されるため、障害発生時に全てのログを収集するには非常に長い時間を要する。 The above case is an example of a single server, but in a distributed processing system, as shown in FIG. 4, each program runs on a plurality of servers, and the plurality of servers cooperate. Works. Since the logs of each program have a large capacity and are distributed and accumulated in a plurality of servers, it takes a very long time to collect all the logs when a failure occurs.

障害発生時には、一刻も早く原因を特定する必要があり、ログ収集にかける時間は短ければ短いほど好ましい。ただ、解析者は、大容量のログの全てを一度に解析できるわけではないため、解析者が解析する順序でログを提供できれば、解析者がログの到着を待つ時間を短縮でき、解析作業の進捗を阻害することはない。 When a failure occurs, it is necessary to identify the cause as soon as possible, and the shorter the time required for log collection, the better. However, the analyst cannot analyze all of the large-capacity logs at the same time, so if the analysts can provide logs in the order in which they are analyzed, the time for the analysts to wait for the arrival of logs can be shortened, and There is no impediment to progress.

そのため、本発明では、図３のログ収集順序に従って、まず、サーバ１〜ＮのＰ１のログ、次に、サーバ１〜ＮのＰ２のログ、最後に、サーバ１〜ＮのＰ３のログの順序で収集して解析者に提供する。 Therefore, in the present invention, according to the log collection order of FIG. 3, first, the P1 log of the servers 1 to N, then the P2 log of the servers 1 to N, and finally the P3 log of the servers 1 to N. Collected and provided to analysts.

これにより、解析者がログの到着を待つ時間を短縮することができる。
（２）本発明の実施形態
続いて、本発明の実施形態について説明する。
（２−１）第１の実施形態
図５に、本実施形態の分散処理システムの構成例を示す。 Thereby, it is possible to shorten the time for the analyst to wait for the arrival of the log.
(2) Embodiment of the Present Invention Next, an embodiment of the present invention will be described.
(2-1) First Embodiment FIG. 5 shows a configuration example of a distributed processing system of this embodiment.

図５に示すように、本実施形態の分散処理システムは、複数台（より具体的には数百台以上）のサービス提供サーバ（以下、サーバと表記した場合にはサービス提供サーバを意味する）１０と、アラーム監視サーバ２０と、ログ収集サーバ３０と、を有している。 As shown in FIG. 5, the distributed processing system of the present embodiment has a plurality of (more specifically, several hundred or more) service providing servers (hereinafter referred to as “server providing servers”). 10, an alarm monitoring server 20, and a log collection server 30.

サーバ１０は、同一サービスを提供するグループ毎に分類される。 The server 10 is classified for each group that provides the same service.

サーバ１０は、障害発生時に備えて解析に必要なログを蓄積している。 The server 10 accumulates logs necessary for analysis in case a failure occurs.

なお、図５において、Ｐｚｉは、グループｚの各サーバ１０にインストールされたｉ個目のプログラム（ｚ＝Ａ，Ｂ，・・・、ｉ＝１，２，・・・）であり、また、Ｌｚｉは、プログラムＰｚｉのログ（ｚ＝Ａ，Ｂ，・・・、ｉ＝１，２，・・・）である（以下の図面において同じ）。 In FIG. 5, Pzi is the i-th program (z = A, B,..., I = 1, 2,...) Installed in each server 10 of the group z. Lzi is a log (z = A, B,..., I = 1, 2,...) Of the program Pzi (same in the following drawings).

サーバ１０は、自己のサービス提供サーバ上のプログラムに障害が発生した場合、アラームを送信するアラーム送信部１１を有している。なお、アラームは、自己のサーバ１０を識別する情報と障害が発生したプログラムを識別する情報を含むものとする。 The server 10 includes an alarm transmission unit 11 that transmits an alarm when a failure occurs in a program on its service providing server. The alarm includes information for identifying the server 10 and information for identifying the program in which the failure has occurred.

アラーム監視サーバ２０は、障害が発生したサーバ１０からアラームを受信するアラーム受信部２１と、アラーム受信部２１が受信したアラームを表示するアラーム表示部２２と、を有している。 The alarm monitoring server 20 includes an alarm receiving unit 21 that receives an alarm from the server 10 in which a failure has occurred, and an alarm display unit 22 that displays an alarm received by the alarm receiving unit 21.

ログ収集サーバ３０は、障害発生時にサーバ１０からログを収集するログ収集部３１と、サーバ１０から収集したログを格納するログ格納部３２と、ログ収集順序リスト３３と、サーバリスト３４と、を有している。 The log collection server 30 includes a log collection unit 31 that collects logs from the server 10 when a failure occurs, a log storage unit 32 that stores logs collected from the server 10, a log collection order list 33, and a server list 34. Have.

図６に、ログ収集順序リスト３３の例を示す。 FIG. 6 shows an example of the log collection order list 33.

図６に示すように、ログ収集順序リスト３３は、プログラム毎に、プログラム間の依存関係に応じて予め決められた、そのプログラムに障害が発生した時にログを収集する順序を表す第１のリストである。 As shown in FIG. 6, the log collection order list 33 is a first list representing the order in which logs are collected when a failure occurs in a program, which is predetermined for each program according to the dependency relationship between the programs. It is.

図７に、サーバリスト３４の例を示す。 FIG. 7 shows an example of the server list 34.

図７に示すように、サーバリスト３４は、グループ毎に、そのグループに属するサーバ１０の情報（サーバ名、ＩＰアドレス等）を表す第２のリストである。 As shown in FIG. 7, the server list 34 is a second list that represents information (server name, IP address, etc.) of the servers 10 belonging to the group for each group.

分散処理システムでは、基本的には、グループ単位でサービスを提供し、同一グループに属するサーバ１０間で協調動作を行う。 In the distributed processing system, basically, a service is provided in units of groups, and a cooperative operation is performed between servers 10 belonging to the same group.

そのため、ログ収集部３１は、サーバ１０で障害が発生した場合、そのサーバ１０と同じグループに属する全てのサーバ１０からログを収集する。 Therefore, when a failure occurs in the server 10, the log collection unit 31 collects logs from all the servers 10 that belong to the same group as the server 10.

以下に、本実施形態の分散処理システムの動作について説明する。 The operation of the distributed processing system of this embodiment will be described below.

図８に、本実施形態の分散処理システムにおいて、サーバ１０−Ａ１上で動作するプログラムＰＡ１に障害が発生した時の動作例を説明するシーケンスチャートを示す。 FIG. 8 shows a sequence chart for explaining an operation example when a failure occurs in the program PA1 operating on the server 10-A1 in the distributed processing system of this embodiment.

図８に示すように、各プログラムＰｚｉは、ログＬｚｉをサーバ１０に常時出力している（ステップＡ１）。 As shown in FIG. 8, each program Pzi always outputs the log Lzi to the server 10 (step A1).

ここで、サーバ１０−Ａ１上のプログラムＰＡ１に障害が発生したとする（ステップＡ２）。 Here, it is assumed that a failure has occurred in the program PA1 on the server 10-A1 (step A2).

すると、サーバ１０−Ａ１のアラーム送信部１１は、アラーム監視サーバ２０にアラームを送信する（ステップＡ３）。 Then, the alarm transmission unit 11 of the server 10-A1 transmits an alarm to the alarm monitoring server 20 (step A3).

アラーム監視サーバ２０内では、アラーム受信部２１は、アラームを受信すると、そのアラームをアラーム表示部２２に表示する（ステップＡ４）。 In the alarm monitoring server 20, when receiving an alarm, the alarm receiving unit 21 displays the alarm on the alarm display unit 22 (step A4).

運用者２０Ａは、アラーム監視サーバ２０のアラーム表示部２２を監視することにより、サーバ１０−Ａ１上のプログラムＰＡ１にて障害が発生したことを確認する（ステップＡ５）。 The operator 20A monitors the alarm display unit 22 of the alarm monitoring server 20 to confirm that a failure has occurred in the program PA1 on the server 10-A1 (step A5).

次に、運用者２０Ａは、ログ収集サーバ３０のログ収集部３１に対し、サーバ１０−Ａ１のプログラムＰＡ１に関連するログの収集を指示する（ステップＡ６）。 Next, the operator 20A instructs the log collection unit 31 of the log collection server 30 to collect logs related to the program PA1 of the server 10-A1 (step A6).

すると、ログ収集部３１は、後述する図９に示すフローに従って、サーバ１０−Ａ１と同じグループに属する全てのサーバ１０からログを収集し、収集したログをログ格納部３２に格納する（ステップＡ７）。 Then, the log collection unit 31 collects logs from all the servers 10 belonging to the same group as the server 10-A1 according to the flow shown in FIG. 9 described later, and stores the collected logs in the log storage unit 32 (step A7). ).

解析者３０Ａは、ログＬＡ１が収集され次第、解析を開始する（ステップＡ８）。 The analyst 30A starts analysis as soon as the log LA1 is collected (step A8).

ここでは、ログ収集部３１により、ログはＬＡ１→ＬＡ２→ＬＡ３・・・→ＬＡＸの順序で収集される。そのため、障害の真の原因がＰＡ１でなかった場合、解析者３０Ａは、ＬＡ２→ＬＡ３・・・の順序でログを解析する。 Here, the logs are collected by the log collection unit 31 in the order of LA1 → LA2 → LA3... → LAX. Therefore, when the true cause of the failure is not PA1, the analyst 30A analyzes the logs in the order of LA2 → LA3.

図９に、ログ収集部３１のログ収集動作の動作例を説明するフローチャートを示す。 FIG. 9 shows a flowchart for explaining an operation example of the log collection operation of the log collection unit 31.

図９に示すように、まず、ログ収集部３１は、ＮとＸを初期化してそれぞれ１にする（ステップＢ１）。 As shown in FIG. 9, first, the log collection unit 31 initializes N and X to 1 respectively (step B1).

次に、ログ収集部３１は、ログ収集順序リスト３３から、プログラムＰＡ１の障害発生時にＸ番目に収集すべき対象ログを特定する（ステップＢ２）。 Next, the log collection unit 31 specifies the Xth target log to be collected from the log collection order list 33 when the failure of the program PA1 occurs (step B2).

次に、ログ収集部３１は、サーバリスト３４から、障害が発生したサーバ１０−Ａ１が属するグループＡのＮ番目に収集すべき対象サーバを特定する（ステップＢ３）。 Next, the log collection unit 31 specifies, from the server list 34, a target server to be collected in the Nth group A to which the failed server 10-A1 belongs (step B3).

なお、Ｎ番目のサーバは、例えば、サーバリスト３４に記載されたサーバ名の数字部分の順序や、サーバリスト３４の記載の順序等で判断する。 Note that the Nth server is determined based on, for example, the order of the numeric part of the server name described in the server list 34, the order of description in the server list 34, or the like.

次に、ログ収集部３１は、対象サーバから対象ログを収集する（ステップＢ４）。 Next, the log collection unit 31 collects a target log from the target server (step B4).

次に、ログ収集部３１は、Ｎをインクリメントし（ステップＢ５）、グループＡの中に対象ログを未収集の未収集サーバがあるか否かを判断し（ステップＢ６）、未収集のサーバがあれば（ステップＢ６のＹＥＳ）、ステップＢ３に戻る。 Next, the log collection unit 31 increments N (step B5), determines whether there is an uncollected server that has not collected the target log in the group A (step B6), and the uncollected server If there is (YES in step B6), the process returns to step B3.

一方、グループＡの中に未収集サーバがなければ（ステップＢ６のＮＯ）、ログ収集部３１は、Ｘをインクリメントすると共にＮを初期化して１にし（ステップＢ７）、プログラムＰＡ１の障害発生時に収集すべきログの中に、未収集ログがあるか否かを判断し（ステップＢ８）、未収集ログがあれば（ステップＢ８のＹＥＳ）、ステップＢ２に戻り、未収集ログがなければ（ステップＢ８のＮＯ）、処理を終了する。 On the other hand, if there is no uncollected server in group A (NO in step B6), the log collecting unit 31 increments X and initializes N to 1 (step B7), and collects when a failure occurs in the program PA1. It is determined whether there is an uncollected log in the logs to be processed (step B8). If there is an uncollected log (YES in step B8), the process returns to step B2, and if there is no uncollected log (step B8). NO), the process is terminated.

上述したように本実施形態によれば、プログラム毎に、プログラム間の依存関係に応じて、そのプログラムに障害が発生した時のログ収集する順序を予め決めておき、障害発生時には、その順序でログを収集する。 As described above, according to the present embodiment, for each program, the order of collecting logs when a failure occurs in the program is determined in advance according to the dependency relationship between the programs. Collect logs.

これにより、障害発生時には、解析者が必要とする順序でログを自動的に収集することができ、解析者がログの収集を待つ時間を短縮できる。
（２−２）第２の実施形態
本実施形態の分散処理システムは、第１の実施形態と構成自体は同様であるが、動作が第１の実施形態とは異なる。 As a result, when a failure occurs, logs can be automatically collected in the order required by the analyst, and the time for the analyst to wait for log collection can be shortened.
(2-2) Second Embodiment The distributed processing system of the present embodiment has the same configuration as that of the first embodiment, but the operation is different from that of the first embodiment.

すなわち、第１の実施形態においては、障害が発生したサーバ１０と同じグループに属するサーバ１０からログを収集する際に、サーバ名の数字部分の順序やサーバリスト３４に記載された順序でログを収集していた。 That is, in the first embodiment, when logs are collected from the servers 10 belonging to the same group as the server 10 in which the failure has occurred, the logs are recorded in the order of the numerical parts of the server names or the order described in the server list 34. I was collecting.

これに対して、本実施形態においては、最初に、障害が発生したサーバ１０からログを収集し、その後に、障害が発生したサーバ１０と同じグループに属する他のサーバ１０からログを収集する。 On the other hand, in the present embodiment, first, logs are collected from the server 10 in which the failure has occurred, and thereafter logs are collected from other servers 10 belonging to the same group as the server 10 in which the failure has occurred.

以下、本実施形態の分散処理システムの動作について説明する。 Hereinafter, the operation of the distributed processing system of this embodiment will be described.

なお、本実施形態は、第１の実施形態と比較して、ログ収集部３１のログ収集動作が異なり、その他の動作は同様である。そのため、以下では、本実施形態のログ収集動作の動作例についてのみ、図１０を参照して説明する。 Note that the present embodiment is different from the first embodiment in the log collection operation of the log collection unit 31, and the other operations are the same. Therefore, hereinafter, only an operation example of the log collection operation of the present embodiment will be described with reference to FIG.

図１０に示すように、まず、ログ収集部３１は、障害が発生したサーバ１０−Ａ１を、最初にログを収集すべきサーバと決定する（ステップＣ１）。 As illustrated in FIG. 10, first, the log collection unit 31 determines that the server 10-A1 in which the failure has occurred is a server that should first collect logs (step C1).

次に、ログ収集部３１は、Ｘを初期化して１にする（ステップＣ２）。 Next, the log collection unit 31 initializes X to 1 (step C2).

次に、ログ収集部３１は、ログ収集順序リスト３３から、プログラムＰＡ１の障害発生時にＸ番目に収集すべき対象ログを特定する（ステップＣ３）。 Next, the log collection unit 31 specifies the Xth target log to be collected from the log collection order list 33 when the failure of the program PA1 occurs (step C3).

次に、ログ収集部３１は、サーバ１０−Ａ１から対象ログを収集する（ステップＣ４）。 Next, the log collection unit 31 collects a target log from the server 10-A1 (step C4).

次に、ログ収集部３１は、Ｘをインクリメントし（ステップＣ５）、プログラムＰＡ１の障害発生時に収集すべきログの中に、未収集ログがあるか否かを判断し（ステップＣ６）、未収集ログがあれば（ステップＣ６のＹＥＳ）、ステップＣ３に戻り、未収集ログがなければ（ステップＣ６のＮＯ）、ステップＣ７に進む。 Next, the log collection unit 31 increments X (step C5), determines whether there is an uncollected log in the logs to be collected when a failure occurs in the program PA1 (step C6), and uncollects If there is a log (YES in step C6), the process returns to step C3, and if there is no uncollected log (NO in step C6), the process proceeds to step C7.

以降、図９に示したステップＢ１〜Ｂ８と同様のステップＣ７〜Ｃ１４の処理を行う。 Thereafter, the same processes of steps C7 to C14 as steps B1 to B8 shown in FIG. 9 are performed.

上述したように本実施形態によれば、障害が発生したサーバ１０から最初にログを収集するため、障害が発生したサーバ１０のログを優先して解析することができる。 As described above, according to the present embodiment, since logs are first collected from the server 10 in which the failure has occurred, the logs of the server 10 in which the failure has occurred can be preferentially analyzed.

その他の効果は第１の実施形態と同様である。
（２−３）第３の実施形態
図１１に、本実施形態の分散処理システムの構成例を示す。 Other effects are the same as those of the first embodiment.
(2-3) Third Embodiment FIG. 11 shows a configuration example of a distributed processing system according to this embodiment.

図１１に示すように、本実施形態の分散処理システムは、図５に示した第１の実施形態と比較して、アラーム監視サーバ２０を削除した点と、ログ収集サーバ３０にアラーム受信部３５を追加した点と、が異なる。 As shown in FIG. 11, the distributed processing system of this embodiment is different from the first embodiment shown in FIG. 5 in that the alarm monitoring server 20 is deleted, and the log collection server 30 has an alarm receiver 35. It is different from the point that was added.

第１の実施形態においては、運用者２０Ａがアラーム監視サーバ２０を監視し、アラームを確認した場合にログ収集サーバ３０にログ収集を手動で指示し、ログ収集サーバ３０は、運用者２０Ａの指示をトリガーとしてログを収集していた。 In the first embodiment, when the operator 20A monitors the alarm monitoring server 20 and confirms the alarm, the log collection server 30 is manually instructed to collect the log, and the log collection server 30 is instructed by the operator 20A. Was collected as a trigger.

これに対して、本実施形態においては、各サーバ１０からログ収集サーバ３０に直接アラームを送信し、ログ収集サーバ３０は、アラームの受信をトリガーとして、そのアラームに関連するログを自動で収集する。 On the other hand, in the present embodiment, an alarm is directly transmitted from each server 10 to the log collection server 30, and the log collection server 30 automatically collects a log related to the alarm with the reception of the alarm as a trigger. .

図１２に、本実施形態の分散処理システムにおいて、サーバ１０−Ａ１上で動作するプログラムＰＡ１に障害が発生した時の動作例を説明するシーケンスチャートを示す。 FIG. 12 shows a sequence chart for explaining an operation example when a failure occurs in the program PA1 operating on the server 10-A1 in the distributed processing system of this embodiment.

図１２に示すように、各プログラムＰｚｉは、ログＬｚｉをサーバ１０に常時出力している（ステップＤ１）。 As shown in FIG. 12, each program Pzi always outputs the log Lzi to the server 10 (step D1).

ここで、サーバ１０−Ａ１上のプログラムＰＡ１に障害が発生したとする（ステップＤ２）。 Here, it is assumed that a failure has occurred in the program PA1 on the server 10-A1 (step D2).

すると、サーバ１０−Ａ１のアラーム送信部１１は、ログ収集サーバ３０にアラームを送信する（ステップＤ３）。 Then, the alarm transmission unit 11 of the server 10-A1 transmits an alarm to the log collection server 30 (step D3).

ログ収集サーバ３０内では、アラーム受信部３５は、アラームを受信すると、そのアラームに関連するログの収集をログ収集部３１に指示する（ステップＤ４）。 In the log collection server 30, when receiving an alarm, the alarm receiving unit 35 instructs the log collecting unit 31 to collect a log related to the alarm (step D4).

すると、ログ収集部３１は、上述した図９または図１０に示すフローに従って、サーバ１０−Ａ１と同じグループに属する全てのサーバ１０からログを収集し、収集したログをログ格納部３２に格納する（ステップＤ５）。 Then, the log collection unit 31 collects logs from all the servers 10 belonging to the same group as the server 10-A1 according to the flow shown in FIG. 9 or FIG. 10, and stores the collected logs in the log storage unit 32. (Step D5).

解析者３０Ａは、ログＬＡ１が収集され次第、解析を開始する（ステップＤ６）。 The analyst 30A starts analysis as soon as the log LA1 is collected (step D6).

上述したように本実施形態によれば、各サーバ１０からログ収集サーバ３０に直接アラームを送信するため、運用者が手動で指示をすることなく、アラームに関連するログを自動で収集することができる。 As described above, according to the present embodiment, the alarm is directly transmitted from each server 10 to the log collection server 30, so that the log related to the alarm can be automatically collected without manual instruction from the operator. it can.

その他の効果は第１または第２の実施形態と同様である。 Other effects are the same as those of the first or second embodiment.

１０サービス提供サーバ
２０アラーム監視サーバ
２１アラーム受信部
２２アラーム表示部
３０ログ収集サーバ
３１ログ収集部
３２ログ格納部
３３ログ収集順序リスト
３４サーバリスト
３５アラーム受信部 DESCRIPTION OF SYMBOLS 10 Service provision server 20 Alarm monitoring server 21 Alarm reception part 22 Alarm display part 30 Log collection server 31 Log collection part 32 Log storage part 33 Log collection order list 34 Server list 35 Alarm reception part

Claims

A distributed processing system comprising a plurality of service providing servers and a log collection server that collects logs from the service providing servers,
The log collection server
For each program, a first list that represents a sequence in which logs are collected when a failure occurs in the program, which is predetermined according to the dependency relationship between the programs;
A distributed processing system comprising: a log collecting unit that collects logs from the service providing server according to the order represented in the first list when a failure occurs in the program on the service providing server; .

The log collection server
Each group has a second list representing service providing servers belonging to the group,
The log collecting unit
The distributed processing system according to claim 1, wherein when a failure occurs in a program on the service providing server, logs are collected from all service providing servers belonging to the same group as the service providing server in which the failure has occurred.

The log collecting unit
When a failure occurs in the program on the service providing server, the log is first collected from the service providing server in which the failure has occurred, and then another service providing server belonging to the same group as the service providing server in which the failure has occurred. The distributed processing system according to claim 2, wherein logs are collected from the system.

The service providing server includes:
An alarm transmission unit that transmits an alarm indicating that a failure has occurred in the program on the service providing server;
The distributed processing system includes:
An alarm monitoring server for receiving and displaying the alarm from the service providing server;
The log collecting unit
4. The log according to claim 1, wherein logs are collected from the service providing server triggered by an input of an instruction to collect a log related to a program on the service providing server in which a failure has occurred. Distributed processing system.

The service providing server includes:
An alarm transmission unit that transmits an alarm indicating that a failure has occurred in the program on the service providing server;
The log collection server
An alarm receiver for receiving the alarm from the service providing server;
The log collecting unit
4. The distributed processing system according to claim 1, wherein a log is collected from the service providing server, triggered by reception of the alarm. 5.

A log collection server that collects logs from a service providing server,
For each program, a first list that represents a sequence in which logs are collected when a failure occurs in the program, which is predetermined according to the dependency relationship between the programs;
A log collection server having a log collection unit that collects logs from the service provision server according to the order represented in the first list when a failure occurs in the program on the service provision server; .

A log collection method performed by a log collection server that collects logs from a service providing server,
For each program, register a first list representing a sequence in which logs are collected when a failure occurs in the program, which is determined in advance according to the dependency relationship between the programs,
A log collection method for collecting a log from the service providing server according to the order represented in the first list when a failure occurs in a program on the service providing server.

A program for causing the log collection server to execute the log collection method according to claim 7.