JP2014096031A

JP2014096031A - Failure information management device, information processing device, distributed parallel processing system, failure information management method and computer program

Info

Publication number: JP2014096031A
Application number: JP2012247274A
Authority: JP
Inventors: Emiko Miyazaki; 恵美子宮崎
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2012-11-09
Filing date: 2012-11-09
Publication date: 2014-05-22

Abstract

PROBLEM TO BE SOLVED: To provide a technique for improving processing efficiency in failure analysis work by reducing the amount of failure analysis information when failure occurs in a distributed parallel processing system.SOLUTION: A failure information management device includes: a failure information storage 11; a failure information reception part 12 that receives a piece of failure information from a node 20; a similarity determination section 13 that determines whether the failure information storage 11 stores any piece of failure information that satisfies a predetermined similar condition relative to the received failure information; an output YES/NO information transmission part 14 that transmits a piece of output YES/NO information representing YES/NO of output of failure analysis information including a content of a main storage which is used for processing a failure occurred at the node 20, according to the determination result made by the similarity determination section 13; and a failure information registration section 15 that, when it is determined that no failure information satisfying the predetermined similar condition relative to the received failure information is stored in the failure information storage 11, registers the received failure information to the failure information storage 11.

Description

本発明は、分散並列処理の実行中に発生する障害に関する情報を管理する技術に関する。 The present invention relates to a technique for managing information related to a failure that occurs during execution of distributed parallel processing.

一般に、情報処理装置は、プロセスの実行中に障害が発生した場合は、実行中のメモリイメージ（コンテキスト、スタック、データ）をコアファイルとして出力する。このようなコアファイルは、障害の原因の解析に用いられる。以降、このようなコアファイルを、障害解析情報と呼ぶものとする。なお、障害解析情報は、プロセス毎に作成されるのが一般的である。 In general, when a failure occurs during the execution of a process, the information processing apparatus outputs a memory image (context, stack, data) being executed as a core file. Such a core file is used for analyzing the cause of the failure. Hereinafter, such a core file is referred to as failure analysis information. The failure analysis information is generally created for each process.

また、近年、複数の情報処理装置（ノード）によって処理を実行する分散並列処理システムがよく知られている。このような分散並列処理システムは、処理全体を、並列実行可能な複数の処理に分け、各処理を異なるプロセスとして複数のノード上で分散して実行する。このような分散並列処理システムにおいて分散並列処理の実行中に障害が発生した場合、複数のノードが、実行していた各プロセスについて障害解析情報を出力する。このため、障害の原因を解析するためには、複数の障害解析情報を解析する必要が生じる。 In recent years, distributed parallel processing systems that execute processing by a plurality of information processing apparatuses (nodes) are well known. In such a distributed parallel processing system, the entire process is divided into a plurality of processes that can be executed in parallel, and each process is distributed and executed on a plurality of nodes as different processes. When a failure occurs during the execution of distributed parallel processing in such a distributed parallel processing system, a plurality of nodes output failure analysis information for each process being executed. For this reason, it is necessary to analyze a plurality of pieces of failure analysis information in order to analyze the cause of the failure.

一方、分散並列処理システムのノード数は年々増加している。このため、出力される障害解析情報の全体容量もより大きくなってきている。 On the other hand, the number of nodes in a distributed parallel processing system is increasing year by year. For this reason, the total capacity of the failure analysis information to be output is also increasing.

また、分散並列処理システムは、例えば、ＭＰＩ（Message Passing Interface）による分散並列プログラムを実行する場合、同期をとって、各ノードで同じような処理を実行するケースがほとんどである。この場合、複数のノード上の各プロセスは、分散並列プログラムにおける同一の箇所で同一の障害を発生することも多い。この場合、複数の同じような障害解析情報がプロセス毎に生成されることになる。例えば、１０プロセスの分散並列プログラムにおいて配列外参照が発生した場合、そのうち９プロセスは、同一の箇所で同一の例外を起こしている場合も多い。その場合であっても、９つのプロセスについて、９個の同じような障害解析情報が生成される。しかしながら、実際には、１つの障害解析情報を解析するだけで十分であることがほとんどである。 Also, in the distributed parallel processing system, for example, when executing a distributed parallel program by MPI (Message Passing Interface), in most cases, similar processing is executed at each node in synchronization. In this case, the processes on a plurality of nodes often generate the same failure at the same place in the distributed parallel program. In this case, a plurality of similar failure analysis information is generated for each process. For example, when an out-of-array reference occurs in a 10-process distributed parallel program, 9 processes often cause the same exception at the same location. Even in that case, nine similar failure analysis information is generated for nine processes. However, in practice, it is almost always sufficient to analyze one piece of failure analysis information.

このような問題に関連する技術として、例えば、特許文献１に記載されたものがある。この関連技術は、分散並列処理システムにおいて障害が発生したときに、障害発生ノードを特定して該ノードから障害解析データを採取する。さらに、この関連技術は、障害発生ノードにおける障害発生に関連があると思われる他のノード（例えば、障害発生ノードの通信相手のノード）を特定し、特定した他のノードからも障害解析データを採取する。 As a technique related to such a problem, for example, there is one described in Patent Document 1. In this related technology, when a failure occurs in a distributed parallel processing system, a failure occurrence node is specified and failure analysis data is collected from the node. Furthermore, this related technology identifies other nodes that are considered to be related to the occurrence of a failure at the failed node (for example, the communication partner node of the failed node), and also provides failure analysis data from the identified other nodes. Collect.

また、このような問題に関連する他の技術として、例えば、特許文献２に記載されたものがある。この関連技術は、複数のプロセスが共有メモリを利用する場合に、障害解析情報を軽量化する。具体的には、この関連技術は、各プロセスに関する障害解析情報に含めるメモリイメージのうち、共有メモリの内容を別ファイルとして出力し、障害解析情報には、その別ファイルを特定する情報を含める。このとき、この関連技術は、対象のプロセスに関連する共有メモリの内容を表す別ファイルが既に作成されていれば、別ファイルの出力を行わず、その別ファイルを特定する情報を含めた障害解析情報を出力する。これにより、この関連技術は、各プロセスによって生成される障害解析情報を軽量化している。 Another technique related to such a problem is described in Patent Document 2, for example. This related technique reduces the weight of failure analysis information when a plurality of processes use a shared memory. Specifically, this related technique outputs the contents of the shared memory as a separate file among the memory images included in the failure analysis information regarding each process, and the failure analysis information includes information for specifying the separate file. At this time, this related technology does not output a separate file if another file representing the contents of the shared memory related to the target process has already been created, but includes failure identification including information identifying the separate file. Output information. As a result, this related technique reduces the weight of failure analysis information generated by each process.

特開２００５−４５１３号公報Japanese Patent Laying-Open No. 2005-4513 特開２００８−１９７９８０号公報JP 2008-197980 A

しかしながら、特許文献１および特許文献２に記載された技術には、以下の問題がある。 However, the techniques described in Patent Document 1 and Patent Document 2 have the following problems.

特許文献１に記載された関連技術は、障害発生ノードおよび関連ノードを特定することにより、それらのノードから障害解析データを取得する。しかしながら、上述のような分散並列プログラムを実行する分散並列処理システムにこの関連技術を適用した場合、同じような処理を行う全ノードにおいて同一の箇所で同一の障害が発生する可能性が高い。そのため、この関連技術は、そのような全ノードから同じような障害解析データを採取することになる。したがって、この関連技術は、障害解析情報の全体容量の肥大化およびそれに伴う解析作業の非効率性を軽減することができない。 The related technique described in Patent Document 1 specifies failure occurrence nodes and related nodes, and acquires failure analysis data from those nodes. However, when this related technique is applied to a distributed parallel processing system that executes the distributed parallel program as described above, there is a high possibility that the same failure will occur at the same location in all nodes that perform the same processing. Therefore, this related technique collects similar failure analysis data from all such nodes. Therefore, this related technique cannot reduce the enlargement of the entire capacity of the failure analysis information and the inefficiency of the analysis work associated therewith.

また、特許文献２に記載された関連技術は、共有メモリを用いる複数のプロセスについて生成される障害解析情報を軽量化することはできる。しかしながら、互いに共有するメモリを持たないような複数のノードで上述のような分散並列プログラムを実行する分散並列処理システムにおいては、この関連技術は、障害解析情報を軽量化できない。したがって、この関連技術は、障害解析情報の全体容量の肥大化およびそれに伴う解析作業の非効率性を軽減することができない。 The related technique described in Patent Document 2 can reduce the weight of failure analysis information generated for a plurality of processes using a shared memory. However, in a distributed parallel processing system that executes a distributed parallel program as described above on a plurality of nodes that do not have a shared memory, this related technique cannot reduce the weight of failure analysis information. Therefore, this related technique cannot reduce the enlargement of the entire capacity of the failure analysis information and the inefficiency of the analysis work associated therewith.

なお、障害解析情報の全体容量の肥大化によるディスク容量の圧迫を避けるには、一般的なオペレーティングシステムにより提供される資源制限機能を用いることも考えられる。しかしながら、この場合、資源制限を超えた際に、それ以降の障害解析情報が出力されなくなる。したがって、解析に必要な情報が採取されない場合がある。 In order to avoid the compression of the disk capacity due to the enlargement of the total capacity of the failure analysis information, it may be possible to use a resource limiting function provided by a general operating system. However, in this case, when the resource limit is exceeded, subsequent failure analysis information is not output. Therefore, information necessary for analysis may not be collected.

本発明は、上述の課題を解決するためになされたもので、分散並列処理システムにおける障害発生時に、障害解析情報の全体容量の肥大化を抑止して、解析作業をより効率化する技術を提供することを目的とする。 The present invention has been made to solve the above-described problems, and provides a technique for making analysis work more efficient by suppressing the enlargement of the entire capacity of failure analysis information when a failure occurs in a distributed parallel processing system. The purpose is to do.

本発明の障害情報管理装置は、分散並列プログラムを分散実行する各ノードでプロセス実行中に発生した障害に関する障害情報を格納する障害情報格納部と、前記ノードによって前記障害の発生に応じて送信される前記障害情報を受信する障害情報受信部と、前記ノードから受信された障害情報に対して所定の類似条件を満たす障害情報が、前記障害情報格納部に格納されているか否かを判断する類似性判断部と、前記類似性判断部による判断結果に応じて、前記ノードにおいて前記障害発生時に前記プロセスによって利用されていた主記憶装置の内容を含む障害解析情報の出力可否を表す出力可否情報を、該ノードに送信する出力可否情報送信部と、前記ノードから受信された障害情報に対して所定の類似条件を満たす障害情報が、前記障害情報格納部に格納されていないと判断された場合、前記ノードから受信された障害情報を、前記障害情報格納部に登録する障害情報登録部と、を備える。 The failure information management apparatus according to the present invention includes a failure information storage unit that stores failure information related to a failure that occurred during process execution in each node that executes a distributed parallel program in a distributed manner, and is transmitted by the node in response to the occurrence of the failure. A failure information receiving unit that receives the failure information, and a failure information receiving unit that determines whether failure information satisfying a predetermined similarity condition for the failure information received from the node is stored in the failure information storage unit. According to the determination result by the sex determination unit and the similarity determination unit, output enable / disable information indicating whether or not failure analysis information including the contents of the main storage device used by the process at the time of the failure occurrence in the node is output. The output availability information transmission unit that transmits to the node, and the failure information that satisfies a predetermined similarity condition with respect to the failure information received from the node, If it is determined not to be stored in the information storage unit, the fault information received from the node, and a fault information registration unit that registers the failure information storage unit.

また、本発明の情報処理装置は、分散並列プログラムを分散して実行可能なノードとして動作する情報処理装置であって、前記分散並列プログラムに基づく処理を実行するプロセス実行部と、前記プロセス実行部によるプロセス実行中に障害が発生すると、該障害に関する障害情報を採取する障害情報採取部と、前記障害情報採取部によって採取された障害情報を、上述の障害情報管理装置に対して送信する障害情報送信部と、前記障害情報管理装置から受信される前記出力可否情報が出力可を示すとき、前記プロセスによって利用されていた主記憶装置の内容を含む障害解析情報を出力する障害解析情報出力部と、を備える。 The information processing apparatus of the present invention is an information processing apparatus that operates as a node that can execute a distributed parallel program in a distributed manner, and includes a process execution unit that executes processing based on the distributed parallel program, and the process execution unit If a failure occurs during process execution by the failure information, a failure information collection unit that collects failure information related to the failure and failure information that is collected by the failure information collection unit to the failure information management device described above A failure analysis information output unit that outputs failure analysis information including the contents of the main storage device used by the process when the output permission information received from the failure information management device indicates that output is possible; .

また、本発明の分散並列処理システムは、上述の障害情報管理装置と、上述の情報処理装置と、を備える。 Moreover, the distributed parallel processing system of this invention is provided with the above-mentioned failure information management apparatus and the above-mentioned information processing apparatus.

また、本発明の障害情報管理方法は、分散並列処理プログラムを分散実行するノードでプロセス実行中に障害が発生すると、障害が発生したノードは、前記障害に関する障害情報を採取し、前記障害情報を障害情報管理装置に送信し、前記障害情報管理装置は、受信した障害情報に対して所定の類似条件を満たす障害情報が障害情報格納部に格納されているか否かを判断し、所定の類似条件を満たす障害情報が格納されていない場合、前記ノードにおいて前記障害発生時に前記プロセスによって利用されていた主記憶装置の内容を含む障害解析情報の出力可を示す出力可否情報を前記ノードに送信するとともに、前記ノードから受信した障害情報を前記障害情報格納部に登録し、所定の類似条件を満たす障害情報が格納されている場合、前記障害解析情報の出力否を示す出力可否情報を前記ノードに送信し、前記障害が発生したノードは、前記障害情報管理装置から受信される出力可否情報が出力可を示す場合に、前記障害解析情報を出力する。 In the failure information management method of the present invention, when a failure occurs during process execution on a node that executes a distributed parallel processing program in a distributed manner, the failed node collects failure information related to the failure, and stores the failure information. The failure information management device determines whether failure information satisfying a predetermined similarity condition for the received failure information is stored in the failure information storage unit, and the predetermined similarity condition If failure information satisfying the condition is not stored, output enable / disable information indicating that failure analysis information including the contents of the main storage device used by the process at the time of the failure is output to the node. , When the failure information received from the node is registered in the failure information storage unit and failure information satisfying a predetermined similarity condition is stored, The failure analysis information is transmitted to the node, indicating that the failure analysis information is output, and the failure analysis information is received when the output permission information received from the failure information management device indicates that the failure has occurred. Is output.

また、本発明のコンピュータ・プログラムは、分散並列プログラムを分散実行する各ノードでプロセス実行中に発生した障害に関する障害情報を格納する障害情報格納部を用いて、前記ノードによって前記障害の発生に応じて送信される前記障害情報を受信する障害情報受信ステップと、前記ノードから受信された障害情報に対して所定の類似条件を満たす障害情報が、前記障害情報格納部に格納されているか否かを判断する類似性判断ステップと、前記類似性判断ステップの判断結果に応じて、前記ノードにおいて前記障害発生時に前記プロセスによって利用されていた主記憶装置の内容を含む障害解析情報の出力可否を表す出力可否情報を該ノードに送信する出力可否情報送信ステップと、前記ノードから受信された障害情報に対して所定の類似条件を満たす障害情報が、前記障害情報格納部に格納されていないと判断された場合、前記ノードから受信された障害情報を前記障害情報格納部に登録する障害情報登録ステップと、をコンピュータ装置に実行させる。 The computer program according to the present invention uses a failure information storage unit that stores failure information relating to a failure that occurred during process execution in each node that executes a distributed parallel program in a distributed manner, and responds to the occurrence of the failure by the node. A failure information receiving step for receiving the failure information transmitted in response to the failure information, and whether or not failure information satisfying a predetermined similarity condition for the failure information received from the node is stored in the failure information storage unit. A similarity determination step to be determined, and an output indicating whether failure analysis information including the contents of the main memory used by the process at the time of the failure occurrence in the node is output according to the determination result of the similarity determination step Output permission information transmission step for transmitting permission information to the node, and failure information received from the node. A failure information registration step of registering the failure information received from the node in the failure information storage unit when it is determined that the failure information satisfying the similar condition is not stored in the failure information storage unit; Let the device run.

また、本発明の他のコンピュータ・プログラムは、分散並列プログラムに基づく処理を実行するプロセス実行ステップと、前記プロセス実行ステップでのプロセス実行中に障害が発生すると、該障害に関する障害情報を採取する障害情報採取ステップと、前記障害情報採取ステップで採取された障害情報を、上述のコンピュータ・プログラムを実行する装置に対して送信する障害情報送信ステップと、前記装置から受信される前記出力可否情報が出力可を示すとき、前記プロセスによって利用されていた主記憶装置の内容を含む障害解析情報を出力する障害解析情報出力ステップと、をコンピュータ装置に実行させる。 Another computer program of the present invention includes a process execution step for executing processing based on a distributed parallel program, and a failure for collecting failure information related to the failure when a failure occurs during the process execution in the process execution step. An information collection step; a failure information transmission step for transmitting the failure information collected in the failure information collection step to a device executing the above-described computer program; and the output availability information received from the device is output. If yes, the computer device is caused to execute a failure analysis information output step of outputting failure analysis information including the contents of the main memory used by the process.

本発明は、分散並列処理システムにおける障害発生時に、障害解析情報の全体容量の肥大化を抑止して、解析作業をより効率化する技術を提供することができる。 The present invention can provide a technique for suppressing the enlargement of the entire capacity of failure analysis information when a failure occurs in a distributed parallel processing system, and making analysis work more efficient.

本発明の第１の実施の形態としての分散並列処理システムの構成を示すブロック図である。It is a block diagram which shows the structure of the distributed parallel processing system as the 1st Embodiment of this invention. 本発明の第１の実施の形態としての分散並列処理システムを構成する各装置の機能ブロック図である。It is a functional block diagram of each apparatus which comprises the distributed parallel processing system as the 1st Embodiment of this invention. 本発明の第１の実施の形態としての分散並列処理システムの動作を説明するフローチャートである。It is a flowchart explaining operation | movement of the distributed parallel processing system as the 1st Embodiment of this invention. 本発明の第２の実施の形態としての分散並列処理システムの構成を示すブロック図である。It is a block diagram which shows the structure of the distributed parallel processing system as the 2nd Embodiment of this invention. 本発明の第２の実施の形態としての分散並列処理システムを構成する各装置の機能ブロック図である。It is a functional block diagram of each apparatus which comprises the distributed parallel processing system as the 2nd Embodiment of this invention. 本発明の第２の実施の形態としての分散並列処理システムの動作を説明するフローチャートである。It is a flowchart explaining operation | movement of the distributed parallel processing system as the 2nd Embodiment of this invention. 本発明の第２の実施の形態としての分散並列処理システムの動作の具体例を説明する模式図である。It is a schematic diagram explaining the specific example of operation | movement of the distributed parallel processing system as the 2nd Embodiment of this invention. 本発明の第２の実施の形態としての分散並列処理システムの動作の具体例を説明する他の模式図である。It is another schematic diagram explaining the specific example of operation | movement of the distributed parallel processing system as the 2nd Embodiment of this invention. 本発明の第２の実施の形態としての分散並列処理システムの動作の具体例を説明する他の模式図である。It is another schematic diagram explaining the specific example of operation | movement of the distributed parallel processing system as the 2nd Embodiment of this invention. 本発明の第２の実施の形態としての分散並列処理システムの動作の具体例を説明する他の模式図である。It is another schematic diagram explaining the specific example of operation | movement of the distributed parallel processing system as the 2nd Embodiment of this invention. 本発明の第２の実施の形態としての分散並列処理システムの動作の具体例において出力されるサマリ情報を説明する模式図である。It is a schematic diagram explaining the summary information output in the specific example of operation | movement of the distributed parallel processing system as the 2nd Embodiment of this invention.

以下、本発明の実施の形態について、図面を参照して詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

（第１の実施の形態）
本発明の第１の実施の形態としての分散並列処理システム１の構成を図１に示す。 (First embodiment)
A configuration of a distributed parallel processing system 1 as a first embodiment of the present invention is shown in FIG.

図１において、分散並列処理システム１は、障害情報管理装置１０と、１つ以上の情報処理装置２０とを備えている。障害情報管理装置１０および各情報処理装置２０は、ネットワークを介して互いに通信可能に接続されている。情報処理装置２０は、分散並列プログラムを分散して実行可能となっている。以下、情報処理装置２０を、ノード２０とも記載する。なお、図１には、１つの障害情報管理装置１０と、４つのノード２０とを示しているが、本発明の分散並列処理システムが備える各装置の数を限定するものではない。 In FIG. 1, the distributed parallel processing system 1 includes a failure information management device 10 and one or more information processing devices 20. The failure information management apparatus 10 and each information processing apparatus 20 are connected to be communicable with each other via a network. The information processing apparatus 20 can execute a distributed parallel program in a distributed manner. Hereinafter, the information processing apparatus 20 is also referred to as a node 20. Although FIG. 1 shows one failure information management apparatus 10 and four nodes 20, the number of apparatuses included in the distributed parallel processing system of the present invention is not limited.

ここで、障害情報管理装置１０およびノード２０は、ＣＰＵ（Central Processing Unit）と、ＲＡＭ（Random Access Memory）と、ＲＯＭ（Read Only Memory）と、ハードディスク等の記憶装置と、ネットワークインタフェースとを備えたコンピュータ装置によってそれぞれ構成されている。なお、障害情報管理装置１０は、分散並列処理システム１に含まれるノード２０のいずれかと同一のコンピュータ装置によって構成されていてもよい。 Here, the failure information management device 10 and the node 20 include a CPU (Central Processing Unit), a RAM (Random Access Memory), a ROM (Read Only Memory), a storage device such as a hard disk, and a network interface. Each is constituted by a computer device. Note that the failure information management apparatus 10 may be configured by the same computer apparatus as any of the nodes 20 included in the distributed parallel processing system 1.

次に、分散並列処理システム１を構成する各装置の機能ブロック構成を図２に示す。 Next, a functional block configuration of each device constituting the distributed parallel processing system 1 is shown in FIG.

図２において、障害情報管理装置１０は、障害情報格納部１１と、障害情報受信部１２と、類似性判断部１３と、出力可否情報送信部１４と、障害情報登録部１５とを備える。また、ノード２０は、プロセス実行部２１と、障害情報採取部２２と、障害情報送信部２３と、障害解析情報出力部２４とを備える。ここで、障害情報格納部１１は、記憶装置によって構成される。また、障害情報受信部１２、出力可否情報送信部１４、障害情報送信部２３および障害解析情報出力部２４は、ネットワークインタフェースと、ＲＯＭおよび記憶装置に記憶されたコンピュータ・プログラムをＲＡＭに読み込んで実行するＣＰＵとによって構成される。また、類似性判断部１３、障害情報登録部１５、プロセス実行部２１および障害情報採取部２２は、ＲＯＭおよび記憶装置に記憶されたコンピュータ・プログラムをＲＡＭに読み込んで実行するＣＰＵによって構成される。なお、障害情報管理装置１０および各ノード２０ならびに各装置の各機能ブロックのハードウェア構成は、上述の構成に限定されない。 In FIG. 2, the failure information management apparatus 10 includes a failure information storage unit 11, a failure information reception unit 12, a similarity determination unit 13, an output availability information transmission unit 14, and a failure information registration unit 15. In addition, the node 20 includes a process execution unit 21, a failure information collection unit 22, a failure information transmission unit 23, and a failure analysis information output unit 24. Here, the failure information storage unit 11 is configured by a storage device. Further, the failure information receiving unit 12, the output availability information transmitting unit 14, the failure information transmitting unit 23, and the failure analysis information outputting unit 24 read the computer program stored in the network interface, ROM, and storage device into the RAM and execute it. CPU. Further, the similarity determination unit 13, the failure information registration unit 15, the process execution unit 21, and the failure information collection unit 22 are configured by a CPU that reads a computer program stored in the ROM and the storage device into the RAM and executes it. The hardware configuration of the failure information management apparatus 10, each node 20, and each functional block of each apparatus is not limited to the above-described configuration.

まず、ノード２０の各機能ブロックについて説明する。 First, each functional block of the node 20 will be described.

プロセス実行部２１は、分散並列プログラムに基づく処理を実行する。 The process execution unit 21 executes processing based on the distributed parallel program.

障害情報採取部２２は、プロセス実行部２１によるプロセス実行中に障害が発生すると、該障害に関する障害情報を採取する。ここで、障害情報には、例えば、障害原因を表す障害原因情報が含まれていてもよい。また、障害情報には、障害が発生するまでの命令の履歴を表すトレースバック情報が含まれていてもよい。その他、障害情報には、障害発生時にノード２０において採取可能な障害に関連する各種情報が含まれていてもよい。 When a failure occurs during process execution by the process execution unit 21, the failure information collection unit 22 collects failure information related to the failure. Here, the failure information may include, for example, failure cause information indicating the cause of the failure. Further, the failure information may include traceback information representing a history of instructions until the failure occurs. In addition, the failure information may include various types of information related to failures that can be collected in the node 20 when a failure occurs.

障害情報送信部２３は、障害情報採取部２２によって採取された障害情報を、障害情報管理装置１０に対して送信する。 The failure information transmission unit 23 transmits the failure information collected by the failure information collection unit 22 to the failure information management apparatus 10.

障害解析情報出力部２４は、障害情報管理装置１０から出力可否情報を受信する。そして、障害解析情報出力部２４は、受信した出力可否情報が出力可を示すとき、障害解析情報を出力する。ここで、障害解析情報とは、障害発生時にプロセスによって利用されていた主記憶装置（ＲＡＭ）の内容（メモリイメージ）を含む情報である。なお、例えば、障害解析情報出力部２４は、自装置の記憶装置に、障害解析情報をファイルとして出力してもよい。 The failure analysis information output unit 24 receives output permission information from the failure information management device 10. Then, the failure analysis information output unit 24 outputs failure analysis information when the received output availability information indicates that output is possible. Here, the failure analysis information is information including the contents (memory image) of the main memory (RAM) used by the process when the failure occurs. For example, the failure analysis information output unit 24 may output the failure analysis information as a file to the storage device of the own device.

次に、障害情報管理装置１０の各機能ブロックについて説明する。 Next, each functional block of the failure information management apparatus 10 will be described.

障害情報格納部１１は、ノード２０において分散並列プログラムのプロセス実行中に発生した障害に関する障害情報を格納する。 The failure information storage unit 11 stores failure information regarding a failure that occurred during the process execution of the distributed parallel program in the node 20.

障害情報受信部１２は、ノード２０によって障害の発生に応じて送信される障害情報を受信する。 The failure information receiving unit 12 receives failure information transmitted by the node 20 in response to the occurrence of a failure.

類似性判断部１３は、ノード２０から受信された障害情報に対して所定の類似条件を満たす障害情報が、障害情報格納部１１に格納されているか否かを判断する。例えば、類似条件は、障害情報に含まれる障害原因情報が同一であることであってもよい。また、例えば、類似条件は、障害情報に含まれるトレースバック情報が同一であることであってもよい。 The similarity determination unit 13 determines whether failure information satisfying a predetermined similarity condition with respect to failure information received from the node 20 is stored in the failure information storage unit 11. For example, the similar condition may be that the failure cause information included in the failure information is the same. Further, for example, the similar condition may be that the traceback information included in the failure information is the same.

出力可否情報送信部１４は、類似性判断部１３による判断結果に応じて、障害解析情報の出力可否を表す出力可否情報を、該当するノード２０に送信する。具体的には、出力可否情報送信部１４は、ノード２０から受信された障害情報に対して所定の類似条件を満たす障害情報が、障害情報格納部１１に格納されている場合には、出力否を示す出力可否情報を送信する。また、出力可否情報送信部１４は、ノード２０から受信された障害情報に対して所定の類似条件を満たす障害情報が、障害情報格納部１１に格納されていない場合には、出力可を示す出力可否情報を送信する。 The output propriety information transmission unit 14 transmits output propriety information indicating the output propriety of the failure analysis information to the corresponding node 20 according to the determination result by the similarity determination unit 13. Specifically, the output availability information transmission unit 14 outputs an output rejection when failure information satisfying a predetermined similarity condition for failure information received from the node 20 is stored in the failure information storage unit 11. Output enable / disable information indicating is transmitted. Further, the output permission information transmission unit 14 outputs an output indicating that the output is permitted when failure information satisfying a predetermined similarity condition with respect to the failure information received from the node 20 is not stored in the failure information storage unit 11. Send availability information.

障害情報登録部１５は、ノード２０から受信された障害情報に対して所定の類似条件を満たす障害情報が、障害情報格納部１１に格納されていない場合、受信された障害情報を障害情報格納部１１に登録する。 The failure information registration unit 15 stores the received failure information in the failure information storage unit when failure information satisfying a predetermined similarity condition with respect to the failure information received from the node 20 is not stored in the failure information storage unit 11. 11 is registered.

以上のように構成された分散並列処理システム１の動作について、図３を参照して説明する。なお、ノード２０において、プロセス実行部２１は既に分散並列プログラムのプロセスを実行中であるものとする。また、図３では、左図は障害情報管理装置１０の動作を示し、右図はノード２０の動作を示し、左右を結ぶ破線の矢印はデータの流れを示すものとする。 The operation of the distributed parallel processing system 1 configured as described above will be described with reference to FIG. In the node 20, it is assumed that the process execution unit 21 is already executing the process of the distributed parallel program. In FIG. 3, the left diagram shows the operation of the failure information management apparatus 10, the right diagram shows the operation of the node 20, and the broken arrows connecting the left and right indicate the flow of data.

まず、ノード２０の障害情報採取部２２は、障害の発生に応じて、障害情報を採取する（ステップＳ１）。 First, the failure information collection unit 22 of the node 20 collects failure information in response to the occurrence of a failure (step S1).

次に、障害情報送信部２３は、ステップＳ１で採取された障害情報を、障害情報管理装置１０に対して送信する（ステップＳ２）。 Next, the failure information transmission unit 23 transmits the failure information collected in step S1 to the failure information management apparatus 10 (step S2).

次に、障害情報管理装置１０の障害情報受信部１２は、ステップＳ２で送信された障害情報を受信する（ステップＳ３）。 Next, the failure information receiving unit 12 of the failure information management device 10 receives the failure information transmitted in step S2 (step S3).

次に、類似性判断部１３は、ステップＳ３で受信された障害情報に対して所定の類似条件を満たす障害情報が、障害情報格納部１１に格納されているか否かを判断する（ステップＳ４）。 Next, the similarity determination unit 13 determines whether failure information satisfying a predetermined similarity condition with respect to the failure information received in step S3 is stored in the failure information storage unit 11 (step S4). .

ここで、所定の類似条件を満たす障害情報がまだ格納されていないと判断された場合、出力可否情報送信部１４は、出力可を示す出力可否情報を、ノード２０に対して送信する（ステップＳ５）。 Here, when it is determined that failure information satisfying a predetermined similarity condition has not yet been stored, the output permission information transmitting unit 14 transmits output permission information indicating that output is possible to the node 20 (step S5). ).

次に、障害情報登録部１５は、ステップＳ３で受信された障害情報を、障害情報格納部１１に登録する（ステップＳ６）。 Next, the failure information registration unit 15 registers the failure information received in step S3 in the failure information storage unit 11 (step S6).

一方、ステップＳ４において、所定の類似条件を満たす障害情報が既に格納されていると判断した場合、出力可否情報送信部１４は、出力否を示す出力可否情報を、ノード２０に対して送信する（ステップＳ７）。 On the other hand, if it is determined in step S4 that the failure information satisfying the predetermined similarity condition has already been stored, the output permission information transmitting unit 14 transmits the output permission information indicating the output rejection to the node 20 ( Step S7).

次に、ノード２０の障害解析情報出力部２４は、ステップＳ５またはステップＳ７で送信された出力可否情報を受信し、受信した出力可否情報が出力可を示すか否かを判断する（ステップＳ８）。 Next, the failure analysis information output unit 24 of the node 20 receives the output propriety information transmitted in step S5 or step S7, and determines whether or not the received output propriety information indicates that output is possible (step S8). .

ここで、出力可否情報が出力可を示すと判断した場合、障害解析情報出力部２４は、障害解析情報を出力する（ステップＳ９）。 Here, when it is determined that the output permission information indicates that output is possible, the failure analysis information output unit 24 outputs the failure analysis information (step S9).

一方、出力可否情報が出力否を示すと判断した場合、障害解析情報出力部２４は、障害解析情報を出力しない。 On the other hand, when it is determined that the output permission information indicates output failure, the failure analysis information output unit 24 does not output the failure analysis information.

以上で、分散並列処理システム１は、動作を終了する。 As described above, the distributed parallel processing system 1 ends the operation.

次に、本発明の第１の実施の形態の効果について述べる。 Next, effects of the first exemplary embodiment of the present invention will be described.

本発明の第１の実施の形態としての分散並列処理システムは、分散並列処理システムにおける障害発生時に、障害解析情報の全体容量の肥大化を抑止して、解析作業をより効率化することができる。 The distributed parallel processing system as the first exemplary embodiment of the present invention can suppress the enlargement of the entire capacity of the failure analysis information when a failure occurs in the distributed parallel processing system, and can make the analysis work more efficient. .

その理由は、分散並列プログラムを実行する各ノードは、障害発生時に障害情報を採取して障害情報管理装置に送信し、障害情報管理装置の類似性判断部は、受信した障害情報に類似する障害情報が既に障害情報格納部に格納されているか否かを判断し、類似する障害情報がまだ格納されていない場合には、その障害情報を障害情報格納部に登録するとともに、障害解析情報の出力可を示す出力可否情報をノードに送信し、類似する障害情報が既に格納されている場合には、障害解析情報の出力否を示す出力可否情報をノードに送信するからである。そして、各ノードは、障害情報管理装置から出力可を示す出力可否情報を受信した場合に、障害解析情報を出力するからである。 The reason is that each node that executes the distributed parallel program collects fault information when a fault occurs and transmits it to the fault information management apparatus. The similarity determination unit of the fault information management apparatus uses a fault similar to the received fault information. Determine whether information is already stored in the failure information storage unit, and if similar failure information is not yet stored, register the failure information in the failure information storage unit and output failure analysis information This is because the output enable / disable information indicating enable / disable is transmitted to the node, and when similar failure information is already stored, the output enable / disable information indicating whether or not the failure analysis information is output is transmitted to the node. This is because each node outputs failure analysis information when it receives output permission information indicating that output is possible from the failure information management device.

これにより、本実施の形態としての分散並列処理システムでは、類似する障害が複数のノードで発生した場合には、いずれかのノードが障害解析情報を出力した後は、他のノードが類似する障害解析情報を重複して出力することがない。特に、障害情報に含まれる障害原因情報およびトレースバック情報が同一であることを類似条件とする場合、本実施の形態としての分散並列処理システムは、同じ分散並列処理プログラムにおける同じ箇所で同じ原因により発生した障害について、重複して障害解析情報を出力することがない。したがって、本実施の形態は、障害解析情報の肥大化を抑止する。そして、本実施の形態を利用することにより、解析作業者は、分散並列プログラムの同一の箇所において発生した同一の原因による同一の障害については、１つの障害解析情報を解析すればよいことになる。したがって、本実施の形態は、解析作業をより効率化する。 As a result, in the distributed parallel processing system according to the present embodiment, when a similar failure occurs in a plurality of nodes, after any node outputs the failure analysis information, the failure is similar to the other nodes. Duplicate analysis information is not output. In particular, when the failure cause information and the traceback information included in the failure information are assumed to be the same condition, the distributed parallel processing system as the present embodiment has the same cause at the same location in the same distributed parallel processing program. The failure analysis information is not output repeatedly for the failure that has occurred. Therefore, this embodiment suppresses the enlargement of the failure analysis information. By using this embodiment, the analysis operator only has to analyze one piece of failure analysis information for the same failure caused by the same cause that occurs in the same location of the distributed parallel program. . Therefore, this embodiment makes analysis work more efficient.

（第２の実施の形態）
次に、本発明の第２の実施の形態について図面を参照して詳細に説明する。なお、本実施の形態の説明において参照する各図面において、本発明の第１の実施の形態と同一の構成および同様に動作するステップには同一の符号を付して本実施の形態における詳細な説明を省略する。 (Second Embodiment)
Next, a second embodiment of the present invention will be described in detail with reference to the drawings. Note that, in each drawing referred to in the description of the present embodiment, the same reference numerals are given to the same configuration and steps that operate in the same manner as in the first embodiment of the present invention, and the detailed description in the present embodiment. Description is omitted.

本発明の第２の実施の形態としての分散並列処理システム２の構成を、図４に示す。図４において、分散並列処理システム２は、障害情報管理装置３０と、１つ以上のノード（情報処理装置）４０と、共有情報蓄積装置５０とを備える。障害情報管理装置３０、ノード４０および共有情報蓄積装置５０は、ノード間接続装置９０によって互いに通信可能に接続されている。ここで、ノード間接続装置９０は、複数の装置間をクロスバースイッチ等で接続することにより装置間の高速データ転送を実現する装置である。なお、障害情報管理装置３０および共有情報蓄積装置５０は、それぞれ、いずれかのノード４０を構成するコンピュータ装置によって構成されていてもよい。つまり、障害情報管理装置３０および共有情報蓄積装置５０は、分散並列プログラムを分散して実行可能ないずれかのノード４０上にそれぞれ実現されていてもよい。また、障害情報管理装置３０、共有情報蓄積装置５０およびノード４０をそれぞれ構成する各コンピュータ装置は、ノード間接続装置９０を介して他の装置とデータ転送を行うためのデータ転送モジュールを有しているものとする。 The configuration of a distributed parallel processing system 2 as a second embodiment of the present invention is shown in FIG. In FIG. 4, the distributed parallel processing system 2 includes a failure information management device 30, one or more nodes (information processing devices) 40, and a shared information storage device 50. The failure information management device 30, the node 40, and the shared information storage device 50 are connected to each other by an inter-node connection device 90 so as to communicate with each other. Here, the inter-node connection device 90 is a device that realizes high-speed data transfer between devices by connecting a plurality of devices with a crossbar switch or the like. Note that the failure information management device 30 and the shared information storage device 50 may each be configured by a computer device that constitutes one of the nodes 40. That is, the failure information management device 30 and the shared information storage device 50 may be realized on any of the nodes 40 that can execute the distributed parallel program in a distributed manner. Further, each computer device constituting the failure information management device 30, the shared information storage device 50, and the node 40 has a data transfer module for performing data transfer with other devices via the inter-node connection device 90. It shall be.

次に、分散並列処理システム２を構成する各装置の機能ブロック構成を図５に示す。図５において、障害情報管理装置３０は、本発明の第１の実施の形態としての障害情報管理装置１０に対して、障害情報登録部１５に替えて障害情報登録部３５を備え、さらに、サマリ情報出力部３６を備える点が異なる。また、ノード４０は、本発明の第１の実施の形態としてのノード２０に対して、障害情報採取部２２に替えて障害情報採取部４２と、障害解析情報出力部２４に替えて障害解析情報出力部４４とを備える点が異なる。 Next, the functional block configuration of each device constituting the distributed parallel processing system 2 is shown in FIG. In FIG. 5, the failure information management device 30 includes a failure information registration unit 35 instead of the failure information registration unit 15 with respect to the failure information management device 10 as the first embodiment of the present invention. The difference is that an information output unit 36 is provided. Also, the node 40 replaces the failure information collecting unit 22 with the failure information collecting unit 42 and the failure analysis information output unit 24 with respect to the node 20 as the first embodiment of the present invention. The difference is that the output unit 44 is provided.

まず、障害情報管理装置３０の各機能ブロックについて説明する。 First, each functional block of the failure information management apparatus 30 will be described.

障害情報登録部３５は、本発明の第１の実施の形態における障害情報登録部１５と同様に、ノード４０から受信された障害情報に対して所定の類似条件を満たす障害情報が、障害情報格納部１１に格納されていないと判断された場合には、受信された障害情報を障害情報格納部１１に登録する。 As with the failure information registration unit 15 in the first exemplary embodiment of the present invention, the failure information registration unit 35 stores failure information that satisfies a predetermined similarity condition for failure information received from the node 40. If it is determined that the failure information is not stored in the unit 11, the received failure information is registered in the failure information storage unit 11.

さらに、障害情報登録部３５は、ノード４０から受信された障害情報に対して所定の類似条件を満たす障害情報が、障害情報格納部１１に格納されていると判断された場合、既に格納されているその障害情報に、新たに受信された障害情報に含まれるノード４０の識別情報を追加して含めるよう更新する。さらに、障害情報に、障害が発生したプロセスの識別情報が含まれる場合には、障害情報登録部３５は、既に格納されている、類似条件を満たす障害情報に、新たに受信された障害情報に含まれるプロセスの識別情報を追加して含めるよう更新してもよい。 Further, when it is determined that the failure information satisfying a predetermined similarity condition with respect to the failure information received from the node 40 is stored in the failure information storage unit 11, the failure information registration unit 35 is already stored. The failure information is updated to include the identification information of the node 40 included in the newly received failure information. Further, when the failure information includes the identification information of the process in which the failure has occurred, the failure information registration unit 35 adds the failure information that has already been stored and satisfies the similar condition to the newly received failure information. It may be updated to include additional process identification information.

サマリ情報出力部３６は、障害情報格納部１１に格納されている障害情報を、サマリ情報として出力する。例えば、サマリ情報出力部３６は、分散並列プログラムの全てのプロセスが各ノード４０上で終了した後、サマリ情報を出力するようにしてもよい。また、サマリ情報出力部３６は、共有情報蓄積装置５０に対して、サマリ情報を出力してもよい。 The summary information output unit 36 outputs the failure information stored in the failure information storage unit 11 as summary information. For example, the summary information output unit 36 may output the summary information after all processes of the distributed parallel program are completed on each node 40. The summary information output unit 36 may output summary information to the shared information storage device 50.

次に、ノード４０の各機能ブロックについて説明する。 Next, each functional block of the node 40 will be described.

障害情報採取部４２は、プロセス実行部２１によるプロセス実行中に障害が発生すると、該障害に関して、障害原因情報およびトレースバック情報を採取する。そして、障害情報採取部４２は、採取した障害原因情報およびトレースバック情報と、自装置の識別情報と、プロセスの識別情報とを含む障害情報を生成する。 When a failure occurs during process execution by the process execution unit 21, the failure information collection unit 42 collects failure cause information and traceback information regarding the failure. Then, the failure information collection unit 42 generates failure information including the collected failure cause information and traceback information, the identification information of the own device, and the identification information of the process.

障害解析情報出力部４４は、本発明の第１の実施の形態における障害解析情報出力部２４に対して、障害解析情報を、共有情報蓄積装置５０に対して出力する点が異なる。また、障害解析情報出力部４４は、障害解析情報を、共有情報蓄積装置５０の共有ファイル上に出力するようにしてもよい。 The failure analysis information output unit 44 is different from the failure analysis information output unit 24 according to the first embodiment of the present invention in that failure analysis information is output to the shared information storage device 50. The failure analysis information output unit 44 may output the failure analysis information on a shared file of the shared information storage device 50.

共有情報蓄積装置５０は、障害情報管理装置３０から出力されるサマリ情報および各ノード４０から出力される障害解析情報を蓄積する。また、共有情報蓄積装置５０は、各ノード４０から出力される障害解析情報を、共有ファイルに蓄積してもよい。 The shared information storage device 50 stores summary information output from the failure information management device 30 and failure analysis information output from each node 40. Further, the shared information storage device 50 may store the failure analysis information output from each node 40 in a shared file.

以上のように構成された分散並列処理システム２の動作を、図６を参照して説明する。図６において、分散並列処理システム２は、図３を参照して説明した本発明の第１の実施の形態としての分散並列処理システム１と略同様にステップＳ１〜Ｓ９まで実行する。 The operation of the distributed parallel processing system 2 configured as described above will be described with reference to FIG. 6, the distributed parallel processing system 2 executes steps S1 to S9 in substantially the same manner as the distributed parallel processing system 1 according to the first embodiment of the present invention described with reference to FIG.

ただし、本実施の形態では、ノード４０において、ステップＳ９での障害解析情報の出力先が異なる。障害解析情報出力部４４は、障害解析情報を、共有情報蓄積装置５０に対して出力する（ステップＳ９）。 However, in the present embodiment, the output destination of the failure analysis information in step S9 is different in the node 40. The failure analysis information output unit 44 outputs failure analysis information to the shared information storage device 50 (step S9).

また、本実施の形態では、障害情報管理装置３０において、ステップＳ４において所定の類似条件を満たす障害情報が既に格納されていると判断された場合の動作が異なる。 In the present embodiment, the failure information management apparatus 30 operates differently when it is determined in step S4 that failure information satisfying a predetermined similarity condition has already been stored.

この場合、まず、出力可否情報送信部１４は、出力否を示す出力可否情報をノード４０に送信する（ステップＳ７）。 In this case, first, the output propriety information transmitting unit 14 transmits output propriety information indicating the output disapproval to the node 40 (step S7).

次に、障害情報登録部３５は、既に障害情報格納部１１に格納されている、類似条件を満たす障害情報に、新たに受信された障害情報に含まれるノード４０の識別情報およびプロセスの識別情報を追加して含める（ステップＳ１１）。 Next, the failure information registration unit 35 adds the identification information of the node 40 and the identification information of the process included in the newly received failure information to the failure information satisfying the similarity condition already stored in the failure information storage unit 11. Is added and included (step S11).

また、本実施の形態では、ステップＳ１〜Ｓ９、Ｓ１１を実行後、障害情報管理装置３０は、サマリ情報を出力する。 Moreover, in this Embodiment, after performing step S1-S9 and S11, the failure information management apparatus 30 outputs summary information.

具体的には、サマリ情報出力部３６は、対象の分散並列プログラムの各プロセスが各ノード４０において終了したと判断すると（ステップＳ１２でＹｅｓ）、それまでに障害情報格納部１１に格納された障害情報を、サマリ情報として共有情報蓄積装置５０に対して出力する（ステップＳ１３）。 Specifically, when the summary information output unit 36 determines that each process of the target distributed parallel program has ended in each node 40 (Yes in step S12), the failure stored in the failure information storage unit 11 until then. Information is output to the shared information storage device 50 as summary information (step S13).

以上で、分散並列処理システム２は、動作を終了する。 As described above, the distributed parallel processing system 2 ends the operation.

次に、分散並列処理システム２における動作の具体例について、図７〜図１１を参照して説明する。図７〜図１１において、分散並列処理システム２は、障害情報管理装置３０と、４つのノード４０ａ〜４０ｄと、共有情報蓄積装置５０とを含む。なお、４つのノード４０ａ〜４０ｄは、それぞれ、識別情報（ノードＩＤ）として、４０ａ〜４０ｄを有しているものとする。 Next, specific examples of operations in the distributed parallel processing system 2 will be described with reference to FIGS. 7 to 11, the distributed parallel processing system 2 includes a failure information management device 30, four nodes 40 a to 40 d, and a shared information storage device 50. Note that the four nodes 40a to 40d have 40a to 40d as identification information (node IDs), respectively.

また、この例では、障害情報には、障害原因情報、トレースバック情報、ノードＩＤおよびプロセスＩＤが含まれるものする。なお、図７〜図１１では、説明のため、障害原因情報およびトレースバック情報を文字列で示しているが、障害原因情報は、その種別を表す数値で表されていてもよく、トレースバック情報は、命令のアドレス値等によって表されていてもよい。 In this example, the failure information includes failure cause information, traceback information, a node ID, and a process ID. 7 to 11, for the sake of explanation, the failure cause information and the traceback information are shown as character strings. However, the failure cause information may be expressed as a numerical value indicating the type of the failure cause information and the traceback information. May be represented by an instruction address value or the like.

このとき、まず、ノード４０ａにおいて障害が発生したときの分散並列処理システム２の動作について、図７を参照して説明する。 At this time, first, the operation of the distributed parallel processing system 2 when a failure occurs in the node 40a will be described with reference to FIG.

ここでは、まず、ノード４０ａの障害情報採取部４２は、図７に示す障害情報１１０ａを採取し（ステップＳ１）、障害情報管理装置３０に送信する（ステップＳ２）。 Here, first, the fault information collection unit 42 of the node 40a collects the fault information 110a shown in FIG. 7 (step S1) and transmits it to the fault information management apparatus 30 (step S2).

次に、障害情報管理装置３０において、障害情報受信部１２は、障害情報１１０ａを受信する（ステップＳ３）。 Next, in the failure information management apparatus 30, the failure information receiving unit 12 receives the failure information 110a (step S3).

ここでは、まだ、障害情報格納部１１は、いずれの障害情報も格納していない。 Here, the failure information storage unit 11 has not yet stored any failure information.

そこで、類似性判断部１３は、受信された障害情報１１０ａに対して所定の類似条件を満たす障害情報が、障害情報格納部１１に格納されていないと判断する（ステップＳ４でＮｏ）。 Therefore, the similarity determination unit 13 determines that failure information satisfying a predetermined similarity condition with respect to the received failure information 110a is not stored in the failure information storage unit 11 (No in step S4).

次に、出力可否情報送信部１４は、受信された障害情報１１０ａに類似した障害情報が未だ登録されていないので、出力可を示す出力可否情報を、ノード４０ａに対して送信する（ステップＳ５）。 Next, since the failure information similar to the received failure information 110a has not yet been registered, the output availability information transmitting unit 14 transmits output availability information indicating that output is possible to the node 40a (step S5). .

次に、障害情報登録部３５は、障害情報１１０ａを、障害情報格納部１１に登録する（ステップＳ６）。 Next, the failure information registration unit 35 registers the failure information 110a in the failure information storage unit 11 (step S6).

次に、ノード４０ａにおいて、障害解析情報出力部４４は、出力可否情報を受信し、出力可を示すと判断する（ステップＳ８でＹｅｓ）。 Next, in the node 40a, the failure analysis information output unit 44 receives the output enable / disable information and determines that the output is enabled (Yes in step S8).

そこで、障害解析情報出力部４４は、図７に示す障害解析情報４４０ａを、共有情報蓄積装置５０に対して出力する（ステップＳ９）。 Therefore, the failure analysis information output unit 44 outputs the failure analysis information 440a shown in FIG. 7 to the shared information storage device 50 (step S9).

続いて、ノード４０ｂにおいて障害が発生したものとする。このときの分散並列処理システム２の動作について、図８を参照して説明する。 Subsequently, it is assumed that a failure has occurred in the node 40b. The operation of the distributed parallel processing system 2 at this time will be described with reference to FIG.

ここでは、まず、ノード４０ｂの障害情報採取部４２は、図８に示す障害情報１１０ｂを採取し（ステップＳ１）、障害情報管理装置３０に送信する（ステップＳ２）。 Here, first, the fault information collection unit 42 of the node 40b collects the fault information 110b shown in FIG. 8 (step S1) and transmits it to the fault information management apparatus 30 (step S2).

次に、障害情報管理装置３０では、障害情報受信部１２は、障害情報１１０ｂを受信する（ステップＳ３）。 Next, in the failure information management device 30, the failure information receiving unit 12 receives the failure information 110b (step S3).

次に、類似性判断部１３は、受信された障害情報１１０ｂに対して所定の類似条件を満たす障害情報が、障害情報格納部１１に格納されていないと判断する（ステップＳ４でＮｏ）。具体的には、類似性判断部１３は、受信された障害情報１１０ｂと、既に格納されている障害情報１１０ａとは、障害原因情報およびトレースバック情報が異なるので、所定の類似条件を満たさないと判断している。 Next, the similarity determination unit 13 determines that failure information that satisfies a predetermined similarity condition for the received failure information 110b is not stored in the failure information storage unit 11 (No in step S4). Specifically, since the failure cause information and the traceback information are different between the received failure information 110b and the already stored failure information 110a, the similarity determination unit 13 must satisfy a predetermined similarity condition. Deciding.

次に、出力可否情報送信部１４は、受信された障害情報１１０ｂに類似した障害情報が未だ登録されていないので、出力可を示す出力可否情報を、ノード４０ｂに対して送信する（ステップＳ５）。 Next, since the failure information similar to the received failure information 110b has not yet been registered, the output availability information transmitting unit 14 transmits output availability information indicating that output is possible to the node 40b (step S5). .

次に、障害情報登録部３５は、障害情報１１０ｂを、障害情報格納部１１に登録する（ステップＳ６）。 Next, the failure information registration unit 35 registers the failure information 110b in the failure information storage unit 11 (step S6).

次に、ノード４０ｂにおいて、障害解析情報出力部４４は、出力可否情報を受信し、出力可を示すと判断する（ステップＳ８でＹｅｓ）。 Next, in the node 40b, the failure analysis information output unit 44 receives the output enable / disable information and determines that the output is enabled (Yes in step S8).

そこで、障害解析情報出力部４４は、図８に示す障害解析情報４４０ｂを、共有情報蓄積装置５０に対して出力する（ステップＳ９）。 Therefore, the failure analysis information output unit 44 outputs the failure analysis information 440b shown in FIG. 8 to the shared information storage device 50 (step S9).

続いて、ノード４０ｃにおいて障害が発生したものとする。このときの分散並列処理システム２の動作について、図９を参照して説明する。 Subsequently, it is assumed that a failure has occurred in the node 40c. The operation of the distributed parallel processing system 2 at this time will be described with reference to FIG.

ここでは、まず、ノード４０ｃの障害情報採取部４２は、図９に示す障害情報１１０ｃを採取し（ステップＳ１）、障害情報管理装置３０に送信する（ステップＳ２）。 Here, first, the fault information collection unit 42 of the node 40c collects the fault information 110c shown in FIG. 9 (step S1) and transmits it to the fault information management apparatus 30 (step S2).

次に、障害情報管理装置３０では、障害情報受信部１２は、障害情報１１０ｃを受信する（ステップＳ３）。 Next, in the failure information management apparatus 30, the failure information receiving unit 12 receives the failure information 110c (step S3).

次に、類似性判断部１３は、受信された障害情報１１０ｃに対して所定の類似条件を満たす障害情報が、障害情報格納部１１に格納されていると判断する（ステップＳ４でＹｅｓ）。具体的には、類似性判断部１３は、既に格納されている障害情報１１０ｂと、新たに受信された障害情報１１０ｃとは、同一の障害原因情報「ＳＩＧＳＥＧＶ」および同一のトレースバック情報「関数Ｃ（）＋ＺＺＺ、関数ＡＡ（）＋ＸＸＸ、ｍａｉｎ（）」を含むので、所定の類似条件を満たすと判断している。 Next, the similarity determination unit 13 determines that failure information satisfying a predetermined similarity condition with respect to the received failure information 110c is stored in the failure information storage unit 11 (Yes in step S4). Specifically, the similarity determination unit 13 uses the same failure cause information “SIGSEEGV” and the same traceback information “function C” as the already stored failure information 110b and newly received failure information 110c. () + ZZZ, function AA () + XXX, main () ”, it is determined that a predetermined similarity condition is satisfied.

次に、出力可否情報送信部１４は、受信された障害情報１１０ｃに類似した障害情報１１０ｂが既に登録されていると判断されたので、出力否を示す出力可否情報を、ノード４０ｃに対して送信する（ステップＳ７）。 Next, since it is determined that the fault information 110b similar to the received fault information 110c has already been registered, the output propriety information transmitting unit 14 transmits output propriety information indicating output prohibition to the node 40c. (Step S7).

次に、障害情報登録部３５は、受信された障害情報１１０ｃに含まれるノードＩＤ「４０ｃ」およびプロセスＩＤ「３０００」を、障害情報１１０ｂに追加して障害情報格納部１１を更新する（ステップＳ１１）。 Next, the failure information registration unit 35 adds the node ID “40c” and the process ID “3000” included in the received failure information 110c to the failure information 110b to update the failure information storage unit 11 (step S11). ).

次に、ノード４０ｃでは、障害解析情報出力部４４は、出力可否情報を受信し、出力否を示すと判断する（ステップＳ８でＮｏ）。 Next, in the node 40c, the failure analysis information output unit 44 receives the output propriety information and determines that it indicates output failure (No in step S8).

そこで、ノード４０ｃの障害解析情報出力部４４は、障害解析情報を出力しない。 Therefore, the failure analysis information output unit 44 of the node 40c does not output failure analysis information.

続いて、ノード４０ｄにおいて障害が発生したものとする。このときの分散並列処理システム２の動作について、図１０を参照して説明する。 Subsequently, it is assumed that a failure has occurred in the node 40d. The operation of the distributed parallel processing system 2 at this time will be described with reference to FIG.

ここでは、まず、ノード４０ｄの障害情報採取部４２は、図１０に示す障害情報１１０ｄを採取し（ステップＳ１）、障害情報管理装置３０に送信する（ステップＳ２）。 Here, first, the fault information collection unit 42 of the node 40d collects the fault information 110d shown in FIG. 10 (step S1) and transmits it to the fault information management apparatus 30 (step S2).

ここでは、障害情報１１０ｄの内容は、障害情報１１０ｂに対して、所定の類似条件を満たすものであったとする。 Here, it is assumed that the content of the failure information 110d satisfies a predetermined similarity condition with respect to the failure information 110b.

したがって、障害情報管理装置３０およびノード４０ｄは、図９を参照して説明した障害情報管理装置３０およびノード４０ｃと同様に動作する。すなわち、ノード４０ｄの障害解析情報出力部４４は、障害解析情報を出力しない。また、障害情報管理装置３０は、新たに受信された障害情報１１０ｄに含まれるノードＩＤ「４０ｄ」およびプロセスＩＤ「４０００」を、障害情報１１０ｂに追加する。 Therefore, the failure information management device 30 and the node 40d operate in the same manner as the failure information management device 30 and the node 40c described with reference to FIG. That is, the failure analysis information output unit 44 of the node 40d does not output failure analysis information. Further, the failure information management apparatus 30 adds the node ID “40d” and the process ID “4000” included in the newly received failure information 110d to the failure information 110b.

次に、障害情報管理装置３０のサマリ出力動作について、図１１を参照して説明する。 Next, the summary output operation of the failure information management apparatus 30 will be described with reference to FIG.

ここでは、まず、障害情報管理装置３０のサマリ情報出力部３６は、ノード４０ａ〜４０ｄにおいて各プロセスが終了したと判断する（ステップＳ１２でＹｅｓ）。 Here, first, the summary information output unit 36 of the failure information management apparatus 30 determines that each process has ended in the nodes 40a to 40d (Yes in step S12).

そこで、サマリ出力部３６は、障害情報格納部１１に格納された障害情報１１０ａおよび１１０ｂを含むサマリ情報３６０を、共有情報蓄積装置５０に出力する。 Therefore, the summary output unit 36 outputs summary information 360 including the failure information 110 a and 110 b stored in the failure information storage unit 11 to the shared information storage device 50.

このように、分散並列処理システム２の４つのノード４０ａ〜４０ｄで発生した障害のうち、３つの障害が同じ箇所および同じ障害原因によるものであった場合、分散並列処理システム２は、２つの障害解析情報４４０ａおよび４４０ｂ、ならびに、サマリ情報３６０を出力することになる。したがって、分散並列処理システム２は、４つのプロセスで発生した障害についてそれぞれ障害解析情報を出力する場合に比べて、障害解析情報の全体容量の肥大化を抑止している。そして、解析作業者は、４つの障害解析情報を解析する必要はなく、サマリ情報３６０を参照しながら２つの解析情報４４０ａおよび４４０ｂを解析するだけでよい。また、共有情報蓄積装置５０において共有ファイル上に障害解析情報を出力する場合、分散並列処理システム２は、４つの障害解析情報を共有ファイル上に出力する場合に比べて、障害解析情報の出力時間の増大も抑止できる。 As described above, in the case where three failures are caused by the same location and the same cause among the failures occurring in the four nodes 40a to 40d of the distributed parallel processing system 2, the distributed parallel processing system 2 has two failures. The analysis information 440a and 440b and the summary information 360 are output. Therefore, the distributed parallel processing system 2 suppresses the enlargement of the entire capacity of the failure analysis information compared to the case where the failure analysis information is output for each of the failures that occurred in the four processes. The analysis operator does not need to analyze the four pieces of failure analysis information, and only needs to analyze the two pieces of analysis information 440a and 440b while referring to the summary information 360. Also, when the failure analysis information is output on the shared file in the shared information storage device 50, the distributed parallel processing system 2 outputs the failure analysis information output time compared to the case where the four failure analysis information is output on the shared file. Can also be prevented.

次に、本発明の第２の実施の形態の効果について述べる。 Next, the effect of the second exemplary embodiment of the present invention will be described.

本発明の第２の実施の形態としての分散並列処理システムは、分散並列処理システムにおける障害解析作業をさらに効率化することができる。 The distributed parallel processing system as the second exemplary embodiment of the present invention can further improve the efficiency of failure analysis work in the distributed parallel processing system.

その理由は、各ノードにおいて障害発生時に採取される障害情報に類似する障害情報が、障害情報管理装置の障害情報格納部に未だ格納されていない場合に、障害情報管理装置は該障害情報を障害格納部に格納し、該ノードは、障害解析情報を共有情報蓄積装置に出力するからである。また、障害情報管理装置は、全てのノードにおいて分散並列プログラムのプロセス終了後、障害情報格納部に蓄積した障害情報をサマリ情報として、共有情報蓄積装置に出力するからである。 The reason is that if the failure information similar to the failure information collected when a failure occurs in each node is not yet stored in the failure information storage unit of the failure information management device, the failure information management device This is because the node stores the failure analysis information to the shared information storage device. Also, the failure information management device outputs the failure information accumulated in the failure information storage unit to the shared information accumulation device as summary information after the process of the distributed parallel program is completed in all nodes.

これにより、解析作業者は、本実施の形態としての分散並列処理システムにおける障害発生時には、共有情報蓄積装置に出力されたサマリ情報にしたがって、共有情報蓄積装置に出力された障害解析情報を参照することにより、解析作業を行えばよい。サマリ情報には、障害原因やトレースバック情報が重複しない障害のサマリが含まれる。また、そのサマリが示す障害解析情報は、障害原因やトレースバック情報が重複しないものが出力されている。したがって、本実施の形態は、解析作業をさらに効率化することができる。 Thereby, the analysis worker refers to the failure analysis information output to the shared information storage device according to the summary information output to the shared information storage device when a failure occurs in the distributed parallel processing system according to the present embodiment. Therefore, analysis work may be performed. The summary information includes a summary of the failure in which the cause of the failure and the traceback information do not overlap. Also, the failure analysis information indicated by the summary is output with no cause of failure or traceback information overlapping. Therefore, this embodiment can further improve the efficiency of the analysis work.

また、本発明の第２の実施の形態としての分散並列処理システムは、共有情報蓄積装置の共有ファイル上に障害解析情報を出力する場合、障害解析情報の出力時間の増大を抑止することができる。 In addition, the distributed parallel processing system according to the second exemplary embodiment of the present invention can suppress an increase in the output time of the failure analysis information when the failure analysis information is output on the shared file of the shared information storage device. .

その理由は、本実施の形態の分散並列処理システムは、障害原因やトレースバック情報が重複する障害解析情報を、共有ファイル上に出力しないからである。これにより、本実施の形態は、障害が発生したノード毎に出力される障害解析情報を全て共有ファイル上に出力する場合に比べて、障害解析情報の出力時間を軽減することになる。 The reason is that the distributed parallel processing system according to the present embodiment does not output the failure analysis information in which the cause of the failure or the traceback information overlaps on the shared file. As a result, the present embodiment reduces the output time of failure analysis information compared to the case where all failure analysis information output for each node in which a failure has occurred is output on a shared file.

なお、上述した本発明の各実施の形態において、障害情報管理装置は、分散並列プログラムを実行するいずれかのノードと同一のコンピュータ装置によって構成されていてもよい。 In each embodiment of the present invention described above, the failure information management apparatus may be configured by the same computer apparatus as any node that executes the distributed parallel program.

また、上述した本発明の各実施の形態において、障害情報管理装置および各ノードは、ノード間接続装置を介して接続されている例を中心に説明したが、障害情報管理装置およびノードは、互いに通信可能に接続されていれば、その他の技術によって接続されていてもよい。なお、その場合も、障害情報管理装置および各ノードは、分散並列処理に適した通信速度で通信可能に接続されていることが望ましい。 In each embodiment of the present invention described above, the failure information management device and each node have been described mainly with respect to an example in which the failure information management device and the node are connected via an inter-node connection device. As long as it is communicably connected, it may be connected by other techniques. In this case also, it is desirable that the failure information management device and each node are connected so as to be communicable at a communication speed suitable for distributed parallel processing.

また、上述した本発明の各実施の形態において、各ノードの障害情報採取部は、障害原因情報およびトレースバック情報を障害情報として採取する例を中心に説明したが、障害情報採取部は、障害発生時に採取可能なその他の情報を採取して障害情報に含め、障害情報管理装置に送信するようにしてもよい。 Further, in each of the embodiments of the present invention described above, the failure information collection unit of each node has been described mainly with respect to an example of collecting failure cause information and traceback information as failure information. Other information that can be collected at the time of occurrence may be collected and included in the failure information, and transmitted to the failure information management apparatus.

また、上述した本発明の各実施の形態において、障害情報管理装置の類似性判断部が用いる類似条件は、障害原因情報およびトレースバック情報が同一であるという条件である例を中心に説明したが、所定の類似条件は、その他の条件であってもよい。 Further, in each of the embodiments of the present invention described above, the similarity condition used by the similarity determination unit of the failure information management apparatus has been mainly described as an example where the failure cause information and the traceback information are the same. The predetermined similarity condition may be other conditions.

また、上述した本発明の各実施の形態において、各フローチャートを参照して説明した障害情報管理装置および各ノードの動作を、本発明のコンピュータ・プログラムとしてコンピュータ装置の記憶装置（記憶媒体）に格納しておき、係るコンピュータ・プログラムを当該ＣＰＵが読み出して実行するようにしてもよい。そして、このような場合において、本発明は、係るコンピュータ・プログラムのコードあるいは記憶媒体によって構成される。 Further, in each of the embodiments of the present invention described above, the operation of the failure information management apparatus and each node described with reference to each flowchart is stored in the storage device (storage medium) of the computer apparatus as the computer program of the present invention. In addition, the computer program may be read and executed by the CPU. In such a case, the present invention is constituted by the code of the computer program or a storage medium.

また、上述した各実施の形態は、適宜組み合わせて実施されることが可能である。 Moreover, each embodiment mentioned above can be implemented in combination as appropriate.

また、本発明は、上述した各実施の形態に限定されず、様々な態様で実施されることが可能である。 The present invention is not limited to the above-described embodiments, and can be implemented in various modes.

また、上述した各実施の形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。
（付記１）
分散並列プログラムを分散実行する各ノードでプロセス実行中に発生した障害に関する障害情報を格納する障害情報格納部と、
前記ノードによって前記障害の発生に応じて送信される前記障害情報を受信する障害情報受信部と、
前記ノードから受信された障害情報に対して所定の類似条件を満たす障害情報が、前記障害情報格納部に格納されているか否かを判断する類似性判断部と、
前記類似性判断部による判断結果に応じて、前記ノードにおいて前記障害発生時に前記プロセスによって利用されていた主記憶装置の内容を含む障害解析情報の出力可否を表す出力可否情報を、該ノードに送信する出力可否情報送信部と、
前記ノードから受信された障害情報に対して所定の類似条件を満たす障害情報が、前記障害情報格納部に格納されていないと判断された場合、前記ノードから受信された障害情報を、前記障害情報格納部に登録する障害情報登録部と、
を備えた障害情報管理装置。
（付記２）
前記障害情報に、前記障害発生までの命令の履歴を表すトレースバック情報が含まれるとき、
前記類似性判断部は、前記トレースバック情報が同一であることを、前記所定の類似条件に含めることを特徴とする付記１に記載の障害情報管理装置。
（付記３）
前記障害情報に、前記障害の原因を表す障害原因情報が含まれるとき、
前記類似性判断部は、前記障害原因情報が同一であることを、前記所定の類似条件に含めることを特徴とする付記１または付記２に記載の障害情報管理装置。
（付記４）
前記障害情報に、前記障害が発生したノードの識別情報が含まれるとき、
前記障害情報登録部は、前記ノードから受信された障害情報に対して所定の類似条件を満たす障害情報が前記障害情報格納部に格納されていると判断された場合、前記障害情報格納部に格納された該障害情報に、前記ノードから受信された障害情報に含まれるノードの識別情報を追加して含めるよう更新することを特徴とする付記１から付記３のいずれか１つに記載の障害情報管理装置。
（付記５）
前記障害情報に、前記障害が発生したプロセスの識別情報が含まれるとき、
前記障害情報登録部は、前記ノードから受信された障害情報に対して所定の類似条件を満たす障害情報が前記障害情報格納部に格納されていると判断された場合、前記障害情報格納部に格納された該障害情報に、前記ノードから受信された障害情報に含まれるプロセスの識別情報を追加して含めるよう更新することを特徴とする付記１から付記４のいずれか１つに記載の障害情報管理装置。
（付記６）
前記障害情報格納部に格納された障害情報をサマリ情報として出力するサマリ情報出力部をさらに備えたことを特徴とする付記１から付記５のいずれか１つに記載の障害情報管理装置。
（付記７）
分散並列プログラムを分散して実行可能なノードとして動作する情報処理装置であって、
前記分散並列プログラムに基づく処理を実行するプロセス実行部と、
前記プロセス実行部によるプロセス実行中に障害が発生すると、該障害に関する障害情報を採取する障害情報採取部と、
前記障害情報採取部によって採取された障害情報を、付記１から付記６のいずれか１つに記載の障害情報管理装置に対して送信する障害情報送信部と、
前記障害情報管理装置から受信される前記出力可否情報が出力可を示すとき、前記プロセスによって利用されていた主記憶装置の内容を含む障害解析情報を出力する障害解析情報出力部と、
を備えた情報処理装置。
（付記８）
付記１から付記６のいずれか１つに記載の障害情報管理装置と、
付記７に記載の情報処理装置と、
を備えた分散並列処理システム。
（付記９）
前記情報処理装置から出力される前記障害解析情報を蓄積する共有情報蓄積装置をさらに備えたことを特徴とする付記８に記載の分散並列処理システム。
（付記１０）
分散並列処理プログラムを分散実行するノードでプロセス実行中に障害が発生すると、
障害が発生したノードは、
前記障害に関する障害情報を採取し、
前記障害情報を障害情報管理装置に送信し、
前記障害情報管理装置は、
受信した障害情報に対して所定の類似条件を満たす障害情報が障害情報格納部に格納されているか否かを判断し、
所定の類似条件を満たす障害情報が格納されていない場合、前記ノードにおいて前記障害発生時に前記プロセスによって利用されていた主記憶装置の内容を含む障害解析情報の出力可を示す出力可否情報を前記ノードに送信するとともに、前記ノードから受信した障害情報を前記障害情報格納部に登録し、
所定の類似条件を満たす障害情報が格納されている場合、前記障害解析情報の出力否を示す出力可否情報を前記ノードに送信し、
前記障害が発生したノードは、前記障害情報管理装置から受信される出力可否情報が出力可を示す場合に、前記障害解析情報を出力する、障害情報管理方法。
（付記１１）
分散並列プログラムを分散実行する各ノードでプロセス実行中に発生した障害に関する障害情報を格納する障害情報格納部を用いて、
前記ノードによって前記障害の発生に応じて送信される前記障害情報を受信する障害情報受信ステップと、
前記ノードから受信された障害情報に対して所定の類似条件を満たす障害情報が、前記障害情報格納部に格納されているか否かを判断する類似性判断ステップと、
前記類似性判断ステップの判断結果に応じて、前記ノードにおいて前記障害発生時に前記プロセスによって利用されていた主記憶装置の内容を含む障害解析情報の出力可否を表す出力可否情報を該ノードに送信する出力可否情報送信ステップと、
前記ノードから受信された障害情報に対して所定の類似条件を満たす障害情報が、前記障害情報格納部に格納されていないと判断された場合、前記ノードから受信された障害情報を前記障害情報格納部に登録する障害情報登録ステップと、
をコンピュータ装置に実行させるコンピュータ・プログラム。
（付記１２）
分散並列プログラムに基づく処理を実行するプロセス実行ステップと、
前記プロセス実行ステップでのプロセス実行中に障害が発生すると、該障害に関する障害情報を採取する障害情報採取ステップと、
前記障害情報採取ステップで採取された障害情報を、付記１１に記載のコンピュータ・プログラムを実行する装置に対して送信する障害情報送信ステップと、
前記装置から受信される前記出力可否情報が出力可を示すとき、前記プロセスによって利用されていた主記憶装置の内容を含む障害解析情報を出力する障害解析情報出力ステップと、
をコンピュータ装置に実行させるコンピュータ・プログラム。
（付記１３）
分散並列プログラムを分散実行するノードでプロセス実行中に発生した障害に関する障害情報を前記ノードから受信し、
前記ノードから受信された障害情報に対して所定の類似条件を満たす障害情報が障害情報格納部に格納されているか否かを判断し、
判断結果に応じて、前記ノードにおいて前記障害発生時に前記プロセスによって利用されていた主記憶装置の内容を含む障害解析情報の出力可否を表す出力可否情報を該ノードに送信し、
前記ノードから受信された障害情報に対して所定の類似条件を満たす障害情報が前記障害情報格納部に格納されていないと判断した場合、前記ノードから受信された障害情報を前記障害情報格納部に登録する、障害情報管理方法。
（付記１４）
分散並列プログラムに基づくプロセス実行中に障害が発生すると、該障害に関する障害情報を採取し、
採取した障害情報を、付記１３に記載の障害情報管理方法を実行する装置に対して送信し、
前記装置から受信される前記出力可否情報が出力可を示すとき、前記プロセスによって利用されていた主記憶装置の内容を含む障害解析情報を出力する、障害情報管理方法。 A part or all of each of the above-described embodiments can be described as in the following supplementary notes, but is not limited thereto.
(Appendix 1)
A failure information storage unit that stores failure information related to failures that occurred during process execution on each node that executes distributed distributed programs in a distributed manner;
A failure information receiving unit for receiving the failure information transmitted in response to the occurrence of the failure by the node;
A similarity determination unit that determines whether failure information satisfying a predetermined similarity condition with respect to failure information received from the node is stored in the failure information storage unit;
In response to a determination result by the similarity determination unit, output enable / disable information indicating whether or not failure analysis information including the contents of the main storage device used by the process at the time of the failure in the node is output to the node An output propriety information transmitting unit to perform,
When it is determined that failure information satisfying a predetermined similarity condition with respect to the failure information received from the node is not stored in the failure information storage unit, the failure information received from the node is changed to the failure information. A failure information registration unit to be registered in the storage unit;
Fault information management device comprising:
(Appendix 2)
When the failure information includes traceback information representing a history of instructions up to the occurrence of the failure,
The failure information management apparatus according to appendix 1, wherein the similarity determination unit includes that the traceback information is the same in the predetermined similarity condition.
(Appendix 3)
When the failure information includes failure cause information indicating the cause of the failure,
The failure information management apparatus according to appendix 1 or appendix 2, wherein the similarity determination unit includes that the failure cause information is the same in the predetermined similarity condition.
(Appendix 4)
When the failure information includes identification information of the node where the failure has occurred,
The failure information registration unit stores in the failure information storage unit when it is determined that failure information satisfying a predetermined similarity condition with respect to the failure information received from the node is stored in the failure information storage unit The failure information according to any one of appendix 1 to appendix 3, wherein the fault identification information is updated so as to include the node identification information included in the fault information received from the node. Management device.
(Appendix 5)
When the failure information includes identification information of the process in which the failure has occurred,
The failure information registration unit stores in the failure information storage unit when it is determined that failure information satisfying a predetermined similarity condition with respect to the failure information received from the node is stored in the failure information storage unit The failure information according to any one of appendix 1 to appendix 4, wherein the fault identification information is updated to include process identification information included in the failure information received from the node. Management device.
(Appendix 6)
The failure information management apparatus according to any one of appendix 1 to appendix 5, further comprising a summary information output unit that outputs the failure information stored in the failure information storage unit as summary information.
(Appendix 7)
An information processing apparatus that operates as a node that can execute a distributed parallel program in a distributed manner,
A process execution unit that executes processing based on the distributed parallel program;
When a failure occurs during process execution by the process execution unit, a failure information collection unit that collects failure information related to the failure;
A failure information transmitting unit that transmits the failure information collected by the failure information collecting unit to the failure information management device according to any one of appendix 1 to appendix 6,
A failure analysis information output unit that outputs failure analysis information including the contents of the main storage device used by the process when the output permission information received from the failure information management device indicates that output is possible;
An information processing apparatus comprising:
(Appendix 8)
The failure information management device according to any one of appendix 1 to appendix 6,
The information processing apparatus according to appendix 7,
Distributed parallel processing system.
(Appendix 9)
The distributed parallel processing system according to appendix 8, further comprising a shared information storage device that stores the failure analysis information output from the information processing device.
(Appendix 10)
When a failure occurs during process execution on a node that executes a distributed parallel processing program in a distributed manner,
The failed node is
Collect failure information about the failure,
Sending the failure information to a failure information management device;
The failure information management device includes:
Determine whether failure information satisfying a predetermined similarity condition with respect to the received failure information is stored in the failure information storage unit,
When failure information satisfying a predetermined similarity condition is not stored, output enable / disable information indicating whether or not failure analysis information including the contents of the main storage used by the process at the time of the failure at the node is output to the node And the failure information received from the node is registered in the failure information storage unit,
When failure information satisfying a predetermined similarity condition is stored, output enable / disable information indicating output failure of the failure analysis information is transmitted to the node,
The failure information management method in which the node in which the failure has occurred outputs the failure analysis information when the output availability information received from the failure information management apparatus indicates that output is possible.
(Appendix 11)
Using a failure information storage unit that stores failure information related to failures that occur during process execution on each node that executes distributed parallel programs in a distributed manner.
A failure information receiving step of receiving the failure information transmitted in response to the occurrence of the failure by the node;
A similarity determination step of determining whether or not failure information satisfying a predetermined similarity condition with respect to failure information received from the node is stored in the failure information storage unit;
In accordance with the determination result of the similarity determination step, output enable / disable information indicating whether or not failure analysis information including the contents of the main storage device used by the process at the time of the failure occurrence in the node is transmitted to the node. An output propriety information transmission step;
When it is determined that failure information satisfying a predetermined similarity condition with respect to failure information received from the node is not stored in the failure information storage unit, failure information received from the node is stored in the failure information storage Failure information registration step to be registered in the department,
Is a computer program that causes a computer device to execute.
(Appendix 12)
A process execution step for executing processing based on a distributed parallel program;
If a failure occurs during process execution in the process execution step, a failure information collection step for collecting failure information related to the failure;
A failure information transmission step of transmitting the failure information collected in the failure information collection step to an apparatus that executes the computer program according to attachment 11,
When the output availability information received from the device indicates that output is possible, a failure analysis information output step for outputting failure analysis information including the contents of the main storage device used by the process;
Is a computer program that causes a computer device to execute.
(Appendix 13)
Receiving from the node fault information related to a fault that has occurred during process execution in a node that executes the distributed parallel program in a distributed manner;
Determining whether failure information satisfying a predetermined similarity condition with respect to failure information received from the node is stored in the failure information storage unit;
According to the determination result, output enable / disable information indicating whether or not failure analysis information including the contents of the main storage device used by the process at the time of the failure at the node is output to the node,
If it is determined that failure information satisfying a predetermined similarity condition with respect to the failure information received from the node is not stored in the failure information storage unit, the failure information received from the node is stored in the failure information storage unit. The failure information management method to be registered.
(Appendix 14)
If a failure occurs during process execution based on a distributed parallel program, the failure information related to the failure is collected.
The collected failure information is transmitted to a device that executes the failure information management method described in Appendix 13,
A failure information management method for outputting failure analysis information including contents of a main storage device used by the process when the output permission information received from the device indicates that output is possible.

１、２分散並列処理システム
１０、３０障害情報管理装置
１１障害情報格納部
１２障害情報受信部
１３類似性判断部
１４出力可否情報送信部
１５、３５障害情報登録部
３６サマリ情報出力部
２０、４０ノード
２１プロセス実行部
２２、４２障害情報採取部
２３障害情報送信部
２４、４４障害解析情報出力部
５０共有情報蓄積装置
９０ノード間接続装置 DESCRIPTION OF SYMBOLS 1, 2 Distributed parallel processing system 10, 30 Fault information management apparatus 11 Fault information storage part 12 Fault information receiving part 13 Similarity judgment part 14 Output availability information transmission part 15, 35 Fault information registration part 36 Summary information output part 20, 40 Node 21 Process execution unit 22, 42 Fault information collection unit 23 Fault information transmission unit 24, 44 Fault analysis information output unit 50 Shared information storage device 90 Inter-node connection device

Claims

A failure information storage unit that stores failure information related to failures that occurred during process execution on each node that executes distributed distributed programs in a distributed manner;
A failure information receiving unit for receiving the failure information transmitted in response to the occurrence of the failure by the node;
A similarity determination unit that determines whether failure information satisfying a predetermined similarity condition with respect to failure information received from the node is stored in the failure information storage unit;
In response to a determination result by the similarity determination unit, output enable / disable information indicating whether or not failure analysis information including the contents of the main storage device used by the process at the time of the failure in the node is output to the node An output propriety information transmitting unit to perform,
When it is determined that failure information satisfying a predetermined similarity condition with respect to the failure information received from the node is not stored in the failure information storage unit, the failure information received from the node is changed to the failure information. A failure information registration unit to be registered in the storage unit;
Fault information management device comprising:

When the failure information includes traceback information representing a history of instructions up to the occurrence of the failure,
The failure information management apparatus according to claim 1, wherein the similarity determination unit includes the fact that the traceback information is the same in the predetermined similarity condition.

When the failure information includes failure cause information indicating the cause of the failure,
The failure information management apparatus according to claim 1, wherein the similarity determination unit includes that the failure cause information is the same in the predetermined similarity condition.

When the failure information includes identification information of the node where the failure has occurred,
The failure information registration unit stores in the failure information storage unit when it is determined that failure information satisfying a predetermined similarity condition with respect to the failure information received from the node is stored in the failure information storage unit 4. The update according to claim 1, wherein the failure information is updated to include additional node identification information included in the failure information received from the node. 5. Fault information management device.

The failure information management apparatus according to any one of claims 1 to 4, further comprising a summary information output unit that outputs the failure information stored in the failure information storage unit as summary information.

An information processing apparatus that operates as a node that can execute a distributed parallel program in a distributed manner,
A process execution unit that executes processing based on the distributed parallel program;
When a failure occurs during process execution by the process execution unit, a failure information collection unit that collects failure information related to the failure;
A failure information transmission unit that transmits the failure information collected by the failure information collection unit to the failure information management device according to any one of claims 1 to 5,
A failure analysis information output unit that outputs failure analysis information including the contents of the main storage device used by the process when the output permission information received from the failure information management device indicates that output is possible;
An information processing apparatus comprising:

The failure information management device according to any one of claims 1 to 5,
An information processing apparatus according to claim 6;
Distributed parallel processing system.

When a failure occurs during process execution on a node that executes a distributed parallel processing program in a distributed manner,
The failed node is
Collect failure information about the failure,
Sending the failure information to a failure information management device;
The failure information management device includes:
Determine whether failure information satisfying a predetermined similarity condition with respect to the received failure information is stored in the failure information storage unit,
When failure information satisfying a predetermined similarity condition is not stored, output enable / disable information indicating whether or not failure analysis information including the contents of the main storage used by the process at the time of the failure at the node is output to the node And the failure information received from the node is registered in the failure information storage unit,
When failure information satisfying a predetermined similarity condition is stored, output enable / disable information indicating output failure of the failure analysis information is transmitted to the node,
The failure information management method in which the node in which the failure has occurred outputs the failure analysis information when the output availability information received from the failure information management apparatus indicates that output is possible.

Using a failure information storage unit that stores failure information related to failures that occur during process execution on each node that executes distributed parallel programs in a distributed manner.
A failure information receiving step of receiving the failure information transmitted in response to the occurrence of the failure by the node;
A similarity determination step of determining whether or not failure information satisfying a predetermined similarity condition with respect to failure information received from the node is stored in the failure information storage unit;
In accordance with the determination result of the similarity determination step, output enable / disable information indicating whether or not failure analysis information including the contents of the main storage device used by the process at the time of the failure occurrence in the node is transmitted to the node. An output propriety information transmission step;
When it is determined that failure information satisfying a predetermined similarity condition with respect to failure information received from the node is not stored in the failure information storage unit, failure information received from the node is stored in the failure information storage Failure information registration step to be registered in the department,
Is a computer program that causes a computer device to execute.

A process execution step for executing processing based on a distributed parallel program;
If a failure occurs during process execution in the process execution step, a failure information collection step for collecting failure information related to the failure;
A failure information transmission step of transmitting the failure information collected in the failure information collection step to an apparatus that executes the computer program according to claim 9;
When the output availability information received from the device indicates that output is possible, a failure analysis information output step for outputting failure analysis information including the contents of the main storage device used by the process;
Is a computer program that causes a computer device to execute.