JP2022144118A

JP2022144118A - Computer system and restart program

Info

Publication number: JP2022144118A
Application number: JP2021044991A
Authority: JP
Inventors: 拓也近藤; Takuya Kondo; 雅彦齊藤; Masahiko Saito
Original assignee: Panasonic Intellectual Property Management Co Ltd
Current assignee: Panasonic Intellectual Property Management Co Ltd
Priority date: 2021-03-18
Filing date: 2021-03-18
Publication date: 2022-10-03

Abstract

To enable autonomous restoration of a failure.SOLUTION: A computer system includes: a first failure information storage part for storing first failure management information including information of a first failure that occurs in a first process group operating in a first group logically divided from a second group; a second failure information storage part for storing second failure management information obtained by registering second external failure information corresponding to the information of the first failure and information of an influence object of the first failure in the second group in association with each other; a first process management part for transmitting the information of the first failure to a first failure information communication part in the case of detecting the first failure; and a second process management part for restarting an influence object of the first failure corresponding to the second external failure information with reference to the second failure management information in the case that a second failure information communication part receives the information of the first failure.SELECTED DRAWING: Figure 1

Description

本開示は、計算機システムおよび再起動プログラムに関する。 The present disclosure relates to a computer system and a reboot program.

従来、計算機により実行されるプロセス、または、計算機上でハイパバイザの制御により動作する仮想マシンに障害が発生した場合に、それを検出してプロセス、または、仮想マシンを再起動させる技術がある。 2. Description of the Related Art Conventionally, when a process executed by a computer or a virtual machine running under the control of a hypervisor on a computer fails, there is a technique for detecting the failure and restarting the process or the virtual machine.

例えば、特許文献１には、異なる機能を提供する複数のプロセスに発生する障害を検出し、正常に稼働しているプロセスについては再起動せず、障害が発生したプロセスのみを再起動することにより、正常に稼働しているプロセスによる機能の提供が中断されることを防止する電子機器が開示されている。 For example, Patent Literature 1 discloses a system that detects failures that occur in multiple processes that provide different functions, does not restart normally running processes, and restarts only the failed process. , an electronic device that prevents a normally running process from interrupting the provision of functionality.

また、特許文献２には、リセット処理を行う仮想マシンの範囲を表すリセットレベルの入力をユーザから受け付け、そのリセットレベルに対応する仮想マシンをリセットする計算機装置が開示されている。 Further, Patent Literature 2 discloses a computer device that accepts input from a user of a reset level representing a range of virtual machines to be reset, and resets the virtual machines corresponding to the reset level.

特開２０１６－２０７１２４号公報JP 2016-207124 A 特開２０１２－１８４５２号公報JP 2012-18452 A

しかしながら、上述した特許文献１に開示される従来技術では、プロセスが他のプロセスに依存している場合に、その依存関係を考慮しないため、あるプロセスに障害が発生した場合にそのプロセスだけでなく、そのプロセスに依存する別のプロセスも再起動したいという要求に応えることができない。 However, in the conventional technology disclosed in the above-mentioned Patent Document 1, if a process depends on other processes, the dependency is not taken into account. , the request to restart another process that depends on that process cannot be fulfilled.

また、特許文献２に開示される従来技術では、リセットレベルごとに、リセット処理を行う仮想マシンの範囲を、仮想マシン間の依存関係を考慮して決定することができるものの、どのリセットレベルの再起動を行うかをユーザが判断する必要があるため、障害からの復旧を自律的に行うことができないという課題があった。 Further, in the conventional technology disclosed in Patent Document 2, although the range of virtual machines to be reset can be determined for each reset level in consideration of the dependency relationship between the virtual machines, it is possible to determine which reset level to reset. Since it is necessary for the user to decide whether or not to start the system, there is a problem that recovery from a failure cannot be performed autonomously.

本開示の目的は、自律的な障害からの復旧を可能とする計算機システムおよび再起動プログラムを提供することである。 An object of the present disclosure is to provide a computer system and a restart program that enable recovery from autonomous failures.

本開示に係る計算機システムは、論理的に第２のグループと区分された第１のグループで動作する第１のプロセス群において発生する第１の障害の情報を含む第１の障害管理情報を記憶する第１の障害情報記憶部と、第１の障害の情報に対応する第２の外部障害情報と、第２のグループにおける第１の障害の影響対象の情報とを対応付けて登録した第２の障害管理情報を記憶する第２の障害情報記憶部と、第１の障害を検出した場合に、第１の障害の情報を第１の障害情報通信部に送信させる第１のプロセス管理部と、第１の障害の情報を第２の障害情報通信部が受信した場合に、第２の障害管理情報を参照し、第２の外部障害情報に対応する第１の障害の影響対象を再起動する第２のプロセス管理部と、を備える。 A computer system according to the present disclosure stores first failure management information including information about a first failure occurring in a first process group operating in a first group logically separated from a second group. a first failure information storage unit, a second external failure information corresponding to the information of the first failure, and the information of the target affected by the first failure in the second group are registered in association with each other. a second failure information storage unit for storing failure management information of the first failure, and a first process management unit for transmitting information of the first failure to the first failure information communication unit when the first failure is detected; , when the second fault information communication unit receives the information of the first fault, referring to the second fault management information and restarting the target affected by the first fault corresponding to the second external fault information. and a second process management unit.

本開示に係る再起動プログラムは、論理的に第２のグループと区分された第１のグループで動作する第１のプロセス群において発生する第１の障害を検出した場合に、第１の障害の情報を含む第１の障害管理情報を記憶した第１の障害情報記憶部から第１の障害の情報を読み出して第１の障害情報通信部に送信させる手順と、第１の障害の情報を受信した第２の障害情報通信部から第１の障害の情報を取得し、第１の障害の情報に対応する第２の外部障害情報と、第２のグループにおける第１の障害の影響対象の情報とを対応付けて登録した第２の障害管理情報を記憶する第２の障害情報記憶部から第２の障害管理情報を読み出して第１の障害の情報に対応する第２の外部障害情報を特定し、特定した第２の外部障害情報に対応する第１の障害の影響対象を再起動する手順と、をコンピュータに実行させる。 When a restart program according to the present disclosure detects a first failure occurring in a first process group operating in a first group logically separated from a second group, a procedure for reading first failure information from a first failure information storage unit storing first failure management information including information and transmitting the information to a first failure information communication unit; and receiving the first failure information. information of the first failure from the second failure information communication unit, second external failure information corresponding to the information of the first failure, and information of the target affected by the first failure in the second group The second failure management information is read from the second failure information storage unit that stores the second failure management information registered in association with the second failure management information, and the second external failure information corresponding to the first failure information is specified. and restarting the affected target of the first failure corresponding to the identified second external failure information.

本開示によれば、自律的な障害からの復旧が可能となる。 According to the present disclosure, recovery from autonomous failures is possible.

実施の形態１に係る計算機システムの構成の一例を示す図A diagram showing an example of a configuration of a computer system according to Embodiment 1 障害管理情報２３ａの一例を示す図A diagram showing an example of failure management information 23a 障害管理情報３３ａの一例を示す図A diagram showing an example of failure management information 33a 第１のＶＭ２０が行う再起動処理の処理手順の一例を示すフローチャートFlowchart showing an example of a procedure of restart processing performed by the first VM 20 第２のＶＭ３０が行う再起動処理の処理手順の一例を示すフローチャートFlowchart showing an example of a procedure of restart processing performed by the second VM 30 実施の形態２に係る計算機システムの構成の一例を示す図A diagram showing an example of the configuration of a computer system according to Embodiment 2 障害管理情報４３ａの一例を示す図A diagram showing an example of failure management information 43a 障害管理情報５３ａの一例を示す図A diagram showing an example of failure management information 53a 障害管理情報６３ａの一例を示す図A diagram showing an example of failure management information 63a 実施の形態３に係る計算機システムの構成の一例を示す図A diagram showing an example of the configuration of a computer system according to Embodiment 3 障害管理情報７２ａの一例を示す図A diagram showing an example of failure management information 72a 障害管理情報８４ａの一例を示す図A diagram showing an example of failure management information 84a 障害管理情報９４ａの一例を示す図A diagram showing an example of failure management information 94a 障害管理情報１０５ａの一例を示す図A diagram showing an example of the failure management information 105a 障害管理情報１１３ａの一例を示す図A diagram showing an example of the fault management information 113a ハイパバイザ７０が行う再起動処理の処理手順の一例を示すフローチャート3 is a flow chart showing an example of a procedure of restart processing performed by the hypervisor 70;

以下、本開示の実施の形態を図面に基づいて詳細に説明する。 Hereinafter, embodiments of the present disclosure will be described in detail based on the drawings.

（実施の形態１）
図１は、実施の形態１に係る計算機システムの構成の一例を示す図である。 (Embodiment 1)
FIG. 1 is a diagram showing an example of the configuration of a computer system according to Embodiment 1. As shown in FIG.

図１に示すように、計算機システムは、ハイパバイザ（Ｈｙｐｅｒｖｉｓｏｒ）１０、第１のＶＭ（ＶｉｒｔｕａｌＭａｃｈｉｎｅ、仮想マシン）２０、第２のＶＭ３０を備える。 As shown in FIG. 1, the computer system includes a hypervisor 10, a first VM (Virtual Machine, virtual machine) 20, and a second VM 30. FIG.

ハイパバイザ１０は、１つの計算機を第１のＶＭ２０と第２のＶＭ３０とに論理的に分割し、２つの独立した仮想マシンとして動作させる制御部である。第１のＶＭ２０および第２のＶＭ３０は、このようにして生成された仮想マシンである。 The hypervisor 10 is a control unit that logically divides one computer into a first VM 20 and a second VM 30 and operates them as two independent virtual machines. The first VM 20 and the second VM 30 are virtual machines generated in this way.

なお、ここではハイパバイザ１０上で動作する仮想マシンの数が２であることとしたが、２以上であってもよい。 Although the number of virtual machines operating on the hypervisor 10 is assumed to be two here, the number may be two or more.

プロセス２１ａ～２１ｎを含むプロセス群が実行される第１のＶＭ２０は、プロセス監視部２２、障害情報記憶部２３、障害情報通信部２４、および、プロセス管理部２５を備える。プロセス監視部２２、障害情報通信部２４、および、プロセス管理部２５は、それぞれ個別のプロセスとして実装されても良いが、それぞれの役割を備えた単一のプロセスとして実装されても良い。 A first VM 20 on which a process group including processes 21 a to 21 n is executed includes a process monitoring section 22 , a failure information storage section 23 , a failure information communication section 24 and a process management section 25 . The process monitoring unit 22, the fault information communication unit 24, and the process management unit 25 may be implemented as individual processes, or may be implemented as a single process having their respective roles.

プロセス監視部２２、障害情報通信部２４、および、プロセス管理部２５の機能は、プロセッサにより実現される。また、障害情報記憶部２３の機能は、メモリなどの記憶装置により実現される。 Functions of the process monitoring unit 22, the fault information communication unit 24, and the process management unit 25 are realized by the processor. Also, the function of the fault information storage unit 23 is implemented by a storage device such as a memory.

プロセス監視部２２は、プロセス２１ａ～２１ｎを含むプロセス群で発生する障害を監視する。例えば、プロセス監視部２２は、プロセス２１ａ～２１ｎに対してハートビートメッセージを送信し、応答がないプロセスを、後述するプロセス管理部２５に通知する。これにより、動作が停止したり、無限ループに陥っていたりするプロセスの検出が可能となる。 The process monitoring unit 22 monitors faults occurring in a process group including the processes 21a to 21n. For example, the process monitoring unit 22 transmits heartbeat messages to the processes 21a to 21n, and notifies the process management unit 25, which will be described later, of the processes that do not respond. This makes it possible to detect processes that have stopped working or are stuck in an infinite loop.

障害情報記憶部２３は、プロセス２１ａ～２１ｎを含むプロセス群において発生する障害の情報と、当該障害の影響対象の情報とを対応付けて登録した障害管理情報２３ａを記憶する。 The failure information storage unit 23 stores failure management information 23a in which information about failures that occur in a process group including the processes 21a to 21n and information about targets affected by the failures are associated and registered.

図２は、障害管理情報２３ａの一例を示す図である。障害管理情報２３ａは、プロセス情報、障害情報、影響対象情報を含む。 FIG. 2 is a diagram showing an example of the fault management information 23a. The fault management information 23a includes process information, fault information, and affected target information.

プロセス情報は、第１のＶＭ２０において実行されるプロセスの識別情報、および、第１のＶＭ２０の外部において実行されるプロセスであることを示す情報を含む。なお、図２の例では、後者の情報は登録されていない。障害情報は、それらのプロセスの障害の情報である。影響対象情報は、当該障害が発生した場合に影響を受ける対象の情報である。 The process information includes identification information of the process executed in the first VM 20 and information indicating that the process is executed outside the first VM 20 . In addition, in the example of FIG. 2, the latter information is not registered. The failure information is information about failures of those processes. The affected target information is information about the target that will be affected when the failure occurs.

障害情報における「ＴＡＲＧＥＴ」は、障害が仮想マシン（ＶＭ）、コンテナ、プロセスのどれに発生したかを示す情報である。「ＶＭ」は、障害は発生した仮想マシンを示す識別情報である。「ＰＲＯＣ」は、障害が発生したプロセスを示す識別情報である。 "TARGET" in the fault information is information indicating whether the fault occurred in a virtual machine (VM), container, or process. "VM" is identification information indicating the virtual machine in which the failure occurred. "PROC" is identification information indicating the process in which the failure occurred.

例えば、「ＰＲＯＣＥＳＳ１２」というプロセス情報には、「ＴＡＲＧＥＴ：ＰＲＯＣ，ＶＭ：１，ＰＲＯＣ：ＰＲＯＣＥＳＳ１２」という障害情報、および、「ＰＲＯＣＥＳＳ１３」という影響対象情報が対応付けて登録されている。 For example, process information "PROCESS12" is registered in association with failure information "TARGET:PROC, VM:1, PROC:PROCESS12" and affected target information "PROCESS13".

これは、第１のＶＭ２０における「ＰＲＯＣＥＳＳ１２」というプロセスに障害が発生した場合、その障害の発生により「ＰＲＯＣＥＳＳ１３」というプロセスを再起動する必要があることを示している。 This indicates that when a failure occurs in the process "PROCESS12" in the first VM 20, the process "PROCESS13" must be restarted due to the occurrence of the failure.

また、「ＰＲＯＣＥＳＳ１３」というプロセス情報には、「ＴＡＲＧＥＴ：ＰＲＯＣ，ＶＭ：１，ＰＲＯＣ：ＰＲＯＣＥＳＳ１３」という障害情報が対応付けて登録されている。 Further, process information "PROCESS13" is registered in association with failure information "TARGET: PROC, VM: 1, PROC: PROCESS13".

この障害情報は、第１のＶＭ２０において障害が発生するプロセスが「ＰＲＯＣＥＳＳ１３」であることを示している。なお、「ＰＲＯＣＥＳＳ１３」というプロセスには、影響対象情報が登録されていないので、「ＰＲＯＣＥＳＳ１３」というプロセスに障害が発生した場合に再起動する必要がある対象はない。 This fault information indicates that the process in which the fault occurs in the first VM 20 is "PROCESS13". It should be noted that no affected target information is registered for the process "PROCESS13", so there is no target that needs to be restarted when a failure occurs in the process "PROCESS13".

図１の説明に戻ると、障害情報通信部２４は、第２のＶＭ３０の障害情報通信部３４と通信を行う。 Returning to the description of FIG. 1 , the failure information communication unit 24 communicates with the failure information communication unit 34 of the second VM 30 .

例えば、障害情報通信部２４は、第１のＶＭ２０におけるプロセス２１ａ～２１ｎに障害が発生した場合に、障害が発生したプロセスに対応する図２に示した障害情報を第２のＶＭ３０に送信する。また、障害情報通信部２４は、障害情報通信部３４から送信される障害情報を受信する。 For example, when a failure occurs in the processes 21a to 21n in the first VM 20, the failure information communication unit 24 transmits the failure information shown in FIG. The failure information communication unit 24 also receives failure information transmitted from the failure information communication unit 34 .

プロセス管理部２５は、第１のＶＭ２０におけるプロセス２１ａ～２１ｎを管理する。例えば、プロセス管理部２５は、プロセス監視部２２からの通知により、プロセス２１ａ～２１ｎに障害が発生したことを検出する。また、プロセス管理部２５は、プロセス２１ａ～２１ｎからの障害メッセージを受信することによりプロセス２１ａ～２１ｎに障害が発生したことを検出する。 The process management unit 25 manages the processes 21a to 21n in the first VM20. For example, the process management unit 25 detects from the notification from the process monitoring unit 22 that a failure has occurred in the processes 21a to 21n. Also, the process management unit 25 detects that a failure has occurred in the processes 21a to 21n by receiving failure messages from the processes 21a to 21n.

そして、プロセス管理部２５は、障害管理情報２３ａを参照し、障害が発生したプロセスに対応付けて登録されている障害情報および影響対象情報を取得する。 Then, the process management unit 25 refers to the failure management information 23a and acquires the failure information and the affected target information registered in association with the process in which the failure occurred.

例えば、図２の例において、障害を検出したプロセスが「ＰＲＯＣＥＳＳ１２」である場合、プロセス管理部２５は、「ＴＡＲＧＥＴ：ＰＲＯＣ，ＶＭ：１，ＰＲＯＣ：ＰＲＯＣＥＳＳ１２」という障害情報、および、「ＰＲＯＣＥＳＳ１３」という影響対象情報を取得する。 For example, in the example of FIG. 2, if the process that detected the failure is "PROCESS12", the process management unit 25 outputs the failure information "TARGET: PROC, VM: 1, PROC: PROCESS12" and the failure information "PROCESS13". Get affected information.

その後、プロセス管理部２５は、「ＰＲＯＣＥＳＳ１２」のプロセスのように、障害情報が登録されている場合、その障害情報を他のＶＭである第２のＶＭ３０に送信するよう障害情報通信部２４に指示する。 Thereafter, the process management unit 25 instructs the failure information communication unit 24 to transmit the failure information to the second VM 30, which is another VM, when failure information is registered like the process "PROCESS12". do.

また、プロセス管理部２５は、「ＰＲＯＣＥＳＳ１２」のプロセスのように、影響対象情報が登録されている場合、登録されている影響対象を再起動する。さらに、プロセス管理部２５は、障害が発生したプロセスを再起動する。 In addition, the process management unit 25 restarts the registered affected object when the affected object information is registered like the process of "PROCESS12". Furthermore, the process management unit 25 restarts the failed process.

また、図２に示す「ＰＲＯＣＥＳＳ１２」のプロセスのように、影響対象情報として「ＰＲＯＣＥＳＳ１３」が登録されている場合、プロセス管理部２５は、「ＰＲＯＣＥＳＳ１３」のプロセスに対応する障害情報も他のＶＭである第２のＶＭ３０に送信するよう障害情報通信部２４に指示する。 Further, when "PROCESS13" is registered as the affected target information like the process "PROCESS12" shown in FIG. The failure information communication unit 24 is instructed to transmit to a certain second VM 30 .

なお、「ＰＲＯＣＥＳＳ１３」のプロセスに対応する影響対象情報は登録されていないため、プロセス管理部２５はプロセスの再起動を行わない。 In addition, since the affected object information corresponding to the process of "PROCESS13" is not registered, the process management unit 25 does not restart the process.

さらに、プロセス管理部２５は、障害情報通信部２４が他のＶＭである第２のＶＭ３０から障害情報を受信した場合に、その障害情報を取得し、その障害情報が障害管理情報２３ａに含まれる障害情報に対応するものであるか否かを判定する。 Furthermore, when the failure information communication unit 24 receives failure information from the second VM 30, which is another VM, the process management unit 25 acquires the failure information, and the failure information is included in the failure management information 23a. It is determined whether or not it corresponds to the failure information.

そして、プロセス管理部２５は、その障害情報が障害管理情報２３ａに含まれる障害情報に対応するものである場合、その障害情報に対応する影響対象情報が登録されているか否かを判定する。 Then, when the failure information corresponds to the failure information included in the failure management information 23a, the process management unit 25 determines whether or not the affected information corresponding to the failure information is registered.

その障害情報に対応する影響対象情報が登録されている場合、プロセス管理部２５は、登録されている影響対象を再起動する。 If affected target information corresponding to the fault information is registered, the process management unit 25 restarts the registered affected target.

また、プロセス３１ａ～３１ｍを含むプロセス群が実行される第２のＶＭ３０は、プロセス監視部３２、障害情報記憶部３３、障害情報通信部３４、および、プロセス管理部３５を備える。プロセス監視部３２、障害情報通信部３４、および、プロセス管理部３５は、それぞれ個別のプロセスとして実装されても良いが、それぞれの役割を備えた単一のプロセスとして実装されても良い。 The second VM 30 on which a process group including the processes 31a to 31m is executed includes a process monitoring section 32, a failure information storage section 33, a failure information communication section 34, and a process management section . The process monitoring unit 32, the fault information communication unit 34, and the process management unit 35 may be implemented as individual processes, or may be implemented as a single process having their respective roles.

プロセス監視部３２、障害情報通信部３４、および、プロセス管理部３５の機能は、プロセッサにより実現される。また、障害情報記憶部３３の機能は、メモリなどの記憶装置により実現される。 Functions of the process monitoring unit 32, the fault information communication unit 34, and the process management unit 35 are realized by the processor. Also, the function of the failure information storage unit 33 is implemented by a storage device such as a memory.

プロセス監視部３２は、プロセス３１ａ～３１ｍを含むプロセス群で発生する障害を監視する。例えば、プロセス監視部３２は、プロセス３１ａ～３１ｍに対してハートビートメッセージを送信し、応答がないプロセスを、後述するプロセス管理部３５に通知する。 The process monitoring unit 32 monitors failures occurring in a process group including the processes 31a to 31m. For example, the process monitoring unit 32 sends heartbeat messages to the processes 31a to 31m, and notifies the process management unit 35, which will be described later, of the processes that do not respond.

障害情報記憶部３３は、プロセス３１ａ～３１ｍを含むプロセス群において発生する障害の情報と、当該障害の影響対象の情報とを対応付けて登録した障害管理情報３３ａを記憶する。 The failure information storage unit 33 stores failure management information 33a in which information about failures that occur in a process group including the processes 31a to 31m and information about targets affected by the failures are associated and registered.

図３は、障害管理情報３３ａの一例を示す図である。障害管理情報３３ａは、プロセス情報、障害情報、影響対象情報を含む。 FIG. 3 is a diagram showing an example of the fault management information 33a. The fault management information 33a includes process information, fault information, and affected target information.

プロセス情報は、第２のＶＭ３０において実行されるプロセスの識別情報、および、第２のＶＭ３０の外部において実行されるプロセスであることを示す情報を含む。障害情報は、それらのプロセスの障害の情報ある。影響対象情報は、当該障害が発生した場合に影響を受ける対象の情報である。 The process information includes identification information of the process executed in the second VM 30 and information indicating that the process is executed outside the second VM 30 . The failure information is information on failures of those processes. The affected target information is information about the target that will be affected when the failure occurs.

障害情報における「ＴＡＲＧＥＴ」、「ＶＭ」、および、「ＰＲＯＣ」は、図２に示した障害管理情報３３ａの障害情報における「ＴＡＲＧＥＴ」、「ＶＭ」、および、「ＰＲＯＣ」と同様の情報である。 "TARGET", "VM" and "PROC" in the failure information are the same information as "TARGET", "VM" and "PROC" in the failure information of the failure management information 33a shown in FIG. .

ここで、「ＥＸＴＥＲＮＡＬ＿ＥＲＲＯＲ」というプロセス情報には、「ＴＡＲＧＥＴ：ＰＲＯＣ，ＶＭ：１，ＰＲＯＣ：ＰＲＯＣＥＳＳ１２」という障害情報（外部障害情報）、および、「ＰＲＯＣＥＳＳ２２，ＰＲＯＣＥＳＳ２３」という影響対象情報が対応付けて登録されている。 Here, the process information "EXTERNAL_ERROR" is registered in association with the failure information (external failure information) "TARGET: PROC, VM: 1, PROC: PROCESS12" and the affected information "PROCESS22, PROCESS23". It is

これは、第１のＶＭ２０における「ＰＲＯＣＥＳＳ１２」というプロセスに障害が発生した場合、その障害の発生により「ＰＲＯＣＥＳＳ２２」、「ＰＲＯＣＥＳＳ２３」というプロセスを再起動する必要があることを示している。 This indicates that when a failure occurs in the process "PROCESS12" in the first VM 20, it is necessary to restart the processes "PROCESS22" and "PROCESS23" due to the occurrence of the failure.

図１の説明に戻ると、障害情報通信部３４は、第１のＶＭ２０の障害情報通信部２４と通信を行う。 Returning to the description of FIG. 1 , the failure information communication unit 34 communicates with the failure information communication unit 24 of the first VM 20 .

例えば、障害情報通信部３４は、第２のＶＭ３０におけるプロセス３１ａ～３１ｍに障害が発生した場合に、障害が発生したプロセスに対応する図３に示した障害情報を第１のＶＭ２０に送信する。また、障害情報通信部３４は、障害情報通信部２４から送信される障害情報を受信する。 For example, when a failure occurs in the processes 31a to 31m in the second VM 30, the failure information communication unit 34 transmits the failure information shown in FIG. Further, the failure information communication section 34 receives failure information transmitted from the failure information communication section 24 .

プロセス管理部３５は、第２のＶＭ３０におけるプロセス３１ａ～３１ｍを管理する。例えば、プロセス管理部３５は、プロセス監視部３２からの通知により、プロセス３１ａ～３１ｍに障害が発生したことを検出する。また、プロセス管理部３５は、プロセス３１ａ～３１ｍからの障害メッセージを受信することによりプロセス３１ａ～３１ｍに障害が発生したことを検出する。 The process management unit 35 manages the processes 31a-31m in the second VM30. For example, the process management unit 35 detects from the notification from the process monitoring unit 32 that a failure has occurred in the processes 31a to 31m. Also, the process management unit 35 detects that a failure has occurred in the processes 31a to 31m by receiving failure messages from the processes 31a to 31m.

そして、プロセス管理部３５は、障害管理情報３３ａを参照し、障害が発生したプロセスに対応付けて登録されている障害情報および影響対象情報を取得する。 Then, the process management unit 35 refers to the failure management information 33a, and acquires failure information and affected target information registered in association with the process in which the failure occurred.

例えば、図３の例において、障害が発生したプロセスが「ＰＲＯＣＥＳＳ２２」である場合、プロセス管理部３５は、「ＴＡＲＧＥＴ：ＰＲＯＣ，ＶＭ：２，ＰＲＯＣ：ＰＲＯＣＥＳＳ２２」という障害情報、および、「ＰＲＯＣＥＳＳ２３」という影響対象情報を取得する。 For example, in the example of FIG. 3, if the process in which the failure occurred is "PROCESS22", the process management unit 35 outputs the failure information "TARGET: PROC, VM: 2, PROC: PROCESS22" and the failure information "PROCESS23". Get affected information.

その後、プロセス管理部３５は、「ＰＲＯＣＥＳＳ２２」のプロセスのように、障害情報が登録されている場合、その障害情報を他のＶＭである第１のＶＭ２０に送信するよう障害情報通信部３４に指示する。 After that, the process management unit 35 instructs the failure information communication unit 34 to transmit the failure information to the first VM 20, which is another VM, when failure information is registered like the process "PROCESS22". do.

また、プロセス管理部３５は、「ＰＲＯＣＥＳＳ２２」のプロセスのように、影響対象情報が登録されている場合、登録されている影響対象を再起動する。さらに、プロセス管理部２５は、障害が発生したプロセスを再起動する。 In addition, the process management unit 35 restarts the registered affected object when the affected object information is registered like the process of "PROCESS22". Furthermore, the process management unit 25 restarts the failed process.

なお、図３の例では、「ＰＲＯＣＥＳＳ２３」のプロセスに対応する障害情報は登録されていないため、障害情報の第１のＶＭ２０への送信処理は行われない。障害が発生しても他へ影響しないプロセスの場合、障害情報を登録しないことで不必要な障害情報の送信処理の抑止が可能となる。また、「ＰＲＯＣＥＳＳ２３」のプロセスに対応する影響対象情報も登録されていないため、プロセス管理部３５はプロセスの再起動を行わない。 In the example of FIG. 3, since the failure information corresponding to the process "PROCESS23" is not registered, the process of transmitting the failure information to the first VM 20 is not performed. In the case of a process that does not affect others even if a failure occurs, it is possible to suppress unnecessary failure information transmission processing by not registering failure information. In addition, since the affected object information corresponding to the process "PROCESS23" is not registered, the process management unit 35 does not restart the process.

さらに、プロセス管理部３５は、障害情報通信部３４が他のＶＭである第１のＶＭ２０から障害情報を受信した場合に、その障害情報を取得し、その障害情報が障害管理情報３３ａに含まれる障害情報に対応するものであるか否かを判定する。 Further, when the failure information communication unit 34 receives failure information from the first VM 20, which is another VM, the process management unit 35 acquires the failure information, and the failure information is included in the failure management information 33a. It is determined whether or not it corresponds to the fault information.

そして、プロセス管理部３５は、その障害情報が障害管理情報３３ａに含まれる障害情報に対応するものである場合、その障害情報に対応する影響対象情報が登録されているか否かを判定する。 Then, when the failure information corresponds to the failure information included in the failure management information 33a, the process management unit 35 determines whether or not the affected information corresponding to the failure information is registered.

その障害情報に対応する影響対象情報が登録されている場合、プロセス管理部３５は、登録されている影響対象を再起動する。 If affected target information corresponding to the fault information is registered, the process management unit 35 restarts the registered affected target.

例えば、障害情報通信部３４が第１のＶＭ２０から、障害情報として「ＴＡＲＧＥＴ：ＰＲＯＣ，ＶＭ：１，ＰＲＯＣ：ＰＲＯＣＥＳＳ１２」という情報を受信したものとする。 For example, it is assumed that the failure information communication unit 34 has received information "TARGET: PROC, VM: 1, PROC: PROCESS 12" from the first VM 20 as failure information.

この場合、この障害情報は、図３に示した障害管理情報３３ａの「ＥＸＴＥＲＮＡＬ＿ＥＲＲＯＲ」の障害情報に対応するため、プロセス管理部３５は、登録されている影響対象である「ＰＲＯＣＥＳＳ２２」、「ＰＲＯＣＥＳＳ２３」のプロセスを再起動する。 In this case, this fault information corresponds to the fault information of "EXTERNAL_ERROR" of the fault management information 33a shown in FIG. process.

ここで、「ＰＲＯＣＥＳＳ２２」には、障害情報および影響対象情報が登録されているので、プロセス管理部３５は、その障害情報を第１のＶＭ２０に送信するよう障害情報通信部３４に指示するとともに、影響対象情報として登録されている「ＰＲＯＣＥＳＳ２３」のプロセスを再起動する。 Here, since the failure information and the affected information are registered in "PROCESS22", the process management unit 35 instructs the failure information communication unit 34 to transmit the failure information to the first VM 20, Restart the process of "PROCESS23" registered as the affected information.

なお、「ＰＲＯＣＥＳＳ２３」のプロセスに対応する障害情報は登録されていないため、障害情報の第１のＶＭ２０への送信処理は行われない。また、「ＰＲＯＣＥＳＳ２３」のプロセスに対応する影響対象情報も登録されていないため、プロセス管理部３５はプロセスの再起動は行わない。 Since the failure information corresponding to the process of "PROCESS23" is not registered, the process of transmitting the failure information to the first VM 20 is not performed. In addition, since the affected object information corresponding to the process "PROCESS23" is not registered, the process management unit 35 does not restart the process.

つぎに、第１のＶＭ２０が行う再起動処理の処理手順の一例について説明する。図４は、第１のＶＭ２０が行う再起動処理の処理手順の一例を示すフローチャートである。 Next, an example of the procedure of restart processing performed by the first VM 20 will be described. FIG. 4 is a flowchart illustrating an example of a procedure of restart processing performed by the first VM 20 .

図４に示すように、プロセス管理部２５は、プロセス監視部２２からの通知により、または、プロセス２１ａ～２１ｎから受信する障害メッセージにより、プロセス２１ａ～２１ｎに障害が発生したことを検出する（ステップＳ１０１）。 As shown in FIG. 4, the process management unit 25 detects that a failure has occurred in the processes 21a to 21n from a notification from the process monitoring unit 22 or from a failure message received from the processes 21a to 21n (step S101).

続いて、プロセス管理部２５は、障害管理情報２３ａを参照し、障害が発生したプロセスの情報をもとに、そのプロセスに対応する障害情報と影響対象情報とを障害管理情報２３ａから取得する処理を行う（ステップＳ１０２）。 Next, the process management unit 25 refers to the fault management information 23a, and based on the information of the process in which the fault occurred, acquires the fault information and the affected object information corresponding to the process from the fault management information 23a. (step S102).

そして、プロセス管理部２５は、障害管理情報２３ａにそのプロセスに対応する障害情報が登録されていたか否かを判定する（ステップＳ１０３）。 Then, the process management unit 25 determines whether fault information corresponding to the process is registered in the fault management information 23a (step S103).

障害管理情報２３ａにそのプロセスに対応する障害情報が登録されていた場合（ステップＳ１０３においてＹｅｓの場合）、プロセス管理部２５は、障害情報通信部２４に指示して、その障害情報を他のＶＭである第２のＶＭ３０に送信させる（ステップＳ１０４）。 If failure information corresponding to the process is registered in the failure management information 23a (Yes in step S103), the process management unit 25 instructs the failure information communication unit 24 to transmit the failure information to another VM. is transmitted to the second VM 30 (step S104).

その後、プロセス管理部２５は、障害管理情報２３ａにそのプロセスに対応する影響対象情報が登録されていたか否かを判定する（ステップＳ１０５）。 After that, the process management unit 25 determines whether or not influence target information corresponding to the process is registered in the failure management information 23a (step S105).

障害管理情報２３ａにそのプロセスに対応する影響対象情報が登録されていた場合（ステップＳ１０５においてＹｅｓの場合）、プロセス管理部２５は、その影響対象情報に登録されている影響対象を再起動する（ステップＳ１０６）。 If the affected object information corresponding to the process is registered in the fault management information 23a (Yes in step S105), the process management unit 25 restarts the affected object registered in the affected object information ( step S106).

なお、プロセス管理部２５は、障害管理情報２３ａにおいて、再起動される影響対象のプロセスに対応する障害情報が登録されている場合には、障害情報通信部２４に指示して、その障害情報を第２のＶＭ３０に送信させ、そのプロセスに対応する影響対象情報が登録されていた場合には、その影響対象情報に登録されている影響対象を再起動する。 If fault information corresponding to the affected process to be restarted is registered in the fault management information 23a, the process management unit 25 instructs the fault information communication unit 24 to transmit the fault information. If it is sent to the second VM 30 and the affected object information corresponding to the process is registered, the affected object registered in the affected object information is restarted.

さらに、プロセス管理部２５は、障害の発生が検出されたプロセスを再起動し（ステップＳ１０７）、この再起動処理を終了する。 Furthermore, the process management unit 25 restarts the process in which the occurrence of the failure has been detected (step S107), and terminates this restart processing.

また、ステップＳ１０３において、障害管理情報２３ａにそのプロセスに対応する障害情報が登録されていなかった場合（ステップＳ１０３においてＮｏの場合）、または、ステップＳ１０５において、障害管理情報２３ａにそのプロセスに対応する影響対象情報が登録されていなかった場合（ステップＳ１０５においてＮｏの場合）、プロセス管理部２５は、障害の発生が検出されたプロセスを再起動し（ステップＳ１０７）、この再起動処理を終了する。 In step S103, if the failure information corresponding to the process is not registered in the failure management information 23a (No in step S103), or if the failure management information 23a corresponds to the process in step S105, If the affected target information is not registered (No in step S105), the process management unit 25 restarts the process in which the occurrence of the failure is detected (step S107), and terminates this restart processing.

つぎに、第２のＶＭ３０が行う再起動処理の処理手順の一例について説明する。図５は、第２のＶＭ３０が行う再起動処理の処理手順の一例を示すフローチャートである。 Next, an example of the procedure of restart processing performed by the second VM 30 will be described. FIG. 5 is a flowchart illustrating an example of a procedure of restart processing performed by the second VM 30. As illustrated in FIG.

図５に示すように、第２のＶＭ３０の障害情報通信部３４は、第１のＶＭ２０の障害情報通信部２４により送信された障害情報を受信する（ステップＳ２０１）。 As shown in FIG. 5, the failure information communication unit 34 of the second VM 30 receives failure information transmitted by the failure information communication unit 24 of the first VM 20 (step S201).

そして、プロセス管理部３５は、障害情報通信部３４から障害情報を取得するとともに、障害管理情報３３ａを参照し、その障害情報に対応する障害情報と影響対象情報とを障害管理情報３３ａから取得する処理を行う（ステップＳ２０２）。 Then, the process management unit 35 acquires the failure information from the failure information communication unit 34, refers to the failure management information 33a, and acquires the failure information and the affected information corresponding to the failure information from the failure management information 33a. Processing is performed (step S202).

そして、プロセス管理部３５は、障害管理情報３３ａに、障害情報通信部３４から取得した障害情報に対応する障害情報が登録されていたか否かを判定する（ステップＳ２０３）。 Then, the process management unit 35 determines whether fault information corresponding to the fault information acquired from the fault information communication unit 34 is registered in the fault management information 33a (step S203).

障害管理情報３３ａに障害情報通信部３４から取得した障害情報に対応する障害情報が登録されていた場合（ステップＳ２０３においてＹｅｓの場合）、プロセス管理部３５は、障害情報通信部３４に指示して、その障害情報を他のＶＭである第１のＶＭ２０に送信させる（ステップＳ２０４）。 If failure information corresponding to the failure information acquired from the failure information communication unit 34 is registered in the failure management information 33a (Yes in step S203), the process management unit 35 instructs the failure information communication unit 34 to , the failure information is transmitted to the first VM 20, which is another VM (step S204).

その後、プロセス管理部３５は、障害管理情報３３ａに障害情報通信部３４から取得した障害情報に対応する影響対象情報が登録されていたか否かを判定する（ステップＳ２０５）。 After that, the process management unit 35 determines whether or not affected information corresponding to the failure information acquired from the failure information communication unit 34 is registered in the failure management information 33a (step S205).

障害管理情報３３ａに障害情報通信部３４から取得した障害情報に対応する影響対象情報が登録されていた場合（ステップＳ２０５においてＹｅｓの場合）、プロセス管理部３５は、その影響対象情報に登録されている影響対象を再起動し、この再起動処理を終了する（ステップＳ２０６）。 If the affected information corresponding to the failure information acquired from the failure information communication unit 34 is registered in the failure management information 33a (Yes in step S205), the process management unit 35 registers the affected information registered in the affected information. restarts the affected target, and terminates the restart processing (step S206).

なお、プロセス管理部３５は、障害管理情報３３ａにおいて、再起動される影響対象のプロセスに対応する障害情報が登録されている場合には、障害情報通信部３４に指示して、その障害情報を第１のＶＭ２０に送信させ、そのプロセスに対応する影響対象情報が登録されていた場合には、その影響対象情報に登録されている影響対象を再起動する。 If the failure management information 33a registers failure information corresponding to the affected process to be restarted, the process management unit 35 instructs the failure information communication unit 34 to transmit the failure information. If it is sent to the first VM 20 and the affected object information corresponding to the process is registered, the affected object registered in the affected object information is restarted.

また、ステップＳ２０３において、障害管理情報３３ａに障害情報通信部３４から取得した障害情報に対応する障害情報が登録されていなかった場合（ステップＳ２０３においてＮｏの場合）、または、ステップＳ２０５において、障害管理情報３３ａに障害情報通信部３４から取得した障害情報に対応する影響対象情報が登録されていなかった場合（ステップＳ２０５においてＮｏの場合）、そのままこの再起動処理は終了する。 Further, in step S203, if failure information corresponding to the failure information acquired from the failure information communication unit 34 is not registered in the failure management information 33a (No in step S203), or in step S205, failure management If the affected target information corresponding to the failure information acquired from the failure information communication unit 34 is not registered in the information 33a (No in step S205), this restart processing ends.

このように、実施の形態１では、第１のＶＭ２０の障害情報記憶部２３が、論理的に第２のＶＭ３０と区分された第１のＶＭ２０で動作するプロセス２１ａ～２１ｎにおいて発生する第１の障害の情報を含む障害管理情報２３ａを記憶し、第２のＶＭ３０の障害情報記憶部３３が、第１の障害の情報に対応する第２の障害情報（外部障害情報）と、第２のＶＭ３０における上記第１の障害の影響対象の情報とを対応付けて登録した障害管理情報３３ａを記憶し、第１のＶＭ２０のプロセス管理部２５が、上記第１の障害を検出した場合に、第１の障害の情報を障害情報通信部２４に送信させ、第１の障害の情報を第２のＶＭ３０の障害情報通信部３４が受信した場合に、プロセス管理部３５が、障害管理情報３３ａを参照し、上記第２の障害情報（外部障害情報）に対応する第１の障害の影響対象を再起動することとした。 Thus, in Embodiment 1, the failure information storage unit 23 of the first VM 20 stores the first error generated in the processes 21a to 21n operating in the first VM 20 logically separated from the second VM 30. The failure management information 23a including failure information is stored, and the failure information storage unit 33 of the second VM 30 stores second failure information (external failure information) corresponding to the first failure information, The failure management information 33a registered in association with the information on the affected target of the first failure in the first failure is stored, and when the process management unit 25 of the first VM 20 detects the first failure, failure information is transmitted to the failure information communication unit 24, and when the failure information communication unit 34 of the second VM 30 receives the first failure information, the process management unit 35 refers to the failure management information 33a. , the target affected by the first failure corresponding to the second failure information (external failure information) is restarted.

これにより、プロセス間の依存関係を考慮して、障害からの復旧を自律的に行うことができる。 As a result, it is possible to autonomously recover from failures by considering inter-process dependencies.

また、実施の形態１では、第１のＶＭ２０のプロセス監視部２２が、プロセス２１ａ～２１ｎのハートビートを監視し、ハートビートの監視結果の情報をプロセス管理部２５に通知し、プロセス管理部２５が、プロセス監視部２２による通知に基づいて、プロセス２１ａ～２１ｎの障害を検出することとした。 Further, in the first embodiment, the process monitoring unit 22 of the first VM 20 monitors heartbeats of the processes 21a to 21n, notifies the process management unit 25 of the heartbeat monitoring result information, and processes the process management unit 25. However, based on the notification from the process monitoring unit 22, failures in the processes 21a to 21n are detected.

これにより、動作が停止したり、無限ループに陥っていたりするプロセスの検出が可能となる。 This makes it possible to detect processes that have stopped working or are stuck in an infinite loop.

また、実施の形態１では、第１のＶＭ２０の障害管理情報２３ａに、さらに第２のＶＭ３０で動作するプロセス３１ａ～３１ｍにおいて発生する第２の障害に対応する第１の障害情報（外部障害情報）と、第２の障害の影響対象の情報とが対応付けて登録され、第２のＶＭ３０の障害管理情報３３ａには、第２の障害の情報がさらに登録され、第２のＶＭ３０のプロセス管理部３５は、第２の障害を検出した場合に、第２の障害の情報を障害情報通信部３４に送信させ、第１のＶＭ２０のプロセス管理部２５は、障害情報通信部２４が第２の障害の情報を受信した場合に、障害管理情報２３ａを参照し、上記第１の障害情報（外部障害情報）に対応する第２の障害の影響対象を再起動することとした。 In addition, in the first embodiment, in addition to the fault management information 23a of the first VM 20, first fault information (external fault information ) and information on the target affected by the second failure are registered in association with each other, information on the second failure is further registered in the failure management information 33a of the second VM 30, and process management When the second failure is detected, the unit 35 causes the failure information communication unit 34 to transmit the second failure information, and the process management unit 25 of the first VM 20 causes the failure information communication unit 24 to transmit the second failure. When the failure information is received, the failure management information 23a is referred to and the target affected by the second failure corresponding to the first failure information (external failure information) is restarted.

これにより、第１のＶＭ２０において発生した障害からの復旧だけでなく、第２のＶＭ３０において発生した障害からの復旧も自律的に行うことができる。 As a result, recovery from a failure occurring in the first VM 20 as well as recovery from a failure occurring in the second VM 30 can be autonomously performed.

また、実施の形態１では、論理的に区分された２つのグループが、ハイパバイザ上で動作する第１のＶＭ２０および第２のＶＭ３０であることとした。 Moreover, in Embodiment 1, the logically divided two groups are the first VM 20 and the second VM 30 operating on the hypervisor.

これにより、第１のＶＭ２０のプロセスに第２のＶＭ３０のプロセスが依存する場合に、障害からの復旧を自律的に行うことができる。 Thereby, when the process of the second VM 30 depends on the process of the first VM 20, it is possible to autonomously recover from the failure.

（実施の形態２）
図６は、実施の形態１に係る計算機システムの構成の一例を示す図である。図６に示すように、この計算機システムでは、プロセス４０ａ～４０ｌ、および、コンテナ管理プロセス４１ａ、４１ｂが実行される。コンテナ管理プロセス４１ａ、４１ｂは、それぞれ第１のコンテナ５０、および、第２のコンテナ６０の管理を行うプロセスである。 (Embodiment 2)
FIG. 6 is a diagram showing an example of the configuration of a computer system according to Embodiment 1. As shown in FIG. As shown in FIG. 6, this computer system executes processes 40a to 40l and container management processes 41a and 41b. The container management processes 41a and 41b are processes for managing the first container 50 and the second container 60, respectively.

この計算機システムは、プロセス監視部４２、障害情報記憶部４３、障害情報通信部４４、プロセス管理部４５、第１のコンテナ５０、および、第２のコンテナ６０を備える。第１のコンテナ５０および第２のコンテナ６０は、１つの計算機において互いに論理的に分割されてアプリケーションを実行するコンテナである。 This computer system comprises a process monitoring section 42 , a fault information storage section 43 , a fault information communication section 44 , a process management section 45 , a first container 50 and a second container 60 . The first container 50 and the second container 60 are containers that are logically divided from each other in one computer to execute applications.

なお、ここではコンテナの数が２であることとしたが、２以上であってもよい。また、プロセス監視部４２、障害情報通信部４４、および、プロセス管理部４５は、それぞれ個別のプロセスとして実装されても良いが、それぞれの役割を備えた単一のプロセスとして実装されても良い。 Although the number of containers is two here, it may be two or more. The process monitoring unit 42, the fault information communication unit 44, and the process management unit 45 may be implemented as separate processes, or may be implemented as a single process having their respective roles.

プロセス監視部４２は、プロセス４０ａ～４０ｌおよびコンテナ管理プロセス４１ａ、４１ｂを含むプロセス群で発生する障害を監視する。例えば、プロセス監視部４２は、プロセス４０ａ～４０ｌおよびコンテナ管理プロセス４１ａ、４１ｂに対してハートビートメッセージを送信し、応答がないプロセスを、後述するプロセス管理部４５に通知する。 The process monitoring unit 42 monitors faults occurring in a process group including the processes 40a to 40l and the container management processes 41a and 41b. For example, the process monitoring unit 42 transmits heartbeat messages to the processes 40a to 40l and the container management processes 41a and 41b, and notifies the process management unit 45 (to be described later) of processes that do not receive a response.

障害情報記憶部４３は、プロセス４０ａ～４０ｌおよびコンテナ管理プロセス４１ａ、４１ｂを含むプロセス群において発生する障害の情報と、当該障害の影響対象の情報とを対応付けて登録した障害管理情報４３ａを記憶する。 The fault information storage unit 43 stores fault management information 43a in which information on faults occurring in a process group including the processes 40a to 40l and the container management processes 41a and 41b is associated with information on targets affected by the faults and registered. do.

図７は、障害管理情報４３ａの一例を示す図である。障害管理情報４３ａは、プロセス情報、障害情報、影響対象情報を含む。 FIG. 7 is a diagram showing an example of the fault management information 43a. The fault management information 43a includes process information, fault information, and affected target information.

プロセス情報は、第１のＶＭ２０において実行されるプロセスの識別情報、および、第１のＶＭ２０の外部において実行されるプロセスであることを示す情報を含む。障害情報は、これらのプロセスの障害の情報ある。影響対象情報は、当該障害が発生した場合に影響を受ける対象の情報である。 The process information includes identification information of the process executed in the first VM 20 and information indicating that the process is executed outside the first VM 20 . The failure information is information on failures of these processes. The affected target information is information about the target that will be affected when the failure occurs.

障害情報における「ＴＡＲＧＥＴ」は、障害が仮想マシン（ＶＭ）、コンテナ、プロセスのどれに発生したかを示す情報である。「ＣＯＮＴＡＩＮＥＲ」は、障害が発生したコンテナを示す識別情報である。「ＰＲＯＣ」は、障害が発生したプロセスを示す識別情報である。 "TARGET" in the fault information is information indicating whether the fault occurred in a virtual machine (VM), container, or process. "CONTAINER" is identification information indicating a failed container. "PROC" is identification information indicating the process in which the failure occurred.

例えば、「ＣＯＮＴＡＩＮＥＲ２＿ＭＮＧ＿ＰＲＯＣＥＳＳ」というプロセス情報には、「ＴＡＲＧＥＴ：ＣＯＮＴＡＩＮＥＲ，ＣＯＮＴＡＩＮＥＲ：２」という障害情報、および、「ＰＲＯＣＥＳＳ１３」という影響対象情報が対応付けて登録されている。 For example, process information "CONTAINER2_MNG_PROCESS" is registered in association with failure information "TARGET:CONTAINER, CONTAINER:2" and affected object information "PROCESS13".

これは、「ＣＯＮＴＡＩＮＥＲ２＿ＭＮＧ＿ＰＲＯＣＥＳＳ」というコンテナ管理プロセス４１ｂに障害が発生した場合、その障害の発生により「ＰＲＯＣＥＳＳ１３」というプロセスを再起動する必要があることを示している。 This indicates that when a failure occurs in the container management process 41b called "CONTAINER2_MNG_PROCESS", the process called "PROCESS13" must be restarted due to the occurrence of the failure.

また、「ＰＲＯＣＥＳＳ１３」というプロセス情報には、「ＴＡＲＧＥＴ：ＰＲＯＣ，ＣＯＮＴＡＩＮＥＲ：－，ＰＲＯＣ：ＰＲＯＣＥＳＳ１３」という障害情報が対応付けて登録されている。 Also, the process information "PROCESS13" is registered in association with the failure information "TARGET:PROC, CONTAINER:-, PROC:PROCESS13".

この障害情報は、障害が発生するプロセスが「ＰＲＯＣＥＳＳ１３」であることを示している。なお、「ＰＲＯＣＥＳＳ１３」というプロセスには、影響対象情報が登録されていないので、「ＰＲＯＣＥＳＳ１３」というプロセスに障害が発生した場合に再起動する必要がある対象はない。 This fault information indicates that the process in which the fault occurs is "PROCESS13". It should be noted that no affected target information is registered for the process "PROCESS13", so there is no target that needs to be restarted when a failure occurs in the process "PROCESS13".

また、「ＥＸＴＥＲＮＡＬ＿ＥＲＲＯＲ」というプロセス情報には、「ＴＡＲＧＥＴ：ＰＲＯＣ，ＣＯＮＴＡＩＮＥＲ：２，ＰＲＯＣ：ＰＲＯＣＥＳＳ３２」という障害情報（外部障害情報）、および、「ＰＲＯＣＥＳＳ１３」という影響対象情報が対応付けて登録されている。 Further, the process information "EXTERNAL_ERROR" is registered in association with the failure information (external failure information) "TARGET: PROC, CONTAINER: 2, PROC: PROCESS32" and the affected information "PROCESS13". .

これは、「ＰＲＯＣＥＳＳ３２」という第２のコンテナ６０のプロセスに障害が発生した場合、その障害の発生により「ＰＲＯＣＥＳＳ１３」というプロセスを再起動する必要があることを示している。 This indicates that when a failure occurs in the process of the second container 60 called "PROCESS32", it is necessary to restart the process called "PROCESS13" due to the occurrence of the failure.

前述のように、「ＰＲＯＣＥＳＳ１３」というプロセスには、影響対象情報が登録されていないので、「ＰＲＯＣＥＳＳ１３」というプロセスに障害が発生した場合に再起動する必要がある対象はない。 As described above, no affected target information is registered for the process "PROCESS13", so there is no target that needs to be restarted when a failure occurs in the process "PROCESS13".

図６の説明に戻ると、障害情報通信部４４は、第１のコンテナ５０の障害情報通信部５４、および、第２のコンテナ６０の障害情報通信部６４と通信を行う。 Returning to the description of FIG. 6 , the failure information communication unit 44 communicates with the failure information communication unit 54 of the first container 50 and the failure information communication unit 64 of the second container 60 .

例えば、障害情報通信部４４は、プロセス４０ａ～４０ｌ、または、コンテナ管理プロセス４１ａ、４１ｂに障害が発生した場合に、障害が発生したプロセスに対応する図７に示した障害情報を障害情報通信部５４および障害情報通信部６４に送信する。 For example, when a failure occurs in the processes 40a to 40l or the container management processes 41a and 41b, the failure information communication unit 44 sends the failure information shown in FIG. 54 and the failure information communication unit 64 .

また、障害情報通信部４４は、障害情報通信部５４、および、障害情報通信部６４から送信される障害情報を受信する。 The failure information communication unit 44 also receives failure information transmitted from the failure information communication unit 54 and the failure information communication unit 64 .

プロセス管理部４５は、プロセス４０ａ～４０ｌ、および、コンテナ管理プロセス４１ａ、４１ｂを管理する。例えば、プロセス管理部４５は、プロセス監視部４２からの通知により、プロセス４０ａ～４０ｌ、または、コンテナ管理プロセス４１ａ、４１ｂに障害が発生したことを検出する。 The process management unit 45 manages the processes 40a-40l and the container management processes 41a and 41b. For example, the process management unit 45 detects from the notification from the process monitoring unit 42 that a failure has occurred in the processes 40a to 40l or the container management processes 41a and 41b.

また、プロセス管理部４５は、プロセス４０ａ～４０ｌ、または、コンテナ管理プロセス４１ａ、４１ｂからの障害メッセージを受信することによりプロセス４０ａ～４０ｌ、または、コンテナ管理プロセス４１ａ、４１ｂに障害が発生したことを検出する。 Further, the process management unit 45 receives a fault message from the processes 40a to 40l or the container management processes 41a and 41b to notify that a fault has occurred in the processes 40a to 40l or the container management processes 41a and 41b. To detect.

そして、プロセス管理部４５は、障害管理情報４３ａを参照し、障害が発生したプロセスに対応付けて登録されている障害情報および影響対象情報を取得する。 Then, the process management unit 45 refers to the failure management information 43a and acquires the failure information and the affected target information registered in association with the process in which the failure occurred.

例えば、図７の例において、障害が発生したプロセスが「ＣＯＮＴＡＩＮＥＲ２＿ＭＮＧ＿ＰＲＯＣＥＳＳ」である場合、プロセス管理部２５は、「ＴＡＲＧＥＴ：ＣＯＮＴＡＩＮＥＲ，ＣＯＮＴＡＩＮＥＲ：２」という障害情報、および、「ＰＲＯＣＥＳＳ１３」という影響対象情報を取得する。 For example, in the example of FIG. 7, if the faulty process is "CONTAINER2_MNG_PROCESS", the process management unit 25 sets the fault information "TARGET: CONTAINER, CONTAINER: 2" and the affected target information "PROCESS13". get.

その後、プロセス管理部４５は、「ＣＯＮＴＡＩＮＥＲ２＿ＭＮＧ＿ＰＲＯＣＥＳＳ」のプロセスのように、障害情報が登録されている場合、その障害情報を第１のコンテナ５０、および、第２のコンテナ６０に送信するよう障害情報通信部４４に指示する。 Thereafter, the process management unit 45 performs fault information communication so as to transmit the fault information to the first container 50 and the second container 60 when fault information is registered as in the process of "CONTAINER2_MNG_PROCESS". 44 is instructed.

また、プロセス管理部４５は、「ＣＯＮＴＡＩＮＥＲ２＿ＭＮＧ＿ＰＲＯＣＥＳＳ」のプロセスのように、影響対象情報が登録されている場合、登録されている影響対象を再起動する。さらに、プロセス管理部４５は、障害が発生したプロセスを再起動する。 In addition, the process management unit 45 restarts the registered affected object when the affected object information is registered like the process of "CONTAINER2_MNG_PROCESS". Furthermore, the process management unit 45 restarts the failed process.

ここで、図７に示すように、影響対象情報として「ＰＲＯＣＥＳＳ１３」が登録されている場合、プロセス管理部４５は、「ＰＲＯＣＥＳＳ１３」のプロセスに対応する障害情報も第１のコンテナ５０、および、第２のコンテナ６０に送信するよう障害情報通信部４４に指示する。 Here, as shown in FIG. 7, when "PROCESS13" is registered as the affected object information, the process management unit 45 stores the fault information corresponding to the process "PROCESS13" in the first container 50 and in the first container. The failure information communication unit 44 is instructed to transmit to the second container 60 .

なお、「ＰＲＯＣＥＳＳ１３」のプロセスに対応する影響対象情報は登録されていないため、プロセス管理部４５はプロセスの再起動を行わない。 In addition, since the affected object information corresponding to the process of "PROCESS13" is not registered, the process management unit 45 does not restart the process.

さらに、プロセス管理部４５は、障害情報通信部４４が第１のコンテナ５０、または、第２のコンテナ６０から障害情報を受信した場合に、その障害情報を取得し、その障害情報が障害管理情報４３ａに含まれる障害情報に対応するものであるか否かを判定する。 Furthermore, when the fault information communication unit 44 receives fault information from the first container 50 or the second container 60, the process management unit 45 acquires the fault information, and converts the fault information into fault management information. 43a or not.

そして、プロセス管理部４５は、その障害情報が障害管理情報４３ａに含まれる障害情報に対応するものである場合、その障害情報に対応する影響対象情報が登録されているか否かを判定する。 Then, when the failure information corresponds to the failure information included in the failure management information 43a, the process management unit 45 determines whether or not the affected information corresponding to the failure information is registered.

その障害情報に対応する影響対象情報が登録されている場合、プロセス管理部４５は、登録されている影響対象を再起動する。 If affected target information corresponding to the failure information is registered, the process management unit 45 restarts the registered affected target.

例えば、障害情報通信部４４が第２のコンテナ６０から、障害情報として「ＴＡＲＧＥＴ：ＰＲＯＣ，ＣＯＮＴＡＩＮＥＲ：２，ＰＲＯＣ：ＰＲＯＣＥＳＳ３２」という情報を受信したものとする。 For example, it is assumed that the failure information communication unit 44 has received information "TARGET: PROC, CONTAINER: 2, PROC: PROCESS 32" from the second container 60 as failure information.

この場合、この障害情報は、図７に示した障害管理情報４３ａの「ＥＸＴＥＲＮＡＬ＿ＥＲＲＯＲ」の障害情報（外部障害情報）に対応するため、プロセス管理部４５は、登録されている影響対象である「ＰＲＯＣＥＳＳ１３」のプロセスを再起動する。 In this case, this fault information corresponds to the fault information (external fault information) of "EXTERNAL_ERROR" in the fault management information 43a shown in FIG. ” process.

また、「ＰＲＯＣＥＳＳ１３」には、障害情報が登録されているので、プロセス管理部４５は、その障害情報を第１のコンテナ５０に送信するよう障害情報通信部４４に指示するとともに、影響対象情報として登録されている「ＰＲＯＣＥＳＳ１３」のプロセスを再起動する。 In addition, since the failure information is registered in "PROCESS13", the process management unit 45 instructs the failure information communication unit 44 to transmit the failure information to the first container 50, and as the affected information, Restart the registered "PROCESS13" process.

なお、「ＰＲＯＣＥＳＳ１３」のプロセスに対応する影響対象情報は登録されていないため、プロセス管理部４５はプロセスの再起動は行わない。 In addition, since the affected object information corresponding to the process of "PROCESS13" is not registered, the process management unit 45 does not restart the process.

プロセス５１ａ～５１ｎを含むプロセス群が実行される第１のコンテナ５０は、プロセス監視部５２、障害情報記憶部５３、障害情報通信部５４、および、プロセス管理部５５を備える。プロセス監視部５２、障害情報通信部５４、および、プロセス管理部５５は、それぞれ個別のプロセスとして実装されても良いが、それぞれの役割を備えた単一のプロセスとして実装されても良い。 A first container 50 in which a process group including processes 51 a to 51 n is executed includes a process monitoring section 52 , a failure information storage section 53 , a failure information communication section 54 and a process management section 55 . The process monitoring unit 52, the fault information communication unit 54, and the process management unit 55 may be implemented as individual processes, or may be implemented as a single process having their respective roles.

プロセス監視部５２、障害情報通信部５４、および、プロセス管理部５５の機能は、プロセッサにより実現される。また、障害情報記憶部５３の機能は、メモリなどの記憶装置により実現される。 Functions of the process monitoring unit 52, the fault information communication unit 54, and the process management unit 55 are realized by the processor. Also, the function of the failure information storage unit 53 is implemented by a storage device such as a memory.

プロセス監視部５２は、プロセス５１ａ～５１ｎを含むプロセス群で発生する障害を監視する。例えば、プロセス監視部５２は、プロセス５１ａ～５１ｎに対してハートビートメッセージを送信し、応答がないプロセスを、後述するプロセス管理部５５に通知する。 The process monitoring unit 52 monitors failures occurring in a process group including the processes 51a to 51n. For example, the process monitoring unit 52 transmits heartbeat messages to the processes 51a to 51n, and notifies the process management unit 55, which will be described later, of the processes that do not receive a response.

障害情報記憶部５３は、プロセス５１ａ～５１ｎを含むプロセス群において発生する障害の情報と、当該障害の影響対象の情報とを対応付けて登録した障害管理情報５３ａを記憶する。 The failure information storage unit 53 stores failure management information 53a in which information about failures that occur in a process group including the processes 51a to 51n and information about targets affected by the failures are associated and registered.

図８は、障害管理情報５３ａの一例を示す図である。障害管理情報５３ａは、プロセス情報、障害情報、影響対象情報を含む。 FIG. 8 is a diagram showing an example of the failure management information 53a. The fault management information 53a includes process information, fault information, and affected target information.

プロセス情報は、第１のコンテナ５０において実行されるプロセスの識別情報、および、第１のコンテナ５０の外部において実行されるプロセスであることを示す情報を含む。なお、図８の例では、後者の情報は登録されていない。障害情報は、それらのプロセスの障害の情報ある。影響対象情報は、当該障害が発生した場合に影響を受ける対象の情報である。 The process information includes identification information of the process executed in the first container 50 and information indicating that the process is executed outside the first container 50 . In addition, in the example of FIG. 8, the latter information is not registered. The failure information is information on failures of those processes. The affected target information is information about the target that will be affected when the failure occurs.

障害情報における「ＴＡＲＧＥＴ」、「ＣＯＮＴＡＩＮＥＲ」、および、「ＰＲＯＣ」は、図７に示した障害管理情報４３ａの障害情報における「ＴＡＲＧＥＴ」、「ＣＯＮＴＡＩＮＥＲ」、および、「ＰＲＯＣ」と同様の情報である。 "TARGET", "CONTAINER" and "PROC" in the fault information are information similar to "TARGET", "CONTAINER" and "PROC" in the fault information of the fault management information 43a shown in FIG. .

例えば、「ＰＲＯＣＥＳＳ２１」というプロセス情報には、「ＴＡＲＧＥＴ：ＰＲＯＣ，ＣＯＮＴＡＩＮＥＲ：１，ＰＲＯＣ：ＰＲＯＣＥＳＳ２１」という障害情報、および、「ＰＲＯＣＥＳＳ２２」という影響対象情報が対応付けて登録されている。 For example, process information "PROCESS21" is registered in association with failure information "TARGET: PROC, CONTAINER: 1, PROC: PROCESS21" and affected target information "PROCESS22".

これは、第１のコンテナ５０における「ＰＲＯＣＥＳＳ２１」というプロセスに障害が発生した場合、その障害の発生により「ＰＲＯＣＥＳＳ２２」というプロセスを再起動する必要があることを示している。 This indicates that when a failure occurs in the process "PROCESS21" in the first container 50, the process "PROCESS22" must be restarted due to the occurrence of the failure.

また、「ＰＲＯＣＥＳＳ２２」というプロセス情報には、「ＴＡＲＧＥＴ：ＰＲＯＣ，ＣＯＮＴＡＩＮＥＲ：１，ＰＲＯＣ：ＰＲＯＣＥＳＳ２２」という障害情報が対応付けて登録されている。 Further, process information "PROCESS22" is registered in association with failure information "TARGET: PROC, CONTAINER: 1, PROC: PROCESS22".

この障害情報は、第１のコンテナ５０において障害が発生するプロセスが「ＰＲＯＣＥＳＳ２２」であることを示している。なお、「ＰＲＯＣＥＳＳ２２」というプロセスには、影響対象情報が登録されていないので、「ＰＲＯＣＥＳＳ２２」というプロセスに障害が発生した場合に再起動する必要がある対象はない。 This fault information indicates that the process in which the fault occurs in the first container 50 is "PROCESS22". It should be noted that no affected target information is registered for the process "PROCESS22", so there is no target that needs to be restarted when a failure occurs in the process "PROCESS22".

図６の説明に戻ると、障害情報通信部５４は、障害情報通信部４４、および、障害情報通信部６４と通信を行う。 Returning to the explanation of FIG. 6 , the failure information communication section 54 communicates with the failure information communication section 44 and the failure information communication section 64 .

例えば、障害情報通信部５４は、第１のコンテナ５０におけるプロセス５１ａ～５１ｎに障害が発生した場合に、障害が発生したプロセスに対応する図８に示した障害情報を障害情報通信部４４および障害情報通信部６４に送信する。また、障害情報通信部２４は、障害情報通信部４４および障害情報通信部６４から送信される障害情報を受信する。 For example, when a failure occurs in the processes 51a to 51n in the first container 50, the failure information communication unit 54 sends the failure information shown in FIG. It is transmitted to the information communication section 64 . Further, the failure information communication section 24 receives failure information transmitted from the failure information communication section 44 and the failure information communication section 64 .

プロセス管理部５５は、第１のコンテナ５０におけるプロセス５１ａ～５１ｎを管理する。例えば、プロセス管理部５５は、プロセス監視部５２からの通知により、プロセス５１ａ～５１ｎに障害が発生したことを検出する。また、プロセス管理部５５は、プロセス５１ａ～５１ｎからの障害メッセージを受信することによりプロセス５１ａ～５１ｎに障害が発生したことを検出する。 The process manager 55 manages the processes 51a-51n in the first container 50. FIG. For example, the process management unit 55 detects from the notification from the process monitoring unit 52 that a failure has occurred in the processes 51a to 51n. Also, the process management unit 55 detects that a failure has occurred in the processes 51a to 51n by receiving failure messages from the processes 51a to 51n.

そして、プロセス管理部５５は、障害管理情報５３ａを参照し、障害が発生したプロセスに対応付けて登録されている障害情報および影響対象情報を取得する。 Then, the process management unit 55 refers to the failure management information 53a and acquires the failure information and the affected target information registered in association with the process in which the failure occurred.

例えば、図８の例において、障害が発生したプロセスが「ＰＲＯＣＥＳＳ２１」である場合、プロセス管理部５５は、「ＴＡＲＧＥＴ：ＰＲＯＣ，ＣＯＮＴＡＩＮＥＲ：１，ＰＲＯＣ：ＰＲＯＣＥＳＳ２１」という障害情報、および、「ＰＲＯＣＥＳＳ２２」という影響対象情報を取得する。 For example, in the example of FIG. 8, if the process in which the fault has occurred is "PROCESS21", the process management unit 55 outputs the fault information "TARGET: PROC, CONTAINER: 1, PROC: PROCESS21" and the fault information "PROCESS22". Get affected information.

その後、プロセス管理部５５は、「ＰＲＯＣＥＳＳ２１」のプロセスのように、障害情報が登録されている場合、その障害情報を障害情報通信部４４および障害情報通信部６４に送信するよう障害情報通信部５４に指示する。 Thereafter, process management unit 55 instructs failure information communication unit 54 to transmit the failure information to failure information communication unit 44 and failure information communication unit 64 when failure information is registered as in the process “PROCESS21”. direct to.

また、プロセス管理部５５は、「ＰＲＯＣＥＳＳ２１」のプロセスのように、影響対象情報が登録されている場合、登録されている影響対象を再起動する。さらに、プロセス管理部５５は、障害が発生したプロセスを再起動する。 In addition, the process management unit 55 restarts the registered affected object when the affected object information is registered like the process of "PROCESS21". Furthermore, the process management unit 55 restarts the failed process.

ここで、図８に示すように、影響対象情報として「ＰＲＯＣＥＳＳ２２」が登録されている場合、プロセス管理部５５は、「ＰＲＯＣＥＳＳ２２」のプロセスに対応する障害情報も障害情報通信部４４および障害情報通信部６４に送信するよう障害情報通信部５４に指示する。 Here, as shown in FIG. 8, when "PROCESS22" is registered as the affected object information, the process management unit 55 sends fault information corresponding to the process "PROCESS22" to the fault information communication unit 44 and the fault information communication unit 44. The failure information communication unit 54 is instructed to transmit to the unit 64 .

なお、「ＰＲＯＣＥＳＳ２２」のプロセスに対応する影響対象情報は登録されていないため、プロセス管理部５５はプロセスの再起動を行わない。 In addition, since the affected object information corresponding to the process of "PROCESS22" is not registered, the process management unit 55 does not restart the process.

さらに、プロセス管理部５５は、障害情報通信部５４が障害情報通信部４４または障害情報通信部６４から障害情報を受信した場合に、その障害情報を取得し、その障害情報が障害管理情報５３ａに含まれる障害情報に対応するものであるか否かを判定する。 Furthermore, when the fault information communication unit 54 receives fault information from the fault information communication unit 44 or the fault information communication unit 64, the process management unit 55 acquires the fault information, and stores the fault information in the fault management information 53a. It is determined whether or not it corresponds to the included fault information.

そして、プロセス管理部５５は、その障害情報が障害管理情報５３ａに含まれる障害情報に対応するものである場合、その障害情報に対応する影響対象情報が登録されているか否かを判定する。 Then, when the fault information corresponds to the fault information included in the fault management information 53a, the process management unit 55 determines whether or not the affected information corresponding to the fault information is registered.

その障害情報に対応する影響対象情報が登録されている場合、プロセス管理部５５は、登録されている影響対象を再起動する。 If affected target information corresponding to the failure information is registered, the process management unit 55 restarts the registered affected target.

また、プロセス６１ａ～６１ｍを含むプロセス群が実行される第２のコンテナ６０は、プロセス監視部６２、障害情報記憶部６３、障害情報通信部６４、および、プロセス管理部６５を備える。プロセス監視部６２、障害情報通信部６４、および、プロセス管理部６５は、それぞれ個別のプロセスとして実装されても良いが、それぞれの役割を備えた単一のプロセスとして実装されても良い。 A second container 60 in which a process group including processes 61a to 61m is executed includes a process monitoring section 62, a failure information storage section 63, a failure information communication section 64, and a process management section 65. FIG. The process monitoring unit 62, the fault information communication unit 64, and the process management unit 65 may be implemented as separate processes, or may be implemented as a single process having their respective roles.

プロセス監視部６２、障害情報通信部６４、および、プロセス管理部６５の機能は、プロセッサにより実現される。また、障害情報記憶部６３の機能は、メモリなどの記憶装置により実現される。 Functions of the process monitoring unit 62, the fault information communication unit 64, and the process management unit 65 are realized by the processor. Also, the function of the fault information storage unit 63 is implemented by a storage device such as a memory.

プロセス監視部６２は、プロセス６１ａ～６１ｍを含むプロセス群で発生する障害を監視する。例えば、プロセス監視部６２は、プロセス６１ａ～６１ｍのハートビートメッセージを送信し、応答がないプロセスを、後述するプロセス管理部６５に通知する。 The process monitoring unit 62 monitors faults occurring in the process group including the processes 61a to 61m. For example, the process monitoring unit 62 transmits heartbeat messages of the processes 61a to 61m, and notifies the process management unit 65, which will be described later, of the processes with no response.

障害情報記憶部６３は、プロセス６１ａ～６１ｍを含むプロセス群において発生する障害の情報と、当該障害の影響対象の情報とを対応付けて登録した障害管理情報６３ａを記憶する。 The failure information storage unit 63 stores failure management information 63a in which information about failures that occur in a process group including the processes 61a to 61m and information about targets affected by the failures are associated and registered.

図９は、障害管理情報６３ａの一例を示す図である。障害管理情報６３ａは、プロセス情報、障害情報、影響対象情報を含む。 FIG. 9 is a diagram showing an example of the fault management information 63a. The fault management information 63a includes process information, fault information, and affected target information.

プロセス情報は、第２のコンテナ６０において実行されるプロセスの識別情報、および、第２のコンテナ６０の外部において実行されるプロセスであることを示す情報を含む。障害情報は、それらのプロセスの障害の情報ある。影響対象情報は、当該障害が発生した場合に影響を受ける対象の情報である。 The process information includes identification information of the process executed in the second container 60 and information indicating that the process is executed outside the second container 60 . The failure information is information on failures of those processes. The affected target information is information about the target that will be affected when the failure occurs.

ここで、図９の上から４番目にある「ＥＸＴＥＲＮＡＬ＿ＥＲＲＯＲ」というプロセス情報には、「ＴＡＲＧＥＴ：ＣＯＮＴＡＩＮＥＲ，ＣＯＮＴＡＩＮＥＲ：１」という障害情報（外部障害情報）、および、「ＰＲＯＣＥＳＳ３１，ＰＲＯＣＥＳＳ３２」という影響対象情報が対応付けて登録されている。 Here, the fourth process information "EXTERNAL_ERROR" from the top in FIG. are associated and registered.

これは、第２のコンテナ６０の外部のコンテナ管理プロセス４１ａに障害が発生した場合、その障害の発生により「ＰＲＯＣＥＳＳ３１」、「ＰＲＯＣＥＳＳ３２」というプロセスを再起動する必要があることを示している。 This indicates that when a failure occurs in the container management process 41a outside the second container 60, the processes "PROCESS31" and "PROCESS32" must be restarted due to the occurrence of the failure.

図６の説明に戻ると、障害情報通信部６４は、障害情報通信部４４、および、障害情報通信部５４と通信を行う。 Returning to the explanation of FIG. 6 , the fault information communication section 64 communicates with the fault information communication section 44 and the fault information communication section 54 .

例えば、障害情報通信部６４は、第２のコンテナ６０におけるプロセス６１ａ～６１ｍに障害が発生した場合に、障害が発生したプロセスに対応する図９に示した障害情報を障害情報通信部４４および障害情報通信部５４に送信する。また、障害情報通信部６４は、障害情報通信部４４または障害情報通信部５４から送信される障害情報を受信する。 For example, when a failure occurs in the processes 61a to 61m in the second container 60, the failure information communication unit 64 sends the failure information shown in FIG. It is transmitted to the information communication section 54 . Further, the failure information communication unit 64 receives failure information transmitted from the failure information communication unit 44 or the failure information communication unit 54 .

プロセス管理部６５は、第２のコンテナ６０におけるプロセス６１ａ～６１ｍを管理する。 The process manager 65 manages the processes 61a-61m in the second container 60. FIG.

例えば、プロセス管理部６５は、プロセス監視部６２からの通知により、プロセス６１ａ～６１ｍに障害が発生したことを検出する。また、プロセス管理部６５は、プロセス６１ａ～６１ｍからの障害メッセージを受信することによりプロセス６１ａ～６１ｍに障害が発生したことを検出する。 For example, the process management unit 65 detects from the notification from the process monitoring unit 62 that a failure has occurred in the processes 61a to 61m. Also, the process management unit 65 detects that a failure has occurred in the processes 61a to 61m by receiving failure messages from the processes 61a to 61m.

そして、プロセス管理部６５は、障害管理情報６３ａを参照し、障害が発生したプロセスに対応付けて登録されている障害情報および影響対象情報を取得する。 Then, the process management unit 65 refers to the failure management information 63a and acquires the failure information and the affected target information registered in association with the process in which the failure occurred.

例えば、図９の例において、障害が発生したプロセスが「ＰＲＯＣＥＳＳ３２」である場合、プロセス管理部６５は、「ＴＡＲＧＥＴ：ＰＲＯＣ，ＣＯＮＴＡＩＮＥＲ：２，ＰＲＯＣ：ＰＲＯＣＥＳＳ３２」という障害情報を取得する。なお、図９の例では、「ＰＲＯＣＥＳＳ３２」のプロセスには、影響対象情報は登録されていない。 For example, in the example of FIG. 9, if the failed process is "PROCESS32", the process management unit 65 acquires failure information "TARGET: PROC, CONTAINER: 2, PROC: PROCESS32". In the example of FIG. 9, influence target information is not registered in the process "PROCESS32".

その後、プロセス管理部６５は、「ＰＲＯＣＥＳＳ３２」のプロセスのように、障害情報が登録されている場合、その障害情報を障害情報通信部４４および障害情報通信部５４に送信するよう障害情報通信部６４に指示する。 After that, the process management unit 65 instructs the failure information communication unit 64 to transmit the failure information to the failure information communication unit 44 and the failure information communication unit 54 when failure information is registered as in the process "PROCESS32". direct to.

また、プロセス管理部６５は、影響対象情報が登録されている場合、登録されている影響対象を再起動する。さらに、プロセス管理部５５は、障害が発生したプロセスを再起動する。 Further, when the affected object information is registered, the process management unit 65 restarts the registered affected object. Furthermore, the process management unit 55 restarts the failed process.

ここで、図９に示すように、影響対象情報として「ＰＲＯＣＥＳＳ３１」が登録されている場合、プロセス管理部６５は、「ＰＲＯＣＥＳＳ３１」のプロセスに対応する障害情報も障害情報通信部４４および障害情報通信部５４に送信するよう障害情報通信部６４に指示する。 Here, as shown in FIG. 9, when "PROCESS31" is registered as the affected object information, the process management section 65 sends fault information corresponding to the process "PROCESS31" to the fault information communication section 44 and the fault information communication section 44. The failure information communication unit 64 is instructed to transmit to the unit 54 .

なお、図９の例では、「ＰＲＯＣＥＳＳ３１」のプロセスに対応する障害情報は登録されていないため、障害情報の送信処理は行われない。障害が発生しても他へ影響しないプロセスの場合、障害情報を登録しないことで不必要な障害情報の送信処理の抑止が可能となる。また、「ＰＲＯＣＥＳＳ３１」のプロセスに対応する影響対象情報も登録されていないため、プロセス管理部６５はプロセスの再起動を行わない。 In the example of FIG. 9, since the failure information corresponding to the process "PROCESS31" is not registered, the failure information transmission process is not performed. In the case of a process that does not affect others even if a failure occurs, it is possible to suppress unnecessary failure information transmission processing by not registering failure information. In addition, since the affected object information corresponding to the process "PROCESS31" is not registered, the process management unit 65 does not restart the process.

さらに、プロセス管理部６５は、障害情報通信部６４が障害情報通信部４４または障害情報通信部５４から障害情報を受信した場合に、その障害情報を取得し、その障害情報が障害管理情報６３ａに含まれる障害情報に対応するものであるか否かを判定する。 Furthermore, when the fault information communication unit 64 receives fault information from the fault information communication unit 44 or the fault information communication unit 54, the process management unit 65 acquires the fault information, and stores the fault information in the fault management information 63a. It is determined whether or not it corresponds to the included fault information.

そして、プロセス管理部６５は、その障害情報が障害管理情報６３ａに含まれる障害情報に対応するものである場合、その障害情報に対応する影響対象情報が登録されているか否かを判定する。 Then, when the failure information corresponds to the failure information included in the failure management information 63a, the process management unit 65 determines whether or not the affected information corresponding to the failure information is registered.

その障害情報に対応する影響対象情報が登録されている場合、プロセス管理部６５は、登録されている影響対象を再起動する。 If affected target information corresponding to the fault information is registered, the process management unit 65 restarts the registered affected target.

例えば、障害情報通信部６４が、障害情報通信部４４から、障害情報として「ＴＡＲＧＥＴ：ＣＯＮＴＡＩＮＥＲ，ＣＯＮＴＡＩＮＥＲ：１」という情報を受信したものとする。 For example, it is assumed that the failure information communication unit 64 receives information "TARGET: CONTAINER, CONTAINER: 1" from the failure information communication unit 44 as failure information.

この場合、この障害情報は、図９に示した障害管理情報６３ａの上から４番目に示された「ＥＸＴＥＲＮＡＬ＿ＥＲＲＯＲ」の障害情報（外部障害情報）に対応するため、プロセス管理部６５は、登録されている影響対象である「ＰＲＯＣＥＳＳ３１」、「ＰＲＯＣＥＳＳ３２」のプロセスを再起動する。 In this case, this fault information corresponds to the fault information (external fault information) of "EXTERNAL_ERROR" shown in the fourth from the top of the fault management information 63a shown in FIG. restart the affected processes "PROCESS31" and "PROCESS32".

ここで、「ＰＲＯＣＥＳＳ３２」には、障害情報が登録されているので、プロセス管理部６５は、その障害情報を障害情報通信部４４および障害情報通信部５４に送信するよう障害情報通信部６４に指示するとともに、影響対象情報として登録されている「ＰＲＯＣＥＳＳ３１」のプロセスを再起動する。 Since the failure information is registered in "PROCESS32", the process management unit 65 instructs the failure information communication unit 64 to transmit the failure information to the failure information communication unit 44 and the failure information communication unit 54. At the same time, the process of "PROCESS31" registered as the affected information is restarted.

なお、図９の例では、「ＰＲＯＣＥＳＳ３１」のプロセスに対応する障害情報は登録されていないため、障害情報の送信処理は行われない。また、「ＰＲＯＣＥＳＳ３１」のプロセスに対応する影響対象情報も登録されていないため、プロセス管理部６５はプロセスの再起動を行わない。 In the example of FIG. 9, since the failure information corresponding to the process "PROCESS31" is not registered, the failure information transmission process is not performed. In addition, since the affected object information corresponding to the process "PROCESS31" is not registered, the process management unit 65 does not restart the process.

プロセス管理部４５、５５、６５がプロセスの障害を検出して行う再起動処理の処理手順は、図４で説明した処理手順と同様のものである。 The processing procedure of restart processing performed by the process management units 45, 55, and 65 upon detection of a process failure is the same as the processing procedure described with reference to FIG.

すなわち、図４に示すように、プロセス管理部４５、５５、６５は、プロセス監視部４２、５２、６２からの通知により、または、各プロセスから受信する障害メッセージにより、各プロセスに障害が発生したことを検出する（ステップＳ１０１）。 That is, as shown in FIG. 4, the process management units 45, 55, and 65 receive notifications from the process monitoring units 42, 52, and 62 or receive fault messages from the respective processes to determine whether a fault has occurred in each process. is detected (step S101).

続いて、プロセス管理部４５、５５、６５は、それぞれの障害管理情報４３ａ、５３a、６３ａを参照し、障害が発生したプロセスの情報をもとに、そのプロセスに対応する障害情報と影響対象情報とを障害管理情報４３ａ、５３a、６３ａから取得する処理を行う（ステップＳ１０２）。 Subsequently, the process management units 45, 55, and 65 refer to the fault management information 43a, 53a, and 63a, respectively, and based on the information of the process in which the fault occurred, the fault information and affected target information corresponding to the process. from the failure management information 43a, 53a, 63a (step S102).

そして、プロセス管理部４５、５５、６５は、障害管理情報４３ａ、５３a、６３ａにそのプロセスに対応する障害情報が登録されていたか否かを判定する（ステップＳ１０３）。 Then, the process management units 45, 55, 65 determine whether or not fault information corresponding to the process is registered in the fault management information 43a, 53a, 63a (step S103).

障害管理情報４３ａ、５３a、６３ａにそのプロセスに対応する障害情報が登録されていた場合（ステップＳ１０３においてＹｅｓの場合）、プロセス管理部４５、５５、６５は、障害情報通信部４４、５４、６４に指示して、その障害情報を他の障害情報通信部４４、５４、６４に送信させる（ステップＳ１０４）。 When failure information corresponding to the process is registered in the failure management information 43a, 53a, 63a (Yes in step S103), the process management units 45, 55, 65 communicate with the failure information communication units 44, 54, 64 to transmit the fault information to the other fault information communication units 44, 54 and 64 (step S104).

その後、プロセス管理部４５、５５、６５は、それぞれの障害管理情報４３ａ、５３ａ、６３ａにそのプロセスに対応する影響対象情報が登録されていたか否かを判定する（ステップＳ１０５）。 After that, the process management units 45, 55, 65 determine whether or not affected information corresponding to the process is registered in the failure management information 43a, 53a, 63a (step S105).

障害管理情報４３ａ、５３ａ、６３ａにそのプロセスに対応する影響対象情報が登録されていた場合（ステップＳ１０５においてＹｅｓの場合）、プロセス管理部４５、５５、６５は、その影響対象情報に登録されている影響対象を再起動する（ステップＳ１０６）。 If the affected information corresponding to the process is registered in the failure management information 43a, 53a, 63a (Yes in step S105), the process management units 45, 55, 65 are registered in the affected information. restart the affected target (step S106).

なお、プロセス管理部４５、５５、６５は、障害管理情報４３ａ、５３ａ、６３ａにおいて、再起動される影響対象のプロセスに対応する障害情報が登録されている場合には、障害情報通信部４４、５４、６４に指示して、その障害情報を他の障害情報通信部４４、５４、６４に送信させ、そのプロセスに対応する影響対象情報が登録されていた場合には、その影響対象情報に登録されている影響対象を再起動する。 If the fault management information 43a, 53a, 63a registers the fault information corresponding to the affected process to be restarted, the process management units 45, 55, and 65, the fault information communication unit 44, 54, 64 to transmit the fault information to other fault information communication units 44, 54, 64, and if the affected object information corresponding to the process is registered, it is registered in the affected object information. reboot the affected target.

さらに、プロセス管理部４５、５５、６５は、障害の発生が検出されたプロセスを再起動し（ステップＳ１０７）、この再起動処理を終了する。 Furthermore, the process management units 45, 55, 65 restart the process in which the occurrence of the failure is detected (step S107), and terminate this restart processing.

また、ステップＳ１０３において、障害管理情報４３ａ、５３ａ、６３ａにそのプロセスに対応する障害情報が登録されていなかった場合（ステップＳ１０３においてＮｏの場合）、または、ステップＳ１０５において、障害管理情報４３ａ、５３ａ、６３ａにそのプロセスに対応する影響対象情報が登録されていなかった場合（ステップＳ１０５においてＮｏの場合）、プロセス管理部４５、５５、６５は、障害の発生が検出されたプロセスを再起動し（ステップＳ１０７）、この再起動処理を終了する。 In step S103, failure information corresponding to the process is not registered in the failure management information 43a, 53a, 63a (No in step S103), or in step S105, failure management information 43a, 53a , 63a (No in step S105), the process management units 45, 55, 65 restart the process in which the occurrence of the failure is detected ( Step S107), this restart process is terminated.

また、プロセス管理部４５、５５、６５が障害情報を受信して行う再起動処理の処理手順は、図５で説明した処理手順と同様のものである。 Also, the processing procedure of the restart processing performed by the process management units 45, 55, and 65 upon receiving the failure information is the same as the processing procedure described with reference to FIG.

すなわち、図５に示すように、障害情報通信部４４、５４、６４は、他の障害情報通信部４４、５４、６４により送信された障害情報を受信する（ステップＳ２０１）。 That is, as shown in FIG. 5, the fault information communication units 44, 54 and 64 receive fault information transmitted by the other fault information communication units 44, 54 and 64 (step S201).

そして、プロセス管理部４５、５５、６５は、障害情報通信部４４、５４、６４から障害情報を取得するとともに、それぞれの障害管理情報４３ａ、５３ａ、６３ａを参照し、その障害情報に対応する障害情報と影響対象情報とを障害管理情報４３ａ、５３ａ、６３ａから取得する処理を行う（ステップＳ２０２）。 The process management units 45, 55, and 65 acquire fault information from the fault information communication units 44, 54, and 64, refer to the fault management information 43a, 53a, and 63a, respectively, and detect faults corresponding to the fault information. A process of acquiring information and affected target information from the failure management information 43a, 53a, 63a is performed (step S202).

そして、プロセス管理部４５、５５、６５は、それぞれの障害管理情報４３ａ、５３ａ、６３ａに、障害情報通信部４４、５４、６４から取得した障害情報に対応する障害情報が登録されていたか否かを判定する（ステップＳ２０３）。 Then, the process management units 45, 55 and 65 determine whether fault information corresponding to the fault information acquired from the fault information communication units 44, 54 and 64 is registered in the respective fault management information 43a, 53a and 63a. is determined (step S203).

障害管理情報４３ａ、５３ａ、６３ａに障害情報通信部４４、５４、６４から取得した障害情報に対応する障害情報が登録されていた場合（ステップＳ２０３においてＹｅｓの場合）、プロセス管理部４５、５５、６５は、障害情報通信部４４、５４、６４に指示して、その障害情報を他の障害情報通信部４４、５４、６４に送信させる（ステップＳ２０４）。 If failure information corresponding to the failure information acquired from the failure information communication units 44, 54, and 64 is registered in the failure management information 43a, 53a, and 63a (Yes in step S203), the process management units 45, 55, 65 instructs the failure information communication units 44, 54 and 64 to transmit the failure information to the other failure information communication units 44, 54 and 64 (step S204).

その後、プロセス管理部４５、５５、６５は、それぞれの障害管理情報４３ａ、５３ａ、６３ａに、障害情報通信部４４、５４、６４から取得した障害情報に対応する影響対象情報が登録されていたか否かを判定する（ステップＳ２０５）。 After that, the process management units 45, 55 and 65 determine whether affected information corresponding to the fault information acquired from the fault information communication units 44, 54 and 64 is registered in the respective fault management information 43a, 53a and 63a. (step S205).

障害管理情報４３ａ、５３ａ、６３ａに障害情報通信部４４、５４、６４から取得した障害情報に対応する影響対象情報が登録されていた場合（ステップＳ２０５においてＹｅｓの場合）、プロセス管理部４５、５５、６５は、その影響対象情報に登録されている影響対象を再起動、この再起動処理を終了する。（ステップＳ２０６）。 If affected target information corresponding to the failure information acquired from the failure information communication units 44, 54, and 64 is registered in the failure management information 43a, 53a, and 63a (Yes in step S205), the process management units 45 and 55 , 65 restarts the affected object registered in the affected object information, and terminates this restart processing. (Step S206).

また、ステップＳ２０３において、障害管理情報４３ａ、５３ａ、６３ａに障害情報通信部４４、５４、６４から取得した障害情報に対応する障害情報が登録されていなかった場合（ステップＳ２０３においてＮｏの場合）、または、ステップＳ２０５において、障害管理情報４３ａ、５３ａ、６３ａに障害情報通信部４４、５４、６４から取得した障害情報に対応する影響対象情報が登録されていなかった場合（ステップＳ２０５においてＮｏの場合）、そのままこの再起動処理は終了する。 Further, in step S203, if failure information corresponding to the failure information acquired from the failure information communication units 44, 54, and 64 is not registered in the failure management information 43a, 53a, and 63a (No in step S203), Alternatively, in step S205, if the affected target information corresponding to the failure information acquired from the failure information communication units 44, 54, and 64 is not registered in the failure management information 43a, 53a, and 63a (No in step S205). , this restart processing ends.

このように、本実施の形態２では、障害情報記憶部４３が、第１のコンテナ５０における第１の障害の情報または第２のコンテナ６０における第２の障害の情報に対応する外部障害情報と、第１の障害の情報または第２の障害の情報の影響対象の情報とを含む障害管理情報４３ａを記憶し、第１の障害の情報または第２の障害の情報のいずれかの障害情報を障害情報通信部４４が受信した場合に、プロセス管理部４５が、障害管理情報４３ａを参照し、外部障害情報に対応する第１の障害の影響対象または第２の障害の影響対象を再起動することとした。 As described above, in the second embodiment, the fault information storage unit 43 stores external fault information corresponding to the first fault information in the first container 50 or the second fault information in the second container 60. , information about the first failure or information about the second failure, and information about the affected object, and information about the first failure or the information about the second failure. When the fault information communication unit 44 receives the information, the process management unit 45 refers to the fault management information 43a and restarts the target affected by the first fault or the target affected by the second fault corresponding to the external fault information. I decided to

これにより、コンテナにおけるプロセス間の依存関係を考慮して、障害からの復旧を自律的に行うことができる。 As a result, it is possible to autonomously recover from a failure, taking into consideration the dependencies between processes in the container.

また、本実施の形態２では、論理的に区分された２つのグループが、コンテナであることとした。 Also, in the second embodiment, the logically divided two groups are containers.

これにより、コンテナにおけるプロセス間に依存関係がある場合でも、障害からの復旧を自律的に行うことができる。 As a result, recovery from failures can be performed autonomously even if there is a dependency between processes in the container.

（実施の形態３）
図１０は、実施の形態３に係る計算機システムの構成の一例を示す図である。以下で説明する計算機システムは、例えば、車両に搭載されるシステムである。図１０に示すように、この計算機システムは、ハイパバイザ７０、管理ＶＭ８０、メータＶＭ９０、ＩＶＩ（Ｉｎ－ＶｅｈｉｃｌｅＩｎｆｏｔａｉｎｍｅｎｔ）ＶＭ１００を備える。 (Embodiment 3)
FIG. 10 is a diagram showing an example of the configuration of a computer system according to Embodiment 3. As shown in FIG. The computer system described below is, for example, a system mounted on a vehicle. As shown in FIG. 10, this computer system includes a hypervisor 70, a management VM 80, a meter VM 90, and an IVI (In-Vehicle Information) VM 100. FIG.

ハイパバイザ７０は、１つの計算機を管理ＶＭ８０、メータＶＭ９０、および、ＩＶＩＶＭ１００に論理的に分割し、３つの独立した仮想マシンとして動作させる制御部である。管理ＶＭ８０と、メータＶＭ９０と、ＩＶＩＶＭ１００は、このようにして生成された仮想マシンである。 The hypervisor 70 is a control unit that logically divides one computer into a management VM 80, a meter VM 90, and an IVI VM 100 and operates them as three independent virtual machines. The management VM 80, the meter VM 90, and the IVI VM 100 are virtual machines generated in this manner.

なお、ここではハイパバイザ１０上で動作する仮想マシンの数が３であり、コンテナの数が１であることとしたが、少なくとも１以上であればよい。 Although the number of virtual machines operating on the hypervisor 10 is 3 and the number of containers is 1 here, the number may be at least 1 or more.

ハイパバイザ７０は、ＶＭ監視部７１、障害情報記憶部７２、障害情報通信部７３、および、ＶＭ管理部７４を備える。 The hypervisor 70 includes a VM monitoring unit 71 , a failure information storage unit 72 , a failure information communication unit 73 and a VM management unit 74 .

ＶＭ監視部７１、障害情報通信部７３、および、ＶＭ管理部７４の機能は、プロセッサにより実現される。また、障害情報記憶部７２の機能は、メモリなどの記憶装置により実現される。 Functions of the VM monitoring unit 71, the failure information communication unit 73, and the VM management unit 74 are realized by the processor. Also, the function of the fault information storage unit 72 is implemented by a storage device such as a memory.

ＶＭ監視部７１は、管理ＶＭ８０、メータＶＭ９０、および、ＩＶＩＶＭ１００のそれぞれで稼働する仮想マシンの障害を監視する。例えば、ＶＭ監視部７１は、管理ＶＭ８０のＶＭ監視応答部８３、メータＶＭ９０のＶＭ監視応答部９３、または、ＩＶＩＶＭ１００のＶＭ監視応答部１０４からＶＭに障害が発生したことを示す通知を受け付け、後述するＶＭ管理部７４に障害の発生を通知する。 The VM monitoring unit 71 monitors failures of virtual machines operating in each of the management VM 80 , meter VM 90 and IVI VM 100 . For example, the VM monitoring unit 71 receives a notification indicating that a failure has occurred in a VM from the VM monitoring response unit 83 of the management VM 80, the VM monitoring response unit 93 of the meter VM 90, or the VM monitoring response unit 104 of the IVI VM 100, The occurrence of the failure is notified to the VM management unit 74, which will be described later.

または、ＶＭ監視部７１は、ＶＭ監視応答部８３、９３、１０４に対してハートビートメッセージを送信し、応答がないＶＭ監視応答部８３、９３、１０４のＶＭに障害が発生したことをＶＭ管理部７４に通知してもよい。 Alternatively, the VM monitoring unit 71 sends a heartbeat message to the VM monitoring response units 83, 93, and 104, and notifies the VM management response units 83, 93, and 104 that a failure has occurred in the VMs of the VM monitoring response units 83, 93, and 104 that do not respond. The unit 74 may be notified.

障害情報記憶部７２は、管理ＶＭ８０、メータＶＭ９０、および、ＩＶＩＶＭ１００における仮想マシンのプロセス群において発生する障害の情報と、当該障害の影響対象の情報とを対応付けて登録した障害管理情報７２ａを記憶する。 The fault information storage unit 72 stores fault management information 72a in which information on faults occurring in virtual machine process groups in the management VM 80, the meter VM 90, and the IVI VM 100 is associated with and registered with information on targets affected by the fault. Remember.

図１１は、障害管理情報７２ａの一例を示す図である。障害管理情報７２ａは、ＶＭ情報、障害情報、影響対象情報を含む。 FIG. 11 is a diagram showing an example of the fault management information 72a. The failure management information 72a includes VM information, failure information, and affected target information.

ＶＭ情報は、計算機システムにおいて稼働する仮想マシンの識別情報である。障害情報は、それらの仮想マシンにおいて発生する障害の情報である。影響対象情報は、当該障害が発生した場合に影響を受ける対象の情報である。 The VM information is identification information of virtual machines running in the computer system. The failure information is information about failures occurring in those virtual machines. The affected target information is information about the target that will be affected when the failure occurs.

障害情報における「ＴＡＲＧＥＴ」は、障害が仮想マシン（ＶＭ）、コンテナ、プロセスのどれに発生したかを示す情報である。「ＶＭ」は、障害が発生した仮想マシンを示す識別情報である。 "TARGET" in the fault information is information indicating whether the fault occurred in a virtual machine (VM), container, or process. “VM” is identification information indicating a failed virtual machine.

例えば、「管理ＶＭ」というＶＭ情報には、「ＴＡＲＧＥＴ：ＶＭ，ＶＭ：ＭＡＮＡＧＥＲ」という障害情報、および、「ＳＹＳ＿ＲＥＳＥＴ」という影響対象情報が対応付けて登録されている。 For example, VM information "management VM" is registered in association with failure information "TARGET: VM, VM: MANAGER" and affected target information "SYS_RESET".

これは、「管理ＶＭ」の仮想マシンのプロセス群において障害が発生した場合、その障害の発生により計算機システム全体を再起動する必要があることを示している。 This indicates that when a failure occurs in the virtual machine process group of the "management VM", the entire computer system must be restarted due to the occurrence of the failure.

また、「メータＶＭ」というＶＭ情報には、「ＴＡＲＧＥＴ：ＶＭ，ＶＭ：ＭＥＴＥＲ」という障害情報、および、「ＩＶＩＶＭ」という影響対象情報が対応付けて登録されている。 Further, the VM information "meter VM" is registered in association with the failure information "TARGET: VM, VM: METER" and the affected target information "IVI VM".

この障害情報および影響対象情報は、メータＶＭ９０における仮想マシンのプロセス群において障害が発生した場合に、ＩＶＩＶＭ１００における仮想マシンのプロセス群を再起動する必要があることを示している。 This failure information and affected information indicate that the virtual machine process group in the IVI VM 100 needs to be restarted when a failure occurs in the virtual machine process group in the meter VM 90 .

図１０の説明に戻ると、障害情報通信部７３は、管理ＶＭ８０の障害情報通信部８５、メータＶＭ９０の障害情報通信部９５、ＩＶＩＶＭ１００の障害情報通信部１０６、および、コンテナ１１０の障害情報通信部１１４と通信を行う。 Returning to the description of FIG. 10 , the failure information communication unit 73 includes the failure information communication unit 85 of the management VM 80 , the failure information communication unit 95 of the meter VM 90 , the failure information communication unit 106 of the IVI VM 100 , and the failure information communication unit of the container 110 . Communicates with unit 114 .

例えば、障害情報通信部７３は、管理ＶＭ８０、メータＶＭ９０、または、ＩＶＩＶＭ１００における仮想マシンのプロセス群に障害が発生した場合に、障害が発生した仮想マシンに対応する図１１に示した障害情報を、障害が発生した仮想マシン以外の仮想マシンに送信する。 For example, when a failure occurs in a virtual machine process group in the management VM 80, the meter VM 90, or the IVI VM 100, the failure information communication unit 73 transmits the failure information shown in FIG. 11 corresponding to the failed virtual machine. , to virtual machines other than the failed virtual machine.

また、障害情報通信部７３は、障害情報通信部８５、障害情報通信部９５、障害情報通信部１０６、または、障害情報通信部１１４から送信される障害情報を受信する。 Failure information communication unit 73 also receives failure information transmitted from failure information communication unit 85 , failure information communication unit 95 , failure information communication unit 106 , or failure information communication unit 114 .

ＶＭ管理部７４は、管理ＶＭ８０、メータＶＭ９０、および、ＩＶＩＶＭ１００を管理する。例えば、ＶＭ管理部７４は、ＶＭ監視部７１からの通知により、管理ＶＭ８０、メータＶＭ９０、または、ＩＶＩＶＭ１００のプロセス群に障害が発生したことを検出する。 The VM management unit 74 manages the management VM 80, the meter VM 90, and the IVI VM 100. For example, the VM management unit 74 detects that a failure has occurred in the process group of the management VM 80, the meter VM 90, or the IVI VM 100 from the notification from the VM monitoring unit 71. FIG.

また、ＶＭ管理部７４は、管理ＶＭ８０のプロセス管理部８６、メータＶＭ９０のプロセス管理部９６、または、ＩＶＩＶＭ１００のプロセス管理部１０７から再起動要求メッセージを受信することにより、再起動要求メッセージを送信したＶＭのプロセス群に障害が発生したことを検出する。 Further, the VM management unit 74 transmits a restart request message by receiving a restart request message from the process management unit 86 of the management VM 80, the process management unit 96 of the meter VM 90, or the process management unit 107 of the IVI VM 100. Detects that a failure has occurred in the process group of the VM that has been executed.

そして、ＶＭ管理部７４は、障害管理情報７２ａを参照し、障害が発生したＶＭに対応付けて登録されている障害情報および影響対象情報を取得する。例えば、障害が発生したＶＭが「メータＶＭ」である場合、ＶＭ管理部７４は、「ＴＡＲＧＥＴ：ＶＭ，ＶＭ：ＭＥＴＥＲ」という障害情報、および、「ＩＶＩＶＭ」という影響対象情報を取得する。 Then, the VM management unit 74 refers to the failure management information 72a and acquires failure information and affected target information registered in association with the failed VM. For example, if the failed VM is a "meter VM", the VM management unit 74 acquires failure information "TARGET: VM, VM: METER" and affected target information "IVI VM".

その後、ＶＭ管理部７４は、「メータＶＭ」のＶＭのように、障害情報が登録されている場合、その障害情報を管理ＶＭ８０、メータＶＭ９０、ＩＶＩＶＭ１００、および、コンテナ１１０に送信するよう障害情報通信部７３に指示する。 After that, the VM management unit 74 sends the failure information to the management VM 80, the meter VM 90, the IVI VM 100, and the container 110 when failure information is registered, such as the VM of "meter VM". The communication unit 73 is instructed.

また、ＶＭ管理部７４は、「メータＶＭ」のＶＭのように、影響対象情報が登録されている場合、登録されている影響対象を再起動する。さらに、ＶＭ管理部７４は、障害が発生したＶＭのプロセス群を再起動する。 In addition, the VM management unit 74 restarts the registered affected target when the affected target information is registered, such as the VM of the “meter VM”. Furthermore, the VM management unit 74 restarts the process group of the failed VM.

ここで、図１１に示すように、影響対象情報として「ＩＶＩＶＭ」が登録されている場合、ＶＭ管理部７４は、「ＩＶＩＶＭ」のＶＭに対応する障害情報も管理ＶＭ８０、メータＶＭ９０、ＩＶＩＶＭ１００、および、コンテナ１１０に送信するよう障害情報通信部７３に指示する。 Here, as shown in FIG. 11 , when “IVI VM” is registered as the affected target information, the VM management unit 74 also stores failure information corresponding to the VM of “IVI VM” as the management VM 80, the meter VM 90, the IVI The failure information communication unit 73 is instructed to transmit to the VM 100 and the container 110 .

なお、「ＩＶＩＶＭ」のＶＭに対応する影響対象情報は登録されていないため、ＶＭ管理部７４はＶＭのプロセス群の再起動を行わない。 In addition, since the affected target information corresponding to the VM of "IVI VM" is not registered, the VM management unit 74 does not restart the process group of the VM.

さらに、ＶＭ管理部７４は、障害情報通信部７３が管理ＶＭ８０、メータＶＭ９０、ＩＶＩＶＭ１００、および、コンテナ１１０から障害情報を受信した場合に、その障害情報を取得し、その障害情報が障害管理情報７２ａに含まれる障害情報に対応するものであるか否かを判定する。 Furthermore, when the failure information communication unit 73 receives failure information from the management VM 80, the meter VM 90, the IVI VM 100, and the container 110, the VM management unit 74 acquires the failure information and converts the failure information into failure management information. 72a or not.

そして、ＶＭ管理部７４は、その障害情報が障害管理情報７２ａに含まれる障害情報に対応するものである場合、その障害情報に対応する影響対象情報が登録されているか否かを判定する。 Then, when the failure information corresponds to the failure information included in the failure management information 72a, the VM management unit 74 determines whether or not the affected information corresponding to the failure information is registered.

その障害情報に対応する影響対象情報が登録されている場合、ＶＭ管理部７４は、登録されている影響対象を再起動する。 If affected target information corresponding to the failure information is registered, the VM management unit 74 restarts the registered affected target.

プロセス８１ａ～８１ｎを含むプロセス群が実行される管理ＶＭ８０は、プロセス監視部８２、ＶＭ監視応答部８３、障害情報記憶部８４、障害情報通信部８５、および、プロセス管理部８６を備える。管理ＶＭ８０は、車両の管理を行う仮想マシンである。プロセス監視部８２、ＶＭ監視応答部８３、障害情報通信部８５、および、プロセス管理部８６は、それぞれ個別のプロセスとして実装されても良いが、それぞれの役割を備えた単一のプロセスとして実装されても良い。 A management VM 80 in which a process group including processes 81a to 81n is executed includes a process monitoring unit 82, a VM monitoring response unit 83, a fault information storage unit 84, a fault information communication unit 85, and a process management unit 86. The management VM 80 is a virtual machine that manages vehicles. The process monitoring unit 82, the VM monitoring response unit 83, the fault information communication unit 85, and the process management unit 86 may each be implemented as individual processes, but they are implemented as a single process having their respective roles. can be

プロセス監視部８２、ＶＭ監視応答部８３、障害情報通信部８５、および、プロセス管理部８６の機能は、プロセッサにより実現される。また、障害情報記憶部８４の機能は、メモリなどの記憶装置により実現される。 Functions of the process monitoring unit 82, the VM monitoring response unit 83, the fault information communication unit 85, and the process management unit 86 are realized by the processor. Also, the function of the fault information storage unit 84 is implemented by a storage device such as a memory.

プロセス監視部８２は、プロセス８１ａ～８１ｎを含むプロセス群で発生する障害を監視する。例えば、プロセス監視部８２は、プロセス８１ａ～８１ｎのハートビートメッセージを送信し、応答がないプロセスを、後述するプロセス管理部８６に通知する。 The process monitoring unit 82 monitors faults occurring in the process group including the processes 81a to 81n. For example, the process monitoring unit 82 transmits heartbeat messages of the processes 81a to 81n, and notifies the process management unit 86, which will be described later, of the processes with no response.

ＶＭ監視応答部８３は、管理ＶＭ８０の仮想マシンに対してハートビートメッセージを送信し、応答がない場合に、仮想マシンのプロセス群に障害が発生したことをハイパバイザ７０のＶＭ監視部７１に通知する。 The VM monitoring response unit 83 transmits a heartbeat message to the virtual machine of the management VM 80, and if there is no response, notifies the VM monitoring unit 71 of the hypervisor 70 that a failure has occurred in the process group of the virtual machine. .

障害情報記憶部８４は、プロセス８１ａ～８１ｎを含むプロセス群において発生する障害の情報と、当該障害の影響対象の情報とを対応付けて登録した障害管理情報８４ａを記憶する。 The failure information storage unit 84 stores failure management information 84a in which information about failures that occur in a process group including the processes 81a to 81n and information about targets affected by the failures are associated and registered.

図１２は、障害管理情報８４ａの一例を示す図である。障害管理情報８４ａは、プロセス情報、障害情報、影響対象情報を含む。 FIG. 12 is a diagram showing an example of the failure management information 84a. The fault management information 84a includes process information, fault information, and affected target information.

プロセス情報は、管理ＶＭ８０において実行されるプロセス、および、管理ＶＭ８０の外部において実行されるプロセスであることを示す情報を含む。なお、図１２の例では、後者の情報は登録されていない。障害情報は、それらのプロセスの障害の情報ある。影響対象情報は、当該障害が発生した場合に影響を受ける対象の情報である。 The process information includes information indicating processes executed in the management VM 80 and processes executed outside the management VM 80 . In addition, in the example of FIG. 12, the latter information is not registered. The failure information is information on failures of those processes. The affected target information is information about the target that will be affected when the failure occurs.

図１０の説明に戻ると、障害情報通信部８５は、障害情報通信部７３、障害情報通信部９５、障害情報通信部１０６、および、障害情報通信部１１４と通信を行う。 Returning to the explanation of FIG. 10 , the fault information communication section 85 communicates with the fault information communication section 73 , the fault information communication section 95 , the fault information communication section 106 and the fault information communication section 114 .

例えば、障害情報通信部８５は、管理ＶＭ８０におけるプロセス８１ａ～８１ｎに障害が発生した場合に、障害が発生したプロセスに対応する障害情報を障害情報通信部７３、障害情報通信部９５、障害情報通信部１０６、および、障害情報通信部１１４に送信する。 For example, when a failure occurs in the processes 81a to 81n in the management VM 80, the failure information communication unit 85 sends failure information corresponding to the failed process to the failure information communication unit 73, the failure information communication unit 95, and the failure information communication unit 73 section 106 and failure information communication section 114 .

また、障害情報通信部８５は、障害情報通信部７３、障害情報通信部９５、障害情報通信部１０６、または、障害情報通信部１１４から送信される障害情報を受信する。 Further, failure information communication unit 85 receives failure information transmitted from failure information communication unit 73 , failure information communication unit 95 , failure information communication unit 106 , or failure information communication unit 114 .

プロセス管理部８６は、管理ＶＭ８０におけるプロセス８１ａ～８１ｎを管理する。例えば、プロセス管理部８６は、プロセス監視部８２からの通知により、プロセス８１ａ～８１ｎに障害が発生したことを検出する。また、プロセス管理部８６は、プロセス８１ａ～８１ｎからの障害メッセージを受信することによりプロセス８１ａ～８１ｎに障害が発生したことを検出する。 The process management unit 86 manages processes 81a to 81n in the management VM 80. FIG. For example, the process management unit 86 detects from the notification from the process monitoring unit 82 that a failure has occurred in the processes 81a to 81n. Also, the process management unit 86 detects that a failure has occurred in the processes 81a to 81n by receiving failure messages from the processes 81a to 81n.

そして、プロセス管理部８６は、障害管理情報８４ａを参照し、障害が発生したプロセスに対応付けて登録されている障害情報および影響対象情報を取得する。 Then, the process management unit 86 refers to the failure management information 84a and acquires the failure information and the affected target information registered in association with the process in which the failure occurred.

その後、プロセス管理部８６は、障害情報が登録されている場合、その障害情報を障害情報通信部７３、障害情報通信部９５、障害情報通信部１０６、および、障害情報通信部１１４に送信するよう障害情報通信部８５に指示する。 After that, when the fault information is registered, the process management section 86 transmits the fault information to the fault information communication section 73, the fault information communication section 95, the fault information communication section 106, and the fault information communication section 114. The failure information communication unit 85 is instructed.

また、プロセス管理部８６は、影響対象情報が登録されている場合、登録されている影響対象を再起動する。さらに、プロセス管理部８６は、障害が発生したプロセスを再起動する。 Further, when the affected object information is registered, the process management unit 86 restarts the registered affected object. Furthermore, the process management unit 86 restarts the failed process.

さらに、プロセス管理部８６は、障害情報通信部８５が障害情報通信部７３、障害情報通信部９５、障害情報通信部１０６、または、障害情報通信部１１４から障害情報を受信した場合に、その障害情報を取得し、その障害情報が障害管理情報８４ａに含まれる障害情報に対応するものであるか否かを判定する。 Furthermore, when the fault information communication unit 85 receives fault information from the fault information communication unit 73, the fault information communication unit 95, the fault information communication unit 106, or the fault information communication unit 114, the process management unit 86 Information is acquired, and it is determined whether or not the failure information corresponds to the failure information included in the failure management information 84a.

そして、プロセス管理部８６は、その障害情報が障害管理情報８４ａに含まれる障害情報に対応するものである場合、その障害情報に対応する影響対象情報が登録されているか否かを判定する。 Then, when the fault information corresponds to the fault information included in the fault management information 84a, the process management unit 86 determines whether or not the affected information corresponding to the fault information is registered.

その障害情報に対応する影響対象情報が登録されている場合、プロセス管理部８６は、登録されている影響対象を再起動する。 If affected target information corresponding to the fault information is registered, the process management unit 86 restarts the registered affected target.

なお、図１２に示す障害管理情報８４ａの例では、障害情報および影響対象情報は登録されていないので、プロセス管理部８６は、障害が発生したプロセスのみを再起動する。 In the example of the fault management information 84a shown in FIG. 12, since fault information and affected target information are not registered, the process management unit 86 restarts only the faulty process.

また、プロセス９１ａ～９１ｍを含むプロセス群が実行されるメータＶＭ９０は、プロセス監視部９２、ＶＭ監視応答部９３、障害情報記憶部９４、障害情報通信部９５、および、プロセス管理部９６を備える。メータＶＭ９０は、運転席のデジタルメータ、ナビゲーション情報などの表示を担う仮想マシンである。プロセス監視部９２、ＶＭ監視応答部９３、障害情報通信部９５、および、プロセス管理部９６は、それぞれ個別のプロセスとして実装されても良いが、それぞれの役割を備えた単一のプロセスとして実装されても良い。 A meter VM 90 in which a process group including processes 91 a to 91 m is executed includes a process monitoring section 92 , a VM monitoring response section 93 , a fault information storage section 94 , a fault information communication section 95 and a process management section 96 . The meter VM 90 is a virtual machine that displays a driver's seat digital meter, navigation information, and the like. The process monitoring unit 92, the VM monitoring response unit 93, the fault information communication unit 95, and the process management unit 96 may each be implemented as individual processes, but they are implemented as a single process having their respective roles. can be

プロセス監視部９２、ＶＭ監視応答部９３、障害情報通信部９５、および、プロセス管理部９６の機能は、プロセッサにより実現される。また、障害情報記憶部９４の機能は、メモリなどの記憶装置により実現される。 The functions of the process monitoring unit 92, the VM monitoring response unit 93, the fault information communication unit 95, and the process management unit 96 are realized by the processor. Also, the function of the fault information storage unit 94 is implemented by a storage device such as a memory.

プロセス監視部９２は、プロセス９１ａ～９１ｍを含むプロセス群で発生する障害を監視する。例えば、プロセス監視部９２は、プロセス９１ａ～９１ｍに対してハートビートメッセージを送信し、応答がないプロセスを、後述するプロセス管理部９６に通知する。 The process monitoring unit 92 monitors failures occurring in the process group including the processes 91a to 91m. For example, the process monitoring unit 92 transmits heartbeat messages to the processes 91a to 91m, and notifies the process management unit 96, which will be described later, of the processes that do not respond.

ＶＭ監視応答部９３は、メータＶＭ９０の仮想マシンに対してハートビートメッセージを送信し、応答がない場合に、仮想マシンのプロセス群に障害が発生したことをハイパバイザ７０のＶＭ監視部７１に通知する。 The VM monitoring response unit 93 transmits a heartbeat message to the virtual machine of the meter VM 90, and if there is no response, notifies the VM monitoring unit 71 of the hypervisor 70 that a failure has occurred in the process group of the virtual machine. .

障害情報記憶部９４は、プロセス９１ａ～９１ｍを含むプロセス群において発生する障害の情報と、当該障害の影響対象の情報と対応付けて登録した障害管理情報９４ａを記憶する。 The fault information storage unit 94 stores fault management information 94a registered in association with information on faults occurring in a process group including the processes 91a to 91m and information on targets affected by the faults.

図１３は、障害管理情報９４ａの一例を示す図である。障害管理情報９４ａは、プロセス情報、障害情報、影響対象情報を含む。 FIG. 13 is a diagram showing an example of the fault management information 94a. The fault management information 94a includes process information, fault information, and affected object information.

プロセス情報は、メータＶＭ９０において実行されるプロセス、および、メータＶＭ９０の外部において実行されるプロセスであることを示す情報を含む。障害情報は、それらのプロセスの障害の情報ある。影響対象情報は、当該障害が発生した場合に影響を受ける対象の情報である。 The process information includes information indicating a process executed in the meter VM 90 and a process executed outside the meter VM 90 . The failure information is information on failures of those processes. The affected target information is information about the target that will be affected when the failure occurs.

障害情報における「ＴＡＲＧＥＴ」、「ＶＭ」、「ＣＯＮＴＡＩＮＥＲ」、および、「ＰＲＯＣ」は、図２および図７に示した障害管理情報２３ａ、４３ａの障害情報における「ＴＡＲＧＥＴ」、「ＶＭ」、「ＣＯＮＴＡＩＮＥＲ」、および、「ＰＲＯＣ」と同様の情報である。 "TARGET", "VM", "CONTAINER", and "PROC" in the failure information correspond to "TARGET", "VM", "CONTAINER" in the failure management information 23a, 43a shown in FIGS. ”, and “PROC”.

図１０の説明に戻ると、障害情報通信部９５は、障害情報通信部７３、障害情報通信部８５、障害情報通信部１０６、および、障害情報通信部１１４と通信を行う。 Returning to the explanation of FIG. 10 , the fault information communication section 95 communicates with the fault information communication section 73 , the fault information communication section 85 , the fault information communication section 106 and the fault information communication section 114 .

例えば、障害情報通信部９５は、メータＶＭ９０におけるプロセス９１ａ～９１ｍに障害が発生した場合に、障害が発生したプロセスに対応する障害情報を障害情報通信部７３、障害情報通信部８５、障害情報通信部１０６、および、障害情報通信部１１４に送信する。 For example, when a failure occurs in the processes 91a to 91m in the meter VM 90, the failure information communication unit 95 sends failure information corresponding to the failed process to the failure information communication unit 73, the failure information communication unit 85, and the failure information communication unit 73 section 106 and failure information communication section 114 .

また、障害情報通信部９５は、障害情報通信部７３、障害情報通信部８５、障害情報通信部１０６、または、障害情報通信部１１４から送信される障害情報を受信する。 Further, failure information communication unit 95 receives failure information transmitted from failure information communication unit 73 , failure information communication unit 85 , failure information communication unit 106 , or failure information communication unit 114 .

プロセス管理部９６は、メータＶＭ９０におけるプロセス９１ａ～９１ｍを管理する。例えば、プロセス管理部９６は、プロセス監視部９２からの通知により、プロセス９１ａ～９１ｍに障害が発生したことを検出する。また、プロセス管理部９６は、プロセス９１ａ～９１ｍからの障害メッセージを受信することによりプロセス９１ａ～９１ｍに障害が発生したことを検出する。 A process management unit 96 manages processes 91 a to 91 m in the meter VM 90 . For example, the process management unit 96 detects from the notification from the process monitoring unit 92 that a failure has occurred in the processes 91a to 91m. Also, the process management unit 96 detects that a failure has occurred in the processes 91a to 91m by receiving failure messages from the processes 91a to 91m.

そして、プロセス管理部９６は、障害管理情報９４ａを参照し、障害が発生したプロセスに対応付けて登録されている障害情報および影響対象情報を取得する。 Then, the process management unit 96 refers to the failure management information 94a and acquires the failure information and the affected target information registered in association with the process in which the failure occurred.

例えば、図１３の例において、障害が発生したプロセスが「ＰＲＯＣＥＳＳ２１」である場合、プロセス管理部は、「ＴＡＲＧＥＴ：ＰＲＯＣ，ＶＭ：ＭＥＴＥＲ，ＣＯＮＴＡＩＮＥＲ：－，ＰＲＯＣ：ＰＲＯＣＥＳＳ２１」という障害情報を取得する。 For example, in the example of FIG. 13, if the failed process is "PROCESS21", the process management unit acquires the failure information "TARGET: PROC, VM: METER, CONTAINER: -, PROC: PROCESS21".

その後、プロセス管理部９６は、「ＰＲＯＣＥＳＳ２１」のプロセスのように、障害情報が登録されている場合、その障害情報を障害情報通信部７３、障害情報通信部８５、障害情報通信部１０６、および、障害情報通信部１１４に送信するよう障害情報通信部９５に指示する。 After that, the process management unit 96 sends the failure information to the failure information communication unit 73, the failure information communication unit 85, the failure information communication unit 106, and The failure information communication unit 95 is instructed to transmit to the failure information communication unit 114 .

また、プロセス管理部９６は、影響対象情報が登録されている場合、登録されている影響対象を再起動する。さらに、プロセス管理部９６は、障害が発生したプロセスを再起動する。 Further, when the affected object information is registered, the process management unit 96 restarts the registered affected object. Furthermore, the process management unit 96 restarts the failed process.

ここで、図１３に示すように、影響対象情報として「ＰＲＯＣＥＳＳ２２」が登録されている場合、プロセス管理部９６は、「ＰＲＯＣＥＳＳ２２」のプロセスに対応する障害情報も障害情報通信部７３、障害情報通信部８５、障害情報通信部１０６、および、障害情報通信部１１４に送信するよう障害情報通信部９５に指示する。 Here, as shown in FIG. 13, when "PROCESS22" is registered as the affected object information, the process management section 96 sends fault information corresponding to the process "PROCESS22" to the fault information communication section 73. The failure information communication unit 95 is instructed to transmit to the unit 85 , the failure information communication unit 106 , and the failure information communication unit 114 .

「ＰＲＯＣＥＳＳ２２」のプロセスに対応する影響対象情報は登録されていないため、プロセス管理部９６はプロセスの再起動を行わない。 Since the affected object information corresponding to the process of "PROCESS22" is not registered, the process management unit 96 does not restart the process.

さらに、プロセス管理部９６は、障害情報通信部９５が障害情報通信部７３、障害情報通信部８５、障害情報通信部１０６、または、障害情報通信部１１４から障害情報を受信した場合に、その障害情報を取得し、その障害情報が障害管理情報９４ａに含まれる障害情報に対応するものであるか否かを判定する。 Furthermore, when the fault information communication unit 95 receives fault information from the fault information communication unit 73, the fault information communication unit 85, the fault information communication unit 106, or the fault information communication unit 114, the process management unit 96 Information is acquired, and it is determined whether or not the failure information corresponds to the failure information included in the failure management information 94a.

そして、プロセス管理部９６は、その障害情報が障害管理情報９４ａに含まれる障害情報に対応するものである場合、その障害情報に対応する影響対象情報が登録されているか否かを判定する。 Then, when the failure information corresponds to the failure information included in the failure management information 94a, the process management unit 96 determines whether or not the affected information corresponding to the failure information is registered.

その障害情報に対応する影響対象情報が登録されている場合、プロセス管理部９６は、登録されている影響対象を再起動する。 If affected target information corresponding to the failure information is registered, the process management unit 96 restarts the registered affected target.

また、プロセス１０１ａ～１０１ｌ、および、コンテナ管理プロセス１０２を含むプロセス群が実行されるＩＶＩＶＭ１００は、プロセス監視部１０３、ＶＭ監視応答部１０４、障害情報記憶部１０５、障害情報通信部１０６、プロセス管理部１０７、および、コンテナ１１０を備える。プロセス監視部１０３、ＶＭ監視応答部１０４、障害情報通信部１０６、および、プロセス管理部１０７は、それぞれ個別のプロセスとして実装されても良いが、それぞれの役割を備えた単一のプロセスとして実装されても良い。 In addition, the IVI VM 100 in which processes 101a to 101l and a process group including the container management process 102 are executed includes a process monitoring unit 103, a VM monitoring response unit 104, a failure information storage unit 105, a failure information communication unit 106, a process management A unit 107 and a container 110 are provided. The process monitoring unit 103, the VM monitoring response unit 104, the fault information communication unit 106, and the process management unit 107 may each be implemented as individual processes, but they are implemented as a single process having their respective roles. can be

ＩＶＩＶＭ１００は、ナビゲーション、オーディオ、車両情報の表示のほか、スマートフォンなどとの連携機能を担う仮想マシンである。コンテナ管理プロセス１０２は、コンテナ１１０の管理を行うプロセスである。 The IVI VM 100 is a virtual machine that not only displays navigation, audio, and vehicle information, but also functions to link with smartphones and the like. A container management process 102 is a process for managing the container 110 .

プロセス監視部１０３、ＶＭ監視応答部１０４、障害情報通信部１０６、プロセス管理部１０７、および、コンテナ１１０の障害情報記憶部１１３以外の機能は、プロセッサにより実現される。また、障害情報記憶部１０５および障害情報記憶部１１３の機能は、メモリなどの記憶装置により実現される。 Functions other than the process monitoring unit 103, the VM monitoring response unit 104, the failure information communication unit 106, the process management unit 107, and the failure information storage unit 113 of the container 110 are realized by the processor. Also, the functions of the failure information storage unit 105 and the failure information storage unit 113 are implemented by a storage device such as a memory.

プロセス監視部１０３は、プロセス１０１ａ～１０１ｌ、および、コンテナ管理プロセス１０２を含むプロセス群で発生する障害を監視する。例えば、プロセス監視部１０３は、プロセス１０１ａ～１０１ｌ、および、コンテナ管理プロセス１０２に対してハートビートメッセージを送信し、応答がないプロセスを、後述するプロセス管理部１０７に通知する。 The process monitoring unit 103 monitors faults that occur in the processes 101 a to 101 l and the process group including the container management process 102 . For example, the process monitoring unit 103 sends heartbeat messages to the processes 101a to 101l and the container management process 102, and notifies the process management unit 107, which will be described later, of processes that do not respond.

ＶＭ監視応答部１０４は、ＩＶＩＶＭ１００の仮想マシンに対してハートビートメッセージを送信し、応答がない場合に、仮想マシンのプロセス群に障害が発生したことをハイパバイザ７０のＶＭ監視部７１に通知する。 The VM monitoring response unit 104 transmits a heartbeat message to the virtual machine of the IVI VM 100, and if there is no response, notifies the VM monitoring unit 71 of the hypervisor 70 that a failure has occurred in the process group of the virtual machine. .

障害情報記憶部１０５は、プロセス１０１ａ～１０１ｌ、および、コンテナ管理プロセス１０２を含むプロセス群において発生する障害の情報と、当該障害の影響対象の情報とを対応付けて登録した障害管理情報１０５ａを記憶する。 The fault information storage unit 105 stores fault management information 105a in which information on faults that occur in a process group including the processes 101a to 101l and the container management process 102 is associated with information on targets affected by the faults and registered. do.

図１４は、障害管理情報１０５ａの一例を示す図である。障害管理情報１０５ａは、プロセス情報、障害情報、影響対象情報を含む。 FIG. 14 is a diagram showing an example of the failure management information 105a. The fault management information 105a includes process information, fault information, and affected target information.

プロセス情報は、ＩＶＩＶＭ１００において実行されるプロセスの識別情報、および、ＩＶＩＶＭ１００の外部において実行されるプロセスであることを示す情報を含む。障害情報は、それらのプロセスの障害の情報ある。影響対象情報は、当該障害が発生した場合に影響を受ける対象の情報である。 The process information includes identification information of processes executed in the IVI VM 100 and information indicating that the processes are executed outside the IVI VM 100 . The failure information is information on failures of those processes. The affected target information is information about the target that will be affected when the failure occurs.

図１０の説明に戻ると、障害情報通信部１０６は、障害情報通信部７３、障害情報通信部８５、障害情報通信部９５、および、障害情報通信部１１４と通信を行う。 Returning to the description of FIG. 10 , failure information communication section 106 communicates with failure information communication section 73 , failure information communication section 85 , failure information communication section 95 , and failure information communication section 114 .

例えば、障害情報通信部１０６は、ＩＶＩＶＭ１００におけるプロセス１０１ａ～１０１ｌ、または、コンテナ管理プロセス１０２に障害が発生した場合に、障害が発生したプロセスに対応する障害情報を障害情報通信部７３、障害情報通信部８５、障害情報通信部９５、および、障害情報通信部１１４に送信する。 For example, when a failure occurs in the processes 101a to 101l in the IVI VM 100 or in the container management process 102, the failure information communication unit 106 sends failure information corresponding to the failed process to the failure information communication unit 73. It is transmitted to the communication unit 85 , the failure information communication unit 95 , and the failure information communication unit 114 .

また、障害情報通信部１０６は、障害情報通信部７３、障害情報通信部８５、障害情報通信部９５、または、障害情報通信部１１４から送信される障害情報を受信する。 Failure information communication unit 106 also receives failure information transmitted from failure information communication unit 73 , failure information communication unit 85 , failure information communication unit 95 , or failure information communication unit 114 .

プロセス管理部１０７は、ＩＶＩＶＭ１００におけるプロセス１０１ａ～１０１ｌ、および、コンテナ管理プロセス１０２を管理する。例えば、プロセス管理部１０７は、プロセス監視部１０３からの通知により、プロセス１０１ａ～１０１ｌ、または、コンテナ管理プロセス１０２に障害が発生したことを検出する。 The process management unit 107 manages the processes 101 a to 101 l in the IVI VM 100 and the container management process 102 . For example, the process management unit 107 detects from the notification from the process monitoring unit 103 that the processes 101a to 101l or the container management process 102 has failed.

また、プロセス管理部１０７は、プロセス１０１ａ～１０１ｌ、または、コンテナ管理プロセス１０２からの障害メッセージを受信することによりプロセス１０１ａ～１０１ｌ、または、コンテナ管理プロセス１０２に障害が発生したことを検出する。 Also, the process management unit 107 detects that a failure has occurred in the processes 101a to 101l or the container management process 102 by receiving a failure message from the processes 101a to 101l or the container management process 102 .

そして、プロセス管理部１０７は、障害管理情報１０５ａを参照し、障害が発生したプロセスに対応付けて登録されている障害情報および影響対象情報を取得する。 Then, the process management unit 107 refers to the failure management information 105a, and acquires failure information and affected target information registered in association with the process in which the failure occurred.

例えば、図１４の例において、障害が発生したプロセスが「ＣＯＮＴＡＩＮＥＲ１＿ＭＮＧ＿ＰＲＯＣＥＳＳ」というコンテナ管理プロセス１０２である場合、プロセス管理部１０７は、「ＴＡＲＧＥＴ：ＣＯＮＴＡＩＮＥＲ，ＶＭ：ＩＶＩ，ＣＯＮＴＡＩＮＥＲ：１」という障害情報を取得する。 For example, in the example of FIG. 14, if the process in which the failure occurred is the container management process 102 "CONTAINER1_MNG_PROCESS", the process management unit 107 acquires the failure information "TARGET: CONTAINER, VM: IVI, CONTAINER: 1". do.

その後、プロセス管理部１０７は、「ＣＯＮＴＡＩＮＥＲ１＿ＭＮＧ＿ＰＲＯＣＥＳＳ」のプロセスのように、障害情報が登録されている場合、その障害情報を障害情報通信部７３、障害情報通信部８５、障害情報通信部９５、および、障害情報通信部１１４に送信するよう障害情報通信部１０６に指示する。 After that, when failure information is registered as in the process of "CONTAINER1_MNG_PROCESS", the process management unit 107 sends the failure information to the failure information communication unit 73, the failure information communication unit 85, the failure information communication unit 95, and the The failure information communication unit 106 is instructed to transmit to the failure information communication unit 114 .

また、プロセス管理部１０７は、影響対象情報が登録されている場合、登録されている影響対象を再起動する。さらに、プロセス管理部１０７は、障害が発生したプロセスを再起動する。 Further, when the influence target information is registered, the process management unit 107 restarts the registered influence target. Furthermore, the process management unit 107 restarts the failed process.

ここで、図１４に示すように、影響対象情報として「ＰＲＯＣＥＳＳ３２」が登録されている場合、プロセス管理部１０７は、「ＰＲＯＣＥＳＳ３２」のプロセスに対応する障害情報も障害情報通信部７３、障害情報通信部８５、障害情報通信部９５、および、障害情報通信部１１４に送信するよう障害情報通信部１０６に指示する。 Here, as shown in FIG. 14, when "PROCESS32" is registered as the affected object information, the process management unit 107 sends fault information corresponding to the process "PROCESS32" to the fault information communication unit 73. The failure information communication unit 106 is instructed to transmit to the unit 85 , the failure information communication unit 95 , and the failure information communication unit 114 .

「ＰＲＯＣＥＳＳ３２」のプロセスに対応する影響対象情報は登録されていないため、プロセス管理部１０７はプロセスの再起動を行わない。 Since the affected object information corresponding to the process of "PROCESS32" is not registered, the process management unit 107 does not restart the process.

さらに、プロセス管理部１０７は、障害情報通信部１０６が障害情報通信部７３、障害情報通信部８５、障害情報通信部９５、または、障害情報通信部１１４から障害情報を受信した場合に、その障害情報を取得し、その障害情報が障害管理情報１０５ａに含まれる障害情報に対応するものであるか否かを判定する。 Furthermore, when the failure information communication unit 106 receives failure information from the failure information communication unit 73, the failure information communication unit 85, the failure information communication unit 95, or the failure information communication unit 114, the process management unit 107 Information is acquired, and it is determined whether or not the failure information corresponds to the failure information included in the failure management information 105a.

そして、プロセス管理部１０７は、その障害情報が障害管理情報１０５ａに含まれる障害情報に対応するものである場合、その障害情報に対応する影響対象情報が登録されているか否かを判定する。 Then, when the failure information corresponds to the failure information included in the failure management information 105a, the process management unit 107 determines whether or not the affected information corresponding to the failure information is registered.

その障害情報に対応する影響対象情報が登録されている場合、プロセス管理部１０７は、登録されている影響対象を再起動する。 If affected target information corresponding to the fault information is registered, the process management unit 107 restarts the registered affected target.

例えば、障害情報通信部１０６が、障害情報通信部９５から、障害情報として「ＴＡＲＧＥＴ：ＰＲＯＣ，ＶＭ：ＭＥＴＥＲ，ＣＯＮＴＡＩＮＥＲ：－，ＰＲＯＣ：ＰＲＯＣＥＳＳ２１」という情報を受信したものとする。 For example, it is assumed that the failure information communication unit 106 receives information "TARGET: PROC, VM: METER, CONTAINER: -, PROC: PROCESS21" from the failure information communication unit 95 as failure information.

この場合、この障害情報は、図１４に示した障害管理情報１０５ａの上から４番目に示された「ＥＸＴＥＲＮＡＬ＿ＥＲＲＯＲ」の障害情報（外部障害情報）に対応するため、プロセス管理部１０７は、登録されている影響対象である「ＰＲＯＣＥＳＳ３１」のプロセスを再起動する。 In this case, this fault information corresponds to the fault information (external fault information) of "EXTERNAL_ERROR" shown in the fourth from the top of the fault management information 105a shown in FIG. restart the affected process "PROCESS31".

ここで、「ＰＲＯＣＥＳＳ３１」には、障害情報が登録されていないので、障害情報は障害情報通信部７３、障害情報通信部８５、障害情報通信部９５、および、障害情報通信部１１４に送信されない。 Here, since failure information is not registered in "PROCESS 31", failure information is not transmitted to failure information communication section 73, failure information communication section 85, failure information communication section 95, and failure information communication section 114. FIG.

また、「ＰＲＯＣＥＳＳ３１」のプロセスに対応する影響対象情報も登録されていないため、プロセス管理部１０７はプロセスの再起動を行わない。 In addition, since the affected object information corresponding to the process "PROCESS31" is not registered, the process management unit 107 does not restart the process.

プロセス１１１ａ～１１１ｋを含むプロセス群が実行されるコンテナ１１０は、プロセス監視部１１２、障害情報記憶部１１３、障害情報通信部１１４、および、プロセス管理部１１５を備える。プロセス監視部１１２、障害情報通信部１１４、および、プロセス管理部１１５は、それぞれ個別のプロセスとして実装されても良いが、それぞれの役割を備えた単一のプロセスとして実装されても良い。 A container 110 in which a process group including processes 111a to 111k is executed includes a process monitoring unit 112, a failure information storage unit 113, a failure information communication unit 114, and a process management unit 115. FIG. The process monitoring unit 112, the fault information communication unit 114, and the process management unit 115 may be implemented as separate processes, or may be implemented as a single process having their respective roles.

プロセス監視部１１２、障害情報通信部１１４、および、プロセス管理部１１５の機能は、プロセッサにより実現される。また、障害情報記憶部１１３の機能は、メモリなどの記憶装置により実現される。 Functions of the process monitoring unit 112, the fault information communication unit 114, and the process management unit 115 are realized by the processor. Also, the function of the failure information storage unit 113 is implemented by a storage device such as a memory.

プロセス監視部１１２は、プロセス１１１ａ～１１１ｋを含むプロセス群で発生する障害を監視する。例えば、プロセス監視部１１２は、プロセス１１１ａ～１１１ｋに対してハートビートメッセージを送信し、応答がないプロセスを、後述するプロセス管理部１１５に通知する。 The process monitoring unit 112 monitors failures occurring in a process group including the processes 111a to 111k. For example, the process monitoring unit 112 transmits heartbeat messages to the processes 111a to 111k, and notifies the process management unit 115, which will be described later, of processes that do not respond.

障害情報記憶部１１３は、プロセス１１１ａ～１１１ｋを含むプロセス群において発生する障害の情報と、当該障害の影響対象の情報とを対応付けて登録した障害管理情報１１３ａを記憶する。 The failure information storage unit 113 stores failure management information 113a in which information about failures that occur in a process group including the processes 111a to 111k and information about targets affected by the failures are associated and registered.

図１５は、障害管理情報１１３ａの一例を示す図である。障害管理情報１１３ａは、プロセス情報、障害情報、影響対象情報を含む。 FIG. 15 is a diagram showing an example of the failure management information 113a. The fault management information 113a includes process information, fault information, and affected target information.

プロセス情報は、コンテナ１１０において実行されるプロセスの識別情報、および、コンテナ１１０の外部において実行されるプロセスであることを示す情報を含む。障害情報は、それらのプロセスの障害の情報ある。影響対象情報は、当該障害が発生した場合に影響を受ける対象の情報である。 The process information includes identification information of the process executed in the container 110 and information indicating that the process is executed outside the container 110 . The failure information is information on failures of those processes. The affected target information is information about the target that will be affected when the failure occurs.

ここで、図１５の上から３番目にある「ＥＸＴＥＲＮＡＬ＿ＥＲＲＯＲ」というプロセス情報には、「ＴＡＲＧＥＴ：ＰＲＯＣ，ＶＭ：ＭＥＴＥＲ，ＣＯＮＴＡＩＮＥＲ：－，ＰＲＯＣ：ＰＲＯＣＥＳＳ２２」という障害情報（外部障害情報）、および、「ＰＲＯＣＥＳＳ４１」という影響対象情報が対応付けて登録されている。 Here, the third process information "EXTERNAL_ERROR" from the top in FIG. PROCESS41" is registered in association with the affected target information.

これは、メータＶＭ９０の「ＰＲＯＣＥＳＳ２２」というプロセスに障害が発生した場合、その障害の発生により「ＰＲＯＣＥＳＳ４１」というプロセスを再起動する必要があることを示している。 This indicates that when a failure occurs in the process "PROCESS22" of the meter VM 90, it is necessary to restart the process "PROCESS41" due to the occurrence of the failure.

図１０の説明に戻ると、障害情報通信部１１４は、障害情報通信部７３、障害情報通信部８５、障害情報通信部９５、および、障害情報通信部１０６と通信を行う。 Returning to the description of FIG. 10, the fault information communication section 114 communicates with the fault information communication section 73, the fault information communication section 85, the fault information communication section 95, and the fault information communication section .

例えば、障害情報通信部１１４は、コンテナ１１０におけるプロセス１１１ａ～１１１ｋに障害が発生した場合に、障害が発生したプロセスに対応する図１５に示した障害情報を障害情報通信部７３、障害情報通信部８５、障害情報通信部９５、および、障害情報通信部１０６に送信する。 For example, when a failure occurs in the processes 111a to 111k in the container 110, the failure information communication unit 114 sends the failure information shown in FIG. 85 , failure information communication unit 95 , and failure information communication unit 106 .

また、障害情報通信部１１４は、障害情報通信部７３、障害情報通信部８５、障害情報通信部９５、または、障害情報通信部１０６から送信される障害情報を受信する。 Further, failure information communication section 114 receives failure information transmitted from failure information communication section 73 , failure information communication section 85 , failure information communication section 95 , or failure information communication section 106 .

プロセス管理部１１５は、コンテナ１１０におけるプロセス１１１ａ～１１１ｋを管理する。例えば、プロセス管理部１１５は、プロセス監視部１１２からの通知により、プロセス１１１ａ～１１１ｋに障害が発生したことを検出する。 The process manager 115 manages the processes 111a-111k in the container 110. FIG. For example, the process management unit 115 detects from the notification from the process monitoring unit 112 that a failure has occurred in the processes 111a to 111k.

また、プロセス管理部１１５は、プロセス１１１ａ～１１１ｋからの障害メッセージを受信することによりプロセス１１１ａ～１１１ｋに障害が発生したことを検出する。 Also, the process management unit 115 detects that a failure has occurred in the processes 111a to 111k by receiving failure messages from the processes 111a to 111k.

そして、プロセス管理部１１５は、障害管理情報１１３ａを参照し、障害が発生したプロセスに対応付けて登録されている障害情報および影響対象情報を取得する。 Then, the process management unit 115 refers to the failure management information 113a and acquires the failure information and the affected target information registered in association with the process in which the failure occurred.

例えば、図１５の例において、障害が発生したプロセスが「ＰＲＯＣＥＳＳ４１」である場合、プロセス管理部１１５は、「ＴＡＲＧＥＴ：ＰＲＯＣ，ＶＭ：ＩＶＩ，ＣＯＮＴＡＩＮＥＲ：１，ＰＲＯＣ：ＰＲＯＣＥＳＳ４１」という障害情報、および、「ＰＲＯＣＥＳＳ４２」という影響対象情報を取得する。 For example, in the example of FIG. 15, if the failed process is "PROCESS41", the process management unit 115 generates failure information "TARGET: PROC, VM: IVI, CONTAINER: 1, PROC: PROCESS41" and Acquire the affected object information "PROCESS42".

その後、プロセス管理部１１５は、「ＰＲＯＣＥＳＳ４１」のプロセスのように、障害情報が登録されている場合、その障害情報を障害情報通信部７３、障害情報通信部８５、障害情報通信部９５、および、障害情報通信部１０６に送信するよう障害情報通信部１１４に指示する。 After that, the process management unit 115 sends the failure information to the failure information communication unit 73, the failure information communication unit 85, the failure information communication unit 95, and The failure information communication unit 114 is instructed to transmit to the failure information communication unit 106 .

また、プロセス管理部１１５は、「ＰＲＯＣＥＳＳ４２」のプロセスのように、影響対象情報が登録されている場合、登録されている影響対象を再起動する。さらに、プロセス管理部１１５は、障害が発生したプロセスを再起動する。 In addition, the process management unit 115 restarts the registered affected object when the affected object information is registered like the process of "PROCESS42". Furthermore, the process management unit 115 restarts the failed process.

また、影響対象情報としてプロセスが登録されている場合、プロセス管理部１１５は、そのプロセスに対応する障害情報も障害情報通信部７３、障害情報通信部８５、障害情報通信部９５、および、障害情報通信部１０６に送信するよう障害情報通信部１１４に指示する。 Further, when a process is registered as affected information, the process management unit 115 also sends fault information corresponding to the process to the fault information communication unit 73, the fault information communication unit 85, the fault information communication unit 95, and the fault information The failure information communication unit 114 is instructed to transmit to the communication unit 106 .

さらに、プロセス管理部１１５は、障害情報通信部１１４が障害情報通信部８５、障害情報通信部９５、または、障害情報通信部１０６から障害情報を受信した場合に、その障害情報を取得し、その障害情報が障害管理情報１１３ａに含まれる障害情報に対応するものであるか否かを判定する。 Furthermore, when the fault information communication unit 114 receives fault information from the fault information communication unit 85, the fault information communication unit 95, or the fault information communication unit 106, the process management unit 115 acquires the fault information, It is determined whether or not the failure information corresponds to the failure information included in the failure management information 113a.

そして、プロセス管理部１１５は、その障害情報が障害管理情報１１３ａに含まれる障害情報に対応するものである場合、その障害情報に対応する影響対象情報が登録されているか否かを判定する。 Then, when the failure information corresponds to the failure information included in the failure management information 113a, the process management unit 115 determines whether or not the affected information corresponding to the failure information is registered.

その障害情報に対応する影響対象情報が登録されている場合、プロセス管理部１１５は、登録されている影響対象を再起動する。 If affected target information corresponding to the fault information is registered, the process management unit 115 restarts the registered affected target.

例えば、障害情報通信部１１４が、障害情報通信部９５から、障害情報として「ＴＡＲＧＥＴ：ＰＲＯＣ，ＶＭ：ＭＥＴＥＲ，ＣＯＮＴＡＩＮＥＲ：－，ＰＲＯＣ：ＰＲＯＣＥＳＳ２２」という情報を受信したものとする。 For example, it is assumed that the failure information communication unit 114 receives the information "TARGET: PROC, VM: METER, CONTAINER: -, PROC: PROCESS22" from the failure information communication unit 95 as failure information.

この場合、この障害情報は、図１５に示した障害管理情報１１３ａの上から３番目に示された「ＥＸＴＥＲＮＡＬ＿ＥＲＲＯＲ」の障害情報（外部障害情報）に対応するため、プロセス管理部１１５は、登録されている影響対象である「ＰＲＯＣＥＳＳ４１」のプロセスを再起動する。 In this case, this fault information corresponds to the fault information (external fault information) of "EXTERNAL_ERROR" shown third from the top of the fault management information 113a shown in FIG. restart the affected process "PROCESS41".

ここで、「ＰＲＯＣＥＳＳ４１」には、障害情報が登録されているので、プロセス管理部１１５は、その障害情報を障害情報通信部７３、障害情報通信部８５、障害情報通信部９５、および、障害情報通信部１０６に送信するよう障害情報通信部１１４に指示するとともに、影響対象情報として登録されている「ＰＲＯＣＥＳＳ４２」のプロセスを再起動する。 Here, since the failure information is registered in "PROCESS 41", the process management unit 115 sends the failure information to the failure information communication unit 73, the failure information communication unit 85, the failure information communication unit 95, and the failure information The failure information communication unit 114 is instructed to transmit to the communication unit 106, and the process of "PROCESS42" registered as the affected information is restarted.

「ＰＲＯＣＥＳＳ４２」のプロセスに対応する障害情報、および、影響対象情報は登録されていないため、プロセス管理部１１５は、障害情報の送信指示、および、プロセスの再起動を行わない。 Since the failure information and affected target information corresponding to the process of "PROCESS42" are not registered, the process management unit 115 does not issue an instruction to send failure information or restart the process.

つぎに、ハイパバイザ７０が行う再起動処理の処理手順の一例について説明する。図１６は、ハイパバイザ７０が行う再起動処理の処理手順の一例を示すフローチャートである。 Next, an example of a procedure of restart processing performed by the hypervisor 70 will be described. FIG. 16 is a flowchart illustrating an example of a procedure of restart processing performed by the hypervisor 70. As illustrated in FIG.

図１６に示すように、ＶＭ管理部７４は、ＶＭ監視部７１からの通知により、または、各ＶＭから受信する再起動要求メッセージにより、各ＶＭに障害が発生したことを検出する（ステップＳ３０１）。 As illustrated in FIG. 16 , the VM management unit 74 detects that a failure has occurred in each VM by a notification from the VM monitoring unit 71 or by a restart request message received from each VM (step S301). .

続いて、ＶＭ管理部７４は、障害管理情報７２ａを参照し、ＶＭ監視部７１から取得した障害が発生したＶＭの情報をもとに、そのＶＭに対応する障害情報と影響対象情報とを障害管理情報７２ａから取得する処理を行う（ステップＳ３０２）。 Next, the VM management unit 74 refers to the failure management information 72a, and based on the information of the failed VM acquired from the VM monitoring unit 71, the VM management unit 74 updates the failure information and the affected target information corresponding to the VM. A process of acquiring from the management information 72a is performed (step S302).

そして、ＶＭ管理部７４は、障害管理情報７２ａにそのＶＭに対応する障害情報が登録されていたか否かを判定する（ステップＳ３０３）。 Then, the VM management unit 74 determines whether or not failure information corresponding to the VM is registered in the failure management information 72a (step S303).

障害管理情報７２ａにそのＶＭに対応する障害情報が登録されていた場合（ステップＳ３０３においてＹｅｓの場合）、ＶＭ管理部７４は、障害情報通信部７３に指示して、その障害情報を障害情報通信部８５、障害情報通信部９５、障害情報通信部１０６、および、障害情報通信部１１４に送信させる（ステップＳ３０４）。 If failure information corresponding to the VM is registered in the failure management information 72a (Yes in step S303), the VM management unit 74 instructs the failure information communication unit 73 to send the failure information to failure information communication. It is transmitted to the unit 85, the failure information communication unit 95, the failure information communication unit 106, and the failure information communication unit 114 (step S304).

その後、ＶＭ管理部７４は、障害管理情報７２ａにそのＶＭに対応する影響対象情報が登録されていたか否かを判定する（ステップＳ３０５）。 After that, the VM management unit 74 determines whether the affected target information corresponding to the VM is registered in the fault management information 72a (step S305).

障害管理情報７２ａにそのＶＭに対応する影響対象情報が登録されていた場合（ステップＳ３０５においてＹｅｓの場合）、ＶＭ管理部７４は、その影響対象情報に登録されている影響対象を再起動する（ステップＳ３０６）。 If the affected target information corresponding to the VM is registered in the fault management information 72a (Yes in step S305), the VM management unit 74 restarts the affected target registered in the affected target information ( step S306).

なお、ＶＭ管理部７４は、障害管理情報７２ａにおいて、再起動される影響対象のＶＭに対応する障害情報が登録されている場合には、障害情報通信部７３に指示して、その障害情報を障害情報通信部８５、障害情報通信部９５、障害情報通信部１０６、および、障害情報通信部１１４に送信させ、そのＶＭに対応する影響対象情報が登録されていた場合には、その影響対象情報に登録されている影響対象を再起動する。 Note that if the failure management information 72a registers failure information corresponding to the affected VM to be restarted, the VM management unit 74 instructs the failure information communication unit 73 to transmit the failure information. If the affected target information corresponding to the VM is registered, the affected target information Restart the affected targets registered in the .

さらに、ＶＭ管理部７４は、障害の発生が検出されたＶＭを再起動し（ステップＳ３０７）、この再起動処理を終了する。 Furthermore, the VM management unit 74 restarts the VM in which the occurrence of the failure has been detected (step S307), and terminates this restart processing.

また、ステップＳ３０３において、障害管理情報７２ａにそのＶＭに対応する障害情報が登録されていなかった場合（ステップＳ３０３においてＮｏの場合）、または、ステップＳ３０５において、障害管理情報７２ａにそのＶＭに対応する影響対象情報が登録されていなかった場合（ステップＳ３０５においてＮｏの場合）、ＶＭ管理部７４は、障害の発生が検出されたＶＭを再起動し（ステップＳ３０７）、この再起動処理を終了する。 In step S303, if the fault information corresponding to the VM is not registered in the fault management information 72a (No in step S303), or if the fault management information 72a corresponds to the VM in step S305 If the affected target information is not registered (No in step S305), the VM management unit 74 restarts the VM in which the occurrence of the failure is detected (step S307), and terminates this restart processing.

プロセス管理部８６、９６、１０７，１１５（以下、単に管理部と呼ぶ。）がプロセスの障害を検出して行う再起動処理の処理手順は、図４で説明した処理手順と同様のものである。 The processing procedure of restart processing performed by the process management units 86, 96, 107, and 115 (hereinafter simply referred to as management units) upon detection of a process failure is the same as the processing procedure described with reference to FIG. .

すなわち、図４に示すように、プロセス監視部８２、９２、１０３，１１２は、自らが管理するプロセスの障害を検出し、障害を検出したことを管理部に通知する（ステップＳ１０１）。 That is, as shown in FIG. 4, the process monitoring units 82, 92, 103, and 112 detect failures in the processes they manage and notify the management units of the failure detection (step S101).

続いて、管理部は、それぞれの障害管理情報８４ａ、９４ａ、１０５ａ、１１３ａ（以下、単に障害管理情報と呼ぶ。）を参照し、障害が発生したプロセスの情報をもとに、そのプロセスに対応する障害情報と影響対象情報とを障害管理情報から取得する処理を行う（ステップＳ１０２）。 Subsequently, the management unit refers to the failure management information 84a, 94a, 105a, and 113a (hereinafter simply referred to as failure management information) and, based on the information on the process in which the failure occurred, responds to the process. A process of acquiring the failure information and the affected object information from the failure management information is performed (step S102).

そして、管理部は、障害管理情報にそのプロセスに対応する障害情報が登録されていたか否かを判定する（ステップＳ１０３）。 Then, the management unit determines whether fault information corresponding to the process is registered in the fault management information (step S103).

障害管理情報にそのプロセスに対応する障害情報が登録されていた場合（ステップＳ１０３においてＹｅｓの場合）、管理部は、障害情報通信部８５、９５、１０６、１１４（以下、単に障害情報通信部と呼ぶ。）に指示して、その障害情報を他の障害情報通信部に送信させる（ステップＳ１０４）。 If the failure information corresponding to the process is registered in the failure management information (Yes in step S103), the management unit will be referred to as failure information communication units 85, 95, 106, and 114 (hereinafter simply referred to as failure information communication units). ) to transmit the fault information to another fault information communication unit (step S104).

その後、管理部は、それぞれの障害管理情報にそのプロセスに対応する影響対象情報が登録されていたか否かを判定する（ステップＳ１０５）。 After that, the management unit determines whether or not affected object information corresponding to the process is registered in each failure management information (step S105).

障害管理情報にそのプロセスに対応する影響対象情報が登録されていた場合（ステップＳ１０５においてＹｅｓの場合）、管理部は、その影響対象情報に登録されている影響対象を再起動する（ステップＳ１０６）。 If the affected object information corresponding to the process is registered in the fault management information (Yes in step S105), the management unit restarts the affected object registered in the affected object information (step S106). .

なお、管理部は、障害管理情報において、再起動される影響対象のプロセスに対応する障害情報が登録されている場合には、障害情報通信部に指示して、その障害情報を他の障害情報通信部に送信させ、そのプロセスに対応する影響対象情報が登録されていた場合には、その影響対象情報に登録されている影響対象を再起動する。 If fault information corresponding to the affected process to be restarted is registered in the fault management information, the management section instructs the fault information communication section to replace the fault information with other fault information. If the affected object information corresponding to the process is registered, the affected object registered in the affected object information is restarted.

さらに、管理部は、障害の発生が検出されたプロセスを再起動し（ステップＳ１０７）、この再起動処理を終了する。 Furthermore, the management unit restarts the process in which the occurrence of the failure has been detected (step S107), and terminates this restart processing.

また、ステップＳ１０３において、障害管理情報にそのプロセスに対応する障害情報が登録されていなかった場合（ステップＳ１０３においてＮｏの場合）、または、ステップＳ１０５において、障害管理情報にそのプロセスに対応する影響対象情報が登録されていなかった場合（ステップＳ１０５においてＮｏの場合）、管理部は、障害の発生が検出されたプロセスを再起動し（ステップＳ１０７）、この再起動処理を終了する。 Further, in step S103, if the failure information corresponding to the process is not registered in the failure management information (No in step S103), or in step S105, the affected object corresponding to the process in the failure management information If the information is not registered (No in step S105), the management unit restarts the process in which the occurrence of the failure is detected (step S107), and terminates this restart processing.

また、ＶＭ管理部７４およびプロセス管理部８６、９６、１０７，１１５（以下、単に管理部と呼ぶ。）が障害情報を受信して行う再起動処理の処理手順は、図５で説明した処理手順と同様のものである。 Further, the procedure of restart processing performed by the VM management unit 74 and the process management units 86, 96, 107, and 115 (hereinafter simply referred to as management units) upon receiving fault information is the processing procedure described with reference to FIG. is similar to

すなわち、図５に示すように、障害情報通信部７３、８５、９５、１０６、１１４（以下、単に障害情報通信部と呼ぶ。）は、他の障害情報通信部により送信されたＶＭまたはプロセスの障害に関する障害情報を受信する（ステップＳ２０１）。 That is, as shown in FIG. 5, failure information communication units 73, 85, 95, 106, and 114 (hereinafter simply referred to as failure information communication units) communicate the VMs or processes transmitted by other failure information communication units. Failure information about failure is received (step S201).

そして、管理部は、障害情報通信部から障害情報を取得するとともに、それぞれの障害管理情報７２ａ、８４ａ、９４ａ、１０５ａ、１１３ａ（以下、単に障害管理情報と呼ぶ。）を参照し、その障害情報に対応する障害情報と影響対象情報とを障害管理情報から取得する処理を行う（ステップＳ２０２）。 Then, the management unit acquires the failure information from the failure information communication unit, refers to each of the failure management information 72a, 84a, 94a, 105a, and 113a (hereinafter simply referred to as failure management information), and from the failure management information (step S202).

そして、管理部は、それぞれの障害管理情報にそのＶＭまたはプロセスに対応する障害情報が登録されていたか否かを判定する（ステップＳ２０３）。 Then, the management unit determines whether or not failure information corresponding to the VM or process is registered in each failure management information (step S203).

障害管理情報にそのＶＭまたはプロセスに対応する障害情報が登録されていた場合（ステップＳ２０３においてＹｅｓの場合）、管理部は、障害情報通信部に指示して、その障害情報を他の障害情報通信部に送信させる（ステップＳ２０４）。 If failure information corresponding to the VM or process is registered in the failure management information (Yes in step S203), the management unit instructs the failure information communication unit to transmit the failure information to another failure information communication. department (step S204).

その後、管理部は、それぞれの障害管理情報にそのＶＭまたはプロセスに対応する影響対象情報が登録されていたか否かを判定する（ステップＳ２０５）。 After that, the management unit determines whether or not affected target information corresponding to the VM or process is registered in each failure management information (step S205).

障害管理情報にそのＶＭまたはプロセスに対応する影響対象情報が登録されていた場合（ステップＳ２０５においてＹｅｓの場合）、管理部は、その影響対象情報に登録されている影響対象を再起動し、この再起動処理を終了する。（ステップＳ２０６）。 If affected target information corresponding to the VM or process is registered in the failure management information (Yes in step S205), the management unit restarts the affected target registered in the affected target information, and restarts the affected target. Finish the reboot process. (Step S206).

なお、管理部は、障害管理情報において、再起動される影響対象のＶＭまたはプロセスに対応する障害情報が登録されている場合には、障害情報通信部に指示して、その障害情報を他の障害情報通信部に送信させ、そのＶＭまたはプロセスに対応する影響対象情報が登録されていた場合には、その影響対象情報に登録されている影響対象を再起動する。 If fault information corresponding to the affected VM or process to be restarted is registered in the fault management information, the management unit instructs the fault information communication unit to send the fault information to another machine. If it is transmitted to the failure information communication unit and affected target information corresponding to the VM or process is registered, the affected target registered in the affected target information is restarted.

また、ステップＳ２０３において、障害管理情報にそのＶＭまたはプロセスに対応する障害情報が登録されていなかった場合（ステップＳ２０３においてＮｏの場合）、または、ステップＳ２０５において、障害管理情報にそのＶＭまたはプロセスに対応する影響対象情報が登録されていなかった場合（ステップＳ２０５においてＮｏの場合）、そのままこの再起動処理は終了する。 Further, in step S203, if failure information corresponding to the VM or process is not registered in the failure management information (No in step S203), or in step S205, if the VM or process is registered in the failure management information If the corresponding affected target information is not registered (No in step S205), this restart processing ends.

このように、本実施の形態３では、論理的に区分された複数のグループの少なくとも１つがハイパバイザ上で動作する仮想マシンまたはコンテナであることとした。 As described above, in the third embodiment, at least one of a plurality of logically divided groups is a virtual machine or container operating on a hypervisor.

これにより、仮想マシンのプロセスとコンテナのプロセスとの間に依存関係がある場合でも、障害からの復旧を自律的に行うことができる。 As a result, even if there is a dependency relationship between the virtual machine process and the container process, recovery from a failure can be performed autonomously.

なお、実施の形態１～３で説明した障害管理情報において、影響対象情報として保持する情報は、システムを再起動するための「ＳＹＳ＿ＲＥＳＥＴ」を除き、自グループの管理対象に限定することが望ましい。 In addition, in the failure management information described in the first to third embodiments, it is preferable that the information held as the affected object information is limited to the managed objects of the own group, except for "SYS_RESET" for restarting the system.

すなわち、他のグループのプロセス等に影響がある場合でも、その影響対象の情報は他のグループの障害管理情報で管理される。これは、ソフトウェアとしての結合度を疎に保つためである。 In other words, even if the processes of other groups are affected, the information of the affected objects is managed by the fault management information of the other groups. This is to maintain loose coupling as software.

また、障害管理情報に登録される情報は、計算機システムの起動時に、論理的に区分された各グループ間で通信を行って交換するようにしてもよい。 Further, the information registered in the failure management information may be exchanged by communication between logically divided groups when the computer system is started.

例えば、図１の計算機システムにおいて、第１のＶＭ２０に障害が発生した場合、以下のような障害情報の往復が発生する可能性がある。 For example, in the computer system of FIG. 1, when a failure occurs in the first VM 20, there is a possibility that the following round trip of failure information will occur.

（１）第１のＶＭ２０から第２のＶＭ３０に障害情報Ａが送信される。
（２）障害情報Ａに基づき、第２のＶＭ４０においてプロセスが再起動され、障害情報Ｂが第１のＶＭ２０に送信される。
（３）障害情報Ｂに基づき、第１のＶＭ２０においてプロセスが再起動され、障害情報Ｃが第２のＶＭ３０に送信される。 (1) Fault information A is transmitted from the first VM 20 to the second VM 30 .
(2) Based on the failure information A, the process is restarted in the second VM 40 and the failure information B is sent to the first VM 20 .
(3) Based on the failure information B, the process is restarted in the first VM 20 and the failure information C is sent to the second VM 30 .

障害管理情報に登録される情報が各グループ間であらかじめ交換されることにより、上記のように障害情報の往復が発生する影響対象をあらかじめ検出できる。そして、最初の障害発生時にその影響対象を再起動するとともに、図５のステップＳ２０４に示した障害情報の送信処理を省略することにより、上述した障害情報の往復を防止できる。 By exchanging the information registered in the failure management information among the groups in advance, it is possible to detect in advance the affected target for which the back and forth of the failure information occurs as described above. Then, when the first failure occurs, the affected object is restarted, and by omitting the failure information transmission process shown in step S204 of FIG.

また、図１１に示した障害管理情報７２ａの影響対象情報には、計算機システム全体を再起動するため、「ＳＹＳ＿ＲＥＳＥＴ」という情報が登録されていたが、図２、図３、図１２～図１４に示した各ＶＭが管理する影響対象情報に「ＶＭ＿ＲＥＳＥＴ」という情報が登録されてもよい。 In addition, information "SYS_RESET" was registered in the affected object information of the fault management information 72a shown in FIG. 11 in order to restart the entire computer system, but FIGS. Information "VM_RESET" may be registered in the affected target information managed by each VM shown in .

この「ＶＭ＿ＲＥＳＥＴ」は、ＶＭが自らを再起動させるための情報である。例えば、ＶＭは、ハイパバイザ１０、７０に再起動要求メッセージを送信することにより、自らを再起動させる。 This "VM_RESET" is information for the VM to restart itself. For example, a VM restarts itself by sending a restart request message to the hypervisor 10,70.

また、図８、図９、図１５に示した各コンテナが管理する影響対象情報に「ＣＯＮＴＡＩＮＥＲ＿ＲＥＳＥＴ」という情報が登録されてもよい。 Further, information "CONTAINER_RESET" may be registered in the influence target information managed by each container shown in FIGS.

この「ＣＯＮＴＡＩＮＥＲ＿ＲＥＳＥＴ」は、コンテナが自らを再起動させるための情報である。例えば、コンテナは、コンテナ管理プロセス４１ａ、４１ｂ、１０２に再起動要求メッセージを送信することにより、自らを再起動させる。 This "CONTAINER_RESET" is information for the container to restart itself. For example, a container restarts itself by sending a restart request message to the container management processes 41a, 41b, 102. FIG.

または、各コンテナのプロセス管理部５５、６５、１１５が自発的に終了処理を行うことで、コンテナ管理プロセス４１ａ、４１ｂ、１０２に各コンテナの障害を検出させ、コンテナ管理プロセス４１ａ、４１ｂ、１０２に各コンテナを再起動させるようにしてもよい。 Alternatively, the process management units 55, 65, and 115 of each container voluntarily perform termination processing to cause the container management processes 41a, 41b, and 102 to detect failures in the respective containers, and cause the container management processes 41a, 41b, and 102 to Each container may be restarted.

また、上記実施の形態１～３では、第１のグループから受信した障害情報が、第２のグループの障害管理情報の障害情報に対応する例として、それらの障害情報が一致する場合について説明したが、障害情報が対応する場合は必ずしもそれに限定されるものではない。 Further, in the first to third embodiments, as an example in which the failure information received from the first group corresponds to the failure information in the failure management information of the second group, the case where the failure information matches is described. However, it is not necessarily limited to that if the failure information corresponds.

例えば、障害情報として「ＴＡＲＧＥＴ：ＶＭ，ＶＭ：ＩＶＩ」という障害情報を第２のグループが受信したものとする。この障害情報は、図１０に示したＩＶＩＶＭ１００に障害が発生したことを示す情報である。 For example, assume that the second group has received failure information "TARGET: VM, VM: IVI" as failure information. This failure information is information indicating that a failure has occurred in the IVI VM 100 shown in FIG.

この場合、ＩＶＩＶＭ１００が再起動されることになるため、各グループの障害管理情報において、ＶＭが「ＩＶＩ」である障害情報は、すべて「ＴＡＲＧＥＴ：ＶＭ，ＶＭ：ＩＶＩ」という障害情報に対応するものと判定される。 In this case, since the IVI VM 100 will be restarted, in the failure management information of each group, the failure information in which the VM is "IVI" all correspond to the failure information "TARGET: VM, VM: IVI". is judged to be

例えば、図１５に示した障害管理情報１１３ａにおいて、上から４番目の「ＥＸＴＥＲＮＡＬ＿ＥＲＲＯＲ」のプロセス情報に対応する「ＴＡＲＧＥＴ：ＰＲＯＣ，ＶＭ：ＩＶＩ，ＣＯＮＴＡＩＮＥＲ：－，ＰＲＯＣ：ＰＲＯＣＥＳＳＳ３２」という障害情報は、「ＴＡＲＧＥＴ：ＶＭ，ＶＭ：ＩＶＩ」という障害情報に対応するものとして、その後の処理が実行される。 For example, in the fault management information 113a shown in FIG. 15, the fault information "TARGET:PROC, VM:IVI, CONTAINER:-, PROC:PROCESS32" corresponding to the fourth process information "EXTERNAL_ERROR" from the top is " The subsequent processing is executed as corresponding to the failure information "TARGET: VM, VM: IVI".

その他、上記実施の形態は、何れも本開示を実施するにあたっての具体化の一例を示したものに過ぎず、これらによって本開示の技術的範囲が限定的に解釈されてはならない。すなわち、本開示はその要旨、またはその主要な特徴から逸脱することなく、様々な形で実施される。 In addition, the above-described embodiments are merely examples of specific implementations of the present disclosure, and the technical scope of the present disclosure should not be construed to be limited by these. That is, the disclosure may be embodied in various forms without departing from its spirit or key features.

本開示の技術は、計算機システムおよび再起動プログラムに利用できる。 The technology of the present disclosure can be used for computer systems and restart programs.

１０ハイパバイザ
２０第１のＶＭ
２１ａ～２１ｎ，３１ａ～３１ｍ，４０ａ～４０ｌ，５１ａ～５１ｎ，６１ａ～６１ｍ，８１ａ～８１ｎ，９１ａ～９１ｍ，１０１ａ～１０１ｌ，１１１ａ～１１１ｋプロセス
２２，３２，４２，５２，６２，８２，９２，１０３，１１２プロセス監視部
２３，３３，４３，５３，６３，７２，８４，９４，１０５，１１３障害情報記憶部
２３ａ，３３ａ，４３ａ，５３ａ，６３ａ，７２ａ，８４ａ，９４ａ，１０５ａ，１１３ａ障害管理情報
２４，３４，４４，５４，６４，７３，８５，９５，１０６，１１４障害情報通信部
２５，３５，４５，５５，６５，８６，９６，１０７，１１５プロセス管理部
３０第２のＶＭ
４１ａ，４１ｂ，１０２コンテナ管理プロセス
７１ＶＭ監視部
７４ＶＭ管理部 10 hypervisor 20 first VM
21a-21n, 31a-31m, 40a-40l, 51a-51n, 61a-61m, 81a-81n, 91a-91m, 101a-101l, 111a-111k process 22, 32, 42, 52, 62, 82, 92, 103, 112 Process monitoring unit 23, 33, 43, 53, 63, 72, 84, 94, 105, 113 Fault information storage unit 23a, 33a, 43a, 53a, 63a, 72a, 84a, 94a, 105a, 113a Fault management Information 24, 34, 44, 54, 64, 73, 85, 95, 106, 114 Fault information communication unit 25, 35, 45, 55, 65, 86, 96, 107, 115 Process management unit 30 Second VM
41a, 41b, 102 container management process 71 VM monitoring unit 74 VM management unit

Claims

A first fault information storage unit for storing first fault management information including information on a first fault occurring in a first process group operating in a first group logically separated from a second group When,
storing second failure management information in which second external failure information corresponding to the information of the first failure and information affected by the first failure in the second group are registered in association with each other; a second failure information storage unit;
a first process management unit for transmitting information of the first failure to a first failure information communication unit when the first failure is detected;
When the second failure information communication unit receives the information of the first failure, the affected target of the first failure corresponding to the second external failure information by referring to the second failure management information. a second process manager that restarts the
A computer system comprising

further comprising a first process monitoring unit for monitoring a heartbeat of the first process group, the first process monitoring unit notifying the first process management unit of heartbeat monitoring result information; 2. The computer system according to claim 1, wherein said first process manager detects said first failure based on a notification from said first process monitor.

The first failure management information further includes first external failure information corresponding to a second failure that occurs in a second process group operating in the second group, and a target affected by the second failure. is registered in association with the information of
the second failure management information further includes information about the second failure;
The second process management unit causes the second failure information communication unit to transmit information on the second failure when the second failure is detected,
When the first failure information communication unit receives the second failure information, the first process management unit refers to the first failure management information and changes the information to the first external failure information. 3. The computer system according to claim 1, wherein the affected target of the corresponding second failure is restarted.

4. The computer system according to claim 1, wherein at least one of said first group and said second group is a container.

when the first group and the second group are the containers,
third external fault information corresponding to the first fault information or the second fault information occurring in the second process group operating in the second group; a third failure information storage unit that stores third failure management information including information about a second failure and information about an affected target;
When the third failure information communication unit receives failure information of either the first failure information or the second failure information, the third failure management information is referred to, and the third failure management information is referred to. a third process management unit that restarts the target affected by the first failure or the target affected by the second failure corresponding to the external failure information;
5. The computer system according to claim 4, comprising:

A third process monitoring unit for monitoring heartbeats of the process managing the first group and the process managing the second group, wherein the third process monitoring unit monitors heartbeat monitoring results. Information is notified to the third process management unit, and the third process management unit detects a failure of the process managing the first group based on the notification from the third process monitoring unit or Detecting a failure of a process managing a group, referring to the third failure management information, and managing a target affected by the detected failure of the process managing the first group or the second group 6. The computer system according to claim 5, wherein the affected object of the failure of the process to be executed is restarted.

4. The computer system according to claim 1, wherein at least one of said first group and said second group is a virtual machine running on a hypervisor.

When the first group and the second group are the virtual machines,
third external fault information corresponding to the first fault information or the second fault information occurring in the second process group operating in the second group; a third failure information storage unit that stores third failure management information including information about a second failure and information about an affected target;
When the third failure information communication unit receives failure information of either the first failure information or the second failure information, the third failure management information is referred to, and the third failure management information is referred to. a VM management unit that restarts the target affected by the first failure or the target affected by the second failure corresponding to the external failure information;
8. The computer system of claim 7, comprising:

further comprising a VM monitoring unit that monitors the first group and the second group, wherein the VM monitoring unit notifies the VM management unit of information on monitoring results of the first group and the second group; and the VM management unit detects a failure in the first group or the second group based on the notification from the VM monitoring unit, refers to the third failure management information, and detects the detected failure. 9. The computer system according to claim 8, wherein the failure-affected objects of the first group or the failure-affected objects of the second group are restarted.

10. The computer system according to claim 9, wherein said third fault information storage unit, said VM monitoring unit, and said VM management unit operate on a hypervisor.

first fault management including information on said first fault when a first fault occurring in a first process group operating in a first group logically separated from a second group is detected; a procedure of reading the information of the first failure from the first failure information storage unit storing the information and transmitting the information to the first failure information communication unit;
Acquiring the information of the first failure from the second failure information communication unit that received the information of the first failure, obtaining the second external failure information corresponding to the information of the first failure, and the second failure information reading the second failure management information from a second failure information storage unit that stores the second failure management information registered in association with the information affected by the first failure in the group of a procedure of identifying the second external failure information corresponding to the failure information of and restarting the target affected by the first failure corresponding to the identified second external failure information;
A restart program that causes the computer to run