JP4112319B2

JP4112319B2 - Process restart method, process restart device, and process restart program

Info

Publication number: JP4112319B2
Application number: JP2002261260A
Authority: JP
Inventors: 伸宏木村; 光瀬社家; 一樹渡辺; 隆弘宮崎
Original assignee: Fujitsu Ltd; Nippon Telegraph and Telephone Corp
Current assignee: Fujitsu Ltd; Nippon Telegraph and Telephone Corp
Priority date: 2002-09-06
Filing date: 2002-09-06
Publication date: 2008-07-02
Anticipated expiration: 2022-09-06
Also published as: JP2004102492A

Description

【０００１】
【発明の属する技術分野】
本発明は、複数サーバ上で複数のプロセスが連携して複数の業務を実行するシステムにおける、プロセスおよびサーバ故障の再開(および復旧)処理を行うプロセス再開方法、プロセス再開装置、プロセス再開プログラムに関する。
【０００２】
【従来の技術】
従来のサーバ再開機能は、単独サーバ再開の次のフェーズとしては、全サーバの再開になってしまい、その中断時間の影響からくるシステム全体への影響が懸念された。また、プロセスの実行管理についても、故障検出時に該当プロセスを再起動する、または、サーバ故障時に該当サーバ上のプロセスを他サーバで再起動するなどの手法が採られていた。これらの方式では、再起動したプロセスの初期化処理は実行されるが、関連プロセスの初期化処理が実行されず、また、他サーバヘの再起動を行うにしても、詳細な設定が行えないため、プロセス間の不整合が発生することが懸念される。その結果、システム全体の整合性が崩れ、事象を回避するためにシステム全体を再開させることになり、中断時間の増加に結びつくことが懸念される。
なお、従来技術において、直接的に本願発明を示す文献は、発見されなかったので明示することができない。
【０００３】
【発明が解決しようとする課題】
上述したように、従来のサーバ再開方法では、あるサーバの再開により他サーバの再開も余儀なくされている。その具体例を図１５に示す。ユーザが、特定のサーバＳＶ−１，ＳＶ−２を再開させたい場合、サーバＳＶ−１，ＳＶ−２に対する他のサーバＳＶ−３，ＳＶ−４，ＳＶ−５の運用上の関係を考慮すると、システム全体を再開し、サーバＳＶ−１，ＳＶ−２と他のサーバＳＶ−３，ＳＶ−４，ＳＶ−５との整合を行うしかない。
【０００４】
また、あるプロセス故障によって、システム全体の整合性が崩れ、その事象を回避するためにシステム全体の再開を余儀なくされている。その具体例を図１６ないし図１８に示す。まず、図１６において、プロセス（Ａ）とプロセス（Ｂ）とは、機能的に関連したプロセスであり、それぞれを運用するためには、互いの運用が必須条件であるとする。一方のプロセスが故障再起動してしまうと、様々な矛盾からくる障害が発生するため、他方のプロセスの再起動も必要となる。これを実現するためには、システム再開を行うしかない。
【０００５】
次に、図１７において、プロセス（Ｃ）は、他の市販アプリケーションと機能的に関連したプロセスであり、該市販アプリケーション１０の運用が、システムの運用を行う上での必須条件であるとする。市販アプリケーション１０に障害が発生すると、市販アプリケーション１０については実行管理を行うことができないので、システム再開を行うしかない。
【０００６】
次に、図１８において、プロセス（Ａ），プロセス（Ｂ），プロセス（Ｃ），プロセス（Ｄ）は、機能的に関連したプロセスであり、それぞれを運用するためには、互いの運用が必須条件であるとする。一方のサーバＳＶ−１の再開が発生した場合、プロセス（Ａ）、プロセス（Ｂ）の再開が行われるため、様々な矛盾からくる障害発生を考慮すると、サーバＳＶ−２側のプロセス（Ｃ），プロセス（Ｄ）の再起動も必要となる。これを実現するためには、サーバＳＶ−２も再開するしかない。
【０００７】
次に、サーバ故障時の起動プロセスの救済方式についてだが、詳細な排他制御設定が行えず、該当プロセスの救済起動が行えないため、これもまた、システム全体の再開を余儀なくさせている。その具体例を図１９に示す。図１９において、プロセス（Ａ），プロセス（Ｂ）は、それぞれの運用上、同一サーバでの起動が不可能な関係であるとする。サーバＳＶ−１の再開が発生した場合、同サーバＳＶ−１にて起動されていたプロセス（Ａ）について、サーバＳＶ−２への救済起動は、プロセス（Ｂ）の起動により、不可能であるが、サーバＳＶ−３への起動は、論理的には可能である。しかしながら、プロセスおよびサーバ単位での詳細な排他設定が行えないため、プロセス（Ａ）を救済起動できない。
【０００８】
従来のサーバ再開方法では、これら３つの事象は、いずれにおいても、システム運用中断時間を長引かせるという問題がある。
【０００９】
この発明は上述した事情に鑑みてなされたもので、一部の故障がシステム全体の再開（中断）に発展することを抑止することができ、システム全体としての中断時間を短縮することができるプロセス再開方法、プロセス再開装置、プロセス再開プログラムを提供することを目的とする。
【００１０】
【課題を解決するための手段】
上述した問題点を解決するために、本発明は、複数のサーバで起動される複数のプロセスが連携して構築されている管理対象システムのプロセス再開方法において、前記複数のプロセスの各々に対して、故障したプロセスとの整合性を保障するために当該故障したプロセスと同時に再起動される他のプロセスと当該故障したプロセスとを含む予め定められるグループの中で当該プロセスを含む再起動プロセス数が最小となるグループをプロセス名、グループ番号、プロセスが動作するサーバ名からなるプロセスグループ情報として定義し、いずれかのプロセスが障害で故障した際に、前記プロセスグループ情報に基づいて、再開するプロセス群を特定して前記複数のサーバのそれぞれに特定したプロセス群に含まれる該当プロセスのみの再開をさせること特徴とする。
【００１１】
また、本発明は、上記に記載の発明において、一方のプロセスグループ情報と他方のプロセスグループ情報について、前記一方のプロセスグループ情報が、前記他方のプロセスグループ情報に定義されるプロセスを全て包含する包含関係を有する場合、前記他方のプロセスグループ情報に定義されるプロセスが障害で故障した場合は、当該他方のプロセスグループ情報内のプロセスのみを再起動し、前記一方のプロセスグループ情報にのみ定義されるプロセスが障害で故障した場合は、当該一方のプロセスグループ情報に定義される全てのプロセスを再起動することを特徴とする。
【００１２】
また、本発明は、上記に記載の発明において、前記複数のプロセスのうち少なくとも１つ以上が市販アプリケーションであり、かつ少なくとも１つ以上が前記市販アプリケーションとの連携を前提として開発された独自プロセスである前記複数のプロセスの各々をプロセスグループ情報として定義することを特徴とする。
【００１７】
また、上述した問題点を解決するために、本発明は、複数のサーバで起動される複数のプロセスが連携して構築されている管理対象システムのプロセス再開装置において、前記複数のプロセスの各々に対して、故障したプロセスとの整合性を保障するために当該故障したプロセスと同時に再起動される他のプロセスと当該故障したプロセスとを含む予め定められるグループの中で当該プロセスを含む再開起動プロセス数が最小となるグループをプロセス名、グループ番号、プロセスが動作するサーバ名からなるプロセスグループ情報として記憶するプロセスグループ情報記憶手段と、いずれかのプロセスが障害で故障した際に、前記プロセスグループ情報記憶手段のプロセスグループ情報に基づいて、再開するプロセス群を特定して前記複数のサーバのそれぞれに特定したプロセス群に含まれる該当プロセスのみの再開をさせる再開プロセス特定手段とを具備することを特徴とする。
【００１８】
また、本発明は、上記に記載の発明において、一方のプロセスグループ情報と他方のプロセスグループ情報について、前記プロセスグループ情報記憶手段が記憶する前記一方のプロセスグループ情報が、前記他方のプロセスグループ情報に定義されるプロセスを全て包含する包含関係を有する場合、前記再開プロセス特定手段は、前記他方のプロセスグループ情報に定義されるプロセスが障害で故障した場合には、当該他方のプロセスグループ情報内のプロセスのみを再起動し、前記一方のプロセスグループ情報にのみ定義されるプロセスが障害で故障した場合には、当該一方のプロセスグループ情報に定義される全てのプロセスを再起動することを特徴とする。
【００１９】
また、本発明は、上記に記載の発明において、前記プロセスグループ情報記憶手段は、前記複数のプロセスのうち少なくとも１つ以上が市販アプリケーションであり、かつ少なくとも１つ以上が前記市販アプリケーションとの連携を前提として開発された独自プロセスである前記複数のプロセスの各々を含むことが定義された前記プロセスグループ情報を記憶することを特徴とする。
【００２３】
また、上述した問題点を解決するために、本発明は、複数のサーバで起動され、管理対象システムを構築している複数のプロセスの各々に対して、故障したプロセスとの整合性を保障するために当該故障したプロセスと同時に再起動される他のプロセスと当該故障したプロセスとを含む予め定められるグループの中で当該プロセスを含む再開起動プロセス数が最小となるグループをプロセス名、グループ番号、プロセスが動作するサーバ名からなるプロセスグループ情報として定義するステップと、いずれかのプロセスが障害で故障した際に、前記プロセスグループ情報に基づいて、再開するプロセス群を特定して前記複数のサーバのそれぞれに特定したプロセス群に含まれる該当プロセスのみの再開をさせるステップとをコンピュータに実行させることを特徴とする。
【００２４】
また、本発明は、上記に記載の発明において、一方のプロセスグループ情報と他方のプロセスグループ情報について、前記一方のプロセスグループ情報が、前記他方のプロセスグループ情報に定義されるプロセスを全て包含する包含関係を有するプロセスグループ情報を定義するステップと、前記他方のプロセスグループ情報に定義されるプロセスが障害で故障した場合には、当該他方のプロセスグループ情報内のプロセスのみを再起動し、前記一方のプロセスグループ情報にのみ定義されるプロセスが障害で故障した場合には、当該一方のプロセスグループ情報に定義される全てのプロセスを再起動するステップとをコンピュータに実行させることを特徴とする。
【００２８】
この発明では、複数のプロセスの各々に対して、故障したプロセスとの整合性を保障するために当該故障したプロセスと同時に再起動が必要な他のプロセスをプロセスグループ情報として定義し、いずれかのプロセスが障害で故障した際に、前記プロセスグループ情報に基づいて、再開するプロセス群を特定する。したがって、必要最低限範囲でのサーバ再開を実現、またプロセス故障発生時の影響範囲を局所化させ、グループ単位での整合性を担保すること、そして、プロセス間の関係を考慮した排他制御によるプロセス救済により、一部の故障がシステム全体の再開（中断）に発展することを抑止し、システム全体としての中断時間を短縮させることが可能となる。
【００２９】
【発明の実施の形態】
以下、図面を用いて本発明の実施の形態を説明する。
Ａ．実施形態の構成
図１は、本発明の実施形態によるサーバプロセス管理システムの構成を示すブロック図である。図１において、サーバプロセス管理システムは、サーバＳＶ−１，ＳＶ−２，ＳＶ−３を具備する。サーバＳＶ−１は、プロセスＡ，プロセスＢ（市販アプリケーション）、プロセスＣ（市販アプリケーション）および管理プロセスＸを起動する。サーバＳＶ−２は、プロセスＤ，プロセスＥおよび管理プロセスＹを起動する。サーバＳＶ−３は、プロセスＦ，プロセスＧおよび管理プロセスＺを起動する。
【００３０】
また、プロセスＡおよびプロセスＢは、プロセスグループＰＧ１を構成する。また、プロセスＣ，プロセスＤ，プロセスＥおよびプロセスＦは、プロセスグループＰＧ２を構成する。なお、プロセスグループとは、起動管理を行うプロセスにおいて、運用上、関係のあるプロセス群を、１つのグループとして見立てたものである。プロセスグループの適用範囲は、１つのサーバ内に閉じたものではなく、システムを構成する全てのサーバ間で有効である。また、本発明の機能は、自作プロセスに限ったものではなく、市販アプリケーションについても有効である。
【００３１】
本実施形態によるサーバプロセス管理システムは、図２ないし図４に示す構成の条件ファイルを具備する。図２は、起動プロセス管理ファイル２０の構成を示す概念図である。図３は、システム構成管理ファイル２１の構成を示す概念図である。さらに、図４は、排他制御管理ファイル２２の構成を示す概念図である。
【００３２】
図２、図３に示す、起動プロセス管理ファイル２０およびシステム構成管理ファイル２１において、サーバ／プロセスグループ設定を行うには、各々、管理ファイルに事前に登録する。運用上管理が必要なサーバ／プロセス分、設定を行うこととし、各レコードの項目の中にグループ番号を設定する項目を設けている。管理プロセスは、これら設定内容に従って、各プロセスのグルーピング判断を行う。なお、１つのグループの設定数には上限がない。グループ番号に「０」が設定されていた場合、そのプロセスおよびサーバについては、グルーピング未設定と判断する。
【００３３】
また、図４において、プロセスの救済起動に関する排他制御の設定を行うには、排他制御管理ファイル２２に事前に登録する。設定方法については次の４パターンがある。１．サーバグループ単位での排他設定、２．サーバ単位での排他設定、３．プロセスグループ単位での排他設定、４．プロセス単位での排他設定である。いずれの設定であるかの識別子を設定し、その上に、各々、排他制御を設定する対象名、被対象名を設定する項目を設けている。管理プロセスは、これらの設定内容に従って、排他制御の有無を判断する。なお、設定数に上限はない。また、排他設定対象名／被対象名については、同一名で複数項目の設定も可能である。
【００３４】
以下により詳細に説明する。本実施形態では、管理対象サーバ（群）において起動されているプロセスに関して、任意のプロセスが故障したときに当該プロセスとの整合性を保障するために当該プロセスと同時に再起動が必要な他のプロセスを管理対象サーバ（群）の運用前に抽出し、抽出した複数のプロセスをまとめでプロセスグループと定義する。なお、プロセスグループを定義する際、包含関係をもつプロセスグループは定義してよいが、一部のプロセスのみを共有するプロセスグループは定義しない。
【００３５】
さらに、起動プロセス、当該起動プロセスが属する上記プロセスグループ、当該起動プロセスが起動しているサーバの少なくとも３つの情報を構成要素とする起動プロセス管理ファイル２０も同様に、管理対象サーバ（群）の運用前に作成する。
【００３６】
なお、包含関係をもつプロセスグループが定義されている場合、起動プロセスの属するプロセスグループが複数定義され得る。例えば、プロセスグループ間の包含関係（例えば、「プロセスグループαはプロセスグループβとプロセスグループνを包含する」）の情報と、各起動プロセスに対応するプロセスグループとして当該起動プロセスを含む起動プロセス数最小のプロセスグループ情報との２種類の情報を定義する。あるいは、各起動プロセスに対応した情報として、当該起動プロセスを含む起動プロセス数最小のプロセスグループ情報と、当該起動プロセスが含まれる前記以外の全てのプロセスグループ情報との２種類の情報を定義する。
【００３７】
管理対象サーバを管理する管理プロセスは、任意のプロセスが故障した際に、上記起動プロセス管理ファイル２０を参照することにより、再開起動プロセス数が最小となるように、上述した情報を用いて再開範囲を特定し「関連プロセスを含むグループ再開」の処理を実現する。
なお、一方のプロセスグループ情報に他方のプロセスグループ情報を含むとき、他方のプロセスグループ情報に含まれるプロセスが障害で故障した場合は、他方のプロセスグループ情報内のプロセスのみを再起動し、一方のプロセスグループ情報にのみ含まれるプロセスが障害で故障した場合は、他方のプロセスグループ情報を含む一方のプロセスグループ情報に属する全てのプロセスを再起動する。
【００３８】
また、起動プロセスのみならず、任意の管理対象サーバそのものが再開した際にも、管理プロセスは、上記起動プロセス管理ファイル２０を参照することにより、再開起動プロセス数が最小となるように、上述した情報を用いて再開範囲を特定し「関連プロセスを含むグループ再開」の処理を実現する。
【００３９】
また、本実施形態では、管理対象サーバ（群）における起動プロセス群と同様に、管理対象サーバそのものに対してもシステム運用条件に合わせて、任意の管理対象サーバが故障したときに当該管理対象サーバとの整合性を保障するために当該管越対象サーバと同時に再起動が必要な他の管理対象サーバを管理対象サーバ（群）の運用前に抽出し、抽出した複数の管理対象サーバをまとめてサーバグループと定義する。なお、サーバグループを定義する際、包含関係をもつサーバグループは特定してよいが、一部の管理対象サーバのみを共有するサーバグループは定義しない。
【００４０】
さらに、管理対象サーバ、当該管理対象サーバが属する上記サーバグループの少なくとも２つの情報を構成要素とするシステム構成管理ファイル２１も同様に、管理対象サーバ（群）の運用前に作成する。
【００４１】
なお、包含関係をもつサーバグループが定義されている場合、管理対象サーバの属するサーバグループが複数定義され得る。例えば、サーバグループ間の包含関係（例えば、「サーバグループαはサーバグループβとサーバグループを包含する」）の情報と、各管理対象サーバに対応するサーバグループとして当該管理対象サーバを含む管理対象サーバ数最小のサーバグループ情報との２種類の情報を定義する。あるいは、各管理対象サーバに対応した情報として、当該管理対象サーバを含む管理対象サーバ数最小のサーバグループ情報と、当該管理対象サーバが含まれる前記以外の全てのサーバグループ情報との２種類の情報を定義する。
【００４２】
管理対象サーバを管理する管理プロセスは、任意の管理対象サーバが故障した際に、上記システム構成管理ファイル２１を参照することにより、再開管理対象サーバ数が最小となるように、上述した情報を用いて再開範囲を特定し「サーバグループ再開」の処理を実現する。
なお、一方のサーバグループ情報に他方のサーバグループ情報を含むとき、他方のサーバグループ情報に含まれるサーバが障害で故障した場合は、他方のサーバグループ情報内のサーバのみを再起動し、一方のサーバグループ情報にのみ含まれるサーバが障害で故障した場合は、他方のサーバグループ情報を含む一方のサーバグループ情報に属する全てのサーバを再起動する。
【００４３】
また、本実施形態では、任意の管理対象サーバもしくは起動プロセスの再開が行えない場合の、起動プロセス救済起動については、サーバグループ、サーバ、プロセスグループ、プロセスの少なくとも４つのパターンに関して、任意のサーバグループに属する管理対象サーバの起動プロセスは、同一サーバ上に起動されることが許容不可なプロセス、サーバ、プロセスグループ、サーバグループの関係を示す排他条件を、管理対象サーバ（群）の運用前に抽出し、排他制御管理ファイル２２として作成する。
【００４４】
排他条件としては、全て起動することができないサーバグループを特定するためのサーバグループの排他条件、任意の管理対象サーバの起動プロセスは全て起動することができない管理対象サーバを特定するためのサーバ排他条件、任意のプロセスグループに属する起動プロセスと同一管理対象サーバ上では起動することができないプロセスグループを特定するためのプロセスグループの排他条件、任意のプロセスと同一管理対象サーバ上では起動することができないプロセスを特定するためのプロセス排他条件がある。
【００４５】
運用継統中の管理プロセスは、管理対象サーバ故障発生時に、上記排他制御管理ファイル２２を参照することで、各管理対象サーバにて救済可能な起動プロセスを特定し、起動プロセスの救済起動処理を実現する。
【００４６】
Ｂ．実施形態の動作
次に、本実施形態によるサーバプロセス管理システムの動作について説明する。まず、図５および図６は、グループ再開方法（グループプロセス再開発生時）の具体的な動作原理を説明するための概念図である。図５および図６では、プロセスＤが故障した場合の各サーバの処理概要が示されている。
【００４７】
まず、始めにサーバＳＶ−２上の動作概要を、図５を参照して説明する。サーバＳＶ−２上の管理プロセスＹは、プロセスＤの故障を検出し、起動プロセス管理ファイル２０の参照を行う（Ｓａ１）。プロセスＤは、プロセスグループＰＧ２に属するため、プロセスグループＰＧ２に定義されているプロセスが起動するサーバであるサーバＳＶ−１とサーバＳＶ−３とにプロセスグループＰＧ２の再起動要求を行う（Ｓａ２）。そして、他サーバヘの再起動要求と同時に自サーバ内のプロセスグループＰＧ２に定義されているプロセスＥの再起動を行う（Ｓａ３）。
【００４８】
次に、サーバＳＶ−１の動作概要を、図６を参照して説明する。サーバＳＶ−１上の管理プロセスＸは、サーバＳＶ−２からのプロセスグループＰＧ２の再起動要求を受信すると、自サーバ上で動作するプロセスグループＰＧ２のプロセスを起動プロセス管理ファイル２０から抽出し（Ｓｂ１）、プロセスグループＰＧ２に属するプロセスＣの再起動を行う（Ｓｂ２）。同様に、サーバＳＶ−３においても、プロセスグループＰＧ２に該当するプロセスの再起動を行う。
【００４９】
次に、サーバＳＶ−１の再開時のサーバＳＶ−２における処理概要を説明する。ここで、図７および図８は、グループ再開方法（サーバ再開発生時）の具体的な動作原理を説明するための概念図である。
【００５０】
まず、サーバＳＶ−１の動作概要を、図７を参照して説明する。プロセスの故障等によりサーバＳＶ−１の再開を実施した場合、各サーバへサーバＳＶ−１の再開通知を行った後、自サーバの再開を行う。再起動時、起動プロセス管理ファイル２０の参照を行い（Ｓｃ１）、自サーバに起動するプロセスＡ、プロセスＢ、プロセスＣを起動する（Ｓｃ２）。
【００５１】
次に、サーバＳＶ−２の処理概要を、図８を参照して説明する。サーバＳＶ−１からサーバの再開通知を受信後、起動プロセス管理ファイル２０の参照を行い（Ｓｄ１）、サーバＳＶ−１上で動作しているプロセスグループＰＧ１、プロセスグループＰＧ２が自サーバ上で動作しているかチェックを行う。そこでグループＰＧ２が対象となるため、プロセスグループＰＧ２内のプロセスＤ、プロセスＥの再開を行う（Ｓｄ２）。同様に、サーバＳＶ−３においても、プロセスグループＰＧ１、プロセスグループＰＧ２に該当するプロセスの再起動を行う。
【００５２】
次に、図９は、サーバ種別（ＡＰＬサーバ、ＷＷＷサーバ、ＤＢサーバ等）とサーバグループとの構成例を示すブロック図である。図において、サーバグループとは、システムを構成する各サーバについて、運用上、関係のあるサーバ群を、１つのグループに見立てたものである。サーバグループＳＧ１は、ＡＰＬサーバＳＶ−１およびＷＷＷサーバＳＶ−２から構成されている。サーバグループＳＧ２は、ＡＰＬサーバＳＶ−３、ＡＰＬサーバＳＶ−４、ネーミングサーバＳＶ−５から構成されている。なお、種類（種別）の異なるサーバは、同一のグループへ設定することも有効である。
【００５３】
次に、図１０および図１１は、本実施形態によるサーバグループ再開方法の具体的な動作原理を説明するための概念図である。まず、保守者などからサーバグループＳＧ１の再開要求を受信したサーバＳＶ−１の処理概要を、図１０を参照して説明する。サーバＳＶ−１の管理プロセスは、サーバグループＳＧ１の再開要求を受信後、システム構成管理ファイル２１の参照を行う（Ｓｅ１）。サーバＳＶ−１は、他サーバＳＶ−２ヘのサーバグループＳＧ１の再開通知も行うとともに（Ｓｅ２）、自身もサーバグループＳＧ１であるので再開を実施する（Ｓｅ３）。
【００５４】
次に、サーバＳＶ−２のサーバグループ再開の処理概要を、図１１を参照して説明する。サーバＳＶ−１との処理の違いは、要求元が保守者か他サーバかの違いのみであり、サーバグループＳＧ１の再開要求を受信後、システム構成管理ファイルの参照を行い（Ｓｆ１）、サーバグループＳＧ１に属するサーバＳＶ−２も再開する（Ｓｆ２）。
【００５５】
次に、図１２は、プロセス排他制御導入時の救済起動方法において、排他制御管理ファイルに基づくサーバグループ・サーバ・プロセスグループ・プロセスの排他制御対象と排他制御被対象の関係を示すブロック図である。また、図１３および図１４は、サーバＳＶ−１が故障した場合の排他制御概要を説明するための概念図である。なお、排他制御対象／被対象とは、システムを構成する上で、様々な制約事項から、同一サーバ上に起動されることを、許容できないプロセスを起動制御するために、サーバ、サーバグループ、プロセスの範囲にて制御対象を設けたものである。
【００５６】
サーバＳＶ−３の管理プロセスでは、サーバＳＶ−１の故障を検出すると、自サーバ内でプロセスを救済可能かチェックするため、排他制御管理ファイル２２およびシステム構成管理ファイル２１の参照を行い、サーバグループ、サーバ、プロセスグループ、プロセスの排他情報を取得する。サーバグループＳＧ１（ＳＲＶｇｒｐ−１）については、システム構成管理ファイル２１からサーバグループを構成するサーバの情報を取得し、プロセスグループについては起動プロセス管理ファイル２０から取得する。そして、これらの情報に従って、自サーバが排他対象であるかを決定する（Ｓｇ１）。
【００５７】
この場合、サーバグループＳＧ２（ＳＲＶｇｒｐ−２）に属しているサーバＳＶ−３は、排他制御管理ファイルにおいてサーバグループＳＧ１（ＳＲＶｇｒｐ−１）に対する排他制御が記述されているため、プロセスの救済対象とならない（Ｓｇ２）。
【００５８】
また、グループ定義されていないサーバＳＶ−５の場合、サーバグループＳＧ１（ＳＲＶｇｒｐ−１）のサーバＳＶ−１の故障に対する排他定義は無いが、プロセスグループＰＧ１（ＰＲＣｇｒｐ−１）に対する排他定義がサーバＳＶ−５上のプロセスグループＰＧ２（ＰＲＣｇｒｐ−２）にあるため、プロセスの救済対象とならない。
【００５９】
また、サーバＳＶ−６の管理プロセスでは、サーバＳＶ−１の故障を検出すると、自サーバ内でプロセスを救済可能かチェックするため、排他制御管理ファイル２２およびシステム構成管理ファイル２１の参照を行い、サーバグループ、サーバ、プロセスグループ、プロセスの排他情報を取得する。そして、サーバグループおよびサーバの観点で、サーバＳＶ−１に対して自サーバが排他設定されているか確認する（Ｓｈ１）。この場合、サーバＳＶ−６でのプロセスＡの救済起動が可能であると判断される。但し、他に救済起動可能な他サーバ、この場合、サーバＳＶ−２があるため、内部管理情報によりプロセス総数の比較を行い、最も起動総数の少ないサーバＳＶ−２を救済起動先と決定する（Ｓｈ２）。
【００６０】
【発明の効果】
以上説明したように、本発明によれば、複数のプロセスの各々に対して、故障したプロセスとの整合性を保障するために当該故障したプロセスと同時に再起動が必要な他のプロセスをプロセスグループ情報として定義し、いずれかのプロセスが障害で故障した際に、前記プロセスグループ情報に基づいて、再開するプロセス群を特定するようにしたので、必要最低限範囲でのサーバ再開を実現、またプロセス故障発生時の影響範囲を局所化させ、グループ単位での整合性を担保すること、そして、プロセス間の関係を考慮した排他制御によるプロセス救済により、一部の故障がシステム全体の再開（中断）に発展することを抑止し、システム全体としての中断時間を短縮させることができるという利点が得られる。
【図面の簡単な説明】
【図１】本発明の実施形態によるサーバプロセス管理システムの構成を示すブロック図である。
【図２】起動プロセス管理ファイルの構成を示す概念図である。
【図３】システム構成管理ファイルの構成を示す概念図である。
【図４】排他制御管理ファイルの構成を示す概念図である。
【図５】グループ再開方法（グループプロセス再開発生時）の具体的な動作原理を説明するための概念図である。
【図６】グループ再開方法（グループプロセス再開発生時）の具体的な動作原理を説明するための概念図である。
【図７】グループ再開方法（サーバ再開発生時）の具体的な動作原理を説明するための概念図である。
【図８】グループ再開方法（サーバ再開発生時）の具体的な動作原理を説明するための概念図である。
【図９】サーバ種別（ＡＰＬサーバ、ＷＷＷサーバ、ＤＢサーバ等）とサーバグループとの構成例を示すブロック図である。
【図１０】本実施形態によるサーバグループ再開方法の具体的な動作原理を説明するための概念図である。
【図１１】本実施形態によるサーバグループ再開方法の具体的な動作原理を説明するための概念図である。
【図１２】プロセス排他制御導入時の救済起動方法において、排他制御管理ファイルに基づくサーバグループ・サーバ・プロセスグループ・プロセスの排他制御対象と排他制御被対象の関係を示すブロック図である。
【図１３】サーバＳＶ−１が故障した場合の排他制御概要を説明するための概念図である。
【図１４】サーバＳＶ−１が故障した場合の排他制御概要を説明するための概念図である。
【図１５】従来技術によるプロセス再開方法の問題点を説明するための概念図である。
【図１６】従来技術によるサーバ再開方法の問題点を説明するための概念図である。
【図１７】従来技術によるサーバ再開方法の問題点を説明するための概念図である。
【図１８】従来技術によるサーバ再開方法の問題点を説明するための概念図である。
【図１９】従来技術によるプロセス救済起動方式の問題点を説明するための概念図である。
【符号の説明】
２０起動プロセス管理ファイル（プロセスグループ情報、プロセスグループ情報記憶手段）
２１システム構成管理ファイル（サーバグループ情報、サーバグループ情報記憶手段）
２２排他制御管理ファイル（排他条件情報、排他条件情報記憶手段）
ＳＶ−１〜ＳＶ−６サーバ
ＰＧ１，ＰＧ２プロセスグループ
ＳＧ１，ＳＧ２サーバグループ
Ａ〜Ｊプロセス
Ｘ，Ｙ，Ｚ管理プロセス（再開プロセス特定手段、再開サーバ特定手段、サーバ特定手段）[0001]
BACKGROUND OF THE INVENTION
The present invention relates to a process resumption method for resuming (and restoring) a process and a server failure in a system in which a plurality of processes cooperate with each other on a plurality of servers to execute a plurality of tasks. , Process restart device , It relates to a process restart program.
[0002]
[Prior art]
In the conventional server restart function, all servers are restarted as the next phase of the single server restart, and there is a concern about the influence on the entire system due to the influence of the interruption time. In addition, for process execution management, a method has been adopted in which a corresponding process is restarted when a failure is detected, or a process on the corresponding server is restarted on another server when a server fails. In these methods, the initialization process of the restarted process is executed, but the initialization process of the related process is not executed, and detailed settings cannot be made even if restarting to another server. There is concern that inconsistencies between processes will occur. As a result, the consistency of the entire system is lost, and the entire system is restarted to avoid an event, which may increase the interruption time.
In addition, in the prior art, a document that directly indicates the invention of the present application has not been found and cannot be clearly indicated.
[0003]
[Problems to be solved by the invention]
As described above, in the conventional server restart method, restart of another server is forced to restart another server. A specific example is shown in FIG. When a user wants to restart specific servers SV-1 and SV-2, considering the operational relationship of other servers SV-3, SV-4, and SV-5 with respect to servers SV-1 and SV-2 Then, the entire system must be restarted and the servers SV-1 and SV-2 can be matched with the other servers SV-3, SV-4, and SV-5.
[0004]
In addition, due to a certain process failure, the consistency of the entire system is lost, and the entire system must be restarted to avoid the event. Specific examples thereof are shown in FIGS. First, in FIG. 16, a process (A) and a process (B) are functionally related processes, and in order to operate each, it is assumed that mutual operation is an essential condition. If one process is restarted due to a failure, a failure resulting from various contradictions occurs, and the other process must be restarted. The only way to achieve this is to restart the system.
[0005]
Next, in FIG. 17, a process (C) is a process functionally related to another commercial application, and the operation of the commercial application 10 is an indispensable condition for operating the system. If a failure occurs in the commercial application 10, execution management cannot be performed for the commercial application 10, so the system must be restarted.
[0006]
Next, in FIG. 18, process (A), process (B), process (C), and process (D) are functionally related processes, and in order to operate each other, mutual operation is essential. Suppose that it is a condition. When one server SV-1 is restarted, the process (A) and the process (B) are restarted. Therefore, considering the occurrence of failures due to various contradictions, the process (C) on the server SV-2 side , It is also necessary to restart the process (D). In order to realize this, the server SV-2 can only be restarted.
[0007]
Next, regarding the recovery method of the startup process in the event of a server failure, detailed exclusive control setting cannot be performed, and the recovery startup of the corresponding process cannot be performed. This also necessitates restarting the entire system. A specific example is shown in FIG. In FIG. 19, it is assumed that the process (A) and the process (B) are in a relationship incapable of being activated on the same server in each operation. When the restart of the server SV-1 occurs, the rescue activation to the server SV-2 is impossible for the process (A) activated on the server SV-1 by the activation of the process (B). However, the activation to the server SV-3 is logically possible. However, since detailed exclusion setting cannot be performed in units of processes and servers, the process (A) cannot be rescued and activated.
[0008]
In the conventional server restart method, any of these three events has a problem of prolonging the system operation interruption time.
[0009]
The present invention has been made in view of the above-described circumstances, and is a process that can prevent a part of a failure from developing into resumption (interruption) of the entire system and shorten the interruption time of the entire system. How to resume , Process restart device , The purpose is to provide a process restart program.
[0010]
[Means for Solving the Problems]
In order to solve the above problems, the present invention provides: Launched on multiple servers In the process restarting method of a managed system in which a plurality of processes are linked to each other, each of the plurality of processes is restarted simultaneously with the failed process in order to ensure consistency with the failed process. Within a predetermined group that includes other processes to be processed and the failed process The number of restart processes including the process is minimized group The Consists of process name, group number, and server name Defined as process group information, and when any process fails due to a failure, the process group to be restarted is identified based on the process group information To restart only the corresponding process included in the specified process group in each of the plurality of servers. It is a feature.
[0011]
Also, In the invention described above, the present invention relates to one process group information and the other process group information. One process group information But said Other process group information If it has an inclusive relationship including all processes defined in Other process group information Defined in If the process fails due to failure, Concerned Restart only the processes in the other process group information, Said Only for one process group information Defined If the process fails due to failure, Concerned One process group information Defined in It is characterized by restarting all processes.
[0012]
Also, The present invention provides the above-described invention, Each of the plurality of processes is defined as process group information, wherein at least one of the plurality of processes is a commercial application, and at least one of the plurality of processes is a unique process developed on the premise of cooperation with the commercial application. It is characterized by doing.
[0017]
In order to solve the above-described problems, the present invention Launched on multiple servers In a process resumption device of a managed system in which a plurality of processes are linked to each other, each of the plurality of processes is restarted simultaneously with the failed process in order to ensure consistency with the failed process. In a predetermined group including other processes to be processed and the failed process The number of restart start processes including the process is minimized group The Consists of process name, group number, and server name Process group information storage means for storing as process group information, and when any process fails due to a failure, the process group to be restarted is specified based on the process group information in the process group information storage means To restart only the corresponding process included in the specified process group in each of the plurality of servers. And a restart process specifying means.
[0018]
Also, In the invention described above, the present invention relates to one process group information and the other process group information. Process group information storage means stores Said One process group information But said Other process group information If it has an inclusive relationship including all processes defined in The restart process identification means is Said Other process group information Defined in If the process fails due to failure, Concerned Restart only the processes in the other process group information, Said Only for one process group information Defined If the process fails due to failure, Concerned One process group information Defined in It is characterized by restarting all processes.
[0019]
Also, The present invention provides the above-described invention, The process group information storage means includes a plurality of processes in which at least one of the plurality of processes is a commercial application, and at least one is a unique process developed on the premise of cooperation with the commercial application. The process group information defined to include each of the process group information is stored.
[0023]
In order to solve the above-described problems, the present invention Launched on multiple servers, Restart each of the multiple processes that make up the managed system simultaneously with the failed process to ensure consistency with the failed process. Within a predetermined group that includes other processes to be processed and the failed process The number of restart start processes including the process is minimized group The Consists of process name, group number, and server name Steps defined as process group information, and when any process fails due to a failure, the process group to be restarted is identified based on the process group information To restart only the corresponding process included in the specified process group in each of the plurality of servers. And causing the computer to execute the steps.
[0024]
Also, In the invention described above, the present invention relates to one process group information and the other process group information. One process group information But said Other process group information Has an inclusive relationship that encompasses all processes defined in Defining process group information; Said Other process group information Defined in If the process fails due to failure, Concerned Restart only the processes in the other process group information, Said Only for one process group information Defined If the process fails due to failure, Concerned One process group information Defined in And causing the computer to execute a step of restarting all processes.
[0028]
In the present invention, for each of a plurality of processes, another process that needs to be restarted simultaneously with the failed process is defined as process group information in order to ensure consistency with the failed process. When a process fails due to a failure, a process group to be restarted is specified based on the process group information. Therefore, it is possible to restart the server within the minimum necessary range, to localize the range of influence when a process failure occurs, to ensure consistency in group units, and to use exclusive control considering the relationship between processes By the relief, it is possible to prevent a part of the failure from developing into resumption (interruption) of the entire system, and to shorten the interruption time of the entire system.
[0029]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
A. Configuration of the embodiment
FIG. 1 is a block diagram showing a configuration of a server process management system according to an embodiment of the present invention. In FIG. 1, the server process management system includes servers SV-1, SV-2, and SV-3. Server SV-1 starts process A, process B (commercial application), process C (commercial application), and management process X. The server SV-2 starts process D, process E, and management process Y. The server SV-3 starts process F, process G, and management process Z.
[0030]
Process A and process B constitute process group PG1. Process C, process D, process E, and process F constitute a process group PG2. Note that a process group is a process in which activation management is considered as a group of processes that are operationally related. The application range of the process group is not closed within one server, but is effective between all servers constituting the system. Further, the function of the present invention is not limited to a self-made process, but is effective for a commercial application.
[0031]
The server process management system according to the present embodiment includes a condition file having the configuration shown in FIGS. FIG. 2 is a conceptual diagram showing the configuration of the startup process management file 20. FIG. 3 is a conceptual diagram showing the configuration of the system configuration management file 21. FIG. 4 is a conceptual diagram showing the configuration of the exclusive control management file 22.
[0032]
In order to set the server / process group in the startup process management file 20 and the system configuration management file 21 shown in FIG. 2 and FIG. 3, each is registered in advance in the management file. Settings are made for servers / processes that require management in operation, and an item for setting a group number is provided in each record item. The management process makes a grouping decision for each process according to these settings. There is no upper limit to the number of settings for one group. When “0” is set in the group number, it is determined that the grouping is not set for the process and the server.
[0033]
In FIG. 4, in order to set exclusive control related to process rescue activation, registration is performed in advance in the exclusive control management file 22. Regarding the setting method, there are the following four patterns. 1. 1. Exclusive setting for each server group 2. Exclusive setting for each server. 3. Exclusive setting for each process group This is an exclusive setting for each process. An identifier indicating which setting is set is set, and items for setting a target name and a target name for setting exclusive control are provided on the identifier. The management process determines the presence or absence of exclusive control according to these settings. There is no upper limit to the number of settings. In addition, regarding the exclusive setting target name / subject name, a plurality of items can be set with the same name.
[0034]
This will be described in more detail below. In this embodiment, regarding a process that is activated in the managed server (group), when any process fails, another process that needs to be restarted at the same time as the process to ensure consistency with the process. Are extracted before operation of the managed server (group), and the extracted processes are collectively defined as a process group. Note that when defining a process group, a process group having an inclusion relationship may be defined, but a process group that shares only some processes is not defined.
[0035]
Further, the startup process management file 20 including at least three pieces of information of the startup process, the process group to which the startup process belongs, and the server on which the startup process is running is similarly operated by the managed server (group). Create before.
[0036]
If process groups having an inclusion relationship are defined, a plurality of process groups to which the startup process belongs can be defined. For example, information on the inclusion relationship between process groups (for example, “process group α includes process group β and process group ν”) and the minimum number of startup processes including the startup process as a process group corresponding to each startup process Two types of information are defined. Alternatively, as information corresponding to each activation process, two types of information are defined: process group information with the smallest number of activation processes including the activation process and all process group information other than the above including the activation process.
[0037]
The management process for managing the managed server refers to the restart range using the information described above so that the number of restart startup processes is minimized by referring to the startup process management file 20 when an arbitrary process fails. Is specified, and the process of “Restart group including related processes” is realized.
When one process group information includes the other process group information, if a process included in the other process group information fails due to a failure, only the process in the other process group information is restarted. When a process included only in the process group information fails due to a failure, all processes belonging to one process group information including the other process group information are restarted.
[0038]
In addition to the startup process, the management process refers to the startup process management file 20 so that the number of restart startup processes is minimized when any managed server itself restarts. The resumption range is specified using the information, and the “group resumption including related processes” process is realized.
[0039]
Further, in the present embodiment, similarly to the startup process group in the managed server (group), the managed server itself also fails when any managed server fails in accordance with the system operation conditions. Extract other managed servers that need to be restarted at the same time as the management target server to ensure consistency with the managed server (s) before operating the managed server (s). Define a server group. When defining a server group, a server group having an inclusion relationship may be specified, but a server group that shares only some managed servers is not defined.
[0040]
Further, the system configuration management file 21 including at least two pieces of information of the management target server and the server group to which the management target server belongs is similarly created before the management target server (group) is operated.
[0041]
When server groups having an inclusion relationship are defined, a plurality of server groups to which managed servers belong can be defined. For example, information on the inclusion relationship between server groups (for example, “server group α includes server group β and server group”), and a managed server that includes the managed server as a server group corresponding to each managed server Two types of information are defined: the smallest number of server group information. Alternatively, as information corresponding to each managed server, two types of information, that is, server group information with the minimum number of managed servers including the managed server and all other server group information including the managed server are included. Define
[0042]
The management process for managing the managed server uses the information described above so that the number of resume managed servers can be minimized by referring to the system configuration management file 21 when any managed server fails. To specify the restart range and implement the “server group restart” process.
When one server group information includes the other server group information, if a server included in the other server group information fails due to a failure, only the server in the other server group information is restarted. When a server included only in server group information fails due to a failure, all servers belonging to one server group information including the other server group information are restarted.
[0043]
Also, in this embodiment, when any managed server or boot process cannot be restarted, for boot process rescue boot, any server group with respect to at least four patterns of server group, server, process group, and process The startup process of managed servers belonging to the server extracts the exclusion conditions that indicate the relationship between processes, servers, process groups, and server groups that cannot be started on the same server before operating the managed server (s) And created as an exclusive control management file 22.
[0044]
Exclusion conditions include server group exclusion conditions for identifying server groups that cannot all be started, and server exclusion conditions for identifying managed servers that cannot start all the startup processes of any managed server Process group exclusion conditions for identifying process groups that cannot be started on the same managed server as the start process belonging to any process group, processes that cannot be started on the same managed server as any process There is a process exclusion condition to identify
[0045]
The management process during operation succession refers to the exclusive control management file 22 when a managed server failure occurs, identifies a startup process that can be repaired by each managed server, and performs a recovery startup process of the startup process. Realize.
[0046]
B. Operation of the embodiment
Next, the operation of the server process management system according to the present embodiment will be described. First, FIG. 5 and FIG. 6 are conceptual diagrams for explaining the specific operation principle of the group restart method (when a group process restart occurs). 5 and 6 show an outline of processing of each server when the process D fails.
[0047]
First, an outline of the operation on the server SV-2 will be described with reference to FIG. The management process Y on the server SV-2 detects the failure of the process D and refers to the startup process management file 20 (Sa1). Since the process D belongs to the process group PG2, a restart request for the process group PG2 is made to the servers SV-1 and SV-3, which are servers on which processes defined in the process group PG2 are started (Sa2). Then, simultaneously with the restart request to the other server, the process E defined in the process group PG2 in the own server is restarted (Sa3).
[0048]
Next, an outline of the operation of the server SV-1 will be described with reference to FIG. When the management process X on the server SV-1 receives the restart request for the process group PG2 from the server SV-2, it extracts the process of the process group PG2 operating on its own server from the startup process management file 20 (Sb1). ), The process C belonging to the process group PG2 is restarted (Sb2). Similarly, in the server SV-3, the process corresponding to the process group PG2 is restarted.
[0049]
Next, an outline of processing in the server SV-2 when the server SV-1 is restarted will be described. Here, FIG. 7 and FIG. 8 are conceptual diagrams for explaining the specific operation principle of the group restart method (when server restart occurs).
[0050]
First, an outline of the operation of the server SV-1 will be described with reference to FIG. When the server SV-1 is restarted due to a process failure or the like, the server SV-1 is restarted to each server, and then the server SV-1 is restarted. At the time of restart, the startup process management file 20 is referred to (Sc1), and the processes A, B, and C to be started on the own server are started (Sc2).
[0051]
Next, the processing outline of the server SV-2 will be described with reference to FIG. After receiving the server restart notification from the server SV-1, the startup process management file 20 is referred to (Sd1), and the process group PG1 and the process group PG2 operating on the server SV-1 operate on the own server. Check if it is. Therefore, since the group PG2 is targeted, the processes D and E in the process group PG2 are restarted (Sd2). Similarly, in the server SV-3, processes corresponding to the process group PG1 and the process group PG2 are restarted.
[0052]
Next, FIG. 9 is a block diagram showing a configuration example of server types (APL server, WWW server, DB server, etc.) and server groups. In the figure, a server group is a group of servers that are operationalally related to each server that constitutes the system. The server group SG1 is composed of an APL server SV-1 and a WWW server SV-2. The server group SG2 includes an APL server SV-3, an APL server SV-4, and a naming server SV-5. It is also effective to set servers of different types (types) to the same group.
[0053]
Next, FIG. 10 and FIG. 11 are conceptual diagrams for explaining the specific operation principle of the server group restart method according to the present embodiment. First, an overview of the process of server SV-1 that has received a restart request for server group SG1 from a maintenance person or the like will be described with reference to FIG. The management process of the server SV-1 references the system configuration management file 21 after receiving the restart request for the server group SG1 (Se1). The server SV-1 also performs resumption notification of the server group SG1 to the other server SV-2 (Se2), and performs resumption because the server SV-1 itself is also the server group SG1 (Se3).
[0054]
Next, an outline of server SV-2 server group restart processing will be described with reference to FIG. The only difference between the server SV-1 and the server SV-1 is whether the request source is a maintenance person or another server. After receiving the restart request for the server group SG1, the system configuration management file is referred to (Sf1). The server SV-2 belonging to SG1 is also restarted (Sf2).
[0055]
Next, FIG. 12 is a block diagram showing the relationship between the exclusive control target and the exclusive control target of the server group / server / process group / process based on the exclusive control management file in the rescue activation method when introducing the exclusive process control. . FIGS. 13 and 14 are conceptual diagrams for explaining an outline of exclusive control when the server SV-1 fails. Note that the exclusive control target / target is a server, server group, or process in order to control the startup of processes that cannot be allowed to start on the same server due to various restrictions in configuring the system. The control object is provided in the range of.
[0056]
In the management process of the server SV-3, when a failure of the server SV-1 is detected, the exclusive control management file 22 and the system configuration management file 21 are referred to in order to check whether the process can be relieved in the own server, and the server group Get exclusive information of server, process group and process. For the server group SG1 (SRVgrp-1), information on the servers constituting the server group is acquired from the system configuration management file 21, and the process group is acquired from the startup process management file 20. Then, according to these pieces of information, it is determined whether the own server is an exclusion target (Sg1).
[0057]
In this case, the server SV-3 belonging to the server group SG2 (SRVgrp-2) is not a process relief target because the exclusive control for the server group SG1 (SRVgrp-1) is described in the exclusive control management file. (Sg2).
[0058]
In the case of the server SV-5 that is not group-defined, there is no exclusive definition for the failure of the server SV-1 of the server group SG1 (SRVgrp-1), but the exclusive definition for the process group PG1 (PRCgrp-1) is the server SV. Since it is in the process group PG2 (PRCgrp-2) on −5, it is not a process relief target.
[0059]
Further, in the management process of the server SV-6, when a failure of the server SV-1 is detected, the exclusive control management file 22 and the system configuration management file 21 are referred to check whether the process can be relieved in the own server, Acquires exclusive information of server groups, servers, process groups, and processes. Then, from the viewpoint of the server group and the server, it is confirmed whether the own server is exclusively set for the server SV-1 (Sh1). In this case, it is determined that the rescue start of the process A on the server SV-6 is possible. However, since there is another server that can be rescue activated, in this case, server SV-2, the total number of processes is compared based on the internal management information, and the server SV-2 with the smallest total number of activation is determined as the rescue activation destination ( Sh2).
[0060]
【The invention's effect】
As described above, according to the present invention, for each of a plurality of processes, another process that needs to be restarted at the same time as the failed process is secured to the process group in order to ensure consistency with the failed process. It is defined as information, and when one of the processes fails due to a failure, the process group to be restarted is specified based on the process group information. Localization of the range of influence at the time of failure occurrence, guaranteeing consistency in group units, and process failure by exclusive control that considers the relationship between processes, some failures are restarted (suspended) of the entire system It is possible to obtain the advantage that it is possible to prevent the system from being developed, and to shorten the interruption time of the entire system.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a configuration of a server process management system according to an embodiment of the present invention.
FIG. 2 is a conceptual diagram showing a configuration of a startup process management file.
FIG. 3 is a conceptual diagram showing a configuration of a system configuration management file.
FIG. 4 is a conceptual diagram showing a configuration of an exclusive control management file.
FIG. 5 is a conceptual diagram for explaining a specific operation principle of a group restart method (when a group process restarts).
FIG. 6 is a conceptual diagram for explaining a specific operation principle of a group restart method (when a group process restart occurs).
FIG. 7 is a conceptual diagram for explaining a specific operation principle of a group restart method (when a server restart occurs).
FIG. 8 is a conceptual diagram for explaining a specific operation principle of a group restart method (when a server restart occurs).
FIG. 9 is a block diagram showing a configuration example of server types (APL server, WWW server, DB server, etc.) and server groups.
FIG. 10 is a conceptual diagram for explaining a specific operation principle of the server group restart method according to the present embodiment.
FIG. 11 is a conceptual diagram for explaining a specific operation principle of the server group restart method according to the present embodiment.
FIG. 12 is a block diagram showing the relationship between exclusive control targets and exclusive control targets of server groups, servers, process groups, and processes based on an exclusive control management file in the rescue activation method when process exclusive control is introduced.
FIG. 13 is a conceptual diagram for explaining an overview of exclusive control when a server SV-1 fails.
FIG. 14 is a conceptual diagram for explaining an overview of exclusive control when a server SV-1 fails.
FIG. 15 is a conceptual diagram for explaining a problem of a process resumption method according to a conventional technique.
FIG. 16 is a conceptual diagram for explaining a problem of a server restart method according to the prior art.
FIG. 17 is a conceptual diagram for explaining a problem of a server restart method according to a conventional technique.
FIG. 18 is a conceptual diagram for explaining a problem of a server restart method according to the prior art.
FIG. 19 is a conceptual diagram for explaining a problem of a process rescue activation method according to a conventional technique.
[Explanation of symbols]
20 Startup process management file (process group information, process group information storage means)
21 System configuration management file (server group information, server group information storage means)
22 Exclusive control management file (exclusive condition information, exclusive condition information storage means)
SV-1 to SV-6 server
PG1, PG2 process group
SG1, SG2 server group
A ~ J Process
X, Y, Z management process (resumption process identification means, resumption server identification means, server identification means)

Claims

In the process restart method of the managed system in which multiple processes started on multiple servers are built in cooperation,
For each of the plurality of processes, in a predetermined group including other processes restarted simultaneously with the failed process and the failed process to ensure consistency with the failed process. Hold the group with the minimum number of restart processes including the process as process group information consisting of the process name, group number, and server name on which the process operates .
When any process fails due to a failure, the process group to be restarted is specified based on the process group information, and only the corresponding process included in the process group specified by each of the plurality of servers is restarted. A process resumption method characterized.

In the process restarting method,
For one process group information and the other process group information, if the one process group information has an inclusion relationship that includes all the processes defined in the other process group information, the process group information is defined in the other process group information. If a process to be failed fails due to a failure, only the process in the other process group information is restarted. If a process defined only in the one process group information fails due to a failure, the one process The process restarting method according to claim 1, wherein all processes defined in the group information are restarted.

Each of the plurality of processes is defined as process group information, wherein at least one of the plurality of processes is a commercial application, and at least one of the plurality of processes is a unique process developed on the premise of cooperation with the commercial application. 3. The process restarting method according to claim 1, wherein the process is restarted.

In the process restarting device of the managed system where multiple processes started on multiple servers are linked and built,
For each of the plurality of processes, in a predetermined group including other processes restarted simultaneously with the failed process and the failed process to ensure consistency with the failed process. and process group information storage means for storing a group to resume boot process number is minimized including the process process name, group number, as the process group information consisting of server name process works,
When any process fails due to a failure, the corresponding process included in the process group identified by each of the plurality of servers by identifying the process group to be resumed based on the process group information in the process group information storage unit A process resumption device comprising: a resumption process specifying means for resuming only a process.

In the process restarting device,
For one process group information and the other process group information, the one process group information stored by the process group information storage means has an inclusive relationship including all processes defined in the other process group information. ,
When the process defined in the other process group information fails due to a failure, the restart process specifying means restarts only the process in the other process group information, and only the one process group information 5. The process restarting apparatus according to claim 4, wherein when a defined process fails due to a failure, all processes defined in the one process group information are restarted.

The process group information storage means includes
It is defined that at least one of the plurality of processes is a commercial application, and at least one of the plurality of processes includes each of the plurality of processes that is a unique process developed on the assumption of cooperation with the commercial application. 6. The process restarting apparatus according to claim 4, wherein the process group information is stored.

For each of a plurality of processes that are started on a plurality of servers and construct a managed system, and other processes that are restarted simultaneously with the failed process in order to ensure consistency with the failed process a step of defining the group resume boot process number is minimized, including the process in the predetermined is a group including a process that the failure process name, as the process group information consisting of server name group number, the process is operated ,
A step of specifying a process group to be restarted based on the process group information and restarting only a corresponding process included in the specified process group in each of the plurality of servers when any process fails due to a failure. A process restarting program that causes a computer to execute and.

The process restart program,
For one process group information and the other process group information, the one process group information defines process group information having an inclusion relationship that includes all processes defined in the other process group information;
When the process defined in the other process group information fails due to a failure, only the process in the other process group information is restarted, and the process defined only in the one process group information fails. 8. The process restarting program according to claim 7, wherein, when a failure occurs, the computer is caused to execute a step of restarting all processes defined in the one process group information.