JP2014078067A

JP2014078067A - Database system, database device, failure recovery method for database and program

Info

Publication number: JP2014078067A
Application number: JP2012224147A
Authority: JP
Inventors: Masanori Matsuda; 政宣松田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2012-10-09
Filing date: 2012-10-09
Publication date: 2014-05-01
Anticipated expiration: 2032-10-09
Also published as: JP6070040B2

Abstract

PROBLEM TO BE SOLVED: To provide a database system capable of executing learned failure recovery processing even at the occurrence of failures in a plurality of servers.SOLUTION: In a primary site and a secondary site each including an application server and a database server, each server includes a server state monitoring part. The server state monitoring part includes: a command learning function for storing learning data obtained by extracting a specific pattern from a command input by a maintenance person at the occurrence of a failure in association with an error code; a failure detection function for detecting that the failure has occurred in any server; an error pattern search function for searching whether or not the same error code as the detected failure is stored in the learning data; and a learned command execution function for executing the command of the pattern corresponding to the error code.

Description

本発明はデータベースシステム、データベース装置、データベースの障害回復方法およびプログラムに関し、特に障害回復処理を学習して自動化するデータベースシステム等に関する。 The present invention relates to a database system, a database apparatus, a database failure recovery method, and a program, and more particularly to a database system that learns and automates failure recovery processing.

企業などで利用されるコンピュータシステムにおいては、短時間の停止であっても、その間に発生した業務の停止によって巨額の損失が発生しうる。特に、大量のデータを記憶して取り扱うデータベースと、それらのデータを利用して企業活動そのものに関わる処理を行うアプリケーションソフトを動作させる企業システムは、絶対に停止してはならず、またデータの損失などもあってはならない。 In a computer system used in a company or the like, even if it is stopped for a short period of time, a huge loss can occur due to the suspension of the business that occurred during that time. In particular, a database that stores and handles large amounts of data and a company system that operates application software that uses these data to perform processes related to the company's activities must never stop, and data loss There must be no such thing.

そのため、それらのコンピュータシステムで何か不具合が発生した場合には、可及的速やかな復旧が望まれる。しかしながら、システム自体の規模、あるいは取り扱うデータの分量は膨大になっていく一方であり、そのためその復旧にかかる時間や作業量も膨大になっていく一方である。 For this reason, when something goes wrong with these computer systems, it is desired to recover as soon as possible. However, the scale of the system itself or the amount of data to be handled is increasing enormously. Therefore, the time and the amount of work required for restoration are also increasing enormously.

特に、平成２３年３月１１日に発生したいわゆる東日本大震災以後、そのようなコンピュータシステムが地震や風水害などのような大規模災害に遭った場合にもデータの損失を防いで処理を継続可能とする、いわゆるディザスタリカバリ（Disaster Recovery、災害復旧）の重要性が声高に叫ばれるようになっている。 In particular, after the Great East Japan Earthquake that occurred on March 11, 2011, even if such a computer system encounters a large-scale disaster such as an earthquake or storm and flood damage, data loss can be prevented and processing can continue. The importance of so-called disaster recovery has been screaming loudly.

そこで近年は、保守作業員にかかる負荷を削減する技術、特に保守作業員の手をかけずに自動的に障害を復旧させる技術の開発が活発に進められている。これに関連する文献として、次の各々の技術資料がある。その中でも特許文献１には、コンピュータシステムに対して行った復旧手順を、次に同一の障害が発生した時に備えて登録しておくという障害復旧装置が記載されている。 Therefore, in recent years, development of a technique for reducing a load on a maintenance worker, in particular, a technique for automatically recovering from a failure without the intervention of the maintenance worker has been actively promoted. There are the following technical documents as related documents. Among them, Patent Document 1 describes a failure recovery apparatus that registers a recovery procedure performed on a computer system in preparation for the next occurrence of the same failure.

特許文献２には、ジョブネットワークで障害が発生した場合に、障害回復プログラムを自動的に実行するという管理システムが記載されている。特許文献３には、複数のサーバで提供される処理サービスを１台の監視装置によって監視して復旧するという自動監視復旧システムが記載されている。 Patent Document 2 describes a management system that automatically executes a failure recovery program when a failure occurs in a job network. Patent Document 3 describes an automatic monitoring and recovery system in which processing services provided by a plurality of servers are monitored and recovered by a single monitoring device.

特開平１０−０９１４７５号公報Japanese Patent Laid-Open No. 10-091475 特開２００４−３１８７６３号公報JP 2004-318863 A 特開２００２−３４２１８０号公報JP 2002-342180 A

図１３は、既存技術に係るデータベースシステム９０１の全体的な構成について示す説明図である。データベースシステム９０１は、プライマリサイト９０２（現用系）とセカンダリサイト９０３（待機系）とがネットワーク９０４を介して接続されて構成されている。 FIG. 13 is an explanatory diagram showing the overall configuration of the database system 901 according to the existing technology. The database system 901 is configured by connecting a primary site 902 (active system) and a secondary site 903 (standby system) via a network 904.

プライマリサイト９０２は、アプリケーションサーバ９１１と、データベースサーバ９１２と、ストレージ装置９１３とを含む。セカンダリサイト９０３も同様に、アプリケーションサーバ９２１と、データベースサーバ９２２と、ストレージ装置９２３とを含む。また、アプリケーションサーバ９１１および９２１は相互にサーバクラスタを構成し、同様にデータベースサーバ９１２および９２２も相互にサーバクラスタを構成する。 The primary site 902 includes an application server 911, a database server 912, and a storage device 913. Similarly, the secondary site 903 includes an application server 921, a database server 922, and a storage device 923. Further, the application servers 911 and 921 constitute a server cluster with each other, and similarly, the database servers 912 and 922 also constitute a server cluster with each other.

アプリケーションサーバ９１１および９２１は、相互に動作を監視しあい、そのうちの一方が異常停止した場合には残る一方のみで動作を継続することができる。また、データベースサーバ９１２および９２２も、相互に動作を監視しあい、そのうちの一方が異常停止した場合には残る一方のみで動作を継続することができる。 The application servers 911 and 921 can monitor the operation of each other, and if one of them is abnormally stopped, only the remaining one can continue the operation. The database servers 912 and 922 can also monitor the operation of each other, and if one of them stops abnormally, only the remaining one can continue the operation.

近年は特に、ｅコマース（電子商取引）の普及により、ウェブサーバなどのようなアプリケーションソフトを動作させるアプリケーションサーバと、そこで取引される商品やサービスなどについてのデータを大量に扱うデータベースサーバとを組み合わせた構造のウェブシステムが多く動作するようになっている。そのような場合には、図１３に示したデータベースシステム９０１のように、アプリケーションサーバおよびデータベースサーバを各々二重化して相互に監視し合い、現用系で異常が発生した時にはいつでも待機系が処理を引き継げる構造のシステムが利用されるようになっている。 In recent years, in particular, with the spread of e-commerce (electronic commerce), an application server that operates application software such as a web server and a database server that handles a large amount of data about products and services traded there are combined. Many structured web systems are now working. In such a case, as in the database system 901 shown in FIG. 13, the application server and the database server are duplicated and monitored each other, and the standby system can take over the process whenever an abnormality occurs in the active system. Structured systems are being used.

しかしながら、これらのようなシステムでは、現用系から待機系への切り替えにはやはり保守作業員の手による操作が必要である。この操作はどうしても煩雑なものとなり、またヒューマンエラーによるミスも生じやすい。 However, in such systems, switching from the active system to the standby system still requires an operation by a maintenance worker. This operation is inevitably complicated, and mistakes due to human errors are likely to occur.

特に、ハードウェアおよびソフトウェアの構成の都合上、そのシステムにおいて特に頻繁に発生しやすい特定の障害がどうしても存在するが、その特定の障害に対する対応は経験上ある程度パターン化できるものであるにもかかわらず、その対応のパターン化を行って自動的に実行しうる形にするという従来技術は存在しない。 In particular, due to hardware and software configurations, there are inevitably specific faults that are likely to occur particularly frequently in the system, but the response to those specific faults can be patterned to some extent through experience. However, there is no prior art that forms a pattern that can be automatically executed.

前述の特許文献１〜３には「障害回復の操作を学習して、同じ障害に対する回復操作を自動的に行う」ということは記載されているが、図１３に示したデータベースシステム９０１の構成に適用可能なものは記載されていない。特に、アプリケーションサーバおよびデータベースサーバの両方で１台以上の障害が発生した場合に、これを検出して自動的に回復させることが可能なものは、特許文献１〜３のいずれにも記載されていない。 The above-mentioned Patent Documents 1 to 3 describe that “learning a failure recovery operation and automatically performing a recovery operation for the same failure”, but the configuration of the database system 901 shown in FIG. Applicable items are not listed. In particular, any one of Patent Documents 1 to 3 that can detect and automatically recover when one or more failures occur in both the application server and the database server. Absent.

本願発明の目的は、相互に監視し合うアプリケーションサーバおよびデータベースサーバの両方が１台以上で障害が発生した場合にも、学習された障害回復処理を実行することを可能とするデータベースシステム、データベース装置、データベースの障害回復方法およびプログラムを提供することにある。 An object of the present invention is to provide a database system and a database apparatus capable of executing a learned failure recovery process even when a failure occurs in both one or more application servers and database servers that monitor each other. Another object of the present invention is to provide a database failure recovery method and program.

上記目的を達成するため、本発明に係るデータベースシステムは、現用系アプリケーションサーバおよび現用系データベースサーバを含むプライマリサイトと、待機系アプリケーションサーバおよび待機系データベースサーバを含むセカンダリサイトとがネットワークを介して接続されて構成されるデータベースシステムであって、現用系アプリケーションサーバ、現用系データベースサーバ、待機系アプリケーションサーバおよび待機系データベースサーバがいずれも、相互に他装置と監視し合うサーバ状態監視部を各々備え、各サーバ状態監視部が、データベースシステムを構成するいずれかのサーバに障害が発生した時に保守者が入力したコマンドをログとして記録し、このログの中から特定のパターンを抽出した学習データを発生した障害の症状を示すエラーコードと対応づけて予め備えられた記憶手段に記憶させ、かつ他の各サーバにもこの学習データを送付して記憶させるコマンド学習機能と、いずれかのサーバに障害が発生したことを検出する障害検出機能と、検出された障害と同一のエラーコードが学習データに記憶されているか否かを検索するエラーパターン検索機能と、検出された障害と同一のエラーコードが学習データに記憶されている場合に、そのエラーコードに対応するパターンのコマンドを現用系アプリケーションサーバに実行させる学習済コマンド実行機能とを有することを特徴とする。 In order to achieve the above object, a database system according to the present invention connects a primary site including an active application server and an active database server and a secondary site including a standby application server and a standby database server via a network. Each of the active application server, the active database server, the standby application server, and the standby database server each includes a server status monitoring unit that monitors each other, Each server status monitoring unit records a command entered by the maintainer when a failure occurs in any of the servers that make up the database system, and generates learning data by extracting specific patterns from this log. A command learning function that associates an error code indicating a failure symptom with a storage means prepared in advance and sends this learning data to other servers for storage, and a failure occurs in any server A failure detection function that detects the failure, an error pattern search function that searches whether or not the same error code as the detected failure is stored in the learning data, and an error code that is the same as the detected failure And a learned command execution function for causing the active application server to execute a command having a pattern corresponding to the error code.

上記目的を達成するため、本発明に係るデータベース装置は、現用系アプリケーションサーバおよび現用系データベースサーバを含むプライマリサイトと、待機系アプリケーションサーバおよび待機系データベースサーバを含むセカンダリサイトとがネットワークを介して接続されて構成されるデータベースシステムで、現用系データベースサーバもしくは待機系データベースサーバとして機能しうるデータベース装置であって、相互に他装置と監視し合うサーバ状態監視部を備えると共に、このサーバ状態監視部が、データベースシステムを構成するいずれかのサーバに障害が発生した時に保守者が入力したコマンドをログとして記録し、このログの中から特定のパターンを抽出した学習データを発生した障害の症状を示すエラーコードと対応づけて予め備えられた記憶手段に記憶させ、かつ他の各サーバにもこの学習データを送付して記憶させるコマンド学習機能と、いずれかのサーバに障害が発生したことを検出する障害検出機能と、検出された障害と同一のエラーコードが学習データに記憶されているか否かを検索するエラーパターン検索機能と、検出された障害と同一のエラーコードが学習データに記憶されている場合に、そのエラーコードに対応するパターンのコマンドを現用系アプリケーションサーバに実行させる学習済コマンド実行機能とを有することを特徴とする。 In order to achieve the above object, a database apparatus according to the present invention connects a primary site including an active application server and an active database server and a secondary site including a standby application server and a standby database server via a network. A database system that can function as an active database server or a standby database server, and includes a server status monitoring unit that monitors each other, and the server status monitoring unit An error that indicates the symptom of the failure that generated the learning data that recorded a specific pattern extracted from the log when the command entered by the maintainer when a failure occurred in any of the servers that make up the database system With code A command learning function for storing in a storage means provided in advance and sending this learning data to other servers for storage, and a failure detection function for detecting that a failure has occurred in any of the servers And an error pattern search function for searching whether or not the same error code as the detected failure is stored in the learning data, and when the same error code as the detected failure is stored in the learning data, And a learned command execution function for causing the active application server to execute a command having a pattern corresponding to the error code.

上記目的を達成するため、本発明に係る障害回復方法は、現用系アプリケーションサーバおよび現用系データベースサーバを含むプライマリサイトと、待機系アプリケーションサーバおよび待機系データベースサーバを含むセカンダリサイトとがネットワークを介して接続されて構成されるデータベースシステムにあって、データベースシステムを構成するいずれかのサーバに障害が発生した時に保守者が入力したコマンドをデータベースシステムを構成する各サーバのコマンド学習機能がログとして記録し、記録されたログの中から特定のパターンを抽出した学習データを発生した障害の症状を示すエラーコードと対応づけて各サーバのコマンド学習機能が予め備えられた記憶手段に記憶させると共に、他の各サーバにもこの学習データを送付して記憶させ、いずれかのサーバに障害が発生したことを各サーバの障害検出機能が検出し、検出された障害と同一のエラーコードが学習データに記憶されているか否かを各サーバのエラーパターン検索機能が検索し、検出された障害と同一のエラーコードが学習データに記憶されている場合に、そのエラーコードに対応するパターンのコマンドを各サーバの学習済コマンド実行機能が現用系アプリケーションサーバに実行させることを特徴とする。 In order to achieve the above object, a failure recovery method according to the present invention includes a primary site including an active application server and an active database server, and a secondary site including a standby application server and a standby database server via a network. In a connected database system, a command entered by a maintenance person when a failure occurs in any of the servers that make up the database system is recorded as a log by the command learning function of each server that makes up the database system. In addition, the learning data obtained by extracting a specific pattern from the recorded log is associated with the error code indicating the symptom of the failure that has occurred, and stored in the storage means provided in advance with the command learning function of each server. This server also stores this learning data The failure detection function of each server detects that a failure has occurred in any server, and whether or not the same error code as the detected failure is stored in the learning data. When the error pattern search function searches and the same error code as the detected fault is stored in the learning data, the command of the pattern corresponding to the error code is used by the learned command execution function of each server. It is characterized by causing a server to execute.

上記目的を達成するため、本発明に係る障害回復プログラムは、現用系アプリケーションサーバおよび現用系データベースサーバを含むプライマリサイトと、待機系アプリケーションサーバおよび待機系データベースサーバを含むセカンダリサイトとがネットワークを介して接続されて構成されるデータベースシステムにあって、データベースシステムを構成する各サーバが備えるプロセッサに、データベースシステムを構成するいずれかのサーバに障害が発生した時に保守者が入力したコマンドをログとして記録する手順、記録されたログの中から特定のパターンを抽出した学習データを発生した障害の症状を示すエラーコードと対応づけて予め備えられた記憶手段に記憶させると共に、他の各サーバにもこの学習データを送付して記憶させる手順、いずれかのサーバに障害が発生したことを検出する手順、検出された障害と同一のエラーコードが学習データに記憶されているか否かを検索する手順、および検出された障害と同一のエラーコードが学習データに記憶されている場合に、そのエラーコードに対応するパターンのコマンドを現用系アプリケーションサーバに実行させる手順を実行させることを特徴とする。 In order to achieve the above object, a failure recovery program according to the present invention includes a primary site including an active application server and an active database server, and a secondary site including a standby application server and a standby database server via a network. In a connected database system, a command input by a maintenance person when a failure occurs in any of the servers constituting the database system is recorded as a log in a processor included in each server constituting the database system. The learning data obtained by extracting a specific pattern from the procedure and recorded log is stored in the storage means prepared in advance in association with the error code indicating the symptom of the failure that has occurred, and this learning is also performed in each of the other servers. Send data and remember A procedure for detecting that a failure has occurred in one of the servers, a procedure for searching whether the same error code as the detected failure is stored in the learning data, and the same as the detected failure When the error code is stored in the learning data, a procedure for causing the active application server to execute a command of a pattern corresponding to the error code is executed.

本発明は、上記したように、現用系および待機系のアプリケーションサーバおよびデータベースサーバがいずれも、相互に障害の発生を監視し合うサーバ状態監視部を備える構成としたので、障害が発生した場合にどの装置からでも学習済の障害回復の操作を実行することができる。このことにより、アプリケーションサーバおよびデータベースサーバの両方が１台以上で障害が発生した場合にも、学習された障害回復処理を実行することが可能であるという、優れた特徴を持つデータベースシステム、データベース装置、データベースの障害回復方法およびプログラムを提供することができる。 In the present invention, as described above, both the active and standby application servers and database servers are configured to include the server state monitoring unit that monitors the occurrence of a failure, so that when a failure occurs, The learned failure recovery operation can be executed from any device. As a result, a database system and a database device having an excellent feature that, even when one or more of the application server and the database server have a failure, the learned failure recovery processing can be executed. A database failure recovery method and program can be provided.

本発明の実施形態に係るデータベースシステムの全体的な構成について示す説明図である。It is explanatory drawing shown about the whole structure of the database system which concerns on embodiment of this invention. 図１に示したプライマリサイトのより詳しい構成について示す説明図である。It is explanatory drawing shown about the more detailed structure of the primary site shown in FIG. 図１に示したセカンダリサイトのより詳しい構成について示す説明図である。It is explanatory drawing shown about the more detailed structure of the secondary site shown in FIG. 図１に示したデータベースシステムで、全ての構成要素が正常に動作している時に、各サーバの行う処理を学習する処理の流れについて示すシーケンス図である。FIG. 2 is a sequence diagram illustrating a flow of processing for learning processing performed by each server when all the components are operating normally in the database system illustrated in FIG. 1. 図４のステップＳ１１０〜１１４の処理で保守者に提示されるエラーパターン、および記憶される学習データの一例について示す説明図である。It is explanatory drawing shown about an example of the error pattern shown to a maintenance person by the process of FIG.4 S110-114, and the learning data memorize | stored. 図１に示したデータベースシステムで、全てのサーバ状態監視部が正常に動作しているが、データベースサーバで障害が発生した場合の動作の流れについて示すシーケンス図である。In the database system shown in FIG. 1, all the server state monitoring units are operating normally, but a sequence diagram illustrating an operation flow when a failure occurs in the database server. 図１に示したデータベースシステムで、データベースサーバで障害が発生し、かつアプリケーションサーバも正常に動作していない場合の動作の流れについて示すシーケンス図である。FIG. 2 is a sequence diagram showing a flow of operations when a failure occurs in a database server and an application server is not operating normally in the database system shown in FIG. 1. 図１に示したデータベースシステムで、現用系データベースサーバのサーバ状態監視部が停止した場合の動作の流れについて示すシーケンス図である。FIG. 2 is a sequence diagram showing a flow of operations when a server state monitoring unit of an active database server is stopped in the database system shown in FIG. 1. 図１に示したデータベースシステムで、待機系データベースサーバのサーバ状態監視部が停止した場合の動作の流れについて示すシーケンス図である。FIG. 3 is a sequence diagram showing a flow of operations when a server state monitoring unit of a standby database server is stopped in the database system shown in FIG. 1. 図１に示したデータベースシステムで、現用系データベースサーバのサーバ状態監視部が停止した場合に行う操作を学習する処理の流れについて示すシーケンス図である。FIG. 3 is a sequence diagram showing a flow of processing for learning an operation performed when a server state monitoring unit of an active database server is stopped in the database system shown in FIG. 1. 図１に示したデータベースシステムで、待機系データベースサーバのサーバ状態監視部が停止した場合に行う操作を学習する処理の流れについて示すシーケンス図である。FIG. 3 is a sequence diagram illustrating a flow of processing for learning an operation performed when a server state monitoring unit of a standby database server is stopped in the database system illustrated in FIG. 1. 図１に示したデータベースシステムで、コマンド学習機能がシステム立ち上げの初期段階で行う処理について示す説明図である。FIG. 3 is an explanatory diagram showing processing performed by a command learning function at an initial stage of system startup in the database system shown in FIG. 1. 既存技術に係るデータベースシステムの全体的な構成について示す説明図である。It is explanatory drawing shown about the whole structure of the database system which concerns on the existing technique.

（実施形態）
以下、本発明の実施形態の構成について添付図１〜３に基づいて説明する。
最初に、本実施形態の基本的な内容について説明し、その後でより具体的な内容について説明する。
本実施形態に係るデータベースシステム１は、現用系アプリケーションサーバ１１および現用系データベースサーバ１２を含むプライマリサイト２と、待機系アプリケーションサーバ２１および待機系データベースサーバ２２を含むセカンダリサイト３とがネットワーク４を介して接続されて構成されるデータベースシステムである。現用系アプリケーションサーバ１１、現用系データベースサーバ１２、待機系アプリケーションサーバ２１および待機系データベースサーバ２２がいずれも、相互に他装置と監視し合うサーバ状態監視部１０２，１１２，２０２，２１２を各々備え、各サーバ状態監視部が、データベースシステムを構成するいずれかのサーバに障害が発生した時に保守者が入力したコマンドをログとして記録し、このログの中から特定のパターンを抽出した学習データを発生した障害の症状を示すエラーコードと対応づけて予め備えられた記憶手段に記憶させ、かつ他の各サーバにもこの学習データを送付して記憶させるコマンド学習機能１０２ａと、いずれかのサーバに障害が発生したことを検出する障害検出機能１０２ｂと、検出された障害と同一のエラーコードが学習データに記憶されているか否かを検索するエラーパターン検索機能１０２ｃと、検出された障害と同一のエラーコードが学習データに記憶されている場合に、そのエラーコードに対応するパターンのコマンドを現用系アプリケーションサーバに実行させる学習済コマンド実行機能１０２ｄとを有する。 (Embodiment)
Hereinafter, the configuration of an embodiment of the present invention will be described with reference to FIGS.
First, the basic content of the present embodiment will be described, and then more specific content will be described.
In the database system 1 according to this embodiment, a primary site 2 including an active application server 11 and an active database server 12 and a secondary site 3 including a standby application server 21 and a standby database server 22 are connected via a network 4. It is a database system that is connected to each other. The active application server 11, the active database server 12, the standby application server 21, and the standby database server 22 all include server status monitoring units 102, 112, 202, and 212 that monitor each other. Each server status monitoring unit records a command entered by the maintenance person when a failure occurs in any of the servers that make up the database system, and generates learning data by extracting a specific pattern from this log A command learning function 102a that stores in a storage means prepared in advance in association with an error code indicating a symptom of a failure, and sends the learning data to other servers for storage, and a failure occurs in any of the servers. The failure detection function 102b that detects the occurrence of the failure and the same error as the detected failure -An error pattern search function 102c for searching whether or not a code is stored in the learning data, and a command of a pattern corresponding to the error code when the same error code as the detected fault is stored in the learning data And a learned command execution function 102d for causing the active application server to execute the command.

ここで、コマンド学習機能１０２ａは、ログから抽出されたパターンをユーザに提示し、該ユーザが選択したパターンを学習データとして記憶する。また、障害検出機能１０２ｂが各サーバのうちのいずれかに障害が発生したことを検出した場合に、その障害が発生したサーバが復旧したことを検出してからコマンド学習機能１０２ａに学習データを送付させる。 Here, the command learning function 102a presents the pattern extracted from the log to the user, and stores the pattern selected by the user as learning data. When the failure detection function 102b detects that a failure has occurred in any of the servers, the learning data is sent to the command learning function 102a after detecting that the failed server has recovered. Let

さらに、学習済コマンド実行機能１０２ｄが、現用系アプリケーションサーバ１１がコマンドを実行できない場合に、待機系アプリケーションサーバ２１にそのコマンドを実行させる。 Further, when the active application server 11 cannot execute the command, the learned command execution function 102d causes the standby application server 21 to execute the command.

以上の構成を備えることにより、データベースシステム１は、アプリケーションサーバおよびデータベースサーバの両方が１台以上で障害が発生した場合にも、学習された障害回復処理を実行することが可能となる。
以下、これをより詳細に説明する。 With the above configuration, the database system 1 can execute the learned failure recovery process even when a failure occurs in both of the application server and the database server.
Hereinafter, this will be described in more detail.

図１は、本発明の実施形態に係るデータベースシステム１の全体的な構成について示す説明図である。データベースシステム１は、プライマリサイト２（現用系）とセカンダリサイト３（待機系）とがネットワーク４を介して接続されて構成されている。ネットワーク４は、同一のサブネットマスクを有するネットワークである。 FIG. 1 is an explanatory diagram showing an overall configuration of a database system 1 according to an embodiment of the present invention. The database system 1 is configured by connecting a primary site 2 (active system) and a secondary site 3 (standby system) via a network 4. The network 4 is a network having the same subnet mask.

プライマリサイト２は、アプリケーションサーバ１１と、データベースサーバ１２と、ストレージ装置１３とを含む。セカンダリサイト３も同様に、アプリケーションサーバ２１と、データベースサーバ２２と、ストレージ装置２３とを含む。また、アプリケーションサーバ１１および２１は相互にサーバクラスタを構成し、同様にデータベースサーバ１２および２２も相互にサーバクラスタを構成する。 The primary site 2 includes an application server 11, a database server 12, and a storage device 13. Similarly, the secondary site 3 includes an application server 21, a database server 22, and a storage device 23. The application servers 11 and 21 constitute a server cluster with each other, and the database servers 12 and 22 similarly constitute a server cluster with each other.

プライマリサイト２とセカンダリサイト３はそれぞれ、多数のクライアント装置１４ａ，ｂ，ｃ，…と２４ａ，ｂ，ｃ，…を含む。また、アプリケーションサーバ１１および２１には各々、多くの周辺装置１５ａ，ｂ，ｃ，…と２５ａ，ｂ，ｃ，…が接続されている。これらについては、本実施形態を説明する上で特に必要ではないので、それらの詳しい構成についての説明はここでは省略する。 Each of the primary site 2 and the secondary site 3 includes a number of client devices 14a, b, c,... And 24a, b, c,. .. And 25a, b, c,... Are connected to the application servers 11 and 21, respectively. Since these are not particularly necessary for describing the present embodiment, a detailed description thereof will be omitted here.

図２は、図１に示したプライマリサイト２のより詳しい構成について示す説明図である。アプリケーションサーバ１１およびデータベースサーバ１２は、いずれも一般的なコンピュータ装置（サーバマシン）である。図２では物理的に別々のコンピュータであるものとしたが、アプリケーションサーバ１１およびデータベースサーバ１２を同一のコンピュータで構成してもよいし、またアプリケーションサーバ１１もしくはデータベースサーバ１２を複数台のコンピュータの組み合わせによって仮想的に構成してもよい。 FIG. 2 is an explanatory diagram showing a more detailed configuration of the primary site 2 shown in FIG. The application server 11 and the database server 12 are both general computer devices (server machines). In FIG. 2, the computers are physically separate computers, but the application server 11 and the database server 12 may be configured by the same computer, or the application server 11 or the database server 12 is a combination of a plurality of computers. May be configured virtually.

アプリケーションサーバ１１は、コンピュータプログラムの実施主体であるプロセッサ１１ａと、コンピュータプログラムと動作データを記憶する記憶手段１１ｂと、ネットワーク４と接続して他のコンピュータとの間でデータ通信を行う通信手段１１ｃとを備える。データベースサーバ１２も同様に、コンピュータプログラムの実施主体であるプロセッサ１２ａと、コンピュータプログラムと動作データを記憶する記憶手段１２ｂと、ネットワーク４と接続して他のコンピュータとの間でデータ通信を行う通信手段１２ｃとを備える。 The application server 11 includes a processor 11a that is a computer program implementation entity, a storage unit 11b that stores the computer program and operation data, and a communication unit 11c that is connected to the network 4 and performs data communication with other computers. Is provided. Similarly, the database server 12 also includes a processor 12a, which is a computer program executing entity, a storage unit 12b that stores the computer program and operation data, and a communication unit that is connected to the network 4 and performs data communication with other computers. 12c.

アプリケーションサーバ１１のプロセッサ１１ａでは、各クライアント装置からの処理依頼に基づいてアプリケーション（具体的にはウェブサーバ、業務システムなど）を動作させるアプリケーション動作部１０１が機能する。データベースサーバ１２のプロセッサ１２ａでは、アプリケーション動作部１０１で動作するアプリケーションで処理されるデータをストレージ装置１３に記憶させるデータベース動作部１１１が機能する。 In the processor 11a of the application server 11, an application operation unit 101 that operates an application (specifically, a web server, a business system, etc.) functions based on a processing request from each client device. In the processor 12 a of the database server 12, a database operation unit 111 that stores data processed by an application operating in the application operation unit 101 in the storage device 13 functions.

ストレージ装置１３には、通常の磁気ディスクや半導体ディスクによる主ストレージ部１３ａと、テープ（ＱＩＣ，ＤＤＳ，ＤＬＴ他）や光ディスクなどによるバックアップストレージ部１３ｂとが含まれる。 The storage device 13 includes a main storage unit 13a using a normal magnetic disk or a semiconductor disk, and a backup storage unit 13b using a tape (QIC, DDS, DLT, etc.) or an optical disk.

さらに、アプリケーションサーバ１１のプロセッサ１１ａとデータベースサーバ１２のプロセッサ１２ａでは、互いに互いの動作を監視すると同時に、セカンダリサイト３のアプリケーションサーバ２１およびデータベースサーバ２２の動作を監視し、異常を検出した際にその異常から当該サーバを復旧させるサーバ状態監視部１０２および１１２がサーバプロセスとして同時に動作する。 Furthermore, the processor 11a of the application server 11 and the processor 12a of the database server 12 monitor each other's operations, and simultaneously monitor the operations of the application server 21 and the database server 22 at the secondary site 3, and when an abnormality is detected, The server status monitoring units 102 and 112 that recover the server from the abnormality operate simultaneously as server processes.

そして、アプリケーションサーバ１１の記憶手段１１ｂとデータベースサーバ１２の記憶手段１２ｂには、サーバ状態監視部１０２および１１２が各々記憶させる学習データ１０３および１１３が記憶される。 The storage means 11b of the application server 11 and the storage means 12b of the database server 12 store learning data 103 and 113 stored by the server state monitoring units 102 and 112, respectively.

図３は、図１に示したセカンダリサイト３のより詳しい構成について示す説明図である。セカンダリサイト３のアプリケーションサーバ２１およびデータベースサーバ２２は、いずれも一般的なコンピュータ装置（サーバマシン）である。 FIG. 3 is an explanatory diagram showing a more detailed configuration of the secondary site 3 shown in FIG. The application server 21 and the database server 22 at the secondary site 3 are both general computer devices (server machines).

アプリケーションサーバ２１、データベースサーバ２２、およびストレージ装置２３は、いずれもプライマリサイト２のアプリケーションサーバ１１、データベースサーバ１２、およびストレージ装置１３と同一の構成を備え、ただハードウェア的には各要素の参照番号を＋１０、ソフトウェア的には各要素の参照番号を＋１００したのみである。各要素の呼称は全て同一である。従って、それらの詳しい構成についての説明はここでは省略する。 The application server 21, the database server 22, and the storage device 23 all have the same configuration as the application server 11, the database server 12, and the storage device 13 in the primary site 2, but only the reference numbers of the respective elements in terms of hardware. +10, and in terms of software, only the reference number of each element is +100. The names of the elements are all the same. Therefore, the detailed description of those configurations is omitted here.

アプリケーションサーバ１１のサーバ状態監視部１０２には、コマンド学習機能１０２ａ、障害検出機能１０２ｂ、エラーパターン検索機能１０２ｃ、学習済コマンド実行機能１０２ｄ、といった各機能を備えている。データベースサーバ１２、アプリケーションサーバ２１、データベースサーバ２２が各々備えるサーバ状態監視部１１２，２０２，２１２も、これらの各機能をそれぞれ備えているので、以後本明細書では、たとえばデータベースサーバ１２のサーバ状態監視部１１２が備えるコマンド学習機能は「コマンド学習機能１１２ａ」などのように呼ぶことにする。 The server state monitoring unit 102 of the application server 11 includes various functions such as a command learning function 102a, a failure detection function 102b, an error pattern search function 102c, and a learned command execution function 102d. Since the server status monitoring units 112, 202, and 212 included in the database server 12, the application server 21, and the database server 22 also have these functions, respectively, in this specification, for example, the server status monitoring of the database server 12 will be described hereinafter. The command learning function included in the unit 112 will be referred to as “command learning function 112a”.

図４は、図１に示したデータベースシステム１で、全ての構成要素が正常に動作している時に、各サーバの行う処理を学習する処理の流れについて示すシーケンス図である。図４では、保守者がプライマリサイト２のアプリケーションサーバ１１に対して開始要求を発する例について示すが、この開始要求はアプリケーションサーバ１１を直接操作して入力してもよいし、またプライマリサイト２側のクライアント装置１４のいずれかからアプリケーションサーバ１１にリモートアクセスして入力してもよい。さらにセカンダリサイト３のアプリケーションサーバ２１に対しても、同様の開始要求によって図４と同様の動作をさせることも可能である。 FIG. 4 is a sequence diagram showing the flow of processing for learning the processing performed by each server when all the components are operating normally in the database system 1 shown in FIG. Although FIG. 4 shows an example in which the maintenance person issues a start request to the application server 11 at the primary site 2, this start request may be input by directly operating the application server 11 or on the primary site 2 side. The application server 11 may be accessed remotely from any of the client devices 14 and input. Furthermore, the application server 21 at the secondary site 3 can be operated in the same manner as in FIG. 4 by a similar start request.

開始要求を受けたサーバ状態監視部１０２のコマンド学習機能１０２ａは（ステップＳ１０１）、動作ログの記録を開始しつつも他の各サーバのサーバ状態監視部、即ちプライマリサイト２側ではデータベースサーバ１２のサーバ状態監視部１１２、セカンダリサイト３ではアプリケーションサーバ２１のサーバ状態監視部２０２、およびデータベースサーバ２２のサーバ状態監視部２１２の各々に対して開始要求を送信する（ステップＳ１０２ａ〜ｃ）。 The command learning function 102a of the server state monitoring unit 102 that has received the start request (step S101) starts the recording of the operation log, while the server state monitoring unit of each other server, that is, the primary site 2 side, In the server status monitoring unit 112 and the secondary site 3, a start request is transmitted to each of the server status monitoring unit 202 of the application server 21 and the server status monitoring unit 212 of the database server 22 (steps S102a to c).

開始要求を受けたサーバ状態監視部１１２，２０２，２１２は、各々のコマンド学習機能１１２ａ，２０２ａ，２１２ａによって各々動作ログの記録を開始して、開始要求に対する正常応答をサーバ状態監視部１０２に返す（ステップＳ１０３ａ〜ｃ）。そしてサーバ状態監視部１０２のコマンド学習機能１０２ａは保守者に対して正常応答を返す（ステップＳ１０４）。 The server status monitoring units 112, 202, and 212 that have received the start request start recording operation logs using the command learning functions 112a, 202a, and 212a, and return a normal response to the start request to the server status monitoring unit 102. (Steps S103a-c). Then, the command learning function 102a of the server state monitoring unit 102 returns a normal response to the maintenance person (step S104).

これで、アプリケーションサーバ１１にユーザが入力し、そこから他のサーバに対して適宜転送される全てのコマンドに対して、そのコマンドと当該コマンドに対して行われた操作と返される応答について、各々のサーバ状態監視部１０２，１１２，２０２，２１２はログを記録する（ステップＳ１０５〜１０６）。図４では、アプリケーションサーバ１１から他のサーバに転送されるコマンドとそれに対する他のサーバからの応答についての詳細な記載は省略している。 With respect to all commands that are input by the user to the application server 11 and appropriately transferred from there to other servers, the command and the operation performed on the command and the response to be returned are respectively The server status monitoring units 102, 112, 202, and 212 record logs (steps S105 to S106). In FIG. 4, detailed descriptions of commands transferred from the application server 11 to other servers and responses from the other servers are omitted.

開始要求と同様にして、保守者が終了処理要求を入力すると（ステップＳ１０７）、サーバ状態監視部１０２が動作ログの記録を終了し、同時に他のサーバ状態監視部１１２，２０２，２１２の各々に対して終了処理要求を送信する（ステップＳ１０８ａ〜ｃ）。終了処理要求を受けた他のサーバ状態監視部１１２，２０２，２１２は、各々動作ログの記録を終了して、記録した動作ログをサーバ状態監視部１０２に送信する（ステップＳ１０９ａ〜ｃ）。 Similarly to the start request, when the maintenance person inputs an end processing request (step S107), the server state monitoring unit 102 ends the recording of the operation log, and at the same time, each of the other server state monitoring units 112, 202, and 212. An end processing request is transmitted to the server (steps S108a to c). The other server status monitoring units 112, 202, and 212 that have received the termination processing request each end the recording of the operation log, and transmit the recorded operation log to the server status monitoring unit 102 (steps S109a to S109c).

サーバ状態監視部１０２は、自らが記録した動作ログと他から受信した動作ログとから、エラーパターンを抽出して保守者に提示する（ステップＳ１１０）。この動作の詳細については後述する。 The server state monitoring unit 102 extracts an error pattern from the operation log recorded by itself and the operation log received from others and presents it to the maintenance person (step S110). Details of this operation will be described later.

そして保守者は、その中から学習すべきエラーパターンを選択して入力する（ステップＳ１１１）。サーバ状態監視部１０２は、そこで選択されたエラーパターンを学習データ１０３として記憶すると共に、他のサーバ状態監視部１１２，２０２，２１２の各々に対してそのエラーパターンを送付する（ステップＳ１１２ａ〜ｃ）。各々のサーバ状態監視部１０２，１１２，２０２，２１２は、そのエラーパターンを受信して、各々学習データ１１３，２０３，２１３として記憶し、（ステップＳ１１３ａ〜ｃ）、正常応答を返す（ステップＳ１１４）。 Then, the maintenance person selects and inputs an error pattern to be learned from among them (step S111). The server status monitoring unit 102 stores the selected error pattern as the learning data 103 and sends the error pattern to each of the other server status monitoring units 112, 202, and 212 (steps S112a to c). . Each server status monitoring unit 102, 112, 202, 212 receives the error pattern, stores it as learning data 113, 203, 213 (step S113a-c), and returns a normal response (step S114). .

図５は、図４のステップＳ１１０〜１１４の処理で保守者に提示されるエラーパターン、および記憶される学習データ１０３，１１３，２０３，２１３の一例について示す説明図である。ステップＳ１１０に示した処理では、発生したエラーの対象装置１２０ａとエラーコード１２０ｂ、そしてこれに対して各装置に実際に入力されたコマンド１２０ｃが保守者に提示される。 FIG. 5 is an explanatory diagram showing an example of the error pattern presented to the maintenance person in the process of steps S110 to 114 of FIG. 4 and the learning data 103, 113, 203, 213 stored. In the processing shown in step S110, the target device 120a and the error code 120b of the error that has occurred and the command 120c actually input to each device are presented to the maintenance person.

これに対して保守者は、ステップＳ１１１に示した処理で、そのコマンド１２０ｃの中で実際に入力すべきものを選択し、またたとえば「ホスト名」や「プロセス名」などのような要素を「対象装置」「対象プロセス」などを示す変数に置き換えて入力する。これで入力されたデータが学習パターン１２０ｄとなり、エラーコード１２０ｂに対応づけられて、ステップＳ１１２〜１１４の処理で各装置の学習データ１０３，１１３，２０３，２１３として記憶される。 On the other hand, the maintenance person selects what should be actually input in the command 120c in the process shown in step S111, and also selects elements such as “host name” and “process name” as “target”. Replace it with a variable that indicates "device" or "target process". Thus, the input data becomes a learning pattern 120d, is associated with the error code 120b, and is stored as learning data 103, 113, 203, 213 of each device in the processing of steps S112 to S114.

ここで、実際に作成される学習データは、各装置ごとに異なるものである。アプリケーションサーバとデータベースサーバとで異なるコマンドが実行されるべきであり、同様にプライマリサイトとセカンダリサイトとでも異なるコマンドが実行されるべきだからである。しかしながら、この装置ごとに実行されるべきコマンドの相違点は当業者にはよく知られているので、細かい解説はここでは割愛する。 Here, the learning data actually created is different for each device. This is because different commands should be executed on the application server and the database server, and different commands should be executed on the primary site and the secondary site as well. However, since the difference in commands to be executed for each apparatus is well known to those skilled in the art, a detailed explanation is omitted here.

図６は、図１に示したデータベースシステム１で、全てのサーバ状態監視部１０２，１１２，２０２，２１２が正常に動作しているが、データベースサーバ１２で障害が発生した場合の動作の流れについて示すシーケンス図である。データベースサーバ１２ではサーバ状態監視部１１２がサーバプロセスとして動作しているので、この障害発生を障害検出機能１１２ｂが検出する（ステップＳ２０１）。 FIG. 6 shows the operation flow when all of the server status monitoring units 102, 112, 202, and 212 are operating normally in the database system 1 shown in FIG. FIG. In the database server 12, since the server state monitoring unit 112 operates as a server process, the failure detection function 112b detects the occurrence of this failure (step S201).

障害検出機能１１２ｂがこの障害発生を検出したら、これに反応したエラーパターン検索機能１１２ｃが、学習データ１１３の中に同一のエラーコード１２０ｂのものが登録されているか否かについて検索する（ステップＳ２０２）。登録されていなければ、そこでアプリケーションサーバ１１に処理を停止するよう指示して、その障害発生を保守者に通知する（ステップＳ２０３）。その場合、保守者は図４に示した操作を行って、各サーバに対して障害対応を行うと共に、その障害対応の操作について学習させる。 When the failure detection function 112b detects the occurrence of this failure, the error pattern search function 112c that has responded to this searches for whether or not the same error code 120b is registered in the learning data 113 (step S202). . If not registered, the application server 11 is instructed to stop the process, and the maintenance person is notified of the failure (step S203). In that case, the maintenance person performs the operation shown in FIG. 4 to deal with the failure for each server and to learn about the operation for dealing with the failure.

学習データ１１３の中に同一のエラーコード１２０ｂのものが登録されていれば、エラーパターン検索機能１１２ｃはアプリケーションサーバ１１の学習済コマンド実行機能１０２ｄに、このエラーコード１２０ｂに対応する学習パターン１２０ｄのコマンドを実行するよう指示し、これに応じて学習済コマンド実行機能１０２ｄはその学習パターン１２０ｄを学習データ１１３から読み出して、各サーバの学習済コマンド実行機能１０２ｄ，１１２ｄ，２０２ｄ，２１２ｄにその学習パターン１２０ｄによるコマンドを実行させる（ステップＳ２０４〜２０５）。 If the same error code 120b is registered in the learning data 113, the error pattern search function 112c sends the command of the learning pattern 120d corresponding to the error code 120b to the learned command execution function 102d of the application server 11. In response to this, the learned command execution function 102d reads the learning pattern 120d from the learning data 113 and sends the learned pattern execution function 102d, 112d, 202d, 212d of each server to the learning pattern 120d. Are executed (steps S204 to S205).

ここで、学習済コマンドの実行を開始する主体はデータベースサーバ１２でもよいが、このデータベースサーバ１２の動作が停止している場合を想定して、ここではアプリケーションサーバ１１をその主体としている。基本的に、この実行開始の主体はどのサーバでもよい。 Here, the subject that starts execution of the learned command may be the database server 12, but the application server 11 is assumed as the subject here, assuming that the operation of the database server 12 is stopped. Basically, this execution start subject may be any server.

その実行中も、コマンド学習機能１０２ａ，１１２ａ，２０２ａ，２１２ａが図４のステップＳ１０５〜１０６と同様にログを記録しているので、学習済のコマンドの実行が一通り終了したら学習済コマンド実行機能１０２ｄは他のコマンド学習機能１０２ａ，１１２ａ，２０２ａ，２１２ａに対して終了指示を行い（ステップＳ２０６ａ〜ｃ）、これに応じて各々のコマンド学習機能１０２ａ，１１２ａ，２０２ａ，２１２ａは記録したログを返送する（ステップＳ２０７ａ〜ｃ）。そして学習済コマンド実行機能１０２ｄは処理結果をユーザに通知する（ステップＳ２０８）。 Even during the execution, the command learning function 102a, 112a, 202a, 212a records a log in the same manner as in steps S105 to S106 in FIG. 4, so when the execution of the learned command is completed, the learned command execution function is completed. 102d instructs the other command learning functions 102a, 112a, 202a, and 212a to end (steps S206a to S206c), and each command learning function 102a, 112a, 202a, and 212a returns the recorded log accordingly. (Steps S207a-c). Then, the learned command execution function 102d notifies the user of the processing result (step S208).

図７は、図１に示したデータベースシステム１で、データベースサーバ１２で障害が発生し、かつアプリケーションサーバ１１も正常に動作していない場合の動作の流れについて示すシーケンス図である。この場合は、ステップＳ２０２までは図６と同一の動作となるが、ステップＳ２０３でエラーパターン検索機能１１２ｃがアプリケーションサーバ１１に処理を停止するよう指示、またはステップＳ２０４で学習済コマンド実行機能１０２ｄにエラーコード１２０ｂに対応する学習パターン１２０ｄのコマンドを実行するよう指示したが、これに対する正常応答が返送されない。 FIG. 7 is a sequence diagram showing an operation flow when a failure occurs in the database server 12 and the application server 11 is not operating normally in the database system 1 shown in FIG. In this case, the operation is the same as that in FIG. 6 until step S202. However, in step S203, the error pattern search function 112c instructs the application server 11 to stop processing, or in step S204, the learned command execution function 102d has an error. Although an instruction is given to execute the command of the learning pattern 120d corresponding to the code 120b, a normal response to this is not returned.

そこでエラーパターン検索機能１１２ｃは、セカンダリサイト３のアプリケーションサーバ２１に、学習データ１１３の中にエラーコード１２０ｂに対応する学習パターン１２０ｄが登録されていない場合はステップＳ２０３と同一の処理停止指示を送信して、その障害発生を保守者に通知する（ステップＳ２５１）。学習データ１１３の中にエラーコード１２０ｂに対応する学習パターン１２０ｄが登録されている場合は、ステップＳ２０４と同一のコマンド実行指示を送信して、学習済コマンド実行機能２０２ｄにその学習パターン１２０ｄのコマンドを実行させる（ステップＳ２５２〜３）。 Therefore, if the learning pattern 120d corresponding to the error code 120b is not registered in the learning data 113, the error pattern search function 112c transmits the same processing stop instruction as that in step S203 to the application server 21 of the secondary site 3. Then, the maintenance person is notified of the occurrence of the failure (step S251). When the learning pattern 120d corresponding to the error code 120b is registered in the learning data 113, the same command execution instruction as in step S204 is transmitted, and the command of the learning pattern 120d is sent to the learned command execution function 202d. It is made to perform (step S252-3).

以後は図６のステップＳ２０５以降と同一の動作を、アプリケーションサーバ２１の学習済コマンド実行機能２０２ｄが行うこととなる。ただし、アプリケーションサーバ１１は正常に動作していないので、ステップＳ２０７ａのログが正常に送信されずタイムアウトとなる可能性がある。その場合も、その旨をログに記録し、保守者に通知する。 Thereafter, the learned command execution function 202d of the application server 21 performs the same operation as that after step S205 in FIG. However, since the application server 11 is not operating normally, there is a possibility that the log in step S207a may not be transmitted normally and time out. In that case, the fact is recorded in a log and the maintenance person is notified.

図８は、図１に示したデータベースシステム１で、現用系データベースサーバ１２のサーバ状態監視部１１２が停止した場合の動作の流れについて示すシーケンス図である。アプリケーションサーバ１１の障害検出機能１０２ｂは、周期的にデータベースサーバ１２および２２のサーバ状態監視部１１２および２１２に対して状態監視用の信号を送信しているので、この状態監視用信号に対する応答が無いことによってサーバ状態監視部１１２の停止を検出する（ステップＳ３０１）。 FIG. 8 is a sequence diagram showing an operation flow when the server state monitoring unit 112 of the active database server 12 is stopped in the database system 1 shown in FIG. Since the failure detection function 102b of the application server 11 periodically transmits a state monitoring signal to the server state monitoring units 112 and 212 of the database servers 12 and 22, there is no response to the state monitoring signal. Thus, the stop of the server state monitoring unit 112 is detected (step S301).

障害検出機能１０２ｂがこの障害発生を検出したら、これに反応したエラーパターン検索機能１０２ｃが、学習データ１０３の中にこのエラーと同一のエラーコード１２０ｂが登録されているか否かについて検索する（ステップＳ３０２）。登録されていなければ、図６のステップＳ２０３と同様に、自らの処理を停止して、その障害発生を保守者に通知する（ステップＳ３０３）。 When the failure detection function 102b detects the occurrence of the failure, the error pattern search function 102c that responds to the failure searches whether or not the same error code 120b as this error is registered in the learning data 103 (step S302). ). If it is not registered, as in step S203 of FIG. 6, its own processing is stopped and the occurrence of the failure is notified to the maintenance person (step S303).

学習データ１１３の中に同一のエラーコード１２０ｂが登録されていれば、アプリケーションサーバ１１の学習済コマンド実行機能１０２ｄがそのエラーコード１２０ｂに対応する学習パターン１２０ｄのコマンドを実行すると共に、図６のステップＳ２０４以後と同様にして他の学習済コマンド実行機能１１２ｄ，２０２ｄ，２１２ｄにもその学習パターン１２０ｄのコマンドを実行させる。以後は図６の動作と同一である。 If the same error code 120b is registered in the learning data 113, the learned command execution function 102d of the application server 11 executes the command of the learning pattern 120d corresponding to the error code 120b, and the steps of FIG. Similarly to S204 and subsequent steps, the other learned command execution functions 112d, 202d, and 212d are caused to execute the command of the learned pattern 120d. The subsequent operation is the same as that shown in FIG.

図９は、図１に示したデータベースシステム１で、待機系データベースサーバ２２のサーバ状態監視部２１２が停止した場合の動作の流れについて示すシーケンス図である。これは、アプリケーションサーバ１１の障害検出機能１０２ｂが停止を検出する対象がサーバ状態監視部２１２に変わる（ステップＳ３５１）だけで、あとは図８と同一の動作となる。 FIG. 9 is a sequence diagram showing an operation flow when the server state monitoring unit 212 of the standby database server 22 is stopped in the database system 1 shown in FIG. This is the same operation as that in FIG. 8 except that the failure detection function 102b of the application server 11 detects the stop as the server status monitoring unit 212 (step S351).

図１０は、図１に示したデータベースシステム１で、現用系データベースサーバ１２のサーバ状態監視部１１２が停止した場合に行う操作を学習する処理の流れについて示すシーケンス図である。より具体的には、図８のステップＳ３０３で、そのエラーコード１２０ｂが登録されていないことが保守者に通知された場合に、この状態に対して行う処理を学習することが必要となる。 FIG. 10 is a sequence diagram showing a flow of processing for learning an operation performed when the server status monitoring unit 112 of the active database server 12 is stopped in the database system 1 shown in FIG. More specifically, when the maintenance person is notified in step S303 in FIG. 8 that the error code 120b is not registered, it is necessary to learn the processing to be performed for this state.

その場合も、図４と同一の動作によって各サーバに対して行った操作を学習する。そして、その学習の操作が終了した後、アプリケーションサーバ１１の障害検出機能１０２ｂにサーバ状態監視部１１２が復旧したか否かを監視する動作、より具体的には状態監視用の信号を周期的に送信する動作を開始する（ステップＳ４０１）。 Also in that case, the operation performed on each server is learned by the same operation as in FIG. Then, after the learning operation is completed, an operation for monitoring whether or not the server state monitoring unit 112 has been restored to the failure detection function 102b of the application server 11, more specifically, a state monitoring signal is periodically transmitted. The transmission operation is started (step S401).

この信号に対する正常な応答が返ってきたら、サーバ状態監視部１１２が復旧したものと判断して（ステップＳ４０２）、コマンド学習機能１０２ａはステップＳ１１２で送信できなかったエラーパターンをアプリケーションサーバ１１に送付する（ステップＳ４０３）。これを受信したコマンド学習機能１１２ａは、受信したエラーパターンを学習データ１１３として記憶して、正常応答を返す（ステップＳ４０４）。 If a normal response to this signal is returned, it is determined that the server state monitoring unit 112 has recovered (step S402), and the command learning function 102a sends the error pattern that could not be transmitted in step S112 to the application server 11. (Step S403). Receiving this, the command learning function 112a stores the received error pattern as learning data 113 and returns a normal response (step S404).

図１１は、図１に示したデータベースシステム１で、待機系データベースサーバ２２のサーバ状態監視部２１２が停止した場合に行う操作を学習する処理の流れについて示すシーケンス図である。これは、アプリケーションサーバ１１の障害検出機能１０２ｂが復旧したか否かを検出する対象がサーバ状態監視部２１２に変わる（ステップＳ４５１〜４５４）だけで、あとは図１０と同一の動作となる。 FIG. 11 is a sequence diagram illustrating a flow of processing for learning an operation performed when the server state monitoring unit 212 of the standby database server 22 is stopped in the database system 1 illustrated in FIG. 1. This is the same as FIG. 10 except that the object for detecting whether or not the failure detection function 102b of the application server 11 has been restored is changed to the server state monitoring unit 212 (steps S451 to 454).

図１２は、図１に示したデータベースシステム１で、コマンド学習機能１０２ａがシステム立ち上げの初期段階で行う処理について示す説明図である。まず、アプリケーションサーバ１１のコマンド学習機能１０２ａが、他のコマンド学習機能１１２ａ，２０２ａ，２１２ａに対して、各サーバ固有の構成情報を要求する（ステップＳ５０１ａ〜ｃ）。コマンド学習機能１１２ａ，２０２ａ，２１２ａは各々、この要求に対してこの構成情報を送信する（ステップＳ５０２ａ〜ｃ）。 FIG. 12 is an explanatory diagram showing processing performed by the command learning function 102a in the initial stage of system startup in the database system 1 shown in FIG. First, the command learning function 102a of the application server 11 requests configuration information unique to each server from the other command learning functions 112a, 202a, and 212a (steps S501a to S501c). The command learning functions 112a, 202a, and 212a each transmit this configuration information in response to this request (steps S502a to c).

これを受けたコマンド学習機能１０２ａは、受信した各々の構成情報を取りまとめてデータベースシステム１全体としての構成情報を作成して自ら学習データ１０３として記憶すると共に、これをコマンド学習機能１１２ａ，２０２ａ，２１２ａに送信する（ステップＳ５０３ａ〜ｃ）。コマンド学習機能１１２ａ，２０２ａ，２１２ａは各々、これを学習データ１１３，２０３，２１３として記憶して正常応答を返す（ステップＳ５０４ａ〜ｃ）。前述のように、実際に作成される学習データは各装置ごとに異なるものであるが、この実行されるべきコマンドの相違点は当業者にはよく知られているので、細かい解説はここでは割愛する。 Receiving this, the command learning function 102a collects each received configuration information, creates the configuration information of the entire database system 1 and stores it as the learning data 103, and stores it as the command learning function 112a, 202a, 212a. (Steps S503a to S503). The command learning functions 112a, 202a, 212a store these as learning data 113, 203, 213 and return normal responses (steps S504a to S504c). As described above, the learning data actually created is different for each device, but the difference in the command to be executed is well known to those skilled in the art, so a detailed explanation is omitted here. To do.

（実施形態の全体的な動作）
次に、上記の実施形態の全体的な動作について説明する。
本実施形態に係る障害回復方法は、現用系アプリケーションサーバ１１および現用系データベースサーバ１２を含むプライマリサイト２と、待機系アプリケーションサーバ２１および待機系データベースサーバ２２を含むセカンダリサイト３とがネットワーク４を介して接続されて構成されるデータベースシステム４にあって、データベースシステムを構成するいずれかのサーバに障害が発生した時に保守者が入力したコマンドをデータベースシステムを構成する各サーバのコマンド学習機能がログとして記録し（図４・ステップＳ１０１〜１０９）、記録されたログの中から特定のパターンを抽出した学習データを発生した障害の症状を示すエラーコードと対応づけて各サーバのコマンド学習機能が予め備えられた記憶手段に記憶させると共に、他の各サーバにもこの学習データを送付して記憶させ（図４・ステップＳ１１０〜１１４）、いずれかのサーバに障害が発生したことを各サーバの障害検出機能が検出し（図６・ステップＳ２０１）、検出された障害と同一のエラーコードが学習データに記憶されているか否かを各サーバのエラーパターン検索機能が検索し（図６・ステップＳ２０２）、検出された障害と同一のエラーコードが学習データに記憶されている場合に、そのエラーコードに対応するパターンのコマンドを各サーバの学習済コマンド実行機能が現用系アプリケーションサーバに実行させる（図６・ステップＳ２０４〜２０５）。 (Overall operation of the embodiment)
Next, the overall operation of the above embodiment will be described.
In the failure recovery method according to the present embodiment, the primary site 2 including the active application server 11 and the active database server 12 and the secondary site 3 including the standby application server 21 and the standby database server 22 are connected via the network 4. In the database system 4 configured to be connected to each other, a command input by a maintenance person when a failure occurs in any of the servers constituting the database system is used as a log for the command learning function of each server constituting the database system. A command learning function of each server is recorded in advance (FIG. 4, steps S101 to S109), in which learning data obtained by extracting a specific pattern from the recorded log is associated with an error code indicating the symptom of the failure that has occurred. When stored in the storage means In addition, the learning data is sent to other servers for storage (FIG. 4, steps S110 to 114), and the failure detection function of each server detects that a failure has occurred in any server (FIG. 6). Step S201), the error pattern search function of each server searches whether or not the same error code as the detected fault is stored in the learning data (FIG. 6, Step S202), and is the same as the detected fault When the error code is stored in the learning data, the learned command execution function of each server causes the active application server to execute the command of the pattern corresponding to the error code (steps S204 to 205 in FIG. 6).

ここで、上記各動作ステップについては、これをコンピュータで実行可能にプログラム化し、これらを前記各ステップを直接実行する各サーバのプロセッサに実行させるようにしてもよい。本プログラムは、非一時的な記録媒体、例えば、ＤＶＤ、ＣＤ、フラッシュメモリ等に記録されてもよい。その場合、本プログラムは、記録媒体からコンピュータによって読み出され、実行される。
この動作により、本実施形態は以下のような効果を奏する。 Here, each of the above operation steps may be programmed so as to be executable by a computer, and these may be executed by a processor of each server that directly executes each of the steps. The program may be recorded on a non-temporary recording medium, such as a DVD, a CD, or a flash memory. In this case, the program is read from the recording medium by a computer and executed.
By this operation, this embodiment has the following effects.

本実施形態によれば、発生した障害のエラーコードに対応するパターンが登録されていれば、そのパターンを自動的に実行して、復旧にかかる手間を軽減することが可能となる。かつ、そのパターンによる復旧は現用系と待機系のアプリケーションサーバおよび現用系データベースサーバのいずれからでも実行可能であるので、２つ以上のサーバで同時に障害が発生した場合でもそのパターンを実行できる。 According to the present embodiment, if a pattern corresponding to the error code of the failure that has occurred is registered, it is possible to automatically execute the pattern and reduce the effort required for recovery. In addition, since the recovery by the pattern can be executed from any of the active and standby application servers and the active database server, the pattern can be executed even when two or more servers are simultaneously failed.

従って、既存技術ではたとえば数十程度のコマンドの実行と数時間程度の時間を要していた復旧作業を、本実施形態では多くても数コマンド程度の実行で、数分程度で終わらせることができる。さらに、特に頻繁に発生する症状についてはすぐに学習して自動化して、保守者による操作自体を必要とせず、自動的に復旧させることができる。即ち、ヒューマンエラーの発生要因となる繁雑な作業を大幅に軽減して、発生した障害への対応を円滑に行うことが、本実施形態によって可能となる。 Therefore, for example, in the present embodiment, the restoration work, which requires about several tens of commands and several hours in the existing technology, can be completed in several minutes by executing several commands at most. it can. Furthermore, particularly frequently occurring symptoms can be learned and automated immediately, and can be automatically restored without requiring any operation by the maintenance personnel. That is, according to the present embodiment, it is possible to significantly reduce the complicated work that causes human error and to smoothly cope with the failure that has occurred.

（実施形態の拡張）
上記実施形態は、以上で説明した本発明の趣旨を改変しない範囲で、様々な拡張が可能である。以下、これについて説明する。
まず、上記実施形態ではプライマリサイト（現用系）とセカンダリサイト（待機系）が各々アプリケーションサーバおよびデータベースサーバを含むという構成について説明したが、実際の装置の区分は必ずしもこの例の通りである必要はない。たとえばアプリケーションサーバとデータベースサーバとが同一のコンピュータによって構成されてもよいし、逆に複数の物理的コンピュータによって仮想的に構成されるものであってもよい。さらに、セカンダリサイトが複数備えられていてもよい。 (Extended embodiment)
The above-described embodiment can be variously expanded without departing from the spirit of the present invention described above. This will be described below.
First, in the above-described embodiment, the configuration in which the primary site (active system) and the secondary site (standby system) each include an application server and a database server has been described. However, the actual device classification need not necessarily be as in this example. Absent. For example, the application server and the database server may be configured by the same computer, or conversely may be configured by a plurality of physical computers. Furthermore, a plurality of secondary sites may be provided.

これまで本発明について図面に示した特定の実施形態をもって説明してきたが、本発明は図面に示した実施形態に限定されるものではなく、本発明の効果を奏する限り、これまで知られたいかなる構成であっても採用することができる。 The present invention has been described with reference to the specific embodiments shown in the drawings. However, the present invention is not limited to the embodiments shown in the drawings, and any known hitherto provided that the effects of the present invention are achieved. Even if it is a structure, it is employable.

本発明は、アプリケーションサーバとデータベースサーバとを組み合わせた構造のウェブシステムに適用できる。たとえば電子商取引システムや業務システム等において、可用性を向上させる用途に特に適している。 The present invention can be applied to a web system having a structure in which an application server and a database server are combined. For example, it is particularly suitable for applications that improve availability in e-commerce systems and business systems.

１データベースシステム
２プライマリサイト
３セカンダリサイト
４ネットワーク
１１、２１アプリケーションサーバ
１１ａ、１２ａ、２１ａ、２２ａプロセッサ
１１ｂ、１２ｂ、２１ｂ、２２ｂ記憶手段
１１ｃ、１２ｃ、２１ｃ、２２ｃ通信手段
１２、２２データベースサーバ
１３、２３ストレージ装置
１３ａ、２３ａ主ストレージ部
１３ｂ、２３ｂバックアップストレージ部
１４ａ、２４ａクライアント装置
１５ａ、２５ａ周辺装置
１０１、２０１アプリケーション動作部
１０２、１１２、２０２、２１２サーバ状態監視部
１０２ａ、１１２ａ、２０２ａ、２１２ａコマンド学習機能
１０２ｂ、１１２ｂ、２０２ｂ、２１２ｂ障害検出機能
１０２ｃ、１１２ｃ、２０２ｃ、２１２ｃエラーパターン検索機能
１０２ｄ、１１２ｄ、２０２ｄ、２１２ｄ学習済コマンド実行機能
１１１、２１１データベース動作部
１０３、１１３、２０３、２１３学習データ
１２０ａ対象装置
１２０ｂエラーコード
１２０ｃコマンド
１２０ｄ学習パターン DESCRIPTION OF SYMBOLS 1 Database system 2 Primary site 3 Secondary site 4 Network 11, 21 Application server 11a, 12a, 21a, 22a Processor 11b, 12b, 21b, 22b Storage means 11c, 12c, 21c, 22c Communication means 12, 22 Database server 13, 23 Storage device 13a, 23a Main storage unit 13b, 23b Backup storage unit 14a, 24a Client device 15a, 25a Peripheral device 101, 201 Application operation unit 102, 112, 202, 212 Server status monitoring unit 102a, 112a, 202a, 212a Command learning Function 102b, 112b, 202b, 212b Fault detection function 102c, 112c, 202c, 212c Error pattern search function 10 d, 112d, 202d, 212d learned command execution function 111, 211 database operation unit 103,113,203,213 learning data 120a target device 120b error code 120c command 120d learning pattern

Claims

A database system configured by connecting a primary site including an active application server and an active database server and a secondary site including a standby application server and a standby database server via a network,
Each of the active application server, the active database server, the standby application server, and the standby database server each includes a server state monitoring unit that monitors each other,
Each of the server status monitoring units
A command input by a maintenance person when a failure occurs in any of the servers constituting the database system is recorded as a log, and a symptom of the failure in which learning data is generated by extracting a specific pattern from the log is indicated. A command learning function for storing in a storage means provided in advance in association with an error code, and sending and storing this learning data to each of the other servers;
A failure detection function for detecting that a failure has occurred in any of the servers;
An error pattern search function for searching whether or not the same error code as the detected fault is stored in the learning data;
A learning command execution function for causing the active application server to execute a command of the pattern corresponding to the error code when the error code that is the same as the detected failure is stored in the learning data; A database system characterized by that.

The command learning function is
The database system according to claim 1, wherein the pattern extracted from the log is presented to a user, and the pattern selected by the user is stored as the learning data.

When the failure detection function detects that a failure has occurred in any of the servers, the learning data is sent to the command learning function after detecting that the server in which the failure has occurred is recovered. The database system according to claim 1, wherein:

The learned command execution function is
The database system according to claim 1, wherein when the active application server cannot execute the command, the standby application server executes the command.

A database system comprising a primary site including an active application server and an active database server and a secondary site including a standby application server and a standby database server connected via a network, wherein the active database server or A database device capable of functioning as the standby database server,
A server status monitoring unit that mutually monitors other devices,
This server status monitoring unit
A command input by a maintenance person when a failure occurs in any of the servers constituting the database system is recorded as a log, and a symptom of the failure in which learning data is generated by extracting a specific pattern from the log is indicated. A command learning function for storing in a storage means provided in advance in association with an error code, and sending and storing this learning data to each of the other servers;
A failure detection function for detecting that a failure has occurred in any of the servers;
An error pattern search function for searching whether or not the same error code as the detected fault is stored in the learning data;
A learning command execution function for causing the active application server to execute a command of the pattern corresponding to the error code when the error code that is the same as the detected failure is stored in the learning data; A database apparatus characterized by that.

In a database system configured by connecting a primary site including an active application server and an active database server and a secondary site including a standby application server and a standby database server via a network,
The command learning function of each server constituting the database system records a command input by a maintenance person when a failure occurs in any server constituting the database system as a log,
The learning data obtained by extracting a specific pattern from the recorded log is associated with an error code indicating the symptom of the failure that has occurred and stored in a storage means provided in advance with a command learning function of each server, Send this learning data to each other server for storage,
The failure detection function of each server detects that a failure has occurred in any of the servers,
The error pattern search function of each server searches whether or not the same error code as the detected failure is stored in the learning data,
When the same error code as the detected failure is stored in the learning data, the learned command execution function of each server sends the pattern command corresponding to the error code to the active application server. A failure recovery method characterized by being executed.

In a database system configured by connecting a primary site including an active application server and an active database server and a secondary site including a standby application server and a standby database server via a network,
In the processor provided in each server constituting the database system,
A procedure for recording a command entered by a maintenance person as a log when a failure occurs in any of the servers constituting the database system;
The learning data obtained by extracting a specific pattern from the recorded log is stored in the storage means provided in advance in association with the error code indicating the symptom of the failure that has occurred, and the learning is also performed in each of the other servers. Procedure to send and memorize data,
A procedure for detecting that one of the servers has failed;
A procedure for searching whether or not the same error code as the detected fault is stored in the learning data;
When the error code that is the same as the detected failure is stored in the learning data, a procedure for causing the active application server to execute the command of the pattern corresponding to the error code is executed. Disaster recovery program.