JP5307223B2

JP5307223B2 - Disaster recovery architecture

Info

Publication number: JP5307223B2
Application number: JP2011281959A
Authority: JP
Inventors: ルカ・カサーレ; フィリッポ・ファリーナ; エウジェニオ・マリア・マッフィオネ
Original assignee: テレコム・イタリア・エッセ・ピー・アー
Priority date: 2011-12-22
Filing date: 2011-12-22
Publication date: 2013-10-02
Anticipated expiration: 2025-03-10
Also published as: JP2012100313A

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and system for failure recovery in a packet-based network. <P>SOLUTION: A network (50) includes a production site (52) and a recovery site (54) coupled together by a packet-based network (56). Mirroring software (68) on the production site (52) keeps the recovery site (54) up to date to the last transaction occurring on the production site. A recovery control server (84) polls the production site in order to detect a failure condition or other fault. Upon detection of a problem at the production site (52), the recovery control server (84) reconfigures the network (56) so that attempts to access the production site (52) are routed to the recovery site (54). <P>COPYRIGHT: (C)2012,JPO&INPIT

Description

一般に本発明は通信ネットワークに関するものであり、特に、通信ネットワークにおいて使用される障害回復技術に関する。 The present invention relates generally to communication networks, and more particularly to failure recovery techniques used in communication networks.

ネットワーク化されたコンピュータシステムの大衆性と利便性により、データベースによるユーザー間でのデータ共有が多くのビジネス環境において普及してきた。データベースを介した情報への集中型アクセスを行なうには、データベースの保守と管理を注意深く考慮する必要がある。さらに、回復技術はハードウエア／デバイスの故障又はアプリケーションロジックの故障の後でのデータベースの一貫性を保証することが必須である。 Due to the popularity and convenience of networked computer systems, database data sharing among users has become popular in many business environments. For centralized access to information through a database, database maintenance and management must be carefully considered. In addition, recovery techniques are essential to ensure database consistency after hardware / device failures or application logic failures.

一般に、回復技術は損傷を受けた後にシステム又はシステムに記憶されたデータをリセットして動作可能な状態にすると共に、バックアップコピーをリストアすることによりデータベースを再構築する方法を提供する。 In general, recovery techniques provide a way to reset a system or data stored in the system after being damaged to an operational state and to rebuild the database by restoring a backup copy.

いずれのデータ回復システムにおいても、興味ある２つのポイントがある。
・第１は目標回復ポイント（ＲＰＯ）であり、これは元のデータとバックアップコピーとの間の最大計画相違を定義する。
・第２は目標回復時間（ＲＴＯ）であり、これはサービスの再開のための最大時間を定義する。 There are two points of interest in any data recovery system.
First is the target recovery point (RPO), which defines the maximum planned difference between the original data and the backup copy.
Second is the target recovery time (RTO), which defines the maximum time for service restart.

システムバックアップの最も単純な形式の一つとして、磁気テープ上に作られたデータのコピーを遠く離れたアーカイビング・サイトに物理的に輸送することが挙げられる。一般に、この場合にはユーザーはバックアップテープを作っている間はすべてのデータベースの活動を停止する必要がある。次に、この障害回復ではバックアップテープを用いてデータベースを回復させることを必要とする。 One of the simplest forms of system backup is to physically transport a copy of data made on magnetic tape to a remote archiving site. In general, this requires the user to stop all database activity while making a backup tape. Next, this failure recovery requires that the database be recovered using a backup tape.

システムバックアップの更に最近の形式では、ネットワークの相互接続を使用して生産(production)サイトの周期的なバックアップを実行する。このようなバックアップが行われる時間はネットワーク管理者により制御される。アプリケーションサーバをリストアするための方法は、古いシステムと同様の特徴を有するハードウエアを含めて新しいシステムをインストールすること、及びシステムについてのバックアップされたイメージを回復サイトからリストアすることを含む。 A more recent form of system backup uses network interconnections to perform periodic backups of production sites. The time for such backup is controlled by the network administrator. A method for restoring an application server includes installing a new system, including hardware having similar characteristics as the old system, and restoring a backed up image of the system from the recovery site.

Ｖｅｒｉｔａｓにより供給される別の従来技術のシステム（本特許出願の出願日時点ではＵＲＬ：http://www.veritas.com/Products/www?c=product&refId=140にてインターネットからダウンロードで入手可能）は、バックアップ手順の正確な実行に必要な種々の段階、及びクライアントをリストアする後続の段階を制御するためのソフトウエアモジュールのアーキテクチャを検討している。特に、Ｖｅｒｉｔａｓの解決策では、バックアップ動作の制御と管理のためのサーバ、クライアントのリストア段階の制御のためのサーバ、リストアのために必要なプログラム及び設定をクライアントに与えるサーバ、及び最後にリモート・ブーティングの管理のためのサーバを含めて、個別の各機能面について様々なサーバが使用される。 Another prior art system supplied by Veritas (as of the filing date of this patent application, available for download from the Internet at URL: http://www.veritas.com/Products/www?c=product&refId=140) Discusses the architecture of the software module for controlling the various stages necessary for the correct execution of the backup procedure and the subsequent stages of restoring the client. In particular, the Veritas solution includes a server for control and management of backup operations, a server for control of the client restore phase, a server that provides the client with the necessary programs and settings for restore, and finally a remote Various servers are used for each individual functional aspect, including servers for booting management.

別の従来技術の解決策は、標題「Cisco Network Boot Installation and Configuration Guide, Release 3.1」の論文（本特許出願の出願日時点ではＵＲＬ：http://www.cisco.com/en/US/products/hw/ps4159/ps2160/products_installation_and_configuration_guide_book09186a00801a45bO.htmlにてインターネットからダウンロードで入手可能）に記載のシスコ・ネットワーク・ブート・システムであり、オペレーティングシステム、サーバ上のアプリケーション及びデータを含めてシステムイメージ全体のコピーを作る。バックアップはネットワーク管理者によって相互に実行される。シスコの解決策は、複製を行なったプライマリサーバと同じハードウエアの特徴を有するという条件で、ネットワーク上でブート手順を遠隔で実行する可能性を与える。したがって、回復サーバはネットワークからシステムイメージの遠隔コピーをリストアし、プライマリサーバにより以前に保証されたサービスを再提供することができる。 Another prior art solution is the article titled “Cisco Network Boot Installation and Configuration Guide, Release 3.1” (URL: http://www.cisco.com/en/US/products as of the filing date of this patent application). /hw/ps4159/ps2160/products_installation_and_configuration_guide_book09186a00801a45bO.html (available for download from the Internet) and a copy of the entire system image, including the operating system, applications and data on the server create. Backups are performed mutually by the network administrator. The Cisco solution gives the possibility to perform the boot procedure remotely over the network, provided that it has the same hardware characteristics as the replicated primary server. Thus, the recovery server can restore a remote copy of the system image from the network and re-provide services previously guaranteed by the primary server.

米国特許公開ＵＳ２００４／０１５３６９８Ａ１には、損傷又は破壊した通信ネットワーク要素のサービスについて障害に対する準備及び修復をするシステム及び方法が提供されている。ネットワーク要素のための障害バックアップをするコンピュータ実行方法は、複数のネットワーク要素との接続性を得ることを含む。複数のコンピュータ読み取り可能なサービス連続性データをネットワーク要素のローカルメモリに生成するコンピュータルーチンを呼び出すために、ホストコンピュータは１つ以上の指令をネットワーク要素に送信できる。ネットワーク要素の障害回復のためにコンピュータ実行可能な構成要素から成る自動システムが、障害バックアップ動作のために指定された複数のネットワーク要素を選択するように構成されたコンピュータ実行可能なコントローラ構成要素を含む。コンピュータ実行可能なエンジン構成要素は、前記複数のネットワーク要素への接続性を得るよう構成される共に、前記ネットワーク要素の各々に対してサービス連続性データを複製するために１つ以上の指令をネットワーク要素に送信するよう構成される。 US Patent Publication US 2004/0153698 A1 provides a system and method for preparing and repairing failures for services of damaged or destroyed communication network elements. A computer-implemented method for performing fault backup for network elements includes obtaining connectivity with a plurality of network elements. To invoke a computer routine that generates a plurality of computer readable service continuity data in the local memory of the network element, the host computer can send one or more commands to the network element. An automated system of computer-executable components for network element failure recovery includes a computer-executable controller component configured to select a plurality of network elements designated for a failure backup operation . A computer-executable engine component is configured to obtain connectivity to the plurality of network elements and network one or more commands to replicate service continuity data for each of the network elements. Configured to send to the element.

米国特許公開ＵＳ２００４／００７８３９７Ａ１では、ファイルシステム障害回復技術が、自動監視、故障検出及び第１指定ターゲットから指定グループの第２指定ターゲットの一つへの多段階フェイルオーバーを提供する。フェイルオーバーが所定の順序で起こるように、第２指定ターゲットに優先順位を付けてもよい。第１指定ターゲットと第２指定ターゲットとの間での情報の複製により、動作の連続性を最大限にするようにフェイルオーバーが可能となる。加えて、ユーザー特定の動作は、故障検出及び／又はフェイルオーバー動作及び／又はフェイルバック動作のすぐあとで開始できる。
米国特許公開ＵＳ２００４／０１５３６９８Ａ１米国特許公開ＵＳ２００４／００７８３９７Ａ１ In US Patent Publication US2004 / 0078397A1, file system failure recovery techniques provide automatic monitoring, failure detection and multi-stage failover from a first designated target to one of a second designated target in a designated group. The second designated target may be prioritized so that failovers occur in a predetermined order. By duplicating information between the first designated target and the second designated target, failover can be performed to maximize the continuity of the operation. In addition, user specific operations can be initiated shortly after failure detection and / or failover and / or failback operations.
US Patent Publication US2004 / 0153698A1 US Patent Publication US2004 / 0078397A1

発明の目的及び概要
出願人は、障害発生の後にシステムをリストアする際に、クライアントが好ましくは良好なＲＰＯ及びＲＴＯ値を維持しつつも設定を手動で変えて回復サイトの回復サーバにアクセスする必要がなく、ネットワーク要素のリストアとは独立にクライアントがサービスにアクセスできることを保証するという問題が存在することに気付いた。 Object and Summary of the Invention When an applicant restores a system after a failure, the client preferably needs to manually change settings to access the recovery server at the recovery site while maintaining good RPO and RTO values However, it has been found that there is a problem of ensuring that the client can access the service independently of the restoration of the network element.

出願人は、この問題は障害回復を実行する請求項１の方法により解決できることを見いだした。 Applicants have found that this problem can be solved by the method of claim 1 which performs fault recovery.

特に、出願人は、この問題はクライアントを回復サーバにルーティングする自動再ルーティング機構を提供することによって解決できることを見いだした。さらに、サーバのデータ及び設定を最後のトランザクションに常に一致させることができるミラーリング手順を介するデータ複製段階のための自動制御・管理機構を提供することによってこの問題を解決できることを見いだした。 In particular, Applicants have found that this problem can be solved by providing an automatic rerouting mechanism that routes the client to the recovery server. In addition, we have found that this problem can be solved by providing an automatic control and management mechanism for the data replication stage through a mirroring procedure that can always match the server data and settings to the last transaction.

本発明の別の態様は、障害回復を実行する請求項１２に記載のシステムに関する。 Another aspect of the invention relates to a system according to claim 12 for performing fault recovery.

本発明の別の態様は、少なくとも１つのコンピュータのメモリにロード可能であると共に、コンピュータ上で実行するとき本発明の方法の工程を実行するためのソフトウエアコード部分を含んでいるコンピュータプログラムプロダクトに関する。ここで用いられているように、このようなコンピュータプログラムプロダクトというときには、コンピュータシステムを制御して本発明の方法の実行を整合させるための命令を含んだコンピュータ読み出し可能媒体をいうのと等価である。「少なくとも１つのコンピュータ」というのは、明らかに本発明を分散／モジュール方式で実装する可能性を強調するものである。 Another aspect of the present invention relates to a computer program product that can be loaded into the memory of at least one computer and includes software code portions for performing the steps of the method of the present invention when executed on the computer. . As used herein, such a computer program product is equivalent to a computer readable medium containing instructions for controlling the computer system to coordinate the execution of the method of the present invention. . “At least one computer” clearly emphasizes the possibility of implementing the present invention in a distributed / modular manner.

本発明のさらに好ましい態様が独立請求項及び以下の明細書において記載される。 Further preferred embodiments of the invention are described in the independent claims and in the following specification.

本発明をさらに良く理解するために、単なる例であって限定するものと解釈すべきでない好ましい態様を、添付図面を参照して以下で説明する。 For a better understanding of the present invention, preferred embodiments that are merely examples and should not be construed as limiting are described below with reference to the accompanying drawings.

本発明に従って障害回復を実行するためのシステム図である。FIG. 2 is a system diagram for performing fault recovery in accordance with the present invention. 図１の生産サイトの詳細システム図である。It is a detailed system diagram of the production site of FIG. 広域ネットワークの詳細図である。It is a detailed view of a wide area network. 回復制御サーバの詳細図である。It is a detailed view of a recovery control server. 正常な動作状態の期間におけるネットワークトラヒックのフローを示す。Fig. 4 shows a flow of network traffic during a period of normal operation. 障害回復の状況でのネットワークトラヒックのフローを示す。The flow of network traffic in the situation of failure recovery is shown. フェイルバック状況でのネットワークトラヒックのフローを示す。The flow of network traffic in a failback situation is shown. 本発明を実装するための方法のフローチャートである。3 is a flowchart of a method for implementing the present invention.

本発明の好ましい態様の詳細な説明
図１は生産サイト５２、回復サイト５４、生産サイトと回復サイトとの間に接続されたネットワーク５６、及びエクストラネットクライアント５８を含んだシステム５０の図である。生産サイトは１つ以上のアプリケーションサーバ６２に接続されたストレージ６０を含むことができる。例えばイーサネットスイッチ及びＩＰルータを含み得るネットワーク６６を介してアプリケーションサーバ６２にアクセスするために、１つ以上のイントラネットクライアント６４が使用される。ボックス６６にはまた、認証システム、ファイアウォール又はアプリケーションサーバへのアクセスを阻止する侵入検出システムを含み得るセキュリティデバイスも示されている。リモートストレージボリューム上のアプリケーションサーバのローカルイメージの同期複製を実行するために、ミラーリングソフトウエアモジュール６８が用いられる。このような同期複製によって、ストレージ６０上に置かれたデータが最後のトランザクションまで回復サイト５４上にて保持されたコピーと一致することが保証される。また、最後のトランザクションがサーバの設定に対する損傷を引き起こした場合に、前にセーブされた安定なイメージに戻ることができるように、システムの安定な動作条件に対応したシステムのイメージをミラーリングソフトウエアモジュールがセーブすることも望ましい。 DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS OF THE INVENTION FIG. 1 is a diagram of a system 50 that includes a production site 52, a recovery site 54, a network 56 connected between the production site and the recovery site, and an extranet client 58. The production site can include a storage 60 connected to one or more application servers 62. One or more intranet clients 64 are used to access application server 62 via network 66, which may include, for example, an Ethernet switch and an IP router. Box 66 also shows a security device that may include an intrusion detection system that blocks access to an authentication system, firewall, or application server. A mirroring software module 68 is used to perform synchronous replication of the local image of the application server on the remote storage volume. Such synchronous duplication ensures that the data placed on the storage 60 matches the copy held on the recovery site 54 until the last transaction. Also, if the last transaction caused damage to the server configuration, the system image mirroring software module corresponding to the stable operating condition of the system so that it can revert to the previously saved stable image It is also desirable to save.

回復サイト５４は１つ以上の回復サーバ７８、ネットワーク・セキュリティデバイス８０、ストレージエリアネットワーク（ＳＡＮ）デバイス８２、及び回復制御サーバ８４を含むことができる。回復サーバ７８は障害が発生した場合にアプリケーションサーバ６２を模倣するよう構成される。障害が発生した場合にアプリケーションサーバ６２に最も密接に関連したプールの一つを使用できるように、様々なハードウエアの特徴を有する回復サーバのプールを提供することが望ましい。ＳＡＮデバイス８２はミラーリングソフトウエアモジュール６８から提供されるミラーデータを記憶する。ネットワーク・セキュリティデバイス８０は生産サイトのネットワーク・セキュリティデバイス６６と同じ機能を回復サイト５４のために実行する。回復制御サーバ８４は、それらのアクセス可能性を監視するよう管理された各アプリケーションサーバ６２に対して周期的なリクエスト（キープアライブ）を実行する。このようにして、回復制御サーバ８４は生産サイト５２に問題があるか否かを監視できる。加えて、回復制御サーバ８４は、１つ以上のアプリケーションサーバ６２からミラーリングソフトウエア６８を介して回復サイト５４のＳＡＮストレージ８２へのストレージフローを監視してもよい。例えばポーリングなどによって回復制御サーバ８４から生産サイト５２を監視するのに多くの技術を使用できる。後にさらに説明するように、回復制御サーバ８４はまた、生産サイトで問題が検出された場合に生産サイト５２から回復サイト５４への自動切替えを制御する。その際、問題が検出されたアプリケーションサーバ６２に最も密接に関連している利用可能なサーバのプールから回復サーバ７８の一つを選択する必要があるかもしれない。加えて、回復制御サーバ８４は、エクストラネットクライアント５８及びイントラネットクライアント６４が回復サーバ７８に自動的かつシームレスにアクセスできるように必要なネットワーク５６、６６を自動的に再設定する。最後に、アプリケーションサーバ６２がリストアされると共にＳＡＮデバイス８２からのデータがコピーされて生産サイト５２に戻される必要がある場合に、回復制御サーバ８４はフェイルバック条件を自動的に管理できる。 The recovery site 54 may include one or more recovery servers 78, a network security device 80, a storage area network (SAN) device 82, and a recovery control server 84. The recovery server 78 is configured to mimic the application server 62 in the event of a failure. It is desirable to provide a pool of recovery servers with various hardware features so that one of the pools most closely associated with the application server 62 can be used in the event of a failure. The SAN device 82 stores mirror data provided from the mirroring software module 68. Network security device 80 performs the same functions for recovery site 54 as network security device 66 at the production site. The recovery control server 84 executes a periodic request (keep alive) to each application server 62 managed to monitor their accessibility. In this way, the recovery control server 84 can monitor whether there is a problem in the production site 52. In addition, the recovery control server 84 may monitor the storage flow from one or more application servers 62 to the SAN storage 82 at the recovery site 54 via the mirroring software 68. Many techniques can be used to monitor the production site 52 from the recovery control server 84, such as by polling. As described further below, the recovery control server 84 also controls automatic switching from the production site 52 to the recovery site 54 if a problem is detected at the production site. In doing so, it may be necessary to select one of the recovery servers 78 from the pool of available servers that are most closely associated with the application server 62 where the problem was detected. In addition, the recovery control server 84 automatically reconfigures the necessary networks 56, 66 so that the extranet client 58 and intranet client 64 can automatically and seamlessly access the recovery server 78. Finally, the recovery control server 84 can automatically manage failback conditions when the application server 62 is restored and data from the SAN device 82 needs to be copied back to the production site 52.

図２は可能性のある生産サイト５２のさらに詳細な例を示す。アプリケーションサーバ６２はシステムイメージ１００を含む。システムイメージ１００はオペレーティングシステム１０２、アプリケーション１０４の一組、及びオペレーティングシステムとアプリケーションが操作するデータ１０６を含む。大容量ストレージ６０はデータ１０６がセーブされるローカル記憶デバイスを含む。記憶イニシエータ１１０もまたアプリケーションサーバ６２上に存在する。記憶イニシエータ１１０は、ネットワークインフラストラクチャー（例えば、ＬＡＮ、ＷＡＮなど）を介してアクセスできるリモートストレージボリュームにデータを転送可能にするソフトウエアモジュールである。ソフトウエアミラー６８はアプリケーションサーバ６２においてローカルイメージの同期複製を実行するソフトウエアモジュールである。次に、ローカルイメージは記憶イニシエータモジュール１１０を介して回復サイト５４にて記憶される。ソフトウエアミラーモジュール６８はまた、異なる時間間隔で複数のシステムイメージを保持するように、システムイメージのスナップショットを取ることもできる。よって、最後のトランザクションを有することに加えて、異なる時間間隔にてシステムの安定なコピーを有することが可能である。このことにより、システムは異なる時間に取得した１つ以上の安定なコピーを有することができるので、システムは既知の安定な状態に戻ることができる。ソフトウエアミラー６８を用いてシステムイメージのリモートコピーが実行されるので、特定の製造業者に属する独占的な解決策からアーキテクチャが解放される。上記の種類のソフトウエアミラーモジュールは例えばインターネットＵＲＬ：http://www.veritas.com/Products/www?c=product&refId=3（本特許出願の出願日の時点で）からダウンロードにて利用できる。 FIG. 2 shows a more detailed example of a potential production site 52. The application server 62 includes a system image 100. The system image 100 includes an operating system 102, a set of applications 104, and data 106 operated by the operating system and applications. Mass storage 60 includes a local storage device where data 106 is saved. A storage initiator 110 is also present on the application server 62. The storage initiator 110 is a software module that enables data to be transferred to a remote storage volume that can be accessed via a network infrastructure (eg, LAN, WAN, etc.). The software mirror 68 is a software module that executes synchronous replication of a local image in the application server 62. The local image is then stored at the recovery site 54 via the storage initiator module 110. The software mirror module 68 can also take snapshots of the system image to hold multiple system images at different time intervals. Thus, in addition to having the last transaction, it is possible to have a stable copy of the system at different time intervals. This allows the system to return to a known stable state because the system can have one or more stable copies taken at different times. Since a remote copy of the system image is performed using software mirror 68, the architecture is freed from proprietary solutions belonging to a particular manufacturer. The above-mentioned type of software mirror module can be downloaded from the Internet URL: http://www.veritas.com/Products/www?c=product&refId=3 (as of the filing date of this patent application).

イントラネットクライアント６４はネットワークデバイス１１２（この場合、レベル２及びレベル３デバイスとして示されている）を介してアプリケーションサーバ６２にアクセスできる。よって、ネットワークデバイス１１２は、生産サイトのパケットベースのネットワークのために使用されるデバイスであり、メトロポリタン、国内、又は国際レベルでのアクセスのために第三者のパケットベースのネットワークに接続することを可能にする。ネットワークデバイス１１２はＬＡＮ／ＭＡＮ技術、ＩＰルータなどとし得る。セキュリティデバイス１１４はエクストラネットクライアントからの無許可アクセスに対してセキュリティを提供する。例えば、セキュリティデバイスとして、ファイアウォール、侵入検出システムなどを挙げることができる。セキュリティデバイスは、任意の所望の規格（例えば、ＳＮＭＰ）やコマンドラインインターフェースを介して監視及び構成が行える。 Intranet client 64 can access application server 62 via network device 112 (shown here as Level 2 and Level 3 devices). Thus, the network device 112 is a device used for a production site packet-based network and is intended to connect to a third party packet-based network for access at the metropolitan, national, or international level. to enable. The network device 112 may be a LAN / MAN technology, an IP router, or the like. Security device 114 provides security against unauthorized access from extranet clients. For example, a firewall, an intrusion detection system, etc. can be mentioned as a security device. The security device can be monitored and configured via any desired standard (eg, SNMP) or command line interface.

図３はＷＡＮ５６をさらに詳しく示す。ＷＡＮ５６はエクストラネット５８と生産サイト５２と回復サイト５４との間の相互接続を可能にする。様々なプロトコルを使用できる。例えば、２つのサイトを相互接続すべく仮想プライベートネットワーク（ＶＰＮ）サービスを使用可能にするために、マルチプロトコル・ラベル・スイッチング（ＭＰＬＳ）プロトコルを用いることができる。ＷＡＮ５６は全体的に１２０で示された複数のネットワークスイッチングデバイスを含む。具体的には、カスタマーエッジデバイス（例えば、ネットワークをクライアントコンピュータに接続するのに用いられるルータやスイッチなどのネットワークの装置）１２２、１２４はそれぞれ生産サイト５２と回復サイト５４に配置されると共に、プロバイダーのポイント・オブ・プレゼンス（ＰｏＰ）に位置するプロバイダーエッジ（ＰＥ）ネットワークデバイス１２６、１２８、（例えば、カスタマーエッジデバイスとの接続を可能にするサービスプロバイダーのネットワークの一部であるルータ）との通信を可能にする。他のプロバイダーネットワークデバイス１３０（単にＰで示されている）はプロバイダーエッジ１２６、１２８とエクストラネット５８との間の通信を可能にする。新しいサイトを既存のＶＰＮに加えるために、プロバイダーは例えばプロビジョニングプラットホーム（プロビジョニングプラットホーム）を用いて正しい設定をＣＥ及びＰＥデバイスに与えることができる。ＭＰＬＳＶＰＮはＩＰレベルの接続を同じＶＰＮに属するサイトに提供することを可能にする。（仮想プライベートＬＡＮサービスなどの）より革新的な解決策は、同じＶＰＮに属するサイト間でイーサネット接続を設置することを可能にする。ＭＰＬＳＶＰＮ解決策のように、新しいサイトをＶＰＬＳに加えるために、プロバイダーＣＥ及びＰＥデバイスに作用するのみである。これら２つの解決策の主要な違いは、ＶＰＬＳサービスの場合にプロバイダーがカスタマーによって行われたルーティングを管理しないことである。 FIG. 3 shows the WAN 56 in more detail. WAN 56 allows interconnection between extranet 58, production site 52, and recovery site 54. Various protocols can be used. For example, a multi-protocol label switching (MPLS) protocol can be used to enable virtual private network (VPN) services to interconnect two sites. WAN 56 includes a plurality of network switching devices, indicated generally at 120. Specifically, customer edge devices (eg, network devices such as routers and switches used to connect the network to client computers) 122, 124 are located at the production site 52 and the recovery site 54, respectively, and are also provider With provider edge (PE) network devices 126, 128 (e.g., routers that are part of the service provider's network that allow connection with customer edge devices) located at a point of presence (PoP) Enable. Other provider network devices 130 (shown simply as P) allow communication between provider edges 126, 128 and extranet 58. To add a new site to an existing VPN, the provider can provide the correct settings to the CE and PE devices, for example using a provisioning platform (provisioning platform). MPLS VPN allows IP level connections to be provided to sites belonging to the same VPN. More innovative solutions (such as virtual private LAN services) make it possible to install Ethernet connections between sites belonging to the same VPN. Just like the MPLS VPN solution, it only works with provider CE and PE devices to add new sites to VPLS. The main difference between these two solutions is that in the case of VPLS services, the provider does not manage the routing done by the customer.

後でさらに説明するように、回復サイトの回復制御サーバ８４は、障害が発生した場合にエクストラネット５８及びイントラネットクライアント６４が回復サイト５４にアクセスできるように、ネットワークデバイス１２０を再ルーティングする能力を有する。回復制御サーバ８４は、その動作ドメイン（生産サイト及び回復サイト）に属するシステムに対する動作規則を自律的に設定し、必要ならば、ネットワークオペレータなどの第三者により一般に実行される他の制御システムとインターフェースすることによって、その直接制御の外部でシステムと相互作用できる。 As described further below, the recovery control server 84 at the recovery site has the ability to reroute the network device 120 so that the extranet 58 and intranet client 64 can access the recovery site 54 in the event of a failure. . The recovery control server 84 autonomously sets operation rules for systems belonging to the operation domain (production site and recovery site) and, if necessary, other control systems generally executed by a third party such as a network operator. By interfacing, you can interact with the system outside its direct control.

図４は回復制御サーバ８４のさらなる詳細を示す。説明のために、ＭＰＬＳ機能と共にＷＡＮが用いられる場合を示しているが、上述したようなプライベート仮想ネットワーク解決策の設定を可能にする他のパケットベースのネットワークを使用することもできる。カスタマー情報マネージャーモジュール１５０（ＣＩＭＭ）は、リポジトリモジュール１５２内部のメタデータを取り扱うと共に、生産サイト５２のアプリケーションサーバ６２の特徴を示すソフトウエアモジュールである。リポジトリモジュール１５０に記憶された情報として下記のものを挙げることができる：
・アプリケーションサーバのルーティング計画。
・イントラネット／エクストラネットクライアントに対するアプリケーションサーバのアクセス規則。
・生産サイトのネットワークトポロジー及び生産サイトと回復サイトとの相互接続についての情報。
・アプリケーションサーバのハードウエア特徴。
・オペレーティングシステム、インストールされたソフトウエアパッケージなどに関するイメージ特徴。
・サービスについて合意されたサービス・レベル・アグリーメント。
・生産サイトのアプリケーションサーバと互換性のある特徴を有する回復サイトのサーバの可用性。 FIG. 4 shows further details of the recovery control server 84. For purposes of explanation, the WAN is used in conjunction with the MPLS functionality, but other packet-based networks that allow for the configuration of private virtual network solutions as described above can also be used. The customer information manager module 150 (CIMM) is a software module that handles the metadata inside the repository module 152 and also shows the characteristics of the application server 62 of the production site 52. The information stored in the repository module 150 can include the following:
• Application server routing plan.
Application server access rules for intranet / extranet clients.
Information about the network topology of the production site and the interconnection between the production site and the recovery site.
-Hardware features of the application server.
-Image features related to operating system, installed software packages, etc.
• An agreed service level agreement for the service.
• Availability of recovery site servers that have features compatible with production site application servers.

アプリケーションサーバ制御モジュール（ＡＳＣＭ）１５４は、生産サイト５２のアプリケーションサーバのアクセス可能性を検査するソフトウエアモジュールである。この検査は、サーバのＩＰアドレスをポーリングすること、又はサーバ６２内にインストールされたアプリケーションがアクティブであることを確認することによって実行される。制御の追加レベルは、ローカルストレージとリモートストレージとの間で同期ミラーリングプロセスを可能にするソフトウエアによって有効にされる。アプリケーションサーバ６２が設定可能なしきい値（例えば、３０秒、ただしこの時間は特定のアプリケーションに依存して変わり得る）を超える期間の間アクセスされ得ないならば、ＡＳＣＭモジュール１５４が障害回復手順を起動するためのリクエストを行なう。 The application server control module (ASCM) 154 is a software module that checks the accessibility of the application server at the production site 52. This check is performed by polling the server's IP address or by confirming that the application installed in the server 62 is active. An additional level of control is enabled by software that allows a synchronous mirroring process between local and remote storage. If the application server 62 cannot be accessed for a period that exceeds a configurable threshold (eg, 30 seconds, but this time may vary depending on the particular application), the ASCM module 154 initiates a disaster recovery procedure. Make a request to

ストレージゲートウエイ制御モジュール（ＳＧＣＭ）１５６はストレージゲートウエイ管理システムにリクエストを行い、下記の機能を実行できる。
・回復サイト５４のストレージデバイスに対する、アプリケーションサーバ６２によるアクセス。ストレージアクセスはアクセスコントロールリスト（ＡＣＬ）の設定を介して管理され、アクセスコントロールリスト（ＡＣＬ）は、どのサーバが所与のストレージデバイスにアクセスする許可を有しているかを特定する。
・リソースを解放又は割当てするリクエスト。この機能は、例えば所与のアプリケーションサーバに対して障害回復サービスを停止することが決定されていたという理由で、予め割り当てられたリソースを解放するリクエストを行なうことを可能にし、又は逆に言えば新しいストレージリソースを割り当てることを可能にする。この機能は、カスタマーにより署名されたＳＬＡについての情報を更新すると共に、リポジトリ１５２に保持される。
・フェイルバック条件における複製プロセスの管理。障害回復手順の後、この機能は、回復サイト５４の回復サーバ７８によりローカルで使用されるデータのコピーを生産サイト５２のストレージボリューム上で実行可能にする。データが生産サイトにて首尾一貫した方法にてリストアされた後は、初期動作条件に戻ることができ、この初期動作条件では、イントラネット及びエクストラネットクライアントによりアクセスされるサービスが生産サイトのアプリケーションサーバにより公開される。
・割り当てられたリソースの使用ステータスの検査。この機能により、ストレージデバイスの効果的な活用についての統計値を得ることができると共に、新しいデバイス（回復サイトのプールのための処理及びストレージリソース）の取得を前もって評価することができる。 The storage gateway control module (SGCM) 156 can request the storage gateway management system to perform the following functions.
Access to the storage device at the recovery site 54 by the application server 62. Storage access is managed through access control list (ACL) settings, which specify which servers have permission to access a given storage device.
A request to release or allocate resources. This feature makes it possible to make a request to release pre-allocated resources, for example because it has been decided to stop the disaster recovery service for a given application server, or vice versa. Allows new storage resources to be allocated. This feature updates information about the SLA signed by the customer and is maintained in the repository 152.
Management of replication process in failback conditions. After the disaster recovery procedure, this function allows a copy of the data used locally by the recovery server 78 at the recovery site 54 to be executed on the storage volume at the production site 52. After the data has been restored in a consistent manner at the production site, it is possible to return to the initial operating conditions, where the services accessed by the intranet and extranet clients are accessed by the application server at the production site. Published.
• Check usage status of allocated resources. With this function, it is possible to obtain statistics on the effective use of storage devices and to evaluate in advance the acquisition of new devices (processing and storage resources for the recovery site pool).

プロビジョニングプラットホーム制御モジュール（ＰＰＣＭ）１５８は、プロビジョニングプラットホームに対するリクエストを取り扱うソフトウエアモジュールである。ネットワークデバイスの供給業者は、プログラミングメタ言語において受信したリクエストをネットワークデバイスに与えられる設定に翻訳可能にするプロビジョニングプラットホームを提供する。ＰＰＣＭ１５８は生産サイト５２と回復サイト５４とを相互接続するネットワークのトポロジーに基づいてこれらのリクエストを実行する。プロビジョニングシステムは、それらが取り扱うネットワークインフラストラクチャーのトポロジー的記述と、ネットワークの所望の最終状態の記述とに基づいて、デバイスに与えられるべき設定指令を自動的に生成する。これらのリクエストは例えば以下のモードにて行なうことができる。 The provisioning platform control module (PPCM) 158 is a software module that handles requests for the provisioning platform. The network device supplier provides a provisioning platform that allows requests received in the programming metalanguage to be translated into settings provided to the network device. The PPCM 158 performs these requests based on the topology of the network that interconnects the production site 52 and the recovery site 54. Provisioning systems automatically generate configuration commands to be given to devices based on a topological description of the network infrastructure they handle and a description of the desired final state of the network. These requests can be made in the following modes, for example.

静的モード：プロビジョニングシステムにリクエストをするために必要な情報がカスタマーリポジトリ内部に予め割り当てられる。故障が生じた場合、情報がデータベースから抽出され、方式に従って調製され、プロビジョニングシステムに送られる。 Static mode: Information required to make a request to the provisioning system is pre-assigned inside the customer repository. In the event of a failure, information is extracted from the database, prepared according to a scheme, and sent to the provisioning system.

動的モード：プロビジョニングシステムにリクエストをするために必要な情報がプロビジョニングシステムと制御モジュールとの間の相互作用を通じて動的に得られる。この場合、必ずしもデータベースにおいて情報を予め構成する必要はない。 Dynamic mode: The information required to make a request to the provisioning system is obtained dynamically through the interaction between the provisioning system and the control module. In this case, it is not always necessary to preconfigure information in the database.

障害回復制御モジュール（ＤＲＣＭ）１６０は、アプリケーションサーバ制御モジュール１５４により知らされた故障の発生に応じて障害回復プロセスを自動化することを扱うソフトウエアモジュールである。このモジュールはカスタマーリポジトリ１５２に含まれる情報に従って以下の手順を起動できる。
・生産サイト５２のネットワークのトポロジー及び生産サイト５２と回復サイト５４との相互接続に関する情報を収集する目的での、カスタマー情報マネージャーモジュール１５０との相互作用。
・生産サイト５２にて設定されたルーティング計画を回復サイト５４に移動させるための、プロビジョニングプラットホーム制御モジュール１５８へのメッセージの送信。この段階はカスタマーサイトとプロバイダーサイトに存在するＣＥデバイスの設定、及び対応するＰＥデバイスの設定についての変更を含む。
・回復サイト５４内のＳＡＮデバイス８２にセーブされた最近のシステムイメージを識別する目的での、ストレージゲートウエイ制御モジュール１５６との相互作用。
・ディスクレス起動の時に回復サイト５４のサーバプールにおける指定回復サーバが生産サイト５２のアプリケーションサーバ６２と同じＩＰアドレスを受信するための、回復サイトのＤＨＣＰ（ダイナミックホストコンフィグレーションプロトコル）サーバの設定。
・アプリケーションサーバ６２と互換性のある特徴を有する回復サイト５４のリソースプールに属するハードウエアシステムを識別するための、カスタマー情報マネージャーモジュール１５０との相互作用。
・回復サーバ７２上でのディスクレス起動手順を有効にする。例えば、インターネットでＵＲＬ：http://www.cisco.com/en/US/products/hw/ps4159/ps2160/products_installation_and_configuration_guide_book09186a00801a45b0.html（本特許出願の出願日の時点で）からダウンロードで利用可能な種類のディスクレス起動手順が使用できる。 The failure recovery control module (DRCM) 160 is a software module that handles automating the failure recovery process in response to the occurrence of a failure informed by the application server control module 154. This module can launch the following procedure according to the information contained in the customer repository 152.
Interaction with the customer information manager module 150 for the purpose of collecting information regarding the network topology of the production site 52 and the interconnection between the production site 52 and the recovery site 54.
Send a message to the provisioning platform control module 158 to move the routing plan set at the production site 52 to the recovery site 54. This stage includes changes to the CE device settings present at the customer site and the provider site, and the corresponding PE device settings.
Interaction with the storage gateway control module 156 for the purpose of identifying recent system images saved on the SAN device 82 in the recovery site 54.
Setting of the DHCP (Dynamic Host Configuration Protocol) server at the recovery site so that the designated recovery server in the server pool at the recovery site 54 receives the same IP address as the application server 62 at the production site 52 at the time of diskless startup.
Interaction with the customer information manager module 150 to identify the hardware systems belonging to the resource pool of the recovery site 54 that have characteristics compatible with the application server 62.
Enable the diskless boot procedure on the recovery server 72. For example, the types of URLs available for download from the URL: http://www.cisco.com/en/US/products/hw/ps4159/ps2160/products_installation_and_configuration_guide_book09186a00801a45b0.html (as of the filing date of this patent application) A diskless boot procedure can be used.

モジュール１５０、１５４、１５６、１５８、及び１６０は回復制御サーバ８４内にあるＣＰＵ１７２により実行される。加えて、これらのモジュールは通信のためにインターフェースモジュール１６２と相互作用する。インターフェースモジュール１６２は、キープアライブモジュール１６４、ストレージゲートウエイアダプタ１６６、プロビジョニングプラットホームアダプタ１６８、及びストレージプラットホームアダプタ１７０を含めて様々なアダプタを含む。 Modules 150, 154, 156, 158 and 160 are executed by the CPU 172 in the recovery control server 84. In addition, these modules interact with the interface module 162 for communication. The interface module 162 includes various adapters, including a keep alive module 164, a storage gateway adapter 166, a provisioning platform adapter 168, and a storage platform adapter 170.

アプリケーションサーバ６２が生産サイト５２にてリストアされるとき、手動又は自動でフェイルバック手順を起動して、ネットワーク設定を故障前の状態に戻して割り当てられたリソースを解放することができる。フェイルバック手順は、回復モードに関して自明の対称性を有するので、回復手順に類似のロジックに従う。 When the application server 62 is restored at the production site 52, a failback procedure can be started manually or automatically to return the network settings to the state before the failure and release the allocated resources. The failback procedure follows a logic similar to the recovery procedure because it has obvious symmetry with respect to the recovery mode.

最初にシステムを設定するために、ソフトウエアミラー６８がアプリケーションサーバ６２上にインストールされて同期若しくは非同期ミラーリング又は周期的な複製を実行する。回復制御サーバ８４はいくつかの設定動作を実行する。例えば、ＳＧＣＭ１５６は生産サイト５２のストレージ６０とアプリケーションサーバ６２のＩＰアドレスとを関連付けるための設定を実行する。ＰＰＣＭ１５８はリポジトリモジュール１５２内部にロードされるべきネットワーク設定のためにプロビジョニングシステムに対してリクエストを行なう。ロードされる情報は以下のものを含み得る：
生産サイト５２と回復サイト５４との接続を確保するために使用されるＣＥ-ＰＥネットワークデバイスＩＤ。障害回復に伴うすべてのサイトから回復サイトへのアクセス可能性を確保するのに用いられるＣＥ-ＰＥネットワークデバイスＩＤ。障害回復の場合に回復サイトに移動するために生産サイトにて用いられるルーティング計画。サービスへのアクセス規則を定義する生産サイトのＣＥデバイス上で設定されたアクセスコントロールリストが、エクストラネット接続を介してアプリケーションサーバにより利用可能にされる。 To initially set up the system, a software mirror 68 is installed on the application server 62 to perform synchronous or asynchronous mirroring or periodic replication. The recovery control server 84 performs several setting operations. For example, the SGCM 156 executes setting for associating the storage 60 of the production site 52 with the IP address of the application server 62. The PPCM 158 makes a request to the provisioning system for network settings to be loaded inside the repository module 152. The information loaded can include the following:
CE-PE network device ID used to secure the connection between the production site 52 and the recovery site 54. CE-PE network device ID used to ensure accessibility to the recovery site from all sites involved in disaster recovery. A routing plan used at the production site to move to the recovery site in case of disaster recovery. An access control list set on the CE device at the production site that defines service access rules is made available to the application server via an extranet connection.

回復制御サーバ８４におけるＣＩＭＭ１５０は、アプリケーションサーバ６２及び生産サイトに関する情報をリポジトリモジュール１５２に加える。このような情報としては、サーバのハードウエア特徴（例えば、システムイメージのサイズ、ネットワークインターフェースの数など）、アプリケーションサーバのソフトウエア特徴、及びＰＰＣＭ１５８から発せられる情報が挙げられる。 The CIMM 150 in the recovery control server 84 adds information about the application server 62 and the production site to the repository module 152. Such information includes server hardware characteristics (eg, system image size, number of network interfaces, etc.), application server software characteristics, and information emitted from the PPCM 158.

最後に、ＡＳＣＭ１５４はサーバの可用性を調べるために周期的なポーリングを起動する。もしサーバが応答しないならば、障害回復手順を起動する。 Finally, the ASCM 154 activates periodic polling to check server availability. If the server does not respond, it initiates a disaster recovery procedure.

図５は正常な動作条件でのシステムを示す。ＡＳＣＭ１５４はアプリケーションサーバが矢印１８０により示されるようにアクティブであることを調べる。また、アプリケーションサーバのシステム管理者はアプリケーションサーバプラットホーム上で為されたハードウエアの変更を障害回復サービスの管理者に知らせることが望ましい。その目的は、リポジトリ１５２に保持された情報を最新のものに維持すると共に、障害回復手順が起動された場合に正しい回復サーバを選択可能にすることである。矢印１８２で示されるように、通常運転中、エクストラネットクライアント５８は生産サイト５２のアプリケーションサーバ６２にアクセスする。情報はサーバ６２上で更新されているので、ソフトウエアミラー６８は情報が矢印１８０で示されるように回復サイト５４でも記憶されることを保証する。 FIG. 5 shows the system under normal operating conditions. ASCM 154 checks that the application server is active as indicated by arrow 180. It is also desirable for the system administrator of the application server to notify the administrator of the failure recovery service of hardware changes made on the application server platform. Its purpose is to keep the information held in the repository 152 up to date and to be able to select the correct recovery server when a disaster recovery procedure is activated. As indicated by arrow 182, during normal operation, extranet client 58 accesses application server 62 at production site 52. Since the information has been updated on server 62, software mirror 68 ensures that the information is also stored at recovery site 54 as indicated by arrow 180.

設定可能なしきい値を超えた時間間隔の間ＡＳＣＭ１５４がアプリケーションサーバ６２からＡＣＫメッセージを受信しない場合に、障害回復手順が起動される。ＤＲＣＭ１６０を用いて、回復制御サーバ８４は以下の手順を起動できる。
１）生産サイトのネットワークのトポロジー及び生産サイトと回復サイトとの相互接続に関する情報を収集する目的で、ＣＩＭＭ１５０と相互作用する。
２）生産サイトにて構成されたルーティング計画を回復サイトに移動させるために、メッセージ（マイグレートネットワーク（MigrateNetwork））をＰＰＣＭに送信する。この段階はカスタマーサイト及びプロバイダーサイトのＣＥ-ＰＥデバイスの設定についての変更を伴う。
３）回復サイト内のストレージシステムにセーブされた最近のシステムイメージを識別する目的で、ＳＧＣＭと相互作用する（複製機構が用いられる場合には最近のものに一致し得る）。
４）起動（ディスクレス起動）される際に回復サーバを有効にして生産サイトのアプリケーションサーバと同じＩＰアドレスを受信するために、回復サイトのＤＨＣＰサーバの設定を行なう。
５）アプリケーションサーバと互換性のある特徴を有する回復サイトのリソースプールに属するハードウエアシステムを識別するために、ＣＩＭＭと相互作用する。
６）ディスクレス起動手順を可能にする：この段階では、ＧＵＩは人間のオペレータに待機中のハードウエアリソースプールから選択された回復サーバを始動できることを知らせる。 If the ASCM 154 does not receive an ACK message from the application server 62 for a time interval that exceeds a configurable threshold, a failure recovery procedure is invoked. Using the DRCM 160, the recovery control server 84 can start the following procedure.
1) Interact with CIMM 150 for the purpose of collecting information regarding the network topology of the production site and the interconnection between the production site and the recovery site.
2) Send a message (Migrate Network) to the PPCM to move the routing plan configured at the production site to the recovery site. This stage involves changes to the CE-PE device settings at the customer site and provider site.
3) Interact with the SGCM for the purpose of identifying the latest system image saved on the storage system in the recovery site (can match the latest if a replication mechanism is used).
4) Set the recovery site DHCP server to enable the recovery server and receive the same IP address as the production site application server when activated (diskless activation).
5) Interact with the CIMM to identify the hardware system belonging to the resource pool at the recovery site that has features compatible with the application server.
6) Enable diskless boot procedure: At this stage, the GUI informs the human operator that he can start the selected recovery server from the waiting hardware resource pool.

内部ストレージ（ディスクレス）を有していないかもしれない回復サーバは、アプリケーションサーバのシステムイメージ（ＩＰアドレス、ボリューム名、ＬＵＮなど）を含んだストレージシステムにアクセスすることに関連したＩＰアドレスと情報とを得るために、ＤＨＣＰサーバにリクエストを行なう。いったんこの情報が受信されると、回復サーバはネットワーク上でディスクレス起動を実行できる。フィニッシュを起動するとき、回復サーバは最後のトランザクションまで元のアプリケーションサーバに一致している。あらゆるイントラネット、エクストラネット又はインターネットクライアントは、障害回復手順で設定された接続を用いてＴＣＰ／ＩＰを介して回復サーバのリストアされたサービスにアクセスできる。 The recovery server, which may not have internal storage (diskless), has the IP address and information associated with accessing the storage system containing the application server's system image (IP address, volume name, LUN, etc.) To get it, make a request to the DHCP server. Once this information is received, the recovery server can perform a diskless boot on the network. When invoking the finish, the recovery server matches the original application server until the last transaction. Any intranet, extranet or Internet client can access the restored service of the recovery server via TCP / IP using the connection established in the disaster recovery procedure.

図６は障害回復手順が開始された後のデータフローを示す。矢印１８８で示されるように、エクストラネットクライアントが生産サイト５２にアクセスしようと試みるとき、リクエストが自動的に回復サイト５４に再ルーティングされる。このことはエクストラネットユーザーに対してトランスペアレントに行われ、エクストラネットユーザーが回復サイトについて異なるネットワークアドレスをタイピングする必要はない。よって、エクストラネットクライアントの観点からは、たとえ実際には回復サイトにアクセスされていても、生産サイトに依然としてアクセスされている。 FIG. 6 shows the data flow after the failure recovery procedure is started. As indicated by arrow 188, when an extranet client attempts to access production site 52, the request is automatically rerouted to recovery site 54. This is transparent to the extranet user, and the extranet user does not need to type a different network address for the recovery site. Thus, from the extranet client's point of view, the production site is still accessed even though it is actually being accessed.

図７はフェイルバック条件を示す。フェイルバック手順は障害回復手順の後に初期状態に戻ることを可能にする。生産サイト５２のアプリケーションサーバ６２がリストアされた後も、すべてのサービスが回復サイトにより提供される期間が依然として存在する。 FIG. 7 shows failback conditions. The failback procedure makes it possible to return to the initial state after the disaster recovery procedure. Even after the application server 62 at the production site 52 is restored, there is still a period during which all services are provided by the recovery site.

フェイルバック手順は上述した正常動作条件に戻るために以下の段階を含むことができる。
１）ＳＧＣＭ１５６が、矢印１９０で示すように生産サイト上で回復サイトのデータの一致したコピーを行なうために、逆複製手順を起動する。
２）回復サイトにて構成されたルーティング計画を生産サイトに移動させるために、ＤＲＣＭがメッセージ（マイグレートネットワーク（MigrateNetwork））をＰＰＣＭに送信する。この段階はカスタマーサイト及びプロバイダーサイトのＣＥ-ＰＥデバイスの設定についての変更を伴う。
３）生産サイトでのサービスが再開され、クライアントが元のアプリケーションサーバ６２にアクセスする。
４）回復サイト５４にて回復サーバ７８により用いられるハードウエアリソースが解放され（自由なリソースプールに戻される）。
５）同期／非同期ミラーリング（又は複製）手順が再起動される。 The failback procedure can include the following steps to return to the normal operating conditions described above.
1) SGCM 156 activates a reverse replication procedure to perform a consistent copy of the data at the recovery site on the production site as indicated by arrow 190.
2) In order to move the routing plan configured at the recovery site to the production site, the DRCM sends a message (Migrate Network) to the PPCM. This stage involves changes to the CE-PE device settings at the customer site and provider site.
3) The service at the production site is resumed, and the client accesses the original application server 62.
4) The hardware resources used by the recovery server 78 are released at the recovery site 54 (returned to the free resource pool).
5) The synchronous / asynchronous mirroring (or replication) procedure is restarted.

図８は本発明を実行する方法のフローチャートを示す。プロセスブロック２１０では、回復サイトがポーリングにより生産サイトでの問題を検出する。プロセスブロック２１２では、生産サイトへのアクセスの試みが回復サイトにルーティングされるように、回復サイトが自動的にネットワークの再設定を実行する。このようなリクエストはエクストラネット又はイントラネットリクエストに由来し得る。 FIG. 8 shows a flowchart of a method for carrying out the present invention. In process block 210, the recovery site detects a problem at the production site by polling. At process block 212, the recovery site automatically performs a network reconfiguration so that attempts to access the production site are routed to the recovery site. Such a request may be derived from an extranet or intranet request.

本発明の利点は上記説明から明らかである。 The advantages of the present invention are clear from the above description.

特に、利点の一つは、ＲＰＯ及びＲＴＯパラメータがミラーリングプロセスにより実行される複製によって最適化されることである。 In particular, one advantage is that RPO and RTO parameters are optimized by replication performed by the mirroring process.

別の利点は、本発明は生産又は回復サイトで採用されるソフトウエア／ハードウエア解決策に依存しないことである。 Another advantage is that the present invention does not rely on software / hardware solutions employed at production or recovery sites.

さらに別の利点は、クライアントを回復サーバにルーティングする自動再ルーティングである。 Yet another advantage is automatic rerouting that routes clients to the recovery server.

最後に、本発明について多くの変更及び変形が為し得ることは明らかであるが、すべて本発明の範囲に存する。 Finally, it will be apparent that many changes and modifications can be made to the invention, but all fall within the scope of the invention.

例えば、本解決策を拡張及び変更して、それを達成する個々の構成要素に作用させるか、又は既存の構成要素を当該技術における制御アーキテクチャに統合することができる。 For example, the solution can be extended and modified to work on individual components that achieve it, or existing components can be integrated into a control architecture in the art.

特に、生産サイトでは、同期／非同期ミラーリングソフトウエアを提供する構成要素は特定の技術に限定されない。それらはホストベース、ネットワークベース、又はアレイベースの仮想化機構、及びソフトウエアモジュール又は特定のハードウエア構成要素により実現できる。 In particular, at the production site, the components that provide the synchronous / asynchronous mirroring software are not limited to a specific technology. They can be realized by host-based, network-based or array-based virtualization mechanisms and software modules or specific hardware components.

さらに、ここに記載した「障害」とは、生産サイトが何がしかの理由で機能していないことを意味する。これは実際の障害が生ぜざるを得なかったことを意味しない。 Further, the “failure” described here means that the production site is not functioning for any reason. This does not mean that an actual obstacle had to occur.

さらにまた、生産サイトと回復サイトとの間の相互接続ネットワークでは、リモートサイトへのミラーリング／複製フローのために用いられるプロトコルは、回復サイトのストレージにて生産サイトで行われるのと同じ書き込みを再生する機能を実行する限りは、標準的又は独占的なプロトコル（例えば、ＳＣＳＩ）とし得る。 Furthermore, in the interconnect network between the production site and the recovery site, the protocol used for the mirror / replication flow to the remote site replays the same writes that occur at the production site in the recovery site storage. As long as the function to perform is performed, it can be a standard or proprietary protocol (eg, SCSI).

加えて、回復サイトでは、ネットワーク上で起動するための機構は、生産サイトにてデータにアクセスするのに用いられるか又は２つのサイト間での相互接続において用いられるプロトコルとは異なるトランスポートプロトコル（例えばファイバーチャンネル又はｉＳＣＳＩ）を回復サイトにてローカルに使用できる。さらに、回復制御サーバは、同じデバイス内にすべてを一緒に構築するか、又は要求される基本機能を達成する他のデバイスの特徴若しくは機能性を利用する分散方式にて構築できる。これらの機能の制御ロジックは、独立のシステムで実現できるか、又は上記デバイスの一つにおいて追加機能として統合できる。特に、回復サイトでアプリケーションサーバを再開した後に提供されるサービスのネットワーク再ルーティングは、別のシステムにより部分的に又は完全に制御され、手持ちのシステムのインテリジェンスモジュールに統合され、接続プロバイダーに関するエクストラネット／イントラネットＶＰＮサイトの動的管理に委ねられ得る。この再ルーティング機構は、２つのサイト間及びクライアントと生産サイトとの間で用いられる特定の接続性に基づいて様々な代替物（ＭＰＬＳＶＰＮ又は積み重ね可能なＶＬＡＮ／ｄｏｔｌｑなど）を使用できる。同様に、回復制御サーバ内部のストレージゲートウエイの構成要素は、ゲートウエイ又はストレージスイッチなどの市販プロダクトに既に存在するベースモジュールを統合することによって実現できる。 In addition, at the recovery site, a mechanism for starting up on the network is a transport protocol that is different from the protocol used to access data at the production site or in the interconnection between the two sites ( For example, Fiber Channel or iSCSI) can be used locally at the recovery site. In addition, the recovery control server can be built in a distributed manner that builds everything together in the same device, or takes advantage of other device features or functionality to achieve the required basic functionality. The control logic for these functions can be implemented in an independent system or integrated as an additional function in one of the devices. In particular, the network rerouting of services provided after restarting the application server at the recovery site is partly or fully controlled by another system and integrated into the intelligence module of the on-hand system to provide an extranet / It can be left to dynamic management of intranet VPN sites. This rerouting mechanism can use various alternatives (such as MPLS VPN or stackable VLAN / dotlq) based on the specific connectivity used between the two sites and between the client and production sites. Similarly, the components of the storage gateway inside the recovery control server can be realized by integrating a base module already present in a commercial product such as a gateway or storage switch.

プライマリサイトを正常な条件にリストアすること（フェイルバック）を更に最適化するために、本解決策の回復及びリストア機構は、動的又は動的でない特定のＱｏＳ機構とすることもでき、これは回復及びリストア段階における動作を加速させるためにリストア活動の時間窓を小さくして、正常な動作条件において利用可能なものよりも広い伝送帯域を有する相互接続を２つのサイト間に提供する。 In order to further optimize the restoration of the primary site to normal conditions (failback), the recovery and restore mechanism of this solution can be a dynamic or non-dynamic specific QoS mechanism, To speed up the operation during the recovery and restore phase, the restore activity time window is reduced to provide an interconnect between the two sites that has a wider transmission bandwidth than is available under normal operating conditions.

予想されるように、回復サイトの回復サーバによって特に個々の回復サーバ上に形成された処理用ハードウエアリソースを最適化するために、本解決策により保護されるアプリケーションサーバのハードウエア特徴をリソースプールを構成するシステムのものから切り離すべく、特定のソフトウエアモジュールをインストールして物理リソースを仮想化できる。 As expected, the hardware features of the application server protected by this solution are optimized to optimize the processing hardware resources created by the recovery server at the recovery site, especially on the individual recovery server. The physical resources can be virtualized by installing specific software modules so that they can be separated from those of the system constituting the system.

このことにより、このような回復サーバを生産サイトのプライマリサーバのハードウエアと互換性を有せしめると共に、より効率的なリソースの割り当てを保証することが更に容易になる。このようにして、システムの物理ドライバを仮想化する機能のお陰で（１：１仮想化）、異なるハードウエアが同じ物理設定上でエミュレートされるのが可能にされ、最新式の障害回復サービスは生産サイトのアプリケーションサーバのものと同じハードウエア設定を有するサーバを採用することから解放され得る。加えて、１つより多いアプリケーションサーバイメージに対して同じハードウエアリソースを同時に利用するために仮想化ソフトウエアを使用することもできる（ｎ：１仮想化）。 This makes it easier to make such a recovery server compatible with the hardware of the primary server at the production site and to ensure more efficient resource allocation. In this way, thanks to the ability to virtualize the physical drivers of the system (1: 1 virtualization), different hardware can be emulated on the same physical settings, a state-of-the-art disaster recovery service Can be freed from adopting a server with the same hardware settings as that of the application server at the production site. In addition, virtualization software can be used to simultaneously utilize the same hardware resources for more than one application server image (n: 1 virtualization).

５０本発明のシステム
５２生産サイト
５４回復サイト
５６生産サイトと回復サイトとの間に接続されたネットワーク
５８エクストラネットクライアント 50 System 52 of the Present Invention Production Site 54 Recovery Site 56 Network 58 Connected Between Production Site and Recovery Site 58 Extranet Client

Claims

A mirroring module (68) provided in the application server (62) of each production site (52) for mirroring at least a part of the production site (52) to the recovery site (54), the recovery site (54) ) Is intended to serve as a plurality of different production sites (52) and is adapted to couple to the plurality of different production sites (52) via a packet-based network (56). 68)
Provided at the recovery site (54) to detect a problem at the production site (52) and in response to detecting the problem at the production site (52), an attempt to access the production site (52) A disaster recovery system (50) including a recovery control server (84) that automatically reconfigures the packet-based network (56) to be routed to a site (54);
The recovery control server (84) is further adapted to select a recovery server (78) at the recovery site (54) from a pool of recovery servers, and selecting the application server (62) at the production site (52) includes a obtaining a hardware features associated, and to match the features as possible to the hardware features of the recovery server in the pool (78) and,
A computer sending a request destined for the production site (52) is connected to the packet-based network (56), the packet-based network (56) including a device for routing the request, the device , The request from the computer destined for the production site (52) is set to reach the production site (52), and reconfiguring the packet-based network (56) (52) the request from the computer directed to the viewing contains that resetting the device to reach the recovery site (54),
The recovery control server (84) is configured to store information about the network topology of the production site (52) and the interconnection between the production site (52) and the recovery site (54), and the production site Reconfiguring the device so that the request from the computer destined for (52) reaches the recovery site (54) means that the recovery control server (84) Collecting the information about the topology of the network and the interconnection between the production site (52) and the recovery site (54), and the routing plan set at the production site (52) to the recovery site (54) including disaster recovery systems to move (50).

The disaster recovery system (50) further comprises a database (152) stored at the recovery site (54) and adapted to store information about the application server (62) at the production site. The fault recovery system according to claim 1 (50).

The information about the application server (62) is:
Application server routing plan,
Application server access rules for intranet and extranet clients,
Hardware features of the application server (62);
The fault recovery system (50) of claim 2, comprising one or more of image features.

Reconfiguring the packet-based network (56) comprises rerouting a request having the network address of the production site (52) to the recovery site (54). The fault recovery system (50) according to one aspect.

The recovery control server (84) uses the network address of the application server (62) of the production site (52) to select the recovery server selected from the intranet (64) and extranet (58) computers. The fault recovery system (50) of claim 4, further adapted to allow access to 78).

Detecting problems at the production site (52)
Checking the accessibility of the application server (62) at the production site (52);
6. Initiating a disaster recovery procedure if the application server (62) is inaccessible for a period of time beyond a configurable threshold, according to any one of the preceding claims. Disaster recovery system (50).

Detecting problems at the production site (52)
Polling the application server (62) of the production site (52);
Waiting for a response from the application server (62) for a predetermined period;
The fault recovery system (50) according to any one of the preceding claims, comprising starting a fault recovery procedure when the predetermined period ends.

The recovery control server (84)
Upon detecting the solution of the problem at the production site (52), the production site (52) is automatically restored by copying (190) recovery data from the recovery site (54) to the production site (52). And
The system further adapted to automatically reconfigure the packet-based network (56) to allow access to the production site (52) after restoring the production site (52). The fault recovery system (50) according to any one of 1 to 7.