JP2009501382A

JP2009501382A - Maintaining writing order fidelity in multi-writer systems

Info

Publication number: JP2009501382A
Application number: JP2008520988A
Authority: JP
Inventors: スティーブブロムリング，; デールハッグランド，; ジェフヘイワード，; デアグート，ロールヴァン; ウェインカーポフ，
Original assignee: ヨッタヨッタ，インコーポレイテッド
Priority date: 2005-07-14
Filing date: 2006-07-14
Publication date: 2009-01-15
Also published as: US7475207B2; EP1913482A2; EP1913482A4; US20070022264A1; WO2007074408A3; WO2007074408A2; CA2615324A1

Abstract

書き込み順序忠実性（ＷＯＦ）は、地理的に離れたサイトにおける複数のアクセスノードが、「完全アクティブ」方式で、分散データシステムのデータを並行して読み取るおよび／または書き込む、完全アクティブ実装のために維持される。多様な地理的ロケーションにおけるホストの視点からは、同期しキャッシュコヒーレントなデータのビューが提供される。データ転送は非同期である。時間順序データイメージが作成され、維持されるので、ネットワークを介してまだ非同期転送されていないが、送信元ホストへの書き込み確認がなされたデータの損失を引き起こす部分システム障害後に、操作を再始動できる。時間順序非同期データ転送は、全てのノードからの寄与を反映する、変更のパイプラインとして実装される。またＷＯＦは、ネットワーク性能を改善し、帯域消費を低減する。Write Order Fidelity (WOF) is for fully active implementations where multiple access nodes at geographically distant sites read and / or write data in a distributed data system in parallel in a “fully active” manner. Maintained. From the perspective of the host at various geographic locations, a synchronized, cache-coherent view of the data is provided. Data transfer is asynchronous. Time-ordered data images are created and maintained so that operations can be restarted after a partial system failure that has not yet been asynchronously transferred over the network, but that causes the loss of data that has been acknowledged to the source host . Time-order asynchronous data transfer is implemented as a change pipeline that reflects the contributions from all nodes. WOF also improves network performance and reduces bandwidth consumption.

Description

（関連出願の引用）
本出願は、非仮特許出願であり、米国仮特許出願第６０／６９９，９３５号（２００５年７月１４日出願、代理人事件番号：０１９４１７−００８７００ＵＳ）に対する優先権を主張し、該出願の全内容が、本明細書における全ての目的のために参照により引用される。 (Citation of related application)
This application is a non-provisional patent application and claims priority to US Provisional Patent Application No. 60 / 699,935 (filed on July 14, 2005, agent case number: 019417-008700US). The entire contents are cited by reference for all purposes herein.

（著作権の通知）
本特許文書の開示の一部は、著作権保護の対象となる要素を含む。著作権所有者は、米国特許商標庁の特許ファイルまたは記録に表示される特許文書または特許開示のうちのいずれか１つによる複製に異論はないが、それ以外は、全ての著作権のいかなる権利も保有する。 (Copyright notice)
Part of the disclosure of this patent document contains elements that are subject to copyright protection. The copyright owner has no objection to reproduction by any one of the patent documents or patent disclosures displayed in the US Patent and Trademark Office patent file or record, but otherwise, no rights of any copyright Also possess.

（発明の背景）
本発明は、障害あるいはその他のデータ損失または破壊の可能性がある原因が生じた後に、データストレージシステムの内容の回復を提供するためのシステムと方法に関し、より具体的には、分散データストレージネットワークを介して、複数ロケーションで並行して操作しているデータライタを有するストレージシステムに、書き込み順序忠実性を提供するためのシステムと方法に関する。 (Background of the Invention)
The present invention relates to a system and method for providing recovery of the contents of a data storage system after a failure or other possible cause of data loss or corruption, and more particularly, a distributed data storage network. The present invention relates to a system and method for providing write order fidelity to a storage system having a data writer operating in parallel at multiple locations.

現在のストレージネットワークでは、またネットワークによって相互接続される地理的に離れたアクセスノードとストレージリソースを含む特定のストレージネットワークでは、書き込みが同期的に複製あるいは送信されなければならない場合、ノード間の距離が増加すると書き込み性能は大幅に妨げられる。さらに、必要とされるロケーション間の帯域幅を最小限に抑えることは、非常に望ましい。従って、データを非同期的に送信する方法は、データがリモートサイトのノードに送信される前に、書き込みが確認される場合に使用される。 In current storage networks, and in certain storage networks that contain storage resources and geographically separated access nodes that are interconnected by the network, the distance between the nodes can be reduced if writes must be replicated or transmitted synchronously. If it increases, write performance is greatly hindered. Furthermore, it is highly desirable to minimize the required bandwidth between locations. Therefore, the method of asynchronously transmitting data is used when writing is confirmed before the data is transmitted to the remote site node.

ホスト装置により要求されたデータブロックへのアクセス速度を向上するためにも、データアクセスがローカライズされていることも望ましい。アクセスノードでブロックをキャッシュすることにより、ローカライズが提供されるが、キャッシュされたデータは、同じデータをキャッシュしているかもしれない別のアクセスノードでの修正に対して、コヒーレントを維持されなければならない。 It is also desirable that the data access be localized in order to improve the access speed to the data block requested by the host device. Caching a block at an access node provides localization, but the cached data must remain coherent to modifications at another access node that may have cached the same data. Don't be.

さらに、かかる複雑なストレージアプリケーションは、その支持ストレージシステム、ローカルストレージネットワーク、ノードを相互接続しているネットワーク、およびアクセスノードの障害に耐える必要がある。障害が発生した場合、非同期データ送信は、障害サイトで保持されたデータの損失の可能性を示す。整合データイメージは、アプリケーションの視点から、残存するストレージ内容から構築される必要がある。アプリケーションは、どの書き込みあるいはデータの断片が、ストレージシステム障害を乗り切ったストレージシステムに対して書き込まれているかについて何らかの想定をしなければならないが、これは具体的には、所定のブロックへの書き込みにより修正が失われた場合、そのボリュームにおけるブロックあるいはブロックの関連するボリュームへの全ての後続する書き込みも失われるなどの、書き込みの順序付けが維持されているストレージシステムによって完了したと確認された全ての書き込みに対する。 Furthermore, such complex storage applications need to tolerate failures of their supporting storage systems, local storage networks, networks interconnecting nodes, and access nodes. In the event of a failure, asynchronous data transmission indicates the possibility of loss of data held at the failed site. The consistent data image needs to be constructed from the remaining storage contents from the application perspective. An application must make some assumption about which writes or pieces of data are being written to a storage system that has survived a storage system failure, specifically by writing to a given block. All writes confirmed completed by a storage system that maintains write ordering, such as when a fix is lost, all subsequent writes to the block or its associated volume in that volume are also lost Against.

ここで使用される書き込み順序忠実性（「ＷＯＦ」）とは、一群の関連するプロパティを指すが、それぞれのプロパティはある種の障害からの回復後にストレージシステムの内容を記述する。つまり、ストレージシステムが障害から回復した後に、アプリケーションがストレージシステムの内容について想定し得るプロパティであるということである。書き込み順序忠実性（ＷＯＦ）は、障害からの回復後、残存データに整合性があるという保証をもたらす。ファイルシステムまたはデータベースなどの複雑なアプリケーションは、この整合性プロパティに依存して、ストレージシステムの障害から回復する。それ自体の障害またはバックエンドストレージの障害から回復するために明示的に書き込まれていない、より単純なアプリケーションであっても、これらの障害後保証の恩恵を受けるはずである。 As used herein, write order fidelity (“WOF”) refers to a group of related properties, each describing the contents of the storage system after recovery from some type of failure. That is, after the storage system recovers from a failure, it is a property that an application can assume about the contents of the storage system. Write order fidelity (WOF) provides a guarantee that the remaining data is consistent after recovery from a failure. Complex applications such as file systems or databases rely on this consistency property to recover from storage system failures. Even simpler applications that are not explicitly written to recover from their own failures or back-end storage failures should benefit from these post-failure guarantees.

厳密な意味でのＷＯＦを実装するとき、アプリケーションはそのアプリケーションをサポートするストレージシステムへの書き込み｛Ｗ_ｉ｜ｉ≧ｌ｝のストリームを生成する。基礎となるストレージシステムは、ストレージシステムの任意の障害後、回復時のストレージシステムの状態が、アプリケーションからの書き込みシーケンスの接頭部の一部を反映する場合、厳密な書き込み順序忠実性を示す。言い換えると、全ての書き込み｛Ｗ_ｊ｜ｊ≦ｉ｝がストレージにコミットされ、どの書き込み｛Ｗ_ｊ｜ｊ＞ｉ｝もストレージにコミットされていないような、あるｉ≧０が存在する。 When implementing a strict sense of WOF, an application generates a stream of writes {W _i | i ≧ l} to a storage system that supports the application. The underlying storage system exhibits strict write order fidelity if, after any failure of the storage system, the state of the storage system at recovery reflects part of the prefix of the write sequence from the application. In other words, there is some i ≧ 0 such that all writes {W _j | j ≦ i} are committed to storage and no write {W _j | j> i} is committed to storage.

厳密なＷＯＦは、書き込みが完全に順序付けられることを想定するが、これは単一コントローラあるいは共有メモリを通して通信する密結合のストレージコントローラの組の場合には簡単である。しかし、書き込みにかかる完全な順序を生成するためのコストは、たとえサイト内であっても、メッセージを経由して通信するコントローラの場合には多大なものとなる。順序付けのコストは、コントローラ間の待ち時間がたとえ数ミリ秒に達する場合でも、許容しがたいものとなる。 Strict WOF assumes that writes are fully ordered, but this is straightforward for a single controller or a set of tightly coupled storage controllers that communicate through shared memory. However, the cost for generating a complete order for writing is enormous for controllers that communicate via messages, even within a site. The cost of ordering is unacceptable even when latency between controllers reaches a few milliseconds.

従来、「アクティブ−パッシブ」アプローチは、１つのみのライタまたはホストプロセッサが、ブロックの所定のボリュームに対する読み取り−書き込みアクセスを有し、他のプロセッサは、読み取りアクセスのみを有するような、サイト間でのデータの非同期送信のために使われる。ブロックの所定のボリュームに対する読み取りおよび書き込みが任意のノードからランダムにおこなえる、「完全アクティブ」な環境は、非常に望ましいが、ＷＯＦへのアプローチ、およびシステムの全アクセスノードでのキャッシングとの相互作用の方法の変更が必要とされる。 Traditionally, the “active-passive” approach is a site-to-site where only one writer or host processor has read-write access to a given volume of blocks and the other processor has read access only. Used for asynchronous transmission of data. A “fully active” environment, where reads and writes to a given volume of blocks can be done randomly from any node, is highly desirable, but the approach to WOF and the interaction with caching on all access nodes of the system A change in method is required.

（発明の概要）
本発明による実施例は、一般にストレージコントローラと称されるストレージアクセスノードと、ストレージシステムとがネットワークにより相互接続される分散ストレージシステムに、書き込み順序忠実性（ＷＯＦ）を提供する。明らかに、ノード間の通信を可能とするあらゆるネットワークを使用することができる。さまざまな実施例は、任意の所定データボリュームに対する書き込みおよび／または読み取りが、システムの任意のノードから開始されてもよい、完全アクティブ操作を可能とするが、同様のシステムは、従来のアクティブ−パッシブモードでも使用できることは明らかである。 (Summary of Invention)
Embodiments in accordance with the present invention provide write order fidelity (WOF) to distributed storage systems in which storage access nodes, commonly referred to as storage controllers, and storage systems are interconnected by a network. Obviously, any network that allows communication between nodes can be used. While various embodiments allow for fully active operation where writes and / or reads to any given data volume may be initiated from any node of the system, a similar system is conventional active-passive. Obviously, it can also be used in modes.

ＷＯＦは、いくつかの態様で、キャッシュコヒーレンシ層を使用するさまざまなサイトおよび／またはノードにわたり分散あるいはフラグメント化されたデルタセットを使用することにより獲得される。さまざまな実施例は、分散キャッシュコヒーレンシを用いて、任意の所定のノードへの最新の書き込みが、サイトによる後続する読み取りで即座に反映されることを確実にするために、許容しがたい動作コストを伴わずに、完全アクティブサイトにＷＯＦを提供し、従ってコヒーレントなアプリケーションビューを提供する。これは分散ノードにわたって分散された任意の所定のデルタセットに書き込まれたデータブロックがコヒーレントであり、対応する部分デルタがシステムのさまざまなノードに格納された状態で、最新の書き込みを反映していることも保証できる。 The WOF is obtained in some aspects by using a delta set distributed or fragmented across various sites and / or nodes that use a cache coherency layer. Various embodiments use distributed cache coherency to ensure that the most recent writes to any given node are immediately reflected in subsequent reads by the site. Without providing a WOF to a fully active site, thus providing a coherent application view. This reflects the most recent writes, with data blocks written to any given delta set distributed across the distributed nodes being coherent and the corresponding partial deltas being stored on various nodes of the system I can also guarantee that.

一実施例において、データの１つ以上のホスト可視ボリュームは、ＷＯＦグループとして管理される。複数のデータボリュームがＷＯＦグループを構成していても、ＷＯＦは、全ブロックが単一のボリューム内にあるかのように、ＷＯＦグループ内の全データボリュームからの全ブロックに対して保証される。 In one embodiment, one or more host visible volumes of data are managed as a WOF group. Even if multiple data volumes constitute a WOF group, WOF is guaranteed for all blocks from all data volumes in the WOF group as if all blocks were in a single volume.

一実施例において、新しく書き込まれたブロックイメージを収集するためにデルタセットの多段パイプラインを使用して、ノード間でブロックイメージを交換し、ブロックイメージをバックエンドストレージにコミットする。システム全体のバリア機構は、デルタセットのパイプラインが、システムの全てのノードで同時に進行することを保証する。明らかに、バリアは、全てのノードが所定のＷＯＦグループの管理に参加しているわけではない場合、システム全体のものである必要はない。ＷＯＦグループを構成する複数のデータボリュームのいずれかへの新しい書き込みは、そのＷＯＦグループのための「オープンデルタ」のキャッシュに保存される。何らかのトリガーまたはその他のイベント時、現在のオープンデルタは、閉じられるべきである。オープンデルタを閉じるためのメッセージは、そのＷＯＦグループにブロードキャストされ、それによりあらゆる未完了の書き込みを完了することができ、デルタを閉じることができる。新しい書き込みを受信するために、新しいデルタが開かれる。異なるＷＯＦグループで固有フラグメントとして存在できる最新のクローズデルタは、「交換段階」を経て、それにより各サイトは、クローズデルタの完全なコピーを獲得することができる。全ての修正されたブロックがノード間で交換され、次のバリアが発生して、パイプラインの次の進行の境界を定めたあと、完全なクローズデルタは「コミット段階」に入り、安定したストレージに書き込むことにより永続とすることができる。 In one embodiment, a delta set multi-stage pipeline is used to collect newly written block images to exchange block images between nodes and commit the block images to back-end storage. A system-wide barrier mechanism ensures that the delta set pipeline proceeds simultaneously on all nodes of the system. Obviously, the barrier need not be system-wide if not all nodes are participating in the management of a given WOF group. New writes to any of the multiple data volumes that make up a WOF group are stored in an “open delta” cache for that WOF group. Upon any trigger or other event, the current open delta should be closed. A message to close the open delta is broadcast to that WOF group so that any outstanding writes can be completed and the delta can be closed. A new delta is opened to receive a new write. The latest closed delta, which can exist as a unique fragment in different WOF groups, undergoes an “exchange stage”, whereby each site can obtain a complete copy of the closed delta. After all the modified blocks are exchanged between nodes and the next barrier occurs and demarcates the next progression of the pipeline, the complete closed delta enters the “commit phase” to provide stable storage. It can be made permanent by writing.

本発明のこれらまたはその他の実施例、およびその利益と機能は、以下の文章および添付の図と併せて、より詳細に説明される。 These and other embodiments of the present invention, and the benefits and functions thereof, will be described in more detail in conjunction with the following text and accompanying figures.

本発明によるさまざまな実施例が図面と関連して説明される。 Various embodiments according to the invention are described in connection with the drawings.

（発明の詳細な説明）
さまざまな実施例によるシステムと方法は、通常単一のライタが、ネットワークを介してパッシブサイトにデータをプッシュする既存のアクティブ／パッシブ実装とは対照的に、複数サイトのそれぞれが、書き込み順序忠実性（ＷＯＦ）を維持しつつ、並行してデータを読み取るおよび／または書き込むことができる、さまざまな完全アクティブ実装を提供することにより、既存のデータストレージシステムにおける上述およびその他の欠陥を克服する。これらおよびその他の目的は、書き込み順序忠実性への改善されたデルタセットアプローチを、分散キャッシュコヒーレンシを提供するためのアプローチと組み合わせることにより、一態様において達成することができる。 (Detailed description of the invention)
The systems and methods according to the various embodiments are such that each single site is written order fidelity, as opposed to existing active / passive implementations where a single writer typically pushes data over the network to the passive site. Overcoming the above and other deficiencies in existing data storage systems by providing various fully active implementations that can read and / or write data in parallel while maintaining (WOF). These and other objectives can be achieved in one aspect by combining an improved delta set approach to write order fidelity with an approach for providing distributed cache coherency.

一態様において、データの時間順序イメージが作成されて維持されるので、ストレージシステム障害のあと、データのビューに不整合な破損したファイルシステム、データベース、またはその他のアプリケーションを有さずに、操作が再始動できる。通常、既存のシステムは、データを、当技術分野でデータ転送の基本単位として知られるブロックで送信し、それらのブロックの順序を保持しなければならない。これは著しい数の短いトランザクションを生成し、待ち時間の増加に伴って、ネットワークおよび／またはシステム性能の低下を引き起こす可能性がある。トランザクションの数を減らす１つの方法は、ブロックをデルタまたはデルタセットにグループ化することである。デルタセットを用いることは、ネットワークを介してのデータ転送の粒度を大きくし、従ってネットワーク転送の効率を上げ、またデルタセットによって捕捉された時間間隔内で複数回書き込まれた個々のブロックの複数の転送を排除し、それにより個々のブロックが何度も書き込まれた場合、そのブロックは、デルタを用いて１度だけ書き込まれる。デルタ間の境界は、データの書き込み順序整合性イメージを提供する。 In one aspect, the time order image of the data is created and maintained so that after a storage system failure, the operation can be performed without having a corrupted file system, database, or other application inconsistent with the view of the data. Can be restarted. Typically, existing systems must transmit data in blocks known in the art as the basic unit of data transfer and maintain the order of those blocks. This creates a significant number of short transactions and can cause degradation of network and / or system performance with increasing latency. One way to reduce the number of transactions is to group the blocks into deltas or delta sets. Using a delta set increases the granularity of the data transfer over the network, thus increasing the efficiency of the network transfer, and the multiple of individual blocks written multiple times within the time interval captured by the delta set. If a transfer is eliminated so that an individual block is written many times, the block is written only once using a delta. The boundary between deltas provides a data write order consistency image.

デルタセットは、例えば２００４年１１月２３日発行の米国特許第６，８２３，３３６号に記載されているが、これは参照により本書に組み込まれる。デルタとリモートデータをミラーする方法のその他の記述は、例えば２００６年５月３０日発行の米国特許第７，０５５，０５９号、およびＭｉｎｗｅｎＪｉらによる、「Ｓｅｎｅｃａ：ｒｅｍｏｔｅｍｉｒｒｏｒｉｎｇｄｏｎｅｗｒｉｔｅ」（ＵＳＥＮＩＸＴｅｃｈｎｉｃａｌＣｏｎｆｅｒｅｎｃｅ梗概集２５３〜２５６ページ、２００３年６月）に記載されているが、これらそれぞれは参照により本書に組み込まれる。 The delta set is described, for example, in US Pat. No. 6,823,336 issued November 23, 2004, which is incorporated herein by reference. Other descriptions of how to mirror delta and remote data can be found in, for example, US Pat. No. 7,055,059, issued May 30, 2006, and “Seneca: remote mirrowing done write” (USENIX Technical Conference Summary, pages 253-256, June 2003), each of which is incorporated herein by reference.

本発明以前は、デルタセットベースのＷＯＦ実装は、アクティブ−パッシブデータアクセスのみをサポートしていた。このシナリオでは、ＷＯＦの実装は、「アクティブ」サイトが、ＷＯＦイメージ間の移行を完全に制御することができるため、かなり比較的単純である。ＷＯＦの単純なアクティブ−パッシブ実装は、アクティブおよびパッシブサイトにおける２つの部分デルタそれぞれに割り振られたメモリ領域を有する２つのデルタセットのみの維持を伴う。デルタイメージを進行させるために、アクティブノードで決定がなされたあと、新しい書き込みは単純に代替バッファに受け入れられる。パッシブサイトにメッセージが送られ、切り替えを指示し、新しくクローズされたデルタからのデータを転送する。パッシブサイトへのデータの「プッシュ」が完了すると、両サイトは交換されたばかりのデータをディスクにコミットすることができる。これを完了すると、アクティブサイトは、デルタセットバッファ間を再びトグルして、現在のデルタを閉じることができる。従来のデルタセットベースのＷＯＦ実装への機能拡張は、通常交代するバッファの組として維持される、より多くのデルタセットを維持することを含み、閉じられたバッファの交換が、短期のネットワーク帯域幅の飽和のせいで、遅行することを許す。デルタセットの基本使用のさらなる詳細は、参照により組み込まれる米国特許第６，８２３，３３６号に記載されるが、ここではさらなる詳細は説明しない。 Prior to the present invention, delta set-based WOF implementations supported only active-passive data access. In this scenario, the WOF implementation is fairly relatively simple because the “active” site can fully control the transition between WOF images. A simple active-passive implementation of WOF involves maintaining only two delta sets with memory areas allocated to each of the two partial deltas at the active and passive sites. After a decision is made at the active node to advance the delta image, new writes are simply accepted into the alternate buffer. A message is sent to the passive site instructing the switch and transferring data from the newly closed delta. When the data “push” to the passive site is complete, both sites can commit the data just exchanged to disk. When this is complete, the active site can toggle back and forth between delta set buffers to close the current delta. Enhancements to traditional delta set-based WOF implementations include maintaining more delta sets, which are typically maintained as alternating sets of buffers, and the exchange of closed buffers reduces short-term network bandwidth. Permits delays because of saturation. Further details of the basic use of the delta set are described in US Pat. No. 6,823,336, incorporated by reference, but no further details are described here.

潜在的に、共通データボリュームに書き込む複数サイトに、複数ノードの可能性を持ち込むには、より高度なアプローチが必要となる。一実施例において、分散キャッシュコヒーレンシ機構は、２００５年７月７日出願の米国特許出願第１１／１７７，９２４号、名称「ＳｙｓｔｅｍｓａｎｄＭｅｔｈｏｄｓｆｏｒＰｒｏｖｉｄｉｎｇＤｉｓｔｒｉｂｕｔｅｄＣａｃｈｅＣｏｈｅｒｅｎｃｅ」、および２００４年７月７日出願の仮出願第６０／５８６，３６４号で教示されるものなどであるが、これらは全て参照により本明細書組み込まれる。これらの方法は、全てのノードからの読み取り／書き込みアクセスの可能性を有する、ローカルとネットワークとに分離された複数のノードに維持されるキャッシュが、コヒーレントに保たれる機構を提供する。つまり、全ホストが単一のディスクドライブにアクセスしているシステムとして、データイメージの分配の順序を提供する。デルタセットを備える分散キャッシュコヒーレンシの使用は、実際のデータの動きが非同期であっても、データの同期イメージを可能にする。データの動きは、地理的全体の書き込み順序忠実性を維持し、それによりシステムは、いずれかの整合性のあるポイントインタイムで再始動できる。 Potentially, a more advanced approach is needed to bring the potential of multiple nodes to multiple sites writing to a common data volume. In one embodiment, the distributed cache coherency mechanism is based on US patent application Ser. No. 11 / 177,924 filed Jul. 7, 2005, entitled “Systems and Methods for Providing Distributed Cache Coherence”, and Jul. 7, 2004. No. 60 / 586,364, all of which are hereby incorporated by reference. These methods provide a mechanism in which a cache maintained in multiple local and network separated nodes with the possibility of read / write access from all nodes is kept coherent. In other words, the data image distribution order is provided as a system in which all hosts access a single disk drive. The use of distributed cache coherency with delta sets allows for a synchronous image of data even if the actual data movement is asynchronous. Data movement maintains geographic order overall write order fidelity, which allows the system to restart at any consistent point in time.

一態様において、ディレクトリマネージャモジュール（「ＤＭＧ」）を用いて、分散されたデータアクセスノードの組全体に、共有データのためのキャッシュコヒーレンシ機構を提供することができる。共有データボリュームからのデータをキャッシュするために用いられるノードの組は、共有グループと呼ばれる。一般的に、ＤＭＧモジュールは、ノードのプロセッサまたはその他のインテリジェンスモジュール（例えばＡＳＩＣ）において実行するソフトウェアを含む。ＤＭＧモジュールは、単一のノードに実装され得、または複数の相互通信ノードにわたって分散され得る。一部の態様では、アクセスノードは、ストレージエリアネットワーク（ＳＡＮ）などのストレージネットワークに通信可能なように結合された、コントローラデバイスまたはノードとして具現化され、ストレージネットワークに保存されたデータへのアクセスを可能にする。しかし、アクセスノードを、インテリジェントファブリックスイッチまたはハブアダプタなどのその他のネットワークデバイスとして具現化することもできることは明らかであろう。さらに、あらゆるネットワークノードは、ＤＭＧの機能性を有するアクセスノードとして動作するよう構成できる（例えばＤＭＧは、ネットワーク接続を有するデスクトップコンピュータで稼働することができる）。米国特許第６，１４８，４１４号（参照により本書に組み込まれる）は、本発明の態様の実施が特に有用な、コントローラデバイスおよびノードを開示している。 In one aspect, a directory manager module (“DMG”) can be used to provide a cache coherency mechanism for shared data across a set of distributed data access nodes. A set of nodes used to cache data from a shared data volume is called a shared group. In general, a DMG module includes software that executes in a node's processor or other intelligence module (eg, ASIC). The DMG module may be implemented on a single node or distributed across multiple intercommunication nodes. In some aspects, an access node is embodied as a controller device or node communicatively coupled to a storage network such as a storage area network (SAN) and provides access to data stored in the storage network. enable. However, it will be apparent that the access node can also be embodied as other network devices such as intelligent fabric switches or hub adapters. Furthermore, any network node can be configured to operate as an access node with DMG functionality (eg, DMG can run on a desktop computer with network connectivity). US Pat. No. 6,148,414 (incorporated herein by reference) discloses controller devices and nodes in which implementation of aspects of the invention are particularly useful.

本発明の一実施例について、図１は、複数のアクセスノード装置１０４（ａ）〜１０４（Ｎ）に通信可能なように結合された複数のネットワーククライアント１０２（ａ）〜１０２（Ｎ）を含む、基本ネットワーク構成１００を示している。各アクセスノード装置は、マイクロプロセッサまたはその他のインテリジェンスモジュール、キャッシュ１０８（例えばＲＡＭキャッシュ）および／またはその他のローカルストレージ、通信ポート（図示しない）、およびＤＭＧモジュール１１０のインスタンスなどのプロセッサ構成要素１０６を含む。全般的に、「Ｎ」は、ここで複数存在することを示すために使用されるので、１つの構成要素に言及するときの数「Ｎ」は、別の構成要素の数「Ｎ」と必ずしも等しくなくてもよい。各クライアント１０２は、速度およびその他の理由で、ローカルネットワーク接続１１２上でアクセスノード１０４のうち１つ以上に通信可能なように結合することができ、あるいは特定のアプリケーションおよび地理的ロケーションに対する必要に応じて、多数の接続体系のうちのいずれかにおいてノード１０４に通信可能なように結合することができるが、これには例えば、直接配線またはワイヤレス接続、インターネット接続、あらゆるローカルエリアネットワーク（ＬＡＮ）型接続、あらゆる大都市圏ネットワーク（ＭＡＮ）接続、あらゆる広域ネットワーク（ＷＡＮ）型接続、ＶＬＡＮ、あらゆる独自ネットワーク接続などが含まれる。また各ノード１０４は、通常１つ以上の別のノードを含むか、あるいはそれに通信可能なように結合され、ストレージエリアネットワーク（ＳＡＮ）、ＬＡＮ、ＷＡＮ、ＭＡＮ、Ｉｎｆｉｎｉｂａｎｄなどの高速ネットワークなどの１つ以上のネットワーク１１６上で、それぞれ１つ以上のディスクドライブを含む１つ以上のストレージリソース１１４に、通信可能なように結合される。一態様において、ノード１０４が、ローカルネットワーク接続上で１つ以上のストレージリソース１１４に結合されることが好ましい。ノードは、お互いに物理的に近接に位置してよく、あるいは少なくとも１つが他のノードから遠隔（例えば地理的に遠隔）に位置してもよい。アクセスノードは、ネットワーク１１６上および／またはＰＣＩバスまたはバックボーンまたはＦｉｂｒｅチャネルネットワーク上、あるいはクライアント装置１０２がアクセスノード１０４と通信するために使用しているのと同じネットワーク１１２上など、その他の通信ネットワークまたはメディア上で、他のノードと相互通信することもできる。 For one embodiment of the present invention, FIG. 1 includes a plurality of network clients 102 (a) -102 (N) communicatively coupled to a plurality of access node devices 104 (a) -104 (N). A basic network configuration 100 is shown. Each access node device includes a processor component 106, such as a microprocessor or other intelligence module, a cache 108 (eg, RAM cache) and / or other local storage, a communication port (not shown), and an instance of the DMG module 110. . In general, “N” is used herein to indicate that there is a plurality, so the number “N” when referring to one component is not necessarily the same as the number “N” of another component. It does not have to be equal. Each client 102 can be communicatively coupled to one or more of the access nodes 104 over the local network connection 112 for speed and other reasons, or as required for a particular application and geographic location. Can be communicatively coupled to node 104 in any of a number of connection schemes, including, for example, a direct wiring or wireless connection, an Internet connection, any local area network (LAN) type connection , Any metropolitan area network (MAN) connection, any wide area network (WAN) type connection, VLAN, any proprietary network connection, etc. Each node 104 also typically includes or is communicatively coupled to one or more other nodes, such as a high-speed network such as a storage area network (SAN), LAN, WAN, MAN, Infiniband, etc. On the above network 116, it is communicatively coupled to one or more storage resources 114 each including one or more disk drives. In one aspect, node 104 is preferably coupled to one or more storage resources 114 over a local network connection. The nodes may be located in physical proximity to each other, or at least one may be located remotely (eg, geographically remote) from other nodes. The access node may be any other communication network, such as on the network 116 and / or on the PCI bus or backbone or Fiber channel network, or on the same network 112 that the client device 102 is using to communicate with the access node 104 or It is also possible to communicate with other nodes on the media.

分散キャッシュコヒーレンシは、リモートデータへのローカライズされた（キャッシュされた）アクセスを可能にすることにより、地理的に離れたアクセスノード間の帯域幅要件を低減するのに役立つ。データアクセスは一般に、データがローカルにキャッシュできない限り、ローカライズすることができず、さらにキャッシュされたデータをリモートアクセスノードでの修正に対してコヒーレントに保てない限り、データをローカルにキャッシュすることは安全でない。ＤＭＧの実施例は、キャッシュコヒーレンシの正確性要件を満たすことができ、また実用的で有益なローカライズされたキャッシュアクセスの作成のためのオーバーヘッドを十分低く抑えることができる。図１に記載される実施例は、単一のＤＭＧディレクトリ１１０を示しているが、上記特許において教示される分散キャッシュコヒーレンシのための方法は、ピアツーピアベースでコヒーレンシを管理し、また多数のノードおよびノード間の大きい距離に拡張可能である。分散キャッシュコヒーレンシは、デルタセットの使用と組み合わせて、書き込み順序忠実性（ＷＯＦ）を提供することができる。 Distributed cache coherency helps reduce bandwidth requirements between geographically distant access nodes by enabling localized (cached) access to remote data. Data access generally cannot be localized unless the data can be cached locally, and caching the data locally unless the cached data can remain coherent to modifications at the remote access node Unsafe. The DMG embodiment can meet cache coherency accuracy requirements and can keep the overhead for creating practical and useful localized cache accesses low enough. Although the embodiment described in FIG. 1 shows a single DMG directory 110, the method for distributed cache coherency taught in the above patent manages coherency on a peer-to-peer basis, and also has multiple nodes and It can be extended to large distances between nodes. Distributed cache coherency can be combined with the use of delta sets to provide write order fidelity (WOF).

一実施例について、書き込み順序忠実性（ＷＯＦ）および「依存性ＷＯＦ」は、以下のようにより正式に定義できる。アプリケーションは、更新操作を使用して、その永続状態を更新することができるが、この更新操作は、メタデータ更新Ｗ_１（例えば、データベースリカバリーログにエントリを書き込む）とデータ更新Ｗ_２（例えば、修正されたデータをデータベースに書き込む）との、２つの更新書き込み操作から成る。アプリケーション自体あるいはストレージのいずれかの適切な障害回復のために、アプリケーションは、メタデータ更新Ｗ_１が完了するまで待ってデータ更新Ｗ_２を実行する。これら２つの書き込み間のタイミング関係は、ここではＷ_１→Ｗ_２として表わされるが、Ｗ_２はＷ_１に対し「依存性」だと考えられ、Ｗ_１はＷ_２にとって「必要」だと考えられる。かかるアプローチの使用は、Ｗ_１が情報を記録し、それなくしてはＷ_２が正しく解釈され得ないことを可能とする。かかるストレージシステムでは、Ｗ_２の開始時間がＷ_１の完了時間のあとであることに留意することにより、依存性書き込みを監視することができる。 For one embodiment, write order fidelity (WOF) and “dependency WOF” can be more formally defined as follows: An application can use an update operation to update its persistent state, which includes metadata update W ₁ (eg, writing an entry in the database recovery log) and data update W ₂ (eg, And write the modified data to the database). For the application itself or the storage of any suitable disaster recovery, application executes data update W ₂ Wait for metadata updates W ₁ is completed. The timing relationship between these two writing is here but is represented as W ₁ → W _{_2,} W ₂ is considered that it is "dependent" for W _{_1,} W ₁ is taken to W ₂ considered a "necessary" It is done. Using such an approach, W ₁ records the information to allow that the W ₂ is lost it can not be interpreted correctly. In such a storage system, dependent writes can be monitored by noting that the start time of W ₂ is after the completion time of W ₁ .

かかるストレージシステムは、任意の障害から回復したあと、２つの書き込みＷ_１→Ｗ_２について、以下のケースのうち１つだけが当てはまる場合、「依存性」書き込み順序忠実性を示すと考えられる。（ａ）Ｗ_１もＷ_２もストレージに適用されていない、（ｂ）Ｗ_１とＷ_２の両方がストレージに適用されている、あるいは（ｃ）Ｗ_１のみがストレージに適用されている。 Such a storage system is considered to exhibit “dependency” write order fidelity if only one of the following cases is true for two writes W ₁ → W ₂ after recovering from any failure: (A) Neither W ₁ nor W ₂ is applied to the storage, (b) both W ₁ and W ₂ are applied to the storage, or (c) only W ₁ is applied to the storage.

依存性ＷＯＦは、書き込み操作において、全体ではなく半順序を定義する。いくつかの実施例では、これはグローバルに同期された時間概念を使用することができるが、別の実施例はこの問題を回避する。さらに、依存性ＷＯＦによって順序付けされていない書き込みは、まさしくアプリケーションによって並行して実行される書き込みである。一態様によると、依存性書き込み順序忠実性は、ＮｅｔＳｔｏｒａｇｅｒ製品などのネットワークコントローラの整合性の適切な定義である。米国特許第６，１４８，４１４号（参照により本書に組み込まれる）は、有用なネットワークコントローラの態様を開示している。 The dependency WOF defines a partial order, not the whole, in a write operation. In some embodiments, this can use a globally synchronized time concept, but other embodiments avoid this problem. Furthermore, writes that are not ordered by the dependency WOF are just writes that are performed in parallel by the application. According to one aspect, dependency write order fidelity is an appropriate definition of the consistency of a network controller, such as a NetStorage product. US Pat. No. 6,148,414 (incorporated herein by reference) discloses a useful network controller embodiment.

一態様において、ストレージシステムは、アプリケーションの観点から、書き込みの対Ｗ_１とＷ_２は依存性であると判断する。ストレージデバイスは、Ｗ_１が完了した後にＷ_２が到着することを監視し得るのみであるため、ストレージデバイスがこれらの依存性を判断するのは容易ではない。これら２つの書き込み間には従属関係がなく、また明確な順序付けは単なる偶然の一致かもしれない。アプリケーションからの支援なしで（そのようなことは想定されていないが）、通常ストレージシステムは、偶然の一致と本物の依存性書き込みの対を区別することはできない。安全のため、ストレージシステムは、Ｗ_１が完了したあとにＷ_２が到着するあらゆるケースが真の依存性書き込みを示すと想定することができる。これは必要とされるよりも強い半順序を作成するが、真の書き込み依存性は、この強い順序付けの一部となる。 In one aspect, the storage system determines that the write pair W ₁ and W ₂ are dependencies from an application perspective. Storage device, since W ₁ is only capable of monitoring that W ₂ arrives after completing the storage device is not easy to determine these dependencies. There is no dependency between these two writes, and a clear ordering may just be a coincidence. Without assistance from the application (although that is not expected), storage systems usually cannot distinguish between a coincidence match and a real dependency write pair. For safety, the storage system, all cases in which W ₂ arrives after the W ₁ is completed can be assumed to represent the true dependence writing. This creates a stronger partial order than is needed, but the true write dependency becomes part of this strong ordering.

一態様において、複数のアクティブなキャッシュコヒーレント・イニシエータを伴う環境は、依存性ＷＯＦを提供するために、グローバルな時間概念を使用する。かかる使用の本来の解釈は、書き込みの絶対的にグローバルな順序付けを意味する。この本来のアプローチの書き込み当たりの諸経費は非常に高額で、特に地理的に離れた状況では、真剣に検討するに値しない。一代替物は、書き込みのバッチをデルタにグループ化することを伴い、それによりデルタ間の境界が、上述のような依存性書き込み順序忠実性に従う。 In one aspect, an environment with multiple active cache coherent initiators uses a global time concept to provide dependent WOFs. The original interpretation of such use implies an absolute global ordering of writes. The cost per write of this original approach is very high, and it is not worth considering seriously, especially in geographically distant situations. One alternative involves grouping batches of writes into deltas so that the boundaries between deltas follow dependent write order fidelity as described above.

上述のようにシステム変更を表わすために使用されるデルタ内で、書き込みは再順序付けされ、ある程度の最適化を提供する。デルタが閉じられるとき、デルタに含有される全ての書き込みは、同じデルタあるいは以前のデルタの書き込みに依存している。必要な書き込みが後のデルタまで先送りされる場合、ＷＯＦはもはやデルタ境界を跨いで保存されない。デルタを閉じる操作は、そのため全ての参加ノードにわたって調整されるべきである。 Within the delta used to represent system changes as described above, writes are reordered, providing some degree of optimization. When a delta is closed, all writes contained in the delta depend on the same or previous delta writes. If the required write is postponed to a later delta, the WOF is no longer stored across the delta boundary. The operation of closing the delta should therefore be coordinated across all participating nodes.

デルタは、一実施例において、ＷＯＦグループに参加している全てのサイトの全てのノードにわたりグローバルである。各ノードは、ローカルに書き込まれたデルタのフラグメントを収集する。各サイトは、ローカルストレージにコミットするために完全なコピーを必要とするので、サイトはフラグメントを交換して、完全なコピーを組み立てる。一態様において、これらの部分デルタは、ローカルに適用される前にサイト間で交換される。交換後、各サイトは、デルタを基礎となるストレージにローカルにコミットすることができる。ローカルコミットはアトミックに、他のデルタに対して正しい順序でなされる。アトミック性の要件は、デルタが適用される前に永続とされることを意味する。永続性の程度を変更することは、ＷＯＦ整合性保証の強度に影響を及ぼし得る。例えば、デルタを安定したストレージに置くことは、デルタを冗長装置上の揮発性メモリにコピーするよりも安全である。 Delta, in one embodiment, is global across all nodes at all sites participating in the WOF group. Each node collects locally written delta fragments. Since each site needs a complete copy to commit to local storage, the sites exchange fragments and assemble a complete copy. In one aspect, these partial deltas are exchanged between sites before being applied locally. After the exchange, each site can commit the delta locally to the underlying storage. Local commits are made atomically and in the correct order with respect to other deltas. The atomicity requirement means that the delta is made permanent before it is applied. Changing the degree of persistence can affect the strength of the WOF consistency guarantee. For example, placing the delta on stable storage is safer than copying the delta to volatile memory on a redundant device.

ＷＯＦソリューションに関し、基礎となるストレージが、一連の整合性のある状態を通してアトミックに動作しているのを見ることができる。デルタをストレージにローカルに適用する前に、各サイトが全グローバルデルタを集める場合、サイト間のリンク障害は、ストレージを整合性のある状態に保たない。デルタは、全体として適用されるか、あるいは全く適用されない。 For the WOF solution, you can see the underlying storage operating atomically through a set of consistent states. If each site collects the entire global delta before applying the delta locally to the storage, a link failure between the sites will not keep the storage in a consistent state. Delta is applied as a whole or not at all.

一態様において、ネットワーク管理者は、相互に整合性がある必要があるフロントエンドのボリュームを、ＷＯＦグループにグループ化する。例えば、データベースの全てのデータおよびログボリュームは、同じＷＯＦグループに位置づけることができる。各ＷＯＦグループは、一実施例において、全システムにおけるノードとサイトのいくつかのサブセットにより管理される。ＷＯＦグループへの新しい書き込みは、現在のオープンデルタΔ_ｉのキャッシュに収集され、キャッシュのバックエンドは、デルタを生成する。 In one aspect, the network administrator groups front-end volumes that need to be consistent with each other into a WOF group. For example, all data and log volumes in the database can be placed in the same WOF group. Each WOF group, in one embodiment, is managed by several subsets of nodes and sites in the entire system. New writes to the WOF group are collected in the current open delta Δ _i cache, and the cache backend generates a delta.

ＷＯＦソリューションに関し、基礎となるストレージが、一連の整合性のある状態を通してアトミックに動作しているのを見ることができる。デルタをストレージにローカルに適用する前に、各サイトが全グローバルデルタを集める場合、サイト間のリンク障害は、ストレージを整合性のある状態に保たない。デルタは、全体として適用されるか、あるいは全く適用されない。従って、完全なデルタによって与えられた暗黙の時間順序により、基礎となるストレージへのデルタのアトミックな書き込みは、完了しているかどうかに関わらず、ＷＯＦのプロパティを保存することができる。 For the WOF solution, you can see the underlying storage operating atomically through a set of consistent states. If each site collects the entire global delta before applying the delta locally to the storage, a link failure between the sites will not keep the storage in a consistent state. Delta is applied as a whole or not at all. Thus, due to the implicit time order given by the complete delta, the atomic writing of the delta to the underlying storage can preserve the WOF properties, whether or not they are complete.

一態様において、ＷＯＦグループへの新しい書き込みは、現在のＷＯＦデルタΔ_ｉのキャッシュに収集され、キャッシュのバックエンドはデルタを生成する。新しい書き込みを収集するデルタは、「開いている」と考えられる。一態様において、新しい書き込みを受け入れるのをやめ、デルタパイプラインを進行させるための、オープンデルタを閉じる決定は、何らかの時間またはスペース制約に基づいて定期的になされる。デルタを閉じる決定は、外部トリガーによって、あるいはシステムエラー状態からの回復の一部として行うこともできる。デルタを閉じ、新しいデルタを開き、また一般的には、デルタパイプラインを進行させることは、「デルタのロールオ−バ」と称される。一態様において、デルタを閉じる決定を行うノードは、ＷＯＦグループに通知を送信またはブロードキャストすることができる。各ノードが通知を受け取ると、そのノードは、新しい書き込み操作への確認応答の遅延を開始する。これは、全ての書き込み順序依存性がΔ_ｉからΔ_ｉ＋１へ進むこと、あるいはΔ_ｉ内に含有されることを確実にする。通知を受け取った時点で未完了の全ての書き込みが完了すると、Δ_ｉは閉じられる。そしてΔ_ｉ＋１が開かれ、Δ_ｉが交換段階に進むのと同時に、新しい書き込みの収集を開始する。アプリケーションがＩ／Ｏにおける途絶を感知しないことを確実にするために、ＷＯＦグループへの書き込みが遅延される時間を最低限に抑えることが重要である。一般の順序付けられたブロードキャストサービスが、これらのブロードキャストに高すぎる諸経費を課す場合、より軽量な２段階コミットプロトコルを使用することができるが、これはコミットプロトコルに関わるノードを最低限に抑える。 In one aspect, new writes to the WOF group are collected in the current WOF delta Δ _i cache, and the cache back end generates a delta. A delta that collects new writes is considered "open". In one aspect, the decision to close the open delta to stop accepting new writes and advance the delta pipeline is made periodically based on some time or space constraint. The decision to close the delta can be made by an external trigger or as part of recovery from a system error condition. Closing a delta, opening a new delta, and generally advancing the delta pipeline is referred to as "delta rollover." In one aspect, the node making the decision to close the delta can send or broadcast a notification to the WOF group. As each node receives the notification, it begins delaying the acknowledgment to a new write operation. This ensures that all write order dependencies go from Δ _i to Δ _{i + 1} , or are contained within Δ _i . When all writes incomplete at the time of receiving the notification is completed, delta _i is closed. Then Δ _{i + 1} is opened and at the same time Δ _i goes to the exchange phase, it starts collecting new writes. It is important to minimize the time that writes to the WOF group are delayed to ensure that the application does not sense disruptions in I / O. If a general ordered broadcast service imposes too much overhead on these broadcasts, a lighter two-phase commit protocol can be used, but this minimizes the nodes involved in the commit protocol.

デルタパイプラインのロールオーバ操作は、定期的な性能影響を示唆する。ＷＯＦ整合性保証は重大障害の場合にのみ必要とされてもよいため、代替実装は、エラー回復中にデルタ境界を回復し、それにより通常操作中のグローバルバリアの必要性を取り除くことを選択してよい。 Delta pipeline rollover operations imply periodic performance impacts. Since WOF integrity guarantees may only be needed in case of a catastrophic failure, alternative implementations choose to recover the delta boundary during error recovery, thereby eliminating the need for global barriers during normal operation. It's okay.

最近に閉じられたデルタは、ＷＯＦグループのアクセスノードのキャッシュにおいて、フラグメントとして存在することができる。一態様において、キャッシュコヒーレンシプロトコルは、各フラグメントが固有の、非重複「ダーティ」データを含有することを保証する。各サイトは、これらのフラグメントに含有されるブロック、あるいは部分デルタを交換することにより、デルタの完全なコピーを組み立てる。一態様において、サイトのためのΔ_ｉのコピーが、単一のノードにある場合、アトミックコミット段階は簡略化することができる。しかし、これはデルタのグローバルサイズに厳密な制約を課し、交換段階とデルタを閉じる決定をより複雑にする可能性がある。別の態様では、収集されたデルタ内の書き込みは、上書きまたは隣接の書き込みセグメントを利用するために、再順序付けすることができる。 Recently closed deltas can exist as fragments in the cache of the access node of the WOF group. In one aspect, the cache coherency protocol ensures that each fragment contains unique, non-redundant “dirty” data. Each site assembles a complete copy of the delta by exchanging blocks or partial deltas contained in these fragments. In one aspect, the atomic commit phase can be simplified if the copy of Δ _i for the site is on a single node. However, this imposes strict constraints on the global size of the delta and can make the exchange phase and the decision to close the delta more complex. In another aspect, the writes within the collected delta can be reordered to take advantage of overwrites or adjacent write segments.

一態様において、Δ_ｉがコミットされ得る前に、デルタは永続とされる。永続性操作は、交換段階と重複することができる。強い整合性保証のために、デルタを記述するデータおよび関連メタデータは、安定したストレージに書き込むことができる。データを同じサイトの別のノードに１回以上コピーし、それにより「保護コピー」を提供しそれほど強くない保証を提供することができる。かかる保護アプローチは、より高速にできるが、さらなるメモリ制約を課す可能性がある。より弱い保証は、永続性を要求しない。永続性なしでは、複数の障害に対する保護はないが、サイトの災害時には、データ整合性を維持することができる。 In one aspect, the delta is made permanent before Δ _i can be committed. The persistence operation can overlap with the exchange phase. For strong consistency guarantees, data describing deltas and associated metadata can be written to stable storage. Data can be copied one or more times to another node at the same site, thereby providing a “protected copy” and providing a less strong guarantee. Such a protection approach can be faster, but may impose additional memory constraints. A weaker guarantee does not require persistence. Without persistence, there is no protection against multiple failures, but data integrity can be maintained in the event of a site disaster.

Δ_ｉが交換されて永続とされたあと、一態様において、サイトは調和して、Δ_ｉのバックエンドストレージへの適用を開始する。これは真にアトミックには行うことができず、障害に割り込まれることがある。永続コピーは、いくつかのノードがコミットを再実行することができる限り、これらの障害から保護される。デルタのコミットが始まると、操作が純粋にサイトに対してローカルであるため、サイト間リンクへの割り込みは影響を与えない。 After Δ _i is exchanged and made permanent, in one aspect, the sites harmonize and begin applying Δ _{i to} the back-end storage. This cannot be done truly atomic and can be interrupted. Persistent copies are protected from these failures as long as some nodes can redo the commit. When the delta commits, interrupting the intersite link has no effect because the operation is purely local to the site.

一態様におけるＷＯＦの導入は、ダーティデータがキャッシュのバックエンドから出る順序に影響するのみであり、キャッシュコヒーレンシプロトコルには何の影響も及ぼさない。任意のノードへの読み取り要求は、常にデータの最新コピーを返す。クローズデルタにヒットするキャッシュ書き込みは、前のコピーを無効にできないが、その代わりオープンデルタに新しいコピーを作成する。この新しいコピーは、キャッシュコヒーレンシの目的で、基礎となるストレージに保存される永続コピーを含む、あらゆる旧バージョンのブロックをシャドウする。 The introduction of WOF in one aspect only affects the order in which dirty data comes from the cache back end, and has no effect on the cache coherency protocol. A read request to any node always returns the latest copy of the data. A cache write that hits a closed delta cannot invalidate the previous copy, but instead creates a new copy in the open delta. This new copy shadows any previous version of the block, including a persistent copy stored in the underlying storage for cache coherency purposes.

（例示的な依存性ＷＯＦ実装）
一態様において、例示的なＷＯＦ実装は、依存性書き込み順序を有し、多数のサイトにわたり、また多数のサイト内で完全アクティブである。かかる実装は、サイトにわたって、またサイト内のノードにわたって、完全アクティブＷＯＦを完全にサポートする。実装は、各デルタが整合性のある書き込みのバッチを含有する、デルタベースとすることもできる。システムは、ネットワークコントローラノードへのアプリケーション書き込みとしてデルタを収集している。例えば約５〜３０秒のオーダーの間隔で、システムは、現在のデルタのクローズをローカルおよびグローバルに同期させる。新しいデルタが、新しい書き込みの収集を開始するために、即座に開かれる。クローズされたデルタは、次にバックエンドストレージにアトミックに書き込まれる。最新にコミットされたデルタは、障害時の修復ポイントを定義する。 (Example dependency WOF implementation)
In one aspect, an exemplary WOF implementation has a dependent write order and is fully active across multiple sites and within multiple sites. Such an implementation fully supports fully active WOF across sites and across nodes within the site. Implementations can also be delta-based, where each delta contains a consistent batch of writes. The system is collecting deltas as application writes to the network controller node. For example, at intervals on the order of about 5-30 seconds, the system synchronizes the current delta closure locally and globally. A new delta is opened immediately to begin collecting new writes. The closed delta is then written atomically to backend storage. The most recently committed delta defines the repair point in case of failure.

かかる実装は、ＷＯＦ整合性グループのサポートも提供することができる。データベース、ジャーナルファイルシステム、およびその他類似のアプリケーションは、しばしばそのデータボリュームをそのメタデータおよびログボリュームから分離する。管理者は、これらのボリュームをグループ化して、単に個々にではなく、組として、それらにわたってＷＯＦを提供できなければならない。 Such an implementation may also provide support for WOF consistency groups. Databases, journal file systems, and other similar applications often separate their data volumes from their metadata and log volumes. Administrators must be able to group these volumes and provide WOF across them as a set, not just individually.

本段落では、ＷＯＦ実装のより具体的な例を示す。図２は、サイトＡ２０２とサイトＢ２０４を含む例示的な２サイトシステム２００を、サイトＡに分散ＲＡＩＤ１（ＤＲ１）の１つのレグ２０６を、サイトＢに１つのレグ２０８を備えて示している。まず各サイトは、各サイト上部のオープンボックス（２１０ａ、２１０ｂ）に概略的に示される、現在のオープンデルタＤ１２１０に書き込みを収集する。書き込み（Ｗ１、Ｗ２、Ｗ３、Ｗ４）は、通常のキャッシュコヒーレンシプロトコルに従って、サイトの各ノードでローカルに収集される。米国特許出願第１１／１７７，９２４号（上記の参照により組み込まれる）は、有用なキャッシュコヒーレンシのシステムと方法を開示している。 This paragraph gives a more specific example of WOF implementation. FIG. 2 shows an exemplary two-site system 200 that includes site A 202 and site B 204 with one leg 206 of distributed RAID 1 (DR1) at site A and one leg 208 at site B. First, each site collects writes in the current open delta D1 210, shown schematically in the open box (210a, 210b) at the top of each site. Writes (W1, W2, W3, W4) are collected locally at each node in the site according to normal cache coherency protocols. US patent application Ser. No. 11 / 177,924 (incorporated by reference above) discloses useful cache coherency systems and methods.

最終的に、いくつかの異なる理由から、システムは現在のデルタを閉じる判断ができる。例えば、ユーザ設定可能なタイマが満了しているかもしれない。管理者は、かかる設定可能なタイマを用いて、障害が起きた場合に失われる可能性のあるデータの量を抑制することができる。別の可能性のある理由は、外部ＡＰＩがアプリケーションに呼び出されることである。アプリケーションはこの呼び出しを用いて、その観点から、データが完全に整合する時間を指示することができる。また別の可能な理由は、システムが１つ以上のノードのリソースを使い果たすことがある。デルタを収集するノードのうちの１つは、ノードが早期にデルタを閉じる必要があると決定してよい。 Finally, for several different reasons, the system can decide to close the current delta. For example, a user configurable timer may have expired. Administrators can use such configurable timers to reduce the amount of data that can be lost if a failure occurs. Another possible reason is that an external API is called by the application. The application can use this call to indicate when the data is perfectly consistent from that point of view. Another possible reason is that the system may run out of resources of one or more nodes. One of the nodes collecting the delta may determine that the node needs to close the delta early.

一態様におけるシステムは、サイトにわたって現在のデルタのクローズを同期させ、それによりデルタの境界が、依存性書き込みの順序付け整合性を順守する。言い換えると、各デルタのエッジは、整合性のあるストレージのビューを定義することができる。図３に示されるように、現在のデルタが閉じられると、新しいデルタＤ２３００を即座に開いて、アプリケーションからの新しい書き込みを収集することができる。システムは次に、サイト間リンク３０２を介して各サイトで収集された部分デルタを交換し、それによりサイトＡとＢそれぞれは、クローズデルタ（それぞれ２１０ｃおよび２１０ｄ）の完全なコピーを有することができる。 The system in one aspect synchronizes the closing of the current delta across sites so that the delta boundaries adhere to the ordering consistency of dependent writes. In other words, each delta edge can define a consistent view of storage. As shown in FIG. 3, once the current delta is closed, a new delta D2 300 can be immediately opened to collect new writes from the application. The system then exchanges the partial delta collected at each site via the inter-site link 302 so that each of sites A and B can have a complete copy of the closed delta (210c and 210d, respectively). .

閉じられた完全なデルタ２１０ｃ、２１０ｄは、図４に示されるように、適切なリンク４００を経由して、ＤＲ１の各レグ２０６、２０８に適用できる。この適用プロセスは、必ずしもアトミック操作ではない。つまり操作は、各種の障害に割り込まれることがある。障害の種類によって、デルタを適用するプロセスを再始動する必要があるかもしれない。 The closed complete deltas 210c, 210d can be applied to each leg 206, 208 of DR1 via the appropriate link 400, as shown in FIG. This apply process is not necessarily an atomic operation. That is, the operation may be interrupted by various failures. Depending on the type of failure, you may need to restart the process that applies the delta.

図５は、かかるアプローチに従った例示的な方法５００のステップを示すフローチャートである。この方法では、フロントエンドボリュームはＷＯＦグループにグループ化され、それぞれは全データシステムにおけるノードとサイトのサブセットにより実装される（５０２）。ＷＯＦグループのいずれかへの新しい書き込みは、そのＷＯＦグループのオープンデルタのキャッシュに保存される（５０４）。何らかのトリガーイベントで、現在のオープンデルタは閉じられる（５０６）。デルタクローズのメッセージは、ＷＯＦグループにブロードキャストされ、それによりあらゆる未完了の書き込みを完了し得、デルタを閉じ得る（５０８）。次に、新しい書き込みを受信するために、新しいデルタが開かれる（５１０）。異なるＷＯＦグループで固有フラグメントとして存在する最近閉じられたデルタは、交換段階を経て、それにより各サイトは、クローズデルタの完全なコピーを獲得する（５１２）。完全なクローズデルタは、安定したストレージに書き込むことにより永続とされる（５１４）。クローズデルタが永続にされたあと、サイトはクローズデルタを各サイトのバックエンドストレージに適用する（５１６）。 FIG. 5 is a flowchart illustrating the steps of an exemplary method 500 according to such an approach. In this method, front-end volumes are grouped into WOF groups, each implemented by a subset of nodes and sites in the entire data system (502). New writes to any of the WOF groups are stored in the open delta cache of that WOF group (504). At some trigger event, the current open delta is closed (506). A close delta message is broadcast to the WOF group, which may complete any outstanding writes and close the delta (508). A new delta is then opened (510) to receive a new write. Recently closed deltas that exist as unique fragments in different WOF groups go through an exchange phase, whereby each site gets a complete copy of the closed delta (512). The complete closed delta is made permanent by writing to stable storage (514). After the closed delta is made permanent, the site applies the closed delta to each site's backend storage (516).

図６の状況６００は、リンク障害が起きたときに並行して進行する上記のプロセスを示している。オープンデルタＤ３のために書き込みが収集されているとき、クローズデルタＤ２は交換され、コミットされたデルタＤ１は、この場合、サイトＡおよびＢそれぞれのための分散ＲＡＩＤ１システムのレグであるディスクに適用されている。サイト間で書き込みデータを交換するためのリンク６０２が故障した場合、フロントエンドＩ／Ｏが中断されるので、それ以上の書き込みはＤ３に収集されない。Ｄ２のフラグメントは、交換を継続することができず、各サイトは、リンクが回復するまで、あるいはリンクが回復する前に１つのサイトが操作を再開しなければならないと、管理者が宣言するまで、そのデルタのための「ダーティ」データを保持し続ける。しかし、コミットされたデルタＤ１は、Ｄ１がもはやアクティブなリンクに依存していないため、図７の状況７００に示されるように、継続してレグに適用されることができる。これが完了したあと、リンクが切断したとしても、ＤＲ１のレグは一致する。 The situation 600 of FIG. 6 illustrates the above process that proceeds in parallel when a link failure occurs. When writes are being collected for open delta D3, closed delta D2 is swapped and committed delta D1 is applied in this case to the disk that is a leg of the distributed RAID 1 system for sites A and B respectively. ing. If the link 602 for exchanging write data between sites fails, the front end I / O is interrupted and no further writes are collected at D3. Fragments of D2 cannot continue to be exchanged, and each site will either until the link is restored or until the administrator declares that one site must resume operation before the link is restored , Keep “dirty” data for that delta. However, the committed delta D1 can continue to be applied to the leg, as shown in situation 700 of FIG. 7, because D1 is no longer dependent on the active link. Even if the link is broken after this is complete, the DR1 legs will match.

リンク障害が一時的な場合、リンクが回復すると、通常操作が再開される。サイトＢが、システムが確認したいくつかの書き込みのコピーしか保持していないために、ダーティデータが失われている場合、サイトＡおよびＢは、デルタＤ２およびＤ３の内容を破棄し、アプリケーションＩ／Ｏのサービスを再開できる。サイトＡから見られるストレージの内容は、最後に問題なく交換されコミットされたデルタＤ１と整合性がある。ＤＲ１のローカルレグのみに書き込んでいる間、サイトＡはビットマップログを更新することができ、それによりリンクが回復したとき、ＤＲ１の２つのレグを同期させることができる。 If the link failure is temporary, normal operation resumes when the link recovers. If Dirty Data is lost because Site B has only a few copies of the system confirmed, Sites A and B discard the contents of Deltas D2 and D3 and O service can be resumed. The content of the storage seen from site A is consistent with the last successfully exchanged and committed delta D1. While writing to only the DR1 local leg, Site A can update the bitmap log so that when the link is restored, the two legs of DR1 can be synchronized.

（ＷＯＦの性能影響）
ＷＯＦ実装は、フロントエンドからバックエンドの物理ストレージまでの、ホスト書き込みのフローに対する大幅な変更となる可能性がある。実装のいくつかの態様は、ホストにより感知される、システムのネットパフォーマンスに影響を与える可能性がある。 (Effects of WOF performance)
The WOF implementation can be a significant change to the host write flow from the front end to the back end physical storage. Some aspects of the implementation can affect the net performance of the system as perceived by the host.

例えば、一態様においてＷＯＦグループに従事している全てのノードは、同期して現在のデルタを閉じなければならない。これは、特にマルチサイト構成の場合、明らかな性能への影響を有する。しかし、この同期間隔中に、書き込みは正常に実行され、着信の書き込みはホストからデータを受信するが、デルタのクローズが完了するまで確認されないことに留意するべきである。さらに、変更をデルタに収集することにより、これらの変更は、より小さい個別書込み操作よりも効率よく、サイト間リンクを介して流され得る。 For example, in one aspect, all nodes engaged in the WOF group must close the current delta synchronously. This has an obvious performance impact, especially in multi-site configurations. However, it should be noted that during this synchronization interval, writes are successful and incoming writes receive data from the host but are not confirmed until the delta close is complete. In addition, by collecting changes in deltas, these changes can be streamed over inter-site links more efficiently than smaller individual write operations.

上述のように、コミットされたデルタのストレージへの適用はアトミック操作ではないので、操作は障害に対して脆弱である。まだストレージに適用されていないコミットされたデルタのどれが障害に対して保護されるかについては決定することができる。これは、整合性保証をどの程度強くするかの選択に相当する。この分野にはいくつかの実装オプションがあり、それぞれ異なる性能トレードオフを有する。 As mentioned above, the operation is vulnerable to failure because application of committed deltas to storage is not an atomic operation. It can be determined which of the committed deltas that have not yet been applied to storage are protected against failure. This corresponds to the selection of how strong the consistency guarantee is. There are several implementation options in this area, each with different performance tradeoffs.

（デルタ収集段階）
一態様において、ＷＯＦパイプラインの第１ステージは、新しい書き込みのオープングローバルデルタへの収集である。キャッシュの入口でのコヒーレンシプロトコルの結果として、同じブロックの２つのダーティコピーは存在しない。従って、異なるノードまたは異なるサイトからの部分デルタの交点は空集合である。グローバルデルタ全体は、全ての部分デルタのユニオンである。 (Delta collection stage)
In one aspect, the first stage of the WOF pipeline is the collection of new writes into the open global delta. As a result of the coherency protocol at the entrance of the cache, there are no two dirty copies of the same block. Thus, the intersection of partial deltas from different nodes or different sites is an empty set. The entire global delta is a union of all partial deltas.

ダーティブロックはオープンデルタに加えられるので、ブロックが整合性のある書き込み順序で、残りのデルタとともにパイプラインを通って前進するように、それらのブロックはオープンデルタにリンクされている必要がある。一実施例に従ってこれを達成するために、ダーティデータのためのみに使用される、キャッシュブロックデータ構造に２つのメタデータを加えることができる。第一のメタデータは、このブロックが所属するデルタ番号を保存するデルタ識別子である。この識別は不変である。デルタ番号は、システム全体のグローバル番号であり、パイプラインが進行するときに新しいデルタが開かれるたびに増分される。加えることができる第２のメタデータは、キャッシュブロックが、同じデルタからのそのピアとともに保存されるようにする、デルタリストである。この目的では、単独にリンクされたリストで十分である。 Since dirty blocks are added to the open delta, they need to be linked to the open delta so that the blocks will advance through the pipeline with the remaining deltas in a consistent write order. To achieve this according to one embodiment, two metadata can be added to the cache block data structure that is used only for dirty data. The first metadata is a delta identifier that stores the delta number to which this block belongs. This identification is unchanged. The delta number is a system-wide global number that is incremented each time a new delta is opened as the pipeline progresses. A second metadata that can be added is a delta list that allows a cache block to be stored with its peers from the same delta. For this purpose, a single linked list is sufficient.

両メタデータフィールドは、新しい書き込みがオープンデルタに入力されるとすぐに割り当てられることができる。さらに、ノード障害に対して保護するために、着信の書き込みが別のアクセスノードにコピーされるとき、書き込みを受信したノードが故障した場合に回復することができるように、デルタＩＤを、保護コピーとともに保存されるメタデータに加えることができる。 Both metadata fields can be assigned as soon as a new write is entered into the open delta. In addition, to protect against node failure, when an incoming write is copied to another access node, the delta ID is protected copy so that it can be recovered if the node that received the write fails. Can be added to the metadata stored with it.

新しい書き込みが、システムのある場所のキャッシュにある、すでにダーティなブロックに到着するとき、少なくとも２つのアクションのうちの１つを行うことができる。ダーティブロックがオープンデルタにある場合、ダーティブロックは単に無効にすることができる。しかし、ダーティブロックが古いデルタにある場合、ダーティブロックを無効にすることは、依存性書き込み整合性に違反することになる。少なくとも一実施例において、この場合、新しい書き込みがオープンデルタに入力されなければならず、それからキャッシュコヒーレンシの目的で、ブロックの古い内容をシャドウしなければならない。 When a new write arrives in an already dirty block in the cache somewhere in the system, one of at least two actions can be performed. If the dirty block is in an open delta, the dirty block can simply be invalidated. However, if the dirty block is in the old delta, invalidating the dirty block would violate dependency write consistency. In at least one embodiment, in this case, a new write must be entered into the open delta and then the old contents of the block must be shadowed for cache coherency purposes.

新しい書き込みが、システムのある場所のキャッシュにあるすでにダーティなブロックに到着し、その書き込みがブロックを完全に上書きする代わりに、その一部のみを修正するとき、上述のもの以上の特別な処置が必要となる可能性がある。修正されるブロック全体のデータは、ブロックの以前の内容がノード障害によって失われないような方法で書き込みを処理するノードに転送することができる。１つの簡単な手法は、ブロックの現在のダーティな内容を保有しているノードについて、新しい書き込みを処理するノードに送信する前に、そのブロックをストレージに対して消去してしまうことであるが、この手法は依存性書き込み整合性に違反するため、ＷＯＦには適さない。その代わりに、上述のキャッシュ保護機構を例えば以下のように拡張することができる。（１）ブロックの古い内容が、上述のように保護コピーを作成する、書き込みを処理するノードに転送される。（２）ブロックの元の保有者が、そのブロックのコピーと対応する保護コピーを無効にする。（３）書き込みを処理するノードが、アプリケーションからのデータをブロックに適用し、その保護コピーを更新する。ただし、このノードは、オープンデルタにある場合に、古い書き込みとその保護コピーを単に削除することに関し、前の段落に記載された制約に従わなければならない。（４）最後に、アプリケーション書き込みが確認される。 When a new write arrives in an already dirty block in the cache somewhere in the system and the write modifies only a portion of it instead of completely overwriting the block, there are special actions beyond those described above. May be necessary. The entire block of data to be modified can be transferred to the node that processes the write in such a way that the previous contents of the block are not lost due to node failure. One simple approach is to erase the block from storage before sending it to the node that handles the new write for the node that holds the current dirty contents of the block, This method is not suitable for WOF because it violates dependency write consistency. Instead, the cache protection mechanism described above can be extended, for example, as follows. (1) The old contents of the block are transferred to the write processing node that creates the protected copy as described above. (2) The original holder of the block invalidates the protected copy corresponding to the copy of the block. (3) The node that processes the write applies the data from the application to the block and updates its protected copy. However, when this node is in open delta, it must obey the restrictions described in the previous paragraph regarding simply deleting the old write and its protected copy. (4) Finally, application writing is confirmed.

一態様において、全ての書き込み要求は、ＤＭＧにおけるリクエスタのデルタＩＤとともに送信される必要がある。これは、例えば以下の例を使用することによって、ＤＭＧ／キャッシュＡＰＩにおいて達成することができる。 In one aspect, all write requests need to be sent with the requester's delta ID in the DMG. This can be achieved in the DMG / cache API, for example by using the following example.

一態様において、リクエスタは、ブロックのダーティコピーを、ディレクトリから受信するとすぐに保護し、それからホストからの新しいデータを受け入れる前に、書き込みの分散完了を開始し得る。このアプローチは、共有者に対するＤＭＧロックとリモート無効化が、可能な限り早急に完了され、それにより同ページについての後の分散操作を妨げる可能性が低くなるという利点がある。このアプローチに欠点があるとすれば、ライタが、イニシエータへの書き込みを完了する前に、２つの保護操作を実行する（１つはホストが書き込みを転送する前、２つ目はその後）必要があることである。代替的にリクエスタは、書き込みの分散完了を開始する前に、ホストへの書き込みの継続と完了を許可することができる。かかるアプローチは、二重の保護操作を回避する。

In one aspect, the requester may protect the dirty copy of the block as soon as it is received from the directory, and then initiate a write distribution completion before accepting new data from the host. This approach has the advantage that the DMG lock and remote invalidation for the sharer is completed as soon as possible, thereby reducing the likelihood of preventing subsequent distributed operations on the page. If this approach is flawed, the writer needs to perform two protection operations before completing the write to the initiator (one before the host transfers the write and the second after). That is. Alternatively, the requester can allow continuation and completion of writes to the host before initiating write distribution completion. Such an approach avoids double protection operations.

（クロージャ移行中の収集）
オープンデルタが閉じられるとき、新しい書き込みの次のバッチを収集するために新しいデルタが開かれる。デルタクロージャのパイプライン移行は瞬間的操作ではないので、新しい書き込みが別に処置される移行期間が存在する。具体的には、書き込みデータは、正常にキャッシュに受け入れられるが、ホストへの書き込み完了の確認応答は、移行期間が終わるまで遅らされる。 (Collection during closure migration)
When the open delta is closed, a new delta is opened to collect the next batch of new writes. Since the delta closure pipeline transition is not an instantaneous operation, there is a transition period in which new writes are treated separately. Specifically, the write data is normally accepted into the cache, but the confirmation of the completion of writing to the host is delayed until the transition period ends.

デルタクロージャは、ＷＯＦパイプラインのデルタをオープン状態から交換状態へと移動させる状態移行である。デルタクロージャ操作をトリガーできるイベントは、可変間隔を備える通常のタイマ、ノードメモリ制約、外部トリガーＡＰＩ、システムエラー状態からの回復を含む。あらゆるノードは、クロージャトリガーの供給源となることができるが、タイマトリガーは、単一の指定されたノードからのみもたらされるべきである。複数ノードは、独立して非同期にデルタクロージャをトリガーする決定をするので、クロージャバリア機構は、重複するトリガーに耐えるものであるべきである。デルタクロージャのトリガーについては、本明細書の別の箇所により詳しく説明される。 A delta closure is a state transition that moves the delta of a WOF pipeline from an open state to an exchange state. Events that can trigger delta closure operations include normal timers with variable intervals, node memory constraints, external trigger APIs, and recovery from system error conditions. Any node can be a source of closure triggers, but timer triggers should only come from a single designated node. Since multiple nodes independently make the decision to trigger a delta closure asynchronously, the closure barrier mechanism should be able to withstand duplicate triggers. Triggers for delta closure are described in more detail elsewhere herein.

デルタクロージャは、一態様において、２段階コミットプロトコルなどの分散バリア機構と同調させることができる。一実施例によるバリア機構は、多数のステージを含む。かかるステージの１つは、メッセージがＷＯＦグループの全てのノードにブロードキャストされるバリア開始ステージである。メッセージは、任意のノードで開始することができ、これは次に残りのバリアラウンドのリーダーとなる。競合状態があり、複数ノードがバリア開始通知をブロードキャストしている場合、１番目のものが勝者となる（仮想同期などの順序ブロードキャストサービスを使用）。 The delta closure can be tuned in one aspect with a distributed barrier mechanism such as a two-stage commit protocol. A barrier mechanism according to one embodiment includes multiple stages. One such stage is a barrier start stage where messages are broadcast to all nodes of the WOF group. The message can start at any node, which then becomes the leader of the remaining barrier round. If there is a race condition and multiple nodes are broadcasting a barrier start notification, the first one will be the winner (using an order broadcast service such as virtual synchronization).

別のステージは、バリア確認ステージであり、ここではＷＯＦグループの各メンバーにより、そのメンバーがバリアに到達したとき、２地点間メッセージが現在のラウンドリーダーに送信される。バリア確認メッセージは、データペイロードを伝えることができるので、追加の通信オーバーヘッドを伴わずに、バリアに関連する情報が共有できる。 Another stage is the barrier confirmation stage, where each member of the WOF group sends a point-to-point message to the current round leader when that member reaches the barrier. Since the barrier confirmation message can carry a data payload, information related to the barrier can be shared without additional communication overhead.

さらに別のステージは、バリア終了ステージであり、ここではラウンドリーダーが全ての未完了のバリア確認メッセージを集めてしまうと、そのリーダーがブロードキャストメッセージを送信する。バリア終了メッセージは、該当する場合、バリア確認メッセージからの融合データを含む。 Yet another stage is the barrier end stage, where the leader sends a broadcast message when all the incomplete barrier confirmation messages have been collected. The barrier end message includes the merged data from the barrier confirmation message, if applicable.

（バリア使用）
一態様において、ＷＯＦ実装は、ロックステップ方式で進行する厳密なパイプラインを有し、そのため各ＷＯＦグループについてのパイプライン進行を制御するために単一のバリアのみが存在する。バリアは、デルタクロージャトリガーにより任意のノードで開始することができ、そのノードは、全てのノードにバリア開始メッセージをブロードキャストすることにより、次のラウンドのバリアリーダーとなる。各ノードがこのブロードキャストを受信すると、そのノードはグローバルデルタｉｄを増分することができ、それにより全ての新しい書き込み要求が、次のデルタにプッシュされる。ノードは、ホストへの新しい書き込み要求の完了の確認を遅らせることができる。 (Use barrier)
In one aspect, the WOF implementation has a strict pipeline that proceeds in a lockstep fashion, so there is only a single barrier to control the pipeline progression for each WOF group. A barrier can be started at any node by a delta closure trigger, and that node becomes the next round barrier leader by broadcasting a barrier start message to all nodes. As each node receives this broadcast, it can increment the global delta id, which pushes all new write requests to the next delta. The node can delay confirmation of completion of a new write request to the host.

ノードは、最近開かれたデルタにおいて、全ての進行中の書き込み要求が完了するまで待つことができ、必要であれば、現在の交換およびコミットステージの完了を待つことができる。ノードは次にメッセージを送信することにより、このノードが続行できる状態にあることを、バリアリーダーに通知することができる。 The node can wait until all in-progress write requests are completed in the recently opened delta and, if necessary, can wait for the current exchange and commit stage to complete. The node can then notify the barrier leader that this node is ready to continue by sending a message.

交換段階は、バリア確認メッセージに、部分デルタのサイズなどの情報を含むことにより、より効率的にすることができる。しかしながら、書き込み完了を遅延させることによって、バリアがホストアプリケーションに影響を与え得るため、継続時間は最小限に抑えられ得る。バリアリーダーは全てのバリア確認メッセージを収集してしまうと、バリア終了メッセージをブロードキャストすることができる。影響を受けたそれぞれのノードがこの通知を受信するのとで、それらのノードは、新しいオープンデルタへの、完了していたはずの全ての書き込みをホストに承認し、正常に完了するために、さらなる書き込みを許可することができる。ノードは、前のオープンデルタに対して交換プロトコルを開始し、前に交換されたデルタに対してコミットプロトコルを始めることができるが、これについては後述する。 The exchange phase can be made more efficient by including information such as the size of the partial delta in the barrier confirmation message. However, by delaying write completion, the duration can be minimized because the barrier can affect the host application. Once the barrier leader has collected all the barrier confirmation messages, it can broadcast a barrier end message. As each affected node receives this notification, they accept all writes that should have been completed to the new open delta to the host and complete successfully. Further writing can be allowed. The node can initiate an exchange protocol for the previous open delta and begin a commit protocol for the previously exchanged delta, as will be described below.

（交換段階）
一態様において、デルタが交換状態にあるとき、２つのことが起こる。１つめは、各サイトの部分デルタが他の参加サイト全てに転送され、それにより各サイトはデルタの完全なコピーを有する。２つめは、各サイトが、バックエンドストレージへのコミットを始める前に、そのコピーそれぞれを永続あるいは安全にする。 (Exchange stage)
In one aspect, two things happen when the delta is in an exchange state. First, each site's partial delta is transferred to all other participating sites, so that each site has a complete copy of the delta. Second, make each copy permanent or secure before each site begins committing to backend storage.

安全の程度は、あらゆる障害に対して脆弱な無保護デルタから、ミラー保護されたオンディスクジャーナルに及んでもよい。ジャーナルは、ノードおよびサイトの障害に対して保護するだけでなく、データを安全にキャッシュから退去させられることを意味する。しかし、安全に交換されたブロックをキャッシュから退去させることは、コミット段階中にジャーナルから読み戻すという、パフォーマンスの不利益を招く。このパフォーマンスの不利益により、コミットデルタ全体をストレージにコミットされてしまうまでメモリに保持する実装は、少なくとも一実施例において好ましい。かかる実装では、ジャーナルは、障害回復の目的で必要とされない限り、書き込み専用である。 The degree of safety may range from an unprotected delta that is vulnerable to any failure to a mirrored on-disk journal. Journaling not only protects against node and site failures, but also means that data can be safely evicted. However, evicting a safely exchanged block from the cache results in a performance penalty of reading back from the journal during the commit phase. Due to this performance penalty, an implementation that keeps the entire commit delta in memory until committed to storage is preferred in at least one embodiment. In such an implementation, the journal is write-only unless it is needed for disaster recovery purposes.

安全度の疑問に対応する１つのアプローチは、ｎウェイ保護で中間を取ることである。しかし、このアプローチの問題には、複製アプローチの高いメモリ消費量が含まれる。あらゆるサイトは、部分オープンデルタと交換およびコミット状態にある２つの完全なデルタだけでなく、全てのもののｎ複製コピーを保存するのに十分なメモリを必要とする。さらに、サイトが不釣合いな数のノードを有するシステムでは、より小さいサイトは、ＷＯＦパイプラインに使用するためにより小さいメモリプールを有するので、より大きいサイトは、ＷＯＦに使用可能なメモリの一部を未使用のままにすることが必要となる。この問題は、あらゆるメモリ内ソリューションにおいて発生するが、保護ソリューションにおける追加メモリの使用によりさらに悪化する。 One approach to addressing the safety question is to take the middle with n-way protection. However, the problems with this approach include the high memory consumption of the replication approach. Every site needs enough memory to store n duplicate copies of everything, not just two complete deltas that are in exchange and committed with partially open deltas. Furthermore, in a system where the site has a disproportionate number of nodes, the smaller site has a smaller memory pool for use in the WOF pipeline, so the larger site uses some of the memory available to the WOF. It must be left unused. This problem occurs in any in-memory solution, but is exacerbated by the use of additional memory in the protection solution.

一態様において、書き込みを受信するノードが、書き込まれているブロックのための保護スペースを割り振ることができない場合、書き込みを処理する簡単なアプローチは、書き込まれたブロックを即座にディスクへと押し通すこと、つまりこの個別書き込みをライトスルーで処理することである。しかし、そのアプローチは依存性書き込み順序付けに違反し、そのため保護スペースは前もって予約されなければならない。ノード障害後、より大きい保護スペースを取得して、安全でないデータを再保護するか、あるいはパイプラインがフラッシュされている間、低下モードを保たなければならない。保護アプローチは、単一ノードのサイトに安全保証を提供せず、サイト障害あるいはあるｎ個のノード障害後、データ整合性を失う。ジャーナリングアプローチでは、データ整合性は決して失われない。 In one aspect, if the node receiving the write is unable to allocate protected space for the block being written, a simple approach to handle the write is to immediately push the written block to disk; In other words, this individual writing is processed by write-through. However, that approach violates dependency write ordering, so protected space must be reserved in advance. After a node failure, a larger protected space must be obtained to re-protect insecure data or remain in degraded mode while the pipeline is flushed. The protection approach does not provide security guarantees for a single node site and loses data integrity after a site failure or some n node failures. With the journaling approach, data integrity is never lost.

これらおよびその他の理由から、少なくともいくつかの実施例では、デルタの安全性にとってジャーナリングが好ましい。あらゆるノードは、少量の保護スペースを使用して、その部分オープンデルタをノード障害から安全に保つことができる。ブロックが、交換段階中に全てのサイトのジャーナルで安全にされると、保護コピーはもはや必要ではなく、無効にして再使用することができる。単純なノード障害は、安全でないデータを再保護し、ジャーナルから安全なデルタを読み取り、交換段階を再始動することによって処理される。ジャーナルされていないブロックの全てのコピーを喪失するさらに深刻な障害は、必ずリンク障害のように扱われなければならず、全てのサイトが、継続してそのコミットデルタをアトミックに書き出すが、オープンおよび交換データは破棄され、ホストアプリケーションは再始動する必要があり得る。障害処理については、さらに詳しく後述する。 For these and other reasons, journaling is preferred for delta safety in at least some embodiments. Every node can use a small amount of protected space to keep its partially open delta safe from node failure. If the block is secured in the journal at all sites during the exchange phase, the protected copy is no longer needed and can be invalidated and reused. Simple node failures are handled by reprotecting insecure data, reading a secure delta from the journal, and restarting the exchange phase. More serious failures that lose all copies of unjournaled blocks must be treated like link failures, and all sites continue to write their commit deltas atomically, but open and The exchange data is discarded and the host application may need to be restarted. The failure processing will be described in more detail later.

（メモリ供給）
一態様において、ＷＯＦグループに参加している、分散ＲＡＩＤ１（「ＤＲ１」）実装のレグを備えた各サイトの各ノードは、ＷＯＦパイプライン全体のために供給されたメモリを有することができる。ここに記載されるＷＯＦグループのための累積サイト内メモリ使用量は、総メモリ使用量ＭとＷＯＦグループに割り当てられたメモリＷを有する。 (Memory supply)
In one aspect, each node at each site with a leg of a distributed RAID 1 (“DR1”) implementation participating in the WOF group may have memory provisioned for the entire WOF pipeline. The cumulative in-site memory usage for the WOF group described herein includes the total memory usage M and the memory W allocated to the WOF group.

一態様において、ＷＯＦグループを作成するとき、管理者は、全参加サイトに適用するＷを指定する。他のＷＯＦグループがすでに使用可能なスペースを消費しているため、またはその他の理由で要求されたＷの値が割り振れない場合、ＷＯＦグループ作成は失敗し得、他の以前に作成されたＷＯＦグループが、正常に操作を継続することになる。あるいは、ＷＯＦグループは不十分なリソースで作成されてもよいが、他の使用からリソースを取り戻しを始める。十分なリソースが取得されたとき、新しいＷＯＦグループは、操作を開始することができる。ストレージシステムは、長時間経って十分なリソースが収集できない場合、管理者にＷＯＦグループ作成をキャンセルするオプションを提供してもよい。

In one aspect, when creating a WOF group, the administrator specifies W that applies to all participating sites. If other WOF groups are already consuming available space, or if the requested W value cannot be allocated for other reasons, WOF group creation may fail and other previously created WOFs The group will continue to operate normally. Alternatively, the WOF group may be created with insufficient resources, but begins to reclaim resources from other uses. When sufficient resources are acquired, the new WOF group can begin operation. The storage system may provide an option to the administrator to cancel WOF group creation if sufficient resources cannot be collected over a long period of time.

ｎノードサイトでは、Ｍは参加ノードにわたって均等に分散され、各ノードｉは、その部分オープンデルタに対しｐ_ｉ＝Ｐ／ｎに限定される。さらに、交換プロトコルは、完全なデルタのＥ／ｎを各ノードに分散し、ＲＭＧは近接ノードに複製を置く可能性が最も高く、つまり約Ｒ／ｎが各ノードに行き着く。 At the n-node site, M is evenly distributed across the participating nodes and each node i is limited to p _i = P / n for its partial open delta. Furthermore, the exchange protocol distributes the full delta E / n to each node, and the RMG is most likely to place a replica on the neighboring nodes, ie about R / n ends up at each node.

Ｗに提供された値から、Ｐの最大サイズが計算できるが、これはＷＯＦグループにおけるサイトの数ｓによる。より多くのサイトがあれば、完全に交換されたデルタによって、より高い割合のＷが使用される。あらゆるサイトのＰに対し同じ上限が使用される場合、完全に交換されたデルタのサイズを以下のように定義することができる。 From the value provided for W, the maximum size of P can be calculated, depending on the number of sites s in the WOF group. If there are more sites, a higher percentage of W is used by a fully exchanged delta. If the same upper limit is used for P at every site, the size of the fully exchanged delta can be defined as:

従って、オープンデルタの最大サイズは、以下のように定義できる。

Therefore, the maximum size of the open delta can be defined as follows:

この値は、Ｐの理論上の最大値である。複製目的の制約は、オープンデルタをさらに制限し得る。サイトの各ノードｉは、オープンデルタのｐ_ｉを収集する責任がある。そのデータは、ジャーナル上で安全にされるまで複製され得る。ＷＯＦグループのためのサイト内複製要件は、サイトの各ノードに対する複製の必要性の合計によって定義される。

This value is the theoretical maximum value of P. Constraints for replication purposes can further limit open deltas. Each node i of the site is responsible for collecting the open delta p _i. The data can be replicated until secured on the journal. The intra-site replication requirement for a WOF group is defined by the sum of replication needs for each node in the site.

各ノードは、現在のオープンデルタの内容のみでなく、交換段階の最後に安全にされるまで、最新に閉じられたデルタの内容も複製する。

Each node replicates not only the contents of the current open delta, but also the contents of the most recently closed delta until it is secured at the end of the exchange phase.

ＷＯＦグループのための複製スペースは、事前割り振りすることができ、簡単に使用できるので、複製要求は失敗しない。しかし、理想的な割り振りは、常に可能なわけではない。Ｉ／ＯがＷＯＦ区画で開始する前に、ＲＭＧはｒ_ｉを予約するよう求められ、要求されている全てが予約できない場合、ｐ_ｉはそれに応じて減らされる。ｐ_ｉの修正された値は、それからデルタクロージャトリガーに対するスペース制約として使用することができる。

The replication space for the WOF group can be pre-allocated and easily used so that replication requests do not fail. However, an ideal allocation is not always possible. Before the I / O starts at the WOF partition, the RMG is asked to reserve r _i, and if not all requested are reserved, p _i is reduced accordingly. modified values of p _i may then be used as a space constraint for the delta closure trigger.

（メモリ予約）
１つの実装では、関連付けられた保護コピーのためのものも含むＷＯＦパイプラインのためのメモリは、事前に予約され、それにより通常のＷＯＦ操作は、一時メモリ不足に対処する必要がない。代替の実装では、ＷＯＦパイプラインにメモリを割り振ることが可能であろう。 (Memory reservation)
In one implementation, memory for the WOF pipeline, including one for the associated protected copy, is reserved in advance so that normal WOF operations do not have to deal with temporary memory shortages. In an alternative implementation, it would be possible to allocate memory for the WOF pipeline.

（交換プロトコル）
交換プロトコルは、全てのサイトが完全交換デルタのローカルコピーを有することを確実にすることができる。デルタクロージャプロトコル中、各ノードは、その部分デルタにおけるデータの記述子と、関連付けられたメタデータの第２の記述子を構築することができるが、それらはそれぞれバリア確認でラウンドリーダーに送信される。ラウンドリーダーは、次にバリア終了ブロードキャストで、ＷＯＦグループに参加している全てのノードに、これらの記述子を配布する。各参加サイトのノードは、次にこれらの記述子を用いて、そのサイトから欠落しているデルタフラグメントをフェッチする。１つの実装では、これらの記述子は、ＲｅｍｏｔｅＤｉｒｅｃｔＭｅｍｏｒｙＡｃｃｅｓｓ（リモートダイレクト・メモリアクセス、ＲＤＭＡ）領域のためのキーとして実現される。 (Exchange protocol)
The exchange protocol can ensure that all sites have a local copy of the full exchange delta. During the delta closure protocol, each node can build a descriptor for the data in its partial delta and a second descriptor for the associated metadata, each of which is sent to the round leader with a barrier check . The round leader then distributes these descriptors to all nodes participating in the WOF group in a barrier end broadcast. Each participating site node then uses these descriptors to fetch the missing delta fragment from that site. In one implementation, these descriptors are implemented as keys for the Remote Direct Memory Access (Remote Direct Memory Access, RDMA) area.

１つは交換用、もう１つは安全用（例えばジャーナリング）の２つのトークンを、サイトのノードを通して循環させることができる。１つずつ、ノードは交換トークンを取得し、前に伝達された記述子を用いて、ノード交換エリアがいっぱいになるまで、別のサイトからデータをフェッチすることができる。全てのノードは、バリア終了ブロードキャストで同じ記述子を受信しているので、各ノードは、前のノードは除外してただ開始し、領域を通して順に続け得る。ノード交換エリアは、リモート領域を部分的に転送しただけでその後でいっぱいになることがあるため、フェッチするノードは、ブロック境界で転送を停止するようにし得る。データが受信されるとき、データはブロックに分割されることができ、これらは適切なデルタｉｄを割り当てられ、上述のように正しいデルタリストに付加される。 Two tokens, one for exchange and one for security (eg journaling), can be circulated through the nodes of the site. One by one, the node can obtain an exchange token and use the previously communicated descriptor to fetch data from another site until the node exchange area is full. Since all nodes have received the same descriptor in the barrier end broadcast, each node can just start, excluding the previous node, and continue in sequence through the region. Since the node exchange area may fill up after only partially transferring the remote area, the fetching node may cause the transfer to stop at the block boundary. When data is received, the data can be divided into blocks, which are assigned the appropriate delta id and added to the correct delta list as described above.

その交換部分を終えた後、ノードはその近隣へと交換トークンを回す。安全性トークンが、交換トークンに続く。ノードは、全ノード交換エリアが一杯になるまで待つ必要がないので、それらの交換されたデータ（およびそれらのローカル部分デルタ）をチャンクで安全にしてよい。安全性トークンは、安全性プロトコルのためにのみ使用されてもよいが、これはディスクジャーナルのように直列化される必要がある。その交換部分を安全にしたあと、ノードはその近隣へと安全性トークンを回す。最後のノードが安全性トークンを解放するとき、交換段階が終了し、パイプラインを進行させることができる。次のバリア開始ブロードキャストが受信されるとき、以前の交換ラウンドの記述子を破棄して、次回の交換のための新しい記述子を作成することができる。 After finishing the exchange part, the node passes an exchange token to its neighbors. A safety token follows the exchange token. Nodes may secure their exchanged data (and their local partial deltas) in chunks because they do not have to wait until the entire node exchange area is full. The security token may only be used for the security protocol, but it needs to be serialized like a disk journal. After securing the exchange, the node passes a security token to its neighbors. When the last node releases the security token, the exchange phase ends and the pipeline can proceed. When the next barrier start broadcast is received, the descriptor of the previous exchange round can be discarded and a new descriptor can be created for the next exchange.

（安全性プロトコル）
一態様において、安全性プロトコルは、コミット段階前に、交換されたデータが安全であることを確認する責任がある。この目的を達成するためにジャーナリングが用いられる場合、管理者は、ＷＯＦグループが作成されるとき、ジャーナリングのためのローカルディスクスペースを提供することができる。ジャーナル障害の可能性を抑えるための、ミラー保護されたディスクの使用は賢明である。ジャーナルには非常に小さいスペースが必要とされる。例えば、２つの完全に交換されたデルタで十分である。デルタがコミットされてしまうと、ジャーナルされたバージョンはもはや必要なくなる。 (Safety protocol)
In one aspect, the safety protocol is responsible for ensuring that the exchanged data is secure before the commit phase. If journaling is used to achieve this goal, the administrator can provide local disk space for journaling when the WOF group is created. It is wise to use mirrored disks to reduce the possibility of journal failure. A very small space is required for the journal. For example, two fully exchanged deltas are sufficient. Once the delta is committed, the journaled version is no longer needed.

以下の図式は、オンディスクジャーナルＪの形式を要約している。 The following diagram summarizes the format of the on-disk journal J.

ジャーナルのそれぞれの名前付けされた要素は、ブロック境界で開始し、整数のブロックを占有する。典型的な名前付けされたジャーナルのセグメントの内容は以下のとおりである。
ＪＭ（ジャーナルメタデータ）：以下の通り、関連付けられたＷＯＦグループを識別し、現在のコミットデルタのスタートである。

Each named element of the journal starts at a block boundary and occupies an integer number of blocks. The contents of a typical named journal segment are as follows:
JM (journal metadata): identifies the associated WOF group as follows and is the start of the current commit delta.

Δ_ｉ（デルタ）：デルタを定義するために必要とされる全てのデータおよびメタデータを包含する。
ΔＳ（デルタスタートメタデータ）：以下のフィールド^１を^-含む、ジャーナルされたデルタの開始のマーカー。

Δ _i (delta): encompasses all data and metadata needed to define a delta.
[Delta] S (delta start metadata): field ¹ below ^- including, markers of the start of the journal delta.

ΔＥ（デルタエンドメタデータ）：以下のフィールドを含む、オンディスクデルタの終了のマーカー：

ΔE (Delta End Metadata): On-disk delta end marker, including the following fields:

Ｃ_ｉ（デルタチャンク）：デルタのデータのサブセットと、障害後にジャーナルからチャンクを回復するする必要があるメタデータを表わす。
ＣＳ（チャンクメタデータヘッダ）：デルタチャンクの開始をマークし、以下のフィールドを含む。

C _i (delta chunk): represents a subset of the data in the delta and the metadata that needs to be recovered from the journal after a failure.
CS (Chunk Metadata Header): Marks the start of a delta chunk and includes the following fields:

データ（チャンクデータ）：ｃｈｕｎｋＢｌｏｃｋｓのメタデータと同じ順序の、このチャンクに対応するブロックデータ。
ＣＥ（チャンクメタデータトレーラ）：デルタチャンクの終了をマークし、以下のフィールドを含む。

Data (chunk data): Block data corresponding to this chunk in the same order as the chunkBlocks metadata.
CE (Chunk Metadata Trailer): Marks the end of a delta chunk and includes the following fields:

（ジャーナル進行）
ジャーナルは、交換段階中に書き込める。ＷＯＦパイプラインが進行するとき、ジャーナルはデルタ切り替えも実行する。各ノードがバリア終了ブロードキャストを受信すると、それは、グローバルパイプライン進行の永続的な指示として、ＪＭのｃｏｍｍｉｔＤｅｌｔａＯｆｆｓｅｔおよびｃｏｍｍｉｔＤｅｌｔａＩｄのフィールドを更新して、最新にジャーナルされたデルタを指し示すことができる。全てのサイトは、この指示を同時に書き込むので、各サイトの全てのノードは、バリアを出るときに書き込みを実行することができる。かかる要件で、書き込みは全サイト障害によってのみ割り込むことができる。その場合、全ての存続するサイトは、ビットマップログの使用により、その現在のコミットデルタにおける全ての書き込みを保護するべきである。

(Journal progress)
The journal can be written during the exchange phase. As the WOF pipeline progresses, the journal also performs delta switching. As each node receives a barrier end broadcast, it can update the JM commitDeltaOffset and commitDeltaId fields to point to the most recently journaled delta as a permanent indication of global pipeline progress. All sites write this indication at the same time, so all nodes at each site can perform the write when they exit the barrier. With such a requirement, writes can only be interrupted by all site failures. In that case, all surviving sites should protect all writes in their current commit delta by using bitmap logs.

交換段階の目的は部分デルタの複製を作成することなので、単純な障害処理機構では、交換段階を単に再始動する。デルタが安全にされているが、完全にコミットされていない場合、データはジャーナルからメモリ内を読み戻され得る。 Since the purpose of the exchange phase is to create a replica of the partial delta, a simple fault handling mechanism simply restarts the exchange phase. If the delta is secured but not fully committed, data can be read back into memory from the journal.

ジャーナルディスクの障害は、ミラー保護されるようにサポートされているため、まれである。ジャーナルディスクが故障した場合、一時低下モードが生じる。ＷＯＦ保証は、状況が補正される前に、ノードまたはサイト障害が起きた場合にのみ失われる。低下モードは、少なくとも２つの方法で終了できる。１つ目は、管理者が、ディスク障害を起こした問題を是正できる場合、ジャーナリングを再開することができる。あるいは、オープンデルタの最大サイズを減少することにより、ＷＯＦパイプラインが徐々にフラッシュされ、ジャーナルディスクが正常になるまで、ライトスルー（あるいは管理者の好みにより、ライトバック）に切り替えられる。 Journal disk failures are rare because they are supported to be mirrored. If the journal disk fails, a temporary drop mode occurs. The WOF guarantee is lost only if a node or site failure occurs before the situation is corrected. The degradation mode can be terminated in at least two ways. First, journaling can be resumed if the administrator can correct the problem that caused the disk failure. Alternatively, by reducing the maximum size of the open delta, the WOF pipeline is gradually flushed and switched to write-through (or write-back, depending on administrator preference) until the journal disk is normal.

（フロー制御）
ネットワーク接続サイトは、明らかに平均的書き込みスループット率を受け入れるのに十分な帯域幅を提供するべきであるが、一実施例によるシステムは、Ｉ／Ｏトラフィックのバーストに耐えるよう拡張されている。少なくとも２つの戦略を、独立してあるいは共に使用できる。 (Flow control)
Although the network-attached site should obviously provide enough bandwidth to accommodate the average write throughput rate, the system according to one embodiment has been expanded to withstand bursts of I / O traffic. At least two strategies can be used independently or together.

第１の典型的な戦略では、システムは、閉じられているがまだ交換されていない、あるいは「交換前」のデルタのキューを維持することができる。交換段階中のブロックの交換が、次のバリアがパイプライン進行をトリガーする前に完了しない場合、あるいはデルタ収集段階用に割り振られたメモリが、パイプライン進行を起こすのに不十分とわかった場合、新しい書き込みを受け入れるために、新しいオープンデルタを開くことができる。最近閉じられたデルタは、「交換前デルタセット」の先入れ先出し（ＦＩＦＯ）キューに保持することができる。交換段階が完了すると、次に古い交換前デルタを交換段階に進行させることができる。従って、システムは、ネットワークの容量を超過する書き込みトラフィックの短期バーストのためのバッファを提供することができる。 In a first exemplary strategy, the system can maintain a queue of deltas that are closed but not yet replaced, or “before replacement”. If block replacement during the exchange phase does not complete before the next barrier triggers pipeline progress, or if the memory allocated for the delta collection phase is found to be insufficient to cause pipeline progress You can open a new open delta to accept new writes. Recently closed deltas can be kept in a first-in first-out (FIFO) queue of a “delta set before exchange”. When the exchange phase is complete, the next old pre-exchange delta can be advanced to the exchange phase. Thus, the system can provide a buffer for short bursts of write traffic that exceed the capacity of the network.

この概念は、交換前デルタのキューが長くなった場合に、２つ以上の交換前デルタセットを組み合わせることにより、一態様で拡張することができる。この実施例で交換前デルタを組み合わせることは、得られるより大きいデルタセットがシステム全体で整合性のある書き込みの時間範囲を表わすようにシステムの全てのノードにおいて同様におこなわれる。２つ以上のデルタを組み合わせるとき、書き込まれたブロック全てのユニオンが使用され得る。個々のブロックが所定のノードで１回以上書き込まれており、１つ以上のデルタセットに生まれ変わっている場合、最新の生まれ変わったもの（最も新しい交換前デルタセットのブロックイメージ）を使用することができる。デルタセットを組み合わせることは、共通に書き込まれたブロックが送信される回数を減らし、交換段階でより大きい転送のストリームを作成して、潜在的により高いネットワーク効率をもたらし、またより大きな時間間隔およびデータボリュームによりバリア操作のコストを償却する利点を有する。潜在的な欠点としては、交換前デルタセットの組み合わせをトリガーし、管理するための調整コストがかかることがあり、またこの拡張機能は、操作が前の「交換後」デルタイメージから再始動されなければならないときにノードが失われる場合に、失われる可能性があるデータの量をわずかに増やすことがある。従って、これは交換前デルタセットの長いキューによってシステムが後れを取るとき、回復機構として有利に使用できる。 This concept can be extended in one aspect by combining two or more pre-exchange delta sets when the pre-exchange delta queue becomes long. Combining pre-exchange deltas in this embodiment is done in the same way at all nodes of the system so that the resulting larger delta set represents a consistent write time range throughout the system. When combining two or more deltas, the union of all written blocks can be used. If each block has been written more than once at a given node and has been reborn into one or more delta sets, the latest reborn (the block image of the most recent pre-exchange delta set) can be used. . Combining delta sets reduces the number of times a commonly written block is sent, creates a larger stream of transfers in the exchange phase, potentially resulting in higher network efficiency, and larger time intervals and data The volume has the advantage of amortizing the cost of the barrier operation. A potential disadvantage is that there is an adjustment cost to trigger and manage the combination of pre-exchange delta sets, and this extension has to be restarted from the previous “post-exchange” delta image. If a node is lost when it must, the amount of data that can be lost may be slightly increased. This can therefore be used advantageously as a recovery mechanism when the system is behind due to a long queue of delta sets before exchange.

第２の典型的な戦略では、書き込みデータが、ホストからＷＯＦグループに受け入れられる速度が低下、あるいは「抑制」される。この方法では、ホストへの書き込み確認の前に、遅延が挿入される。遅延は、システムがデルタセットの交換で後れを取る程度を反映して、増加あるいは減少される。個々の書き込みを低速にすることは、供給された性能を現在のネットワーク容量と一致するように戻すように、書き込み性能を平均化する傾向がある。 In a second exemplary strategy, the rate at which write data is accepted from the host into the WOF group is reduced or “suppressed”. In this method, a delay is inserted before writing confirmation to the host. The delay is increased or decreased to reflect the extent to which the system lags behind the exchange of delta sets. Slowing down individual writes tends to average the write performance so that the delivered performance returns to match the current network capacity.

両方法を使用することで、システム性能の損失を伴わず、短期バーストに耐えることができ、同時に、短い書き込みタイムアウトを有するアプリケーションを故障させることなく、機構に書き込みの持続的な超過を許可する、一実施例によるソリューションが提供される。 By using both methods, it can withstand short bursts without loss of system performance, and at the same time allow the mechanism to persistently exceed writes without damaging applications with short write timeouts. A solution according to one embodiment is provided.

別のソリューションは、基本的にコミットを除く依存性ＷＯＦプロトコルのあらゆるステージを通して、ボックスをＤＲ１として扱う。コミット段階では、ノードは何もしないが、データが退去させられるのを許可せず、データがパッシブサイトに複製されるまで、データは保持される。そうするとアクティブサイトでサイト災害があった場合、ノードのフロントエンドを通して見られる、パッシブサイトのストレージのホストのビューは、整合性のあるものとなる。その時点で、保存されたデルタが、バックエンドストレージに書き込まれることができる。このアプローチの１つの潜在的な欠点は、ダーティデータがサイト間リンク上で２回転送され得ること（１回は部分デルタを交換するため、もう１回はアクティブＳＲＤＦサイトがパッシブサイトにバッチをプッシュするとき）である。 Another solution basically treats the box as DR1 through every stage of the dependency WOF protocol except commit. In the commit phase, the node does nothing but does not allow the data to be evicted and the data is retained until the data is replicated to the passive site. Then, if there is a site disaster at the active site, the view of the passive site storage host seen through the node front end will be consistent. At that point, the saved delta can be written to back-end storage. One potential drawback of this approach is that dirty data can be transferred twice over the inter-site link (once to exchange partial deltas, once the active SRDF site pushes the batch to the passive site. When).

（ＷＯＦ副構成要素）
一態様におけるＷＯＦ構成要素は、多数の個別の副構成要素から成る。１つのかかる副構成要素は、ここでは「ｗｏｆサーバ」と称される。簡易バージョンのｗｏｆサーバが使用できるが、ｗｏｆサーバは、グループ変更および障害処理を担当することもでき、あるいは単にＮＭＧブロードキャスト機構を提供することもできる。このサービスを通じて送信されたメッセージは、全てのノードで処理できるが、「ｗｏｆクライアント」コードはオンザフライでグループメンバーシップを判断し、それに応じて動作することができる。ｗｏｆクライアントは、ローカルキャッシュがＡＭＦ区画を登録し、ＷＯＦグループハンドルを得るための方法を提供することができる。以下のようなＡＰＩで十分であろう。 (WOF subcomponent)
The WOF component in one aspect consists of a number of individual subcomponents. One such subcomponent is referred to herein as a “wof server”. A simple version of the wof server can be used, but the wof server can be responsible for group changes and fault handling, or simply provide an NMG broadcast mechanism. Messages sent through this service can be processed by all nodes, but the “wof client” code can determine group membership on the fly and act accordingly. The wof client can provide a way for the local cache to register the AMF partition and obtain the WOF group handle. The following API will suffice.

別の副構成要素は、ＣＯＭとＮＭＧとを両方使用する分離汎用バリア機構であるが、これについては後述する。このバリアは、ＷＯＦ構成要素内でのみ使用される。また別の副構成要素は、デルタＩＤジェネレータである。グローバルデルタｉｄジェネレータは、ロールオーバおよびデルタ世代比較をサポートできる。これは以下のようなＡＰＩをキャッシュにエクスポートできる。

Another subcomponent is a separate universal barrier mechanism that uses both COM and NMG, which will be described later. This barrier is only used within the WOF component. Another sub-component is a delta ID generator. The global delta id generator can support rollover and delta generation comparisons. This can export the following API to the cache:

トリガー機構をキャッシュと分離させておくために、ＷＯＦ構成要素は、定期タイマ、メモリ制約、またはユーザコマンド（ＷＯＦトリガー）に依存した、全てのトリガー決定を担当することができる。単一のサイトで交換される部分デルタのためにスペースを予約する必要がないので、最初のマイルストーンにおけるメモリ制約は任意に選択できる。これは以下のようなＡＰＩが使用できることを意味する。

In order to keep the triggering mechanism separate from the cache, the WOF component can be responsible for all trigger decisions that depend on periodic timers, memory constraints, or user commands (WOF triggers). Since there is no need to reserve space for partial deltas exchanged at a single site, the memory constraints at the first milestone can be chosen arbitrarily. This means that the following API can be used.

最後の関数は厳密なパイプラインで使用され、それによりＷＯＦはクロージャバリアを通過する適切な時期まで待つことができる。交換プロトコルが存在する後のマイルストーンでは、同様の関数呼び出しを使用して、交換完了を通知できる。

The last function is used in a strict pipeline so that the WOF can wait until the appropriate time to cross the closure barrier. At milestones after the exchange protocol exists, a similar function call can be used to signal the exchange completion.

（ＡＭＦ抽象化）
抽象化層は、デルタ書き込みのためのＡＭＦの上に使用することができる。これはＷＯＦがローカルＡＭＦとＤＲ１を区別するのを可能にし得、また必要であればビットマップロギングを扱い得る。キャッシュによってａｍｆ＿ｗｒｉｔｅの位置で呼び出されるために、以下のようなＡＰＩの追加が使用できる。 (AMF abstraction)
An abstraction layer can be used on top of the AMF for delta writing. This may allow the WOF to distinguish between local AMF and DR1, and may handle bitmap logging if necessary. The following API additions can be used to be called by the cache at the amf_write location:

（デルタセット概念の拡張機能）
以下の章は、クラスタコントローラ、地理的ストレージ、キャッシュコヒーレンシ、および仮想化についての以下の特許および特許出願に開示されるもの（これらは参照により、全体として本書に組み込まれる）などの技術を拡張する。
６，１４８，４１４；６，９１２，６６８；６，８５７，０５９；５，８７５，４５６；６０／５８６，３６４；ＵＳ−２００３−０１８８６５５−Ａ１；ＵＳ−２００１−００４９７４０−Ａ１；およびＵＳ２００５−００７１５４５Ａ１。

(Extension of the delta set concept)
The following chapters extend technologies such as those disclosed in the following patents and patent applications on cluster controllers, geographic storage, cache coherency, and virtualization, which are incorporated herein by reference in their entirety: .
No. 6,148,414; 6,912,668; 6,857,059; 5,875,456; 60 / 586,364; US-2003-0188655-A1; US-2001-0049740-A1; and US2005-0071545A1 .

（アクティブ・パッシブサポート）
上述のように、依存性ＷＯＦ実装は、１つ以上のサイトでのデータへのアクティブアクセスを可能にし、それにより任意のサイトは、サイト間に非同期的に分散されたデータをアクティブに読み取る、または書き込むことができる。デルタセット概念を拡張する１つのアプローチは、いかなる所定のポイントインタイムにおけるいかなる所定のボリュームについても、多くの可能なライタのうち、実際に所定のＷＯＦグループにアクティブに書き込んでいるノードは１つだけであり得る認識する。ある期間の複数ノード読み取り、単一ノード書き込みは、同等のケースである。この状態が動的に検出できる場合、ストレージシステムを最適化して、デルタのブロードキャストのコストを低減し、前のデルタでのシステム再始動によって失われるデータの量を最小限に抑えることができる。 (Active / passive support)
As described above, the dependency WOF implementation allows active access to data at one or more sites, so that any site actively reads data that is asynchronously distributed between the sites, or Can write. One approach to extending the delta set concept is that only one of the many possible writers is actually actively writing to a given WOF group for any given volume at any given point in time. Recognize that can be. Multiple node reads and single node writes for a period are equivalent cases. If this condition can be detected dynamically, the storage system can be optimized to reduce the cost of delta broadcasts and to minimize the amount of data lost due to a system restart on the previous delta.

１つのみのサイトが書き込んでいる場合（瞬間的一次サイト）、ＷＯＦソリューションはこれまでのＷＯＦソリューションのように機能することができる。１つの例では、瞬間的一次サイトＡ８０２は存続することができ、サイトＢ８０４は図８の状況８００に失敗する。サイトＡは、割り込みまたはデータ損失を伴わずに、データの処理を継続することができる。これまでのＷＯＦソリューションからの主な改善は、他のサイトがデータコヒーレンシの保証付きでアクティブデータの読み取りを継続できることである。この概念は、サイト内および地理全体のノード間のコヒーレンシの概念を組み込み、拡張する。 If only one site is writing (instant primary site), the WOF solution can function like the previous WOF solution. In one example, the instantaneous primary site A 802 can survive and site B 804 fails the situation 800 of FIG. Site A can continue processing data without interruption or data loss. The main improvement over previous WOF solutions is that other sites can continue to read active data with guaranteed data coherency. This concept incorporates and extends the concept of coherency between nodes within a site and across geographies.

別の利点は、どのサイトが一次であるかの定義が、非常に動的であり得ることである。サイトは、任意の非同期デルタセット（オープンまたはクローズ）に書き込んだ唯一のサイトである場合、瞬間的一次サイトとなれる。この実装は、サイトに、新しいデルタセットへのその最初の書き込みの発生時に、その部分デルタセットがダーティであることをブロードキャストすることを要求し得る。部分デルタセットをダーティにする書き込みは、全てのサイトがその通知を確認するまで保留され得る。従って、存続サイトは、データ整合性を確実にするために、もしあるとすればいくつのデルタセットレベルを後退させなければならないかを知ることになる。ない場合は、アプリケーション再始動またはデータ損失を伴わずに、処理が継続できる。存続サイトが瞬間的一次サイトとして宣言されることができない場合でも、サイトが瞬間的一次サイトであると考えられ得る最後のデルタセットのみを後退させることにより、データ損失は最小限に抑えられる。 Another advantage is that the definition of which site is primary can be very dynamic. A site can be an instantaneous primary site if it is the only site that writes to any asynchronous delta set (open or closed). This implementation may require the site to broadcast that the partial delta set is dirty upon the occurrence of its first write to the new delta set. Writes that dirty a partial delta set can be deferred until all sites acknowledge the notification. Thus, the surviving site will know how many delta set levels, if any, must be set back to ensure data integrity. If not, processing can continue without application restart or data loss. Even if the surviving site cannot be declared as a momentary primary site, data loss is minimized by retreating only the last delta set that may be considered a momentary primary site.

（ＤＲ１のローカルレグを有さない参加ノード）
上記の実施例は、各サイトがＤＲ１のローカルレグを有する、さまざまなサイトにおける参加ノードを概説している。完全アクティブＷＯＦの概念は、実際に全てのノードがローカルバックエンドデータストレージ（ＤＲ１のローカルレグ）を有することを要求しない。これはサテライトサイトが、ローカルにミラー保護されたデータボリュームの完全コピーを保持するコストを伴わずに、読み取り／書き込み方式でデータにアクセスすることを望むときに、特に有用である。 (Participating node without DR1 local leg)
The above example outlines participating nodes at various sites, each site having a DR1 local leg. The concept of fully active WOF does not require that all nodes actually have local backend data storage (DR1 local leg). This is particularly useful when the satellite site wants to access the data in a read / write manner without the cost of maintaining a full copy of the locally mirrored data volume.

一実施例は、かかる「サテライトノード」がＷＯＦグループに参加するのを可能にする。この場合、サテライトノードはオープンデルタを作成し、着信の書き込みを、ローカルＤＲ１レグを有するサイトのノードと全く同じ方法で管理する。サテライトノードは、他のノードと全く同じ方法でバリア操作に参加する。しかし、交換段階中、サテライトサイトは半分のみ参加することが必要で、他の参加ノードからの変更をサテライトにコピーする必要はない。同様に、サテライトノードがコミット段階に参加する必要はない。 One embodiment allows such “satellite nodes” to join a WOF group. In this case, the satellite node creates an open delta and manages incoming writes in exactly the same way as the node at the site with the local DR1 leg. Satellite nodes participate in barrier operations in exactly the same way as other nodes. However, during the exchange phase, the satellite site only needs to participate in half, and changes from other participating nodes do not need to be copied to the satellite. Similarly, satellite nodes do not need to participate in the commit phase.

所定のノードが、そのノードにより管理されているいくつかのＷＯＦグループのためのサテライト動作に参加し（ＤＲ１レグを伴わず）、なおその他のためのＤＲ１は維持できることに留意されたい。 Note that a given node can participate in satellite operations (without a DR1 leg) for some WOF groups managed by that node, and still maintain DR1 for others.

（明示的パッシブサイト）
別の拡張機能は、サイトが書き込んでいないことを動的に判断し、それからそのサイトから最初の書き込みが検出されるまで、そのサイトをパッシブサイトとすることを伴う。設定された「パッシブ」サイトからの書き込みで、全てのサイトにメッセージがブロードキャストされ得、それによりそれらのサイトはこの以前のパッシブサイトが今はアクティブな参加者であることを知る。一態様において、システムは、３〜５のデルタのロールオーバなど、少数の区画が通過するのを待って、その期間に書き込んでいないサイトが、後で書き込みを行うまでパッシブサイトと判断されるべきだと判断する。かかるアプローチのコストは、サイトが再びアクティブになるときにブロードキャストする必要性を含み、そのため、サイトが少なくともある期間パッシブのままでいそうな場合のみ、そのサイトをパッシブとして設定することが望ましい。 (Explicit passive site)
Another enhancement involves dynamically determining that a site is not writing and then making that site a passive site until the first write is detected from that site. With a write from a configured “passive” site, a message can be broadcast to all sites, so that they know that this previous passive site is now an active participant. In one aspect, the system should wait for a few partitions to pass, such as a 3-5 delta rollover, and a site not writing during that period should be considered a passive site until a later write. Judge that. The cost of such an approach includes the need to broadcast when the site becomes active again, so it is desirable to set a site as passive only if the site is likely to remain passive for at least some period of time.

操作上、構成またはオペレータコマンドにより、サイトがパッシブであると明示的に判断することは有用であり得る。これは第１書き込みブロードキャストに含まれることになる必要性を除去し得る。操作上の利点の他に、これは区画への最初の書き込みの待ち時間を短くする。 In operation, it may be useful to explicitly determine that a site is passive, by configuration or operator command. This may eliminate the need to be included in the first write broadcast. In addition to operational advantages, this reduces the latency of the first write to the partition.

パッシブとして明示的に宣言された、そうではない複数サイトを潜在的に有するサイトの複合シナリオが存在する。従って、非明示的パッシブ間の通知は、前の章の説明のように、瞬間的一次サイトを判断するように行われる。サイトは、同等の効果をもたらす、「明示的アクティブ」として宣言されることも可能である。 There are complex scenarios of sites that have potentially multiple sites that are explicitly declared as passive. Thus, notifications between implicit passives are made to determine the instantaneous primary site, as described in the previous chapter. Sites can also be declared as “explicitly active” with the same effect.

（障害後の部分デルタセット同期化）
上述の「ダーティデルタセット」ビットを維持することの１つの推測される結果は、部分デルタセット（非同期デルタセット）を同期化することがデータ損失を最小限に抑え得るかどうか、場合によっては瞬間的一次サイト状況を構築し得るかどうかを即座に判断することができることである。 (Partial delta set synchronization after failure)
One presumed result of maintaining the “dirty delta set” bit described above is whether synchronization of a partial delta set (asynchronous delta set) can minimize data loss, and in some cases instantaneous It is possible to immediately determine whether a target primary site situation can be established.

図９の状況９００に示されるように、サイトＣ９０６の喪失は、データの損失もアプリケーションバックアップの損失も意味しないが、これはサイトＡ９０２とＢ９０４との間のデルタＤ２とＤ３を同期化させることは、ＡとＢ両方が瞬間的一次サイトとなることを可能にするためである。パッシブライタが故障した場合、いかなるデータも失われず、システムは単に継続することができる。 As shown in situation 900 of FIG. 9, the loss of site C 906 does not mean a loss of data or a loss of application backup, but this does not synchronize deltas D2 and D3 between sites A 902 and B 904. , To allow both A and B to become instantaneous primary sites. If the passive writer fails, no data is lost and the system can simply continue.

（同期デルタセット）
ここに記載されるＷＯＦは、非同期データ転送も、同期データ転送も扱うことができる。ＷＯＦは、同期と非同期のデータ転送の組み合わせを扱うこともできる。例えば、データセンタへの転送距離が約１００ｋｍの場合、光速待ち時間はそれほど長くない。任意の所定の時に２つの保護されたデータイメージのコピーを有することが所望されるため、システムがホストに書き込むとき、その書き込みは即座に同期デルタセットに挿入される。同期的に参加している２つのサイトと、大陸の半ばに非同期的に参加している第３のサイトがある場合は、書き込みはホストから行うことができ、ブロックは即座にローカルデルタセットだけでなく、同期のパートナーにも挿入される。システムは、次にその書き込みを返して確認するが、これは書き込まれたデータの安全性を表わす。１００ｋｍ離れた同期デルタによってデータを複製する場合、コントラクトが完了し、書き込みを確認した時点で、システムは、データがキャッシュ形式でローカルサイトに存在するだけでなく、キャッシュ形式で１００ｋｍ離れて存在することを示している。非同期イメージが国を横断することは、いつか後に起こるであろう。その観点から、例えば、西海岸の全てのデータストレージが失われた場合、データは脆弱であるが、１つのエリアのみが失われた場合、別のエリアにあるコピーは安全である。あらゆる書き込みは、それら２つのサイト間で同期的にミラー保護されているので、１つのサイトが失われてもよく、それでもデータのコピーを有する他方のサイトを使用して国中にデータを非同期的にプッシュすることができる。 (Synchronous delta set)
The WOF described here can handle both asynchronous data transfer and synchronous data transfer. WOF can also handle a combination of synchronous and asynchronous data transfer. For example, when the transfer distance to the data center is about 100 km, the light speed waiting time is not so long. Since it is desired to have two protected copies of the data image at any given time, when the system writes to the host, the write is immediately inserted into the synchronous delta set. If there are two sites participating synchronously and a third site participating asynchronously in the middle of the continent, writing can be done from the host, and the block can be instantly just the local delta set. Not even the synchronization partner. The system then returns and confirms the write, which represents the safety of the written data. When replicating data with a synchronous delta of 100 km away, when the contract is complete and the write is confirmed, the system must not only exist at the local site in cache format but also 100 km away in cache format Is shown. It will happen sometime later that asynchronous images cross the country. From that point of view, for example, if all data storage on the west coast is lost, the data is fragile, but if only one area is lost, a copy in another area is safe. Every write is synchronously mirrored between the two sites, so one site may be lost, but the other site with a copy of the data still uses the data asynchronously across the country Can be pushed to.

いくつかの実施例では、同期デルタセットのグループを宣言し、それにより任意のオープン部分デルタセットへの書き込みが、共通のデルタセット同期化グループ内に存在する他の部分デルタセットに同期的に書き込まれることは便利であろう。一態様において、書き込み複製は、デルタセット概念と整合性のある方法で実装することができる。この説明の目的で、書き込み複製とは、ホストに「書き込み完了」を返す前に、書き込みダーティデータを２つ以上の独立したキャッシュメモリのプールに置くことを指し、これは任意の所定のノードの障害による喪失から、そのデータを保護することを目的とする。別の態様では、地理的グループ全体に実装されるデルタセット同期化グループは、サイト障害耐性を向上させるための便利な方法となり得る。例えば、比較的近接する２つのサイトは、同期デルタセットグループのメンバーとして宣言することができ、一方他の地理的領域にある他のサイトは、それ自体のグループを有する。このように、任意の１つのサイトは、データ損失または操作割り込みを伴わずに失われ得る。同時に、領域間の待ち時間の影響は最小限に抑えられる。 In some embodiments, a group of synchronous delta sets is declared so that writes to any open partial delta set are synchronously written to other partial delta sets that exist within a common delta set synchronization group It will be convenient. In one aspect, write replication can be implemented in a manner consistent with the delta set concept. For purposes of this description, write replication refers to placing write dirty data in two or more independent pools of cache memory before returning “write complete” to the host, which means that for any given node. Its purpose is to protect the data from loss due to failure. In another aspect, delta set synchronization groups implemented across geographic groups can be a convenient way to improve site fault tolerance. For example, two sites that are relatively close together can be declared as members of a synchronous delta set group, while other sites in other geographic regions have their own groups. Thus, any one site can be lost without data loss or operational interruption. At the same time, the impact of latency between regions is minimized.

図１０の図式１０００では、西海岸の２つのサイト１００２、１００４は、第１のデルタセット同期化グループ１０１０のメンバーとして宣言される。同様に、東海岸の２つのサイト１００６、１００８は、第２のデルタセット同期化グループ１０１２のメンバーとして宣言される。任意の所定の部分デルタセットへの書き込みは、デルタセット同期化グループ内の別の部分デルタセットと共に同期的に複製される（複製は、書き込みが完了時に戻る前に完了する）。データは、通常のクローズ後デルタセットデータプッシュ操作を用いて、大陸中に分散される。これは以下に記載のデルタセットの永続的ビューからも利益を得られる。他のデルタセット挙動と同様に、デルタセット同期化グループは、仮想ボリューム基準により、仮想ボリューム上に定義することができる。 In the schema 1000 of FIG. 10, the two sites 1002, 1004 on the west coast are declared as members of the first delta set synchronization group 1010. Similarly, two sites 1006, 1008 on the east coast are declared as members of the second delta set synchronization group 1012. A write to any given partial delta set is replicated synchronously with another partial delta set in the delta set synchronization group (replication completes before the write returns on completion). Data is distributed throughout the continent using a normal post-close delta set data push operation. This also benefits from a persistent view of the delta set described below. As with other delta set behaviors, delta set synchronization groups can be defined on virtual volumes by virtual volume criteria.

（同期デルタセット複製のカスケード化）
書き込みを受信するデルタセットはデータを他のデルタセットへと「プッシュ」するので、関係は必ずしも対称的でなくてもよい。例えば、ボックス「Ａ」はボックス「Ｂ」に同期的に複製するよう求められる場合があるが、通常のクローズ後操作の一部としての部分デルタ情報の交換を通して以外は、「Ｂ」は「Ａ」に複製する要求を有さないことがある。 (Cascading synchronous delta set replication)
Since the delta set that receives the write “pushes” the data to the other delta set, the relationship need not necessarily be symmetric. For example, box “A” may be required to replicate synchronously to box “B”, but “B” is “A” except through the exchange of partial delta information as part of normal post-close operations. May not have a request to replicate.

これはまた、カスケード操作も可能にする。例えば、部分デルタ「Ａ」への書き込みは、「Ｂ」と「Ｃ」への同期の書き込みをもたらし、これは「Ｂ」から「Ｄ」と「Ｅ」へと、「Ｃ」から「Ｆ」と「Ｇ」への同期の書き込みをもたらす。これが有用な場合の例は、最初にサイト内の複数ノード、そのあと複数サイトにわたり、書き込みを展開する場合である。 This also allows cascade operation. For example, writing to the partial delta “A” results in synchronous writing to “B” and “C”, which goes from “B” to “D” and “E”, and from “C” to “F”. And synchronous writing to "G". An example of when this is useful is when writing is spread across multiple nodes within a site and then across multiple sites.

（クローズデルタセットをキャッシュセーフにする）
上述のように、書き込みに対するコントラクトの完了は、データを操作上安全であるとみなせることを示す。従来の単一のサイトストレージシステムでは、管理者は性能をデータ安全性と釣り合わせる意識的な選択を行う。具体的には、管理者はライトスルー、ライトバック、およびキャッシュ複製ライトバックの中から選択する。キャッシュ複製ライトバックは、さらなる保護のために、書き込みのｎウェイ複製の可能性を備えて拡張することができる。 (Closed delta set is cache-safe)
As described above, completion of a contract for writing indicates that the data can be considered operationally safe. In traditional single site storage systems, administrators make conscious choices that balance performance with data safety. Specifically, the administrator selects from among write-through, write-back, and cache copy write-back. Cache replication writeback can be extended with the possibility of write n-way replication for additional protection.

デルタセット構造は、デルタセットが安全なとき、トレードオフで同様のレベルの操作上の柔軟性を可能にする。系統（ｌｉｎｅａｇｅ）は、以下のいずれかが満たされるとき、デルタセットの観念が「安全」であるとみなされることを含む。
・全てのダーティデータが、全てのディスク常駐ミラーイメージに書き込まれている。
・ダーティデータが、ローカルディスクとリモートサイトのキャッシュに書き込まれる。
・ダーティデータが、ローカルサイトのｎ個のキャッシュと、ｐ個のサイトそれぞれのｍ個のキャッシュの部分デルタセット間に複製される。
・ダーティデータが、ローカルサイトのｎ個のキャッシュに複製される。
・デルタセットが、閉じられるとすぐに安全であるとみなされる。 The delta set structure allows a similar level of operational flexibility at the trade-off when the delta set is secure. Lineage includes that the notion of a delta set is considered “safe” when any of the following is true:
• All dirty data is written to all disk-resident mirror images.
Dirty data is written to the local disk and remote site cache.
Dirty data is replicated between a partial delta set of n caches at the local site and m caches at each of the p sites.
Dirty data is replicated to n caches at the local site.
A delta set is considered safe as soon as it is closed.

上記では、ｎ個のキャッシュに複製することの概念は、ｎ個のキャッシュのそれぞれに位置する部分デルタセット間で交換することを意味する。デルタセットがこれら全てのノードに対しエクスポートされたホストであるという必要性はない。部分デルタセットのいくらかのインスタンス化は、ダーティデータをノード障害から純粋に保護するために使用できる。 In the above, the concept of replicating into n caches means exchanging between partial delta sets located in each of the n caches. There is no requirement that the delta set be an exported host for all these nodes. Some instantiation of partial delta sets can be used to purely protect dirty data from node failure.

（スナップショットとデルタセットの統合）
本文書の別の場所に記載されるように、スナップショットは、実データセットの論理ポイントインタイムイメージを指す。スナップショットは、バックアップウィンドウを維持するような機能のために使用することができる。バックアップを行うとき、例えば、データをバックアップするために、ストレージシステムを長時間シャットダウンすることは望ましくない。望まれるのは、全てのデータのポイントインタイムイメージであり、それによりバックアップテープ全体に常にデータに整合性があることになる。 (Snapshot and delta set integration)
As described elsewhere in this document, a snapshot refers to a logical point-in-time image of a real data set. Snapshots can be used for functions such as maintaining a backup window. When performing a backup, it is not desirable to shut down the storage system for a long time, for example, to back up data. What is desired is a point-in-time image of all data, which ensures that the data is always consistent across the backup tape.

スナップショットは時間整合性であることが所望され、それによりスナップショットがポイントインタイムを反映するべきである。さらにデータベースなどのいくつかのアプリケーションでは、そのポイントインタイムは、アプリケーションが安全なポイントに対応するべきである。例えば、データベースは、一連の読み取りと書き込みの後にコミットをおこなって、データベースのデータを消去し、およびデータベースにデータをコミットすることができる。この場合、スナップショットはコミットポイントに対応し得る。コミットは、そのコミットポイントに対して最後の書き込みがなされたストレージシステムから、確認応答が返ってきたときに行うことができる。エージェントまたはその他のトリガーイベントを使用して、当技術分野で公知のように、スナップショットをトリガーすることができる。今日のスナップショットは、通常、ストレージシステム層に実装される。 The snapshot is desired to be time consistent, so that the snapshot should reflect point in time. Furthermore, for some applications such as databases, the point-in-time should correspond to a point where the application is safe. For example, the database can commit after a series of reads and writes to erase the data in the database and commit the data to the database. In this case, the snapshot may correspond to a commit point. The commit can be performed when a confirmation response is returned from the storage system in which the last writing has been performed for the commit point. Agents or other triggering events can be used to trigger a snapshot, as is known in the art. Today's snapshots are typically implemented at the storage system layer.

これらのスナップショットは、デルタセットと有利に組み合わせることができる。ストレージが地理的全域に分散されていても、データのポイントインタイムイメージは、ＷＯＦ整合性である必要がある。一態様において、スナップショットは、例えばポイントインタイムに対応する、エージェントまたはタイマによってトリガーできる。スナップショットが受信されるとすぐに、デルタセットのロールオーバをトリガーすることができ、それにより希望するスナップショットのポイントに対応するドメイン内ポイントインタイムイメージができる。システムは新しいＩ／Ｏのもので始動でき、そのデルタセットがコミットポイントまで通り抜けるのを可能にする。コミットポイントに到達し、そのオープンデルタに書き込みがおこなわれたとき、デルタはデルタセットパイプラインを通過する。全ての書き込みが完了すると、システムはそれ自身にスナップショットイメージをトリガーする。一態様において、スナップショットは、基礎となるストレージデバイスに対してトリガーできるが、これは、実際はディスク上にあるものに基づく物理的なスナップショットを行う。別の態様では、デルタセットはそのポイントインタイムと関連して保つことができる。 These snapshots can be advantageously combined with a delta set. The point-in-time image of the data needs to be WOF consistent even if the storage is distributed across the geographical area. In one aspect, the snapshot can be triggered by an agent or timer, for example corresponding to point-in-time. As soon as the snapshot is received, a rollover of the delta set can be triggered, resulting in an intra-domain point-in-time image corresponding to the desired snapshot point. The system can be started with new I / O's, allowing its delta set to go through to the commit point. When the commit point is reached and the open delta is written, the delta goes through the delta set pipeline. When all writes are complete, the system triggers a snapshot image on itself. In one aspect, the snapshot can be triggered to the underlying storage device, but this actually takes a physical snapshot based on what is on the disk. In another aspect, the delta set can be kept relative to its point-in-time.

完了したデルタセットそれぞれは、整合性のあるポイントインタイムにおけるデータのボリュームを表わす。論理スナップショット（ボリュームのポイントインタイム論理イメージ）は、単に前のデルタセットに指標を提供し、デルタセットのクローズ時の整合データイメージに基づく、エクスポートされたボリュームを許可することにより、実装することができる。 Each completed delta set represents a volume of data at a consistent point in time. A logical snapshot (a point-in-time logical image of a volume) is implemented by simply providing an indication to the previous delta set and allowing the exported volume to be based on a consistent data image when the delta set is closed Can do.

スナップショットが、アプリケーションなどの外部ソースによってトリガーされるのを許可することは重要である。この理由で、後述のような、デルタセットのクローズためのインタフェースを提供することができる。 It is important to allow snapshots to be triggered by an external source such as an application. For this reason, an interface for closing a delta set as described below can be provided.

スナップショットは、エクスポートされた読み取り／書き込みとすることもできる。デルタセットの視点からは、これはそれぞれデルタセットの起源から発展する系統を表現するデルタセットの家系を作成させる。ユーザの観点からは、結果はまさにこれまでの論理スナップショットの系統のように見える。 The snapshot can also be an exported read / write. From the delta set point of view, this creates a family of delta sets, each representing a lineage that develops from the origin of the delta set. From the user's point of view, the results look exactly like the previous logical snapshot family.

論理スナップショットとして開始するが、バックグラウンドコピーによってデータの物理（論理ではなく）コピーを作成するイメージも埋め込むことができる。これはまず上述のように論理スナップショットを作成し、次にバックグラウンドで、上位のデルタセットを別個のメディアに「クローン」することにより行うことができる。 Starting as a logical snapshot, an image that creates a physical (not logical) copy of the data by background copy can also be embedded. This can be done by first creating a logical snapshot as described above and then “cloning” the higher delta set to a separate media in the background.

（デルタセットのディスクイメージへの拡張）
本文書の別の場所に記載されているように、一態様におけるシステムは、任意の所定の時間にオープンである、３つのデルタセットを有することができるが、それぞれのデルタセットは異なるポイントインタイムを表わす。１つのデルタセットは、データが最終的にディスクにコミットされるポイントインタイムを表わし、１つはデータがディスクにコミットされようとしているポイントインタイムを表わし、１つは交換開始のポイントインタイムを表わす。また、システム全体に多数のオープンデルタがあってよい。これら全てのデルタを折り畳んだ単一の大きい基本イメージを作成する代わりに、システムはたくさんのオープンデルタを保有することができ、またさまざまなポイントインタイムのデータのたくさんのビューを有することができる。かかるシステムでは、適切なポイントインタイムに後退することが望まれる場合、システムは、ポイントインタイムｔと適切なイメージを指示することにより、簡単に適切なデルタに後退することができる。 (Extension to Delta Set disk image)
As described elsewhere in this document, the system in one aspect may have three delta sets that are open at any given time, each delta set having a different point-in-time. Represents. One delta set represents the point-in-time at which the data is finally committed to disk, one represents the point-in-time at which the data is about to be committed to disk, and one represents the point-in-time at the beginning of the exchange. Represent. There may also be many open deltas throughout the system. Instead of creating a single large base image that collapses all these deltas, the system can have many open deltas and have many views of various point-in-time data. In such a system, if it is desired to go back to the appropriate point in time, the system can easily go back to the appropriate delta by indicating the point in time t and the appropriate image.

一態様において、実装は常時データ保護（ＣＤＰ）などの技術を利用できる。それぞれポイントインタイムを表わす多くのデルタセットが作成され、それにより任意の所定のデルタとそのデルタの時間的に後のデルタを読み取ることで、そのポイントインタイムにあったときのデータのピクチャを表示することができ、それによりＣＤＰを実装することができる。デルタセットは、デルタセットが古くなると、細かい粒度のポイントインタイムの必要性が減少するという意味で、結合性である。そのようなものとして、デルタセットは結合方式で折り畳みし、時間と共に粗くする方式でデルタセットを結合させ、使用されるストレージ全体の量を減少させることができる。 In one aspect, implementations can utilize techniques such as constant data protection (CDP). Many delta sets, each representing a point-in-time, are created, thereby displaying a picture of the data at that point-in-time by reading any given delta and the later delta of that delta So that CDP can be implemented. A delta set is connectivity in the sense that as the delta set becomes older, the need for fine-grained point-in-time decreases. As such, delta sets can be folded in a combined manner and combined in a manner that coarsens over time, reducing the overall amount of storage used.

スナップショットの概念は、ＣＤＰに拡張することができ、その場合、デルタセットは、時間と共に粒度の変わるデルタセットをマージすることを始めるように使用され得る。このために、デルタをキャッシュから取り出し、ディスクに書き込むことができる。それによりキャッシュとディスクに保存されたデルタセットを作成することができ、つまりデルタセットストレージをディスク上に拡張する。そうすることで、システムはスナップショットとＣＤＰ形式の機能性を共に実装することができる。デルタセットと部分デルタセットの双方は、ランダムアクセスメモリあるいはディスク（あるいは、その点については、任意のメディア）に収容することができる。 The snapshot concept can be extended to CDP, where delta sets can be used to begin merging delta sets that change in granularity over time. For this purpose, the delta can be retrieved from the cache and written to disk. This allows creation of delta sets stored in cache and disk, i.e. extending delta set storage on disk. By doing so, the system can implement both snapshot and CDP format functionality. Both delta sets and partial delta sets can be accommodated in random access memory or disk (or any medium in that respect).

（デルタセットのマージ（デルタセットの結合性を使用して））
デルタセットが、データ（スナップショット）の一連のポイントインタイムイメージであるとみなされる場合、デルタセットの系統は結合性である。より新しいデルタセットによりエクスポートされたビューを変更することなく、古いデルタセットを結合することができる。 (Merge delta set (using delta set connectivity))
If a delta set is considered to be a series of point-in-time images of data (snapshots), the lineage of delta sets is associative. Old delta sets can be merged without changing the view exported by the newer delta set.

これはより最近の時点で比較的細かい時間増分を表わす、ポイントインタイムイメージの系統の作成を可能にする。デルタセットが「古くなる」と、デルタセットは共にマージして、より粗い時間増分を作成し、ポイントインタイムイメージの維持に関連するオーバーヘッドを低減することができる。 This allows the creation of a family of point-in-time images that represent relatively fine time increments at more recent time points. As the delta sets become “stale”, the delta sets can be merged together to create coarser time increments and reduce the overhead associated with maintaining point-in-time images.

デルタセットを「マージ」するプロセスは、本文書に含まれる教示と提案を踏まえると、当業者には明らかであろう。複数デルタセットに所定のブロックへの変更が存在し、最新の変更が使用され得ること以外は、「ユニオン」操作できる。 The process of “merging” delta sets will be apparent to those skilled in the art in light of the teachings and suggestions contained in this document. A “union” operation is possible except that there are changes to a given block in multiple delta sets and the latest changes can be used.

（リモートインポータ）
ＷＯＦストレージを遠隔的にインポートするが、ローカルＤＲ１レグを有さないサイトは、特殊な挙動を有し得る。他の本格的な参加サイトと同様に、これらのサイトはＰを計算するためにｓに寄与する。一態様において、ＰとＲの双方は正常に計算されるが、書き込むためのローカルストレージがないため、Ｅ＝Ｃ＝０である。インポートするサイトのノードは、デルタクローズ中の確認バリアメッセージでそのＲＤＭＡキーを供給する以上に、交換またはコミット段階に参加しない。 (Remote importer)
Sites that import WOF storage remotely but do not have a local DR1 leg may have special behavior. Like other full-fledged participating sites, these sites contribute to s to calculate P. In one aspect, both P and R are calculated correctly, but E = C = 0 because there is no local storage to write. The node at the importing site does not participate in the exchange or commit phase beyond supplying its RDMA key with a confirmation barrier message during delta close.

（ヨッタディスクとデルタセットの統合）
本書で言及されているように、また米国特許第６，８５７，０５９号（２００５年２月１５日）、名称「Ｓｔｏｒａｇｅｖｉｒｔｕａｌｉｚａｔｉｏｎｓｙｓｔｅｍａｎｄｍｅｔｈｏｄｓ」に開示されているように、ヨッタディスク（ＹｏｔｔａＤｉｓｋ）は、例えば最終利用者などのホストに提示される、任意に大きいサイズ（例えば１０^２４バイト）までの、需要がマップされた仮想ディスクイメージである。一実施例において、例えば、仮想ディスクイメージを用いて、仮想ディスクイメージからバックエンド物理ストレージへのマッピングが作成されるが、これは物理ストレージに対し実行される書き込み操作などのＩ／Ｏ操作の結果として、動的に行われる。ストレージの再マッピングは、バックエンドストレージが、利用者への影響を伴わずに管理されるのを、また複数のバックエンド区画が結合されて単一の仮想イメージを提供することを可能にする。ディスクイメージは、利用者に対し潜在的に非常に大きいイメージを提示して、利用者をボリュームサイズ変更問題から切り離し、ゆとりのある使用量を可能にする。このイメージは、使用量と増加率を制御し、また作成、削除、および他の候補ディスクの取り付けなどのコアシステムプロセスを維持する能力を提供する管理システムによってサポートされてもよい。 (Yotta disk and delta set integration)
As mentioned in this document, and as disclosed in US Pat. No. 6,857,059 (February 15, 2005), entitled “Storage virtualization system and methods”, Yota Disk is, for example, presented to the host, such as end user, any large size up to (for example, 10 ²⁴ bytes), a virtual disk image demand is mapped. In one embodiment, for example, a virtual disk image is used to create a mapping from the virtual disk image to the backend physical storage, which is the result of an I / O operation such as a write operation performed on the physical storage. As done dynamically. Storage remapping allows back-end storage to be managed without impact to the user, and multiple back-end partitions can be combined to provide a single virtual image. The disk image presents a potentially very large image to the user, isolates the user from the volume resizing problem, and allows for a generous amount of usage. This image may be supported by a management system that controls usage and growth rates and provides the ability to maintain core system processes such as creation, deletion, and attachment of other candidate disks.

ヨッタディスクは、同様の機構を用いて、デルタセットによって実装することができる。ストレージは動的に割り振られる必要があり得るので、両プロセスにバックエンドストレージアロケータを使用して、ヨッタディスクとデルタセットがほぼ同時に実装されるのを可能にし得る。デルタセットは、米国特許第６，８５７，０５９号（参照により本書に組み込まれる）に記載されている機構を使用して、例えばブロック参照に基づいて需要がマップされ解放された、スパースディスクイメージ（ｓｐａｒｓｅｄｉｓｋｉｍａｇｅ）を表わすことができる。相違点は、デルタセットにより提供される時間表現の系統である。 Yotta disks can be implemented with delta sets using a similar mechanism. Since storage may need to be allocated dynamically, a backend storage allocator can be used for both processes to allow the yotta disk and delta set to be implemented at about the same time. The delta set uses a mechanism described in US Pat. No. 6,857,059 (incorporated herein by reference), for example, a sparse disk image (where demand is mapped and released based on block references). sparse disk image). The difference is the system of time representation provided by the delta set.

（デルタセットを動的に閉じる）
デルタセットを閉じ、従って新しいデルタセットを開くために、いくつかの機構を提供することができる。かかる機構は、例えば以下を含む。
・設定された時間間隔
・トランザクションの数に基づく時間間隔
・デルタセットまたは部分デルタにおける変更（書き込み）の数に基づく時間間隔
・アプリケーショントリガー
・操作者誘発トリガー
・他のサブシステムにより誘発されたトリガー
・エラー状態により誘発されたトリガー
・上記の複合
設定された時間間隔の場合、間隔の大きさは変更条件によって調整される。例えば、過負荷状態のネットワークは、ネットワークへの影響を低下させるために、デルタセットの存続期間の増加の理由となり得る。「高度のアラート」の期間は、より細かいデルタセットをトリガーして、高いデータ依存の期間中のデータ損失の可能性を低下させ得る。 (Close delta set dynamically)
Several mechanisms can be provided to close the delta set and thus open a new delta set. Such mechanisms include, for example:
• Set time interval • Time interval based on the number of transactions • Time interval based on the number of changes (writes) in the delta set or partial delta • Application triggers • Operator triggered triggers • Triggers triggered by other subsystems In the case of triggers triggered by error conditions and the above complex time intervals, the size of the interval is adjusted according to the changing conditions. For example, an overloaded network can be a reason for increasing the lifetime of a delta set to reduce its impact on the network. The “advanced alert” period may trigger a finer delta set to reduce the likelihood of data loss during periods of high data dependence.

もし任意のノードがデルタセットのターンオーバーを潜在的にトリガーできるのならば、これらのトリガーはさまざまなノードにわたって整合性がある必要はない。 If any node can potentially trigger a delta set turnover, these triggers need not be consistent across the various nodes.

（タイムファイアウォール）
ＭｉｎｗｅｎＪｉらによる、「Ｓｅｎｅｃａ：ｒｅｍｏｔｅｍｉｒｒｏｒｉｎｇｄｏｎｅｗｒｉｔｅ」（ＵＳＥＮＩＸＴｅｃｈｎｉｃａｌＣｏｎｆｅｒｅｎｃｅ梗概集２５３〜２５６ページ、２００３年６月）は、タイムファイアウォールの概念を提示した。本書に記載される一実施例は、データボリュームの時間遅延読み取り専用の視点と、現在のデータビューへのマルチライタ完全アクティブ地理的分散アクセスの視点を両方含むように、かかる概念を拡張している。別の態様では、さまざまなＷＯＦ最適化はその概念と統合される。 (Time Firewall)
Minsen Ji et al., “Seneca: remote mirroring done write” (USENIX Technical Conference summary page 253-256, June 2003) presented the concept of a time firewall. One embodiment described herein extends such a concept to include both a time-delayed read-only view of the data volume and a multi-writer fully active geographically distributed access view to the current data view. . In another aspect, various WOF optimizations are integrated with the concept.

複数サイトを相互接続することの懸念のうちの１つは、論理エラー（物理的障害と対照的な）がサイト間に素早く伝搬する可能性があることである。例えば、１つのサイトに挿入されたウィルスは、データイメージを共有する全てのサイトを素早く感染させる。上述のポイントインタイム／デルタセットの概念は、データの展開への「安全」なウィンドウを提供するための効率的な機構を提供することができる。 One of the concerns of interconnecting multiple sites is that logical errors (as opposed to physical failures) can quickly propagate between sites. For example, a virus inserted at one site quickly infects all sites sharing the data image. The point-in-time / delta set concept described above can provide an efficient mechanism for providing a “safe” window to the deployment of data.

アクティブ／パッシブ方式で動作しているサイトは、アクティブデータから後れを取る、ポイントインタイムイメージ（デルタセット）に基づくデータへの読み取り専用ポータルを提供することができる。読み取り専用イメージは、既定の「安全」時間間隔を維持するデルタセットで、自動的に進行することができる。このように、正常なプロセスは、データの論理障害または破壊を検出し得る「実」データ上で継続して実行される。かかるイベントが検出された場合、「安全」な読み取り専用イメージの進行は、問題が修正されるまで中断される。 Sites operating in an active / passive manner can provide a read-only portal to data based on point-in-time images (delta sets) that lag behind active data. Read-only images can proceed automatically with delta sets that maintain a default “safe” time interval. In this way, normal processes continue to run on “real” data that can detect logical failures or corruption of the data. If such an event is detected, the progress of the “safe” read-only image is interrupted until the problem is corrected.

「リモート」サイト（または任意のサイト）は、共にオープンな、その他あらゆるマルチライタデルタセットと同様に動作する現在のデルタセットイメージと、同時に開いている前のデルタセットへの「安全」なウィンドウを有することができる。 The “remote” site (or any site) has a current delta set image that works like any other multi-writer delta set, and a “safe” window to the previous open delta set at the same time. Can have.

具体的な例では、２つのエンティティが共同して自由にデータを共有することを望むが、ウィルスが１つのエンティティの１つのデータに挿入され、それから別のエンティティに蔓延する状況を回避したいかもしれない。従って、この例では、単に１つの大きいデータリポジトリを有することは好ましくない。一実施例によるソリューションは、エンティティのうちの１つが区画のみに書き込み、それからデータを別のエージェンシーにエクスポートすることを可能にし、それによりそれぞれのエンティティが、データへの異なる視点を有することとなる。１つの視点は、単に２つの正常なＷＯＦサイトがあるかのような、データの同期イメージである。ある時間遅延ポイントには、そのデータの安全なビューとみなされるある時間（３０分など）遅れである第２のボリュームが存在する。１つのエンティティに何かが起こった場合、システムは単に、問題が修復され、ウィルスが取り出されるまで保存ポインタの進行を中断することができる。データのリアルタイムイメージと、少なくとも３０分時間遅延のイメージとが両方存在することができる。遅延したイメージは、デルタセットクロージャなどによってそれ自体を更新して、それ自体をおよそ３０分遅れに維持することができ、また何かまたは誰かが問題があり更新が中止されるべきであると指示しない限り、時間の進行を継続することができる。それから遅延イメージは、そうではないよう指示されるまで、安全なポイントに留まることができる。 In a specific example, two entities want to share data freely, but may want to avoid a situation where a virus is inserted into one data of one entity and then spread to another entity. Absent. Thus, in this example, it is not desirable to have just one large data repository. A solution according to one embodiment allows one of the entities to write only to the partition and then export the data to another agency, so that each entity has a different view of the data. One viewpoint is a synchronized image of data as if there were just two normal WOF sites. At some time delay point there is a second volume that is delayed for some time (such as 30 minutes) that is considered a safe view of that data. If something happens to one entity, the system can simply interrupt the progress of the save pointer until the problem is repaired and the virus is removed. There can be both a real-time image of the data and an image with a time delay of at least 30 minutes. A delayed image can update itself, such as with a delta set closure, to keep itself delayed by approximately 30 minutes, and indicate that something or someone has a problem and the update should be aborted As long as you don't, you can continue the time. The delayed image can then remain at a safe point until instructed otherwise.

（ネストされたデルタセット）
他のサイトと比べ、比較的近接している２つのサイトがある場合、それらの近接サイト間で、同期複製のためなどに、データのより密な転送を行うことができる。デルタセットは、対の近接サイトが頻繁な交換がおこなえるよう、ネストすることができるが、これは「メタ」デルタセットクロージャにおいてである。このメタデルタセット交換は、通常のデルタセット交換よりも粒度の細かいものとできる。より頻繁に交換することにより、サイトの近接性のため、比較的低コストでのより頻繁な更新を提供することができる。 (Nested delta set)
If there are two sites that are relatively close compared to other sites, a denser transfer of data can be performed between the adjacent sites, such as for synchronous replication. Delta sets can be nested so that pairs of neighboring sites can make frequent exchanges, but this is in a “meta” delta set closure. This meta delta set exchange can be finer in granularity than a normal delta set exchange. More frequent exchanges can provide more frequent updates at a relatively low cost due to the proximity of the site.

デルタセットをネストすることは、ノードのサブセットがサブデルタセットを切り替えることを可能にし、それにより地理的に近いノードは、より遠い距離で使用されるよりも密なデルタセットにより同期性をもつことができる。例えば、図１１の１１００は、地理的に近い２つのサイトのネストグループ１１０２へのグループ化を示している。これはデータ損失の可能性を最小限に抑え、障害後の瞬間的一次サイトの機会を最大限にするのに役立つ。 Nesting delta sets allows a subset of nodes to switch sub-delta sets so that nodes that are geographically closer are more synchronized with a denser delta set than is used at a greater distance Can do. For example, 1100 in FIG. 11 illustrates the grouping of two geographically close sites into nested groups 1102. This helps minimize the chance of data loss and maximize the chances of a momentary primary site after a failure.

（デルタセット全体のＲＡＩＤ）
デルタセット実装は、デルタセットを閉じたあと、全てのノード間で部分デルタセットの要素を交換し、全てのノード間で効率的に変更をミラーできる。ノード間で複製することにより変更を「安全」にすることは、単にミラーすることだけを意味する必要はない。例えば、何らかのＲＡＩＤの配置パターンが使用され得る。この場合、「Ｄ」はディスクイメージを指さず、キャッシュ内またはディスク上のノードインスタンスを指す。 (RAID of the entire delta set)
A delta set implementation can mirror the changes between all nodes by exchanging elements of the partial delta set between all nodes after closing the delta set. Making changes "safe" by replicating between nodes does not have to mean just mirroring. For example, some RAID arrangement pattern may be used. In this case, “D” does not refer to the disk image, but refers to a node instance in the cache or on the disk.

「安全」な配置は、物理的な現実も認識されるべきである。例えば、サイト内の複数ノードは、それらの間で１つのコピーのみを必要とし、ミラーやその他の重複コピーは別のサイトに保存されてもよい。 A “safe” arrangement should also recognize the physical reality. For example, multiple nodes in a site may need only one copy between them, and mirrors and other duplicate copies may be stored at another site.

例えば、地理的に離れてデルタセット交換をおこなっている５つのサイトがある場合、５つ全てのサイトでＲＡＩＤ１のミラーを行うよりむしろ、ＲＡＩＤ５のタイプの実装をおこなって、データをいくつかのサブセットに入れることができるが、これはデータへのアクセスを有したまま、任意のサイトが失われることを可能にする。当技術分野で知られるように、ＲＡＩＤ５は、データストライピングおよびパリティー検査などの機能を含み、それによりどのサイトもデータのフルセット有さなくてもよい。デルタセットの交換部分は、ＲＡＩＤ書き込みとして実装することができる。各ブロックをあらゆる他のホストに送信する必要はないが、その代わりＲＡＩＤストライピングに基づき、ブロックを経路指定できる。任意の所定の時間に、データは２つのサイトまたは３つのサイト、あるいは１つのサイトとチェックサムにのみ存在する場合がある。これは全てのデータの完全ブロードキャストの必要性を低減する。 For example, if there are five sites that are geographically separated and performing a delta set exchange, rather than mirroring RAID 1 at all five sites, a RAID 5 type implementation can be used to store data in several subsets. This allows any site to be lost while still having access to the data. As is known in the art, RAID 5 includes functions such as data striping and parity checking so that no site has a full set of data. The exchange part of the delta set can be implemented as a RAID write. Each block need not be sent to every other host, but instead can be routed based on RAID striping. At any given time, data may only exist at two sites or three sites, or at one site and checksum. This reduces the need for a complete broadcast of all data.

（デルタセットを使用してストレージプリミティブを再定義する）
従来のストレージの階層化は、順番に、キャッシュ、キャッシュ複製、仮想化、従来のＲＡＩＤを含む。これはコヒーレンシ層、デルタセット層、そして次に全てのものがどこに位置するかを指示する物理リソースアロケータを含む、別の階層化と置き換えることができる。上記のものを組み合わせて、デルタセットを取り巻くプリミティブに関して、ストレージアーカイブを再定義することが可能である。
そのため以下の従来の階層化の代わりに、
・キャッシュ
・キャッシュ複製
・仮想化
・従来のＲＡＩＤ
これらのプリミティブが以下のような新しい階層化を提示することができる。
・コヒーレンス層
・プロパティと起源を有するデルタセット
・デルタセットを物理装置にマップする、物理リソースアロケータ
さまざまな実施例の機能性は、当技術分野で公知のハードウェアとソフトウェアの任意の適切な組み合わせによって実装することができる。例えば、ソフトウェアとロジックは、複数の命令またはプログラムコードとして、さまざまな構成要素、アクセサリ、および／またはデバイスの内部あるいは外部に含有される、情報ストレージメディアに保存することができる。コードまたはその一部を含有するためのストレージメディアおよびコンピュータ可読メディアは、当技術分野で知られるあるいは使用される任意の適切なメディアを含むことができるが、これにはＥＥＰＲＯＭ、フラッシュメモリ、またはその他のメモリ技術、ＣＤ−ＲＯＭ、ＲＯＭ、ＲＡＭ、デジタル多用途ディス（ＤＶＤ）、またはその他の光学ストレージ、磁気カセット、磁気テープ、磁気ディスクストレージ、またはその他の磁気ストレージデバイス、データ信号、データ伝送、またはコンピュータによりアクセスできるその他のあらゆるメディアを含む、コンピュータ可読命令、データ構造、プログラムモジュール、またはその他のデータなどの情報のストレージおよび／または伝送のための任意の方法あるいは技術で実装される、揮発性および非揮発性の、取り外し可能および不可能なメディアなどであるがこれらに限定されない、さまざまなストレージメディアおよび通信メディアが含まれる。ここに提供される開示および教示に基づき、当業者は、さまざまな実施例の態様を実施するためのその他の手段および／または方法を理解するであろう。 (Redefine storage primitives using delta sets)
Conventional storage tiering includes, in order, cache, cache replication, virtualization, and conventional RAID. This can be replaced with another layering, including a coherency layer, a delta set layer, and then a physical resource allocator that indicates where everything is located. Combining the above, it is possible to redefine the storage archive with respect to the primitives surrounding the delta set.
So instead of the traditional stratification below,
・ Cache ・ Cache replication ・ Virtualization ・ Conventional RAID
These primitives can present a new layering as follows.
• Coherence layer • Delta set with properties and origins • Physical resource allocator that maps delta sets to physical devices The functionality of the various embodiments can be achieved by any suitable combination of hardware and software known in the art. Can be implemented. For example, software and logic can be stored as a plurality of instructions or program code on an information storage medium contained within or outside various components, accessories, and / or devices. Storage media and computer readable media for containing code or portions thereof may include any suitable media known or used in the art, including EEPROM, flash memory, or other Memory technology, CD-ROM, ROM, RAM, digital versatile disc (DVD), or other optical storage, magnetic cassette, magnetic tape, magnetic disk storage, or other magnetic storage device, data signal, data transmission, or Implemented in any method or technique for storage and / or transmission of information such as computer readable instructions, data structures, program modules, or other data, including any other media accessible by the computer Volatile and non-volatile, although such removable and non media are not limited to, include various storage media and communication media. Based on the disclosure and teachings provided herein, one of ordinary skill in the art will appreciate other means and / or methods for implementing various example aspects.

従って、明細書と図面は制約的なものではなく例示とみなされるべきである。しかし、請求項に記載される本発明のより広義な精神と範囲から逸脱することなく、これらにさまざまな修正および変更がなされてもよいことは明らかであろう。 Accordingly, the specification and drawings are to be regarded as illustrative rather than restrictive. However, it will be apparent that various modifications and changes may be made thereto without departing from the broader spirit and scope of the invention as set forth in the claims.

図１は、本発明の一実施例に従って使用できる分散ストレージシステムを示す。FIG. 1 illustrates a distributed storage system that can be used in accordance with one embodiment of the present invention. 図２は、本発明の一実施例に従って、マルチサイトのストレージシステムによる使用のためのプロセスのステップを示す。FIG. 2 illustrates process steps for use by a multi-site storage system in accordance with one embodiment of the present invention. 図３は、本発明の一実施例に従って、マルチサイトのストレージシステムによる使用のためのプロセスのステップを示す。FIG. 3 illustrates process steps for use by a multi-site storage system in accordance with one embodiment of the present invention. 図４は、本発明の一実施例に従って、マルチサイトのストレージシステムによる使用のためのプロセスのステップを示す。FIG. 4 illustrates process steps for use by a multi-site storage system in accordance with one embodiment of the present invention. 図５は、本発明の一実施例に従って、データを保存するための方法のステップを示す。FIG. 5 illustrates the steps of a method for storing data, according to one embodiment of the present invention. 図６は、本発明の一実施例に従って、マルチサイトのストレージシステムによる使用のためのプロセスのステップを示す。FIG. 6 illustrates process steps for use by a multi-site storage system in accordance with one embodiment of the present invention. 図７は、本発明の一実施例に従って、マルチサイトのストレージシステムによる使用のためのプロセスのステップを示す。FIG. 7 illustrates process steps for use by a multi-site storage system in accordance with one embodiment of the present invention. 図８は、本発明の一実施例に従って、マルチサイトのストレージシステムによる使用のためのプロセスのステップを示す。FIG. 8 illustrates process steps for use by a multi-site storage system in accordance with one embodiment of the present invention. 図９は、本発明の一実施例に従って、マルチサイトのストレージシステムによる使用のためのプロセスのステップを示す。FIG. 9 illustrates process steps for use by a multi-site storage system, in accordance with one embodiment of the present invention. 図１０は、本発明の一実施例に従って使用できる、地理的に離れた分散ストレージシステムを示す。FIG. 10 illustrates a geographically separated distributed storage system that can be used in accordance with one embodiment of the present invention. 図１１は、本発明の一実施例に従って使用できる、ネストグループを含む分散ストレージシステムを示す。FIG. 11 illustrates a distributed storage system including nested groups that can be used in accordance with one embodiment of the present invention.

Claims

A method for providing write order fidelity to a set of distributed data access nodes in a network comprising:
Storing a write request in a first cache of open deltas, the first cache corresponding to a first node receiving the write request;
In response to a trigger event, sending a message to each node of the set of data access nodes to close the open delta;
For each node that has a write request for the open delta, complete any pending write requests for the open delta and close the open delta;
Exchanging write request information between nodes, whereby each node is associated with a complete copy of the write request information for the closed delta;
Writing a complete copy of each to persistent storage;
Storing each complete copy of the closed delta in back-end storage of each node.

2. The method of claim 1, further comprising the step of grouping front end volumes into write order fidelity (WOF) groups, each WOF group including at least one site via the network. the method of.

The method of claim 2, wherein each site includes at least one of the nodes and all nodes are operable to read and write in parallel.

The method of claim 2, wherein a write for each WOF group is stored in the open delta cache for that WOF group.

The method of claim 1, further comprising executing the trigger event.

The method of claim 1, further comprising opening a new delta and receiving a new write when closing the open delta.

The method of claim 1, further comprising providing localized cache access to remote data of geographically distant nodes.

The writing of each complete copy to persistent storage includes the step of writing the metadata update entry to the recovery log when the metadata update entry has been written and writing the data update to the database. The method described in 1.

The method of claim 1, further comprising reordering write requests in an open delta before closing the delta.

The method of claim 1, further comprising triggering a data snapshot corresponding to closing the open delta.

Set time interval, time interval based on the number of transactions, time interval based on the number of changes / writes in a delta set or partial delta, application trigger, operator triggered trigger, trigger triggered by another subsystem, error condition The method of claim 1, further comprising generating the trigger event using a mechanism selected from the group consisting of triggers triggered by and combinations thereof.

The method of claim 1, wherein the network is selected from the group consisting of a storage area network (SAN), a local area network (LAN), a wide area network (WAN), and a metropolitan area network (MAN).

The method of claim 1, wherein sending a message comprises broadcasting the message to each node of the set of data access nodes.

A system for providing write order fidelity to a set of distributed data access nodes in a network comprising:
A storage system for storing data;
A plurality of access nodes configured to access data of the storage system;
Each node of the plurality of access nodes is operable to store a write request in an open delta cache, the cache corresponding to the node receiving the write request, each node responding to a trigger event And is further operable to send a message to the plurality of access nodes in the data storage network to complete all write requests and close the open delta, each node exchanging write request information And is thereby further operable to associate each node with a complete copy of the write request information for the closed delta, each node storing a respective complete copy in persistent storage and then the closed delta Each persistent copy of the backend for that node Further operable, the system to apply to storage.

The system of claim 14, wherein at least one of the nodes is geographically remote from other nodes.

The system of claim 14, wherein at least one of the access nodes is operable to mirror data to a remote location.

The system of claim 14, further comprising a plurality of sites, each site including at least one of the nodes.

The storage system is further operable to determine when only one site is writing to the storage system so that exchange of write request information is deferred until multiple sites write to the storage system The system of claim 17, wherein

The storage system is further operable to determine when one of the sites is not writing data to the storage system, and sets the site as a passive site until the site needs to write The system of claim 14, wherein the system is operable.

The system of claim 19, wherein the storage system is further operable to broadcast a message to other sites to indicate the status of the passive site.

The system of claim 14, wherein the storage system is operable to perform asynchronous and synchronous data transfer.

Each node is operable to exchange write request information by exchanging write request information with a first subset of the plurality of access nodes, whereby the first subset of nodes The system of claim 14, exchanging the write request information with a second subset of access nodes.

The system of claim 14, wherein each node is operable to replicate writes to multiple caches.

The data storage system maintains three deltas, each delta at the point in time when the data is finally committed to disk, the point in time just before the data is committed to the disk, and at the start of the exchange The system of claim 14, wherein the system represents one of point-in-time.

The system of claim 14, wherein the storage system is further operable to merge deltas over time.

The system of claim 14, wherein the storage system is further operable to create a snapshot of any open delta at any point in time.

The storage system has a set time interval, a time interval based on the number of transactions, a time interval based on the number of writes in a delta set or partial delta, an application trigger, an operator triggered trigger, a trigger triggered by another subsystem 15. The system of claim 14, further operable to close the delta set using a mechanism selected from the group consisting of: a trigger triggered by an error condition, and combinations thereof.

The system of claim 14, wherein the network is selected from the group consisting of a storage area network (SAN), a local area network (LAN), a wide area network (WAN), and a metropolitan area network (MAN).

The system of claim 14, wherein each node is further operable to send the message to the plurality of access nodes by broadcasting the message to each node of the set of data access nodes.

A computer program product embedded in a computer readable medium for providing write order fidelity to a set of distributed data access nodes in a network comprising:
Computer program code for storing a write request in a first cache of open deltas, the first cache corresponding to a first node that receives the write request;
Computer program code for sending a message to each node of the set of data access nodes to close the open delta in response to a trigger event;
For each node that has a write request for the open delta, computer program code to complete any pending write requests for the open delta and close the open delta;
Computer program code for exchanging write request information between nodes, whereby each node is associated with a complete copy of the write request information for the closed delta;
Computer program code to write a complete copy of each to persistent storage;
A computer program product comprising computer program code for storing a complete copy of each of the closed deltas in the back-end storage of each node.

Computer program code for grouping front end volumes into write order fidelity (WOF) groups, each WOF group further comprising computer program code for grouping including at least one site via the network 32. The computer program product of claim 30, comprising.

32. The computer program product of claim 30, further comprising computer program code for storing a write to each WOF group in the open delta cache for that WOF group.

32. The computer program product of claim 30, further comprising computer program code for providing localized cache access to remote data of geographically distant nodes.

A method for providing write order fidelity to a set of distributed data access nodes in a network comprising:
Providing a plurality of write order fidelity (WOF) groups, each WOF group including at least one of the data access nodes;
Storing the write request in a first cache corresponding to a first node in a first WOF group that receives the write request;
In response to a trigger event, exchanging write request information between the nodes of the first WOF group, whereby each node is associated with a complete copy of the write request information;
Writing a complete copy of each to persistent storage;
Storing a complete copy of each in the backend storage of each node.

35. The method of claim 34, further comprising providing localized cache access to remote data of geographically distant nodes.

35. Writing each complete copy to persistent storage includes writing the metadata update entry to a recovery log when writing the metadata update entry is complete and writing the data update to the database. The method described in 1.

35. The method of claim 34, further comprising triggering a data snapshot corresponding to the state of the cached write request for the first WOF group.

A method for providing write order fidelity to a set of distributed data access nodes in a network comprising:
Providing a plurality of write order fidelity (WOF) groups, each WOF group including at least one of the data access nodes;
Storing write requests in a cache corresponding to one of the plurality of WOF groups, wherein each WOF group is associated with a cache and has access to at least one node of the WOF group. A step operable to receive a write request from a writer;
In response to a triggering event for a WOF group, exchanging write request information between the nodes of the WOF group, whereby each node is associated with a complete copy of the write request information;
Storing each complete copy in persistent storage.

A computer program product embedded in a computer readable medium for providing write order fidelity to a set of distributed data access nodes in a network comprising:
Computer program code providing a plurality of write order fidelity (WOF) groups, each WOF group including at least one of the data access nodes;
Computer program code for storing a write request in a cache corresponding to one of the plurality of WOF groups, each WOF group being associated with a cache and having access to at least one node of the WOF group Computer program code operable to receive a write request from a request writer of:
Computer program code for exchanging write request information between the nodes of the WOF group in response to a trigger event for the WOF group, whereby each node is associated with a complete copy of the write request information;
A computer program product comprising computer program code for storing a complete copy of each in persistent storage.