JP2011510405A

JP2011510405A - Scalable deduplication mechanism

Info

Publication number: JP2011510405A
Application number: JP2010543270A
Authority: JP
Inventors: サンドルフィ，ミクロス; ライター，ティミー・ジィ
Original assignee: セパトン，インコーポレイテッド
Priority date: 2008-01-16
Filing date: 2009-01-16
Publication date: 2011-03-31
Also published as: WO2009091957A3; AU2009206038A1; CA2711273A1; WO2009091957A2; CN101939737A; EP2235640A2

Abstract

バックアップ記憶システムから冗長データを除去するための方法が提示される。一例では、この方法は、アプリケーション層データオブジェクトを受信することと、複数の重複排除ドメインからある重複排除ドメインを、前記重複排除ドメインに関連付けられたデータオブジェクト特性に少なくとも部分的に基づいて選択することと、前記アプリケーション層データオブジェクトが前記特性を有すると判断することと、前記重複排除ドメインに前記アプリケーション層データオブジェクトを誘導することとを含んでいてもよい。 A method is provided for removing redundant data from a backup storage system. In one example, the method receives an application layer data object and selects a deduplication domain from a plurality of deduplication domains based at least in part on data object characteristics associated with the deduplication domain. And determining that the application layer data object has the characteristics and directing the application layer data object to the deduplication domain.

Description

背景
１．発明の分野
この発明の局面はデータストレージに関し、より特定的には、拡張可能なデータ重複排除サービスを提供するための装置および方法に関する。 Background 1. FIELD OF THE INVENTION Aspects of the invention relate to data storage, and more particularly to an apparatus and method for providing an extensible data deduplication service.

２．関連技術の説明
多くのコンピュータシステムは、１つ以上のホストコンピュータと、ホストコンピュータが使用したデータを記憶する１つ以上のデータ記憶システムとを含む。これらのホストコンピュータおよび記憶システムは典型的には、ファイバチャネル（Fibre Channel）ネットワーク、イーサネット（登録商標）ネットワーク、または別のタイプの通信ネットワークといったネットワークを用いて、ともにネットワーク化される。ファイバチャネルは、チャネルベースの伝送方式の速度とネットワークベースの伝送方式の柔軟性とを組合せて、ネットワーク上で複数のイニシエータが複数のターゲットと通信することを可能にする規格であり、ここでイニシエータおよびターゲットは、ネットワークに結合されたどの装置であってもよい。ファイバチャネルは典型的には、光ファイバケーブルといった高速伝送媒体を用いて実現されるため、大量のデータが転送される記憶システムネットワークにとって一般的な選択肢である。 2. Description of Related Art Many computer systems include one or more host computers and one or more data storage systems that store data used by the host computers. These host computers and storage systems are typically networked together using a network, such as a Fiber Channel network, an Ethernet network, or another type of communication network. Fiber Channel is a standard that combines the speed of a channel-based transmission scheme with the flexibility of a network-based transmission scheme to allow multiple initiators to communicate with multiple targets over a network. And the target can be any device coupled to the network. Since Fiber Channel is typically implemented using a high-speed transmission medium such as a fiber optic cable, it is a common choice for storage system networks where large amounts of data are transferred.

いくつかのホストコンピュータとバックアップ記憶システムとを含む典型的なネットワーク化コンピューティング環境の一例を、図１に示す。１つ以上のアプリケーションサーバ１０２が、ローカルエリアネットワーク（ＬＡＮ）１０３を介して、複数のユーザコンピュータ１０４に結合されている。アプリケーションサーバ１０２およびユーザコンピュータ１０４はともに、「ホストコンピュータ」と考えられてもよい。アプリケーションサーバ１０２は、ストレージエリアネットワーク（ＳＡＮ）１０８を介して、１つ以上の一次記憶装置１０６に結合されている。一次記憶装置１０６は、たとえば、ＥＭＣ社、ＩＢＭ社などの企業から入手可能であるようなディスクアレイであってもよい。また、これに代えて、バス（図示せず）または他のネットワークリンクが、アプリケーションサーバと一次記憶システム１０６との間の相互接続を提供してもよい。バスおよび／またはファイバチャネルネットワーク接続は、ホストコンピュータ（たとえばアプリケーションサーバ１０２）と記憶システム１０６との間で転送されるパケットのフォーマットを指示する小型部品システム相互接続（Small Component System Interconnect：ＳＣＳＩ）プロトコルなどのプロトコルを用いて動作してもよい。 An example of a typical networked computing environment including several host computers and a backup storage system is shown in FIG. One or more application servers 102 are coupled to a plurality of user computers 104 via a local area network (LAN) 103. Both application server 102 and user computer 104 may be considered “host computers”. Application server 102 is coupled to one or more primary storage devices 106 via a storage area network (SAN) 108. The primary storage device 106 may be a disk array that can be obtained from companies such as EMC and IBM. Alternatively, a bus (not shown) or other network link may provide an interconnection between the application server and the primary storage system 106. The bus and / or Fiber Channel network connection may be a Small Component System Interconnect (SCSI) protocol that dictates the format of the packets transferred between the host computer (eg, application server 102) and the storage system 106. It may operate using the following protocol.

図１に示すネットワーク化コンピューティング環境は、たとえば大手金融機関または大企業が使用するような大型システムに特有のものである、ということが認識されるべきである。多くのネットワーク化コンピューティング環境は、図１に示す要素をすべて含む必要はないことが、理解されるべきである。たとえば、より小さいネットワーク化コンピューティング環境は単に、記憶システムに直接またはＬＡＮを介して接続されたホストコンピュータを含んでいてもよい。加えて、図１は別個のユーザコンピュータ１０４、アプリケーションサーバ１０２、およびメディアサーバ１１４を例示しているが、これらの機能は１つ以上のコンピュータに組合されてもよい。 It should be appreciated that the networked computing environment shown in FIG. 1 is specific to large systems such as those used by large financial institutions or large corporations. It should be understood that many networked computing environments need not include all of the elements shown in FIG. For example, a smaller networked computing environment may simply include a host computer connected directly or via a LAN to a storage system. In addition, although FIG. 1 illustrates a separate user computer 104, application server 102, and media server 114, these functions may be combined in one or more computers.

多くのネットワーク化コンピュータ環境は、一次記憶装置１０６に加え、少なくとも１つの二次またはバックアップ記憶システム１１０を含む。バックアップ記憶システム１１０は典型的にはテープライブラリであってもよいが、大容量で信頼性のある他の二次記憶システムが使用されてもよい。典型的には、これらの二次記憶システムは一次記憶装置よりも低速であるが、取外されてオフサイトで格納され得る一種のリムーバブルメディア（たとえばテープ、磁気ディスクまたは光ディスク）を含む。 Many networked computing environments include at least one secondary or backup storage system 110 in addition to the primary storage device 106. The backup storage system 110 may typically be a tape library, but other secondary storage systems with large capacity and reliability may be used. These secondary storage systems are typically slower than primary storage devices, but include some type of removable media (eg, tape, magnetic disk or optical disk) that can be removed and stored off-site.

図示された例では、アプリケーションサーバ１０２は、たとえばイーサネット（登録商標）または他の通信リンク１１２を介して、バックアップ記憶システム１１０と直接通信できる場合がある。しかしながら、そのような接続は比較的低速であり、また、プロセッサ時間またはネットワーク帯域幅といった資源を使い尽くす場合がある。したがって、図示されたようなシステムは、たとえばファイバチャネルを用いてＳＡＮ１０８とバックアップ記憶システム１１０との間に通信リンク１１５を提供し得る１つ以上のメディアサーバ１１４を含んでいてもよい。 In the illustrated example, application server 102 may be able to communicate directly with backup storage system 110 via, for example, Ethernet or other communication link 112. However, such connections are relatively slow and may use up resources such as processor time or network bandwidth. Thus, a system such as that shown may include one or more media servers 114 that may provide a communication link 115 between the SAN 108 and the backup storage system 110 using, for example, Fiber Channel.

メディアサーバ１１４は、（ユーザコンピュータ１０４、メディアサーバ１１４、および／またはアプリケーションサーバ１０２といった）ホストコンピュータ、一次記憶装置１０６、およびバックアップ記憶システム１１０の間でのデータの転送を制御するバックアップ／復元アプリケーションを含むソフトウェアを実行してもよい。バックアップ／復元アプリケーションの例は、ベリタス（Veritas）、レガート（Legato）などの企業から入手可能である。データ保護のために、当該技術分野において公知であるように、ネットワーク化コンピューティング環境におけるさまざまなホストコンピュータおよび／または一次記憶装置からのデータを、バックアップ／復元アプリケーションを用いて周期的にバックアップ記憶システム１１０にバックアップしてもよい。 Media server 114 provides a backup / restore application that controls the transfer of data between the host computer (such as user computer 104, media server 114, and / or application server 102), primary storage device 106, and backup storage system 110. Software may be executed. Examples of backup / restore applications are available from companies such as Veritas and Legato. For data protection, as is known in the art, data from various host computers and / or primary storage devices in a networked computing environment is periodically backed up using a backup / restore application. 110 may be backed up.

もちろん、上述のように、多くのネットワーク化コンピュータ環境はより小さく、また、図１に示す例示的なネットワーク化コンピュータ環境よりも構成要素が少ない場合がある、ということが認識されるべきである。したがって、メディアサーバ１１４は実際、単一のホストコンピュータにおいてアプリケーションサーバ１０２と組合され得ること、および、バックアップ／復元アプリケーションは、バックアップ記憶システム１１０に（直接的に、またはネットワークを通してなど間接的に）結合されたどのホストコンピュータ上でも実行され得ることも、認識されるべきである。 Of course, as noted above, it should be recognized that many networked computer environments are smaller and may have fewer components than the exemplary networked computer environment shown in FIG. Thus, the media server 114 may actually be combined with the application server 102 on a single host computer, and the backup / restore application is coupled to the backup storage system 110 (directly or indirectly, such as through a network). It should also be appreciated that it can be executed on any host computer that has been designated.

典型的なバックアップ記憶システムの一例は、多数のテープカートリッジと、少なくとも１つのテープドライブと、カートリッジのテープドライブへの取付けおよび取外しを制御するロボット機構とを含むテープライブラリである。バックアップ／復元アプリケーションはロボット機構に、ある特定のテープカートリッジ、たとえばテープ番号０００１の位置を特定し、データがテープに書込まれるようにそのテープカートリッジをテープドライブに取付けるよう、命令を与える。バックアップ／復元アプリケーションはまた、データをテープに書込むフォーマットを制御する。典型的には、バックアップ／復元アプリケーションは、ＳＣＳＩコマンド、または他の標準化コマンドを用いて、ロボット機構に命令し、データをテープに書込むようテープドライブを制御し、以前に書込まれたデータをテープから回復させてもよい。 An example of a typical backup storage system is a tape library that includes a number of tape cartridges, at least one tape drive, and a robotic mechanism that controls the attachment and removal of cartridges from and to the tape drive. The backup / restore application directs the robotic mechanism to locate a particular tape cartridge, eg, tape number 0001, and attach the tape cartridge to the tape drive so that data is written to the tape. The backup / restore application also controls the format in which data is written to tape. Typically, a backup / restore application uses SCSI commands or other standardized commands to instruct the robotics to control the tape drive to write data to the tape, and to write previously written data. You may recover from the tape.

従来のテープライブラリバックアップシステムは、速度、信頼性、および容量固定を含む多数の問題を抱えている。多くの大企業は、毎週、テラバイト単位のデータをバックアップする必要がある。しかしながら、高価で高性能のテープでも通常、３０〜４０メガバイト／秒（ＭＢ／ｓ）、すなわち約５０ギガバイト／時間（ＧＢ／ｈｒ）の速度でしか、データの読出／書込ができない。このため、１または２テラバイトのデータをテープバックアップシステムにバックアップするのに、少なくとも１０〜２０時間の連続データ転送時間がかかる場合がある。 Conventional tape library backup systems have a number of problems including speed, reliability, and capacity locking. Many large companies need to back up terabytes of data weekly. However, expensive and high performance tapes can usually only read / write data at a rate of 30 to 40 megabytes per second (MB / s), ie about 50 gigabytes per hour (GB / hr). For this reason, it may take at least 10 to 20 hours of continuous data transfer time to back up 1 or 2 terabytes of data to the tape backup system.

加えて、多くのテープ製造業者らは、（人間のオペレータもしくはロボット機構は移動または取付け動作中にテープを落とすかもしれないため、通常のテープライブラリでは比較的頻繁に起こり得るような）テープが落下した場合、もしくはテープが極度の温度または湿度といった理想的でない環境条件にさらされている場合に、データをテープに記憶させる（またはテープから復元する）ことができる、ということを保証しないであろう。したがって、制御された環境においてテープを格納するには、多大な注意を払う必要がある。さらに、（ロボット機構を含む）テープライブラリの複雑な機械は維持費がかさみ、また、個々のテープカートリッジは比較的高価で耐用年数が限られている。 In addition, many tape manufacturers have dropped tapes (as can happen relatively frequently in a normal tape library, as human operators or robotic mechanisms may drop tapes during movement or mounting operations). Or if the tape is exposed to non-ideal environmental conditions such as extreme temperatures or humidity, it will not guarantee that the data can be stored on (or restored from) the tape. . Therefore, great care must be taken to store the tape in a controlled environment. In addition, the complex machinery of tape libraries (including robotic mechanisms) is expensive to maintain, and individual tape cartridges are relatively expensive and have a limited service life.

従来のテープライブラリおよび他の種類のバックアップ記憶媒体に関連するコストを考慮して、供給業者らはしばしば、バックアップ媒体要件全体を減少させるために、重複排除処理を自らの製品提供に取入れている。重複排除とは、時が経つにつれてデータの繰返しシーケンスを識別するプロセスであり、すなわち、それはデルタ圧縮の現れである。重複排除は典型的には、バックアップ記憶装置といったターゲット装置の１つの機能として実現される。バックアップデータストリーム内の冗長データを識別する行為は複雑であり、現在の最新技術では従来、ハッシュ指紋法およびパターン認識を用いて解決されている。 In view of the costs associated with traditional tape libraries and other types of backup storage media, suppliers often incorporate deduplication processing into their product offerings to reduce overall backup media requirements. Deduplication is the process of identifying a repeating sequence of data over time, ie it is a manifestation of delta compression. Deduplication is typically implemented as a function of a target device such as a backup storage device. The act of identifying redundant data in a backup data stream is complex and has been solved in the current state of the art using hash fingerprinting and pattern recognition.

ハッシュ指紋法では、受信データストリームはまず、（その後の一致の最も高い可能性を提供するデータストリームにおいて、エッジとしても公知である良好な「区切り点」を予測することを試みる）整列処理を受け、次にハッシュ処理（現在の最新技術では通常、ＳＨＡ−１）を受ける。データストリームは、ハッシュ処理によっていくつかの塊（サイズが通常、約８キロバイト〜１２キロバイト）に分解され、各塊には、結果として生じるそのハッシュ値が割当てられる。このハッシュ値は、メモリ常駐テーブルと比較される。ハッシュ入力が見つかった場合、データは冗長であると仮定され、ディスク記憶システムに既に記憶された既存データブロックへのポインタと置き換えられる。既存データの位置はテーブルに載っている。ハッシュ入力が見つからない場合、データはディスク記憶システムに記憶され、その位置がそのハッシュとともにメモリ常駐テーブルに記録される。この機構を例示するいくつかの例を、データドメイン（Data Domain）に譲渡された米国特許第７，０６５，６１９号、およびクォンタム社（Quantum Corporation）に譲渡された米国特許第５，９９０，８１０号に見つけることができる。ハッシュ指紋法は典型的には、インラインで実行される。すなわち、データは、ディスクに書込まれる前にリアルタイムで処理される。 In hash fingerprinting, the received data stream is first subjected to an alignment process (attempts to predict a good “breakpoint”, also known as an edge, in the data stream that provides the highest likelihood of subsequent matches). Next, it undergoes hash processing (usually SHA-1 in the current state of the art). The data stream is broken into several chunks (typically about 8 to 12 kilobytes in size) by hashing, and each chunk is assigned its resulting hash value. This hash value is compared with the memory resident table. If a hash entry is found, the data is assumed to be redundant and is replaced with a pointer to an existing data block already stored in the disk storage system. The position of existing data is listed on the table. If no hash entry is found, the data is stored in the disk storage system and its location is recorded in a memory resident table along with the hash. Some examples illustrating this mechanism include US Pat. No. 7,065,619 assigned to Data Domain and US Pat. No. 5,990,810 assigned to Quantum Corporation. Can be found in the issue. Hash fingerprinting is typically performed inline. That is, the data is processed in real time before being written to disk.

パターン認識によれば、受信データストリームはまず、比較的大きいデータブロック（約３２ＭＢ）へと「塊にされ」、または分割される。データは次に、ハッシュ値のリストを組立てる単純なローリングハッシュ法によって処理される。ハッシュ値に対して変換が行なわれ、ここで、結果として生じる値の小さなリストは、データブロックの「指紋」を表わす。次に、他の所与の記憶ブロックに存在する少なくとも或る数の指紋ハッシュを探すために、ハッシュのテーブルに対して検索が行なわれる。最小一致数が満たされない場合、ブロックは固有であると考えられ、ディスクに直接記憶される。対応する指紋ハッシュが、メモリ常駐テーブルに追加される。最小一致数が満たされた場合、現在のデータブロックが以前に記憶されたデータブロックと一致する可能性がある。この場合、一致する指紋によって割当てられたディスクストレージのブロックがメモリに読込まれ、かつてハッシュされた候補ブロックとバイト毎に比較される。データの全シーケンスが等しい場合、データブロックは、ストレージの物理的にアドレス指定されたブロックへのポインタと置き換えられる。ブロック全体が一致しない場合、デルタ差分機構が採用されて、記憶される必要があるブロック内の最小データセットを判断する。その結果は、固有のデータと、以前に記憶されたデータの綿密に一致するブロックについてのリファレンスとの組合せである。この機構を例示する一例を、ディリジェント社（Diligent Corporation）に譲渡された米国特許出願第ＵＳ２００６／００５９２０７号に見つけることができる。上述のように、この動作は典型的には、インラインで実行される。 According to pattern recognition, the received data stream is first “lumped” or split into relatively large data blocks (approximately 32 MB). The data is then processed by a simple rolling hash method that builds a list of hash values. A transformation is performed on the hash value, where the resulting small list of values represents the “fingerprint” of the data block. Next, a search is performed against the hash table to find at least some number of fingerprint hashes present in other given storage blocks. If the minimum number of matches is not met, the block is considered unique and is stored directly on disk. The corresponding fingerprint hash is added to the memory resident table. If the minimum number of matches is met, the current data block may match a previously stored data block. In this case, the disk storage block allocated by the matching fingerprint is read into memory and compared byte by byte with the previously hashed candidate block. If the entire sequence of data is equal, the data block is replaced with a pointer to a physically addressed block of storage. If the entire block does not match, a delta difference mechanism is employed to determine the smallest data set in the block that needs to be stored. The result is a combination of unique data and a reference to a closely matching block of previously stored data. One example illustrating this mechanism can be found in US Patent Application No. US 2006/0059207, assigned to Diligent Corporation. As described above, this operation is typically performed inline.

発明の概要
この発明の局面および実施例は、従来のデータ重複排除手法の問題の一部またはすべてを克服もしくは緩和し、従来の重複排除手法を取入れたデータ記憶システムよりも優れた有効性および拡張性を提供し得る、データ記憶システムを提供する。 SUMMARY OF THE INVENTION Aspects and embodiments of the present invention overcome or alleviate some or all of the problems of conventional data deduplication techniques and are more effective and extended than data storage systems that incorporate traditional deduplication techniques. A data storage system is provided that can provide functionality.

概観では、この発明の局面および実施例は、バックアップ／復元アプリケーションが物理的テープライブラリと同様に装置および媒体の同じ表示を見るように、従来のテープバックアップ記憶システムをエミュレートする、ランダムアクセスベースの記憶システムを提供する。この発明の記憶システムは、ソフトウェアおよびハードウェアを用いて、物理的テープ媒体をエミュレートし、それらを１つ以上のランダムアクセスディスクアレイと置き換えて、テープフォーマットで線形の順次データを、ディスクへの記憶にとって好適なデータに変換する。 In overview, aspects and embodiments of the present invention provide a random access-based emulation that emulates a conventional tape backup storage system so that a backup / restore application sees the same display of devices and media as a physical tape library. Provide a storage system. The storage system of the present invention uses software and hardware to emulate physical tape media and replace them with one or more random access disk arrays to convert linear sequential data in tape format to disk. Convert to data suitable for storage.

この発明のいくつかの局面および実施例によれば、既存のバックアップデータセットを復号し、メタデータ（すなわち、ユーザデータについての情報を表わすデータ）を、検索可能なメタデータキャッシュに記憶させるための機構と、ファイルまたはオブジェクトを求めてメタデータキャッシュを検索および／または閲覧できるようにする機構と、これらのファイルまたはオブジェクトを、典型的なバックアップソフトウェアの既存のバックアップポリシーおよび慣行を通して記憶されたデータから、ウェブ接続を介してダウンロードするための機構とが提供される。また、既存の認証機構を通してユーザを認証し、現在のユーザの認証情報に基づいてメタデータキャッシュの表示を制限するための機構が含まれていてもよい。 In accordance with some aspects and embodiments of the present invention, for decrypting an existing backup data set and storing metadata (ie, data representing information about user data) in a searchable metadata cache Mechanisms, mechanisms that allow searching and / or browsing the metadata cache for files or objects, and these files or objects from data stored through existing backup policies and practices of typical backup software A mechanism for downloading via a web connection. Further, a mechanism for authenticating the user through an existing authentication mechanism and restricting the display of the metadata cache based on the current user authentication information may be included.

この発明の局面および実施例は、冗長データのバックアップデータオブジェクトからの除去も提供する。この除去処理は「重複排除」と呼ばれ得るが、バックデータのコピーを保持するために必要な記憶容量を削減し、ひいてはバックアップデータを記憶するために必要な電子媒体の量を削減する。この発明の少なくともいくつかの局面に従った重複排除処理の実施例は、メタデータを用いることによってコンピューティング資源を効率的に使用し、重複排除処理を最適化する。 Aspects and embodiments of the present invention also provide removal of redundant data from backup data objects. This removal process, which can be referred to as “deduplication”, reduces the storage capacity required to hold a copy of the back data and thus the amount of electronic media required to store the backup data. Embodiments of a deduplication process in accordance with at least some aspects of the present invention efficiently use computing resources and optimize the deduplication process by using metadata.

以下にさらに説明するように、いくつかの実施例は、重複排除処理全体の知的誘導に向けられている。これらの実施例のいくつかでは、データ記憶システムは、ソフトウェアおよびハードウェアを用いて、データオブジェクトを、重複排除および記憶用のいくつかの重複排除ドメインのうちの１つに誘導する。加えて、所与のデータ記憶システムによって提示された制約内でデータの重複排除を管理する重複排除ドメインを構成するために、ハードウェアおよび／またはソフトウェアで実現されるアプリケーションが提供される。いくつかの実施例は、従来のハッシュ指紋手法が、利用可能なメモリの量によって制約される、という認識を表明している。他の実施例は、パターン認識アプローチでは、ランダムなＩ／Ｏ作業負荷が実質的な制限である、という認識を反映している。このため、これらの実施例は、従来のハッシュ指紋法およびパターン認識重複排除手法によって課される制限の認識を表明している。 As described further below, some embodiments are directed to intelligent guidance of the entire deduplication process. In some of these embodiments, the data storage system uses software and hardware to direct the data object to one of several deduplication domains for deduplication and storage. In addition, hardware and / or software implemented applications are provided to configure deduplication domains that manage data deduplication within the constraints presented by a given data storage system. Some embodiments assert the recognition that traditional hash fingerprinting techniques are constrained by the amount of memory available. Other embodiments reflect the recognition that with the pattern recognition approach, random I / O workload is a substantial limitation. Thus, these examples demonstrate the recognition of restrictions imposed by conventional hash fingerprinting and pattern recognition de-duplication techniques.

この発明の他の局面および実施例によれば、メタデータキャッシュにおいて複数のカートリッジ表現の論理的併合を行なうための機構と、新しく合成されたカートリッジにラベルおよびバーコードを適切に付与し、それが有効なデータセットとしてバックアップ／復元ソフトウェアに受付けられるようにするための機構とが提供される。また、この発明のさらに別の局面および実施例によれば、合成カートリッジを表わすデータ要素の複数のコピーを記憶するか、またはメタデータキャッシュにおいて表わされた既存データへのポインタのみを記憶するための機構が提供される。 In accordance with other aspects and embodiments of the present invention, a mechanism for logically merging multiple cartridge representations in a metadata cache, and appropriately assigning labels and barcodes to newly synthesized cartridges, including: And a mechanism for being accepted into the backup / restore software as a valid data set. In accordance with yet another aspect and embodiment of the present invention, to store multiple copies of a data element representing a composite cartridge, or to store only pointers to existing data represented in a metadata cache A mechanism is provided.

一実施例によれば、アプリケーション層データオブジェクトの重複排除を誘導するための方法が提供される。この方法は、当該アプリケーション層データオブジェクトを受信する行為と、複数の重複排除ドメインからある重複排除ドメインを、当該重複排除ドメインに関連付けられたデータオブジェクト特性に少なくとも部分的に基づいて選択する行為と、当該アプリケーション層データオブジェクトが当該特性を有すると判断する行為と、当該選択された重複排除ドメインに当該アプリケーション層データオブジェクトを誘導する行為とを含む。 According to one embodiment, a method is provided for inducing deduplication of application layer data objects. The method includes an act of receiving the application layer data object, an act of selecting a deduplication domain from a plurality of deduplication domains based at least in part on data object characteristics associated with the deduplication domain; An act of determining that the application layer data object has the characteristic and an act of inducing the application layer data object to the selected deduplication domain.

一例では、当該アプリケーション層データオブジェクトを受信する行為は、データストリームを受信する行為と、当該データストリームに含まれるメタデータを用いて当該アプリケーション層データオブジェクトを識別する行為とを含んでいてもよい。別の例では、当該データストリームを受信する行為は、多重化されたデータストリームを受信する行為を含んでいてもよい。別の例によれば、この方法は、当該アプリケーション層データオブジェクトを用いて、当該データストリームに含まれるメタデータを抽出する行為をさらに含んでいてもよい。さらに別の例では、当該複数の重複排除ドメインから当該重複排除ドメインを選択する行為は、当該アプリケーション層データオブジェクトに関連付けられた当該抽出されたメタデータを、当該重複排除ドメインに関連付けられた少なくとも１つの当該特性と比較する行為を含んでいてもよい。さらに別の例によれば、当該データストリームに含まれる当該メタデータを抽出する行為は、バックアップポリシー名、データ源のタイプ、データ源の名前、バックアップアプリケーション名、オペレーションシステムのタイプ、データタイプ、バックアップタイプ、ファイル名、ディレクトリ構造、および時系列情報のうちの少なくとも１つを抽出する行為を含んでいてもよい。 In one example, the act of receiving the application layer data object may include an act of receiving a data stream and an act of identifying the application layer data object using metadata included in the data stream. In another example, the act of receiving the data stream may include the act of receiving the multiplexed data stream. According to another example, the method may further include an act of extracting metadata included in the data stream using the application layer data object. In yet another example, the act of selecting the deduplication domain from the plurality of deduplication domains includes: extracting the extracted metadata associated with the application layer data object with at least one associated with the deduplication domain. It may include an act of comparing two such characteristics. According to yet another example, the act of extracting the metadata included in the data stream includes backup policy name, data source type, data source name, backup application name, operation system type, data type, backup An act of extracting at least one of type, file name, directory structure, and time series information may be included.

別の例では、この方法は、複数の重複排除方法のうちの１つを使用するよう、当該複数の重複排除ドメインの各々を構成する行為をさらに含んでいてもよい。別の例によれば、当該複数の重複排除ドメインの各々を構成する行為は、ハッシュ指紋法、パターン認識、および内容認識重複排除を含む群から選択される１つの重複排除方法を使用するよう、当該複数の重複排除ドメインの各々を構成する行為を含んでいてもよい。さらに別の例では、この方法は、当該複数の重複排除ドメインの各々を少なくとも１つのデータオブジェクト特性に関連付ける行為をさらに含んでいてもよい。追加的な例によれば、この方法は、当該選択された重複排除ドメイン内で、当該アプリケーション層データオブジェクトを重複排除する行為と、重複排除の行為の結果に基づいて、当該複数の重複排除ドメインのうちの少なくとも１つに関連付けられた当該データオブジェクト特性を調節する行為とをさらに含んでいてもよい。さらに別の例では、当該データオブジェクト特性を調節する行為は、重複排除ドメインデータベースにデータを記憶させる行為を含んでいてもよい。 In another example, the method may further include an act of configuring each of the plurality of deduplication domains to use one of a plurality of deduplication methods. According to another example, the act of configuring each of the plurality of deduplication domains uses a deduplication method selected from the group comprising hash fingerprinting, pattern recognition, and content recognition deduplication, The act which comprises each of the said several deduplication domain may be included. In yet another example, the method may further include an act of associating each of the plurality of deduplication domains with at least one data object characteristic. According to additional examples, the method can include deduplicating the application layer data object within the selected deduplication domain and the plurality of deduplication domains based on a result of the deduplication action. And further adjusting the data object characteristic associated with at least one of the two. In yet another example, the act of adjusting the data object characteristics may include the act of storing data in a deduplication domain database.

別の実施例によれば、上述のアプリケーション層データオブジェクトの重複排除を誘導するための方法の行為を実行するために、グリッドコンピューティング環境が提供される。 According to another embodiment, a grid computing environment is provided for performing the acts of the method for directing deduplication of application layer data objects as described above.

別の実施例によれば、上述のアプリケーション層データオブジェクトの重複排除を誘導するための方法の行為を実行するために、バックアップ記憶システムが提供される。この実施例では、データがバックアップ記憶システムにバックアップされていない間に、この方法が実行される。 According to another embodiment, a backup storage system is provided to perform the acts of the method for inducing deduplication of application layer data objects as described above. In this embodiment, the method is performed while data is not backed up to a backup storage system.

別の実施例によれば、上述のアプリケーション層データオブジェクトの重複排除を誘導するための方法の行為を実行するために、バックアップ記憶システムが提供される。この実施例では、データがバックアップ記憶システムにバックアップされている間に、この方法が実行される。 According to another embodiment, a backup storage system is provided to perform the acts of the method for inducing deduplication of application layer data objects as described above. In this embodiment, the method is performed while data is being backed up to a backup storage system.

別の実施例によれば、命令を規定するコンピュータ読取り可能信号が記憶されたコンピュータ読取り可能媒体が提供される。これらの命令は、コンピュータによって実行された結果、当該アプリケーション層データオブジェクトを受信する行為と、複数の重複排除ドメインからある重複排除ドメインを、当該重複排除ドメインに関連付けられたデータオブジェクト特性に少なくとも部分的に基づいて選択する行為と、当該アプリケーション層データオブジェクトが当該特性を有すると判断する行為と、当該選択された重複排除ドメインに当該アプリケーション層データオブジェクトを誘導する行為とを、当該コンピュータに行なうよう命令する。 According to another embodiment, a computer readable medium having stored thereon computer readable signals defining instructions is provided. These instructions are executed by the computer as a result of the act of receiving the application layer data object and a deduplication domain from a plurality of deduplication domains at least partially in the data object characteristics associated with the deduplication domain. Commanding the computer to perform the act of selecting based on the application layer, the act of determining that the application layer data object has the characteristic, and the act of inducing the application layer data object to the selected deduplication domain To do.

別の実施例によれば、アプリケーション層データオブジェクトの重複排除を誘導するためのシステムが提供される。このシステムは、複数の重複排除ドメインを含み、当該複数の重複排除ドメインのうちの各重複排除ドメインは、複数のアプリケーション層データオブジェクトに共通の少なくとも１つの特性に関連付けられており、このシステムはさらに、当該複数の重複排除ドメインに結合されたコントローラを含み、当該コントローラは、当該アプリケーション層データオブジェクトを受信し、当該アプリケーション層データオブジェクトが、ある重複排除ドメインに関連付けられた当該少なくとも１つの特性を有すると判断し、当該重複排除ドメインに当該アプリケーション層データオブジェクトを誘導するよう構成されている。 According to another embodiment, a system is provided for inducing deduplication of application layer data objects. The system includes a plurality of deduplication domains, each of the deduplication domains being associated with at least one characteristic common to the plurality of application layer data objects, the system further comprising: Includes a controller coupled to the plurality of deduplication domains, the controller receiving the application layer data object, wherein the application layer data object has the at least one characteristic associated with a deduplication domain. In this case, the application layer data object is guided to the deduplication domain.

一例では、当該コントローラはさらに、データストリームを受信し、当該データストリームに含まれるメタデータを用いて当該アプリケーション層データオブジェクトを識別するよう構成されていてもよい。別の例では、当該データストリームは多重化されていてもよい。別の例では、当該コントローラはさらに、当該アプリケーション層データオブジェクトを用いて、当該データストリームに含まれるメタデータを抽出するよう構成されていてもよい。さらに別の例では、当該コントローラはさらに、当該アプリケーション層データオブジェクトが当該重複排除ドメインに関連付けられた当該少なくとも１つの特性を有するということを、当該アプリケーション層データオブジェクトに関連付けられた当該抽出されたメタデータを当該重複排除ドメインに関連付けられた当該少なくとも１つの特性と比較することによって判断するよう構成されていてもよい。さらに別の例では、当該コントローラはさらに、バックアップポリシー名、データ源のタイプ、データ源の名前、バックアップアプリケーション名、オペレーションシステムのタイプ、データタイプ、バックアップタイプ、ファイル名、ディレクトリ構造、および時系列情報のうちの少なくとも１つを抽出するよう構成されていてもよい。追加的な例では、当該コントローラはさらに、複数の重複排除方法のうちの１つを使用するよう、当該複数の重複排除ドメインの各々を構成するよう構成されていてもよい。さらに別の例では、当該コントローラはさらに、ハッシュ指紋法、パターン認識、および内容認識重複排除を含む群から選択される１つの重複排除方法を使用するよう、当該複数の重複排除ドメインの各々を構成するよう構成されていてもよい。 In one example, the controller may be further configured to receive the data stream and identify the application layer data object using metadata included in the data stream. In another example, the data stream may be multiplexed. In another example, the controller may be further configured to extract metadata included in the data stream using the application layer data object. In yet another example, the controller further indicates that the application layer data object has the at least one characteristic associated with the deduplication domain, the extracted meta data associated with the application layer data object. The data may be configured to be determined by comparing the at least one characteristic associated with the deduplication domain. In yet another example, the controller further includes backup policy name, data source type, data source name, backup application name, operating system type, data type, backup type, file name, directory structure, and time series information. May be configured to extract at least one of them. In an additional example, the controller may be further configured to configure each of the plurality of deduplication domains to use one of a plurality of deduplication methods. In yet another example, the controller further configures each of the plurality of deduplication domains to use a deduplication method selected from the group comprising hash fingerprinting, pattern recognition, and content recognition deduplication. It may be configured to.

別の例によれば、当該コントローラはさらに、当該複数の重複排除ドメインの各々を少なくとも１つのデータオブジェクト特性に関連付けるよう構成されていてもよい。別の例では、当該コントローラはさらに、当該選択された重複排除ドメイン内で、当該アプリケーション層データオブジェクトの重複排除を引き起こし、重複排除の行為の結果に基づいて、当該複数の重複排除ドメインのうちの少なくとも１つに関連付けられた当該データオブジェクト特性を調節するよう構成されていてもよい。さらに別の例では、当該コントローラはさらに、重複排除ドメインデータベースにデータを記憶させるよう構成されていてもよい。別の例では、このシステムは、グリッドコンピューティング環境に含まれていてもよい。さらに別の例では、当該コントローラはさらに、データが当該システムにバックアップされていない間に、当該アプリケーション層データオブジェクトを受信し、当該アプリケーション層データオブジェクトがある重複排除ドメインに関連付けられた当該少なくとも１つの特性を有すると判断し、当該重複排除ドメインに当該アプリケーション層データオブジェクトを誘導するよう構成されていてもよい。加えて、一例によれば、当該コントローラはさらに、データが当該システムにバックアップされている間に、当該アプリケーション層データオブジェクトを受信し、当該アプリケーション層データオブジェクトがある重複排除ドメインに関連付けられた当該少なくとも１つの特性を有すると判断し、当該重複排除ドメインに当該アプリケーション層データオブジェクトを誘導するよう構成されていてもよい。 According to another example, the controller may be further configured to associate each of the plurality of deduplication domains with at least one data object characteristic. In another example, the controller further causes deduplication of the application layer data object within the selected deduplication domain and, based on a result of the deduplication action, of the plurality of deduplication domains. It may be configured to adjust the data object characteristic associated with at least one. In yet another example, the controller may be further configured to store data in a deduplication domain database. In another example, the system may be included in a grid computing environment. In yet another example, the controller further receives the application layer data object while the data is not backed up to the system, and the application layer data object is associated with the at least one deduplication domain. The application layer data object may be configured to be guided to the deduplication domain by determining that the application layer data object has the characteristic. In addition, according to an example, the controller further receives the application layer data object while the data is backed up to the system and the at least the application layer data object associated with a deduplication domain is located. It may be configured to determine that it has one characteristic and direct the application layer data object to the deduplication domain.

さらに別の局面、実施例、ならびにこれらの例示的な局面および実施例の利点を、以下に詳細に説明する。また、前述の情報および以下の詳細な説明はともに、さまざまな局面および実施例の単なる例示であり、請求されている局面および実施例の性質および特徴を理解するための概要または枠組を提供するよう意図されている。さまざまな局面および実施例の例示ならびにさらなる理解を提供するために、添付図面が含まれており、この明細書に取入れられ、この明細書の一部を構成している。図面は、明細書の残りとともに、説明され請求されている局面および実施例の原理ならびに動作を説明するよう機能する。 Further aspects, embodiments, and advantages of these exemplary aspects and embodiments are described in detail below. Additionally, both the foregoing information and the following detailed description are merely exemplary of various aspects and embodiments, and provide an overview or framework for understanding the nature and characteristics of the claimed aspects and embodiments. Is intended. The accompanying drawings are included and incorporated in and constitute a part of this specification to provide an illustration and various understanding of the various aspects and examples. The drawings, together with the remainder of the specification, serve to explain the principles and operations of the described and claimed aspects and embodiments.

図面の簡単な説明
添付図面を参照して、少なくとも１つの実施例のさまざまな局面を以下に説明する。縮尺通りであるよう意図されてはいない図において、さまざまな図に示された各々の同一またはほぼ同一の構成要素は、同じ番号で表わされている。明確性のため、すべての図においてすべての構成要素に名前が付けられているとは限らない。図は、例示および説明のために提供されており、この発明の限界の定義として意図されてはいない。 BRIEF DESCRIPTION OF THE DRAWINGS Various aspects of at least one embodiment are described below with reference to the accompanying drawings. In the drawings that are not intended to be drawn to scale, each identical or nearly identical component that is illustrated in various figures is represented by a like numeral. For clarity, not all components are named in all figures. The figures are provided for purposes of illustration and description and are not intended as a definition of the limits of the invention.

バックアップ記憶システムを含む大規模なネットワーク化コンピューティング環境の一例のブロック図である。1 is a block diagram of an example large-scale networked computing environment that includes a backup storage system. FIG. この発明の局面に従った記憶システムを含むネットワーク化コンピューティング環境の一例のブロック図である。1 is a block diagram of an example networked computing environment including a storage system in accordance with aspects of the invention. FIG. この発明の局面に従った記憶システムの一例のブロック図である。1 is a block diagram of an example of a storage system according to an aspect of the present invention. この発明の局面に従った記憶システムの一例の仮想レイアウトを示すブロック図である。It is a block diagram showing a virtual layout of an example of a storage system according to an aspect of the present invention. この発明の局面に従ったシステムファイルの一例の概略的レイアウトである。2 is a schematic layout of an example system file according to an aspect of the invention. この発明の局面に従ったテープディレクトリ構造の一例である。2 is an example of a tape directory structure according to an aspect of the invention. この発明の局面に従った合成フルバックアップを作成する方法の一例を示す図である。It is a figure which shows an example of the method of producing the synthetic full backup according to the situation of this invention. この発明に従った合成フルバックアップを含む一連のバックアップデータセットの一例の概略図である。FIG. 2 is a schematic diagram of an example of a series of backup data sets including a synthetic full backup according to the present invention. メタデータキャッシュ構造の一例の図である。It is a figure of an example of a metadata cache structure. 合成フルバックアップデータセットを記憶する仮想カートリッジの一例の図である。It is a figure of an example of the virtual cartridge which memorize | stores a synthetic | combination full backup data set. 合成フルバックアップデータセットを記憶する仮想カートリッジの別の例の図である。FIG. 4 is a diagram of another example of a virtual cartridge that stores a synthetic full backup data set. この発明に従ったデータオブジェクトを重複排除する方法のフローチャートである。4 is a flowchart of a method for deduplicating data objects according to the present invention. ２つのバックアップデータオブジェクトの図である。FIG. 3 is a diagram of two backup data objects. 図１３Ａに示すバックアップデータオブジェクトの重複排除されたコピーの図である。FIG. 13B is a diagram of a deduplicated copy of the backup data object shown in FIG. 13A. 図１３Ａに示すバックアップデータオブジェクトの重複排除されたコピーの別の図である。FIG. 13B is another diagram of a deduplicated copy of the backup data object shown in FIG. 13A. この発明の局面に従った重複排除誘導子の一例のブロック図である。FIG. 6 is a block diagram of an example of a deduplication inductor according to an aspect of the invention. この発明に従った、データオブジェクトの重複排除を誘導する方法のフローチャートである。4 is a flowchart of a method for inducing deduplication of data objects according to the present invention.

詳細な説明
添付図面を参照して、さまざまな実施例およびそれらの局面を、ここにより詳細に説明する。この発明はその適用が、以下の説明で述べられまたは図面に示された構成の詳細および構成要素の配置に限定されない、ということが認識されるべきである。この発明は他の実施例が可能であり、さまざまなやり方で実践または実行され得る。特定の実現化例は、例示的な目的のためにのみここに提供され、限定的であるよう意図されてはいない。特に、任意の１つ以上の実施例に関連して説明される行為、要素、および特徴は、任意の他の実施例における同様の役割から除外されないよう意図されている。また、ここに使用される表現および用語は説明のためであり、限定的であるよう見なされるべきではない。ここでの「含む」、「備える」、「有する」、「含有する」、「関与する」、およびそれらの変形の使用は、この後に列挙される項目およびそれらの均等物、ならびに追加の項目を包含するよう意図されている。 DETAILED DESCRIPTION Various embodiments and their aspects will now be described in more detail with reference to the accompanying drawings. It should be appreciated that the invention is not limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments and of being practiced or carried out in various ways. Particular implementations are provided herein for illustrative purposes only and are not intended to be limiting. In particular, acts, elements, and features described in connection with any one or more embodiments are not intended to be excluded from a similar role in any other embodiments. Also, the expressions and terms used herein are for the purpose of description and should not be considered limiting. The use of “including”, “comprising”, “having”, “containing”, “involved”, and variations thereof herein includes the items listed below and their equivalents, as well as additional items. It is intended to include.

ここに開示されたどの実施例も、他の実施例と組合されてもよく、「ある実施例」、「いくつかの実施例」、「代替的な実施例」、「さまざまな実施例」、「一実施例」、「少なくとも１つの実施例」、「このおよび他の実施例」などへの言及は、必ずしも相互に排他的ではなく、その実施例に関連して説明されたある特定の特徴、構造、または特性が、少なくとも１つの実施例に含まれ得ることを示すよう意図されている。ここに使用されるような用語は、必ずしもすべて同じ実施例に言及しているとは限らない。どの実施例も、ここに開示された局面と調和する態様で他の実施例と組合されてもよい。「または」への言及は、「または」を用いて説明された用語が、説明された用語のうちの１つ、２つ以上、およびすべてのいずれも示すように、包含的であるよう解釈され得る。 Any of the embodiments disclosed herein may be combined with other embodiments, such as “one embodiment”, “some embodiments”, “alternative embodiments”, “various embodiments”, References to “one embodiment”, “at least one embodiment”, “this and other embodiments” are not necessarily mutually exclusive, and certain features described in connection with the embodiment It is intended to indicate that a structure, or characteristic may be included in at least one embodiment. The terms as used herein do not necessarily all refer to the same embodiment. Any embodiment may be combined with other embodiments in a manner consistent with the aspects disclosed herein. Reference to “or” is to be construed as inclusive, such that a term described using “or” indicates any one, two, or all of the described terms. obtain.

ここに使用されているように、「ホストコンピュータ」という用語は、記憶システムまたは他のホストコンピュータといった他の装置と通信可能な、パーソナルコンピュータ、ワークステーション、メインフレーム、ネットワーク化クライアント、サーバなどといった、少なくとも１つのプロセッサを有するあらゆるコンピュータを指す。ホストコンピュータは、（図１を参照して前述されたような）メディアサーバおよびアプリケーションサーバ、ならびに（ユーザワークステーション、ＰＣ、メインフレームなどであり得る）ユーザコンピュータを含んでいてもよい。加えて、この開示においては、「ネットワーク化コンピュータ環境」という用語は、複数のホストコンピュータが、１つ以上の共有記憶システムに、記憶システムがホストコンピュータの各々と通信できるような態様で接続されている、あらゆるコンピューティング環境を含む。ファイバチャネルは、この発明の実施例で使用され得る通信ネットワークの一例である。しかしながら、ここに説明されるネットワークがファイバチャネルに限定されないこと、および、ファイバチャネルの代わりに、またはそれに加えて、さまざまなネットワーク構成要素が、トークンリング（Token RIng）またはイーサネット（登録商標）といったネットワーク接続を介して、もしくは異なるネットワーク接続の組合せを介して互いに通信し得る、ということが認識されるべきである。また、この発明の局面は、ＳＣＳＩまたはパラレルＳＣＳＩといったバストポロジーでも使用され得る。 As used herein, the term “host computer” refers to personal computers, workstations, mainframes, networked clients, servers, etc. that can communicate with other devices such as storage systems or other host computers. Any computer having at least one processor. The host computer may include a media server and application server (as described above with reference to FIG. 1) and a user computer (which may be a user workstation, PC, mainframe, etc.). In addition, in this disclosure, the term “networked computing environment” refers to a plurality of host computers connected to one or more shared storage systems in such a manner that the storage systems can communicate with each of the host computers. Including any computing environment. Fiber Channel is an example of a communication network that can be used in embodiments of the present invention. However, the network described herein is not limited to Fiber Channel, and instead of or in addition to Fiber Channel, various network components can be used in networks such as Token Ring or Ethernet. It should be appreciated that they can communicate with each other over a connection or through a combination of different network connections. Aspects of the invention can also be used in bus topologies such as SCSI or parallel SCSI.

この発明のさまざまな実施例および局面によれば、１つ以上のディスクアレイを用いて、リムーバブルメディアベースの記憶システムをエミュレートし得る、仮想のリムーバブルメディアライブラリバックアップ記憶システムが提供される。この発明の実施例を用いると、ユーザが既存のバックアップ手順に修正または調節を加える必要なく、または新しいバックアップ／復元アプリケーションを購入する必要なく、リムーバブルメディア（テープ、磁気ディスク、光ディスクなど）にデータをバックアップするためにこれまで使用されてきたのと同じバックアップ／復元アプリケーションを用いて、データがディスクアレイにバックアップされ得る。ここに詳細に説明される一実施例では、エミュレートされるリムーバブルメディアはテープであり、この発明のバックアップ記憶システムは、テープと従来のテープライブラリシステムでテープを取扱うために使用されるロボット機構とを含むテープライブラリシステムをエミュレートする。 In accordance with various embodiments and aspects of the present invention, a virtual removable media library backup storage system is provided that can emulate a removable media based storage system using one or more disk arrays. With embodiments of the present invention, data can be transferred to removable media (tapes, magnetic disks, optical disks, etc.) without the need for users to modify or adjust existing backup procedures or purchase new backup / restore applications. Data can be backed up to a disk array using the same backup / restore application that has been used to back up. In one embodiment described in detail herein, the emulated removable media is a tape, and the backup storage system of the present invention includes a robotic mechanism used to handle tapes and tapes in a conventional tape library system. Emulate a tape library system that includes

この発明の実施例を用いてバックアップされ、復元され得るデータは、さまざまなデータオブジェクトへと編成され得る。これらのデータオブジェクトは、データを記憶し得るあらゆる構造を含み得る。例示的なデータオブジェクトの非限定的なリストは、ビット、バイト、データファイル、データブロック、データディレクトリ、バックアップデータセット、および仮想カートリッジを含み、それらを以下にさらに説明する。この開示の大半はデータファイルのバックアップおよび復元に言及しているが、この発明の実施例はあらゆるデータオブジェクトを操作してもよく、「データファイル」という用語が「データオブジェクト」と交換可能である、ということが認識されるべきである。加えて、当業者には認識されるように、ここに説明される実施例は、オープンシステム相互接続（Open System Interconnection：ＯＳＩ）モデルのアプリケーション層で動作し、他のＯＳＩモデル層によって表わされる基本ネットワークサービスを提供するために、他のソフトウェアおよび／またはハードウェアに依存して構築されている。 Data that can be backed up and restored using embodiments of the present invention can be organized into various data objects. These data objects may include any structure that can store data. An exemplary non-limiting list of data objects includes bits, bytes, data files, data blocks, data directories, backup data sets, and virtual cartridges, which are further described below. Although most of this disclosure refers to backup and restore of data files, embodiments of the invention may operate on any data object, and the term “data file” is interchangeable with “data object” Should be recognized. In addition, as will be appreciated by those skilled in the art, the embodiments described herein operate at the application layer of the Open System Interconnection (OSI) model and are represented by other OSI model layers. Built to rely on other software and / or hardware to provide network services.

加えて、実施例は、利用可能なコンピューティング資源をより効率的に利用するために、バックアップされたデータを重複排除してもよい。いくつかの実施例によれば、データ重複排除は、インラインで、すなわち重複排除され記憶されるべきデータをデータ記憶システムが受信している間に、行なわれてもよい。他の実施例では、データ重複排除は、オフラインで、すなわち重複排除されるべきデータをデータ記憶システムが既に記憶した後で、行なわれてもよい。以下にさらに詳述するように、実施例は、拡張性の高い重複排除サービスを提供するために、従来のおよび従来にないさまざまな重複排除手法を知的に誘導し得る。 In addition, embodiments may deduplicate backed up data to more efficiently utilize available computing resources. According to some embodiments, data deduplication may be performed inline, that is, while the data storage system is receiving data to be deduplicated and stored. In other embodiments, data deduplication may be performed offline, i.e. after the data storage system has already stored the data to be deduplicated. As described in further detail below, embodiments may intelligently guide a variety of conventional and unconventional deduplication techniques to provide highly scalable deduplication services.

この発明の局面に従った記憶システムは、ともに（バックアップ／復元アプリケーションを実行している）ホストコンピュータおよびバックアップ記憶媒体とインターフェイスをとる、ハードウェアおよびソフトウェアを含む。記憶システムは、バックアップ／復元アプリケーションが物理的テープライブラリと同様に装置および媒体の同じ表示を見るように、テープまたは他の種類のリムーバブルな記憶媒体をエミュレートし、線形で順次のテープフォーマットデータを、ランダムアクセスディスクへの記憶にとって好適なデータに変換するよう、設計され得る。このように、この発明の記憶システムは、新しいバックアップ／復元アプリケーションソフトウェアまたはポリシーを必要とすることなく、（以下に説明するように、ユーザが個々のバックアップユーザファイルを検索できるようにするなどの）向上した機能性を提供し得る。 A storage system according to aspects of the invention includes hardware and software that interface with a host computer (running a backup / restore application) and a backup storage medium. The storage system emulates tape or other types of removable storage media so that the backup / restore application sees the same display of devices and media as a physical tape library, and linear and sequential tape format data. It can be designed to convert data suitable for storage on a random access disk. Thus, the storage system of the present invention does not require new backup / restore application software or policies (such as allowing a user to search for individual backup user files, as described below). May provide improved functionality.

図２を参照して、この発明の局面に従ったバックアップ記憶システム１７０を含むネットワーク化コンピューティング環境の一例を、ブロック図の形で示す。図示されているように、ホストコンピュータ１２０は、ネットワーク接続１２１を介して、記憶システム１７０に結合されている。このネットワーク接続１２１は、たとえば、ホストコンピュータ１２０と記憶システム１７０との間のデータの高速転送を可能にするファイバチャネル接続であってもよい。ホストコンピュータ１２０は１つ以上のアプリケーションサーバ１０２（図１参照）および／またはメディアサーバ１１４（図１参照）であっても、またはそれらを含んでいてもよく、ネットワーク化コンピューティング環境に存在する任意のコンピュータからの、または一次記憶装置１０６（図１参照）からのデータのバックアップを可能にしてもよい、ということが認識されるべきである。加えて、１つ以上のユーザコンピュータ１３６も、イーサネット（登録商標）接続といった別のネットワーク接続１３８を介して、記憶システム１７０に結合されてもよい。以下に詳細に説明するように、記憶システムは、ユーザコンピュータ１３６のユーザが、記憶システムからバックアップユーザファイルを閲覧し、またオプションで復元することができるようにしてもよい。 With reference to FIG. 2, an example of a networked computing environment including a backup storage system 170 in accordance with aspects of the invention is shown in block diagram form. As shown, host computer 120 is coupled to storage system 170 via network connection 121. The network connection 121 may be, for example, a fiber channel connection that enables high-speed transfer of data between the host computer 120 and the storage system 170. The host computer 120 may be or include one or more application servers 102 (see FIG. 1) and / or media servers 114 (see FIG. 1), any existing in a networked computing environment. It should be appreciated that data may be backed up from other computers or from the primary storage device 106 (see FIG. 1). In addition, one or more user computers 136 may also be coupled to the storage system 170 via another network connection 138, such as an Ethernet connection. As described in detail below, the storage system may allow a user of the user computer 136 to view and optionally restore a backup user file from the storage system.

記憶システムは、以下により詳細に説明されるような、たとえば１つ以上のディスクアレイであってもよいバックアップ記憶媒体１２６を含む。バックアップ記憶媒体１２６は、ホストコンピュータ１２０からのバックアップデータ用の実記憶空間を提供する。しかしながら、記憶システム１７０はまた、ホストコンピュータ１２０上で実行中のバックアップ／復元アプリケーションにとって、データがまるで従来のリムーバブルな記憶媒体にバックアップ中であるかのように見えるように、テープライブラリなどのリムーバブルメディア記憶システムをエミュレートする、ソフトウェアおよび追加のハードウェアを含んでいてもよい。このため、図２に示すように、記憶システム１７０は、たとえば、仮想のまたはエミュレートされたテープなどのリムーバブルな記憶媒体を表わす、「エミュレートされた媒体」１３４を含んでいてもよい。これらの「エミュレートされた媒体」１３４は、記憶システムソフトウェアおよび／またはハードウェアによってホストコンピュータに提示されており、ホストコンピュータ１２０には物理的記憶媒体のように見える。以下に詳細に説明されるように、エミュレートされた媒体１３４と実バックアップ記憶媒体１２６との間でさらにインターフェイスをとっているのは、記憶システムコントローラ（図示せず）と、ホストコンピュータ１２０からデータを受付けてそのデータをバックアップ記憶媒体１２６上に記憶させるスイッチングネットワーク１３２とであってもよい。このように、記憶システムは、ホストコンピュータ１２０に対し、従来のテープ記憶システムを「エミュレート」する。 The storage system includes a backup storage medium 126, which may be, for example, one or more disk arrays, as described in more detail below. The backup storage medium 126 provides a real storage space for backup data from the host computer 120. However, the storage system 170 also removes removable media such as a tape library so that the backup / restore application running on the host computer 120 will appear as if the data is being backed up to a conventional removable storage medium. Software and additional hardware that emulates the storage system may be included. Thus, as shown in FIG. 2, the storage system 170 may include an “emulated medium” 134 that represents a removable storage medium such as, for example, a virtual or emulated tape. These “emulated media” 134 are presented to the host computer by storage system software and / or hardware and appear to the host computer 120 as physical storage media. As will be described in detail below, it is further interfaced between the emulated medium 134 and the actual backup storage medium 126 that is data from the storage system controller (not shown) and the host computer 120. And the switching network 132 that stores the data on the backup storage medium 126. Thus, the storage system “emulates” a conventional tape storage system for the host computer 120.

一実施例によれば、記憶システムは、ホストコンピュータ１２０から記憶システム１７０上にバックアップされたユーザデータに関するメタデータを記憶する「論理メタデータキャッシュ」２４２を含んでいてもよい。ここに使用されているように、「メタデータ」という用語は、ユーザデータについての情報を表わし、実ユーザデータの属性を記述するデータを指す。データオブジェクトに関するメタデータの非限定的な例示的リストは、データオブジェクトサイズ、一次ストレージにおけるデータオブジェクトの論理的および／または物理的位置、データオブジェクトの作成日、データオブジェクトの最終修正日、データオブジェクトを記憶させたバックアップポリシー名、たとえばデータオブジェクトの名前または透かしといった識別子、および、たとえばデータオブジェクトに関連付けられたソフトウェアアプリケーションといったデータオブジェクトのデータタイプを含んでいてもよい。論理メタデータキャッシュ２４２は、ユーザおよび／またはソフトウェアアプリケーションがバックアップユーザファイルをランダムに配置し、ユーザファイルを互いに比較し、さもなければバックアップユーザファイルにアクセスして操作することを可能にする、データの検索可能な集合体を表わしている。論理メタデータキャッシュ２４２に記憶されたデータを使用し得るソフトウェアアプリケーションの２つの例は、以下により十分に説明される、合成フルバックアップアプリケーション２４０とエンドユーザ復元アプリケーション３００とを含む。加えて、以下により詳細に説明される重複排除誘導子は、記憶システム内で拡張可能な重複排除サービスを提供するために、メタデータを使用してもよい。 According to one embodiment, the storage system may include a “logical metadata cache” 242 that stores metadata about user data backed up on the storage system 170 from the host computer 120. As used herein, the term “metadata” refers to data that represents information about user data and describes the attributes of real user data. A non-limiting exemplary list of metadata about a data object includes the data object size, the logical and / or physical location of the data object in primary storage, the creation date of the data object, the last modification date of the data object, the data object It may include a stored backup policy name, eg an identifier such as the name or watermark of the data object, and a data type of the data object such as a software application associated with the data object. The logical metadata cache 242 allows the user and / or software application to randomly place backup user files, compare user files with each other, or otherwise access and manipulate backup user files. Represents a searchable collection. Two examples of software applications that may use data stored in the logical metadata cache 242 include a synthetic full backup application 240 and an end-user restore application 300, more fully described below. In addition, the deduplication inductor described in more detail below may use metadata to provide a deduplication service that is extensible within the storage system.

概観では、合成フルバックアップアプリケーション２４０は、１つの既存のフルバックアップデータセットと１つ以上の増分バックアップデータセットとから、合成フルバックアップデータセットを作成できる。合成フルバックアップは周期的な（たとえば週ごとの）フルバックアップを行なう必要性をなくすことができ、それによりかなりの時間とネットワーク資源とを節約する。合成フルバックアップアプリケーション２４０の詳細を、以下にさらに説明する。同様に以下にさらに詳細に説明されるエンドユーザ復元アプリケーション３００は、エンドユーザ（たとえばユーザコンピュータ１３６のオペレータ）が記憶システム１７０から、以前にバックアップされたユーザファイルをブラウズし、位置特定し、閲覧し、および／または復元することができるようにする。 In overview, the synthetic full backup application 240 can create a synthetic full backup data set from one existing full backup data set and one or more incremental backup data sets. Synthetic full backups can eliminate the need for periodic (eg weekly) full backups, thereby saving considerable time and network resources. Details of the synthetic full backup application 240 are further described below. Similarly, the end user restore application 300, described in further detail below, allows an end user (eg, an operator of the user computer 136) to browse, locate, and browse previously backed up user files from the storage system 170. And / or to be able to restore.

上述のように、記憶システム１７０は、ホストコンピュータ１２０およびバックアップ記憶媒体１２６とインターフェイスをとるハードウェアおよびソフトウェアを含む。この発明の実施例のハードウェアおよびソフトウェアはともに、ホストコンピュータ１２０の観点からは、データがテープ上にバックアップされるように見えるものの、実際には、たとえば複数のディスクアレイといった別の記憶媒体上にデータがバックアップされるように、従来のテープライブラリバックアップシステムをエミュレートしてもよい。 As described above, the storage system 170 includes hardware and software that interfaces with the host computer 120 and the backup storage medium 126. Both the hardware and software of the embodiments of the present invention appear to be backed up on tape from the perspective of the host computer 120, but are actually on separate storage media, eg, multiple disk arrays. A conventional tape library backup system may be emulated so that data is backed up.

図３を参照して、この発明の局面に従った記憶システム１７０の一例を、ブロック図の形で示す。一例では、記憶システム１７０のハードウェアは、記憶システムコントローラ１２２と、記憶システムコントローラ１２２をバックアップ記憶媒体１２６に接続するスイッチングネットワーク１３２とを含む。記憶システムコントローラ１２２は、（単一のプロセッサであっても複数のプロセッサであってもよい）プロセッサ１２７と、記憶システムソフトウェアのすべてまたは一部を実行し得るメモリ１２９（たとえばＲＡＭ、ＲＯＭ、ＰＲＯＭ、ＥＥＰＲＯＭ、フラッシュメモリなど、またはそれらの組合せ）とを含む。メモリ１２９はまた、バックアップ記憶媒体１２６上に記憶されたデータに関するメタデータを記憶するために使用されてもよい。この発明の実施例を実現するプログラミングコードを含むソフトウェアは、通常、ＲＡＭ、ＲＯＭ、光ディスクまたは磁気ディスク、もしくはテープなどといった、コンピュータにより読取可能および／または書込可能な不揮発性記録媒体上に記憶され、次にメモリ１２９へとコピーされ、そこでそれは次にプロセッサ１２７によって実行される。そのようなプログラミングコードは、たとえば、アセンブラ（Assembler）、ジャバ（Java（登録商標））、ビジュアルベーシック（Visual Basic）、Ｃ、Ｃ♯、またはＣ＋＋、フォートラン（Fortran）、パスカル（Pascal）、アイフェル（Eiffel）、ベーシック（Basic）、コボル（COBOL）、もしくはそれらの組合せといった複数のプログラミング言語のいずれかで書かれていてもよい。なぜなら、この発明はある特定のプログラミング言語に限定されていないためである。典型的には、動作時、プロセッサ１２７は、この発明の実施例を実現するコードなどのデータが、不揮発性記録媒体から、不揮発性記録媒体よりもプロセッサによる情報へのアクセスが迅速な、ＲＡＭなどの別の形のメモリに読出されるようにする。 With reference to FIG. 3, an example of a storage system 170 in accordance with an aspect of the invention is shown in block diagram form. In one example, the storage system 170 hardware includes a storage system controller 122 and a switching network 132 that connects the storage system controller 122 to a backup storage medium 126. The storage system controller 122 includes a processor 127 (which may be a single processor or multiple processors) and a memory 129 (eg, RAM, ROM, PROM, etc.) that may execute all or part of the storage system software. EEPROM, flash memory, etc., or a combination thereof). Memory 129 may also be used to store metadata regarding data stored on backup storage medium 126. Software including programming code that implements embodiments of the invention is typically stored on a computer readable and / or writable non-volatile recording medium, such as RAM, ROM, optical disk or magnetic disk, or tape. Is then copied to memory 129 where it is then executed by processor 127. Such programming code may be, for example, Assembler, Java (registered trademark), Visual Basic, C, C #, or C ++, Fortran, Pascal, Eifel ( It may be written in any of several programming languages such as Eiffel), Basic, Basic, COBOL, or a combination thereof. This is because the present invention is not limited to a specific programming language. Typically, in operation, the processor 127 is a RAM, etc., in which data such as code implementing an embodiment of the invention is accessed from a non-volatile recording medium more quickly by the processor than the non-volatile recording medium. To be read into another form of memory.

図３に示すように、コントローラ１２２は、コントローラ１２２をホストコンピュータ１２０に、およびスイッチングネットワーク１３２に接続する、多数のポートアダプタも含む。図示されているように、ホストコンピュータ１２０は、たとえばファイバチャネルポートアダプタであり得るポートアダプタ１２４ａを介して、記憶システムに結合される。記憶システムコントローラ１２２を介して、ホストコンピュータ１２０はバックアップ記憶媒体１２６上にデータをバックアップし、バックアップ記憶媒体１２６からデータを回復できる。 As shown in FIG. 3, the controller 122 also includes a number of port adapters that connect the controller 122 to the host computer 120 and to the switching network 132. As shown, host computer 120 is coupled to the storage system via port adapter 124a, which may be, for example, a fiber channel port adapter. Through the storage system controller 122, the host computer 120 can back up data on the backup storage medium 126 and recover data from the backup storage medium 126.

図示された例では、スイッチングネットワーク１３２は、１つ以上のファイバチャネルスイッチ１２８ａ、１２８ｂを含んでいてもよい。記憶システムコントローラ１２２は、記憶システムコントローラをファイバチャネルスイッチ１２８ａ、１２８ｂに結合する複数のファイバチャネルポートアダプタ１２４ｂおよび１２４ｃを含む。ファイバチャネルスイッチ１２８ａ、１２８ｂを介して、記憶システムコントローラ１２２は、データがバックアップ記憶媒体１２６上にバックアップされるようにする。図３に示すように、スイッチングネットワーク１３２は、イーサネット（登録商標）ポートアダプタ１２５ａ、１２５ｂを介して記憶システムコントローラ１２２に結合された１つ以上のイーサネット（登録商標）スイッチ１３０ａ、１３０ｂをさらに含んでいてもよい。一例では、記憶システムコントローラ１２２は、以下に説明するように、記憶システム１７０がホストコンピュータ（たとえばユーザコンピュータ）と通信できるようにたとえばＬＡＮ１０３に結合され得る、別のイーサネット（登録商標）ポートアダプタ１２５Ｃをさらに含む。 In the illustrated example, the switching network 132 may include one or more Fiber Channel switches 128a, 128b. The storage system controller 122 includes a plurality of fiber channel port adapters 124b and 124c that couple the storage system controller to the fiber channel switches 128a, 128b. The storage system controller 122 allows the data to be backed up on the backup storage medium 126 via the fiber channel switches 128a, 128b. As shown in FIG. 3, the switching network 132 further includes one or more Ethernet switches 130a, 130b coupled to the storage system controller 122 via Ethernet port adapters 125a, 125b. May be. In one example, the storage system controller 122 includes another Ethernet port adapter 125C that can be coupled to, for example, the LAN 103 so that the storage system 170 can communicate with a host computer (eg, a user computer), as described below. In addition.

図３に示す例では、記憶システムコントローラ１２２は、２つのファイバチャネルスイッチと２つのイーサネット（登録商標）スイッチとを含むスイッチングネットワークを介して、バックアップ記憶媒体１２６に結合されている。記憶システム１７０内に各種スイッチを少なくとも２つずつ設けることは、システムにおける単一障害点をなくす。言い換えれば、一方のスイッチ（たとえばファイバチャネルスイッチ１２８ａ）が故障したとしても、記憶システムコントローラ１２２は依然として、別のスイッチを介してバックアップ記憶媒体１２６と通信可能である。そのような構成は、信頼性および速度の点で有利であり得る。たとえば、上述のように、冗長構成要素の設置および単一障害点の排除により、信頼性が向上する。加えて、いくつかの実施例では、記憶システムコントローラは、ファイバチャネルスイッチの一部またはすべてを並行して使用して、バックアップ記憶媒体１２６上にデータをバックアップすることができ、それにより全体的なバックアップ速度を高める。しかしながら、システムが各種スイッチを２つずつ以上備えることも、スイッチングネットワークがファイバチャネルスイッチおよびイーサネット（登録商標）スイッチの双方を備えることも、要求されてはいない、ということが認識されるべきである。さらに、バックアップ記憶媒体１２６が単一のディスクアレイを備える例では、スイッチは全く必要ない。 In the example shown in FIG. 3, the storage system controller 122 is coupled to the backup storage medium 126 via a switching network that includes two Fiber Channel switches and two Ethernet switches. Providing at least two different switches in the storage system 170 eliminates a single point of failure in the system. In other words, even if one switch (eg, Fiber Channel switch 128a) fails, the storage system controller 122 can still communicate with the backup storage medium 126 via another switch. Such a configuration may be advantageous in terms of reliability and speed. For example, as described above, reliability is improved by installing redundant components and eliminating single points of failure. In addition, in some embodiments, the storage system controller can back up data on the backup storage medium 126 using some or all of the Fiber Channel switches in parallel, thereby providing overall Increase backup speed. However, it should be recognized that it is not required that the system include more than one of each type of switch, and that the switching network include both Fiber Channel switches and Ethernet switches. . Further, in the example where the backup storage medium 126 comprises a single disk array, no switch is required.

上述のように、一実施例では、バックアップ記憶媒体１２６は１つ以上のディスクアレイを含んでいてもよい。好ましい一実施例では、バックアップ記憶媒体１２６は、複数のＡＴＡまたはＳＡＴＡディスクを含む。そのようなディスクは「既製の」製品であり、ＥＭＣ、ＩＢＭなどの製造業者らからの従来の記憶アレイ製品と比べ、比較的安価であり得る。さらに、リムーバブルメディア（たとえばテープ）のコスト、およびそのような媒体の耐用年数が限られているという事実を考慮すると、そのようなディスクはコストの点で従来のテープベースのバックアップ記憶システムと同程度である。加えて、そのようなディスクは、テープよりも実質的に速くデータの読出／書込ができる。たとえば、単一のファイバチャネル接続上では、データはディスク上に少なくとも約１５０ＭＢ／ｓの速度でバックアップ可能であるが、それは変換すると約５４０ＧＢ／ｈｒであり、テープのバックアップ速度よりも著しく（たとえば１桁）速い。加えて、いくつかのファイバチャネル接続が並行して実現されてもよく、それにより速度がさらに速くなる。この発明の一実施例によれば、バックアップ記憶媒体は、多数のＲＡＩＤ（Redundant Array of Independent Disks：独立ディスク冗長アレイ）方式のうちのいずれかを実現するよう編成されてもよい。たとえば、一実施例では、バックアップ記憶媒体はＲＡＩＤ−５実装を実現してもよい。 As described above, in one embodiment, the backup storage medium 126 may include one or more disk arrays. In a preferred embodiment, the backup storage medium 126 includes a plurality of ATA or SATA disks. Such disks are “off-the-shelf” products and may be relatively inexpensive compared to conventional storage array products from manufacturers such as EMC, IBM. In addition, given the cost of removable media (eg tape) and the fact that such media has a limited service life, such disks are comparable in cost to traditional tape-based backup storage systems. It is. In addition, such discs can read / write data substantially faster than tape. For example, on a single Fiber Channel connection, data can be backed up on disk at a rate of at least about 150 MB / s, which translates to about 540 GB / hr, which is significantly higher than the tape backup rate (eg, 1 Digits) fast. In addition, several Fiber Channel connections may be realized in parallel, thereby further increasing the speed. According to one embodiment of the present invention, the backup storage medium may be organized to implement any of a number of RAID (Redundant Array of Independent Disks) schemes. For example, in one embodiment, the backup storage medium may implement a RAID-5 implementation.

上述のように、この発明の実施例は、ディスクアレイを用いて従来のテープライブラリバックアップシステムをエミュレートして、物理的バックアップ記憶媒体としてのテープカートリッジを置き換え、それにより「仮想テープライブラリ」を提供する。従来のテープライブラリにあった物理的テープカートリッジは、ここに「仮想カートリッジ」と呼ばれるものに置き換えられる。この開示のために、「仮想テープライブラリ」という用語が、ソフトウェアおよび／または物理的ハードウェアにおいてたとえば１つ以上のディスクアレイとして実現され得る、エミュレートされたテープライブラリを指す、ということが認識されるべきである。さらに、この説明は主としてエミュレートされたテープに言及しているが、記憶システムはたとえばＣＤ−ＲＯＭまたはＤＶＤ−ＲＯＭといった他の記憶媒体もエミュレートし得ること、および「仮想カートリッジ」という用語が一般に、エミュレートされた記憶媒体、たとえばエミュレートされたテープまたはエミュレートされたＣＤを指す、ということが認識されるべきである。一実施例では、仮想カートリッジは実際、１つ以上のハードディスクに対応している。 As described above, embodiments of the present invention emulate a conventional tape library backup system using a disk array to replace a tape cartridge as a physical backup storage medium, thereby providing a “virtual tape library” To do. The physical tape cartridge that was in the conventional tape library is replaced by what is referred to herein as a “virtual cartridge”. For the purposes of this disclosure, it is recognized that the term “virtual tape library” refers to an emulated tape library that may be implemented in software and / or physical hardware, for example, as one or more disk arrays. Should be. Furthermore, although this description primarily refers to emulated tapes, the storage system can also emulate other storage media such as CD-ROM or DVD-ROM, and the term “virtual cartridge” is generally It should be appreciated that it refers to an emulated storage medium, such as an emulated tape or an emulated CD. In one embodiment, the virtual cartridge actually corresponds to one or more hard disks.

したがって、一実施例では、バックアップ／復元アプリケーションには、データがテープ上にバックアップ中であるかのように見えるように、テープライブラリをエミュレートするために、ソフトウェアインターフェイスが提供される。しかしながら、実データライブラリは１つ以上のディスクアレイに置き換えられ、データが実際にはこれらのディスクアレイ上にバックアップされているようになっている。他の種類のリムーバブルメディア記憶システムがエミュレートされてもよく、この発明はテープライブラリ記憶システムのエミュレーションに限定されない、ということが認識されるべきである。ここで以下の説明は、記憶システム１７０に含まれるソフトウェアのさまざまな局面、特徴、および動作を説明する。 Thus, in one embodiment, a backup / restore application is provided with a software interface to emulate a tape library so that the data appears to be being backed up on tape. However, the actual data library is replaced with one or more disk arrays so that the data is actually backed up on these disk arrays. It should be appreciated that other types of removable media storage systems may be emulated and the invention is not limited to emulation of tape library storage systems. The following description now describes various aspects, features, and operations of software included in the storage system 170.

ソフトウェアは記憶システム１７０に「含まれて」いるように説明され、記憶システムコントローラ１２２のプロセッサ１２７（図３参照）によって実行され得るものの、すべてのソフトウェアが記憶システムコントローラ１２２上で実行されることは要求されていない、ということが認識されるべきである。合成フルバックアップアプリケーションおよびエンドユーザ復元アプリケーションといったソフトウェアプログラムは、ホストコンピュータおよび／またはユーザコンピュータ上で実行されてもよく、その一部は、記憶システムコントローラ、ホストコンピュータ、およびユーザコンピュータのすべてまたは一部にわたって分散されてもよい。このため、記憶システムコントローラがコンピュータといった含有された物理的実体であることは要求されていない、ということが認識されるべきである。記憶システム１７０は、たとえばメディアサーバ１１４またはアプリケーションサーバ１０２といったホストコンピュータ上に常駐するソフトウェアと通信してもよい。加えて、記憶システムは、同じまたは異なるホストコンピュータ上で実行され得る、もしくはそれらのホストコンピュータ上に常駐し得るいくつかのソフトウェアアプリケーションを含有していてもよい。さらに、記憶システム１７０は機器の個別の部品に限定されないということが認識されるべきであるが、いくつかの実施例では、記憶システム１７０は機器の個別の部品として具体化されてもよい。一例では、記憶システム１７０は、従来のテープライブラリバックアップシステムの「プラグアンドプレイ」（すなわち既存のバックアップ手順およびポリシーに修正を加える必要がない）代替物として作用する内蔵ユニットとして提供されてもよい。そのような記憶システムユニットはまた、冗長性または追加の記憶容量を提供するために、従来のバックアップシステムを含むネットワーク化コンピューティング環境において使用されてもよい。別の実施例では、記憶システム１１６は、クラスタ化された環境、またはグリッド環境といった分散コンピューティング環境において実現されてもよい。 Although the software is described as “included” in the storage system 170 and may be executed by the processor 127 of the storage system controller 122 (see FIG. 3), it is understood that all software is executed on the storage system controller 122. It should be recognized that it is not required. Software programs such as synthetic full backup applications and end-user restore applications may be executed on the host computer and / or user computer, some of which span all or part of the storage system controller, host computer, and user computer. It may be distributed. Thus, it should be appreciated that the storage system controller is not required to be a contained physical entity such as a computer. Storage system 170 may communicate with software that resides on a host computer, such as media server 114 or application server 102. In addition, the storage system may contain several software applications that may run on the same or different host computers, or may reside on those host computers. Furthermore, it should be appreciated that the storage system 170 is not limited to individual pieces of equipment, but in some embodiments, the storage system 170 may be embodied as individual pieces of equipment. In one example, the storage system 170 may be provided as a self-contained unit that acts as a “plug and play” replacement for a conventional tape library backup system (ie, no need to modify existing backup procedures and policies). Such storage system units may also be used in networked computing environments that include conventional backup systems to provide redundancy or additional storage capacity. In another example, the storage system 116 may be implemented in a distributed computing environment, such as a clustered environment or a grid environment.

上述のように、一実施例によれば、ホストコンピュータ１２０（たとえばアプリケーションサーバ１０２またはメディアサーバ１１４であってもよい、図１参照）は、ホストコンピュータ１２０を記憶システム１７０に結合するネットワークリンク（たとえばファイバチャネルリンク）１２１を介して、バックアップ記憶媒体１２６上にデータをバックアップしてもよい。以下の説明は主としてエミュレートされた媒体へのデータのバックアップに言及しているが、原理は、エミュレートされた媒体からのバックアップデータの復元にも当てはまる。上述のように、ホストコンピュータ１２０とエミュレートされた媒体１３４との間のデータの流れは、バックアップ／復元アプリケーションによって制御され得る。バックアップ／復元アプリケーションの観点からは、データが実際にはエミュレートされた媒体の物理的バージョン上にバックアップ中であるかのように見え得る。 As described above, according to one embodiment, the host computer 120 (eg, the application server 102 or the media server 114, see FIG. 1) may be connected to a network link (eg, Data may be backed up on the backup storage medium 126 via the fiber channel link) 121. Although the following description primarily refers to backing up data to an emulated medium, the principles also apply to restoring backup data from the emulated medium. As described above, the flow of data between the host computer 120 and the emulated media 134 can be controlled by a backup / restore application. From the point of view of a backup / restore application, it can appear as if the data is actually being backed up on the physical version of the emulated media.

図４を参照して、記憶システムソフトウェア１５０は、エミュレートされた媒体を表わし、ホストコンピュータ１２０上に常駐するバックアップ／復元アプリケーション１４０とバックアップ記憶媒体１２６との間にインターフェイスを提供する、１つ以上の論理的抽象層を含んでいてもよい。ソフトウェア１５０は、バックアップ／復元アプリケーション１４０からテープフォーマットデータを受付け、そのデータを、ランダムアクセスディスク（たとえばハードディスク、光ディスクなど）への記憶にとって好適なデータに変換する。一例では、このソフトウェア１５０は記憶システムコントローラ１２２のプロセッサ１２７上で実行され、メモリ１２９に記憶されてもよい（図３参照）。 Referring to FIG. 4, storage system software 150 represents an emulated medium and provides one or more interfaces between backup / restore application 140 resident on host computer 120 and backup storage medium 126. May include a logical abstraction layer. Software 150 accepts tape format data from backup / restore application 140 and converts the data into data suitable for storage on a random access disk (eg, hard disk, optical disk, etc.). In one example, the software 150 may be executed on the processor 127 of the storage system controller 122 and stored in the memory 129 (see FIG. 3).

一実施例によれば、ソフトウェア１５０は、テープ、テープドライブ、およびテープドライブとの間でテープを移送するために使用されるロボット機構のＳＣＳＩエミュレーションを提供し得る、ここに仮想テープライブラリ（virtual tape library：ＶＴＬ）層１４２と呼ばれる層を含んでいてもよい。バックアップ／復元アプリケーション１４０は、たとえば矢印１４４で表わされるＳＣＳＩコマンドを用いて、ＶＴＬ１４２と通信してもよい（たとえばエミュレートされた媒体にデータをバックアップするか書込んでもよい）。このため、ＶＴＬは、他の記憶システムソフトウェアおよびハードウェアとバックアップ／復元アプリケーションとの間にソフトウェアインターフェイスを形成してもよく、エミュレートされた記憶媒体１３４（図２）をバックアップ／復元アプリケーションに提示し、エミュレートされた媒体が、バックアップ／復元アプリケーションには、従来のリムーバブルバックアップ記憶媒体に見えるようにする。 According to one embodiment, software 150 may provide SCSI emulation of a robotic mechanism used to transport tapes to and from tapes, tape drives, and tape drives, where a virtual tape library (virtual tape library) is provided. library: VTL) layer 142 may be included. The backup / restore application 140 may communicate with the VTL 142 using, for example, a SCSI command represented by arrow 144 (eg, backup or write data to the emulated media). Thus, the VTL may form a software interface between other storage system software and hardware and the backup / restore application, presenting the emulated storage medium 134 (FIG. 2) to the backup / restore application. However, the emulated media is made visible to the backup / restore application as a conventional removable backup storage medium.

ここにファイルシステム層１４６と呼ばれる第２のソフトウェア層が、（ＶＴＬで表わされた）エミュレートされた記憶媒体と物理的バックアップ記憶媒体１２６との間にインターフェイスを提供してもよい。一例では、ファイルシステム１４６は、バックアップ記憶媒体１２６との間でデータの読出および書込を行なうために、たとえば矢印１４８で表わされるＳＣＳＩコマンドを用いてバックアップ記憶媒体１２６と通信する、小型のオペレーションシステムとして作用する。 A second software layer, referred to herein as file system layer 146, may provide an interface between the emulated storage medium (represented as VTL) and physical backup storage medium 126. In one example, the file system 146 communicates with the backup storage medium 126 using, for example, SCSI commands represented by arrows 148 to read data from and write data to the backup storage medium 126. Acts as

一実施例では、ＶＴＬは包括的なテープライブラリサポートを提供しており、どのＳＣＳＩメディアチェンジャもサポートしてもよい。エミュレートされたテープ装置は、ＩＢＭＬＴＯ−１およびＬＴＯ−２テープ装置、クォンタムスーパーＤＬＴ３２０テープ装置、クォンタムＰ３０００テープライブラリシステム、またはストレージテック（STORAGETEK）Ｌ１８０テープライブラリシステムを含み得るが、それらに限定されない。ＶＴＬ内では、各仮想カートリッジは、データが記憶されるにつれて動的に成長し得るファイルである。これは、サイズが固定された従来のテープカートリッジとは対照的である。図５に関して以下にさらに説明するように、１つ以上の仮想カートリッジがシステムファイルに記憶されてもよい。 In one embodiment, VTL provides comprehensive tape library support and may support any SCSI media changer. Emulated tape devices may include, but are not limited to, IBM LTO-1 and LTO-2 tape devices, Quantum Super DLT320 tape devices, Quantum P3000 tape library systems, or STORAGETEK L180 tape library systems. . Within the VTL, each virtual cartridge is a file that can grow dynamically as data is stored. This is in contrast to a conventional tape cartridge with a fixed size. As described further below with respect to FIG. 5, one or more virtual cartridges may be stored in the system file.

図５は、この発明の一実施例に従ったシステムファイル２００を示す、ファイルシステムソフトウェア１４６内のデータ構造の一例を示す。この実施例では、システムファイル２００は、ヘッダ２０２とデータ２０４とを含む。ヘッダ２０２は、そのシステムファイルに記憶された仮想カートリッジの各々を識別する情報を含んでいてもよい。ヘッダは、仮想カートリッジが書込保護されているかどうか、仮想カートリッジの作成日／修正日などといった情報も含有していてもよい。一例では、ヘッダ２０２は、各仮想カートリッジを固有に識別し、各仮想カートリッジを、記憶システムに記憶された他の仮想カートリッジから区別する情報を含む。たとえば、この情報は、仮想カートリッジの名前および識別番号（テープがロボット機構によって識別され得るよう、物理的テープ上に通常存在するバーコードに対応）を含んでいてもよい。ヘッダ２０２はまた、各仮想カートリッジの容量、最終修正日といった追加情報も含んでいてもよい。 FIG. 5 shows an example of a data structure in the file system software 146 showing the system file 200 according to one embodiment of the present invention. In this embodiment, the system file 200 includes a header 202 and data 204. The header 202 may include information identifying each virtual cartridge stored in the system file. The header may also contain information such as whether the virtual cartridge is write-protected, the creation date / modification date of the virtual cartridge, and the like. In one example, the header 202 includes information that uniquely identifies each virtual cartridge and distinguishes each virtual cartridge from other virtual cartridges stored in the storage system. For example, this information may include the name and identification number of the virtual cartridge (corresponding to a barcode normally present on the physical tape so that the tape can be identified by a robotic mechanism). The header 202 may also include additional information such as the capacity of each virtual cartridge and the last modification date.

この発明の一実施例によれば、ヘッダ２０２のサイズは、記憶されるデータのタイプ（たとえば、１つ以上のホストコンピュータシステムからのデータバックアップを表わす仮想カートリッジ）と、システムが追跡可能なそのようなデータ（たとえば仮想カートリッジ）の明確なセットの数とを反映させるよう、最適化されてもよい。たとえば、通常テープ記憶システムにバックアップされるデータは、典型的には、多数のシステムファイルおよびユーザファイルを表わすより大きなデータセットによって特徴付けられる。データセットが非常に大きいため、追跡されるべき個別のデータファイルの数はそれに応じて小さい。したがって、一実施例では、ヘッダ２０２のサイズは、データをたくさん記憶しすぎて十分に追跡できないこと（すなわち、ヘッダが大きすぎること）と十分な数のカートリッジ識別子を記憶する空間がないこと（すなわち、ヘッダが小さすぎること）との折り合いに基づいて、選択されてもよい。例示的な一実施例では、ヘッダ２０２は、システムファイル２００の最初の３２ＭＢを利用する。しかしながら、ヘッダ２０２がシステム要求および特性に基づいて異なるサイズを有してもよいこと、および、システム要求および容量に依存して、ヘッダ２０２について異なるサイズを選択してもよいことが、認識されるべきである。 According to one embodiment of the present invention, the size of the header 202 depends on the type of data stored (eg, a virtual cartridge representing a data backup from one or more host computer systems) and such that the system can track. May be optimized to reflect the number of distinct sets of data (eg, virtual cartridges). For example, data that is normally backed up to a tape storage system is typically characterized by a larger data set representing a large number of system files and user files. Since the data set is very large, the number of individual data files to be tracked is accordingly small. Thus, in one embodiment, the size of the header 202 is such that it stores too much data and cannot be tracked sufficiently (ie, the header is too large) and there is not enough space to store a sufficient number of cartridge identifiers (ie, The header may be too small). In one exemplary embodiment, header 202 utilizes the first 32 MB of system file 200. However, it will be appreciated that the header 202 may have different sizes based on system requirements and characteristics, and that different sizes may be selected for the header 202 depending on system requirements and capacity. Should.

バックアップ／復元アプリケーションの観点からは、仮想カートリッジが、すべて同じ属性および特徴を有する物理的テープカートリッジとして見える、ということが認識されるべきである。すなわち、バックアップ復元アプリケーションには、仮想カートリッジは順次書込まれたテープのように見える。しかしながら、好ましい一実施例では、仮想カートリッジに記憶されたデータは、バックアップ記憶媒体１２６上に順次フォーマットで記憶されていない。むしろ、仮想カートリッジに書込まれているかのように見えるデータは、実際には、ランダムにアクセス可能なディスクフォーマットのデータとして、記憶システムのファイルに記憶されている。バックアップ／復元アプリケーションがカートリッジフォーマットのデータの読出および書込ができるよう、記憶されたデータを仮想カートリッジにリンクするためにメタデータが使用される。 It should be appreciated that from a backup / restore application perspective, virtual cartridges appear as physical tape cartridges that all have the same attributes and characteristics. That is, to the backup and restore application, the virtual cartridge appears as a sequentially written tape. However, in a preferred embodiment, the data stored in the virtual cartridge is not stored in sequential format on the backup storage medium 126. Rather, the data that appears to be written to the virtual cartridge is actually stored in a file in the storage system as randomly accessible disk format data. Metadata is used to link the stored data to the virtual cartridge so that the backup / restore application can read and write data in the cartridge format.

このため、好ましい一実施例の概観では、ユーザデータおよび／またはシステムデータ（ここに「ファイルデータ」と呼ぶ）が、ホストコンピュータ１２０から記憶システム１７０によって受信され、バックアップ記憶媒体１２６を構成するディスクアレイ上に記憶される。記憶システムのソフトウェア１５０（図４参照）および／またはハードウェアは、以下により詳細に説明するように、このファイルデータをシステムファイルの形式でバックアップ記憶媒体１２６に書込む。バックアップされたユーザファイルおよび／またはシステムファイルの属性を追跡するために、データファイルが記憶システムコントローラによってバックアップされるにつれてメタデータが抽出される。たとえば、各ファイルについてのそのようなメタデータは、ファイル名、ファイルの作成日または最終修正日、ファイルに関する暗号化情報、および他の情報を含んでいてもよい。加えて、各ファイルについて、ファイルを仮想カートリッジにリンクするメタデータが、記憶システムによって作成されてもよい。そのようなメタデータを用いて、ソフトウェアはホストコンピュータにテープカートリッジのエミュレーションを提供する。しかしながら、以下に説明するように、ファイルデータは実際にはテープフォーマットで記憶されておらず、むしろシステムファイルに記憶されている。データを順次カートリッジフォーマットではなくシステムファイルに記憶させることは、ある特定のファイルを見つけるために順次データを走査する必要なく、個々のファイルへの迅速で効率的でランダムなアクセスを可能にする点で、有利であり得る。 Thus, in an overview of a preferred embodiment, user data and / or system data (referred to herein as “file data”) is received by the storage system 170 from the host computer 120 and constitutes a backup storage medium 126. Remembered above. Storage system software 150 (see FIG. 4) and / or hardware writes this file data to backup storage medium 126 in the form of a system file, as will be described in more detail below. Metadata is extracted as the data files are backed up by the storage system controller to track the attributes of the backed up user files and / or system files. For example, such metadata for each file may include the file name, the date the file was created or last modified, encryption information about the file, and other information. In addition, for each file, metadata that links the file to the virtual cartridge may be created by the storage system. Using such metadata, the software provides emulation of the tape cartridge to the host computer. However, as will be described below, the file data is not actually stored in tape format, but rather is stored in a system file. Storing data in system files rather than sequential cartridge format allows quick, efficient and random access to individual files without having to scan the sequential data to find a particular file. Can be advantageous.

上述のように、一実施例によれば、ファイルデータ（すなわちユーザデータおよび／またはシステムデータ）はシステムファイルとしてバックアップ記憶媒体に記憶され、各システムファイルはヘッダとデータとを含み、データは実ユーザファイルおよび／またはシステムファイルである。各システムファイル２００のヘッダ２０２は、ユーザファイルおよび／またはシステムファイルを仮想カートリッジにリンクするメタデータを含有するテープディレクトリ２０６を含む。ここに使用されるような「メタデータ」という用語は、ユーザファイルデータまたはシステムファイルデータではなく、実ユーザデータおよび／またはシステムデータの属性を記述するデータを指す。一例によれば、テープディレクトリは、仮想カートリッジ上のデータのレイアウトを、バイトレベルに至るまで規定してもよい。 As described above, according to one embodiment, file data (ie, user data and / or system data) is stored as system files on a backup storage medium, each system file including a header and data, where the data is a real user. Files and / or system files. The header 202 of each system file 200 includes a tape directory 206 containing metadata that links the user file and / or system file to the virtual cartridge. The term “metadata” as used herein refers to data describing attributes of real user data and / or system data, not user file data or system file data. According to one example, the tape directory may define the layout of data on the virtual cartridge down to the byte level.

一実施例では、テープディレクトリ２０６は、図６に示すようなテーブル構造を有する。テーブルは、記憶された情報のタイプ（たとえばデータ、ファイルマーカ（ＦＭ）など）に関する列２２０と、使用されたディスクブロックのバイト単位のサイズに関する列２２２と、ファイルデータを記憶するディスクブロックの数を数える列２２４とを含む。このため、テープディレクトリは、コントローラが、バックアップ記憶媒体１２６に記憶されたどのデータファイルにも（順次とは対照的に）ランダムにアクセスできるようにする。たとえば、図６を参照して、データファイル２２６は、仮想テープ上で迅速に位置を特定され得る。なぜなら、テープディレクトリは、ファイル２２６のデータがシステムファイル２００の最初から１ブロックで始まることを示しているためである。この１ブロックにはサイズがない。なぜなら、それはファイルマーカ（ＦＭ）に対応しているためである。ファイルマーカはシステムファイルには記憶されていない。すなわち、ファイルマーカはゼロデータに対応している。ファイルマーカは従来のテープによって使用されるため、テープディレクトリはファイルマーカを含んでおり、バックアップ／復元アプリケーションはこのため、データファイルとともにファイルマーカを書込み、仮想カートリッジを閲覧した場合にファイルマーカを見ることを期待する。したがって、ファイルマーカはテープディレクトリにおいて追跡されている。しかしながら、ファイルマーカはどのデータも表わしておらず、したがってシステムファイルのデータ区分には記憶されていない。このため、ファイル２２６のデータは、矢印２０５（図５参照）で示すシステムファイルのデータ区分の最初で始まり、長さが１０２４バイトである（すなわち、サイズが１０２４バイトの１つのディスクブロック）。データの量、すなわちデータファイルのサイズに依存して、他のファイルデータが１０２４バイト以外のブロックサイズで記憶されてもよい、ということが認識されるべきである。たとえば、より大きなデータファイルは、効率のため、より大きなデータブロックサイズを用いて記憶されてもよい。 In one embodiment, the tape directory 206 has a table structure as shown in FIG. The table includes a column 220 for the type of information stored (eg, data, file marker (FM), etc.), a column 222 for the size in bytes of the used disk blocks, and the number of disk blocks that store the file data. A counting column 224. Thus, the tape directory allows the controller to randomly access any data file stored on the backup storage medium 126 (as opposed to sequentially). For example, referring to FIG. 6, data file 226 can be quickly located on a virtual tape. This is because the tape directory indicates that the data of the file 226 starts with one block from the beginning of the system file 200. This one block has no size. This is because it corresponds to a file marker (FM). File markers are not stored in system files. That is, the file marker corresponds to zero data. Since file markers are used by traditional tapes, the tape directory contains file markers, so backup / restore applications can write file markers along with data files and see file markers when browsing virtual cartridges. Expect. Thus, file markers are tracked in the tape directory. However, the file marker does not represent any data and is therefore not stored in the data section of the system file. For this reason, the data in the file 226 starts at the beginning of the data division of the system file indicated by the arrow 205 (see FIG. 5) and has a length of 1024 bytes (that is, one disk block having a size of 1024 bytes). It should be appreciated that other file data may be stored in block sizes other than 1024 bytes, depending on the amount of data, ie the size of the data file. For example, a larger data file may be stored using a larger data block size for efficiency.

一例では、テープディレクトリは、記憶システム上にバックアップされた各データファイルに関連付けられた「ファイル記述子」に含まれていてもよい。ファイル記述子は、記憶システム上に記憶されたデータファイル２０４に関するメタデータを含む。一実施例では、ファイル記述子は、多くのＵＮＩＸ（登録商標）ベースのシステムが使用するテープアーカイブ（ｔａｒ）フォーマットといった標準化フォーマットに従って実現されてもよい。各ファイル記述子は、対応するユーザファイルの名前、ユーザファイルが作成／修正された日付、ユーザファイルのサイズ、ユーザファイルに対するアクセス制約などの情報を含んでいてもよい。ファイル記述子に記憶された追加情報は、データのコピー元であるディレクトリ構造を記述する情報をさらに含んでいてもよい。このため、以下により詳細に説明するように、ファイル記述子は、対応するデータファイルについての検索可能なメタデータを含んでいてもよい。 In one example, the tape directory may be included in a “file descriptor” associated with each data file backed up on the storage system. The file descriptor includes metadata about the data file 204 stored on the storage system. In one embodiment, the file descriptor may be implemented according to a standardized format, such as the tape archive (tar) format used by many UNIX-based systems. Each file descriptor may include information such as the name of the corresponding user file, the date when the user file was created / modified, the size of the user file, and access restrictions on the user file. The additional information stored in the file descriptor may further include information describing a directory structure from which data is copied. Thus, as will be described in more detail below, the file descriptor may include searchable metadata about the corresponding data file.

バックアップ／復元アプリケーションの観点からは、どの仮想カートリッジも、複数のデータファイルと対応するファイル記述子とを含んでいてもよい。記憶システムソフトウェアの観点からは、データファイルは、たとえばある特定のバックアップ作業にリンクされ得るシステムファイルに記憶される。たとえば、ある特定の時間に１つのホストコンピュータによって実行されたバックアップが、１つ以上の仮想カートリッジに対応し得る１つのシステムファイルを生成してもよい。仮想カートリッジはこのため、どのようなサイズであってもよく、より多くのユーザファイルが仮想カートリッジ上に記憶されるにつれて動的に成長し得る。 From the point of view of a backup / restore application, any virtual cartridge may include multiple data files and corresponding file descriptors. From the perspective of storage system software, data files are stored in system files that can be linked to a particular backup operation, for example. For example, a backup performed by one host computer at a particular time may generate one system file that may correspond to one or more virtual cartridges. The virtual cartridge can thus be of any size and can grow dynamically as more user files are stored on the virtual cartridge.

図２を再度参照して、上述のように、記憶システム１７０は、合成フルバックアップソフトウェアアプリケーション２４０を含んでいてもよい。一実施例では、ホストコンピュータ１２０は、エミュレートされた媒体１３４上にデータをバックアップして、１つ以上の仮想カートリッジを形成する。いくつかのコンピューティング環境では、「フルバックアップ」、すなわち、ネットワークの一次記憶システム（図１参照）上に記憶された全データのバックアップコピーが、周期的に（たとえば毎週）遂行されてもよい。コピーされるべきデータが大量なため、この処理は通常、非常に長い。したがって、多くのコンピューティング環境では、連続するフルバックアップ間で、増分バックアップと呼ばれる追加のバックアップが、たとえば毎日行なわれてもよい。増分バックアップとは、（増分であれフルであれ）最後のバックアップが実行されてから変更されたデータのみをバックアップする処理である。典型的には、この変更されたデータは、往々にしてファイルのデータのほとんどが変更されなかったとしても、ファイルベースでバックアップされる。このため、増分バックアップは通常、フルバックアップよりもはるかに小さく、したがってはるかに迅速に遂行される。多くの実施例は通常、フルバックアップを週に１回行ない、増分バックアップをその週の間毎日行なうが、そのような時間枠を使用することは要求されていない、ということが認識されるべきである。したがって、或る環境は増分バックアップを日に数回必要とするかもしれない。この発明の原理は、フルバックアップ（およびオプションで増分バックアップ）を用いるあらゆる環境に、それらがどれくらいの頻度で行なわれるかにかかわらず、当てはまる。フルバックアップおよび／または増分バックアップを頻繁に実行することは、大量の冗長データが記憶システム１７０に記憶されることをもたらす場合がある。この冗長データに関連する負担を緩和するために、記憶システム１７０は、以下にさらに説明するデータ重複排除システムおよび処理を利用してもよい。 Referring back to FIG. 2, as described above, the storage system 170 may include a synthetic full backup software application 240. In one embodiment, host computer 120 backs up data on emulated medium 134 to form one or more virtual cartridges. In some computing environments, a “full backup”, ie a backup copy of all data stored on the primary storage system of the network (see FIG. 1), may be performed periodically (eg weekly). This process is usually very long due to the large amount of data to be copied. Thus, in many computing environments, additional backups, called incremental backups, may be performed daily, for example, between successive full backups. An incremental backup is a process that backs up only the data that has changed since the last backup (whether incremental or full). Typically, this changed data is backed up on a file basis, even though most of the data in the file is often unchanged. For this reason, incremental backups are usually much smaller than full backups and are therefore performed much more quickly. It should be appreciated that many embodiments typically perform a full backup once a week and perform incremental backups daily for that week, but are not required to use such time frames. is there. Thus, some environments may require incremental backups several times a day. The principles of the invention apply to any environment that uses full backups (and optionally incremental backups) regardless of how often they are performed. Frequent full backups and / or incremental backups may result in a large amount of redundant data being stored in the storage system 170. To alleviate the burden associated with this redundant data, the storage system 170 may utilize a data deduplication system and process described further below.

フルバックアップ手順の最中、ホストコンピュータは、複数のデータファイルを備えるバックアップデータを含む１つ以上の仮想カートリッジを作成してもよい。明確性のため、以下の説明は、フルバックアップが１つの仮想カートリッジのみを生成すると仮定する。しかしながら、フルバックアップは２つ以上の仮想カートリッジを生成し得ること、およびこの発明の原理は任意の数の仮想カートリッジに当てはまることが、認識されるべきである。 During the full backup procedure, the host computer may create one or more virtual cartridges containing backup data comprising a plurality of data files. For clarity, the following description assumes that a full backup generates only one virtual cartridge. However, it should be appreciated that a full backup can generate more than one virtual cartridge, and that the principles of the invention apply to any number of virtual cartridges.

一実施例によれば、１つの既存のフルバックアップデータセットと１つ以上の増分バックアップデータセットとから合成フルバックアップデータセットを作成するための方法が提供される。この方法は、周期的な（たとえば週ごとの）フルバックアップを行なう必要性をなくすことができ、それによりユーザのかなりの時間およびネットワーク資源を節約する。さらに、当業者には公知であるように、１つのフルバックアップおよび１つ以上の増分バックアップに基づいてデータを復元することは、時間がかかる処理であり得る。なぜなら、たとえば、あるファイルの最新バージョンがある増分バックアップに存在する場合、バックアップ／復元アプリケーションは通常、最後のフルバックアップに基づいてそのファイルを復元し、次に増分バックアップから変更を適用する。したがって、合成フルバックアップを提供することは、１つのフルバックアップおよび１つ以上の増分バックアップから複数回の復元を行なう必要なく、バックアップ復元アプリケーションが合成フルバックアップのみに基づいてより迅速にデータファイルを復元できるようにする、という追加の利点を有し得る。ここに使用されるような「最新バージョン」という表現は一般に、データファイルの最新コピー（すなわち、データファイルが保存された最新時間）を、そのファイルが新しいバージョン番号を有しているか否かにかかわらず、指す。「バージョン」という用語は、ここでは一般に、何らかの方法で修正され、または多数回保存されたかもしれない同じファイルのコピーを指すために使用される。 According to one embodiment, a method is provided for creating a synthetic full backup data set from one existing full backup data set and one or more incremental backup data sets. This method can eliminate the need for periodic (eg weekly) full backups, thereby saving considerable time and network resources for the user. Further, as is known to those skilled in the art, restoring data based on one full backup and one or more incremental backups can be a time consuming process. For example, if the latest version of a file is in an incremental backup, the backup / restore application typically restores the file based on the last full backup and then applies changes from the incremental backup. Thus, providing a synthetic full backup allows backup restore applications to restore data files more quickly based on only synthetic full backups without having to restore multiple times from one full backup and one or more incremental backups It can have the additional advantage of being able to. The expression “latest version” as used herein generally refers to the latest copy of a data file (ie, the last time the data file was saved), regardless of whether the file has a new version number. Point. The term “version” is generally used herein to refer to a copy of the same file that may have been modified in some way or stored multiple times.

図７を参照すると、合成フルバックアップ手順の概略図が示されている。ホストコンピュータ１２０は、第１の時点、たとえば週末にフルバックアップ２３０を実行してもよい。ホストコンピュータ１２０は次に、その後の増分バックアップ２３２ａ、２３２ｂ、２３２ｃ、２３２ｄ、２３２ｅを、たとえばその週の間毎日実行してもよい。記憶システム１７０は、次に、以下に説明するように、合成フルバックアップデータセット２３４を作成してもよい。 Referring to FIG. 7, a schematic diagram of a synthetic full backup procedure is shown. The host computer 120 may perform a full backup 230 at a first time, for example, at the weekend. The host computer 120 may then perform subsequent incremental backups 232a, 232b, 232c, 232d, 232e, for example daily for the week. The storage system 170 may then create a synthetic full backup data set 234 as described below.

一実施例によれば、記憶システム１７０は、ここに合成フルバックアップアプリケーション２４０（図３参照）と呼ばれるソフトウェアアプリケーションを含んでいてもよい。合成フルバックアップアプリケーション２４０は、記憶システムコントローラ１２２（図２参照）上で実行されてもよく、またはホストコンピュータ１２０上で実行されてもよい。合成フルバックアップアプリケーションは、合成フルバックアップデータセット２３４を作成するために必要なソフトウェアコマンドおよびインターフェイスを含む。一例では、合成フルバックアップアプリケーションは、合成フルバックアップデータセット２３４を含む新しい仮想カートリッジを生成するために、フルバックアップデータセット２３０および増分バックアップデータセット２３２の各々のメタデータ表現の論理的併合を行なってもよい。 According to one embodiment, the storage system 170 may include a software application referred to herein as a synthetic full backup application 240 (see FIG. 3). The synthetic full backup application 240 may be executed on the storage system controller 122 (see FIG. 2) or may be executed on the host computer 120. The synthetic full backup application includes the software commands and interfaces necessary to create a synthetic full backup data set 234. In one example, the synthetic full backup application performs a logical merge of the metadata representations of each of the full backup data set 230 and the incremental backup data set 232 to create a new virtual cartridge that includes the synthetic full backup data set 234. Also good.

たとえば、図８を参照して、既存のフルバックアップデータセットは、ユーザファイルＦ１、Ｆ２、Ｆ３、およびＦ４を含んでいてもよい。第１の増分バックアップデータセット２３２ａは、Ｆ２の修正バージョンであるユーザファイルＦ２′と、Ｆ３の修正バージョンであるＦ３′とを含んでいてもよい。第２の増分バックアップデータセット２３２ｂは、Ｆ１の修正バージョンであるユーザファイルＦ１′と、Ｆ２のさらに修正されたバージョンであるＦ２″と、新しいユーザファイルＦ５とを含んでいてもよい。したがって、フルバックアップデータセット２３０と２つの増分データセット２３２ａおよび２３２ｂとの論理的併合から形成された合成フルバックアップデータセット２３４は、各ユーザファイルＦ１、Ｆ２、Ｆ３、Ｆ４、およびＦ５の最終バージョンを含む。図８に示すように、合成フルバックアップデータセットはしたがって、ユーザファイルＦ１′、Ｆ２″、Ｆ３′、Ｆ４、およびＦ５を含む。 For example, referring to FIG. 8, an existing full backup data set may include user files F1, F2, F3, and F4. The first incremental backup data set 232a may include a user file F2 ′ that is a modified version of F2, and F3 ′ that is a modified version of F3. The second incremental backup data set 232b may include a user file F1 ′ that is a modified version of F1, a further modified version of F2, F2 ″, and a new user file F5. A synthetic full backup data set 234 formed from a logical merge of the backup data set 230 and the two incremental data sets 232a and 232b includes a final version of each user file F1, F2, F3, F4, and F5. As shown in FIG. 8, the composite full backup data set thus includes user files F1 ′, F2 ″, F3 ′, F4, and F5.

図３および図４を再度参照して、ファイルシステムソフトウェア１４６は、エミュレートされた媒体１３４上に記憶された各ユーザファイルに関するメタデータを記憶する論理メタデータキャッシュ２４２を作成してもよい。論理メタデータキャッシュは物理的データキャッシュである必要はないが、その代わり、記憶媒体１２６上に記憶されたデータの検索可能な集合体であり得る、ということが認識されるべきである。別の例では、論理メタデータキャッシュ２４２は、データベースとして実現可能である。メタデータがデータベースに記憶されている場合、従来のデータベースコマンド（たとえばＳＱＬコマンド）を用いて、合成フルバックアップデータセットを作成するためにフルバックアップセットと１つ以上の増分バックアップデータセットとの論理的併合を行なうことができる。 Referring back to FIGS. 3 and 4, the file system software 146 may create a logical metadata cache 242 that stores metadata about each user file stored on the emulated medium 134. It should be appreciated that the logical metadata cache need not be a physical data cache, but instead can be a searchable collection of data stored on the storage medium 126. In another example, the logical metadata cache 242 can be implemented as a database. When metadata is stored in the database, the logical combination of the full backup set and one or more incremental backup data sets to create a synthetic full backup data set using conventional database commands (eg, SQL commands) Mergers can be performed.

別の実施例では、メタデータの一部がデータベースに記憶され、別の一部が記憶システムファイルに記憶されてもよい。たとえば、バックアップデータセット名とそれが備えるデータオブジェクトとを含むバックアップデータセットメタデータが、従来のデータベースに含まれていてもよく、一方、たとえばデータオブジェクトがデータファイル、データファイルサイズ、機密保護情報、および一次ストレージにおける位置である場合などにおいてデータオブジェクトに特有のメタデータが、記憶システムファイルに含まれていてもよい。このようにメタデータを記憶することは、問合せがしばしばあったデータを従来のデータベースから柔軟に取出すことを可能にし、また、それほど頻繁に問合されなかったデータを記憶システムファイルにより迅速に記憶させることによってシステム拡張性を促進する。 In another embodiment, some of the metadata may be stored in a database and another portion may be stored in a storage system file. For example, backup data set metadata including the name of the backup data set and the data objects it comprises may be included in a traditional database, while the data object is a data file, data file size, security information, Also, metadata specific to the data object, such as in the case of a location in primary storage, may be included in the storage system file. Storing metadata in this way allows data that has been frequently queried to be retrieved flexibly from traditional databases, and data that has not been queried frequently is stored more quickly in storage system files. To promote system extensibility.

上述のように、エミュレートされた媒体１３４上に記憶された各データファイルは、バックアップ記憶媒体１２６上のファイルの位置を含む、データファイルに関するメタデータを含むファイル記述子を含んでいてもよい。一実施例では、ホストコンピュータ１２０上で実行中のバックアップ／復元アプリケーションは、エミュレートされた媒体１３４上にデータをストリーミングテープフォーマットで記憶させる。このテープフォーマットを表わすデータ構造２５０の一例を、図９に示す。上述のように、システムファイルデータ構造はヘッダを含み、それは、データファイル用のファイル記述子、ファイルの作成日および／または修正日、機密保護情報、ファイルの入手元であるホストシステムのディレクトリ構造、およびファイルを仮想カートリッジにリンクする他の情報といった、データファイルについての情報を含んでいてもよい。これらのヘッダは、ホストコンピュータ、一次記憶システムなどからバックアップ（コピー）された実ユーザファイルおよびシステムファイルであるデータ２５４に関連付けられている。システムファイルデータ構造はまた、次のヘッダをブロック境界に適切に整列させ得るパッド２５６をオプションで含んでいてもよい。 As described above, each data file stored on emulated medium 134 may include a file descriptor that includes metadata about the data file, including the location of the file on backup storage medium 126. In one embodiment, a backup / restore application running on the host computer 120 stores the data on the emulated media 134 in a streaming tape format. An example of the data structure 250 representing this tape format is shown in FIG. As mentioned above, the system file data structure includes a header, which includes a file descriptor for the data file, the creation and / or modification date of the file, security information, the directory structure of the host system from which the file was obtained, And information about the data file, such as other information linking the file to the virtual cartridge. These headers are associated with data 254 which is a real user file and a system file backed up (copied) from a host computer, a primary storage system, or the like. The system file data structure may also optionally include a pad 256 that can properly align the next header to the block boundary.

図９に示すように、一例では、ヘッダデータは、迅速な検索と、さもなければ順次であるテープデータフォーマットへのランダムアクセスとを可能にするよう、論理メタデータキャッシュ２４２に位置する。記憶システムコントローラ１２２に対してファイルシステムソフトウェア１４６を使用して実現された、論理メタデータキャッシュの使用は、エミュレートされた媒体１３４上に記憶された線形で順次のテープデータフォーマットを、バックアップ記憶媒体１２６を構成する物理的ディスク上に記憶されたランダムアクセスデータフォーマットに変換することを可能にする。論理メタデータキャッシュ２４２は、データファイル用のファイル記述子を含むヘッダ２５２と、以下により詳細に説明するようにデータファイルへのアクセスを制御するために使用され得る機密保護情報と、仮想カートリッジおよびバックアップ記憶媒体１２６上のデータファイルの実際の位置へのポインタ２５７とを記憶している。一実施例では、論理メタデータキャッシュは、フルバックアップデータセット２３０および各増分データセット２３２においてバックアップされた全データファイルに関するデータを記憶する。 As shown in FIG. 9, in one example, the header data is located in the logical metadata cache 242 to allow for quick retrieval and random access to an otherwise sequential tape data format. The use of a logical metadata cache, implemented using file system software 146 for storage system controller 122, allows linear and sequential tape data formats stored on emulated media 134 to be backed up as storage media. 126 can be converted to a random access data format stored on the physical disks that make up 126. The logical metadata cache 242 includes a header 252 that includes a file descriptor for the data file, security information that can be used to control access to the data file, as described in more detail below, virtual cartridges and backups. A pointer 257 to the actual position of the data file on the storage medium 126 is stored. In one embodiment, the logical metadata cache stores data for all data files backed up in the full backup data set 230 and each incremental data set 232.

一実施例によれば、合成フルバックアップアプリケーションソフトウェア２４０は、論理メタデータキャッシュに記憶された情報を用いて、合成フルバックアップデータセットを作成する。この合成フルバックアップデータセットは次に、合成フルバックアップアプリケーション２４０によって作成された合成仮想カートリッジにリンクされる。バックアップ／復元アプリケーションには、合成フルバックアップデータセットはこの合成仮想カートリッジ上に記憶されているかのように見える。上述のように、合成フルバックアップデータセットは、既存のフルバックアップデータセットと増分バックアップデータセットとの論理的併合を行なうことによって作成され得る。この論理的併合は、既存のフルバックアップデータセットおよび増分バックアップデータセットの各々に含まれる各データファイルを比較すること、および図８を参照して上述したように、各ユーザファイルの最新バージョンの複合体を作成することを含んでいてもよい。 According to one embodiment, the synthetic full backup application software 240 uses the information stored in the logical metadata cache to create a synthetic full backup data set. This composite full backup data set is then linked to the composite virtual cartridge created by the composite full backup application 240. To the backup / restore application, it appears as if the synthetic full backup data set is stored on this synthetic virtual cartridge. As described above, a synthetic full backup data set can be created by performing a logical merge of an existing full backup data set and an incremental backup data set. This logical merging involves comparing each data file contained in each of the existing full backup data set and incremental backup data set and, as described above with reference to FIG. 8, a composite of the latest version of each user file. It may include creating a body.

一実施例によれば、合成仮想カートリッジ２６０は、図１０に示すように、他の仮想カートリッジ、具体的には既存のフルバックアップデータセットおよび増分バックアップデータセットを含む仮想カートリッジ上のデータファイルの位置を指すポインタを含む。図８に関して上に挙げた例を考慮すると、合成仮想カートリッジ２６０は、（既存のフルバックアップデータセットがＦ４の最終バージョンを含んでいたため）ユーザファイルＦ４の、仮想カートリッジ２６２上の既存のフルバックアップデータセットにおける位置、および、たとえばユーザファイルＦ３′の、仮想カートリッジ２６４上の増分データセット２３２ａにおける位置を（矢印２６８で示すように）指す、ポインタ２６６を含む。 According to one embodiment, the synthetic virtual cartridge 260 is a location of a data file on another virtual cartridge, specifically a virtual cartridge containing existing full and incremental backup data sets, as shown in FIG. Contains a pointer to. Considering the example listed above with respect to FIG. 8, the synthetic virtual cartridge 260 is an existing full backup on the virtual cartridge 262 of the user file F4 (since the existing full backup data set included the final version of F4). It includes a pointer 266 that points to the position in the data set and the position in the incremental data set 232a on the virtual cartridge 264, for example the user file F3 '(as indicated by arrow 268).

合成仮想カートリッジは、ポインタ２６６が指すデータを含む全仮想カートリッジの識別番号（およびオプションで名前）を含むリスト２７０も含む。この従属的なカートリッジリスト２７０は、実データの記憶場所を追跡し続けるといった参照のために、また従属的な仮想カートリッジが消去されないようにするために、重要であり得る。この実施例では、合成フルバックアップデータセットは実ユーザファイルを含んでおらず、むしろ、バックアップ記憶媒体１２６上でのユーザファイルの位置を示す１組のポインタを含んでいる。したがって、（他の仮想カートリッジ上に記憶された）実ユーザファイルが削除されないようにすることが望ましい場合がある。これは、データを含む仮想カートリッジの記録（従属的カートリッジリスト２７０）をつけて、それらの仮想カートリッジの各々が上書きされたり削除されたりしないよう保護することによって、一部遂行され得る。合成仮想カートリッジはまた、合成仮想カートリッジのサイズ、バックアップ記憶媒体１２６上のその位置といったカートリッジデータ２７２を含んでいてもよい。加えて、合成仮想カートリッジは、識別番号および／または名前２７４を有していてもよい。 The composite virtual cartridge also includes a list 270 that includes the identification numbers (and optionally names) of all virtual cartridges that contain the data pointed to by pointer 266. This subordinate cartridge list 270 can be important for references such as keeping track of the location of real data and to prevent subordinate virtual cartridges from being erased. In this example, the composite full backup data set does not contain real user files, but rather includes a set of pointers that indicate the location of the user files on the backup storage medium 126. Thus, it may be desirable to prevent real user files (stored on other virtual cartridges) from being deleted. This can be accomplished in part by keeping a record of the virtual cartridges containing data (subordinate cartridge list 270) and protecting each of those virtual cartridges from being overwritten or deleted. The composite virtual cartridge may also include cartridge data 272 such as the size of the composite virtual cartridge and its location on the backup storage medium 126. In addition, the composite virtual cartridge may have an identification number and / or name 274.

別の実施例によれば、合成仮想カートリッジは、ポインタと記憶された実ユーザファイルとの組合せを含んでいてもよい。図１１を参照して、一例では、合成仮想カートリッジは、仮想カートリッジ２６２上の既存のフルバックアップデータセット２３０におけるデータファイル（図９を参照して上述したように、最終バージョン）の位置を指すポインタ２６６を含む。合成仮想カートリッジはまた、矢印２８０によって示すように、増分データセット２３２からコピーされた実データファイルを含むデータ２７８を含んでいてもよい。このように、合成フルバックアップデータセット２７６の作成後に増分バックアップデータセットを削除することが可能であり、それにより記憶空間を節約する。合成仮想カートリッジは、全ユーザファイルのコピーというよりもむしろ、すべてのまたは一部のポインタを含むため、比較的小さい。 According to another embodiment, the composite virtual cartridge may include a combination of a pointer and a stored real user file. Referring to FIG. 11, in one example, the composite virtual cartridge is a pointer that points to the location of the data file (the final version as described above with reference to FIG. 9) in the existing full backup data set 230 on the virtual cartridge 262. 266. The composite virtual cartridge may also include data 278 that includes actual data files copied from incremental data set 232 as indicated by arrow 280. In this way, it is possible to delete the incremental backup data set after creation of the synthetic full backup data set 276, thereby saving storage space. Synthetic virtual cartridges are relatively small because they contain all or part of a pointer rather than a copy of all user files.

合成フルバックアップはポインタと記憶されたファイルデータとの任意の組合せを含んでいてもよく、上に挙げた例に限定されない、ということが認識されるべきである。たとえば、合成フルバックアップは、或る増分バックアップおよび／またはフルバックアップ上に記憶されたいくつかのファイルについてのデータファイルへのポインタを含んでいてもよく、他の既存のフルバックアップおよび／または増分バックアップからコピーされた、記憶されたファイルデータを含んでいてもよい。さらに、これに代えて、ポインタを含まないものの、適切なフルバックアップおよび／または増分バックアップからコピーされた実ファイルデータの最終バージョンを含む、以前のフルバックアップおよび関連する増分バックアップに基づいて、合成フルバックアップが作成されてもよい。 It should be appreciated that a synthetic full backup may include any combination of pointers and stored file data and is not limited to the examples listed above. For example, a synthetic full backup may include pointers to data files for some incremental backups and / or several files stored on the full backup, and other existing full backups and / or incremental backups It may also contain stored file data copied from. In addition, synthetic full based on previous full backups and associated incremental backups that do not include pointers, but include the final version of the actual file data copied from the appropriate full and / or incremental backups. A backup may be created.

一実施例では、合成フルバックアップアプリケーションソフトウェアは、各データファイルの最終バージョンがどこに位置するかを判断するために、既存のフルバックアップデータセットおよび増分バックアップデータセットの各々について、ユーザおよびシステムファイルメタデータを比較できるようにする差分アルゴリズムを含んでいてもよい。たとえば、差分アルゴリズムは、異なるバックアップセットにおける同じデータファイルの異なるバージョン間で、作成日および／または修正日、バージョン番号（利用可能な場合）などを比較し、データファイルの最新バージョンを選択するために使用可能である。しかしながら、ユーザは、ユーザファイルを開き、そのファイル内のデータを実際に変更することなくそのファイルを保存する（それによりその修正日を変更する）ことが、しばしばある。したがって、システムは、システムファイルまたはユーザファイル内のデータを分析して、データが実際に変更されたかどうかを判断し得る、より進んだ差分アルゴリズムを実現してもよい。そのような差分アルゴリズムの変形、および他のタイプの比較アルゴリズムは、当業者には公知であり得る。加えて、上述のように、メタデータがデータベースフォーマットで記憶されている場合、ＳＱＬコマンドなどのデータベースコマンドも、論理的併合を行なうために使用可能である。この発明は、合成フルバックアップデータセットを適正に作成するよう、各ユーザファイルの最新または最後のバージョンが、比較されたすべての既存のバックアップセットから選択されることを確実にするために、そのようなアルゴリズムのどれを適用してもよい。 In one embodiment, the synthetic full backup application software uses user and system file metadata for each of the existing full and incremental backup data sets to determine where the final version of each data file is located. It is possible to include a difference algorithm that makes it possible to compare. For example, the diff algorithm compares the creation date and / or modification date, version number (if available), etc. between different versions of the same data file in different backup sets, and selects the latest version of the data file It can be used. However, users often open a user file and save the file without actually changing the data in the file (thus changing its modification date). Thus, the system may implement a more advanced difference algorithm that can analyze data in system files or user files to determine whether the data has actually changed. Variations of such difference algorithms, and other types of comparison algorithms, may be known to those skilled in the art. In addition, as described above, when metadata is stored in a database format, database commands such as SQL commands can also be used to perform logical merging. The present invention does so to ensure that the latest or last version of each user file is selected from all existing backup sets compared to properly create a synthetic full backup data set. Any of the various algorithms may be applied.

当業者には理解されるはずであるように、合成フルバックアップアプリケーションは、ホストコンピュータが物理的フルバックアップを実行することを必要とせずに、フルバックアップデータセットが作成され、利用可能となることを可能にする。これは、データをバックアップ記憶システムに転送するプロセッサオーバーヘッドをホストコンピュータに課すことを回避するだけでなく、合成フルバックアップアプリケーションが記憶システム上で実行される実施例では、それはネットワーク帯域幅の利用を著しく減少させる。図７に示すように、第１の合成フルバックアップデータセット２３４とその後の増分バックアップデータセット２３６とを用いて、さらに別の合成フルバックアップデータセットが作成されてもよい。これは、頻繁に修正されないファイルまたはオブジェクトは頻繁にコピーされないという点で、著しい時間的利点を提供し得る。代わりに、合成フルバックアップデータセットは、１回しかコピーされなかったファイルへのポインタを保持してもよい。 As those skilled in the art will appreciate, synthetic full backup applications ensure that a full backup dataset is created and made available without requiring the host computer to perform a physical full backup. enable. This not only avoids imposing processor overhead on the host computer to transfer data to the backup storage system, but in embodiments where a synthetic full backup application is run on the storage system, it significantly reduces network bandwidth utilization. Decrease. As shown in FIG. 7, another synthetic full backup data set may be created using the first synthetic full backup data set 234 and the subsequent incremental backup data set 236. This can provide a significant time advantage in that files or objects that are not frequently modified are not frequently copied. Alternatively, the synthetic full backup data set may hold a pointer to a file that was copied only once.

この発明に従ったいくつかの局面は、データオブジェクトから冗長データを除去する拡張可能な重複排除システムに向けられている。たとえば、いくつかの実施例によれば、重複排除システムは、データに含まれる予め処理されたメタデータを用いてデータ重複排除を管理するよう構成される。より特定的には、実施例は、重複排除されるべきデータ内の特定のメタデータ値の有無に基づいて、重複排除ドメインにデータを誘導してもよい。これらの重複排除ドメインの各々は、特定のタイプのデータを効率的に重複排除するよう調整された特定の重複排除手法を採用してもよい。 Some aspects according to the invention are directed to an expandable deduplication system that removes redundant data from data objects. For example, according to some embodiments, the deduplication system is configured to manage data deduplication using pre-processed metadata included in the data. More specifically, embodiments may direct data to the deduplication domain based on the presence or absence of specific metadata values in the data to be deduplicated. Each of these deduplication domains may employ specific deduplication techniques that are tailored to efficiently deduplicate specific types of data.

たとえば、図１４は、拡張可能な重複排除サービスを提供するよう特別に構成された重複排除誘導子１４００のブロック図を表わす。重複排除誘導子１４００は、さまざまなコンピュータシステム上で、ソフトウェア、ハードウェア、またはそれらの組合せとして実現されてもよい。たとえば、一実施例によれば、重複排除誘導子１４００は、図３に関して上述した記憶システムコントローラ１２２の一部として実現される。図１４に示す重複排除誘導子１４００の特定の構成は、例示のためにのみ使用され、限定的であるよう意図されてはない。なぜなら、この発明の実施例は、この発明の範囲から逸脱することなく、さまざまな構成で設計され得るためである。ここに説明される例のいくつかは単一の重複排除誘導子１４００を有する実施例に焦点を当てているが、他の実施例は、この発明の範囲から逸脱することなく、２つ以上の重複排除誘導子を含んでいてもよい。 For example, FIG. 14 depicts a block diagram of a deduplication inductor 1400 that is specifically configured to provide scalable deduplication services. The deduplication inductor 1400 may be implemented as software, hardware, or a combination thereof on various computer systems. For example, according to one embodiment, deduplication inductor 1400 is implemented as part of storage system controller 122 described above with respect to FIG. The particular configuration of the deduplication inductor 1400 shown in FIG. 14 is used for illustration only and is not intended to be limiting. This is because the embodiments of the present invention can be designed in various configurations without departing from the scope of the present invention. While some of the examples described herein focus on embodiments having a single deduplication inductor 1400, other embodiments may include two or more without departing from the scope of the present invention. A deduplication inductor may be included.

図１４を参照して、重複排除誘導子１４００は、データインターフェイス１４０２と、誘導エンジン１４０４と、重複排除ドメインデータベース１４０６と、重複排除ドメイン１４０８、１４１０および１４１２と、重複排除データベースインターフェイス１４１４とを含む。図示された例では、データインターフェイス１４０２は、１つ以上のデータ源と情報を交換する、たとえば供給および受信するよう構成された、実行可能なコード、データ、データ構造またはオブジェクトなどの機能を含む。また、図示された例では、データインターフェイス１４０２は、誘導エンジン１４０４と双方向に通信可能である。 Referring to FIG. 14, deduplication inductor 1400 includes a data interface 1402, a guidance engine 1404, a deduplication domain database 1406, deduplication domains 1408, 1410 and 1412, and a deduplication database interface 1414. In the illustrated example, data interface 1402 includes functions such as executable code, data, data structures or objects configured to exchange, eg, supply and receive information with one or more data sources. Also, in the illustrated example, the data interface 1402 can communicate with the guidance engine 1404 in both directions.

図示されているように、誘導エンジン１４０４は、データインターフェイス１４０２、重複排除ドメインデータベース１４０６、ならびに重複排除ドメイン１４０８、１４１０および１４１２と、さまざまな情報を交換可能である。重複排除ドメインデータベース１４０６は次に、誘導エンジン１４０４および重複排除データベースインターフェイス１４１４の双方と、データを通信してもよい。重複排除データベースインターフェイス１４１４は、さまざまな外部実体と情報を交換するよう構成された機能を有する。これらの外部実体は、とりわけ、ユーザおよび／またはシステムを含んでいてもよい。図示された例では、重複排除データベースインターフェイス１４１４はまた、重複排除ドメインデータベース１４０６と情報を交換可能である。重複排除ドメイン１４０８、１４１０および１４１２の各々は、誘導エンジン１４０４およびさまざまな外部実体の双方と情報を交換するよう構成された機能を含む。たとえば、一実施例では、重複排除ドメイン１４０８、１４１０および１４１２は、図３に関して説明したバックアップ記憶媒体１２６といったデータ記憶媒体と、情報を交換可能である。 As shown, guidance engine 1404 can exchange various information with data interface 1402, deduplication domain database 1406, and deduplication domains 1408, 1410 and 1412. Deduplication domain database 1406 may then communicate data with both guidance engine 1404 and deduplication database interface 1414. Deduplication database interface 1414 has functionality configured to exchange information with various external entities. These external entities may include users and / or systems, among others. In the illustrated example, deduplication database interface 1414 can also exchange information with deduplication domain database 1406. Each of the deduplication domains 1408, 1410 and 1412 includes functionality configured to exchange information with both the guidance engine 1404 and various external entities. For example, in one embodiment, deduplication domains 1408, 1410, and 1412 can exchange information with a data storage medium, such as backup storage medium 126 described with respect to FIG.

情報は、任意の手法を用いて、ここに説明された要素、構成要素およびサブシステム間を流れてもよい。そのような手法は、たとえば、ＴＣＰ／ＩＰなどの標準プロトコルを用いてネットワークを介して情報を渡すこと、メモリのモジュール間で情報を渡すこと、およびファイル、データベース、または何らかの他の不揮発性記憶装置に書込むことによって情報を渡すことを含む。加えて、情報のコピーの代わりに、またはそれに加えて、情報へのポインタまたは他の参照が送信および受信されてもよい。逆に、情報へのポインタまたは他の参照に代えて、またはそれに加えて、情報が交換されてもよい。この発明の範囲から逸脱することなく、情報を通信するための他の手法およびプロトコルが使用されてもよい。 Information may flow between the elements, components, and subsystems described herein using any technique. Such techniques include, for example, passing information over a network using a standard protocol such as TCP / IP, passing information between modules of memory, and files, databases, or some other non-volatile storage device Including passing information by writing to. In addition, instead of or in addition to a copy of the information, pointers or other references to the information may be sent and received. Conversely, information may be exchanged instead of or in addition to pointers or other references to information. Other techniques and protocols for communicating information may be used without departing from the scope of the invention.

図１４に示す例では、重複排除ドメインデータベース１４０６は、１つ以上の重複排除ドメインの属性を記述する情報を記憶し、取出すよう構成された機能を含む。この情報の例は、各重複排除ドメインについて、その重複排除ドメインに属する、または割当てられるコンピューティング資源の量、その重複排除ドメインによって使用されるある特定の重複排除方法、およびその重複排除ドメインに関連付けられた１つ以上のデータオブジェクト特性を含んでいてもよい。重複排除ドメインデータベース１４０６はまた、たとえばハッシュテーブルといった重複排除ドメインによって使用される重複排除方法に関連付けられたアーチファクトを保持してもよい。 In the example shown in FIG. 14, deduplication domain database 1406 includes functionality configured to store and retrieve information describing attributes of one or more deduplication domains. An example of this information is for each deduplication domain, the amount of computing resources that belong to or are allocated to that deduplication domain, the specific deduplication method used by that deduplication domain, and the associated deduplication domain One or more specified data object characteristics. Deduplication domain database 1406 may also hold artifacts associated with the deduplication method used by the deduplication domain, eg, a hash table.

重複排除ドメインデータベース１４０６は、フラットファイル、インデックスファイル、階層型データベース、リレーショナルデータベース、またはオブジェクト指向データベースを含む、コンピュータ読取可能媒体上に情報を記憶可能な任意の論理的構成の形式を取っていてもよい。データは、固有キー関係および外部キー関係ならびにインデックスを用いてモデル化されてもよい。固有キー関係および外部キー関係ならびにインデックスは、データ完全性とデータ交換性能の双方を確実にするために、さまざまなフィールドおよびテーブル間に確立されてもよい。 The deduplication domain database 1406 may take the form of any logical configuration capable of storing information on a computer readable medium, including flat files, index files, hierarchical databases, relational databases, or object oriented databases. Good. Data may be modeled using unique and foreign key relationships and indexes. Unique and foreign key relationships and indexes may be established between the various fields and tables to ensure both data integrity and data exchange performance.

図示された例では、データインターフェイス１４０２は、さまざまな形式およびフォーマットの情報をさまざまなデータ源と交換するよう構成された機能を含む。これらのデータ源は、図１に関して上述した一次記憶装置１０６といった、重複排除処理を受ける情報の提供元を含んでいてもよい。データインターフェイス１４０２は、データフォーマットの中でもとりわけ、データの離散ブロック、連続したデータストリーム、および複数の記憶位置から多重化されたデータストリームを受信可能である。加えて、データインターフェイス１４０２はデータを、インラインで、すなわち重複排除され記憶されるべきデータを、データインターフェイス１４０２を含むデータ記憶システムが受信している間に、受信可能であり、またはオフラインで、すなわち重複排除されるべきデータをデータ記憶装置が既に記憶した後で、受信可能である。 In the illustrated example, the data interface 1402 includes functionality configured to exchange various types and formats of information with various data sources. These data sources may include sources of information subject to deduplication processing, such as primary storage device 106 described above with respect to FIG. The data interface 1402 can receive, among other data formats, discrete blocks of data, continuous data streams, and multiplexed data streams from multiple storage locations. In addition, the data interface 1402 can receive data inline, ie, data to be deduplicated and stored while the data storage system including the data interface 1402 is receiving, or offline, Data to be deduplicated can be received after the data storage device has already stored it.

図示された例では、重複排除ドメイン１４０８、１４１０および１４１２は各々、１つ以上の個々の重複排除ドメインを含む。重複排除ドメインは、データオブジェクトに対して重複排除処理を行なうよう構成された機能を有するソフトウェアおよび／またはハードウェアを含んでいてもよい。各重複排除ドメインは、専用のデータストレージを含んでいてもよい。図示された例では、各重複排除ドメインは、いくつかのデータオブジェクトに共通の１つ以上の特性に関連付けられ得る。加えて、この例では、各重複排除ドメインは、ある特定の重複排除方法を採用可能である。これらの特色により、個々の重複排除ドメインは、関連するデータオブジェクトにとって効果の高い重複排除環境を提供できるようになる。 In the illustrated example, deduplication domains 1408, 1410, and 1412 each include one or more individual deduplication domains. The deduplication domain may include software and / or hardware having functionality configured to perform deduplication processing on data objects. Each deduplication domain may include dedicated data storage. In the illustrated example, each deduplication domain may be associated with one or more characteristics that are common to several data objects. In addition, in this example, each deduplication domain can employ a specific deduplication method. These features allow individual deduplication domains to provide a highly effective deduplication environment for the associated data objects.

たとえば、一実施例によれば、重複排除ドメイン１４０８、１４１０および１４１２は各々、以下に説明する処理１２００といった内容認識重複排除処理を採用する。一方、他の実施例では、重複排除ドメイン１４０８がハッシュ指紋処理を利用し、重複排除ドメイン１４１０がパターン認識処理を使用し、１４１２が処理１２００を採用してもよい。このように、実施例は、ある特定の重複排除方法、または重複排除方法の構成に限定されない。 For example, according to one embodiment, deduplication domains 1408, 1410, and 1412 each employ a content recognition deduplication process, such as process 1200 described below. However, in other embodiments, deduplication domain 1408 may use hash fingerprint processing, deduplication domain 1410 may use pattern recognition processing, and 1412 may employ process 1200. Thus, the embodiments are not limited to a specific deduplication method or configuration of the deduplication method.

さまざまな実施例では、誘導エンジン１４０４は、データオブジェクトを、そのデータオブジェクトの１つ以上の特性に関連付けられた重複排除ドメインに誘導するよう構成された機能を含む。一実施例によれば、これらの特性は、データオブジェクトに関連付けられたメタデータを含む。図示された実施例では、誘導エンジン１４０４は、データインターフェイス１４０２からデータオブジェクトを受信可能である。誘導エンジン１４０４は、重複排除ドメイン１４０８、１４１０および１４１２のうちのどれが、受信されたデータオブジェクトを重複排除するのに好適であるかを選択可能である。図示されているように、誘導エンジン１４０４はまた、データオブジェクトを選択された重複排除ドメインに誘導可能である。誘導エンジン１４０４はまた、重複排除ドメイン１４０８、１４１０および１４１２によって行なわれる重複排除活動の結果を評価し、この評価に基づいて誘導エンジンを評価して、いくつかの重複排除ドメインにまたがる冗長データを単一の重複排除ドメインに統合するよう構成された機能を有しており、それにより追加のコンピューティング資源を節約する。 In various embodiments, the guidance engine 1404 includes functionality configured to direct a data object to a deduplication domain associated with one or more characteristics of the data object. According to one embodiment, these characteristics include metadata associated with the data object. In the illustrated embodiment, the guidance engine 1404 can receive data objects from the data interface 1402. The guidance engine 1404 can select which of the deduplication domains 1408, 1410 and 1412 is suitable for deduplicating the received data object. As shown, guidance engine 1404 can also direct data objects to selected deduplication domains. Guidance engine 1404 also evaluates the results of the deduplication activities performed by deduplication domains 1408, 1410, and 1412, and evaluates the guidance engine based on this assessment to simplify redundant data across several deduplication domains. It has functionality configured to integrate into a single deduplication domain, thereby saving additional computing resources.

さまざまな実施例では、誘導エンジン１４０４は、離散データブロック、データストリーム、および多重化データストリームを含むさまざまな形式およびフォーマットで、データをデータインターフェイス１４０２から受信するよう構成された機能を含む。これらの実施例では、誘導エンジン１４０４は、予め処理されたメタデータを受信データから抽出可能である。このメタデータは、論理メタデータキャッシュに関して上述したタイプの情報を含んでいてもよく、このため、いくつかの実施例では、メタデータの中でもとりわけ、バックアップポリシー名、データ源のタイプ、データ源の名前、バックアップアプリケーション名、オペレーションシステムのタイプ、データタイプ、バックアップタイプ、ファイル名、ディレクトリ構造、ならびに、日付および時間といった時系列情報を含んでいてもよい。 In various embodiments, the guidance engine 1404 includes functionality configured to receive data from the data interface 1402 in a variety of formats and formats including discrete data blocks, data streams, and multiplexed data streams. In these embodiments, the guidance engine 1404 can extract pre-processed metadata from the received data. This metadata may include information of the type described above with respect to the logical metadata cache, so that in some embodiments, among other metadata, the backup policy name, data source type, data source It may include time series information such as name, backup application name, operating system type, data type, backup type, file name, directory structure, and date and time.

さらに、いくつかの実施例では、誘導エンジン１４０４は、抽出されたメタデータに基づいて、データストリームまたは多重化データストリーム内の整列点を識別するよう構成された機能を有する。これらの実施例では、誘導エンジン１４０４は、データオブジェクトを作成するために、これらの整列点に沿ってデータストリームまたは多重化データストリームを区分可能である。また、いくつかの実施例では、誘導エンジン１４０４は、メタデータをデータオブジェクトに関連付けることができる。この関連付けられたメタデータは、とりわけ、データオブジェクトを作成するために使用されるメタデータを含んでいてもよい。 Further, in some embodiments, the guidance engine 1404 has functionality configured to identify alignment points in the data stream or multiplexed data stream based on the extracted metadata. In these examples, the guidance engine 1404 can partition the data stream or multiplexed data stream along these alignment points to create a data object. Also, in some embodiments, the guidance engine 1404 can associate metadata with a data object. This associated metadata may include, among other things, metadata used to create the data object.

たとえば、一実施例によれば、誘導エンジン１４０４は、データオブジェクトがある特定のサーバのその後のバックアップであることに基づいて、データストリームをデータオブジェクトに整列させることができる。同様に、別の実施例では、誘導エンジン１４０４は、同じファイル名およびディレクトリ位置を有するファイルを含むデータオブジェクトを整列させることができる。さらに別の実施例では、誘導エンジン１４０４は、データオブジェクトを作成し、データオブジェクトを作成するためにバックアップ／復元プログラムによって実行されたポリシーに基づいて、または、たとえばオラクル（Oracle）データベースによって作成されたデータといった、どのタイプのデータがデータオブジェクトに含まれているかに基づいて、メタデータを関連付けることができる。 For example, according to one embodiment, the guidance engine 1404 can align the data stream with the data object based on the data object being a subsequent backup of a particular server. Similarly, in another embodiment, the guidance engine 1404 can align data objects that include files with the same file name and directory location. In yet another embodiment, the guidance engine 1404 creates a data object, based on a policy executed by a backup / restore program to create the data object, or created by, for example, an Oracle database Metadata can be associated based on what type of data, such as data, is contained in the data object.

一実施例によれば、誘導エンジン１４０４は、データオブジェクトに関連付けられたメタデータを評価することによってデータオブジェクトを誘導するよう構成された機能を有する。この実施例では、誘導エンジン１４０４は、データオブジェクトに関連付けられたメタデータを、個々の重複排除ドメインに関連付けられたデータオブジェクト特性と比較することができる。十分な品質の一致が見つかった場合、誘導エンジン１４０４は、さらなる処理のためにデータオブジェクトを一致する重複排除ドメインに転送可能である。一実施例によれば、誘導エンジン１４０４は、データオブジェクトのコピーを重複排除ドメインに提供するか、またはポインタといったデータオブジェクトへの参照を重複排除ドメインに提供することによって、データオブジェクトを転送可能である。いくつかの実施例によれば、データオブジェクトに関連付けられたメタデータ、および重複排除ドメインに関連付けられたデータオブジェクト特性はともに、データオブジェクトの内容に関する情報である。 According to one embodiment, the navigation engine 1404 has functionality configured to navigate the data object by evaluating metadata associated with the data object. In this example, the guidance engine 1404 can compare the metadata associated with the data object with the data object characteristics associated with the individual deduplication domain. If a sufficient quality match is found, the guidance engine 1404 can transfer the data object to the matching deduplication domain for further processing. According to one embodiment, the guidance engine 1404 can transfer a data object by providing a copy of the data object to the deduplication domain or by providing a reference to the data object, such as a pointer, to the deduplication domain. . According to some embodiments, the metadata associated with the data object and the data object characteristic associated with the deduplication domain are both information about the content of the data object.

たとえば、一実施例では、データオブジェクトに関連付けられたメタデータ、および重複排除ドメインに関連付けられたデータオブジェクト特性は、たとえばマイクロソフトアウトルック（MICROSOFT OUTLOOK）といった、データオブジェクトを作成したソフトウェアアプリケーションである。この例では、マイクロソフトアウトルックによって作成されたデータオブジェクトに遭遇すると、誘導エンジン１４０４は、それらのデータオブジェクトを、マイクロソフトアウトルックデータオブジェクトに関連付けられた重複排除ドメインに誘導できる。他の実施例では、メタデータおよびデータオブジェクト特性は、他のタイプの情報であってもよい。 For example, in one embodiment, the metadata associated with the data object and the data object characteristics associated with the deduplication domain are software applications that created the data object, eg, Microsoft Outlook. In this example, upon encountering data objects created by Microsoft Outlook, the derivation engine 1404 can direct those data objects to the deduplication domain associated with the Microsoft Outlook data object. In other embodiments, the metadata and data object characteristics may be other types of information.

いくつかの実施例によれば、誘導エンジン１４０４は、複数の重複排除ドメインにわたって冗長データをさらに統合するよう構成された機能を含む。いくつかの実施例では、誘導エンジンは、複数の重複排除ドメインにまたがる冗長データを判断するために、重複排除処理の結果と重複排除処理に関連付けられたアーチファクトとを評価することができる。たとえば、一実施例では、誘導エンジン１４０４は、複数の重複排除ドメインが共有しているかもしれないハッシュ指紋を求めて、ハッシュ指紋法を採用している複数の重複排除ドメインに関連付けられたハッシュテーブルを周期的に「スクラビング」する、または検索することができる。この実施例では、誘導エンジン１４０４は、異なる重複排除ドメインによって処理されたもののこれらの共通の指紋を有しているデータのストレージを、冗長データのコピーを冗長データの単一のコピーへの参照と置き換えるように重複排除ドメインのうちの１つ以上を誘導することによって、統合することができる。 According to some embodiments, the guidance engine 1404 includes functionality configured to further integrate redundant data across multiple deduplication domains. In some examples, the guidance engine can evaluate the results of the deduplication process and the artifacts associated with the deduplication process to determine redundant data across multiple deduplication domains. For example, in one embodiment, the guidance engine 1404 searches for hash fingerprints that may be shared by multiple deduplication domains and hashes tables associated with multiple deduplication domains that employ hash fingerprinting. Can be "scrubbed" or retrieved periodically. In this example, the guidance engine 1404 handles storage of data processed by different deduplication domains but having these common fingerprints, a copy of redundant data as a reference to a single copy of redundant data. Integration can be achieved by inducing one or more of the deduplication domains to replace.

他の実施例では、誘導エンジン１４０４は、新しい重複排除ドメインを作成し、または既存の重複排除ドメインの構成を修正して、上述のスクラビング処理によって見つかった冗長データに関するデータの今後の処理を統合することができる。たとえば、一実施例では、誘導エンジン１４０４は、特定の重複排除ドメインに関連付けられたデータオブジェクト特性を変更することにより、１つの重複排除ドメインからの冗長データに関するデータを含むデータオブジェクトの今後の処理を、別のものにシフトすることができる。 In other embodiments, the guidance engine 1404 creates a new deduplication domain or modifies the configuration of an existing deduplication domain to consolidate future processing of data regarding redundant data found by the scrubbing process described above. be able to. For example, in one embodiment, the guidance engine 1404 may further process data objects that contain data about redundant data from one deduplication domain by changing data object characteristics associated with a particular deduplication domain. , Can shift to another.

たとえば、一実施例では、誘導エンジン１４０４は、スクラビングを通して見つかった冗長データを含むデータオブジェクトに共通するメタデータを見つけるよう構成された機能を含む。さらに、この例では、誘導エンジンは、共通のメタデータに基づいて、共通のメタデータと一致する１つ以上のデータオブジェクト特性を判断することができる。誘導エンジン１４０４はまた、重複排除ドメインデータベース１４０６に関連性を記憶させることにより、これらの新しく判断されたデータオブジェクト特性を、新しいまたは既存の重複排除ドメイン、すなわち今後の処理が統合される重複排除ドメインに関連付けることができる。逆に、誘導エンジン１４０４は、重複排除ドメインデータベース１４０６と相互作用して、既存の重複排除ドメインから１つ以上のデータオブジェクト特性を分離し、これらの重複排除ドメインが将来、冗長データに関するデータを含むデータオブジェクトを受信しないようにすることができる。このように、誘導エンジン１４０４は、新しく見つかった共通のメタデータに関連付けられたデータオブジェクトの、新しく判断されたデータオブジェクト特性に関連付けられた重複排除ドメインへの流れを調節することができる。 For example, in one embodiment, the guidance engine 1404 includes functionality configured to find metadata common to data objects including redundant data found through scrubbing. Further, in this example, the guidance engine can determine one or more data object characteristics that match the common metadata based on the common metadata. Guidance engine 1404 also stores these associations in deduplication domain database 1406 to convert these newly determined data object characteristics into new or existing deduplication domains, ie deduplication domains into which future processing is integrated. Can be associated with Conversely, the guidance engine 1404 interacts with the deduplication domain database 1406 to separate one or more data object characteristics from existing deduplication domains, which will contain data about redundant data in the future. You can prevent data objects from being received. In this manner, the guidance engine 1404 can regulate the flow of data objects associated with newly found common metadata to the deduplication domain associated with the newly determined data object characteristics.

いくつかの実施例では、誘導エンジン１４０４は、データオブジェクトをある特定の重複排除ドメインに誘導する際に追加情報を使用するよう構成された機能を含む。たとえば、一実施例によれば、誘導エンジン１４０４は、ある特定の重複排除ドメイン専用のストレージが、残りの容量のしきい値レベル未満であることを検出可能である。この場合、誘導エンジン１４０４は、データオブジェクトを他の重複排除ドメインに誘導でき、または、追加の記憶容量を重複排除ドメインに割当てることができる。別の実施例では、誘導エンジン１４０４は、データオブジェクトが失効するまで残っている時間量に基づいて、データオブジェクトをある特定の重複排除ドメインに誘導可能である。たとえば、この実施例では、誘導エンジン１４０４は、残りの寿命が少ないデータオブジェクトを、少ない処理オーバーヘッドで、重複排除ドメインに、そのデータオブジェクトに関するその重複排除ドメインの有効性にかかわらず誘導することができる。なぜなら、そのデータオブジェクトは短時間のうちにストレージから消去されるためである。 In some embodiments, the guidance engine 1404 includes functionality configured to use additional information in directing data objects to a particular deduplication domain. For example, according to one embodiment, the guidance engine 1404 can detect that storage dedicated to a particular deduplication domain is below the remaining capacity threshold level. In this case, the guidance engine 1404 can direct the data object to another deduplication domain, or can allocate additional storage capacity to the deduplication domain. In another example, the guidance engine 1404 can direct the data object to a particular deduplication domain based on the amount of time remaining until the data object expires. For example, in this example, the guidance engine 1404 can direct a data object that has a low remaining lifetime to a deduplication domain with little processing overhead, regardless of the effectiveness of that deduplication domain for that data object. . This is because the data object is deleted from the storage in a short time.

さまざまな実施例によれば、重複排除データベースインターフェイス１４１４は、さまざまな外部実体と情報を交換するよう構成された機能を有する。図示された実施例によれば、重複排除データベースインターフェイス１４１４は、重複排除ドメイン１４０８、１４１０および１４１２といった重複排除ドメインをユーザが作成、修正、および削除できるようにするさまざまなユーザインターフェイスメタファを、ユーザに提供できる。より特定的には、新しい重複排除ドメインを作成するためのメタファを表示する際、重複排除データベースインターフェイス１４１４は、その新しい重複排除ドメインに関連付けられたデータオブジェクトの特性をユーザが特定できるようにするインターフェイス要素を、ユーザに提示できる。加えて、重複排除データベースインターフェイス１４１４は、その新しい重複排除ドメインによって採用されるべき重複排除方法をユーザが特定できるようにするインターフェイス要素を、ユーザに提供できる。 According to various embodiments, the deduplication database interface 1414 has functionality configured to exchange information with various external entities. In accordance with the illustrated embodiment, the deduplication database interface 1414 provides the user with various user interface metaphors that allow the user to create, modify, and delete deduplication domains, such as deduplication domains 1408, 1410, and 1412. Can be provided. More specifically, when displaying a metaphor for creating a new deduplication domain, the deduplication database interface 1414 is an interface that allows the user to identify characteristics of the data objects associated with the new deduplication domain. Elements can be presented to the user. In addition, the deduplication database interface 1414 can provide the user with interface elements that allow the user to specify the deduplication method to be employed by the new deduplication domain.

他の実施例では、重複排除データベースインターフェイス１４１４は、バックアップ／復元プログラムといった外部システムからデータを受信し、受信データに基づいて、外部システムから到来したデータを処理するように重複排除ドメインを自動的に構成するよう構成された機能を有する。たとえば、いくつかの実施例では、重複排除データベースインターフェイス１４１４は、将来受信される、または現在受信中のデータオブジェクトのタイプにおける共通性を判断して、重複排除効率を高めるように重複排除ドメイン１４０８、１４１０および１４１２を構成することが可能である。一実施例では、重複排除データベースインターフェイス１４１４は、バックアップ／復元プログラムによって供給されるバックアップポリシーに基づいて、バックアップポリシーの実行の結果受信されるデータオブジェクトの一次記憶位置を判断可能である。この実施例では、重複排除データベースインターフェイス１４１４は、この一次記憶位置情報に基づいて、重複排除ドメイン１４０８、１４１０および１４１２の構成を記憶可能である。別の実施例では、重複排除データベースインターフェイス１４１４によって記憶された構成は、データオブジェクトの記憶位置よりもむしろ、データオブジェクトを作成したソフトウェアアプリケーションに基づくことが可能である。他の実施例は、他のタイプのデータを用いて、好適な重複排除ドメイン構造および構成を判断してもよい。 In another embodiment, the deduplication database interface 1414 receives data from an external system such as a backup / restore program and automatically configures the deduplication domain to process data coming from the external system based on the received data. Having a function configured to configure. For example, in some embodiments, the deduplication database interface 1414 determines the commonality in the types of data objects that are received in the future or that are currently being received to increase the deduplication efficiency 1408, 1410 and 1412 can be configured. In one embodiment, the deduplication database interface 1414 can determine the primary storage location of data objects received as a result of executing a backup policy based on a backup policy provided by a backup / restore program. In this example, the deduplication database interface 1414 can store the configuration of the deduplication domains 1408, 1410 and 1412 based on this primary storage location information. In another example, the configuration stored by the deduplication database interface 1414 can be based on the software application that created the data object, rather than the storage location of the data object. Other embodiments may use other types of data to determine suitable deduplication domain structures and configurations.

上述のように、重複排除誘導子１４００は、いくつかの重複排除方法のうちの１つを用いて、データオブジェクトから冗長データを除去してもよい。ある重複排除ドメインによって使用され得る１つの特定の重複排除手法は、内容認識重複排除である。図１２は、この発明の一実施例に従った、データオブジェクトからデータを重複排除するための例示的な内容認識処理１２００を示す。図１３は、データ重複排除とともに使用すると追加の処理効率を生み出す、進化した参照手法を示す。重複排除処理は、単一のバックアップ記憶システムを用いて、または上述のグリッド環境といった分散記憶システム環境内で実現されてもよい。 As described above, deduplication inductor 1400 may remove redundant data from a data object using one of several deduplication methods. One particular deduplication technique that can be used by a deduplication domain is content recognition deduplication. FIG. 12 illustrates an exemplary content recognition process 1200 for deduplicating data from a data object according to one embodiment of the present invention. FIG. 13 illustrates an evolved reference approach that, when used with data deduplication, creates additional processing efficiency. The deduplication process may be implemented using a single backup storage system or in a distributed storage system environment such as the grid environment described above.

一般に、処理１２００を実行するシステムは、一連のデータオブジェクトに関連付けられたメタデータを抜粋して、たとえば重複データを共有する可能性のあるデータオブジェクトといった、さらなる重複排除処理ステップを受けるそれらのデータオブジェクトを識別してもよい。このシステムは、冗長データの位置を特定するために、追加処理用に識別されたデータオブジェクトを検査してもよい。さらに、このシステムは、冗長データの単一のコピーを指す識別されたデータオブジェクトのコピーを構築し、また、オプションでこれらのコピーの完全性を確認してもよい。冗長データによって占められた記憶容量を取り戻すために、このシステムは、もともと識別されたデータオブジェクトを削除してもよい。重複排除方法および重複排除装置の局面および実施例を、以下により詳細に説明する。 In general, the system performing process 1200 extracts those metadata objects that have undergone further deduplication processing steps, such as data objects that may extract metadata associated with a set of data objects, for example, data objects that may share duplicate data. May be identified. The system may examine data objects identified for additional processing to locate redundant data. In addition, the system may construct copies of the identified data objects that point to a single copy of the redundant data and optionally verify the integrity of these copies. In order to reclaim the storage capacity occupied by the redundant data, the system may delete the originally identified data object. Aspects and embodiments of the deduplication method and deduplication device are described in more detail below.

図１２を引き続き参照して、ステップ１２０２で、データ重複排除処理１２００が始まる。ステップ１２０４で、システムは、さらなる重複排除処理を受けることになるデータオブジェクトを識別する。一実施例では、このシステムは、冗長データを含む可能性があるデータオブジェクトを識別してもよい。この識別を行なうために、さまざまな方法およびメタデータが採用されてもよい。たとえば、一実施例では、一次ストレージにおけるあるバックアップデータオブジェクトの物理的位置は、それが別のバックアップデータオブジェクトを有するデータを有する可能性があることを示してもよい。より特定的には、同じ一次記憶装置、たとえばある特定のサーバから２つのバックアップデータオブジェクトが生じた場合、これらのデータオブジェクトは、冗長データのコピーを含む可能性があると識別されてもよい。同様に、別の実施例では、２つのデータオブジェクトは、ともにある特定のソフトウェアアプリケーションによって作成された場合、冗長データを有する可能性があると識別されてもよい。さらに別の実施例では、データオブジェクトがフルまたは増分バックアップポリシーの一部として記憶されたかどうかが、冗長データの可能性を示してもよい。重複データを含む可能性があるデータオブジェクトの識別は、ＣＰＵサイクルといった乏しいコンピュータ資源が、冗長データの除去から最も利益を得ることになるこれらのデータオブジェクトに集中され得るようにすることによって、処理１２００の全体的効率を高める。 With continued reference to FIG. 12, at step 1202, data deduplication processing 1200 begins. In step 1204, the system identifies data objects that are to undergo further deduplication processing. In one embodiment, the system may identify data objects that may contain redundant data. Various methods and metadata may be employed to perform this identification. For example, in one embodiment, the physical location of one backup data object in primary storage may indicate that it may have data with another backup data object. More specifically, if two backup data objects arise from the same primary storage device, eg, a particular server, these data objects may be identified as potentially containing redundant data copies. Similarly, in another embodiment, two data objects may be identified as potentially having redundant data if both were created by a particular software application. In yet another example, whether the data object was stored as part of a full or incremental backup policy may indicate the possibility of redundant data. Identification of data objects that may contain duplicate data is a process 1200 by allowing scarce computer resources, such as CPU cycles, to be concentrated on those data objects that will most benefit from the removal of redundant data. Increase the overall efficiency of

別の実施例では、システムは、あるデータオブジェクトを、これらのデータオブジェクトに関連付けられたメタデータに基づいて、さらなる重複排除処理に自動的に含めるよう、またはさらなる重複排除処理から自動的に除外するよう構成されていてもよい。たとえば、システムは、ある特定のソフトウェアアプリケーションによって作成されたデータオブジェクトを重複排除処理に含めるよう構成されていてもよい。同様に、システムは、ある特定のポリシーの一部としてバックアップされたデータオブジェクトを、さらなる重複排除処理に含めるよう構成されていてもよい。逆に、システムは、ある特定のポリシーによってバックアップされた全データオブジェクト、および／または具体的な名前のデータオブジェクトを、さらなる重複排除処理から除外するよう構成されていてもよい。これらの構成の選択肢により、システム挙動は、どのようなクライアント環境の特定の要望にも適合するよう調整され得るようになり、このため、システム効率、性能および拡張性を高める。 In another embodiment, the system automatically includes or excludes certain data objects from further deduplication processing based on metadata associated with these data objects. It may be configured as follows. For example, the system may be configured to include data objects created by a particular software application in the deduplication process. Similarly, the system may be configured to include data objects backed up as part of a particular policy for further deduplication processing. Conversely, the system may be configured to exclude all data objects backed up by a particular policy and / or specifically named data objects from further deduplication processing. These configuration options allow system behavior to be tailored to suit the specific needs of any client environment, thus increasing system efficiency, performance, and scalability.

ステップ１２０６で、処理１２００を実行するシステムは、さらなる重複排除処理のために識別されたデータオブジェクトにおいて冗長データの位置を特定する。この分析は、メタデータを用いることによって、および／または識別されたデータオブジェクトの実際の内容を検査することによって遂行されてもよい。一実施例では、同様のメタデータを有するデータオブジェクトは、同じデータを備えると仮定される。たとえば、複数のデータオブジェクトがデータファイルで、双方とも同じ名前、一次ストレージおよび巡回冗長検査（ＣＲＣ）における物理的位置、重複排除処理中に生成されたハッシュまたは何らかの他のメタデータを共有する場合、これら２つのデータオブジェクトは冗長であるとして記録されてもよい。冗長データを識別するためにメタデータを使用することは、いくつかの利点を提供する。メタデータの使用は効率を高める。なぜなら、データオブジェクト全体というよりもむしろ、データオブジェクトのメタデータのみが処理され得るためである。 At step 1206, the system performing process 1200 locates redundant data in the identified data object for further deduplication processing. This analysis may be accomplished by using metadata and / or by examining the actual content of the identified data object. In one embodiment, data objects having similar metadata are assumed to comprise the same data. For example, if multiple data objects are data files that both share the same name, primary storage and physical location in cyclic redundancy check (CRC), hashes generated during the deduplication process, or some other metadata, These two data objects may be recorded as redundant. Using metadata to identify redundant data provides several advantages. The use of metadata increases efficiency. This is because only the metadata of the data object can be processed rather than the entire data object.

別の実施例では、データオブジェクト同士は、冗長性を保証するためにビット単位で比較されてから、そのように記録されてもよい。このタイプの比較は、コンピューティング資源をたくさん使うかもしれないが、それは、冗長であると識別されたあらゆるデータが実際に完全に冗長である、という強い保証も提供する。冗長性の判断に対するこのアプローチは、たとえば、金融情報といった、完全性が特に重要であるデータオブジェクトを取扱う際に有用であり得る。 In another embodiment, the data objects may be compared bit by bit to ensure redundancy and then recorded as such. This type of comparison may use a lot of computing resources, but it also provides a strong guarantee that any data identified as redundant is actually fully redundant. This approach to redundancy determination may be useful when dealing with data objects where integrity is particularly important, for example financial information.

さらに別の実施例では、データオブジェクトに含まれるデータの一部が分析されて、オブジェクト全体の冗長性を定める。たとえば、あるソフトウェアアプリケーションは、それらが修正するデータオブジェクト内のある位置に、たとえばオブジェクトの最初または最後に、修正されたデータを追いやってもよい。このため、このデータ分散パターンを用いて、システムはその重複排除処理を、静的である可能性がより高いデータオブジェクトの部分に集中させ、こうしてシステム効率を高めてもよい。 In yet another embodiment, a portion of the data contained in the data object is analyzed to determine overall object redundancy. For example, some software applications may drive modified data to a location in the data object that they modify, for example at the beginning or end of the object. Thus, using this data distribution pattern, the system may concentrate its deduplication processing on the portion of the data object that is more likely to be static, thus increasing system efficiency.

この発明の実施例は、冗長データの位置を特定するために、これらの手法の組合せを採用してもよい。より特定的には、システムは、上述のさらなる重複排除処理のためにデータオブジェクトを識別するために使用されたようなメタデータに基づいて、特定の手法を特定のデータオブジェクトに誘導してもよい。このメタデータは、とりわけ、一次ストレージにおける位置、データオブジェクトがバックアップされるようにしたポリシー、およびデータオブジェクトに関連付けられたソフトウェアアプリケーションを含んでいてもよい。データオブジェクト識別と同様に、重複データの位置を特定する態様に関してシステムを調整する能力は、システム拡張性および性能を高める。 Embodiments of the present invention may employ a combination of these techniques to locate redundant data. More specifically, the system may direct a particular approach to a particular data object based on metadata such as that used to identify the data object for further deduplication processing as described above. . This metadata may include, among other things, the location in primary storage, the policy that caused the data object to be backed up, and the software application associated with the data object. Similar to data object identification, the ability to tune the system with respect to the manner in which duplicate data is located increases system scalability and performance.

ステップ１２０８で、処理１２００を実行するシステムは、冗長データを含む以前に識別されたデータオブジェクトの重複排除されたコピーを作成してもよい。これらの重複排除されたコピーには、冗長データがほとんどまたは全くない場合がある。一実施例では、識別されたデータオブジェクトは、たとえば、仮想カートリッジを含んでいてもよい。この場合、システムは、完全に分解されると識別された仮想カートリッジに含まれるデータをすべて含む、１つ以上の重複排除された仮想カートリッジを作成してもよい。上述の合成仮想カートリッジと同様、これらの重複排除された仮想カートリッジは、データオブジェクトおよびデータオブジェクトへのポインタの双方を備えていてもよい。 At step 1208, the system performing process 1200 may create a deduplicated copy of a previously identified data object that includes redundant data. These deduplicated copies may have little or no redundant data. In one embodiment, the identified data object may include, for example, a virtual cartridge. In this case, the system may create one or more deduplicated virtual cartridges that contain all the data contained in the virtual cartridges that are identified as being fully disassembled. Similar to the synthetic virtual cartridge described above, these deduplicated virtual cartridges may include both a data object and a pointer to the data object.

これらの重複排除されたデータコピーの作成中、システムは、ある特定のデータオブジェクト内に重複データのコピーを記憶させ、他のデータオブジェクト内のポインタを作成および／または修正して、重複データをそれらのデータオブジェクト内に記憶させてもよい。システムは、重複データおよびポインタを記憶させる際、さまざまな方法論に従ってもよい。一実施例では、重複データは最も古いデータオブジェクトに収容され、重複データの位置を識別するポインタは、重複データを含むより若いデータオブジェクトに記憶される。当該技術分野において逆方向参照と呼ばれるこの手法は、重複排除処理のためにデータオブジェクトを要約するためにハッシュインデックスが構築される場合に一般的である。 During the creation of these deduplicated data copies, the system stores duplicate data copies in one particular data object, creates and / or modifies pointers in other data objects, May be stored in the data object. The system may follow various methodologies when storing duplicate data and pointers. In one embodiment, duplicate data is contained in the oldest data object, and a pointer identifying the location of the duplicate data is stored in the younger data object that contains the duplicate data. This technique, referred to in the art as backward referencing, is common when hash indexes are constructed to summarize data objects for deduplication processing.

別の実施例では、重複データは最も若いデータオブジェクトに収容され、重複データの位置を識別するポインタは、重複データを含むより古いデータオブジェクトに記憶される。この手法は順方向参照と呼ばれ得る。順方向参照は、データが最後のバックアップから復元される場合のデータ復元性能を高める。なぜなら、バックアップデータオブジェクトに含まれる全データを分解するために、ポインタの参照解除の減少が必要となるためである。この性能の向上は、一次ストレージからデータを復元させなければならない場合に通常、最新の、すなわち最も若いバックアップが使用されるという事実により、特に有利である。 In another embodiment, duplicate data is contained in the youngest data object and a pointer identifying the location of the duplicate data is stored in an older data object that contains the duplicate data. This approach may be referred to as forward reference. Forward referencing enhances data restoration performance when data is restored from the last backup. This is because it is necessary to reduce the dereference of pointers in order to decompose all data included in the backup data object. This increase in performance is particularly advantageous due to the fact that the latest or youngest backup is usually used when data must be restored from primary storage.

図１３Ａ、図１３Ｂおよび図１３Ｃは、上述のような順方向参照および逆方向参照を示す。図１３Ａは、重複排除処理前のバックアップデータオブジェクト１３０２および１３０４を示す。この例示のために、バックアップデータオブジェクト１３０２はバックアップデータオブジェクト１３０４よりも前に記憶されたと仮定されたい。バックアップデータオブジェクト１３０２は、固有データ部分１３０６と冗長データ部分１３１０Ａとを含む。バックアップデータオブジェクト１３０４は、固有データ部分１３０８と冗長データ部分１３１８Ｂとを含む。 Figures 13A, 13B and 13C show forward and backward references as described above. FIG. 13A shows backup data objects 1302 and 1304 before deduplication processing. For purposes of this illustration, assume that backup data object 1302 was stored prior to backup data object 1304. Backup data object 1302 includes a unique data portion 1306 and a redundant data portion 1310A. Backup data object 1304 includes a unique data portion 1308 and a redundant data portion 1318B.

図１３Ｂは、順方向参照方式の下でのデータオブジェクト１３０２および１３０４の重複排除されたコピーを示す。２つのうち、より最近に記憶されたデータオブジェクト１３０４は、冗長データ部分１３１０Ｂのコピーを含む。２つのうち、より以前に記憶されたデータオブジェクト１３０２は、冗長データ部分１３１０Ｂを指すポインタ１３１２を含む。このため、重複排除されたコピーの作成後、より若いデータオブジェクトは冗長データのコピーを含み、より古いデータオブジェクトは、より若いデータオブジェクトの冗長データへのポインタを含む。 FIG. 13B shows a deduplicated copy of data objects 1302 and 1304 under the forward reference scheme. Of the two, the more recently stored data object 1304 includes a copy of the redundant data portion 1310B. Of the two, the earlier stored data object 1302 includes a pointer 1312 that points to the redundant data portion 1310B. Thus, after making a deduplicated copy, the younger data object contains a copy of the redundant data and the older data object contains a pointer to the redundant data of the younger data object.

図１３Ｃは、逆方向参照方式の下でのデータオブジェクト１３０２および１３０４の重複排除されたコピーを示す。２つのうち、より以前に記憶されたデータオブジェクト１３０２は、冗長データ１３１０Ａのコピーを含む。２つのうち、より最近に記憶されたデータオブジェクト１３０２は、冗長データ部分１３１０Ａを指すポインタ１３１２を含む。このため、重複排除されたコピーの作成後、より古いデータオブジェクトは冗長データのコピーを含み、より若いデータオブジェクトは、より古いデータオブジェクトの冗長データへのポインタを含む。 FIG. 13C shows a deduplicated copy of data objects 1302 and 1304 under a reverse referencing scheme. Of the two, the earlier stored data object 1302 includes a copy of redundant data 1310A. Of the two, the more recently stored data object 1302 includes a pointer 1312 that points to the redundant data portion 1310A. Thus, after making a deduplicated copy, the older data object contains a copy of the redundant data and the younger data object contains a pointer to the redundant data of the older data object.

ステップ１２１０で、システムは、データ完全性が保存されたことを確実にするために、重複排除されたコピーを以前に識別されたデータオブジェクトと比較してもよい。この比較は、データオブジェクトポインタの参照解除を必要とする場合があり、また、データオブジェクトに含まれるデータのビット単位の比較を含んでいてもよい。この完全性チェックの実行後、一実施例では、システムは、重複排除されたコピーを識別するポインタとそれらのそれぞれの以前に識別されたデータオブジェクトとを交換して、重複排除されたデータオブジェクトが一次データオブジェクトとなり、以前に識別されたデータオブジェクトが、それを参照するデータオブジェクトの完全性を混乱させることなく削除され得るようにしてもよい。システムはまた、メタデータに他の調節を加えて、それが重複排除されたコピーの特性を正確に反映することを確実にしてもよい。 At step 1210, the system may compare the deduplicated copy with a previously identified data object to ensure that data integrity has been preserved. This comparison may require dereferencing of the data object pointer and may include a bit-by-bit comparison of the data contained in the data object. After performing this integrity check, in one embodiment, the system exchanges pointers identifying the deduplicated copies with their respective previously identified data objects to ensure that the deduplicated data objects are It may be a primary data object so that a previously identified data object can be deleted without disrupting the integrity of the data object that references it. The system may also make other adjustments to the metadata to ensure that it accurately reflects the characteristics of the deduplicated copy.

ステップ１２１２で、以前に識別されたデータオブジェクトによって利用された記憶容量が、他のデータオブジェクトによる使用のために取り戻される。一実施例では、これは、以前に識別されたデータオブジェクトを単に削除することによって遂行されてもよい。ステップ１２１４で、処理１２００は終了する。 In step 1212, the storage capacity utilized by the previously identified data object is reclaimed for use by other data objects. In one embodiment, this may be accomplished by simply deleting a previously identified data object. At step 1214, process 1200 ends.

処理１２００は、ある好ましい一連の事象を示す。この発明の精神から逸脱することなく、他の動作を加えることが可能であり、または処理１２００において動作の順序を変更することができる。一実施例では、処理１２００は、バックアップストレージシステムに含まれる各データオブジェクトについて実行されてもよい。別の実施例では、システムは、バックアップ記憶システムにおけるデータオブジェクトのサブセットについて、処理１２００を実行してもよい。 Process 1200 illustrates a preferred sequence of events. Other actions can be added or the order of actions can be changed in process 1200 without departing from the spirit of the invention. In one embodiment, process 1200 may be performed for each data object included in the backup storage system. In another example, the system may perform process 1200 on a subset of data objects in the backup storage system.

処理１２００は要望に応じて実行されてもよく、もしくは１回限りの処理または繰返し処理として予定されてもよい。重複排除によって取り戻された空間があるしきい値を満たすか上回る場合、処理１２００のさらなるサブセットが実行されてもよい。たとえば、一実施例では、重複排除が少なくともある特定の数の（たとえば５０）テラバイト、または利用されたバックアップ記憶容量のある特定のパーセンテージ（たとえば２５％）を自由にする場合のみ、処理１２００が動作してもよい。事象駆動型のコンピューティング動作として実現される場合、処理１２００を構成する行為は、グリッド環境といった分散型コンピューティング環境において実行されてもよい。 Process 1200 may be performed as desired, or may be scheduled as a one-time process or a repetitive process. If the space reclaimed by deduplication meets or exceeds a certain threshold, a further subset of process 1200 may be performed. For example, in one embodiment, process 1200 operates only if deduplication frees at least a certain number (eg, 50) terabytes, or a certain percentage (eg, 25%) of used backup storage capacity. May be. If implemented as an event-driven computing operation, the acts making up process 1200 may be performed in a distributed computing environment, such as a grid environment.

このように、要約すると、重複排除処理１２００の実施例は、バックアップデータのコピーを維持するために必要な記憶容量を減少させ、ひいてはバックアップデータを記憶するために必要な電子媒体の量を減少させる。さらに、重複排除処理１２００の実施例は、メタデータを用いて重複排除処理を最適化することにより、コンピューティング資源を効率的に使用し得る。最後に、重複排除されたデータを順方向参照方式で記憶することにより、重複排除は、よく使用されるデータ復元機能性の性能を高めることができる。 Thus, in summary, the embodiment of the deduplication process 1200 reduces the storage capacity required to maintain a copy of the backup data, and thus reduces the amount of electronic media required to store the backup data. . Further, embodiments of the deduplication process 1200 may efficiently use computing resources by optimizing the deduplication process using metadata. Finally, by storing deduplicated data in a forward reference manner, deduplication can enhance the performance of commonly used data recovery functionality.

さまざまな実施例は、拡張可能な重複排除サービスを提供するコンピュータシステムのための処理を含む。図１５は、データを受信し、データを処理するために重複排除ドメインを選択し、選択された重複排除ドメインにデータを誘導する行為を含む、そのような１つの処理１５００の一例を示す。処理１５００は１５０２で始まる。 Various embodiments include a process for a computer system that provides an extensible deduplication service. FIG. 15 illustrates an example of one such process 1500 that includes receiving data, selecting a deduplication domain to process the data, and directing the data to the selected deduplication domain. Process 1500 begins at 1502.

行為１５０４で、コンピュータシステムは、重複排除されるべきデータを受信する。上述のように、一実施例によれば、データは、データのブロック、データストリーム、および多重化データストリームを含むさまざまな形式を取っていてもよい。図１４に示す例では、データはデータインターフェイス１４０２によって受信され、さらなる処理のために誘導エンジン１４０４に提供される。この例によれば、誘導エンジン１４０４はデータを受信し、データストリームに含まれる予め処理されたメタデータに基づいてデータを１つ以上のデータオブジェクトに区分する。さらに、この例では、誘導エンジン１４０４はメタデータをそれが作成するデータオブジェクトと関連付ける。 At act 1504, the computer system receives data to be deduplicated. As described above, according to one embodiment, data may take various forms including blocks of data, data streams, and multiplexed data streams. In the example shown in FIG. 14, data is received by the data interface 1402 and provided to the guidance engine 1404 for further processing. According to this example, the guidance engine 1404 receives the data and partitions the data into one or more data objects based on pre-processed metadata included in the data stream. Further, in this example, the guidance engine 1404 associates metadata with the data object it creates.

行為１５０６で、コンピュータシステムは、受信データを処理するために重複排除ドメインを選択する。図１４に示す例によれば、誘導エンジン１４０４は、ある特定のデータオブジェクトを、そのデータオブジェクトに関連付けられたメタデータと重複排除ドメインに関連付けられたデータオブジェクト特性とを比較することによって処理するために、重複排除ドメイン１４０８、１４１０および１４１２のうちの１つを選択する。加えて、誘導エンジン１４０４は、他の情報、たとえば特定の重複排除ドメインに残っている記憶容量の量に基づいて、特定の重複排除ドメインを選択してもよく、または選択しなくてもよい。 At act 1506, the computer system selects a deduplication domain to process the received data. According to the example shown in FIG. 14, the guidance engine 1404 processes a particular data object by comparing the metadata associated with that data object with the data object characteristics associated with the deduplication domain. Select one of the deduplication domains 1408, 1410 and 1412. In addition, the guidance engine 1404 may or may not select a particular deduplication domain based on other information, such as the amount of storage capacity remaining in the particular deduplication domain.

行為１５０８で、コンピュータシステムは選択された重複排除ドメインに受信データを誘導する。図１４の図示された例によれば、誘導エンジン１４０４は、データオブジェクトへの参照またはデータオブジェクトのコピーを選択された重複排除ドメインに渡すことにより、選択された重複排除ドメインにデータオブジェクトを提供してもよい。 At act 1508, the computer system directs the received data to the selected deduplication domain. According to the illustrated example of FIG. 14, the guidance engine 1404 provides a data object to the selected deduplication domain by passing a reference to the data object or a copy of the data object to the selected deduplication domain. May be.

処理１５００は１５１０で終了する。
処理１５００は、ある特定の実施例における特定の一連の行為を例示している。これらの処理の各々に含まれる行為は、ここに説明されるように特別に構成された１つ以上のコンピュータシステムにより、またはそれらを用いて、行なわれ得る。加えて、この発明の範囲から逸脱することなく、行為の順序を変更でき、他の行為を追加できる。 The process 1500 ends at 1510.
Process 1500 illustrates a particular sequence of actions in a particular embodiment. The actions involved in each of these processes may be performed by or using one or more computer systems specially configured as described herein. In addition, the order of actions can be changed and other actions can be added without departing from the scope of the invention.

図３を参照して上述したように、記憶システムは、エンドユーザ復元アプリケーション３００と呼ばれるソフトウェアアプリケーションも含んでいてもよい。このため、別の実施例によれば、ＩＴスタッフの介入なく、また既存のバックアップ／復元手順および／またはポリシーに何ら変更を加える必要なく、エンドユーザがバックアップデータの位置を特定し、バックアップデータを復元するための方法が提供される。典型的なバックアップ記憶システムでは、ホストコンピュータ１２０上で実行中のバックアップ／復元アプリケーションはＩＴスタッフにより制御されており、エンドユーザがＩＴスタッフの介入なくバックアップデータにアクセスすることは、不可能であるかまたは非常に難しい場合がある。この発明の局面および実施例によれば、エンドユーザが、たとえばバックアップ記憶媒体１２６を有するウェブベースのまたは他のインターフェイスを介して、自分たちのファイルの位置を特定してそれらを復元できるようにする記憶システムソフトウェアが提供される。 As described above with reference to FIG. 3, the storage system may also include a software application called an end user restore application 300. For this reason, according to another embodiment, the end user locates the backup data and does not need to make any changes to existing backup / restore procedures and / or policies without IT staff intervention. A method for restoring is provided. In a typical backup storage system, the backup / restore application running on the host computer 120 is controlled by IT staff, and is it impossible for end users to access backup data without IT staff intervention? Or it can be very difficult. According to aspects and embodiments of the present invention, end users can locate their files and restore them, for example via a web-based or other interface having a backup storage medium 126 Storage system software is provided.

合成フルバックアップアプリケーション２４０と同様に、エンドユーザ復元アプリケーション３００は、記憶システムコントローラ１２２（図２参照）上で実行されてもよく、またはホストコンピュータ１２０上で実行されてもよい、ということが認識されるべきである。エンドユーザ復元アプリケーションは、認証されたユーザが、論理メタデータキャッシュを検索してバックアップファイルの位置を特定し、またオプションでバックアップ記憶媒体１２６からバックアップファイルを復元できるようにするために必要なソフトウェアコマンドおよびインターフェイスを含む。 As with the synthetic full backup application 240, it is recognized that the end user restore application 300 may be executed on the storage system controller 122 (see FIG. 2) or may be executed on the host computer 120. Should be. The end-user restore application is a software command required to allow an authenticated user to search the logical metadata cache to locate the backup file and optionally restore the backup file from the backup storage medium 126. And including the interface.

一実施例によれば、ユーザコンピュータ１３６上にインストールされ、および／またはユーザコンピュータ１３６上で実行されるユーザインターフェイスを含むソフトウェアが提供される。このユーザインターフェイスは、ユーザがバックアップ記憶媒体上のファイルの位置を特定できるようにする、あらゆるタイプのインターフェイスであり得る。たとえば、ユーザインターフェイスは、グラフィカルユーザインターフェイスであってもよく、ウェブベースであってもよく、またはテキストインターフェイスであってもよい。ユーザコンピュータは、たとえばイーサネット（登録商標）接続であり得るネットワーク接続１３８を介して記憶システム１７０に結合される。このネットワーク接続１３８を通して、ユーザコンピュータ１３６のオペレータは、記憶システム１７０上に記憶されたデータにアクセスできる。 According to one embodiment, software is provided that includes a user interface that is installed on and / or executed on the user computer 136. This user interface can be any type of interface that allows a user to locate a file on a backup storage medium. For example, the user interface may be a graphical user interface, web based, or a text interface. The user computer is coupled to the storage system 170 via a network connection 138, which can be, for example, an Ethernet connection. Through this network connection 138, the operator of the user computer 136 can access data stored on the storage system 170.

一例では、エンドユーザ復元アプリケーション３００は、ユーザ認証および／または認定特徴を含む。たとえば、ユーザは、ユーザ名およびパスワードを用いて、ユーザコンピュータ上のユーザインターフェイスを介してログインするよう要求されてもよい。ユーザコンピュータは、ユーザが記憶システムへのアクセスを有するかどうかを判断するために、適切なユーザ認証機構を使用し得る記憶システム（たとえばエンドユーザ復元アプリケーション）に、ユーザ名およびパスワードを通信してもよい。使用され得るユーザ認証機構のいくつかの例は、マイクロソフトアクティブディレクトリ（Active Directory）サーバ、ユニックス（UNIX（登録商標））「イエローページ」サーバ、またはライトウェイト・ディレクトリ・アクセスプロトコル（Lightweight Directory Access Protocol）を含むが、それらに限定されない。ログイン／ユーザ認証機構は、ユーザ権限を交換するために、エンドユーザ復元アプリケーションと通信してもよい。たとえば、一部のユーザは、自分で作成したファイル、またはある権限を有するか所有者として識別されているファイルのみ検索することを許可されていてもよい。他のユーザ、たとえばシステムオペレータまたはアドミニストレータは、すべてのバックアップファイルなどへのアクセスが許可されていてもよい。 In one example, the end user restore application 300 includes user authentication and / or certification features. For example, a user may be required to log in via a user interface on a user computer using a username and password. The user computer may communicate the username and password to a storage system (eg, an end user restore application) that may use an appropriate user authentication mechanism to determine whether the user has access to the storage system. Good. Some examples of user authentication mechanisms that can be used are Microsoft Active Directory server, Unix (Yellow) "Yellow Page" server, or Lightweight Directory Access Protocol. Including, but not limited to. The login / user authentication mechanism may communicate with the end-user restore application to exchange user rights. For example, some users may be allowed to search only files that they have created or that have certain rights or are identified as owners. Other users, such as system operators or administrators, may be allowed access to all backup files and the like.

一実施例によれば、エンドユーザ復元アプリケーションは論理メタデータキャッシュを用いて、バックアップ記憶媒体上にバックアップされたすべてのデータファイルについての情報を得る。エンドユーザ復元アプリケーションは、ユーザインターフェイスを介して、たとえばバックアップ時間／日付、ユーザ名、（ファイルがバックアップされた際に得られたかもしれない）元のユーザコンピュータディレクトリ構造、または他のファイル特性などによって分類されたユーザのファイルの階層型ディレクトリ構造を、ユーザに提示する。一例では、ユーザに提示されるディレクトリ構造は、そのユーザにとって可能とされた権限に応じて変わり得る。エンドユーザ復元アプリケーションはブラウズ要求を受付けてもよく（すなわち、ユーザインターフェイスを通して、ユーザは所望のファイルの位置を特定するためにディレクトリ構造をブラウズしてもよく）、または、ユーザは名前、日付などによってファイルを検索してもよい。 According to one embodiment, the end user restore application uses a logical metadata cache to obtain information about all data files backed up on a backup storage medium. The end-user restore application can be accessed via the user interface, eg by backup time / date, user name, original user computer directory structure (which may have been obtained when the file was backed up), or other file characteristics. The hierarchical directory structure of the classified user's files is presented to the user. In one example, the directory structure presented to a user can vary depending on the permissions enabled for that user. The end user restore application may accept browse requests (ie, through the user interface, the user may browse the directory structure to locate the desired file), or the user may be by name, date, etc. You may search for a file.

一実施例によれば、ユーザは、記憶システムからバックアップファイルを復元してもよい。たとえば、ユーザが一旦所望のファイルの位置を特定すると、上述のように、ユーザはネットワーク接続１３８を介して記憶システムからファイルをダウンロードしてもよい。一例では、このダウンロード手順は、当業者には公知であるようなウェブベースのダウンロードと同等の態様で実現されてもよい。 According to one embodiment, the user may restore a backup file from the storage system. For example, once the user locates the desired file, the user may download the file from the storage system via the network connection 138 as described above. In one example, this download procedure may be implemented in a manner equivalent to a web-based download as is known to those skilled in the art.

エンドユーザが閲覧／ダウンロードを許可されたファイルにアクセスできるようにすることにより、また、ユーザインターフェイス（たとえばウェブベースの技術）を通してそのようなアクセスを可能にすることにより、エンドユーザ復元アプリケーションは、バックアップポリシーまたは手順を変える必要なく、ユーザが自分たち自身のファイルを検索し、復元できるようにする。 By enabling end users to access files that they are authorized to view / download, and by allowing such access through a user interface (eg, web-based technology), end-user restore applications can be backed up. Allow users to find and restore their own files without having to change policies or procedures.

合成フルバックアップアプリケーションおよびエンドユーザ復元アプリケーションといったこの発明の局面は主としてソフトウェアに関してここに説明されているが、それらはそれに代えて、ソフトウェア、ハードウェアもしくはファームウェア、またはそれらの任意の組合せで実現され得る、ということが認識されるべきである。 Although aspects of the present invention, such as synthetic full backup applications and end-user restore applications, are primarily described herein with respect to software, they can instead be implemented in software, hardware or firmware, or any combination thereof. It should be recognized that.

このため、たとえば、この発明の実施例は、実行時に、記憶システムのプロセッサに対して、上に詳述したような合成フルバックアップアプリケーションおよび／またはエンドユーザ復元アプリケーションの機能を少なくとも部分的に行なうコンピュータプログラム（すなわち複数の命令）で符号化された、コンピュータ読取り可能媒体（たとえばコンピュータメモリ、フロッピー（登録商標）ディスク、コンパクトディスク、テープなど）を備えていてもよい。 Thus, for example, embodiments of the present invention provide a computer that at least partially performs the functions of a synthetic full backup application and / or end-user restore application as detailed above to a processor of a storage system at run time. A computer readable medium (eg, computer memory, floppy disk, compact disk, tape, etc.) encoded with a program (ie, a plurality of instructions) may be provided.

概して要約すれば、この発明の実施例および局面はこのように、従来のテープバックアップシステムをエミュレートしているものの、合成バックアップを作成可能であること、エンドユーザがバックアップファイルを閲覧し復元できるようになることといった機能性の向上を提供し得る、記憶システムおよび方法を含む。しかしながら、この発明のさまざまな局面はコンピュータデータのバックアップ以外のために使用されてもよい、ということが認識されるべきである。この発明の記憶システムは膨大な量のデータを経済的に記憶するために使用されてもよく、その記憶されたデータは順次ではなくランダムに、かつハードディスクアクセス時間でアクセス可能であるため、この発明の実施例は、従来のバックアップ記憶システム以外での使用を見出し得る。たとえば、この発明の実施例は、幅広い選択肢の映画および音楽を表わす映像または音声データを記憶し、必要に応じて映像および／または音声を可能にするために使用されてもよい。 In summary, the embodiments and aspects of the present invention thus emulate a conventional tape backup system, but can create synthetic backups and allow end users to browse and restore backup files. Including storage systems and methods that can provide improved functionality such as. However, it should be recognized that various aspects of the present invention may be used for purposes other than computer data backup. The storage system of the present invention may be used to economically store a huge amount of data, and the stored data can be accessed randomly rather than sequentially and with hard disk access time. This embodiment may find use outside of conventional backup storage systems. For example, embodiments of the present invention may be used to store video or audio data representing a wide selection of movies and music, allowing video and / or audio as needed.

この発明の少なくとも１つの実施例のいくつかの局面を説明してきたが、当業者の脳裏にはさまざまな変更、修正および改良が容易に浮かぶであろう、ということが認識されるべきである。そのような変更、修正および改良は、この開示の一部であるよう意図されており、この発明の範囲内にあるよう意図されている。したがって、前述の説明および図面は単なる例示である。 While several aspects of at least one embodiment of the present invention have been described, it should be recognized that various changes, modifications and improvements will readily occur to those skilled in the art. Such alterations, modifications, and improvements are intended to be part of this disclosure, and are intended to be within the scope of the invention. Accordingly, the foregoing description and drawings are merely exemplary.

Claims

A method for inducing deduplication of application layer data objects, the method comprising:
Receiving the application layer data object;
Selecting a deduplication domain from a plurality of deduplication domains based at least in part on data object characteristics associated with the deduplication domain;
Determining that the application layer data object has the characteristics;
Directing the application layer data object to the selected deduplication domain.

Receiving the application layer data object includes
Receiving a data stream;
The method of claim 1, comprising identifying the application layer data object using metadata included in the data stream.

The method of claim 2, wherein receiving the data stream comprises receiving a multiplexed data stream.

The method of claim 2, further comprising extracting metadata included in the data stream using the application layer data object.

Selecting the deduplication domain from the plurality of deduplication domains compares the extracted metadata associated with the application layer data object with at least one of the characteristics associated with the deduplication domain. The method of claim 4 comprising:

Extracting the metadata contained in the data stream includes backup policy name, data source type, data source name, backup application name, operation system type, data type, backup type, file name, directory structure, And extracting at least one of the time series information.

The method of claim 1, further comprising configuring each of the plurality of deduplication domains to use one of a plurality of deduplication methods.

Each of the plurality of deduplication domains is configured to use one deduplication method selected from the group comprising hash fingerprinting, pattern recognition, and content recognition deduplication. 8. The method of claim 7, comprising configuring each.

The method of claim 1, further comprising associating each of the plurality of deduplication domains with at least one data object characteristic.

Deduplicating the application layer data object within the selected deduplication domain;
The method of claim 1, further comprising adjusting the data object characteristics associated with at least one of the plurality of deduplication domains based on a deduplication act result.

The method of claim 10, wherein adjusting the data object characteristics comprises storing data in a deduplication domain database.

A computer readable medium having stored thereon computer readable signals defining instructions that, when executed by a computer, instruct the computer to perform the method of claim 1.

The method of claim 1, wherein the method is executed in a grid computing environment.

The method of claim 1, wherein the method is performed on the backup storage system while data is not backed up to the backup storage system.

The method of claim 1, wherein the method is performed on the backup storage system while data is being backed up to the backup storage system.

A system for inducing deduplication of application layer data objects, the system comprising:
A plurality of deduplication domains, each of the deduplication domains being associated with at least one characteristic common to a plurality of application layer data objects, the system further comprising:
A controller coupled to the plurality of deduplication domains, the controller comprising:
Receiving the application layer data object;
Determining that the application layer data object has the at least one characteristic associated with a deduplication domain;
A system configured to direct the application layer data object to the deduplication domain.

The controller further includes:
Receive the data stream,
The system of claim 16, configured to identify the application layer data object using metadata contained in the data stream.

The system of claim 17, wherein the data stream is multiplexed.

The system of claim 17, wherein the controller is further configured to extract metadata contained in the data stream using the application layer data object.

The controller further indicates that the application layer data object has the at least one characteristic associated with the deduplication domain and that the extracted metadata associated with the application layer data object is the deduplication domain. The system of claim 19, wherein the system is configured to determine by comparing with the at least one characteristic associated with the.

The controller further includes at least one of a backup policy name, a data source type, a data source name, a backup application name, an operating system type, a data type, a backup type, a file name, a directory structure, and time series information. The system of claim 19, wherein the system is configured to extract.

The system of claim 16, wherein the controller is further configured to configure each of the plurality of deduplication domains to use one of a plurality of deduplication methods.

The controller is further configured to configure each of the plurality of deduplication domains to use one deduplication method selected from the group comprising hash fingerprinting, pattern recognition, and content recognition deduplication. The system of claim 22.

The system of claim 16, wherein the controller is further configured to associate each of the plurality of deduplication domains with at least one data object characteristic.

The controller further includes:
Causing deduplication of the application layer data object within the selected deduplication domain;
The system of claim 16, wherein the system is configured to adjust the data object characteristic associated with at least one of the plurality of deduplication domains based on a deduplication act result.

26. The system of claim 25, wherein the controller is further configured to store data in a deduplication domain database.

The system of claim 16, included in a grid computing environment.

The controller further receives the application layer data object while data is not backed up to the system and determines that the application layer data object has the at least one characteristic associated with a deduplication domain. The system of claim 16, configured to direct the application layer data object to the deduplication domain.

The controller further receives the application layer data object while data is being backed up to the system and determines that the application layer data object has the at least one characteristic associated with a deduplication domain. The system of claim 16, configured to direct the application layer data object to the deduplication domain.