WO2009031158A2

WO2009031158A2 - Method and apparatus for network based data recovery

Info

Publication number: WO2009031158A2
Application number: PCT/IL2008/001211
Authority: WO
Inventors: Leonid Remennik; Henry Broodney; Eli Bernstein
Original assignee: Ingrid Networks Ltd
Priority date: 2007-09-09
Filing date: 2008-09-09
Publication date: 2009-03-12
Also published as: WO2009031158A3; WO2009031157A3; WO2009031156A3; WO2009031157A2; WO2009031156A2

Abstract

The subject matter discloses a data structure for storing and retrieving data in a grid network. The data structure enables lazy loading of the data, by preserving the original order of the data blocks in the volume, and computing keys representing the location of data blocks in the grid. The lazy loading of backed up data enables optimized retrieval of multiple versions of a selected file, from various locations in which the file was backed up.

Description

METHOD AND APPARATUS FOR NETWORK BASED DATA RECOVERY

CROSS-REFERENCE TO RELATED APPLICATIONS The present invention claims priority of the filing date of provisional patent application serial number 60/970,957 titled METHOD AND APPARATUS FOR GRID

BASED DATA PROTECTION, filed September 9, 2007, the contents of which is hereby incorporated by reference herein.

BACKGROUND OF THE INVENTION

FIELD OF THE INVENTION

The present disclosure relates to retrieval of data from data protection systems in general, and to retrieval of data from peer-to-peer network based data protection systems, in particular.

DESCRIPTION OF RELATED ART

In the computerized environment surrounding us, data stored in computing devices becomes more and more vital to our everyday life, and the need to protect that data becomes ever more crucial. One aspect of such protection relates to creating a backup of the data should there be a need to restore it. Data may be lost due to machine or hardware failure, such as hard disk failure or other catastrophes. Other causes for loss of data may be malicious actions, such as theft or attacks by hackers and viruses, or through a user mistake, such as inadvertent deletion of a file, and even through intentional actions, such as modifying data or files and deleting the same. Some available solutions to this problem are completely manual, such as burning data onto CD or DVD, copying files to remote locations and using removable drives for backup. Other solutions discuss automatic or semiautomatic methods such as utilizing dedicated backup servers and software. Many technology innovations in the field of data protection focus on backup operations optimization for improving speed, reliability and automation.

Backup methods are typically divided into two groups: file level and volume level. A file level backup method allows a user to minimize a storage space required to contain data to be backed up by filtering out files that need not be backed up. Such methods are limited to restoring files on to an existing file system, and are unable to restore the entire file system itself. A volume level backup method, on the other hand, is able to restore an entire volume and bring a computer back to its original state, from the software perspective. Most volume level backup methods store backup copies of all the data stored on a volume that is being backed up. Such methods read data from the volume and store a backup of every block, a representation of the lowest level data unit stored on the volume. A block containing backed up data is referred to as data block. When storing an entire volume, the backed up data is referred to as a snapshot or as a recovery point. Most of existing automatic and semi-automatic data protection systems use relatively expensive dedicated backup servers and/or removable media to store backed up data, thus putting a lot of strain on a small business's budget. An additional drawback of such a system is the need of skilled personnel to install the backup system.

Some recent backup systems offer a different solution that utilizes free storage space on other computers on a peer-to-peer computerized network, also referred to as a computerized grid network. Instead of storing the data related to the backup, referred to henceforth as backup data, on a dedicated storage device, some general purpose storage devices located on other networked computers are used to store backup data.

One of the main difficulties in a computerized grid network based data protection system is to determine which data blocks should be retrieved when an entire backed up volume is retrieved. An additional issue is the number of blocks which must be retrieved in order to allow the user to inspect the backed up data. Another difficulty is the ability to determine that a snapshot stored in a computerized grid network is either retrievable in its entirety, or cannot be retrieved at all. This challenge, also named atomicity, is desired to be achieved.

Another challenge related to backing up data over a computerized grid network is to retrieve multiple versions of the same file backed up on a computerized grid network or on server different storage device, such as a server. For example, retrieve multiple versions of a software code, or multiple versions of a word processor document, without the saving any additional predefined data in addition to the data stored in the backup process. In view of the foregoing there is a need for an apparatus and method capable of retrieving an entire backed up volume stored in a computerized grid network, while minimizing the number of retrieved blocks, and retrieving multiple versions of a file.

BRIEF SUMMARY OF THE INVENTION

It is a first object of the disclosed subject matter to provide an apparatus including a data structure stored therein, the data structure to allow for the retrieval of backed up data blocks, the data structure comprises: a first index block comprising a first representation of an at least one backed up data block; said first representation comprises a retrieval key for a data block or an at least one secondary index block; said at least one secondary index block comprises a second representation of a portion of said at least one backed up data block; said second representation comprises two or more retrieval keys of either at least two data blocks or at least two other secondary index blocks. In some exemplary embodiments of the subject matter said data structure further comprises a header block.

Optionally, the header block comprises data enabling to determine a route to a selected backed up data block in the apparatus. In other exemplary embodiments of the subject matter said header block comprises information used for lazy loading of a backed up snapshot.

In other exemplary embodiments of the subject matter said header block comprises a third representation of said at least one backed up data block; said third representation comprises a retrieval key of said first index block; said first index block is stored in said decentralized computer apparatus. Optionally, said header block is stored in said apparatus; a retrieval key of said header block is deterministically determined.

Additionally, said header block may further comprise a files filter. In yet other exemplary embodiments of the subject matter said header block is used as a start point in retrieving the backed up data blocks. Optionally, said header block is stored in said decentralized computer apparatus; wherein said data structure provides atomicity.

In another exemplary embodiment of the subject matter the data structure stores the previously backed up data blocks in a first order corresponding to a second order in which the previously backed up data blocks are stored in a volume. In yet another exemplary embodiment of the subject matter an at least one index block or data block contained in said data structure is also contained in another data structure of an equivalent type as the data structure.

Optionally, at least one index block or data block is cached. It is a second object of the disclosed subject matter to provide a method of backing up at least one data block in a computerized network, the method comprising: (a) determining a retrieval key associated with each data block of the at least one data block in the computerized network, said retrieval key enables retrieval of the associated data block in the computerized network; (b) generating an at least one index block comprising an at least two retrieval keys; (c) determining a retrieval key associated with said at least one index block on the computerized network; and repeating steps (b) and (c) until only one index block is generated.

In some exemplary embodiments of the subject matter, the method further comprises a step of storing the at least one index block in the computerized network.

It is a third object of the disclosed subject matter to provide a method of retrieving previously backed up data blocks in a computerized network, the method comprising: (a) obtaining a first index block representing the previously backed up data blocks, said first index block comprises an at least one retrieval key associated with an at least one additional block; (b) retrieving said at least one additional block using said at least one retrieval key; in case the retrieved block is an index block comprising a second at least one retrieval key, retrieving a second at least one additional block associated with said second at least one retrieval key; and repeating step (b) at least once.

In some exemplary embodiments of the subject matter said at least one additional block comprises either a data block or an index block.

Optionally, the retrieval key of the first index block is deterministically determined.

Additionally, the method may further comprise a step of utilizing a files filter to filter out information regarding data blocks that were not backed up. It is a fourth object of the disclosed subject matter to provide a method for retrieving a selected portion of previously backed up data blocks in a computerized network, the method comprising: (a) obtaining a header block representing the previously backed up data blocks, said header block contains an at least one retrieval key associated with a block; (b) for at least one data block of said previously backed up data blocks to be retrieved, determining a route from the header block to the data block; locating a retrieval key associated with each block in route; said locating retrieval key comprised by another block in said route; retrieving said each block of route using said retrieval key. It is a fifth object of the disclosed subject matter to provide a method for retrieving multiple computerized versions of a previously backed up file, the method comprising: (a) determining a last in time recovery point comprising a version of said previously backed up file; (b) retrieving the last in time recovery point; (c) determining a modification time of said version of said previously backed up file; (d) determining an additional recovery point comprising another version of said previously backed up file, said additional recovery point was backed up before the modification time of said version of said previously backed up file; (e) retrieving said additional recovery point; and repeating steps (c)-(e) until no additional recovery points are retrieved or until the additional recovery point does not comprise a version of the previously backed up file.

In an exemplary embodiment of the disclosed subject matter, the method further comprises a step of computing a computerized representation of said versions.

Optionally, said recovery points are retrieved from a computerized grid network.

In yet other exemplary embodiment of the disclosed subject matter each of said recovery points is represented using an HLT data structure.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be understood and appreciated more fully from the following detailed description taken in conjunction with the drawings in which corresponding or like numerals or characters indicate corresponding or like components. Unless indicated otherwise, the drawings provide exemplary embodiments or aspects of the disclosure and do not limit the scope of the disclosure. In the drawings: Fig. 1 shows main components in a typical decentralized computer apparatus 100, according to some exemplary embodiments of the disclosed subject matter;

Fig. 2 shows a block diagram of a Hierarchical Lookup Table (HLT) 200, a data structure stored in a computer apparatus for representing a plurality of data blocks, according to some exemplary embodiments of the disclosed subject matter;

Fig. 3 shows a block diagram of an HLT used for partially loading a snapshot, according to an exemplary embodiment of the disclosed subject matter;

Fig. 4 shows a flow diagram of a method of creating an HLT representation of a snapshot, according to an exemplary embodiment of the subject matter; and,

Fig. 5 shows a flow diagram of a method of retrieving multiple computerized versions of a backed up file, according to an exemplary embodiment of the subject matter.

DETAILED DESCRIPTION

One technical problem dealt with in the disclosed subject matter is to reduce the number retrieval operations required to retrieve a specific data block backed up in a computerized grid network. Another technical problem is to retrieve only a subset of the data blocks of a volume, in accordance with a determination of relevant data blocks. Further, another technical problem is to efficiently represent a backed up volume and allow a retrieval of a portion of a backup data. Yet another technical problem solved in the disclosed subject matter is to provide a mechanism and methods to retrieve multiple versions of a specific backed up file. Retrieving multiple versions of such file is not limited to the field of backing up data over a computerized grid network, but also backing up data on a single computer, or on a managed computer network. One technical solution for the technical problems disclosed above is a data structure stored in a memory of a computer apparatus or stored on any other media or hardware device, the data structure being used to retrieve previously backed up data blocks in the computer apparatus. The computer apparatus as disclosed in the disclosed subject matter is a computerized network, or a computerized node within the computerized network, for example a personal computer, wireless device such as a wireless phone and the like. The data structure comprises a first index block comprising a first representation of at least one backed up data block, and at least one secondary index block comprises a second representation of a portion of said at least one backed up data block. Said first representation comprises a retrieval key for either a data block or at least one secondary index block. Said second representation comprises two or more retrieval keys of either at least two data blocks or at least two other secondary index blocks. The data structure may be stored in a single computerized node in the computerized grid network or it may be stored in parts in several computerized nodes in the computerized grid network. In some exemplary embodiments of the subject matter, the first index block is a root of the data structure, and comprises a key related to the location of one or more data blocks in the computerized grid network. In many embodiments, the root comprises data related to the location of at least two data blocks on the computerized grid network. Such data may relate to data blocks or to secondary index blocks. The secondary blocks contain keys which represent the location of two or more data blocks in the computerized grid network. As a result, a specific data block backed up in the computerized grid network can be retrieved without retrieving the entire data backed up data from the computerized grid network. The data structure is further described below.

Another technical solution disclosed in the subject matter is a decentralized computer apparatus, comprising at least one computerized node having a memory, and storing a representation of at least one data block. Said representation comprises index blocks comprising retrieval keys of either other index blocks or data blocks. Another technical solution is determining the data blocks to be retrieved in accordance to said representation and to other information.

Though the disclosed subject matter may be most useful when used in a computerized grid network, it will be apparent to a person skilled in the art that the disclosed subject matter may be utilized in other types of computerized networks, such as but not limited to a centralized computerized network.

Fig. 1 shows main components in a typical decentralized computer apparatus 100, according to some exemplary embodiments of the disclosed subject matter. Several computerized nodes 110, 120, 130 are connected to each other via a computerized network 140. The computerized network 140 is preferably operating in a peer-to-peer network constellation, or in a likewise decentralized computerized network in which every computerized node of the computerized nodes 110, 120, 130 is able to communicate with another computerized node of the computerized nodes 110, 120, 130 without requiring a centralized computerized node to supervise the communication. Communication however, may be channeled via one or more computerized nodes before reaching their destination computerized node. A person skilled in the art will appreciate that any computerized node of the computerized nodes 110, 120, 130 may be a personal computer (PC), a laptop, a personal digital assistant (PDA), a wireless device such as a cellular phone and any other similar machinery. The computerized nodes 110, 120, 130 communicate using protocols such as L2TP, PPTP, PPP, TCP/IP, token ring, NDP or frame relay, over either a wired or a wireless infrastructure. The computerized nodes 110, 120, 130 may be informed in advance of computerized nodes connected to the computerized network 140, or alternatively may have to initiate a predetermined protocol to discover computerized nodes connected to the computerized network 140. Although generically assumed that any computerized node of the computerized nodes 110, 120, 130 is reachable from any other computerized node of the computerized nodes 110, 120, 130, directly or indirectly, a condition may exist when some of the computerized nodes of computerized nodes 110, 120, 130 are, individually or as a group, temporarily disconnected from the bulk of the network 140. This does not interfere with the application of any part of the disclosed subject matter. While the present figures shows a limited number of computerized nodes, it will be appreciated that any additional number of computerized nodes can be present within the computerized network 140 and used in conjunction with the present subject matter.

Each computerized node of the computerized nodes 110, 120, 130 functions as a provider of backup services as well as a consumer of such services. In an exemplary embodiment of the subject matter, when a computerized agent residing on a computerized node, such as computerized node 110, determines that certain data is required to be backed up, the data to be backed up is sent over the computerized network 140 to other computerized nodes, such as the computerized nodes 120, 130, that store a duplicate of the data. A person skilled in the art will appreciate that in certain embodiments, some of the computerized nodes 110, 120, 130 may function only as consumers of the backup services and another subset of the computerized nodes 110, 120, 130 may function only as providers of the backup services.

A person skilled in the art will appreciate that blocks that are stored on computerized nodes of the computerized grid network may be stored in their original state. The blocks may be stored after undergoing processing and transformations such as compression, encryption and addition of forward error correction codes.

In an exemplary embodiment of the subject matter, in order for the backup to ensure sufficient data survivability in various foreseeable and non-foreseeable circumstances, all backed up data should be stored redundantly on several computerized nodes. According to one embodiment, there is a utilization of forward error correction code to cope with maximal number of simultaneous failures of computerized nodes without requiring a substantial amount of storage space. A person skilled in the art will appreciate that a failure of a computerized node may be caused from varied reasons, such as disconnection from the computerized network 140, hardware malfunction of storage devices residing in the computerized node and the like. Such forward error correction codes can be utilized to achieve different levels of redundancy.

As proposed by the present subject matter utilizing a computerized grid network as a backup system presents an inexpensive and efficient backup solution, solving technical problems currently known in the art. As opposed to systems that depend on costly designated servers, the computerized grid network of the present subject matter utilizes multiple inexpensive storage devices used by general purpose computing devices. Most organizations own and maintain various such general purpose computing devices to allow their personnel to work and interact. As a result of the availability of low-priced high-volume storage devices that are installed on such general purpose computing devices, most organizations have a great deal of redundant storage space. That redundant storage space is utilized by the computerized grid network of the present subject matter for storing backup data. Another technical advantage of the computerized grid network of the present subject matter is that the backup data may be distributed over different computerized nodes that may be located in different geographical locations. By distributing the backup data the risk of losing the entire backup data is reduced substantially.

Fig. 2 shows a block diagram of a Hierarchical Lookup Table (HLT) 200, a data structure stored in a computer apparatus for representing a plurality of data blocks, according to some exemplary embodiments of the disclosed subject matter. Data blocks

201, 202, 203, 204, 205, 206, 207, 208, 209 are data blocks that were determined to be backed up. In some exemplary embodiments of the subject matter, said data block 201,

202, 203, 204, 205, 206, 207, 208, 209 are all the data blocks in a volume backed up by a computerized device. In this exemplary embodiment said data blocks 201, 202, 203, 204, 205, 206, 207, 208, 209 represent the snapshot of the volume. It will be apparent to a person skilled in the art that snapshot may comprise data blocks of a single partition, a boot sector, partition tables or a combination of the above. While the present figure shows a limited number of data blocks, it will be appreciated that any additional number of data blocks can be represented by the HLT 200 and used in conjunction with the present subject matter.

In some exemplary embodiments of the subject matter said data blocks 201, 202,

203, 204, 205, 206, 207, 208, 209 represent a snapshot and the order of the representation of said data blocks in the data structure of the disclosed subject matter remains as appearing on the volume. For example, in a volume composed of exactly four data blocks 201, 202, 203, 204, an address of data block 201 in the volume is 0, an address of data block 202 is 1 , an address of data block 203 is 2 and an address of data block 204 is 3. In this example, the order of the data blocks 201, 202, 203, 204 is preserved in relation to their respective addresses. This characteristic, as will be explained below, utilizes lazy loading of the volume.

A header block 270 contained in the HLT 200 comprises a representation of a backed up snapshot. The header block 270 comprises a retrieval key of a root block 263 and it may also comprise snapshot properties such as identification of a computerized node from which the volume was stored, date, volume identification and volume size. The header block 270 may also comprise a files filter to be used when reading files from the backed up snapshot. The files filter may be used to filter out information, such as directory listings or information in the Master File Table (MFT), about files that were not backed up in the backed up snapshot. A person skilled in the art will appreciate that in case a file is not backed up, a files filter may be utilized to avoid a retrieval request of a content of a file that was not backed up. The retrieval key is used to retrieve a data block or an index block from a computerized grid network, using a distributed hash table (DHT). In some exemplary embodiments, a header block 270 may be stored in more than one block. In such case, several blocks are required to be retrieved and the retrieval key of the header block 270 comprises the retrieval keys of each of the several blocks. The root block 263 is an index block. An index block of the disclosed subject matter comprises at least one retrieval key utilized for retrieving other index block or data block. In Fig. 2 exemplary index blocks are shown in which each index block comprises of three retrieval keys, for example retrieval keys 260, 261, 262. A person skilled in the art will appreciate that an index block may be stored in several blocks. In such case, a retrieval key of an index block is comprised of several retrieval keys of blocks containing the data of the index block. For simplicity the following description refers to an index block which content is stored in one block. Retrieval keys 260, 261, 263 are used to retrieve other index blocks 243, 247, 251 which comprise retrieval keys 240, 241, 242, 244, 245, 246, 248, 249, 250. Retrieval keys 240, 241, 250 are used to retrieve index blocks 224, 228, 233 which comprise retrieval keys 221, 222, 223, 225, 226, 227, 230, 231, 232. Said retrieval keys 221, 222, 223, 225, 226, 227, 230, 231, 232 are used to retrieve data blocks 201, 202, 203, 204, 205, 206, 207, 208, 209. In an exemplary embodiment of the subject matter, a retrieval key, such as retrieval key 221 is computed using a hash function. In a content-based system the retrieval key is computed using the hash function activated on the content of the data block or index block represented by the value of the retrieval key. Such hash function provides an output of enough bits, for example 160 bits, such that the space of the keys is sufficiently large. If the hash function provides a uniform distribution of said space of keys, the possibility that two distinct data blocks are assigned the same key is relatively low. Hence a hash value may be used as a unique identifier of a data block, as is well known in the art. Thus, such a hash function provides a unique identifier to every data block. An exemplary hash function that provides the above characteristics is SHA-I. In an exemplary embodiment of the subject matter a hash function provides an output of 8 bytes, the size of a data block is 1024 bytes and an index block size is equal to the size of a data block. In other exemplary embodiments of the subject matter, an index block solely contains retrieval keys representing index blocks of a lower level in the data structure. An index block, such as index block 224, in said exemplary embodiment may comprise up to 228 retrieval keys of other data blocks or index blocks. In the exemplary embodiment of the subject matter presented in figure 2, an index block, such as index block 224 contains three retrieval keys 221, 222, 223. The retrieval keys are a hash value computed by the hash function, or the output of another mathematical or logical function desired by a person skilled in the art. According to the specific example, retrieval key 221 is the hash value representing data block 201. Retrieval keys 222, 223 represent data blocks 202, 203, respectively. The index block 224 may be a concatenation of the three retrieval keys 221, 223, 223. In case an index block should point to less than three blocks, either data blocks or index blocks, a dummy retrieval key, such as a key whose bits are set to 0, may be used to indicate that the value represented by the dummy retrieval key should not be used to retrieve a block. As mentioned above, given a hash function with enough number of output bits and which provides a uniform distribution, a possibility that a hash value of a block, either a data block or an index block, will be equal to any specific key in general, and the dummy key in particular, is sufficiently low.

It will be apparent to a person skilled in the art that given the header block 270, the entire volume may be retrieved. If the header block 270 is also stored in the computerized grid network, a computerized node of the computerized grid network determining to retrieve the recovery point represented by the header block 270 should be able to restore a retrieval key of the header block 270. In an exemplary embodiment of the subject matter, the retrieval key of the header block 270 is computed using a deterministic function that takes into account characteristics of a snapshot represented by the header block 270, such as but not limited to snapshot sequential number, identification of a computerized node of the computerized grid network that backed up the snapshot, volume sequential number and user credentials. In some exemplary embodiments of the subject matter, several hash functions are used to calculate a retrieval key. Each hash function is used to calculate a retrieval key for a different level in the HLT 200 and a different hash function is used to calculate a retrieval key for data blocks. As an example, a first hash function may be used to calculate a retrieval key of root block 263, a second hash function may be used to calculate a retrieval key of index blocks 243, 247 and 251, a third hash function may be used to calculate a retrieval key of index blocks 224, 228, 233, and a forth hash function may be used to calculate a retrieval key of data blocks 201, 202, 203, 204, 205, 206, 207, 208, 209.

In an exemplary embodiment of the subject matter an index block is of size M bytes and a retrieval key is of size K bytes. An index block may comprise M/K retrieval keys, and hence a number of blocks in a level of an HLT may be a portion of K/M of a number of blocks of a previous level in an HLT. If the level is the lowest level of the HLT, and a data block is also of size M bytes, it may be a portion of K/M of number of data blocks in a volume. For example, in case the retrieval key is 20 bytes in size, and a block is of size 4 kibibytes, backing up a 100 gibibytes volume, stored in 512 mebibytes data blocks. Level 1 of an HLT consists 131072 index blocks comprising 200 retrieval keys each. Level 2 of the HLT consists 656 index blocks. Level 3 of the HLT consists 13 kibibytes index blocks. Level 4 of the HLT consists 4 index blocks. Level 5 consists of 1 index block, a root block.

Fig. 3 shows a block diagram of an HLT used for partially loading a snapshot, according to an exemplary embodiment of the disclosed subject matter. Data blocks represented by nodes in an HLT 300 are retrieved using a header block 301. In some exemplary embodiments of the disclosed subject matter, as described above, a retrieval key of a header block such as the header block 301 is determined using a deterministic function used to retrieve the header block from a DHT. The HLT 300 may be used to load a full snapshot of a volume backed up from a computerized node in a computerized grid network, represented by data blocks 320, 308, 309, 330, 310. In an exemplary embodiment of the disclosed subject matter only a portion of said data blocks 320, 308, 309, 330, 310 are determined to be retrieved. For example, data blocks 308, 309, 310. As described above, said data blocks 320, 308, 309, 330, 310 preserve the respective order as appearing on the volume. When data blocks 308, 309, 310 are determined to be retrieved, their addresses in the volume are located using a file system (not shown), as in retrieval of data blocks from a volume stored in a storage device (not shown). Since the original order of data blocks is preserved, an address of a data block may be used to determine a route from the header block 301 to the data block. In figure 3 there is annotation of the address of a block in its level of the HLT 300. For example, the address of data block 308 is 8 and the address of data block 310 is 22. In case a number of blocks in the backed up volume is provided, for example, 24 in the HLT 300 of figure 3, the topology of the HLT 300 may be known. In some exemplary embodiments of the subject matter, the number of blocks in the backed up volume is provided by the header block 301. For example, given 24 data blocks comprising the volume, and in case an index block may comprise of up to 4 retrieval keys, as described by the exemplary HLT 300, the first level of the HLT 300 contains 6 index blocks. In figure 3, the 6 index blocks are shown as index blocks 370, 360, 305, 350, 340, 306. If the ninth data block, tenth data block and twenty second data block are determined to be retrieved, i.e. data blocks 308, 309, 310, only the third index block and sixth index block are required to be retrieved from the computerized grid network in order to retrieve said data block 308, 309, 310, i.e. only index blocks 305, 306. In a similar manner, it can be determined that both index blocks 303, 304 of the second level of the HLT 300 are required to be retrieved in order to determine the retrieval keys of index blocks 305, 306 of the first level in the HLT 300. A person skilled in the art will appreciate that the route from the header block 301 to the data blocks to be retrieved 308, 309, 310 can be computed by a mathematical computation considering the structure of the HLT 300, the indexing of the data blocks determined to be retrieved and the number of retrieval keys in an index block. For example, the route from the header block 301 to the data block 308 comprises the index blocks 305, 303 and the root block 302. A person skilled in the art will further appreciate that by computing such routes, index blocks not required for the retrieval of data blocks determined to be retrieved are not retrieved. The usage of system's resources, such as but not limited to RAM memory, caching memory and network bandwidth, is reduced by minimizing the number of retrieved index blocks. In some exemplary embodiments of the subject matter, a computerized node of a computerized grid network is able to lazy load a volume, as it is able to retrieve a partial snapshot and may load data blocks only demand. For example, the computerized node may determine to first load the lookup tables of a file system stored in the snapshot. The computerized node may then determine a plurality of files whose content is required. Only upon such determination, the data blocks representing the content of said selected files are retrieved along with index blocks that are required for the retrieval process. The description hereinafter refers to the operation of determining which data blocks represent the content of a file as associating a file with data blocks.

In some exemplary embodiments of the subject matter, a caching mechanism is utilized. The caching mechanism may cache previously retrieved blocks to be used again should they be required in a retrieval process. For example, in case the data block 308 was retrieved, the index blocks 305, 303, 302 were also retrieved in the retrieval process. When the data block 309 is determined to be retrieved, index blocks 305, 303, 302 are required again since said index blocks 305, 303, 302 comprise the key required to retrieve the data block 309. If the index blocks 305, 303, 302 are stored in the caching mechanism, the retrieval process of the data block 309 requires retrieving only one block from the DHT - the data block 309. If all the index blocks 305, 303, 302 are not cached, the retrieval process of the data block 309 requires retrieving 4 blocks from the DHT, the data block 309 and the index blocks 305, 303, 302. The caching mechanism is also useful when retrieving a content of a file when only a portion of the data blocks representing its content were modified. In such scenario, data blocks previously retrieved and cached may be reused when retrieving the modified content of a file. For example, in case a content of a file is stored in data blocks 220, 208 and only data block 208 was modified, the retrieval process of the modified content requires only retrieving the modified data block 208 from the computerized grid network and combining it with the data blocks 220 that were not modified. A person skilled in the art will appreciate that by using said caching mechanism the use of system's limited resources, such as network bandwidth and processing elements, is reduced. For simplicity, a caching mechanism may be implemented as an interface of a DHT, caching values from the actual DHT and performing a lookup in the cache before utilizing the actual DHT.

Lazy loading should be construed as a method that enables retrieval of a backed up data block upon demand. In an exemplary embodiment of the subject matter, lazy loading is enabled due to the HLT data structure. A volume represented by the HLT is accessible in a similar manner that a volume stored in a storage device: via address. As aforementioned disclose, in case the header block comprises information regarding the number of data blocks in the volume, the route to a data block may be computed and only necessary index blocks are retrieved before retrieving the desired data block. Such random access method is necessary to enable lazy loading.

Fig. 4 shows a flow diagram of a method of creating an HLT representation of a snapshot, according to an exemplary embodiment of the subject matter. In step 400, data blocks comprising a volume in a computerized grid network are read. In step 405 a retrieval key for each of said data blocks is computed. An exemplary computation method is by utilizing a hash function as aforementioned disclosed. Next, in step 410, the first level of the HLT is created. The first level is comprised of index blocks. In some exemplary embodiments of the subject matter, the number of index blocks is a function of the number of retrieval keys that can be stored in a single index block. As noted above, the respective ordering of the data blocks is preserved. The respective ordering of the retrieval keys of the data blocks is also preserved. Each index block may be considered a representation of a set of data blocks whose retrieval keys it holds, either directly or indirectly. The respective ordering of the sets of data blocks may also be preserved. In an exemplary embodiment of the subject matter, in case the last index block of the first level of the HLT comprises less than the maximum number of retrieval keys stored in an index block, the last index block also comprises dummy retrieval keys. For example, in case an index block comprises 2 retrieval keys and a volume comprises 5 data blocks three index blocks will be created: a first index block comprising two retrieval keys of the first two data blocks; a second index block comprising two retrieval keys of the third and fourth data blocks; and a third index block comprising a retrieval key of the last data block and an additional dummy retrieval key.

In step 415 the data blocks are sent to the computerized grid network to be backed up. In some exemplary embodiments of the disclosed subject matter, the data blocks are backed up on different computerized nodes of the computerized grid network. In other exemplary embodiments of the disclosed subject matter, the data blocks may be sent to the computerized grid network as soon as they are updated. The data blocks may be sent only after a portion of an HLT is constructed.

Next, in step 420 retrieval keys of the index blocks of the first level are calculated, for example, by using a hash function. In step 425 a second level of the HLT is created, in a similar manner to the creation of the first level of the HLT disclosed in step 410. The second level comprises new index blocks, the new index blocks comprise retrieval keys of the index blocks of the first level of the HLT. In some exemplary embodiments of the subject matter, the index blocks preserve their respective order. In step 430, the index blocks of the first level are sent to the computerized grid network to be stored on at least one computerized node of the computerized grid network. In some exemplary embodiments of the subject matter, a variable quality of service is determined to every block stored in the computerized grid network. An index block may be assigned a quality of service level that is a function of quality of service levels assigned to blocks to which the index block points using retrieval keys. In step 440, the size of the previously created level of the HLT is examined. In case the last level comprises more than one index block, the flow loops back to step 420 to create an additional level in the HLT. In case the last level created comprises one index block, the one index block is considered a root block of the HLT and the flow continues to step 450. In step 450, a retrieval key of the root block is computed in a similar manner to the computations done in steps 405, 420, for example using a hash function. In step 455 a header block is created and the header block is used to store the retrieval key of the root block and additional information about the snapshot as aforementioned disclosed. In step 460, the root block is sent to the computerized grid network for storage. In step 465, which is optional, the header block is also sent to the computerized grid network using a retrieval key computed by a deterministic function as disclosed above. By storing the header block in the computerized grid network only after the root block, index blocks and data blocks were stored, the HLT provides atomicity. Atomicity should be construed as providing a complete snapshot and not allowing accessing a representation of a volume if the volume itself is accessible, either directly or indirectly. In some exemplary embodiments of the disclosed subject matter, a computerized node of the computerized grid network may not retrieve a data block or an index block of the volume only upon storing the header block in the computerized grid network since the header block is the starting point for the entire retrieval process and provides the representation of the entire backed up volume.

In some exemplary embodiments of the subject matter, a retrieval key is computed using a hash function and based on the content of a block. Hence, a retrieval key of a data block remains unchanged if the data block is not modified. In case only a portion of the data blocks comprising a volume are modified after the volume is backed up using a first HLT representation, then creation of a second HLT representation of the updated volume the second HLT may use one or more index blocks used by the first HLT. For example, consider the HLT of figure 3. If data block 310 is the only data block that was modified and the entire volume is determined to be backed up, then the second HLT may reuse index blocks 303, 370, 360, 305, 350, 340 of the first HLT. The index blocks not used for the creation of a new HLT after the data block 310 is modified are the index block 306, containing a retrieval key of the modified data block 310, the index block 304, containing a retrieval key of the modified index block 306, the root block 302, containing a retrieval key of the modified index block 304 and the header block 301 comprising a retrieval key of the modified root block 302 and properties of the modified volume. In some exemplary embodiments of the subject matter caching mechanism is utilized to assist in incrementally building an HLT.

Fig. 5 shows a flow diagram of a method of retrieving multiple computerized versions of a backed up file, according to an exemplary embodiment of the subject matter. In step 500, a retrieval key of a header block of a last in time snapshot is calculated. The last in time snapshot is the last backup of the volume. Said retrieval key may be calculated using a deterministic function, or using a hash table stored in the computerized node performing the method, or stored in other computerized nodes connected to the computerized node performing the method. In some exemplary embodiments of the subject matter, the retrieval key provides a snapshot which is a last in time snapshot within a predefined time range. In step 510, using the retrieval key, a DHT is utilized to retrieve the header block of the snapshot. A person skilled in the art may use other methods of retrieving a block from a computerized grid network. In step 520, the retrieved last in time snapshot is examined and the file system of a volume represented by the retrieved last in time snapshot is used to search for said backed up file. In case the backed up file does not appear in the retrieved last in time snapshot, step 560 is performed and a retrieval key of a second snapshot is calculated. The second snapshot is characterized in that it was stored before the last in time and previously retrieved snapshot and no other third snapshot exists that was stored after the second snapshot and before the last in time and previously retrieved snapshot. If such a second snapshot does not exist, the retrieval key computed is a dummy retrieval key or any other indication that such a snapshot was not stored in the backup data. In case the file is stored within the retrieved last in time snapshot, step 530 is performed. In step 530, metadata of the backed up file is read. The metadata may consist of a modification time of the file, date of creation of the file or other data describing when the file was last modified. In step 540, the backed up file in the retrieved last in time snapshot is added to a storage containing a list of versions. Such storage may be in the computerized node performing the method, or at least partially in other computerized nodes connected to the computerized node performing the method. In some exemplary embodiments of the subject matter, a version of the file contained in the retrieved last in time snapshot is added to the list of versions. The version of the file is added as a version related to the retrieved last in time snapshot. It will be apparent to a person skilled in the art that a version of a file related to a specific snapshot is also related to a time in which the file or at least a portion of the volume has been backed up. In step 550, the metadata of the file is used to determine when the file was last modified. A retrieval key of a header block of an additional snapshot is calculated, in a similar manner to calculation of the key performed in step 500. The additional snapshot is characterized in that it is a snapshot that was backed up before the time in which the file was modified and that no additional snapshots exist that were backed up after the backup time of the additional snapshot and before the file was modified. If such a snapshot does not exist, the retrieval key computed is a dummy retrieval key or any other indication that such a snapshot was not stored in the backup data. In step 570, in case the retrieval key that was computed either in step 560 or in step 550 is a dummy retrieval key, the control flows to step 580. The list of versions may be then sent to another process or method for further computations or presented to a user of a computerized node of the computerized grid network. On step 580 the method ends, after, optionally, performing cleanup operations deleting temporary information such as information regarding data blocks to be backed up by the HLT. In case the retrieval key computed in step 560 or step 550 is not a dummy retrieval key, a snapshot as is required is available in the backup data. Hence, the last snapshot of the method was not yet reached. The flow loops back to step 510 and the retrieval key is used to retrieve a header block of a snapshot.

In some exemplary embodiments of the subject matter, a caching mechanism is used. Whenever a snapshot is retrieved, index blocks used to retrieve the desired data blocks are cached and may be used to retrieve a sequential snapshot in the method described in figure 5. In other exemplary embodiments of the subject matter, snapshots are lazy loaded. In yet other exemplary embodiments of the subject matter, the file system of a volume has a known characteristic that data blocks that contain metadata are likely to preserve their location on the volume over time. Such a file system may be NTFS. In such case, after loading a first snapshot in the method described in figure 5, the next snapshot is lazy loaded and the data blocks that contain the metadata retrieved in step 530 are retrieved using the address of the corresponding data blocks in the first retrieved snapshot. Such optimization decreases significantly a number of blocks to be retrieved from the computerized grid network, and thus reduces the usage of system's resources. In an exemplary embodiment of the subject matter, a data block is verified to contain the metadata of the backed up file by comparing the metadata with the metadata of the backed up file in the first snapshot. As an example, parent directories, name of file, location on disk and other characteristics may be compared.

While the disclosure has been described with reference to exemplary embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings without departing from the essential scope thereof. Therefore, it is intended that the disclosed subject matter not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this invention, but only by the claims that follow.

Claims

1. An apparatus including a data structure stored therein, the data structure to allow for the retrieval of backed up data blocks, the data structure comprises: a first index block comprising a first representation of an at least one backed up data block; said first representation comprises a retrieval key for a data block or an at least one secondary index block; said at least one secondary index block comprises a second representation of a portion of said at least one backed up data block; said second representation comprises two or more retrieval keys of either at least two data blocks or at least two other secondary index blocks.

2. The apparatus of claim 1, wherein said data structure further comprises a header block.

3. The apparatus of claim 2, wherein the header block comprises data enabling to determine a route to a selected backed up data block in the apparatus.

4. The apparatus of claim 2, wherein said header block comprises information used for lazy loading of a backed up snapshot.

5. The apparatus of claim 2, wherein said header block comprises a third representation of said at least one backed up data block; said third representation comprises a retrieval key of said first index block; said first index block is stored in said apparatus.

6. The apparatus of claim 2, wherein said header block is stored in said apparatus; a retrieval key of said header block is deterministically determined.

7. The apparatus of claim 2, wherein said header block further comprises a files filter.

8. The apparatus of claim 2, wherein said header block is used as a start point in retrieving the backed up data blocks.

9. The apparatus of claim 8, wherein said header block is stored in said apparatus; wherein said data structure provides atomicity.

10. The apparatus of claim 1, wherein the data structure stores the previously backed up data blocks in a first order corresponding to a second order in which the previously backed up data blocks are stored in a volume.

1 1. The apparatus of claim 1 , wherein an at least one index block or data block contained in said data structure is also contained in another data structure of an equivalent type as the data structure.

12. The apparatus of claim 1, wherein an at least one index block or data block is cached.

13. A method of backing up at least one data block in a computerized network, the method comprising: a. determining a retrieval key associated with each data block of the at least one data block in the computerized network, said retrieval key enables retrieval of the associated data block in the computerized network; b. generating an at least one index block comprising an at least two retrieval keys; c. determining a retrieval key associated with said at least one index block on the computerized network; repeating steps (b) and (c) until only one index block is generated.

14. The method according to claim 13, further comprising a step of storing the at least one index block in the computerized network.

15. A method of retrieving previously backed up data blocks in a computerized network, the method comprising: a. obtaining a first index block representing the previously backed up data blocks, said first index block comprises an at least one retrieval key associated with an at least one additional block; b. retrieving said at least one additional block using said at least one retrieval key; in case the retrieved block is an index block comprising a second at least one retrieval key, retrieving a second at least one additional block associated with said second at least one retrieval key; repeating step (b) at least once.

16. The method according to claim 15, wherein said at least one additional block comprises either a data block or an index block.

17. The method according to claim 15, the retrieval key of the first index block is deterministically determined.

18. The method according to claim 15, further comprising a step of utilizing a files filter to filter out information regarding data blocks that were not backed up.

19. A method for retrieving a selected portion of previously backed up data blocks in a computerized network, the method comprising: a. obtaining a header block representing the previously backed up data blocks, said header block contains an at least one retrieval key associated with a block; b. for at least one data block of said previously backed up data blocks to be retrieved, determining a route from the header block to the data block; locating a retrieval key associated with each block in route; said locating retrieval key comprised by another block in said route; retrieving said each block of route using said retrieval key.

20. A method for retrieving multiple computerized versions of a previously backed up file, the method comprising: a. determining a last in time recovery point comprising a version of said previously backed up file; b. retrieving the last in time recovery point; c. determining a modification time of said version of said previously backed up file; d. determining an additional recovery point comprising another version of said previously backed up file, said additional recovery point was backed up before the modification time of said version of said previously backed up file; e. retrieving said additional recovery point; repeating steps (c)-(e) until no additional recovery points are retrieved or until the additional recovery point does not comprise a version of the previously backed up file.

21. The method of claim 20, further comprising a step of computing a computerized representation of said versions.

22. The method of claim 20, wherein said recovery points are retrieved from a computerized grid network.

23. The method of claim 20, wherein each of said recovery points is represented using an HLT data structure.