WO2019037876A1

WO2019037876A1 - Network system and method for deduplicating data

Info

Publication number: WO2019037876A1
Application number: PCT/EP2017/071464
Authority: WO
Inventors: Michael Hirsch; Yair Toaff; Yehonatan DAVID
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2017-08-25
Filing date: 2017-08-25
Publication date: 2019-02-28
Also published as: CN110945483A; CN110945483B

Abstract

The present invention provides a network system (100) for storing deduplicated data. The network system (100) comprises a common network accessible repository (101), the repository (101) storing one or more containers (102). Each container (102) includes one or more data segments (103) and first segment metadata (104) for each data segment (103). The network system (100) also includes a plurality of backup nodes (105). A backup node (105) stores, for at least one container (102) in the repository (101), second segment metadata (106) for each data segment (103) of the container (102), the second segment metadata (106) including at least a liveness indicator (107) for each data segment (103) of the container (102).

Description

NETWORK SYSTEM AND METHOD FOR DEDUPLICATING DATA

TECHNICAL FIELD

The present invention relates to a network system and to a method for deduplicating data, in particular for storing deduplicated data. This includes, for example, storing received data blocks as deduplicated data blocks. Particularly, the present invention relates to the area of data deduplication in a distributed environment like a distributed cluster of servers (distributed data deduplication).

BACKGROUND

It has become common practice to process backups by removing data that has already been stored using a process known as "deduplication". Instead of storing duplicates, the process stores some form of references to where duplicate data is already stored. These references and other items stored "about" this data are commonly known as metadata.

The metadata for a stored data block is typically referred to as a deduplicated data block, and is a list of data segments of the stored data block. The data segments of the stored data block are sequences of consecutive bytes, and upon receiving a data block to be stored, the block is typically chunked into these data segments (segmentation). A typical data segment length varies from product to product, but may average about 4 kB. A block may contain thousands of such data segments.

The data segments are stored in containers, and a container may contain thousands of data segments. Each data segment is stored in the container in association with segment metadata, and the totality of all segment metadata of the data segments in a container is referred to as container metadata. The segment metadata of a data segment may include storage details of the data segment in the container and a strong hash of the data segment.

Further, the segment metadata of a data segment may include details about a reference count indicating for the (unique) data segment in how many data blocks it was found. The initial reference count of any new data segment is usually 1, reflecting its first use. The reference count of an existing data segment is incremented when a new block refers to it. The reference count of a data segment is decremented when a block referring to it is deleted. When the reference count of a data segment reaches 0, the space can be reclaimed.

A problem in a conventional data deduplication system is, however, that the segment metadata of a data segment including the reference count needs to be rewritten every time that the reference count is incremented or decremented. This leads to a large amount of Input / Output operations (I/Os), and thus impacts negatively on the overall performance of the conventional deduplication system.

In order to address this problem, it was proposed to avoid reference counts, which led to complex mark and sweep algorithms of live systems to reclaim unused data. However, this mark and sweep of live systems is very complex and error-prone, can leave dead data, and requires multiple passes, thus draining performance. Also generally, data deduplication is rather compute intensive, which leads to the desire to execute it in a distributed manner, for instance, on one or more clusters of servers. The metadata is in this case stored in a distributed database per cluster, which can be accessed from several or all servers of the cluster. The distributed database typically handles all the consistency and fault tolerance needed by the regular operation of a system for deduplicating data (e.g. temporary down node, replacing node, failed disk etc.).

However, a conventional data deduplication system is limited to an effective scalability to between only 2 to 4 servers per cluster, i.e. it is limited in the number of servers that can work simultaneously on the same deduplication scope. Moreover, the conventional data deduplication system has also the problem that data cannot be restored, if one node in a cluster is down. This is due to the fact that there is no replication of important data performed, and accordingly already the loss of a single node may lead to complete data loss.

However, a simple replication of the distributed database to a remote database introduces latency into every database operation, requires communication links with remote sites, and requires storage space at the remote sites for the remote database. For a full backup, the database further needs to be replicated, and these replicates even have to be synchronized. This requires a lot of computational resources, and introduces further latency.

SUMMARY

In view of the above-mentioned problems and disadvantages, the invention aims to improve conventional data deduplication. The object of the invention is to provide a network system and a method for storing deduplicated data, in particular for storing deduplicated data in a distributed manner, wherein the network system is not limited in its scalability. That means, more than 4 servers, particularly up to 64 servers, should be at least possible per cluster. All servers in a cluster should thereby be able to simultaneously work on the same deduplication scope, i.e. on the same stored deduplicated data. Each server should also perform as well or even better than a server in the conventional distributed data deduplication system. In particular, the number of necessary I/Os should be reduced. Additionally, a replication of necessary information should be enabled, in order to avoid data loss in case of a node loss.

The object of the present invention is achieved by the solution provided in the enclosed independent claims. Advantageous implementations of the present invention are further defined in the dependent claims. In particular, the present invention proposes splitting off certain parts of the segment metadata including liveness indicators (e.g. reference counts) from the rest of the segment metadata of a data segment, and storing it in a different physical location. The invention thus shows how liveness indicators can be used reliably and efficiently even in a distributed environment. A first aspect of the invention provides a network system for storing deduplicated data, the network system comprising a common network accessible repository, the repository storing one or more containers, each container including one or more data segments and first segment metadata for each data segment, and a plurality of backup nodes, a backup node storing, for at least one container in the repository, second segment metadata for each data segment of the container, the second segment metadata including at least a liveness indicator for each data segment of the container. The network system of the first aspect improves the conventional systems by splitting off, for each data segment in the container, the second segment metadata including the liveness indicator, from the first segment metadata. Thus, the network system can avoid that the complete segment metadata of a data segment is rewritten every time that the corresponding liveness indicator is changed. Only the second segment metadata needs to be rewritten. This saves a significant amount of I/Os and increases the overall performance of the network system, particularly since the liveness indicator of a data segment is typically changed constantly during data deduplication.

The network system of the first aspect has no practical limit to its scalability, and at least up to 64 servers (e.g. agents) in a cluster may be configured to modify data segments and also modify the first and second segment metadata of the data segments in the container, respectively. This possibility to scale the hardware provides scalability to also processing power and total data managed. All servers in a cluster can work simultaneously on the same deduplicated data. Each server performs as well as or better than its non-distributed predecessor.

In an implementation form of the first aspect, the liveness indicator for a data segment is a reference count for the data segment, the reference count indicating a number of deduplicated data blocks referring to that data segment.

Accordingly, unlike in the conventional systems, the network system of the first aspect enables a reliable use of reference counts in a distributed environment. By separating at least the reference counts from the first segment metadata (which e.g. includes hashes of the corresponding data segments), a rewriting of the first segment metadata every time that the reference count needs to be incremented or decremented is not necessary. Therefore, the performance of the network system is significantly increased. In a further implementation form of the first aspect, the plurality of backup nodes are configured to replicate among each other the second segment metadata of the at least one container, in particular the liveness indicator for each data segment of the container.

Unlike in the conventional systems, the network system of the first aspect thus enables a replication of necessary information, in order to ensure that complete data loss is avoided in case that a node goes down. In particular, the network system is resilient to up to k node failures, when there are at least 2k+l nodes in a cluster. In a further implementation form of the first aspect, the second segment metadata of the at least one container further includes a state of the container, and the container state is "regular" if the container can currently be updated by any agent and the state is "locked" if the container cannot currently be updated by any agent other than the agent currently locking the container.

By means of these container states, only one agent performing a true updating process (i.e. an updater), like appending data segments to a container or defragmenting a container, is enabled at the same time. The update lock provided by the "locked" state prevents simultaneous updates. However, the most common and quick operations, like incrementing and decrementing reference counts, may not use this lock, but may rather use a test-and-set primitive. Further, a container can preferably still be read consistently, even if in the "locked" state. This provides the ability to read logical entities at any time, even if an updater has died in mid- operation and left a container "locked".

In a further implementation form of the first aspect, the second segment metadata of each container further includes a current version number of the second segment metadata of the container.

In a further implementation form of the first aspect, the second segment metadata of each container further includes a current version number of the first segment metadata of the container and a current version number of the data segments of the container. All version numbers may respectively be file suffixes. By providing these version numbers to the different components of segment data (data segments and first and segment metadata), which are stored in the different physical storage locations, the consistency of a logical container is guaranteed across the network system. To this end, a lease-like mechanism may be used. In a further implementation form of the first aspect, the plurality of backup nodes are configured to cooperate to atomically and persistently update the second segment metadata of at least one container and any subset of the container state, the version number of the first segment metadata, the version number of the data segments and/or a plurality of liveness indicators of the second segment metadata of the container are updated, if the version number of the second metadata has not changed. Here, an update to any set of data items stored in a backup node and replicated between backup nodes is called "atomic", if and only if the entire update in its entirety becomes visible to all backup nodes simultaneously. In other words, any and all backup nodes retrieving the set of data items will receive a consistent set of the data items prior to the update (if queried before the point in time of the update) or the updated consistent updated set of data items (if queried after the point in time of the update).

In a further implementation form of the first aspect, the network system further comprises at least one agent configured to modify the data segments, the first segment metadata and/or the second segment metadata of a container, wherein the at least one agent is configured to recover the "regular" state of a container, which is in a "locked" state and for which the version number of the second segment metadata did not change for a time period larger than a determined threshold value, based on the current version numbers of the first segment metadata and the data segments, respectively.

Since the "regular" state can be recovered, the network system can perform the data deduplication more efficiently.

In a further implementation form of the first aspect, the at least one agent is configured to modify any of the version numbers of a container after modifying the second segment metadata, the first segment metadata and/or the data segments of the container, respectively. By keeping the version numbers up to date, the consistency of a logical container across the different storage locations of its segment data across the network system is guaranteed.

In a further implementation form of the first aspect, any operation on the data segments and the first and second segment metadata is limited such that the„regular" state of the container can be recovered after a crash at any point in the operation.

Accordingly, data and storage space loss can be effectively avoided.

In a further implementation form of the first aspect, the at least one agent is configured to, for recovering the "regular" state of a container, retrieve, from the second segment metadata of the container, the current version numbers of the first segment metadata and the data segments and a number of the data segments of the container, atomically and persistently increment the version number of the second segment metadata, if it has not changed since the retrieving, read the first segment metadata with the current version number of the first segment metadata as retrieved with the second segment metadata from the repository, determine, from the read first segment metadata, a size that the data segments of the container with the current version number of the data segments of the container should have, truncate the size of the data segments of the container according to the determined size, if the size of the data segments is larger than the determined size, and/or truncate the first segment metadata of the container according to the retrieved number of data segments, and atomically and persistently reset the state of the container to„regular" and increment the version number of the second metadata, if it has not changed in the meantime.

According to this implementation form, the at least one agent is configured to perform a recovery process. Notably, this recovery process is a preferred example. The at least one agent may, however, be configured to perform any other recovery process leading to the container recovery.

In the above-described way, the "regular" state of the container can be effectively recovered in case that an agent left it in a not regular case after a failure. After the recovery, all the three parts of the container are coherent without any loss of data or storage space. In a further implementation form of the first aspect, data can be restored from a container that is in the "regular" state or in the "locked" state at any time.

This enables restore operations, even if a true updater has left the container in the "locked" state for any reason.

In a further implementation form of the first aspect, the repository stores the data segments of each container in a first storage and stores the first segment metadata of each container in a second storage, which can be different from the first storage.

The first storage and the second storage may be physically separate devices. These can be the same (e.g. each storage may be a SSD), but are preferably different (in terms of e.g. read latency, capacity or cost). Using different (types of) storages enables selecting them according to the type of the segment data (data segments, first segment metadata, or second segment metadata) that is stored therein, which can improve the performance and reliability of the network system.

In a further implementation form of the first aspect, the first storage further stores, for each container, also the first segment metadata of the container. Storing the first segment metadata again, improves the network system in case a recovery is needed.

In a further implementation form of the first aspect, the read latency of the second storage is not higher than the read latency of the first storage, and/or preferably the second storage is a solid state drive, SSD, or is a serial attached SCSI, SAS, storage and/or the first storage is a serial advanced technology attachment, SATA storage.

Accordingly, the data segments that are written once and are updated and read rarely are stored in a storage with high capacity and low cost. The first segment metadata that is written rarely, but read often, is stored in a storage with low read latency. Thereby, the network system performs more efficiently and costs can be saved. A second aspect of the invention provides a method for storing deduplicated data, the method comprising storing one or more containers in a common network accessible repository, each container including one or more data segments and first segment metadata for each data segment, and storing, in a plurality of backup nodes, second segment metadata for each data segment of at least one container in the repository, the second segment metadata including at least a liveness indicator for each data segment of the container.

In an implementation form of the second aspect, the liveness indicator for a data segment is a reference count for the data segment, the reference count indicating a number of deduplicated data blocks referring to that data segment. In a further implementation form of the second aspect, the plurality of backup nodes replicate among each other the second segment metadata of the at least one container, in particular the liveness indicator for each data segment of the container.

In a further implementation form of the second aspect, the second segment metadata of the at least one container further includes a state of the container, and the container state is "regular" if the container can currently be updated by any agent and the state is "locked" if the container cannot currently be updated by any agent other than the agent currently locking the container.

In a further implementation form of the second aspect, the second segment metadata of each container further includes a current version number of the second segment metadata of the container.

In a further implementation form of the second aspect, the second segment metadata of each container further includes a current version number of the first segment metadata of the container and a current version number of the data segments of the container. In a further implementation form of the second aspect, the plurality of backup nodes cooperate to atomically and persistently update the second segment metadata of at least one container and any subset of the container state, the version number of the first segment metadata, the version number of the data segments and/or a plurality of liveness indicators of the second segment metadata of the container are updated, if the version number of the second metadata has not changed.

In a further implementation form of the second aspect, at least one agent modifies the data segments, the first segment metadata and/or the second segment metadata of a container, wherein the at least one agent is configured to recover the "regular" state of a container, which is in a "locked" state and for which the version number of the second segment metadata did not change for a time period larger than a determined threshold value, based on the current version numbers of the first segment metadata and the data segments, respectively.

In a further implementation form of the second aspect, the at least one agent modifies any of the version numbers of a container after modifying the second segment metadata, the first segment metadata and/or the data segments of the container, respectively.

In a further implementation form of the second aspect, any operation on the data segments and the first and second segment metadata is limited such that the„regular" state of the container can be recovered after a crash at any point in the operation.

In a further implementation form of the second aspect, the at least one agent retrieves, for recovering the "regular" state of a container, from the second segment metadata of the container, the current version numbers of the first segment metadata and the data segments and a number of the data segments of the container, atomically and persistently increments the version number of the second segment metadata, if it has not changed since the retrieving, read the first segment metadata with the current version number of the first segment metadata as retrieved with the second segment metadata from the repository, determines, from the read first segment metadata, a size that the data segments of the container with the current version number of the data segments of the container should have, truncates the size of the data segments of the container according to the determined size, if the size of the data segments is larger than the determined size, and/or truncates the first segment metadata of the container according to the retrieved number of data segments, and atomically and persistently resets the state of the container to„regular" and increments the version number of the second metadata, if it has not changed in the meantime.

In a further implementation form of the second aspect, data can be restored from a container that is in the "regular" state or in the "locked" state at any time.

In a further implementation form of the second aspect, the repository stores the data segments of each container in a first storage and stores the first segment metadata of each container in a second storage, which can be different from the first storage.

In a further implementation form of the second aspect, the first storage further stores, for each container, also the first segment metadata of the container.

In a further implementation form of the second aspect, the read latency of the second storage is not higher than the read latency of the first storage, and/or preferably the second storage is a solid state drive, SSD, or is a serial attached SCSI, SAS, storage and/or the first storage is a serial advanced technology attachment, SATA storage. A third aspect of the invention provides a computer program product comprising a program code for controlling a network system according to the first aspect or one of its implementation forms, or for performing, when running on a computer, the method according to the second aspect or any of its implementation forms.

The computer program product of the third aspect achieves all benefits and effects of the network system of the first aspect, and the method of the second aspect, respectively. It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof.

BRIEF DESCRIPTION OF DRAWINGS

The above-described aspects and implementation forms of the present invention will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which

FIG. 1 shows a network system according to an embodiment of the present invention.

FIG. 2 shows a network system according to an embodiment of the present invention.

FIG. 3 shows a network system according to an embodiment of the present invention. FIG. 4 shows a container logically (b) and shows its contents for one data segment (a).

FIG. 5 shows a container as physically stored in the network system according to an embodiment of the present invention.

FIG. 6 shows a network system according to an embodiment of the present invention.

FIG. 7 shows a method according to an embodiment of the present invention. DETAILED DESCRIPTION OF EMBODIMENTS

Fig. 1 shows a network system 100 according to an embodiment of the present invention. The network system 100 is configured to store deduplicated data, i.e. to store, for instance, received data blocks as deduplicated data blocks. The network system 100 comprises a common network accessible repository 101 and a plurality of backup nodes 105.

The repository 101 may be a standard repository storage of a network system. However, the repository 101 may include at least a first storage and a second storage different from the first storage. That is, the storages differ, for example, in type, storage capacity, read latency, and/or read speed. Preferably, the read latency of the second storage is not higher than the read latency of the first storage. Preferably, the second storage is a solid state drive (SSD) or is a serial attached SCSI (SAS) storage. Preferably, the first storage is a serial advanced technology attachment, SATA storage.

The repository 101 stores one or more containers 102, each container 102 including one or more data segments 103 and first segment metadata 104 for each data segment 103. Preferably, the repository 101 stores the data segments 103 of a container 102 in the above-mentioned first storage, and stores the first segment metadata 104 of a container 102 in the above-mentioned second storage.

The plurality of backup nodes 105 may each maintain a database, in order to store information. In particular, a backup node 105 stores, for at least one container 102 in the repository 101, second segment metadata 106 for each data segment 103 of the container 102. The second segment metadata 106 includes at least a liveness indicator 107 for the data segment 103. Preferably, the liveness indicator 107 for a data segment 103 is a reference count for the data segment 103, wherein the reference count indicates a number of deduplicated data blocks referring to that data segment 103. The plurality of backup nodes 105 preferably replicate among each other at least the second segment metadata 106, in particular at least the liveness indicator 107 for each data segment 103 of the container 102.

Fig. 2 shows a network system 100 according to an embodiment of the present invention, which builds on the network system of Fig. 1. Preferably, as shown in Fig. 2, the second segment metadata 106 of the at least one container 102 further includes a state 108 of the container 102. A container state 108 is "regular" if the associated container 102 can currently be updated by an agent, and is "locked" if the container 102 cannot currently be updated by any agent other than the agent currently locking the container 102. The second segment metadata 106 of one or more containers 102, or of each container 102, may preferably further include a current version number 109 of the second segment metadata 106. Additionally, even more preferably, the second segment metadata 106 of one or more containers 102, or of each container 102, may further include a current version number 110 of the first segment metadata 104 and/or a current version number 111 of the data segments 103 of the container 102. In Fig. 2, the most preferred solution of the second segment metadata 106 including all version numbers 109, 110 and 111, and including the container state 108 is illustrated.

The plurality of backup nodes 105 may further cooperate to atomically and persistently update the second segment metadata 106 of at least one container 102 and any subset of the container state, the version number 110 of the first segment metadata 104, the version number 111 of the data segments 103 and/or a plurality of liveness indicators 107 of the second segment metadata 106 of the container 102 are updated, if the version number 109 of the second metadata 106 has not changed.

FIG. 3 shows a network system 100 according to an embodiment of the present invention, which builds on the network system 100 of Fig. 1 or Fig. 2. The network system 100 accordingly includes again the repository 101, which may be a file server as shown in Fig. 3, and a plurality of backup nodes 105. Each backup node 105 may store a database including the second segment metadata 106 of data segments 103 in one or more containers 102 stored in the repository 101. In order to provide some backup, the network system 100 may also include one or more remote backup nodes (not shown in Fig. 3) for remote replication of the backup nodes 105. Fig. 3 shows an example of a relatively small backup node deployment. Larger deployments are possible, however, and may scale linearly to more backup nodes, for example, up to 64 backup nodes 105.

Fig. 3 shows a network system 100 including a hypervisor network, a distributed database network, a file server network, and an admin network. In particular, between the distributed database network and the file server network, there are located the plurality of backup nodes 105. In the plurality of backup nodes 105 at least the second segment metadata 106 is stored. However, also a plurality of deduplicated data blocks may be stored. Additionally, the plurality of backup nodes 105 may also include a deduplication index for each data segment 103, which is referenced by a deduplicated data block in the node.

FIG. 3 further shows a hypervisor network including at least one hypervisor that is connected to the distributed database network. FIG. 3 also shows an admin network connected to the file server network, wherein the admin network includes at least one admin node.

Fig. 4 shows the different logical components of a container 102 stored in the network system 100 according to Fig. 1, Fig. 2 or Fig. 3 in a physically distributed manner. The container 102 logically includes segment data for each of a plurality of data segments 103. In particular, Fig. 5 shows the different read/write characteristics of the different components of one specific segment data in the container 102. The segment data includes logically the data segment 103 itself, the first metadata 104 and the second metadata 106. The data segments are, for instance, stored in the form of compressed data. The first segment metadata 104 may contain a hash or strong hash of the related data segment 103, an index in the container 102, an offset in the container 102, a size of the data segment 103 in an uncompressed state, and/or a size of the data segment 103 in the compressed state. The second segment metadata 106 includes at least the liveness indicator 107, here exemplarily a reference count, which is according to the invention physical split off from the data segments 103 and the first segment metadata 104, respectively. The individual elements of the segment data in the container 102 are used differently in the data deduplications scheme. That is, the liveness indicator 107 and optionally the second segment metadata 106 are updated frequently. The first segment metadata 104 is written once, but is read rather often. The data segment 103 is written once, and is rarely read or updated. Accordingly, in the network system 100 of the invention, the elements of the segment data are preferably stored differently and separated from another.

Fig. 5 shows how a container 102, particularly the plurality of segment data in the container 102, is actually stored physically in the network system 100 according to an embodiment of the present invention. A logical container 102 is stored by storing aggregated component parts of the segment data in different storage Tiers. Here, the data segments 103, since they are written and read and updated only rarely, are preferably stored in a first storage of the repository 101, which is preferably a storage with high capacity, low cost, and optimized for sequential I/O. For instance, the first storage is a SATA storage. The first segment metadata 104, since it is written rarely but read often, is stored in a second storage of the repository 101, which is preferably a storage with low read latency, like a SSD or SAS. The second segment metadata 106 including here the reference counts 107, since they are updated frequently, are stored in the plurality of backup nodes 105, i.e. they are not stored in the repository 101 together with the other components of the logical container 102. Each of the plurality of backup nodes 105 provides a distributed transactional, replicated, always-consistent key-value store. FIG. 6 shows a network system 100 according to an embodiment of the present invention, which builds on the network system 100 shown in Fig. 3. Fig. 6 particularly shows data placement in the physical cluster. Different segment data components are stored in different tiers, namely where the storage characteristics suit the respective component best. In this respect, Fig. 6 shows clearly the separation in storage location of the data segments 103, the first segment metadata 104, and the second metadata 106 including the liveness indicator 107, respectively. The data segments 103 are stored in the file server 101 in advantageously the first storage (also referred to as Tier 2 level), which is e.g. SATA. The first segment metadata 104 is stored in the file server 101 in advantageously the second storage (also referred to as Tier 1 level), which is e.g. SSD or SAS. The second segment metadata 106 is stored in the backup nodes 105 (also referred to as Tier 0 level), which at least replicate among another the liveness indicator 107, here the reference counts of the data segments 103 in the container 102. Due to this replication, the network system 100 is resilient to tolerate up to k node 105 failures when there are at least 2k+l nodes 105 in a cluster. This failure tolerance can be tuned. If the user chooses to tolerate <k> failures (for k <= (n-l)/2), then the second segment metadata 106 in the key-value store of the backup nodes 105 must be replicated <2k+l> times to guarantee a quorum. This use more local storage space in the nodes 105 and presents a performance penalty. However, this guarantees the "always-consistent" property of the key value store mentioned above. Together with the "transactional" property of the key- value store, the consistency of at least the liveness indicator 107 of the data segments 103 are guaranteed.

Accordingly, for the Tier 0 level a cached and replicated local storage (fastest) is preferably selected. A key-value store is created, where the key is the container ID, and the value may be made up of three items: the liveness indicators 107, e.g. reference counts, of the data segments 103 of the container 102. A state ("regular", "write" or "defrag") of the container 102. And version numbers, preferably, a current version number 109 of the second segment metadata 106 of the container 102 and/or a current version number 110 of the first segment metadata 104 of the container 102 and/or a current version number 111 of the data segments 103 of the container 102.

For Tier 1 level, a high read operation, low read latency, write rarely, read often (SSD or fast SAS) is preferably selected. It contains a file for each container 102 that stores all the first segment metadata 104. For instance, the file name is "container- ID. <version number>"

For Tier 2 level, a write rarely, read rarely (network storage, possibly even archive-like) storage is preferably selected. It contains a file for each container 102 that preferably stores all the first segment metadata 104 (again, for recovery) and all the data segments 103. For instance, the file name is "container- ID. <version number>". The consistency of a logical container 102 is guaranteed across the Tiers by preferably using a lease-like mechanism and by giving a version number 109-111 to each component of the segment data in each Tier. These version numbers 109- 111 are referred to as tOv, tlv and t2v. The use cases presented below guarantee that no data is lost in the event of a failure of an agent performing an action. To this end, preferably no data is deleted until it can be guaranteed that it will not be needed anymore. Further, preferably a recovery path is guaranteed that restores consistency.

A container may be in "regular", "write" or "defragment" states, wherein "write" and "defragment" states may be equivalent. Any container with a state that is not "regular" and whose tOv has not advanced in <n+m> time units is a candidate for recovery, where <n> is the time in which "most" operations will complete and <m> is a safety margin. Operations that attempt to complete after recovery will be failed with no damage.

The entry in Tier 0 maintains the versions of the rest of the components. Version numbers may actually be file suffixes in lower Tiers. In general, all operations described below can be restarted. This particularly applies to the recovery use cases. Tier 1 and Tier 2 could be merged into one Tier. Some operations may have to be performed or restarted at some time point. However, these are usually not time-critical, and may be delayed by adding a reminder to a list to be executed later. Below, the term "delay" is used to denote this. Multiple delays of an operation on the same container 102 usually indicate that the container 102 needs recovery.

In the following some specific use cases are described. Notably, in the normal flow of deduplication, both the "deduplicate" use case" and the "write-new-data" use case are used. The described use cases are examples to a set of use cases, which may be used in the deduplication process and allow recovery from every step of the process that fails.

Now the "deduplicate" use case is described. This use case updates only liveness indicators 107, e.g. reference counts, in Tier 0. The use case preferably includes the following steps:

1. Retrieve the Tier 0 entry including the second metadata 106 for the container 102 from the key- value store.

2. If the container state is not "regular", a different set of containers 102 is chosen for deduplication.

3. Update the liveness indicators 107 in the in-memory copy of the entry.

4. If tOv has not changed, atomically update the liveness indicators 107 and version to (tOv+1) (test-and-set) in the key-value store.

5. If this fails, start again from #1. If the tlv has changed in the meantime, it is necessary to retrieve the new first segment metadata 104 in the container 102.

Now the "delete" use case is described. The use case updates only liveness indicators 107, e.g. reference counts in Tier 0. The use case occurs when a data block is deleted. The liveness indicators 107 of the data segments 103 of the block must be decremented. The use case includes the following steps:

1. Retrieve the Tier 0 entry for the container 102 from the key-value store.

2. If the container state is not "regular", restart this operation after a delay.

3. Update the liveness indicators 107 in the in-memory copy of the entry. 4. If tOv has not changed, atomically update the liveness indicators 107 and version to (tOv+1) (test-and-set) in the key-value store. If this fails, restart this operation after a delay.

Now the "write-new-data" (append) use case is described. The use case preferably includes the following steps:

1. Choose a container 102 as the append target.

2. Retrieve the Tier 0 entry for the container 102 from the key- value store. If the Tier 0 state is not "regular", restart from #1, choosing a different container 102.

3. If tOv has not changed, atomically change the state to "write" and version to (tOv+1) (test-and-set). If this fails, restart from #1, chose a different container 102.

4. Append the new first segment metadata 104 and data segments 103 to the Tier 2 file "container-ID. <t2v>".

5. Write a new Tier 1 file with name "container-ID. <tlv+l>" containing the old contents and new first segment metadata 104 appended.

6. Build a new Tier 0 entry with the second metadata 106 including liveness indicators 107 extended to include counts for the appended segments, state "regular", versions (t0v+2), (tlv+1), (t2v). Atomically store this entry over the tier 0 entry if the version is still (tOv+1).

7. If this update fails, restart from #1, choose a different container 102.

8. Remove the "container-ID. <tlv>" file.

Next the "defrag" use case is described. The use case preferably includes the following steps:

1. Retrieve the Tier 0 entry for the container 102 from the key- value store.

2. If the Tier 0 state is not "regular", restart this operation after a delay. 3. If tOv has not changed, atomically change the state to "defrag" and version to (tOv+1) (test- and- set). If this fails, restart this operation after a delay.

4. Write a new Tier 2 file with name "container- ID. <t2v+l>" containing only the segment data (first segment metadata 104 and data segments 103) that have positive liveness indicators 107.

5. Write a new Tier 1 file with name "container- ID. <tlv+l>" containing only the first segments metadata 104 that have positive liveness indicators 107.

6. Build a new Tier 0 entry with only the non-zero liveness indicators 107, state "regular", versions (t0v+2), (tlv+1), (t2v+l). Atomically store this entry over the tier 0 entry if the version is still (tOv+1).

7. If this update fails, restart this operation after a delay.

8. Remove the "container-ID. <tlv>" and "container- ID. <t2v>" files. They are no longer needed.

Now the "read (restore)" use case is described. The use case preferably includes the following steps.

1. Retrieve the Tier 0 entry for the container 102 from the key-value store to find the current value of t2v.

2. Read the "container- ID. <t2v>" file from Tier 2 to memory.

3. This may fail in a race condition if the container file is being updated. In this case, restart from #1.

4. Extract the required data segments 103 from the memory copy of the "container- ID.<t2v>".

Now a "primary recovery" use case is described as an exemplary recovery process. It can be executed on any container 102 that has state not "regular" and whose tOv has not advanced in <n+m> time units. This is an indication that the agent performing a modification died in mid process. Notably, there are only three use cases that could leave a container in a state that is not regular if they fail: "Write-new-data": writes a new version of the Tier 1 file but appends to the Tier 2 file, "defrag": writes new versions of the files in Tiers 1 and 2. "recovery" (this use case, not the previous one): does not write new data, simply rolls back the above two use cases.

In all cases, the original Tier 1 and Tier 2 files either still exist or can be recovered by truncating them. The use case includes preferably the following steps:

1. Retrieve the Tier 0 entry for the container from the key- value store to find the current value of tlv and t2v and the number of data segments 103 (<q>) in the container 102.

2. If tOv has not changed, atomically change the state to "write" and version to (tOv+1) (test-and-set). If this update fails, restart this operation after a delay.

3. Read the contents of the "container- ID. <tlv>" file in Tier 1 to ascertain the size <s> that the "container-ID. <t2v>" should have if it contained only the first <q> data segments 103.

4. If the "container-ID. <t2v>" file is larger than <s>, truncate it to size <s>.

5. If the "container- ID. <tlv>" file has more than <q> data segments 103, truncate it to contain only <q> data segments 103 (should not happen).

6. Build a new Tier 0 entry with state "regular", and versions (t0v+2), tlv and t2v.

Atomically store this entry over the Tier 0 entry if the version is still (tOv+1) (test-and-set). If this update fails, restart this operation after a delay.

Now a "secondary recovery" use case is described as an exemplary recovery process. It can be executed on any container 102 that has state not "regular" and whose tOv has not advanced in <n+m> time units. This is an indication that a modification was abandoned mid process for any reason. Notably, tOv is always guaranteed to larger than both tlv and t2v. The use case preferably includes the following steps:

1. Retrieve the Tier 0 entry for the container 102 from the key-value store to find the current value of tlv and t2v and the number of data segments 103 (<q>) in the container 102.

2. If tOv has not changed, atomically change the state to "write" and version to (tOv+1) (test-and-set). 3. Use (tOv+1) for the new values tlv' and t2v' . Invariant: tlv < tlv' and t2v < t2v'

4. Read the "container- ID. <t2v>" file to memory. Write the first <q> data segments 103 to a new "container-ID. <t2v'>" file in Tier 2.

5. Reconstruct a new "container-ID.<tlv'>" file in Tier 1 from the first segment metadata 104 read in step #3.

6. Build a new Tier 0 entry with state "regular", versions (t0v+2), tlv' and t2v' .

Atomically store this entry over the tier 0 entry if the version is still (tOv+1) (test- and-set)

7. If this update fails, restart this operation after a delay. 8. Remove all "container-ID. <k>" files in Tier 1 for k < tlv'.

9. Remove all "container-ID. <p>" files in Tier 2 for p < t2v'.

It is noted that in the present invention "atomically" is defined as follows: An update to a set of data items stored in a backup node and replicated between backup nodes is atomic if and only if the entire update in its entirety becomes visible to all backup nodes simultaneously. In other words, any and all backup nodes retrieving the set of data items will receive a consistent set of the data items prior to the update (if queried before the point in time of the update) or the updated consistent updated set of data items (if queried after the point in time of the update).

Fig. 7 shows a method 700 for storing deduplicated data, i.e. for storing received data blocks as deduplicated data blocks. The method 700 may be carried out by or in the network system 100 shown in the previous Figs. In particular, the method 700 comprises a method step 701 of storing one or more containers 102 in a common network accessible repository 101, each container 102 including one or more data segments 103 and first segment metadata 104 for each data segment 103. The method 700 also comprises another method step 702 of storing, in a plurality of backup nodes 105, second segment metadata 106 for each data segment 103 of at least one container 102 in the repository 101, the second segment metadata 106 including at least a liveness indicator 107 for each data segment 103 of the container 102. Obviously, the repository 101 mentioned in the method 700 may be the repository 101 shown in Fig. 1 or Fig. 2 or the file server 101 of the network system 100 shown in the Figs. 3 and 6. The plurality of backup nodes 105 mentioned in the method 700 may be the backup nodes 105 of the network system 100 shown in the Figs. 1, 2, 3 and 6.

In summary, this invention provides a storage architecture for reliably storing deduplicated data. To this end, the invention splits the storage of the deduplicated data between storage Tiers. Updates to the segment liveness indicators 107 are fast and transactional. Performance is improved, because storage characteristics match usage patterns. Reads of the first segment metadata 104 (e.g. the hashes of data segments 103) are fast. Costs are kept down by using cheaper storage for the large bulk of the data segments 103. Restores are fast, because the bulk of the data segments 103 is read sequentially from storage optimized for this. Storage use cases are tied across storage Tiers by version numbers 109-111 and leases that guarantee no data loss, readability of all data at all times, and restartable recovery.

Thus, this invention shows how to scale and distribute global deduplication linearly to an unlimited number of servers in a cluster, with fault tolerance, while still maintaining performance and reliability without increasing cost. This has been done without changing the server hardware specification from conventional generations that did not support distributed global deduplication.

The present invention has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word "comprising" does not exclude other elements or steps and the indefinite article "a" or "an" does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Claims

1. A network system (100) for storing deduplicated data, the network system (100) comprising

a common network accessible repository (101), the repository (101) storing one or more containers (102), each container (102) including one or more data segments (103) and first segment metadata (104) for each data segment (103), and

a plurality of backup nodes (105), a backup node (105) storing, for at least one container (102) in the repository (101), second segment metadata (106) for each data segment (103) of the container (102), the second segment metadata (106) including at least a liveness indicator (107) for each data segment (103) of the container (102).

2. The network system (100) according to claim 1, wherein

the liveness indicator (107) for a data segment (103) is a reference count for the data segment (103), the reference count indicating a number of deduplicated data blocks referring to that data segment (103).

3. The network system (100) according to claim 1 or 2, wherein

the plurality of backup nodes (105) are configured to replicate among each other the second segment metadata (106) of the at least one container (102), in particular the liveness indicator (107) for each data segment (103) of the container (102).

4. The network system (100) according to one of the claims 1 to 3, wherein

the second segment metadata (106) of the at least one container (102) further includes a state (108) of the container (102), and

the container state (108) is "regular" if the container (102) can currently be updated by any agent and the state (108) is "locked" if the container (102) cannot currently be updated by any agent other than the agent currently locking the container (102).

5. The network system (100) according to one of the claims 1 to 4, wherein

the second segment metadata (106) of each container (102) further includes a current version number (109) of the second segment metadata (106) of the container (102).

6. The network system (100) according to one of the claims 1 to 5, wherein the second segment metadata (106) of each container (102) further includes a current version number (110) of the first segment metadata (104) of the container (102) and a current version number (111) of the data segments (103) of the container (102).

7. The network system (100) according to the claims 5 and 6, wherein

the plurality of backup nodes (105) are configured to cooperate to atomically and persistently update the second segment metadata (106) of at least one container (102) and any subset of the container state (108), the version number (110) of the first segment metadata (104), the version number (111) of the data segments (103) and/or a plurality of liveness indicators (107) of the second segment metadata (106) of the container (102) are updated, if the version number (109) of the second metadata (106) has not changed. 8. The network system (100) according to claim 4 or 5 and according to claim 6 or 7, further comprising

at least one agent configured to modify the data segments (103), the first segment metadata (104) and/or the second segment metadata (106) of a container (102),

wherein the at least one agent is configured to recover the "regular" state (108) of a container (102), which is in a "locked" state (108) and for which the version number (109) of the second segment metadata (106) did not change for a time period larger than a determined threshold value, based on the current version numbers (110, 111) of the first segment metadata (104) and the data segments (103), respectively. 9. The network system (100) according to claim 8, wherein

the at least one agent is configured to modify any of the version numbers (109, 110, 111) of a container (102) after modifying the second segment metadata (106), the first segment metadata (104) and/or the data segments (103) of the container (102), respectively.

10. The network system (100) according to one of the claims 7 to 9, wherein

any operation on the data segments (103) and the first and second segment metadata (104, 106) is limited such that the„regular" state (108) of the container (102) can be recovered after a crash at any point in the operation.

11. The network system (100) according to one of the claims 8 to 10, wherein the at least one agent is configured to, for recovering the "regular" state (108) of a container (102),

retrieve, from the second segment metadata (106) of the container (102), the current version numbers (110, 111) of the first segment metadata (104) and the data segments (103) and a number of the data segments (103) of the container (102),

atomically and persistently increment the version number (109) of the second segment metadata (106), if it has not changed since the retrieving,

read the first segment metadata (104) with the current version number (110) of the first segment metadata (104) as retrieved with the second segment metadata (106) from the repository (101),

determine, from the read first segment metadata (104), a size that the data segments (103) of the container (102) with the current version number (111) of the data segments (103) of the container (102) should have,

truncate the size of the data segments (103) of the container (102) according to the determined size, if the size of the data segments (103) is larger than the determined size, and/or

truncate the first segment metadata (104) of the container (102) according to the retrieved number of data segments (103), and

atomically and persistently reset the state (108) of the container (102) to „regular" and increment the version number (109) of the second segment metadata (106), if it has not changed in the meantime. 12. The network system (100) according to one of the claims 4 to 11, wherein

data can be restored from a container (102) that is in the„regular" state (108) or in the„locked" state (108) at any time.

13. The network system (100) according to one of the claims 1 to 12, wherein

the repository (101) stores the data segments (103) of each container (102) in a first storage and stores the first segment metadata (104) of each container (102) in a second storage, which can be different from the first storage.

14. The network system (100) according to claim 13, wherein the first storage further stores, for each container (102), also the first segment metadata (104) of the container (102).

15. The network system (100) according to claim 13 or 14, wherein

the read latency of the second storage is not higher than the read latency of the first storage, and/or

preferably the second storage is a solid state drive, SSD, or is a serial attached SCSI, SAS, storage and/or the first storage is a serial advanced technology attachment, SATA storage.

16. Method (700) for storing deduplicated data, the method (700) comprising

storing (701) one or more containers (102) in a common network accessible repository (101), each container (102) including one or more data segments (103) and first segment metadata (104) for each data segment (103), and

storing (702), in a plurality of backup nodes (105), second segment metadata

(106) for each data segment (103) of at least one container (102) in the repository (101), the second segment metadata (106) including at least a liveness indicator (107) for each data segment (103) of the container (102). 17. Computer program product comprising a program code for controlling a network system (100) according to one of the claims 1 to 15 or for performing, when running on a computer, the method (700) according to claim 16.