CN110945483B

CN110945483B - Network system and method for data de-duplication

Info

Publication number: CN110945483B
Application number: CN201780093463.9A
Authority: CN
Inventors: 迈克尔·赫希; 亚伊尔·托弗; 叶赫那坦·大卫
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2017-08-25
Filing date: 2017-08-25
Publication date: 2023-01-13
Anticipated expiration: 2037-08-25
Also published as: CN110945483A; WO2019037876A1

Abstract

The present invention provides a network system 100 for storing deduplicated data. The network system 100 includes a common network accessible repository 101, the repository 101 storing one or more containers 102. Each container 102 includes one or more data segments 103 and a first segment of metadata 104 for each data segment 103. The network system 100 also includes a plurality of standby nodes 105. For at least one container 102 in the repository 101, a standby node 105 stores a second segment metadata 106 for each data segment 103 of the container 102, the second segment metadata 106 including at least an activity indicator 107 for each data segment 103 of the container 102.

Description

Network system and method for data de-duplication

Technical Field

The present invention relates to a network system and method for data de-duplication, and more particularly, to a network system and method for storing data after data de-duplication. For example, including storing the received data chunks as deduplicated data chunks. In particular, the present invention relates to the field of deduplication (distributed deduplication) in a distributed environment like a distributed server cluster.

Background

It has become common practice to process backups by removing stored data using a process called "deduplication". The process does not store a copy, but rather some form of reference to a duplicate data storage location. These references and other items "about" this data store are commonly referred to as metadata.

The metadata storing the data chunks is generally referred to as a deduplicated data chunk, and is a list of data segments storing the data chunk. The data segments in which a data block is stored are a continuous sequence of bytes, and after receiving a data block to be stored, the data block is typically partitioned into these data segments (fragments). Typical data segment lengths vary from product to product, but average about 4kB. A block may contain thousands of such data segments.

The data segments are stored in containers, which may contain thousands of data segments. Each data segment is stored in a container associated with segment metadata, and all of the segment metadata for the data segments in the container are collectively referred to as container metadata. Segment metadata for a data segment may include storage details of the data segment in the container and a strong hash of the data segment.

In addition, the segment metadata for a data segment may include details about a reference count indicating how many data blocks the (unique) data segment is found in. The initial reference count for any new data segment is typically 1, indicating its first use. When the new block references an existing data segment, the reference count of the existing data segment is incremented. When a block that references a data segment is deleted, the reference count for the data segment is decremented. When the reference count of the data segment reaches 0, the space may be reclaimed.

However, there is a problem in the conventional deduplication system: segment metadata for a data segment that includes a reference count needs to be rewritten each time the reference count is incremented or decremented. This can result in a large number of Input/Output operations (I/Os), which negatively impact the overall performance of the conventional deduplication system.

To address this problem, it has been proposed not to use reference counting, which results in a real-time system requiring complex marking and scanning algorithms to recycle unused data. However, such marking and scanning of real-time systems is very complex and error prone, leaves useless data and requires multiple passes, thus being performance intensive.

Moreover, data deduplication is generally computationally intensive and is therefore desirable to be performed in a distributed manner, such as on one or more server clusters. In this case, the metadata is stored in a distributed database in each cluster, and several or all of the servers in the cluster can access the metadata. Distributed databases typically handle all of the consistency and fault tolerance required by the system for normal operation of deduplication (e.g., temporary failure of a node, replacement of a node, disk failure, etc.).

However, the effective scalability of conventional deduplication systems is limited to only 2 to 4 servers per cluster, i.e., the number of servers that can act on the same deduplication spectrum simultaneously is limited.

In addition, the conventional data de-duplication system has a problem that data cannot be recovered when one node in the cluster fails. This is because no duplication of important data is made, and thus having lost a single node may result in a complete loss of data.

However, simply copying the distributed database to the remote database introduces latency in each database operation, requires a communication link with the remote site, and requires storage space for the remote database at the remote site. For a full backup, the database also needs to be replicated and even these replicas must be synchronized. This requires a large amount of computing resources, which results in more latency.

Disclosure of Invention

In view of the above-mentioned problems and disadvantages, the present invention is directed to improving conventional deduplication. An object of the present invention is to provide a network system and method for storing deduplicated data, and in particular, a network system and method for storing deduplicated data in a distributed manner, in which the scalability of the network system is not limited. This means that there may be at least more than 4 servers per cluster, and in particular, up to 64 servers. Thus, all servers in a cluster should be able to act on the same deduplication range simultaneously, i.e., on the same deduplicated stored data. Each server should also perform as well or even better than the servers in conventional distributed deduplication systems. Specifically, the number of necessary I/Os should be reduced. In addition, the necessary information should be duplicated in order to avoid losing data if a node is lost.

The object of the invention is achieved by the solution provided in the independent claims. Advantageous embodiments of the invention are further defined in the dependent claims.

In particular, the present invention proposes separating certain portions of segment metadata of a data segment, including activity indicators (e.g., reference counts), from the rest of the segment metadata and storing them in different physical locations. Thus, the present invention demonstrates how liveness indicators can be reliably and efficiently used even in a distributed environment.

A first aspect of the present invention provides a network system that stores deduplicated data, the network system including: a common network accessible repository storing one or more containers, each container comprising one or more data segments and a first segment metadata for each data segment; and a plurality of standby nodes, for at least one container in the repository, the standby nodes storing second segment metadata for each data segment of the container, the second segment metadata including at least an activity indicator for each data segment of the container.

The network system of the first aspect improves upon conventional systems by separating, for each data segment in the container, second segment metadata that includes an activity indicator from the first segment metadata. Thus, the network system may avoid overwriting the entire segment metadata of a data segment whenever the corresponding liveness indicator changes. Only the second piece of metadata needs to be rewritten. This saves a large amount of I/O and improves the overall performance of the network system, especially since the liveness indicators of data segments typically change constantly during deduplication.

The network system of the first aspect has no practical limitation on its scalability, and at least up to 64 servers (e.g., agents) in the cluster may be used to modify the data segments and to modify the first and second pieces of metadata of the data segments in the containers, respectively. This possibility of expanding the hardware provides scalability in processing power and overall management data. All servers in the cluster may act on the same deduplicated data simultaneously. Each server performs as well or better than previous non-distributed servers.

In an implementation form of the first aspect, the liveness indicator of a data segment is a reference count of the data segment, the reference count indicating a number of deduplicated data chunks that reference the data segment.

Therefore, unlike the conventional system, the network system of the first aspect can reliably use reference counting in a distributed environment. By separating at least the reference count from the first piece of metadata (which includes, for example, a hash of the corresponding data segment), the first piece of metadata need not be overwritten whenever the reference count needs to be incremented or decremented. Therefore, the performance of the network system is significantly improved.

In a further implementation form of the first aspect, the plurality of standby nodes are configured to copy the second piece of metadata of the at least one container with one another, in particular to copy the activity indicator of each data piece of the container with one another.

Unlike conventional systems, the network system of the first aspect is thus able to duplicate the necessary information to ensure that complete loss of data is avoided in the event of a node failure. Specifically, the network system can accommodate up to k node failures when there are at least 2k +1 nodes in the cluster.

In another implementation form of the first aspect, the second piece of metadata of the at least one container further includes a status of the container, and the container status is "normal" if any agent is currently able to update the container, and "locked" if any agent other than the agent currently locking the container is currently unable to update the container.

With these container states, only one agent can perform the true update process (i.e., system update) at the same time, such as appending data segments to the container or defragmenting the container. The update lock provided by the "locked" state prevents simultaneous updates. However, the most common and fast operations of incrementing and decrementing reference counts, etc., may not use this lock, and may actually use a check and set basic operation. In addition, even in the "locked" state, it is preferable to consistently read the containers. This provides the ability to read logical entities at any time, even if system updates stop and the container is "locked" during operation.

In another implementation form of the first aspect, the second piece of metadata of each container further comprises a current version number of the second piece of metadata of the container.

In another implementation form of the first aspect, the second piece of metadata of each container further includes a current version number of the first piece of metadata of the container and a current version number of the data piece of the container.

All version numbers may be respectively suffixed to files. By providing these version numbers for different components of the segment data (data segment and first segment metadata and second segment metadata) stored in different physical storage locations, the consistency of the logical container on the network system is guaranteed. For this purpose, a lease-like mechanism may be used.

In another implementation form of the first aspect, the plurality of standby nodes are to cooperate to continuously update the second segment of metadata and any subset of the container state of at least one container atomically, if the version number of the second metadata has not changed, to update the version number of the first segment of metadata of the container, the version number of the data segment, and/or a plurality of activity indicators of the second segment of metadata.

An update to any set of data items stored in a standby node and replicated between standby nodes is referred to herein as "atomic" if and only if the entire update is fully visible to all standby nodes simultaneously. In other words, any and all standby nodes that retrieve a set of data items receive a consistent set of data items prior to an update (if queried before the point in time of the update), or an updated consistent set of updated data items (if queried after the point in time of the update).

In another implementation form of the first aspect, the network system further comprises at least one agent for modifying the data segment, the first segment of metadata and/or the second segment of metadata of a container, wherein the at least one agent is configured to restore the "normal" state of a container based on the first segment of metadata and the current version number of the data segment, respectively, the container being in a "locked" state before restoration and the version number of the second segment of metadata not changing for a period of time greater than a determined threshold.

Because the "normal" state may be restored, the network system may perform deduplication more efficiently.

In another implementation form of the first aspect, the at least one agent is configured to modify any of the version numbers of a container after modifying the second piece of metadata, the first piece of metadata, and/or the data piece of the container, respectively.

By keeping the version number up to date, consistency of the logical container on different storage locations of its segment data on the network system is guaranteed.

In another implementation form of the first aspect, any operations on the data segment and the first and second pieces of metadata are restricted such that the "normal" state of the container may be recovered after a crash at any point of the operation.

Therefore, data and storage space loss can be effectively avoided.

In another implementation form of the first aspect, for restoring the "normal" state of a container, the at least one agent is configured to: retrieving the first piece of metadata and the current version number of the data segment and the number of the data segments of the container according to the second piece of metadata of the container; atomically continuously incrementing the version number of the second piece of metadata if the version number of the second piece of metadata has not changed since retrieval; reading the first piece of metadata with the current version number of the first piece of metadata as retrieved from the repository with the second piece of metadata; determining, from the read first segment of metadata, a size that the data segments of the container should have containing the current version number of the data segments of the container; truncating the size of the data segment of the container according to the determined size if the size of the data segment is larger than the determined size; and/or truncating the first piece of metadata of the container according to the number of retrieved data pieces; and atomically continuously resetting the state of the container to "normal" and incrementing the version number of the second metadata if the version number of the second metadata has not changed during this time.

According to this embodiment, the at least one agent is used to perform a recovery procedure. It is noted that this recovery process is a preferred example. However, the at least one agent may be used to perform any other recovery procedure that causes the container to recover.

In this manner, the "normal" state of the container may be effectively restored if the agent does not place the container in a normal condition after the failure. After recovery, all three parts of the container are coherent and do not lose any data or storage space.

In another implementation form of the first aspect, data may be recovered from the container in a "normal" state or in a "locked" state at any time.

This enables a recovery operation even if a real system update for any reason places the container in a "locked" state.

In another implementation form of the first aspect, the repository stores the data segments of each container in a first storage device, and stores the first segment metadata of each container in a second storage device, which may be different from the first storage device.

The first storage means and the second storage means may be physically separate devices. These storage devices may be the same (e.g., each storage device may be an SSD), but preferably are different (e.g., different in read latency, capability, or cost). The use of different (types of) storage means enables selection of the segment data stored therein according to its type (data segment, first segment metadata or second segment metadata), which can improve the performance and reliability of the network system.

In another implementation form of the first aspect, the first storage device further stores the first piece of metadata of the container for each container.

Storing the first piece of metadata again can improve the network system in case a restoration is needed.

In a further embodiment of the first aspect, the read latency of the second storage device is not higher than the read latency of the first storage device, and/or preferably the second storage device is a solid state disk, a solid state drive ssd, or a serial attached SCSI, a serial attached SCSISAS storage device, and/or the first storage device is a serial advanced technology attachment sata storage device.

Thus, data segments that are written once and updated and read infrequently are stored in large, low cost storage devices. The first piece of metadata that is rarely written to but is often read is stored in a storage device that has a low read latency. Thereby, the network system performs more efficiently and can save costs.

A second aspect of the present invention provides a method of storing deduplicated data, the method comprising storing one or more containers in a common network accessible repository, each container comprising one or more data segments and a first segment metadata for each data segment; and storing second segment metadata for each data segment of at least one container in the repository in the plurality of standby nodes, the second segment metadata including at least an activity indicator for each data segment of the container.

In an implementation form of the second aspect, the activity indicator of a data segment is a reference count of the data segment, the reference count indicating a number of deduplicated data chunks that reference the data segment.

In a further embodiment of the second aspect, a plurality of backup nodes copy the second piece of metadata of the at least one container with one another, in particular copy the liveness indicator of each data piece of the container with one another.

In another implementation of the second aspect, the second piece of metadata of the at least one container further includes a status of the container, and the container status is "normal" if any agent is currently able to update the container, and "locked" if any agent other than the agent currently locking the container is currently unable to update the container.

In another implementation form of the second aspect, the second piece of metadata of each container further comprises a current version number of the second piece of metadata of the container.

In another implementation form of the second aspect, the second piece of metadata of each container further includes a current version number of the first piece of metadata of the container and a current version number of the data piece of the container.

In another implementation form of the second aspect, the plurality of standby nodes cooperate to atomically continuously update the second piece of metadata and any subset of the container state of at least one container, if the version number of the second metadata has not changed, to update the version number of the first piece of metadata, the version number of the data piece, and/or a plurality of activity indicators of the second piece of metadata of the container.

In another implementation form of the second aspect, at least one agent modifies the data segment, the first segment of metadata, and/or the second segment of metadata of a container, wherein the at least one agent is configured to restore the "normal" state of a container based on the first segment of metadata and the current version number of the data segment, respectively, the container being in a "locked" state before restoration and the version number of the second segment of metadata not changing for a period of time greater than a determined threshold.

In another implementation form of the second aspect, the at least one agent modifies any of the version numbers of the containers after modifying the second piece of metadata, the first piece of metadata, and/or the data piece of the container, respectively.

In another implementation form of the second aspect, any operations on the data segment and the first and second pieces of metadata are restricted such that the "normal" state of the container may be recovered after a crash at any point of the operation.

In another implementation form of the second aspect, for restoring the "normal" state of a container, at least one agent retrieves the first piece of metadata and the current version number of the data segments and the number of the data segments of the container from the second piece of metadata of the container; atomically continuously incrementing the version number of the second piece of metadata if the version number of the second piece of metadata has not changed since retrieval; reading the first piece of metadata with the current version number of the first piece of metadata as retrieved from the repository with the second piece of metadata; determining, from the read first segment metadata, a size that the data segment of the container containing the current version number of the data segment of the container should have; truncating the size of the data segment of the container according to the determined size if the size of the data segment is larger than the determined size; and/or truncating the first piece of metadata of the container according to the number of retrieved data segments; and atomically continuously resetting the state of the container to "normal" and incrementing the version number of the second metadata if the version number of the second metadata has not changed during this time.

In another implementation form of the second aspect, data may be recovered from a container in the "normal" state or in the "locked" state at any time.

In another implementation form of the second aspect, the repository stores the data segments of each container in a first storage device, and stores the first segment of metadata of each container in a second storage device, which may be different from the first storage device.

In another embodiment of the second aspect, the first storage device further stores the first piece of metadata of the container for each container.

In another embodiment of the second aspect, the read latency of the second storage device is not higher than the read latency of the first storage device, and/or preferably the second storage device is a solid state disk solid state drive ssd, or a serial attached SCSI, serial attached SCSISAS storage device, and/or the first storage device is a serial advanced technology attachment, serial attached sata storage device.

A third aspect of the present invention provides a computer program product comprising program code for controlling a network system provided by the first aspect or one of its embodiments, or for performing the method provided by the second aspect or any one of its embodiments, when it is run on a computer.

The computer program product of the third aspect realizes all the benefits and effects of the network system of the first aspect and the method of the second aspect, respectively.

It should be noted that all devices, elements, units and means described in the present application may be implemented in software or hardware elements or any kind of combination thereof. All steps performed by the various entities described in the present application and the functions described to be performed by the various entities are intended to indicate that the respective entities are adapted or arranged to perform the respective steps and functions. Although in the following description of specific embodiments specific functions or steps performed by an external entity are not reflected in the description of specific elements of the entity performing the specific steps or functions, it should be clear to a skilled person that these methods and functions may be implemented in respective hardware or software elements or any combination thereof.

Drawings

The aspects and embodiments of the invention described above will be explained in the following description of specific embodiments, taken in conjunction with the accompanying drawings, in which:

fig. 1 shows a network system according to an embodiment of the present invention.

Fig. 2 shows a network system according to an embodiment of the invention.

Fig. 3 illustrates a network system according to an embodiment of the present invention.

Fig. 4 shows the logical container (b) and shows the contents of one data segment (a).

Fig. 5 illustrates a container physically stored in a network system according to an embodiment of the present invention.

Fig. 6 illustrates a network system according to an embodiment of the present invention.

FIG. 7 illustrates a method according to an embodiment of the invention.

Detailed Description

Fig. 1 illustrates a network system 100 according to an embodiment of the invention. Network system 100 is used to store deduplicated data, i.e., to store, for example, received data chunks as deduplicated data chunks.

Network system 100 includes a common network accessible repository 101 and a plurality of standby nodes 105.

Repository 101 may be a standard repository storage of a network system. However, repository 101 may include at least a first storage device and a second storage device different from the first storage device. That is, the storage devices differ in, for example, type, storage capacity, read latency, and/or read speed. Preferably, the read latency of the second memory device is not higher than the read latency of the first memory device. Preferably, the second storage device is a Solid State Drive (SSD) or a Serial Attached SCSI (SAS) storage device. Preferably, the first storage device is a serial advanced technology attachment, serial Advanced Technology Attachment (SATA) storage device.

Repository 101 stores one or more containers 102, each container 102 including one or more data segments 103 and a first segment metadata 104 for each data segment 103. Preferably, the repository 101 stores the data segment 103 of the container 102 in the first storage means and stores the first segment metadata 104 of the container 102 in the second storage means.

Multiple standby nodes 105 may each maintain a database to store information. Specifically, for at least one container 102 in repository 101, standby node 105 stores a second segment of metadata 106 for each data segment 103 of container 102. The second segment of metadata 106 includes at least an activity indicator 107 of the data segment 103. Preferably, the liveness indicator 107 of the data segment 103 is a reference count of the data segment 103, wherein the reference count indicates a number of deduplicated data chunks referencing the data segment 103. The plurality of standby nodes 105 preferably replicate at least the second segment of metadata 106, and in particular, at least the liveness indicators 107 of each data segment 103 of the container 102, with each other.

Fig. 2 illustrates a network system 100 constructed on the network system of fig. 1 in accordance with an embodiment of the present invention. Preferably, as shown in FIG. 2, the second piece of metadata 106 for at least one container 102 also includes a state 108 of the container 102. The container state 108 is "normal" if any agent can currently update the associated container 102, and "locked" if any agent other than the agent that currently locks the container 102 cannot currently update the container 102. The second piece of metadata 106 for one or more containers 102 or the second piece of metadata 106 for each container 102 may preferably further include a current version number 109 for the second piece of metadata 106. Additionally, even more preferably, the second piece of metadata 106 for one or more containers 102 or the second piece of metadata 106 for each container 102 may further include the current version number 110 for the first piece of metadata 104 for the container 102 and/or the current version number 111 for the data segment 103. In fig. 2, a preferred embodiment of the second piece of metadata 106 comprising all

version numbers

109, 110 and 111 and comprising the container state 108 is shown.

The plurality of standby nodes 105 may also cooperate to atomically continuously update the second piece of metadata 106 and any subset of the container state of the at least one container 102, updating the version number 110 of the first piece of metadata 104, the version number 111 of the data piece 103, and/or the plurality of liveness indicators 107 of the second piece of metadata 106 of the container 102 if the version number 109 of the second piece of metadata 106 has not changed.

Fig. 3 illustrates a network system 100 constructed on the network system 100 of fig. 1 or 2, according to an embodiment of the present invention. The network system 100 accordingly still comprises a plurality of standby nodes 105 and a repository 101, the repository 101 may be a file server as shown in fig. 3. Each standby node 105 may store a database that includes second segment metadata 106 for data segments 103 in one or more containers 102 stored in repository 101. To provide some back-up, network system 100 may also include one or more remote back-up nodes (not shown in FIG. 3) for remotely replicating back-up node 105. Fig. 3 illustrates an example of a relatively small standby node deployment. Of course, larger deployments are also possible, which may linearly expand into more standby nodes, e.g., up to 64 standby nodes 105.

FIG. 3 illustrates a network system 100 including a virtual machine monitor network, a distributed database network, a file server network, and an administrator network. Specifically, between the distributed database network and the file server network, there are a plurality of standby nodes 105. In the plurality of standby nodes 105, at least a second piece of metadata 106 is stored. Of course, multiple deduplicated data chunks may also be stored. Additionally, the plurality of standby nodes 105 may also include a deduplication index for each data segment 103, referenced by the deduplicated data chunks in the node.

FIG. 3 also shows a virtual machine monitor network comprising at least one virtual machine connected to the distributed database network. FIG. 3 also shows an administrator network connected to the file server network, wherein the administrator network includes at least one administrator node.

Fig. 4 shows different logical components of the container 102 stored in a physically distributed manner in the network system 100 provided in fig. 1, 2 or 3. The container 102 logically includes segment data for each of a plurality of data segments 103. In particular, FIG. 5 illustrates different read/write characteristics of different components of a particular segment of data in container 102. The segment data logically includes the data segment 103 itself, first metadata 104, and second metadata 106. The data segments are stored, for example, in the form of compressed data. The first piece of metadata 104 may contain a hash or strong hash of the associated data segment 103, an index into the container 102, an offset into the container 102, a size of the data segment 103 in an uncompressed state, and/or a size of the data segment 103 in a compressed state. The second piece of metadata 106 includes at least an activity indicator 107, here exemplified as a reference count, which is physically separated from the data segment 103 and the first piece of metadata 104, respectively, in accordance with the present invention. The use of separate elements of segment data in the container 102 in the deduplication scheme varies. That is, the activity indicator 107 and, optionally, the second piece of metadata 106 are frequently updated. The first piece of metadata 104 is written once, but is often read. Data segment 103 is written once and is rarely read or updated. Therefore, in the network system 100 of the present invention, the elements of the segment data are preferably stored separately in different locations.

Fig. 5 illustrates the manner in which the container 102, and in particular, the plurality of segment data in the container 102, is physically stored in the network system 100, according to an embodiment of the present invention. The logical container 102 is stored by storing the aggregate component portion of the segment data in different storage tiers. Here, since data segments 103 are rarely written, read, and updated, data segments 103 are preferably stored in a first storage device of repository 101, which is preferably a large capacity, low cost, and optimized for sequential I/O. For example, the first storage device is a SATA storage device. Since the first piece of metadata 104 is rarely written to but is often read, the first piece of metadata 104 is stored in a second storage device of the repository 101, which is preferably a storage device with low read latency, such as an SSD or SAS. Since the second piece of metadata 106, here including the reference count 107, is updated frequently, they are stored in multiple standby nodes 105, i.e., they are not stored in the repository 101 with other components of the logical container 102. Each of the plurality of standby nodes 105 provides a distributed transactional, replicated, consistent key-value store.

Fig. 6 illustrates a network system 100 constructed on the network system 100 shown in fig. 3, according to an embodiment of the present invention. FIG. 6 illustrates data placement in a physical cluster in detail. The different segments of the data component are stored in different layers, i.e. where the storage characteristics are most suitable for the respective component. In this regard, FIG. 6 clearly illustrates the separation of the respective storage locations of the data segment 103, the first segment metadata 104, and the second metadata 106 including the liveness indicator 107. The data segment 103 is advantageously stored in the file server 101 in a first storage device (also referred to as tier 2), such as SATA. The first piece of metadata 104 is advantageously stored in the file server 101 in a second storage device (also referred to as tier 1), such as an SSD or SAS. The second piece of metadata 106 is stored in a standby node 105 (also referred to as level 0) that replicates at least the activity indicators 107, here the reference counts of the data segments 103 in the container 102, with each other. Due to this replication, network system 100 may be adapted to tolerate up to k node 105 failures when there are at least 2k +1 nodes 105 in the cluster. The fault tolerance can be adjusted. If the user chooses to tolerate < k > failures, where k < = (n-1)/2, the second piece of metadata 106 in the key-value store of the standby node 105 must be replicated <2k +1> times to guarantee arbitration. This uses more local storage space in the node 105 and presents a performance penalty. However, this ensures the "consistent" nature of the key-value store mentioned above. Along with the "transactional" nature of the key-value store, at least consistency of the liveness indicator 107 of the data segment 103 is guaranteed.

Therefore, for level 0, cached and replicated local stores (fastest) are preferably selected. Creating a key-value store, where the key is a container ID, the value may consist of three items: an activity indicator 107, such as a reference count, for the data segment 103 of the container 102; the status of container 102 ("conventional", "write", or "defragmentation"); and a version number, preferably a current version number 109 of the second piece of metadata 106 of the container 102, and/or a current version number 110 of the first piece of metadata 104 of the container 102, and/or a current version number 111 of the data piece 103 of the container 102.

For level 1, a storage device (SSD or fast SAS) with high read operation, low read latency, few writes, and frequent reads is preferably selected. Which contains for each container 102 a file storing all the first pieces of metadata 104. For example, the file name is "container-ID. < version number >".

For level 2, a few writes, few reads (network storage, possibly even archive-like) storage is preferably chosen. It contains for each container 102 a file that preferably stores all first pieces of metadata 104 (again, for recovery) and all data pieces 103. For example, the file name is "container-ID. < version number >".

Consistency of the logical container 102 is guaranteed across tiers by preferably using a lease-like mechanism and by giving each component of the segment data in each tier version numbers 109 to 111. These version numbers 109 to 111 are referred to as t0v, t1v and t2v. The use cases presented below ensure that data is not lost in the event of a failure of an agent performing an action. For this purpose, the data is preferably not deleted until it can be guaranteed that the data is no longer needed. In addition, a restoration path for restoring consistency is preferably ensured.

The container may be in a "normal", "write", or "defragmented" state, where the "write" and "defragmented" states may be equivalent. Any container that has a non-normal state and t0v has not increased by < n + m > time units, where < n > is the time for "most" operations to complete and < m > is the margin of safety, is a candidate for recovery. Operations that attempt to complete after recovery will fail but not be corrupted.

Entries in layer 0 maintain versions of the remaining components. The version number in the lower layer may actually be a file suffix. In general, all operations described below may be restarted. This applies in particular to restoring the use case. The 1 st and 2 nd layers may be combined into one layer. Some operations may need to be performed or restarted at some point in time. But these are usually not time critical and can be delayed by adding a reminder to the list to be executed later. The term "delay" is used hereinafter to denote this. Multiple delays in operation on the same container 102 generally indicate that the container 102 needs to be restored.

Some specific use cases are described below. Notably, in the normal flow of deduplication, the use of "deduplication" and "write new data" are used simultaneously. The described use cases are examples of a set of use cases that may be used for a deduplication process and allow recovery from each step of a failed process.

The use case of "deduplication" is now described. This use case only updates the activity indicator 107 in layer 0, e.g., reference count. The use case preferably comprises the steps of:

1. a level 0 entry is retrieved from the key-value store that includes the second metadata 106 for the container 102.

2. If the container state is not "regular," a different set of containers 102 is selected for deduplication.

3. The activity indicator 107 in the memory copy of the entry is updated.

4. If t0v has not changed, the liveness indicator 107 is atomically updated in the key-value store and the version is updated to (t 0v + 1) (checked and set).

5. If it fails, it starts again with # 1. If t1v changes during this period, it is necessary to retrieve a new first piece of metadata 104 in container 102.

The "delete" use case is now described. This use case only updates the activity indicator 107 in layer 0, e.g., reference count. This use case occurs when a data block is deleted. The activity indicator 107 of the data segment 103 of the block must be decremented. The use case comprises the following steps:

1. a level 0 entry for the container 102 is retrieved from the key-value store.

2. If the container status is not "normal," the operation is resumed after a delay.

3. The activity indicator 107 in the in-memory copy of the entry is updated.

4. If t0v has not changed, the liveness indicator 107 is atomically updated in the key-value store and the version is updated to (t 0v + 1) (checked and set). If it fails, the operation is restarted after a delay.

The "write new data" (append) use case is now described. The use case preferably comprises the steps of:

1. container 102 is selected as an additional target.

2. A level 0 entry for the container 102 is retrieved from the key-value store. If the level 0 state is not "normal," then, starting again with #1, a different container 102 is selected.

3. If t0v has not changed, the state is changed to "write" and the version is changed to (t 0v + 1) atomically (checked and set). If it fails, a different container 102 is selected, starting again with # 1.

4. A new first piece of metadata 104 and a data piece 103 are appended to the layer 2 file "container-ID. < t2v >".

5. A new layer 1 file with file name "container-ID. < t1v +1>" is written, which contains the old content and an additional new first piece of metadata 104.

6. A new layer 0 entry is constructed using the second metadata 106, which includes an activity indicator 107 expanded to include a count of additional segments, the state "regular," versions (t 0v + 2), (t 1v + 1), (t 2 v). If the version is still (t 0v + 1), then this entry is stored atomically on the level 0 entry.

7. If this update fails, a different container 102 is selected, starting again with # 1.

8. Remove the "container-ID. < t1v >" file.

Next, the "defragmentation" use case is described. The use case preferably comprises the steps of:

1. a level 0 entry for the container 102 is retrieved from the key-value store.

2. If the layer 0 state is not "normal," the operation is restarted after a delay.

3. If t0v has not changed, the state is atomically changed to "defragmented" and the version is changed to (t 0v + 1) (checked and set). If it fails, the operation is restarted after a delay.

4. A new layer 2 file with the file name "container-ID. < t2v +1>" is written, which contains only the segment data (first segment metadata 104 and data segment 103) with the positive activity indicator 107.

5. A new layer 1 file with file name "container-ID. < t1v +1>" is written, containing only the first piece of metadata 104 with the positive activity indicator 107.

6. Using only non-zero activity indicator 107, the state "regular," versions (t 0v + 2), (t 1v + 1), (t 2v + 1) construct a new level 0 entry. If the version is still (t 0v + 1), then this entry is stored atomically on the level 0 entry.

7. If the update fails, the operation is restarted after a delay.

8. The "container-ID. < t1v >" and "container-ID. < t2v >" files are removed. These two files are no longer needed.

The "read (restore)" use case is now described. The use case preferably includes the following steps.

1. The level 0 entry for the container 102 is retrieved from the key-value store to find the current value of t2v.

2. Read the "container-ID. < t2v >" file from layer 2 to memory.

3. This may fail under competitive conditions if the container file is being updated. In this case, the process is restarted from # 1.

4. The required data segment 103 is extracted from the memory copy of "container-ID. < t2v >".

The "primary recovery" use case will now be described as an exemplary recovery process. This process may be performed on any container 102 that has a non "normal" state and t0v has not been incremented in < n + m > time units. This indicates that the agent performing the modification terminates in the middle of the process. It is worth noting that if they fail, there are only three use cases that can put the container in an unconventional state: "new data written": write a new version of the layer 1 file but attach to the layer 2 file; "defragmentation": writing new versions of files in the layer 1 and the layer 2; "resume" (this use case was not a previous use case): no new data is written, and only the two above use cases are rolled back.

In all cases, the original layer 1 and layer 2 files still exist or can be recovered by truncating them. The use case preferably comprises the steps of:

1. the container's level 0 entry is retrieved from the key value store to find the current values of t1v and t2v and the number of data segments 103 (< q >) in the container 102.

2. If t0v has not changed, the state is atomically changed to "write" and the version is changed to (t 0v + 1) (check and set). If the update fails, the operation is resumed after a delay.

3. The contents of the "container-ID. < t1v >" file in layer 1 are read to determine the size < s > that "container-ID. < t2v >" should have if only the first < q > data segments 103 were contained.

4. If the "container-ID. < t2v >" file is larger than < s >, it is truncated to a size < s >.

5. If the "container-ID. < t1v >" file has more than < q > data segments 103, it is truncated to contain only < q > data segments 103 (which should not occur).

6. With the state "normal", versions (t 0v + 2), t1v, and t2v construct a new layer 0 entry. If the version is still (t 0v + 1) (checked and set), this entry is stored atomically on the level 0 entry. If the update fails, the operation is restarted after a delay.

A "secondary recovery" use case is now described as an exemplary recovery process. This process may be performed on any container 102 that has a non "normal" state and t0v has not been incremented in < n + m > time units. This indicates that the modification is abandoned in the process for any reason. Notably, t0v is always guaranteed to be greater than t1v and t2v. The use case preferably comprises the steps of:

1. the level 0 entry of the container 102 is retrieved from the key-value store to find the current values of t1v and t2v and the number (< q >) of data segments 103 in the container 102.

2. If t0v has not changed, the state is atomically changed to "write" and the version is changed to (t 0v + 1) (check and set).

3. For the new values t1v 'and t2v', use (t 0v + 1). Constant: t1v < t1v 'and t2v < t2v'.

4. The "container-ID. < t2v >" file is read to memory. The first < q > data segments 103 are written to a new "container-ID. < t2v' >" file in layer 2.

5. A new "container-ID. < t1v' >" file is reconstructed in layer 1 from the first piece of metadata 104 read in step # 3.

6. With the state "normal", versions (t 0v + 2), t1v ', and t2v' construct a new level 0 entry. If the version is still (t 0v + 1) (checked and set), this entry is stored atomically on the level 0 entry.

7. If the update fails, the operation is restarted after a delay.

8. All "container-ID. < k >" files in layer 1 are removed, where k < t1v'.

9. All "container-ID. < p >" files in layer 2 are removed, where p < t2v'.

It should be noted that in the present invention, "atomically" is defined as follows: updates to a set of data items stored in a standby node and replicated between standby nodes are atomic if and only if the entire update is fully visible to all standby nodes simultaneously. In other words, any and all standby nodes that retrieve a set of data items receive a consistent set of data items prior to an update (if queried before the point in time of the update), or an updated consistent set of updated data items (if queried after the point in time of the update).

FIG. 7 illustrates a method 700 of storing deduplicated data, i.e., a method 700 of storing received data chunks as deduplicated data chunks. The method 700 may be performed by the network system 100 shown in the previous figures or in the network system 100. In particular, the method 700 comprises method steps 701: one or more containers 102 are stored in the common network accessible repository 101, each container 102 including one or more data segments 103 and a first segment metadata 104 for each data segment 103. The method 700 further comprises a further method step 702: second segment metadata 106 for each data segment 103 of at least one container 102 in the repository 101 is stored in the plurality of standby nodes 105, the second segment metadata 106 including at least an activity indicator 107 for each data segment 103 of the container 102. It is obvious that the repository 101 mentioned in the method 700 may be the repository 101 shown in fig. 1 or fig. 2 or the file server 101 of the network system 100 shown in fig. 3 and 6. The plurality of standby nodes 105 referred to in the method 700 may be the standby nodes 105 of the network system 100 shown in fig. 1, 2, 3, and 6.

In general, the present invention provides a storage architecture for reliably storing deduplicated data. To this end, the present invention splits and stores deduplicated data between storage tiers. The updates to the activity indicators 107 of the segments are fast and transactional. Performance is improved because the storage characteristics match the usage pattern. Reading the first piece of metadata 104 (e.g., a hash of the data segment 103) is very fast. Costs are reduced by using cheaper storage for large data segments 103. Since a large number of data segments 103 are read in order from storage that is optimal for them, recovery is very fast. Storage usage is tied up on the storage tier by version numbers 109 to 111 and leases that guarantee that no data is lost, that all data is always readable, and that rebooting is resumed.

Thus, the present invention illustrates how global deduplication can be linearly expanded and distributed to an unlimited number of servers in a cluster with fault tolerance, while still maintaining performance and reliability without increasing cost. This is done without changing the traditional proxy server hardware specification that does not support distributed global deduplication.

The invention has been described in connection with various embodiments and implementations as examples. Other variations will be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the independent claims. In the claims as well as in the description, the term "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Claims

1. A network system (100) for storing deduplicated data, the network system (100) comprising:

a common network accessible repository (101), the repository (101) storing one or more containers (102), each container (102) comprising one or more data segments (103) and a first segment metadata (104) for each data segment (103); and

a plurality of standby nodes (105), a standby node (105) storing, for at least one container (102) in the repository (101), a second segment metadata (106) for each data segment (103) of the container (102), the second segment metadata (106) including at least an activity indicator (107) for each data segment (103) of the container (102), the plurality of standby nodes (105) for replicating the activity indicators (107) for each data segment (103) of the container (102) with each other.

2. The network system (100) of claim 1,

the activity indicator (107) of a data segment (103) is a reference count of the data segment (103) indicating a number of deduplicated data chunks referencing the data segment (103).

3. The network system (100) of claim 1,

the second piece of metadata (106) of the at least one container (102) further comprises a state (108) of the container (102), and

the state (108) of the container (102) is "normal" if any agent is currently able to update the container (102), and the state (108) is "locked" if any agent other than the agent currently locking the container (102) is not currently able to update the container (102).

4. The network system (100) of claim 1,

the second piece of metadata (106) of each container (102) further includes a current version number (109) of the second piece of metadata (106) of the container (102).

5. The network system (100) of claim 1,

the second piece of metadata (106) of each container (102) further includes a current version number (110) of the first piece of metadata (104) of the container (102) and a current version number (111) of the data segment (103) of the container (102).

6. The network system (100) according to claim 4 or 5,

the plurality of standby nodes (105) are configured to cooperate to atomically continuously update any subset of the second piece of metadata (106) of at least one container (102) and the state (108) of the container (102), update the version number (110) of the first piece of metadata (104), the version number (111) of the data piece (103), and/or a plurality of liveness indicators (107) of the second piece of metadata (106) of the container (102) if the version number (109) of the second metadata (106) has not changed.

7. The network system (100) according to any one of claims 3-5, further comprising:

at least one agent for modifying the data segments (103), the first segment of metadata (104) and/or the second segment of metadata (106) of a container (102),

wherein the at least one agent is configured to restore the "normal" state (108) of a container (102) based on the first piece of metadata (104) and the current version number (110, 111) of the data segment (103), respectively, the container being in a "locked" state (108) before restoration and the version number (109) of the second piece of metadata (106) not changing for a period of time greater than a determined threshold.

8. The network system (100) of claim 7,

the at least one agent is configured to modify any of the version numbers (109, 110, 111) of a container (102) after modifying the second piece of metadata (106), the first piece of metadata (104), and/or the data piece (103) of the container (102), respectively.

9. The network system (100) of claim 7,

restricting any operations on the data segment (103) and the first and second pieces of metadata (104, 106) such that the "normal" state (108) of the container (102) may be restored after a crash of any point of the operations.

10. The network system (100) of claim 7, wherein for the "normal" state (108) of a recovery container (102), the at least one agent is configured to:

retrieving the first piece of metadata (104) and the current version number (110, 111) of the data segment (103) and the number of the data segments (103) of the container (102) in accordance with the second piece of metadata (106) of the container (102),

-continuously incrementing the version number (109) of the second piece of metadata (106) atomically if the version number (109) of the second piece of metadata (106) has not changed since the retrieving,

reading the first piece of metadata (104) with the current version number (110) of the first piece of metadata (104) as retrieved from the repository (101) with the second piece of metadata (106),

determining, from the read first piece of metadata (104), a size that the data piece (103) of the container (102) containing the current version number (111) of the data piece (103) of the container (102) should have,

truncating the size of the data segment (103) of the container (102) according to a determined size if the size of the data segment (103) is larger than the determined size, and/or

Truncating the first piece of metadata (104) of the container (102) according to the number of the retrieved data pieces (103), and

atomically continuously resetting the state (108) of the container (102) to "normal" and incrementing the version number (109) of the second piece of metadata (106) if the version number (109) of the second piece of metadata (106) has not changed during this time.

11. The network system (100) according to any one of claims 3-5,

data may be recovered from a container (102) in a "normal" state (108) or in a "locked" state (108) at any time.

12. The network system (100) of claim 1,

the repository (101) stores the data segments (103) of each container (102) in a first storage device, and stores the first segment of metadata (104) of each container (102) in a second storage device, which may be different from the first storage device.

13. The network system (100) of claim 12,

the first storage also stores, for each container (102), the first piece of metadata (104) for the container (102).

14. The network system (100) according to claim 12 or 13,

the read latency of the second memory device is not higher than the read latency of the first memory device, and/or

Preferably, the second storage device is a solid state disk, a solid state drive SSD, or a serial attached SCSI, serial attached SCSISAS storage device, and/or the first storage device is a serial advanced technology attachment, serial attached technology attachment SATA storage device.

15. A method (700) of storing deduplicated data, the method (700) comprising:

storing (701) one or more containers (102) in a common network accessible repository (101), each container (102) comprising one or more data segments (103) and a first segment metadata (104) for each data segment (103), and

storing (702) second segment metadata (106) for each data segment (103) of at least one container (102) in the repository (101) in a plurality of standby nodes (105), the second segment metadata (106) including at least a liveness indicator (107) for each data segment (103) of the container (102), the plurality of standby nodes (105) for replicating the liveness indicators (107) for each data segment (103) of the container (102) with each other.