WO2021104638A1

WO2021104638A1 - Devices, system and methods for optimization in deduplication

Info

Publication number: WO2021104638A1
Application number: PCT/EP2019/083040
Authority: WO
Inventors: Yaron MOR; Assaf Natanzon; Aviv Kuvent; Asaf Yeger
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2021-06-03
Also published as: CN113227993A

Abstract

An advanced deduplication method, particularly with additional deduplication tier is disclosed. In specific, the disclosure proposes a global server for deduplicating multiple storage servers. The global server is configured to maintain information regarding a set of hash values, each hash value being associated with a data chunk of data stored in the global server and/or the storage servers, and notify a first range of the hash values to the storage servers. The global server is further configured to receive, from one or more of the storage servers, a request to modify the information with respect to one or more hash values falling into the first range of the hash values. The global server is further configured to modify the information with respect to the one or more hash values falling into the first range of the hash values, based on the request received from the one or more storage servers. The disclosure further proposes a storage server for deduplicating at a global server.

Description

DEVICES, SYSTEM AND METHODS FOR OPTIMIZATION IN DEDUPLICATION

TECHNICAL FIELD

The present disclosure relates to a data storage deduplication method, in particular, to an optimized memory management method in a global deduplication server (GDS, may also be simply called as global server). The disclosure solves a performance degradation by optimizing the memory management in the GDS.

BACKGROUND

Data deduplication (also referred to as data optimization) refers to reducing the physical amount of bytes of data that need to be stored on disk or transmitted across a network, without compromising the fidelity or integrity of the original data, i.e., the reduction in bytes is lossless and the original data can be completely recovered. By reducing the storage resources to store and/or transmit data, data deduplication thus leads to savings in hardware costs (for storage and network transmission) and data-managements costs (e.g., backup). As the amount of digitally stored data grows, these cost savings become significant.

Data deduplication typically uses a combination of techniques to eliminate redundancy within and between persistently stored files. One technique operates to identify identical regions of data in one or multiple files, and physically storing only one unique region (referred to as chunk), while maintaining a pointer to that chunk in association with the file. Another technique is to mix data deduplication with compression, e.g., by storing compressed chunks.

Many organizations use dedicated servers to store data (i.e. storage servers). Data stored by different servers is often duplicated, resulting in space loss. A solution for this problem is deduplication, which includes storing only unique data up to a certain granularity, by using hashes to identify duplicates. However, deduplication is performed in the granularity of a specific single storage server.

To prevent duplication of data across multiple storage servers, a concept of deduplication of deduplication (nested deduplication), including an additional deduplication tier (performing deduplication of multiple deduplication servers) is introduced. In particular, a GDS is proposed, which will store highly-duplicated data, as a solution to the problem of performing deduplication across multiple storage servers.

The GDS stores hashes sent by the storage servers which comply with the nested deduplication (i.e. a storage servers cluster), to determine whether a hash value appears in enough storage servers to warrant having the GDS take ownership of the data represented by this hash value. The hash value can be used to uniquely identify a respective data chunk. Since the GDS stores all hash values of the storage servers in the storage servers cluster (regardless of whether it also stores their data), this results in a large storage space required to store all these hashes. Consequently, holding all hashes in memory of the GDS is not possible, which further affects the performance of the GDS, in particular when replying to requests to store or delete hash values.

The standard solution for such a problem is caching of data in memory. Due to the way the GDS architecture is constructed (multiple storage servers are formed as the storage servers cluster communicating with one or more GDS), without a proper caching policy and method, this may result in many “cache-misses”, which will force disk-access each time.

This disclosure aims to solve this performance degradation by optimizing the memory management efficiency of the GDS when the hash values are stored in the memory of the GDS.

SUMMARY

In view of the above-mentioned problems, embodiments of the present disclosure aim to provide a data management solution, which optimizes a response time of a global server by reducing cache misses. An objective is to allow reading hash values more from memory and less from disk. Another aim is to reduce I/O operations.

The objective is achieved by the embodiments provided in the enclosed independent claims. Advantageous implementations of the embodiments of the present disclosure are further defined in the dependent claims.

A first aspect of the disclosure provides a global server for deduplicating multiple storage servers, wherein the global server is configured to: notify a first range of a set of hash values to the storage servers, wherein information regarding the set of hash values is maintained, and each hash value being associated with a data chunk of data stored in the global server and/or the storage servers; receive, from one or more of the storage servers, a request to modify the information with respect to one or more hash values falling into the first range of the hash values; and modify the information with respect to the one or more hash values falling into the first range of the hash values, based on the request received from the one or more storage servers.

This disclosure provides a global server that stores highly-duplicated data in a nested deduplication system. The nest deduplication system may comprise a global server and multiple storage servers. In particular, this disclosure proposes a communication manner between the global server and the multiple storage servers in which the global server initiates the communication. That is, a specific range of hash values will be provided to the storage servers, and only requests with respect to hash values falling into that range can be reported to the global server. In this way, the global server has a control on the requested hash values. As a consequence, cache misses in the global server are reduced, and thus a response time of a global server is increased.

The term “global server” is an abbreviation of “global deduplication server”, and refers to a server for handling the highly-duplicated data in a storage system comprising multiple deduplication servers. In the implementation, the GDS can be implemented as a centralized device (e.g., a server), or deployed in one storage server of the multiple storage servers, or implemented in a distributed manner (e.g., a plurality of server constitute a “virtual” global deduplication server).

In an implementation form of the first aspect, information regarding the set of hash values may be maintained in the global server or in a separate storage device accessible to the global server. This improves variety of implementation of the deduplication system.

In an implementation form of the first aspect, the global server is configured to send, to the storage servers, a broadcast message carrying the first range of the hash values. In particular, the global server may notify the storage servers of the hash values range that it is willing to accept, by using a broadcast message carrying that range. This improves the efficiency of the deduplication system.

In an implementation form of the first aspect, the global server comprises a disk storage, and the information comprises a hash metadata table including the set of hash values, the hash metadata table being stored in the disk storage.

A table storing hash values and information related to respective hash values, namely the hash metadata table, is stored in the local disk of the global server.

In an implementation form of the first aspect, the global server is configured to: divide the hash metadata table into N parts, N being a positive integer no less than 2, wherein each part of the hash metadata table is associated with a different range of the hash values.

It should be noted that, the hash metadata table is a sorted table, where all hash values are stored in order. For instance, the hash values may be stored in ascending order or in descending order in the table. When the hash metadata table is divided into parts, the hash values are divided into different ranges accordingly.

In an implementation form of the first aspect, the global server further comprises a memory, wherein the global server is further configured to: upload a first part of the hash metadata table to the memory, wherein the first part of the hash metadata table is associated with the first range of the hash values; and modify the first part of the hash metadata table based on the request received from the one or more storage servers.

The range of hash values is divided into N parts, since the hash metadata table is divided into N parts. The global server will go over all parts, for instance in a cyclic fashion. The global server requests from the storage servers to provide it with hash values that the storage servers hold, which are contained in a respective part. Notably, the respective part is a part that is currently uploaded in the memory. Thus, the global server is able to read more from memory and less from disk. In an implementation form of the first aspect, the hash metadata table comprises information about each hash value in the set of hash values and information regarding one or more storage servers being registered for that hash value.

The embodiments of this disclosure are based on a fact that when a storage server requests the global server to add a hash value, the global server will register that storage server for that hash value.

In an implementation form of the first aspect, the global server is configured to:

- in response to the request received from a storage server comprises a request to add a first hash value:

- in response to the first hash value is not included in the first part of the hash metadata table, add the first hash value into the first part of the hash metadata table, create a first water mark associated with the first hash value, and register the storage server that sent the request regarding the first hash value, wherein the first water mark indicating whether a data chunk having the first hash value is highly duplicated among the storage servers;

- in response to the first hash value is included in the first part of the hash metadata table, increase a value of the first water mark associated with the first hash value, and register the storage server that sent the request regarding the first hash value; and/or

- in response to the request received from a storage server comprises a request to delete a second hash value:

- decrease a value of a second water mark associated with the second hash value, and unregister the storage server that sent the request regarding the second hash value, wherein the second water mark indicating whether a data chunk having the second hash value is highly duplicated among the storage servers; and

- in response to the value of the second water mark is equal to 0, delete the second hash value from the first part of the hash metadata table.

Generally speaking, the global server creates or increases a water mark associated with a hash value, upon receiving a request to add that hash value, and registers the storage server, which sent the request, for that hash value. Accordingly, the global server decreases a value of a water mark associated with a hash value of a data when receiving a request from a storage server to remove the data, and unregisters the storage server for that hash value. Notably, according to embodiments of the present disclosure, a request received from the storage servers during a time period is related to a hash value falling into a specific range that is in the memory of the global server during the same time period.

In an implementation form of the first aspect, the global server is configured to: persist the updated first part of the hash metadata table to the disk storage.

After updating a respective part of the hash metadata table, which is currently in the memory of the global server, this part of the hash metadata table will be persisted back to the disk storage of the global server. That is, the updated part overwrites the old data of the same part.

In an implementation form of the first aspect, the global server is configured to: modify the information with respect to the one or more hash values falling into the first range of the hash values, based on all the requests received from the one or more storage servers in a pre determined time period.

In particular, after uploading a part of the hash metadata table to the memory, all requests received from the storage servers during a specific time period should be processed by the global server. That is, before persisting the updated part to the disk storage, the global server needs to deal with each request regarding to add/delete a hash value, and to act according to each request (to modify the uploaded part of the hash metadata table).

A second aspect of the present disclosure provides a storage server for deduplicating at a global server, wherein the storage server is configured to: in response to a request to add or delete a first data chunk, record the request associated with a first hash value of the first data chunk, wherein the first hash value is included in a set of hash values, the set of hash value is maintained and each hash value being of a stored data chunk; receive, from the global server, a notification of a first range of hash values; and send, to the global server, the recorded request to modify information in the global server with respect to the first hash value, if the first hash value falls into the first range of hash values.

In this topology, multiple storage servers may be connected to the global server. Each storage server may operate in a similar way. Notably, the storage server communicates with the global server in a manner that the global server initiates the communication. In particular, the global server notifies the storage server a manner of sending request. For instance, the global server may instruct the storage server when to send the request and which request can be sent. Apparently, before sending a request to the global server, the storage server needs to store all request received from users. The storage server supports the global server in reducing cache misses, and thus a response time of a global server can be optimized. In particular, a latency of the global sever can be reduced.

In an implementation form of the second aspect, the storage server is configured to: delete the recorded request after sending the request to the global server.

Notably, after the request, that is related to a hash value falling into a specific range provided by the global server, is sent to the global server, it is assumed that the global server will deal with this request and will act accordingly. To prevent from sending again the redundant request to the global server, the storage server will delete those requests that have been sent to the global server from the record.

In an implementation form of the second aspect, the storage server is configured to: after deleting the recorded request, maintain information about a storage location of the first data chunk, in association with the maintained first hash value, wherein the storage location of the first data chunk is the storage server and/or the global server.

In particular, the storage server deletes the recorded request (which saved until being handled by the global server), however, the storage server keeps a record regarding the hash value. For instance, information regarding whether the data associated with that hash value is saved locally on the storage sever and/or at the global sever, as well as a reference count associated with the hash value that indicates how often it is required among the storage server’s users, will be kept in the storage server.

In an implementation form of the second aspect, the storage server is configured to: receive, from the global server, a broadcast message carrying the first range of hash values.

In an implementation form of the second aspect, the storage server is configured to: compare the first range of hash values with the first hash value; and determine, if the first hash value falls into the first range of hash values. In particular, for each recorded request associated with a respective hash value, the storage server will determine whether this request should be sent to the global server.

In an implementation form of the second aspect, the request comprises a request to add the first hash value of the first data chunk, if the user requested adding the first data chunk; or the request comprises a request to delete the first hash value of the first data chunk, if the user requested deleting the first data chunk.

A third aspect of the present disclosure provides a method performed by a global server, wherein the method comprises: notifying a first range of a set of hash values to the multiple storage servers, wherein information regarding the set of hash values is maintained, and each hash value is associated with a data chunk of data stored in the global server and/or the storage servers; receiving, from one or more of the storage servers, a request to modify the information with respect to one or more hash values falling into the first range of the hash values; and modifying the information with respect to the one or more hash values falling into the first range of the hash values, based on the request received from the one or more storage servers.

The method of the third aspect and its implementation forms provide the same advantages and effects as described above for the global server of the first aspect and its respective implementation forms.

A fourth aspect of the present disclosure provides a method performed by a storage server, wherein the method comprises: in response to a request to add or delete a first data chunk, recording the request associated with a first hash value of the first data chunk, wherein the first hash value is included in a set of hash values, and each hash value being of a stored data chunk; receiving, from the global server, a notification of a first range of hash values; and sending, to the global server, the recorded request to modify information in the global server with respect to the first hash value, if the first hash value falls into the first range of hash values.

The method of the fourth aspect and its implementation forms provide the same advantages and effects as described above for the storage server of the second aspect and its respective implementation forms. A fifth aspect of the present disclosure provides a computer program product comprising computer readable code instructions which, when run in a computer will cause the computer to perform a method according to the third or fourth aspects and theirs implementation forms.

A sixth aspect of the present disclosure provides a computer readable storage medium comprising computer program code instructions, being executable by a computer, for performing a method according to the third or fourth aspects and their implementation forms when the computer program code instructions runs on a computer. The computer readable storage medium is comprises of one or more from the group: ROM (Read-Only Memory), PROM (Programmable ROM), EPROM (Erasable PROM), Flash memory, EEPROM (Electrically EPROM) and hard disk drive.

A seventh aspect of the present disclosure provides a global server for deduplicating multiple storage servers, includes a processor and a memory. The memory is storing instructions that cause the processor to perform the method according to the third aspect and its implementation forms.

An eighth aspect of the present disclosure provides a storage server for deduplicating at a global server, includes a processor and a memory. The memory is storing instructions that cause the processor to perform the method according to the fourth aspect and its implementation forms.

It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. BRIEF DESCRIPTION OF DRAWINGS

The above described aspects and implementation forms of the present disclosure will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which

FIG. 1 shows a global server according to an embodiment of the disclosure.

FIG. 2 shows a topology according to an embodiment of the disclosure.

FIG. 3 shows data storage in a global server according to an embodiment of the disclosure. FIG. 4 shows a storage server according to an embodiment of the disclosure.

FIG. 5 shows a flowchart of a method according to an embodiment of the disclosure.

FIG. 6 shows a flowchart of another method according to an embodiment of the disclosure.

DETAIFED DESCRIPTION OF EMBODIMENTS

Illustrative embodiments of method, device, and program product for efficient packet transmission in a communication system are described with reference to the figures. Although this description provides a detailed example of possible implementations, it should be noted that the details are intended to be exemplary and in no way limit the scope of the application.

Moreover, an embodiment/example may refer to other embodiments/examples. For example, any description including but not limited to terminology, element, process, explanation and/or technical advantage mentioned in one embodiment/example is applicative to the other embodiments/examples.

FIG. 1 shows a global server 100 according to an embodiment of the disclosure. The global server 100 may comprise processing circuitry (not shown) configured to perform, conduct or initiate the various operations of the global server 100 described herein. The processing circuitry may comprise hardware and software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. In one embodiment, the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors. The non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the global server 100 to perform, conduct or initiate the operations or methods described herein.

The global server 100 is adapted for deduplicating multiple storage servers 110 (one of which is illustrated). The global server 100 may be configured to maintain information 101 regarding a set of hash values, each hash value being associated with a data chunk of data stored in the global server and/or the storage servers 110. The global server 100 is further configured to notify a first range 1011 of the hash values to the storage servers 110. Then, the global server

100 is configured to receive, from one or more of the storage servers 110, a request 111 to modify the information 101 with respect to one or more hash values falling into the first range 1011 of the hash values. Further, the global server 100 is configured to modify the information

101 with respect to the one or more hash values falling into the first range 1011 of the hash values, based on the request 111 received from the one or more storage servers 110.

For the skilled person in the storage field, the maintain information regarding a set of hash values can also be maintained in a separate device (e.g., a storage server) accessible to the global server 100. The above description of the global server cannot be treated as a limitation to the implementation of the global server 100.

The embodiments of this disclosure apply to a nested deduplication topology. FIG. 2 shows a topology of a global server 100 and storage servers A, B and C 110, according to an embodiment of the disclosure. Notably, an actual number of storage servers 110 in implementations of a nested deduplication topology is not limited herein. The global server 100 provides an additional deduplication tier. That is, a deduplication of multiple deduplication servers (i.e. storage servers) 110 is performed. Typically, a number of application servers have access to the respective storage server. A user could write/read data to the storage server through the application server.

A hash value of a data chunk can be obtained by performing a hash function or hash algorithm on the data chunk. The hash value can be used to uniquely identify respective data chunk. This disclosure does not limit the types of hashing and chunking techniques used in the storage servers, as long as it is identical across all servers. When a user writes data to the storage server 110, the storage server 110 may perform chunking and hashing of the data, to obtain a hash value of the data chunk. Since data stored by multiple deduplication servers or storage servers is often duplicated, to avoid a space loss, the storage server may request to store some data in a GDS.

In particular, the GDS shown in FIG. 2 is the global server 100 according to an embodiment of the disclosure, as shown in FIG. 1. Each of the storage servers A, B and C shown in FIG. 2 is the storage server 110 according to an embodiment of the disclosure, as shown in FIG. 1. The global server 100 aims to store the highly-duplicated data of the storage servers 110. Generally, the determination of a highly-duplicated data is done by the global server 100, according to some configurable thresholds. The storage servers 110 communicate with the global server 100 and send requests to store data, which the global server 100 may accept or reject according to the configured thresholds. The highly-duplicated data may be stored in the global server 100. Accordingly, the storage servers 110 may remove the highly-duplicated data from their local storage and reply on the global server 100. In some scenarios, the storage servers may also keep a copy of the highly-duplicated data.

This disclosure proposes to optimize a memory management of the hash values stored in the GDS. In this solution, the GDS will initiate the request to storage servers to provide it with hash values.

Notably, the global server 100 may comprise a disk storage 102, as shown in FIG. 3. In particular, the set of hash values is included in a table, namely, a hash metadata table, which is stored in the disk storage 102 of the global server 100.

Optionally, the global server 100 may be configured to divide the hash metadata table into N parts, particularly N equal parts, where N being a positive integer no less than 2. Each part of the hash metadata table is associated with a different range of the hash values. In particular, the hash values are arranged in a specific order in the hash metadata table, e.g., in ascending or descending order. That is, if the hash values are arranged in ascending order, a hash value in N^th part of the hash metadata table will have a greater value than a hash value in (N-l)^th part of the hash metadata table. The ranges of the hash values in respective parts of the hash metadata table will not overlap with each other. Notably, the global server 100 may further comprise a memory 103, as shown in FIG. 3. Optionally, the global server 100 may be further configured to upload a first part of the hash metadata table to the memory 103. In particular, the first part of the hash metadata table is associated with the first range 1011 of the hash values. Accordingly, the global server 100 may be configured to modify the first part of the hash metadata table based on the request 111 received from the one or more storage servers 110.

FIG. 3 shows a data storage management of a global server 100, according to an embodiment of the disclosure. As shown in FIG. 3, the first part of the hash metadata table is uploaded, from the disk storage 102, to the memory 103. The uploaded first part of the hash metadata table corresponds to the first range 1011 of the hash values. Optionally, the global server 100 may send a broadcast message carrying a specific range, i.e., the first range 1011 according to this embodiment, to all the storage servers 110 connected to it. This is to indicate the storage servers 110 about the range of hash values that the global server 100 is willing to accept. According to embodiments of this disclosure, the specific range corresponds to the part of the hash metadata table that is currently uploaded to the memory 103.

In particular, the hash metadata table comprises information about each hash value in the set of hash values and information regarding one or more storage servers 110 being registered for that hash value. For example, for each hash value, a data chunk having that hash value, a water mark associated with that hash value, and information about which storage servers 110 that have requested to add that hash value, may be included in the hash metadata table.

Generally speaking, each storage server 110 will record hash values which are new, and hash values which were removed, between notifications from the global server 100. Once a notification with a specific range of hash values has been received, the storage server 110 sends recorded requests to add or delete hash values in that specific range.

Possibly, in an embodiment of this disclosure, a request received from a storage server 110 may comprise a request to add a first hash value. When the first hash value is not included in the first part of the hash metadata table, the global server 100 may be configured to add the first hash value into the first part of the hash metadata table, create a first water mark associated with the first hash value, and register the storage server 110 that sent the request regarding the first hash value. The first water mark indicates whether a data chunk having the first hash value is highly duplicated among the storage servers 110. For instance, if the first water mark has a value of 1, that means, there is one storage server 110 that requests to add the first hash value. It should be noted that, when the first hash value is not included in the first part of the hash metadata table, means that although the first hash value falls into the first range 1011, it is currently not been stored in the global server 100. When the first hash value is included in the first part of the hash metadata table, the global server 100 may be configured to increase a value of the first water mark associated with the first hash value, and register the storage server 110 that sent the request regarding the first hash value.

Possibly, in another embodiment of this disclosure, a request received from a storage server 110 may comprise a request to delete a second hash value. The global server 100 may be configured to decrease a value of a second water mark associated with the second hash value, and unregister the storage server 110 that sent the request regarding the second hash value. Similarly, the second water mark indicates whether a data chunk having the second hash value is highly duplicated among the storage servers 110. Further, when the value of the second water mark is equal to 0, the global server 100 may be configured to delete the second hash value from the first part of the hash metadata table. It should be noted that, when the value of the second water mark is equal to 0, means that currently there is no storage server 110 still requests to add the second hash value, thereby it can be deleted from the hash metadata table.

Possibly, the global server 100 may receive both requests, i.e., a request to add the first hash value and a request to delete the second hash value, at the same time. Possibly, the first hash value and the second hash value may even be the same hash value. For instance, the storage server A may request to add a hash value, while the storage server B requests to delete that hash value.

Notably, while handling the requests, the global server 100 inserts the hash values into, or deletes the hash values from the part of table currently in the memory 103. After inserting a hash value, if a water mark associated with that hash value reaches a high water mark, the global serve 100 may ask for the data having that hash value from the storage servers 110. After deleting a hash value, if a water mark associated with that hash value reaches a low water mark, the global server 100 may delete the data having that hash value. That is, based on whether the water mark associated with a respective hash value is above/below some thresholds, the global server 100 will either request to receive the data for that hash value from some storage server 110, or will decide to evacuate the data of that hash value and will notify all the relevant storage servers 110 to re-claim ownership of this data.

After updating the first part of the hash metadata table, which is currently in the memory 103 of the global server 100, this updated part will be persisted back to the disk storage 102 of the global server 100. That is, the updated data overwrites the old data of the same part accordingly.

The global server 100 will go over each part one by one, in a cyclic fashion. Namely, after the first part of the hash metadata table in the disk storage 102 is updated, the global server 100 will continue the procedure to update a second part of the hash metadata table. Accordingly, the global server 100 may be configured to notify a second range of the hash values to the storage servers 110. The global server 100 may be further configured to receive, from one or more of the storage servers 110, a request to modify the information 101 with respect to one or more hash values falling into the second range of the hash values. Then, the global server 100 may be configured to modify the information 101 with respect to the one or more hash values falling into the second range of the hash values, based on the request received from the one or more storage servers 110.

In particular, the global server 100 may upload a second part of the hash metadata table from the disk storage 102 to the memory 103. The second part of the hash metadata table is associated with the second range of the hash values.

It should be understood that, according to embodiments of this disclosure, such procedure may be performed every X minutes/hours/days, X being a positive number that can be configurable or changed dynamically. In this way, only the requests related to hash values, which fall into a pre-defined limited range, can be sent to the global server 100 and can further be handled. The global server 100 has a control on the requested hash values. Thus, the global server 100 does not need to continuously access the disk storage 102 to retrieve and update hash values, rather it can process only the part of hash values in the memory 103. That is, the global server 100 only needs access the disk storage 102 once or twice during processing one of the parts. In addition, it allows better control over an amount of network traffic that is generated in a predefined time period. In particular, after uploading a part of the hash metadata table to the memory 103, all requests received from the storage servers 110 during a specific time period should be processed by the global server 100. That is, before persisting the updated part to the disk storage 102, the global server 100 needs to deal with each request regarding to add/delete a hash value, and to act according to each request (to modify the uploaded part of the hash metadata table).

Accordingly, the global server 100 may be configured to modify the information 101 with respect to the one or more hash values falling into the first range 1011 of the hash values, based on all the requests received from the one or more storage servers 110 in a pre-determined time period.

By having the global server 100 initiate communication and control which range of hash values it processes at any point in time, the storage servers 110 only report a limited range of hash values to the global server 100 in a specific time period. Thereby, it is ensured that the hash values in this range are in the memory 103 of the global server 100. This further avoids cache- misses. In this way, a memory management in the global server 100 is optimized.

FIG. 4 shows a storage server 110 according to an embodiment of the disclosure. The storage server 110 may comprise processing circuitry (not shown) configured to perform, conduct or initiate the various operations of the storage server 110 described herein. The processing circuitry may comprise hardware and software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. In one embodiment, the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors. The non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the storage server 110 to perform, conduct or initiate the operations or methods described herein.

The storage server 110 is adapted for deduplicating at a global servers. In particular, the storage server 110 shown in FIG. 4 is the same storage server 110 as shown in FIG. 1 or FIG. 2. And the global server 100 shown in FIG. 4 is the same global server 100 as shown in FIG. 1 - 3. The storage server 110, according to an embodiment of the disclosure, is configured to maintain a set of hash values, each hash value being of a stored data chunk. When a user requests to add or delete a first data chunk, the storage server 110 is configured to record the request 111 associated with a first hash value of the first data chunk. Further, the storage server 110 is configured to receive, from the global server 100, a notification of a first range 1011 of hash values. Accordingly, the storage server 110 is configured to send, to the global server 100, the recorded request 111 to modify information in the global server 100 with respect to the first hash value, if the first hash value falls into the first range 1011 of hash values.

Since data stored by multiple deduplication servers or storage servers is often duplicated, to avoid a space loss, the storage server 110 may request to store some data in a global server 100. According to embodiments of this disclosure, the global server 100 initiates the communication between the global server 100 and the storage servers 110, and controls which range of hash values can be provided to it. Therefore, the storage server 110, according to the embodiments of the disclosure, does not freely send a request to add or delete data to the global server 100, rather records the requests and sends them (maybe separately) based on the notification or instruction from the global server 100.

Optionally, after sending the request 111 to the global server 100, the storage server 110 may be configured to delete the recorded request 111.

Notably, after the request 111 that is related to a hash value falling into a specific range provided by the global server, is sent to the global server 100, it is assumed that the global server 100 will deal with this request 111 and will act accordingly. To prevent from sending again the redundant request to the global server, the storage server 110 will delete those requests that have been sent to the global server 100 from the record.

However, it should be noted that, even though the storage server 110 deletes the recorded request 111 (which saved until it is handled by the global server 100), the storage server 110 still keeps a record regarding the hash value. For instance, the information about whether the data associated with that hash value is saved locally at the storage sever 110, and/or at the global sever 100, as well as a local count associated with the hash value that indicates how often this hash value is required by the storage server’s users, may be still stored in the storage server 110. Accordingly, after deleting the recorded request 111, the storage server 110 may be configured to maintain information about a storage location of the first data chunk, in association with the maintained first hash value, wherein the storage location of the first data chunk is the storage server 110 and/or the global server 100.

Optionally, the storage server 110 may be configured to receive, from the global server 100, a broadcast message carrying the first range 1011 of hash values.

Optionally, the storage server 110 may be further configured to compare the first range 1011 of hash values with the first hash value. Further, the storage server 110 is configured to determine, whether the first hash value falls into the first range 1011 of hash values.

In particular, for each recorded request associated with a respective hash value, the storage server 110 will determine whether this request should be sent to the global server 100. That means, a can-be-sent request must be related to a hash value that falls into a given range, i.e., the first range 1011, according to this embodiment.

Notably, the request 111 may comprise a request to add the first hash value of the first data chunk, if the user requested adding the first data chunk. Possibly, the request may also comprise a request to delete the first hash value of the first data chunk, if the user requested deleting the first data chunk.

Optionally, for a recorded request, even if this request is allowed to send to the global server 100, the storage server 110 may still decide whether to send it. For instance, for frequently accessed data, the storage server 110 may decide to not offload to the global server 100. Thus, such data might remain in the local storage server to allow for a low read latency. Further, storage servers 110 can also decide to not offload certain data to the global server 100, e.g., some private data, or for security reasons.

Notably, after the storage server 110 sends the recorded request 111 to the global server 100 for processing, new incoming request from the user will be continuously recorded in the storage server 110. The global server 100 may request the storage server 110 to send further recorded request, particularly when the global server 100 has successfully handled all received requests. It should be understood that, according to embodiments of this disclosure, when a user requests to add or delete a second data chunk, the storage server 110 may be further configured to record the request associated with a second hash value of the second data chunk. Then, the storage server 110 may be configured to receive, from the global server 100, a notification of a second range of hash values. Accordingly, the storage server 110 may be configured to send, to the global server 100, the recorded request to modify information in the global server 100 with respect to the second hash value, if the second hash value falls into the second range of hash values. Similarly, as the previous embodiments, the storage server 110 may be configured to delete the recorded request after successfully sending the request to the global server 100.

In this disclosure, the storage server 110 sends requests to the global server 100, only in a specific time period (when it is indicated to), and only sends those requests that fulfil a condition provided by the global server 100. In particular, the request must be related to a hash value falling into a limited range that is notified by the global server 100. It allows better control over the amount of network traffic that is generated in a predefined time period.

FIG. 5 shows a method 500 performed by a global server 100 for deduplicating multiple storage servers 110 according to an embodiment of the present disclosure. In particular, the global server 100 corresponds to the global server 100 of FIG. 1. The method 500 comprises: a step

501 of maintaining information 101 regarding a set of hash values, each hash value is associated with a data chunk of data stored in the global server 100 and/or the storage servers 110; a step

502 of notifying a first range 1011 of the hash values to the storage servers 110; a step 503 of receiving, from one or more of the storage servers 110, a request 111 to modify the information 101 with respect to one or more hash values falling into the first range 1011 of the hash values; and a step 504 of modifying the information 101 with respect to the one or more hash values falling into the first range 1011 of the hash values, based on the request 111 received from the one or more storage servers 110. Particularly, the storage server 110 are the storage device 110 of FIG. 1. For the skilled person in the art, the step 501 may be optional in the implementation of the method 500.

FIG. 6 shows a method 600 performed by a storage server 110 for deduplicating at a global server 100, according to an embodiment of the present disclosure. In particular, the global server 100 is the global server 100 of FIG. 4, and the storage server 110 is the storage server 110 of FIG. 4. The method 600 comprises: a step 601 of maintaining a set of hash values, each hash value being of a stored data chunk; a step 602 of when a user requests to add or delete a first data chunk, recording the request 111 associated with a first hash value of the first data chunk; a step 603 of receiving, from the global server 100, a notification of a first range 1011 of hash values; and a step 604 of sending, to the global server 100, the recorded request to modify information in the global server 100 with respect to the first hash value, if the first hash value falls into the first range 1011 of hash values. For the skilled person in the art, the step 601 may be optional in the implementation of the method 600.

The present disclosure has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed disclosure, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.

Furthermore, any method according to embodiments of the invention may be implemented in a computer program, having code means, which when run by processing means causes the processing means to execute the steps of the method. The computer program is included in a computer readable medium of a computer program product. The computer readable medium may comprise essentially any memory, such as a ROM (Read-Only Memory), a PROM (Programmable Read-Only Memory), an EPROM (Erasable PROM), a Flash memory, an EEPROM (Electrically Erasable PROM), or a hard disk drive.

Moreover, it is realized by the skilled person that embodiments of the global server 100 and the storage server 110 comprise the necessary communication capabilities in the form of e.g., functions, means, units, elements, etc., for performing the solution. Examples of other such means, units, elements and functions are: processors, memory, buffers, control logic, encoders, decoders, rate matchers, de-rate matchers, mapping units, multipliers, decision units, selecting units, switches, interleavers, de-interleavers, modulators, demodulators, inputs, outputs, antennas, amplifiers, receiver units, transmitter units, DSPs, trellis-coded modulation (TCM) encoder, TCM decoder, power supply units, power feeders, communication interfaces, communication protocols, etc. which are suitably arranged together for performing the solution.

Especially, the processor(s) of the global server 100 and the storage server 110 may comprise, e.g., one or more instances of a Central Processing Unit (CPU), a processing unit, a processing circuit, a processor, an Application Specific Integrated Circuit (ASIC), a microprocessor, or other processing logic that may interpret and execute instructions. The expression “processor” may thus represent a processing circuitry comprising a plurality of processing circuits, such as, e.g., any, some or all of the ones mentioned above. The processing circuitry may further perform data processing functions for inputting, outputting, and processing of data comprising data buffering and device control functions, such as call processing control, user interface control, or the like.

Claims

1. A global server (100) for deduplicating multiple storage servers (110), wherein the global server is configured to: notify a first range (1011) of a set of hash values to the storage servers (110), wherein information (101) regarding the set of hash values is maintained, and each hash value being associated with a data chunk of data stored in the global server (100) and/or the storage servers (110); receive, from one or more of the storage servers (110), a request (111) to modify the information (101) with respect to one or more hash values falling into the first range (1011) of the hash values; and modify the information (101) with respect to the one or more hash values falling into the first range (1011) of the hash values, based on the request (111) received from the one or more storage servers (110).

2. The global server (100) according to claim 1, configured to: send, to the storage servers (110), a broadcast message carrying the first range (1011) of the hash values.

3. The global server (100) according to claim 1 or 2, wherein: the global server (100) comprises a disk storage (102); the information (101) comprises a hash metadata table including the set of hash values, the hash metadata table being stored in the disk storage (102).

4. The global server (100) according to claim 3, configured to: divide the hash metadata table into N parts, N being a positive integer no less than 2, wherein each part of the hash metadata table is associated with a different range of the hash values.

5. The global server (100) according to claim 4, wherein: the global server (100) further comprises a memory (103); the global server (100) is further configured to: upload a first part of the hash metadata table to the memory (103), wherein the first part of the hash metadata table is associated with the first range (1011) of the hash values; and modify the first part of the hash metadata table based on the request (111) received from the one or more storage servers (110).

6. The global server (100) according to any of claim 3 to 5, wherein: the hash metadata table comprises information about each hash value in the set of hash values and information regarding one or more storage servers (110) being registered for that hash value.

7. The global server (100) according to claim 6, configured to:

- in response to the request received from a storage server (110) comprises a request to add a first hash value:

- in response to the first hash value is not included in the first part of the hash metadata table, add the first hash value into the first part of the hash metadata table, create a first water mark associated with the first hash value, and register the storage server that sent the request regarding the first hash value, wherein the first water mark indicating whether a data chunk having the first hash value is highly duplicated among the storage servers (110);

- in response to the request received from a storage server (110) comprises a request to delete a second hash value:

- decrease a value of a second water mark associated with the second hash value, and unregister the storage server that sent the request regarding the second hash value, wherein the second water mark indicating whether a data chunk having the second hash value is highly duplicated among the storage servers (110); and

8. The global server (100) according to any of claim 5 to 7, configured to: persist the updated first part of the hash metadata table to the disk storage (102).

9. The global server (100) according to any of claim 1 to 8, configured to: modify the information (101) with respect to the one or more hash values falling into the first range of the hash values, based on all the requests (111) received from the one or more storage servers (110) in a pre-determined time period.

10. A storage server (110) for deduplicating at a global server (100), wherein the storage (HO) server is configured to : in response to a request to add or delete a first data chunk, record the request (111) associated with a first hash value of the first data chunk, wherein the first hash value is included in a set of hash values, the set of hash value is maintained and each hash value being of a stored data chunk; receive, from the global server (100), a notification ofa first range (1011) of hash values; and send, to the global server (100), the recorded request (111) to modify information in the global server (100) with respect to the first hash value, if the first hash value falls into the first range (1011) of hash values.

11. The storage server (110) according to claim 10, configured to: delete the recorded request after sending the request to the global server (100).

12. The storage server (110) according to claim 11, configured to: after deleting the recorded request, maintain information about a storage location of the first data chunk, in association with the maintained first hash value, wherein the storage location of the first data chunk is the storage server and/or the global server (100).

13. The storage server (110) according to any of claim 10 to 12, configured to: receive, from the global server (100), a broadcast message carrying the first range of hash values.

14. The storage server (110) according to one of the claims 10 to 13, configured to: compare the first range of hash values with the first hash value; and determine, if the first hash value falls into the first range of hash values.

15. The storage server (110) according to one of the claims 10 to 14, wherein: the request comprises a request to add the first hash value of the first data chunk, if the user requested adding the first data chunk; or the request comprises a request to delete the first hash value of the first data chunk, if the user requested deleting the first data chunk.

16. A method (500) for deduplicating multiple storage servers (110), the method comprising: notifying (502) a first range (1011) of a set of hash values to the multiple storage servers (110), wherein information (101) regarding the set of hash values is maintained, and each hash value is associated with a data chunk of data stored in the global server (100) and/or the storage servers (110); receiving (503), from one or more of the storage servers, a request (111) to modify the information with respect to one or more hash values falling into the first range (1011) of the hash values; and modifying (504) the information (101) with respect to the one or more hash values falling into the first range of the hash values, based on the request (111) received from the one or more storage servers (110).

17. A method (600) for deduplicating at a global server, the method comprising: in response to a request to add or delete a first data chunk, recording (602) the request (111) associated with a first hash value of the first data chunk, wherein the first hash value is included in a set of hash values, and each hash value being of a stored data chunk; receiving (603), from the global server (100), a notification of a first range (1011) of hash values; and sending (604), to the global server (100), the recorded request (111) to modify information in the global server (10) with respect to the first hash value, if the first hash value falls into the first range (1011) of hash values.

18. A computer program product comprising computer readable code instructions which, when run in a computer will cause the computer to perform the method according to any of claim 16 or 17.

19. A computer readable storage medium comprising computer program code instructions, being executable by a computer, for performing a method according to any of claim 16 or 17 when the computer program code instructions runs on a computer.