WO2023041141A1

WO2023041141A1 - Deduplication using cache eviction for strong and weak hashes

Info

Publication number: WO2023041141A1
Application number: PCT/EP2021/075186
Authority: WO
Inventors: Assaf Natanzon; Aviv Kuvent
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2021-09-14
Filing date: 2021-09-14
Publication date: 2023-03-23
Also published as: CN117813591A

Abstract

A computer-implemented method of data management in a data storage system, includes dividing each data item into plurality of blocks, calculating strong hash and weak hash, and generating ID table and weak hash table. In response to receiving an incoming data item, the method further includes dividing the incoming data item into plurality of blocks, calculating strong hash and weak hash, selecting one or more representative weak hashes, searching for the representative weak hashes in the weak hash table, and recording a match between one or more of the representative weak hashes and a weak hash in the weak hash table. The weak hash table comprises a cached portion, and a cache eviction algorithm is configured to determine whether to keep or evict each weak hash in the cached portion based on a number of matches recorded for the weak hash. Thus, number of accesses to the disk is reduced.

Description

DEDUPLICATION USING CACHE EVICTION FOR STRONG AND WEAK HASHES

TECHNICAL FIELD

The present disclosure relates generally to the field of data deduplication; and more specifically, to a method of data management in a data storage system, and a data processing apparatus for a data storage system.

BACKGROUND

Generally, a data backup operation is used to create a backup of data at a target site in order to protect and recover data in an event of data loss in a source site. Examples of the event of data loss may include, but are not limited to, data corruption, hardware or software failure in the source site, accidental deletion of data, hacking, or malicious attack. Thus, for safety reasons, a separate backup storage is extensively used to store the backup of the data present in the source site. However, duplicate copies of data, which are either shared over a network or stored in a data storage system, may unnecessarily increase the storage capacity requirements of the data storage system. Typically, data deduplication solutions are required to detect and eliminate duplicate copies of data and thus, significantly decrease storage capacity requirements.

Conventionally, one of the typical process flow of conventional data deduplication solutions is receiving a new write request, segmenting the data using any segmentation algorithm, calculating a strong hash and a weak hash for each segment, and searching the strong hash in a given strong hash cache. If the strong hash is found in the given strong hash cache, the block identity (ID) is returned. If the strong hash is not found in the given strong hash cache, the weak hash is searched in a given weak hash table. If the weak hash is found in the given weak hash table, the strong hash is retrieved from a given ID table for comparison. If the weak hash is not found in the given weak hash table, the data cannot be deduplicated, and a new block ID is generated. However, since the given weak hash table is large and most of the given weak hash table is stored on the disk, the search requires read IOs from the disk. Hence, there is a significant impact on the latency as well as the read IOs throughput. Further, in certain scenarios, the conventional data deduplication solutions may use a separate memory, which is not part of the given weak hash table, to find a good bunch of candidates retrieved from the given weak hash table. However, the use of the separate memory increases the storage capacity requirements of the data storage system and is not desirable. Moreover, certain conventional data deduplication solutions make a trade-off between several performance aspects while degrading the performance for other application scenarios. For example, one conventional data deduplication solution may be optimized for use in an online deduplication scenario, which provides high throughput but with relatively high latency. Thus, such a conventional data deduplication solution is not preferred for a source-based deduplication scenario, in which very low latency is required. Therefore, there exists a technical problem of inefficient and ineffective data deduplication solutions that manifest low performance, under multiple application scenarios (e.g., source-based deduplication and generic inline deduplication).

Therefore, in light of the foregoing discussion, there exists a need to overcome the aforementioned drawbacks associated with conventional data deduplication solutions.

SUMMARY

The present disclosure provides a method of data management in a data storage system, and a data processing apparatus for a data storage system. The present disclosure provides a solution to the existing problem of inefficiency and unreliability associated with the conventional data deduplication solutions, where the problem is compounded by the fact that the existing solutions compromise with performance in terms of parameters, such as latency, throughput, access rate, and hit rate, for example, under multiple application scenarios (e.g., source-based deduplication and generic inline deduplication). An aim of the present disclosure is to provide a solution that overcomes at least partially the problems encountered in prior art, and provide an improved data deduplication solution that can efficiently utilize a cache eviction process, which manifest high performance in terms of low latency, high throughput, low access rate, and high hit rate as compared to existing systems. One or more objects of the present disclosure is achieved by the solutions provided in the enclosed independent claims. Advantageous implementations of the present disclosure are further defined in the dependent claims.

In one aspect, the present disclosure provides a computer-implemented method of data management in a data storage system. The method comprises: dividing each data item in the data storage system into a plurality of blocks; calculating a strong hash for each block and generating an ID table comprising a list of strong hashes with a pointer to a location of the corresponding block; and calculating a weak hash for each strong hash and generating a weak hash table comprising a list of weak hashes with a pointer to a location of the corresponding strong hash in the ID table. In response to receiving an incoming data item, the method further comprises: dividing the incoming data item in the data storage system into a plurality of blocks; calculating a strong hash and a weak hash for each block; selecting one or more representative weak hashes for the incoming data item; searching for the representative weak hashes in the weak hash table; and recording a match between one or more of the representative weak hashes and a weak hash in the weak hash table, wherein the weak hash table comprises a cached portion, and a cache eviction algorithm is configured to determine whether to keep or evict each weak hash in the cached portion based on a number of matches recorded for the weak hash.

The method of the present disclosure provides an improved data deduplication solution by efficiently managing the data stored in the data storage system. By virtue of the cached portion, the number of disk-read accesses is reduced, which consequently minimizes the access rate and latency in fetching the IO operations of the data storage system. Further, the method of the present disclosure can be efficiently used for multiple scenarios (e.g., sourcebased deduplication and generic inline deduplication) without degrading and compromising the performance (i.e. , latency, throughput, access rate, and hit rate) for any scenario. Moreover, the method allows an efficient and reliable data deduplication solution to manage duplicate data as well as non-duplicate data. The non-duplicate data is stored in the data storage system as a new data item. On the other hand, the duplicate data is not stored explicitly in the data storage system; rather, the corresponding weak hash is kept in the cached portion based on the number of matches recorded for the weak hash. Thus, the method of the present disclosure allows efficient utilization of the storage capacity of the data storage system. In an implementation form, selecting the representative weak hashes comprises selecting a predetermined number of the lowest value weak hashes.

Beneficially, the method of selection of the representative weak hashes becomes easy and deterministic because for the given one or more largest weak hash values, the same one or more largest weak hash values are selected as the representative weak hashes every time.

In a further implementation form, selecting the representative weak hashes comprises selecting one or more weak hashes for which predetermined number of the most significant bits are equal to zero.

Beneficially, weak hashes with the same most significant bits (MSBs) are located closely in the cached portion within the weak hash table. Thus, the cache eviction algorithm is able to cache full pages of strong hashes, which in turn makes the search in the weak hash table to be more efficient.

In a further implementation form, the predetermined number is dynamically chosen based on a hit rate for the data storage system and an amount of data in the cached portion.

Beneficially, the cache eviction (using the cache eviction algorithm) ensures that the weak hashes of the incoming data item, which are searched against the weak hash table of the data storage system, are present in the cached portion. Thus, the number of accesses to fetch IO operations from the disk is reduced, and consequently, latency in the data storage system is minimized.

In a further implementation form, in response to a match in the weak hash table, the method further comprises finding an associated strong hash from the ID table and checking the associated strong hash against one or more of the strong hashes calculated for the incoming data item.

By virtue of checking the associated strong hash against one or more of the strong hashes calculated for the incoming data item, the cache eviction algorithm caches full pages of strong hashes, which allows the search in the weak hash table to be more efficient. Thus, the throughput and hit rate for the data storage system is increased. In a further implementation form, the method further comprises: loading the associated strong hash and one or more neighboring strong hashes to a strong hash cache, wherein one or more strong hash values calculated for a new incoming data item are checked against the strong hash cache before the weak hash table is searched.

By virtue of the associated strong hash and the one or more neighboring strong hashes in the strong hash cache, the strong hashes that saved access to the weak hash table due to many hits in the strong hash cache remain in the cached portion of the weak hash table for a longer time.

In a further implementation form, if there are no matches in the weak hash table, the incoming data item is written to the data storage system as a new data item.

Beneficially, the incoming data item is identified as a non-duplicate data item and is stored in the data storage system as a new data item.

In a further implementation form, the cache eviction algorithm is configured to assign a higher priority to keep weak hashes for which a recorded match corresponds to a data item received as part of a backup task.

Beneficially, the number of accesses to fetch IO operations from the disk is reduced, and consequently, latency in the backup task is minimized.

In a further implementation form, the incoming data item is received as part of a backup task.

Beneficially, no additional hardware is required to perform deduplication (e.g., source-based deduplication), and bandwidth, as well as storage capacity requirements, are reduced.

In a further implementation form, the method further comprises: receiving a second incoming data item which is not part of a backup task; dividing the second incoming data item in the data storage system into a plurality of blocks; calculating a strong hash and a weak hash for each block; searching for each weak hash in the weak hash table; and recording a match between one or more of the weak hashes and a weak hash in the weak hash table, wherein the cache eviction algorithm is configured to assign a lower priority to keep weak hashes for which a recorded match corresponds to the second data item which is not received as part of a backup task.

By virtue of the second incoming data item, the incoming write requests, which are not part of the backup task, can be efficiently stored in the data storage system.

In another aspect, the present disclosure provides a computer-readable medium configured to store instructions which, when executed by a processor, cause the processor to perform the method of aforementioned aspect.

The computer-readable medium achieves all the advantages and effects of the respective method of the present disclosure.

In a yet another aspect, the present disclosure provides a data processing apparatus for a data storage system. The data processing apparatus comprises a data indexing module, a data query module, a cache eviction module. The data indexing module is configured to: divide each data item in the data storage system into a plurality of blocks; calculate a strong hash for each block and generating an ID table comprising a list of strong hashes with a pointer to a location of the corresponding block; and calculate a weak hash for each strong hash and generating a weak hash table comprising a list of weak hashes with a pointer to a location of the corresponding strong hash in the ID table. The data query module is configured to receive an incoming data item and, in response to receiving an incoming data item: divide the incoming data item in the data storage system into a plurality of blocks; calculate a strong hash and a weak hash for each block; select one or more representative weak hashes for the incoming data item; search for the representative weak hashes in the weak hash table; and record a match between one or more of the representative weak hashes and a weak hash in the weak hash table. The cache eviction module, wherein the weak hash table comprises a cached portion, and the cache eviction module is configured to determine whether to keep or evict each weak hash in the cached portion based on a number of matches recorded for the weak hash.

The data processing apparatus of the present disclosure performs the method of aforementioned aspect by efficiently indexing the weak hashes in the cached portion of the weak hash table. Further, the data processing apparatus provides an efficient and reliable data deduplication solution to manage duplicate data as well as non-duplicate data. The non- duplicate data is stored in the data storage system as a new data item. On the other hand, the duplicate data is not stored explicitly in the data storage system; rather, the corresponding weak hash is kept in the cached portion based on the number of matches recorded for the weak hash. Thus, the data processing apparatus helps in the efficient utilization of the storage capacity of the data storage system.

In an implementation form, selecting the representative weak hashes comprises selecting a predetermined number of the lowest value weak hashes.

Beneficially, the selection of the representative weak hashes becomes easy and deterministic because for the given one or more largest weak hash values, the same one or more largest weak hash values are selected as the representative weak hashes every time.

Beneficially, weak hashes with the same most significant bits (MSBs) are located closely in the cached portion within the weak hash table. Thus, the cache eviction module is caches full pages of strong hashes, which allows the search in the weak hash table to be more efficient.

Beneficially, the data query module ensures that the weak hashes of the incoming data item, which are searched against the weak hash table of the data storage system, are present in the cached portion. Thus, the number of accesses to fetch IO operations from the disk is reduced, and consequently, latency in the data storage system is minimized.

In a further implementation form, the data query module is further configured, in response to a match in the weak hash table, to find an associated strong hash from the ID table and check the associated strong hash against one or more of the strong hashes calculated for the incoming data item. By virtue of checking the associated strong hash against one or more of the strong hashes calculated for the incoming data item, the cache eviction module is able to cache full pages of strong hashes, which allows the search in the weak hash table to be more efficient. Thus, the throughput and hit rate for the data storage system is increased.

In a further implementation form, the data query module is further configured to load the associated strong hash and one or more neighboring strong hashes to a strong hash cache, wherein one or more strong hash values calculated for a new incoming data item are checked against the strong hash cache before the weak hash table is searched.

In a further implementation form, the cache eviction module is configured to assign a higher priority to keep weak hashes for which a recorded match corresponds to a data item received as part of a backup task.

In a further implementation form, the incoming data is received as part of a backup task.

In a further implementation form, the data query module is further configured to: receive a second incoming data item which is not part of a backup task; divide the second incoming data item in the data storage system into a plurality of blocks; calculate a strong hash and a weak hash for each block; search for each weak hash in the weak hash table; and record a match between one or more of the weak hashes and a weak hash in the weak hash table; wherein the cache eviction module is configured to assign a lower priority to keep weak hashes for which a recorded match corresponds to the second data item which is not received as part of a backup task.

It has to be noted that all devices, elements, circuitry, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. It will be appreciated that features of the present disclosure are susceptible to being combined in various combinations without departing from the scope of the present disclosure as defined by the appended claims.

Additional aspects, advantages, features and objects of the present disclosure would be made apparent from the drawings and the detailed description of the illustrative implementations construed in conjunction with the appended claims that follow.

BRIEF DESCRIPTION OF THE DRAWINGS

The summary above, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the present disclosure, exemplary constructions of the disclosure are shown in the drawings. However, the present disclosure is not limited to specific methods and instrumentalities disclosed herein. Moreover, those in the art will understand that the drawings are not to scale. Wherever possible, like elements have been indicated by identical numbers.

Embodiments of the present disclosure will now be described, by way of example only, with reference to the following diagrams wherein:

FIG. 1A and FIG. IB collectively, is a flowchart of a method of data management in a data storage system, in accordance with an embodiment of the present disclosure;

FIG. 2 is a block diagram of a data storage system, in accordance with an embodiment of the present disclosure; and

FIG. 3 is an illustration that depicts various operations of a data deduplication solution, in accordance with an embodiment of the present disclosure.

In the accompanying drawings, an underlined number is employed to represent an item over which the underlined number is positioned or an item to which the underlined number is adjacent. A non-underlined number relates to an item identified by a line linking the nonunderlined number to the item. When a number is non-underlined and accompanied by an associated arrow, the non-underlined number is used to identify a general item at which the arrow is pointing.

DETAILED DESCRIPTION OF EMBODIMENTS

The following detailed description illustrates embodiments of the present disclosure and ways in which they can be implemented. Although some modes of carrying out the present disclosure have been disclosed, those skilled in the art would recognize that other embodiments for carrying out or practicing the present disclosure are also possible.

FIG. 1A and FIG. IB, collectively, is a flowchart of a method of data management in a data storage system, in accordance with an embodiment of the present disclosure. With reference to FIG. 1A and FIG. IB, there is shown a method 100. The method 100 is performed by a data processing apparatus described in detail, for example, in FIG. 2. The method 100 includes steps 102 to 116. At step 102, the method 100 comprises dividing each data item in the data storage system into a plurality of blocks. The amount of data items (i.e. , data files, database files, IO writes, and the like) stored in the data storage system can be huge. In order to efficiently manage such huge data, each data item in the data storage system is broken down into a plurality of blocks. The process of dividing is also referred to as chunking of data, i.e., splitting a large data item into small data items called chunks. The plurality of blocks may also be referred to as data segments. Further, dividing each data item into the plurality of blocks may result in blocks of fixed-length (i.e., equal size blocks) or variable-length (i.e., unequal size blocks), depending upon the way chunking is performed.

At step 104, the method 100 further comprises calculating a strong hash for each block and generating an ID table comprising a list of strong hashes with a pointer to a location of the corresponding block. For example, the ID table may include a mapping from an ID to a strong hash and the location on the disk where the block corresponding to the strong hash is located. In some examples, the IDs are monotonically rising so subsequent entries will have consecutive IDs. As such, if an ID matches a data item, it can be expected that the subsequent IDs will point to consecutive data items. Generally, hashing is the process of converting a given key into a new value. A hash function is used to generate the new value according to a mathematical algorithm. The result of the hash function (i.e., the new value) is referred to as the strong hash or, simply, a hash. The strong hashes may also be referred to as fingerprints of data segments (i.e., the plurality of blocks) since each hash may be of the same length but may be unique based on the data block. For example, a strong hash of a data block is a hash which has a very high probability of being unique for the data block. The probability may be so high that no two blocks of data stored on a given storage system will have the same ID even over years of usage. Alternatively stated, the strong hash refers to the value that uniquely describes the data block. As such, two data blocks with the same strong hash are identical with such a high probability that in a real practical system, they will be considered identical. In other words, the strong hash is a bit string that represents the data block that is processed. If a given data block is processed by a given hashing algorithm and later if the same hashing algorithm is applied on the same data block, then the same strong hash is created each time. Thus, if the same copies of data segments arrive, then the same strong hash is generated for all the copies. Further, the ID table is generated after calculating a strong hash for each block. The ID table refers to a full index that maps block ID (i.e., the strong hash) into the actual address of the corresponding block. The ID table includes a list of unique data blocks, sorted by the ID number of the block. Once a new data block is found, it is added to the end of the list with a new sequence number, which is the ID number of the data block. Hence, the ID table includes all the block IDs of all the blocks stored in the data storage system. Since the ID table comprises the list of strong hashes, which are unique in nature, when the same strong hash is generated for different blocks, the blocks are identified as duplicate blocks and are not stored repetitively. Thus, the size of the ID table is reduced.

At step 106, the method 100 further comprises calculating a weak hash for each strong hash and generating a weak hash table comprising a list of weak hashes with a pointer to a location of the corresponding strong hash in the ID table. The weak hash is selected by generating a list of weak hashes for each strong hash and selecting one or more weak hashes from the list of weak hashes as the weak hash. The weak hash is so named because it comprises only a portion of the bits of a corresponding strong hash. For example, for a strong hash having 160 bits, a weak hash may be generated by selecting, e.g., 64 bits out of the total 160 bits. Furthermore, the weak hash may be selected from the list of weak hashes in a deterministic manner. For example, choosing two weak hashes from the list of weak hashes such that the two weak hashes have minimal value. Further, the weak hash table is generated after calculating the weak hash for each strong hash. The weak hash table refers to a full index that maps a weak hash value into a block ID. The weak hash table is the main index used for deduplication. The weak hash table is a standard key-value data structure. Typically, such a data structure is implemented using a b-tree, and hashes that are not distanced are stored in the close location on the disk. When there is a hit, the data structure brings a set of keys and values, which are located closely on the disk. The weak hash table has a build in cache mechanism to store relevant weak hashes in a cached portion. Beneficially, latency and access rate to fetch read IDs is reduced, and throughput, as well as hit rate, is increased. Hence, the overall performance of the data deduplication solution is optimized.

At step 108, in response to receiving an incoming data item, the method 100 comprises dividing the incoming data item in the data storage system into a plurality of blocks. The incoming data item in the data storage system refers to a request to store a new data item (i.e., data files, database files, IO writes, and the like) as part of a backup task. The incoming data item is mostly sequential and with large IOs. In order to efficiently manage the incoming data item (or the incoming write request), the incoming data item is broken down into the plurality of blocks via the process of chunking (i.e., splitting a large incoming data item into small data items called data chunks).

At step 110, in response to receiving the incoming data item, the method 100 further comprises calculating a strong hash and a weak hash for each block. Generally, hashing is the process of converting a given key into a new value. A hash function is used to generate the new value according to a mathematical algorithm. The result of the hash function (i.e., the new value) is referred to as the strong hash or, simply, a hash. In other words, the strong hash is a bit string that represents the data block that is processed. Here, for each of the plurality of blocks generated for the incoming data item, a corresponding strong hash and weak hash are calculated. Alternatively stated, the strong hash refers to the value that uniquely describes the data block of the incoming data item. Further, the weak hash is so named because it comprises only a portion of the bits of the corresponding strong hash. For example, for a strong hash having 160 bits, a weak hash may be generated by selecting, e.g., 64 bits out of the total 160 bits.

At step 112, in response to receiving the incoming data item, the method 100 further comprises selecting one or more representative weak hashes for the incoming data item. The one or more representative weak hashes are selected by generating a list of weak hashes, one for each strong hash of the incoming data item, and selecting one or more weak hashes from the list of weak hashes as the one or more representative weak hashes for the incoming data item. Furthermore, one or more representative weak hashes may be selected from the list of weak hashes in a deterministic manner. For example, choosing two representative weak hashes from the list of weak hashes such that the two representative weak hashes have minimal value.

In accordance with an embodiment, selecting the representative weak hashes comprises selecting a predetermined number of the lowest value weak hashes. The one or more representative weak hashes are selected by generating a list of weak hashes from the calculated strong hashes of each block of the incoming data item and selecting one or more of the weak hashes as the representative weak hashes. Furthermore, selecting the representative weak hashes uses the predetermined process. The predetermined process refers to a process in which no randomness is involved in the development of future states of the process. Thus, the predetermined process always produces the same output from a given starting condition or initial state. The predetermined process may also refer to a probabilistic or determinative process. For example, selecting one or more representative weak hashes from the one or more weak hashes of each block of the incoming data item such that the one or more representative weak hashes have the lowest weak hash value.

In accordance with an embodiment, selecting the representative weak hashes comprises selecting one or more weak hashes for which predetermined number of the most significant bits are equal to zero. The one or more representative weak hashes are selected by generating a list of weak hashes from the calculated strong hashes of each block of the incoming data item and selecting one or more of the weak hashes as the representative weak hashes. Furthermore, selecting the representative weak hashes uses the predetermined process. The predetermined process refers to a process in which no randomness is involved in the development of future states of the process. Thus, the predetermined process always produces the same output from a given starting condition or initial state. The predetermined process may also refer to a probabilistic or determinative process. For example, selecting one or more representative weak hashes from the one or more weak hashes of each block of the incoming data item such that the predetermined number of the most significant bits (MSBs) of the one or more weak hashes are equal to zero. Beneficially, weak hashes with the same MSBs are located closely in the cached portion within the weak hash table. Thus, the cache eviction algorithm is able to cache full pages of strong hashes, which allows the search in the weak hash table to be more efficient.

In an implementation, the predetermined number is dynamically chosen based on a hit rate for the data storage system and an amount of data in the cached portion. The cache eviction algorithm dynamically decides on how many weak hashes are to be checked for entry of a weak hash in the cached portion of the weak hash table. F or example, what minimum number of weak hashes are to be selected or how many MSBs are to be checked. The predetermined number is based on the hit rate for the data storage system and the amount of data in the cached portion of the weak hash table, i.e., the one or more weak hashes for which the maximum number of matches are found more likely to be placed in the cached portion of the weak hash table. Beneficially, the cache eviction algorithm ensures that the weak hashes of the incoming data item, which are searched against the weak hash table of the data storage system, are present in the cached portion. Thus, the number of accesses to fetch IO operations from the disk is reduced, and consequently, latency in the data storage system is minimized.

At step 114, in response to receiving the incoming data item, the method 100 further comprises searching for the representative weak hashes in the weak hash table. The one or more representative weak hashes of the incoming data item (i.e., the incoming write request) are searched in the weak hash table of the data storage system for a match. The weak hash table refers to a full index that maps all the weak hashes of the data storage system into the respective block IDs. The searching for the representative weak hashes in the weak hash table corresponds to checking the one or more representative weak hashes of the incoming data item against the weak hashes in the weak hash table of the data storage system.

At step 116, in response to receiving the incoming data item, the method 100 further comprises recording a match between one or more of the representative weak hashes and a weak hash in the weak hash table, wherein the weak hash table comprises a cached portion, and a cache eviction algorithm is configured determine whether to keep or evict each weak hash in the cached portion based on a number of matches recorded for the weak hash. When there is a hit (i.e., a match between the one or more representative weak hashes of the incoming data item and the weak hashes of the weak hash table), the data processing apparatus uses a built-in cache mechanism to locate the set of matches (weak hashes and their respective block IDs) closely on the disk. The built-in cache mechanism results in a cached portion within the weak hash table. The cached portion refers to the set of weak hashes and their respective block IDs for which matches are found against the one or more representative weak hashes of the incoming data item. The representative weak hashes may be selected by a deterministic method, for example where the 6 most significant bits are 0, and therefore if the weak hashes are arranged on the disk in order of their value, then the neighborhood of each representative weak hash is highly likely to also have 0 in the 6 most significant bits. As such, if a representative weak hash is brought to the cache along with its neighboring weak hashes all those weak hashes will likely have 0 in the 6 most significant bits. Hence, every time a weak hash is accessed, the neighborhood of the block IDs is brought via a single hit in the weak hash table. The hits to the cached portion take into account how many hits the weak hash entry brought from a strong hash cache, i.e., if the block ID of the weak hash is found and there are hits in the strong hash cache, then the hits from the strong hash cache are counted as hits to the weak hash table. Alternatively stated, the weak hashes that saved access to the weak hash table due to many hits in the strong hash cache remain in the cached portion of the weak hash table for a longer time. Thus, the cache eviction algorithm determines whether to keep or evict each weak hash in the cached portion based on the number of matches recorded for the weak hash. Moreover, the cache eviction algorithm decides dynamically on how many weak hashes to check for each incoming data item (e.g., how many minimum hashes to take or how many most significant bits to check) based on the current hit rate and the current amount of data in the cached portion of the weak hash table.

In accordance with an embodiment, in response to a match in the weak hash table, the method 100 further comprises finding an associated strong hash from the ID table and checking the associated strong hash against one or more of the strong hashes calculated for the incoming data item. When the match is found between the one or more representative weak hashes of the incoming data item and the weak hashes of the weak hash table, the strong hash associated with the weak hashes is found from the ID table. The ID table refers to a full index that maps block ID (i.e., the strong hash) into the actual address of the corresponding block. The ID table includes a list of unique data blocks, sorted by the ID number of the block. Hence, the ID table includes all the block IDs of all the blocks stored in the data storage system. Further, when the associated strong hash is found, the associated strong hash is checked against the one or more strong hashes calculated for the plurality of blocks of the incoming data item. By virtue of checking the associated strong hash against one or more of the strong hashes calculated for the incoming data item, the cache eviction algorithm is able to cache full pages of strong hashes, which allows the search in the weak hash table to be more efficient. Thus, the throughput and hit rate for the data storage system is increased.

In accordance with an embodiment, the method 100 further comprises loading the associated strong hash and one or more neighboring strong hashes to a strong hash cache, wherein one or more strong hash values calculated for a new incoming data item are checked against the strong hash cache before the weak hash table is searched. The associated strong hash corresponds to the strong hash associated with the weak hashes for which matches are found in the cached portion. The one or more neighboring strong hashes correspond to the strong hashes, which are located close to the associated strong hash in the ID table. The associated strong hash and the one or more neighboring strong hashes are loaded in the strong hash cache. The strong hash cache refers to a small index that maps strong hashes into their respective block IDs. The strong hash cache is stored in a memory and is used as an inmemory cache for strong hashes. Further, when a new incoming data item (i.e., a new write request) arrives at the data storage system, the one or more strong hash values calculated for the new incoming data item are checked against the strong hash cache of the data storage system before the weak hash table is searched. The checking of the calculated one or more strong hash values for each block of the new incoming data item against the strong hash cache corresponds to checking the calculated one or more strong hash values against the strong hashes of the strong hash cache. The strong hashes of the strong hash cache comprise the associated strong hash and the one or more neighboring strong hashes. The hits to the strong hash cache take into account how many hits the strong hash entry brought from the strong hash cache, i.e., if the strong hash of the strong hash cache matches the calculated one or more strong hash values of the new incoming data item, then the hits from the strong hash cache are counted as hits to the weak hash table. Alternatively stated, the strong hashes that saved access to the weak hash table due to many hits in the strong hash cache remain in the cached portion of the weak hash table for a longer time.

In accordance with an embodiment, if there are no matches in the weak hash table, the incoming data item is written to the data storage system as a new data item. The one or more representative weak hashes of the incoming data item are checked against the weak hashes of the weak hash table for a match. If no matches are found in the weak hash table, the incoming data item is identified as a non-deduplicate data item. The non-duplicate data item corresponds to the data item for which no matches (or fingerprints) could be found in the data storage system. Alternatively stated, the non-duplicate data item corresponds to the data item, which cannot be deduplicated because it does not exist in the data storage system. A new block ID is generated for the non-duplicate data item, and the exact data item (i.e., the non-duplicate data item) is stored at an address in the data storage system as the new data item.

In accordance with an embodiment, the cache eviction algorithm is configured to assign a higher priority to keep weak hashes for which a recorded match corresponds to a data item received as part of a backup task. The backup task is used to backup data at a target site in order to protect and recover data in an event of data loss in a source site. The backup task (e.g., source-based deduplication) removes redundant data items before transmitting data to the target site at the client or server-side (e.g., data storage system). The cache eviction algorithm ensures that the weak hashes searched by the backup task (e.g., source-based deduplication) have higher priority to be kept in the cached portion of the weak hash table based on the number of matches recorded for the weak hashes. Thus, the number of accesses to fetch IO operations from the disk is reduced, and consequently, latency in the backup task is minimized.

In accordance with an embodiment, the incoming data item is received as part of a backup task. The incoming data item refers to an incoming write request to store the data item (i.e., data files, database files, input-output (I/O) writes, and the like) on the data storage system. The backup task is used to backup data at a target site in order to protect and recover data in an event of data loss in a source site. The backup task (e.g., source-based deduplication) removes redundant data items before transmitting data to the target site at the client or serverside (e.g., data storage system). Thus, no additional hardware is required to perform deduplication. Moreover, bandwidth and storage capacity requirements are reduced.

In accordance with an embodiment, the method 100 further comprises: receiving a second incoming data item which is not part of a backup task; dividing the second incoming data item in the data storage system into a plurality of blocks; calculating a strong hash and a weak hash for each block; searching for each weak hash in the weak hash table; and recording a match between one or more of the weak hashes and a weak hash in the weak hash table, wherein the cache eviction algorithm is configured to assign a lower priority to keep weak hashes for which a recorded match corresponds to the second data item which is not received as part of a backup task. The second incoming data item refers to an incoming write request to store the data item (i.e., data files, database files, input-output (I/O) writes, and the like) on the data storage system. Moreover, the second incoming data item corresponds to small and random IO operations for generic inline deduplication. The second incoming data item is broken down into the plurality of blocks. Further, the strong hash corresponding to each of the plurality of blocks is calculated. Further, one or more weak hashes corresponding to each calculated strong hash are calculated. Further, each of the one or more weak hashes is checked against the weak hashes of the weak hash table for a match. Further, if the match is found, then the cache eviction algorithm keeps the weak hashes in the cached portion of the weak hash table. However, the cache eviction algorithm assigns a lower priority to the second incoming data item than the incoming data item because the second incoming data item is not part of the backup task.

Thus, the method 100 provides an improved data duplication solution by efficiently managing the data stored in the data storage system with the help of the data indexing module. By virtue of the cached portion, the number of disk-read accesses is reduced, which consequently minimizes the access rate and latency in fetching the IO operations of the data storage system. Further, the method 100 of the present disclosure can be efficiently used for multiple scenarios (e.g., source-based deduplication and generic inline deduplication) without degrading the performance (i.e. , latency, throughput, access rate, and hit rate) for any scenario. Moreover, the method 100 allows an efficient and reliable data deduplication solution to manage duplicate data as well as non-duplicate data. The non-duplicate data is stored in the data storage system as a new data item. On the other hand, the duplicate data is not stored explicitly in the data storage system; rather, the corresponding weak hash is kept in the cached portion based on the number of matches recorded for the weak hash. Thus, the method 100 of the present disclosure allows efficient utilization of the storage capacity of the data storage system.

The steps 102 to 116 are only illustrative and other alternatives can also be provided where one or more steps are added, one or more steps are removed, or one or more steps are provided in a different sequence without departing from the scope of the claims herein.

In another aspect, the present disclosure provides a computer-readable medium configured to store instruction which, when executed by a processor, cause the processor to perform the method 100 of aforementioned aspect. The computer readable medium refers to a non- transitory computer-readable storage medium. Examples of implementation of the computer-readable media include, but is not limited to, Electrically Erasable Programmable Read-Only Memory (EEPROM), Random Access Memory (RAM), Read Only Memory (ROM), Hard Disk Drive (HDD), Flash memory, a Secure Digital (SD) card, Solid-State Drive (SSD), a computer readable storage medium, and/or CPU cache memory.

FIG. 2 is a block diagram of a data storage system, in accordance with an embodiment of the present disclosure. FIG. 2 is described in conjunction with elements of FIGs. 1A and IB, collectively. With reference to FIG. 2, there is shown a block diagram 200 of a data storage system 202. The data storage system 202 includes a control circuitry 204, a transceiver 206, a data processing apparatus 208, and a memory 210. The data processing apparatus 208 further includes a data indexing module 208A, a data query module 208B, and a cache eviction module 208C.

The data storage system 202 refers to a computer storage system that stores information (i.e., data items such as data files or database files, I/O writes, etc.) in a storage medium, such as a storage disk. Examples of data storage system 202 include, but are not limited to, a secondary storage system, a cloud server, a file storage system, a block storage system, an object storage system, or a combination thereof.

The control circuitry 204 includes a logic circuitry that may be communicatively coupled to the transceiver 206, the data processing apparatus 208, and the memory 210. The control circuitry 204 includes controls the operations performed by the data processing apparatus 208 and the flow of data across the different modules of the data processing apparatus 208. Examples of the control circuitry 204 may include, but are not limited to, a microprocessor, a microcontroller, a complex instruction set computing (CISC) processor, an applicationspecific integrated circuit (ASIC) processor, a reduced instruction set (RISC) processor, a very long instruction word (VLIW) processor, a central processing unit (CPU), a state machine, a data processing unit, and other processors or control circuitry.

The transceiver 206 includes a suitable logic, circuitry, and/or interfaces that is configured to transmit/receive IO read/write operations. Examples of the transceiver 206 include, but are not limited to, a transmitter/receiver antenna, an Intemet-of-Things (loT) controller, a

The data processing apparatus 208 refers to a computer component that uses a data structure technique to quickly retrieve records from a database file. The data processing apparatus 208 may also be simply referred to as a module or a data indexing circuitry. The data processing apparatus 208 includes suitable logic, circuitry, and/or interfaces that may be configured to execute the method 100 (of FIGs. 1A and IB, collectively). The data processing apparatus 208 includes the data indexing module 208A, which is configured to index the data stored in the data storage system 202. The data processing apparatus 208 further includes the data query module 208B, which is configured to receive and index an incoming data item in the data storage system 202. The data processing apparatus 208 further includes the cache eviction module 208C, which is configured to determine whether to keep or evict a weak hash in the cached portion of the data storage system 202. The memory 210 includes a suitable logic, circuitry, and/or interfaces that may be configured to store machine code and/or instructions executable by the data processing apparatus 208. Examples of implementation of the memory 210 may include, but are not limited to, Hard Disk Drive (HDD), Flash memory, Solid-State Drive (SSD), Network Attached Storage (NAS), SSD Flash Drive Arrays, Hybrid Flash Arrays, Cloud Storage, and the like.

In operation, the data processing apparatus 208 executes the method 100 by efficiently managing a data item in the data storage system 202 with the help of the data indexing module 208A, the data query module 208B, and the cache eviction module 208C. The operations of the data indexing module 208A include: dividing each data item in the data storage system 202 into a plurality of blocks; calculating a strong hash for each block, and generating an ID table comprising a list of strong hashes with a pointer to a location of the corresponding block; and calculating a weak hash for each strong hash and generating a weak hash table comprising a list of weak hashes with a pointer to a location of the corresponding strong hash in the ID table. In response to receiving an incoming data item, the operations of the data query module 208B includes: dividing the incoming data item in the data storage system 202 into a plurality of blocks; calculating a strong hash and a weak hash for each block; selecting one or more representative weak hashes for the incoming data item; searching for the representative weak hashes in the weak hash table; and recording a match between one or more of the representative weak hashes and a weak hash in the weak hash table. The weak hash table comprises a cached portion, and the cache eviction module 208C is configured to determine whether to keep or evict each weak hash in the cached portion based on a number of matches recorded for the weak hash.

In accordance with an embodiment, selecting the representative weak hashes comprises selecting a predetermined number of the lowest value weak hashes. Further, selecting the representative weak hashes also comprises selecting one or more weak hashes for which predetermined number of the most significant bits are equal to zero. Moreover, the predetermined number is dynamically chosen based on a hit rate for the data storage system 202 and an amount of data in the cached portion.

In accordance with an embodiment, in response to a match in the weak hash table, the data query module 208B is configured to find an associated strong hash from the ID table and check the associated strong hash against one or more of the strong hashes calculated for the incoming data item. The data query module 208B is further configured to load the associated strong hash, and one or more neighboring strong hashes to a strong hash cache, wherein one or more strong hash values calculated for a new incoming data item are checked against the strong hash cache before the weak hash table is searched. Further, if there are no matches in the weak hash table, the incoming data item is written to the data storage system 202 as a new data item.

In accordance with an embodiment, the incoming data item is received as part of a backup task. Further, the cache eviction module 208C is configured to assign a higher priority to keep weak hashes for which a recorded match corresponds to a data item received as part of the backup task.

In accordance with an embodiment, the data query module 208B is further configured to receive a second incoming data item which is not part of a backup task; divide the second incoming data item in the data storage system 202 into a plurality of blocks; calculate a strong hash and a weak hash for each block; search for each weak hash in the weak hash table; and record a match between one or more of the weak hashes and a weak hash in the weak hash table. Further, the cache eviction module 208C is configured to assign a lower priority to keep weak hashes for which a recorded match corresponds to the second data item which is not received as part of a backup task.

Thus, the data processing apparatus 208 provides an efficient and reliable data deduplication solution to manage duplicate data as well as non-duplicate data. The non-duplicate data is stored in the data storage system as a new data item. On the other hand, the duplicate data is not stored explicitly in the data storage system 202; rather, the corresponding weak hash is kept in the cached portion based on the number of matches recorded for the weak hash. Thus, the data processing apparatus 208 helps in efficient utilization of the storage capacity of the data storage system 202.

FIG. 3 is an illustration that depicts various operations of a data deduplication solution, in accordance with an embodiment of the present disclosure. FIG. 3 is described in conjunction with elements of FIGs. 1A, IB, and 2. With reference to FIG. 3, there is shown a process 300 that depicts various operations of a data deduplication solution. There is further shown an incoming write request 302, a strong hash cache 304, a weak hash table 306, and an ID table 308. The strong hash cache 304 further includes a strong hash 304A, and a block ID 304B. The weak hash table 306 further includes a weak hash 306A, a block ID 306B, and a cached portion 306C. The ID table 308 further includes a block ID 308A, a strong hash 308B, a ref-count 308C, and a disk address 308D. The process 300 may correspond to the method 100 (of FIGs. 1A and IB, collectively).

The incoming write request 302 refers to a request to store an incoming data item (i.e., data files, database files, I/O writes, and the like) in the data storage system 202.

The strong hash cache 304 refers to a small index that maps the strong hash 304A into the block ID 304B. The strong hash cache 304 is stored in the memory 210 and is used as an inmemory cache for strong hashes.

The weak hash table 306 refers to a full Index that maps the weak hash 306A into the block ID 306B. In order to provide a good deduplication ratio (i.e., the ratio of the size of original data to the size of data after deduplication), the size of the weak hash table 306 is large, and hence, the weak hash table 306 is stored on the disk storage. The weak hash table 306 is a standard key -value data structure. Typically, such a data structure is implemented using a b- tree, and hashes that are not distanced are stored in the close location on the disk. When there is a hit, the data structure brings a set of keys and values, which are located closely on the disk. The weak hash table 306 has a build-in cache mechanism to store relevant weak hashes in the cached portion 306C. The cached portion 306C is the main index used for deduplication. Beneficially, latency and access rate to fetch read IDs is reduced, and throughput, as well as hit rate, is increased. Hence, the overall performance of the data deduplication solution is optimized.

The ID table 308 refers to a full index that maps the block ID 308A of the strong hash 308B into the actual address of the data (i.e., the disk address 308D). The ID table 308 includes all the block IDs of all the blocks stored in the data storage system 202. Hence, the size of the ID table 308 is large. Thus, the ID table 308 is stored on the disk storage. Further, the ID table 308 keeps a count of the references (i.e., pointers) for duplicate data items in the ref- count 308C.

In operation, the incoming write request 302 is divided into a plurality of blocks, and strong hashes and weak hashes for each block are calculated. The incoming write request 302 may include a data item (i.e., data files, database files, I/O writes, and the like) that is very large in size. In order to efficiently manage such large size data item (i.e., the incoming write request 302), the data item is broken down into the plurality of blocks. The dividing of the data item into the plurality of blocks may result in blocks of fixed-length (i.e., equal size blocks) or variable-length (i.e., unequal size blocks), depending upon the dividing algorithm performed. Further, strong hashes are calculated for each block of the data item (i.e., the incoming write request 302). The calculated strong hashes for each block refer to a fingerprint of the corresponding block, which uniquely describes the corresponding block of the data item. In other words, the calculated strong hashes for each block is a bit string that represents the corresponding block of the data item. Further, weak hashes are calculated for each of the calculated strong hashes. The calculated weak hashes refer to a portion of the bits of a corresponding calculated strong hash of the incoming write request 302.

The calculated strong hashes for each block are checked against the strong hash cache 304. The checking of the calculated strong hashes for each block against the strong hash cache 304 corresponds to checking the calculated strong hashes for each block against the strong hash 304A of the strong hash cache 304. If a match is found between the calculated strong hashes and the strong hash 304A, then the cache eviction algorithm keeps the calculated weak hashes corresponding to the strong hash 304A in the cached portion 306C for a longer time. On the other hand, if a match is not found between the calculated strong hashes and the strong hash 304A, then the control is sent to the weak hash table 306 for further processing.

At weak hash table 306, one or more representative weak hashes selected for the incoming write request 302 are checked against the weak hash table 306. The one or more representative weak hashes are selected by generating a list of weak hashes from the calculated strong hashes of each block of the incoming write request 302 and selecting one or more of the weak hashes as the representative weak hashes. The checking of one or more representative weak hashes against the weak hash table 306 corresponds to checking the one or more representative weak hashes for the incoming write request 302 against the cached portion 306C of the weak hash table 306 for a match. The cached portion 306C comprises weak hashes from the weak hash 306A of the weak hash table 306 based on the number of matches recorded for the weak hashes. Further, if the match is found for one or more representative weak hashes in the cached portion 306C, then an associated strong hash is retrieved from the ID table 308, and the weak hashes are kept in the cached portion 306C. The associated strong hash refers to the strong hash corresponding to the weak hashes in the cached portion 306C for which matches are found. Hence, every time a weak hash is accessed, the neighborhood of the block IDs is brought via a single hit in the weak hash table 306. The hits to the cached portion 306C take into account how many hits the neighborhood of strong hashes brought from the strong hash cache 304, i. e. , the hits from the neighborhood of block IDs brought to the strong hash cache 304 are counted as hits to the weak hash table 306. Alternatively stated, the weak hashes that saved access to the weak hash table 306 due to many hits in the strong hash cache 304 remain in the cached portion 306C of the weak hash table for a longer time. On the other hand, if no matches are found for one or more representative weak hashes in the cached portion 306C, then the incoming write request 302 is written into the data storage system 202 as a new data item.

In accordance with an embodiment, selecting the representative weak hashes comprises selecting a predetermined number of the lowest value weak hashes. Further, selecting the representative weak hashes also comprises selecting one or more weak hashes for which predetermined number of the most significant bits are equal to zero. Moreover, the predetermined number is dynamically chosen based on a hit rate for the data storage system 202 and an amount of data in the cached portion 306C.

In accordance with an embodiment, in response to a match in the weak hash table 306, the process 300 further comprises finding an associated strong hash from the ID table 308 and checking the associated strong hash against one or more of the strong hashes calculated for the incoming data item. The process 300 further comprises loading the associated strong hash and one or more neighboring strong hashes to a strong hash cache 304, wherein one or more strong hash values calculated for a new incoming data item are checked against the strong hash cache 304 before the weak hash table 306 is searched. Further, if there are no matches in the weak hash table 306, the incoming data item is written to the data storage system 202 as a new data item.

In accordance with an embodiment, the incoming data item is received as part of a backup task. Further, the cache eviction algorithm is configured to assign a higher priority to keep weak hashes for which a recorded match corresponds to a data item received as part of the backup task. In accordance with an embodiment, the process 300 further comprises: receiving a second incoming data item which is not part of a backup task; dividing the second incoming data item in the data storage system 202 into a plurality of blocks; calculating a strong hash and a weak hash for each block; searching for each weak hash in the weak hash table 306; and recording a match between one or more of the weak hashes and a weak hash in the weak hash table 306, wherein the cache eviction algorithm is configured to assign a lower priority to keep weak hashes for which a recorded match corresponds to the second data item which is not received as part of a backup task.

In conventional systems, in an example, the main flow typically of the conventional data deduplication solutions is receiving a new write request, segmenting the data using any segmentation algorithm, calculating a strong hash and a weak hash for each segment, and searching the strong hash in a given strong hash cache. If the strong hash is found in the given strong hash cache, the block ID is returned. If the strong hash is not found in the given strong hash cache, the weak hash is searched in a given weak hash table. If the weak hash is found in the given weak hash table, the strong hash is retrieved from a given ID table for comparison. If the weak hash is not found in the given weak hash table, the data cannot be deduplicated, and a new block ID is generated. However, since the given weak hash table is large and most of the given weak hash table is stored on the disk, the search requires read IDs from the disk. Thus, there is a significant impact on the latency as well as the read IDs throughput.

In contrast to the conventional systems, the data duplication solution of the present disclosure uses the cache eviction algorithm to keep or evict weak hashes in the cached portion 306C based on a number of matches recorded for the weak hashes. The cached portion 306C of the present disclosure reduces the number of disk-read accesses, and consequently, the latency in the data deduplication solution is minimized.

Table 1 shows a comparison of the number of disk-read accesses for the conventional data deduplication solutions and the data deduplication solution of the present disclosure. Table 1 includes the datasets, such as ‘Files 28Full’, ‘VMware 28Full’, ‘Oracle 28Full’, ‘Files 4F24Inc’, ‘VMware 4F24Inc’, and ‘Oracle 4F24Inc’. Table 1 further includes the number of disk-read accesses for the conventional data deduplication solutions and the data deduplication solution of the present disclosure along with the improvement. It can be observed from Table 1 that the data deduplication solution of the present disclosure can save over 95% of the disk-read accesses for most of the datasets.

Table 2 shows a comparison of the deduplication ratio achieved for the source-based deduplication and the data deduplication solution of the present disclosure. Table 2 includes the datasets, such as ‘Files 28Full’, ‘VMware 28Full’, ‘Oracle 28Full’, ‘Files 4F24Inc’, ‘VMware 4F24Inc’, and ‘Oracle 4F24Inc’. Table 2 further includes the deduplication ratio achieved for the source-based deduplication and the data deduplication solution of the present disclosure. It can be observed from Table 2 that the deduplication ratio achieved for the data deduplication solution of the present disclosure is very similar to the deduplication ratio achieved for the source-based deduplication. Thus, the deduplication ratio is not degraded for the data deduplication solution of the present disclosure.

Table 1 : Comparison of the number of disk-read accesses for the conventional data deduplication solutions and the data deduplication solution of the present disclosure.

Table 2: Comparison of deduplication ratio achieved for the source-based deduplication and the data deduplication solution of the present disclosure.

“Inc" refers to a weekly (7-day) full backup for 4 weeks, i.e., 28 days incremental backup. “Full” refers to a full backup of 28 days.

Thus, the process 300 corresponds to the method for managing the data item in the data storage system 202 to provide an improved data duplication solution. By virtue of the cached portion 306C, the number of disk-read accesses is reduced, which consequently minimizes the access rate and latency in fetching the IO operations of the data storage system 202. Further, the process 300 can be efficiently used for multiple scenarios (e.g., source-based deduplication and generic inline deduplication) without degrading the performance (i.e., latency, throughput, access rate, and hit rate) for any scenario. Moreover, the process 300 allows an efficient and reliable data deduplication solution to manage duplicate data as well as non-duplicate data. The non-duplicate data is stored in the data storage system 202 as a new data item. On the other hand, the duplicate data is not stored explicitly in the data storage system 202; rather, the corresponding weak hashes are kept in the cached portion 306C based on the number of matches recorded for the weak hashes. Thus, the process 300 allows efficient utilization of the storage capacity of the data storage system 202.

Various embodiments of the disclosure thus provide a computer-implemented method (i.e., the method 100) of data management in a data storage system 202. The method 100 comprises: dividing each data item in the data storage system 202 into a plurality of blocks; calculating a strong hash for each block and generating an ID table comprising a list of strong hashes with a pointer to a location of the corresponding block; calculating a weak hash for each strong hash and generating a weak hash table comprising a list of weak hashes with a pointer to a location of the corresponding strong hash in the ID table; and in response to receiving an incoming data item: dividing the incoming data item in the data storage system 202 into a plurality of blocks; calculating a strong hash and a weak hash for each block; selecting one or more representative weak hashes for the incoming data item; searching for the representative weak hashes in the weak hash table; and recording a match between one or more of the representative weak hashes and a weak hash in the weak hash table; wherein the weak hash table comprises a cached portion, and a cache eviction algorithm is configured to determine whether to keep or evict each weak hash in the cached portion based on a number of matches recorded for the weak hash.

Various embodiments of the disclosure thus further provide a data processing apparatus 208 for a data storage system 202. The data processing apparatus 208 comprises a data indexing module 208A configured to: divide each data item in the data storage system 202 into a plurality of blocks; calculate a strong hash for each block and generating an ID table comprising a list of strong hashes with a pointer to a location of the corresponding block; and calculate a weak hash for each strong hash and generating a weak hash table comprising a list of weak hashes with a pointer to a location of the corresponding strong hash in the ID table. The data processing apparatus 208 further comprises a data query module 208B configured to receive an incoming data item and, in response to receiving an incoming data item: divide the incoming data item in the data storage system 202 into a plurality of blocks; calculate a strong hash and a weak hash for each block; select one or more representative weak hashes for the incoming data item; search for the representative weak hashes in the weak hash table; and record a match between one or more of the representative weak hashes and a weak hash in the weak hash table. The data processing apparatus 208 further comprises a cache eviction module 208C, wherein the weak hash table comprises a cached portion, and the cache eviction module 208C is configured to determine whether to keep or evict each weak hash in the cached portion based on a number of matches recorded for the weak hash.

Modifications to embodiments of the present disclosure described in the foregoing are possible without departing from the scope of the present disclosure as defined by the accompanying claims. Expressions such as “including”, “comprising”, “incorporating”, “have”, “is” used to describe and claim the present disclosure are intended to be construed in a non-exclusive manner, namely allowing for items, components or elements not explicitly described also to be present. Reference to the singular is also to be construed to relate to the plural. The word “exemplary” is used herein to mean “serving as an example, instance or illustration”. Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments and/or to exclude the incorporation of features from other embodiments. The word “optionally” is used herein to mean “is provided in some embodiments and not provided in other embodiments”. It is appreciated that certain features of the present disclosure, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the present disclosure, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable combination or as suitable in any other described embodiment of the disclosure.

Claims

1. A computer-implemented method (100) of data management in a data storage system (202), the method (100) comprising: dividing each data item in the data storage system (202) into a plurality of blocks; calculating a strong hash for each block and generating an ID table comprising a list of strong hashes with a pointer to a location of the corresponding block; calculating a weak hash for each strong hash and generating a weak hash table comprising a list of weak hashes with a pointer to a location of the corresponding strong hash in the ID table; and in response to receiving an incoming data item: dividing the incoming data item in the data storage system (202) into a plurality of blocks; calculating a strong hash and a weak hash for each block; selecting one or more representative weak hashes for the incoming data item; searching for the representative weak hashes in the weak hash table; and recording a match between one or more of the representative weak hashes and a weak hash in the weak hash table; wherein the weak hash table comprises a cached portion, and a cache eviction algorithm is configured to determine whether to keep or evict each weak hash in the cached portion based on a number of matches recorded for the weak hash.

2. The method (100) of claim 1, wherein selecting the representative weak hashes comprises selecting a predetermined number of the lowest value weak hashes.

3. The method (100) of claim 1, wherein selecting the representative weak hashes comprises selecting one or more weak hashes for which predetermined number of the most significant bits are equal to zero.

4. The method (100) of claim 2 or claim 3, wherein the predetermined number is dynamically chosen based on a hit rate for the data storage system (202) and an amount of data in the cached portion.

5. The method (100) of any preceding claim, further comprising: in response to a match in the weak hash table, finding an associated strong hash from the ID table and checking the associated strong hash against one or more of the strong hashes calculated for the incoming data item.

6. The method (100) in of claim 5, further comprising loading the associated strong hash and one or more neighboring strong hashes to a strong hash cache, wherein one or more strong hash values calculated for a new incoming data item are checked against the strong hash cache before the weak hash table is searched.

7. The method (100) of any preceding claim, wherein if there are no matches in the weak hash table, the incoming data item is written to the data storage system (202) as a new data item.

8. The method (100) of any preceding claim, wherein the cache eviction algorithm is configured to assign a higher priority to keep weak hashes for which a recorded match corresponds to a data item received as part of a backup task.

9. The method (100) of any preceding claim, wherein the incoming data item is received as part of a backup task.

10. The method (100) of claim 9, further comprising: receiving a second incoming data item which is not part of a backup task; dividing the second incoming data item in the data storage system (202) into a plurality of blocks; calculating a strong hash and a weak hash for each block; searching for each weak hash in the weak hash table; and recording a match between one or more of the weak hashes and a weak hash in the weak hash table; wherein the cache eviction algorithm is configured to assign a lower priority to keep weak hashes for which a recorded match corresponds to the second data item which is not received as part of a backup task.

11. A computer-readable medium configured to store instruction which, when executed by a processor, cause the processor to perform the method (100) of any preceding claim.

12. A data processing apparatus (208) for a data storage system (202), comprising: a data indexing module (208A) configured to: divide each data item in the data storage system (202) into a plurality of blocks; calculate a strong hash for each block and generating an ID table comprising a list of strong hashes with a pointer to a location of the corresponding block; and calculate a weak hash for each strong hash and generating a weak hash table comprising a list of weak hashes with a pointer to a location of the corresponding strong hash in the ID table; a data query module (208B) configured to receive an incoming data item and, in response to receiving an incoming data item: divide the incoming data item in the data storage system (202) into a plurality of blocks; calculate a strong hash and a weak hash for each block; select one or more representative weak hashes for the incoming data item; search for the representative weak hashes in the weak hash table; and record a match between one or more of the representative weak hashes and a weak hash in the weak hash table; and a cache eviction module (208C), wherein the weak hash table comprises a cached portion, and the cache eviction module 208C is configured to determine whether to keep or evict each weak hash in the cached portion based on a number of matches recorded for the weak hash.

13. The data processing apparatus (208) of claim 12, wherein selecting the representative weak hashes comprises selecting a predetermined number of the lowest value weak hashes.

14. The data processing apparatus (208) of claim 12, wherein selecting the representative weak hashes comprises selecting one or more weak hashes for which predetermined number of the most significant bits are equal to zero.

15. The data processing apparatus (208) of claim 13 or claim 14, wherein the predetermined number is dynamically chosen based on a hit rate for the data storage system (202) and an amount of data in the cached portion.

16. The data processing apparatus (208) of any one of claims 12 to 15, wherein the data query module (208B) is further configured, in response to a match in the weak hash table, to find an associated strong hash from the ID table and check the associated strong hash against one or more of the strong hashes calculated for the incoming data item.

17. The data processing apparatus (208) of claim 16, wherein the data query module (208B) is further configured to load the associated strong hash and one or more neighboring strong hashes to a strong hash cache, wherein one or more strong hash values calculated for a new incoming data item are checked against the strong hash cache before the weak hash table is searched.

18. The data processing apparatus (208) of any one of claims 12 to 17, wherein if there are no matches in the weak hash table, the incoming data item is written to the data storage system (202) as a new data item.

19. The data processing apparatus (208) of any one of claims 12 to 18, wherein the cache eviction module (208C) is configured to assign a higher priority to keep weak hashes for which a recorded match corresponds to a data item received as part of a backup task.

20. The data processing apparatus (208) of any one of claims 12 to 19, wherein the incoming data is received as part of a backup task.

21. The data processing apparatus (208) of claim 20, wherein the data query module (208B) is further configured to: receive a second incoming data item which is not part of a backup task; divide the second incoming data item in the data storage system (202) into a plurality of blocks; calculate a strong hash and a weak hash for each block; search for each weak hash in the weak hash table; and record a match between one or more of the weak hashes and a weak hash in the weak hash table; wherein the cache eviction module (208C) is configured to assign a lower priority to keep weak hashes for which a recorded match corresponds to the second data item which is not received as part of a backup task.