WO2022184272A1

WO2022184272A1 - Method for indexing a data item in a data storage system

Info

Publication number: WO2022184272A1
Application number: PCT/EP2021/055603
Authority: WO
Inventors: Idan Zach; Aviv Kuvent; Assaf Natanzon; Michael Sternberg; Elizabeth FIRMAN
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2022-09-09
Also published as: CN116917878A

Abstract

Provided is a computer-implemented method for indexing a data item in a data storage system. The method includes dividing the data item into one or more large blocks. The method includes dividing each large block into one or more small blocks. The method includes calculating a strong hash value for each of the small blocks and storing a list of strong hash values with a pointer to a location of the large block. The method includes, from the list of strong hash values calculated for each large block, selecting one or more representative hash values for the large block and setting a representative flag for the representative hash values in the list of strong hash values to indicate the selection and using this flag for more efficient updating of sparse index on changes to the data. The method includes compiling a sparse index including an entry for each large block.

Description

METHOD FOR INDEXING A DATA ITEM IN A DATA STORAGE SYSTEM

TECHNICAL FIELD

The present disclosure relates generally to a computer-implemented method for indexing a data item in a data storage system, and more particularly, the disclosure relates to a data indexing module for the data storage system. Moreover, the disclosure also relates to the data storage system including the data indexing module for indexing the data item in the data storage system. BACKGROUND

Deduplication is a method used to reduce an amount of data either passed over a network or stored in a data storage system, by identifying duplicates of the data at granularity level and avoiding passing or storing such duplicates explicitly. Deduplication is done by (i) dividing the data into segments, (ii) calculating fingerprints (i.e. strong hashes) per segment, and (iii) using the fingerprints to identify identical data (i.e. duplicate data). The main problem in the deduplication method is to efficiently locate existing fingerprints that have a high probability of being equal to incoming fingerprints since the number of fingerprints is significant for large amounts of data where the deduplication is most relevant. One method to solve the above-mentioned problem is through a sparse index. The method of using the sparse index includes (i) dividing the fingerprints into areas (i.e. association blocks), and (ii) selecting the representatives (i.e. weak hashes) from each area via a deterministic method. The representatives are derived from the selected fingerprints. For example, the two fingerprints (i.e. strong hashes) with the maximal value in that area are selected, and a subset of bits to be used from each of these fingerprints is picked as the representatives.

These representatives, with a pointer to the area they represent, are placed into the sparse index. When new fingerprints arrive (i.e. incoming fingerprints), the new representatives are selected from the new fingerprints in the same way as for the stored fingerprints. These new representatives are used for searching in the sparse index. Upon locating the same representatives in the sparse index, relevant fingerprint areas are uploaded and a one-by-one comparison of the incoming fingerprints with the existing fingerprints is performed. When equal fingerprints are identified, the duplicate data can be located.

The information in the sparse index needs to be consistent with the existing fingerprints. Otherwise, deduplication ratio may degrade as the existing fingerprints may not be represented in the sparse index. When writing large sequential input/output (IOs), updating the sparse index inline does not cause any major problems, as the overhead of reading the additional fingerprints to compute representatives is small. However, when performing small random overwrites, the fingerprints may be spread over many fingerprint areas. Further, updating the sparse index may require reading all fingerprints in these areas in order to calculate new representatives for the sparse index, which may result in a significant amount of additional IOs for each such small, random overwrite.

The problem of sparse index update may be solved offline (especially, when high performance is needed), by reading the relevant area of fingerprints, locating new representatives, and updating the sparse index accordingly. The disadvantages of updating the sparse index offline is that: (i) the degradation of the deduplication ratio until the update can be performed in the sparse index, (ii) the need to store or identify the changes done to the fingerprints, to know which areas to update offline in the sparse index, and (iii) in the highly loaded systems, the user may not have enough offline time to properly update all needed entries in the sparse index.

Therefore, there arises a need to address the aforementioned technical drawbacks and problems in updating the index of the data in the data storage system.

SUMMARY

It is an object of the disclosure to provide a computer-implemented method for indexing a data item in a data storage system, a data indexing module for the data storage system, and the data storage system including the data indexing module for indexing the data item in the data storage system while avoiding one or more disadvantages of prior art approaches.

This object is achieved by the features of the independent claims. Further, implementation forms are apparent from the dependent claims, the description, and the figures. The disclosure provides a computer-implemented method for indexing a data item in a data storage system, a data indexing module for the data storage system, and the data storage system including the data indexing module for indexing the data item in the data storage system.

According to a first aspect, there is provided a computer-implemented method for indexing a data item in a data storage system. The method includes dividing the data item into one or more large blocks. The method includes dividing each large block into a plurality of small blocks. The method includes calculating a strong hash value for each of the small blocks and storing a list of strong hash values with a pointer to a location of the large block. The method includes, from the list of strong hash values calculated for each large block, selecting one or more representative hash values for the large block and setting a representative flag for the representative hash values in the list of strong hash values to indicate the selection. The method includes compiling a sparse index including an entry for each large block. Each entry is based on the representative hash values and a pointer to the list of strong hash values for each large block.

The method provides the list of strong hashes including a record of the strong hashes which have been selected as representative hash values and included in the sparse index. Based on this additional metadata (i.e. the strong hashes and the representative hash values), the method can prioritize the update mechanism for these blocks specifically. This can improve the relevance of the sparse index over time, thereby avoiding some of the degradations that, otherwise, takes place over time due to changes in the underlying data item, for example between updates in a periodic update scheme. The method enables to update the sparse index inline in an efficient way to decrease the deduplication degradation.

Optionally, the method includes, in response to a change in the data item: (i) identifying a changed large block in the data item and one or more changed small blocks, (ii) determining whether the strong hash corresponding to each changed small block is associated with a representative flag, and (iii) if a strong hash corresponding to a changed small block was selected as a representative hash value, updating the sparse index by: (a) reselecting one or more representative hash values for the changed large block and resetting one or more representative flags in the corresponding list of strong hash values, and (b) recompiling the sparse index with an updated entry for the changed large block, based on the new representative hash values. The method can ensure that if a change to the underlying data item affects the references in the sparse index, then the sparse index is immediately updated. Hence, the deduplication ratio may not degrade because the sparse index updates the references up to date, for example, the new representative hash values and the pointers to the list of strong hash values for each changed large block are updated.

Optionally, if a strong hash corresponding to a changed small block was not selected as a representative hash value, the method includes updating the sparse index with a probability determined by a ratio between a number of representative hash values for the changed large block and a total number of strong hashes in the list of strong hash values for the changed large block. The method ensures that the probability of updating the sparse index when it is not directly affected by a change is low, and is correlated with the probability that any given change would affect the sparse index directly.

Optionally, selecting the representative hash values uses a determinative process. The determinative process may include selecting one or more largest hash values. Two representative hash values may be selected.

Optionally, compiling the sparse index includes calculating a weak hash for each representative hash value. Optionally, compiling the sparse index includes compressing each pointer by storing a hash value of a file path for the list, an indication of the corresponding large block location within the data item, and a file size indication for the data item. A length of the hash value of the file path may be based on the file size of the data item. The sparse index may be stored in a memory, and the lists of strong hash values stored in a disk storage. Optionally, each strong hash has about 20 bytes.

According to a second aspect, there is provided a computer-readable medium configured to store instructions which, when executed by a processor, cause the processor to execute the above method.

According to a third aspect, there is provided a data indexing module for a data storage system. The data indexing module includes one or more processors configured to execute the above method.

According to a fourth aspect, there is provided a data storage system. The data storage system includes one or more data storage units and the data indexing module as described above. The data indexing module provides the list of strong hashes including a record of the strong hashes which have been selected as representative hash values and included in the sparse index. Based on this additional metadata (i.e. the strong hashes and the representative hash values), the data indexing module can prioritize the update mechanism for these blocks specifically. This can improve the relevance of the sparse index over time, thereby avoiding some of the degradations that, otherwise, takes place over time due to changes in the underlying data item, for example between updates in a periodic update scheme.

The data indexing module enables to set additional representative flags for the representative hash values in the list of strong hash values to indicate the selection. The data indexing module enables to update the sparse index inline in an efficient way to decrease the deduplication degradation.

A technical problem in the prior art is resolved, where the technical problem is that to update the sparse index of the data in the data storage system and to decrease the deduplication degradation.

Therefore, in contradistinction to the prior art, according to the computer-implemented method for indexing the data item in the data storage system, the data indexing module and the data storage system for indexing the data item in the data storage system, the relevance of the sparse index over time is improved, thereby avoiding some of the degradations that, otherwise, may take place over time due to changes in the underlying data, for example between updates in a periodic update scheme. The method provides the list of strong hashes including a record of the strong hashes which have been selected as representative hash values and included in the sparse index. Based on this additional metadata (i.e. the strong hashes and the representative hash values), the method can prioritize the update mechanism for these blocks specifically.

These and other aspects of the disclosure will be apparent from and the implementation(s) described below.

BRIEF DESCRIPTION OF DRAWINGS

Implementations of the disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:

FIG. 1 is a block diagram of a data indexing module for a data storage system in accordance with an implementation of the disclosure;

FIG. 2 is a block diagram of a data storage system in accordance with an implementation of the disclosure;

FIGS. 3A-3B illustrate exemplary views of indexing a data item in a data storage system before and after a change in the data item respectively in accordance with an implementation of the disclosure;

FIGS. 4A-4B are flow diagrams that illustrate a method for indexing a data item in a data storage system in accordance with an implementation of the disclosure;

FIGS. 5A-5C are flow diagrams that illustrate a method for indexing a data item in a data storage system in response to a change in the data item in accordance with an implementation of the disclosure; and

FIG. 6 is an illustration of a computer system (e.g. a data storage system, a data storage unit, a data indexing module) in which the various architectures and functionalities of the various previous implementations may be implemented.

DETAILED DESCRIPTION OF THE DRAWINGS

Implementations of the disclosure provide a computer-implemented method for indexing a data item in a data storage system, a data indexing module for indexing a data item in the data storage system. The disclosure also relates to the data storage system including the data indexing module for indexing the data item in the data storage system.

To make solutions of the disclosure more comprehensible for a person skilled in the art, the following implementations of the disclosure are described with reference to the accompanying drawings.

Terms such as "a first", "a second", "a third", and "a fourth" (if any) in the summary, claims, and foregoing accompanying drawings of the disclosure are used to distinguish between similar objects and are not necessarily used to describe a specific sequence or order. It should be understood that the terms so used are interchangeable under appropriate circumstances, so that the implementations of the disclosure described herein are, for example, capable of being implemented in sequences other than the sequences illustrated or described herein. Furthermore, the terms "include" and "have" and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, a method, a system, a product, or a device that includes a series of steps or units, is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.

FIG. 1 is a block diagram of a data indexing module 100 for a data storage system in accordance with an implementation of the disclosure. The data indexing module 100 includes one or more processors 102A-N. The one or more processors 102A-N are configured to execute a method for indexing a data item in the data storage system. The one or more processors 102A-N are configured to divide the data item into one or more large blocks. The one or more processors 102A-N are configured to divide each large block into one or more small blocks. The one or more processors 102A-N are configured to calculate a strong hash value for each of the small blocks and store a list of strong hash values with a pointer to a location of the large block. The one or more processors 102A- N are configured to, from the list of strong hash values calculated for each large block, select one or more representative hash values for the large block and set a representative flag for the representative hash values in the list of strong hash values to indicate the selection. The one or more processors 102A-N are configured to compile a sparse index including an entry for each large block. Each entry is based on the representative hash values and a pointer to the list of strong hash values for each large block.

The data indexing module 100 provides the list of strong hashes including a record of the strong hashes which have been selected as representative hash values and included in the sparse index. Based on this additional metadata (i.e. the strong hashes and the representative hash values), the data indexing module 100 can prioritize the update mechanism for these blocks specifically. This can improve the relevance of the sparse index over time, thereby avoiding some of the degradations that, otherwise, takes place over time due to changes in the underlying data item, for example between updates in a periodic update scheme.

The data indexing module 100 sets the additional representative flag for the representative hash values in the list of strong hash values to indicate the selection. The data indexing module 100 enables to update the sparse index inline in an efficient way to decrease the deduplication degradation. The data indexing module 100 efficiently monitors a deduplication ratio and memory consumption during write input/outputs (IOs), to identify if the sparse index is used. If the sparse index is used, the data indexing module 100 monitors, how long after each write input/output (IO), the deduplication ratio is changed, or how much of a deduplication degradation occurs, as well as to measure a performance of the IO to identify when the sparse index is updated. If the deduplication degradation is low and the performance is relatively high, the data indexing module 100 may deduce that, in high probability, the sparse index is updated efficiently.

Optionally, the one or more processors 102A-N are configured to, in response to a change in the data item: (i) identify a changed large block in the data item and one or more changed small blocks, (ii) determine whether the strong hash corresponding to each changed small block is associated with a representative flag, and (iii) if a strong hash corresponding to a changed small block was selected as a representative hash value, update the sparse index by: (a) reselecting one or more representative hash values for the changed large block and reset the one or more representative flags in the corresponding list of strong hash values, and (b) recompiling the sparse index with an updated entry for the changed large block, based on the new representative hash values.

The data indexing module 100 can ensure that if a change to the underlying data items affects the references in the sparse index, then the sparse index is immediately updated. Hence, the deduplication ratio may not degrade because the sparse index updates the references up to date, for example, the new representative hash values and the pointers to the list of strong hash values for each changed large block are updated.

Optionally, if a strong hash corresponding to a changed small block was not selected as a representative hash value, the one or more processors 102A-N update the sparse index with a probability determined by a ratio between a number of representative hash values for the changed large block and a total number of strong hashes in the list of strong hash values for the changed large block. The data indexing module 100 ensures that the probability of updating the sparse index when it is not directly affected by a change is low, and is correlated with the probability that any given change would affect the sparse index directly.

Optionally, the probability of updating the sparse index is the probability determined by a ratio between the number of representative hash values for the changed large block and the total number of strong hashes in the list of strong hash values for the changed large block. Optionally, the ratio between the number of representative hash values for the changed large block and the total number of strong hashes in the list of strong hash values for the changed large block is small to keep the sparse index small which allows them to be entirely in a memory, thereby keeping probability as low. The sparse index may be stored in the memory, and the lists of strong hash values stored in a disk storage. Optionally, each strong hash has about 20 bytes.

Optionally, the one or more processors 102A-N select the representative hash values using a determinative process. The determinative process may include selecting one or more largest hash values. Two representative hash values may be selected. Optionally, when a strong hash corresponding to a changed small block was not selected as a representative hash value, the probability of the strong hash being the new representative hash value post-change is low, when the selection of the representative hash value is deterministic and relies on choosing the minimal or maximal strong hashes in the list of strong hash values (e.g. the strong hash values are uniformly distributed, so the probability is l/(#the strong hashes in the list of strong hash values)). Optionally, the probability of the change of the strong hash does not result in the updating of the sparse index, and as a result, the deduplication degradation ratio is low.

FIG. 2 is a block diagram of a data storage system 200 in accordance with an implementation of the disclosure. The data storage system 200 includes one or more data storage units 202A-N and a data indexing module 204. Optionally, the one or more data storage units 202A-N are communicatively connected to the data indexing module 204. The data indexing module 204 is configured to divide a data item into one or more large blocks. The data indexing module 204 is configured to divide each large block into one or more small blocks. The data indexing module 204 is configured to calculate a strong hash value for each of the small blocks and store a list of strong hash values with a pointer to a location of the large block. The data indexing module 204 is configured to, from the list of strong hash values calculated for each large block, select one or more representative hash values for the large block and set a representative flag for the representative hash values in the list of strong hash values to indicate the selection. The data indexing module 204 is configured to compile a sparse index including an entry for each large block. Each entry is based on the representative hash values and a pointer to the list of strong hash values for each large block.

The data storage system 200 provides the list of strong hashes including a record of the strong hashes which have been selected as representative hash values and included in the sparse index. Based on this additional metadata (i.e. the strong hashes and the representative hash values), the data storage system 200 can prioritize the update mechanism for these blocks specifically. This can improve the relevance of the sparse index over time, thereby avoiding some of the degradations that, otherwise, takes place overtime due to changes in the underlying data item, for example between updates in a periodic update scheme.

The data storage system 200 is a term referred to describe a data storage unit 202, or a group of data storage units 202A-N, that a network uses store copies of one or more data items across high-speed connections. The data storage units 202A-N are essential because it backs up critical data items/files and other data to a central location. Users can then easily access these data items/files. The data storage units 202A-N are storage devices that are connected to a network that allows storage and retrieval of data from a central location for authorised network users.

FIGS. 3A-3B illustrate exemplary views of indexing a data item in a data storage system before and after a change in the data item respectively in accordance with an implementation of the disclosure. FIG. 3A shows a sparse index 302, a logical view of the data items/files (e.g. a file 1 and a file 2) in a repository, and a list of strong hash values 308 of the files (e.g. the file 1, the file 2, etc.). The sparse index 302 may be stored in a memory, and the lists of strong hash values 308 may be stored in a disk storage 310. In this example, the file 1 is divided into one or more large blocks. Each large block is divided into one or more small blocks. Optionally, when the file 1 is written, the one or more small blocks may be bl to bi+5 and when the file 2 is written, the one or more small blocks may be bl to bj+5 as shown in the FIG. 3 A. A strong hash value is calculated for each of the small blocks of the file 1 and a list of strong hash values 308 of file 1 (e.g. Hl,l to Hl,k) is stored with a pointer 306 to a location of a large block 312 and the strong hash value is calculated for each of the small blocks of the file 2 and the list of strong hash values 308 of file 2 (e.g. H2,l to H2,s) is stored with the pointer 306 to a location of the large block 312 as shown in the FIG. 3A.

Optionally, selecting the representative hash values 304 uses a determinative process. The determinative process may include selecting one or more largest hash values. Two representative hash values 304 may be selected. Optionally, compiling the sparse index 302 includes calculating a weak hash for each representative hash value 304. One or more representative hash values 304 for the large block 312 for the file 1 may be selected as hi and h3 (i.e. the weak hashes) using the determinative process from the strong hash values HI, 3 and Hl,i+1 from the list of strong hash values 308 of the file 1 with the pointer 306 of 111 and 131 to the location of the large block 312 respectively to indicate the selection and the representative flags 314 are set as 1 and 1 for the strong hash values HI, 3 and Hl,i+1 respectively as shown in the FIG. 3A. Optionally, the one or more representative hash values 304 for the large block 312 for the file 2 may be selected as h2 and h3 (i.e. the weak hashes) using the determinative process from the strong hash values H2,2 and H2,7 from the list of strong hash values 308 of file 2 with the pointer 306 of 122 and 132 to the location of the large block 312 respectively to indicate the selection and the representative flags 314 are set as 1 and 1 for the strong hash values H2,2 and H2,7 respectively to indicate the selection as shown in the FIG. 3 A. Optionally, compiling the sparse index 302 includes compressing each pointer 306 by storing a hash value of a file path for the list 308, an indication of the corresponding large block location within the data item, and a file size indication for the data item. A length of the hash value of the file path may be based on the file size of the data item. The sparse index 302 includes an entry for each large block of the file 1 and the file 2. Each entry in the sparse index 302 is based on the representative hash values 304 of the data items (the file 1, the file 2 etc.) as hi, h2 h3 etc. and the pointer 306 to the list of strong hash values 308 for each large block of the file 1 and the file 2 (111, 122, 131, 132 etc.). The hi, h2, h3 etc. may be the weak hashes calculated for each representative hash value 304.

When the file 1 and the file 2 are randomly written by a data indexing module, and the flow to update the sparse index 302 is shown in the FIG. 3B.

The data indexing module randomly writes to the file 1 as follows:

1. Write to b2 -> update HI, 2 a. Random check for update -> no update

2. Write to bi -> update Hl,i a. Random check for update -> no update

3. Write to b3

update HI, 3 - set a representative flag 314 to 1 a. read Hl,l - Hl,i+5. b. deduce the previous representative hash values (i.e. the weak hashes) according to the set representative flag 314: hi, h3 c. calculate new representative hash values 304 (i.e. the weak hashes) hx, hy (from HI, 6 and Hi,l) d. update the sparse index 302 by: i. removing the pairs of the representative hash values 304 and the pointer 306 to the location of the large block 312 <hl,ll 1>, <h3,113> ii. adding the pairs of the representative hash values 304 and the pointer 306 to the location of the large block 312 <hx,lx2>, <hy,lyl> e. update the 4 representative flags 314.

Optionally, the data indexing module randomly writes to the file 2 as follows:

1. Write randomly j times to b8 - bj -> update H2,8 to H2,j a. Random check for update -> no update

2. Write to b4 -> update H2,4 - according to the probability a. random check for update -> update b. read H2,l - H2,j. c. deduce the previous representative hash values 304 (i.e. the weak hashes) according to the set representative flag 314: h2, h3 d. calculate new representative hash values 304 (i.e. the weak hashes) h2, hi (from H2,2 and H2,4) e. update the sparse index 302 by: i. removing the pair of the representative hash values 304 and the pointer 306 to the location of the large block 312 <h3,132> ii. adding the pair of the representative hash values 304 and the pointer 306 to the location of the large block 312 <hl,114> f. update the 2 representative flags 314.

FIGS. 4A-4B are flow diagrams that illustrate a method for indexing a data item in a data storage system in accordance with an implementation of the disclosure. At a step 402, the data item is divided into one or more large blocks. At a step 404, each large block is divided into one or more small blocks. At a step 406, a strong hash value is calculated for each of the small blocks and a list of strong hash values is stored with a pointer to a location of the large block. At a step 408, from the list of strong hash values calculated for each large block, one or more representative hash values is selected for the large block and a representative flag is set for the representative hash values in the list of strong hash values to indicate the selection. At a step 410, a sparse index including an entry for each large block is compiled. Each entry is based on the representative hash values and a pointer to the list of strong hash values for each large block.

The method provides the list of strong hashes including a record of the strong hashes which have been selected as representative hash values and included in the sparse index. Based on this additional metadata, the method can prioritize the update mechanism for these blocks specifically. This can improve the relevance of the sparse index over time, thereby avoiding some of the degradations that, otherwise, takes place over time due to changes in the underlying data item, for example between updates in a periodic update scheme.

With reference to FIGS. 4A-4B, FIGS. 5A-5C are flow diagrams that illustrate a method for indexing a data item in a data storage system in response to a change in the data item in accordance with an implementation of the disclosure. At a step 502, a changed large block in the data item and one or more changed small blocks are identified. At a step 504, it is determined whether the strong hash corresponding to each changed small block is associated with a representative flag. At a step 506, if a strong hash corresponding to a changed small block was selected as a representative hash value, the sparse index is updated by: (a) reselecting one or more representative hash values for the changed large block and resetting the one or more representative flags in the corresponding list of strong hash values, and (b) recompiling the sparse index with an updated entry for the changed large block, based on the new representative hash values. The method can ensure that if a change to the underlying data item affects the references in the sparse index, then the sparse index is immediately updated. At a step 508, if a strong hash corresponding to a changed small block was not selected as a representative hash value, the sparse index is updated with a probability determined by a ratio between a number of representative hash values for the changed large block and a total number of strong hashes in the list of strong hash values for the changed large block. The method ensures that the probability of updating the sparse index when it is not directly affected by a change is low, and is correlated with the probability that any given change would affect the sparse index directly.

The method stores the list of strong hashes with additional representative flag per representative hash to indicate the selection for the large block. The method enables to update the sparse index in an efficient way to decrease the deduplication degradation. The method efficiently monitors a deduplication ratio and a memory consumption during write input/outputs (IOs), to identify if the sparse index is used. If the sparse index is used, the method then monitors, how long after each write input/output (10), the deduplication ratio is changed, or how much of a deduplication degradation occurs, as well as to measure a performance of the IO to identify when the sparse index is updated. If the deduplication degradation is low and the performance is relatively high, the method can deduce that, in high probability, the sparse index is updated efficiently.

In an implementation, a computer-readable medium configured to store instructions which, when executed by a processor, cause the processor to execute the any of above methods as described in FIGS. 4A & 4B and FIGS. 5A-5C.

FIG. 6 is an illustration of a computer system (e.g. a data storage system, a data storage unit, a data indexing module) in which the various architectures and functionalities of the various previous implementations may be implemented. As shown, the computer system 600 includes at least one processor 604 that is connected to a bus 602, wherein the computer system 600 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI-Express, AGP (Accelerated Graphics Port), Hyper Transport, or any other bus or point-to-point communication protocol (s). The computer system 600 also includes a memory 606.

Control logic (software) and data are stored in the memory 606 which may take a form of random- access memory (RAM). In the disclosure, a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip modules with increased connectivity which simulate on-chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user.

The computer system 600 may also include a secondary storage 610. The secondary storage 610 includes, for example, a hard disk drive and a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory. The removable storage drives at least one of reads from and writes to a removable storage unit in a well-known manner.

Computer programs, or computer control logic algorithms, may be stored in at least one of the memory 606 and the secondary storage 610. Such computer programs, when executed, enable the computer system 600 to perform various functions as described in the foregoing. The memory 606, the secondary storage 610, and any other storage are possible examples of computer-readable media.

In an implementation, the architectures and functionalities depicted in the various previous figures may be implemented in the context of the processor 604, a graphics processor coupled to a communication interface 612, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 604 and a graphics processor, a chipset (namely, a group of integrated circuits designed to work and sold as a unit for performing related functions, and so forth).

Furthermore, the architectures and functionalities depicted in the various previous- described figures may be implemented in a context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system. For example, the computer system 600 may take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, an embedded system.

Furthermore, the computer system 600 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a smart phone, a television, and so forth. Additionally, although not shown, the computer system 600 may be coupled to a network (for example, a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like) for communication purposes through an I/O interface 608.

It should be understood that the arrangement of components illustrated in the figures described are exemplary and that other arrangement may be possible. It should also be understood that the various system components (and means) defined by the claims, described below, and illustrated in the various block diagrams represent components in some systems configured according to the subject matter disclosed herein. For example, one or more of these system components (and means) may be realized, in whole or in part, by at least some of the components illustrated in the arrangements illustrated in the described figures.

In addition, while at least one of these components are implemented at least partially as an electronic hardware component, and therefore constitutes a machine, the other components may be implemented in software that when included in an execution environment constitutes a machine, hardware, or a combination of software and hardware. Although the disclosure and its advantages have been described in detail, it should be understood that various changes, substitutions, and alterations can be made herein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims

1. A computer-implemented method for indexing a data item in a data storage system (200), the method comprising: dividing the data item into one or more large blocks; dividing each large block into a plurality of small blocks; calculating a strong hash value for each of the small blocks and storing a list of strong hash values (308) with a pointer (306) to a location of the large block (312); from the list of strong hash values (308) calculated for each large block, selecting one or more representative hash values (304) for the large block and setting a representative flag (314) for the representative hash values (304) in the list of strong hash values (308) to indicate the selection; and compiling a sparse index (302) comprising an entry for each large block, wherein each entry is based on the representative hash values (304) and the pointer (306) to the list of strong hash values (308) for each large block.

2. The method of claim 1, further comprising, in response to a change in the data item: identifying a changed large block in the data item and one or more changed small blocks; determining whether the strong hash corresponding to each changed small block is associated with a representative flag (314); and if a strong hash corresponding to a changed small block was selected as a representative hash value (304), updating the sparse index (302) by: reselecting one or more representative hash values (304) for the changed large block and resetting one or more representative flags (314) in the corresponding list of strong hash values (308); and recompiling the sparse index (302) with an updated entry for the changed large block, based on the new representative hash values (304).

3. The method of claim 2, wherein if a strong hash corresponding to a changed small block was not selected as a representative hash value (304), the method comprises updating the sparse index (302) with a probability determined by a ratio between a number of representative hash values (304) for the changed large block and a total number of strong hashes in the list of strong hash values (308) for the changed large block.

4. The method of any preceding claim, wherein selecting the representative hash values (304) uses a determinative process.

5. The method of claim 4, wherein the determinative process comprises selecting one or more largest hash values.

6. The method of any preceding claim, wherein two representative hash values (304) are selected.

7. The method of any preceding claim, wherein compiling the sparse index (302) includes calculating a weak hash for each representative hash value (304).

8. The method of any preceding claim, wherein compiling the sparse index (302) includes compressing each pointer (306) by storing a hash value of a file path for the list, an indication of the corresponding large block location within the data item and a file size indication for the data item, wherein a length of the hash value of the file path is based on the file size of the data item.

9. The method of any preceding claim, wherein the sparse index (302) is stored in a memory, and the lists of strong hash values stored in a disk storage (310).

10. The method of any preceding claim, wherein each strong hash has about 20 bytes.

11. A computer readable medium configured to store instructions which, when executed by a processor (102), cause the processor (102) to execute the method of any preceding claim.

12. A data indexing module (100, 204) for a data storage system (200), the data indexing module (100, 204) comprising one or more processors (102A-N) configured to execute the method of any one of claims 1 to 10.

13. A data storage system (200) comprising: one or more data storage units (202 A-N); and the data indexing module (100, 204) of claim 12.