WO2016072988A1 - Data chunk boundary - Google Patents

Data chunk boundary Download PDF

Info

Publication number
WO2016072988A1
WO2016072988A1 PCT/US2014/064255 US2014064255W WO2016072988A1 WO 2016072988 A1 WO2016072988 A1 WO 2016072988A1 US 2014064255 W US2014064255 W US 2014064255W WO 2016072988 A1 WO2016072988 A1 WO 2016072988A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
chunks
subsequent
chunk
data set
Prior art date
Application number
PCT/US2014/064255
Other languages
French (fr)
Inventor
Kevin Lloyd Jones
Original Assignee
Hewlett Packard Enterprise Development Lp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hewlett Packard Enterprise Development Lp filed Critical Hewlett Packard Enterprise Development Lp
Priority to PCT/US2014/064255 priority Critical patent/WO2016072988A1/en
Publication of WO2016072988A1 publication Critical patent/WO2016072988A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks

Definitions

  • Storage systems may include data dedupiication techniques to help manage storage capacity requirements by reducing occurrence of redundant data.
  • the storage system may store a single unique instance of the data to storage media.
  • the storage system replaces the unique data copy with a pointer to the unique data instead of storing another copy of the data.
  • FIG. 1 is a block diagram of a computer system for data chunk boundary processing according to an example implementation.
  • FIG. 2 is a flow diagram of a computer system for data chunk boundary processing of Fig. 1 according to an example implementation.
  • FIGs. 3A and 3B are diagrams of operation of a computer system for data chunk boundary processing according to an example implementation.
  • FIG. 4 is an example block diagram showing a non-transitory, computer-readable medium that stores instructions for a computer system for data chunk boundary processing in accordance with an example
  • a backup system may generally store once each unique region or “chunk" of a set or collection of data (or “data set” herein). Such a chunk may be referred to as a "data chunk” herein, in one example, a data chunk is a region of data within a data set bounded by start and end points.
  • a backup system may perform deduplication on the basis of content-based fingerprints, such as cryptographic hash functions, of the content of the data chunks of the data collection to be backed up.
  • the backup system may compare respective hashes of data chunks of a data collection provided for backup to hashes of previously stored data chunks to determine which data chunks of the provided data collection have not been previously stored in persistent storage of the backup system and thus are to be stored in storage.
  • the system may employ chunking techniques such as dividing a data collection into data chunks which may have an impact on the
  • the consistency of the chunking technique utilized may have an impact on deduplication performance.
  • a chunking technique that is able to produce the same data chunks when provided the same data collection may result in good deduplication performance.
  • the hashes of the data chunks of an incoming data collection are likely to match hashes of a previously stored data chunks when the same data collection has been stored previously.
  • Computer systems with deduplication techniques may receive a plurality of data sets from other systems or devices such as hosts.
  • the data sets may include subsequent data sets received after or subsequent to previous data sets.
  • these computer systems may encounter mutation conditions or variations in the data sets over time.
  • Such mutation or variation in data sets may involve insertion of metadata by hosts or storage applications of hosts.
  • such mutation conditions may involve alternate receipt or delivery order of subsets of the data of the data sets.
  • the occurrence of the mutation conditions may have a negative impact on the performance of the deduplication process.
  • a mutated version of a subsequent data set compared to a previous data set may be split or divided into component data chunks which differ from those of the previous or original data set due to different chunk boundaries.
  • the process may check the data chunks from the mutated data of the subsequent data set and determine that they are not duplicate data chunks and therefore not preform deduplication on the data.
  • the techniques of the present application facilitate the deduplication process of storage systems to help manage physical storage requirements by helping reduce or eliminate duplication of data within storage.
  • the techniques detect occurrence of duplication by splitting or dividing data sets into multiple component data chunks of data and Identifying those chunks common to multiple data sets.
  • the techniques may store one Instance of a data chunk common to multiple data sets, in one example, the techniques perform data chunk boundary processing on data sets.
  • the data chunk boundary process provides a feedback mechanism to establish data chunk boundaries for subsequent data sets based on boundaries from previous data sets.
  • the resultant data chunks identified using this boundary selection technique may be more likely to be common to multiple data sets, thus helping to Increase the proportion of dedup!icated data.
  • a system that includes a chunk boundary module configured to receive a plurality data sets from a device such as a host or other device.
  • the chunk boundary module is configured to receive a subsequent data set which is received subsequent to a previous data set.
  • the chunk boundary module is configured to apply a chunk boundary process to divide data of a data set into data chunks.
  • the chunk boundary module is configured to compare data chunks of a subsequent data set to data chunks of the previous data set.
  • the chunk boundary module is configured to check if a region of unmatched data exists within subsequent data set and that region is bounded by a pair of data chunks which are determined to be the same as a pair of chunks bounding a similar sized region within previous data set.
  • the chunk boundary module determines this condition occurred, then the module creates a plurality of smaller data chunks by sectioning that region.
  • the data of subsequent data set has a higher probability of partially matching the set of smaller data chunks than of entirely matching the smaller data chunks.
  • the chunk boundary module is configured to compare data chunks that includes the module configured to apply a hash process on the data chunks of the subsequent data set and previous data set, and compare the hashes of the data chunks of the subsequent data set to the hashes of the data chunks of the previous data set.
  • the chunk boundary module is configured to predict the presence of an adjacent data chunk in the subsequent data set to be identical to the adjacent data chunk in the previous data set and confirm the prediction by means of hash comparison.
  • the chunk boundary module is configured to apply a deduplication process on data chunks that includes having the module configured to compare hashes of the data chunks in data storage and store the data chunks to storage if the hashes of the data chunks are not already stored in storage.
  • the chunk boundary module is configured to apply a chunk boundary process that includes having the module configured to establish the boundaries of a data chunk which includes to merge a plurality of adjacent subsequent chunks which have been found identical to a plurality of adjacent previous chunks.
  • the techniques of the present application may increase the proportion of data deduplicated by a dedupiication process of a storage system. This may be achieved while having the system remain data agnostic such that the process may reduce the need to be aware of the nature or characteristics of the data or mutation or variation of data between data sets.
  • FIG. 1 is a block diagram of a system 100 with a computer system 102 for data chunk boundary processing according to an example
  • computer system 102 is coupled to data storage 104 as part of storage mechanisms to store and retrieve data.
  • computer system 102 may receive data from an external device such as a host.
  • the received data may be grouped or arranged as files as part of a file system.
  • a file system may include data blocks which are groups of data comprised of bytes of data organized as files as part of directory structures.
  • Another device or system such as a host (not shown), may send to computer system 102 write commands to write data blocks from the host to data storage 104. Further, the host may send to computer system 102 read commands to read data blocks back from storage and return the data blocks to the host.
  • the computer system 102 may be an apparatus such as an electronic device capable of processing data.
  • the computer system 102 includes a chunk boundary module 106 to manage the operation of the computer system including communication with data storage 104 and other devices such as host devices or computers.
  • the chunk boundary module 106 may interact with a host to process write commands to write data blocks from the host to the data storage.
  • the chunk boundary module 106 may interact with a host to process read commands to read data blocks back from storage and return the data blocks to the host.
  • chunk boundary module 106 may be configured to receive a plurality data sets 108 (108-1 through 108-n) from a device such as a host or other device.
  • chunk boundary module 106 may receive a subsequent data set, such as second data set 108-2 which is received subsequent to a previous data set such as first data set 108-1 .
  • the chunk boundary module 106 is configured to apply a chunk boundary process to divide data of data sets 1 (38 into data chunks 1 10.
  • the chunk boundary module 108 is configured to determine data chunk boundaries 1 12 between data chunks of data sets.
  • the chunk boundaries 1 12 may inciude starting and end points of data chunks 1 10.
  • the chunk boundary moduie 106 is configured to compare data chunks 1 10 of a subsequent data set, such as second data set 108-2 to data chunks of the previous data set such as first data set 108-1 .
  • chunk boundary moduie 106 is configured to check if a region of unmatched data exists within subsequent data set 108-2 and that region is bounded by a pair of data chunks 1 10 which are determined to be the same as a pair of chunks bounding a similar sized region within previous data set 108-1 . If chunk boundary moduie 106 determines that this condition occurred, then the moduie creates a plurality of smaller data chunks 1 10 by sectioning that region. In one example, the data of subsequent data set 108-2 may have a higher probability of partially matching the set of smaller data chunks 1 10 than of entirely matching the smaller data chunks.
  • chunk boundary moduie 106 is configured to compare data chunks that includes having the moduie apply a hash process on the data chunks 1 10 of subsequent data set 108-2 and previous data set 108-1 .
  • the chunk boundary module 106 compares the hashes or hash values of data chunks 1 10 of subsequent data set 108-2 to the hashes of the data chunks of previous data set 108-1 .
  • chunk boundary moduie 106 is configured to predict the presence of an adjacent data chunk in the subsequent data set to be identical to the adjacent data chunk in the previous data set and confirm the prediction by means of a hash comparison.
  • chunk boundary module 106 is configured to apply a deduplication process on data chunks 1 10 that includes having the module compare hashes or hash values of data chunks 1 10 in data storage 104 and store the data chunks to data storage 104 if the hashes of the data chunks are not already stored in data storage.
  • chunk boundary module 106 is configured to apply a chunk boundary process that includes having the module establish data chunk boundaries 1 12 of data chunks 1 10 which includes to merge a plurality of adjacent subsequent chunks which have been found identical to a plurality of adjacent previous chunks.
  • data chunk boundary module 106 provides a feedback mechanism (shown by arrow 1 14) to establish data chunk boundaries 1 12 for subsequent data sets 108-2 based on data chunk boundaries from previous data sets 108-1.
  • the resultant data chunks identified using this boundary selection technique may be more likely to be common to multiple data sets, thus helping to Increase the proportion of deduplicated data.
  • the computer system 102 may be any electronic device capable of data processing such as a server computer, mobile device and the like.
  • the functionality of the components of computer system 102 may be implemented in hardware, software or a combination thereof.
  • the computer system 102 may communicate with data storage 104 and other devices such as hosts using any electronic communication means including wired, wireless, network based such as storage area network (SAN), Ethernet, Fibre Channel and the like.
  • SAN storage area network
  • Ethernet Fibre Channel
  • the data storage 104 may include a plurality of storage devices (not shown) configured to present logical storage devices to other electronic devices such as hosts.
  • electronic devices may be coupled to computer system 102, such as hosts, which may access the logical configuration of storage array as LUNS.
  • the storage devices may include any means to store data for later retrieval.
  • the storage devices may include non-volatile memory, volatile memory or a combination thereof. Examples of non-volatile memory include, but are not limited to, electrically erasable programmable read only memory (EEPRO ) and read only memory (ROM). Examples of volatile memory include, but are not limited to, static random access memory (SRAM), and dynamic random access memory (DRAM). Examples of storage devices may include, but are not limited to, HDDs, CDs, DVDs, SSDs optical drives, flash memory devices and other like devices.
  • the techniques of the present application may increase the proportion of data dedup!icated by a dedupiication process of a storage system. This may be achieved while having the system remain data agnostic such that the process may reduce the need to be aware of the nature or characteristics of the data or mutation or variation of data between data sets.
  • computer system 102 is for illustrative purposes and other implementations of the system may be employed to practice the techniques of the present application.
  • computer system 102 is shown as a single component but the computer system may include a plurality of computer systems coupled to data storage 104 to practice the techniques of the present application.
  • FIG. 2 is a flow diagram 200 of a computer system for data chunk boundary processing of Fig. 1 according to an example implementation.
  • chunk boundary module 106 is configured to receive data such as data sets 108 from another device such as a host coupled to computer system 102.
  • Processing may begin at block 202, wherein chunk boundary module 106 receives a plurality of subsequent data sets received subsequent to previous data sets. For example, chunk boundary module 106 may receive a second or subsequent data set 108-2 that is received subsequent to a first or previous data set 108-1 . Processing then proceeds to block 204.
  • chunk boundary module 106 applies a chunk boundary process to divide data of a data set into data chunks.
  • the chunk boundary process may include processes such as identification of unique chunks of data, or byte patterns of data of the data sets. Processing then proceeds to block 206.
  • chunk boundary module 108 compares data chunks of a subsequent data set to data chunks of the previous data set. For example, chunk boundary module 106 may compare data chunks 1 10 of subsequent or second data set 108-2 to data chunks of previous or first data set 108-1 . Processing then proceeds to block 208.
  • chunk boundary module 106 checks if a region of unmatched data exists within the subsequent data set and that region is bounded by a pair of data chunks which are determined to be the same as a pair of chunks bounding a similar sized region within a previous data set. If so, then chunk boundary module 106 creates a plurality of smaller data chunks by sectioning that region. In one example, data of a subsequent data set has a higher probability of partially matching the set of smaller data chunks than of entirely matching the smaller data chunks. Processing then may proceed to the End block or back to block 202 to continue to process other data sets.
  • processing may further include having chunk boundary module 106 configured to compare data chunks that includes having the module apply a hash process on the data chunks 1 10 of subsequent data set 108-2 and previous data set 108-1 .
  • the chunk boundary module 106 compares the hashes or hash values of data chunks 1 10 of subsequent data set 108-2 to the hashes of the data chunks of previous data set 108-1 .
  • chunk boundary module 106 is configured fo predict the presence of an adjacent data chunk in the subsequent data set to be identical to the adjacent data chunk in the previous data set and confirm the prediction by means of a hash comparison.
  • chunk boundary module 106 is configured to apply a dedupiication process on data chunks 1 10 that includes having the module compare hashes or hash values of data chunks 1 10 in data storage 104 and store the data chunks to data storage 104 if the hashes of the data chunks are not already stored in data storage.
  • the chunk boundary module 106 configured to apply a chunk boundary process that includes having the module establish data chunk boundaries 1 12 of data chunks 1 10 which includes to merge a plurality of adjacent subsequent chunks which have been found identical to a plurality of adjacent previous chunks.
  • the techniques of the present application may increase the proportion of data deduplicated by a dedupiication process of a storage system. This may be achieved while having the system remain data agnostic such that the process may reduce the need to be aware of the nature or characteristics of the data or mutation or variation of data between data sets.
  • chunk boundary module 106 may process a different number of data sets or in a different order.
  • Fig. 3A is a diagram 300 of operation of a computer system for data chunk boundary processing according to an example implementation.
  • the diagram 300 describes chunk boundary module 106 performing a process to check if a region of unmatched data exists within subsequent data set and that region is bounded by a pair of data chunks which are determined fo be the same as a pair of chunks bounding a similar sized region within previous data set. If chunk boundary module 106 determines that this condition occurred, then the module creates a plurality of smaller data chunks by sectioning that region.
  • the data of subsequent data set may have a higher probability of partially matching the set of smaller data chunks than of entirely matching the smaller data chunks.
  • chunk boundary module 106 is configured to receive data such as data sets 108 from another device such as host coupled to computer system 102
  • Processing may begin wherein chunk boundary module 106 receives a second subsequent data set 108-2 and a third subsequent data set 108-3 that are received subsequent to first or previous data set 108-1 .
  • chunk boundary module 106 applies a chunk process to divide or split first previous data set 108-1 into data chunks A, B, C, D, E, F.
  • chunk boundary module 106 applies a chunk process to divide or split second subsequent data set 108-2 into data chunks A, B, D, F. However, in this case, chunk boundary module 106 determines that data chunks C and E from previous data set 108-1 do not have a match in subsequent data set 108-2 which are marked as data chunk Unmatched- 1 and data chunk Unmatched-2. The chunk boundary module 106 replaces data chunk Unmatched-1 with data chunks W, X as shown by arrow 302. In a similar manner, chunk boundary module 106 replaces data chunk Unmatched-2 with data chunks Y, Z as shown by arrow 304.
  • chunk boundary module 106 applies a chunk process to divide or split third subsequent data set 108-3 into data chunks A, B, W. D, Z, F.
  • the chunk boundary module 106 determines that data chunks X and Y from subsequent data set 108-2 do not have a match in subsequent data set 108-3 which are marked as data chunk
  • subsequent data set 108-3 now matches data chunk W and data chunk Z but does not match data chunk X and data chunk Y.
  • this part of the data set may be characterized as being chronically volatile because large mutations or variations between data sets frequently occur here.
  • the techniques of the present application may increase the proportion of data deduplicated by a deduplication process of a storage system. This may be achieved while having the system remain data agnostic such that the process may reduce the need to be aware of the nature or characteristics of the data or mutation or variation of data between data sets.
  • chunk boundary module 106 may process a different number of data sets or in a different order.
  • Fig. 3B is a diagram 350 of operation of a computer system for data chunk boundary processing according to an example implementation.
  • the diagram 350 illustrates how chunk boundary module 106 establishes data chunks in subsequent data sets 108-2 by predicting its size based on knowledge of data chunks in previous data sets 108-1 .
  • chunk boundary module 108 given the presence of an identical data chunk in previous data sets 108-1 and subsequent data sets 108-2, predicts the presence of an adjacent data chunk in the subsequent data set to be identical to the adjacent data chunk in the previous data set and confirms the prediction by means of hash comparison.
  • chunk boundary module 106 is configured to receive data such as data sets 108 from another device such as host coupled to computer system 102
  • chunk boundary module 106 receives a first or previous data set 108-1 .
  • chunk boundary module 106 applies a chunk process to divide or split previous data set 108-1 into data chunks labeled A, B, C, D, E, F.
  • the chunk boundary module 106 generates or calculates hash values for data chunks which are shown as "#" symbols which represent hash values associated with the data chunks A#, B#, C#, D#, E#, F#.
  • chunk boundary module 106 receives a second or subsequent data set 108-2 subsequent to previous data set 108-1 .
  • chunk boundary module 106 applies a chunk process to divide or split subsequent data set 108-2 into data chunks.
  • chunk boundary module 106 determines or establishes data chunks in subsequent data set 108-2 by predicting its size based on knowledge of data chunks in previous data set 108-1 .
  • chunk boundary module 106 attempts to match data chunk A# of subsequent data set 108-2 by generating chunk length using a fingerprint technique such as a Rabin fingerprint.
  • the chunk boundary module 106 determines a hash value for the data chunk of subsequent data set 108-2 using a hash process and then compares the hash of data chunk A# of subsequent data set 108-2 with data chunk A# of previous data set 108- 1 .
  • chunk boundary module 106 attempts to predict the presence of data chunks following the previous data chunks of subsequent data set 108-2 using the length of previous data chunk. For example, to illustrate operation, chunk boundary module 1 (36 attempts to predict the presence of data chunk labeled "Predicted" following the previous data chunk A# of subsequent data set 108-2 using the length of previous data chunk B#.
  • chunk boundary module 106 generates a hash value on the predicted data chunk labeled "Predicted#".
  • chunk boundary module 106 checks or compares whether the hash value of the "Predicted" data chunk subsequent data set 108-2 is equal to the hash value of data chunk B# of previous data set 108-1 . If chunk boundary module 1 (36 determines that the hash values are the same, then the "Predicated" data chunk of
  • subsequent data set 108-2 is equal to data chunk B# of previous data set 108-1 .
  • chunk boundary module 106 continues applying the process of data chunk predication for the remaining data chunks of subsequent data set 108-2, For example, the chunk boundary module 106 continues with the process of data chunk predication of data chunk C of subsequent data set 108-2.
  • chunk boundary module 106 may process a different number of data sets or in a different order.
  • the techniques of the present application may increase the proportion of data deduplicated by a dedupiication process of a storage system. This may be achieved while having the system remain data agnostic such that the process may reduce the need to be aware of the nature or characteristics of the data or mutation or variation of data between data sets.
  • the techniques of the present application may employ practice multiple or a combination of methods to establish data chunk boundaries of data sets.
  • a data chunk is a region of data within a data set bounded by start and end points.
  • the chunk boundary module 106 processes the bounded data by applying a hash process to the data chunks such as secure hash algorithm (SHA) to obtain hash values associated with the data chunks.
  • SHA secure hash algorithm
  • chunk boundary module 106 applies a Rabin Fingerprint process to establish boundaries between data chunks of data sets.
  • the techniques of the present application may employ or practice multiple or a combination of methods to establish data chunk boundaries of data sets.
  • chunk boundary module 106 applies Rabin fingerprint process (or any similar algorithm) on unknown data sets to establish data chunk boundaries of data sets.
  • chunk boundary module 106 checks whether a data set is likely to be a mutation or variation from a subsequent data set. If chunk boundary module 106 believes this to be the case, then it establishes data chunk boundaries using information or feedback from a previous data set. For example, assume that a previous data set includes a data chunk A having a length of 4321 bytes followed by a data chunk B having a length of 9876 bytes. In this case, then if chunk boundary module 106 identifies a data chunk X (e.g., identified using Rabin Fingerprint process) from a new subsequent data set and finds that it matches (e.g.
  • a data chunk X e.g., identified using Rabin Fingerprint process
  • chunk boundary module 106 determines whether a data chunk B follows after data chunk A. in this case, chunk boundary module 106 does not need to execute a Rabin Fingerprint process to establish the presence of a data chunk B in the new data set.
  • chunk boundary module 106 speculates or predicates that the following data contains data chunk B and proceeds to generate a hash value on data having length 9876 bytes and compares that to the hash value of data chunk B. in this manner, chunk boundary module 106 conserves processing power by not having to execute a Rabin Fingerprint process which is processor intensive.
  • chunk boundary module 106 determines that a data set is believed to be a mutation or variation of a previous data set, then chunk boundary module 106 identifies data chunks by the presence of previously known data chunks A and C whose end and start positions bound mutated data (which in the previous data set was data chunk B). Thus, chunk boundary module 106 may be able to establish the boundaries of a previously unknown data chunk since the data chunk is bordered by previously known data chunks.
  • chunk boundary module 108 instead of identifying a single new data chunk using the third method above, the chunk boundary module chooses to generate or create a pair of new data chunks V and W.
  • chunk boundary module 106 generates a data chunk pattern comprising data chunks A, V, W, C. in this manner, chunk boundary module 106 detects occurrence of a mutation or variation at a specific location and finds that a subsequent mutated data set might well match data chunks A, V, C, leaving an unknown new data chunk in the position previously occupied by data chunk W.
  • chunk boundary module 106 applies the third method above thereby creating or generating two smaller size data chunks X and Y in place of the new data chunk.
  • the chunk boundary module 108 therefor generates a data chunk pattern that includes data chunks A, V, X, Y, C. In this manner, the chunk boundary module 106 applies the third and fourth methods in an iterative manner continuing to the byte level of the data chunks.
  • the chunk boundary module 106 applies the third and fourth methods above which results in generation of multiple small sized data chunks. It may not be desirable for data storage to contain too high a proportion of such sized data chunks, since the associated metadata may consume too much data storage space.
  • chunk boundary module 106 may apply a further process to merge or combine small size data chunks together. In this case, chunk boundary module 106 establishes boundaries of a data chunk by merging a plurality of adjacent subsequent chunks which have been found identical to a plurality of adjacent previous chunks. The chunk boundary module 106 applies this technique by relying on the observation that sufficient repetition of data chunk patterns have occurred. For example, chunk boundary module 106 may decide to merge or combine data chunks A, V, X into a new single data chunk based on sufficient repetition of this pattern. This suggests there may be repetitive mutation or variation of data within the bounds of data chunk Y.
  • the techniques of the present application may increase the proportion of data deduplicated by a deduplication process of a storage system. This may be achieved while having the system remain data agnostic such that the process may reduce the need to be aware of the nature or characteristics of the data or mutation or variation of data between data sets.
  • FIG. 4 is an example block diagram showing a non-transitory, computer-readable medium that stores instructions for a computer system for data chunk boundary processing in accordance with an example
  • the non-transitory, computer-readable medium is generally referred to by the reference number 400 and may be included in devices of system 100 as described herein.
  • the non-transitory, computer-readable medium 400 may correspond to any typical storage device that stores computer-implemented instructions, such as programming code or the like.
  • the non-transitory, computer-readable medium 400 may include one or more of a non-volatile memory, a volatile memory, and/or one or more storage devices.
  • Examples of non-volatile memory include, but are not limited to, EEPRO and ROM.
  • Examples of volatile memory include, but are not limited to, SRAM, and DRAM.
  • Examples of storage devices include, but are not limited to, hard disk drives, compact disc drives, digital versatile disc drives, optical drives, and flash memory devices.
  • a processor 402 generally retrieves and executes the instructions stored in the non-transitory, computer-readable medium 400 to operate the devices of system 100 in accordance with an example.
  • the tangible, machine-readable medium 400 may be accessed by the processor 402 over a bus 404.
  • a first region 408 of the non-transitory, computer- readable medium 400 may include chunk boundary module functionality as described herein.
  • the software components may be stored in any order or configuration.
  • the non- transitory, computer-readable medium 400 is a hard drive
  • the software components may be stored in non-contiguous, or even overlapping, sectors.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In some examples, techniques for receiving a plurality of subsequent data sets received subsequent to previous data sets, applying a chunk boundary process to divide data of a data set into data chunks, and comparing data chunks of a subsequent data set to data chunks of the previous data set. If a region of unmatched data exists within the subsequent data set and that region is bounded by a pair of data chunks which are determined to be the same as a pair of chunks bounding a similar sized region within a previous data set, then creating a plurality of smaller data chunks by sectioning that region, and wherein data of a subsequent data set has a higher probability of partially matching the set of smaller data chunks than of entirely matching the smaller data chunks.

Description

DATA CHUNK BOUNDARY BACKGROUND
[0001 ] Storage systems may include data dedupiication techniques to help manage storage capacity requirements by reducing occurrence of redundant data. For example, the storage system may store a single unique instance of the data to storage media. The storage system replaces the unique data copy with a pointer to the unique data instead of storing another copy of the data.
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] Fig. 1 is a block diagram of a computer system for data chunk boundary processing according to an example implementation.
[0003] Fig. 2 is a flow diagram of a computer system for data chunk boundary processing of Fig. 1 according to an example implementation.
[0004] Figs. 3A and 3B are diagrams of operation of a computer system for data chunk boundary processing according to an example implementation.
[0(305] Fig. 4 is an example block diagram showing a non-transitory, computer-readable medium that stores instructions for a computer system for data chunk boundary processing in accordance with an example
implementation.
DETAILED DESCRIPTION
[0006] Techniques such as data dedupllcation may enable data to be stored in a backup system more compactly and thus more cheaply. By performing dedupllcation, a backup system may generally store once each unique region or "chunk" of a set or collection of data (or "data set" herein). Such a chunk may be referred to as a "data chunk" herein, in one example, a data chunk is a region of data within a data set bounded by start and end points. In some examples, a backup system may perform deduplication on the basis of content-based fingerprints, such as cryptographic hash functions, of the content of the data chunks of the data collection to be backed up. In such examples, the backup system may compare respective hashes of data chunks of a data collection provided for backup to hashes of previously stored data chunks to determine which data chunks of the provided data collection have not been previously stored in persistent storage of the backup system and thus are to be stored in storage.
[000?'] The system may employ chunking techniques such as dividing a data collection into data chunks which may have an impact on the
performance of deduplication by the backup system. For example, the consistency of the chunking technique utilized may have an impact on deduplication performance. In one example, a chunking technique that is able to produce the same data chunks when provided the same data collection may result in good deduplication performance. In this case, the hashes of the data chunks of an incoming data collection are likely to match hashes of a previously stored data chunks when the same data collection has been stored previously.
[0008] Computer systems with deduplication techniques may receive a plurality of data sets from other systems or devices such as hosts. The data sets may include subsequent data sets received after or subsequent to previous data sets. However, these computer systems may encounter mutation conditions or variations in the data sets over time. Such mutation or variation in data sets may involve insertion of metadata by hosts or storage applications of hosts. In other examples, such mutation conditions may involve alternate receipt or delivery order of subsets of the data of the data sets. The occurrence of the mutation conditions may have a negative impact on the performance of the deduplication process. For example, a mutated version of a subsequent data set compared to a previous data set may be split or divided into component data chunks which differ from those of the previous or original data set due to different chunk boundaries. In this case, the process may check the data chunks from the mutated data of the subsequent data set and determine that they are not duplicate data chunks and therefore not preform deduplication on the data.
[0009] In some examples, the techniques of the present application facilitate the deduplication process of storage systems to help manage physical storage requirements by helping reduce or eliminate duplication of data within storage. The techniques detect occurrence of duplication by splitting or dividing data sets into multiple component data chunks of data and Identifying those chunks common to multiple data sets. The techniques may store one Instance of a data chunk common to multiple data sets, in one example, the techniques perform data chunk boundary processing on data sets. The data chunk boundary process provides a feedback mechanism to establish data chunk boundaries for subsequent data sets based on boundaries from previous data sets. The resultant data chunks identified using this boundary selection technique may be more likely to be common to multiple data sets, thus helping to Increase the proportion of dedup!icated data.
[001 (3] In one example, disclosed is a system that includes a chunk boundary module configured to receive a plurality data sets from a device such as a host or other device. The chunk boundary module is configured to receive a subsequent data set which is received subsequent to a previous data set. The chunk boundary module is configured to apply a chunk boundary process to divide data of a data set into data chunks. The chunk boundary module is configured to compare data chunks of a subsequent data set to data chunks of the previous data set. The chunk boundary module is configured to check if a region of unmatched data exists within subsequent data set and that region is bounded by a pair of data chunks which are determined to be the same as a pair of chunks bounding a similar sized region within previous data set. if the chunk boundary module determines this condition occurred, then the module creates a plurality of smaller data chunks by sectioning that region. In one example, the data of subsequent data set has a higher probability of partially matching the set of smaller data chunks than of entirely matching the smaller data chunks.
[001 1 ] In other examples, the chunk boundary module is configured to compare data chunks that includes the module configured to apply a hash process on the data chunks of the subsequent data set and previous data set, and compare the hashes of the data chunks of the subsequent data set to the hashes of the data chunks of the previous data set. in another example, given the presence of an identical data chunk in previous and subsequent data sets, the chunk boundary module is configured to predict the presence of an adjacent data chunk in the subsequent data set to be identical to the adjacent data chunk in the previous data set and confirm the prediction by means of hash comparison.
[0012] In one example, the chunk boundary module is configured to apply a deduplication process on data chunks that includes having the module configured to compare hashes of the data chunks in data storage and store the data chunks to storage if the hashes of the data chunks are not already stored in storage. In an example, the chunk boundary module is configured to apply a chunk boundary process that includes having the module configured to establish the boundaries of a data chunk which includes to merge a plurality of adjacent subsequent chunks which have been found identical to a plurality of adjacent previous chunks.
[0(313] In this manner, these techniques may help improve storage performance, in one example, the techniques of the present application may increase the proportion of data deduplicated by a dedupiication process of a storage system. This may be achieved while having the system remain data agnostic such that the process may reduce the need to be aware of the nature or characteristics of the data or mutation or variation of data between data sets.
[0(314] Fig. 1 is a block diagram of a system 100 with a computer system 102 for data chunk boundary processing according to an example
implementation.
[0015] In one example, computer system 102 is coupled to data storage 104 as part of storage mechanisms to store and retrieve data. In one example, computer system 102 may receive data from an external device such as a host. In one example, the received data may be grouped or arranged as files as part of a file system. A file system may include data blocks which are groups of data comprised of bytes of data organized as files as part of directory structures. Another device or system, such as a host (not shown), may send to computer system 102 write commands to write data blocks from the host to data storage 104. Further, the host may send to computer system 102 read commands to read data blocks back from storage and return the data blocks to the host.
[0016] The computer system 102 may be an apparatus such as an electronic device capable of processing data. The computer system 102 includes a chunk boundary module 106 to manage the operation of the computer system including communication with data storage 104 and other devices such as host devices or computers. The chunk boundary module 106 may interact with a host to process write commands to write data blocks from the host to the data storage. The chunk boundary module 106 may interact with a host to process read commands to read data blocks back from storage and return the data blocks to the host.
[0017] In one example, chunk boundary module 106 may be configured to receive a plurality data sets 108 (108-1 through 108-n) from a device such as a host or other device. In one example, chunk boundary module 106 may receive a subsequent data set, such as second data set 108-2 which is received subsequent to a previous data set such as first data set 108-1 . The chunk boundary module 106 is configured to apply a chunk boundary process to divide data of data sets 1 (38 into data chunks 1 10. The chunk boundary module 108 is configured to determine data chunk boundaries 1 12 between data chunks of data sets. The chunk boundaries 1 12 may inciude starting and end points of data chunks 1 10. The chunk boundary moduie 106 is configured to compare data chunks 1 10 of a subsequent data set, such as second data set 108-2 to data chunks of the previous data set such as first data set 108-1 .
[0018] In one example, chunk boundary moduie 106 is configured to check if a region of unmatched data exists within subsequent data set 108-2 and that region is bounded by a pair of data chunks 1 10 which are determined to be the same as a pair of chunks bounding a similar sized region within previous data set 108-1 . If chunk boundary moduie 106 determines that this condition occurred, then the moduie creates a plurality of smaller data chunks 1 10 by sectioning that region. In one example, the data of subsequent data set 108-2 may have a higher probability of partially matching the set of smaller data chunks 1 10 than of entirely matching the smaller data chunks.
[0(319] In other examples, chunk boundary moduie 106 is configured to compare data chunks that includes having the moduie apply a hash process on the data chunks 1 10 of subsequent data set 108-2 and previous data set 108-1 . The chunk boundary module 106 compares the hashes or hash values of data chunks 1 10 of subsequent data set 108-2 to the hashes of the data chunks of previous data set 108-1 . in another example, given the presence of an identical data chunk 1 10 in previous data sets 108-1 and subsequent data sets 108-2, chunk boundary moduie 106 is configured to predict the presence of an adjacent data chunk in the subsequent data set to be identical to the adjacent data chunk in the previous data set and confirm the prediction by means of a hash comparison. In one example, chunk boundary module 106 is configured to apply a deduplication process on data chunks 1 10 that includes having the module compare hashes or hash values of data chunks 1 10 in data storage 104 and store the data chunks to data storage 104 if the hashes of the data chunks are not already stored in data storage.
[0020] In another example, chunk boundary module 106 is configured to apply a chunk boundary process that includes having the module establish data chunk boundaries 1 12 of data chunks 1 10 which includes to merge a plurality of adjacent subsequent chunks which have been found identical to a plurality of adjacent previous chunks. In this manner, data chunk boundary module 106 provides a feedback mechanism (shown by arrow 1 14) to establish data chunk boundaries 1 12 for subsequent data sets 108-2 based on data chunk boundaries from previous data sets 108-1. The resultant data chunks identified using this boundary selection technique may be more likely to be common to multiple data sets, thus helping to Increase the proportion of deduplicated data.
[0021 ] The computer system 102 may be any electronic device capable of data processing such as a server computer, mobile device and the like. The functionality of the components of computer system 102 may be implemented in hardware, software or a combination thereof. The computer system 102 may communicate with data storage 104 and other devices such as hosts using any electronic communication means including wired, wireless, network based such as storage area network (SAN), Ethernet, Fibre Channel and the like.
[0022] The data storage 104 may include a plurality of storage devices (not shown) configured to present logical storage devices to other electronic devices such as hosts. In one example, electronic devices may be coupled to computer system 102, such as hosts, which may access the logical configuration of storage array as LUNS. The storage devices may include any means to store data for later retrieval. The storage devices may include non-volatile memory, volatile memory or a combination thereof. Examples of non-volatile memory include, but are not limited to, electrically erasable programmable read only memory (EEPRO ) and read only memory (ROM). Examples of volatile memory include, but are not limited to, static random access memory (SRAM), and dynamic random access memory (DRAM). Examples of storage devices may include, but are not limited to, HDDs, CDs, DVDs, SSDs optical drives, flash memory devices and other like devices.
[0023] In this manner, these techniques may help improve storage performance, in one example, the techniques of the present application may increase the proportion of data dedup!icated by a dedupiication process of a storage system. This may be achieved while having the system remain data agnostic such that the process may reduce the need to be aware of the nature or characteristics of the data or mutation or variation of data between data sets.
[0024] It should be understood that the description of computer system 102 above is for illustrative purposes and other implementations of the system may be employed to practice the techniques of the present application. For example, computer system 102 is shown as a single component but the computer system may include a plurality of computer systems coupled to data storage 104 to practice the techniques of the present application.
[0025] Fig. 2 is a flow diagram 200 of a computer system for data chunk boundary processing of Fig. 1 according to an example implementation.
[0026] In one example, to illustrate operation, it may be assumed that chunk boundary module 106 is configured to receive data such as data sets 108 from another device such as a host coupled to computer system 102.
[0027] Processing may begin at block 202, wherein chunk boundary module 106 receives a plurality of subsequent data sets received subsequent to previous data sets. For example, chunk boundary module 106 may receive a second or subsequent data set 108-2 that is received subsequent to a first or previous data set 108-1 . Processing then proceeds to block 204.
[0028] At block 204, chunk boundary module 106 applies a chunk boundary process to divide data of a data set into data chunks. For example, the chunk boundary process may include processes such as identification of unique chunks of data, or byte patterns of data of the data sets. Processing then proceeds to block 206.
[0029] At block 206, chunk boundary module 108 compares data chunks of a subsequent data set to data chunks of the previous data set. For example, chunk boundary module 106 may compare data chunks 1 10 of subsequent or second data set 108-2 to data chunks of previous or first data set 108-1 . Processing then proceeds to block 208.
[0030] At block 208, chunk boundary module 106 checks if a region of unmatched data exists within the subsequent data set and that region is bounded by a pair of data chunks which are determined to be the same as a pair of chunks bounding a similar sized region within a previous data set. If so, then chunk boundary module 106 creates a plurality of smaller data chunks by sectioning that region. In one example, data of a subsequent data set has a higher probability of partially matching the set of smaller data chunks than of entirely matching the smaller data chunks. Processing then may proceed to the End block or back to block 202 to continue to process other data sets.
[0(331 ] In other examples, processing may further include having chunk boundary module 106 configured to compare data chunks that includes having the module apply a hash process on the data chunks 1 10 of subsequent data set 108-2 and previous data set 108-1 . The chunk boundary module 106 compares the hashes or hash values of data chunks 1 10 of subsequent data set 108-2 to the hashes of the data chunks of previous data set 108-1 . In another example, given the presence of an identical data chunk 1 10 in previous data sets 108-1 and subsequent data sets 1 (38-2, chunk boundary module 106 is configured fo predict the presence of an adjacent data chunk in the subsequent data set to be identical to the adjacent data chunk in the previous data set and confirm the prediction by means of a hash comparison. In one example, chunk boundary module 106 is configured to apply a dedupiication process on data chunks 1 10 that includes having the module compare hashes or hash values of data chunks 1 10 in data storage 104 and store the data chunks to data storage 104 if the hashes of the data chunks are not already stored in data storage. The chunk boundary module 106 configured to apply a chunk boundary process that includes having the module establish data chunk boundaries 1 12 of data chunks 1 10 which includes to merge a plurality of adjacent subsequent chunks which have been found identical to a plurality of adjacent previous chunks.
[0(332] In this manner, these techniques may help improve storage performance. In one example, the techniques of the present application may increase the proportion of data deduplicated by a dedupiication process of a storage system. This may be achieved while having the system remain data agnostic such that the process may reduce the need to be aware of the nature or characteristics of the data or mutation or variation of data between data sets.
[0033] It should be understood that the above process 200 is for illustrative purposes and that other implementations may be employed to the practice the techniques of the present application. For example, chunk boundary module 106 may process a different number of data sets or in a different order.
[0034] Fig. 3A is a diagram 300 of operation of a computer system for data chunk boundary processing according to an example implementation. The diagram 300 describes chunk boundary module 106 performing a process to check if a region of unmatched data exists within subsequent data set and that region is bounded by a pair of data chunks which are determined fo be the same as a pair of chunks bounding a similar sized region within previous data set. If chunk boundary module 106 determines that this condition occurred, then the module creates a plurality of smaller data chunks by sectioning that region. In one example, the data of subsequent data set may have a higher probability of partially matching the set of smaller data chunks than of entirely matching the smaller data chunks.
[0(335] In one example, to illustrate operation, it may be assumed that chunk boundary module 106 is configured to receive data such as data sets 108 from another device such as host coupled to computer system 102
[0036] Processing may begin wherein chunk boundary module 106 receives a second subsequent data set 108-2 and a third subsequent data set 108-3 that are received subsequent to first or previous data set 108-1 . In one example, chunk boundary module 106 applies a chunk process to divide or split first previous data set 108-1 into data chunks A, B, C, D, E, F.
[0037] Continuing with the process, then chunk boundary module 106 applies a chunk process to divide or split second subsequent data set 108-2 into data chunks A, B, D, F. However, in this case, chunk boundary module 106 determines that data chunks C and E from previous data set 108-1 do not have a match in subsequent data set 108-2 which are marked as data chunk Unmatched- 1 and data chunk Unmatched-2. The chunk boundary module 106 replaces data chunk Unmatched-1 with data chunks W, X as shown by arrow 302. In a similar manner, chunk boundary module 106 replaces data chunk Unmatched-2 with data chunks Y, Z as shown by arrow 304.
[0(338] Continuing with the process, then chunk boundary module 106 applies a chunk process to divide or split third subsequent data set 108-3 into data chunks A, B, W. D, Z, F. The chunk boundary module 106 determines that data chunks X and Y from subsequent data set 108-2 do not have a match in subsequent data set 108-3 which are marked as data chunk
Umatched-3 and data chunk Umatched-4. in this case, subsequent data set 108-3 now matches data chunk W and data chunk Z but does not match data chunk X and data chunk Y. In this case, this part of the data set may be characterized as being chronically volatile because large mutations or variations between data sets frequently occur here.
[0039] In this manner, these techniques may help improve storage performance, in one example, the techniques of the present application may increase the proportion of data deduplicated by a deduplication process of a storage system. This may be achieved while having the system remain data agnostic such that the process may reduce the need to be aware of the nature or characteristics of the data or mutation or variation of data between data sets.
[0040] It should be understood that the above process is for illustrative purposes and that other implementations may be employed to the practice the techniques of the present application. For example, chunk boundary module 106 may process a different number of data sets or in a different order.
[0041 ] Fig. 3B is a diagram 350 of operation of a computer system for data chunk boundary processing according to an example implementation. The diagram 350 illustrates how chunk boundary module 106 establishes data chunks in subsequent data sets 108-2 by predicting its size based on knowledge of data chunks in previous data sets 108-1 . in one example, chunk boundary module 108, given the presence of an identical data chunk in previous data sets 108-1 and subsequent data sets 108-2, predicts the presence of an adjacent data chunk in the subsequent data set to be identical to the adjacent data chunk in the previous data set and confirms the prediction by means of hash comparison.
[0042] In one example, to illustrate operation, it may be assumed that chunk boundary module 106 is configured to receive data such as data sets 108 from another device such as host coupled to computer system 102
[0043] Processing may begin at block 352 wherein chunk boundary module 106 receives a first or previous data set 108-1 . In one example, chunk boundary module 106 applies a chunk process to divide or split previous data set 108-1 into data chunks labeled A, B, C, D, E, F. The chunk boundary module 106 generates or calculates hash values for data chunks which are shown as "#" symbols which represent hash values associated with the data chunks A#, B#, C#, D#, E#, F#.
[0044] Continuing with the process at block 354, chunk boundary module 106 receives a second or subsequent data set 108-2 subsequent to previous data set 108-1 . In one example, chunk boundary module 106 applies a chunk process to divide or split subsequent data set 108-2 into data chunks.
[0045] Continuing with the process at block 356, chunk boundary module 106 determines or establishes data chunks in subsequent data set 108-2 by predicting its size based on knowledge of data chunks in previous data set 108-1 . In one example, chunk boundary module 106 attempts to match data chunk A# of subsequent data set 108-2 by generating chunk length using a fingerprint technique such as a Rabin fingerprint. The chunk boundary module 106 determines a hash value for the data chunk of subsequent data set 108-2 using a hash process and then compares the hash of data chunk A# of subsequent data set 108-2 with data chunk A# of previous data set 108- 1 .
[0046] Continuing with the process at block 358, chunk boundary module 106 attempts to predict the presence of data chunks following the previous data chunks of subsequent data set 108-2 using the length of previous data chunk. For example, to illustrate operation, chunk boundary module 1 (36 attempts to predict the presence of data chunk labeled "Predicted" following the previous data chunk A# of subsequent data set 108-2 using the length of previous data chunk B#.
[0047] Continuing with the process at block 360, chunk boundary module 106 generates a hash value on the predicted data chunk labeled "Predicted#".
[0048] Continuing with the process at block 362, chunk boundary module 106 checks or compares whether the hash value of the "Predicted" data chunk subsequent data set 108-2 is equal to the hash value of data chunk B# of previous data set 108-1 . If chunk boundary module 1 (36 determines that the hash values are the same, then the "Predicated" data chunk of
subsequent data set 108-2 is equal to data chunk B# of previous data set 108-1 .
[0049] Continuing with the process at block 364, chunk boundary module 106 continues applying the process of data chunk predication for the remaining data chunks of subsequent data set 108-2, For example, the chunk boundary module 106 continues with the process of data chunk predication of data chunk C of subsequent data set 108-2.
[0050] It should be understood that the above process is for illustrative purposes and that other implementations may be employed to the practice the techniques of the present application. For example, chunk boundary module 106 may process a different number of data sets or in a different order.
[0(351 ] In this manner, these techniques may help improve storage performance, in one example, the techniques of the present application may increase the proportion of data deduplicated by a dedupiication process of a storage system. This may be achieved while having the system remain data agnostic such that the process may reduce the need to be aware of the nature or characteristics of the data or mutation or variation of data between data sets.
[0052] In other examples, the techniques of the present application may employ practice multiple or a combination of methods to establish data chunk boundaries of data sets. As explained above, a data chunk is a region of data within a data set bounded by start and end points. The chunk boundary module 106 processes the bounded data by applying a hash process to the data chunks such as secure hash algorithm (SHA) to obtain hash values associated with the data chunks. These techniques help establish chunk boundaries between data chunks of data sets. In one example, chunk boundary module 106 applies a Rabin Fingerprint process to establish boundaries between data chunks of data sets. As explained below, the techniques of the present application may employ or practice multiple or a combination of methods to establish data chunk boundaries of data sets.
[0053] In a first method, chunk boundary module 106 applies Rabin fingerprint process (or any similar algorithm) on unknown data sets to establish data chunk boundaries of data sets.
[0(354] In a second method, chunk boundary module 106 checks whether a data set is likely to be a mutation or variation from a subsequent data set. If chunk boundary module 106 believes this to be the case, then it establishes data chunk boundaries using information or feedback from a previous data set. For example, assume that a previous data set includes a data chunk A having a length of 4321 bytes followed by a data chunk B having a length of 9876 bytes. In this case, then if chunk boundary module 106 identifies a data chunk X (e.g., identified using Rabin Fingerprint process) from a new subsequent data set and finds that it matches (e.g. hash process) and is identical to data chunk A, then it is feasible that a data chunk B follows after data chunk A. in this case, chunk boundary module 106 does not need to execute a Rabin Fingerprint process to establish the presence of a data chunk B in the new data set. Here, chunk boundary module 106 speculates or predicates that the following data contains data chunk B and proceeds to generate a hash value on data having length 9876 bytes and compares that to the hash value of data chunk B. in this manner, chunk boundary module 106 conserves processing power by not having to execute a Rabin Fingerprint process which is processor intensive.
[0055] In a third method, if chunk boundary module 106 determines that a data set is believed to be a mutation or variation of a previous data set, then chunk boundary module 106 identifies data chunks by the presence of previously known data chunks A and C whose end and start positions bound mutated data (which in the previous data set was data chunk B). Thus, chunk boundary module 106 may be able to establish the boundaries of a previously unknown data chunk since the data chunk is bordered by previously known data chunks.
[0056] In a fourth method, chunk boundary module 108, instead of identifying a single new data chunk using the third method above, the chunk boundary module chooses to generate or create a pair of new data chunks V and W. Thus, chunk boundary module 106 generates a data chunk pattern comprising data chunks A, V, W, C. in this manner, chunk boundary module 106 detects occurrence of a mutation or variation at a specific location and finds that a subsequent mutated data set might well match data chunks A, V, C, leaving an unknown new data chunk in the position previously occupied by data chunk W. In this case, chunk boundary module 106 applies the third method above thereby creating or generating two smaller size data chunks X and Y in place of the new data chunk. The chunk boundary module 108 therefor generates a data chunk pattern that includes data chunks A, V, X, Y, C. In this manner, the chunk boundary module 106 applies the third and fourth methods in an iterative manner continuing to the byte level of the data chunks.
[0057] In one example, the chunk boundary module 106 applies the third and fourth methods above which results in generation of multiple small sized data chunks. It may not be desirable for data storage to contain too high a proportion of such sized data chunks, since the associated metadata may consume too much data storage space. In one example, chunk boundary module 106 may apply a further process to merge or combine small size data chunks together. In this case, chunk boundary module 106 establishes boundaries of a data chunk by merging a plurality of adjacent subsequent chunks which have been found identical to a plurality of adjacent previous chunks. The chunk boundary module 106 applies this technique by relying on the observation that sufficient repetition of data chunk patterns have occurred. For example, chunk boundary module 106 may decide to merge or combine data chunks A, V, X into a new single data chunk based on sufficient repetition of this pattern. This suggests there may be repetitive mutation or variation of data within the bounds of data chunk Y.
[0058] In this manner, these techniques may help improve storage performance. In one example, the techniques of the present application may increase the proportion of data deduplicated by a deduplication process of a storage system. This may be achieved while having the system remain data agnostic such that the process may reduce the need to be aware of the nature or characteristics of the data or mutation or variation of data between data sets.
[0059] Fig. 4 is an example block diagram showing a non-transitory, computer-readable medium that stores instructions for a computer system for data chunk boundary processing in accordance with an example
implementation. The non-transitory, computer-readable medium is generally referred to by the reference number 400 and may be included in devices of system 100 as described herein. The non-transitory, computer-readable medium 400 may correspond to any typical storage device that stores computer-implemented instructions, such as programming code or the like. For example, the non-transitory, computer-readable medium 400 may include one or more of a non-volatile memory, a volatile memory, and/or one or more storage devices. Examples of non-volatile memory include, but are not limited to, EEPRO and ROM. Examples of volatile memory include, but are not limited to, SRAM, and DRAM. Examples of storage devices include, but are not limited to, hard disk drives, compact disc drives, digital versatile disc drives, optical drives, and flash memory devices.
[0060] A processor 402 generally retrieves and executes the instructions stored in the non-transitory, computer-readable medium 400 to operate the devices of system 100 in accordance with an example. In an example, the tangible, machine-readable medium 400 may be accessed by the processor 402 over a bus 404. A first region 408 of the non-transitory, computer- readable medium 400 may include chunk boundary module functionality as described herein.
[0061 ] Although shown as contiguous blocks, the software components may be stored in any order or configuration. For example, if the non- transitory, computer-readable medium 400 is a hard drive, the software components may be stored in non-contiguous, or even overlapping, sectors.

Claims

What is claimed is:
1 . A method comprising:
receiving a plurality of subsequent data sets received subsequent to previous data sets;
applying a chunk boundary process to divide data of a data set into data chunks;
comparing data chunks of a subsequent data set to data chunks of the previous data set; and
If a region of unmatched data exists within the subsequent data set and that region is bounded by a pair of data chunks which are determined to be the same as a pair of chunks bounding a similar sized region within a previous data set, then creating a plurality of smaller data chunks by sectioning that region, and wherein data of a subsequent data set has a higher probability of partially matching the set of smaller data chunks than of entirely matching the smaller data chunks.
2. The method of claim 1 , wherein the comparing of data chunks further includes applying a hash process on the data chunks of the subsequent data set and previous data set, and comparing the hashes of the data chunks of the subsequent data set to the hashes of the data chunks of the previous data set.
3. The method of claim 1 , wherein given the presence of an identical data chunk in previous and subsequent data sets, predicting the presence of an adjacent data chunk in the subsequent data set to be identical to the adjacent data chunk in the previous data set and confirming the prediction by means of hash comparison.
4. The method of claim 1 , further comprising applying a deduplication process on the data chunks that includes comparing hashes of the data chunks in storage and storing the data chunks to storage if the hashes of the data chunks are not already stored in storage.
5. The method of claim 1 . wherein the chunk boundary process further includes establishing the boundaries of a data chunk by merging a plurality of adjacent subsequent chunks which have been found identical to a plurality of adjacent previous chunks,
8. A computer system comprising:
a chunk boundary module to:
receive a plurality of subsequent data sets received subsequent to previous data sets,
apply a chunk boundary process to divide data of a data set into data chunks,
compare data chunks of a subsequent data set to data chunks of the previous data set, and
if a region of unmatched data exists within the subsequent data set and that region is bounded by a pair of data chunks which are determined to be the same as a pair of chunks bounding a similar sized region within a previous data set, then creating a plurality of smaller data chunks by sectioning that region, and wherein data of a subsequent data set has a higher probability of partially matching the set of smaller data chunks than of entirely matching the smaller data chunks
7. The computer system of claim 6, wherein the compare of data chunks further includes to apply a hash process on the data chunks of the
subsequent data set and previous data set, and compare the hashes of the data chunks of the subsequent data set to the hashes of the data chunks of the previous data set.
8. The computer system of claim 6, wherein given the presence of an identical data chunk in previous and subsequent data sets, predict the presence of an adjacent data chunk in the subsequent data set to be identical to the adjacent data chunk in the previous data set and confirm the prediction by means of hash comparison.
9. The computer system of claim 6, further comprising to apply a deduplication process on the data chunks that includes to compare hashes of the data chunks in storage and store the data chunks to storage if the hashes of the data chunks are not already stored in storage.
10. The computer system of claim 6, wherein the chunk boundary process further includes to establish the boundaries of a data chunk which includes to merge a plurality of adjacent subsequent chunks which have been found identical to a plurality of adjacent previous chunks.
1 1 . An article comprising a non-transitory computer readable storage medium to store instructions that when executed by a computer to cause the computer to:
receive a plurality of subsequent data sets received subsequent to previous data sets;
apply a chunk boundary process to divide data of a data set into data chunks;
compare data chunks of a subsequent data set to data chunks of the previous data set; and
If a region of unmatched data exists within the subsequent data set and that region is bounded by a pair of data chunks which are determined to be the same as a pair of chunks bounding a similar sized region within a previous data set, then creating a plurality of smaller data chunks by sectioning that region, and wherein data of a subsequent data set has a higher probability of partially matching the set of smaller data chunks than of entirely matching the smaller data chunks.
12. The article of claim 1 1 , further comprising instructions that if executed cause a computer, wherein the compare of data chunks further includes to apply a hash process on the data chunks of the subsequent data set and previous data set, and compare the hashes of the data chunks of the subsequent data set to the hashes of the data chunks of the previous data set.
13. The article of claim 1 1 , further comprising instructions that if executed cause a computer to wherein given the presence of an identical data chunk in previous and subsequent data sets, predict the presence of an adjacent data chunk in the subsequent data set to be identical to the adjacent data chunk in the previous data set and confirm the prediction by means of hash
comparison.
14. The article of claim 1 1 , further comprising instructions that if executed cause a computer to further apply a deduplication process on the data chunks that includes to compare hashes of the data chunks in storage and store the data chunks to storage if the hashes of the data chunks are not already stored in storage.
15. The article of claim 1 1 , further comprising instructions that if executed cause a computer, wherein the chunk boundary process further includes to establish the boundaries of a data chunk which includes to merge a plurality of adjacent subsequent chunks which have been found identical to a plurality of adjacent previous chunks.
PCT/US2014/064255 2014-11-06 2014-11-06 Data chunk boundary WO2016072988A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/US2014/064255 WO2016072988A1 (en) 2014-11-06 2014-11-06 Data chunk boundary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2014/064255 WO2016072988A1 (en) 2014-11-06 2014-11-06 Data chunk boundary

Publications (1)

Publication Number Publication Date
WO2016072988A1 true WO2016072988A1 (en) 2016-05-12

Family

ID=55909548

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2014/064255 WO2016072988A1 (en) 2014-11-06 2014-11-06 Data chunk boundary

Country Status (1)

Country Link
WO (1) WO2016072988A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8447741B2 (en) * 2010-01-25 2013-05-21 Sepaton, Inc. System and method for providing data driven de-duplication services
US20130138620A1 (en) * 2011-11-28 2013-05-30 International Business Machines Corporation Optimization of fingerprint-based deduplication
US20130212074A1 (en) * 2010-08-31 2013-08-15 Nec Corporation Storage system
WO2013165389A1 (en) * 2012-05-01 2013-11-07 Hewlett-Packard Development Company, L.P. Determining segment boundaries for deduplication
US8712963B1 (en) * 2011-12-22 2014-04-29 Emc Corporation Method and apparatus for content-aware resizing of data chunks for replication

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8447741B2 (en) * 2010-01-25 2013-05-21 Sepaton, Inc. System and method for providing data driven de-duplication services
US20130212074A1 (en) * 2010-08-31 2013-08-15 Nec Corporation Storage system
US20130138620A1 (en) * 2011-11-28 2013-05-30 International Business Machines Corporation Optimization of fingerprint-based deduplication
US8712963B1 (en) * 2011-12-22 2014-04-29 Emc Corporation Method and apparatus for content-aware resizing of data chunks for replication
WO2013165389A1 (en) * 2012-05-01 2013-11-07 Hewlett-Packard Development Company, L.P. Determining segment boundaries for deduplication

Similar Documents

Publication Publication Date Title
US9880746B1 (en) Method to increase random I/O performance with low memory overheads
US8639669B1 (en) Method and apparatus for determining optimal chunk sizes of a deduplicated storage system
US8712963B1 (en) Method and apparatus for content-aware resizing of data chunks for replication
US8898120B1 (en) Systems and methods for distributed data deduplication
US9851917B2 (en) Method for de-duplicating data and apparatus therefor
US9514138B1 (en) Using read signature command in file system to backup data
US10303797B1 (en) Clustering files in deduplication systems
US10339112B1 (en) Restoring data in deduplicated storage
US8712978B1 (en) Preferential selection of candidates for delta compression
US9141633B1 (en) Special markers to optimize access control list (ACL) data for deduplication
US9740422B1 (en) Version-based deduplication of incremental forever type backup
US9183218B1 (en) Method and system to improve deduplication of structured datasets using hybrid chunking and block header removal
US8825626B1 (en) Method and system for detecting unwanted content of files
US9984090B1 (en) Method and system for compressing file system namespace of a storage system
US9569357B1 (en) Managing compressed data in a storage system
US10656858B1 (en) Deduplication featuring variable-size duplicate data detection and fixed-size data segment sharing
US9665306B1 (en) Method and system for enhancing data transfer at a storage system
US8756249B1 (en) Method and apparatus for efficiently searching data in a storage system
US10366072B2 (en) De-duplication data bank
US10838923B1 (en) Poor deduplication identification
US20170199894A1 (en) Rebalancing distributed metadata
JP6807395B2 (en) Distributed data deduplication in the processor grid
US10229127B1 (en) Method and system for locality based cache flushing for file system namespace in a deduplicating storage system
US20170199893A1 (en) Storing data deduplication metadata in a grid of processors
US9268832B1 (en) Sorting a data set by using a limited amount of memory in a processing system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14905636

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14905636

Country of ref document: EP

Kind code of ref document: A1