US20190114288A1 - Transferring differences between chunks during replication - Google Patents

Transferring differences between chunks during replication Download PDF

Info

Publication number
US20190114288A1
US20190114288A1 US16/209,598 US201816209598A US2019114288A1 US 20190114288 A1 US20190114288 A1 US 20190114288A1 US 201816209598 A US201816209598 A US 201816209598A US 2019114288 A1 US2019114288 A1 US 2019114288A1
Authority
US
United States
Prior art keywords
chunk
data
node
storage node
requested
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Abandoned
Application number
US16/209,598
Inventor
Murali Bashyam
Sreekanth Garigala
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Quest Software Inc
Original Assignee
Quest Software Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Quest Software Inc filed Critical Quest Software Inc
Priority to US16/209,598 priority Critical patent/US20190114288A1/en
Publication of US20190114288A1 publication Critical patent/US20190114288A1/en
Abandoned legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • G06F16/1752De-duplication implemented within the file system, e.g. based on file segments based on file chunks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • G06F11/1448Management of the data involved in backup or backup restore
    • G06F11/1453Management of the data involved in backup or backup restore using de-duplication of the data

Definitions

  • the present disclosure relates generally to data storage, and more specifically to the operation of storage systems in which data is replicated across different storage nodes.
  • Data is often stored in storage systems that include more than one storage node on which data may be stored.
  • the data stored on a primary storage node may be mirrored on one or more secondary storage nodes.
  • Data may be synchronized in this way for several purposes. For instance, storing data on more than one storage mode may provide redundancy in case of storage node failure and/or improved data access times in case one storage node receives more access requests than it can handle in a timely fashion.
  • Some data storage systems may perform operations related to data deduplication.
  • data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data.
  • Deduplication techniques may be used to improve storage utilization or network data transfers by effectively reducing the number of bytes that must be sent or stored.
  • unique chunks of data, or byte patterns are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and a redundant chunk may be replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times, the amount of data that must be stored or transferred can be greatly reduced.
  • the match frequency may depend at least in part on the chunk size. Different storage systems may employ different chunk sizes or may support variable chunk sizes.
  • Deduplication differs from standard file compression techniques. While standard file compression techniques typically identify short repeated substrings inside individual files, storage-based data deduplication involves inspecting potentially large volumes of data and identify potentially large sections—such as entire files or large sections of files—that are identical, in order to store only one copy of a duplicate section. In some instances, this copy may be additionally compressed by single-file compression techniques. For example, a typical email system might contain many instances of the same one megabyte (MB) file attachment. Each time the system is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, the storage space required may be limited to only one instance of the attachment. Subsequent instances may be referenced back to the saved copy for deduplication ratio of roughly 100 to 1.
  • MB megabyte
  • FIG. 1 shows an example of an arrangement of data in a storage node, arranged in accordance with one or more embodiments.
  • FIG. 2 illustrates a particular example of a system that can be used in conjunction with the techniques and mechanisms of the present invention.
  • FIG. 3 illustrates a data replication method, performed in accordance with one or more embodiment
  • FIG. 4 illustrates a source node chunk replication method, performed in accordance with one or more embodiments.
  • FIG. 5 illustrates a particular example of a storage system.
  • FIG. 6 illustrates a target node chunk replication method, performed in accordance with one or more embodiments.
  • a system uses a processor in a variety of contexts. However, it will be appreciated that a system can use multiple processors while remaining within the scope of the present invention unless otherwise noted.
  • the techniques and mechanisms of the present invention will sometimes describe a connection between two entities. It should be noted that a connection between two entities does not necessarily mean a direct, unimpeded connection, as a variety of other entities may reside between the two entities.
  • a processor may be connected to memory, but it will be appreciated that a variety of bridges and controllers may reside between the processor and memory. Consequently, a connection does not necessarily mean a direct, unimpeded connection unless otherwise noted.
  • techniques and mechanisms described herein may replicate data from a source storage node to a target storage node.
  • the replication source node may compute and store a set of fingerprints in a fingerprint index for each data chunk that is replicated.
  • the fingerprint index may map each fingerprint to its corresponding chunk address.
  • the source node may compute the fingerprints for the requested chunk. Then, the source node may look up each fingerprint in the fingerprint index and select the chunk address that has the highest frequency of occurrence. Next, the source node may determine a delta between the requested chunk and the selected chunk. If the delta is relatively small, then the source node may transmit to the target node an identifier for the selected chunk as well as the delta between the selected chunk and the requested chunk. Then, the target node may reconstruct the requested chunk based on the transmitted information.
  • a file can be logically broken into a sequence of chunks.
  • a chunk may be associated with metadata such as the offset in the file at which the chunk occurs, the chunk size, the portion of the chunk used at that offset, and a hash or fingerprint of the chunk.
  • files may be stored in duplicate on different storage nodes for any of a variety of purposes such as redundancy or reduced access times.
  • data in a storage system is replicated, new files or changes to files made on a source node are transmitted to a replication target node so that the replication target node contains an accurate replica of the data stored on the source node.
  • chunk metadata such as chunk hashes, offset, and size information may be sent to a replication target node.
  • the target node may consult an index based on the chunk metadata and identify the chunks that it needs the source to transfer.
  • the target node may then transmit a message to the source node indicating the chunks that need to be transferred.
  • the requested chunks may be transmitted from the source node to the target node.
  • the nodes may communicate to confirm that all chunks have been received.
  • the second phase is the most time consuming since a potentially large amount of data may need to be replicated over a network link.
  • the network link may have a relatively small throughput when compared to other data transfer links in the system.
  • a storage node may send and receive data via a fast internal LAN such as a 100 Mbps or 1 Gbps network.
  • data replicated between storage nodes may be transmitted via a slower connection, such as a WAN link operating at speeds of 64 Kb/s to 10 Mbps.
  • the transfer can be performed more quickly.
  • One technique for reducing the volume of data transferred is compressing the data if it is compressible.
  • Another technique for reducing the data volume is to first identify similar chunks and then transfer only the difference between the chunks. After this difference, also referred to as a “delta”, is transferred from the source node to the target node, the target node can reconstruct the new chunk by applying the delta to the similar chunk.
  • the nature and workflow of backup applications is such that overwrites and modifications made to the files in the dataset being backed up result in incremental changes to previously stored chunks in the system or altogether new chunks. According to various embodiments, those incremental changes made to a chunk may be replicated without transferring the entire chunk.
  • a set of fingerprints may be computed and stored for a chunk.
  • a fingerprint may also be referred to as a hash or checksum.
  • Each fingerprint may correspond to an offset within the chunk. For instance, each chunk may be divided into a designated number of subchunks, and each fingerprint may correspond with a subchunk. Any of various hashing techniques may be used to compute the fingerprint. For instance, the fingerprint may be a Rabin checksum.
  • a change made to a portion of the chunk may be detected since only the checksums spanning the modified ranges of the chunk will change. The rest of the checksums, which span the unchanged ranges of the chunk, will stay the same.
  • data storage characteristics such as the chunk size and number of fingerprints per chunk may be strategically determined based on factors such as the characteristics of the underlying storage system. For instance, the system may store 8 checksums of 64 bytes each for every chunk in an index. If the chunk size is 8 kilobytes, then a checksum is calculated over each 1 kilobyte range of the chunk.
  • the system may identify the original chunk and the modified chunk as similar since the 7 out of 8 of the checksums for the two chunks will match, with only the 5 th checksum being different.
  • FIG. 1 shows an example of an arrangement of data in a storage node.
  • FIG. 1 includes a portion of a source storage node fingerprint index 100 .
  • FIG. 1 also includes a representation of a requested data chunk 106 and a reference data chunk A 108 .
  • the fingerprint index may be used to identify a reference data chunk that is similar to a requested data chunk.
  • a replication target node may determine that it needs to receive a particular data chunk in order to maintain a replica of data stored on a source storage node. When the replication target node makes such a determination, it transmits a request for the chunk to the source storage node.
  • a chunk may be logically divided into a number of subchunks.
  • the requested chunk 106 is divided into subchunks numbered 1 - 8 .
  • These subchunks may correspond with data ranges within the chunk.
  • an 8 kilobyte chunk may be divided into 8 subchunks, each of 1 kilobyte.
  • the chunk size, subchunk size, and number of subchunks with a chunk may differ from the examples discussed herein.
  • the source storage node may hash each subchunk to determine a subchunk identifier. These subchunk identifiers may then be looked up in the source storage node fingerprint index.
  • the source storage node fingerprint index portion corresponding with the data subchunk portions associated with the requested chunk 106 is shown at 100 .
  • the storage node fingerprint index includes a data column associated with the data subchunk identifier 102 and a data column associated with the chunk identifiers 104 .
  • the data included in a row of the data subchunk identifier column 102 represents a fingerprint associated with a particular subchunk.
  • the data included in a row of the chunk identifiers column 104 represents one or more identifiers each corresponding with a particular data chunk stored in the storage system.
  • the storage node fingerprint index may be used to identify a chunk associated with a given subchunk.
  • a chunk is listed in the fingerprint index as being associated with a particular subchunk, then the chunk includes the subchunk as a portion of the chunk. For instance, in FIG. 1 , the first row of the fingerprint index portion indicates that the chunk A includes the data subchunk 1 .
  • the relationship between subchunks and chunks may be one-to-one or one-to-many.
  • the data subchunk 1 is only found in chunk A
  • the data subchunk 2 is only found in chunk B
  • the data subchunk 3 is found in both chunk A and chunk B.
  • the source storage node may use the fingerprint index to identify a reference chunk that is similar to a requested chunk. For instance, in FIG. 1 , the requested chunk includes 8 different subchunks, number 1 - 8 .
  • the storage node fingerprint index indicates that the data subchunks 1 , 3 , 4 , 5 , 7 , and 8 are each part of the chunk A.
  • the index also indicates that the data subchunks 2 , 3 , and 4 are each part of the chunk B.
  • the index indicates that the data subchunk 6 is not part of any chunk referenced by the fingerprint index.
  • the chunk A is the chunk that is most similar to the requested chunk 106 because the chunk A has the highest frequency of matches in the fingerprint index portion to the data subchunks included within the requested chunk 106 .
  • the reference chunk A includes the data subchunks 1 , 3 , 4 , 5 , 7 , and 8 that are each part of the requested chunk 106 .
  • the reference chunk A also includes the data subchunks 9 and 10 that are not part of the requested chunk 106 .
  • 6 out of 8 of the subchunks of the requested chunk 106 may be found within the reference chunk 108 .
  • only 3 of the subchunks of the requested chunk 106 are part of the chunk B. Therefore, chunk A is a closer match to the requested chunk than chunk B.
  • the similarity between the requested chunk 106 and the reference chunk A 108 may be used to reduce the amount of data transmitted from the data source node and the target replication node in response to the request for the requested chunk 106 .
  • the source storage system may transmit data for reconstructing the requested chunk 106 .
  • This data may include information such as an identifier corresponding to the reference chunk A 108 , the missing data subchunks 2 and 6 , and any metadata capable of being used to perform the reconstruction.
  • the fingerprint index may store offset information that indicates a location within the chunk at which a subchunk is located.
  • the offset information may be stored in conjunction with the fingerprint information, in conjunction with the chunk identification information, or in a separate data column.
  • a match between a subchunk fingerprint and a chunk in the fingerprint index may include a match on offset as well as fingerprint. Alternately, a match between a subchunk fingerprint and a chunk may occur even if the subchunk offset does not match.
  • FIG. 1 depicts only a portion of an example arrangement of data on a storage system.
  • each storage node may store many different chunks. Accordingly, the fingerprint index at the source node may potentially be quite long and may indicate relationships between many different subchunks and many different chunks.
  • FIG. 2 illustrates one example of a system that can be used as a storage node in a deduplication system.
  • a system 200 suitable for implementing particular embodiments of the present invention includes a processor 201 , a memory 203 , an interface 211 , persistent storage 205 , and a bus 215 (e.g., a PCI bus).
  • the processor 201 When acting under the control of appropriate software or firmware, the processor 201 is responsible for such tasks such as optimization.
  • Various specially configured devices can also be used in place of a processor 201 or in addition to processor 201 . The complete implementation can also be done in custom hardware.
  • the interface 211 is typically configured to send and receive data packets or data segments over a network.
  • interfaces the device supports include Ethernet interfaces, frame relay interfaces, cable interfaces, DSL interfaces, token ring interfaces, and the like.
  • Persistent storage 205 may include disks, disk arrays, tape devices, solid state storage, etc.
  • various very high-speed interfaces may be provided such as fast Ethernet interfaces, Gigabit Ethernet interfaces, ATM interfaces. HSSI interfaces, POS interfaces. FDDI interfaces and the like.
  • these interfaces may include ports appropriate for communication with the appropriate media.
  • they may also include an independent processor and, in some instances, volatile RAM.
  • the independent processors may control such communications intensive tasks as packet switching, media control and management.
  • the system 200 uses memory 203 to store data and program instructions and maintain a local side cache.
  • the program instructions may control the operation of an operating system and/or one or more applications, for example.
  • the memory or memories may also be configured to store received metadata and batch requested metadata.
  • the present invention relates to tangible, machine readable media that include program instructions, state information, etc. for performing various operations described herein.
  • machine-readable media include hard disks, floppy disks, magnetic tape, optical media such as CD-ROM disks and DVDs; magneto-optical media such as optical disks, and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and programmable read-only memory devices (PROMs).
  • program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.
  • FIG. 3 illustrates a data replication method 300 , performed in accordance with one or more embodiments.
  • the method 300 may be performed at a source storage node in communication with a target replication node.
  • the method 300 may be used to replicate data stored on the source storage node to the target replication node. After replication, the replicated data is available on both nodes.
  • the method 300 may be performed at any of various times.
  • the method 300 may be performed when new data is received for storage on the source storage node.
  • replication may be performed periodically, at scheduled times, or upon request.
  • each chunk fingerprint is a hashed value that is computed by applying a hash function such as a Rabin hash to the underlying chunk data.
  • each chunk may be a file, a portion of a file, or any other range of data that may be stored in a storage system.
  • the techniques and mechanisms described herein apply generally to a wide variety of storage systems including storage systems that differ in terms of characteristics such as chunk size.
  • the chunk fingerprint may be used by the target replication node to determine whether the target replication node is missing the chunk corresponding to the chunk fingerprint. For instance, the target replication node may use a chunk fingerprint to look up the chunk in a database indexed by chunk fingerprint to determine whether the chunk is stored on the target replication node. If the chunk is already present on the target replication node, then the target replication node need not request the chunk from the source node.
  • the hashing function used to generate the chunk fingerprint need not uniquely identify a particular chunk.
  • a data chunk may include 8 kilobytes of data, while a chunk fingerprint may be 64 bytes, 512 bytes, or some other size.
  • a given chunk fingerprint may potentially correspond to two different chunks.
  • the target replication node may simply need to send a subsequent request for a chunk that at first appeared to be stored on the target replication node but in actuality was not.
  • a request is received to transmit chunks to the target replication node.
  • the requested chunks may include those that are not yet stored on the target replication node.
  • the requested chunks may be identified by chunk identifiers or by the chunk fingerprints transmitted from the source node to the target node.
  • the requested chunks are provided to the target replication node.
  • each chunk may be provided in any of various ways. For example, in some instances the entire chunk may be transferred. In other instances, an identifier for a reference chunk may be transmitted along with delta information for reconstructing the requested chunk from the reference chunk. Techniques for providing requested chunks to the target replication node are discussed in further detail with respect to FIG. 4 . Techniques for reconstructing a requested chunk from a reference chunk are discussed in further detail with respect to FIG. 6 .
  • the subchunk index is updated to include the provided chunks.
  • An example of a subchunk index is shown in FIG. 1 .
  • updating the subchunk index may involve storing or updating entries for each subchunk of a replicated chunk when necessary. For instance, if a new chunk is transmitted to the target replication node, then the subchunk index is updated to indicate an association between the new chunk and a subchunk fingerprint for each subchunk of the new chunk.
  • FIG. 4 illustrates a source node chunk replication method 400 , performed in accordance with one or more embodiments.
  • the method 400 may be performed at a source storage node in communication with a target storage node.
  • the source storage node may be configured to provide data for replication to the target storage node.
  • the method 400 may be performed during the process of replication, when a request is received to replicate data to the target storage node.
  • a request for a data chunk is received.
  • the request may be received in response to a determination that the data chunk should be stored at the target storage node in order to replicate corresponding data stored on the source storage node.
  • the request may be received in response to a set of fingerprints transmitted to the target replication node, as discussed with respect to operations 302 and 304 in FIG. 3 .
  • the request may identify the data chunk in any of various ways.
  • the request may include an identifier and/or a fingerprint corresponding with the requested data chunk.
  • a set of data chunk fingerprints for subchunks of the requested chunk are determined.
  • the set of data chunk fingerprints may be determined by first dividing the data chunk into subchunks. For instance, an 8 kilobyte chunk may be divided into 1 kilobyte chunks. Then, a hash function may be applied to each data chunk to produce a corresponding fingerprint. Any of various types of hash functions may be used. For instance, the system may employ a Rabin hash function.
  • one or more data chunks associated with the fingerprint and stored on the target storage node are identified from the fingerprint index.
  • the one or more data chunks may be identified by looking up each fingerprint in the fingerprint index.
  • the subchunk fingerprint index maps each subchunk fingerprint to a chunk in which the subchunk is included.
  • Each subchunk fingerprint may be associated with zero, one, two, or more chunks.
  • identifying the one or more data chunks may involve creating a frequency list.
  • a frequency list may identify a number of data chunks that include subchunks within the requested data chunk. For each of the identified data chunks, a number of subchunks included the identified data chunk may also be determined.
  • the chunk A includes 6 subchunks that overlap with the requested chunk, while the chunk B includes 3 subchunks that overlap with the requested chunk.
  • a frequency list for the requested chunk 106 shown in FIG. 1 would include chunk A linked with the frequency count 6 and the chunk B linked with the frequency count 3 .
  • one of the identified data chunks having a high frequency of occurrence is selected.
  • the selected chunk may be the chunk identified at operation 406 that has the highest frequency of occurrence.
  • the frequency list may be sorted, and the highest frequency chunk may be selected.
  • the subchunk A is selected.
  • the highest frequency chunk may be the chunk that has the most overlap with the requested chunk.
  • a difference (or delta) between the requested data chunk and the selected data chunk are determined.
  • the delta may represent the data included in the requested data chunk that is not also within the selected (or reference) data chunk.
  • the delta between the requested data chunk and the reference data chunk A is the set of two subchunks corresponding to subchunk 2 and subchunk 6 .
  • identifying the delta may involve calculating a difference via an algorithm such as the VCDIFF algorithm for delta encoding.
  • the VCDIFF algorithm may identify delta data to include in conjunction with the reference chunk data as well as metadata for combining the delta data with the reference chunk data.
  • identifying the delta may involve identifying metadata for combining the delta data with the reference data chunk.
  • the metadata may include offset information.
  • the offset information may indicate that the subchunk 2 is located in the second position, while the subchunk 6 is located in the sixth position.
  • the designated threshold may be strategically determined based on any of various factors such as the chunk size, the subchunk size, and the amount of metadata information needed to reconstruct a requested chunk from a reference chunk.
  • the determination made at 412 may reflect the various tradeoffs involved in reconstructing the requested chunk at the target node. For example, reconstructing the requested chunk at the target node involves some amount of computing resources. As another example, delta information, reference chunk identification information, and metadata information may still need to be transmitted from the source node to the target node. As yet another example, some chance may exist that the reference node is not actually present on the target storage node, which may involve additional network traffic such as transferring request messages and the entire requested node. Accordingly, if the reconstruction information is not significantly smaller than the size of the requested data chunk, then transmitting the entire requested data chunk may be more efficient than transmitting the requested information.
  • the delta information and an identifier for the selected data chunk are transmitted to the requesting node.
  • the information transmitted may include any information for reconstructing the requested chunk at the target node.
  • the information transmitted may include metadata information such as offset data that identifies the location within the chunk at which the delta information is located.
  • the requested chunk is transmitted to the requesting node.
  • the entire requested chunk may be transmitted if transmitting difference information is inefficient for any of various reasons. Alternately, the entire requested chunk may be transmitted if no similar reference node is stored on the target node for use in reconstructing the requested chunk.
  • FIG. 5 illustrates a particular example of a system that can be uses in conjunction with the techniques and mechanisms of the present invention.
  • data is received at an accelerated deduplication system 500 over an interface such as a network interface.
  • a data stream may be received in segments or blocks and maintained in system memory 503 .
  • a processor or CPU 501 maintains a state machine but offloads boundary detection and fingerprinting to a deduplication engine or deduplication accelerator 502 .
  • the CPU 501 is associated with cache 511 and memory controller 513 .
  • cache 511 and memory controller 513 may be integrated onto the CPU 501 .
  • the deduplication engine or deduplication accelerator 505 is connected to the CPU 501 over a system bus 515 and detects boundaries using an algorithm such as Rabin to delineate segments of data in system memory 503 and generates fingerprints using algorithms such as hashing algorithms like SHA-1 or MD-5.
  • the deduplication engine 505 accesses the deduplication dictionary 507 to determine if a fingerprint is already included in the deduplication dictionary 507 .
  • the deduplication dictionary 507 is maintained in persistent storage and maps segment fingerprints to segment storage locations. In particular embodiments, segment storage locations are maintained in fixed size extents. Datastore suitcases, references, metadata, etc., may be created or modified based on the result of the dictionary lookup.
  • the optimization software stack will communicate to the CPU 501 the final destination direct memory access (DMA) addresses for the data.
  • the DMA addresses can then be used to transfer the data through one or more bus bridges 517 and/or 527 and secondary buses 519 and/or 529 .
  • a secondary bus is a peripheral component interconnect (PCI) bus 519 .
  • Peripherals 551 , 523 , 525 , 531 , and 533 may be peripheral components and/or peripheral interfaces such as disk arrays, network interfaces, serial interfaces, timers, tape devices, etc.
  • FIG. 6 illustrates a target node chunk replication method 600 , performed in accordance with one or more embodiments.
  • the method 600 may be performed at a target storage node configured to replicate data stored on a source storage node.
  • the target storage node may be performed during a replication operation for ensuring that the data stored on the target storage node is the same as corresponding data stored on the source storage node.
  • a request for a data chunk is transmitted to a source storage node.
  • the requested data chunk may be a portion of data to be replicated from the source storage node to the target storage node.
  • the requested data chunk may be a data chunk identified based on a set of chunk fingerprints transmitted from the source storage node to the target storage node as discussed with respect to operation 302 discussed with respect to FIG. 3 .
  • the request may identify the data chunk in any of various ways.
  • the request may include an identifier associated with the data chunk.
  • the request may include a fingerprint value associated with the requested data chunk.
  • data chunk reconstruction information is received from the source storage node.
  • the data chunk reconstruction information may include any information capable of being used to create the requested data chunk.
  • the data chunk reconstruction information may include an identifier corresponding to a reference data chunk, delta data that represents a difference in data between the reference data chunk and the requested data chunk, and/or metadata information for use in combining the reference data chunk with the delta data to create the requested data chunk.
  • a reference data chunk for reconstructing the requested data chunk is identified.
  • the reference data chunk may be identified based on information included in the data chunk reconstruction information received at operation 604 .
  • the source storage node may have out-of-date information regarding which data chunks are stored on the target storage node. For instance, an intervening operation between the time at which the source storage node determines that a data chunk is stored on the target storage node and the time at which the data chunk reconstruction information is received from the source storage node may have caused the reference data chunk to be deleted from the source storage node.
  • the determination made at operation 608 may be made at least in part by looking up information associated with the reference data chunk in a data dictionary residing at the target storage node.
  • a data dictionary may indicate a storage location corresponding to each data chunk residing in the storage system, indexed by an identifier associated with each data chunk.
  • the reference data chunk is combined with delta information to produce the requested data chunk.
  • combining the reference data chunk with the delta information may involve any operations related to reconstructing the requested data chunk at the target storage node.
  • the data corresponding with the reference data chunk may be retrieved from the storage system.
  • the delta information may be added in the appropriate positions in the reference data chunk to create the requested data chunk.
  • the data chunk reconstruction information may include metadata such as subchunk offsets that indicate one or more locations within the reference data chunk at which the delta information should be placed.
  • a success message is transmitted to the source storage node.
  • the success message may identify the requested chunk.
  • the success message may include an identifier corresponding with the requested chunk and/or a fingerprint that identifies the requested chunk.
  • transmitting the success message to the source storage node may allow the source storage node to update the fingerprint index stored at the source storage node. In this way, the source storage node may be informed of the data stored at the target replication storage node. Then, when subsequent requests for data chunks are received at the source storage node, the source storage node may respond by determining whether to send the entire data chunk or data chunk reconstruction information, as discussed herein.
  • a request to the source storage node for transmitting the entire requested data chunk is transmitted.
  • an intervening action may have caused the reference data chunk to be no longer stored in the target node storage system.
  • the target node may be unable to reconstruct the requested node from the reference node.
  • the target node may transmit a new request for the source storage node to transmit the entire requested data chunk.
  • the source storage node may transmit reconstruction information based on a different reference data chunk.
  • FIGS. 4 and 6 are described in the context of the replication of a single data chunk, potentially many different data chunks may be replicated. In this case, the operations discussed with respect to FIGS. 4 and 6 may be performed separately for each data chunk or may be combined for more than one data chunk. For example, data chunk reconstruction information may be received for potentially more than one data chunk in the same message or series of messages between the two nodes. As another example, the success message transmitted to the source storage node may identify a range or group of data chunks successfully stored at the target storage node.

Abstract

Techniques and mechanisms described herein facilitate the replication of data between storage nodes. According to various embodiments, a request to provide a data chunk to a target storage node may be received at a source data storage node. A reference data chunk may be identified based on fingerprint information associated with the requested data chunk. The reference data chunk may be stored on the target storage node. The reference data chunk and the requested data chunk may each include a first data portion. Data chunk reconstruction information may be transmitted from the source data storage node to the target data storage node. The data chunk reconstruction information may identify the reference data chunk. The data chunk reconstruction information may include data difference information for constructing the requested data chunk at the target data storage node based on the reference data chunk.

Description

    RELATED APPLICATION
  • This patent application is a continuation application that claims the benefit of the filing date of U.S. patent application Ser. No. 13/952,062, filed Jul. 26, 2013, and entitled “TRANSFERRING DIFFERENCES BETWEEN CHUNKS DURING REPLICATION” which is incorporated herein by reference in its entirety.
  • TECHNICAL FIELD
  • The present disclosure relates generally to data storage, and more specifically to the operation of storage systems in which data is replicated across different storage nodes.
  • DESCRIPTION OF RELATED ART
  • Data is often stored in storage systems that include more than one storage node on which data may be stored. In some systems, the data stored on a primary storage node may be mirrored on one or more secondary storage nodes. Data may be synchronized in this way for several purposes. For instance, storing data on more than one storage mode may provide redundancy in case of storage node failure and/or improved data access times in case one storage node receives more access requests than it can handle in a timely fashion.
  • Some data storage systems may perform operations related to data deduplication. In computing, data deduplication is a specialized data compression technique for eliminating duplicate copies of repeating data. Deduplication techniques may be used to improve storage utilization or network data transfers by effectively reducing the number of bytes that must be sent or stored. In the deduplication process, unique chunks of data, or byte patterns, are identified and stored during a process of analysis. As the analysis continues, other chunks are compared to the stored copy and a redundant chunk may be replaced with a small reference that points to the stored chunk. Given that the same byte pattern may occur dozens, hundreds, or even thousands of times, the amount of data that must be stored or transferred can be greatly reduced. The match frequency may depend at least in part on the chunk size. Different storage systems may employ different chunk sizes or may support variable chunk sizes.
  • Deduplication differs from standard file compression techniques. While standard file compression techniques typically identify short repeated substrings inside individual files, storage-based data deduplication involves inspecting potentially large volumes of data and identify potentially large sections—such as entire files or large sections of files—that are identical, in order to store only one copy of a duplicate section. In some instances, this copy may be additionally compressed by single-file compression techniques. For example, a typical email system might contain many instances of the same one megabyte (MB) file attachment. Each time the system is backed up, all 100 instances of the attachment are saved, requiring 100 MB storage space. With data deduplication, the storage space required may be limited to only one instance of the attachment. Subsequent instances may be referenced back to the saved copy for deduplication ratio of roughly 100 to 1.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • The disclosure may best be understood by reference to the following description taken in conjunction with the accompanying drawings, which illustrate particular embodiments of the present invention.
  • FIG. 1 shows an example of an arrangement of data in a storage node, arranged in accordance with one or more embodiments.
  • FIG. 2 illustrates a particular example of a system that can be used in conjunction with the techniques and mechanisms of the present invention.
  • FIG. 3 illustrates a data replication method, performed in accordance with one or more embodiment
  • FIG. 4 illustrates a source node chunk replication method, performed in accordance with one or more embodiments.
  • FIG. 5 illustrates a particular example of a storage system.
  • FIG. 6 illustrates a target node chunk replication method, performed in accordance with one or more embodiments.
  • DESCRIPTION OF PARTICULAR EMBODIMENTS
  • Reference will now be made in detail to some specific examples of the invention including the best modes contemplated by the inventors for carrying out the invention. Examples of these specific embodiments are illustrated in the accompanying drawings. While the invention is described in conjunction with these specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims.
  • For example, the techniques and mechanisms of the present invention will be described in the context of particular data storage mechanisms. However, it should be noted that the techniques and mechanisms of the present invention apply to a variety of different data storage mechanisms. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. Particular example embodiments of the present invention may be implemented without some or all of these specific details. In other instances, well known process operations have not been described in detail in order not to unnecessarily obscure the present invention.
  • Various techniques and mechanisms of the present invention will sometimes be described in singular form for clarity. However, it should be noted that some embodiments include multiple iterations of a technique or multiple instantiations of a mechanism unless noted otherwise. For example, a system uses a processor in a variety of contexts. However, it will be appreciated that a system can use multiple processors while remaining within the scope of the present invention unless otherwise noted. Furthermore, the techniques and mechanisms of the present invention will sometimes describe a connection between two entities. It should be noted that a connection between two entities does not necessarily mean a direct, unimpeded connection, as a variety of other entities may reside between the two entities. For example, a processor may be connected to memory, but it will be appreciated that a variety of bridges and controllers may reside between the processor and memory. Consequently, a connection does not necessarily mean a direct, unimpeded connection unless otherwise noted.
  • Overview
  • According to various embodiments, techniques and mechanisms described herein may replicate data from a source storage node to a target storage node. The replication source node may compute and store a set of fingerprints in a fingerprint index for each data chunk that is replicated. The fingerprint index may map each fingerprint to its corresponding chunk address. When the replication target node indicates a chunk that it needs, the source node may compute the fingerprints for the requested chunk. Then, the source node may look up each fingerprint in the fingerprint index and select the chunk address that has the highest frequency of occurrence. Next, the source node may determine a delta between the requested chunk and the selected chunk. If the delta is relatively small, then the source node may transmit to the target node an identifier for the selected chunk as well as the delta between the selected chunk and the requested chunk. Then, the target node may reconstruct the requested chunk based on the transmitted information.
  • Example Embodiments
  • In a deduplication based file-system, a file can be logically broken into a sequence of chunks. A chunk may be associated with metadata such as the offset in the file at which the chunk occurs, the chunk size, the portion of the chunk used at that offset, and a hash or fingerprint of the chunk.
  • In a replication storage system, files may be stored in duplicate on different storage nodes for any of a variety of purposes such as redundancy or reduced access times. When data in a storage system is replicated, new files or changes to files made on a source node are transmitted to a replication target node so that the replication target node contains an accurate replica of the data stored on the source node.
  • Various operations may be involved in replicating a file stored in such a system. In the first phase, chunk metadata such as chunk hashes, offset, and size information may be sent to a replication target node. The target node may consult an index based on the chunk metadata and identify the chunks that it needs the source to transfer. The target node may then transmit a message to the source node indicating the chunks that need to be transferred. In the second phase, the requested chunks may be transmitted from the source node to the target node. In the third phrase, the nodes may communicate to confirm that all chunks have been received.
  • In many instances, the second phase is the most time consuming since a potentially large amount of data may need to be replicated over a network link. In some instances, the network link may have a relatively small throughput when compared to other data transfer links in the system. For instance, a storage node may send and receive data via a fast internal LAN such as a 100 Mbps or 1 Gbps network. However, data replicated between storage nodes may be transmitted via a slower connection, such as a WAN link operating at speeds of 64 Kb/s to 10 Mbps.
  • If the amount of data to be transferred can be reduced, the transfer can be performed more quickly. One technique for reducing the volume of data transferred is compressing the data if it is compressible. Another technique for reducing the data volume is to first identify similar chunks and then transfer only the difference between the chunks. After this difference, also referred to as a “delta”, is transferred from the source node to the target node, the target node can reconstruct the new chunk by applying the delta to the similar chunk.
  • The nature and workflow of backup applications is such that overwrites and modifications made to the files in the dataset being backed up result in incremental changes to previously stored chunks in the system or altogether new chunks. According to various embodiments, those incremental changes made to a chunk may be replicated without transferring the entire chunk.
  • According to various embodiments, a set of fingerprints may be computed and stored for a chunk. A fingerprint may also be referred to as a hash or checksum. Each fingerprint may correspond to an offset within the chunk. For instance, each chunk may be divided into a designated number of subchunks, and each fingerprint may correspond with a subchunk. Any of various hashing techniques may be used to compute the fingerprint. For instance, the fingerprint may be a Rabin checksum.
  • According to various embodiments, by storing a set of fingerprints for a chunk, a change made to a portion of the chunk may be detected since only the checksums spanning the modified ranges of the chunk will change. The rest of the checksums, which span the unchanged ranges of the chunk, will stay the same.
  • According to various embodiments, data storage characteristics such as the chunk size and number of fingerprints per chunk may be strategically determined based on factors such as the characteristics of the underlying storage system. For instance, the system may store 8 checksums of 64 bytes each for every chunk in an index. If the chunk size is 8 kilobytes, then a checksum is calculated over each 1 kilobyte range of the chunk.
  • As an illustrative example, suppose that a 100 byte modification has been made in the middle of a chunk at offset 4000 in a system with parameters as described in the preceding paragraph. In this case, the system may identify the original chunk and the modified chunk as similar since the 7 out of 8 of the checksums for the two chunks will match, with only the 5th checksum being different.
  • FIG. 1 shows an example of an arrangement of data in a storage node. FIG. 1 includes a portion of a source storage node fingerprint index 100. FIG. 1 also includes a representation of a requested data chunk 106 and a reference data chunk A 108. According to various embodiments, the fingerprint index may be used to identify a reference data chunk that is similar to a requested data chunk.
  • According to various embodiments, a replication target node may determine that it needs to receive a particular data chunk in order to maintain a replica of data stored on a source storage node. When the replication target node makes such a determination, it transmits a request for the chunk to the source storage node.
  • At 106, an example of such a requested chunk is shown. A chunk may be logically divided into a number of subchunks. For instance, the requested chunk 106 is divided into subchunks numbered 1-8. These subchunks may correspond with data ranges within the chunk. For example, an 8 kilobyte chunk may be divided into 8 subchunks, each of 1 kilobyte. However, according to various implementations, the chunk size, subchunk size, and number of subchunks with a chunk may differ from the examples discussed herein.
  • According to various embodiments, the source storage node may hash each subchunk to determine a subchunk identifier. These subchunk identifiers may then be looked up in the source storage node fingerprint index. The source storage node fingerprint index portion corresponding with the data subchunk portions associated with the requested chunk 106 is shown at 100.
  • According to various embodiments, the storage node fingerprint index includes a data column associated with the data subchunk identifier 102 and a data column associated with the chunk identifiers 104. The data included in a row of the data subchunk identifier column 102 represents a fingerprint associated with a particular subchunk. The data included in a row of the chunk identifiers column 104 represents one or more identifiers each corresponding with a particular data chunk stored in the storage system.
  • According to various embodiments, the storage node fingerprint index may be used to identify a chunk associated with a given subchunk. When a chunk is listed in the fingerprint index as being associated with a particular subchunk, then the chunk includes the subchunk as a portion of the chunk. For instance, in FIG. 1, the first row of the fingerprint index portion indicates that the chunk A includes the data subchunk 1.
  • According to various embodiments, the relationship between subchunks and chunks may be one-to-one or one-to-many. For instance, the data subchunk 1 is only found in chunk A, while the data subchunk 2 is only found in chunk B. However, the data subchunk 3 is found in both chunk A and chunk B.
  • According to various embodiments, the source storage node may use the fingerprint index to identify a reference chunk that is similar to a requested chunk. For instance, in FIG. 1, the requested chunk includes 8 different subchunks, number 1-8. The storage node fingerprint index indicates that the data subchunks 1, 3, 4, 5, 7, and 8 are each part of the chunk A. The index also indicates that the data subchunks 2, 3, and 4 are each part of the chunk B. Finally, the index indicates that the data subchunk 6 is not part of any chunk referenced by the fingerprint index.
  • At 108, a representation of the reference chunk A 108 is shown. In FIG. 1, the chunk A is the chunk that is most similar to the requested chunk 106 because the chunk A has the highest frequency of matches in the fingerprint index portion to the data subchunks included within the requested chunk 106. For instance, as shown at 108, the reference chunk A includes the data subchunks 1, 3, 4, 5, 7, and 8 that are each part of the requested chunk 106. The reference chunk A also includes the data subchunks 9 and 10 that are not part of the requested chunk 106. Thus, 6 out of 8 of the subchunks of the requested chunk 106 may be found within the reference chunk 108. In contrast, only 3 of the subchunks of the requested chunk 106 ( subchunks 2, 3, and 4) are part of the chunk B. Therefore, chunk A is a closer match to the requested chunk than chunk B.
  • According to various embodiments, the similarity between the requested chunk 106 and the reference chunk A 108 may be used to reduce the amount of data transmitted from the data source node and the target replication node in response to the request for the requested chunk 106. For instance, instead of sending each of the data subchunks that form the requested chunk 106, the source storage system may transmit data for reconstructing the requested chunk 106. This data may include information such as an identifier corresponding to the reference chunk A 108, the missing data subchunks 2 and 6, and any metadata capable of being used to perform the reconstruction.
  • According to various embodiments, various types of information may be stored within the fingerprint index. For example, the fingerprint index may store offset information that indicates a location within the chunk at which a subchunk is located. The offset information may be stored in conjunction with the fingerprint information, in conjunction with the chunk identification information, or in a separate data column. In particular embodiments, a match between a subchunk fingerprint and a chunk in the fingerprint index may include a match on offset as well as fingerprint. Alternately, a match between a subchunk fingerprint and a chunk may occur even if the subchunk offset does not match.
  • It should be noted that FIG. 1 depicts only a portion of an example arrangement of data on a storage system. In replicated storage systems, each storage node may store many different chunks. Accordingly, the fingerprint index at the source node may potentially be quite long and may indicate relationships between many different subchunks and many different chunks.
  • A variety of devices and applications can implement particular examples of the present invention. FIG. 2 illustrates one example of a system that can be used as a storage node in a deduplication system. According to particular example embodiments, a system 200 suitable for implementing particular embodiments of the present invention includes a processor 201, a memory 203, an interface 211, persistent storage 205, and a bus 215 (e.g., a PCI bus). When acting under the control of appropriate software or firmware, the processor 201 is responsible for such tasks such as optimization. Various specially configured devices can also be used in place of a processor 201 or in addition to processor 201. The complete implementation can also be done in custom hardware. The interface 211 is typically configured to send and receive data packets or data segments over a network. Particular examples of interfaces the device supports include Ethernet interfaces, frame relay interfaces, cable interfaces, DSL interfaces, token ring interfaces, and the like. Persistent storage 205 may include disks, disk arrays, tape devices, solid state storage, etc.
  • In addition, various very high-speed interfaces may be provided such as fast Ethernet interfaces, Gigabit Ethernet interfaces, ATM interfaces. HSSI interfaces, POS interfaces. FDDI interfaces and the like. Generally, these interfaces may include ports appropriate for communication with the appropriate media. In some cases, they may also include an independent processor and, in some instances, volatile RAM. The independent processors may control such communications intensive tasks as packet switching, media control and management.
  • According to particular example embodiments, the system 200 uses memory 203 to store data and program instructions and maintain a local side cache. The program instructions may control the operation of an operating system and/or one or more applications, for example. The memory or memories may also be configured to store received metadata and batch requested metadata.
  • Because such information and program instructions may be employed to implement the systems/methods described herein, the present invention relates to tangible, machine readable media that include program instructions, state information, etc. for performing various operations described herein. Examples of machine-readable media include hard disks, floppy disks, magnetic tape, optical media such as CD-ROM disks and DVDs; magneto-optical media such as optical disks, and hardware devices that are specially configured to store and perform program instructions, such as read-only memory devices (ROM) and programmable read-only memory devices (PROMs). Examples of program instructions include both machine code, such as produced by a compiler, and files containing higher level code that may be executed by the computer using an interpreter.
  • FIG. 3 illustrates a data replication method 300, performed in accordance with one or more embodiments. According to various embodiments, the method 300 may be performed at a source storage node in communication with a target replication node. The method 300 may be used to replicate data stored on the source storage node to the target replication node. After replication, the replicated data is available on both nodes.
  • According to various embodiments, the method 300 may be performed at any of various times. For example, the method 300 may be performed when new data is received for storage on the source storage node. As another example, replication may be performed periodically, at scheduled times, or upon request.
  • At 302, a set of chunk fingerprints are transmitted to the target replication node. According to various embodiments, each chunk fingerprint is a hashed value that is computed by applying a hash function such as a Rabin hash to the underlying chunk data.
  • According to various embodiments, each chunk may be a file, a portion of a file, or any other range of data that may be stored in a storage system. The techniques and mechanisms described herein apply generally to a wide variety of storage systems including storage systems that differ in terms of characteristics such as chunk size.
  • According to various embodiments, the chunk fingerprint may be used by the target replication node to determine whether the target replication node is missing the chunk corresponding to the chunk fingerprint. For instance, the target replication node may use a chunk fingerprint to look up the chunk in a database indexed by chunk fingerprint to determine whether the chunk is stored on the target replication node. If the chunk is already present on the target replication node, then the target replication node need not request the chunk from the source node.
  • According to various embodiments, the hashing function used to generate the chunk fingerprint need not uniquely identify a particular chunk. For instance, a data chunk may include 8 kilobytes of data, while a chunk fingerprint may be 64 bytes, 512 bytes, or some other size. In this case, a given chunk fingerprint may potentially correspond to two different chunks. However, in the event of such a collision, the target replication node may simply need to send a subsequent request for a chunk that at first appeared to be stored on the target replication node but in actuality was not.
  • At 304, a request is received to transmit chunks to the target replication node. According to various embodiments, the requested chunks may include those that are not yet stored on the target replication node. The requested chunks may be identified by chunk identifiers or by the chunk fingerprints transmitted from the source node to the target node.
  • At 306, the requested chunks are provided to the target replication node. According to various embodiments, each chunk may be provided in any of various ways. For example, in some instances the entire chunk may be transferred. In other instances, an identifier for a reference chunk may be transmitted along with delta information for reconstructing the requested chunk from the reference chunk. Techniques for providing requested chunks to the target replication node are discussed in further detail with respect to FIG. 4. Techniques for reconstructing a requested chunk from a reference chunk are discussed in further detail with respect to FIG. 6.
  • At 308, the subchunk index is updated to include the provided chunks. An example of a subchunk index is shown in FIG. 1. According to various embodiments, updating the subchunk index may involve storing or updating entries for each subchunk of a replicated chunk when necessary. For instance, if a new chunk is transmitted to the target replication node, then the subchunk index is updated to indicate an association between the new chunk and a subchunk fingerprint for each subchunk of the new chunk.
  • FIG. 4 illustrates a source node chunk replication method 400, performed in accordance with one or more embodiments. According to various embodiments, the method 400 may be performed at a source storage node in communication with a target storage node. The source storage node may be configured to provide data for replication to the target storage node. The method 400 may be performed during the process of replication, when a request is received to replicate data to the target storage node.
  • At 402, a request for a data chunk is received. According to various embodiments, the request may be received in response to a determination that the data chunk should be stored at the target storage node in order to replicate corresponding data stored on the source storage node. For instance, the request may be received in response to a set of fingerprints transmitted to the target replication node, as discussed with respect to operations 302 and 304 in FIG. 3.
  • According to various embodiments, the request may identify the data chunk in any of various ways. For example, the request may include an identifier and/or a fingerprint corresponding with the requested data chunk.
  • At 404, a set of data chunk fingerprints for subchunks of the requested chunk are determined. According to various embodiments, the set of data chunk fingerprints may be determined by first dividing the data chunk into subchunks. For instance, an 8 kilobyte chunk may be divided into 1 kilobyte chunks. Then, a hash function may be applied to each data chunk to produce a corresponding fingerprint. Any of various types of hash functions may be used. For instance, the system may employ a Rabin hash function.
  • At 406, one or more data chunks associated with the fingerprint and stored on the target storage node are identified from the fingerprint index. According to various embodiments, the one or more data chunks may be identified by looking up each fingerprint in the fingerprint index. As discussed with respect to FIG. 1, the subchunk fingerprint index maps each subchunk fingerprint to a chunk in which the subchunk is included. Each subchunk fingerprint may be associated with zero, one, two, or more chunks.
  • According to various embodiments, identifying the one or more data chunks may involve creating a frequency list. According to various embodiments, a frequency list may identify a number of data chunks that include subchunks within the requested data chunk. For each of the identified data chunks, a number of subchunks included the identified data chunk may also be determined.
  • For instance, in FIG. 1, the chunk A includes 6 subchunks that overlap with the requested chunk, while the chunk B includes 3 subchunks that overlap with the requested chunk. Accordingly, a frequency list for the requested chunk 106 shown in FIG. 1 would include chunk A linked with the frequency count 6 and the chunk B linked with the frequency count 3.
  • At 408, one of the identified data chunks having a high frequency of occurrence is selected. According to various embodiments, the selected chunk may be the chunk identified at operation 406 that has the highest frequency of occurrence. For instance, the frequency list may be sorted, and the highest frequency chunk may be selected. For example, in FIG. 1, the subchunk A is selected. The highest frequency chunk may be the chunk that has the most overlap with the requested chunk.
  • At 410, a difference (or delta) between the requested data chunk and the selected data chunk are determined. According to various embodiments, the delta may represent the data included in the requested data chunk that is not also within the selected (or reference) data chunk. For instance, in FIG. 1, the delta between the requested data chunk and the reference data chunk A is the set of two subchunks corresponding to subchunk 2 and subchunk 6.
  • In particular embodiments, identifying the delta may involve calculating a difference via an algorithm such as the VCDIFF algorithm for delta encoding. The VCDIFF algorithm may identify delta data to include in conjunction with the reference chunk data as well as metadata for combining the delta data with the reference chunk data.
  • In particular embodiments, identifying the delta may involve identifying metadata for combining the delta data with the reference data chunk. For instance, the metadata may include offset information. The offset information may indicate that the subchunk 2 is located in the second position, while the subchunk 6 is located in the sixth position.
  • At 412, a determination is made as to whether the difference data exceeds a designated threshold. According to various embodiments, the designated threshold may be strategically determined based on any of various factors such as the chunk size, the subchunk size, and the amount of metadata information needed to reconstruct a requested chunk from a reference chunk.
  • According to various embodiments, the determination made at 412 may reflect the various tradeoffs involved in reconstructing the requested chunk at the target node. For example, reconstructing the requested chunk at the target node involves some amount of computing resources. As another example, delta information, reference chunk identification information, and metadata information may still need to be transmitted from the source node to the target node. As yet another example, some chance may exist that the reference node is not actually present on the target storage node, which may involve additional network traffic such as transferring request messages and the entire requested node. Accordingly, if the reconstruction information is not significantly smaller than the size of the requested data chunk, then transmitting the entire requested data chunk may be more efficient than transmitting the requested information.
  • At 414, the delta information and an identifier for the selected data chunk are transmitted to the requesting node. According to various embodiments, the information transmitted may include any information for reconstructing the requested chunk at the target node. For instance, the information transmitted may include metadata information such as offset data that identifies the location within the chunk at which the delta information is located.
  • At 416, the requested chunk is transmitted to the requesting node. As discussed with respect to operation 412, the entire requested chunk may be transmitted if transmitting difference information is inefficient for any of various reasons. Alternately, the entire requested chunk may be transmitted if no similar reference node is stored on the target node for use in reconstructing the requested chunk.
  • FIG. 5 illustrates a particular example of a system that can be uses in conjunction with the techniques and mechanisms of the present invention. According to various embodiments, data is received at an accelerated deduplication system 500 over an interface such as a network interface. A data stream may be received in segments or blocks and maintained in system memory 503. According to various embodiments, a processor or CPU 501 maintains a state machine but offloads boundary detection and fingerprinting to a deduplication engine or deduplication accelerator 502. The CPU 501 is associated with cache 511 and memory controller 513. According to various embodiments, cache 511 and memory controller 513 may be integrated onto the CPU 501.
  • In particular embodiments, the deduplication engine or deduplication accelerator 505 is connected to the CPU 501 over a system bus 515 and detects boundaries using an algorithm such as Rabin to delineate segments of data in system memory 503 and generates fingerprints using algorithms such as hashing algorithms like SHA-1 or MD-5. The deduplication engine 505 accesses the deduplication dictionary 507 to determine if a fingerprint is already included in the deduplication dictionary 507. According to various embodiments, the deduplication dictionary 507 is maintained in persistent storage and maps segment fingerprints to segment storage locations. In particular embodiments, segment storage locations are maintained in fixed size extents. Datastore suitcases, references, metadata, etc., may be created or modified based on the result of the dictionary lookup.
  • If the data needs to be transferred to persistent storage, the optimization software stack will communicate to the CPU 501 the final destination direct memory access (DMA) addresses for the data. The DMA addresses can then be used to transfer the data through one or more bus bridges 517 and/or 527 and secondary buses 519 and/or 529. In example of a secondary bus is a peripheral component interconnect (PCI) bus 519. Peripherals 551, 523, 525, 531, and 533 may be peripheral components and/or peripheral interfaces such as disk arrays, network interfaces, serial interfaces, timers, tape devices, etc.
  • FIG. 6 illustrates a target node chunk replication method 600, performed in accordance with one or more embodiments. According to various embodiments, the method 600 may be performed at a target storage node configured to replicate data stored on a source storage node. The target storage node may be performed during a replication operation for ensuring that the data stored on the target storage node is the same as corresponding data stored on the source storage node.
  • At 602, a request for a data chunk is transmitted to a source storage node. According to various embodiments, the requested data chunk may be a portion of data to be replicated from the source storage node to the target storage node. For instance, the requested data chunk may be a data chunk identified based on a set of chunk fingerprints transmitted from the source storage node to the target storage node as discussed with respect to operation 302 discussed with respect to FIG. 3.
  • According to various embodiments, the request may identify the data chunk in any of various ways. For instance, the request may include an identifier associated with the data chunk. Alternately. or additionally, the request may include a fingerprint value associated with the requested data chunk.
  • At 604, data chunk reconstruction information is received from the source storage node. According to various embodiments, the data chunk reconstruction information may include any information capable of being used to create the requested data chunk. For instance, the data chunk reconstruction information may include an identifier corresponding to a reference data chunk, delta data that represents a difference in data between the reference data chunk and the requested data chunk, and/or metadata information for use in combining the reference data chunk with the delta data to create the requested data chunk.
  • At 606, a reference data chunk for reconstructing the requested data chunk is identified. According to various embodiments, the reference data chunk may be identified based on information included in the data chunk reconstruction information received at operation 604.
  • At 608, a determination is made as to whether the reference data chunk is stored in the target node storage system. In some instances, the source storage node may have out-of-date information regarding which data chunks are stored on the target storage node. For instance, an intervening operation between the time at which the source storage node determines that a data chunk is stored on the target storage node and the time at which the data chunk reconstruction information is received from the source storage node may have caused the reference data chunk to be deleted from the source storage node.
  • According to various embodiments, the determination made at operation 608 may be made at least in part by looking up information associated with the reference data chunk in a data dictionary residing at the target storage node. For instance, in a deduplication storage system, a data dictionary may indicate a storage location corresponding to each data chunk residing in the storage system, indexed by an identifier associated with each data chunk.
  • At 610, the reference data chunk is combined with delta information to produce the requested data chunk. According to various embodiments, combining the reference data chunk with the delta information may involve any operations related to reconstructing the requested data chunk at the target storage node. For example, the data corresponding with the reference data chunk may be retrieved from the storage system. Then, the delta information may be added in the appropriate positions in the reference data chunk to create the requested data chunk. The data chunk reconstruction information may include metadata such as subchunk offsets that indicate one or more locations within the reference data chunk at which the delta information should be placed.
  • At 612, a success message is transmitted to the source storage node. According to various embodiments, the success message may identify the requested chunk. For instance, the success message may include an identifier corresponding with the requested chunk and/or a fingerprint that identifies the requested chunk.
  • According to various embodiments, transmitting the success message to the source storage node may allow the source storage node to update the fingerprint index stored at the source storage node. In this way, the source storage node may be informed of the data stored at the target replication storage node. Then, when subsequent requests for data chunks are received at the source storage node, the source storage node may respond by determining whether to send the entire data chunk or data chunk reconstruction information, as discussed herein.
  • At 614, a request to the source storage node for transmitting the entire requested data chunk is transmitted. According to various embodiments, as discussed with respect to 608, an intervening action may have caused the reference data chunk to be no longer stored in the target node storage system. In this case, the target node may be unable to reconstruct the requested node from the reference node. Accordingly, the target node may transmit a new request for the source storage node to transmit the entire requested data chunk. Alternately, the source storage node may transmit reconstruction information based on a different reference data chunk.
  • Although FIGS. 4 and 6 are described in the context of the replication of a single data chunk, potentially many different data chunks may be replicated. In this case, the operations discussed with respect to FIGS. 4 and 6 may be performed separately for each data chunk or may be combined for more than one data chunk. For example, data chunk reconstruction information may be received for potentially more than one data chunk in the same message or series of messages between the two nodes. As another example, the success message transmitted to the source storage node may identify a range or group of data chunks successfully stored at the target storage node.
  • Although many of the components and processes are described above in the singular for convenience, it will be appreciated by one of skill in the art that multiple components and repeated processes can also be used to practice the techniques of the present invention.
  • While the invention has been particularly shown and described with reference to specific embodiments thereof, it will be understood by those skilled in the art that changes in the form and details of the disclosed embodiments may be made without departing from the spirit or scope of the invention. It is therefore intended that the invention be interpreted to include all variations and equivalents that fall within the true spirit and scope of the present invention.

Claims (1)

What is claimed is:
1. A method comprising:
receiving, at a source data storage node, a request to provide a data chunk to a target storage node;
identifying a reference data chunk based on fingerprint information associated with the requested data chunk, the reference data chunk being stored on the target storage node, the reference data chunk and the requested data chunk each including a first data portion; and
transmitting data chunk reconstruction information from the source data storage node to the target data storage node, the data chunk reconstruction information identifying the reference data chunk, the data chunk reconstruction information including data difference information for constructing the requested data chunk at the target data storage node based on the reference data chunk.
US16/209,598 2013-07-26 2018-12-04 Transferring differences between chunks during replication Abandoned US20190114288A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/209,598 US20190114288A1 (en) 2013-07-26 2018-12-04 Transferring differences between chunks during replication

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US13/952,062 US10146787B2 (en) 2013-07-26 2013-07-26 Transferring differences between chunks during replication
US16/209,598 US20190114288A1 (en) 2013-07-26 2018-12-04 Transferring differences between chunks during replication

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US13/952,062 Continuation US10146787B2 (en) 2013-07-26 2013-07-26 Transferring differences between chunks during replication

Publications (1)

Publication Number Publication Date
US20190114288A1 true US20190114288A1 (en) 2019-04-18

Family

ID=52391495

Family Applications (2)

Application Number Title Priority Date Filing Date
US13/952,062 Active 2034-02-12 US10146787B2 (en) 2013-07-26 2013-07-26 Transferring differences between chunks during replication
US16/209,598 Abandoned US20190114288A1 (en) 2013-07-26 2018-12-04 Transferring differences between chunks during replication

Family Applications Before (1)

Application Number Title Priority Date Filing Date
US13/952,062 Active 2034-02-12 US10146787B2 (en) 2013-07-26 2013-07-26 Transferring differences between chunks during replication

Country Status (1)

Country Link
US (2) US10146787B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245129A (en) * 2019-04-23 2019-09-17 平安科技(深圳)有限公司 Distributed global data deduplication method and device

Families Citing this family (95)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8589640B2 (en) 2011-10-14 2013-11-19 Pure Storage, Inc. Method for maintaining multiple fingerprint tables in a deduplicating storage system
US9218244B1 (en) 2014-06-04 2015-12-22 Pure Storage, Inc. Rebuilding data across storage nodes
US9367243B1 (en) 2014-06-04 2016-06-14 Pure Storage, Inc. Scalable non-uniform storage sizes
US11068363B1 (en) 2014-06-04 2021-07-20 Pure Storage, Inc. Proactively rebuilding data in a storage cluster
US11652884B2 (en) 2014-06-04 2023-05-16 Pure Storage, Inc. Customized hash algorithms
US9836234B2 (en) 2014-06-04 2017-12-05 Pure Storage, Inc. Storage cluster
US10574754B1 (en) 2014-06-04 2020-02-25 Pure Storage, Inc. Multi-chassis array with multi-level load balancing
US11960371B2 (en) 2014-06-04 2024-04-16 Pure Storage, Inc. Message persistence in a zoned system
US11886308B2 (en) 2014-07-02 2024-01-30 Pure Storage, Inc. Dual class of service for unified file and object messaging
US8868825B1 (en) 2014-07-02 2014-10-21 Pure Storage, Inc. Nonrepeating identifiers in an address space of a non-volatile solid-state storage
US9836245B2 (en) 2014-07-02 2017-12-05 Pure Storage, Inc. Non-volatile RAM and flash memory in a non-volatile solid-state storage
US9021297B1 (en) 2014-07-02 2015-04-28 Pure Storage, Inc. Redundant, fault-tolerant, distributed remote procedure call cache in a storage system
US11604598B2 (en) 2014-07-02 2023-03-14 Pure Storage, Inc. Storage cluster with zoned drives
US9747229B1 (en) 2014-07-03 2017-08-29 Pure Storage, Inc. Self-describing data format for DMA in a non-volatile solid-state storage
US9811677B2 (en) 2014-07-03 2017-11-07 Pure Storage, Inc. Secure data replication in a storage grid
US10853311B1 (en) 2014-07-03 2020-12-01 Pure Storage, Inc. Administration through files in a storage system
US9082512B1 (en) 2014-08-07 2015-07-14 Pure Storage, Inc. Die-level monitoring in a storage cluster
US9483346B2 (en) 2014-08-07 2016-11-01 Pure Storage, Inc. Data rebuild on feedback from a queue in a non-volatile solid-state storage
US10983859B2 (en) 2014-08-07 2021-04-20 Pure Storage, Inc. Adjustable error correction based on memory health in a storage unit
US9495255B2 (en) 2014-08-07 2016-11-15 Pure Storage, Inc. Error recovery in a storage cluster
US10079711B1 (en) 2014-08-20 2018-09-18 Pure Storage, Inc. Virtual file server with preserved MAC address
JP6262878B2 (en) * 2014-11-28 2018-01-17 株式会社日立製作所 Storage device
TWI554893B (en) * 2014-12-03 2016-10-21 仁寶電腦工業股份有限公司 Method and system for transmitting data
US9672216B2 (en) 2014-12-09 2017-06-06 Dell International L.L.C. Managing deduplication in a data storage system using a bloomier filter data dictionary
US9940234B2 (en) 2015-03-26 2018-04-10 Pure Storage, Inc. Aggressive data deduplication using lazy garbage collection
US10178169B2 (en) 2015-04-09 2019-01-08 Pure Storage, Inc. Point to point based backend communication layer for storage processing
US9672125B2 (en) 2015-04-10 2017-06-06 Pure Storage, Inc. Ability to partition an array into two or more logical arrays with independently running software
US10846275B2 (en) 2015-06-26 2020-11-24 Pure Storage, Inc. Key management in a storage device
US10983732B2 (en) 2015-07-13 2021-04-20 Pure Storage, Inc. Method and system for accessing a file
US10108355B2 (en) 2015-09-01 2018-10-23 Pure Storage, Inc. Erase block state detection
US11341136B2 (en) 2015-09-04 2022-05-24 Pure Storage, Inc. Dynamically resizable structures for approximate membership queries
US9768953B2 (en) 2015-09-30 2017-09-19 Pure Storage, Inc. Resharing of a split secret
US10762069B2 (en) 2015-09-30 2020-09-01 Pure Storage, Inc. Mechanism for a system where data and metadata are located closely together
US9843453B2 (en) 2015-10-23 2017-12-12 Pure Storage, Inc. Authorizing I/O commands with I/O tokens
US10007457B2 (en) 2015-12-22 2018-06-26 Pure Storage, Inc. Distributed transactions with token-associated execution
CN107193686A (en) * 2016-03-15 2017-09-22 伊姆西公司 Method and apparatus for data backup
US11010409B1 (en) * 2016-03-29 2021-05-18 EMC IP Holding Company LLC Multi-streaming with synthetic replication
US10261690B1 (en) 2016-05-03 2019-04-16 Pure Storage, Inc. Systems and methods for operating a storage system
US10116629B2 (en) 2016-05-16 2018-10-30 Carbonite, Inc. Systems and methods for obfuscation of data via an aggregation of cloud storage services
US10356158B2 (en) 2016-05-16 2019-07-16 Carbonite, Inc. Systems and methods for aggregation of cloud storage
US10264072B2 (en) * 2016-05-16 2019-04-16 Carbonite, Inc. Systems and methods for processing-based file distribution in an aggregation of cloud storage services
US11100107B2 (en) 2016-05-16 2021-08-24 Carbonite, Inc. Systems and methods for secure file management via an aggregation of cloud storage services
US10404798B2 (en) 2016-05-16 2019-09-03 Carbonite, Inc. Systems and methods for third-party policy-based file distribution in an aggregation of cloud storage services
CN105897921B (en) * 2016-05-27 2019-02-26 重庆大学 A kind of data block method for routing of the sampling of combination fingerprint and reduction fragmentation of data
US11016940B2 (en) 2016-06-02 2021-05-25 International Business Machines Corporation Techniques for improving deduplication efficiency in a storage system with multiple storage nodes
US9690801B1 (en) 2016-06-02 2017-06-27 International Business Machines Corporation Techniques for improving deduplication efficiency in a storage system with multiple storage nodes
US11861188B2 (en) 2016-07-19 2024-01-02 Pure Storage, Inc. System having modular accelerators
US10768819B2 (en) 2016-07-22 2020-09-08 Pure Storage, Inc. Hardware support for non-disruptive upgrades
US9672905B1 (en) 2016-07-22 2017-06-06 Pure Storage, Inc. Optimize data protection layouts based on distributed flash wear leveling
US11604690B2 (en) 2016-07-24 2023-03-14 Pure Storage, Inc. Online failure span determination
US11797212B2 (en) 2016-07-26 2023-10-24 Pure Storage, Inc. Data migration for zoned drives
US10366004B2 (en) 2016-07-26 2019-07-30 Pure Storage, Inc. Storage system with elective garbage collection to reduce flash contention
US11886334B2 (en) 2016-07-26 2024-01-30 Pure Storage, Inc. Optimizing spool and memory space management
US11734169B2 (en) 2016-07-26 2023-08-22 Pure Storage, Inc. Optimizing spool and memory space management
US10203903B2 (en) 2016-07-26 2019-02-12 Pure Storage, Inc. Geometry based, space aware shelf/writegroup evacuation
US11422719B2 (en) 2016-09-15 2022-08-23 Pure Storage, Inc. Distributed file deletion and truncation
US9747039B1 (en) 2016-10-04 2017-08-29 Pure Storage, Inc. Reservations over multiple paths on NVMe over fabrics
US11550481B2 (en) 2016-12-19 2023-01-10 Pure Storage, Inc. Efficiently writing data in a zoned drive storage system
US11307998B2 (en) 2017-01-09 2022-04-19 Pure Storage, Inc. Storage efficiency of encrypted host system data
US11955187B2 (en) 2017-01-13 2024-04-09 Pure Storage, Inc. Refresh of differing capacity NAND
US9747158B1 (en) 2017-01-13 2017-08-29 Pure Storage, Inc. Intelligent refresh of 3D NAND
US10528488B1 (en) 2017-03-30 2020-01-07 Pure Storage, Inc. Efficient name coding
US11016667B1 (en) 2017-04-05 2021-05-25 Pure Storage, Inc. Efficient mapping for LUNs in storage memory with holes in address space
CN108733541A (en) * 2017-04-17 2018-11-02 伊姆西Ip控股有限责任公司 The method and apparatus for replicating progress for determining data in real time
US10141050B1 (en) 2017-04-27 2018-11-27 Pure Storage, Inc. Page writes for triple level cell flash memory
US10516645B1 (en) 2017-04-27 2019-12-24 Pure Storage, Inc. Address resolution broadcasting in a networked device
US11782625B2 (en) 2017-06-11 2023-10-10 Pure Storage, Inc. Heterogeneity supportive resiliency groups
US10425473B1 (en) 2017-07-03 2019-09-24 Pure Storage, Inc. Stateful connection reset in a storage cluster with a stateless load balancer
US10402266B1 (en) 2017-07-31 2019-09-03 Pure Storage, Inc. Redundant array of independent disks in a direct-mapped flash storage system
US10545687B1 (en) 2017-10-31 2020-01-28 Pure Storage, Inc. Data rebuild when changing erase block sizes during drive replacement
US10496330B1 (en) 2017-10-31 2019-12-03 Pure Storage, Inc. Using flash storage devices with different sized erase blocks
US10860475B1 (en) 2017-11-17 2020-12-08 Pure Storage, Inc. Hybrid flash translation layer
US10976948B1 (en) 2018-01-31 2021-04-13 Pure Storage, Inc. Cluster expansion mechanism
US11036596B1 (en) 2018-02-18 2021-06-15 Pure Storage, Inc. System for delaying acknowledgements on open NAND locations until durability has been confirmed
US11385792B2 (en) 2018-04-27 2022-07-12 Pure Storage, Inc. High availability controller pair transitioning
US11868309B2 (en) 2018-09-06 2024-01-09 Pure Storage, Inc. Queue management for data relocation
US11354058B2 (en) 2018-09-06 2022-06-07 Pure Storage, Inc. Local relocation of data stored at a storage device of a storage system
US11500570B2 (en) 2018-09-06 2022-11-15 Pure Storage, Inc. Efficient relocation of data utilizing different programming modes
JP7185133B2 (en) * 2018-11-21 2022-12-07 富士通株式会社 Information processing device, information processing program and analysis method
US11099986B2 (en) 2019-04-12 2021-08-24 Pure Storage, Inc. Efficient transfer of memory contents
US11281394B2 (en) 2019-06-24 2022-03-22 Pure Storage, Inc. Replication across partitioning schemes in a distributed storage system
JP6858812B2 (en) * 2019-07-26 2021-04-14 株式会社日立製作所 Storage control system and method
US11893126B2 (en) 2019-10-14 2024-02-06 Pure Storage, Inc. Data deletion for a multi-tenant environment
US11416144B2 (en) 2019-12-12 2022-08-16 Pure Storage, Inc. Dynamic use of segment or zone power loss protection in a flash device
US11704192B2 (en) 2019-12-12 2023-07-18 Pure Storage, Inc. Budgeting open blocks based on power loss protection
US11847331B2 (en) 2019-12-12 2023-12-19 Pure Storage, Inc. Budgeting open blocks of a storage unit based on power loss prevention
US10938961B1 (en) 2019-12-18 2021-03-02 Ndata, Inc. Systems and methods for data deduplication by generating similarity metrics using sketch computation
US11119995B2 (en) 2019-12-18 2021-09-14 Ndata, Inc. Systems and methods for sketch computation
US11188432B2 (en) 2020-02-28 2021-11-30 Pure Storage, Inc. Data resiliency by partially deallocating data blocks of a storage device
US11474986B2 (en) 2020-04-24 2022-10-18 Pure Storage, Inc. Utilizing machine learning to streamline telemetry processing of storage media
US11487455B2 (en) 2020-12-17 2022-11-01 Pure Storage, Inc. Dynamic block allocation to optimize storage system performance
US11847324B2 (en) 2020-12-31 2023-12-19 Pure Storage, Inc. Optimizing resiliency groups for data regions of a storage system
US11614880B2 (en) 2020-12-31 2023-03-28 Pure Storage, Inc. Storage system with selectable write paths
US11507597B2 (en) 2021-03-31 2022-11-22 Pure Storage, Inc. Data replication to meet a recovery point objective
US11791835B1 (en) 2022-06-13 2023-10-17 International Business Machines Corporation Compression improvement in data replication

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080244204A1 (en) * 2007-03-29 2008-10-02 Nick Cremelie Replication and restoration of single-instance storage pools
US20100125553A1 (en) * 2008-11-14 2010-05-20 Data Domain, Inc. Delta compression after identity deduplication
US20110040728A1 (en) * 2009-08-11 2011-02-17 International Business Machines Corporation Replication of deduplicated data
US20120084519A1 (en) * 2010-09-30 2012-04-05 Commvault Systems, Inc. Systems and methods for retaining and using data block signatures in data protection operations
US20130024435A1 (en) * 2011-07-19 2013-01-24 Exagrid Systems, Inc. Systems and methods for managing delta version chains
US20130232125A1 (en) * 2008-11-14 2013-09-05 Emc Corporation Stream locality delta compression
US20130268673A1 (en) * 2012-04-05 2013-10-10 John Graham-Cumming Method and apparatus for reducing network resource transmission size using delta compression
US20130297899A1 (en) * 2012-05-01 2013-11-07 Hitachi, Ltd. Traffic reducing on data migration
US8631052B1 (en) * 2011-12-22 2014-01-14 Emc Corporation Efficient content meta-data collection and trace generation from deduplicated storage
US8712978B1 (en) * 2012-06-13 2014-04-29 Emc Corporation Preferential selection of candidates for delta compression
US9235475B1 (en) * 2013-03-05 2016-01-12 Emc Corporation Metadata optimization for network replication using representative of metadata batch

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080244204A1 (en) * 2007-03-29 2008-10-02 Nick Cremelie Replication and restoration of single-instance storage pools
US20100125553A1 (en) * 2008-11-14 2010-05-20 Data Domain, Inc. Delta compression after identity deduplication
US20130232125A1 (en) * 2008-11-14 2013-09-05 Emc Corporation Stream locality delta compression
US20110040728A1 (en) * 2009-08-11 2011-02-17 International Business Machines Corporation Replication of deduplicated data
US20120084519A1 (en) * 2010-09-30 2012-04-05 Commvault Systems, Inc. Systems and methods for retaining and using data block signatures in data protection operations
US20120084518A1 (en) * 2010-09-30 2012-04-05 Commvault Systems, Inc. Systems and methods for retaining and using data block signatures in data protection operations
US20130024435A1 (en) * 2011-07-19 2013-01-24 Exagrid Systems, Inc. Systems and methods for managing delta version chains
US8631052B1 (en) * 2011-12-22 2014-01-14 Emc Corporation Efficient content meta-data collection and trace generation from deduplicated storage
US20130268673A1 (en) * 2012-04-05 2013-10-10 John Graham-Cumming Method and apparatus for reducing network resource transmission size using delta compression
US20130297899A1 (en) * 2012-05-01 2013-11-07 Hitachi, Ltd. Traffic reducing on data migration
US8712978B1 (en) * 2012-06-13 2014-04-29 Emc Corporation Preferential selection of candidates for delta compression
US9235475B1 (en) * 2013-03-05 2016-01-12 Emc Corporation Metadata optimization for network replication using representative of metadata batch

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245129A (en) * 2019-04-23 2019-09-17 平安科技(深圳)有限公司 Distributed global data deduplication method and device

Also Published As

Publication number Publication date
US20150032978A1 (en) 2015-01-29
US10146787B2 (en) 2018-12-04

Similar Documents

Publication Publication Date Title
US20190114288A1 (en) Transferring differences between chunks during replication
US9201800B2 (en) Restoring temporal locality in global and local deduplication storage systems
US10552040B2 (en) Fixed size extents for variable size deduplication segments
US9569456B2 (en) Accelerated deduplication
US9792306B1 (en) Data transfer between dissimilar deduplication systems
US8402250B1 (en) Distributed file system with client-side deduplication capacity
US9449014B2 (en) Resynchronization of replicated data
US10380073B2 (en) Use of solid state storage devices and the like in data deduplication
US9613046B1 (en) Parallel optimized remote synchronization of active block storage
US9208166B2 (en) Seeding replication
US8965852B2 (en) Methods and apparatus for network efficient deduplication
US9460178B2 (en) Synchronized storage system operation
US20140195575A1 (en) Data file handling in a network environment and independent file server
US9952933B1 (en) Fingerprint change during data operations
US9563632B2 (en) Garbage collection aware deduplication
US9594643B2 (en) Handling restores in an incremental backup storage system
US10331362B1 (en) Adaptive replication for segmentation anchoring type
US9361302B1 (en) Uniform logic replication for DDFS
US11436088B2 (en) Methods for managing snapshots in a distributed de-duplication system and devices thereof
US20170124107A1 (en) Data deduplication storage system and process
US11016933B2 (en) Handling weakening of hash functions by using epochs
EP3485386A1 (en) Improved data deduplication for eventual consistency system and method
US9122641B1 (en) On-premise data deduplication for cloud environments
US10795891B2 (en) Data deduplication for eventual consistency system and method

Legal Events

Date Code Title Description
STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

STPP Information on status: patent application and granting procedure in general

Free format text: FINAL REJECTION MAILED

STCB Information on status: application discontinuation

Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION