WO2015096847A1 - Method and apparatus for context aware based data de-duplication - Google Patents

Method and apparatus for context aware based data de-duplication Download PDF

Info

Publication number
WO2015096847A1
WO2015096847A1 PCT/EP2013/077894 EP2013077894W WO2015096847A1 WO 2015096847 A1 WO2015096847 A1 WO 2015096847A1 EP 2013077894 W EP2013077894 W EP 2013077894W WO 2015096847 A1 WO2015096847 A1 WO 2015096847A1
Authority
WO
Grant status
Application
Patent type
Prior art keywords
data
written
de
duplication
block
Prior art date
Application number
PCT/EP2013/077894
Other languages
French (fr)
Inventor
Ariel Kulik
Gil Sasson
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/30067File systems; File servers
    • G06F17/30129Details of further file system functionalities
    • G06F17/3015Redundancy elimination performed by the file system
    • G06F17/30156De-duplication implemented within the file system, e.g. based on file segments
    • G06F17/30159De-duplication implemented within the file system, e.g. based on file segments based on file chunks

Abstract

An apparatus and a method for context aware based data de-duplication is provided, the method comprising the steps of: assigning (S1) a de-duplication module by loading at least one metadata of written data and at least one metadata of written data into a metadata memory cache (40) and separating the data to be written into data chunks; counting (S2) a number of the data chunks of the data to be written and of the written data for each data segment by scanning the cached metadata in the metadata memory cache (40), the number of chunks representing a score of the data segment; and calling (S3) a data segment selection procedure providing a set of data segments based on the score of the data segment to de-duplicate the data to be written and the written data.

Description

METHOD AND APPARATUS FOR CONTEXT AWARE BASED DATA

DE-DUPLICATION

TECHNICAL FIELD

The present application relates to the field of context aware segment selection for data de-duplication, and particularly to a method and an apparatus for context aware based data de-duplication.

BACKGROUND

De-duplication is a specialized data compression technique for eliminating duplicate copies of repeating data, or chunks, which proved to be highly useful for backup purposes.

De-duplication mechanisms mostly have the problem of demanding excessive resource requirements or have a low through-put, thus, more sophisticated mechanisms are required in order to implement de-duplication in commercial products.

One of the common techniques to implement de-duplication is to hold data chunks in containers/segments which maintain the locality characteristics of the incoming data.

Common techniques solve the problem by using several indexing techniques, often with combination with caching. In prior art systems, an index has maintained either a full or partial (sparse) indexing of the fingerprints of the chunks stored in the system. By issuing a lookup operation in the index for some or all the fingerprints of the chunks in the incoming block these systems find containers or segments to be used for de-duplicating the data of the block.

The different techniques vary in the implementation of the index (RAM based or a combination of RAM with disk based), the number of fingerprints in the index and the way fingerprints are chosen for the index, the set of chunks which is queried in the index or further variables. SUMMARY AND DESCRIPTION

It is the object of the invention to provide an improved technique for de-duplication systems which are used for storing backed up data.

This object is achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description and the figures.

According to a first aspect, a method for context aware based data de-duplication is provided, the method comprising the steps of: assigning a de-duplication module to a write operation by loading at least one structural metadata of written data into a metadata memory cache and separating the cached data to be written into data chunks; counting a number of the data chunks of the data to be written for each data segment by scanning the cached structural metadata in the metadata memory, the number of chunks representing a score of the data segment; and calling a data segment selection procedure providing a set of data segments based on the score of the data segment to de-duplicate the data to be written.

Within this technique, to perform de-duplication a sequence of incoming data chunks is combined into a block, the incoming data is given at first place a block. The de-duplication of a block is done against chunks in a limited number of segments.

For each block, the de-duplication mechanism needs to determine the set of segments the block would be de-duplicated against, this process is referred to as segment selection. The selection mechanism is required to fulfill high performance constraints and has a significant impact on the de-duplication ratio attained.

Backup system are used to create and store and restore a collection of snapshots of volume/s or file system/s, in other words of one or multiple volumes or of one or multiple file systems. The backup systems work by generating an initial full backup, i.e. a snapshot, which contains all the relevant data, and multiple incremental backups or snapshots. In computer systems, a snapshot is the state of a system at a particular point in time. The term was coined as an analogy to that in photography. It can refer to an actual copy of the state of a system or to a capability provided by certain systems, e.g. the file systems.

Incremental backups only contain a subset of the snapshot's content. To access the full snapshot's content, data from both the snapshot and previous snapshots is used. In both techniques, blocks/areas/files which are being backed up have previous versions in the backup system generated by previous snapshots.

The invention solves the problem of Segment Selection in de-duplication systems which are used for storing backed up data. The invention can also be used for de-duplication of primary storage system.

The present invention comprises a series of steps combined with a proprietary interface. The combination of the two is used to solve this problem by combining different techniques with context aware interface between the backup system and the de-duplication component.

The present invention is intended to be implemented within the context of a de-duplication system with a basic 10 scope of fixed sized blocks, where a single block size is the range of 1 MB to 10 MB. However, the invention's basic concept can be implemented in different setting with proper adjustments.

According to present invention, to store a block, the system maintains a meta-data object which holds for each data chunk in the block the hash of the chunk, and ID of the segment in which the chunk's data reside, or similar information. It is referred to these objects as Block Meta Data objects. Indeed the present invention is to be implemented in a de-duplication system or to any other read/write or data storage system.

The interface part of the inventions is that the backup system addresses the blocks in the de-duplication engine in a context aware manner such as logical block location and version or by logical block, where the operation overrides the previous version of the data.

The series of steps may be the following: on write commands, the de-duplication engine loads to memory the block metadata files of the previous version/s of the logical block and those of adjacent logical blocks.

We refer to the data loaded from disk as local metadata. For each segment ID in the local metadata, count the number of chunks which are both in the block to be written and in the local metadata associated with the segment ID. The value evaluated for each segment ID is its score. A segment selection mechanism is called to produce a collection of segments as well.

The present invention advantageously uses the information from both tools to determine the set of segments to de-duplicate against. Both in case of incremental and full backups there is high likelihood that a new version of a block has a significant similarity to the previous version of the same block, or adjacent blocks. A good example is that in incremental backup with 4 MB granularity a 4k change will produce a 4 MB write to the de-duplication engine. This write can be almost entirely eliminated when de-duplicated against the previous version of the block.

De-duplication techniques are not aware to the context in which the de-duplication is used and therefore cannot easily locate the previous version and use the information for de-duplication.

The present invention provides a resource light mechanism which ensures data de-duplication between the data in the new block to the previous version/s of the logical block and adjacent logical blocks. Therefore, the present invention provides a significant improvement in the de-duplication ratio with little resource overhead.

The implementation of the present invention advantageously implies using a specific interface between the backup components to the de-duplication component, and usage of the information from the interface for the de-duplication process. In a first possible implementation form of the method according to the first aspect, the step of assigning the de-duplication module comprises generating the metadata by means of a context aware processing of the written data or by means of logical block addressing of the written data.

This advantageously provides increased in-line de-duplication efficiency.

In a second possible implementation form of the method according to the first aspect as such or according to the first implementation form of the first aspect, the step of assigning the de-duplication module by loading the at least one metadata of the written data comprises loading a previous version of the written data and/or loading any version of a plurality of previous versions of the written data and/or loading an adjacent data block of the written data.

Thereby, high storage efficiency is achieved through data de-duplication. In a third possible implementation form of the method according to the first aspect as such or according to the any of the preceding implementation forms of the first aspect, during the step of separating the cached data to be written into the data chunks, an evaluating of at least one hash value of the written data and the data to be written is conducted. In a fourth possible implementation form of the method according to the first aspect as such or according to the any of the preceding implementation forms of the first aspect, the written data is a block of data.

This advantageously provides an efficient method to store data that identifies and eliminates duplicate blocks of data during backups.

In a fifth possible implementation form of the method according to the fourth possible implementation form of the method according to the first aspect, the block of data is a sequence of bytes, having a block size between 1 mega bytes and 10 mega bytes or any other block size.

This beneficially provides an efficient method to store data that identifies and eliminates duplicate blocks of data during backups.

In a sixth possible implementation form of the method according to the fourth possible implementation form of the method according to the first aspect or according to the fifth possible implementation form of the method according to the first aspect, the size of the block of data is non-constant. This allows optimized adjusting of the data block size to the requirements of the data de-duplication method.

In a seventh possible implementation form of the method according to the first aspect as such or according to any of the preceding implementation forms of the first aspect, each data chunk is a sequence of bytes, having an average chunk size of 1 kilo bytes, 2 kilo bytes, 4 kilo bytes, 8 kilo bytes or any size between 1 and 512 kilo bytes.

This advantageously provides increased in-line de-duplication efficiency.

In an eighth possible implementation form of the method according to the seventh possible implementation form of the apparatus according to the first aspect as such, the data chunks comprise a variable size. According to a second aspect, the invention relates to an apparatus for context aware based data de-duplication, the apparatus comprising: a de-duplication module configured to load at least one structural metadata of written data into a meta data memory cache and separating the cached data to be written into data chunks; a processing module configured to count a number of chunks existing in the data to be written for each data segment by scanning the cached structural meta data in the metadata memory, the number of chunks representing a score of the data segment; and a data selection module configured to provide a set of data segments based on the score of the data segment to de-duplicate the data to be written.

According to a third aspect, the invention relates to a back-up system comprising a file system and an apparatus according to the second aspect.

The methods, systems and devices described herein may be implemented as software in a Digital Signal Processor, DSP, in a micro-controller or in any other side-processor or as hardware circuit within an application specific integrated circuit, ASIC or in a

field-programmable gate array which is an integrated circuit designed to be configured by a customer or a designer after manufacturing - hence "field-programmable".

The invention can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations thereof, e.g. in available hardware of conventional mobile devices or in new hardware dedicated for processing the methods described herein.

BRIEF DESCRIPTION OF THE DRAWINGS

Further embodiments of the invention will be described with respect to the following figures, in which:

Fig. 1 shows a schematic diagram of a backup system comprising a file system and an apparatus for context aware based data de-duplication according to one embodiment of the present invention;

Fig. 2 shows a schematic diagram of a core data layout according to an embodiment of the invention;

Fig. 3 shows a block diagram of a method for context aware based data de-duplication according to a further embodiment of the present invention; and

Fig. 4 shows a block diagram of a method for context aware based data de-duplication according to a further embodiment of the present invention. DETAILED DESCRIPTION

In the associated figures, identical reference signs denote identical or at least equivalent elements, parts, units or steps. In addition, it should be noted that all of the accompanying drawings are not to scale.

The technical solutions in the embodiments of the present invention are described clearly and completely in the following with detailed reference to the accompanying drawings in the embodiments of the present invention.

Apparently, the described embodiments are only some embodiments of the present invention, rather than all embodiments. Based on the described embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making any creative effort shall fall within the protection scope of the present invention.

Fig. 1 shows a schematic diagram of a backup system comprising a file system and an apparatus for context aware based data de-duplication according to one embodiment of the present invention.

Fig. 1 shows an embodiment of the present invention, wherein a de-duplication apparatus 100 along with its write path is illustrated. The de-duplication apparatus 100 takes use of the concept of sparse indexing as part of defined mechanisms. The shown embodiment of the present invention refers to a de-duplication component or apparatus 100 which receives write, read and delete commands from a backup system BS. The apparatus 100 may be coupled between the backup system BS and a file system FS.

The apparatus 100 for context aware based data de-duplication may comprise a de-duplication module 10, a processing module 20, a data selection module 30, and a metadata memory cache 40. The de-duplication module 10 may be configured to load at least one structural metadata of written data into a metadata memory cache and separating the cached data to be written into data chunks. The processing module 20 may be configured to count a number of chunks existing in the data to be written for each data segment by scanning the cached structural metadata in the metadata memory, the number of chunks representing a score of the data segment.

The data selection module 30 may be configured to provide a set of data segments based on the score of the data segment to de-duplicate the data to be written.

The metadata memory cache 40 may be configured to receive and store at least one metadata of data to be written and at least one metadata of written data.

The series of steps may be the following: on write commands, the de-duplication engine loads to memory the block metadata files of the previous version/s of the logical block and those of adjacent logical blocks.

For example, the de-duplication system interface would be:

- write(block logical location (string), version id (integer), data (buffer)

- read (block logical location (string), version id (integer), data (buffer)

- -delete (block logical location (string), version id (integer) ), whereas the backup system would use the logical block location as a unique identifier of the data source and location from which the data block was taken, such as "storage array name/lun id/offset".

A host node FiN provides data to be saved, i.e. data to be written, to a backup system BS. The data to be saved or the data to be written may be present as a 4 MB block of data send from host node FIN to the backup system BS. The backup system BS may send the data block read from the host node FIN to the de-duplication system, i.e. the apparatus 100.

According to the present invention, a host node FIN (Latin nodus, 'knot') may be a connection point, a redistribution point or a communication endpoint (some terminal equipment).

A network host node FfN may be a computer connected to a computer network. A network host node FfN may offer information resources, services, and applications to users or other nodes on the network. A network host node FIN may be a network node that is assigned a network layer host address. According to the present invention, a file system FS is used to control how information is stored and retrieved.

The file system FS may be used on many different kinds of storage devices. Each storage device may use a different kind of media. Media that are used may be magnetic tape, optical disc, and flash memory. In some cases, the computer's main memory, Random-access memory, RAM or any other form of computer data storage, is used to create a temporary file system for short term use.

The term "file system" may refer to either the abstract data structures used to define files, or the actual software or firmware components that implement the abstract ideas. As file systems, any systems may be used on local data storage devices; others will provide file access via a network protocol (e.g. Network File System (NFS), Server Message Block (SMB), or Plan 9 (9P) clients). The file systems may be "virtual", in that the "files" supplied are computed on request (e.g. procfs) or are merely a mapping into a different file system used as a backing store. The file system FS manages access to both the content of files and the metadata about those files.

The backup system BS may comprise a plurality of client computers and a backup server computer, the backup server computer comprising means for automatically performing regular backups of data from the client computers.

Optionally, in one embodiment of the present invention, each of the commands refers to a 4 MB block of data read from the drive being backed up. The de-duplication apparatus 100 stores the data on a file system FS by writing the de-duplicated data block.

The write operation or the method for context aware based data de-duplication can be conducted according to the following:

In a first step of the write operation, a write command arrives to the de-duplication system. In a second step of the write operation, the data block which is designated to be written and to be saved is split into chunks, i.e. hash values of the data block are evaluated.

In a third step of the write operation, reading the block meta-data files of the previous versions of the block, and nearby blocks is performed. For each of the segments in these block meta-data files, evaluating the number of chunks in the same block meta-data files is conducted, wherein the block meta-data files are processed which belong to the specific segment and also appear in the content of the write command, based on the fingerprints. Subsequently, this number is set as the score of the segment. For example, when the system receives the command "write (disk7/block8, version 5, [some data buffer])" the system would load the block metadata of "disk7/block8" with "version 4". A further example would be a snapshot of a virtual machines created by a hypervisor such as VMware ESX of Microsoft HyperV.

In a fourth step of the write operation, sending a lookup command to the index for each of the chunks in the write command is conducted.

In a fifth step of the write operation, selecting the following segments for de-duplication is conducted: a.) If there are more than four segments with score higher than, for example, 0.1 times the number of chunks in the data to be written, select the four segments with the highest score. b.) If there are less than four segments with score higher than, for example, 0.1 times the

number of chunks in the data to be written, select all these segments, and also segments found in lookups of step four of the write operation, so the total number of selected segments is no more than four.

In a sixth step of the write operation, the selected segments are loaded from disk, and de-duplication is done against the chunks in them. Non-duplicated chunks are written to a new segment.

In a seventh step of the write operation, the new block meta-data files are saved to the file system FS.

Fig. 2 shows a schematic diagram of a core data layout according to an embodiment of the invention.

The core data layout as shown in Fig. 2 of the system is described in the following:

A single block de-duplication is done against the data chunks in a small number of segments. Two mechanisms are used for selecting the segments to de-duplicate against: At first, a sparse index technique is used which holds few representative forms of each segment. The indexing is used to approximate similarity between an incoming block to an arbitrary segment.

Secondly, context aware de-duplication methods are used- when receiving a write command to a certain block the front end loads the block meta-data files for the previous versions of the block and nearby blocks. The information in the block meta-data files is used to identify segments which share chunks with the data in the new write.

Optionally, in one embodiment of the present invention, each operation refers to a block of data, where the size of each block is 4 MB. The system supports as I/O-operations basically three operations: write, read and delete operations. In the following write commands for writing operations will be described:

Optionally, in one embodiment of the present invention, the block addressing is in the form of logical block ID.

Optionally, in one embodiment of the present invention, the Blocks are split to chunks by a variable size chunking, with average size of 8 kb.

Optionally, in one embodiment of the present invention, the data chunks are represented by the hash value of their data, the hash value is often referred as a fingerprint, i.e. a uniquely identifying data by extracting from it a small key known as a fingerprint.

Optionally, in one embodiment of the present invention, the engine holds data in segments, where each segment persistently stores a set of chunks.

Optionally, in one embodiment of the present invention, to represent blocks, block meta-data files, BMD, are used. The file contains a list of chunks or hashes which comprise the block data. For each chunk, the file also comprises a segment ID of the segment in which the chunk's data can be found. Optionally, in one embodiment of the present invention, by means of the metadata of the data segment hash values of the chunks in the segment are created.

Fig. 3 shows a block diagram of a method for context aware based data de-duplication according to one embodiment of the present invention. As a first step of the method, assigning S10 a write command referring to logical block or to a further version is conducted.

As a second step of the method for context aware based data de-duplication, splitting S 11 the block to chunks and evaluate hash values is conducted.

As a third step of the method, loading S12 metadata objects of previous versions of the block - or of only one previous version of the block - or/and previous or current versions of adjacent blocks is conducted.

As a fourth step of the method for context aware based data de-duplication, using S13 general techniques or regular techniques for segment selection is conducted.

As a fifth step of the method, evaluating S14 the score for each segment ID is conducted.

As a sixth step of the method, using S15 the information from both techniques - second and fourth step as well as the third and the fifth step - to determine the set of segments to de-duplicate against is conducted.

The second and fourth step SI 1, S 13 as well as the third and the fifth step S12, S14 may be implemented by parallel processing or any other form of computation in which many calculations are carried out simultaneously.

The steps are then solved concurrently ("in parallel"). There may be used several different forms of parallel computing: bit-level, instruction level, data, and task parallelism.

Fig. 4 shows a block diagram of a method for context aware based data de-duplication according to one embodiment of the present invention.

A method for context aware based data de-duplication, the method comprising the steps of:

As a first step of the method, assigning SI a de-duplication module to a write operation by loading at least one structural metadata of written data into a metadata memory cache and separating the cached data to be written into data chunks is performed.

As a second step of the method, counting S2 a number of the data chunks of the data to be written for each data segment is conducted by scanning the cached structural metadata in the metadata memory 40, the number of chunks representing a score of the data segment. As a third step of the method, calling S3 a data segment selection procedure providing a set of data segments based on the score of the data segment to de-duplicate the data to be written is performed.

From the foregoing, it will be apparent to those skilled in the art that a variety of methods, systems, computer programs on recording media, and the like, are provided.

The present disclosure also supports a computer program product including computer executable code or computer executable instructions that, when executed, causes at least one computer to execute the performing and computing steps described herein.

Many alternatives, modifications, and variations will be apparent to those skilled in the art in light of the above teachings. Of course, those skilled in the art readily recognize that there are numerous applications of the invention beyond those described herein.

While the present invention has been described with reference to one or more particular embodiments, those skilled in the art recognize that many changes may be made thereto without departing from the scope of the present invention. It is therefore to be understood that within the scope of the appended claims and their equivalents, the inventions may be practiced otherwise than as specifically described herein.

In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measured cannot be used to advantage. A computer program may be stored or distributed on a suitable medium, such as an optical storage medium or a solid-state medium supplied together with or as part of other hardware, but may also be distributed in other forms, such as via the Internet or other wired or wireless telecommunication systems.

Claims

1. A method for context aware based data de-duplication comprising: assigning (SI) a de-duplication module to a write operation by loading at least one structural metadata of written data into a metadata memory cache (40) and separating the cached data to be written into data chunks; counting (S2) a number of the data chunks of the data to be written for each data segment by scanning the cached structural metadata in the metadata memory cache (40), the number of chunks representing a score of the data segment; and calling (S3) a data segment selection procedure, providing a set of data segments based on the score of the data segment to de-duplicate the data to be written.
2. The method according to claim 1,
wherein assigning (SI) the de-duplication module comprises generating the metadata by means of a context aware processing of the written data or by means of logical block addressing of the written data.
3. The method according to claim 1 or 2,
wherein assigning (SI) the de-duplication module by loading the at least one metadata of the written data comprises loading a previous version of the written data and/or loading any version of a plurality of previous versions of the written data and/or loading an adjacent data block of the written data.
4. The method according to one of the preceding claims 1 to 3,
wherein during separating the cached data to be written into the data chunks, an evaluating of at least one hash value of the written data and of the data to be written is conducted.
5. The method according to one of the preceding claims 1 to 4,
wherein the written data is a block of data.
6. The method according to claim 5,
wherein the block of data is a sequence of bytes, having a block size between 1 mega bytes and 10 mega bytes or any other block size.
7. The method according to claim 5 or 6,
wherein the size of the block of data is non-constant.
8. The method according to one of the preceding claims 1 to 7,
wherein each data chunk is a sequence of bytes, having an average chunk size of 1 kilo bytes, 2 kilo bytes, 4 kilo bytes, 8 kilo bytes or any size between 1 and 512 kilo bytes.
9. The method according to claim 8,
wherein the data chunks comprise a variable size.
10. An apparatus (100) for context aware based data de-duplication, the apparatus
comprising: a de-duplication module (10) configured to load at least one structural metadata of written data into a metadata memory cache (40) and to separate the cached data to be written into data chunks; a processing module (20) configured to count a number of the data chunks of the data to be written for each data segment by scanning the cached structural metadata in the metadata memory cache (40), the number of chunks representing a score of the data segment; and a data selection module (30) configured to provide a set of data segments based on the score of the data segment to de-duplicate the data to be written.
11. A backup system, BS, for a host node, HN comprising a file system, FS, and an apparatus (100) for context aware based data de-duplication according to claim 10.
12. A computer program with a program code for performing a method according to any of claims 1 to 9, when the computer program runs on a computer.
PCT/EP2013/077894 2013-12-23 2013-12-23 Method and apparatus for context aware based data de-duplication WO2015096847A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2013/077894 WO2015096847A1 (en) 2013-12-23 2013-12-23 Method and apparatus for context aware based data de-duplication

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
PCT/EP2013/077894 WO2015096847A1 (en) 2013-12-23 2013-12-23 Method and apparatus for context aware based data de-duplication
CN 201380078408 CN105493080A (en) 2013-12-23 2013-12-23 Method and apparatus for context aware based data de-duplication

Publications (1)

Publication Number Publication Date
WO2015096847A1 true true WO2015096847A1 (en) 2015-07-02

Family

ID=49886942

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2013/077894 WO2015096847A1 (en) 2013-12-23 2013-12-23 Method and apparatus for context aware based data de-duplication

Country Status (2)

Country Link
CN (1) CN105493080A (en)
WO (1) WO2015096847A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100088296A1 (en) * 2008-10-03 2010-04-08 Netapp, Inc. System and method for organizing data to facilitate data deduplication
US7996371B1 (en) * 2008-06-10 2011-08-09 Netapp, Inc. Combining context-aware and context-independent data deduplication for optimal space savings

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5020673B2 (en) * 2007-03-27 2012-09-05 株式会社日立製作所 Computer system to prevent the storage of duplicate files
CN103034659B (en) * 2011-09-29 2015-08-19 国际商业机器公司 A method and system for data deduplication
CN103051671A (en) * 2012-11-22 2013-04-17 浪潮电子信息产业股份有限公司 Repeating data deletion method for cluster file system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7996371B1 (en) * 2008-06-10 2011-08-09 Netapp, Inc. Combining context-aware and context-independent data deduplication for optimal space savings
US20100088296A1 (en) * 2008-10-03 2010-04-08 Netapp, Inc. System and method for organizing data to facilitate data deduplication

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Bartlomiej Romanski ET AL: "Anchor-Driven Subchunk Deduplication", SYSTOR '11, 30 May 2011 (2011-05-30), pages 1-13, XP055035332, DOI: 10.1145/1987816.1987837 ISBN: 978-1-45-030773-4 Retrieved from the Internet: URL:http://www.9livesdata.com/files/ninelivesdata/systor37-romanski.pdf [retrieved on 2012-08-13] *
JIAYANG DU ET AL: "MassStore: A low bandwidth, high De-duplication efficiency network backup system", SYSTEMS AND INFORMATICS (ICSAI), 2012 INTERNATIONAL CONFERENCE ON, IEEE, 19 May 2012 (2012-05-19), pages 886-890, XP032192649, DOI: 10.1109/ICSAI.2012.6223150 ISBN: 978-1-4673-0198-5 *

Also Published As

Publication number Publication date Type
CN105493080A (en) 2016-04-13 application

Similar Documents

Publication Publication Date Title
Bhagwat et al. Extreme binning: Scalable, parallel deduplication for chunk-based file backup
US7567188B1 (en) Policy based tiered data deduplication strategy
US8527544B1 (en) Garbage collection in a storage system
US8352422B2 (en) Data restore systems and methods in a replication environment
US7636824B1 (en) System and method for efficient backup using hashes
US8370315B1 (en) System and method for high performance deduplication indexing
US20110246416A1 (en) Stubbing systems and methods in a data replication environment
US20140095816A1 (en) System and method for full virtual machine backup using storage system functionality
US20100161554A1 (en) Asynchronous distributed de-duplication for replicated content addressable storage clusters
US20110246429A1 (en) Stub file prioritization in a data replication system
US20100106691A1 (en) Remote backup and restore
US8510279B1 (en) Using read signature command in file system to backup data
US20130086006A1 (en) Method for removing duplicate data from a storage array
US20130097380A1 (en) Method for maintaining multiple fingerprint tables in a deduplicating storage system
US20100280997A1 (en) Copying a differential data store into temporary storage media in response to a request
US8806160B2 (en) Mapping in a storage system
US20130138620A1 (en) Optimization of fingerprint-based deduplication
US20140095439A1 (en) Optimizing data block size for deduplication
US7788220B1 (en) Storage of data with composite hashes in backup systems
US8639669B1 (en) Method and apparatus for determining optimal chunk sizes of a deduplicated storage system
US8949208B1 (en) System and method for bulk data movement between storage tiers
US20100280996A1 (en) Transactional virtual disk with differential snapshots
US20130042052A1 (en) Logical sector mapping in a flash storage array
US20110231362A1 (en) Extensible data deduplication system and method
US20130086353A1 (en) Variable length encoding in a storage system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13814968

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13814968

Country of ref document: EP

Kind code of ref document: A1