WO2021175446A1 - Devices and methods for eliminating defragmentation in deduplication - Google Patents

Devices and methods for eliminating defragmentation in deduplication Download PDF

Info

Publication number
WO2021175446A1
WO2021175446A1 PCT/EP2020/056082 EP2020056082W WO2021175446A1 WO 2021175446 A1 WO2021175446 A1 WO 2021175446A1 EP 2020056082 W EP2020056082 W EP 2020056082W WO 2021175446 A1 WO2021175446 A1 WO 2021175446A1
Authority
WO
WIPO (PCT)
Prior art keywords
duplicated data
data block
segment
stored
segments
Prior art date
Application number
PCT/EP2020/056082
Other languages
French (fr)
Inventor
Yehonatan DAVID
Elizabeth FIRMAN
Michael Hirsch
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to CN202080006944.3A priority Critical patent/CN113632059A/en
Priority to PCT/EP2020/056082 priority patent/WO2021175446A1/en
Publication of WO2021175446A1 publication Critical patent/WO2021175446A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/1724Details of de-fragmentation performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1744Redundancy elimination performed by the file system using compression, e.g. sparse files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments

Definitions

  • the present disclosure relates to a device and method for storing data, in particular, for storing duplicated data.
  • embodiments of this disclosure provide a solution to simplify the defragmentation process.
  • the present disclosure proposes a device and method that allows eliminating defragmentation in a backup storage.
  • Deduplication is a process of elimination of duplicated data in storage. Due to the nature of how the data is saved in a deduplicating storage, there is often a problem of fragmentation in deduplication. This is a software level fragmentation, which is done on top of a lower-level hardware that also does some defragmentation. This fragmentation can also be a cause for further read amplification.
  • deduplication user data is stored as pointers to the actual data.
  • the actual data is stored in containers of some sort.
  • the data can either be copied to another place, trying to keep the locality of references, or the space may be reused for new data to prevent the complexities of copying.
  • a conventional manner of handling obsolete data it may wait until a certain amount of data in a container or area is obsolete, and then copy the remaining data into a new container, and update all the references from the old location to the new location.
  • This process requires locking of metadata of the container, the container itself, and maybe other structures as well. It is difficult to parallelize the defragmentation process thus make it error prone, and it may affect backup and restore speeds as well.
  • an objective for embodiments of the present disclosure is to provide a device and method, respectively, that eliminates a need for defragmentation in a deduplication backup system.
  • An aim is, in particular, to avoid locking and/or updating metadata references when removing obsolete data.
  • Another aim is to redeem space at an application level. Further, it is also desirable to be able to perform normal backup and restore operations in the meantime.
  • a first aspect of the present disclosure provides a device for storing duplicated data blocks, the device being configured to: store one or more compressed containers, wherein each container comprises a plurality of segments, wherein one or more duplicated data blocks are stored in the plurality of segments of the one or more containers; and delete a first duplicated data block that is stored in a segment by decompressing the segment, replacing the first duplicated data block with zeros in the segment, and recompressing the segment.
  • the device of the first aspect has the advantage that a need for defragmentation in a deduplication backup system is reduced or eliminated. This is due to the replacement by zeros, and the compression of the segments.
  • the device is further configured to maintain metadata of the one or more stored duplicated data blocks, wherein metadata of each stored duplicated data block comprises a reference to a location where the duplicated data block is stored.
  • each container further comprises segment metadata for each segment, which includes a location of the data segments in the container.
  • the stored data segment can be located.
  • the device is further configured to: obtain the one or more duplicated data blocks; fill the one or more duplicated data blocks into the plurality of segments of the one or more containers; and compress the plurality of segments of the one or more containers.
  • the deduplication based backup system may divide a backup stream into variable sized blocks of data.
  • the data blocks that need to be written may be aggregated into the containers, particularly into segments of the containers. According to this embodiment of the invention, each segment is compressed in the container.
  • the device is further configured to retrieve the stored first duplicated data block according to metadata of the first duplicated data block.
  • data blocks will be located and read based on the metadata.
  • each duplicated data block is associated with a reference value, which indicates how often the duplicated data block is being updated.
  • the reference value may be used to distinguish hot data and cold data in a storage.
  • the device is further configured to store a duplicated data block that has a reference value higher than a preset value, and a duplicated data block that has a reference value lower than the preset value in different containers.
  • the preset value may be configured based on specific implementations or requirements.
  • the device comprises a disk, and the one or more containers are stored on the disk.
  • a second aspect of the present disclosure provides a method for storing duplicated data blocks, comprising: storing one or more compressed containers, wherein each container comprises a plurality of segments, wherein one or more duplicated data blocks are stored in the plurality of segments of the one or more containers; and deleting a first duplicated data block that is stored in a segment by decompressing the segment, replacing the first duplicated data block with zeros in the segment, and recompressing the segment.
  • the method further comprises maintaining metadata of the one or more stored duplicated data blocks, wherein metadata of each stored duplicated data block comprises a reference to a location where the duplicated data block is stored.
  • each container further comprises segment metadata for each segment, which includes a location of the date segments in the container.
  • the method further comprises: obtaining the one or more duplicated data blocks; filling the one or more duplicated data blocks into the plurality of segments of the one or more container; and compressing the plurality of segments of the one or more containers.
  • the method further comprises retrieving the stored first duplicated data block according to metadata of the first duplicated data block.
  • each duplicated data block is associated with a reference value, which indicates how often the duplicated data block is being updated.
  • the method comprises storing a duplicated data block that has a reference value higher than a preset value, and a duplicated data block that has a reference value lower than the preset value in different containers.
  • the method of the second aspect and its implementation forms provide the same advantages and effects as described above for the device of the first aspect and its respective implementation forms.
  • a third aspect of the present disclosure provides a computer program comprising a program code for carrying out, when implemented on a processor, the method according to the second aspect or any of its implementation forms.
  • a fourth aspect of the present disclosure provides a non-transitory storage medium storing executable program code which, when executed by a processor, causes the method according to the second aspect or any of its implementation forms to be performed.
  • FIG. 1 shows a device according to an embodiment of the invention.
  • FIG. 2 shows a typical deduplication structure.
  • FIG. 3 shows an example of data structure when copying data to a new container.
  • FIG. 4 shows an example of data structure when reusing redeemed space.
  • FIG. 5 shows a file structure according to an embodiment of the invention.
  • FIG. 6 shows a method according to an embodiment of the invention.
  • an embodiment/example may refer to other embodiments/examples.
  • any description including but not limited to terminology, element, process, explanation and/or technical advantage mentioned in one embodiment/example is applicative to the other embodiments/examples.
  • FIG. 1 shows a device 100 according to an embodiment of the invention.
  • the device 100 is adapted for storing duplicated data blocks.
  • the device 100 is configured to store one or more compressed containers 101.
  • each container 101 comprises a plurality of segments, wherein one or more duplicated data blocks are stored in the plurality of segments of the one or more containers 101.
  • the device 100 is configured to delete a first duplicated data block that is stored in a segment, particularly by decompressing the segment, replacing the first duplicated data block with zeros in the segment, and recompressing the segment.
  • the device 100 may comprise processing circuitry (not shown) configured to perform, conduct or initiate the various operations of the device 100 described herein.
  • the processing circuitry may comprise hardware and software.
  • the hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry.
  • the digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors.
  • the device 100 may further comprise memory circuitry, which stores one or more instruction(s) that can be executed by the processor or by the processing circuitry, in particular under control of the software.
  • the memory circuitry may comprise a non-transitory storage medium storing executable software code which, when executed by the processor or the processing circuitry, causes the various operations of the device 100 to be performed.
  • the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors.
  • the non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the device 100 to perform, conduct or initiate the operations or methods described herein.
  • the deduplication process instead of storing duplicated data, stores some form of references, e.g. pointer, to where the duplicated data is already stored, as an example of a typical deduplication structure shown in FIG. 2. A reference to a location where the duplicated data block is stored commonly known as metadata.
  • the device 100 is further configured to maintain metadata of the one or more stored duplicated data blocks, wherein metadata of each stored duplicated data block comprises a reference to a location where the duplicated data block is stored.
  • Deduplication based backup system divides a backup into variable sized chunks or segments of data.
  • chunks of a container become obsolete.
  • remaining data will be copied or moved into a new container, as an example depicted in FIG. 3.
  • all the references from the old location need to be updated to the new location.
  • all the metadata that references the location of data needs to be locked until the transaction is complete. Otherwise, a pointer to data which was moved may be kept. This also affects backup and restore speeds.
  • a container originally had chunks from a backup X: XI, X2, X3, X4 and X5, then X2, X3 and X5 are deleted. Later, if some data from another backup Y are written into this container, it may have in the container: XI Y1 Y2 X4 Y3, and a part of it are deleted. If the container are reused, eventually it may get to a scenario as shown in FIG. 4.
  • FIG. 4 it illustrates the effect of reusing container for a long while in a system.
  • Data in the reused container is from a variety of different and possibly unrelated backups, and each backup user data is also scattered among large number of containers.
  • Embodiments of this disclosure design a backup and delete process of a deduplication system to redeem space at an application level, which eliminates the need for defragmentation.
  • each container is divided into segments.
  • each container may further comprise segment metadata for each segment, which includes a location of the data segments in the container.
  • Each segment may be identified by an identifier.
  • Segment metadata may be used to map identifiers of segments to their physical addresses.
  • the device 100 may be further configured to obtain the one or more duplicated data blocks. Accordingly, the device 100 may be configured to fill the one or more duplicated data blocks into the plurality of segments of the one or more containers 101. Further, the device 100 may be configured to compress the plurality of segments of the one or more containers 101.
  • the backup stream is reconstructed according to the data.
  • the device 100 may be further configured to retrieve the stored first duplicated data block according to metadata of the first duplicated data block.
  • containers stored in the device 100 may be large containers. For instance, a size of the container may be bigger than 50MB.
  • Each segment is compressed in the device 100.
  • a header of a segment may contain the location of the segment in a file (or an object).
  • removing data from a physical storage can be implemented by reading segments that contain redeemed chunks, uncompressing the segments, replacing the redeems parts with zeros, compressing and replacing the old segments with the recompressed segments.
  • replacing the redeemed part with zeros means to replace each bit of the data block, which needs to be deleted, with a zero.
  • FIG. 5 As a file structure according to an embodiment of the present disclosure shown in FIG. 5, after the deleted data being replaced with zeros and further being recompressed, the space in the compressed container 101 is saved.
  • the read amplification refers to the number of disk reads per query, which is a ratio between the amount of data that is requested to the amount of data that actually needs to be read.
  • the locality of references is lost when the container is being reused. According to embodiments provided in this invention, the container is not reused, thus locality of references can be kept.
  • cold data refers to data which is kept over a number of backups (e.g., hundreds of)
  • hot data refers to data which is often replaced between backups. This process also implicitly keeps cold data and hot data separated in different containers.
  • each duplicated data block is associated with a reference value, which indicates how often the duplicated data block is being updated.
  • the device 100 may be further configured to store a duplicated data block that has a reference value higher than a preset value, and a duplicated data block that has a reference value lower than the preset value in different containers.
  • this process requires no locking, and no bad path handling.
  • the process of reclaiming data can be done independently on any container. It requires no locking, which means a backup and data recovery can be operated as usual. It also requires no locking for metadata, and there is no common area that needs update, so it can be done in parallel and/or distributed among any number of threads, processes, or computers.
  • cross references are kept in the system, which can be verified or rebuild independently and in parallel.
  • this scheme can also work in any other deduplication system.
  • the device 100 comprises a disk, and the one or more containers are stored on the disk.
  • FIG. 6 shows a method 600 for storing duplicated data blocks according to an embodiment of the present disclosure.
  • the method 300 is performed by a device 100 shown in FIG. 1.
  • the method 600 comprises: a step 601 of storing one or more compressed containers, wherein each container comprises a plurality of segments, wherein one or more duplicated data blocks are stored in the plurality of segments of the one or more containers; and a step 602 of deleting a first duplicated data block that is stored in a segment by decompressing the segment, replacing the first duplicated data block with zeros in the segment, and recompressing the segment.
  • the method 600 may further comprise maintaining metadata of the one or more stored duplicated data blocks, wherein metadata of each stored duplicated data block comprises a reference to a location where the duplicated data block is stored.
  • each container further comprises segment metadata for each segment, which includes a location of the date segments in the container.
  • the method 600 may further comprise obtaining the one or more duplicated data blocks. Accordingly the method may comprise filling the one or more duplicated data blocks into the plurality of segments of the one or more container. Then, the method may further comprise compressing the plurality of segments of the one or more containers.
  • the method 600 may further comprise retrieving the stored first duplicated data block according to metadata of the first duplicated data block.
  • each duplicated data block is associated with a reference value, which indicates how often the duplicated data block is being updated.
  • the method 600 may comprise storing a duplicated data block that has a reference value higher than a preset value, and a duplicated data block that has a reference value lower than the preset value in different containers.
  • any method according to embodiments of the present disclosure may be implemented in a computer program, having code means, which when run by processing means causes the processing means to execute the steps of the method.
  • the computer program is included in a computer readable medium of a computer program product.
  • the computer readable medium may comprise essentially any memory, such as a ROM (Read-Only Memory), a PROM (Programmable Read-Only Memory), an EPROM (Erasable PROM), a Flash memory, an EEPROM (Electrically Erasable PROM), or a hard disk drive.
  • embodiments of the device 100 comprises the necessary communication capabilities in the form of e.g., functions, means, units, elements, etc., for performing the solution.
  • means, units, elements and functions are: processors, memory, buffers, control logic, encoders, decoders, rate matchers, de-rate matchers, mapping units, multipliers, decision units, selecting units, switches, interleavers, de-interleavers, modulators, demodulators, inputs, outputs, antennas, amplifiers, receiver units, transmitter units, DSPs, trellis-coded modulation (TCM) encoder, TCM decoder, power supply units, power feeders, communication interfaces, communication protocols, etc. which are suitably arranged together for performing the solution.
  • TCM trellis-coded modulation
  • the processor(s) of the device 100 may comprise, e.g., one or more instances of a Central Processing Unit (CPU), a processing unit, a processing circuit, a processor, an Application Specific Integrated Circuit (ASIC), a microprocessor, or other processing logic that may interpret and execute instructions.
  • CPU Central Processing Unit
  • ASIC Application Specific Integrated Circuit
  • the expression “processor” may thus represent a processing circuitry comprising a plurality of processing circuits, such as, e.g., any, some or all of the ones mentioned above.
  • the processing circuitry may further perform data processing functions for inputting, outputting, and processing of data comprising data buffering and device control functions, such as call processing control, user interface control, or the like.

Abstract

The present disclosure relates to a device and method for storing duplicated data blocks. Specifically, the disclosure proposes a device for eliminating a need for defragmentation in a deduplication backup system. The device is configured to store one or more compressed containers, wherein each container comprises a plurality of segments, wherein one or more duplicated data blocks are stored in the plurality of segments of the one or more containers. The device is further configured to delete a first duplicated data block that is stored in a segment by decompressing the segment, replacing the first duplicated data block with zeros in the segment, and recompressing the segment. In this way, a solution to remove obsolete data which does not require metadata updates, is provided.

Description

DEVICES AND METHODS FOR ELIMINATING DEFRAGMENTATION IN
DEDUPLICATION TECHNICAL FIELD
The present disclosure relates to a device and method for storing data, in particular, for storing duplicated data. In order to solve a problem of fragmentation caused in the deduplication process, embodiments of this disclosure provide a solution to simplify the defragmentation process. To this end, the present disclosure proposes a device and method that allows eliminating defragmentation in a backup storage.
BACKGROUND
Deduplication is a process of elimination of duplicated data in storage. Due to the nature of how the data is saved in a deduplicating storage, there is often a problem of fragmentation in deduplication. This is a software level fragmentation, which is done on top of a lower-level hardware that also does some defragmentation. This fragmentation can also be a cause for further read amplification.
In deduplication, user data is stored as pointers to the actual data. The actual data is stored in containers of some sort. When a user deletes data, chunks of a container become obsolete. Therefore, when performing deduplication, there are often obsolete data chunks in the repository that needs to be collected. This usually requires an offline/background process to do a lot of work, and this also includes a lot of calculations, and sometime complicated processes that update reference counts in a way that allows preventing errors.
Traditionally, to redeem space, the data can either be copied to another place, trying to keep the locality of references, or the space may be reused for new data to prevent the complexities of copying. In a conventional manner of handling obsolete data, it may wait until a certain amount of data in a container or area is obsolete, and then copy the remaining data into a new container, and update all the references from the old location to the new location. This process requires locking of metadata of the container, the container itself, and maybe other structures as well. It is difficult to parallelize the defragmentation process thus make it error prone, and it may affect backup and restore speeds as well.
SUMMARY
In view of the above-mentioned challenges, an objective for embodiments of the present disclosure is to provide a device and method, respectively, that eliminates a need for defragmentation in a deduplication backup system. An aim is, in particular, to avoid locking and/or updating metadata references when removing obsolete data. Another aim is to redeem space at an application level. Further, it is also desirable to be able to perform normal backup and restore operations in the meantime.
This is achieved by the embodiments of the present disclosure as described in the enclosed independent claims. Advantageous implementations of the embodiments of the present disclosure are further defined in the dependent claims.
A first aspect of the present disclosure provides a device for storing duplicated data blocks, the device being configured to: store one or more compressed containers, wherein each container comprises a plurality of segments, wherein one or more duplicated data blocks are stored in the plurality of segments of the one or more containers; and delete a first duplicated data block that is stored in a segment by decompressing the segment, replacing the first duplicated data block with zeros in the segment, and recompressing the segment.
The device of the first aspect has the advantage that a need for defragmentation in a deduplication backup system is reduced or eliminated. This is due to the replacement by zeros, and the compression of the segments.
In an implementation form of the first aspect, the device is further configured to maintain metadata of the one or more stored duplicated data blocks, wherein metadata of each stored duplicated data block comprises a reference to a location where the duplicated data block is stored.
Metadata can be used to map the stored data blocks to their physical addressed. In an implementation form of the first aspect, each container further comprises segment metadata for each segment, which includes a location of the data segments in the container.
Knowing the segment metadata, the stored data segment can be located.
In an implementation form of the first aspect, the device is further configured to: obtain the one or more duplicated data blocks; fill the one or more duplicated data blocks into the plurality of segments of the one or more containers; and compress the plurality of segments of the one or more containers.
The deduplication based backup system may divide a backup stream into variable sized blocks of data. The data blocks that need to be written may be aggregated into the containers, particularly into segments of the containers. According to this embodiment of the invention, each segment is compressed in the container.
In an implementation form of the first aspect, the device is further configured to retrieve the stored first duplicated data block according to metadata of the first duplicated data block.
For instance, during data recovery, data blocks will be located and read based on the metadata.
In an implementation form of the first aspect, each duplicated data block is associated with a reference value, which indicates how often the duplicated data block is being updated.
Optionally, the reference value may be used to distinguish hot data and cold data in a storage.
In an implementation form of the first aspect, the device is further configured to store a duplicated data block that has a reference value higher than a preset value, and a duplicated data block that has a reference value lower than the preset value in different containers.
Preferably, it may be desired to keep cold data and hot data separated in different containers. The preset value may be configured based on specific implementations or requirements.
In an implementation form of the first aspect, the device comprises a disk, and the one or more containers are stored on the disk.
A second aspect of the present disclosure provides a method for storing duplicated data blocks, comprising: storing one or more compressed containers, wherein each container comprises a plurality of segments, wherein one or more duplicated data blocks are stored in the plurality of segments of the one or more containers; and deleting a first duplicated data block that is stored in a segment by decompressing the segment, replacing the first duplicated data block with zeros in the segment, and recompressing the segment.
In an implementation form of the second aspect, the method further comprises maintaining metadata of the one or more stored duplicated data blocks, wherein metadata of each stored duplicated data block comprises a reference to a location where the duplicated data block is stored.
In an implementation form of the second aspect, each container further comprises segment metadata for each segment, which includes a location of the date segments in the container.
In an implementation form of the second aspect, the method further comprises: obtaining the one or more duplicated data blocks; filling the one or more duplicated data blocks into the plurality of segments of the one or more container; and compressing the plurality of segments of the one or more containers.
In an implementation form of the second aspect, the method further comprises retrieving the stored first duplicated data block according to metadata of the first duplicated data block.
In an implementation form of the second aspect, each duplicated data block is associated with a reference value, which indicates how often the duplicated data block is being updated.
In an implementation form of the second aspect, the method comprises storing a duplicated data block that has a reference value higher than a preset value, and a duplicated data block that has a reference value lower than the preset value in different containers.
The method of the second aspect and its implementation forms provide the same advantages and effects as described above for the device of the first aspect and its respective implementation forms.
A third aspect of the present disclosure provides a computer program comprising a program code for carrying out, when implemented on a processor, the method according to the second aspect or any of its implementation forms. A fourth aspect of the present disclosure provides a non-transitory storage medium storing executable program code which, when executed by a processor, causes the method according to the second aspect or any of its implementation forms to be performed.
It has to be noted that all devices, elements, units and means described in the present application could be implemented in the software or hardware elements or any kind of combination thereof. All steps which are performed by the various entities described in the present application as well as the functionalities described to be performed by the various entities are intended to mean that the respective entity is adapted to or configured to perform the respective steps and functionalities. Even if, in the following description of specific embodiments, a specific functionality or step to be performed by external entities is not reflected in the description of a specific detailed element of that entity which performs that specific step or functionality, it should be clear for a skilled person that these methods and functionalities can be implemented in respective software or hardware elements, or any kind of combination thereof. BRIEF DESCRIPTION OF DRAWINGS
The above described aspects and implementation forms will be explained in the following description of specific embodiments in relation to the enclosed drawings, in which
FIG. 1 shows a device according to an embodiment of the invention.
FIG. 2 shows a typical deduplication structure.
FIG. 3 shows an example of data structure when copying data to a new container.
FIG. 4 shows an example of data structure when reusing redeemed space.
FIG. 5 shows a file structure according to an embodiment of the invention.
FIG. 6 shows a method according to an embodiment of the invention. DETAILED DESCRIPTION OF EMBODIMENTS
Illustrative embodiments of method, device, and program product for efficient packet transmission in a communication system are described with reference to the figures. Although this description provides a detailed example of possible implementations, it should be noted that the details are intended to be exemplary and in no way limit the scope of the application.
Moreover, an embodiment/example may refer to other embodiments/examples. For example, any description including but not limited to terminology, element, process, explanation and/or technical advantage mentioned in one embodiment/example is applicative to the other embodiments/examples.
FIG. 1 shows a device 100 according to an embodiment of the invention. The device 100 is adapted for storing duplicated data blocks. The device 100 is configured to store one or more compressed containers 101. In particular, each container 101 comprises a plurality of segments, wherein one or more duplicated data blocks are stored in the plurality of segments of the one or more containers 101. Further, the device 100 is configured to delete a first duplicated data block that is stored in a segment, particularly by decompressing the segment, replacing the first duplicated data block with zeros in the segment, and recompressing the segment.
The device 100 may comprise processing circuitry (not shown) configured to perform, conduct or initiate the various operations of the device 100 described herein. The processing circuitry may comprise hardware and software. The hardware may comprise analog circuitry or digital circuitry, or both analog and digital circuitry. The digital circuitry may comprise components such as application-specific integrated circuits (ASICs), field-programmable arrays (FPGAs), digital signal processors (DSPs), or multi-purpose processors. The device 100 may further comprise memory circuitry, which stores one or more instruction(s) that can be executed by the processor or by the processing circuitry, in particular under control of the software. For instance, the memory circuitry may comprise a non-transitory storage medium storing executable software code which, when executed by the processor or the processing circuitry, causes the various operations of the device 100 to be performed. In one embodiment, the processing circuitry comprises one or more processors and a non-transitory memory connected to the one or more processors. The non-transitory memory may carry executable program code which, when executed by the one or more processors, causes the device 100 to perform, conduct or initiate the operations or methods described herein. Notably, instead of storing duplicated data, the deduplication process stores some form of references, e.g. pointer, to where the duplicated data is already stored, as an example of a typical deduplication structure shown in FIG. 2. A reference to a location where the duplicated data block is stored commonly known as metadata.
In particular, according to an embodiment of the invention, the device 100 is further configured to maintain metadata of the one or more stored duplicated data blocks, wherein metadata of each stored duplicated data block comprises a reference to a location where the duplicated data block is stored.
Deduplication based backup system divides a backup into variable sized chunks or segments of data. When a user deletes data, chunks of a container become obsolete. When handling obsolete data in a conventional solution, remaining data will be copied or moved into a new container, as an example depicted in FIG. 3. In this way, all the references from the old location need to be updated to the new location. When changing a location of data, all the metadata that references the location of data needs to be locked until the transaction is complete. Otherwise, a pointer to data which was moved may be kept. This also affects backup and restore speeds.
For instance, if a container originally had chunks from a backup X: XI, X2, X3, X4 and X5, then X2, X3 and X5 are deleted. Later, if some data from another backup Y are written into this container, it may have in the container: XI Y1 Y2 X4 Y3, and a part of it are deleted. If the container are reused, eventually it may get to a scenario as shown in FIG. 4.
As the example shown in FIG. 4, it illustrates the effect of reusing container for a long while in a system. Data in the reused container is from a variety of different and possibly unrelated backups, and each backup user data is also scattered among large number of containers.
Notably, when a data backup is performed in such scenario, both the throughput performance and the deduplication performance are reduced. When a data recovery is performed, it needs to read random chunks from different locations, which is a known problem and bottleneck that slows down restore of deduplicated data.
Embodiments of this disclosure design a backup and delete process of a deduplication system to redeem space at an application level, which eliminates the need for defragmentation. Notably, each container is divided into segments. According to an embodiment of the invention, each container may further comprise segment metadata for each segment, which includes a location of the data segments in the container.
Each segment may be identified by an identifier. Segment metadata may be used to map identifiers of segments to their physical addresses.
During a backup, data blocks that need to be written are aggregated into containers to preserve the spatial locality of a backup stream. Optionally, according to an embodiment of the invention, the device 100 may be further configured to obtain the one or more duplicated data blocks. Accordingly, the device 100 may be configured to fill the one or more duplicated data blocks into the plurality of segments of the one or more containers 101. Further, the device 100 may be configured to compress the plurality of segments of the one or more containers 101.
During a restore, the backup stream is reconstructed according to the data. Optionally, according to an embodiment of the invention, the device 100 may be further configured to retrieve the stored first duplicated data block according to metadata of the first duplicated data block.
Optionally, according to embodiments of the invention, containers stored in the device 100 may be large containers. For instance, a size of the container may be bigger than 50MB. Each segment is compressed in the device 100. Possibly, a header of a segment may contain the location of the segment in a file (or an object). With such scheme, removing data from a physical storage, can be implemented by reading segments that contain redeemed chunks, uncompressing the segments, replacing the redeems parts with zeros, compressing and replacing the old segments with the recompressed segments. In particular, replacing the redeemed part with zeros means to replace each bit of the data block, which needs to be deleted, with a zero. As a file structure according to an embodiment of the present disclosure shown in FIG. 5, after the deleted data being replaced with zeros and further being recompressed, the space in the compressed container 101 is saved.
In this way, there is no need to update the metadata references, since the location of the segment remains the same. In addition, the locality of references is also kept. The read amplification is improved, since areas that belong to the same backup are kept together. The read amplification refers to the number of disk reads per query, which is a ratio between the amount of data that is requested to the amount of data that actually needs to be read. Notably, in a conventional way of reusing redeemed space, the locality of references is lost when the container is being reused. According to embodiments provided in this invention, the container is not reused, thus locality of references can be kept.
This also helps deduplication since the locality of data (especially cold data) is kept. Notably, cold data refers to data which is kept over a number of backups (e.g., hundreds of), and hot data refers to data which is often replaced between backups. This process also implicitly keeps cold data and hot data separated in different containers.
In particular, according to an embodiment of the invention, wherein each duplicated data block is associated with a reference value, which indicates how often the duplicated data block is being updated.
Preferably, according to an embodiment of the invention, the device 100 may be further configured to store a duplicated data block that has a reference value higher than a preset value, and a duplicated data block that has a reference value lower than the preset value in different containers.
In can be seen that, this process requires no locking, and no bad path handling. The process of reclaiming data can be done independently on any container. It requires no locking, which means a backup and data recovery can be operated as usual. It also requires no locking for metadata, and there is no common area that needs update, so it can be done in parallel and/or distributed among any number of threads, processes, or computers.
Moreover, even for some weird edge case, such as two entities are working on the same container, there is still no risk for data corruption, and no need for handling special cases.
In the previous embodiments, cross references are kept in the system, which can be verified or rebuild independently and in parallel. However, this scheme can also work in any other deduplication system.
Possibly, according to an embodiment of the invention, the device 100 comprises a disk, and the one or more containers are stored on the disk.
FIG. 6 shows a method 600 for storing duplicated data blocks according to an embodiment of the present disclosure. In a particular embodiment of the disclosure, the method 300 is performed by a device 100 shown in FIG. 1. The method 600 comprises: a step 601 of storing one or more compressed containers, wherein each container comprises a plurality of segments, wherein one or more duplicated data blocks are stored in the plurality of segments of the one or more containers; and a step 602 of deleting a first duplicated data block that is stored in a segment by decompressing the segment, replacing the first duplicated data block with zeros in the segment, and recompressing the segment.
Optionally, according to an embodiment of the present disclosure, the method 600 may further comprise maintaining metadata of the one or more stored duplicated data blocks, wherein metadata of each stored duplicated data block comprises a reference to a location where the duplicated data block is stored.
Optionally, according to an embodiment of the present disclosure, each container further comprises segment metadata for each segment, which includes a location of the date segments in the container.
Optionally, according to an embodiment of the present disclosure, the method 600 may further comprise obtaining the one or more duplicated data blocks. Accordingly the method may comprise filling the one or more duplicated data blocks into the plurality of segments of the one or more container. Then, the method may further comprise compressing the plurality of segments of the one or more containers.
Optionally, according to an embodiment of the disclosure, the method 600 may further comprise retrieving the stored first duplicated data block according to metadata of the first duplicated data block.
Optionally, according to an embodiment of the disclosure, each duplicated data block is associated with a reference value, which indicates how often the duplicated data block is being updated.
Optionally, according to an embodiment of the disclosure, the method 600 may comprise storing a duplicated data block that has a reference value higher than a preset value, and a duplicated data block that has a reference value lower than the preset value in different containers.
Deduplication systems are dedicated, and a lot of trade-offs need to be considered when designing such systems. Based on the design proposed in embodiments of this disclosure, the following effects can be gained: 1. Eliminating the need for defragmentation.
2. Removing obsolete data while requiring no metadata updates.
3. Reducing overall read amplification, and increasing data locality.
4. Increasing a lifecycle of containers with cold data, and reducing a lifecycle of containers with hot data.
5. No need for offline defragmentation.
6. The process per container is completely independent, thus allowing fully distributed and parallelism with no extra effort.
The present disclosure has been described in conjunction with various embodiments as examples as well as implementations. However, other variations can be understood and effected by those persons skilled in the art and practicing the claimed invention, from the studies of the drawings, this disclosure and the independent claims. In the claims as well as in the description the word “comprising” does not exclude other elements or steps and the indefinite article “a” or “an” does not exclude a plurality. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in the mutual different dependent claims does not indicate that a combination of these measures cannot be used in an advantageous implementation.
Furthermore, any method according to embodiments of the present disclosure may be implemented in a computer program, having code means, which when run by processing means causes the processing means to execute the steps of the method. The computer program is included in a computer readable medium of a computer program product. The computer readable medium may comprise essentially any memory, such as a ROM (Read-Only Memory), a PROM (Programmable Read-Only Memory), an EPROM (Erasable PROM), a Flash memory, an EEPROM (Electrically Erasable PROM), or a hard disk drive.
Moreover, it is realized by the skilled person that embodiments of the device 100 comprises the necessary communication capabilities in the form of e.g., functions, means, units, elements, etc., for performing the solution. Examples of other such means, units, elements and functions are: processors, memory, buffers, control logic, encoders, decoders, rate matchers, de-rate matchers, mapping units, multipliers, decision units, selecting units, switches, interleavers, de-interleavers, modulators, demodulators, inputs, outputs, antennas, amplifiers, receiver units, transmitter units, DSPs, trellis-coded modulation (TCM) encoder, TCM decoder, power supply units, power feeders, communication interfaces, communication protocols, etc. which are suitably arranged together for performing the solution.
Especially, the processor(s) of the device 100 may comprise, e.g., one or more instances of a Central Processing Unit (CPU), a processing unit, a processing circuit, a processor, an Application Specific Integrated Circuit (ASIC), a microprocessor, or other processing logic that may interpret and execute instructions. The expression “processor” may thus represent a processing circuitry comprising a plurality of processing circuits, such as, e.g., any, some or all of the ones mentioned above. The processing circuitry may further perform data processing functions for inputting, outputting, and processing of data comprising data buffering and device control functions, such as call processing control, user interface control, or the like.

Claims

1. A device (100) for storing duplicated data blocks, the device (100) being configured to: store one or more compressed containers (101), wherein each container (101) comprises a plurality of segments, wherein one or more duplicated data blocks are stored in the plurality of segments of the one or more containers (101); and delete a first duplicated data block that is stored in a segment by decompressing the segment, replacing the first duplicated data block with zeros in the segment, and recompressing the segment.
2. The device (100) according to claim 1, being configured to: maintain metadata of the one or more stored duplicated data blocks, wherein metadata of each stored duplicated data block comprises a reference to a location where the duplicated data block is stored.
3. The device (100) according to claim 1 or 2, wherein each container (101) further comprises segment metadata for each segment, which includes a location of the data segments in the container (101).
4. The device (100) according to one of the claims 1 to 3, being configured to: obtain the one or more duplicated data blocks; fill the one or more duplicated data blocks into the plurality of segments of the one or more containers (101); and compress the plurality of segments of the one or more containers (101).
5. The device (100) according to one of the claims 2 to 4, being configured to: retrieve the stored first duplicated data block according to metadata of the first duplicated data block.
6. The device (100) according to one of the claims 2 to 5, wherein each duplicated data block is associated with a reference value, which indicates how often the duplicated data block is being updated.
7. The device (100) according to claim 6, being configured to: store a duplicated data block that has a reference value higher than a preset value, and a duplicated data block that has a reference value lower than the preset value in different containers (101).
8. The device (100) according to one of the claims 1 to 7, wherein the device (100) comprises a disk, and the one or more containers (101) are stored on the disk.
9. A method (600) for storing duplicated data blocks, comprising: storing (601) one or more compressed containers, wherein each container comprises a plurality of segments, wherein one or more duplicated data blocks are stored in the plurality of segments of the one or more containers; and deleting (602) a first duplicated data block that is stored in a segment by decompressing the segment, replacing the first duplicated data block with zeros in the segment, and recompressing the segment.
10. The method (600) according to claim 9, further comprising: maintaining metadata of the one or more stored duplicated data blocks, wherein metadata of each stored duplicated data block comprises a reference to a location where the duplicated data block is stored.
11. The method (600) according to claim 9 or 10, wherein each container further comprises segment metadata for each segment, which includes a location of the date segments in the container.
12. The method (600) according to one of the claims 9 to 11, further comprising: obtaining the one or more duplicated data blocks; filling the one or more duplicated data blocks into the plurality of segments of the one or more container; and compressing the plurality of segments of the one or more containers.
13. The method (600) according to one of the claims 10 to 12, further comprising: retrieving the stored first duplicated data block according to metadata of the first duplicated data block.
14. The method (600) according to one of the claims 10 to 13, wherein each duplicated data block is associated with a reference value, which indicates how often the duplicated data block is being updated.
15. The method (600) according to claim 14, further comprising: storing a duplicated data block that has a reference value higher than a preset value, and a duplicated data block that has a reference value lower than the preset value in different containers.
16. A computer program product comprising a program code for carrying out, when implemented on a processor, the methods according to one of the claims 9 to 15.
PCT/EP2020/056082 2020-03-06 2020-03-06 Devices and methods for eliminating defragmentation in deduplication WO2021175446A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080006944.3A CN113632059A (en) 2020-03-06 2020-03-06 Apparatus and method for eliminating defragmentation in deduplication
PCT/EP2020/056082 WO2021175446A1 (en) 2020-03-06 2020-03-06 Devices and methods for eliminating defragmentation in deduplication

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/056082 WO2021175446A1 (en) 2020-03-06 2020-03-06 Devices and methods for eliminating defragmentation in deduplication

Publications (1)

Publication Number Publication Date
WO2021175446A1 true WO2021175446A1 (en) 2021-09-10

Family

ID=69780207

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2020/056082 WO2021175446A1 (en) 2020-03-06 2020-03-06 Devices and methods for eliminating defragmentation in deduplication

Country Status (2)

Country Link
CN (1) CN113632059A (en)
WO (1) WO2021175446A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2013206404A1 (en) * 2008-09-26 2013-07-11 Commvault Systems, Inc. Systems and methods for managing single instancing data
US20150261776A1 (en) * 2014-03-17 2015-09-17 Commvault Systems, Inc. Managing deletions from a deduplication database
WO2016091282A1 (en) * 2014-12-09 2016-06-16 Huawei Technologies Co., Ltd. Apparatus and method for de-duplication of data
US20200042219A1 (en) * 2018-08-03 2020-02-06 EMC IP Holding Company LLC Managing deduplication characteristics in a storage system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10394757B2 (en) * 2010-11-18 2019-08-27 Microsoft Technology Licensing, Llc Scalable chunk store for data deduplication
US20120159098A1 (en) * 2010-12-17 2012-06-21 Microsoft Corporation Garbage collection and hotspots relief for a data deduplication chunk store
CN102214210B (en) * 2011-05-16 2013-03-13 华为数字技术(成都)有限公司 Method, device and system for processing repeating data
CN102684827B (en) * 2012-03-02 2015-07-29 华为技术有限公司 Data processing method and data processing equipment
CN102999605A (en) * 2012-11-21 2013-03-27 重庆大学 Method and device for optimizing data placement to reduce data fragments

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU2013206404A1 (en) * 2008-09-26 2013-07-11 Commvault Systems, Inc. Systems and methods for managing single instancing data
US20150261776A1 (en) * 2014-03-17 2015-09-17 Commvault Systems, Inc. Managing deletions from a deduplication database
WO2016091282A1 (en) * 2014-12-09 2016-06-16 Huawei Technologies Co., Ltd. Apparatus and method for de-duplication of data
US20200042219A1 (en) * 2018-08-03 2020-02-06 EMC IP Holding Company LLC Managing deduplication characteristics in a storage system

Also Published As

Publication number Publication date
CN113632059A (en) 2021-11-09

Similar Documents

Publication Publication Date Title
US9767154B1 (en) System and method for improving data compression of a storage system in an online manner
AU2019257524B2 (en) Managing operations on stored data units
US9367557B1 (en) System and method for improving data compression
US9977746B2 (en) Processing of incoming blocks in deduplicating storage system
US20190171624A1 (en) System and method for balancing compression and read performance in a storage system
US9411815B1 (en) System and method for improving data compression in a deduplicated storage system
US20110113016A1 (en) Method and Apparatus for Data Compression
CN107682016B (en) Data compression method, data decompression method and related system
KR102275431B1 (en) Managing operations on stored data units
CN111125033B (en) Space recycling method and system based on full flash memory array
US10152376B2 (en) Data object recovery for storage systems
US20190379394A1 (en) System and method for global data compression
US10248656B2 (en) Removal of reference information for storage blocks in a deduplication system
US9965487B2 (en) Conversion of forms of user data segment IDs in a deduplication system
CN105917304A (en) Apparatus and method for de-duplication of data
CN112470140A (en) Block-based deduplication
CN113227958B (en) Apparatus, system and method for optimization in deduplication
EP2965188B1 (en) Managing operations on stored data units
US10838990B1 (en) System and method for improving data compression of a storage system using coarse and fine grained similarity
US9594635B2 (en) Systems and methods for sequential resilvering
CN111124940A (en) Space recovery method and system based on full flash memory array
WO2021175446A1 (en) Devices and methods for eliminating defragmentation in deduplication
EP4135200A1 (en) Systems, methods, and apparatus for dividing and compressing data
EP3659042B1 (en) Apparatus and method for deduplicating data
CN111625186B (en) Data processing method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20710129

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20710129

Country of ref document: EP

Kind code of ref document: A1