WO2023006183A1 - Method for managing the storage of data segments on a storage device - Google Patents

Method for managing the storage of data segments on a storage device Download PDF

Info

Publication number
WO2023006183A1
WO2023006183A1 PCT/EP2021/070951 EP2021070951W WO2023006183A1 WO 2023006183 A1 WO2023006183 A1 WO 2023006183A1 EP 2021070951 W EP2021070951 W EP 2021070951W WO 2023006183 A1 WO2023006183 A1 WO 2023006183A1
Authority
WO
WIPO (PCT)
Prior art keywords
size
segments
data segments
data
segment
Prior art date
Application number
PCT/EP2021/070951
Other languages
French (fr)
Inventor
Elizabeth FIRMAN
Idan Zach
Assaf Natanzon
Aviv Kuvent
Yair Toaff
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2021/070951 priority Critical patent/WO2023006183A1/en
Publication of WO2023006183A1 publication Critical patent/WO2023006183A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0673Single storage device
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3091Data deduplication
    • H03M7/3093Data deduplication using fixed length segments

Definitions

  • the disclosure relates to a method for managing the storage of data segments on a storage device, and more particularly, the disclosure relates to the storage device that is configured to manage the storage of data segments.
  • File system storage divides data of a file into fixed-size blocks, while a logical representation of the file points to the fixed-size blocks.
  • Some existing storage systems optimize disk space by finding duplicates of the fixed-size blocks.
  • Deduplication is a method that is used to reduce the amount of used storage by identifying granularity of duplicates of data and avoiding storing the duplicates of data explicitly. There are two methods to identify the granularity of data as a duplicate. The first method is used to divide the data into predefined fixed-size segments but misses duplicates of data that are aligned differently. The second method is used to divide the data into variable-size data segments by finding boundaries that depend on context and a criterion. But, the variable size data segments consume more resources, computation time and are more complicated to implement.
  • variable size segmentation to store fixed- size chunks and to assign reference points to the fixed-size chunks.
  • the reference points that are assigned to the fixed-size chunks include all the data or partial data (prefix or suffix) of that chunk.
  • the existing approaches calculate fingerprints on variable size segments and search for matches in a fingerprint store to perform the deduplication method in backup streams. By saving a reference stream, the deduplication method saves a backup stream that points to the fixed size chunks to compose the variable size segments. This is done to handle misalignments between the variable size segments and the fixed-size chunks.
  • Another known approach employs a method that uses hierarchal deduplication by storing fingerprints in a Merkle tree, where the leaf nodes represent fixed-size blocks.
  • the hierarchal deduplication includes three levels.
  • the first level includes segmenting the entire object into variable size segments and composing each variable size segment with the fixed-size blocks. For alignment, the last fixed-size block of each variable size segment can be smaller than the fixed size.
  • the file systems that include the fixed size deduplication are aligned to the variable size segments using a segmentation technique and store additional metadata to identify the blocks smaller than the fixed size or to calculate segments that are multiples of the fixed-size blocks.
  • the aforementioned known approaches store additional fingerprint metadata that requires a lot of memory for fast searches.
  • the deduplication method is implemented either partially or entirely.
  • write-once read-many scenario is successful in implementing the deduplication method, whereas in-place or overwrite modifications are still unable to implement the deduplication method.
  • the disclosure provides a method for managing the storage of data segments on a storage device, and the storage device that is configured to manage the storage of data segments.
  • a method for managing the storage of data segments on a storage device includes storing a plurality of fixed-size data segments of a determined fixed size on the storage device.
  • the method includes receiving an input data stream.
  • the method includes segmenting the input data stream into a plurality of variable size data segments each having a variable size.
  • the method includes creating offset data segments from the variable size data segments by offsetting the end of each variable size segment so that the size of each of the offset data segments is a multiple of the determined fixed size.
  • the method includes identifying duplicate data in the input data stream by detecting the duplicate data in the offset data segments.
  • the method improves an efficiency of the storage device by deduplicating the data segments without changing underlying data.
  • the method writes the offset data segments and aligns them according to their variable size to obtain the fixed-size data segments. Once the offset data segments are written, the method can detect duplicates of the data in the fixed-size data segments.
  • the deduplication in the fixed-size data segments is enhanced by writing the offset segments and not the actual segments.
  • the offset segments enhance the deduplication ratio with minimal overhead.
  • the method provides variable size segmentation of the input data stream to gain the flexibility to shift in the input data stream while using the fixed-size data segments.
  • the method provides the offset data segments as highly compressible so that a storage capacity that is occupied by the offset data segments is optimized.
  • a fixed size data segments deduplication method may be implemented in scenarios such as in-place or overwrite modifications.
  • segmenting the input data stream uses as parameters the average, the maximum, and the minimum sizes of each of the variable size segments created from segmenting the input data stream.
  • the parameters related to the segmentation provide efficient deduplication in cases such as in-place or overwrite modifications.
  • the maximum size of each of the variable size segments is set to be smaller or equal to the determined fixed size.
  • the maximum size of each of the variable size segments is set to be equal to the determined fixed size.
  • the minimum size of each of the variable size segments is set to be greater or equal to the determined fixed size.
  • the minimum size of each of the variable size segments is set to be equal to the determined fixed size.
  • the determined fixed size provides a threshold value for applying the deduplication method. Also, the threshold value helps to optimize the usage of memory during deduplication.
  • the method further includes inserting one or more reserved segments of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments.
  • the method provides writing variable size data segments on top of existing fixed-size deduplication, while optimizing the random IOs using reserved space, thereby increasing de-duplication ratio and the efficiency of the storage device.
  • the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments.
  • the method further includes storing the offset data segment into a repository.
  • the method optimizes the memory by inserting the offset data segments based on the variable size of data segments, thereby improving storage capacity of the storage device.
  • the method further includes storing, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment in the input data stream relative to a maximum size of each of the variable size data segments, and the size of the given variable size data segment.
  • the memory is optimized by saving the relative offset, thereby improving the efficiency of the storage device.
  • the relative offset is optionally an offset that is relative to the fixed size data segments that are divided from the input data stream.
  • creating offset data segments from the variable size data segments comprises padding the end of each variable size segment with a padding character so that the size of each of the offset data segments is a multiple of the determined fixed size.
  • the padding character occupies a negligible space as it is highly compressible. Hence, the memory is utilized efficiently.
  • a storage device configured to store a plurality of fixed-size data segments of a determined fixed size.
  • the storage device is configured to receive an input data stream.
  • the storage device is configured to segment the input data stream into a plurality of variable size data segments each having a variable size.
  • the storage device is configured to create the offset data segments from the variable size data segments by offsetting the end of each variable size segment so that the size of each of the offset data segments is a multiple of the determined fixed size.
  • the storage device is configured to identify duplicate data in the input data stream by detecting the duplicate data in the offset data segments.
  • An efficiency of the storage device is improved by deduplicating the data segments without changing underlying data.
  • the storage device writes the offset data segments and aligns them according to their variable size to obtain the fixed-size data segments. Once the offset data segments are written, the storage device detects duplicates of the data in the fixed-size data segments.
  • the deduplication in the fixed-size data segments is enhanced by writing the offset segments and not the actual segments.
  • the offset segments enhance the deduplication ratio with minimal overhead.
  • the storage device provides variable size segmentation of the input data stream to gain the flexibility to shift in the input data stream while using the fixed-size data segments.
  • the storage device provides the offset data segments as highly compressible so that storage capacity is occupied by the offset data segments is optimized.
  • a deduplication method may be implemented in scenarios such as in-place or overwrite modifications.
  • the storage device further configured to segment the input data stream, using as parameters the average, the maximum and the minimum sizes of each of the variable size segments created from segmenting the input data stream.
  • the storage device is configured to use the parameters that are related to the segmentation to provide efficient deduplication in cases such as in-place or overwrite modifications.
  • the storage device further configured to set the maximum size of each of the variable size segments to be smaller or equal to the determined fixed size, if the determined fixed size is greater than a first predetermined value.
  • the storage device provides the determined fixed size as a threshold value for applying the deduplication method. Also, the threshold value helps to optimize the usage of memory during deduplication.
  • the storage device further configured to set the minimum size of each of the variable size segments to be greater or equal to the determined fixed size, if the determined fixed size is smaller than a second predetermined value.
  • the storage device further configured to insert one or more reserved segment of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments.
  • the storage device further configured so that the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments, depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments. The storage device optimizes memory by inserting the offset data segments based on the variable size of data segments, thereby improving a storage capacity of the storage device.
  • the storage device further includes a repository for storing the offset data segment.
  • the storage device further configured to store, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment in the input data stream relative to a maximum size of each of the variable size data segments, and the size of the given variable size data segment.
  • the storage device optimizes the memory by saving the relative offset, thereby improving the efficiency of the storage device.
  • the relative offset is optionally an offset that is relative to the fixed size data segments that are divided from the input data stream.
  • the storage device further configured to create the offset data segments from the variable size data segments by padding the end of each variable size segment with a padding character so that the size of each of the offset data segments is a multiple of the determined fixed size.
  • the padding character occupies a negligible space as it is highly compressible. Hence, the memory is utilized efficiently.
  • the storage device that is configured to manage the storage of data segments by deduplicating the data segments without changing underlying data, thereby improving the performance of the storage device.
  • the storage device writes the offset data segments and aligns them according to their variable size data segments to obtain the fixed- size data segments. Once the offset data segments are written, the storage device can detect duplicates of the data in the fixed-size data segments.
  • the deduplication in the fixed-size data segments is enhanced by writing the offset segments and not the actual segments.
  • the offset segments enhance the deduplication ratio with minimal overhead.
  • the storage device provides variable size segmentation of the input data stream to gain the flexibility to shift in the input data stream while using the fixed-size data segments.
  • the storage device provides offset data segments as highly compressible so that a storage capacity occupied by the offset data segments is optimized.
  • a deduplication method is implemented in scenarios such as in-place or overwrite modifications.
  • FIG. 1 is a block diagram of a storage device in accordance with an implementation of the disclosure
  • FIG. 2 is an exemplary diagram that illustrates segmenting an input data stream into one or more variable size data segments using a storage device in accordance with an implementation of the disclosure
  • FIG. 3 is an exemplary diagram that illustrates segmenting an input data stream into one or more variable size data segments of sizes 30 Kilobytes (KB), 30 KB, and 14 KB using a storage device in accordance with an implementation of the disclosure;
  • FIG. 4 is an exemplary diagram that illustrates random IOs of 32KB and re-segmentation results in accordance with an implementation of the disclosure
  • FIGS. 5A and 5B are exemplary diagrams that illustrate random IO of 20KB and re segmentation results in accordance with an implementation of the disclosure
  • FIG. 6 is a flow diagram that illustrates a method for managing the storage of data segments on a storage device in accordance with an implementation of the disclosure
  • FIG. 7 is an illustration of a computing arrangement (e.g. a storage device) that is used in accordance with implementations of the disclosure.
  • Implementations of the disclosure provide a method for managing the storage of data segments on a storage device, and the storage device that is configured to manage the storage of data segments.
  • a process, a method, a system, a product, or a device that includes a series of steps or units is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.
  • FIG. 1 is a block diagram of a storage device 100 in accordance with an implementation of the disclosure.
  • the storage device 100 is configured to store one or more fixed-size data segments 102A-N of a determined fixed size.
  • the storage device 100 is configured to receive an input data stream.
  • the storage device 100 is configured to segment the input data stream into one or more variable size data segments each having a variable size.
  • the storage device 100 is configured to create the offset data segments from the one or more variable size data segments by offsetting the end of each variable size segment so that the size of each of the offset data segments is a multiple of the determined fixed size.
  • the storage device 100 is configured to identify duplicate data in the input data stream by detecting the duplicate data in the offset data segments.
  • An efficiency of the storage device 100 is improved by deduplicating the data segments without changing underlying data.
  • the storage device 100 writes the offset data segments and aligns them according to their variable size to obtain the fixed-size data segments. Once the offset data segments are written, the storage device 100 detects duplicates of the data in the fixed-size data segments.
  • the deduplication in the one or more fixed-size data segments 102A-N is enhanced by writing the offset segments, and not the actual segments.
  • the offset segments enhance de- duplication ratio with minimal overhead.
  • the storage device 100 provides variable size segmentation of the input data stream to gain flexibility to shift in the input data stream while using the one or more fixed-size data segments 102A-N.
  • the storage device 100 provides offset data segments as highly compressible so that a storage capacity that is occupied by the offset data segments is optimized.
  • the storage device 100 is further configured to segment the input data stream, using as parameters the average, the maximum and the minimum sizes of each of the variable size segments created from segmenting the input data stream.
  • the storage device 100 is further configured to set the maximum size of each of the variable size segments to be smaller or equal to the determined fixed size, if the determined fixed size is greater than a first predetermined value.
  • the storage device 100 is further configured to set the minimum size of each of the variable size segments to be greater or equal to the determined fixed size, if the determined fixed size is smaller than a second predetermined value.
  • the storage device 100 is further configured to insert one or more reserved segment of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments.
  • the storage device 100 is further configured so that the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments, depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments.
  • the storage device 100 further includes a repository for storing the offset data segment.
  • the storage device 100 is further configured to store, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment relative to a maximum data segment size in the input data stream, and the size of the given variable size segment.
  • the storage device 100 is further configured to create the offset data segments from the variable size data segments by padding the end of each variable size segment with a padding character so that the size of each of the offset data segments is a multiple of the determined fixed size.
  • FIG. 2 is an exemplary diagram that illustrates segmenting an input data stream 202 into one or more variable size data segments 206 using a storage device in accordance with an implementation of the disclosure.
  • the storage device is configured to store one or more fixed- size data segments 204A-N of a determined fixed size.
  • the storage device is configured to receive the input data stream 202.
  • the storage device is configured to segment the input data stream 202 in to one or more variable size data segments 206 each having a variable size.
  • the storage device is configured to create the offset data segments 208A-B from the one or more variable size data segments 206 by offsetting the end of each variable size segment so that the size of each of the offset data segments 208A-B is a multiple of the determined fixed size.
  • the offset data segments 208A-B are shown with dots in the figure.
  • the one or more variable size data segments 206 are shown with diagonal lines in the figure.
  • the storage device segments the input data stream 202 into the one or more variable size data segments 206 based on use of a rolling hash, and a central processing unit (CPU) optimized segmentation.
  • a determined fixed size of the one or more fixed-size data segments 204A-N varies between different storage devices, and is also configurable based on a file system used in the storage device.
  • the storage device is further configured to set the maximum size of each of the variable size segments to be smaller or equal to the determined fixed size, if the determined fixed size is greater than a first predetermined value.
  • the storage device is further configured to set the minimum size of each of the variable size segments to be greater or equal to the determined fixed size, if the determined fixed size is smaller than a second predetermined value.
  • the storage device aligns the one or more variable size data segments 206 to the determined fixed size, adds a padding character, and saves additional metadata to enable reads, thereby improving a de-duplication ratio.
  • the storage efficiency of the storage device is increased and the costs of storage is reduced.
  • the storage device when random IOs are received to an existing file (e.g. a live file), then the storage device, (i) performs variable size segmentation for a large write that is received as the random IOs. (ii) reads enough data around a location, where new data is to be written and performs variable size segmentation with the new data to maintain consistency for a small write with the variable size segmentation. As there might be more data segments for reading and writing, enough space is required for new aligned variable size segments.
  • an existing file e.g. a live file
  • variable size segmentation is performed, in case of a large input data stream, to obtain the one or more fixed-size data segments 204A-N to maintain consistency with the variable size segmentation.
  • the de-duplication ratio remains the same.
  • the storage device may create new data segments on a live file is known as re-segmentation.
  • a data management model is used to optimize a reserved space and minimize overhead of the data to be read to achieve efficient re-segmentation.
  • the re-segmentation may be used in online or offline deduplication of data segments based on a workload. During the re-segmentation, input- outputs (IOs) and computational overhead are added, thereby enabling to write in aligned data segments and save a bitmap of changed data segments to perform the re-segmentation offline.
  • IOs input- outputs
  • the storage device includes a repository for storing the offset data segments 208A- B.
  • each user file is mapped to a file in the repository.
  • the file in the repository includes the offset data segments 208A-B, the padding character that is added, and a reserved space. The reserved space may be used for future re-segmentation using sparse files.
  • the metadata enables to map the offset data segments 208A-B with the one or more fixed-size data segments 204A-N by considering the padding character.
  • the input data stream 202 may be divided into logical data segments. Each logical data segment has a maximum size. The maximum size of the fixed-size data segment is a predefined constant value, and it is important as limits a number of reserved spaces.
  • the metadata is stored for each fixed-size data segment.
  • the metadata includes an offset and a size. The offset is relative to the offset of the fixed logical data segments of the input data stream 202 that are mapped to the one or more fixed-size data segments 204A-N. The offset of each fixed logical data segment is between 0 and 1.
  • the size of data of the one or more fixed-size data segments 204A-N is determined by mapping each fixed logical data segment to a constant number of the one or more fixed-size data segments 204A-N.
  • the size of the one or more fixed-size data segments 204A-N may be greater than 1.
  • the offset of the input data stream 202 is mapped to the offset of the repository using the metadata.
  • a fixed-size data segment 204A from where the reading operation starts from is determined by, (i) providing a start offset of read that is relative to a user view of the fixed-size data segment 204A, determining a start offset of read that is relative to a repository view, (ii) calculating an index of the fixed-size data segment 204A by dividing the start offset by the size of the fixed-size data segment 204A, (iii) searching the relative offset in the corresponding one or more fixed-size data segments 204A-N, (iv) searching the metadata of corresponding fixed-size data segments by providing the index of the fixed-size data segment 204A and (v) identifying the fixed-size data segment 204A from where the reading operation starts.
  • the search of metadata of the corresponding fixed-size data segments includes a last fixed-size data segment from a previous fixed-size data segment. Once the start offset is identified, the metadata is retrieved from that point to read the corresponding fixed- size data segments.
  • the segmentation is performed on the input data stream 202 that is appended to leftover from previous segmentation.
  • the leftover from the previous segmentation is used to calculate the boundaries of data segments.
  • the de-duplication ratio is higher, especially in the case of the small-sized input data stream.
  • a ratio between maximum segment size and minimum segment size is considered as k.
  • Each write operation segmentation may be performed for new files and re-segmentation may be performed for live files.
  • new data is to be written after performing the read operation on the fixed-size data segment 204A.
  • an offset that is relative to a user view is translated to a repository offset and a relative offset in the fixed-size data segment 204A to the metadata of the one or more fixed-size data segments 204A-N are saved.
  • the size of data in the one or more fixed-size data segments 204A-N are saved.
  • the one or more fixed-size data segments 204A- N are reserved for re-segmentation by starting the write operation for each data segment at the corresponding fixed-size data segment that is k times of the one or more fixed-size data segments 204A-N.
  • the one or more fixed-size data segments 204A-N represent a present data segment that may include data from a next data segment to maintain a data segment alignment, not affecting the de-duplication ratio as there is no deletion of any data segment during alignment.
  • the write operation may depend on the size of the one or more fixed-size data segments 204A-N compared to the size of the one or more variable size data segments 206.
  • each data segment is mapped to 2k times of the fixed-size data segments 204A-N.
  • the end of the variable size data segment is padded with a padding character to align with the size of the fixed-size data segment.
  • Each variable size data segment may be mapped to k times of the one or more fixed- size data segments 204A-N.
  • each data segment of the input data stream 202 in the repository is mapped to k times of the one or more fixed-sized data segments 204A-N, including data and reserved space.
  • each variable size data segment is mapped to at most k times of the one or more fixed-size data segments 204A-N and at least to a single fixed-size data segment.
  • the end of the data segment is padded with the padding character to align with the size of the one or more fixed-size data segments 204A-N.
  • Each variable size data segment may be mapped to two times of k times of the one or more fixed-size data segments 204A-N.
  • the repository file includes two times of k times of the one or more fixed- size data segments 204A-N, including the data and the reserved space.
  • an upper bound is set.
  • the upper bound of the random IO overhead is a single fixed-size data segment after the last modified fixed-size data segment, that is three extra data IOs are present, that is one read, and two write along with one delete.
  • the upper bound of the random IO overhead is the single data segment after the last modified fixed-size data segment, i.e. the number of IOs depend on specific data segment span.
  • FIG. 3 is an exemplary diagram that illustrates segmenting an input data stream into one or more variable size data segments (e.g. 4 variable size data segments) of sizes 30 Kilobytes (KB), 30 KB, and 14 KB using a storage device in accordance with an implementation of the disclosure.
  • the exemplary diagram shows mapping of 2 variable size data segments to 2 fixed- size data segments of size 128 Kilobytes (KB) each.
  • Each fixed-size data segment includes 4 fixed-size data segment grains.
  • One of the 2 fixed-size data segments represents the start offsets in a range of 0-128 Kilobytes (KB)
  • another fixed-size data segment represents the start offsets in a range of 128-256KB.
  • FIG. 4 is an exemplary diagram that illustrates random IOs of 32KB and re-segmentation results in accordance with an implementation of the disclosure.
  • the exemplary diagram includes writing new data at offset 16KB, of size 32KB and results in 5 data segments of sizes 20KB, 16KB, 8KB, 18KB, and 14KB using re-segmentation.
  • the re-segmentation includes calculating fixed-size data segments to read that are stored in a repository.
  • the exemplary diagram illustrates writing starts at index 0 and ends at 3 as shown in FIG. 3.
  • the fixed-size data segments in the range [0,3] are read.
  • the exemplary diagram includes performing re segmentation on the fixed-size data segments with new data.
  • the last boundary of the fixed- size data segments is aligned to previously fixed data-size segments. For example, the last boundary is aligned currently between 4 and 5, previously between 2 and 3 as shown in the figure.
  • the following mapping is performed based on the segmentation algorithm.
  • FIGS. 5A and 5B are exemplary diagrams that illustrate random IO of 20KB and re segmentation results in accordance with an implementation of the disclosure.
  • FIG. 5A is an exemplary diagram that illustrates the random IO of 20KB and one of the results of re segmentation.
  • the exemplary diagrams include writing new data at offset 16KB, of size 32KB and results in final data segments of sizes 20KB, 16KB, 8KB, 18KB, 8KB, and 14KB using re segmentation.
  • the data segments 1-4 are segmented based on a segmentation algorithm.
  • the fixed-size data segment 5 includes a data segment that is created using the re-segmentation algorithm.
  • the re-segmentation includes calculating fixed-size data segments to read that are stored in a repository.
  • the exemplary diagrams depict writing start at index 4 and end at index 5.
  • the fixed-size data segments in the range [4,5] are read.
  • the exemplary diagram illustrates performing re-segmentation on the fixed-size data segments with new data.
  • the segmentation is performed on 26KB and data segment 3 is obtained of size 18KB.
  • the data segment 3 is stored in the fixed-size data segment 4 and a leftover of size 8KB is identified with no boundary. As the leftover is equal to a minimum data segment size, the data segment 3 is stored in its fixed-size data segment, that is in index 6.
  • FIG. 5B is an exemplary diagram that illustrates the random IO of 20 KB and another result of the re-segmentation.
  • the exemplary diagram includes writing new data at offset 20KB, of size 32KB and results in final data segments of sizes 20KB, 16KB, 20KB, and 20KB using re segmentation.
  • the fixed-size data segment 7 includes a data segment that is created using a re segmentation algorithm.
  • the re-segmentation includes calculating fixed-size data segments to read that are stored in the repository.
  • the exemplary method of writing starts at index 4 and ends at index 5.
  • the fixed-size data segments in the range [4,5] are read.
  • the exemplary diagram includes performing re-segmentation on the fixed-size data segments with new data.
  • the segmentation is performed on 26KB and data segment 3 is obtained of size 20KB.
  • the data segment 3 is stored in the fixed-size data segment 4 and a leftover of size 6KB is identified with no boundary. As the leftover is less than a minimum data segment size 6KB+14KB, the next data segment is read, that is data segment 7 with a size of 14KB.
  • the data segments are appended to a single data segment of size 20KB and are stored in data segment 7.
  • the re-segmentation algorithm provides writing variable size data segments on top of existing fixed-size deduplication while optimizing the random IOs using reserved space, thereby increasing a de-duplication ratio and the efficiency of a storage device.
  • FIG. 6 is a flow diagram that illustrates a method for managing the storage of data segments on a storage device in accordance with an implementation of the disclosure.
  • a step 602 one or more fixed-size data segments of a determined fixed size are stored on the storage device.
  • an input data stream is received.
  • the input data stream is segmented into one or more variable size data segments each having a variable size.
  • offset data segments are created from the variable size data segments by offsetting the end of each variable size segment so that the size of each of the offset data segments is a multiple of the determined fixed size.
  • duplicate data is identified in the input data stream by comparing the offset data segments with the one or more fixed-size data segments.
  • the method improves an efficiency of the storage device by deduplicating the data segments without changing underlying data.
  • the method writes the offset data segments and aligns them according to their variable size to obtain the one or more fixed-size data segments. Once the offset data segments are written, the method can detect duplicates of the data in the fixed-size data segments.
  • the deduplication in the one or more fixed-size data segments is enhanced by writing the offset segments and not the actual segments.
  • the offset segments enhance the deduplication ratio with minimal overhead.
  • the method provides variable size segmentation of the input data stream to gain the flexibility to shift in the input data stream while using the one or more fixed-size data segments.
  • the method provides the offset data segments as highly compressible so that a storage capacity that is occupied by the offset data segments is optimized.
  • a deduplication method may be implemented in scenarios such as in-place or overwrite modifications.
  • segmenting the input data stream uses as parameters the average, the maximum and the minimum sizes of each of the variable size segments created from segmenting the input data stream.
  • the maximum size of each of the variable size segments is set to be smaller or equal to the determined fixed size.
  • the minimum size of each of the variable size segments is set to be greater or equal to the determined fixed size.
  • the method further includes inserting one or more reserved segments of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments.
  • the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments.
  • the method further includes storing the offset data segment into a repository.
  • the method further includes storing, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment in the input data stream, and the size of the given variable size segment.
  • creating offset data segments from the variable size data segments includes padding the end of each variable size segment with a padding character so that the size of each of the offset data segments is a multiple of the determined fixed size.
  • a ratio between maximum segment size and minimum segment size is considered as k.
  • MinSize FSGrainSize (minimal segment size)
  • Segment.UserOffset The user offset in which the variable-length segment starts
  • the following coding includes writing sequentially on the fixed size data segments.
  • the following coding includes re-segmentation algorithm:
  • FIG. 7 is an illustration of an exemplary computing arrangement 700 (e.g. a storage device) in which the various architectures and functionalities of the various previous implementations may be implemented.
  • the computing arrangement 700 includes at least one processor 704 that is connected to a bus 702, wherein the computing arrangement 700 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI-Express, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol (s).
  • the computing arrangement 700 also includes a memory 706. Control logic (software) and data are stored in the memory 706 which may take the form of random-access memory (RAM).
  • RAM random-access memory
  • a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip modules with increased connectivity which simulate on-chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user.
  • CPU central processing unit
  • the computing arrangement 700 may also include a secondary storage 710.
  • the secondary storage 710 includes, for example, a hard disk drive and a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory.
  • the removable storage drive at least one of reads from and writes to a removable storage unit in a well-known manner.
  • Computer programs, or computer control logic algorithms may be stored in at least one of the memory 706 and the secondary storage 710. Such computer programs, when executed, enable the computing arrangement 700 to perform various functions as described in the foregoing.
  • the memory 706, the secondary storage 710, and any other storage are possible examples of computer-readable media.
  • the architectures and functionalities depicted in the various previous figures may be implemented in the context of the processor 704, a graphics processor coupled to a communication interface 712, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 704 and a graphics processor, a chipset (i.e., a group of integrated circuits designed to work and sold as a unit for performing related functions, etc.).
  • a graphics processor coupled to a communication interface 712
  • an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 704 and a graphics processor
  • a chipset i.e., a group of integrated circuits designed to work and sold as a unit for performing related functions, etc.
  • the architectures and functionalities depicted in the various previous figures may be implemented in the context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system.
  • the computing arrangement 700 may take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, an embedded system.
  • the computing arrangement 700 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a smart phone, a television, etc.
  • PDA personal digital assistant
  • the computing arrangement 700 may be coupled to a network (e.g., a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like) for communication purposes through an I/O interface 708.
  • a network e.g., a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like
  • a network e.g., a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like

Abstract

Provided is a method for managing the storage of data segments on a storage device (100). The method includes storing one or more fixed-size data segments (102A-N, 204A-N) of a determined fixed size on the storage device (100). The method includes receiving an input data stream (202). The method includes, segmenting the input data stream (202) into one or more variable size data segments (206) each having a variable size. The method includes creating offset data segments (208A-B) from the variable size data segments (206) by offsetting the end of each variable size segment so that the size of each of the offset data segments (208A-B) is a multiple of the determined fixed size. The method includes identifying duplicate data in the input data stream (202) by detecting the duplicate data in the offset data segments (208A-B).

Description

METHOD FOR MANAGING THE STORAGE OF DATA SEGMENTS ON A
STORAGE DEVICE
TECHNICAL FIELD
The disclosure relates to a method for managing the storage of data segments on a storage device, and more particularly, the disclosure relates to the storage device that is configured to manage the storage of data segments.
BACKGROUND
File system storage divides data of a file into fixed-size blocks, while a logical representation of the file points to the fixed-size blocks. Some existing storage systems optimize disk space by finding duplicates of the fixed-size blocks. Deduplication is a method that is used to reduce the amount of used storage by identifying granularity of duplicates of data and avoiding storing the duplicates of data explicitly. There are two methods to identify the granularity of data as a duplicate. The first method is used to divide the data into predefined fixed-size segments but misses duplicates of data that are aligned differently. The second method is used to divide the data into variable-size data segments by finding boundaries that depend on context and a criterion. But, the variable size data segments consume more resources, computation time and are more complicated to implement.
Several known approaches are employed to perform variable size segmentation to store fixed- size chunks and to assign reference points to the fixed-size chunks. The reference points that are assigned to the fixed-size chunks include all the data or partial data (prefix or suffix) of that chunk. The existing approaches calculate fingerprints on variable size segments and search for matches in a fingerprint store to perform the deduplication method in backup streams. By saving a reference stream, the deduplication method saves a backup stream that points to the fixed size chunks to compose the variable size segments. This is done to handle misalignments between the variable size segments and the fixed-size chunks. Another known approach employs a method that uses hierarchal deduplication by storing fingerprints in a Merkle tree, where the leaf nodes represent fixed-size blocks. The hierarchal deduplication includes three levels. The first level includes segmenting the entire object into variable size segments and composing each variable size segment with the fixed-size blocks. For alignment, the last fixed-size block of each variable size segment can be smaller than the fixed size. The file systems that include the fixed size deduplication are aligned to the variable size segments using a segmentation technique and store additional metadata to identify the blocks smaller than the fixed size or to calculate segments that are multiples of the fixed-size blocks.
However, the aforementioned known approaches store additional fingerprint metadata that requires a lot of memory for fast searches. Moreover, the deduplication method is implemented either partially or entirely. Moreover, write-once read-many scenario is successful in implementing the deduplication method, whereas in-place or overwrite modifications are still unable to implement the deduplication method.
Therefore, there arises a need to address the aforementioned technical drawbacks in known techniques or technologies in managing the storage of data segments on a storage device.
SUMMARY
It is an object of the disclosure to provide a method for managing the storage of data segments on a storage device, and the storage device that is configured to manage the storage of data segments while avoiding one or more disadvantages of prior art approaches.
This object is achieved by the features of the independent claims. Further, implementation forms are apparent from the dependent claims, the description, and the figures.
The disclosure provides a method for managing the storage of data segments on a storage device, and the storage device that is configured to manage the storage of data segments.
According to a first aspect, there is provided a method for managing the storage of data segments on a storage device. The method includes storing a plurality of fixed-size data segments of a determined fixed size on the storage device. The method includes receiving an input data stream. The method includes segmenting the input data stream into a plurality of variable size data segments each having a variable size. The method includes creating offset data segments from the variable size data segments by offsetting the end of each variable size segment so that the size of each of the offset data segments is a multiple of the determined fixed size. The method includes identifying duplicate data in the input data stream by detecting the duplicate data in the offset data segments.
The method improves an efficiency of the storage device by deduplicating the data segments without changing underlying data. The method writes the offset data segments and aligns them according to their variable size to obtain the fixed-size data segments. Once the offset data segments are written, the method can detect duplicates of the data in the fixed-size data segments. The deduplication in the fixed-size data segments is enhanced by writing the offset segments and not the actual segments. The offset segments enhance the deduplication ratio with minimal overhead. The method provides variable size segmentation of the input data stream to gain the flexibility to shift in the input data stream while using the fixed-size data segments. The method provides the offset data segments as highly compressible so that a storage capacity that is occupied by the offset data segments is optimized. A fixed size data segments deduplication method may be implemented in scenarios such as in-place or overwrite modifications.
Optionally, segmenting the input data stream uses as parameters the average, the maximum, and the minimum sizes of each of the variable size segments created from segmenting the input data stream. The parameters related to the segmentation provide efficient deduplication in cases such as in-place or overwrite modifications.
Optionally, if the determined fixed size is greater than a first predetermined value, the maximum size of each of the variable size segments is set to be smaller or equal to the determined fixed size. Preferably, in that case, the maximum size of each of the variable size segments is set to be equal to the determined fixed size. Optionally too, if the determined fixed size is smaller than a second predetermined value, the minimum size of each of the variable size segments is set to be greater or equal to the determined fixed size. Preferably, in that case, the minimum size of each of the variable size segments is set to be equal to the determined fixed size. The determined fixed size provides a threshold value for applying the deduplication method. Also, the threshold value helps to optimize the usage of memory during deduplication. Optionally, the method further includes inserting one or more reserved segments of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments. The method provides writing variable size data segments on top of existing fixed-size deduplication, while optimizing the random IOs using reserved space, thereby increasing de-duplication ratio and the efficiency of the storage device.
Optionally, the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments, depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments.
Optionally, the method further includes storing the offset data segment into a repository.
The method optimizes the memory by inserting the offset data segments based on the variable size of data segments, thereby improving storage capacity of the storage device.
Optionally, the method further includes storing, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment in the input data stream relative to a maximum size of each of the variable size data segments, and the size of the given variable size data segment.
In the case of a large-sized input data stream, the memory is optimized by saving the relative offset, thereby improving the efficiency of the storage device. The relative offset is optionally an offset that is relative to the fixed size data segments that are divided from the input data stream.
Optionally, creating offset data segments from the variable size data segments, comprises padding the end of each variable size segment with a padding character so that the size of each of the offset data segments is a multiple of the determined fixed size. The padding character occupies a negligible space as it is highly compressible. Hence, the memory is utilized efficiently.
According to a second aspect, there is provided a storage device. The storage device is configured to store a plurality of fixed-size data segments of a determined fixed size. The storage device is configured to receive an input data stream. The storage device is configured to segment the input data stream into a plurality of variable size data segments each having a variable size. The storage device is configured to create the offset data segments from the variable size data segments by offsetting the end of each variable size segment so that the size of each of the offset data segments is a multiple of the determined fixed size. The storage device is configured to identify duplicate data in the input data stream by detecting the duplicate data in the offset data segments.
An efficiency of the storage device is improved by deduplicating the data segments without changing underlying data. The storage device writes the offset data segments and aligns them according to their variable size to obtain the fixed-size data segments. Once the offset data segments are written, the storage device detects duplicates of the data in the fixed-size data segments. The deduplication in the fixed-size data segments is enhanced by writing the offset segments and not the actual segments. The offset segments enhance the deduplication ratio with minimal overhead. The storage device provides variable size segmentation of the input data stream to gain the flexibility to shift in the input data stream while using the fixed-size data segments. The storage device provides the offset data segments as highly compressible so that storage capacity is occupied by the offset data segments is optimized. A deduplication method may be implemented in scenarios such as in-place or overwrite modifications.
Optionally, the storage device further configured to segment the input data stream, using as parameters the average, the maximum and the minimum sizes of each of the variable size segments created from segmenting the input data stream. The storage device is configured to use the parameters that are related to the segmentation to provide efficient deduplication in cases such as in-place or overwrite modifications.
Optionally, the storage device further configured to set the maximum size of each of the variable size segments to be smaller or equal to the determined fixed size, if the determined fixed size is greater than a first predetermined value. The storage device provides the determined fixed size as a threshold value for applying the deduplication method. Also, the threshold value helps to optimize the usage of memory during deduplication.
Optionally, the storage device further configured to set the minimum size of each of the variable size segments to be greater or equal to the determined fixed size, if the determined fixed size is smaller than a second predetermined value.
Optionally, the storage device further configured to insert one or more reserved segment of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments. Optionally, the storage device further configured so that the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments, depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments. The storage device optimizes memory by inserting the offset data segments based on the variable size of data segments, thereby improving a storage capacity of the storage device.
Optionally, the storage device further includes a repository for storing the offset data segment.
Optionally, the storage device further configured to store, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment in the input data stream relative to a maximum size of each of the variable size data segments, and the size of the given variable size data segment. In case of a large-sized input data stream, the storage device optimizes the memory by saving the relative offset, thereby improving the efficiency of the storage device. The relative offset is optionally an offset that is relative to the fixed size data segments that are divided from the input data stream.
Optionally, the storage device further configured to create the offset data segments from the variable size data segments by padding the end of each variable size segment with a padding character so that the size of each of the offset data segments is a multiple of the determined fixed size.
The padding character occupies a negligible space as it is highly compressible. Hence, the memory is utilized efficiently.
A technical problem in the prior art is resolved, where the technical problem concerns requiring a lot of memory for fast searches due to storage of additional fingerprint metadata, where in- place or overwrite modifications are unable to implement the deduplication method.
Therefore, in contradistinction to the prior art, according to the method for managing the storage of data segments on the storage device, the storage device that is configured to manage the storage of data segments by deduplicating the data segments without changing underlying data, thereby improving the performance of the storage device. The storage device writes the offset data segments and aligns them according to their variable size data segments to obtain the fixed- size data segments. Once the offset data segments are written, the storage device can detect duplicates of the data in the fixed-size data segments. The deduplication in the fixed-size data segments is enhanced by writing the offset segments and not the actual segments. The offset segments enhance the deduplication ratio with minimal overhead. The storage device provides variable size segmentation of the input data stream to gain the flexibility to shift in the input data stream while using the fixed-size data segments. The storage device provides offset data segments as highly compressible so that a storage capacity occupied by the offset data segments is optimized. A deduplication method is implemented in scenarios such as in-place or overwrite modifications.
These and other aspects of the disclosure will be apparent from and the implementation(s) described below.
BRIEF DESCRIPTION OF DRAWINGS
Implementations of the disclosure will now be described, by way of example only, with reference to the accompanying drawings, in which:
FIG. 1 is a block diagram of a storage device in accordance with an implementation of the disclosure;
FIG. 2 is an exemplary diagram that illustrates segmenting an input data stream into one or more variable size data segments using a storage device in accordance with an implementation of the disclosure;
FIG. 3 is an exemplary diagram that illustrates segmenting an input data stream into one or more variable size data segments of sizes 30 Kilobytes (KB), 30 KB, and 14 KB using a storage device in accordance with an implementation of the disclosure;
FIG. 4 is an exemplary diagram that illustrates random IOs of 32KB and re-segmentation results in accordance with an implementation of the disclosure;
FIGS. 5A and 5B are exemplary diagrams that illustrate random IO of 20KB and re segmentation results in accordance with an implementation of the disclosure;
FIG. 6 is a flow diagram that illustrates a method for managing the storage of data segments on a storage device in accordance with an implementation of the disclosure; and FIG. 7 is an illustration of a computing arrangement (e.g. a storage device) that is used in accordance with implementations of the disclosure.
DETAILED DESCRIPTION OF THE DRAWINGS
Implementations of the disclosure provide a method for managing the storage of data segments on a storage device, and the storage device that is configured to manage the storage of data segments.
To make solutions of the disclosure more comprehensible for a person skilled in the art, the following implementations of the disclosure are described with reference to the accompanying drawings.
Terms such as "a first", "a second", "a third", and "a fourth" (if any) in the summary, claims, and foregoing accompanying drawings of the disclosure are used to distinguish between similar objects and are not necessarily used to describe a specific sequence or order. It should be understood that the terms so used are interchangeable under appropriate circumstances, so that the implementations of the disclosure described herein are, for example, capable of being implemented in sequences other than the sequences illustrated or described herein. Furthermore, the terms "include" and "have" and any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, a method, a system, a product, or a device that includes a series of steps or units, is not necessarily limited to expressly listed steps or units but may include other steps or units that are not expressly listed or that are inherent to such process, method, product, or device.
FIG. 1 is a block diagram of a storage device 100 in accordance with an implementation of the disclosure. The storage device 100 is configured to store one or more fixed-size data segments 102A-N of a determined fixed size. The storage device 100 is configured to receive an input data stream. The storage device 100 is configured to segment the input data stream into one or more variable size data segments each having a variable size. The storage device 100 is configured to create the offset data segments from the one or more variable size data segments by offsetting the end of each variable size segment so that the size of each of the offset data segments is a multiple of the determined fixed size. The storage device 100 is configured to identify duplicate data in the input data stream by detecting the duplicate data in the offset data segments.
An efficiency of the storage device 100 is improved by deduplicating the data segments without changing underlying data. The storage device 100 writes the offset data segments and aligns them according to their variable size to obtain the fixed-size data segments. Once the offset data segments are written, the storage device 100 detects duplicates of the data in the fixed-size data segments. The deduplication in the one or more fixed-size data segments 102A-N is enhanced by writing the offset segments, and not the actual segments. The offset segments enhance de- duplication ratio with minimal overhead. The storage device 100 provides variable size segmentation of the input data stream to gain flexibility to shift in the input data stream while using the one or more fixed-size data segments 102A-N. The storage device 100 provides offset data segments as highly compressible so that a storage capacity that is occupied by the offset data segments is optimized.
Optionally, the storage device 100 is further configured to segment the input data stream, using as parameters the average, the maximum and the minimum sizes of each of the variable size segments created from segmenting the input data stream.
Optionally, the storage device 100 is further configured to set the maximum size of each of the variable size segments to be smaller or equal to the determined fixed size, if the determined fixed size is greater than a first predetermined value.
Optionally, the storage device 100 is further configured to set the minimum size of each of the variable size segments to be greater or equal to the determined fixed size, if the determined fixed size is smaller than a second predetermined value.
Optionally, the storage device 100 is further configured to insert one or more reserved segment of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments.
Optionally, the storage device 100 is further configured so that the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments, depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments. Optionally, the storage device 100 further includes a repository for storing the offset data segment.
Optionally, the storage device 100 is further configured to store, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment relative to a maximum data segment size in the input data stream, and the size of the given variable size segment.
Optionally, the storage device 100 is further configured to create the offset data segments from the variable size data segments by padding the end of each variable size segment with a padding character so that the size of each of the offset data segments is a multiple of the determined fixed size.
FIG. 2 is an exemplary diagram that illustrates segmenting an input data stream 202 into one or more variable size data segments 206 using a storage device in accordance with an implementation of the disclosure. The storage device is configured to store one or more fixed- size data segments 204A-N of a determined fixed size. The storage device is configured to receive the input data stream 202. The storage device is configured to segment the input data stream 202 in to one or more variable size data segments 206 each having a variable size. The storage device is configured to create the offset data segments 208A-B from the one or more variable size data segments 206 by offsetting the end of each variable size segment so that the size of each of the offset data segments 208A-B is a multiple of the determined fixed size. The offset data segments 208A-B are shown with dots in the figure. The one or more variable size data segments 206 are shown with diagonal lines in the figure.
The storage device segments the input data stream 202 into the one or more variable size data segments 206 based on use of a rolling hash, and a central processing unit (CPU) optimized segmentation. Optionally, a determined fixed size of the one or more fixed-size data segments 204A-N varies between different storage devices, and is also configurable based on a file system used in the storage device.
Optionally, the storage device is further configured to set the maximum size of each of the variable size segments to be smaller or equal to the determined fixed size, if the determined fixed size is greater than a first predetermined value. Optionally, the storage device is further configured to set the minimum size of each of the variable size segments to be greater or equal to the determined fixed size, if the determined fixed size is smaller than a second predetermined value.
Optionally, when large input outputs (IOs) to a new file are performed, the storage device aligns the one or more variable size data segments 206 to the determined fixed size, adds a padding character, and saves additional metadata to enable reads, thereby improving a de-duplication ratio. Hence, the storage efficiency of the storage device is increased and the costs of storage is reduced.
Optionally, when random IOs are received to an existing file (e.g. a live file), then the storage device, (i) performs variable size segmentation for a large write that is received as the random IOs. (ii) reads enough data around a location, where new data is to be written and performs variable size segmentation with the new data to maintain consistency for a small write with the variable size segmentation. As there might be more data segments for reading and writing, enough space is required for new aligned variable size segments.
Optionally, the variable size segmentation is performed, in case of a large input data stream, to obtain the one or more fixed-size data segments 204A-N to maintain consistency with the variable size segmentation. The de-duplication ratio remains the same. The storage device may create new data segments on a live file is known as re-segmentation. Optionally, a data management model is used to optimize a reserved space and minimize overhead of the data to be read to achieve efficient re-segmentation. The re-segmentation may be used in online or offline deduplication of data segments based on a workload. During the re-segmentation, input- outputs (IOs) and computational overhead are added, thereby enabling to write in aligned data segments and save a bitmap of changed data segments to perform the re-segmentation offline.
Optionally, the storage device includes a repository for storing the offset data segments 208A- B. Optionally, each user file is mapped to a file in the repository. The file in the repository includes the offset data segments 208A-B, the padding character that is added, and a reserved space. The reserved space may be used for future re-segmentation using sparse files.
Optionally, for reading the input data stream 202, the metadata enables to map the offset data segments 208A-B with the one or more fixed-size data segments 204A-N by considering the padding character. The input data stream 202 may be divided into logical data segments. Each logical data segment has a maximum size. The maximum size of the fixed-size data segment is a predefined constant value, and it is important as limits a number of reserved spaces. Optionally, during a write operation, the metadata is stored for each fixed-size data segment. The metadata includes an offset and a size. The offset is relative to the offset of the fixed logical data segments of the input data stream 202 that are mapped to the one or more fixed-size data segments 204A-N. The offset of each fixed logical data segment is between 0 and 1. The size of data of the one or more fixed-size data segments 204A-N is determined by mapping each fixed logical data segment to a constant number of the one or more fixed-size data segments 204A-N. The size of the one or more fixed-size data segments 204A-N may be greater than 1. Optionally, the offset of the input data stream 202 is mapped to the offset of the repository using the metadata.
Optionally, during a read operation, a fixed-size data segment 204A from where the reading operation starts from is determined by, (i) providing a start offset of read that is relative to a user view of the fixed-size data segment 204A, determining a start offset of read that is relative to a repository view, (ii) calculating an index of the fixed-size data segment 204A by dividing the start offset by the size of the fixed-size data segment 204A, (iii) searching the relative offset in the corresponding one or more fixed-size data segments 204A-N, (iv) searching the metadata of corresponding fixed-size data segments by providing the index of the fixed-size data segment 204A and (v) identifying the fixed-size data segment 204A from where the reading operation starts. Optionally, the search of metadata of the corresponding fixed-size data segments includes a last fixed-size data segment from a previous fixed-size data segment. Once the start offset is identified, the metadata is retrieved from that point to read the corresponding fixed- size data segments.
Optionally, during the write operation, the segmentation is performed on the input data stream 202 that is appended to leftover from previous segmentation. The leftover from the previous segmentation is used to calculate the boundaries of data segments. Thus, for any size of a write operation, the de-duplication ratio is higher, especially in the case of the small-sized input data stream.
Optionally, a ratio between maximum segment size and minimum segment size is considered as k. Each write operation segmentation may be performed for new files and re-segmentation may be performed for live files. During the write operation for the live files, new data is to be written after performing the read operation on the fixed-size data segment 204A. When writing the new data to the data segments, an offset that is relative to a user view is translated to a repository offset and a relative offset in the fixed-size data segment 204A to the metadata of the one or more fixed-size data segments 204A-N are saved. The size of data in the one or more fixed-size data segments 204A-N are saved. The one or more fixed-size data segments 204A- N are reserved for re-segmentation by starting the write operation for each data segment at the corresponding fixed-size data segment that is k times of the one or more fixed-size data segments 204A-N. The one or more fixed-size data segments 204A-N represent a present data segment that may include data from a next data segment to maintain a data segment alignment, not affecting the de-duplication ratio as there is no deletion of any data segment during alignment. The write operation may depend on the size of the one or more fixed-size data segments 204A-N compared to the size of the one or more variable size data segments 206.
Optionally, when the size of the one or more fixed-size data segments 204A-N is less than or equal to the size of the one or more variable size data segments 206, each data segment is mapped to 2k times of the fixed-size data segments 204A-N. When a variable size data segment is smaller than the one or more fixed-size data segments 204A-N, the end of the variable size data segment is padded with a padding character to align with the size of the fixed-size data segment. Each variable size data segment may be mapped to k times of the one or more fixed- size data segments 204A-N. Hence, each data segment of the input data stream 202 in the repository is mapped to k times of the one or more fixed-sized data segments 204A-N, including data and reserved space.
Optionally, when the size of the one or more fixed-size data segments 204A-N is greater than or equal to the variable size of the one or more variable size data segments 206, each variable size data segment is mapped to at most k times of the one or more fixed-size data segments 204A-N and at least to a single fixed-size data segment. When the fixed size data segment is smaller than the variable-size data segments 204A-N, the end of the data segment is padded with the padding character to align with the size of the one or more fixed-size data segments 204A-N. Each variable size data segment may be mapped to two times of k times of the one or more fixed-size data segments 204A-N. Hence, for each size of the one or more variable size data segments 206, the repository file includes two times of k times of the one or more fixed- size data segments 204A-N, including the data and the reserved space.
Optionally, to initiate the re-segmentation process for the random IOs, an upper bound is set. When the size of the one or more fixed-size data segments 204A-N is less than or equal to the size of the one or more variable size data segments 206, the upper bound of the random IO overhead is a single fixed-size data segment after the last modified fixed-size data segment, that is three extra data IOs are present, that is one read, and two write along with one delete. When the size of the one or more fixed- size data segments 204 A-N is greater than or equal to the size of the one or more variable size data segments 206, the upper bound of the random IO overhead is the single data segment after the last modified fixed-size data segment, i.e. the number of IOs depend on specific data segment span.
FIG. 3 is an exemplary diagram that illustrates segmenting an input data stream into one or more variable size data segments (e.g. 4 variable size data segments) of sizes 30 Kilobytes (KB), 30 KB, and 14 KB using a storage device in accordance with an implementation of the disclosure. The exemplary diagram shows mapping of 2 variable size data segments to 2 fixed- size data segments of size 128 Kilobytes (KB) each. Each fixed-size data segment includes 4 fixed-size data segment grains. One of the 2 fixed-size data segments represents the start offsets in a range of 0-128 Kilobytes (KB), and another fixed-size data segment represents the start offsets in a range of 128-256KB. Two reserved fixed-size data segment grains with indexl and index2 of size 32KB are shown in the figure. Padding of 18KB is shown in the fixed-size data segment that starts offsets in the range 128-256KB. Optionally, a ratio between a maximum segment size and a minimum segment size is considered as k.
The following mapping is performed by considering k = 4 according to a segmentation algorithm, based on the following calculations:
FSBlocklndex = Int(Segment.UserOffset / FSGrainSize)
FSBlockGrainlndex = Int((Segment.UserOffset % FSGrainSize) / MinSize)
The new data segment is stored in FSGrain. Index = 4 * FSBlocklndex + FSBlockGrainlndex
Segment 1 - offset 0, size 30KB
Figure imgf000015_0001
FS grain 0 (offset 0)
Location calculation:
FSBlocklndex = 0/32 = 0 FSBlockGrainlndex = (0%32) / 8 = 0 FSGrain. Index = 4*0 + 0 = 0 Stored MD: offset = 0, size = 30
Segment 2 - offset 28, size 30KB - FS grain 3 (offset 96KB) Location calculation:
FSBlocklndex = Int(28 / 32) = 0 FSBlockGrainlndex = Int((28 % 32) / 8) = 3 FSGrain. Index = 4 * 0 + 3 = 3 Stored MD: offset = 28, size = 30
Segment 3 - offset 60, size 14KB - FS grain 7 (offset 224KB)
Location calculation:
FSBlocklndex = Int(60 / 32) = 1 FSBlockGrainlndex = Int((60 % 32) / 8) = 3 FSGrain. Index = 4 * l + 3 = 7 Stored MD: offset = 28, size = 14
FIG. 4 is an exemplary diagram that illustrates random IOs of 32KB and re-segmentation results in accordance with an implementation of the disclosure. The exemplary diagram includes writing new data at offset 16KB, of size 32KB and results in 5 data segments of sizes 20KB, 16KB, 8KB, 18KB, and 14KB using re-segmentation. The re-segmentation includes calculating fixed-size data segments to read that are stored in a repository. The exemplary diagram illustrates writing starts at index 0 and ends at 3 as shown in FIG. 3. The fixed-size data segments in the range [0,3] are read. The exemplary diagram includes performing re segmentation on the fixed-size data segments with new data. The last boundary of the fixed- size data segments is aligned to previously fixed data-size segments. For example, the last boundary is aligned currently between 4 and 5, previously between 2 and 3 as shown in the figure. The following mapping is performed based on the segmentation algorithm.
Segment 1 - offset 0, size 20KB
Figure imgf000016_0001
FS grain 0 (offset 0)
Location calculation: FSBlocklndex = 0/32 = 0
FSBlockGrainlndex = (0%32) / 8 = 0 FSGrain. Index = 4*0 + 0 = 0 Stored MD: offset = 0, size = 20
Segment 2 - offset 20, size 16KB - FS grain 2 (offset 64KB) Location calculation:
FSBlocklndex = Int(16/32) = 0 FSBlockGrainlndex = Int ((16%32) / 8) = 2 FSGrain. Index = 4*0 + 2 = 2 Stored MD: offset = 20, size = 16
Segment 3 - offset 36, size 8KB - FS grain 4 (offset 128KB)
Location calculation:
FSBlocklndex = Int(36/32) = 1 FSBlockGrainlndex = Int((36%32) / 8) = 0 FSGrain. Index = 4*1 + 0 = 4 Stored MD: offset = 4, size = 8
Segment 4 - offset 44, size 18KB -> FS grain 5 (offset 160KB)
Location calculation:
FSBlocklndex = Int(44/32) = 1 FSBlockGrainlndex = Int((44%32) / 8) = 1 FSGrain. Index = 4*1 + 1 = 5 Stored MD: offset = 12, size = 18
Segment 5 - offset 62, size 14KB - FS grain 7 (offset 224KB)
Location calculation:
FSBlocklndex = Int(62/32) = 1 FSBlockGrainlndex = Int((62%32) / 8) = 3 FSGrain. Index = 4*1 + 3 = 7 Stored MD: offset = 30, size = 14
FIGS. 5A and 5B are exemplary diagrams that illustrate random IO of 20KB and re segmentation results in accordance with an implementation of the disclosure. FIG. 5A is an exemplary diagram that illustrates the random IO of 20KB and one of the results of re segmentation. The exemplary diagrams include writing new data at offset 16KB, of size 32KB and results in final data segments of sizes 20KB, 16KB, 8KB, 18KB, 8KB, and 14KB using re segmentation. The data segments 1-4 are segmented based on a segmentation algorithm. The fixed-size data segment 5 includes a data segment that is created using the re-segmentation algorithm. The re-segmentation includes calculating fixed-size data segments to read that are stored in a repository. The exemplary diagrams depict writing start at index 4 and end at index 5. The fixed-size data segments in the range [4,5] are read. The exemplary diagram illustrates performing re-segmentation on the fixed-size data segments with new data. The segmentation is performed on 26KB and data segment 3 is obtained of size 18KB. The data segment 3 is stored in the fixed-size data segment 4 and a leftover of size 8KB is identified with no boundary. As the leftover is equal to a minimum data segment size, the data segment 3 is stored in its fixed-size data segment, that is in index 6.
FIG. 5B is an exemplary diagram that illustrates the random IO of 20 KB and another result of the re-segmentation. The exemplary diagram includes writing new data at offset 20KB, of size 32KB and results in final data segments of sizes 20KB, 16KB, 20KB, and 20KB using re segmentation. The fixed-size data segment 7 includes a data segment that is created using a re segmentation algorithm. The re-segmentation includes calculating fixed-size data segments to read that are stored in the repository. The exemplary method of writing starts at index 4 and ends at index 5. The fixed-size data segments in the range [4,5] are read. The exemplary diagram includes performing re-segmentation on the fixed-size data segments with new data. The segmentation is performed on 26KB and data segment 3 is obtained of size 20KB. The data segment 3 is stored in the fixed-size data segment 4 and a leftover of size 6KB is identified with no boundary. As the leftover is less than a minimum data segment size 6KB+14KB, the next data segment is read, that is data segment 7 with a size of 14KB. The data segments are appended to a single data segment of size 20KB and are stored in data segment 7.
The re-segmentation algorithm provides writing variable size data segments on top of existing fixed-size deduplication while optimizing the random IOs using reserved space, thereby increasing a de-duplication ratio and the efficiency of a storage device.
FIG. 6 is a flow diagram that illustrates a method for managing the storage of data segments on a storage device in accordance with an implementation of the disclosure. At a step 602, one or more fixed-size data segments of a determined fixed size are stored on the storage device. At a step 604, an input data stream is received. At a step 606, the input data stream is segmented into one or more variable size data segments each having a variable size. At a step 608, offset data segments are created from the variable size data segments by offsetting the end of each variable size segment so that the size of each of the offset data segments is a multiple of the determined fixed size. At a step 610, duplicate data is identified in the input data stream by comparing the offset data segments with the one or more fixed-size data segments.
The method improves an efficiency of the storage device by deduplicating the data segments without changing underlying data. The method writes the offset data segments and aligns them according to their variable size to obtain the one or more fixed-size data segments. Once the offset data segments are written, the method can detect duplicates of the data in the fixed-size data segments. The deduplication in the one or more fixed-size data segments is enhanced by writing the offset segments and not the actual segments. The offset segments enhance the deduplication ratio with minimal overhead. The method provides variable size segmentation of the input data stream to gain the flexibility to shift in the input data stream while using the one or more fixed-size data segments. The method provides the offset data segments as highly compressible so that a storage capacity that is occupied by the offset data segments is optimized. A deduplication method may be implemented in scenarios such as in-place or overwrite modifications.
Optionally, segmenting the input data stream uses as parameters the average, the maximum and the minimum sizes of each of the variable size segments created from segmenting the input data stream.
Optionally, if the determined fixed size is greater than a first predetermined value, the maximum size of each of the variable size segments is set to be smaller or equal to the determined fixed size.
Optionally, if the determined fixed size is smaller than a second predetermined value, the minimum size of each of the variable size segments is set to be greater or equal to the determined fixed size.
Optionally, the method further includes inserting one or more reserved segments of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments.
Optionally, the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments, depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments. Optionally, the method further includes storing the offset data segment into a repository.
Optionally, the method further includes storing, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment in the input data stream, and the size of the given variable size segment.
Optionally, creating offset data segments from the variable size data segments, includes padding the end of each variable size segment with a padding character so that the size of each of the offset data segments is a multiple of the determined fixed size.
Optionally, a ratio between maximum segment size and minimum segment size is considered as k.
The following coding includes defining one or more fixed-size data segments by considering k = 4 and the variable segment size is greater than equal to the fixed segment size.
1. MinSize = FSGrainSize (minimal segment size)
2. MaxSize = 4 * FSGrainSize
3. UserBlockSize = MaxSize
4. Segment = Variable-length segment
5. Segment.UserOffset = The user offset in which the variable-length segment starts
6. Segment. Size = The size (in bytes) of the variable-length segment NOTE: MinSize <= Segment. Size <= MaxSize
7. FSGrainsF actor = 2 * k // 8 in our use-case
8. FSBlockSize = FSGrainsFactor * FSGrainSize
9. FSBlocklndex = Int(Segment.UserOffset / UserBlockSize)
10. FSBlockGrainlndex = [0, 1, 2, 3, ... ,(FSGrainFactor-l)]
11. FSBlockGrainlndex. Start = 2 * Int((Segment.UserOffset % UserBlockSize) / MinSize) + (Segment.UserOffset % FSGrainSize == 0 ? 0 : 1)
The following coding includes writing sequentially on the fixed size data segments.
1. Calculate NextSegment (UserOffset, Size)
2. Set FSBlocklndex = Int(Segment.UserOffset / UserBlockSize)
3. Set FSBlockGrainlndex. Start = 2 * Int((Segment.UserOffset % UserBlockSize) / FSGrainSize) + (Segment.UserOffset % FSGrainSize == 0 ? 0 : 1) 4. Store the new segment starting in FSGrain. Index = FSGrainsF actor * FSBlocklndex + F S SubB 1 ockGrainlndex Start
5. while true
6. Store the data [UserOffset, min(size, FSGrainSize)] in FSGrain. Index
7. size -= FSGrainSize
8. If (size <= 0) break
9. Useroffset += FSGrainSize
10. blocklndex = Int(UserOffset / UserBlockSize)
// If the segment crosses FSBlocks, jump to the 1st FSGrain (index=0) in the next FSSubBlock
11. If (blocklndex > FSBlocklndex)
12. FSGrain. Index = FSGrainsF actor * blocklndex
13. FSSubBlocklndex = blocklndex
14. else
// If the segment requires multiple FSGrains, use the consecutive FSGrains in the same FSSubBlock
15. FSGrain. Index++
16. end-if
17. end- while
18. DONE
The following coding includes re-segmentation algorithm:
// Parameters: <offset, size> in user-view of the write chunk
1. Find the <FSBlockIndex, FSBlockGrainIndex> of the 1st FSGrain (FSGrain. Index. start) and the last FSGrain (FSGrain. Index. end)
2. Read all data in the range [FSGrain. Index. start, FSGrain. Index. end]
3. Run Segmentation on the data in the range [FSGrain. Index. start, FSGrain. Index. end]
// Let's denote the amount of "left over" data size after the segmentation as the remainder
4. If there is no remainder (remainder == 0) — > DONE
// Create a new segment with ONLY the data of the remainder
5. If remainder >= MinSize
6. Create a new segment for the remainder 7. Store the new segment in the appropriate location - <FSBlockIndex, FSBlockGrainIndex>
8. DONE
9. end-if
// remainder < MinSize
10. Read the next segment just after the FS Grain. Index. end (fully unchanged segment)
11. If (remainder + nextSegment.size) <= MaxSize
12. Add the reminder into the nextSegment
13. Store the modified segment in the appropriate location
14. // NOTE: the FSGrain location might be changed. Thus, we may need to delete the old segment
15. DONE
16. end-if
// (remainder + nextSegment.size) > MaxSize
// In this case, we are going to split the (remainder + nextSegment.size) into exactly 2 segments
17. Run segmentation on remainder+nextSegment with MaxSize = (remainder + nextSegment.size - MinSize) to find the 1st cut-point // NOTE: this special MaxSize setting ensures at least 2 segments
18. Create 2 new segments using the 1st cut-point
19. Store the 2 new segments in the appropriate locations
20. // NOTE: the locations might be different than the location of the old segment. Thus, we may need to delete the old segment
21. DONE
FIG. 7 is an illustration of an exemplary computing arrangement 700 (e.g. a storage device) in which the various architectures and functionalities of the various previous implementations may be implemented. As shown, the computing arrangement 700 includes at least one processor 704 that is connected to a bus 702, wherein the computing arrangement 700 may be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect), PCI-Express, AGP (Accelerated Graphics Port), HyperTransport, or any other bus or point-to-point communication protocol (s). The computing arrangement 700 also includes a memory 706. Control logic (software) and data are stored in the memory 706 which may take the form of random-access memory (RAM). In the present description, a single semiconductor platform may refer to a sole unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform may also refer to multi-chip modules with increased connectivity which simulate on-chip modules with increased connectivity which simulate on-chip operation, and make substantial improvements over utilizing a conventional central processing unit (CPU) and bus implementation. Of course, the various modules may also be situated separately or in various combinations of semiconductor platforms per the desires of the user.
The computing arrangement 700 may also include a secondary storage 710. The secondary storage 710 includes, for example, a hard disk drive and a removable storage drive, representing a floppy disk drive, a magnetic tape drive, a compact disk drive, digital versatile disk (DVD) drive, recording device, universal serial bus (USB) flash memory. The removable storage drive at least one of reads from and writes to a removable storage unit in a well-known manner.
Computer programs, or computer control logic algorithms, may be stored in at least one of the memory 706 and the secondary storage 710. Such computer programs, when executed, enable the computing arrangement 700 to perform various functions as described in the foregoing. The memory 706, the secondary storage 710, and any other storage are possible examples of computer-readable media.
In an implementation, the architectures and functionalities depicted in the various previous figures may be implemented in the context of the processor 704, a graphics processor coupled to a communication interface 712, an integrated circuit (not shown) that is capable of at least a portion of the capabilities of both the processor 704 and a graphics processor, a chipset (i.e., a group of integrated circuits designed to work and sold as a unit for performing related functions, etc.).
Furthermore, the architectures and functionalities depicted in the various previous figures may be implemented in the context of a general computer system, a circuit board system, a game console system dedicated for entertainment purposes, an application-specific system. For example, the computing arrangement 700 may take the form of a desktop computer, a laptop computer, a server, a workstation, a game console, an embedded system. Furthermore, the computing arrangement 700 may take the form of various other devices including, but not limited to a personal digital assistant (PDA) device, a mobile phone device, a smart phone, a television, etc. Additionally, although not shown, the computing arrangement 700 may be coupled to a network (e.g., a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN) such as the Internet, a peer-to-peer network, a cable network, or the like) for communication purposes through an I/O interface 708.
It should be understood that the arrangement of components illustrated in the figures described are exemplary and that other arrangement may be possible. It should also be understood that the various system components (and means) defined by the claims, described below, and illustrated in the various block diagrams represent components in some systems configured according to the subject matter disclosed herein. For example, one or more of these system components (and means) may be realized, in whole or in part, by at least some of the components illustrated in the arrangements illustrated in the described figures. In addition, while at least one of these components are implemented at least partially as an electronic hardware component, and therefore constitutes a machine, the other components may be implemented in software that when included in an execution environment constitutes a machine, hardware, or a combination of software and hardware.
Although the disclosure and its advantages have been described in detail, it should be understood that various changes, substitutions, and alterations can be made herein without departing from the spirit and scope of the disclosure as defined by the appended claims.

Claims

1. A method for managing the storage of data segments on a storage device (100), the method comprising: storing a plurality of fixed-size data segments (102A-N, 204 A-N) of a determined fixed size on the storage device (100), receiving an input data stream (202), segmenting the input data stream (202) into a plurality of variable size data segments (206) each having a variable size, creating offset data segments (208A-B) from the variable size data segments (206) by offsetting the end of each variable size segment so that the size of each of the offset data segments (208A-B) is a multiple of the determined fixed size, identifying duplicate data in the input data stream (202) by detecting said duplicate data in the offset data segments (208A-B).
2. A method according to claim 1, wherein segmenting the input data stream (202) uses as parameters the average, the maximum and the minimum sizes of each of the variable size segments created from segmenting the input data stream (202).
3. A method according to claim 2, wherein, if the determined fixed size is greater than a first predetermined value, the maximum size of each of the variable size segments is set to be smaller or equal to the determined fixed size.
4. A method according to any of claims 2 and 3, wherein if the determined fixed size is smaller than a second predetermined value, the minimum size of each of the variable size segments is set to be greater or equal to the determined fixed size.
5. A method according to any of claims 1 to 4, further comprising inserting one or more reserved segments of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments.
6. A method according to claim 5, wherein the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments (208A-B), depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments (206).
7. A method according to any of claims 1 to 6, further comprising storing the offset data segment into a repository.
8. A method according to any of claims 1 to 7, further comprising storing, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment in the input data stream (202) relative to a maximum size of each of the variable size data segments, and the size of the given variable size data segment.
9. A method according to any of claims 1 to 8, wherein creating offset data segments (208A-B) from the variable size data segments (206), comprises padding the end of each variable size segment with a padding character so that the size of each of the offset data segments (208A-B) is a multiple of the determined fixed size.
10. A storage device (100) configured to: store a plurality of fixed-size data segments (102A-N, 204 A-N) of a determined fixed size, receive an input data stream (202), segment the input data stream (202) into a plurality of variable size data segments (206) each having a variable size, create offset data segments (208A-B) from the variable size data segments (206) by offsetting the end of each variable size segment so that the size of each of the offset data segments (208A-B) is a multiple of the determined fixed size, identify duplicate data in the input data stream (202) by detecting said duplicate data in the offset data segments (208A-B).
11. A storage device (100) according to claim 10, further configured to segment the input data stream (202), using as parameters the average, the maximum and the minimum sizes of each of the variable size segments created from segmenting the input data stream (202).
12. A storage device (100) according to claim 11, further configured to set the maximum size of each of the variable size segments to be smaller or equal to the determined fixed size, if the determined fixed size is greater than a first predetermined value.
13. A storage device (100) according to any of claims 11 and 12, further configured to set the minimum size of each of the variable size segments to be greater or equal to the determined fixed size, if the determined fixed size is smaller than a second predetermined value.
14. A storage device (100) according to any of claims 10 to 13, further configured to insert one or more reserved segment of the predetermined fixed size between consecutive offset data segments amongst the created offset data segments.
15. A storage device (100) according to claim 14, further configured so that the number of reserved segments inserted between a first and a second consecutive offset data segments amongst the created offset data segments (208A-B), depends on the size of the variable size data segment from which the first offset data segment is created, on the minimum variable size and on the maximum variable size of the variable size data segments (206).
16. A storage device (100) according to any of claims 10 to 15, further comprising a repository for storing the offset data segment.
17. A storage device (100) according to any of claims 10 to 16, further configured to store, for each given offset data segment created from a given variable size data segment, the position of the given variable size data segment in the input data stream (202) relative to a maximum size of each of the variable size data segments, and the size of the given variable size data segment.
18. A storage device (100) according to any of claims 10 to 17, further configured to create the offset data segments (208A-B) from the variable size data segments (206) by padding the end of each variable size segment with a padding character so that the size of each of the offset data segments (208A-B) is a multiple of the determined fixed size.
PCT/EP2021/070951 2021-07-27 2021-07-27 Method for managing the storage of data segments on a storage device WO2023006183A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/070951 WO2023006183A1 (en) 2021-07-27 2021-07-27 Method for managing the storage of data segments on a storage device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/070951 WO2023006183A1 (en) 2021-07-27 2021-07-27 Method for managing the storage of data segments on a storage device

Publications (1)

Publication Number Publication Date
WO2023006183A1 true WO2023006183A1 (en) 2023-02-02

Family

ID=77207168

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2021/070951 WO2023006183A1 (en) 2021-07-27 2021-07-27 Method for managing the storage of data segments on a storage device

Country Status (1)

Country Link
WO (1) WO2023006183A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8234468B1 (en) * 2009-04-29 2012-07-31 Netapp, Inc. System and method for providing variable length deduplication on a fixed block file system
US20130073527A1 (en) * 2011-09-16 2013-03-21 Symantec Corporation Data storage dedeuplication systems and methods
US20130238570A1 (en) * 2012-03-08 2013-09-12 Dell Products L.P. Fixed size extents for variable size deduplication segments
US9465808B1 (en) * 2012-12-15 2016-10-11 Veritas Technologies Llc Deduplication featuring variable-size duplicate data detection and fixed-size data segment sharing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8234468B1 (en) * 2009-04-29 2012-07-31 Netapp, Inc. System and method for providing variable length deduplication on a fixed block file system
US20130073527A1 (en) * 2011-09-16 2013-03-21 Symantec Corporation Data storage dedeuplication systems and methods
US20130238570A1 (en) * 2012-03-08 2013-09-12 Dell Products L.P. Fixed size extents for variable size deduplication segments
US9465808B1 (en) * 2012-12-15 2016-10-11 Veritas Technologies Llc Deduplication featuring variable-size duplicate data detection and fixed-size data segment sharing

Similar Documents

Publication Publication Date Title
EP3376393B1 (en) Data storage method and apparatus
US8972358B2 (en) File storage apparatus, file storage method, and program
CN110557124B (en) Data compression method and device
US20170249218A1 (en) Data to be backed up in a backup system
CN106874348A (en) File is stored and the method for indexing means, device and reading file
CN106611035A (en) Retrieval algorithm for deleting repetitive data in cloud storage
CN106980680B (en) Data storage method and storage device
CN108475508B (en) Simplification of audio data and data stored in block processing storage system
CN104035822A (en) Low-cost efficient internal storage redundancy removing method and system
CN107423425A (en) A kind of data quick storage and querying method to K/V forms
CN105515586A (en) Rapid delta compression method
WO2023006183A1 (en) Method for managing the storage of data segments on a storage device
CN111061428A (en) Data compression method and device
CN113326262B (en) Data processing method, device, equipment and medium based on key value database
CN108614879A (en) Small documents processing method and device
CN110941730B (en) Retrieval method and device based on human face feature data migration
CN116762054A (en) Method for storing data pages in a data storage device using similarity-based data reduction
CN114138552B (en) Data dynamic repeating and deleting method, system, terminal and storage medium
CN116383290B (en) Data generalization and analysis method
CN117369731B (en) Data reduction processing method, device, equipment and medium
US20230418497A1 (en) Memory Controller and Method for Improved Deduplication
US9189488B2 (en) Determination of landmarks
US20230063119A1 (en) Method, computer-readable medium and file system for deduplication
CN115964002B (en) Electric energy meter terminal archive management method, device, equipment and medium
US11748307B2 (en) Selective data compression based on data similarity

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21749814

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE